Process templates

When defining many similar processes, it can be useful to parameterize a single process template. This can be accomplished by defining a procedure that takes any number of arguments and returns a parameterized process. Here’s how to do this somewhat verbosely in plain Scheme:

(define (build-me-a-process thing)
  "Return a process that displays THING."
    (name (string-append "show-" thing))
    (procedure `(display ,thing))))

;; Now use this procedure to build concrete processes.
(define show-fruit
  (build-me-a-process "fruit"))
(define show-kitchen
  (build-me-a-process "kitchen"))
(define show-table
  (build-me-a-process "table"))

As this is a somewhat common thing to do in real workflows, the GWL provides simplified syntax to express the same concepts with a little less effort:

process build-me-a-process (with thing)
    string-append "show-" thing
    ` display ,thing

define show-fruit
  build-me-a-process "fruit"
define show-kitchen
  build-me-a-process "kitchen"
define show-table
  build-me-a-process "table"

The result is the same: you get a procedure build-me-a-process that you can use to define a number of similar processes. In the end you have the three processes show-fruit, show-kitchen, and show-table.

In a real-life workflow, the above example would not be very efficient. The GWL generates an executable script for every process, passing the process properties (such as name, inputs, outputs, etc) as arguments. It is a good idea to only generate one script per process template instead of producing one script per process, as this vastly reduces preparation work that the GWL has to perform.

The GWL can arrange for scripts to be reused as long as you take care not to embed arbitrary variables in the process procedure field. To this end the GWL offers the values field for arbitrary value definitions that should be passed to process scripts as arguments.

Another thing to avoid is to make the process name dependent on template arguments. This prevents script reuse as the GWL is forced to generate scripts that are virtually identical except for their names. Here’s an example with ten processes that all share the same process script:

define LOG_DIR
  file "logs"

define SAMPLES
    . "first-sample"
    . "second"
    . "third-sample"
    . "sample-no4"
    . "take-five"
    . "666"
    . "se7en"
    . "who-eight-nine?"
    . "NEIN!"
    reverse-string "net"

process index-bam (with sample)
    file "mapped-reads" / sample "_Aligned.sortedByCoord.out.bam"
    . bai:
    file "mapped-reads" / sample "_Aligned.sortedByCoord.out.bam.bai"
    . log:
    file LOG_DIR / "samtools_index_" sample ".log"
    . "samtools"
    . "coreutils"
    . sample-id: sample
    . backwards:
      first inputs
  # {
    mkdir -p {{LOG_DIR}}
    echo "The sample identifier is {{values:sample-id}}"
    samtools index {{inputs}} {{outputs:bai}} >> {{outputs:log}} 2>&1
    echo "By the way, the sample's file name in reverse is {{values:backwards}}."

workflow test
    map index-bam SAMPLES

Here the value of the variable LOG_DIR is embedded in the generated script, but that’s fine because it is independent of the template argument sample. While we could have used sample directly, we instead defined it as a value in the values field and tagged it with the keyword sample-id:. For the fun of it we also defined a value with the tag backwards:, which is defined in terms of another process field (inputs).

References to the fields inputs, outputs, name, and values are resolved via arguments passed to the process script at execution time. They do not interfere with script reuse as their values are not embedded in the generated script.