process Fields (Guix Workflow Language Reference Manual)

`process` Fields

Both make-process and process accept the same fields, which we describe below.

name

The readable name of the process as a string. This is used for display purposes and to select processes by name. When the process constructor is used, the name field need not be provided explicitly.

version

This field holds an arbitrary version string. This can be used to disambiguate between different implementations of a process when searching by name.

synopsis

A short summary of what this process intends to accomplish.

description

A longer description about the purpose of this process.

packages

This field is used to specify what software packages need to be available when executing the process. Packages can either be Guix package specifications — such as the string "guile@3.0" for Guile version 3.0 — or package variable names.

By default, package specifications are looked up in the context of the current Guix, i.e. the same version of Guix that you used to invoke guix workflow. This is to ensure that you get exactly those packages that you would expect given the Guix channels you have configured.

We strongly advise against using package variables from Guix modules. The workflow language uses Guix as a library and is compiled and tested with the version of Guix that is currently available as the guix package in (gnu packages package-management). The version of this Guix will likely be older than the version of Guix you use to invoke guix workflow.

Package variables are useful for one-off ad-hoc packages that are not contained in any channel and are defined in the workflow file itself. We suggest you use the procedure lookup-package from the (gwl packages) module to look up inputs in the context of the current Guix. To ensure reproducibility, however, we urge you to publish packages in a version-controlled channel. See the Guix reference manual to learn all there is to know about channels.

The packages field accepts a list of packages as well as multiple values (an “implicit list”). All of the following specifications are valid. A single package:

process
  packages "guile"
  …

More than one package:

process
  packages "guile" "python"
  …

A single list of packages:

process
  packages
    list "guile" "python"
  …

inputs

This field holds inputs to the process. Commonly, this will be a list of file names that the process requires to be present. The GWL can automatically connect processes by matching up their declared inputs and outputs, so that processes generating certain outputs are executed before those that declare the same item as an input.

As with the packages field, the inputs field accepts an “implicit list” of multiple values as well as an explicit list. Additionally, individual inputs can be “tagged” or named by prefixing it with a keyword (see Keywords in GNU Guile Reference Manual). Here’s an example of an implicit list of inputs spread across multiple lines where two inputs have been tagged:

process
  inputs
    . genome: "hg19.fa"
    . "cookie-recipes.txt"
    . samples: "foo.fq"
  …

The leading period is Wisp syntax to continue the previous line. You can, of course, do without the periods, but this may look a little more cluttered:

process
  inputs genome: "hg19.fa" "cookie-recipes.txt" samples: "foo.fq"
  …

Why tag inputs at all? Because you can reference them in other parts of your process definition without having to awkwardly traverse the whole list of inputs. Here is one way to select the first input that was tagged with the samples: keyword:

pick genome: inputs

To select the second item after the tag genome: do this:

pick second genome: inputs

or using a numerical zero-based index:

pick 1 genome: inputs

Code Snippets for a convenient way to access named items in code snippets without having to define your picks beforehand.

The procedure process-inputs can be used to access the list of inputs of any given process. By default, tags are removed from the list. If you want to include tags (e.g. to select specific inputs with pick), you can pass the keyword with-tags.

Here is an example of two processes where the second process refers to the inputs of the first.

process count-reads (with sample)
  packages
    . "r-minimal"
  inputs
    . bam:
    file sample "_Aligned.sortedByCoord.out.bam"
    . bai:
    file sample "_Aligned.sortedByCoord.out.bam.bai"
    . script:
    file "count-reads.R"
  outputs
    file sample ".read_counts.csv"
  # {
    R {{inputs:script}} {{inputs:bam}} {{inputs:bai}} > {{outputs}}
  }

process genome-coverage (with sample)
  packages
    . "r-minimal"
  inputs
    define other-inputs
      process-inputs
        count-reads sample with-tags:
    . files:
    pick bam: others
    pick bai: others
    . script:
    file "genome-coverage.R"
  outputs
    files sample / (list ".forward" ".reverse") ".bigwig"
  # {
    R {{inputs:script}} {{inputs::files}} > {{outputs}}
  }

outputs

This field holds a list of outputs that are expected to appear after executing the process. Usually this will be a list of file names. Just like the inputs field, this field accepts a plain list, an implicit list of one or more values, and lists with named items.

The GWL can automatically connect processes by matching up their declared inputs and outputs, so that processes generating certain outputs are executed before those that declare the same item as an input.

The procedure process-outputs can be used to access the list of outputs of any given process. By default, tags are removed from the list. If you want to include tags (e.g. to select specific outputs with pick), you can pass the keyword with-tags.

Here is an example of two processes where the second process refers to the outputs of the first.

process one
  packages
    . "coreutils"
  inputs
    . "input.txt"
  outputs
    . log: "first.log"
    . text: "first.txt"
  # { tail {{inputs}} > {{outputs:text}} }

process two
  packages
    . "coreutils"
  inputs
    pick text:
      process-outputs one with-tags:
  outputs
    . done: "second.txt"
    . log: "second.log"
  # { head {{inputs}} > {{outputs:done}} }

output-path

This is a directory prefix for all outputs.

run-time

This field is used to specify run-time resource estimates, such as the memory requirement of the process or the maximum time it should run. This is especially useful when submitting jobs to an HPC cluster scheduler such as Grid Engine, as these schedulers may give higher priority to jobs that declare a short run time.

Resources are specified as a complexity value with the fields space (for memory requirements), time (for the expected duration of the computation), and threads (to control the number of CPU threads). For convenience, memory requirements can be specified with the units kibibytes (or KiB), mebibytes (or MiB), or gibibytes (or GiB). Supported time units are seconds, minutes, and hours.

Here is an example of a single-threaded process that is granted 20 MiB of run-time memory for a duration of 10 seconds:

process stamp-inputs
  inputs "first" "second" "third"
  outputs "inputs.txt"
  run-time
    complexity
      space 20 mebibytes
      time  10 seconds
      threads 1
  # { echo {{inputs}} > {{outputs}} }

When this process is executed by a scheduler that honors resource limits, the process will be granted at most 20 MiB of memory and will be killed if it has not concluded after 10 seconds.

values

This field holds a list with keyword-tagged items that can be used in code snippets. Values defined here are passed to the process script at execution time (rather than preparation time), so this field can be used to avoid embedding literal values in code snippets when generating processes from a template. To learn more about code snippets Code Snippets.

Here is a simple example of a process template with values:

process greet (with name)
  packages
    . "hello"
    . "coreutils"
  outputs
    file name ".txt"
  values
    . capitalized:
    string-upcase name
  # {
    echo "This is a greeting for {{values:capitalized}}."
    hello >> {{outputs}}
  }

map greet
  list "rekado" "civodul" "zimoun"

The generated script from this process does not embed any specific value for name or even capitalized. Instead it looks up the value for capitalized in the arguments passed to the script at execution time. So instead of generating three scripts that only differ in one value (the capitalized name), the GWL will only generate one script and pass it three different values for the three processes.

For another example and further discussion of embedding values versus referencing them at execution time Process templates.

procedure

This field holds an expression of code that should be run when the process is executed. This is the “work” that a process should perform. By default that’s a quoted Scheme expression, but code snippets in other languages are also supported (see Code Snippets).

Here’s an example of a process with a procedure that writes a haiku to a file:

process haiku
  outputs "haiku.txt"
  synopsis "Write a haiku to a file"
  description
    . "This process writes a haiku by Gary Hotham \
to the file \"haiku.txt\"."
  procedure
    ` with-output-to-file ,outputs
        lambda ()
          display "\
the library book
overdue?
slow falling snow"

The Scheme expression here is quasiquoted (with a leading `) to allow for unquoting (with ,) of variables, such as outputs.

Not always will Scheme be the best choice for a process procedure. Sometimes all you want to do is fire off a few shell commands. While this is, of course, possible to express in Scheme, it is admittedly somewhat verbose. For convenience we offer a simple and surprisingly short syntax for this common use case. As a bonus you can even leave off the field name “procedure” and write your code snippet right there. How? Code Snippets.