Recording computational steps

Objectives

  • Know what is the minimal amount of information necessary to reproduce your results

  • How can you make it easier for others (or your later self) to reproduce your work

  • Understand why and when a workflow management tool can be useful

Questions

  • You have some steps that need to be run to do your work. How do you actually run them? Does it rely on your own memory and work, or is it reproducible? How do you communicate the steps for future you and others?

  • How can we create a reproducible workflow?

Instructor note

  • 5 min teaching

  • 5 min exercise/demo

Several steps from input data to result

The following material is partly derived from a HPC Carpentry lesson.

In this episode, we will use an example project which finds frequent words in books and plots the result from those statistics. In this example we wish to:

  1. Analyze word frequencies using code/count.py for 4 books (they are all in the data directory).

  2. Plot a histogram using plot/plot.py.

From book to word counts to plot

Example (for one book only):

$ python code/count.py data/isles.txt > statistics/isles.data
$ python code/plot.py --data-file statistics/isles.data --plot-file plot/isles.png

Another way to analyze the data would be via a graphical user interface (GUI), where you can for example drag and drop files and click buttons to do the different processing steps.

We can also express the workflow for all books with a script. The repository includes such script called run_all.sh.

We can run it with:

$ bash run_all.sh

These approaches are fine, when only a few steps are needed, or only a few repetitons are required, but become infeasible when large amounts of data need to be processed. E.g. you don’t want to do either approach with 500 different books. Clicking 500 times, or having 500 copies of lines with small modifications is bound to introduce typos or other errors. How could we deal with this?

  • Loops with automated argument lists or other approaches to specify the inputs

  • Workflow managers

The simpler way, to just get reproducible results is to have tools generate the inputs automatically e.g. using “one folder/file per input” approaches. This will lead to reproducible results, but requires that for every change everything has to be re-run. E.g. when adding another 10 datapoints, if your script just checks what is there, it will re-run the analysis for all 1000 other elements as well, and if you start adding more arguments to your script, you risk the forgetting elements or having typos again. The advantage of this approach, however, is that you can easily build a executable manuscript file from such a script (like jupyter notebooks, or matlab live code).

Workflow managers

Workflow managers in contrast create flows that, in general, keep track on what needs to be executed. E.g. snakemake will keep track of what has been processed and only re-run those parts of it’s flow that need updates. If properly defined (e.g. source files being inputs of steps), it will re-run all analysis starting from a modified step, which makes results more dependable, since you can’t forget to “run that one new pre-processing step” for some old input data. An example of how it can be used on triton (which is also generally applicable) can be found here


Additional Tools

Keypoints

  • Computational steps can be recorded in many ways.

  • Workflow tools can help if there are many steps to be executed and/or many datasets to be processed.