List of exercises

Full list

This is a list of all exercises and solutions in this lesson, mainly as a reference for helpers and instructors. This list is automatically generated from all of the other pages in the lesson. Any single teaching event will probably cover only a subset of these, depending on their interests.

Organizing your projects

In organizing-projects.md:

Demonstration

We use a simple word-count repository in demonstrations and exercises (https://github.com/coderefinery/word-count). We should clone the repository already to prepare to work on it.

On VSCode (many other interfaces exist and are just as good)

  • Open command palette and type clone

  • paste the URL of the repository to clone

  • use the file dialog to find a place for it

  • Create a new file and make a change to an existing one.

  • Add them to the repository and push online.

  • See the changes online.

In organizing-projects.md:

Recording computational steps

In workflow-management.md:

Workflow-1: Scripted solution for processing 4 books

Somebody wrote a script (script.sh) to process all 4 books:

#!/usr/bin/env bash

# loop over all books
for title in abyss isles last sierra; do
    python statistics/count.py data/${title}.txt > statistics/${title}.data
    python plot/plot.py --data-file statistics/${title}.data --plot-file plot/${title}.png
done

We can run it with:

$ bash script.sh
  • What are the advantages of this solution compared to processing all one by one?

  • Is the scripted solution reproducible?

  • Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate with a scripted solution?

In workflow-management.md:

Workflow-2: Workflow solution using Snakemake

How Snakemake works

Somebody wrote a Snakemake solution and the interesting file here is the Snakefile:

# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book

rule all:
    input:
        expand('statistics/{book}.data', book=DATA),
        expand('plot/{book}.png', book=DATA)

# count words in one of our books
rule count_words:
    input:
        script='statistics/count.py',
        book='data/{file}.txt'
    output: 'statistics/{file}.data'
    conda: 'environment.yml'
    log: 'statistics/{file}.log'
    shell: 'python {input.script} {input.book} > {output}'

# create a plot for each book
rule make_plot:
    input:
        script='plot/plot.py',
        book='statistics/{file}.data'
    output: 'plot/{file}.png'
    conda: 'environment.yml'
    log: 'plot/{file}.log'
    shell: 'python {input.script} --data-file {input.book} --plot-file {output}'

Snakemake uses declarative style: we describe dependencies but we let Snakemake figure out the series of steps to produce results (targets). Snakefiles contain rules that relate targets (output) to dependencies (input) and commands (shell).

Exercise goals:

  1. Clone the example to your computer: $ git clone https://github.com/coderefinery/word-count.git

  2. Study the Snakefile. How does it know what to do first and what to do then?

  3. Try to run it. Since version 5.11 one needs to specify number of cores (or jobs) using -j, --jobs or --cores:

    $ snakemake --delete-all-output -j 1
    $ snakemake -j 1
    

    The --delete-all-output part makes sure that we remove all generated files before we start.

  4. Try running snakemake again and observe that and discuss why it refused to rerun all steps:

    $ snakemake -j 1
    
    Building DAG of jobs...
    Nothing to be done (all requested files are present and up to date).
    
  5. Make a tiny modification to the plot.py script and run $ snakemake -j 1 again and observe how it will only re-run the plot steps.

  6. Make a tiny modification to one of the books and run $ snakemake -j 1 again and observe how it only regenerates files for this book.

  7. Discuss possible advantages compared to a scripted solution.

  8. Question for R developers: Imagine you want to rewrite the two Python scripts and use R instead. Which lines in the Snakefile would you have to modify so that it uses your R code?

  9. If you make changes to the Snakefile, validate it using $ snakemake --lint.

Recording dependencies

In dependencies.md:

(optional) Dependencies-1: Time-capsule of dependencies

Situation: 5 students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.

Answer in the collaborative document:

  • Which version do you expect to be easiest to re-run? Why?

  • What problems do you anticipate in each solution?

    A: You find a couple of library imports across the code but that’s it.

    B: The README file lists which libraries were used but does not mention any versions.

    C: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy
      - numpy
      - sympy
      - click
      - python
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@master
        - git+https://github.com/anotheruser/anotherproject.git@master
    

    D: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@d7b2c7e
        - git+https://github.com/anotheruser/anotherproject.git@sometag
    

    E: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - someproject=1.2.3
      - anotherproject=2.3.4
    

In dependencies.md:

In dependencies.md:

(optional) Dependencies-2: Create a time-capsule for the future

Now it is time to create your own time-capsule and share it with the future world. If we asked you now which dependencies your project is using, what would you answer? How would you find out? And how would you communicate this information?

Try this either with your own project or inside the “coderefinery” conda environment:

$ conda env export > environment.yml

Have a look at the generated file and discuss what you see.

In future you can re-create this environment with:

$ conda env create -f environment.yml

More information: https://docs.conda.io/en/latest/

See also: https://github.com/mamba-org/mamba