List of exercises
Full list
This is a list of all exercises and solutions in this lesson, mainly as a reference for helpers and instructors. This list is automatically generated from all of the other pages in the lesson. Any single teaching event will probably cover only a subset of these, depending on their interests.
Organizing your projects
Demonstration
We use a simple word-count repository in demonstrations and exercises (https://github.com/coderefinery/word-count). We should clone the repository already to prepare to work on it.
On VSCode (many other interfaces exist and are just as good)
Open command palette and type clone
paste the URL of the repository to clone
use the file dialog to find a place for it
Create a new file and make a change to an existing one.
Add them to the repository and push online.
See the changes online.
Solution
Consider using version control for manuscripts as well. It may help you when keeping track of edits + if you sync it online then you don’t have to worry about losing your work.
Collaboration can be done efficiently by
real time collaboration tools like HackMD/HedgeDoc where conflicts are resolved on the fly
version control where conflicts are detected and shown – and solved manually
Recording computational steps
Workflow-1: Scripted solution for processing 4 books
Somebody wrote a script (script.sh
) to process all 4 books:
#!/usr/bin/env bash
# loop over all books
for title in abyss isles last sierra; do
python statistics/count.py data/${title}.txt > statistics/${title}.data
python plot/plot.py --data-file statistics/${title}.data --plot-file plot/${title}.png
done
We can run it with:
$ bash script.sh
What are the advantages of this solution compared to processing all one by one?
Is the scripted solution reproducible?
Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate with a scripted solution?
Solution
The advantage of this solution compared to processing one by one is more automation: We can generate all. This is not only easier, it is also less error-prone.
Yes, the scripted solution can be reproducible.
If we had more steps and once steps start to be time-consuming, a limitation of a scripted solution is that it tries to run all steps always. Rerunning only part of the steps or only part of the input data requires us to outcomment lines in our script which can again become tedious and error-prone.
Workflow-2: Workflow solution using Snakemake
Somebody wrote a Snakemake solution and the interesting file here is the Snakefile:
# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book
rule all:
input:
expand('statistics/{book}.data', book=DATA),
expand('plot/{book}.png', book=DATA)
# count words in one of our books
rule count_words:
input:
script='statistics/count.py',
book='data/{file}.txt'
output: 'statistics/{file}.data'
conda: 'environment.yml'
log: 'statistics/{file}.log'
shell: 'python {input.script} {input.book} > {output}'
# create a plot for each book
rule make_plot:
input:
script='plot/plot.py',
book='statistics/{file}.data'
output: 'plot/{file}.png'
conda: 'environment.yml'
log: 'plot/{file}.log'
shell: 'python {input.script} --data-file {input.book} --plot-file {output}'
Snakemake uses declarative style: we describe dependencies but we let
Snakemake figure out the series of steps to produce results (targets).
Snakefiles contain rules that relate targets (output
) to dependencies
(input
) and commands (shell
).
Exercise goals:
Clone the example to your computer:
$ git clone https://github.com/coderefinery/word-count.git
Study the Snakefile. How does it know what to do first and what to do then?
Try to run it. Since version 5.11 one needs to specify number of cores (or jobs) using
-j
,--jobs
or--cores
:$ snakemake --delete-all-output -j 1 $ snakemake -j 1
The
--delete-all-output
part makes sure that we remove all generated files before we start.Try running
snakemake
again and observe that and discuss why it refused to rerun all steps:$ snakemake -j 1 Building DAG of jobs... Nothing to be done (all requested files are present and up to date).
Make a tiny modification to the plot.py script and run
$ snakemake -j 1
again and observe how it will only re-run the plot steps.Make a tiny modification to one of the books and run
$ snakemake -j 1
again and observe how it only regenerates files for this book.Discuss possible advantages compared to a scripted solution.
Question for R developers: Imagine you want to rewrite the two Python scripts and use R instead. Which lines in the Snakefile would you have to modify so that it uses your R code?
If you make changes to the Snakefile, validate it using
$ snakemake --lint
.
Solution
2: Start with “all” and look what it depends on. Now search for rules that have these as output. Look for their inputs and search where they are produced. In other words, search backwards and build a graph of dependencies. This is what Snakemake does.
4: It can see that outputs are newer than inputs. It will only regenerate outputs if they are not there or if the inputs or scripts have changed.
7: It only generates steps and outputs that are missing or outdated. The workflow does not run everything every time. In other words if you notice a problem or update information “half way” in the analysis, it will only re-run what needs to be re-run. Nothing more, nothing less. Another advantage is that it can distribute tasks to multiple cores, off-load work to supercomputers, offers more fine-grained control over environments, and more.
8: Probably only the two lines containing “shell”.
Recording dependencies
(optional) Dependencies-1: Time-capsule of dependencies
Situation: 5 students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.
Answer in the collaborative document:
Which version do you expect to be easiest to re-run? Why?
What problems do you anticipate in each solution?
A: You find a couple of library imports across the code but that’s it.
B: The README file lists which libraries were used but does not mention any versions.
C: You find a
environment.yml
file with:name: student-project channels: - conda-forge dependencies: - scipy - numpy - sympy - click - python - pip - pip: - git+https://github.com/someuser/someproject.git@master - git+https://github.com/anotheruser/anotherproject.git@master
D: You find a
environment.yml
file with:name: student-project channels: - conda-forge dependencies: - scipy=1.3.1 - numpy=1.16.4 - sympy=1.4 - click=7.0 - python=3.8 - pip - pip: - git+https://github.com/someuser/someproject.git@d7b2c7e - git+https://github.com/anotheruser/anotherproject.git@sometag
E: You find a
environment.yml
file with:name: student-project channels: - conda-forge dependencies: - scipy=1.3.1 - numpy=1.16.4 - sympy=1.4 - click=7.0 - python=3.8 - someproject=1.2.3 - anotherproject=2.3.4
A: You find a couple of library imports across the code but that’s it.
B: The README file lists which libraries were used but does not mention any versions.
C: You find a
requirements.txt
file with:scipy numpy sympy click python git+https://github.com/someuser/someproject.git@master git+https://github.com/anotheruser/anotherproject.git@master
D: You find a
requirements.txt
file with:scipy==1.3.1 numpy==1.16.4 sympy==1.4 click==7.0 python==3.8 git+https://github.com/someuser/someproject.git@d7b2c7e git+https://github.com/anotheruser/anotherproject.git@sometag
E: You find a
requirements.txt
file with:scipy==1.3.1 numpy==1.16.4 sympy==1.4 click==7.0 python==3.8 someproject==1.2.3 anotherproject==2.3.4
A: You find a couple of
library()
orrequire()
calls across the code but that’s it.B: The README file lists which libraries were used but does not mention any versions.
C: You find a DESCRIPTION file which contains:
Imports: dplyr, tidyr
In addition you find these:
remotes::install_github("someuser/someproject@master") remotes::install_github("anotheruser/anotherproject@master")
D: You find a DESCRIPTION file which contains:
Imports: dplyr (== 1.0.0), tidyr (== 1.1.0)
In addition you find these:
remotes::install_github("someuser/someproject@d7b2c7e") remotes::install_github("anotheruser/anotherproject@sometag")
E: You find a DESCRIPTION file which contains:
Imports: dplyr (== 1.0.0), tidyr (== 1.1.0), someproject (== 1.2.3), anotherproject (== 2.3.4)
Can you please contribute an example?
Solution
A: It will be tedious to collect the dependencies one by one. And after the tedious process you will still not know which versions they have used.
B: If there is no standard file to look for and look at and it might become very difficult for to create the software environment required to run the software. But at least we know the list of libraries. But we don’t know the versions.
C: Having a standard file listing dependencies is definitely better than nothing. However, if the versions are not specified, you or someone else might run into problems with dependencies, deprecated features, changes in package APIs, etc.
D and E: In both these cases exact versions of all dependencies are specified and one can recreate the software environment required for the project. One problem with the dependencies that come from GitHub is that they might have disappeared (what if their authors deleted these repositories?).
E is slightly preferable because version numbers are easier to understand than Git commit hashes or Git tags.
Solution
A: It will be tedious to collect the dependencies one by one. And after the tedious process you will still not know which versions they have used.
B: If there is no standard file to look for and look at and it might become very difficult for to create the software environment required to run the software. But at least we know the list of libraries. But we don’t know the versions.
C: Having a standard file listing dependencies is definitely better than nothing. However, if the versions are not specified, you or someone else might run into problems with dependencies, deprecated features, changes in package APIs, etc.
D and E: In both these cases exact versions of all dependencies are specified and one can recreate the software environment required for the project. One problem with the dependencies that come from GitHub is that they might have disappeared (what if their authors deleted these repositories?).
E is slightly preferable because version numbers are easier to understand than Git commit hashes or Git tags.
(optional) Dependencies-2: Create a time-capsule for the future
Now it is time to create your own time-capsule and share it with the future world. If we asked you now which dependencies your project is using, what would you answer? How would you find out? And how would you communicate this information?
Try this either with your own project or inside the “coderefinery” conda environment:
$ conda env export > environment.yml
Have a look at the generated file and discuss what you see.
In future you can re-create this environment with:
$ conda env create -f environment.yml
More information: https://docs.conda.io/en/latest/
See also: https://github.com/mamba-org/mamba
Try this in your own project:
$ pip freeze > requirements.txt
Have a look at the generated file and discuss what you see.
In future you can re-create this environment with:
$ pip install -r requirements.txt
More information: https://docs.python.org/3/tutorial/venv.html
This example uses renv.
Try to “save” and “load” the state of your project library using
renv::snapshot()
and renv::restore()
.
See also: https://rstudio.github.io/renv/articles/renv.html#reproducibility
More information: https://rstudio.github.io/renv/articles/renv.html
Can you please contribute an example?