2 + 2
[1] 4
The challenges of working collaboratively on data and some solutions
“rigour, reproducibility and robustness. These remind us of the reason why we became scientists in the first place.”
Note: The ELN function is also frequently combined with a LIMS (Laboratory information/inventory management system) solution as the two are often closely interrelated. They are however distinct functions and you may wish to pick different tools for these functions.
See also the turing way sub-chapter on ELNs which I have contributed to
ELNs are incredibly useful and massively expand on the utility of a paper lab notebook. Links, images, multimedia, collaboration, search, and sharing, integration with LIMS systems, and much more. It is, however, important when choosing an ELN solution that you do not give up the advantages offered by a paper lab notebook.
There are an immense and baffling array of options in the Electronic Lab Notebook space. Many organisations offer software that purports to solve the problem of electronic lab notebooks. So choosing a suitable solution can be a major headache. The choice of ELN is an incredibly important decision and one that your lab/institution will likely have live with for years or even decades. You are putting the record of your research into the hands of the tool that you choose, and entering into a long-term relationship with the provider of your ELN solution.
Consider the differences between a paper copy of a lab notebook (PLN) and an electronic copy. Consider also which are the properties of a paper copy that it is important that you are able to retain when adopting an electronic alternative.
A paper lab notebook is physically under your control you (or more likely your institution) own it. You can control access to it physically. Your physical possession of this resource means that it would be difficult for anyone to prevent you from accessing it. It would also be difficult for anyone to charge you a fee in order to continue using it. You do not need any specialist tools in order to access it’s contents. You are not dependent on the functioning of any complex systems like computer networks in order to be able to use your paper lab notebook. You do not have to agree to a ‘terms of service’ or ‘end-user license agreement’ with a 3rd party, (the terms of which are likely subject to unilateral alteration by that 3rd party), in order to use and retain access to your lab book.
If the provider of my paper lab books goes out of business it has almost no bearing on my ability to continue doing my work. One paper notebook is much like another, finding a new provider is pretty easy. Changing providers does not impact on my ability to access my past notebooks or to continue operating with the same workflow in my future ones. This is not necessarily true of ELNs. Few active measures are needed to maintain the data in your notebooks, they are vulnerable as they exist in only one copy but as long as they are kept in a cool, dry and dark spot they will likely last decades. Electronic data requires much more active upkeep.
Lab notebooks perform an archival function and proprietary formats are antithetical to this as they assume the institution which can act as a gatekeeper to the use of the proprietary format will outlive the need to archive the material. When choosing an archival format one seeks to maximize the likelihood that one can recover the relevant information from that format. Using a proprietary solution is talking a needless risk with the future of your data. Your data’s fate can become tied to that of the firm, or project within a firm, that develops and operates the software that you use to store your lab notes.
When looking for any piece of software the first question that I ask is: “Is there an Libre / open-source solution to this”, If it is a web app I ask: “Can I host my own fully featured instance, should I need to?”. I also ask: “is there a large community using the project, does it have some institutional backing of some kind?” This might take the form of a company which sells service contracts, or offers paid hosting ideally with feature parity1 with a self-hosted option. Or perhaps a foundation or other non-profit/academic organisation with robust funding.
Open solutions provide me the assurance that If I do the appropriate preparatory work I should be able to access all of my data in it’s native form by running the ELN application in a VM or similar reproducible computational environment in the future should I need to. Even if the tools are no longer maintained and in a state that can be used in production. They can still be used to read the data and interact with it in the same way. The data will also likely be stored in an open format from which it can relatively easily be extracted and ported to a new format.
This recent review (Higgins, Nogiwa-Valdez, and Stevens 2022 [cito:citesAsRecommendedReading] [cito:critiques]) provides a good overview of considerations when adopting an ELN solution. It cover such things a regulatory compliance that I have not touched on here. It does occasionally appear to conflate open-source solutions with self-hosted ones which need not necessarily be the case. This guide on choosing an ELN from Simon Bungers of Labfolder is also worth a read. Some companies will let you host proprietary apps on premises and you can pay for 3rd party hosting and administration of open source applications. This is important as if you don’t have the expertise or internal resources to administer a self-hosted instance of an open-source ELN solution you can still pay a 3rd party to do this for you in which case you get the benefits of professional support and the reassurance of an open solution. You should still take regular local backups of exports from your hosting provider from which you could restore your ELN system with different hosting. This means that you retain the option to change providers as the hosting and support are no longer vertically integrated parts of the software as a service (SaaS) experience for you.
The only ELN/LIMS software solutions that I have so far identified that meet my initial screening criteria are listed here. They are each quite different but have many of the same core features. For example rich text editing in a web browser. The ability to upload files. Sharing and permissions based on roles/groups.
Laboratory resource scheduling feature for booking things like hoods and microscopes, automatic mol file previews for molecules and proteins & support for free-hand drawing.
The eLabFTW site and documentation there is also a demo deployment that you can try out
Self-hosting is relatively simple according to the documentation. There is also a paid support tier which would be recommend for any larger deployment to support the ongoing development of the project.
Paid cloud hosting is available from the developer in a geographical region suited to your needs, a more expensive tier with hosting in France compliant with additional security and privacy certifications is available.
Good features for integrated metadata management e.g. linking to ontologies / controlled vocabularies. This is based in a flexible object system for making similar entries.
openBIS has an API and can integrate with jupyterhub for electronic lab notebooks.
Very feature rich LIMS system with optional integration of stores management with protocols and experiments including keeping track of bar-coded stocks.
You can get a feel for it in the demo deployment.
openBIS is a bit more complex to administer based on its documentation. It’s a slightly older and more complex project than the others on this list, meaning it is very featureful and well tested against the needs of the groups at ETH Zurich.
Developed at ETH Zurich openBIS can be hosted for you under the openRDM service operated by ETH Zurich scientific IT services. No fixed pricing is available cost would be dependent on your specific needs
OSF is oriented towards sharing and collaborating on your work, including the ability to generate DOIs and host pre-prints directly on the main instance.
It is free to use OSF at the main instance at osf.io so you can try it out there directly. For larger data you must provide your own additional storage addons, available from a number fo cloud storage providers.
Whilst OSF can be hosted yourself this is presented by the project as for the purposes of development, and is not directly available as a paid service.
Strong sharing features, makes it easy to take your ELN and make it, or parts of it, public.
There are additional sections covering the management of types on information which don’t necessarily fit into an ELN solution for more general, personal or informal notes see the short section on Personal Knowledge management Section 9.2, for bibliographic information management the section on Zotero Section 9.1, and for passwords the section on ?sec-bitwarden.
“Code is text, code is readable, code is reproducible”
The choice of analyzing your data in a graphical or GUI (Graphical User Interface) tool such as a spreadsheet, or statistical analysis and plotting tool or doing so in code in a programming language such as R or Python is a significant one, though the two are not always mutually exclusive. This is sometimes a choice made for us by the availability of tools to solve our specific problem being only graphical or only command line (i.e. code). On many occasions however we are presented with a choice between the two.
Working in a graphical tool is typically, though not always, faster to pickup and learn for those with no prior coding experience than interacting through code and can make it easier to quickly get started with data analysis with a shallower learning curve.
Working in code typically provides much better provenance for the data and implicitly documents every step taken in the analysis of your data. If working with a graphical tool it is commonplace for the steps taken to require manually and often in-exact instructions to repeat the same analysis. This requires that such steps be carefully documented, and meticulously followed by someone attempting to reproduce your work. Both of these steps leave significant room for a class of ‘operator’ error, failing to unambiguously document a step, misinterpreting or having to guess at an ambiguous step or just random mistakes. Whilst these difficulties are hard to avoid in lab protocols where physical steps must be described they are theoretically avoidable in computational analyses which reduces to a series of unambiguous mechanically executed steps. This can result in its own class of errors, bugs not spotted which lead to widespread errors if undiscovered in widely used tools for example so there are some trade-offs.
A graphical tool which permits you to define a series of operations, export these instructions to a file and import import that process into a different session is a step up over one in which the user must repeatedly manually specify an action. However such approaches are usually not quite as robust as code, software versions change and user interfaces no longer match the instructions or can no longer import the files from older versions. This can be especially difficult with proprietary or SAAS (software as a service) solutions where access to older version of the software is not available. It is much easier to maintain the equivalent of a lab notebook for you computational analyses if you are able to do so in code than it is do the equivalent when using a GUI tool.
Some GUI tools which generate/edit code snippets based on a GUI wrapper or produce a file containing manual annotation information can provide a bridge for things that are just easier to do graphically and purely code based solutions. These sorts of hybrid solution are available for certain tasks and make it possible to have a primarily code first workflow augmented by GUI assistance when needed/desired. This document is an example of this sort of workflow. I’m currently writing it in RStudio’s visual editor mode that resembles an ordinary WYSIWYG (what you see is what you get) word processor with all the bells and whistles like automated reference management integrated with Zotero but it’s actually generating well-formed Rmarkdown syntax.
Another useful example of this for generating reproducible figures with imaging data and graphs in Inkscape with imageJ is Jérôme Mutterer’s2 inkscape-imagej-panel plugin3 .
A section from Hadley Wickham’s 2019 Keynote at EMBL covering the merits of computational notebooks for reproducible science.
“In science consensus is irrelevant. What is relevant is reproducible results.”
- Michael Crichton
To reproduce, or indeed to easily collaborate on a data analysis project you need shared access to three things:
We will cover a number of technologies in the following sections each of which solves a different aspect of the problems associated with performing, collaborating on and sharing reproducible computational analyses. Then we will look at a tool which brings many of these technologies together into a single relatively easy to use platform, Renku. If you are in a hurry you can skip directly to the Renku section Section 4.8 and revisit the intervening sections as needed though I’d suggest at least skimming them to get a little context.
AKA version control or source control
When working with code at any scale beyond a few small scripts (and sometimes even then) it is highly advisable to use a tool to keep track of the changes that you have made to your code. This is especially true if you are collaborating with others as such tools usually also feature utilities to help you to merge code developed by multiple people working on the same project asynchronously. The de facto standard tool for this is git, it is widely used and there is much tooling built around the core git software.
git - Track changes but OP4 (and a bit more complicated)
- Richard J. Acton
Using git in a data analysis project is also a bit like using a lab notebook. Whenever you take a snapshot of your project by making a ‘commit’ you accompany it with a ‘commit message’ giving a brief description of why you did what you did. A digital file is not necessarily like a lab notebook in that a physical notebook has a chronological order where you can see the history of what you did, when and why. In contrast a digital file that you change over time just has its current form and does not retain a history of its changes. git adds this chronological dimension back to digital projects letting you time travel through the history of your projects this can be very valuable. For example if you want to be able to get back a result exactly as you generated it before you updated your code.
It is simple to learn basic git operations but its underlying structure can be a bit conceptually difficult grasp. I recommend taking the time to form a good mental model of git’s workings if you are going to use it regularly. If you want to understand more and perform more advanced operations or indeed just fully understand the simple ones. See the learning resource below for some more in depth material on git.
Before we delve a little more into git I’m going to introduce another concept - Literate programming. Source code is after all just text and many of these same concepts translate well to collaborating on prose.
“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
- Donald Knuth
Literate programming is a paradigm for writing code interspersed with natural language prose, or vice versa. The concept was introduced by Donald Knuth in 1984, an early example of Literate Programming was \(LaTeX\) the document authoring and typesetting language still popular with anyone writing mathematical notation on computers. Literate programming is now very popular in ‘computational notebooks’ used by data scientists, these are in many ways the computational equivalent of a lab notebook. This is a literate programming document, it leans heavily towards prose and does not contain much code but I can easily include some, check it out:
2 + 2
[1] 4
The literate programming tool I’m using is called Quarto. With it I can write plain text documents which include snippets of code in R or a variety of other languages and format my text with a simple markup syntax called markdown. As I mentioned above I’m currently editing this document in a WYSIWYG editor much like the word processors with which you are likely familiar that generates the Quarto/markdown formatted text. Markdown is however remarkably simple and very easy to learn and I regularly switch between source and visual modes with minimal friction.
This is an extremely powerful tool for generating and properly documenting my work and indeed for outputting it for different publication formats, a concept called single source publishing. This document for example is automatically published as a website, a pdf & and epub every time I commit and push changes to the gitlab repository where it is hosted. You can even get your markdown formatted according to the requirements of many journals with {rticles} or Quarto Journals. Thus the published output from this source document is tightly coupled to the code in it. Any code I write here is re-run when this document is built (unless I cache the results).
A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
- John Gruber
Here’s is a quick markdown syntax rundown (~90% of all the markdown syntax you’ll ever need):
# Heading 1
## Heading 2
### Heading 3 etc. {#sec-h3}
[hyperlink text](url)
footnote [^1]
Inline references were used by @smith2021, it has been claimed [@jones2022] (not inline)
cross refernces @sec-h3 using the h3 short alias
**Bold**
*italic*
***Bold & Italic***
`inline code`
- Bullet point
- nested
- point
1. Numbered list
2. another thing ...
> Quotation - unattributed
![alt text](/path/to/an/image.png)
```
generic code chunk
```
$inline~math~x^y$
$$
math~chunk~\frac{x}{y}
$$
[^1]: callback!
Markdown comes in a number of ‘flavors’ usually a superset of the commonmark specification / reference implementation which extend it with additional features so there is some variation in syntax, many tools have built in linters to check/auto-correct any syntax not supported in a given flavor.
Quarto is a scientific and technical publishing system that uses markdown
Whilst still from the Posit (formerly RStudio) team it is more language agnostic than Rmarkdown which may be familiar to R users and can be installed as a separate command-line utility without R dependencies. It can use jupyter notebooks as a source document format and integrates well with vs-code as well as RStudio. It also unifies the variety of different pre-processing steps for different output formats previously performed by a family of R packages bringing us closer to true single source publishing.
if you are starting in 2023 begin with Quarto it’s basically the same as Rmarkdown but better and highly backwards compatible with R markdown. These texts are still relevant but you can can now do a lot of cool new stuff in Quarto in particular ‘fenced divs’ are awesome.
Jupyter notebooks are another major player in the scientific computational notebook space and originate in the python community with the iPython interactive shell. They are are run from within your web browser either locally or on a remote jupyter hub. They also make use of markdown syntax for literate programming.
Unlike Quarto/Rmarkdown they are not tightly integrated with an IDE (integrated development environment) meaning they sometimes lack some of the features that this can provide. Though tighter integrations with Microsoft’s featureful open source text editor vscode are changing this.
I am primarily an R user I’ve used jupyter notebooks for Python and Raku projects but much prefer the experience of Rmarkdown style notebooks over jupyter. Mostly as you do not generally see or edit the actual source document you only generate it from the interface. This makes working with version control tools like git more challenging. Thankfully MyST makes markdown style notebooks possible if you don’t like Quarto.
There are trade-offs between the Rmarkdown & Jupyter Notebook ways of working (see: The First Notebook War) but Quarto and jupyter book in conjunction with MyST go a long way to resolving some of these issues.
If you are primarily a python person looking to get started with a literate programming workflow I would suggest that you avoid classic jupyter notebook files in favor of those written entirely in Markdown.
You could use Quatro with VScode and or Python in RStudio with Quarto and {reticulate} over jupyterhub or the jupyter extension for vscode, but this may not be the best fit for you established workflow, it is a matter of taste.
One of the nice features of vscode for working collaboratively is the live share extension which gives you real-time google-docs-like collaboration tools, though of course you can still use git in vscode for asynchronous collaboration. JupyterHub now also has support for real-time collaboration.
Text/Video (quickstart) getting started with jupyter notebooks
Text (quickstart) Quarto in vscode
Video (longform) Jupyter Notebooks in VS Code Extension NEW in 2022 - Tutorial Introducing Kernels, Markdown, & Cells
Text (documentation) jupyter notebooks in vscode
git
glossary
I’m including command line examples here but the concepts should map well onto a number of different GUI front-ends to git. The glossary should help with both git’s terminology and understanding some of the key concepts that make it up.
git init
initializes a new repo, you can also create one on a git hosting service like gitlab github then clone a local copy. After this you’ll find a hidden .git
folder in your repo.git commit <file that I changed> -m"Short informative message"
git commit -am"Short informative message"
.gitignore
le.
git add new_file.txt
git add -A
add all files - handle with care! you don’t want to accidentally commit a rge binary file or things that should be kept secret.git stash
to get a clean working tree to perform her git actions. They must added or stashed with the -u
option.master
or main
.
git branch -l
git branch <new branch name>
git checkout <new branch name>
git branch -d <branch name>
.git
directory, your working directory merely points to em here. By changing which commit your HEAD is pointing to you can move the window of your working directory around the history and branches of your project.origin
.
git remote -v
git remote add <remote name> <remote url>
git pull <branch name> <remote name>
, commonly git pull origin master
git push <branch name> <remote name>
, commonly git push origin master
git log
shows a list of commits adding the --graph
flag shows a text based graphic of the branch structuregit status
command reveals the contents of your staging area under Changes to be mmitted:
if you change a file it will appear in this section. If you create a new file it will not be in the staging area until you add it with git add
.git add -i
git stash
stash your current changesgit stash list
list the stashesgit stash pop
apply the first item in the stash list to working directory. This is like rebasing in that it will apply your stash to the tip of the current branch.master
branch and I want to merge in the feature
branch I can use the command: git merge feature
to merge the branches. This will, if there are no conflicts, create a merge commit.git merge --abort
If you weren’t expecting a conflict, do is before making any attempts at resolving the conflict as if you make changes during a merge you many t be able to revert cleanlymaster
but has no conflicts with it’s current tip then instead merging you can rebase
on master
. ‘snip’ your branch off from it’s current parent and automatically generate new commits where the current tip of master
is the parent instead.
git rebase master
.git rebase master <branch name>
delta
for improved visual diffs)mv
a file to re-name it it will appear to git as though you deleted add re-added it less you use git mv
. If you rm
a file you will counter-intuitively need to add the action of removing it to your staging area to let git know you’ve removed it is simpler to git rm
a file which both removes it and stages the action of removing it.git
on the command line
unfinished!
Installing git…links
Tell git who you are
git config --global user.email "youremail@yourdomain.com"
git config --global user.name "your name"
(Dropping --global
will only set these values for the current project)
Initialize a git repository (turn a folder into a git repo)
git init
Add files to be tracked by git:
echo "# README" > README.md # an example file
git add README.md
set-up a remote
(Send your changes to a git server)
pull from a remote (Get the latest changes from a git server)
Get the status of your git repository with git status
this will show you …
diff
staging chunks
There are numerous GUI (e.g. gitkraken) / TUI (terminal/text user interface) (e.g. gitui, lazygit) interfaces to git which provide convenient interfaces to git beyond the core command line application. RStudio provides a built-in git UI in which you can commit changes, see diffs, explore history, manage branches etc. By default it is located in a tab in the top right pane of the RStudio interface in projects which use git.
Git and the platforms built around it such as github and gitlab solve the problem of sharing and collaborating on your code, and in the context of literate programming your prose as well.
You can also explore the history of the changes made to project in the history view of project in github or gitlab for example here is the gitlab history of this document.
Another useful feature of git is for attribution. Every git commit has an author so when collaborating on a project managed in git credit can go to the people who wrote particular parts of the document. (git also distinguishes between an author and a committer, so a committer can commit changes from an author who is not themselves directly using git if desired, though this is not entirely the intended use case)
You can temporarily override the default author / committer values set the the global or local git config files by setting these
export GIT_AUTHOR_NAME="John Smith"
export GIT_AUTHOR_EMAIL="jsmith@example.com"
export GIT_COMMITTER_NAME="Jane Doe"
export GIT_COMMITTER_EMAIL="jdoe@example.com"
Note that truly deleting things from a git history once that history has been pushed to a repo used by others can be quite difficult. (It can take a long time because git is based on hash tree if you delete something from the history you have to re-write all subsequent commits. This is part of what makes it such a good system for provenance of code.) So never commit secrets such as passwords or API keys even to private repos if these might ever be made public. Storing sensitive values in environment variables is a common solution to this problem.
feature branches for internal collaborators, forking for external collaborators.
There are a variety of workflow patterns which can be followed when it comes to collaborating on git projects. For a solo project you might be able to get away with just committing directly to the primary branch (often called main or master) almost all of the time. When collaborating however it can be a good idea to switch to working in ‘feature branches’. You have some small feature that you want to implement or issue to address so you make a branch and work on it there. Once you are done you can check on the status of the master branch. If master is ahead of where you branched off you might want to rebase on the new master, resolve and conflicts and perform a fast forward merge appending your new commits to the end of the master branch. It is best to keep the scope of these as small as possible so there are minimal issues when merging back into the master branch.
A collaborator with access and permissions on your repository can work on feature branches in your repo, but an external collaborator without these permissions cannot. So to achieve the same thing they can fork the repo i.e. make their own copy and submit pull requests (PR) from there. PRs are best for specific suggested changes. If there is a problem or query around what changes need to be made an issue should be opened in the issue tracker of the project and once plans for specific changes are agreed then a PR can be generated with the proposed changes. PRs can be reviewed and revised before being accepted and merged into the master branch.
This process is generally how ‘peer review’ usually just referred to as ‘code review’ tends to happens in software projects. An issue becomes a proposed set of changes, becomes a specific implementation of those changes, becomes a pull request. Any alterations to the specifics are worked out in the PR before the agreed changes are merged.
In literate programming it is advisable to follow the convention of one sentence per line in the source document when using git. This makes it easier to manage git diffs as git focuses on linewise not character-wise differences. You can get this behavior in RStudio with the options below in the YAML header of an Rmarkdown document, or at of project of global level in the RStudio settings.
---
editor_options:
markdown:
wrap: sentence
canonical: true
---
When you write data analysis code in a language like R or Python chances are that you are going to be depending on some other packages to do your work. You may have noticed that updates to the packages sometimes break your code. A function that used to exist has been deprecated and is no longer in a package, or the arguments to a function have changed. More worrying still sometimes such changes won’t stop your code running but produce an output that is wrong but not in an obvious fashion. Thus in order to reproduce your analysis exactly we would need not just your code but the versions of the language and the packages that your code depends on. This way it is possible run your code with the confidence that it is functioning the same way for us as it was for you.
In the R programming language the best package management solution for reproducible environments is {renv}.
{renv} provides renv::install()
which is a replacement for the base install.packages()
, as well as the BiocManger::install()
& remotes::install_github()
functions used to install R packages. The renv::snapshot()
function is used to create a project specific manifest file renv.lock
which documents all the packages used and their versions. renv::restore()
can then be used to make the installed packages and their versions match those specified in the lock file. {renv} has a central package cache and uses symbolic links to project libraries to ensure that there is only one copy of a given version of a package installed on your system improving it’s performance over previous attempts at project specific package management in R.
In Python there are two main tools for managing package environments.
venv
is a python specific environment manager for isolated project specific python package management and part of the python standard libraries. A virtual environment can be created with by running python3 -m venv venv/
in your project directory. This command uses the venv
module (-m venv
) to create a virtual environment called venv
(venv/
) but the name is arbitrary and a sub-directory with the environment’s name will be created. To use the environment it must be activated with: source venv/bin/activate
, deactivate
exits the environment. The pip
package manager can be used as normal within the environment and will only affect the local environment while it is active. To capture a snapshot of the environment from which it could be restored later use pip freeze > requirements.txt
A virtual environment can be restored from a requirements.txt
file with: pip install -r requirements.txt
This guide to python virtual environments goes into some additional details of how to use venv
and how it works. For management of the version of python itself pyenv
is a good tool. For the management of python packaging the tool poetry
is a good choice for it’s good dependency management.
conda
is both a package and environment manager and is language agnostic however tends to be used in predominantly python settings. Whilst conda
can be used to mange R packages I would not recommend it for a predominantly R project. By default conda
does not take the approach of storing the specification of your environment within the project directory, unlike {renv} & venv
, I would avoid this default behavior. Keeping the environment specification in the directory is obviously preferable if you want to be able to share the project along with it’s environment. This guide to conda projects provides a nice overview of getting started with conda environments using an environment.yml
file and this demonstrates how you can set the location of the conda environments to be within a project directory.
Beyond language specific packages many of the packages in a given language will depend on system libraries in your operating system. For instance an R package which parses XML files might rely on a fast system library written in C which is used by many other packages in other languages rather than re-implement it’s own XML parsing package and duplicate the effort. Now a language specific package or environment management solution will no longer be sufficient alone. Solutions to this problem include more advanced system level package management tools such as NIX based package management (see also GNU GUIX). Alternatively, and more popularly at the moment, system dependencies can be managed by the operating system’s package manager and containers can be used to create portable and isolated system environments with different system dependencies. Nix-like package management solves more and different problems than containers and can be used to more reproducibly build container images but still currently has a bit of a ‘early adopter tax’.
A container provides an isolated self-contained computing environment similar in practice to that of a virtual machine (VM) whilst not having nearly the same performance deficits associated with virtualization (for the technically inclined, a simplification is containers share a kernel but provide a different user-land). This lets you package up your code along with all it’s dependencies and configuration in a standard ‘box’ that can run exactly the same way on essentially any Linux back-end (as well as on mac and windows through what amounts to a wrapper around a linux VM).
The most popular containerization technology is Docker though others exist (podman & Apptainer/Singularity for example). You specify the environment you want inside a Docker container using a Dockerfile
and building a container image which can run things in the environment specified in the Dockerfile
. Whilst running something with the exact container image is fully reproducible building a container image from a specification is not necessarily so. The Dockerfile
starts from a ‘base image’ usually of the operating system you’d like to setup your environment in. You might use for example ubuntu:latest
the second part of this text specifying the operating system latest
is called a tag. The latest
tag obviously depends on what happened to be the latest version when the build command was run thus you cannot rebuild an identical image to the original one built from this Dockerfile
unless you know want version of Ubuntu was the latest when the build command was run. To avoid this ambiguity it is best to specify the version more explicitly e.g. ubuntu:jammy
, jammy
is the code name for Ubuntu 22.04 the current (as of writing) LTS (long term support) release of the Ubuntu operating system.
When datasets are small their day-to-day management is often relatively un-complicated. We can just make a copy of our original raw data and work with that in our analysis. As datasets get larger and simply making a copy of them becomes an expensive operation we often have to get a bit more creative with their management.
Raw data is conceptually ‘read-only’ it can be a good idea to make this literal. Keep the raw data for your project in a place that you cannot accidentally modify or delete it. An easy way to do this is to make your raw data files read only and keep them in specific location which you can back-up with a little extra thoroughness.
On UNIX like systems you might want to follow a pattern like this:
# A central directory to store all your raw data files
# with subdirectories by project
mkdir -p ~/tank/test-project-data
# Make an example data file
touch ~/tank/test-project-data/data.file
# Change the mode of all the files in `test-project-data`
# and all sub-directories with `-R`
# remove the write permission with `-w`
chmod -R -w ~/tank/test-project-data
# Make a directory in your project folder to link to your raw data
mkdir -p ~/projects/test-project/data
ln -s ~/tank/test-project-data/data.file ~/projects/data/data.file
# Links to files in ~/projects/data/data.file can now be deleted
# The files they are linked to will not be affected
rm ~/projects/data/data.file
# If you run this you'll find it's still there
ls -l ~/tank/test-project-data/data.file
Need Help understanding any of these shell commands? Checkout explainshell.com paste in any shell command to get a breakdown of its component parts and what they mean.
This approach lets you keep your datasets within your project directory without actually having to keep the files there. For example you might have your data directory on a secondary higher capacity storage device than your projects folder.
You may even want to make a dedicated user account who is the only one with write permissions to your raw data files as extra protection against their accidentally being changed. A dedicated account able only to read the raw data files that is used to perform backups is also a potentially sensible strategy.
Raw data is generally data directly from whatever your instrument is. There may be some degree of pre-processing applied by that instrument on its own raw data from its sensors prior to outputting this pre-processed data to the end user. For example in DNA sequencing machines base calls are generally made on the machine from the raw sensor output e.g. the florescence intensity before being output as a fastq file with the call and an indicator of it’s quality.
Once you have your raw data you process it yielding (surprise!) processed data. Some of this processed data will be ‘end points’ and other parts may be ‘intermediate data products’. Whether your data is an endpoint or an intermediate product is context dependent. You might for example consider the count matrix from a RNA-seq experiment as an endpoint as it is a common product of analysis used in further downstream analyses. It’s the sort of data product that it is useful to others if you include it when you deposit your data in a public repository. But you might discard the alignments in the form of BAM files as these are very large. BAM files are however computationally expensive to generate so you might keep them around for the active duration of the project but not archive them.
In theory all processed data should be dispensable if your raw data, analysis code, and computational environment are properly documented. It should be possible to exactly regenerate your results from your raw data and your computational methods.
When working with data that you did not generate and thus do not need to ensure the preservation of. You might want to keep this somewhere separate from your own raw data. Somewhere without your own backups where you can cache the data. Always be sure to capture the metadata about how you acquired your copy though, accession numbers, when you did so, and any version numbers available.
Domain specific data repositories may have their own download tools and approaches to locally caching data which you can use.
As we will discuss in section Chapter 5 When to Publish Data it can be a good idea to publish your own data to a public repository before publishing your main analysis. This way you can access your data from the public resource as other researchers would. This is a good practice as it permits you to validate if your data are indeed FAIR. It shows that you were able to find you own data, refer to it with it’s unique identifiers and retrieve it in a appropriate format. This improves the documentation of the provenance of your data as it’s shared accessions and metadata annotation are used in the original work leaving less opportunity for errors of labeling etc. in the data repository.
Within a project managed by git
large binary files, such as images, can be a problem as they will cause a repo to quickly grow to an unmanageable size if they are included by git
. A solution to this problem if you want to remain within the git paradigm is git-lfs (git large file storage) though this approach is not without it’s drawbacks. Every committed version of a large file is still kept just on the git-lfs server not in everyone’s local repos where only the needed version in synchronized. When git-lfs is available at a git hosting service it is often, understandably, a paid feature or has limited capacity. It is also a non-trivial effort to configure and host your own git-lfs server.
The tool data version control (DVC) provides git compatible git-lfs like functionality with different storage back-end options including consumer cloud storage options like google drive dropbox etc. Other alternatives include lakefs, and you can also use ZFS to version data if you are using it directly and not just as a storage back-end behind other abstractions.
Note: I am not an image analysis specialist, so this section would likely benefit from the contributions of someone with more experience in this area.
Working with imaging data can pose a number of substantial practical challenges. Imaging datasets are often quite complex to administer they are often comprised of many files which need to be structured and accompanied by both experimental design and technical metadata. Many microscopes produce images in proprietary formats which attempt to address some of these organisational issues by bundling together metadata and images from individual planes or channels into single files that are interpretable to their software, unfortunately these formats are often proprietary which can present issues when trying to use them with software other that that provided by the manufacture of your imaging equipment or their partners.
Thanks to the open microscopy environment’s (OME) bio-formats project and its hard work reverse engineering many of these formats it is now possible to work with many of them interoperably and with open software tools. There are ongoing efforts to have commercial imaging providers make use of open standards and open up their imaging formats5.
As alluded to in Chapter 3 How To Store Your Data imaging data can in some cases be very large. 3 dimensional multi-channel and time course (aka 5D) datasets at high resolution from imaging techniques such as light sheet microscopy can rapidly balloon in size. Extremely high resolution electron microscopy images are another example. When datasets reach multiple terabytes we start running up against the limits of the current generation of readily available computing technology to make use of datasets of this unwieldy size. Fortunately there is much work underway to make this a more manageable problem including the development of next generation file formats OME-NGFF which facilitate parallel processing and the streaming of only needed portions of large datasets to users remotely accessing data from central repository(s) (Moore et al. 2021 [cito:citesAsAuthority] [cito:credits] [cito:agreesWith]).
Organizing your imaging data benefits from software tools which permit you to store, view, annotate, share, search, and programatically explore your imaging datasets. A simple file and folder directory structure with some standard operating procedures for where to put files and what to call them is slow, manual, cumbersome, and error prone. The Open Microscopy Environment’s OMERO tool is probably the best available software tool to solve your image data organisation woes. It operates a standard client-server approach with a central server on which the data is indexed and stored which can be accessed by various clients. There is a general web client, and additional web based viewers and figure creation tools as well as a desktop client to speed up larger image uploads & downloads. The OMERO server can be accessed via an application programming interface (API) which permits you to interact with your data from applications like Fiji, cellprofiler, QuPath, or napari. There are libraries for Python, R, Java, & MATLAB to facilitate using the API in your own custom analysis code.
One of the advantages to deploying an OMERO instance and using it to store and analyse your data is that it is the same software stack which underpins public image databases such as the Image Data Repository (IDR) a highly curated ‘added-value database’ for image data-sets that are community resources. You can interact with your own data in the same way you interact with publicly available datasets and when you make your own data public others can access it the same way you do internally
Text (documentation) Fiji/ImageJ documentation
Text (tutorial) ImageJ tutorials
Text (documentation) Open Microscopy Environment (OME) documentation
Text (tutorial) OMERO guides
Video (longform) Image data: management, sharing & re-use - IDR workshop
Text (documentation) Cellprofiler manual
Text (tutorial) Cellprofiler tutorials
Text (documentation) QuPath documentation
Text (publication) QuPath: Open source software for digital pathology image analysis (Bankhead et al. 2017 [cito:citesAsAuthority])
Video (YouTube Channel) QuPath YouTube channel
Video (quickstart) QuPath OMERO access
Video (longform) QuPath tutorial #1 - Getting started
One of the significant practical issues addressed by using a pipeline management system for developing a new analysis or iterating on an existing one is results caching. If your long fairly complex pipeline with some slow computationally expensive steps is just a script that has to be re-run from scratch because you changed how a graph looks you’re are not going to use a single framework for your whole analysis. You are, quite sensibly, going to break it up into separate steps. You are however now at risk of ending up with your analysis in an inconsistent state if, for example, you forget to re-run a step downstream of your change. You have introduced a semi-manual stepping through each of the separate sections of your analysis to get the final result.
Many pipeline managers are designed to be (mostly) idempotent, that is to say running the same pipeline repeatedly will get you the same result, subsequent runs will not be affected by previous runs. Also running it repeatedly is a save operation in that it won’t affect the outcome. Whilst you can manage to get the same result with an ordinary script it can be very cumbersome and time consuming to do so. One of the tricks generally employed by pipeline managers to make idempotency practical is caching. If you can cache the computationally expensive parts of an analysis you can feel safe running the pipeline command again. This way you can make a minor downstream modification to a plot safe in the knowledge that you won’t have to wait hours to see the results of your change, as the same pipeline command only runs steps that need to be run to apply the changes.
This lets you keep your long and complex analyses properly connected together and re-run-able from scratch with a single command but means that you don’t have to re-run the bits you have not changed to ensure everything stays consistent. Despite this it is almost always advisable to re-run any pipeline from scratch with a clean cache once you think you have the final version ready to make sure there are no hitches. Cache invalidation is after all a legendarily hard thing to get consistently right. Because pipeline managers generally understand the dependency relationships between the steps of your analysis it is usually simple to automatically parallelise independent tasks and get better run times.
Pipeline management tools are most advantageous for longer more complex or more computationally expensive analyses, especially those intended to be reused by others. Their design tends to favor workloads which require large batch processing with little to no user interaction needed during a run. So they won’t be applicable for everyone’s use-case.
There are a number of language specific pipeline tools which may be easier to learn if your are already proficient with a particular language and benefit from language specific integrations. R’s {targets} pipeline manager for instance has nice integrations with R’s literate programming tools, which can be useful when writing a pipeline with nicely formatted outputs. In Python there is the snakemake
pipeline manager.
A common reason for using a pipeline manager however is not writing a new pipeline form scratch but making use of an existing one. A good example of this is the nf-core project with uses nextflow a domain specific language (DSL) for pipeline management which excels in portability of pipelines between different systems. nf-core has a number of pre-built pipelines for common bioinformatic analyses which can be used by anyone and make it easy for others to reproduce your analysis. nf-core is an open source project so anyone can contribute updates, extensions, bug fixes or entirely new pipelines to the project which may be incorporated into the upstream versions used by the community. If you have a novel analysis method creating such a community pipeline is one of the best ways to make it easy for other researchers to use your work. (Publications which accompany tools which become popular tend to attract out-sized numbers of citations.)
A project that is worth being aware of in the workflow management space is the common workflow language (CWL). Many other pipeline management frameworks have a least partial compatibility to import/export CWL which is very valuable when migrating a pipeline between systems. If you are working with a pipeline management tool other than nextflow or even if you are using nextflow you can also deposit workflows in WorkflowHub which supports workflows of any type.
The Renku platform also has a built-in workflow management system which takes a slightly different approach to constructing pipelines step-wise which can be exported to CWL and which we will cover more in the renku section Section 4.8.
CI/CD are concepts popularized by the software development industry for testing and deploying applications. If you are developing a software package to share your code learning some of this tooling can be very useful to automate many of the steps involved in testing and distributing software. However this same tooling can be very useful for checking that you analyses are indeed reproducible and for publishing documentation associated with any workflows that you share.
This text is making use of a CI/CD pipeline in its publication process. It is built from its markdown source files into a website, epub & pdf on a gitlab CI/CD ‘runner’ every time I push commits to the remote repository. The static website for this document is only updated if the build completes correctly with no errors. This .gitlab-ci.yml file details the steps taken when building this document from source. (I’m doing some extra things so it would not normally be as complicated as it might appear in this file to build a document like this.)
Keeping your code, data, & compute environment together
Renku (連句 “linked verses”), is a Japanese form of popular collaborative linked verse poetry, written by more than one author working together.
- Wikipedia
In the last few sections we covered a number of powerful, complex and configurable technologies, it may feel a bit overwhelming as there is a lot to learn and a lot of choices to make. Fortunately, the Renku platform combines many of these technologies and has chosen some sensible defaults to make it simpler to get started using them. It is about as easy as picking up a project that is a jupyter notebook or RStudio project if you are already familiar with these and getting much of the rest (almost) for free. Renku’s flexible template system makes it possible for people with more experience of the platform to set up easy to use environments specialized for particular tasks for other collaborators on a project.
Whilst there are other solutions to the reproducible compute problem which make it fairly straightforward to reproduce environments, e.g. binder they lack data integrations. There are proprietary cloud based solutions to having the trifecta of data, code and compute environment in the same place such as Google’s Colab. However, given google’s graveyard full of dead projects it may be unwise to depend on them if you want your work to be around and accessible in the medium to long term future. There is also Code Ocean & DagsHub but these are also a paid closed source solutions even if many of their internals and integrations are based on open source tooling. Stencila is an ambitious but still early stage open source project notably they have an integration with eLife.
Fundamentally adoption of a proprietary platform as a standard for computational reproducibility is an oxymoron, full transparency is not possible with this approach. I can’t verifiably reproduce your analysis if I’m using a black box to do it or am missing key features to create new analyses or interact with their results. Paid services in support of open source tools is the only transparent, ethical and sustainable approach to solving this problem. A project worth watching as a source of publicly funded could infrastructure for hosting such open open platforms is the European Open Science Cloud (EOSC), though this remains at a relatively early stage of development at time of writing.
Renku provides all the needed features of the above projects but is an open-source project developed at the Swiss Data Science Center based at EPFL and ETH Zurich. A project which you can host yourself and which has a public instance at renkulab.io. Similar considerations apply to the choice of a reproducible computational analysis platform as to the choice of electronic lab notebooks (section Section 4.1) this is because they have semi-overlapping functions. A platform like Renku can serve as the lab notebook to your more computationally focused researchers.
You can signup for an account at Renkulab.io at this registration page ORCID & github are supported as single single sign-on providers.
When you start a new project in renku you generally do so from a template. Renku has a templating system which permits users to create their own templates for projects. There are a core set of default templates as well as community contributed ones. I’m also developing some templates for HDBI.
If you have docker and the renku CLI client installed on your system you can run an interactive renku session on your local system by running: renku session start
and navigating to the link that it returns in your web browser the link will look something like this: http://0.0.0.0:49153/?token=998dasdf...
If you are running a renku session on a local workstation or server with a lot of compute resources but still want to access this session remotely from your laptop or even phone there are a couple of ways of doing this.
If you can ssh (secure shell) into the machine running your container you can access your session by ssh port forwarding using a command structured like the following: ssh -nNT -L <local port>:<host>:<remote port> <host>
. let’s say renku session start
on my workstation returns: http://0.0.0.0:49153/?token=998dasdf
I can run ssh -nNT -L 49153:me@host:49153 host
on my laptop where me
is my username on my workstation and host
is my workstation’s IP/url then I can navigate to http://0.0.0.0:49153/?token=998dasdf
on my laptop and remotely access the session.
tailscale will create a secure wireguard mesh VPN between the clients on which it is installed so if you can install tailscale on your workstation and laptop you can connect to your workstation irrespective of any firewalls or NAT normally blocking your path. Simply navigate to your workstation’s tailscale IP address and enter the port/token for the session. e.g. http://100.10.10.10:49153/?token=998dasdf
where 100.10.10.10
is your workstation’s IP from tailscale status
Feature Parity - meaning that the self-hosted version offers all the same features as the paid hosted version. I.e. there are not features locked behind a pay-wall, this business model tends to trend increasingly closed and can lead to tension between the in house dev team and the community over implementation of paid features in community versions.↩︎
Jérôme’s CNRS page as ORCID is a bit sparse↩︎
inkscape-imagej-panel resources:
↩︎OP: Over Powered, a colloquialism originating in the concept of a video game character having excessive abilities which upsets the game balance. Now commonly used to refer to characters or items in fiction or reality who abilities are disruptively good.↩︎
Dear Funders, please get together and adopt a blanket policy of refusing to fund the purchase of any scientific equipment which outputs data is a proprietary format, ideally eventually moving on to refusing to fund the purchase of any equipment with proprietary embedded software. This would really save everyone a lot of time and money in the long run. Pretty Please↩︎