A Beginner's Guide to Snakemake

Hongtao Hao / 2022-01-04

Please refer to the README.md in hongtaoh/snakemake-tutorial for updated instructions. This blog post might be outdated.

Prerequisit: You’ve installed conda or miniconda.

I also assume you are using a Mac or Linux. If you are using Windows, refer to here first.

Let’s say you already have a project folder. Take snakemake-tutorial as an example.

Install snakemake #

First, create a virtual environment for your project.

cd snakemake-tutorial
conda create --name s-tutorial python=3.8
conda activate s-tutorial

Then, we install packages into this virtual environment. For simplicity, I will only use pandas.

conda install pandas

Then, install snakemake:

pip3 install "git+https://github.com/ashwinvis/datrie.git@python3.8-cythonize"
pip3 install snakemake

The above code came from here

Create a Snakefile #

Let’s say, we have this raw data, data/raw/raw_data.csv :

year,value
2001,4
2002,5
2003,6

We want a python script that select only rows whose year is equal to or greater than 2002. Let’s say we name this Python script as get_since_2002.py. We will place it in the folder of scripts.

This script, scripts/get_since_2002.py, will generate an output named since_2002.csv based on the input: data/raw/raw_data.csv. And, we want to place this output file to the folder of data/derived.

Based on the above information, we will have this Snakefile:

ffrom os.path import join as pjoin

DATA_DIR = "data/"
RAW_DATA_DIR = pjoin(DATA_DIR, "raw")
DERIVED_DATA_DIR = pjoin(DATA_DIR, "derived")

###############################################################################
# Raw datasets
###############################################################################

RAW_DATA = pjoin(RAW_DATA_DIR, 'raw_data.csv')

###############################################################################
# Final outputs
###############################################################################

SINCE_2002 = pjoin(DERIVED_DATA_DIR, 'since_2002.csv')

###############################################################################
# Workflows
###############################################################################

rule get_since_2002:
    input: RAW_DATA
    output: SINCE_2002
    shell: "python scripts/get_since_2002.py {input} {output}"

Create python scripts #

As described above, we’ll name it as get_since_2002.py and put it into the scripts folder.

import sys
import pandas as pd

RAW_DATA = sys.argv[1]
OUT_FNAME = sys.argv[2]

df = pd.read_csv(RAW_DATA)
df_out = df.loc[df.year >= 2002]
df_out.to_csv(OUT_FNAME)

Execute Snakefile #

Now, go to the directory where your Snakefile is located, and run snakemake --cores 1:

snakemake --cores 1

What this snippt does is to find Snakefile and execute it.

If successful, you’ll see this:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job               count    min threads    max threads
--------------  -------  -------------  -------------
get_since_2002        1              1              1
total                 1              1              1

Select jobs to execute...

[Tue Jan  4 14:23:46 2022]
rule get_since_2002:
    input: data/raw/raw_data.csv
    output: data/derived/since_2002.csv
    jobid: 0
    resources: tmpdir=/var/folders/z2/5kr96fyn63z_tj_bwr33t5dw0000gn/T

[Tue Jan  4 14:23:49 2022]
Finished job 0.
1 of 1 steps (100%) done

And you will find since_2002.csv in data/derived.

References #

Last modified on 2022-01-04