Snakemake for Beginners: Your First Bioinformatics Pipeline
Learn how to build reproducible bioinformatics workflows with Snakemake. A step-by-step guide from basic concepts to a complete RNA-seq analysis pipeline.
Why Snakemake Will Change Your Research Life
If you’re doing bioinformatics analysis, you’ve probably found yourself in this situation: You have a series of tools that need to be run in a specific order, with the output of one tool feeding into the next. You write a bash script, it works great… until you need to rerun part of the analysis, or add a new sample, or share it with a colleague who has a different computer setup.
Enter Snakemake - a workflow management system that makes your bioinformatics pipelines reproducible, scalable, and maintainable.
What is Snakemake?
Snakemake is like Make (the build system) but designed specifically for data analysis workflows. It:
- Tracks dependencies between files and automatically determines what needs to be rerun
- Parallelizes jobs automatically when possible
- Handles errors gracefully and allows easy restarts
- Integrates with conda environments and containers
- Scales from laptop to high-performance computing clusters
Think of it as a smart project manager for your data analysis.
Installing Snakemake
First, let’s get Snakemake installed. I recommend using conda:
# Create a new environment for workflow management
conda create -n snakemake -c conda-forge -c bioconda snakemake
conda activate snakemake
# Verify installation
snakemake --help
Basic Concepts: Rules and Files
Snakemake workflows are defined by rules. Each rule describes:
- What output files it produces
- What input files it needs
- What commands to run to transform inputs into outputs
Here’s the basic syntax:
rule rule_name:
input:
"input_file.txt"
output:
"output_file.txt"
shell:
"command_to_transform_input_to_output"
Your First Snakemake Workflow
Let’s start with a simple example: quality control of FASTQ files.
Create a file called Snakefile:
# Simple quality control workflow
rule all:
input:
"results/sample1_fastqc.html",
"results/sample2_fastqc.html"
rule fastqc:
input:
"data/{sample}.fastq.gz"
output:
"results/{sample}_fastqc.html",
"results/{sample}_fastqc.zip"
shell:
"fastqc {input} -o results/"
What’s happening here?
- Rule
all: This is a special rule that defines what we want as final output - Rule
fastqc: Does the actual work - runs FastQC on input files - Wildcards (
{sample}): Allow the rule to work with multiple samples - Dependencies: Snakemake automatically figures out that to make the
alloutputs, it needs to run thefastqcrule
Run it:
# Create sample data directory
mkdir -p data results
# Run the workflow (dry run first to see what would happen)
snakemake -n
# Actually run it
snakemake --cores 2
Building a Real RNA-seq Pipeline
Now let’s build something more realistic - a complete RNA-seq analysis pipeline:
# RNA-seq analysis pipeline
SAMPLES = ["sample1", "sample2", "sample3"]
rule all:
input:
"results/multiqc_report.html",
expand("results/{sample}_counts.txt", sample=SAMPLES)
# Quality control of raw reads
rule fastqc_raw:
input:
"data/{sample}.fastq.gz"
output:
"results/fastqc_raw/{sample}_fastqc.html",
"results/fastqc_raw/{sample}_fastqc.zip"
shell:
"fastqc {input} -o results/fastqc_raw/"
# Trim adapters and low-quality bases
rule trim_reads:
input:
"data/{sample}.fastq.gz"
output:
trimmed="results/trimmed/{sample}_trimmed.fastq.gz",
report="results/trimmed/{sample}_trimming_report.txt"
shell:
"trim_galore --gzip --output_dir results/trimmed/ "
"--basename {wildcards.sample} {input} > {output.report}"
# Quality control of trimmed reads
rule fastqc_trimmed:
input:
"results/trimmed/{sample}_trimmed.fastq.gz"
output:
"results/fastqc_trimmed/{sample}_trimmed_fastqc.html",
"results/fastqc_trimmed/{sample}_trimmed_fastqc.zip"
shell:
"fastqc {input} -o results/fastqc_trimmed/"
# Align reads to reference genome
rule align_reads:
input:
reads="results/trimmed/{sample}_trimmed.fastq.gz",
index="reference/genome_index"
output:
"results/aligned/{sample}.bam"
threads: 4
shell:
"hisat2 -x {input.index} -U {input.reads} -p {threads} | "
"samtools sort -@ {threads} -o {output}"
# Count reads mapping to genes
rule count_reads:
input:
bam="results/aligned/{sample}.bam",
annotation="reference/genes.gtf"
output:
"results/{sample}_counts.txt"
shell:
"featureCounts -a {input.annotation} -o {output} {input.bam}"
# Generate summary report
rule multiqc:
input:
expand("results/fastqc_raw/{sample}_fastqc.zip", sample=SAMPLES),
expand("results/fastqc_trimmed/{sample}_trimmed_fastqc.zip", sample=SAMPLES),
expand("results/{sample}_counts.txt", sample=SAMPLES)
output:
"results/multiqc_report.html"
shell:
"multiqc results/ -o results/"
Advanced Features
1. Using Conda Environments
Snakemake can automatically manage software dependencies:
rule align_reads:
input:
reads="results/trimmed/{sample}_trimmed.fastq.gz"
output:
"results/aligned/{sample}.bam"
conda:
"envs/alignment.yaml" # Conda environment file
shell:
"hisat2 -x reference/genome -U {input.reads} | samtools sort -o {output}"
Create envs/alignment.yaml:
channels:
- conda-forge
- bioconda
dependencies:
- hisat2=2.2.1
- samtools=1.15
2. Configuration Files
Keep parameters separate from code:
config.yaml:
samples:
- sample1
- sample2
- sample3
reference:
genome: 'reference/genome.fa'
annotation: 'reference/genes.gtf'
params:
trim_galore:
quality: 20
min_length: 50
hisat2:
threads: 8
Snakefile:
configfile: "config.yaml"
SAMPLES = config["samples"]
rule trim_reads:
input:
"data/{sample}.fastq.gz"
output:
"results/trimmed/{sample}_trimmed.fastq.gz"
params:
quality=config["params"]["trim_galore"]["quality"],
min_len=config["params"]["trim_galore"]["min_length"]
shell:
"trim_galore --quality {params.quality} "
"--length {params.min_len} --gzip {input}"
3. Resource Management
Specify computing resources for each rule:
rule align_reads:
input:
"results/trimmed/{sample}_trimmed.fastq.gz"
output:
"results/aligned/{sample}.bam"
threads: 8
resources:
mem_mb=16000, # 16GB memory
runtime=120 # 2 hours max runtime
shell:
"hisat2 -p {threads} -x reference/genome -U {input} | "
"samtools sort -@ {threads} -m 2G -o {output}"
4. Error Handling and Logging
rule align_reads:
input:
"results/trimmed/{sample}_trimmed.fastq.gz"
output:
"results/aligned/{sample}.bam"
log:
"logs/{sample}_alignment.log"
shell:
"hisat2 -x reference/genome -U {input} 2> {log} | "
"samtools sort -o {output} 2>> {log}"
Running Your Workflow
Basic Execution
# Dry run to check workflow
snakemake -n
# Run with 4 cores
snakemake --cores 4
# Run specific rule
snakemake --cores 4 multiqc
# Force rerun of specific files
snakemake --cores 4 --forcerun trim_reads
With Conda Integration
# Install conda environments automatically
snakemake --cores 4 --use-conda
# Create environments but don't run workflow
snakemake --cores 4 --use-conda --conda-create-envs-only
On Computing Clusters
# Submit jobs to SLURM cluster
snakemake --cores 100 --cluster "sbatch -p cluster -t 60"
# Using cluster profiles (more advanced)
snakemake --profile slurm --cores 100
Best Practices
1. Project Structure
project/
├── Snakefile
├── config.yaml
├── data/
│ ├── sample1.fastq.gz
│ └── sample2.fastq.gz
├── results/
├── logs/
├── envs/
│ ├── qc.yaml
│ └── alignment.yaml
└── scripts/
└── custom_analysis.py
2. Modular Workflows
Split complex workflows into modules:
Snakefile:
include: "rules/quality_control.smk"
include: "rules/alignment.smk"
include: "rules/quantification.smk"
rule all:
input:
"results/final_report.html"
3. Testing Your Workflow
# Test with subset of data
snakemake --cores 2 --config samples='["sample1"]'
# Validate workflow syntax
snakemake --lint
# Create workflow diagram
snakemake --dag | dot -Tpng > workflow.png
4. Documentation
Always document your rules:
rule align_reads:
"""
Align trimmed reads to reference genome using HISAT2.
This rule aligns single-end RNA-seq reads to the reference genome
and outputs coordinate-sorted BAM files for downstream analysis.
"""
input:
reads="results/trimmed/{sample}_trimmed.fastq.gz",
index="reference/genome_index"
output:
"results/aligned/{sample}.bam"
log:
"logs/{sample}_alignment.log"
threads: 8
shell:
"hisat2 -x {input.index} -U {input.reads} -p {threads} "
"--summary-file {log} | samtools sort -@ {threads} -o {output}"
Troubleshooting Common Issues
1. Missing Input Files
Error: MissingInputException
Solution: Check file paths and make sure input files exist
2. Circular Dependencies
Error: WorkflowError: Circular dependency
Solution: Review your rule dependencies - output of rule A shouldn’t depend on output of rule B if B depends on A
3. Wildcard Constraints
Error: Ambiguous wildcards Solution: Add constraints to wildcards:
rule process_samples:
input:
"data/{sample}.fastq.gz"
output:
"results/{sample}_processed.txt"
wildcard_constraints:
sample="[A-Za-z0-9]+" # Only alphanumeric characters
4. Resource Issues
Error: Jobs killed due to memory/time limits Solution: Increase resources or optimize your tools:
rule memory_intensive:
input: "input.txt"
output: "output.txt"
resources:
mem_mb=32000, # 32GB
runtime=600 # 10 hours
Real-World Example: Complete Setup
Here’s a complete, production-ready RNA-seq workflow:
Directory structure:
rnaseq_workflow/
├── Snakefile
├── config.yaml
├── cluster_config.yaml
├── data/
├── results/
├── logs/
├── envs/
│ ├── qc.yaml
│ ├── trim.yaml
│ ├── align.yaml
│ └── count.yaml
└── scripts/
└── generate_report.R
config.yaml:
samples:
sample1: data/sample1_R1.fastq.gz
sample2: data/sample2_R1.fastq.gz
sample3: data/sample3_R1.fastq.gz
reference:
genome: reference/genome.fa
annotation: reference/annotation.gtf
index: reference/hisat2_index
params:
trimgalore:
quality: 20
min_length: 20
hisat2:
extra_params: '--rna-strandness U'
featurecounts:
extra_params: '-t exon -g gene_id'
Run command:
snakemake --cores 8 --use-conda --keep-going --printshellcmds
Next Steps
Once you’re comfortable with basic Snakemake:
- Learn cluster execution for large-scale analyses
- Explore Snakemake wrappers - pre-built rules for common tools
- Try workflow catalogs like nf-core (for Nextflow, but similar concepts)
- Implement continuous integration to test your workflows automatically
- Share your workflows on GitHub with proper documentation
Conclusion
Snakemake transforms chaotic shell scripts into organized, reproducible workflows. The initial learning curve is worth it - you’ll save countless hours debugging broken pipelines and your collaborators will thank you for sharing reproducible analyses.
Start small, build incrementally, and don’t try to implement every advanced feature at once. Even a basic Snakemake workflow is infinitely better than a collection of shell scripts held together with hope and prayer.
Want to see more bioinformatics workflow examples? Check out my GitHub repositories or reach out with questions at tamoghnadas.12@gmail.com
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
RNA-Seq Analysis Pipeline
A complete Snakemake pipeline for RNA-Seq data analysis from raw FASTQ files to differential expression results with MultiQC reporting.
From Excel Spreadsheets to Reproducible Pipelines: A Researcher's Journey
How I transformed my data analysis workflow from manual Excel processes to automated, reproducible computational pipelines - and why you should too.
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run