tutorials intermediate

Snakemake for Beginners: Your First Bioinformatics Pipeline

Learn how to build reproducible bioinformatics workflows with Snakemake. A step-by-step guide from basic concepts to a complete RNA-seq analysis pipeline.

January 10, 2025 12 min read

#snakemake #bioinformatics #workflows #reproducibility #pipeline #automation

Why Snakemake Will Change Your Research Life

If you’re doing bioinformatics analysis, you’ve probably found yourself in this situation: You have a series of tools that need to be run in a specific order, with the output of one tool feeding into the next. You write a bash script, it works great… until you need to rerun part of the analysis, or add a new sample, or share it with a colleague who has a different computer setup.

Enter Snakemake - a workflow management system that makes your bioinformatics pipelines reproducible, scalable, and maintainable.

What is Snakemake?

Snakemake is like Make (the build system) but designed specifically for data analysis workflows. It:

Tracks dependencies between files and automatically determines what needs to be rerun
Parallelizes jobs automatically when possible
Handles errors gracefully and allows easy restarts
Integrates with conda environments and containers
Scales from laptop to high-performance computing clusters

Think of it as a smart project manager for your data analysis.

Installing Snakemake

First, let’s get Snakemake installed. I recommend using conda:

# Create a new environment for workflow management
conda create -n snakemake -c conda-forge -c bioconda snakemake
conda activate snakemake

# Verify installation
snakemake --help

Basic Concepts: Rules and Files

Snakemake workflows are defined by rules. Each rule describes:

What output files it produces
What input files it needs
What commands to run to transform inputs into outputs

Here’s the basic syntax:

rule rule_name:
    input:
        "input_file.txt"
    output:
        "output_file.txt"
    shell:
        "command_to_transform_input_to_output"

Your First Snakemake Workflow

Let’s start with a simple example: quality control of FASTQ files.

Create a file called Snakefile:

# Simple quality control workflow
rule all:
    input:
        "results/sample1_fastqc.html",
        "results/sample2_fastqc.html"

rule fastqc:
    input:
        "data/{sample}.fastq.gz"
    output:
        "results/{sample}_fastqc.html",
        "results/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o results/"

What’s happening here?

Rule all: This is a special rule that defines what we want as final output
Rule fastqc: Does the actual work - runs FastQC on input files
Wildcards ({sample}): Allow the rule to work with multiple samples
Dependencies: Snakemake automatically figures out that to make the all outputs, it needs to run the fastqc rule

Run it:

# Create sample data directory
mkdir -p data results

# Run the workflow (dry run first to see what would happen)
snakemake -n

# Actually run it
snakemake --cores 2

Building a Real RNA-seq Pipeline

Now let’s build something more realistic - a complete RNA-seq analysis pipeline:

# RNA-seq analysis pipeline
SAMPLES = ["sample1", "sample2", "sample3"]

rule all:
    input:
        "results/multiqc_report.html",
        expand("results/{sample}_counts.txt", sample=SAMPLES)

# Quality control of raw reads
rule fastqc_raw:
    input:
        "data/{sample}.fastq.gz"
    output:
        "results/fastqc_raw/{sample}_fastqc.html",
        "results/fastqc_raw/{sample}_fastqc.zip"
    shell:
        "fastqc {input} -o results/fastqc_raw/"

# Trim adapters and low-quality bases
rule trim_reads:
    input:
        "data/{sample}.fastq.gz"
    output:
        trimmed="results/trimmed/{sample}_trimmed.fastq.gz",
        report="results/trimmed/{sample}_trimming_report.txt"
    shell:
        "trim_galore --gzip --output_dir results/trimmed/ "
        "--basename {wildcards.sample} {input} > {output.report}"

# Quality control of trimmed reads
rule fastqc_trimmed:
    input:
        "results/trimmed/{sample}_trimmed.fastq.gz"
    output:
        "results/fastqc_trimmed/{sample}_trimmed_fastqc.html",
        "results/fastqc_trimmed/{sample}_trimmed_fastqc.zip"
    shell:
        "fastqc {input} -o results/fastqc_trimmed/"

# Align reads to reference genome
rule align_reads:
    input:
        reads="results/trimmed/{sample}_trimmed.fastq.gz",
        index="reference/genome_index"
    output:
        "results/aligned/{sample}.bam"
    threads: 4
    shell:
        "hisat2 -x {input.index} -U {input.reads} -p {threads} | "
        "samtools sort -@ {threads} -o {output}"

# Count reads mapping to genes
rule count_reads:
    input:
        bam="results/aligned/{sample}.bam",
        annotation="reference/genes.gtf"
    output:
        "results/{sample}_counts.txt"
    shell:
        "featureCounts -a {input.annotation} -o {output} {input.bam}"

# Generate summary report
rule multiqc:
    input:
        expand("results/fastqc_raw/{sample}_fastqc.zip", sample=SAMPLES),
        expand("results/fastqc_trimmed/{sample}_trimmed_fastqc.zip", sample=SAMPLES),
        expand("results/{sample}_counts.txt", sample=SAMPLES)
    output:
        "results/multiqc_report.html"
    shell:
        "multiqc results/ -o results/"

Advanced Features

1. Using Conda Environments

Snakemake can automatically manage software dependencies:

rule align_reads:
    input:
        reads="results/trimmed/{sample}_trimmed.fastq.gz"
    output:
        "results/aligned/{sample}.bam"
    conda:
        "envs/alignment.yaml"  # Conda environment file
    shell:
        "hisat2 -x reference/genome -U {input.reads} | samtools sort -o {output}"

Create envs/alignment.yaml:

channels:
  - conda-forge
  - bioconda
dependencies:
  - hisat2=2.2.1
  - samtools=1.15

2. Configuration Files

Keep parameters separate from code:

config.yaml:

samples:
  - sample1
  - sample2
  - sample3

reference:
  genome: 'reference/genome.fa'
  annotation: 'reference/genes.gtf'

params:
  trim_galore:
    quality: 20
    min_length: 50
  hisat2:
    threads: 8

Snakefile:

configfile: "config.yaml"

SAMPLES = config["samples"]

rule trim_reads:
    input:
        "data/{sample}.fastq.gz"
    output:
        "results/trimmed/{sample}_trimmed.fastq.gz"
    params:
        quality=config["params"]["trim_galore"]["quality"],
        min_len=config["params"]["trim_galore"]["min_length"]
    shell:
        "trim_galore --quality {params.quality} "
        "--length {params.min_len} --gzip {input}"

3. Resource Management

Specify computing resources for each rule:

rule align_reads:
    input:
        "results/trimmed/{sample}_trimmed.fastq.gz"
    output:
        "results/aligned/{sample}.bam"
    threads: 8
    resources:
        mem_mb=16000,  # 16GB memory
        runtime=120    # 2 hours max runtime
    shell:
        "hisat2 -p {threads} -x reference/genome -U {input} | "
        "samtools sort -@ {threads} -m 2G -o {output}"

4. Error Handling and Logging

rule align_reads:
    input:
        "results/trimmed/{sample}_trimmed.fastq.gz"
    output:
        "results/aligned/{sample}.bam"
    log:
        "logs/{sample}_alignment.log"
    shell:
        "hisat2 -x reference/genome -U {input} 2> {log} | "
        "samtools sort -o {output} 2>> {log}"

Running Your Workflow

Basic Execution

# Dry run to check workflow
snakemake -n

# Run with 4 cores
snakemake --cores 4

# Run specific rule
snakemake --cores 4 multiqc

# Force rerun of specific files
snakemake --cores 4 --forcerun trim_reads

With Conda Integration

# Install conda environments automatically
snakemake --cores 4 --use-conda

# Create environments but don't run workflow
snakemake --cores 4 --use-conda --conda-create-envs-only

On Computing Clusters

# Submit jobs to SLURM cluster
snakemake --cores 100 --cluster "sbatch -p cluster -t 60"

# Using cluster profiles (more advanced)
snakemake --profile slurm --cores 100

Best Practices

1. Project Structure

project/
├── Snakefile
├── config.yaml
├── data/
│   ├── sample1.fastq.gz
│   └── sample2.fastq.gz
├── results/
├── logs/
├── envs/
│   ├── qc.yaml
│   └── alignment.yaml
└── scripts/
    └── custom_analysis.py

2. Modular Workflows

Split complex workflows into modules:

Snakefile:

include: "rules/quality_control.smk"
include: "rules/alignment.smk"
include: "rules/quantification.smk"

rule all:
    input:
        "results/final_report.html"

3. Testing Your Workflow

# Test with subset of data
snakemake --cores 2 --config samples='["sample1"]'

# Validate workflow syntax
snakemake --lint

# Create workflow diagram
snakemake --dag | dot -Tpng > workflow.png

4. Documentation

Always document your rules:

rule align_reads:
    """
    Align trimmed reads to reference genome using HISAT2.

    This rule aligns single-end RNA-seq reads to the reference genome
    and outputs coordinate-sorted BAM files for downstream analysis.
    """
    input:
        reads="results/trimmed/{sample}_trimmed.fastq.gz",
        index="reference/genome_index"
    output:
        "results/aligned/{sample}.bam"
    log:
        "logs/{sample}_alignment.log"
    threads: 8
    shell:
        "hisat2 -x {input.index} -U {input.reads} -p {threads} "
        "--summary-file {log} | samtools sort -@ {threads} -o {output}"

Troubleshooting Common Issues

1. Missing Input Files

Error: MissingInputException Solution: Check file paths and make sure input files exist

2. Circular Dependencies

Error: WorkflowError: Circular dependency Solution: Review your rule dependencies - output of rule A shouldn’t depend on output of rule B if B depends on A

3. Wildcard Constraints

Error: Ambiguous wildcards Solution: Add constraints to wildcards:

rule process_samples:
    input:
        "data/{sample}.fastq.gz"
    output:
        "results/{sample}_processed.txt"
    wildcard_constraints:
        sample="[A-Za-z0-9]+"  # Only alphanumeric characters

4. Resource Issues

Error: Jobs killed due to memory/time limits Solution: Increase resources or optimize your tools:

rule memory_intensive:
    input: "input.txt"
    output: "output.txt"
    resources:
        mem_mb=32000,  # 32GB
        runtime=600    # 10 hours

Real-World Example: Complete Setup

Here’s a complete, production-ready RNA-seq workflow:

Directory structure:

rnaseq_workflow/
├── Snakefile
├── config.yaml
├── cluster_config.yaml
├── data/
├── results/
├── logs/
├── envs/
│   ├── qc.yaml
│   ├── trim.yaml
│   ├── align.yaml
│   └── count.yaml
└── scripts/
    └── generate_report.R

config.yaml:

samples:
  sample1: data/sample1_R1.fastq.gz
  sample2: data/sample2_R1.fastq.gz
  sample3: data/sample3_R1.fastq.gz

reference:
  genome: reference/genome.fa
  annotation: reference/annotation.gtf
  index: reference/hisat2_index

params:
  trimgalore:
    quality: 20
    min_length: 20
  hisat2:
    extra_params: '--rna-strandness U'
  featurecounts:
    extra_params: '-t exon -g gene_id'

Run command:

snakemake --cores 8 --use-conda --keep-going --printshellcmds

Next Steps

Once you’re comfortable with basic Snakemake:

Learn cluster execution for large-scale analyses
Explore Snakemake wrappers - pre-built rules for common tools
Try workflow catalogs like nf-core (for Nextflow, but similar concepts)
Implement continuous integration to test your workflows automatically
Share your workflows on GitHub with proper documentation

Conclusion

Snakemake transforms chaotic shell scripts into organized, reproducible workflows. The initial learning curve is worth it - you’ll save countless hours debugging broken pipelines and your collaborators will thank you for sharing reproducible analyses.

Start small, build incrementally, and don’t try to implement every advanced feature at once. Even a basic Snakemake workflow is infinitely better than a collection of shell scripts held together with hope and prayer.

Want to see more bioinformatics workflow examples? Check out my GitHub repositories or reach out with questions at tamoghnadas.12@gmail.com

Continue Learning

One Small Win

Try this quick command to get started:

ls -la

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run