pipeline completed

Variant Calling Pipeline

A robust Snakemake pipeline for germline and somatic variant calling from whole-genome and whole-exome sequencing data.

January 20, 2025 3 min read

#variant-calling #snakemake #wgs #wes #gatk #bcftools #genomics

Variant Calling Pipeline

A production-ready Snakemake pipeline for comprehensive variant calling from whole-genome and whole-exome sequencing data, supporting both germline and somatic variant detection.

Overview

This pipeline implements best-practice workflows for variant calling using GATK and other standard tools. It supports multiple sequencing technologies and provides flexible configuration options for different analysis scenarios.

Key Features

Multiple Variant Callers: GATK HaplotypeCaller, FreeBayes, bcftools
Germline and Somatic Calling: Support for both variant types
Quality Control: Comprehensive QC at each step
Annotation: Variant effect prediction with SnpEff
Filtering: Hard and soft filtering options
Joint Calling: Multi-sample variant calling support
Reproducible: Conda environments and container support

Pipeline Workflow

1. Preprocessing

Adapter trimming with Trim Galore!
Read alignment with BWA-MEM
BAM file sorting and indexing
Duplicate marking with Picard
Base quality score recalibration (BQSR)

2. Variant Calling

Germline Calling: GATK HaplotypeCaller per sample
Joint Genotyping: GenotypeGVCFs for cohort analysis
Somatic Calling: MuTect2 for tumor-normal pairs
Alternative Callers: FreeBayes, bcftools

3. Variant Processing

Variant quality score recalibration (VQSR)
Variant filtering and annotation
Variant effect prediction with SnpEff
Population frequency annotation

4. Quality Control

Alignment statistics with Samtools
Variant calling metrics
Concordance analysis
MultiQC reports

Installation

# Clone the repository
git clone https://github.com/tamoghnadas12/variant-calling-pipeline
cd variant-calling-pipeline

# Install dependencies
conda env create -f environment.yml
conda activate variant-calling

# Configure reference data
bash scripts/download_references.sh

Configuration

The pipeline is configured through:

config/config.yaml: Main configuration
config/samples.tsv: Sample metadata
config/units.tsv: Sequencing unit information

Example configuration:

# config/config.yaml
ref:
  genome: 'references/hg38.fasta'
  dbsnp: 'references/dbsnp_138.vcf'
  mills: 'references/Mills_and_1000G_gold_standard.indels.vcf'

calling:
  callers: ['gatk', 'freebayes']
  joint_calling: true
  somatic: false

processing:
  bqsr: true
  vqsr: true

Usage

Germline Variant Calling

# Run germline pipeline
snakemake --cores 16 --use-conda --config calling/somatic=false

# Joint genotyping for cohort
snakemake --cores 16 --use-conda --config calling/joint_calling=true

Somatic Variant Calling

# Run somatic pipeline
snakemake --cores 16 --use-conda --config calling/somatic=true

# Tumor-normal pairs specified in samples.tsv

Output Files

The pipeline generates organized output directories:

results/
├── alignments/           # BAM files and indices
├── variants/            # VCF files (per-sample and joint)
├── annotations/         # Annotated VCF files
├── metrics/             # QC metrics and statistics
├── reports/             # HTML and PDF reports
└── multiqc/             # MultiQC reports

Technologies Used

Workflow: Snakemake
Alignment: BWA-MEM
Variant Calling: GATK, FreeBayes, bcftools
Processing: Picard, Samtools
Annotation: SnpEff, ANNOVAR
QC: FastQC, MultiQC
Environment: Conda, Singularity

Documentation

Contributing

Please read our contributing guidelines for information on how to contribute to this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Continue Learning

One Small Win

Try this quick command to get started:

git clone https://github.com/tamoghnadas12/variant-calling-pipeline

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run

Variant Calling Pipeline

Variant Calling Pipeline

Overview

Key Features

Pipeline Workflow

1. Preprocessing

2. Variant Calling

3. Variant Processing

4. Quality Control

Installation

Configuration

Usage

Germline Variant Calling

Somatic Variant Calling

Output Files

Technologies Used

Documentation

Contributing

License

Continue Learning

One Small Win

Related Content

Proteomics Data Analysis Pipeline

RNA-Seq Analysis Pipeline

Snakemake for Beginners: Your First Bioinformatics Pipeline

Metagenomics Analysis Toolkit

Start Your Own Project