Variant Calling Pipeline
A robust Snakemake pipeline for germline and somatic variant calling from whole-genome and whole-exome sequencing data.
Variant Calling Pipeline
A production-ready Snakemake pipeline for comprehensive variant calling from whole-genome and whole-exome sequencing data, supporting both germline and somatic variant detection.
Overview
This pipeline implements best-practice workflows for variant calling using GATK and other standard tools. It supports multiple sequencing technologies and provides flexible configuration options for different analysis scenarios.
Key Features
- Multiple Variant Callers: GATK HaplotypeCaller, FreeBayes, bcftools
- Germline and Somatic Calling: Support for both variant types
- Quality Control: Comprehensive QC at each step
- Annotation: Variant effect prediction with SnpEff
- Filtering: Hard and soft filtering options
- Joint Calling: Multi-sample variant calling support
- Reproducible: Conda environments and container support
Pipeline Workflow
1. Preprocessing
- Adapter trimming with Trim Galore!
- Read alignment with BWA-MEM
- BAM file sorting and indexing
- Duplicate marking with Picard
- Base quality score recalibration (BQSR)
2. Variant Calling
- Germline Calling: GATK HaplotypeCaller per sample
- Joint Genotyping: GenotypeGVCFs for cohort analysis
- Somatic Calling: MuTect2 for tumor-normal pairs
- Alternative Callers: FreeBayes, bcftools
3. Variant Processing
- Variant quality score recalibration (VQSR)
- Variant filtering and annotation
- Variant effect prediction with SnpEff
- Population frequency annotation
4. Quality Control
- Alignment statistics with Samtools
- Variant calling metrics
- Concordance analysis
- MultiQC reports
Installation
# Clone the repository
git clone https://github.com/tamoghnadas12/variant-calling-pipeline
cd variant-calling-pipeline
# Install dependencies
conda env create -f environment.yml
conda activate variant-calling
# Configure reference data
bash scripts/download_references.sh
Configuration
The pipeline is configured through:
config/config.yaml: Main configurationconfig/samples.tsv: Sample metadataconfig/units.tsv: Sequencing unit information
Example configuration:
# config/config.yaml
ref:
genome: 'references/hg38.fasta'
dbsnp: 'references/dbsnp_138.vcf'
mills: 'references/Mills_and_1000G_gold_standard.indels.vcf'
calling:
callers: ['gatk', 'freebayes']
joint_calling: true
somatic: false
processing:
bqsr: true
vqsr: true
Usage
Germline Variant Calling
# Run germline pipeline
snakemake --cores 16 --use-conda --config calling/somatic=false
# Joint genotyping for cohort
snakemake --cores 16 --use-conda --config calling/joint_calling=true
Somatic Variant Calling
# Run somatic pipeline
snakemake --cores 16 --use-conda --config calling/somatic=true
# Tumor-normal pairs specified in samples.tsv
Output Files
The pipeline generates organized output directories:
results/
├── alignments/ # BAM files and indices
├── variants/ # VCF files (per-sample and joint)
├── annotations/ # Annotated VCF files
├── metrics/ # QC metrics and statistics
├── reports/ # HTML and PDF reports
└── multiqc/ # MultiQC reports
Technologies Used
- Workflow: Snakemake
- Alignment: BWA-MEM
- Variant Calling: GATK, FreeBayes, bcftools
- Processing: Picard, Samtools
- Annotation: SnpEff, ANNOVAR
- QC: FastQC, MultiQC
- Environment: Conda, Singularity
Documentation
Contributing
Please read our contributing guidelines for information on how to contribute to this project.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Proteomics Data Analysis Pipeline
A comprehensive pipeline for mass spectrometry-based proteomics data analysis including identification, quantification, and statistical analysis.
RNA-Seq Analysis Pipeline
A complete Snakemake pipeline for RNA-Seq data analysis from raw FASTQ files to differential expression results with MultiQC reporting.
Snakemake for Beginners: Your First Bioinformatics Pipeline
Learn how to build reproducible bioinformatics workflows with Snakemake. A step-by-step guide from basic concepts to a complete RNA-seq analysis pipeline.
Metagenomics Analysis Toolkit
A comprehensive toolkit for metagenomics data analysis including taxonomic profiling, functional annotation, and diversity analysis.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run