Variant Calling Pipeline with Bash
Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.
Variant Calling Pipeline with Bash
Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.
Overview
This tutorial will guide you through creating a complete bash pipeline for variant calling, from raw FASTQ files to annotated VCF files. We’ll implement both GATK Best Practices and alternative approaches using bcftools, providing a flexible framework for different sequencing scenarios.
Prerequisites
Before starting this tutorial, you should have:
- Advanced understanding of bash scripting
- Familiarity with Next-Generation Sequencing (NGS) data
- Understanding of SAM/BAM file formats
- Basic knowledge of variant calling concepts
- Tools installed:
bwa,samtools,gatk,bcftools,picard,tabix,bgzip
Pipeline Workflow
Our variant calling pipeline will follow GATK Best Practices:
- Quality control of raw reads
- Read alignment to reference genome
- BAM file preprocessing (sorting, marking duplicates, base quality recalibration)
- Variant calling (HaplotypeCaller for single samples, GenotypeGVCFs for cohorts)
- Variant filtering and annotation
- Final reporting
Setting Up the Project Structure
First, let’s create a directory structure for our pipeline:
# Create project directory
mkdir variant_calling_pipeline
cd variant_calling_pipeline
# Create directory structure
mkdir -p data/raw data/reference results/alignments results/variants results/annotations results/reports logs tmp
# Download reference genome (example with hg38)
# You'll need to download the reference genome and create BWA index
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
Creating the Main Pipeline Script
Create a main pipeline script variant_calling_pipeline.sh:
#!/bin/bash
# variant_calling_pipeline.sh - Variant calling pipeline following GATK Best Practices
# Author: Your Name
# Date: $(date)
set -euo pipefail # Exit on error, undefined vars, pipe failures
# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignments"
VARIANTS_DIR="${RESULTS_DIR}/variants"
ANNOT_DIR="${RESULTS_DIR}/annotations"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"
TMP_DIR="${PROJECT_ROOT}/tmp"
# Reference files (update these paths for your environment)
REF_GENOME="${REF_DIR}/hg38.fasta"
DBSNP_VCF="${REF_DIR}/dbsnp_138.hg38.vcf.gz"
MILLS_INDELS="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
# Sample information (in practice, this would come from a sample sheet)
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"
# Output files
ALIGNED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.aligned.bam"
SORTED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.sorted.bam"
DEDUP_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.bam"
RECAL_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.recal.bam"
GVCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.g.vcf.gz"
VCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.vcf.gz"
ANNOTATED_VCF="${ANNOT_DIR}/${SAMPLE_NAME}.annotated.vcf.gz"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}
# Create directories
setup_directories() {
log "Setting up directories..."
mkdir -p "${ALIGN_DIR}" "${VARIANTS_DIR}" "${ANNOT_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}" "${TMP_DIR}"
}
# Validate input files
validate_input() {
log "Validating input files..."
local required_files=("$FORWARD_READS" "$REVERSE_READS" "$REF_GENOME" "$DBSNP_VCF" "$MILLS_INDELS")
for file in "${required_files[@]}"; do
if [[ ! -f "$file" ]]; then
log "ERROR: Required file not found: $file"
exit 1
fi
done
log "Input validation passed"
}
# Quality control with FastQC
run_fastqc() {
local input_dir="$1"
local output_dir="$2"
log "Running FastQC on ${input_dir}..."
fastqc -o "${output_dir}" -t 4 "${input_dir}"/*.fastq.gz
}
# Align reads with BWA
align_reads() {
log "Aligning reads with BWA MEM..."
bwa mem -t 8 -R "@RG\tID:${SAMPLE_NAME}\tSM:${SAMPLE_NAME}\tPL:ILLUMINA" \
"$REF_GENOME" "$FORWARD_READS" "$REVERSE_READS" | \
samtools view -bS - > "$ALIGNED_BAM"
log "Alignment completed"
}
# Sort BAM file
sort_bam() {
log "Sorting BAM file..."
samtools sort -@ 4 -o "$SORTED_BAM" "$ALIGNED_BAM"
samtools index "$SORTED_BAM"
# Clean up intermediate file
rm "$ALIGNED_BAM"
log "BAM sorting completed"
}
# Mark duplicates with Picard
mark_duplicates() {
log "Marking duplicates with Picard..."
picard MarkDuplicates \
I="$SORTED_BAM" \
O="$DEDUP_BAM" \
M="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.metrics.txt" \
REMOVE_DUPLICATES=false
samtools index "$DEDUP_BAM"
# Clean up intermediate file
rm "$SORTED_BAM" "${SORTED_BAM}.bai"
log "Duplicate marking completed"
}
# Base quality score recalibration with GATK
recalibrate_base_qualities() {
log "Recalibrating base qualities with GATK..."
# Step 1: Generate base recalibration table
gatk BaseRecalibrator \
-I "$DEDUP_BAM" \
-R "$REF_GENOME" \
--known-sites "$DBSNP_VCF" \
--known-sites "$MILLS_INDELS" \
-O "${ALIGN_DIR}/${SAMPLE_NAME}.recal_data.table"
# Step 2: Apply recalibration
gatk ApplyBQSR \
-I "$DEDUP_BAM" \
-R "$REF_GENOME" \
--bqsr-recal-file "${ALIGN_DIR}/${SAMPLE_NAME}.recal_data.table" \
-O "$RECAL_BAM"
samtools index "$RECAL_BAM"
# Clean up intermediate file
rm "$DEDUP_BAM" "${DEDUP_BAM}.bai"
log "Base quality recalibration completed"
}
# Variant calling with GATK HaplotypeCaller
call_variants() {
log "Calling variants with GATK HaplotypeCaller..."
gatk HaplotypeCaller \
-R "$REF_GENOME" \
-I "$RECAL_BAM" \
-O "$GVCF_FILE" \
-ERC GVCF \
--native-pair-hmm-threads 4
log "Variant calling completed"
}
# Joint genotyping (for cohort analysis)
joint_genotyping() {
local gvcf_list="$1"
local output_vcf="$2"
log "Performing joint genotyping..."
gatk GenotypeGVCFs \
-R "$REF_GENOME" \
-V "$gvcf_list" \
-O "$output_vcf"
log "Joint genotyping completed"
}
# Variant filtering
filter_variants() {
local input_vcf="$1"
local output_vcf="$2"
log "Filtering variants..."
# Separate SNPs and indels
gatk SelectVariants \
-R "$REF_GENOME" \
-V "$input_vcf" \
-select-type SNP \
-O "${VARIANTS_DIR}/${SAMPLE_NAME}.snps.vcf.gz"
gatk SelectVariants \
-R "$REF_GENOME" \
-V "$input_vcf" \
-select-type INDEL \
-O "${VARIANTS_DIR}/${SAMPLE_NAME}.indels.vcf.gz"
# Filter SNPs
gatk VariantFiltration \
-R "$REF_GENOME" \
-V "${VARIANTS_DIR}/${SAMPLE_NAME}.snps.vcf.gz" \
--filter-expression "QD < 2.0" \
--filter-name "QD2" \
--filter-expression "QUAL < 30.0" \
--filter-name "QUAL30" \
--filter-expression "SOR > 3.0" \
--filter-name "SOR3" \
--filter-expression "FS > 60.0" \
--filter-name "FS60" \
--filter-expression "MQ < 40.0" \
--filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" \
--filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" \
--filter-name "ReadPosRankSum-8" \
-O "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.snps.vcf.gz"
# Filter indels
gatk VariantFiltration \
-R "$REF_GENOME" \
-V "${VARIANTS_DIR}/${SAMPLE_NAME}.indels.vcf.gz" \
--filter-expression "QD < 2.0" \
--filter-name "QD2" \
--filter-expression "QUAL < 30.0" \
--filter-name "QUAL30" \
--filter-expression "FS > 200.0" \
--filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" \
--filter-name "ReadPosRankSum-20" \
-O "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.indels.vcf.gz"
# Merge filtered variants
gatk MergeVcfs \
-I "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.snps.vcf.gz" \
-I "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.indels.vcf.gz" \
-O "$output_vcf"
log "Variant filtering completed"
}
# Annotate variants with SnpEff
annotate_variants() {
local input_vcf="$1"
local output_vcf="$2"
log "Annotating variants with SnpEff..."
snpEff -Xmx8g hg38 "$input_vcf" > "$output_vcf"
# Compress and index
bgzip "$output_vcf"
tabix -p vcf "${output_vcf}.gz"
log "Variant annotation completed"
}
# Generate statistics and reports
generate_report() {
log "Generating variant statistics and reports..."
# Basic statistics
bcftools stats "$VCF_FILE" > "${REPORTS_DIR}/${SAMPLE_NAME}.stats"
# Plot statistics
plot-vcfstats -p "${REPORTS_DIR}/${SAMPLE_NAME}_plots" "${REPORTS_DIR}/${SAMPLE_NAME}.stats"
# Create summary
local total_variants=$(bcftools view -H "$VCF_FILE" | wc -l)
local passed_variants=$(bcftools view -H -f PASS "$VCF_FILE" | wc -l)
local snp_count=$(bcftools view -H -v snps "$VCF_FILE" | wc -l)
local indel_count=$(bcftools view -H -v indels "$VCF_FILE" | wc -l)
cat > "${REPORTS_DIR}/${SAMPLE_NAME}_summary.txt" << EOF
Variant Calling Pipeline Summary
===============================
Sample: $SAMPLE_NAME
Total variants called: $total_variants
Variants passing filters: $passed_variants
SNPs: $snp_count
Indels: $indel_count
Generated on: $(date)
EOF
log "Report generation completed"
}
# Main pipeline execution
main() {
setup_directories
validate_input
# Step 1: Initial QC
run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"
# Step 2: Alignment
align_reads
sort_bam
mark_duplicates
recalibrate_base_qualities
# Step 3: Variant calling
call_variants
# Step 4: Variant processing
filter_variants "$GVCF_FILE" "$VCF_FILE"
annotate_variants "$VCF_FILE" "${ANNOT_DIR}/${SAMPLE_NAME}.annotated.vcf"
# Step 5: Reporting
generate_report
log "Variant calling pipeline completed successfully!"
log "Results available in: ${RESULTS_DIR}"
}
# Run main function
main "$@"
Alternative Pipeline with Bcftools
Create an alternative pipeline using bcftools bcftools_pipeline.sh:
#!/bin/bash
# bcftools_pipeline.sh - Alternative variant calling pipeline using bcftools
# Author: Your Name
# Date: $(date)
set -euo pipefail
# Configuration (same as above)
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignments"
VARIANTS_DIR="${RESULTS_DIR}/variants"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"
# Reference files
REF_GENOME="${REF_DIR}/hg38.fasta"
DBSNP_VCF="${REF_DIR}/dbsnp_138.hg38.vcf.gz"
# Sample information
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"
# Output files
ALIGNED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.aligned.bam"
SORTED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.sorted.bam"
DEDUP_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.bam"
VCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.bcftools.vcf.gz"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/bcftools_pipeline.log"
}
# Align reads with BWA
align_reads() {
log "Aligning reads with BWA MEM..."
bwa mem -t 8 "$REF_GENOME" "$FORWARD_READS" "$REVERSE_READS" | \
samtools sort -@ 4 -o "$SORTED_BAM"
samtools index "$SORTED_BAM"
log "Alignment completed"
}
# Mark duplicates with samtools
mark_duplicates() {
log "Marking duplicates with samtools..."
samtools markdup -@ 4 "$SORTED_BAM" "$DEDUP_BAM"
samtools index "$DEDUP_BAM"
# Clean up intermediate file
rm "$SORTED_BAM" "${SORTED_BAM}.bai"
log "Duplicate marking completed"
}
# Variant calling with bcftools
call_variants_bcftools() {
log "Calling variants with bcftools..."
# mpileup and call variants
bcftools mpileup -Ou -f "$REF_GENOME" "$DEDUP_BAM" | \
bcftools call -mv -Oz -o "$VCF_FILE"
# Index the VCF file
bcftools index "$VCF_FILE"
log "Variant calling with bcftools completed"
}
# Filter variants with bcftools
filter_variants_bcftools() {
log "Filtering variants with bcftools..."
# Filter based on quality and depth
bcftools filter -Oz -o "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz" \
-s "LowQual" -i '%QUAL>30 && DP>10' "$VCF_FILE"
# Index filtered VCF
bcftools index "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz"
log "Variant filtering completed"
}
# Annotate variants with bcftools
annotate_variants_bcftools() {
log "Annotating variants with bcftools..."
bcftools annotate -Oz -o "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz" \
-a "$DBSNP_VCF" -c CHROM,POS,ID "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz"
bcftools index "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz"
log "Variant annotation completed"
}
# Generate statistics
generate_stats() {
log "Generating variant statistics..."
bcftools stats "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz" > \
"${REPORTS_DIR}/${SAMPLE_NAME}_bcftools.stats"
log "Statistics generation completed"
}
# Main execution
main() {
mkdir -p "${ALIGN_DIR}" "${VARIANTS_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"
align_reads
mark_duplicates
call_variants_bcftools
filter_variants_bcftools
annotate_variants_bcftools
generate_stats
log "Bcftools variant calling pipeline completed successfully!"
}
main "$@"
Making Scripts Executable
chmod +x variant_calling_pipeline.sh
chmod +x bcftools_pipeline.sh
Advanced Features
1. Cohort Analysis Script
Create a script for joint genotyping multiple samples joint_genotyping.sh:
#!/bin/bash
# joint_genotyping.sh - Joint genotyping for cohort analysis
set -euo pipefail
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VARIANTS_DIR="${PROJECT_DIR}/results/variants"
REF_GENOME="${PROJECT_DIR}/data/reference/hg38.fasta"
# List of GVCF files
GVCF_LIST_FILE="${VARIANTS_DIR}/gvcf_list.txt"
# Output
COHORT_VCF="${VARIANTS_DIR}/cohort_variants.vcf.gz"
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
# Create GVCF list
create_gvcf_list() {
log "Creating GVCF list..."
find "$VARIANTS_DIR" -name "*.g.vcf.gz" > "$GVCF_LIST_FILE"
log "GVCF list created with $(wc -l < "$GVCF_LIST_FILE") files"
}
# Joint genotyping
run_joint_genotyping() {
log "Running joint genotyping..."
gatk GenotypeGVCFs \
-R "$REF_GENOME" \
-V "$GVCF_LIST_FILE" \
-O "$COHORT_VCF"
# Index the output
gatk IndexFeatureFile -I "$COHORT_VCF"
log "Joint genotyping completed"
}
# Main execution
main() {
create_gvcf_list
run_joint_genotyping
log "Cohort analysis completed"
}
main "$@"
2. Quality Control and Validation
Add comprehensive QC functions:
# Validate BAM file
validate_bam() {
local bam_file="$1"
log "Validating BAM file: $bam_file"
# Check if file exists
[[ -f "$bam_file" ]] || error_exit "BAM file not found: $bam_file"
# Check BAM header
if ! samtools quickcheck "$bam_file"; then
error_exit "BAM file failed validation: $bam_file"
fi
# Check if indexed
if [[ ! -f "${bam_file}.bai" ]]; then
log "Indexing BAM file..."
samtools index "$bam_file"
fi
log "BAM validation passed"
}
# Calculate alignment statistics
alignment_stats() {
local bam_file="$1"
local sample_name="$2"
log "Calculating alignment statistics for $sample_name..."
# Basic statistics
samtools flagstat "$bam_file" > "${ALIGN_DIR}/${sample_name}.flagstat.txt"
# Insert size distribution
samtools stats "$bam_file" | grep ^IS > "${ALIGN_DIR}/${sample_name}.insert_size.txt"
# Coverage statistics
samtools depth "$bam_file" | awk '{sum+=$3; count++} END {print "Average coverage: " sum/count}' \
> "${ALIGN_DIR}/${sample_name}.coverage.txt"
log "Alignment statistics calculated"
}
# Validate VCF file
validate_vcf() {
local vcf_file="$1"
log "Validating VCF file: $vcf_file"
# Check if file exists
[[ -f "$vcf_file" ]] || error_exit "VCF file not found: $vcf_file"
# Check VCF format
if ! bcftools view -h "$vcf_file" >/dev/null 2>&1; then
error_exit "Invalid VCF format: $vcf_file"
fi
log "VCF validation passed"
}
3. Parallel Processing
Implement parallel processing for multiple samples:
# Process multiple samples in parallel
process_samples_parallel() {
local sample_list="$1"
local max_jobs="${2:-4}"
log "Processing samples in parallel (max ${max_jobs} jobs)..."
if command -v parallel &> /dev/null; then
# Use GNU parallel
cat "$sample_list" | parallel -j "$max_jobs" ./variant_calling_pipeline.sh {}
else
# Fallback to basic background jobs
local job_count=0
while IFS= read -r sample; do
if [[ $((job_count % max_jobs)) -eq 0 && $job_count -gt 0 ]]; then
wait # Wait for previous batch to complete
fi
./variant_calling_pipeline.sh "$sample" &
((job_count++))
done < "$sample_list"
wait # Wait for all jobs to complete
fi
log "All samples processed"
}
Configuration Management
Create a configuration file variant_calling_config.ini:
[general]
project_name=variant_calling
threads=8
memory=16G
[alignment]
aligner=bwa
read_group_platform=ILLUMINA
[variant_calling]
caller=gatk
joint_genotyping=true
cohort_size_threshold=10
[filtering]
snp_filter_expression=QD < 2.0 || FS > 60.0 || MQ < 40.0
indel_filter_expression=QD < 2.0 || FS > 200.0
[annotation]
annotator=snpeff
database=hg38
[quality_control]
min_mapping_quality=30
min_base_quality=20
Running the Pipeline
To run the pipeline:
# Make scripts executable
chmod +x variant_calling_pipeline.sh
chmod +x bcftools_pipeline.sh
chmod +x joint_genotyping.sh
# Run the GATK pipeline
./variant_calling_pipeline.sh
# Or run the bcftools alternative
./bcftools_pipeline.sh
# For cohort analysis
./joint_genotyping.sh
# For verbose output
./variant_calling_pipeline.sh 2>&1 | tee variant_calling_run.log
Expected Output Structure
After running the pipeline, you’ll have:
variant_calling_pipeline/
├── data/
│ ├── raw/
│ │ ├── sample_R1.fastq.gz
│ │ └── sample_R2.fastq.gz
│ └── reference/
│ ├── hg38.fasta
│ ├── dbsnp_138.hg38.vcf.gz
│ └── Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
├── results/
│ ├── alignments/
│ │ ├── sample.dedup.bam
│ │ ├── sample.dedup.bai
│ │ └── sample.recal.bam
│ ├── variants/
│ │ ├── sample.g.vcf.gz
│ │ ├── sample.vcf.gz
│ │ └── sample.filtered.vcf.gz
│ ├── annotations/
│ │ └── sample.annotated.vcf.gz
│ └── reports/
│ ├── sample.stats
│ ├── sample_summary.txt
│ └── fastqc_raw/
├── logs/
│ └── pipeline.log
├── tmp/
├── variant_calling_config.ini
├── variant_calling_pipeline.sh
├── bcftools_pipeline.sh
└── joint_genotyping.sh
Best Practices
- Modularity: Break complex tasks into functions
- Error Handling: Use
set -euo pipefailand proper error checking - Logging: Implement comprehensive logging for debugging
- Configuration: Use external config files for flexibility
- Validation: Validate inputs, outputs, and intermediate files
- Documentation: Comment your code thoroughly
- Portability: Use relative paths and environment variables
Performance Optimization
- Parallel Processing: Use multiple cores for alignment and variant calling
- Memory Management: Monitor and limit memory usage
- Disk I/O: Minimize file operations and use appropriate buffering
- Caching: Reuse intermediate results when possible
- Compression: Use appropriate compression for large output files
- Resource Monitoring: Track CPU and memory usage
This variant calling pipeline provides a comprehensive framework for detecting genetic variants from sequencing data, implementing both GATK Best Practices and alternative approaches using bcftools. The modular design allows for easy customization and extension based on specific analysis requirements.
Frequently Asked Questions
Q: What is variant calling and when do I need it?
A: Variant calling identifies genetic differences (SNPs, indels) between your sample and a reference genome. You need it for genomics studies, disease research, population genetics, personalized medicine, and agricultural genomics. It’s essential for whole genome sequencing (WGS), whole exome sequencing (WES), and targeted sequencing.
Q: What’s the difference between GATK and bcftools for variant calling?
A: GATK implements industry best practices with sophisticated error correction, BQSR (Base Quality Score Recalibration), and advanced filtering. bcftools is faster and simpler but less refined. Use GATK for high-quality clinical/research variants; bcftools for quick analysis or large population studies where speed matters.
Q: How much computational power do I need for variant calling?
A: For human WGS: 32GB+ RAM, 16+ CPU cores, 500GB storage. WES needs less: 16GB RAM, 8 cores, 100GB storage. GATK HaplotypeCaller is memory-intensive. Processing time: WGS (6-12 hours), WES (2-4 hours) depending on coverage and sample complexity.
Q: What sequencing coverage do I need for reliable variant calling?
A: Minimum 30x coverage for WGS, 100x for WES. Higher coverage (50-100x WGS) improves sensitivity for rare variants and structural variants. Low coverage (<10x) misses many variants and has high false positive rates. Coverage uniformity matters as much as average depth.
Q: How do I interpret variant calling quality scores?
A: Key metrics: QUAL score (>30 for high confidence), GQ (genotype quality >20), DP (depth 10-100x), and FILTER field (PASS only). AD (allele depth) should support the called genotype. Use VQSR for sophisticated filtering instead of hard thresholds.
Q: What’s Base Quality Score Recalibration (BQSR) and why is it important?
A: BQSR corrects systematic errors in base quality scores reported by sequencers. It uses known variant sites to model true error rates and adjusts quality scores accordingly. This improves variant calling accuracy by 10-20%, especially for low-frequency variants and indels.
Q: Should I call variants on individual samples or jointly?
A: Joint calling (multiple samples together) is more sensitive and accurate, especially for rare variants. It improves genotype quality and enables better filtering. However, it’s computationally expensive. Use individual calling for single samples or when adding samples incrementally.
Q: How do I handle indels and structural variants?
A: Standard pipelines detect small indels (<50bp). For larger structural variants, use specialized tools like Manta, LUMPY, or DELLY. Long-read sequencing (PacBio, Nanopore) is better for complex structural variants. Include indel realignment step for better indel calling accuracy.
Q: What annotation tools should I use after variant calling?
A: Essential annotations: VEP (Ensembl Variant Effect Predictor) or ANNOVAR for functional effects, dbSNP for known variants, ClinVar for clinical significance, gnomAD for population frequencies, and CADD/DANN for pathogenicity prediction. Choose based on your species and research goals.
Q: How do I validate my variant calling results?
A: Validation strategies: Check known variants (dbSNP concordance >95%), use trio data for Mendelian error rates (<2%), compare with different callers, validate high-impact variants with Sanger sequencing, and use benchmark datasets (Genome in a Bottle) for accuracy assessment.
Q: Can I use this pipeline for non-human organisms?
A: Yes! Replace reference genome and known variant files with your organism’s data. Adjust GATK parameters if needed (some are human-optimized). For organisms without known variant databases, skip BQSR or use variants from related species. Consider ploidy differences for plants or other organisms.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
Protein Annotation Pipeline with Bash
Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.
RNA-Seq Analysis Pipeline with Bash
Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.
Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run