Skip to content
b2b bench2bash
bioinformatics advanced

Variant Calling Pipeline with Bash

Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.

22 min read
#bash #variant-calling #gatk #bcftools #pipeline #wgs #wes

Variant Calling Pipeline with Bash

Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.

Overview

This tutorial will guide you through creating a complete bash pipeline for variant calling, from raw FASTQ files to annotated VCF files. We’ll implement both GATK Best Practices and alternative approaches using bcftools, providing a flexible framework for different sequencing scenarios.

Prerequisites

Before starting this tutorial, you should have:

  • Advanced understanding of bash scripting
  • Familiarity with Next-Generation Sequencing (NGS) data
  • Understanding of SAM/BAM file formats
  • Basic knowledge of variant calling concepts
  • Tools installed: bwa, samtools, gatk, bcftools, picard, tabix, bgzip

Pipeline Workflow

Our variant calling pipeline will follow GATK Best Practices:

  1. Quality control of raw reads
  2. Read alignment to reference genome
  3. BAM file preprocessing (sorting, marking duplicates, base quality recalibration)
  4. Variant calling (HaplotypeCaller for single samples, GenotypeGVCFs for cohorts)
  5. Variant filtering and annotation
  6. Final reporting

Setting Up the Project Structure

First, let’s create a directory structure for our pipeline:

# Create project directory
mkdir variant_calling_pipeline
cd variant_calling_pipeline

# Create directory structure
mkdir -p data/raw data/reference results/alignments results/variants results/annotations results/reports logs tmp

# Download reference genome (example with hg38)
# You'll need to download the reference genome and create BWA index
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

Creating the Main Pipeline Script

Create a main pipeline script variant_calling_pipeline.sh:

#!/bin/bash

# variant_calling_pipeline.sh - Variant calling pipeline following GATK Best Practices
# Author: Your Name
# Date: $(date)

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignments"
VARIANTS_DIR="${RESULTS_DIR}/variants"
ANNOT_DIR="${RESULTS_DIR}/annotations"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"
TMP_DIR="${PROJECT_ROOT}/tmp"

# Reference files (update these paths for your environment)
REF_GENOME="${REF_DIR}/hg38.fasta"
DBSNP_VCF="${REF_DIR}/dbsnp_138.hg38.vcf.gz"
MILLS_INDELS="${REF_DIR}/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"

# Sample information (in practice, this would come from a sample sheet)
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"

# Output files
ALIGNED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.aligned.bam"
SORTED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.sorted.bam"
DEDUP_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.bam"
RECAL_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.recal.bam"
GVCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.g.vcf.gz"
VCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.vcf.gz"
ANNOTATED_VCF="${ANNOT_DIR}/${SAMPLE_NAME}.annotated.vcf.gz"

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}

# Create directories
setup_directories() {
    log "Setting up directories..."
    mkdir -p "${ALIGN_DIR}" "${VARIANTS_DIR}" "${ANNOT_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}" "${TMP_DIR}"
}

# Validate input files
validate_input() {
    log "Validating input files..."

    local required_files=("$FORWARD_READS" "$REVERSE_READS" "$REF_GENOME" "$DBSNP_VCF" "$MILLS_INDELS")

    for file in "${required_files[@]}"; do
        if [[ ! -f "$file" ]]; then
            log "ERROR: Required file not found: $file"
            exit 1
        fi
    done

    log "Input validation passed"
}

# Quality control with FastQC
run_fastqc() {
    local input_dir="$1"
    local output_dir="$2"

    log "Running FastQC on ${input_dir}..."
    fastqc -o "${output_dir}" -t 4 "${input_dir}"/*.fastq.gz
}

# Align reads with BWA
align_reads() {
    log "Aligning reads with BWA MEM..."

    bwa mem -t 8 -R "@RG\tID:${SAMPLE_NAME}\tSM:${SAMPLE_NAME}\tPL:ILLUMINA" \
        "$REF_GENOME" "$FORWARD_READS" "$REVERSE_READS" | \
    samtools view -bS - > "$ALIGNED_BAM"

    log "Alignment completed"
}

# Sort BAM file
sort_bam() {
    log "Sorting BAM file..."

    samtools sort -@ 4 -o "$SORTED_BAM" "$ALIGNED_BAM"
    samtools index "$SORTED_BAM"

    # Clean up intermediate file
    rm "$ALIGNED_BAM"

    log "BAM sorting completed"
}

# Mark duplicates with Picard
mark_duplicates() {
    log "Marking duplicates with Picard..."

    picard MarkDuplicates \
        I="$SORTED_BAM" \
        O="$DEDUP_BAM" \
        M="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.metrics.txt" \
        REMOVE_DUPLICATES=false

    samtools index "$DEDUP_BAM"

    # Clean up intermediate file
    rm "$SORTED_BAM" "${SORTED_BAM}.bai"

    log "Duplicate marking completed"
}

# Base quality score recalibration with GATK
recalibrate_base_qualities() {
    log "Recalibrating base qualities with GATK..."

    # Step 1: Generate base recalibration table
    gatk BaseRecalibrator \
        -I "$DEDUP_BAM" \
        -R "$REF_GENOME" \
        --known-sites "$DBSNP_VCF" \
        --known-sites "$MILLS_INDELS" \
        -O "${ALIGN_DIR}/${SAMPLE_NAME}.recal_data.table"

    # Step 2: Apply recalibration
    gatk ApplyBQSR \
        -I "$DEDUP_BAM" \
        -R "$REF_GENOME" \
        --bqsr-recal-file "${ALIGN_DIR}/${SAMPLE_NAME}.recal_data.table" \
        -O "$RECAL_BAM"

    samtools index "$RECAL_BAM"

    # Clean up intermediate file
    rm "$DEDUP_BAM" "${DEDUP_BAM}.bai"

    log "Base quality recalibration completed"
}

# Variant calling with GATK HaplotypeCaller
call_variants() {
    log "Calling variants with GATK HaplotypeCaller..."

    gatk HaplotypeCaller \
        -R "$REF_GENOME" \
        -I "$RECAL_BAM" \
        -O "$GVCF_FILE" \
        -ERC GVCF \
        --native-pair-hmm-threads 4

    log "Variant calling completed"
}

# Joint genotyping (for cohort analysis)
joint_genotyping() {
    local gvcf_list="$1"
    local output_vcf="$2"

    log "Performing joint genotyping..."

    gatk GenotypeGVCFs \
        -R "$REF_GENOME" \
        -V "$gvcf_list" \
        -O "$output_vcf"

    log "Joint genotyping completed"
}

# Variant filtering
filter_variants() {
    local input_vcf="$1"
    local output_vcf="$2"

    log "Filtering variants..."

    # Separate SNPs and indels
    gatk SelectVariants \
        -R "$REF_GENOME" \
        -V "$input_vcf" \
        -select-type SNP \
        -O "${VARIANTS_DIR}/${SAMPLE_NAME}.snps.vcf.gz"

    gatk SelectVariants \
        -R "$REF_GENOME" \
        -V "$input_vcf" \
        -select-type INDEL \
        -O "${VARIANTS_DIR}/${SAMPLE_NAME}.indels.vcf.gz"

    # Filter SNPs
    gatk VariantFiltration \
        -R "$REF_GENOME" \
        -V "${VARIANTS_DIR}/${SAMPLE_NAME}.snps.vcf.gz" \
        --filter-expression "QD < 2.0" \
        --filter-name "QD2" \
        --filter-expression "QUAL < 30.0" \
        --filter-name "QUAL30" \
        --filter-expression "SOR > 3.0" \
        --filter-name "SOR3" \
        --filter-expression "FS > 60.0" \
        --filter-name "FS60" \
        --filter-expression "MQ < 40.0" \
        --filter-name "MQ40" \
        --filter-expression "MQRankSum < -12.5" \
        --filter-name "MQRankSum-12.5" \
        --filter-expression "ReadPosRankSum < -8.0" \
        --filter-name "ReadPosRankSum-8" \
        -O "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.snps.vcf.gz"

    # Filter indels
    gatk VariantFiltration \
        -R "$REF_GENOME" \
        -V "${VARIANTS_DIR}/${SAMPLE_NAME}.indels.vcf.gz" \
        --filter-expression "QD < 2.0" \
        --filter-name "QD2" \
        --filter-expression "QUAL < 30.0" \
        --filter-name "QUAL30" \
        --filter-expression "FS > 200.0" \
        --filter-name "FS200" \
        --filter-expression "ReadPosRankSum < -20.0" \
        --filter-name "ReadPosRankSum-20" \
        -O "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.indels.vcf.gz"

    # Merge filtered variants
    gatk MergeVcfs \
        -I "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.snps.vcf.gz" \
        -I "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.indels.vcf.gz" \
        -O "$output_vcf"

    log "Variant filtering completed"
}

# Annotate variants with SnpEff
annotate_variants() {
    local input_vcf="$1"
    local output_vcf="$2"

    log "Annotating variants with SnpEff..."

    snpEff -Xmx8g hg38 "$input_vcf" > "$output_vcf"

    # Compress and index
    bgzip "$output_vcf"
    tabix -p vcf "${output_vcf}.gz"

    log "Variant annotation completed"
}

# Generate statistics and reports
generate_report() {
    log "Generating variant statistics and reports..."

    # Basic statistics
    bcftools stats "$VCF_FILE" > "${REPORTS_DIR}/${SAMPLE_NAME}.stats"

    # Plot statistics
    plot-vcfstats -p "${REPORTS_DIR}/${SAMPLE_NAME}_plots" "${REPORTS_DIR}/${SAMPLE_NAME}.stats"

    # Create summary
    local total_variants=$(bcftools view -H "$VCF_FILE" | wc -l)
    local passed_variants=$(bcftools view -H -f PASS "$VCF_FILE" | wc -l)
    local snp_count=$(bcftools view -H -v snps "$VCF_FILE" | wc -l)
    local indel_count=$(bcftools view -H -v indels "$VCF_FILE" | wc -l)

    cat > "${REPORTS_DIR}/${SAMPLE_NAME}_summary.txt" << EOF
Variant Calling Pipeline Summary
===============================

Sample: $SAMPLE_NAME
Total variants called: $total_variants
Variants passing filters: $passed_variants
SNPs: $snp_count
Indels: $indel_count

Generated on: $(date)
EOF

    log "Report generation completed"
}

# Main pipeline execution
main() {
    setup_directories
    validate_input

    # Step 1: Initial QC
    run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"

    # Step 2: Alignment
    align_reads
    sort_bam
    mark_duplicates
    recalibrate_base_qualities

    # Step 3: Variant calling
    call_variants

    # Step 4: Variant processing
    filter_variants "$GVCF_FILE" "$VCF_FILE"
    annotate_variants "$VCF_FILE" "${ANNOT_DIR}/${SAMPLE_NAME}.annotated.vcf"

    # Step 5: Reporting
    generate_report

    log "Variant calling pipeline completed successfully!"
    log "Results available in: ${RESULTS_DIR}"
}

# Run main function
main "$@"

Alternative Pipeline with Bcftools

Create an alternative pipeline using bcftools bcftools_pipeline.sh:

#!/bin/bash

# bcftools_pipeline.sh - Alternative variant calling pipeline using bcftools
# Author: Your Name
# Date: $(date)

set -euo pipefail

# Configuration (same as above)
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignments"
VARIANTS_DIR="${RESULTS_DIR}/variants"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"

# Reference files
REF_GENOME="${REF_DIR}/hg38.fasta"
DBSNP_VCF="${REF_DIR}/dbsnp_138.hg38.vcf.gz"

# Sample information
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"

# Output files
ALIGNED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.aligned.bam"
SORTED_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.sorted.bam"
DEDUP_BAM="${ALIGN_DIR}/${SAMPLE_NAME}.dedup.bam"
VCF_FILE="${VARIANTS_DIR}/${SAMPLE_NAME}.bcftools.vcf.gz"

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/bcftools_pipeline.log"
}

# Align reads with BWA
align_reads() {
    log "Aligning reads with BWA MEM..."

    bwa mem -t 8 "$REF_GENOME" "$FORWARD_READS" "$REVERSE_READS" | \
    samtools sort -@ 4 -o "$SORTED_BAM"

    samtools index "$SORTED_BAM"
    log "Alignment completed"
}

# Mark duplicates with samtools
mark_duplicates() {
    log "Marking duplicates with samtools..."

    samtools markdup -@ 4 "$SORTED_BAM" "$DEDUP_BAM"
    samtools index "$DEDUP_BAM"

    # Clean up intermediate file
    rm "$SORTED_BAM" "${SORTED_BAM}.bai"

    log "Duplicate marking completed"
}

# Variant calling with bcftools
call_variants_bcftools() {
    log "Calling variants with bcftools..."

    # mpileup and call variants
    bcftools mpileup -Ou -f "$REF_GENOME" "$DEDUP_BAM" | \
    bcftools call -mv -Oz -o "$VCF_FILE"

    # Index the VCF file
    bcftools index "$VCF_FILE"

    log "Variant calling with bcftools completed"
}

# Filter variants with bcftools
filter_variants_bcftools() {
    log "Filtering variants with bcftools..."

    # Filter based on quality and depth
    bcftools filter -Oz -o "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz" \
        -s "LowQual" -i '%QUAL>30 && DP>10' "$VCF_FILE"

    # Index filtered VCF
    bcftools index "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz"

    log "Variant filtering completed"
}

# Annotate variants with bcftools
annotate_variants_bcftools() {
    log "Annotating variants with bcftools..."

    bcftools annotate -Oz -o "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz" \
        -a "$DBSNP_VCF" -c CHROM,POS,ID "${VARIANTS_DIR}/${SAMPLE_NAME}.filtered.vcf.gz"

    bcftools index "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz"

    log "Variant annotation completed"
}

# Generate statistics
generate_stats() {
    log "Generating variant statistics..."

    bcftools stats "${VARIANTS_DIR}/${SAMPLE_NAME}.annotated.vcf.gz" > \
        "${REPORTS_DIR}/${SAMPLE_NAME}_bcftools.stats"

    log "Statistics generation completed"
}

# Main execution
main() {
    mkdir -p "${ALIGN_DIR}" "${VARIANTS_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"

    align_reads
    mark_duplicates
    call_variants_bcftools
    filter_variants_bcftools
    annotate_variants_bcftools
    generate_stats

    log "Bcftools variant calling pipeline completed successfully!"
}

main "$@"

Making Scripts Executable

chmod +x variant_calling_pipeline.sh
chmod +x bcftools_pipeline.sh

Advanced Features

1. Cohort Analysis Script

Create a script for joint genotyping multiple samples joint_genotyping.sh:

#!/bin/bash

# joint_genotyping.sh - Joint genotyping for cohort analysis

set -euo pipefail

PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VARIANTS_DIR="${PROJECT_DIR}/results/variants"
REF_GENOME="${PROJECT_DIR}/data/reference/hg38.fasta"

# List of GVCF files
GVCF_LIST_FILE="${VARIANTS_DIR}/gvcf_list.txt"

# Output
COHORT_VCF="${VARIANTS_DIR}/cohort_variants.vcf.gz"

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

# Create GVCF list
create_gvcf_list() {
    log "Creating GVCF list..."
    find "$VARIANTS_DIR" -name "*.g.vcf.gz" > "$GVCF_LIST_FILE"
    log "GVCF list created with $(wc -l < "$GVCF_LIST_FILE") files"
}

# Joint genotyping
run_joint_genotyping() {
    log "Running joint genotyping..."

    gatk GenotypeGVCFs \
        -R "$REF_GENOME" \
        -V "$GVCF_LIST_FILE" \
        -O "$COHORT_VCF"

    # Index the output
    gatk IndexFeatureFile -I "$COHORT_VCF"

    log "Joint genotyping completed"
}

# Main execution
main() {
    create_gvcf_list
    run_joint_genotyping
    log "Cohort analysis completed"
}

main "$@"

2. Quality Control and Validation

Add comprehensive QC functions:

# Validate BAM file
validate_bam() {
    local bam_file="$1"
    log "Validating BAM file: $bam_file"

    # Check if file exists
    [[ -f "$bam_file" ]] || error_exit "BAM file not found: $bam_file"

    # Check BAM header
    if ! samtools quickcheck "$bam_file"; then
        error_exit "BAM file failed validation: $bam_file"
    fi

    # Check if indexed
    if [[ ! -f "${bam_file}.bai" ]]; then
        log "Indexing BAM file..."
        samtools index "$bam_file"
    fi

    log "BAM validation passed"
}

# Calculate alignment statistics
alignment_stats() {
    local bam_file="$1"
    local sample_name="$2"

    log "Calculating alignment statistics for $sample_name..."

    # Basic statistics
    samtools flagstat "$bam_file" > "${ALIGN_DIR}/${sample_name}.flagstat.txt"

    # Insert size distribution
    samtools stats "$bam_file" | grep ^IS > "${ALIGN_DIR}/${sample_name}.insert_size.txt"

    # Coverage statistics
    samtools depth "$bam_file" | awk '{sum+=$3; count++} END {print "Average coverage: " sum/count}' \
        > "${ALIGN_DIR}/${sample_name}.coverage.txt"

    log "Alignment statistics calculated"
}

# Validate VCF file
validate_vcf() {
    local vcf_file="$1"
    log "Validating VCF file: $vcf_file"

    # Check if file exists
    [[ -f "$vcf_file" ]] || error_exit "VCF file not found: $vcf_file"

    # Check VCF format
    if ! bcftools view -h "$vcf_file" >/dev/null 2>&1; then
        error_exit "Invalid VCF format: $vcf_file"
    fi

    log "VCF validation passed"
}

3. Parallel Processing

Implement parallel processing for multiple samples:

# Process multiple samples in parallel
process_samples_parallel() {
    local sample_list="$1"
    local max_jobs="${2:-4}"

    log "Processing samples in parallel (max ${max_jobs} jobs)..."

    if command -v parallel &> /dev/null; then
        # Use GNU parallel
        cat "$sample_list" | parallel -j "$max_jobs" ./variant_calling_pipeline.sh {}
    else
        # Fallback to basic background jobs
        local job_count=0
        while IFS= read -r sample; do
            if [[ $((job_count % max_jobs)) -eq 0 && $job_count -gt 0 ]]; then
                wait  # Wait for previous batch to complete
            fi

            ./variant_calling_pipeline.sh "$sample" &
            ((job_count++))
        done < "$sample_list"
        wait  # Wait for all jobs to complete
    fi

    log "All samples processed"
}

Configuration Management

Create a configuration file variant_calling_config.ini:

[general]
project_name=variant_calling
threads=8
memory=16G

[alignment]
aligner=bwa
read_group_platform=ILLUMINA

[variant_calling]
caller=gatk
joint_genotyping=true
cohort_size_threshold=10

[filtering]
snp_filter_expression=QD < 2.0 || FS > 60.0 || MQ < 40.0
indel_filter_expression=QD < 2.0 || FS > 200.0

[annotation]
annotator=snpeff
database=hg38

[quality_control]
min_mapping_quality=30
min_base_quality=20

Running the Pipeline

To run the pipeline:

# Make scripts executable
chmod +x variant_calling_pipeline.sh
chmod +x bcftools_pipeline.sh
chmod +x joint_genotyping.sh

# Run the GATK pipeline
./variant_calling_pipeline.sh

# Or run the bcftools alternative
./bcftools_pipeline.sh

# For cohort analysis
./joint_genotyping.sh

# For verbose output
./variant_calling_pipeline.sh 2>&1 | tee variant_calling_run.log

Expected Output Structure

After running the pipeline, you’ll have:

variant_calling_pipeline/
├── data/
│   ├── raw/
│   │   ├── sample_R1.fastq.gz
│   │   └── sample_R2.fastq.gz
│   └── reference/
│       ├── hg38.fasta
│       ├── dbsnp_138.hg38.vcf.gz
│       └── Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
├── results/
│   ├── alignments/
│   │   ├── sample.dedup.bam
│   │   ├── sample.dedup.bai
│   │   └── sample.recal.bam
│   ├── variants/
│   │   ├── sample.g.vcf.gz
│   │   ├── sample.vcf.gz
│   │   └── sample.filtered.vcf.gz
│   ├── annotations/
│   │   └── sample.annotated.vcf.gz
│   └── reports/
│       ├── sample.stats
│       ├── sample_summary.txt
│       └── fastqc_raw/
├── logs/
│   └── pipeline.log
├── tmp/
├── variant_calling_config.ini
├── variant_calling_pipeline.sh
├── bcftools_pipeline.sh
└── joint_genotyping.sh

Best Practices

  1. Modularity: Break complex tasks into functions
  2. Error Handling: Use set -euo pipefail and proper error checking
  3. Logging: Implement comprehensive logging for debugging
  4. Configuration: Use external config files for flexibility
  5. Validation: Validate inputs, outputs, and intermediate files
  6. Documentation: Comment your code thoroughly
  7. Portability: Use relative paths and environment variables

Performance Optimization

  1. Parallel Processing: Use multiple cores for alignment and variant calling
  2. Memory Management: Monitor and limit memory usage
  3. Disk I/O: Minimize file operations and use appropriate buffering
  4. Caching: Reuse intermediate results when possible
  5. Compression: Use appropriate compression for large output files
  6. Resource Monitoring: Track CPU and memory usage

This variant calling pipeline provides a comprehensive framework for detecting genetic variants from sequencing data, implementing both GATK Best Practices and alternative approaches using bcftools. The modular design allows for easy customization and extension based on specific analysis requirements.

Frequently Asked Questions

Q: What is variant calling and when do I need it?

A: Variant calling identifies genetic differences (SNPs, indels) between your sample and a reference genome. You need it for genomics studies, disease research, population genetics, personalized medicine, and agricultural genomics. It’s essential for whole genome sequencing (WGS), whole exome sequencing (WES), and targeted sequencing.

Q: What’s the difference between GATK and bcftools for variant calling?

A: GATK implements industry best practices with sophisticated error correction, BQSR (Base Quality Score Recalibration), and advanced filtering. bcftools is faster and simpler but less refined. Use GATK for high-quality clinical/research variants; bcftools for quick analysis or large population studies where speed matters.

Q: How much computational power do I need for variant calling?

A: For human WGS: 32GB+ RAM, 16+ CPU cores, 500GB storage. WES needs less: 16GB RAM, 8 cores, 100GB storage. GATK HaplotypeCaller is memory-intensive. Processing time: WGS (6-12 hours), WES (2-4 hours) depending on coverage and sample complexity.

Q: What sequencing coverage do I need for reliable variant calling?

A: Minimum 30x coverage for WGS, 100x for WES. Higher coverage (50-100x WGS) improves sensitivity for rare variants and structural variants. Low coverage (<10x) misses many variants and has high false positive rates. Coverage uniformity matters as much as average depth.

Q: How do I interpret variant calling quality scores?

A: Key metrics: QUAL score (>30 for high confidence), GQ (genotype quality >20), DP (depth 10-100x), and FILTER field (PASS only). AD (allele depth) should support the called genotype. Use VQSR for sophisticated filtering instead of hard thresholds.

Q: What’s Base Quality Score Recalibration (BQSR) and why is it important?

A: BQSR corrects systematic errors in base quality scores reported by sequencers. It uses known variant sites to model true error rates and adjusts quality scores accordingly. This improves variant calling accuracy by 10-20%, especially for low-frequency variants and indels.

Q: Should I call variants on individual samples or jointly?

A: Joint calling (multiple samples together) is more sensitive and accurate, especially for rare variants. It improves genotype quality and enables better filtering. However, it’s computationally expensive. Use individual calling for single samples or when adding samples incrementally.

Q: How do I handle indels and structural variants?

A: Standard pipelines detect small indels (<50bp). For larger structural variants, use specialized tools like Manta, LUMPY, or DELLY. Long-read sequencing (PacBio, Nanopore) is better for complex structural variants. Include indel realignment step for better indel calling accuracy.

Q: What annotation tools should I use after variant calling?

A: Essential annotations: VEP (Ensembl Variant Effect Predictor) or ANNOVAR for functional effects, dbSNP for known variants, ClinVar for clinical significance, gnomAD for population frequencies, and CADD/DANN for pathogenicity prediction. Choose based on your species and research goals.

Q: How do I validate my variant calling results?

A: Validation strategies: Check known variants (dbSNP concordance >95%), use trio data for Mendelian error rates (<2%), compare with different callers, validate high-impact variants with Sanger sequencing, and use benchmark datasets (Genome in a Bottle) for accuracy assessment.

Q: Can I use this pipeline for non-human organisms?

A: Yes! Replace reference genome and known variant files with your organism’s data. Adjust GATK parameters if needed (some are human-optimized). For organisms without known variant databases, skip BQSR or use variants from related species. Consider ploidy differences for plants or other organisms.

Continue Learning

One Small Win

Try this quick command to get started:

gatk HaplotypeCaller --version

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template
git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run