Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
Overview
This tutorial will guide you through creating a complete bash pipeline for genome extraction, from quality control of raw sequencing data to generating a polished genome assembly. We’ll automate the entire workflow using bash scripting best practices.
Prerequisites
Before starting this tutorial, you should have:
- Basic understanding of bash scripting
- Familiarity with genomic file formats (FASTQ, FASTA)
- Command line experience
- Tools installed:
fastqc,multiqc,spades,quast,seqtk
Pipeline Workflow
Our genome extraction pipeline will follow these steps:
- Quality control of raw reads
- Adapter trimming and quality filtering
- Genome assembly
- Assembly quality assessment
- Final reporting
Setting Up the Project Structure
First, let’s create a directory structure for our pipeline:
# Create project directory
mkdir genome_extraction_pipeline
cd genome_extraction_pipeline
# Create directory structure
mkdir -p data/raw data/processed results/assemblies results/reports logs
# Download sample data (or use your own)
# For this tutorial, we'll assume you have paired-end FASTQ files
# data/raw/sample_R1.fastq.gz
# data/raw/sample_R2.fastq.gz
Creating the Main Pipeline Script
Create a main pipeline script genome_pipeline.sh:
#!/bin/bash
# genome_pipeline.sh - Genome extraction pipeline
# Author: Your Name
# Date: $(date)
set -euo pipefail # Exit on error, undefined vars, pipe failures
# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_DIR}/results"
ASSEMBLY_DIR="${RESULTS_DIR}/assemblies"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_DIR}/logs"
# Sample information
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"
# Output files
TRIMMED_FORWARD="${PROCESSED_DIR}/${SAMPLE_NAME}_R1_trimmed.fastq.gz"
TRIMMED_REVERSE="${PROCESSED_DIR}/${SAMPLE_NAME}_R2_trimmed.fastq.gz"
ASSEMBLY_OUTPUT="${ASSEMBLY_DIR}/${SAMPLE_NAME}_assembly.fasta"
QUAST_REPORT="${REPORTS_DIR}/${SAMPLE_NAME}_quast_report"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}
# Create directories
setup_directories() {
log "Setting up directories..."
mkdir -p "${PROCESSED_DIR}" "${ASSEMBLY_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"
}
# Quality control with FastQC
run_fastqc() {
local input_dir="$1"
local output_dir="$2"
log "Running FastQC on ${input_dir}..."
fastqc -o "${output_dir}" -t 4 "${input_dir}"/*.fastq.gz
}
# Adapter trimming with Trim Galore
trim_reads() {
log "Trimming adapters and low-quality reads..."
trim_galore \
--paired \
--cores 4 \
--length 50 \
--quality 20 \
--output_dir "${PROCESSED_DIR}" \
"${FORWARD_READS}" \
"${REVERSE_READS}"
}
# Genome assembly with SPAdes
assemble_genome() {
log "Assembling genome with SPAdes..."
spades.py \
--pe1-1 "${TRIMMED_FORWARD}" \
--pe1-2 "${TRIMMED_REVERSE}" \
--careful \
--cov-cutoff auto \
-o "${ASSEMBLY_DIR}/${SAMPLE_NAME}_spades"
# Copy final assembly to main output
cp "${ASSEMBLY_DIR}/${SAMPLE_NAME}_spades/contigs.fasta" "${ASSEMBLY_OUTPUT}"
}
# Assembly quality assessment with QUAST
assess_assembly() {
log "Assessing assembly quality with QUAST..."
quast.py \
-o "${QUAST_REPORT}" \
-t 4 \
--min-contig 1000 \
"${ASSEMBLY_OUTPUT}"
}
# Generate final report with MultiQC
generate_report() {
log "Generating final report with MultiQC..."
multiqc \
-o "${REPORTS_DIR}" \
"${REPORTS_DIR}" \
"${PROCESSED_DIR}"
}
# Main pipeline execution
main() {
setup_directories
# Step 1: Initial QC
run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"
# Step 2: Trim reads
trim_reads
# Step 3: Post-trimming QC
run_fastqc "${PROCESSED_DIR}" "${REPORTS_DIR}/fastqc_trimmed"
# Step 4: Assemble genome
assemble_genome
# Step 5: Assess assembly
assess_assembly
# Step 6: Generate final report
generate_report
log "Pipeline completed successfully!"
log "Results available in: ${RESULTS_DIR}"
}
# Run main function
main "$@"
Making the Script Executable
chmod +x genome_pipeline.sh
Advanced Pipeline Features
Let’s enhance our pipeline with additional features:
1. Configuration File
Create a configuration file config.ini:
[general]
project_name=genome_extraction
threads=8
memory=16G
[quality_control]
min_length=50
min_quality=20
[trimming]
adapter_type=illumina
cut_front=true
cut_tail=true
[assembly]
assembler=spades
kmer_sizes=21,33,55,77
careful=true
[quast]
min_contig=500
extensive_misassemblies=true
2. Enhanced Pipeline Script
Update genome_pipeline.sh with configuration support:
#!/bin/bash
# Enhanced genome_pipeline.sh with configuration support
set -euo pipefail
# Load configuration
CONFIG_FILE="config.ini"
if [[ -f "$CONFIG_FILE" ]]; then
# Simple config parser
get_config_value() {
grep "^$1=" "$CONFIG_FILE" | cut -d'=' -f2
}
THREADS=$(get_config_value "threads")
MIN_LENGTH=$(get_config_value "min_length")
MIN_QUALITY=$(get_config_value "min_quality")
else
THREADS=4
MIN_LENGTH=50
MIN_QUALITY=20
fi
# ... rest of the script with enhanced features
Quality Control and Validation
Add validation functions to ensure data integrity:
# Validate input files
validate_input() {
log "Validating input files..."
if [[ ! -f "$FORWARD_READS" ]] || [[ ! -f "$REVERSE_READS" ]]; then
log "ERROR: Input files not found!"
exit 1
fi
# Check file integrity
if ! gzip -t "$FORWARD_READS" 2>/dev/null; then
log "ERROR: Forward reads file is corrupted!"
exit 1
fi
if ! gzip -t "$REVERSE_READS" 2>/dev/null; then
log "ERROR: Reverse reads file is corrupted!"
exit 1
fi
log "Input validation passed"
}
# Check disk space
check_disk_space() {
local required_space_gb=50
local available_space=$(df -BG . | awk 'NR==2 {print $4}' | sed 's/G//')
if [[ $available_space -lt $required_space_gb ]]; then
log "WARNING: Low disk space. Available: ${available_space}GB, Required: ${required_space_gb}GB"
fi
}
Parallel Processing
Implement parallel processing for faster execution:
# Run multiple samples in parallel
process_samples_parallel() {
local samples=("$@")
local max_jobs=4
log "Processing ${#samples[@]} samples in parallel (max ${max_jobs} jobs)..."
# Use GNU parallel if available
if command -v parallel &> /dev/null; then
printf '%s\n' "${samples[@]}" | parallel -j "$max_jobs" ./process_sample.sh {}
else
# Fallback to basic background jobs
local job_count=0
for sample in "${samples[@]}"; do
if [[ $((job_count % max_jobs)) -eq 0 && $job_count -gt 0 ]]; then
wait # Wait for previous batch to complete
fi
./process_sample.sh "$sample" &
((job_count++))
done
wait # Wait for all jobs to complete
fi
}
Error Handling and Recovery
Implement robust error handling:
# Cleanup function
cleanup() {
local exit_code=$?
if [[ $exit_code -ne 0 ]]; then
log "Pipeline failed with exit code $exit_code"
log "Check logs for details: ${LOGS_DIR}/pipeline.log"
fi
}
# Set trap for cleanup
trap cleanup EXIT
# Checkpoint system for resume capability
create_checkpoint() {
local step="$1"
echo "$step" > "${LOGS_DIR}/checkpoint"
log "Checkpoint created: $step"
}
check_checkpoint() {
if [[ -f "${LOGS_DIR}/checkpoint" ]]; then
local last_step=$(cat "${LOGS_DIR}/checkpoint")
log "Resuming from checkpoint: $last_step"
return 0
fi
return 1
}
Running the Pipeline
To run the pipeline:
# Make script executable
chmod +x genome_pipeline.sh
# Run the pipeline
./genome_pipeline.sh
# For verbose output
./genome_pipeline.sh 2>&1 | tee pipeline_run.log
Expected Output Structure
After running the pipeline, you’ll have:
genome_extraction_pipeline/
├── data/
│ ├── raw/
│ │ ├── sample_R1.fastq.gz
│ │ └── sample_R2.fastq.gz
│ └── processed/
│ ├── sample_R1_trimmed.fastq.gz
│ └── sample_R2_trimmed.fastq.gz
├── results/
│ ├── assemblies/
│ │ └── sample_assembly.fasta
│ └── reports/
│ ├── fastqc_raw/
│ ├── fastqc_trimmed/
│ ├── sample_quast_report/
│ └── multiqc_report.html
├── logs/
│ └── pipeline.log
├── config.ini
└── genome_pipeline.sh
Best Practices
- Modularity: Break complex tasks into functions
- Error Handling: Use
set -euo pipefailand proper error checking - Logging: Implement comprehensive logging for debugging
- Configuration: Use external config files for flexibility
- Validation: Validate inputs and outputs
- Documentation: Comment your code thoroughly
- Portability: Use relative paths and environment variables
Performance Optimization
- Parallel Processing: Use multiple cores where possible
- Memory Management: Monitor and limit memory usage
- Disk I/O: Minimize file operations and use appropriate buffering
- Caching: Reuse intermediate results when possible
- Compression: Use appropriate compression levels for trade-offs
This genome extraction pipeline provides a solid foundation for automating genomic data processing workflows. You can extend it further based on your specific requirements and analysis needs.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Protein Annotation Pipeline with Bash
Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.
RNA-Seq Analysis Pipeline with Bash
Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.
Variant Calling Pipeline with Bash
Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.
Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run