bioinformatics intermediate

Genome Extraction Pipeline with Bash

Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.

January 15, 2025 15 min read

#bash #genome-extraction #assembly #sequencing #pipeline #automation

Genome Extraction Pipeline with Bash

Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.

Overview

This tutorial will guide you through creating a complete bash pipeline for genome extraction, from quality control of raw sequencing data to generating a polished genome assembly. We’ll automate the entire workflow using bash scripting best practices.

Prerequisites

Before starting this tutorial, you should have:

Basic understanding of bash scripting
Familiarity with genomic file formats (FASTQ, FASTA)
Command line experience
Tools installed: fastqc, multiqc, spades, quast, seqtk

Pipeline Workflow

Our genome extraction pipeline will follow these steps:

Quality control of raw reads
Adapter trimming and quality filtering
Genome assembly
Assembly quality assessment
Final reporting

Setting Up the Project Structure

First, let’s create a directory structure for our pipeline:

# Create project directory
mkdir genome_extraction_pipeline
cd genome_extraction_pipeline

# Create directory structure
mkdir -p data/raw data/processed results/assemblies results/reports logs

# Download sample data (or use your own)
# For this tutorial, we'll assume you have paired-end FASTQ files
# data/raw/sample_R1.fastq.gz
# data/raw/sample_R2.fastq.gz

Creating the Main Pipeline Script

Create a main pipeline script genome_pipeline.sh:

#!/bin/bash

# genome_pipeline.sh - Genome extraction pipeline
# Author: Your Name
# Date: $(date)

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_DIR}/results"
ASSEMBLY_DIR="${RESULTS_DIR}/assemblies"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_DIR}/logs"

# Sample information
SAMPLE_NAME="sample"
FORWARD_READS="${RAW_DIR}/${SAMPLE_NAME}_R1.fastq.gz"
REVERSE_READS="${RAW_DIR}/${SAMPLE_NAME}_R2.fastq.gz"

# Output files
TRIMMED_FORWARD="${PROCESSED_DIR}/${SAMPLE_NAME}_R1_trimmed.fastq.gz"
TRIMMED_REVERSE="${PROCESSED_DIR}/${SAMPLE_NAME}_R2_trimmed.fastq.gz"
ASSEMBLY_OUTPUT="${ASSEMBLY_DIR}/${SAMPLE_NAME}_assembly.fasta"
QUAST_REPORT="${REPORTS_DIR}/${SAMPLE_NAME}_quast_report"

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}

# Create directories
setup_directories() {
    log "Setting up directories..."
    mkdir -p "${PROCESSED_DIR}" "${ASSEMBLY_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"
}

# Quality control with FastQC
run_fastqc() {
    local input_dir="$1"
    local output_dir="$2"

    log "Running FastQC on ${input_dir}..."
    fastqc -o "${output_dir}" -t 4 "${input_dir}"/*.fastq.gz
}

# Adapter trimming with Trim Galore
trim_reads() {
    log "Trimming adapters and low-quality reads..."

    trim_galore \
        --paired \
        --cores 4 \
        --length 50 \
        --quality 20 \
        --output_dir "${PROCESSED_DIR}" \
        "${FORWARD_READS}" \
        "${REVERSE_READS}"
}

# Genome assembly with SPAdes
assemble_genome() {
    log "Assembling genome with SPAdes..."

    spades.py \
        --pe1-1 "${TRIMMED_FORWARD}" \
        --pe1-2 "${TRIMMED_REVERSE}" \
        --careful \
        --cov-cutoff auto \
        -o "${ASSEMBLY_DIR}/${SAMPLE_NAME}_spades"

    # Copy final assembly to main output
    cp "${ASSEMBLY_DIR}/${SAMPLE_NAME}_spades/contigs.fasta" "${ASSEMBLY_OUTPUT}"
}

# Assembly quality assessment with QUAST
assess_assembly() {
    log "Assessing assembly quality with QUAST..."

    quast.py \
        -o "${QUAST_REPORT}" \
        -t 4 \
        --min-contig 1000 \
        "${ASSEMBLY_OUTPUT}"
}

# Generate final report with MultiQC
generate_report() {
    log "Generating final report with MultiQC..."

    multiqc \
        -o "${REPORTS_DIR}" \
        "${REPORTS_DIR}" \
        "${PROCESSED_DIR}"
}

# Main pipeline execution
main() {
    setup_directories

    # Step 1: Initial QC
    run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"

    # Step 2: Trim reads
    trim_reads

    # Step 3: Post-trimming QC
    run_fastqc "${PROCESSED_DIR}" "${REPORTS_DIR}/fastqc_trimmed"

    # Step 4: Assemble genome
    assemble_genome

    # Step 5: Assess assembly
    assess_assembly

    # Step 6: Generate final report
    generate_report

    log "Pipeline completed successfully!"
    log "Results available in: ${RESULTS_DIR}"
}

# Run main function
main "$@"

Making the Script Executable

chmod +x genome_pipeline.sh

Advanced Pipeline Features

Let’s enhance our pipeline with additional features:

1. Configuration File

Create a configuration file config.ini:

[general]
project_name=genome_extraction
threads=8
memory=16G

[quality_control]
min_length=50
min_quality=20

[trimming]
adapter_type=illumina
cut_front=true
cut_tail=true

[assembly]
assembler=spades
kmer_sizes=21,33,55,77
careful=true

[quast]
min_contig=500
extensive_misassemblies=true

2. Enhanced Pipeline Script

Update genome_pipeline.sh with configuration support:

#!/bin/bash

# Enhanced genome_pipeline.sh with configuration support

set -euo pipefail

# Load configuration
CONFIG_FILE="config.ini"
if [[ -f "$CONFIG_FILE" ]]; then
    # Simple config parser
    get_config_value() {
        grep "^$1=" "$CONFIG_FILE" | cut -d'=' -f2
    }

    THREADS=$(get_config_value "threads")
    MIN_LENGTH=$(get_config_value "min_length")
    MIN_QUALITY=$(get_config_value "min_quality")
else
    THREADS=4
    MIN_LENGTH=50
    MIN_QUALITY=20
fi

# ... rest of the script with enhanced features

Quality Control and Validation

Add validation functions to ensure data integrity:

# Validate input files
validate_input() {
    log "Validating input files..."

    if [[ ! -f "$FORWARD_READS" ]] || [[ ! -f "$REVERSE_READS" ]]; then
        log "ERROR: Input files not found!"
        exit 1
    fi

    # Check file integrity
    if ! gzip -t "$FORWARD_READS" 2>/dev/null; then
        log "ERROR: Forward reads file is corrupted!"
        exit 1
    fi

    if ! gzip -t "$REVERSE_READS" 2>/dev/null; then
        log "ERROR: Reverse reads file is corrupted!"
        exit 1
    fi

    log "Input validation passed"
}

# Check disk space
check_disk_space() {
    local required_space_gb=50
    local available_space=$(df -BG . | awk 'NR==2 {print $4}' | sed 's/G//')

    if [[ $available_space -lt $required_space_gb ]]; then
        log "WARNING: Low disk space. Available: ${available_space}GB, Required: ${required_space_gb}GB"
    fi
}

Parallel Processing

Implement parallel processing for faster execution:

# Run multiple samples in parallel
process_samples_parallel() {
    local samples=("$@")
    local max_jobs=4

    log "Processing ${#samples[@]} samples in parallel (max ${max_jobs} jobs)..."

    # Use GNU parallel if available
    if command -v parallel &> /dev/null; then
        printf '%s\n' "${samples[@]}" | parallel -j "$max_jobs" ./process_sample.sh {}
    else
        # Fallback to basic background jobs
        local job_count=0
        for sample in "${samples[@]}"; do
            if [[ $((job_count % max_jobs)) -eq 0 && $job_count -gt 0 ]]; then
                wait  # Wait for previous batch to complete
            fi

            ./process_sample.sh "$sample" &
            ((job_count++))
        done
        wait  # Wait for all jobs to complete
    fi
}

Error Handling and Recovery

Implement robust error handling:

# Cleanup function
cleanup() {
    local exit_code=$?
    if [[ $exit_code -ne 0 ]]; then
        log "Pipeline failed with exit code $exit_code"
        log "Check logs for details: ${LOGS_DIR}/pipeline.log"
    fi
}

# Set trap for cleanup
trap cleanup EXIT

# Checkpoint system for resume capability
create_checkpoint() {
    local step="$1"
    echo "$step" > "${LOGS_DIR}/checkpoint"
    log "Checkpoint created: $step"
}

check_checkpoint() {
    if [[ -f "${LOGS_DIR}/checkpoint" ]]; then
        local last_step=$(cat "${LOGS_DIR}/checkpoint")
        log "Resuming from checkpoint: $last_step"
        return 0
    fi
    return 1
}

Running the Pipeline

To run the pipeline:

# Make script executable
chmod +x genome_pipeline.sh

# Run the pipeline
./genome_pipeline.sh

# For verbose output
./genome_pipeline.sh 2>&1 | tee pipeline_run.log

Expected Output Structure

After running the pipeline, you’ll have:

genome_extraction_pipeline/
├── data/
│   ├── raw/
│   │   ├── sample_R1.fastq.gz
│   │   └── sample_R2.fastq.gz
│   └── processed/
│       ├── sample_R1_trimmed.fastq.gz
│       └── sample_R2_trimmed.fastq.gz
├── results/
│   ├── assemblies/
│   │   └── sample_assembly.fasta
│   └── reports/
│       ├── fastqc_raw/
│       ├── fastqc_trimmed/
│       ├── sample_quast_report/
│       └── multiqc_report.html
├── logs/
│   └── pipeline.log
├── config.ini
└── genome_pipeline.sh

Best Practices

Modularity: Break complex tasks into functions
Error Handling: Use set -euo pipefail and proper error checking
Logging: Implement comprehensive logging for debugging
Configuration: Use external config files for flexibility
Validation: Validate inputs and outputs
Documentation: Comment your code thoroughly
Portability: Use relative paths and environment variables

Performance Optimization

Parallel Processing: Use multiple cores where possible
Memory Management: Monitor and limit memory usage
Disk I/O: Minimize file operations and use appropriate buffering
Caching: Reuse intermediate results when possible
Compression: Use appropriate compression levels for trade-offs

This genome extraction pipeline provides a solid foundation for automating genomic data processing workflows. You can extend it further based on your specific requirements and analysis needs.

Continue Learning

One Small Win

Try this quick command to get started:

snakemake --version

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run

Genome Extraction Pipeline with Bash

Genome Extraction Pipeline with Bash

Overview

Prerequisites

Pipeline Workflow

Setting Up the Project Structure

Creating the Main Pipeline Script

Making the Script Executable

Advanced Pipeline Features

1. Configuration File

2. Enhanced Pipeline Script

Quality Control and Validation

Parallel Processing

Error Handling and Recovery

Running the Pipeline

Expected Output Structure

Best Practices

Performance Optimization

Continue Learning

One Small Win

Related Content

Protein Annotation Pipeline with Bash

RNA-Seq Analysis Pipeline with Bash

Variant Calling Pipeline with Bash

Bioinformatics Bash Pipeline Framework

Start Your Own Project