pipelines intermediate

Bioinformatics Bash Pipeline Framework

Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.

January 25, 2025 18 min read

#bash #pipeline-framework #bioinformatics #automation #workflow #reproducibility

Bioinformatics Bash Pipeline Framework

Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.

Overview

This tutorial will guide you through creating a robust, reusable bash pipeline framework that you can adapt for various bioinformatics workflows. We’ll focus on best practices for error handling, logging, configuration management, and parallel processing.

Prerequisites

Before starting this tutorial, you should have:

Basic understanding of bash scripting
Familiarity with Unix/Linux command line
Understanding of bioinformatics file formats (FASTQ, FASTA, SAM, BAM, VCF)
Experience with common bioinformatics tools

Framework Design Principles

Our bash pipeline framework will follow these design principles:

Modularity: Break complex workflows into reusable functions
Configurability: Use external configuration files
Robustness: Implement comprehensive error handling
Traceability: Maintain detailed logging
Scalability: Support parallel processing
Portability: Work across different environments

Setting Up the Framework Structure

First, let’s create a directory structure for our framework:

# Create framework directory
mkdir bioinfo_pipeline_framework
cd bioinfo_pipeline_framework

# Create directory structure
mkdir -p bin config data/raw data/processed results logs tmp

# Create framework files
touch bin/pipeline.sh
touch config/pipeline.conf
touch README.md
touch Makefile

Creating the Core Framework

Let’s build the core framework in bin/pipeline.sh:

#!/bin/bash

# pipeline.sh - Bioinformatics Pipeline Framework
# A reusable framework for bioinformatics workflows

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# =============================================================================
# FRAMEWORK CORE
# =============================================================================

# Global variables
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
CONFIG_FILE="${PROJECT_ROOT}/config/pipeline.conf"
LOG_FILE="${PROJECT_ROOT}/logs/pipeline_$(date +%Y%m%d_%H%M%S).log"

# Load configuration
load_config() {
    if [[ -f "$CONFIG_FILE" ]]; then
        source "$CONFIG_FILE"
        log "INFO" "Configuration loaded from $CONFIG_FILE"
    else
        log "WARNING" "Configuration file not found, using defaults"
    fi
}

# Logging function
log() {
    local level="$1"
    local message="$2"
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
}

# Error handling
error_exit() {
    local message="$1"
    log "ERROR" "$message"
    exit 1
}

# Create required directories
setup_directories() {
    local dirs=("data/raw" "data/processed" "results" "logs" "tmp")
    for dir in "${dirs[@]}"; do
        mkdir -p "${PROJECT_ROOT}/${dir}"
    done
    log "INFO" "Directory structure created"
}

# Validate environment
validate_environment() {
    log "INFO" "Validating environment..."

    # Check required tools
    local required_tools=("awk" "sed" "grep" "gzip" "make")
    for tool in "${required_tools[@]}"; do
        if ! command -v "$tool" &> /dev/null; then
            error_exit "Required tool '$tool' not found in PATH"
        fi
    done

    # Check disk space
    local required_space=10  # GB
    local available_space=$(df -BG "${PROJECT_ROOT}" | awk 'NR==2 {print $4}' | sed 's/G//')
    if [[ $available_space -lt $required_space ]]; then
        log "WARNING" "Low disk space: ${available_space}GB available, ${required_space}GB recommended"
    fi

    log "INFO" "Environment validation passed"
}

# =============================================================================
# UTILITY FUNCTIONS
# =============================================================================

# File management utilities
safe_copy() {
    local source="$1"
    local destination="$2"

    if [[ -f "$source" ]]; then
        cp "$source" "$destination"
        log "INFO" "Copied $source to $destination"
    else
        error_exit "Source file not found: $source"
    fi
}

safe_move() {
    local source="$1"
    local destination="$2"

    if [[ -f "$source" ]]; then
        mv "$source" "$destination"
        log "INFO" "Moved $source to $destination"
    else
        error_exit "Source file not found: $source"
    fi
}

# Compression utilities
compress_file() {
    local file="$1"
    if [[ -f "$file" ]]; then
        gzip "$file"
        log "INFO" "Compressed $file"
    fi
}

decompress_file() {
    local file="$1"
    if [[ -f "$file" && "$file" == *.gz ]]; then
        gunzip "$file"
        log "INFO" "Decompressed $file"
    fi
}

# =============================================================================
# PARALLEL PROCESSING
# =============================================================================

# Run commands in parallel with GNU parallel
run_parallel() {
    local cmd_file="$1"
    local max_jobs="${2:-4}"

    if command -v parallel &> /dev/null; then
        log "INFO" "Running parallel jobs with GNU parallel"
        parallel -j "$max_jobs" < "$cmd_file"
    else
        log "INFO" "GNU parallel not available, using basic background jobs"
        while IFS= read -r cmd; do
            eval "$cmd" &
        done < "$cmd_file"
        wait
    fi
}

# Split file for parallel processing
split_file() {
    local input_file="$1"
    local lines_per_chunk="$2"
    local output_prefix="$3"

    split -l "$lines_per_chunk" "$input_file" "${PROJECT_ROOT}/tmp/${output_prefix}_"
    log "INFO" "Split $input_file into chunks of $lines_per_chunk lines"
}

# =============================================================================
# WORKFLOW MANAGEMENT
# =============================================================================

# Checkpoint system for resuming workflows
create_checkpoint() {
    local step="$1"
    echo "$step" > "${PROJECT_ROOT}/tmp/checkpoint"
    log "INFO" "Checkpoint created: $step"
}

check_checkpoint() {
    local current_step="$1"
    if [[ -f "${PROJECT_ROOT}/tmp/checkpoint" ]]; then
        local last_step=$(cat "${PROJECT_ROOT}/tmp/checkpoint")
        if [[ "$last_step" == "$current_step" ]]; then
            log "INFO" "Skipping $current_step (already completed)"
            return 0
        fi
    fi
    return 1
}

# Cleanup function
cleanup() {
    local exit_code=$?
    if [[ $exit_code -eq 0 ]]; then
        log "INFO" "Pipeline completed successfully"
        # Clean up temporary files
        rm -f "${PROJECT_ROOT}/tmp/checkpoint"
    else
        log "ERROR" "Pipeline failed with exit code $exit_code"
    fi
}

# Set trap for cleanup
trap cleanup EXIT

# =============================================================================
# BIOINFORMATICS SPECIFIC FUNCTIONS
# =============================================================================

# FASTQ utilities
validate_fastq() {
    local fastq_file="$1"
    log "INFO" "Validating FASTQ file: $fastq_file"

    # Check if file exists
    [[ -f "$fastq_file" ]] || error_exit "FASTQ file not found: $fastq_file"

    # Check file format (basic validation)
    local line_count=$(wc -l < "$fastq_file")
    if (( line_count % 4 != 0 )); then
        error_exit "Invalid FASTQ format: line count not divisible by 4"
    fi

    log "INFO" "FASTQ validation passed"
}

# FASTA utilities
validate_fasta() {
    local fasta_file="$1"
    log "INFO" "Validating FASTA file: $fasta_file"

    # Check if file exists
    [[ -f "$fasta_file" ]] || error_exit "FASTA file not found: $fasta_file"

    # Check if file starts with >
    if ! head -n 1 "$fasta_file" | grep -q "^>"; then
        error_exit "Invalid FASTA format: file doesn't start with >"
    fi

    log "INFO" "FASTA validation passed"
}

# Count sequences in FASTA file
count_fasta_sequences() {
    local fasta_file="$1"
    grep -c "^>" "$fasta_file"
}

# Count reads in FASTQ file
count_fastq_reads() {
    local fastq_file="$1"
    echo $(($(wc -l < "$fastq_file") / 4))
}

# =============================================================================
# MAIN PIPELINE EXECUTION
# =============================================================================

# Initialize framework
initialize_framework() {
    setup_directories
    load_config
    validate_environment
    log "INFO" "Framework initialized successfully"
}

# Display usage
usage() {
    cat << EOF
Bioinformatics Pipeline Framework

Usage: $0 [OPTIONS] COMMAND

OPTIONS:
    -h, --help      Show this help message
    -c, --config    Configuration file path
    -v, --verbose   Enable verbose output

COMMANDS:
    init            Initialize framework
    run             Run pipeline
    validate        Validate environment and inputs
    clean           Clean temporary files
    status          Show pipeline status

EXAMPLES:
    $0 init
    $0 run
    $0 --config custom.conf run

EOF
}

# Parse command line arguments
parse_arguments() {
    while [[ $# -gt 0 ]]; do
        case $1 in
            -h|--help)
                usage
                exit 0
                ;;
            -c|--config)
                CONFIG_FILE="$2"
                shift 2
                ;;
            -v|--verbose)
                set -x
                shift
                ;;
            init)
                initialize_framework
                exit 0
                ;;
            run)
                RUN_PIPELINE=1
                shift
                ;;
            validate)
                validate_environment
                exit 0
                ;;
            clean)
                rm -rf "${PROJECT_ROOT}/tmp/"*
                rm -rf "${PROJECT_ROOT}/logs/"*
                log "INFO" "Temporary files cleaned"
                exit 0
                ;;
            status)
                echo "Pipeline Status:"
                echo "  Project: $PROJECT_ROOT"
                echo "  Config: $CONFIG_FILE"
                echo "  Log: $LOG_FILE"
                if [[ -f "${PROJECT_ROOT}/tmp/checkpoint" ]]; then
                    echo "  Last checkpoint: $(cat ${PROJECT_ROOT}/tmp/checkpoint)"
                fi
                exit 0
                ;;
            *)
                echo "Unknown option: $1"
                usage
                exit 1
                ;;
        esac
    done
}

# Main execution
main() {
    parse_arguments "$@"

    if [[ "${RUN_PIPELINE:-0}" -eq 1 ]]; then
        log "INFO" "Starting pipeline execution"
        # This is where you would call your specific pipeline functions
        log "INFO" "Pipeline execution completed"
    fi
}

# Run main function if script is executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
    main "$@"
fi

Creating the Configuration File

Create a sample configuration file config/pipeline.conf:

# pipeline.conf - Configuration file for bioinformatics pipeline framework

# General settings
PROJECT_NAME="bioinfo_pipeline"
THREADS=4
MEMORY_GB=8

# Directory paths
DATA_DIR="${PROJECT_ROOT}/data"
RAW_DIR="${DATA_DIR}/raw"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_ROOT}/results"
TMP_DIR="${PROJECT_ROOT}/tmp"

# File extensions
FASTQ_EXT=".fastq.gz"
FASTA_EXT=".fasta"
BAM_EXT=".bam"
VCF_EXT=".vcf.gz"

# Quality control settings
MIN_READ_LENGTH=30
MIN_BASE_QUALITY=20
MIN_MAPPING_QUALITY=30

# Parallel processing
MAX_PARALLEL_JOBS=4
CHUNK_SIZE=10000

# Logging
LOG_LEVEL="INFO"
KEEP_LOGS_DAYS=30

# Database paths (update these for your environment)
BLAST_DB_PATH="/path/to/blast/db"
BWA_INDEX_PATH="/path/to/bwa/index"

Creating a Makefile

Create a Makefile for easy pipeline management:

# Makefile - Pipeline management

# Default configuration
CONFIG ?= config/pipeline.conf
SCRIPT = bin/pipeline.sh

# Default target
.PHONY: help init run validate clean status

help:
	@echo "Bioinformatics Pipeline Framework"
	@echo ""
	@echo "Usage:"
	@echo "  make init          Initialize framework"
	@echo "  make run          Run pipeline"
	@echo "  make validate     Validate environment"
	@echo "  make clean        Clean temporary files"
	@echo "  make status       Show pipeline status"
	@echo "  make help         Show this help"

init:
	@bash $(SCRIPT) --config $(CONFIG) init

run:
	@bash $(SCRIPT) --config $(CONFIG) run

validate:
	@bash $(SCRIPT) --config $(CONFIG) validate

clean:
	@bash $(SCRIPT) --config $(CONFIG) clean

status:
	@bash $(SCRIPT) --config $(CONFIG) status

# Example pipeline steps (customize these for your workflow)
.PHONY: quality-control alignment variant-calling

quality-control:
	@echo "Running quality control..."
	# Add your QC commands here

alignment:
	@echo "Running alignment..."
	# Add your alignment commands here

variant-calling:
	@echo "Running variant calling..."
	# Add your variant calling commands here

# Install dependencies (example for Ubuntu/Debian)
.PHONY: install-deps

install-deps:
	sudo apt-get update
	sudo apt-get install -y \
		bash \
		gawk \
		sed \
		grep \
		gzip \
		make \
		parallel \
		blast2 \
		bwa \
		samtools \
		bcftools

Creating a Sample Workflow

Create a sample workflow bin/sample_workflow.sh:

#!/bin/bash

# sample_workflow.sh - Example workflow using the framework

set -euo pipefail

# Source the framework
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${SCRIPT_DIR}/pipeline.sh"

# Workflow-specific functions
download_sample_data() {
    log "INFO" "Downloading sample data..."

    # Example: Download sample FASTQ files
    # In practice, you would download real data or use local files
    local sample_url="https://example.com/sample_R{1,2}.fastq.gz"

    # Create sample data for demonstration
    mkdir -p "${PROJECT_ROOT}/data/raw"
    echo "@SAMPLE_READ_1" > "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
    echo "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
    echo "+" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
    echo "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"

    echo "@SAMPLE_READ_2" > "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
    echo "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
    echo "+" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
    echo "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"

    log "INFO" "Sample data created"
}

run_quality_control() {
    log "INFO" "Running quality control..."

    # Example QC steps
    for fastq in "${PROJECT_ROOT}/data/raw"/*.fastq; do
        if [[ -f "$fastq" ]]; then
            local base_name=$(basename "$fastq" .fastq)
            local read_count=$(count_fastq_reads "$fastq")
            log "INFO" "Sample ${base_name}: ${read_count} reads"
        fi
    done

    log "INFO" "Quality control completed"
}

run_analysis() {
    log "INFO" "Running analysis..."

    # Example analysis steps
    local fasta_file="${PROJECT_ROOT}/data/raw/sample.fasta"
    if [[ ! -f "$fasta_file" ]]; then
        # Create sample FASTA file
        echo ">sample_sequence" > "$fasta_file"
        echo "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT" >> "$fasta_file"
    fi

    local seq_count=$(count_fasta_sequences "$fasta_file")
    log "INFO" "FASTA file contains ${seq_count} sequences"

    log "INFO" "Analysis completed"
}

# Main workflow execution
main_workflow() {
    initialize_framework
    download_sample_data
    run_quality_control
    run_analysis
    log "INFO" "Sample workflow completed successfully"
}

# Run if executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
    main_workflow
fi

Making Scripts Executable

chmod +x bin/pipeline.sh
chmod +x bin/sample_workflow.sh

Advanced Framework Features

1. Database Management

Add database management functions to the framework:

# Database management
check_database() {
    local db_path="$1"
    local db_type="$2"  # blast, bwa, etc.

    case "$db_type" in
        blast)
            if [[ -f "${db_path}.pin" ]]; then
                log "INFO" "BLAST database found: $db_path"
                return 0
            else
                log "ERROR" "BLAST database not found: $db_path"
                return 1
            fi
            ;;
        bwa)
            if [[ -f "${db_path}.bwt" ]]; then
                log "INFO" "BWA index found: $db_path"
                return 0
            else
                log "ERROR" "BWA index not found: $db_path"
                return 1
            fi
            ;;
        *)
            log "ERROR" "Unknown database type: $db_type"
            return 1
            ;;
    esac
}

download_database() {
    local db_name="$1"
    local db_type="$2"
    local output_dir="$3"

    log "INFO" "Downloading database: $db_name"

    case "$db_type" in
        blast)
            # Example for downloading NCBI databases
            # update_blastdb.pl --decompress "$db_name"
            log "INFO" "Downloaded BLAST database: $db_name"
            ;;
        *)
            log "INFO" "Database download for $db_type not implemented"
            ;;
    esac
}

2. Resource Management

Add resource monitoring functions:

# Resource monitoring
monitor_resources() {
    local pid=$1
    local interval=${2:-5}

    while kill -0 "$pid" 2>/dev/null; do
        # Get memory usage
        local mem_usage=$(ps -p "$pid" -o rss= 2>/dev/null | awk '{print int($1/1024)}')

        # Get CPU usage
        local cpu_usage=$(ps -p "$pid" -o %cpu= 2>/dev/null)

        log "RESOURCE" "PID $pid: Memory=${mem_usage}MB, CPU=${cpu_usage}%"

        sleep "$interval"
    done
}

# Start resource monitoring in background
start_monitoring() {
    local pid=$1
    monitor_resources "$pid" 10 &
    local monitor_pid=$!
    echo "$monitor_pid"
}

3. Notification System

Add notification functions:

# Notification system
send_notification() {
    local subject="$1"
    local message="$2"

    # Email notification (if mail command is available)
    if command -v mail &> /dev/null; then
        echo "$message" | mail -s "$subject" "${NOTIFICATION_EMAIL:-user@example.com}"
        log "INFO" "Notification sent via email"
    fi

    # Slack notification (if curl is available and webhook is configured)
    if command -v curl &> /dev/null && [[ -n "${SLACK_WEBHOOK_URL:-}" ]]; then
        local payload=$(jq -n --arg text "$subject: $message" '{text: $text}')
        curl -X POST -H 'Content-type: application/json' --data "$payload" "$SLACK_WEBHOOK_URL"
        log "INFO" "Notification sent to Slack"
    fi
}

Running the Framework

To use the framework:

# Initialize the framework
make init

# Run the sample workflow
./bin/sample_workflow.sh

# Or use the main pipeline script
./bin/pipeline.sh run

# Check status
make status

# Clean up
make clean

Expected Output Structure

After running the framework, you’ll have:

bioinfo_pipeline_framework/
├── bin/
│   ├── pipeline.sh
│   └── sample_workflow.sh
├── config/
│   └── pipeline.conf
├── data/
│   ├── raw/
│   └── processed/
├── results/
├── logs/
│   └── pipeline_20250125_120000.log
├── tmp/
├── Makefile
└── README.md

Best Practices

Modularity: Break complex workflows into reusable functions
Configuration: Use external config files for flexibility
Error Handling: Implement comprehensive error handling
Logging: Maintain detailed logs for debugging
Validation: Validate inputs, outputs, and environment
Documentation: Comment your code thoroughly
Portability: Use relative paths and environment variables

Performance Optimization

Parallel Processing: Use GNU parallel or background jobs
Memory Management: Monitor and limit memory usage
Disk I/O: Minimize file operations and use appropriate buffering
Caching: Reuse intermediate results when possible
Compression: Use appropriate compression for large files
Resource Monitoring: Track CPU and memory usage

This bioinformatics bash pipeline framework provides a solid foundation for building robust, reproducible bioinformatics workflows. You can extend and customize it based on your specific analysis needs.

Continue Learning

One Small Win

Try this quick command to get started:

snakemake --version

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run

Bioinformatics Bash Pipeline Framework

Bioinformatics Bash Pipeline Framework

Overview

Prerequisites

Framework Design Principles

Setting Up the Framework Structure

Creating the Core Framework

Creating the Configuration File

Creating a Makefile

Creating a Sample Workflow

Making Scripts Executable

Advanced Framework Features

1. Database Management

2. Resource Management

3. Notification System

Running the Framework

Expected Output Structure

Best Practices

Performance Optimization

Continue Learning

One Small Win

Related Content

Snakemake for Beginners: Your First Bioinformatics Pipeline

Genome Extraction Pipeline with Bash

From Excel Spreadsheets to Reproducible Pipelines: A Researcher's Journey

10 Productivity Tools That Transformed My PhD Workflow

Start Your Own Project