Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
Overview
This tutorial will guide you through creating a robust, reusable bash pipeline framework that you can adapt for various bioinformatics workflows. We’ll focus on best practices for error handling, logging, configuration management, and parallel processing.
Prerequisites
Before starting this tutorial, you should have:
- Basic understanding of bash scripting
- Familiarity with Unix/Linux command line
- Understanding of bioinformatics file formats (FASTQ, FASTA, SAM, BAM, VCF)
- Experience with common bioinformatics tools
Framework Design Principles
Our bash pipeline framework will follow these design principles:
- Modularity: Break complex workflows into reusable functions
- Configurability: Use external configuration files
- Robustness: Implement comprehensive error handling
- Traceability: Maintain detailed logging
- Scalability: Support parallel processing
- Portability: Work across different environments
Setting Up the Framework Structure
First, let’s create a directory structure for our framework:
# Create framework directory
mkdir bioinfo_pipeline_framework
cd bioinfo_pipeline_framework
# Create directory structure
mkdir -p bin config data/raw data/processed results logs tmp
# Create framework files
touch bin/pipeline.sh
touch config/pipeline.conf
touch README.md
touch Makefile
Creating the Core Framework
Let’s build the core framework in bin/pipeline.sh:
#!/bin/bash
# pipeline.sh - Bioinformatics Pipeline Framework
# A reusable framework for bioinformatics workflows
set -euo pipefail # Exit on error, undefined vars, pipe failures
# =============================================================================
# FRAMEWORK CORE
# =============================================================================
# Global variables
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
CONFIG_FILE="${PROJECT_ROOT}/config/pipeline.conf"
LOG_FILE="${PROJECT_ROOT}/logs/pipeline_$(date +%Y%m%d_%H%M%S).log"
# Load configuration
load_config() {
if [[ -f "$CONFIG_FILE" ]]; then
source "$CONFIG_FILE"
log "INFO" "Configuration loaded from $CONFIG_FILE"
else
log "WARNING" "Configuration file not found, using defaults"
fi
}
# Logging function
log() {
local level="$1"
local message="$2"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
}
# Error handling
error_exit() {
local message="$1"
log "ERROR" "$message"
exit 1
}
# Create required directories
setup_directories() {
local dirs=("data/raw" "data/processed" "results" "logs" "tmp")
for dir in "${dirs[@]}"; do
mkdir -p "${PROJECT_ROOT}/${dir}"
done
log "INFO" "Directory structure created"
}
# Validate environment
validate_environment() {
log "INFO" "Validating environment..."
# Check required tools
local required_tools=("awk" "sed" "grep" "gzip" "make")
for tool in "${required_tools[@]}"; do
if ! command -v "$tool" &> /dev/null; then
error_exit "Required tool '$tool' not found in PATH"
fi
done
# Check disk space
local required_space=10 # GB
local available_space=$(df -BG "${PROJECT_ROOT}" | awk 'NR==2 {print $4}' | sed 's/G//')
if [[ $available_space -lt $required_space ]]; then
log "WARNING" "Low disk space: ${available_space}GB available, ${required_space}GB recommended"
fi
log "INFO" "Environment validation passed"
}
# =============================================================================
# UTILITY FUNCTIONS
# =============================================================================
# File management utilities
safe_copy() {
local source="$1"
local destination="$2"
if [[ -f "$source" ]]; then
cp "$source" "$destination"
log "INFO" "Copied $source to $destination"
else
error_exit "Source file not found: $source"
fi
}
safe_move() {
local source="$1"
local destination="$2"
if [[ -f "$source" ]]; then
mv "$source" "$destination"
log "INFO" "Moved $source to $destination"
else
error_exit "Source file not found: $source"
fi
}
# Compression utilities
compress_file() {
local file="$1"
if [[ -f "$file" ]]; then
gzip "$file"
log "INFO" "Compressed $file"
fi
}
decompress_file() {
local file="$1"
if [[ -f "$file" && "$file" == *.gz ]]; then
gunzip "$file"
log "INFO" "Decompressed $file"
fi
}
# =============================================================================
# PARALLEL PROCESSING
# =============================================================================
# Run commands in parallel with GNU parallel
run_parallel() {
local cmd_file="$1"
local max_jobs="${2:-4}"
if command -v parallel &> /dev/null; then
log "INFO" "Running parallel jobs with GNU parallel"
parallel -j "$max_jobs" < "$cmd_file"
else
log "INFO" "GNU parallel not available, using basic background jobs"
while IFS= read -r cmd; do
eval "$cmd" &
done < "$cmd_file"
wait
fi
}
# Split file for parallel processing
split_file() {
local input_file="$1"
local lines_per_chunk="$2"
local output_prefix="$3"
split -l "$lines_per_chunk" "$input_file" "${PROJECT_ROOT}/tmp/${output_prefix}_"
log "INFO" "Split $input_file into chunks of $lines_per_chunk lines"
}
# =============================================================================
# WORKFLOW MANAGEMENT
# =============================================================================
# Checkpoint system for resuming workflows
create_checkpoint() {
local step="$1"
echo "$step" > "${PROJECT_ROOT}/tmp/checkpoint"
log "INFO" "Checkpoint created: $step"
}
check_checkpoint() {
local current_step="$1"
if [[ -f "${PROJECT_ROOT}/tmp/checkpoint" ]]; then
local last_step=$(cat "${PROJECT_ROOT}/tmp/checkpoint")
if [[ "$last_step" == "$current_step" ]]; then
log "INFO" "Skipping $current_step (already completed)"
return 0
fi
fi
return 1
}
# Cleanup function
cleanup() {
local exit_code=$?
if [[ $exit_code -eq 0 ]]; then
log "INFO" "Pipeline completed successfully"
# Clean up temporary files
rm -f "${PROJECT_ROOT}/tmp/checkpoint"
else
log "ERROR" "Pipeline failed with exit code $exit_code"
fi
}
# Set trap for cleanup
trap cleanup EXIT
# =============================================================================
# BIOINFORMATICS SPECIFIC FUNCTIONS
# =============================================================================
# FASTQ utilities
validate_fastq() {
local fastq_file="$1"
log "INFO" "Validating FASTQ file: $fastq_file"
# Check if file exists
[[ -f "$fastq_file" ]] || error_exit "FASTQ file not found: $fastq_file"
# Check file format (basic validation)
local line_count=$(wc -l < "$fastq_file")
if (( line_count % 4 != 0 )); then
error_exit "Invalid FASTQ format: line count not divisible by 4"
fi
log "INFO" "FASTQ validation passed"
}
# FASTA utilities
validate_fasta() {
local fasta_file="$1"
log "INFO" "Validating FASTA file: $fasta_file"
# Check if file exists
[[ -f "$fasta_file" ]] || error_exit "FASTA file not found: $fasta_file"
# Check if file starts with >
if ! head -n 1 "$fasta_file" | grep -q "^>"; then
error_exit "Invalid FASTA format: file doesn't start with >"
fi
log "INFO" "FASTA validation passed"
}
# Count sequences in FASTA file
count_fasta_sequences() {
local fasta_file="$1"
grep -c "^>" "$fasta_file"
}
# Count reads in FASTQ file
count_fastq_reads() {
local fastq_file="$1"
echo $(($(wc -l < "$fastq_file") / 4))
}
# =============================================================================
# MAIN PIPELINE EXECUTION
# =============================================================================
# Initialize framework
initialize_framework() {
setup_directories
load_config
validate_environment
log "INFO" "Framework initialized successfully"
}
# Display usage
usage() {
cat << EOF
Bioinformatics Pipeline Framework
Usage: $0 [OPTIONS] COMMAND
OPTIONS:
-h, --help Show this help message
-c, --config Configuration file path
-v, --verbose Enable verbose output
COMMANDS:
init Initialize framework
run Run pipeline
validate Validate environment and inputs
clean Clean temporary files
status Show pipeline status
EXAMPLES:
$0 init
$0 run
$0 --config custom.conf run
EOF
}
# Parse command line arguments
parse_arguments() {
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
usage
exit 0
;;
-c|--config)
CONFIG_FILE="$2"
shift 2
;;
-v|--verbose)
set -x
shift
;;
init)
initialize_framework
exit 0
;;
run)
RUN_PIPELINE=1
shift
;;
validate)
validate_environment
exit 0
;;
clean)
rm -rf "${PROJECT_ROOT}/tmp/"*
rm -rf "${PROJECT_ROOT}/logs/"*
log "INFO" "Temporary files cleaned"
exit 0
;;
status)
echo "Pipeline Status:"
echo " Project: $PROJECT_ROOT"
echo " Config: $CONFIG_FILE"
echo " Log: $LOG_FILE"
if [[ -f "${PROJECT_ROOT}/tmp/checkpoint" ]]; then
echo " Last checkpoint: $(cat ${PROJECT_ROOT}/tmp/checkpoint)"
fi
exit 0
;;
*)
echo "Unknown option: $1"
usage
exit 1
;;
esac
done
}
# Main execution
main() {
parse_arguments "$@"
if [[ "${RUN_PIPELINE:-0}" -eq 1 ]]; then
log "INFO" "Starting pipeline execution"
# This is where you would call your specific pipeline functions
log "INFO" "Pipeline execution completed"
fi
}
# Run main function if script is executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi
Creating the Configuration File
Create a sample configuration file config/pipeline.conf:
# pipeline.conf - Configuration file for bioinformatics pipeline framework
# General settings
PROJECT_NAME="bioinfo_pipeline"
THREADS=4
MEMORY_GB=8
# Directory paths
DATA_DIR="${PROJECT_ROOT}/data"
RAW_DIR="${DATA_DIR}/raw"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_ROOT}/results"
TMP_DIR="${PROJECT_ROOT}/tmp"
# File extensions
FASTQ_EXT=".fastq.gz"
FASTA_EXT=".fasta"
BAM_EXT=".bam"
VCF_EXT=".vcf.gz"
# Quality control settings
MIN_READ_LENGTH=30
MIN_BASE_QUALITY=20
MIN_MAPPING_QUALITY=30
# Parallel processing
MAX_PARALLEL_JOBS=4
CHUNK_SIZE=10000
# Logging
LOG_LEVEL="INFO"
KEEP_LOGS_DAYS=30
# Database paths (update these for your environment)
BLAST_DB_PATH="/path/to/blast/db"
BWA_INDEX_PATH="/path/to/bwa/index"
Creating a Makefile
Create a Makefile for easy pipeline management:
# Makefile - Pipeline management
# Default configuration
CONFIG ?= config/pipeline.conf
SCRIPT = bin/pipeline.sh
# Default target
.PHONY: help init run validate clean status
help:
@echo "Bioinformatics Pipeline Framework"
@echo ""
@echo "Usage:"
@echo " make init Initialize framework"
@echo " make run Run pipeline"
@echo " make validate Validate environment"
@echo " make clean Clean temporary files"
@echo " make status Show pipeline status"
@echo " make help Show this help"
init:
@bash $(SCRIPT) --config $(CONFIG) init
run:
@bash $(SCRIPT) --config $(CONFIG) run
validate:
@bash $(SCRIPT) --config $(CONFIG) validate
clean:
@bash $(SCRIPT) --config $(CONFIG) clean
status:
@bash $(SCRIPT) --config $(CONFIG) status
# Example pipeline steps (customize these for your workflow)
.PHONY: quality-control alignment variant-calling
quality-control:
@echo "Running quality control..."
# Add your QC commands here
alignment:
@echo "Running alignment..."
# Add your alignment commands here
variant-calling:
@echo "Running variant calling..."
# Add your variant calling commands here
# Install dependencies (example for Ubuntu/Debian)
.PHONY: install-deps
install-deps:
sudo apt-get update
sudo apt-get install -y \
bash \
gawk \
sed \
grep \
gzip \
make \
parallel \
blast2 \
bwa \
samtools \
bcftools
Creating a Sample Workflow
Create a sample workflow bin/sample_workflow.sh:
#!/bin/bash
# sample_workflow.sh - Example workflow using the framework
set -euo pipefail
# Source the framework
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${SCRIPT_DIR}/pipeline.sh"
# Workflow-specific functions
download_sample_data() {
log "INFO" "Downloading sample data..."
# Example: Download sample FASTQ files
# In practice, you would download real data or use local files
local sample_url="https://example.com/sample_R{1,2}.fastq.gz"
# Create sample data for demonstration
mkdir -p "${PROJECT_ROOT}/data/raw"
echo "@SAMPLE_READ_1" > "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
echo "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
echo "+" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
echo "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII" >> "${PROJECT_ROOT}/data/raw/sample_R1.fastq"
echo "@SAMPLE_READ_2" > "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
echo "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
echo "+" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
echo "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII" >> "${PROJECT_ROOT}/data/raw/sample_R2.fastq"
log "INFO" "Sample data created"
}
run_quality_control() {
log "INFO" "Running quality control..."
# Example QC steps
for fastq in "${PROJECT_ROOT}/data/raw"/*.fastq; do
if [[ -f "$fastq" ]]; then
local base_name=$(basename "$fastq" .fastq)
local read_count=$(count_fastq_reads "$fastq")
log "INFO" "Sample ${base_name}: ${read_count} reads"
fi
done
log "INFO" "Quality control completed"
}
run_analysis() {
log "INFO" "Running analysis..."
# Example analysis steps
local fasta_file="${PROJECT_ROOT}/data/raw/sample.fasta"
if [[ ! -f "$fasta_file" ]]; then
# Create sample FASTA file
echo ">sample_sequence" > "$fasta_file"
echo "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT" >> "$fasta_file"
fi
local seq_count=$(count_fasta_sequences "$fasta_file")
log "INFO" "FASTA file contains ${seq_count} sequences"
log "INFO" "Analysis completed"
}
# Main workflow execution
main_workflow() {
initialize_framework
download_sample_data
run_quality_control
run_analysis
log "INFO" "Sample workflow completed successfully"
}
# Run if executed directly
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main_workflow
fi
Making Scripts Executable
chmod +x bin/pipeline.sh
chmod +x bin/sample_workflow.sh
Advanced Framework Features
1. Database Management
Add database management functions to the framework:
# Database management
check_database() {
local db_path="$1"
local db_type="$2" # blast, bwa, etc.
case "$db_type" in
blast)
if [[ -f "${db_path}.pin" ]]; then
log "INFO" "BLAST database found: $db_path"
return 0
else
log "ERROR" "BLAST database not found: $db_path"
return 1
fi
;;
bwa)
if [[ -f "${db_path}.bwt" ]]; then
log "INFO" "BWA index found: $db_path"
return 0
else
log "ERROR" "BWA index not found: $db_path"
return 1
fi
;;
*)
log "ERROR" "Unknown database type: $db_type"
return 1
;;
esac
}
download_database() {
local db_name="$1"
local db_type="$2"
local output_dir="$3"
log "INFO" "Downloading database: $db_name"
case "$db_type" in
blast)
# Example for downloading NCBI databases
# update_blastdb.pl --decompress "$db_name"
log "INFO" "Downloaded BLAST database: $db_name"
;;
*)
log "INFO" "Database download for $db_type not implemented"
;;
esac
}
2. Resource Management
Add resource monitoring functions:
# Resource monitoring
monitor_resources() {
local pid=$1
local interval=${2:-5}
while kill -0 "$pid" 2>/dev/null; do
# Get memory usage
local mem_usage=$(ps -p "$pid" -o rss= 2>/dev/null | awk '{print int($1/1024)}')
# Get CPU usage
local cpu_usage=$(ps -p "$pid" -o %cpu= 2>/dev/null)
log "RESOURCE" "PID $pid: Memory=${mem_usage}MB, CPU=${cpu_usage}%"
sleep "$interval"
done
}
# Start resource monitoring in background
start_monitoring() {
local pid=$1
monitor_resources "$pid" 10 &
local monitor_pid=$!
echo "$monitor_pid"
}
3. Notification System
Add notification functions:
# Notification system
send_notification() {
local subject="$1"
local message="$2"
# Email notification (if mail command is available)
if command -v mail &> /dev/null; then
echo "$message" | mail -s "$subject" "${NOTIFICATION_EMAIL:-user@example.com}"
log "INFO" "Notification sent via email"
fi
# Slack notification (if curl is available and webhook is configured)
if command -v curl &> /dev/null && [[ -n "${SLACK_WEBHOOK_URL:-}" ]]; then
local payload=$(jq -n --arg text "$subject: $message" '{text: $text}')
curl -X POST -H 'Content-type: application/json' --data "$payload" "$SLACK_WEBHOOK_URL"
log "INFO" "Notification sent to Slack"
fi
}
Running the Framework
To use the framework:
# Initialize the framework
make init
# Run the sample workflow
./bin/sample_workflow.sh
# Or use the main pipeline script
./bin/pipeline.sh run
# Check status
make status
# Clean up
make clean
Expected Output Structure
After running the framework, you’ll have:
bioinfo_pipeline_framework/
├── bin/
│ ├── pipeline.sh
│ └── sample_workflow.sh
├── config/
│ └── pipeline.conf
├── data/
│ ├── raw/
│ └── processed/
├── results/
├── logs/
│ └── pipeline_20250125_120000.log
├── tmp/
├── Makefile
└── README.md
Best Practices
- Modularity: Break complex workflows into reusable functions
- Configuration: Use external config files for flexibility
- Error Handling: Implement comprehensive error handling
- Logging: Maintain detailed logs for debugging
- Validation: Validate inputs, outputs, and environment
- Documentation: Comment your code thoroughly
- Portability: Use relative paths and environment variables
Performance Optimization
- Parallel Processing: Use GNU parallel or background jobs
- Memory Management: Monitor and limit memory usage
- Disk I/O: Minimize file operations and use appropriate buffering
- Caching: Reuse intermediate results when possible
- Compression: Use appropriate compression for large files
- Resource Monitoring: Track CPU and memory usage
This bioinformatics bash pipeline framework provides a solid foundation for building robust, reproducible bioinformatics workflows. You can extend and customize it based on your specific analysis needs.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Snakemake for Beginners: Your First Bioinformatics Pipeline
Learn how to build reproducible bioinformatics workflows with Snakemake. A step-by-step guide from basic concepts to a complete RNA-seq analysis pipeline.
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
From Excel Spreadsheets to Reproducible Pipelines: A Researcher's Journey
How I transformed my data analysis workflow from manual Excel processes to automated, reproducible computational pipelines - and why you should too.
10 Productivity Tools That Transformed My PhD Workflow
Discover the essential digital tools and techniques I use to manage research projects, automate repetitive tasks, and maintain work-life balance during my PhD.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run