bioinformatics advanced

Protein Annotation Pipeline with Bash

Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.

January 20, 2025 20 min read

#bash #protein-annotation #blast #interproscan #hmmer #pipeline #functional-annotation

Protein Annotation Pipeline with Bash

Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.

Overview

This tutorial will guide you through creating a complete bash pipeline for protein annotation, from raw protein sequences to comprehensive functional and structural annotations. We’ll automate the entire workflow using bash scripting and integrate multiple bioinformatics tools.

Prerequisites

Before starting this tutorial, you should have:

Intermediate understanding of bash scripting
Familiarity with protein sequences (FASTA format)
Access to bioinformatics databases (NR, Swiss-Prot, Pfam)
Tools installed: blast, interproscan, hmmer, signalp, tmhmm

Pipeline Workflow

Our protein annotation pipeline will follow these steps:

Sequence quality control and preprocessing
Homology search against reference databases
Domain and motif identification
Functional annotation and classification
Subcellular localization prediction
Structural feature prediction
Comprehensive reporting

Setting Up the Project Structure

First, let’s create a directory structure for our pipeline:

# Create project directory
mkdir protein_annotation_pipeline
cd protein_annotation_pipeline

# Create directory structure
mkdir -p data/input data/processed results/blast results/interpro results/hmmer results/predictions results/reports logs

# Download sample data (or use your own)
# For this tutorial, we'll assume you have protein sequences in FASTA format
# data/input/proteins.fasta

Creating the Main Pipeline Script

Create a main pipeline script protein_annotation_pipeline.sh:

#!/bin/bash

# protein_annotation_pipeline.sh - Protein annotation pipeline
# Author: Your Name
# Date: $(date)

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
INPUT_DIR="${DATA_DIR}/input"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_DIR}/results"
BLAST_DIR="${RESULTS_DIR}/blast"
INTERPRO_DIR="${RESULTS_DIR}/interpro"
HMMER_DIR="${RESULTS_DIR}/hmmer"
PREDICTIONS_DIR="${RESULTS_DIR}/predictions"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_DIR}/logs"

# Input files
INPUT_PROTEINS="${INPUT_DIR}/proteins.fasta"

# Database paths (update these to your local paths)
BLAST_NR_DB="/path/to/blast/db/nr"
BLAST_SWISSPROT_DB="/path/to/blast/db/swissprot"
PFAM_DB="/path/to/pfam/Pfam-A.hmm"

# Output files
PROCESSED_PROTEINS="${PROCESSED_DIR}/proteins_clean.fasta"
BLAST_NR_RESULTS="${BLAST_DIR}/nr_blast.tsv"
BLAST_SWISSPROT_RESULTS="${BLAST_DIR}/swissprot_blast.tsv"
INTERPRO_RESULTS="${INTERPRO_DIR}/interpro_results.tsv"
HMMER_RESULTS="${HMMER_DIR}/hmmer_results.tbl"
SIGNALP_RESULTS="${PREDICTIONS_DIR}/signalp_results.txt"
TMHMM_RESULTS="${PREDICTIONS_DIR}/tmhmm_results.txt"
FINAL_REPORT="${REPORTS_DIR}/protein_annotation_report.tsv"

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}

# Create directories
setup_directories() {
    log "Setting up directories..."
    mkdir -p "${PROCESSED_DIR}" "${BLAST_DIR}" "${INTERPRO_DIR}" "${HMMER_DIR}" \
           "${PREDICTIONS_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"
}

# Validate input files
validate_input() {
    log "Validating input files..."

    if [[ ! -f "$INPUT_PROTEINS" ]]; then
        log "ERROR: Input protein file not found: $INPUT_PROTEINS"
        exit 1
    fi

    # Check if file is valid FASTA
    if ! grep -q "^>" "$INPUT_PROTEINS"; then
        log "ERROR: Input file is not in FASTA format"
        exit 1
    fi

    log "Input validation passed"
}

# Preprocess protein sequences
preprocess_sequences() {
    log "Preprocessing protein sequences..."

    # Remove duplicate sequences
    awk '/^>/ { if (seq) print header "\n" seq; header=$0; seq=""; next }
         { seq=seq $0 }
         END { if (seq) print header "\n" seq }' "$INPUT_PROTEINS" | \
    awk '/^>/ { header=$0; next }
         {
           if (!seen[header $0]++) {
             print header;
             print $0
           }
         }' > "$PROCESSED_PROTEINS"

    # Count sequences
    local seq_count=$(grep -c "^>" "$PROCESSED_PROTEINS")
    log "Processed $seq_count unique protein sequences"
}

# BLAST against NR database
blast_nr() {
    log "Running BLAST against NR database..."

    blastp \
        -query "$PROCESSED_PROTEINS" \
        -db "$BLAST_NR_DB" \
        -out "$BLAST_NR_RESULTS" \
        -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
        -max_target_seqs 10 \
        -num_threads 8 \
        -evalue 1e-5
}

# BLAST against Swiss-Prot database
blast_swissprot() {
    log "Running BLAST against Swiss-Prot database..."

    blastp \
        -query "$PROCESSED_PROTEINS" \
        -db "$BLAST_SWISSPROT_DB" \
        -out "$BLAST_SWISSPROT_RESULTS" \
        -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
        -max_target_seqs 5 \
        -num_threads 8 \
        -evalue 1e-10
}

# InterProScan for domain and motif identification
run_interproscan() {
    log "Running InterProScan for domain identification..."

    interproscan.sh \
        -i "$PROCESSED_PROTEINS" \
        -o "$INTERPRO_RESULTS" \
        -f TSV \
        -cpu 8 \
        -appl Pfam,APRINT,CDD,ProSiteProfiles
}

# HMMER for Pfam domain search
run_hmmer() {
    log "Running HMMER for Pfam domain search..."

    hmmscan \
        --tblout "$HMMER_RESULTS" \
        --cpu 8 \
        --cut_ga \
        "$PFAM_DB" \
        "$PROCESSED_PROTEINS"
}

# SignalP for signal peptide prediction
run_signalp() {
    log "Running SignalP for signal peptide prediction..."

    signalp \
        -org euk \
        -format short \
        -fasta "$PROCESSED_PROTEINS" \
        > "$SIGNALP_RESULTS"
}

# TMHMM for transmembrane helix prediction
run_tmhmm() {
    log "Running TMHMM for transmembrane helix prediction..."

    tmhmm \
        "$PROCESSED_PROTEINS" \
        > "$TMHMM_RESULTS"
}

# Parse and combine results
combine_results() {
    log "Combining annotation results..."

    # Create header for final report
    echo -e "protein_id\tlength\tblast_nr_hit\tblast_swissprot_hit\tinterpro_domains\tpfam_domains\tsignal_peptide\ttransmembrane_domains\tgo_terms" > "$FINAL_REPORT"

    # Process each protein
    while IFS= read -r header; do
        protein_id=$(echo "$header" | sed 's/^>//' | awk '{print $1}')

        # Get sequence length
        sequence=$(awk "/^$header\$/,/^>/ {if (!/^>/) print}" "$PROCESSED_PROTEINS" | tr -d '\n')
        length=${#sequence}

        # Extract BLAST results
        blast_nr_hit=$(awk -v id="$protein_id" '$1 == id {print $12; exit}' "$BLAST_NR_RESULTS" | head -1)
        blast_swissprot_hit=$(awk -v id="$protein_id" '$1 == id {print $12; exit}' "$BLAST_SWISSPROT_RESULTS" | head -1)

        # Extract InterPro results
        interpro_domains=$(awk -v id="$protein_id" '$1 == id {print $4":"$5":"$13}' "$INTERPRO_RESULTS" | tr '\n' ';' | sed 's/;$//')

        # Extract Pfam results
        pfam_domains=$(awk -v id="$protein_id" '$3 == id {print $1":"$15}' "$HMMER_RESULTS" | tr '\n' ';' | sed 's/;$//')

        # Extract SignalP results
        signal_peptide=$(awk -v id="$protein_id" '$1 == id && $2 == "YES" {print "YES"; exit}' "$SIGNALP_RESULTS" && echo "YES" || echo "NO")

        # Extract TMHMM results
        tmh_count=$(awk -v id="$protein_id" '$1 == id {print $5}' "$TMHMM_RESULTS")
        transmembrane_domains=$(if [[ "$tmh_count" -gt 0 ]]; then echo "YES ($tmh_count)"; else echo "NO"; fi)

        # Simple GO term extraction (in practice, you'd parse this from databases)
        go_terms="N/A"

        # Write to report
        echo -e "$protein_id\t$length\t$blast_nr_hit\t$blast_swissprot_hit\t$interpro_domains\t$pfam_domains\t$signal_peptide\t$transmembrane_domains\t$go_terms" >> "$FINAL_REPORT"

    done < <(grep "^>" "$PROCESSED_PROTEINS")
}

# Generate summary statistics
generate_summary() {
    log "Generating summary statistics..."

    local total_proteins=$(wc -l < "$FINAL_REPORT")
    local annotated_proteins=$(awk -F'\t' 'NR>1 && ($3 != "" || $4 != "") {count++} END {print count+0}' "$FINAL_REPORT")
    local signal_peptide_count=$(awk -F'\t' 'NR>1 && $7 == "YES" {count++} END {print count+0}' "$FINAL_REPORT")
    local transmembrane_count=$(awk -F'\t' 'NR>1 && $8 ~ /YES/ {count++} END {print count+0}' "$FINAL_REPORT")

    cat > "${REPORTS_DIR}/summary.txt" << EOF
Protein Annotation Pipeline Summary
==================================

Total proteins processed: $total_proteins
Proteins with annotations: $annotated_proteins
Proteins with signal peptides: $signal_peptide_count
Proteins with transmembrane domains: $transmembrane_count

Generated on: $(date)
EOF

    log "Summary generated: ${REPORTS_DIR}/summary.txt"
}

# Main pipeline execution
main() {
    setup_directories
    validate_input
    preprocess_sequences

    # Step 1: Database searches
    blast_nr
    blast_swissprot

    # Step 2: Domain identification
    run_interproscan
    run_hmmer

    # Step 3: Feature predictions
    run_signalp
    run_tmhmm

    # Step 4: Combine results
    combine_results

    # Step 5: Generate summary
    generate_summary

    log "Protein annotation pipeline completed successfully!"
    log "Results available in: ${RESULTS_DIR}"
    log "Final report: ${FINAL_REPORT}"
}

# Run main function
main "$@"

Making the Script Executable

chmod +x protein_annotation_pipeline.sh

Advanced Features

1. Configuration Management

Create a configuration file protein_annotation_config.ini:

[general]
project_name=protein_annotation
threads=8
memory=16G

[blast]
evalue_nr=1e-5
evalue_swissprot=1e-10
max_targets=10

[interproscan]
applications=Pfam,PRINTS,ProSiteProfiles,CDD
timeout=3600

[hmmer]
cutoff=gathering
cpu=8

[signalp]
organism=eukaryote

[tmhmm]
format=long

2. Parallel Processing

Enhance the pipeline for parallel execution:

# Split large protein files for parallel processing
split_protein_file() {
    local input_file="$1"
    local chunk_size="$2"
    local output_prefix="$3"

    log "Splitting protein file into chunks of $chunk_size sequences..."

    awk -v chunk_size="$chunk_size" -v prefix="$output_prefix" '
    /^>/ {
        if (++count % chunk_size == 1) {
            close(out);
            out = prefix "_" int(count/chunk_size) ".fasta"
        }
    }
    { print > out }
    ' "$input_file"
}

# Process chunks in parallel
process_chunks_parallel() {
    local chunk_pattern="$1"
    local output_dir="$2"

    log "Processing protein chunks in parallel..."

    for chunk in ${chunk_pattern}*; do
        ./annotate_chunk.sh "$chunk" &
    done

    wait  # Wait for all chunks to complete

    log "Combining chunk results..."
    cat ${output_dir}/chunk_*.tsv > "$FINAL_REPORT"
}

3. Database Management

Add database management functions:

# Check database status
check_database_status() {
    log "Checking database status..."

    # Check if databases exist
    if [[ ! -f "${BLAST_NR_DB}.pin" ]]; then
        log "WARNING: BLAST NR database not found"
    fi

    if [[ ! -f "${BLAST_SWISSPROT_DB}.pin" ]]; then
        log "WARNING: BLAST Swiss-Prot database not found"
    fi

    if [[ ! -f "${PFAM_DB}.h3f" ]]; then
        log "WARNING: Pfam HMM database not found"
    fi
}

# Download and update databases
update_databases() {
    log "Updating databases..."

    # This would require specific commands for each database
    # Example for updating Swiss-Prot:
    # wget ftp://ftp.ncbi.nih.gov/blast/db/swissprot.tar.gz
    # tar -xzf swissprot.tar.gz -C /path/to/blast/db/

    log "Database update completed"
}

4. Quality Control and Validation

Add comprehensive quality control:

# Validate protein sequences
validate_protein_sequences() {
    log "Validating protein sequences..."

    # Check for invalid characters
    if grep -v "^>" "$PROCESSED_PROTEINS" | grep -q "[^ACDEFGHIKLMNPQRSTVWYXacdefghiklmnpqrstvwyx]"; then
        log "ERROR: Invalid amino acid characters found in sequences"
        exit 1
    fi

    # Check for very short sequences
    awk '/^>/ { header=$0; next }
         {
           if (length($0) < 10) {
             print "WARNING: Very short sequence found in " header
           }
         }' "$PROCESSED_PROTEINS"

    log "Sequence validation completed"
}

# Validate tool installations
validate_tools() {
    log "Validating required tools..."

    local required_tools=("blastp" "interproscan.sh" "hmmscan" "signalp" "tmhmm")

    for tool in "${required_tools[@]}"; do
        if ! command -v "$tool" &> /dev/null; then
            log "ERROR: Required tool $tool not found in PATH"
            exit 1
        fi
    done

    log "All required tools are available"
}

Running the Pipeline

To run the pipeline:

# Make script executable
chmod +x protein_annotation_pipeline.sh

# Run the pipeline
./protein_annotation_pipeline.sh

# For verbose output
./protein_annotation_pipeline.sh 2>&1 | tee protein_annotation_run.log

Expected Output Structure

After running the pipeline, you’ll have:

protein_annotation_pipeline/
├── data/
│   ├── input/
│   │   └── proteins.fasta
│   └── processed/
│       └── proteins_clean.fasta
├── results/
│   ├── blast/
│   │   ├── nr_blast.tsv
│   │   └── swissprot_blast.tsv
│   ├── interpro/
│   │   └── interpro_results.tsv
│   ├── hmmer/
│   │   └── hmmer_results.tbl
│   ├── predictions/
│   │   ├── signalp_results.txt
│   │   └── tmhmm_results.txt
│   └── reports/
│       ├── protein_annotation_report.tsv
│       └── summary.txt
├── logs/
│   └── pipeline.log
├── protein_annotation_config.ini
└── protein_annotation_pipeline.sh

Best Practices

Modularity: Break complex tasks into functions
Error Handling: Use set -euo pipefail and proper error checking
Logging: Implement comprehensive logging for debugging
Configuration: Use external config files for flexibility
Validation: Validate inputs, outputs, and tool installations
Documentation: Comment your code thoroughly
Portability: Use relative paths and environment variables

Performance Optimization

Parallel Processing: Use multiple cores for database searches
Database Indexing: Ensure databases are properly indexed
Memory Management: Monitor and limit memory usage
Disk I/O: Minimize file operations and use appropriate buffering
Caching: Reuse intermediate results when possible
Compression: Use appropriate compression for large output files

This protein annotation pipeline provides a comprehensive framework for automated protein sequence analysis, integrating multiple bioinformatics tools and databases to provide detailed functional and structural annotations.

Continue Learning

One Small Win

Try this quick command to get started:

snakemake --version

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run

Protein Annotation Pipeline with Bash

Protein Annotation Pipeline with Bash

Overview

Prerequisites

Pipeline Workflow

Setting Up the Project Structure

Creating the Main Pipeline Script

Making the Script Executable

Advanced Features

1. Configuration Management

2. Parallel Processing

3. Database Management

4. Quality Control and Validation

Running the Pipeline

Expected Output Structure

Best Practices

Performance Optimization

Continue Learning

One Small Win

Related Content

Genome Extraction Pipeline with Bash

RNA-Seq Analysis Pipeline with Bash

Variant Calling Pipeline with Bash

Bioinformatics Bash Pipeline Framework

Start Your Own Project