Protein Annotation Pipeline with Bash
Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.
Protein Annotation Pipeline with Bash
Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.
Overview
This tutorial will guide you through creating a complete bash pipeline for protein annotation, from raw protein sequences to comprehensive functional and structural annotations. We’ll automate the entire workflow using bash scripting and integrate multiple bioinformatics tools.
Prerequisites
Before starting this tutorial, you should have:
- Intermediate understanding of bash scripting
- Familiarity with protein sequences (FASTA format)
- Access to bioinformatics databases (NR, Swiss-Prot, Pfam)
- Tools installed:
blast,interproscan,hmmer,signalp,tmhmm
Pipeline Workflow
Our protein annotation pipeline will follow these steps:
- Sequence quality control and preprocessing
- Homology search against reference databases
- Domain and motif identification
- Functional annotation and classification
- Subcellular localization prediction
- Structural feature prediction
- Comprehensive reporting
Setting Up the Project Structure
First, let’s create a directory structure for our pipeline:
# Create project directory
mkdir protein_annotation_pipeline
cd protein_annotation_pipeline
# Create directory structure
mkdir -p data/input data/processed results/blast results/interpro results/hmmer results/predictions results/reports logs
# Download sample data (or use your own)
# For this tutorial, we'll assume you have protein sequences in FASTA format
# data/input/proteins.fasta
Creating the Main Pipeline Script
Create a main pipeline script protein_annotation_pipeline.sh:
#!/bin/bash
# protein_annotation_pipeline.sh - Protein annotation pipeline
# Author: Your Name
# Date: $(date)
set -euo pipefail # Exit on error, undefined vars, pipe failures
# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
INPUT_DIR="${DATA_DIR}/input"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_DIR}/results"
BLAST_DIR="${RESULTS_DIR}/blast"
INTERPRO_DIR="${RESULTS_DIR}/interpro"
HMMER_DIR="${RESULTS_DIR}/hmmer"
PREDICTIONS_DIR="${RESULTS_DIR}/predictions"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_DIR}/logs"
# Input files
INPUT_PROTEINS="${INPUT_DIR}/proteins.fasta"
# Database paths (update these to your local paths)
BLAST_NR_DB="/path/to/blast/db/nr"
BLAST_SWISSPROT_DB="/path/to/blast/db/swissprot"
PFAM_DB="/path/to/pfam/Pfam-A.hmm"
# Output files
PROCESSED_PROTEINS="${PROCESSED_DIR}/proteins_clean.fasta"
BLAST_NR_RESULTS="${BLAST_DIR}/nr_blast.tsv"
BLAST_SWISSPROT_RESULTS="${BLAST_DIR}/swissprot_blast.tsv"
INTERPRO_RESULTS="${INTERPRO_DIR}/interpro_results.tsv"
HMMER_RESULTS="${HMMER_DIR}/hmmer_results.tbl"
SIGNALP_RESULTS="${PREDICTIONS_DIR}/signalp_results.txt"
TMHMM_RESULTS="${PREDICTIONS_DIR}/tmhmm_results.txt"
FINAL_REPORT="${REPORTS_DIR}/protein_annotation_report.tsv"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}
# Create directories
setup_directories() {
log "Setting up directories..."
mkdir -p "${PROCESSED_DIR}" "${BLAST_DIR}" "${INTERPRO_DIR}" "${HMMER_DIR}" \
"${PREDICTIONS_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}"
}
# Validate input files
validate_input() {
log "Validating input files..."
if [[ ! -f "$INPUT_PROTEINS" ]]; then
log "ERROR: Input protein file not found: $INPUT_PROTEINS"
exit 1
fi
# Check if file is valid FASTA
if ! grep -q "^>" "$INPUT_PROTEINS"; then
log "ERROR: Input file is not in FASTA format"
exit 1
fi
log "Input validation passed"
}
# Preprocess protein sequences
preprocess_sequences() {
log "Preprocessing protein sequences..."
# Remove duplicate sequences
awk '/^>/ { if (seq) print header "\n" seq; header=$0; seq=""; next }
{ seq=seq $0 }
END { if (seq) print header "\n" seq }' "$INPUT_PROTEINS" | \
awk '/^>/ { header=$0; next }
{
if (!seen[header $0]++) {
print header;
print $0
}
}' > "$PROCESSED_PROTEINS"
# Count sequences
local seq_count=$(grep -c "^>" "$PROCESSED_PROTEINS")
log "Processed $seq_count unique protein sequences"
}
# BLAST against NR database
blast_nr() {
log "Running BLAST against NR database..."
blastp \
-query "$PROCESSED_PROTEINS" \
-db "$BLAST_NR_DB" \
-out "$BLAST_NR_RESULTS" \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
-max_target_seqs 10 \
-num_threads 8 \
-evalue 1e-5
}
# BLAST against Swiss-Prot database
blast_swissprot() {
log "Running BLAST against Swiss-Prot database..."
blastp \
-query "$PROCESSED_PROTEINS" \
-db "$BLAST_SWISSPROT_DB" \
-out "$BLAST_SWISSPROT_RESULTS" \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" \
-max_target_seqs 5 \
-num_threads 8 \
-evalue 1e-10
}
# InterProScan for domain and motif identification
run_interproscan() {
log "Running InterProScan for domain identification..."
interproscan.sh \
-i "$PROCESSED_PROTEINS" \
-o "$INTERPRO_RESULTS" \
-f TSV \
-cpu 8 \
-appl Pfam,APRINT,CDD,ProSiteProfiles
}
# HMMER for Pfam domain search
run_hmmer() {
log "Running HMMER for Pfam domain search..."
hmmscan \
--tblout "$HMMER_RESULTS" \
--cpu 8 \
--cut_ga \
"$PFAM_DB" \
"$PROCESSED_PROTEINS"
}
# SignalP for signal peptide prediction
run_signalp() {
log "Running SignalP for signal peptide prediction..."
signalp \
-org euk \
-format short \
-fasta "$PROCESSED_PROTEINS" \
> "$SIGNALP_RESULTS"
}
# TMHMM for transmembrane helix prediction
run_tmhmm() {
log "Running TMHMM for transmembrane helix prediction..."
tmhmm \
"$PROCESSED_PROTEINS" \
> "$TMHMM_RESULTS"
}
# Parse and combine results
combine_results() {
log "Combining annotation results..."
# Create header for final report
echo -e "protein_id\tlength\tblast_nr_hit\tblast_swissprot_hit\tinterpro_domains\tpfam_domains\tsignal_peptide\ttransmembrane_domains\tgo_terms" > "$FINAL_REPORT"
# Process each protein
while IFS= read -r header; do
protein_id=$(echo "$header" | sed 's/^>//' | awk '{print $1}')
# Get sequence length
sequence=$(awk "/^$header\$/,/^>/ {if (!/^>/) print}" "$PROCESSED_PROTEINS" | tr -d '\n')
length=${#sequence}
# Extract BLAST results
blast_nr_hit=$(awk -v id="$protein_id" '$1 == id {print $12; exit}' "$BLAST_NR_RESULTS" | head -1)
blast_swissprot_hit=$(awk -v id="$protein_id" '$1 == id {print $12; exit}' "$BLAST_SWISSPROT_RESULTS" | head -1)
# Extract InterPro results
interpro_domains=$(awk -v id="$protein_id" '$1 == id {print $4":"$5":"$13}' "$INTERPRO_RESULTS" | tr '\n' ';' | sed 's/;$//')
# Extract Pfam results
pfam_domains=$(awk -v id="$protein_id" '$3 == id {print $1":"$15}' "$HMMER_RESULTS" | tr '\n' ';' | sed 's/;$//')
# Extract SignalP results
signal_peptide=$(awk -v id="$protein_id" '$1 == id && $2 == "YES" {print "YES"; exit}' "$SIGNALP_RESULTS" && echo "YES" || echo "NO")
# Extract TMHMM results
tmh_count=$(awk -v id="$protein_id" '$1 == id {print $5}' "$TMHMM_RESULTS")
transmembrane_domains=$(if [[ "$tmh_count" -gt 0 ]]; then echo "YES ($tmh_count)"; else echo "NO"; fi)
# Simple GO term extraction (in practice, you'd parse this from databases)
go_terms="N/A"
# Write to report
echo -e "$protein_id\t$length\t$blast_nr_hit\t$blast_swissprot_hit\t$interpro_domains\t$pfam_domains\t$signal_peptide\t$transmembrane_domains\t$go_terms" >> "$FINAL_REPORT"
done < <(grep "^>" "$PROCESSED_PROTEINS")
}
# Generate summary statistics
generate_summary() {
log "Generating summary statistics..."
local total_proteins=$(wc -l < "$FINAL_REPORT")
local annotated_proteins=$(awk -F'\t' 'NR>1 && ($3 != "" || $4 != "") {count++} END {print count+0}' "$FINAL_REPORT")
local signal_peptide_count=$(awk -F'\t' 'NR>1 && $7 == "YES" {count++} END {print count+0}' "$FINAL_REPORT")
local transmembrane_count=$(awk -F'\t' 'NR>1 && $8 ~ /YES/ {count++} END {print count+0}' "$FINAL_REPORT")
cat > "${REPORTS_DIR}/summary.txt" << EOF
Protein Annotation Pipeline Summary
==================================
Total proteins processed: $total_proteins
Proteins with annotations: $annotated_proteins
Proteins with signal peptides: $signal_peptide_count
Proteins with transmembrane domains: $transmembrane_count
Generated on: $(date)
EOF
log "Summary generated: ${REPORTS_DIR}/summary.txt"
}
# Main pipeline execution
main() {
setup_directories
validate_input
preprocess_sequences
# Step 1: Database searches
blast_nr
blast_swissprot
# Step 2: Domain identification
run_interproscan
run_hmmer
# Step 3: Feature predictions
run_signalp
run_tmhmm
# Step 4: Combine results
combine_results
# Step 5: Generate summary
generate_summary
log "Protein annotation pipeline completed successfully!"
log "Results available in: ${RESULTS_DIR}"
log "Final report: ${FINAL_REPORT}"
}
# Run main function
main "$@"
Making the Script Executable
chmod +x protein_annotation_pipeline.sh
Advanced Features
1. Configuration Management
Create a configuration file protein_annotation_config.ini:
[general]
project_name=protein_annotation
threads=8
memory=16G
[blast]
evalue_nr=1e-5
evalue_swissprot=1e-10
max_targets=10
[interproscan]
applications=Pfam,PRINTS,ProSiteProfiles,CDD
timeout=3600
[hmmer]
cutoff=gathering
cpu=8
[signalp]
organism=eukaryote
[tmhmm]
format=long
2. Parallel Processing
Enhance the pipeline for parallel execution:
# Split large protein files for parallel processing
split_protein_file() {
local input_file="$1"
local chunk_size="$2"
local output_prefix="$3"
log "Splitting protein file into chunks of $chunk_size sequences..."
awk -v chunk_size="$chunk_size" -v prefix="$output_prefix" '
/^>/ {
if (++count % chunk_size == 1) {
close(out);
out = prefix "_" int(count/chunk_size) ".fasta"
}
}
{ print > out }
' "$input_file"
}
# Process chunks in parallel
process_chunks_parallel() {
local chunk_pattern="$1"
local output_dir="$2"
log "Processing protein chunks in parallel..."
for chunk in ${chunk_pattern}*; do
./annotate_chunk.sh "$chunk" &
done
wait # Wait for all chunks to complete
log "Combining chunk results..."
cat ${output_dir}/chunk_*.tsv > "$FINAL_REPORT"
}
3. Database Management
Add database management functions:
# Check database status
check_database_status() {
log "Checking database status..."
# Check if databases exist
if [[ ! -f "${BLAST_NR_DB}.pin" ]]; then
log "WARNING: BLAST NR database not found"
fi
if [[ ! -f "${BLAST_SWISSPROT_DB}.pin" ]]; then
log "WARNING: BLAST Swiss-Prot database not found"
fi
if [[ ! -f "${PFAM_DB}.h3f" ]]; then
log "WARNING: Pfam HMM database not found"
fi
}
# Download and update databases
update_databases() {
log "Updating databases..."
# This would require specific commands for each database
# Example for updating Swiss-Prot:
# wget ftp://ftp.ncbi.nih.gov/blast/db/swissprot.tar.gz
# tar -xzf swissprot.tar.gz -C /path/to/blast/db/
log "Database update completed"
}
4. Quality Control and Validation
Add comprehensive quality control:
# Validate protein sequences
validate_protein_sequences() {
log "Validating protein sequences..."
# Check for invalid characters
if grep -v "^>" "$PROCESSED_PROTEINS" | grep -q "[^ACDEFGHIKLMNPQRSTVWYXacdefghiklmnpqrstvwyx]"; then
log "ERROR: Invalid amino acid characters found in sequences"
exit 1
fi
# Check for very short sequences
awk '/^>/ { header=$0; next }
{
if (length($0) < 10) {
print "WARNING: Very short sequence found in " header
}
}' "$PROCESSED_PROTEINS"
log "Sequence validation completed"
}
# Validate tool installations
validate_tools() {
log "Validating required tools..."
local required_tools=("blastp" "interproscan.sh" "hmmscan" "signalp" "tmhmm")
for tool in "${required_tools[@]}"; do
if ! command -v "$tool" &> /dev/null; then
log "ERROR: Required tool $tool not found in PATH"
exit 1
fi
done
log "All required tools are available"
}
Running the Pipeline
To run the pipeline:
# Make script executable
chmod +x protein_annotation_pipeline.sh
# Run the pipeline
./protein_annotation_pipeline.sh
# For verbose output
./protein_annotation_pipeline.sh 2>&1 | tee protein_annotation_run.log
Expected Output Structure
After running the pipeline, you’ll have:
protein_annotation_pipeline/
├── data/
│ ├── input/
│ │ └── proteins.fasta
│ └── processed/
│ └── proteins_clean.fasta
├── results/
│ ├── blast/
│ │ ├── nr_blast.tsv
│ │ └── swissprot_blast.tsv
│ ├── interpro/
│ │ └── interpro_results.tsv
│ ├── hmmer/
│ │ └── hmmer_results.tbl
│ ├── predictions/
│ │ ├── signalp_results.txt
│ │ └── tmhmm_results.txt
│ └── reports/
│ ├── protein_annotation_report.tsv
│ └── summary.txt
├── logs/
│ └── pipeline.log
├── protein_annotation_config.ini
└── protein_annotation_pipeline.sh
Best Practices
- Modularity: Break complex tasks into functions
- Error Handling: Use
set -euo pipefailand proper error checking - Logging: Implement comprehensive logging for debugging
- Configuration: Use external config files for flexibility
- Validation: Validate inputs, outputs, and tool installations
- Documentation: Comment your code thoroughly
- Portability: Use relative paths and environment variables
Performance Optimization
- Parallel Processing: Use multiple cores for database searches
- Database Indexing: Ensure databases are properly indexed
- Memory Management: Monitor and limit memory usage
- Disk I/O: Minimize file operations and use appropriate buffering
- Caching: Reuse intermediate results when possible
- Compression: Use appropriate compression for large output files
This protein annotation pipeline provides a comprehensive framework for automated protein sequence analysis, integrating multiple bioinformatics tools and databases to provide detailed functional and structural annotations.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
RNA-Seq Analysis Pipeline with Bash
Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.
Variant Calling Pipeline with Bash
Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.
Bioinformatics Bash Pipeline Framework
Learn to build a flexible and reusable bash pipeline framework for bioinformatics workflows with error handling, logging, and parallel processing.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run