RNA-Seq Analysis Pipeline with Bash
Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.
RNA-Seq Analysis Pipeline with Bash
Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.
Overview
This tutorial will guide you through creating a complete bash pipeline for RNA-Seq data analysis, from quality control of raw sequencing reads to generating publication-ready differential expression results. We’ll implement best practices for RNA-Seq analysis using STAR aligner, featureCounts for quantification, and DESeq2 for statistical analysis.
Prerequisites
Before starting this tutorial, you should have:
- Advanced understanding of bash scripting
- Familiarity with RNA-Seq experimental design
- Understanding of gene expression and differential expression concepts
- Basic knowledge of R programming for statistical analysis
- Tools installed:
STAR,samtools,featureCounts,FastQC,MultiQC,R,DESeq2
Pipeline Workflow
Our RNA-Seq analysis pipeline will follow these steps:
- Quality control of raw reads
- Adapter trimming and quality filtering
- Read alignment to reference genome
- Gene expression quantification
- Quality control of aligned reads
- Differential expression analysis with DESeq2
- Functional enrichment analysis
- Final reporting
Setting Up the Project Structure
First, let’s create a directory structure for our pipeline:
# Create project directory
mkdir rnaseq_pipeline
cd rnaseq_pipeline
# Create directory structure
mkdir -p data/raw data/reference data/processed results/alignment results/quantification results/de_analysis results/enrichment results/reports logs tmp
# Download reference genome and annotation (example with hg38)
# You'll need to download the reference genome and create STAR index
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/gff/GCA_000001405.15_GRCh38.gff.gz
Creating the Main Pipeline Script
Create a main pipeline script rnaseq_pipeline.sh:
#!/bin/bash
# rnaseq_pipeline.sh - RNA-Seq analysis pipeline
# Author: Your Name
# Date: $(date)
set -euo pipefail # Exit on error, undefined vars, pipe failures
# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignment"
QUANT_DIR="${RESULTS_DIR}/quantification"
DE_DIR="${RESULTS_DIR}/de_analysis"
ENRICH_DIR="${RESULTS_DIR}/enrichment"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"
TMP_DIR="${PROJECT_ROOT}/tmp"
# Reference files (update these paths for your environment)
REF_GENOME="${REF_DIR}/hg38.fasta"
GTF_FILE="${REF_DIR}/hg38.gtf"
STAR_INDEX_DIR="${REF_DIR}/star_index"
# Sample information (in practice, this would come from a sample sheet)
SAMPLES_FILE="${PROJECT_DIR}/samples.tsv"
# Output files
FINAL_COUNTS="${QUANT_DIR}/gene_counts.tsv"
FINAL_CPM="${QUANT_DIR}/gene_cpm.tsv"
DE_RESULTS="${DE_DIR}/differential_expression_results.tsv"
DE_REPORT="${DE_DIR}/differential_expression_report.html"
# Logging function
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}
# Create directories
setup_directories() {
log "Setting up directories..."
mkdir -p "${PROCESSED_DIR}" "${ALIGN_DIR}" "${QUANT_DIR}" "${DE_DIR}" "${ENRICH_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}" "${TMP_DIR}"
}
# Validate input files
validate_input() {
log "Validating input files..."
# Check if samples file exists
if [[ ! -f "$SAMPLES_FILE" ]]; then
log "ERROR: Samples file not found: $SAMPLES_FILE"
log "Creating sample template..."
cat > "$SAMPLES_FILE" << EOF
sample_id condition replicate forward_reads reverse_reads
sample1 control 1 data/raw/sample1_R1.fastq.gz data/raw/sample1_R2.fastq.gz
sample2 control 2 data/raw/sample2_R1.fastq.gz data/raw/sample2_R2.fastq.gz
sample3 treated 1 data/raw/sample3_R1.fastq.gz data/raw/sample3_R2.fastq.gz
sample4 treated 2 data/raw/sample4_R1.fastq.gz data/raw/sample4_R2.fastq.gz
EOF
log "Created sample template. Please update with your sample information."
exit 1
fi
# Check reference files
local required_files=("$REF_GENOME" "$GTF_FILE")
for file in "${required_files[@]}"; do
if [[ ! -f "$file" ]]; then
log "ERROR: Required reference file not found: $file"
exit 1
fi
done
# Check STAR index
if [[ ! -d "$STAR_INDEX_DIR" ]]; then
log "WARNING: STAR index not found. Creating index..."
create_star_index
fi
log "Input validation passed"
}
# Create STAR genome index
create_star_index() {
log "Creating STAR genome index..."
STAR --runMode genomeGenerate \
--genomeDir "$STAR_INDEX_DIR" \
--genomeFastaFiles "$REF_GENOME" \
--sjdbGTFfile "$GTF_FILE" \
--sjdbOverhang 100 \
--runThreadN 8
log "STAR genome index created"
}
# Quality control with FastQC
run_fastqc() {
local input_dir="$1"
local output_dir="$2"
log "Running FastQC on ${input_dir}..."
mkdir -p "$output_dir"
fastqc -o "$output_dir" -t 4 "$input_dir"/*.fastq.gz
}
# Adapter trimming with Trim Galore
trim_reads() {
log "Trimming adapters and low-quality reads..."
# Read samples from TSV file
while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
# Skip header
[[ "$sample_id" == "sample_id" ]] && continue
log "Trimming reads for sample: $sample_id"
trim_galore \
--paired \
--cores 4 \
--length 36 \
--quality 20 \
--output_dir "${PROCESSED_DIR}" \
"${DATA_DIR}/${forward}" \
"${DATA_DIR}/${reverse}"
done < "$SAMPLES_FILE"
}
# Align reads with STAR
align_reads() {
log "Aligning reads with STAR..."
while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
# Skip header
[[ "$sample_id" == "sample_id" ]] && continue
log "Aligning reads for sample: $sample_id"
# Define trimmed reads
local trimmed_forward="${PROCESSED_DIR}/$(basename "${forward}" .fastq.gz)_val_1.fq.gz"
local trimmed_reverse="${PROCESSED_DIR}/$(basename "${reverse}" .fastq.gz)_val_2.fq.gz"
# Create sample-specific output directory
local sample_align_dir="${ALIGN_DIR}/${sample_id}"
mkdir -p "$sample_align_dir"
# Align with STAR
STAR --genomeDir "$STAR_INDEX_DIR" \
--readFilesIn "$trimmed_forward" "$trimmed_reverse" \
--readFilesCommand zcat \
--outFileNamePrefix "${sample_align_dir}/${sample_id}_" \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonical \
--runThreadN 8 \
--limitBAMsortRAM 16000000000
# Index BAM file
samtools index "${sample_align_dir}/${sample_id}_Aligned.sortedByCoord.out.bam"
done < "$SAMPLES_FILE"
}
# Quality control of aligned reads
alignment_qc() {
log "Running alignment quality control..."
# Create directory for alignment QC
local align_qc_dir="${REPORTS_DIR}/alignment_qc"
mkdir -p "$align_qc_dir"
# Collect alignment metrics
while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
# Skip header
[[ "$sample_id" == "sample_id" ]] && continue
local bam_file="${ALIGN_DIR}/${sample_id}/${sample_id}_Aligned.sortedByCoord.out.bam"
# Flagstat
samtools flagstat "$bam_file" > "${align_qc_dir}/${sample_id}_flagstat.txt"
# Collect insert size metrics
samtools stats "$bam_file" | grep ^IS > "${align_qc_dir}/${sample_id}_insert_size.txt"
done < "$SAMPLES_FILE"
# Run MultiQC on alignment results
multiqc -o "$align_qc_dir" "$align_qc_dir"
}
# Gene expression quantification with featureCounts
quantify_expression() {
log "Quantifying gene expression with featureCounts..."
# Create list of BAM files
local bam_list=""
while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
# Skip header
[[ "$sample_id" == "sample_id" ]] && continue
local bam_file="${ALIGN_DIR}/${sample_id}/${sample_id}_Aligned.sortedByCoord.out.bam"
bam_list="$bam_list $bam_file"
done < "$SAMPLES_FILE"
# Run featureCounts
featureCounts -a "$GTF_FILE" \
-o "$FINAL_COUNTS" \
-T 8 \
-t exon \
-g gene_id \
$bam_list
log "Expression quantification completed"
}
# Calculate CPM (Counts Per Million)
calculate_cpm() {
log "Calculating CPM values..."
# Use R script for CPM calculation
local r_script="${TMP_DIR}/calculate_cpm.R"
cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(edgeR)
# Read count data
counts_file <- commandArgs(trailingOnly = TRUE)[1]
output_file <- commandArgs(trailingOnly = TRUE)[2]
# Read counts
counts <- read.table(counts_file, header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)
count_data <- counts[,6:ncol(counts)] # Assuming first 5 columns are metadata
# Calculate CPM
cpm_data <- cpm(count_data)
# Write output
write.table(cpm_data, file=output_file, sep="\t", quote=FALSE, col.names=NA)
message("CPM calculation completed")
EOF
Rscript "$r_script" "$FINAL_COUNTS" "$FINAL_CPM"
log "CPM calculation completed"
}
# Differential expression analysis with DESeq2
run_deseq2() {
log "Running differential expression analysis with DESeq2..."
# Create DESeq2 R script
local r_script="${TMP_DIR}/deseq2_analysis.R"
# Create sample information data frame
local sample_info_file="${TMP_DIR}/sample_info.tsv"
awk 'NR>1 {print $1 "\t" $2}' "$SAMPLES_FILE" > "$sample_info_file"
cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(DESeq2)
library(ggplot2)
library(pheatmap)
# Read arguments
counts_file <- commandArgs(trailingOnly = TRUE)[1]
sample_info_file <- commandArgs(trailingOnly = TRUE)[2]
output_dir <- commandArgs(trailingOnly = TRUE)[3]
# Read count data
counts <- read.table(counts_file, header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)
count_data <- counts[,6:ncol(counts)] # Assuming first 5 columns are metadata
colnames(count_data) <- gsub("\\.bam$", "", colnames(count_data))
# Read sample information
sample_info <- read.table(sample_info_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
rownames(sample_info) <- sample_info$sample_id
# Create DESeq dataset
dds <- DESeqDataSetFromMatrix(countData = count_data,
colData = sample_info,
design = ~ condition)
# Run DESeq
dds <- DESeq(dds)
# Extract results
res <- results(dds)
res <- res[order(res$padj),]
# Write results
write.table(as.data.frame(res), file=file.path(output_dir, "differential_expression_results.tsv"),
sep="\t", quote=FALSE, col.names=NA)
# Create MA plot
png(file.path(output_dir, "ma_plot.png"), width=800, height=600)
plotMA(res, main="MA Plot")
dev.off()
# Create volcano plot
res_df <- as.data.frame(res)
res_df$significance <- ifelse(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1, "Significant", "Not Significant")
png(file.path(output_dir, "volcano_plot.png"), width=800, height=600)
ggplot(res_df, aes(x=log2FoldChange, y=-log10(padj), color=significance)) +
geom_point(alpha=0.6) +
scale_color_manual(values=c("Significant"="red", "Not Significant"="grey")) +
theme_minimal() +
labs(title="Volcano Plot", x="Log2 Fold Change", y="-Log10 Adjusted P-value")
dev.off()
# Create heatmap of top 50 most variable genes
top_genes <- head(order(rowVars(assay(dds)), decreasing=TRUE), 50)
mat <- assay(dds)[top_genes,]
mat <- mat - rowMeans(mat)
png(file.path(output_dir, "heatmap_top50.png"), width=1000, height=800)
pheatmap(mat, cluster_rows=TRUE, show_rownames=FALSE,
annotation_col=sample_info[, "condition", drop=FALSE])
dev.off()
message("DESeq2 analysis completed")
EOF
mkdir -p "$DE_DIR"
Rscript "$r_script" "$FINAL_COUNTS" "$sample_info_file" "$DE_DIR"
log "DESeq2 analysis completed"
}
# Functional enrichment analysis
run_enrichment() {
log "Running functional enrichment analysis..."
# Create enrichment R script
local r_script="${TMP_DIR}/enrichment_analysis.R"
cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(clusterProfiler)
library(org.Hs.eg.db)
# Read DE results
de_results_file <- commandArgs(trailingOnly = TRUE)[1]
output_dir <- commandArgs(trailingOnly = TRUE)[2]
# Read DE results
de_results <- read.table(de_results_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
# Filter significant genes
sig_genes <- de_results[de_results$padj < 0.05 & abs(de_results$log2FoldChange) > 1, ]
up_genes <- sig_genes[sig_genes$log2FoldChange > 1, ]
down_genes <- sig_genes[sig_genes$log2FoldChange < -1, ]
# Convert gene IDs (assuming they are Ensembl IDs)
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene_map <- getBM(attributes=c("ensembl_gene_id", "entrezgene_id"),
filters="ensembl_gene_id",
values=rownames(up_genes),
mart=ensembl)
up_entrez <- gene_map$entrezgene_id[!is.na(gene_map$entrezgene_id)]
# GO enrichment for upregulated genes
if(length(up_entrez) > 10) {
ego_up <- enrichGO(gene=up_entrez,
OrgDb=org.Hs.eg.db,
keyType="ENTREZID",
ont="BP",
pAdjustMethod="BH",
qvalueCutoff=0.05,
readable=TRUE)
if(length(ego_up) > 0) {
write.table(as.data.frame(ego_up),
file=file.path(output_dir, "go_enrichment_upregulated.tsv"),
sep="\t", quote=FALSE, row.names=FALSE)
}
}
message("Enrichment analysis completed")
EOF
mkdir -p "$ENRICH_DIR"
Rscript "$r_script" "${DE_DIR}/differential_expression_results.tsv" "$ENRICH_DIR"
log "Enrichment analysis completed"
}
# Generate final report
generate_report() {
log "Generating final report..."
# Create summary statistics
local total_samples=$(($(wc -l < "$SAMPLES_FILE") - 1))
local total_genes=$(wc -l < "$FINAL_COUNTS")
local significant_genes=$(awk '$6 < 0.05' "${DE_DIR}/differential_expression_results.tsv" | wc -l)
cat > "${REPORTS_DIR}/rnaseq_summary.txt" << EOF
RNA-Seq Analysis Pipeline Summary
================================
Total samples processed: $total_samples
Total genes quantified: $total_genes
Significantly differentially expressed genes: $significant_genes
Analysis completed on: $(date)
Results directories:
- Alignment results: ${ALIGN_DIR}
- Quantification results: ${QUANT_DIR}
- Differential expression results: ${DE_DIR}
- Functional enrichment results: ${ENRICH_DIR}
EOF
# Run MultiQC on all QC results
multiqc -o "${REPORTS_DIR}/multiqc_report" \
"${REPORTS_DIR}/fastqc_raw" \
"${REPORTS_DIR}/fastqc_trimmed" \
"${REPORTS_DIR}/alignment_qc"
log "Final report generated"
}
# Main pipeline execution
main() {
setup_directories
validate_input
# Step 1: Initial QC
run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"
# Step 2: Read trimming
trim_reads
# Step 3: Post-trimming QC
run_fastqc "${PROCESSED_DIR}" "${REPORTS_DIR}/fastqc_trimmed"
# Step 4: Alignment
align_reads
# Step 5: Alignment QC
alignment_qc
# Step 6: Expression quantification
quantify_expression
calculate_cpm
# Step 7: Differential expression analysis
run_deseq2
# Step 8: Functional enrichment
run_enrichment
# Step 9: Final reporting
generate_report
log "RNA-Seq pipeline completed successfully!"
log "Results available in: ${RESULTS_DIR}"
}
# Run main function
main "$@"
Creating a Sample Processing Script
Create a script for processing individual samples process_sample.sh:
#!/bin/bash
# process_sample.sh - Process individual RNA-Seq sample
set -euo pipefail
# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
PROCESSED_DIR="${DATA_DIR}/processed"
ALIGN_DIR="${PROJECT_DIR}/results/alignment"
QUANT_DIR="${PROJECT_DIR}/results/quantification"
# Function to process a single sample
process_sample() {
local sample_id="$1"
local forward_reads="$2"
local reverse_reads="$3"
local reference_genome="$4"
local gtf_file="$5"
local star_index="$6"
echo "Processing sample: $sample_id"
# Create sample directory
local sample_dir="${ALIGN_DIR}/${sample_id}"
mkdir -p "$sample_dir"
# Trim reads
trim_galore \
--paired \
--cores 4 \
--length 36 \
--quality 20 \
--output_dir "${PROCESSED_DIR}" \
"$forward_reads" \
"$reverse_reads"
# Define trimmed reads
local trimmed_forward="${PROCESSED_DIR}/$(basename "${forward_reads}" .fastq.gz)_val_1.fq.gz"
local trimmed_reverse="${PROCESSED_DIR}/$(basename "${reverse_reads}" .fastq.gz)_val_2.fq.gz"
# Align with STAR
STAR --genomeDir "$star_index" \
--readFilesIn "$trimmed_forward" "$trimmed_reverse" \
--readFilesCommand zcat \
--outFileNamePrefix "${sample_dir}/${sample_id}_" \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs RemoveNoncanonical \
--runThreadN 8
# Index BAM file
samtools index "${sample_dir}/${sample_id}_Aligned.sortedByCoord.out.bam"
echo "Sample $sample_id processed successfully"
}
# Main execution
if [[ $# -ne 6 ]]; then
echo "Usage: $0 <sample_id> <forward_reads> <reverse_reads> <reference_genome> <gtf_file> <star_index>"
exit 1
fi
process_sample "$1" "$2" "$3" "$4" "$5" "$6"
Making Scripts Executable
chmod +x rnaseq_pipeline.sh
chmod +x process_sample.sh
Advanced Features
1. Parallel Sample Processing
Create a script for parallel processing parallel_rnaseq.sh:
#!/bin/bash
# parallel_rnaseq.sh - Parallel RNA-Seq processing
set -euo pipefail
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SAMPLES_FILE="${PROJECT_DIR}/samples.tsv"
MAX_JOBS=4
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
# Process samples in parallel
process_samples_parallel() {
log "Processing samples in parallel (max ${MAX_JOBS} jobs)..."
# Read samples and submit jobs
local job_pids=()
while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
# Skip header
[[ "$sample_id" == "sample_id" ]] && continue
# Submit job
./process_sample.sh \
"$sample_id" \
"${PROJECT_DIR}/${forward}" \
"${PROJECT_DIR}/${reverse}" \
"${PROJECT_DIR}/data/reference/hg38.fasta" \
"${PROJECT_DIR}/data/reference/hg38.gtf" \
"${PROJECT_DIR}/data/reference/star_index" &
job_pids+=($!)
# Limit concurrent jobs
if [[ ${#job_pids[@]} -ge $MAX_JOBS ]]; then
wait "${job_pids[@]}"
job_pids=()
fi
done < "$SAMPLES_FILE"
# Wait for remaining jobs
if [[ ${#job_pids[@]} -gt 0 ]]; then
wait "${job_pids[@]}"
fi
log "All samples processed"
}
# Main execution
main() {
process_samples_parallel
}
main "$@"
2. Quality Control and Validation
Add comprehensive QC functions:
# Validate FASTQ files
validate_fastq() {
local fastq_file="$1"
log "Validating FASTQ file: $fastq_file"
# Check if file exists
[[ -f "$fastq_file" ]] || error_exit "FASTQ file not found: $fastq_file"
# Check file format (basic validation)
local line_count=$(zcat "$fastq_file" | head -n 4 | wc -l)
if [[ $line_count -ne 4 ]]; then
error_exit "Invalid FASTQ format: $fastq_file"
fi
log "FASTQ validation passed"
}
# Validate alignment results
validate_alignment() {
local bam_file="$1"
local sample_id="$2"
log "Validating alignment for sample: $sample_id"
# Check if BAM file exists and is valid
if [[ ! -f "$bam_file" ]] || ! samtools quickcheck "$bam_file"; then
error_exit "Invalid BAM file for sample $sample_id: $bam_file"
fi
# Check alignment statistics
local mapped_reads=$(samtools view -c -F 4 "$bam_file")
local total_reads=$(samtools view -c "$bam_file")
local mapping_rate=$(echo "scale=2; $mapped_reads * 100 / $total_reads" | bc)
if (( $(echo "$mapping_rate < 70" | bc -l) )); then
log "WARNING: Low mapping rate for $sample_id: ${mapping_rate}%"
fi
log "Alignment validation passed for $sample_id"
}
# Validate count data
validate_counts() {
local count_file="$1"
log "Validating count data..."
# Check if file exists
[[ -f "$count_file" ]] || error_exit "Count file not found: $count_file"
# Check if file has data
local line_count=$(wc -l < "$count_file")
if [[ $line_count -lt 10 ]]; then
error_exit "Count file appears to be empty or incomplete: $count_file"
fi
log "Count data validation passed"
}
3. Configuration Management
Create a configuration file rnaseq_config.ini:
[general]
project_name=rnaseq_analysis
threads=8
memory=16G
[quality_control]
min_read_length=36
min_base_quality=20
adapter_type=illumina
[alignment]
aligner=star
max_intron_size=1000000
out_filter_multimap_nmax=20
[quantification]
quantifier=featurecounts
feature_type=exon
attribute_type=gene_id
min_overlap=1
[differential_expression]
method=deseq2
fdr_threshold=0.05
lfc_threshold=1.0
[enrichment]
tool=clusterprofiler
organism=human
ontology=BP
[reporting]
multiqc=true
plots=true
Running the Pipeline
To run the pipeline:
# Make scripts executable
chmod +x rnaseq_pipeline.sh
chmod +x process_sample.sh
chmod +x parallel_rnaseq.sh
# Run the complete pipeline
./rnaseq_pipeline.sh
# Or process samples in parallel
./parallel_rnaseq.sh
# For verbose output
./rnaseq_pipeline.sh 2>&1 | tee rnaseq_run.log
Expected Output Structure
After running the pipeline, you’ll have:
rnaseq_pipeline/
├── data/
│ ├── raw/
│ │ ├── sample1_R1.fastq.gz
│ │ ├── sample1_R2.fastq.gz
│ │ ├── sample2_R1.fastq.gz
│ │ └── sample2_R2.fastq.gz
│ ├── reference/
│ │ ├── hg38.fasta
│ │ ├── hg38.gtf
│ │ └── star_index/
│ └── processed/
│ ├── sample1_R1_val_1.fq.gz
│ └── sample1_R2_val_2.fq.gz
├── results/
│ ├── alignment/
│ │ ├── sample1/
│ │ │ ├── sample1_Aligned.sortedByCoord.out.bam
│ │ │ └── sample1_Log.final.out
│ │ └── sample2/
│ ├── quantification/
│ │ ├── gene_counts.tsv
│ │ └── gene_cpm.tsv
│ ├── de_analysis/
│ │ ├── differential_expression_results.tsv
│ │ ├── ma_plot.png
│ │ ├── volcano_plot.png
│ │ └── heatmap_top50.png
│ ├── enrichment/
│ │ └── go_enrichment_upregulated.tsv
│ └── reports/
│ ├── rnaseq_summary.txt
│ ├── fastqc_raw/
│ ├── fastqc_trimmed/
│ ├── alignment_qc/
│ └── multiqc_report/
├── logs/
│ └── pipeline.log
├── tmp/
├── samples.tsv
├── rnaseq_config.ini
├── rnaseq_pipeline.sh
├── process_sample.sh
└── parallel_rnaseq.sh
Best Practices
- Modularity: Break complex tasks into functions
- Error Handling: Use
set -euo pipefailand proper error checking - Logging: Implement comprehensive logging for debugging
- Configuration: Use external config files for flexibility
- Validation: Validate inputs, outputs, and intermediate files
- Documentation: Comment your code thoroughly
- Portability: Use relative paths and environment variables
Performance Optimization
- Parallel Processing: Use multiple cores for alignment and quantification
- Memory Management: Monitor and limit memory usage
- Disk I/O: Minimize file operations and use appropriate buffering
- Caching: Reuse intermediate results when possible
- Compression: Use appropriate compression for large output files
- Resource Monitoring: Track CPU and memory usage
This RNA-Seq analysis pipeline provides a comprehensive framework for analyzing RNA sequencing data, from raw reads to biological insights. The modular design allows for easy customization and extension based on specific analysis requirements.
Frequently Asked Questions
Q: What is RNA-Seq and why use bash for analysis?
A: RNA-Seq (RNA sequencing) is a method to analyze the transcriptome - all RNA molecules in cells. Bash pipelines are ideal for RNA-Seq analysis because they allow you to chain together multiple bioinformatics tools, handle large datasets efficiently, and create reproducible workflows that can run on local machines or HPC clusters.
Q: How much computational power do I need for RNA-Seq analysis?
A: For a typical RNA-Seq experiment with 6-12 samples, you’ll need at least 16GB RAM and 8 CPU cores. STAR alignment is memory-intensive (requires 30GB RAM for human genome), while DESeq2 analysis needs 4-8GB RAM. Processing time ranges from 2-6 hours depending on sample size and depth.
Q: What file formats are used in RNA-Seq analysis?
A: RNA-Seq analysis uses several key formats: FASTQ (raw sequencing reads), BAM (aligned reads), GTF/GFF (gene annotations), TSV (count matrices), and standard formats like PNG for plots. The pipeline automatically handles format conversions between tools.
Q: How do I choose the right parameters for STAR alignment?
A: Key STAR parameters include: --runThreadN (match your CPU cores), --outFilterMultimapNmax (20 for genome, 1 for transcriptome), and --outSAMstrandField intronMotif for stranded libraries. Use --quantMode GeneCounts if you want STAR to do quantification instead of featureCounts.
Q: What’s the difference between gene expression and differential expression?
A: Gene expression measures how much each gene is being transcribed (read counts). Differential expression compares gene expression between conditions (treatment vs control) to find genes that are significantly up- or down-regulated. DESeq2 performs statistical testing to identify these changes.
Q: How many biological replicates do I need for RNA-Seq?
A: Minimum 3 biological replicates per condition for meaningful statistics. More replicates (4-6) improve power to detect small fold changes. Technical replicates are less important than biological replicates. Never pool biological samples - treat each individual as a separate replicate.
Q: What quality control steps should I include?
A: Essential QC steps: FastQC on raw and trimmed reads, MultiQC for aggregated reports, alignment statistics (mapping rates >80%), duplication rates (<30%), and PCA plots to check sample clustering. Use tools like RSeQC for advanced RNA-Seq specific QC.
Q: How do I interpret differential expression results?
A: Key metrics: Log2 fold change (>1 or <-1 for 2-fold change), adjusted p-value (padj <0.05 for significance), and base mean (average expression level). Filter results by both statistical significance (padj) and biological significance (fold change).
Q: Can I modify this pipeline for different organisms?
A: Yes! Change the reference genome files in the configuration section, update the organism database in the enrichment analysis (e.g., org.Mm.eg.db for mouse), and adjust gene ID formats if needed. The core pipeline logic remains the same across organisms.
Q: What if my RNA-Seq data is single-end instead of paired-end?
A: Modify the STAR command to use only --readFilesIn $forward_read (remove reverse), update Trim Galore parameters to remove --paired, and adjust the sample sheet format to include only one FASTQ file per sample. The rest of the pipeline remains unchanged.
Continue Learning
One Small Win
Try this quick command to get started:
Copy and paste this into your terminal to get started immediately.
Related Content
Genome Extraction Pipeline with Bash
Learn to build a robust bash pipeline for extracting and preparing genomic data from raw sequencing files to analysis-ready assemblies.
Protein Annotation Pipeline with Bash
Build a comprehensive bash pipeline for protein sequence annotation including functional prediction, domain identification, and structural analysis.
Variant Calling Pipeline with Bash
Build a comprehensive bash pipeline for variant calling from raw sequencing data to annotated VCF files using GATK and bcftools.
RNA-Seq Analysis Pipeline
A complete Snakemake pipeline for RNA-Seq data analysis from raw FASTQ files to differential expression results with MultiQC reporting.
Start Your Own Project
Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.
Use This Templategit clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run