bioinformatics advanced

RNA-Seq Analysis Pipeline with Bash

Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.

February 5, 2025 25 min read

#bash #rna-seq #star #featurecounts #deseq2 #pipeline #differential-expression

RNA-Seq Analysis Pipeline with Bash

Build a comprehensive bash pipeline for RNA-Seq data analysis from raw reads to differential expression results using STAR, featureCounts, and DESeq2.

Overview

This tutorial will guide you through creating a complete bash pipeline for RNA-Seq data analysis, from quality control of raw sequencing reads to generating publication-ready differential expression results. We’ll implement best practices for RNA-Seq analysis using STAR aligner, featureCounts for quantification, and DESeq2 for statistical analysis.

Prerequisites

Before starting this tutorial, you should have:

Advanced understanding of bash scripting
Familiarity with RNA-Seq experimental design
Understanding of gene expression and differential expression concepts
Basic knowledge of R programming for statistical analysis
Tools installed: STAR, samtools, featureCounts, FastQC, MultiQC, R, DESeq2

Pipeline Workflow

Our RNA-Seq analysis pipeline will follow these steps:

Quality control of raw reads
Adapter trimming and quality filtering
Read alignment to reference genome
Gene expression quantification
Quality control of aligned reads
Differential expression analysis with DESeq2
Functional enrichment analysis
Final reporting

Setting Up the Project Structure

First, let’s create a directory structure for our pipeline:

# Create project directory
mkdir rnaseq_pipeline
cd rnaseq_pipeline

# Create directory structure
mkdir -p data/raw data/reference data/processed results/alignment results/quantification results/de_analysis results/enrichment results/reports logs tmp

# Download reference genome and annotation (example with hg38)
# You'll need to download the reference genome and create STAR index
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
# wget ftp://ftp.ncbi.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/gff/GCA_000001405.15_GRCh38.gff.gz

Creating the Main Pipeline Script

Create a main pipeline script rnaseq_pipeline.sh:

#!/bin/bash

# rnaseq_pipeline.sh - RNA-Seq analysis pipeline
# Author: Your Name
# Date: $(date)

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
RAW_DIR="${DATA_DIR}/raw"
REF_DIR="${DATA_DIR}/reference"
PROCESSED_DIR="${DATA_DIR}/processed"
RESULTS_DIR="${PROJECT_ROOT}/results"
ALIGN_DIR="${RESULTS_DIR}/alignment"
QUANT_DIR="${RESULTS_DIR}/quantification"
DE_DIR="${RESULTS_DIR}/de_analysis"
ENRICH_DIR="${RESULTS_DIR}/enrichment"
REPORTS_DIR="${RESULTS_DIR}/reports"
LOGS_DIR="${PROJECT_ROOT}/logs"
TMP_DIR="${PROJECT_ROOT}/tmp"

# Reference files (update these paths for your environment)
REF_GENOME="${REF_DIR}/hg38.fasta"
GTF_FILE="${REF_DIR}/hg38.gtf"
STAR_INDEX_DIR="${REF_DIR}/star_index"

# Sample information (in practice, this would come from a sample sheet)
SAMPLES_FILE="${PROJECT_DIR}/samples.tsv"

# Output files
FINAL_COUNTS="${QUANT_DIR}/gene_counts.tsv"
FINAL_CPM="${QUANT_DIR}/gene_cpm.tsv"
DE_RESULTS="${DE_DIR}/differential_expression_results.tsv"
DE_REPORT="${DE_DIR}/differential_expression_report.html"

# Logging function
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "${LOGS_DIR}/pipeline.log"
}

# Create directories
setup_directories() {
    log "Setting up directories..."
    mkdir -p "${PROCESSED_DIR}" "${ALIGN_DIR}" "${QUANT_DIR}" "${DE_DIR}" "${ENRICH_DIR}" "${REPORTS_DIR}" "${LOGS_DIR}" "${TMP_DIR}"
}

# Validate input files
validate_input() {
    log "Validating input files..."

    # Check if samples file exists
    if [[ ! -f "$SAMPLES_FILE" ]]; then
        log "ERROR: Samples file not found: $SAMPLES_FILE"
        log "Creating sample template..."
        cat > "$SAMPLES_FILE" << EOF
sample_id	condition	replicate	forward_reads	reverse_reads
sample1	control	1	data/raw/sample1_R1.fastq.gz	data/raw/sample1_R2.fastq.gz
sample2	control	2	data/raw/sample2_R1.fastq.gz	data/raw/sample2_R2.fastq.gz
sample3	treated	1	data/raw/sample3_R1.fastq.gz	data/raw/sample3_R2.fastq.gz
sample4	treated	2	data/raw/sample4_R1.fastq.gz	data/raw/sample4_R2.fastq.gz
EOF
        log "Created sample template. Please update with your sample information."
        exit 1
    fi

    # Check reference files
    local required_files=("$REF_GENOME" "$GTF_FILE")
    for file in "${required_files[@]}"; do
        if [[ ! -f "$file" ]]; then
            log "ERROR: Required reference file not found: $file"
            exit 1
        fi
    done

    # Check STAR index
    if [[ ! -d "$STAR_INDEX_DIR" ]]; then
        log "WARNING: STAR index not found. Creating index..."
        create_star_index
    fi

    log "Input validation passed"
}

# Create STAR genome index
create_star_index() {
    log "Creating STAR genome index..."

    STAR --runMode genomeGenerate \
         --genomeDir "$STAR_INDEX_DIR" \
         --genomeFastaFiles "$REF_GENOME" \
         --sjdbGTFfile "$GTF_FILE" \
         --sjdbOverhang 100 \
         --runThreadN 8

    log "STAR genome index created"
}

# Quality control with FastQC
run_fastqc() {
    local input_dir="$1"
    local output_dir="$2"

    log "Running FastQC on ${input_dir}..."
    mkdir -p "$output_dir"
    fastqc -o "$output_dir" -t 4 "$input_dir"/*.fastq.gz
}

# Adapter trimming with Trim Galore
trim_reads() {
    log "Trimming adapters and low-quality reads..."

    # Read samples from TSV file
    while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
        # Skip header
        [[ "$sample_id" == "sample_id" ]] && continue

        log "Trimming reads for sample: $sample_id"

        trim_galore \
            --paired \
            --cores 4 \
            --length 36 \
            --quality 20 \
            --output_dir "${PROCESSED_DIR}" \
            "${DATA_DIR}/${forward}" \
            "${DATA_DIR}/${reverse}"

    done < "$SAMPLES_FILE"
}

# Align reads with STAR
align_reads() {
    log "Aligning reads with STAR..."

    while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
        # Skip header
        [[ "$sample_id" == "sample_id" ]] && continue

        log "Aligning reads for sample: $sample_id"

        # Define trimmed reads
        local trimmed_forward="${PROCESSED_DIR}/$(basename "${forward}" .fastq.gz)_val_1.fq.gz"
        local trimmed_reverse="${PROCESSED_DIR}/$(basename "${reverse}" .fastq.gz)_val_2.fq.gz"

        # Create sample-specific output directory
        local sample_align_dir="${ALIGN_DIR}/${sample_id}"
        mkdir -p "$sample_align_dir"

        # Align with STAR
        STAR --genomeDir "$STAR_INDEX_DIR" \
             --readFilesIn "$trimmed_forward" "$trimmed_reverse" \
             --readFilesCommand zcat \
             --outFileNamePrefix "${sample_align_dir}/${sample_id}_" \
             --outSAMtype BAM SortedByCoordinate \
             --outSAMstrandField intronMotif \
             --outFilterIntronMotifs RemoveNoncanonical \
             --runThreadN 8 \
             --limitBAMsortRAM 16000000000

        # Index BAM file
        samtools index "${sample_align_dir}/${sample_id}_Aligned.sortedByCoord.out.bam"

    done < "$SAMPLES_FILE"
}

# Quality control of aligned reads
alignment_qc() {
    log "Running alignment quality control..."

    # Create directory for alignment QC
    local align_qc_dir="${REPORTS_DIR}/alignment_qc"
    mkdir -p "$align_qc_dir"

    # Collect alignment metrics
    while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
        # Skip header
        [[ "$sample_id" == "sample_id" ]] && continue

        local bam_file="${ALIGN_DIR}/${sample_id}/${sample_id}_Aligned.sortedByCoord.out.bam"

        # Flagstat
        samtools flagstat "$bam_file" > "${align_qc_dir}/${sample_id}_flagstat.txt"

        # Collect insert size metrics
        samtools stats "$bam_file" | grep ^IS > "${align_qc_dir}/${sample_id}_insert_size.txt"

    done < "$SAMPLES_FILE"

    # Run MultiQC on alignment results
    multiqc -o "$align_qc_dir" "$align_qc_dir"
}

# Gene expression quantification with featureCounts
quantify_expression() {
    log "Quantifying gene expression with featureCounts..."

    # Create list of BAM files
    local bam_list=""
    while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
        # Skip header
        [[ "$sample_id" == "sample_id" ]] && continue

        local bam_file="${ALIGN_DIR}/${sample_id}/${sample_id}_Aligned.sortedByCoord.out.bam"
        bam_list="$bam_list $bam_file"
    done < "$SAMPLES_FILE"

    # Run featureCounts
    featureCounts -a "$GTF_FILE" \
                  -o "$FINAL_COUNTS" \
                  -T 8 \
                  -t exon \
                  -g gene_id \
                  $bam_list

    log "Expression quantification completed"
}

# Calculate CPM (Counts Per Million)
calculate_cpm() {
    log "Calculating CPM values..."

    # Use R script for CPM calculation
    local r_script="${TMP_DIR}/calculate_cpm.R"

    cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(edgeR)

# Read count data
counts_file <- commandArgs(trailingOnly = TRUE)[1]
output_file <- commandArgs(trailingOnly = TRUE)[2]

# Read counts
counts <- read.table(counts_file, header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)
count_data <- counts[,6:ncol(counts)]  # Assuming first 5 columns are metadata

# Calculate CPM
cpm_data <- cpm(count_data)

# Write output
write.table(cpm_data, file=output_file, sep="\t", quote=FALSE, col.names=NA)

message("CPM calculation completed")
EOF

    Rscript "$r_script" "$FINAL_COUNTS" "$FINAL_CPM"

    log "CPM calculation completed"
}

# Differential expression analysis with DESeq2
run_deseq2() {
    log "Running differential expression analysis with DESeq2..."

    # Create DESeq2 R script
    local r_script="${TMP_DIR}/deseq2_analysis.R"

    # Create sample information data frame
    local sample_info_file="${TMP_DIR}/sample_info.tsv"
    awk 'NR>1 {print $1 "\t" $2}' "$SAMPLES_FILE" > "$sample_info_file"

    cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(DESeq2)
library(ggplot2)
library(pheatmap)

# Read arguments
counts_file <- commandArgs(trailingOnly = TRUE)[1]
sample_info_file <- commandArgs(trailingOnly = TRUE)[2]
output_dir <- commandArgs(trailingOnly = TRUE)[3]

# Read count data
counts <- read.table(counts_file, header=TRUE, row.names=1, sep="\t", stringsAsFactors=FALSE)
count_data <- counts[,6:ncol(counts)]  # Assuming first 5 columns are metadata
colnames(count_data) <- gsub("\\.bam$", "", colnames(count_data))

# Read sample information
sample_info <- read.table(sample_info_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
rownames(sample_info) <- sample_info$sample_id

# Create DESeq dataset
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = ~ condition)

# Run DESeq
dds <- DESeq(dds)

# Extract results
res <- results(dds)
res <- res[order(res$padj),]

# Write results
write.table(as.data.frame(res), file=file.path(output_dir, "differential_expression_results.tsv"),
            sep="\t", quote=FALSE, col.names=NA)

# Create MA plot
png(file.path(output_dir, "ma_plot.png"), width=800, height=600)
plotMA(res, main="MA Plot")
dev.off()

# Create volcano plot
res_df <- as.data.frame(res)
res_df$significance <- ifelse(res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1, "Significant", "Not Significant")

png(file.path(output_dir, "volcano_plot.png"), width=800, height=600)
ggplot(res_df, aes(x=log2FoldChange, y=-log10(padj), color=significance)) +
  geom_point(alpha=0.6) +
  scale_color_manual(values=c("Significant"="red", "Not Significant"="grey")) +
  theme_minimal() +
  labs(title="Volcano Plot", x="Log2 Fold Change", y="-Log10 Adjusted P-value")
dev.off()

# Create heatmap of top 50 most variable genes
top_genes <- head(order(rowVars(assay(dds)), decreasing=TRUE), 50)
mat <- assay(dds)[top_genes,]
mat <- mat - rowMeans(mat)
png(file.path(output_dir, "heatmap_top50.png"), width=1000, height=800)
pheatmap(mat, cluster_rows=TRUE, show_rownames=FALSE,
         annotation_col=sample_info[, "condition", drop=FALSE])
dev.off()

message("DESeq2 analysis completed")
EOF

    mkdir -p "$DE_DIR"
    Rscript "$r_script" "$FINAL_COUNTS" "$sample_info_file" "$DE_DIR"

    log "DESeq2 analysis completed"
}

# Functional enrichment analysis
run_enrichment() {
    log "Running functional enrichment analysis..."

    # Create enrichment R script
    local r_script="${TMP_DIR}/enrichment_analysis.R"

    cat > "$r_script" << 'EOF'
#!/usr/bin/env Rscript
library(clusterProfiler)
library(org.Hs.eg.db)

# Read DE results
de_results_file <- commandArgs(trailingOnly = TRUE)[1]
output_dir <- commandArgs(trailingOnly = TRUE)[2]

# Read DE results
de_results <- read.table(de_results_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Filter significant genes
sig_genes <- de_results[de_results$padj < 0.05 & abs(de_results$log2FoldChange) > 1, ]
up_genes <- sig_genes[sig_genes$log2FoldChange > 1, ]
down_genes <- sig_genes[sig_genes$log2FoldChange < -1, ]

# Convert gene IDs (assuming they are Ensembl IDs)
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene_map <- getBM(attributes=c("ensembl_gene_id", "entrezgene_id"),
                  filters="ensembl_gene_id",
                  values=rownames(up_genes),
                  mart=ensembl)
up_entrez <- gene_map$entrezgene_id[!is.na(gene_map$entrezgene_id)]

# GO enrichment for upregulated genes
if(length(up_entrez) > 10) {
  ego_up <- enrichGO(gene=up_entrez,
                     OrgDb=org.Hs.eg.db,
                     keyType="ENTREZID",
                     ont="BP",
                     pAdjustMethod="BH",
                     qvalueCutoff=0.05,
                     readable=TRUE)

  if(length(ego_up) > 0) {
    write.table(as.data.frame(ego_up),
                file=file.path(output_dir, "go_enrichment_upregulated.tsv"),
                sep="\t", quote=FALSE, row.names=FALSE)
  }
}

message("Enrichment analysis completed")
EOF

    mkdir -p "$ENRICH_DIR"
    Rscript "$r_script" "${DE_DIR}/differential_expression_results.tsv" "$ENRICH_DIR"

    log "Enrichment analysis completed"
}

# Generate final report
generate_report() {
    log "Generating final report..."

    # Create summary statistics
    local total_samples=$(($(wc -l < "$SAMPLES_FILE") - 1))
    local total_genes=$(wc -l < "$FINAL_COUNTS")
    local significant_genes=$(awk '$6 < 0.05' "${DE_DIR}/differential_expression_results.tsv" | wc -l)

    cat > "${REPORTS_DIR}/rnaseq_summary.txt" << EOF
RNA-Seq Analysis Pipeline Summary
================================

Total samples processed: $total_samples
Total genes quantified: $total_genes
Significantly differentially expressed genes: $significant_genes

Analysis completed on: $(date)

Results directories:
- Alignment results: ${ALIGN_DIR}
- Quantification results: ${QUANT_DIR}
- Differential expression results: ${DE_DIR}
- Functional enrichment results: ${ENRICH_DIR}
EOF

    # Run MultiQC on all QC results
    multiqc -o "${REPORTS_DIR}/multiqc_report" \
            "${REPORTS_DIR}/fastqc_raw" \
            "${REPORTS_DIR}/fastqc_trimmed" \
            "${REPORTS_DIR}/alignment_qc"

    log "Final report generated"
}

# Main pipeline execution
main() {
    setup_directories
    validate_input

    # Step 1: Initial QC
    run_fastqc "${RAW_DIR}" "${REPORTS_DIR}/fastqc_raw"

    # Step 2: Read trimming
    trim_reads

    # Step 3: Post-trimming QC
    run_fastqc "${PROCESSED_DIR}" "${REPORTS_DIR}/fastqc_trimmed"

    # Step 4: Alignment
    align_reads

    # Step 5: Alignment QC
    alignment_qc

    # Step 6: Expression quantification
    quantify_expression
    calculate_cpm

    # Step 7: Differential expression analysis
    run_deseq2

    # Step 8: Functional enrichment
    run_enrichment

    # Step 9: Final reporting
    generate_report

    log "RNA-Seq pipeline completed successfully!"
    log "Results available in: ${RESULTS_DIR}"
}

# Run main function
main "$@"

Creating a Sample Processing Script

Create a script for processing individual samples process_sample.sh:

#!/bin/bash

# process_sample.sh - Process individual RNA-Seq sample

set -euo pipefail

# Configuration
PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DATA_DIR="${PROJECT_DIR}/data"
PROCESSED_DIR="${DATA_DIR}/processed"
ALIGN_DIR="${PROJECT_DIR}/results/alignment"
QUANT_DIR="${PROJECT_DIR}/results/quantification"

# Function to process a single sample
process_sample() {
    local sample_id="$1"
    local forward_reads="$2"
    local reverse_reads="$3"
    local reference_genome="$4"
    local gtf_file="$5"
    local star_index="$6"

    echo "Processing sample: $sample_id"

    # Create sample directory
    local sample_dir="${ALIGN_DIR}/${sample_id}"
    mkdir -p "$sample_dir"

    # Trim reads
    trim_galore \
        --paired \
        --cores 4 \
        --length 36 \
        --quality 20 \
        --output_dir "${PROCESSED_DIR}" \
        "$forward_reads" \
        "$reverse_reads"

    # Define trimmed reads
    local trimmed_forward="${PROCESSED_DIR}/$(basename "${forward_reads}" .fastq.gz)_val_1.fq.gz"
    local trimmed_reverse="${PROCESSED_DIR}/$(basename "${reverse_reads}" .fastq.gz)_val_2.fq.gz"

    # Align with STAR
    STAR --genomeDir "$star_index" \
         --readFilesIn "$trimmed_forward" "$trimmed_reverse" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${sample_dir}/${sample_id}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMstrandField intronMotif \
         --outFilterIntronMotifs RemoveNoncanonical \
         --runThreadN 8

    # Index BAM file
    samtools index "${sample_dir}/${sample_id}_Aligned.sortedByCoord.out.bam"

    echo "Sample $sample_id processed successfully"
}

# Main execution
if [[ $# -ne 6 ]]; then
    echo "Usage: $0 <sample_id> <forward_reads> <reverse_reads> <reference_genome> <gtf_file> <star_index>"
    exit 1
fi

process_sample "$1" "$2" "$3" "$4" "$5" "$6"

Making Scripts Executable

chmod +x rnaseq_pipeline.sh
chmod +x process_sample.sh

Advanced Features

1. Parallel Sample Processing

Create a script for parallel processing parallel_rnaseq.sh:

#!/bin/bash

# parallel_rnaseq.sh - Parallel RNA-Seq processing

set -euo pipefail

PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
SAMPLES_FILE="${PROJECT_DIR}/samples.tsv"
MAX_JOBS=4

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

# Process samples in parallel
process_samples_parallel() {
    log "Processing samples in parallel (max ${MAX_JOBS} jobs)..."

    # Read samples and submit jobs
    local job_pids=()

    while IFS=$'\t' read -r sample_id condition replicate forward reverse; do
        # Skip header
        [[ "$sample_id" == "sample_id" ]] && continue

        # Submit job
        ./process_sample.sh \
            "$sample_id" \
            "${PROJECT_DIR}/${forward}" \
            "${PROJECT_DIR}/${reverse}" \
            "${PROJECT_DIR}/data/reference/hg38.fasta" \
            "${PROJECT_DIR}/data/reference/hg38.gtf" \
            "${PROJECT_DIR}/data/reference/star_index" &

        job_pids+=($!)

        # Limit concurrent jobs
        if [[ ${#job_pids[@]} -ge $MAX_JOBS ]]; then
            wait "${job_pids[@]}"
            job_pids=()
        fi

    done < "$SAMPLES_FILE"

    # Wait for remaining jobs
    if [[ ${#job_pids[@]} -gt 0 ]]; then
        wait "${job_pids[@]}"
    fi

    log "All samples processed"
}

# Main execution
main() {
    process_samples_parallel
}

main "$@"

2. Quality Control and Validation

Add comprehensive QC functions:

# Validate FASTQ files
validate_fastq() {
    local fastq_file="$1"
    log "Validating FASTQ file: $fastq_file"

    # Check if file exists
    [[ -f "$fastq_file" ]] || error_exit "FASTQ file not found: $fastq_file"

    # Check file format (basic validation)
    local line_count=$(zcat "$fastq_file" | head -n 4 | wc -l)
    if [[ $line_count -ne 4 ]]; then
        error_exit "Invalid FASTQ format: $fastq_file"
    fi

    log "FASTQ validation passed"
}

# Validate alignment results
validate_alignment() {
    local bam_file="$1"
    local sample_id="$2"

    log "Validating alignment for sample: $sample_id"

    # Check if BAM file exists and is valid
    if [[ ! -f "$bam_file" ]] || ! samtools quickcheck "$bam_file"; then
        error_exit "Invalid BAM file for sample $sample_id: $bam_file"
    fi

    # Check alignment statistics
    local mapped_reads=$(samtools view -c -F 4 "$bam_file")
    local total_reads=$(samtools view -c "$bam_file")
    local mapping_rate=$(echo "scale=2; $mapped_reads * 100 / $total_reads" | bc)

    if (( $(echo "$mapping_rate < 70" | bc -l) )); then
        log "WARNING: Low mapping rate for $sample_id: ${mapping_rate}%"
    fi

    log "Alignment validation passed for $sample_id"
}

# Validate count data
validate_counts() {
    local count_file="$1"
    log "Validating count data..."

    # Check if file exists
    [[ -f "$count_file" ]] || error_exit "Count file not found: $count_file"

    # Check if file has data
    local line_count=$(wc -l < "$count_file")
    if [[ $line_count -lt 10 ]]; then
        error_exit "Count file appears to be empty or incomplete: $count_file"
    fi

    log "Count data validation passed"
}

3. Configuration Management

Create a configuration file rnaseq_config.ini:

[general]
project_name=rnaseq_analysis
threads=8
memory=16G

[quality_control]
min_read_length=36
min_base_quality=20
adapter_type=illumina

[alignment]
aligner=star
max_intron_size=1000000
out_filter_multimap_nmax=20

[quantification]
quantifier=featurecounts
feature_type=exon
attribute_type=gene_id
min_overlap=1

[differential_expression]
method=deseq2
fdr_threshold=0.05
lfc_threshold=1.0

[enrichment]
tool=clusterprofiler
organism=human
ontology=BP

[reporting]
multiqc=true
plots=true

Running the Pipeline

To run the pipeline:

# Make scripts executable
chmod +x rnaseq_pipeline.sh
chmod +x process_sample.sh
chmod +x parallel_rnaseq.sh

# Run the complete pipeline
./rnaseq_pipeline.sh

# Or process samples in parallel
./parallel_rnaseq.sh

# For verbose output
./rnaseq_pipeline.sh 2>&1 | tee rnaseq_run.log

Expected Output Structure

After running the pipeline, you’ll have:

rnaseq_pipeline/
├── data/
│   ├── raw/
│   │   ├── sample1_R1.fastq.gz
│   │   ├── sample1_R2.fastq.gz
│   │   ├── sample2_R1.fastq.gz
│   │   └── sample2_R2.fastq.gz
│   ├── reference/
│   │   ├── hg38.fasta
│   │   ├── hg38.gtf
│   │   └── star_index/
│   └── processed/
│       ├── sample1_R1_val_1.fq.gz
│       └── sample1_R2_val_2.fq.gz
├── results/
│   ├── alignment/
│   │   ├── sample1/
│   │   │   ├── sample1_Aligned.sortedByCoord.out.bam
│   │   │   └── sample1_Log.final.out
│   │   └── sample2/
│   ├── quantification/
│   │   ├── gene_counts.tsv
│   │   └── gene_cpm.tsv
│   ├── de_analysis/
│   │   ├── differential_expression_results.tsv
│   │   ├── ma_plot.png
│   │   ├── volcano_plot.png
│   │   └── heatmap_top50.png
│   ├── enrichment/
│   │   └── go_enrichment_upregulated.tsv
│   └── reports/
│       ├── rnaseq_summary.txt
│       ├── fastqc_raw/
│       ├── fastqc_trimmed/
│       ├── alignment_qc/
│       └── multiqc_report/
├── logs/
│   └── pipeline.log
├── tmp/
├── samples.tsv
├── rnaseq_config.ini
├── rnaseq_pipeline.sh
├── process_sample.sh
└── parallel_rnaseq.sh

Best Practices

Modularity: Break complex tasks into functions
Error Handling: Use set -euo pipefail and proper error checking
Logging: Implement comprehensive logging for debugging
Configuration: Use external config files for flexibility
Validation: Validate inputs, outputs, and intermediate files
Documentation: Comment your code thoroughly
Portability: Use relative paths and environment variables

Performance Optimization

Parallel Processing: Use multiple cores for alignment and quantification
Memory Management: Monitor and limit memory usage
Disk I/O: Minimize file operations and use appropriate buffering
Caching: Reuse intermediate results when possible
Compression: Use appropriate compression for large output files
Resource Monitoring: Track CPU and memory usage

This RNA-Seq analysis pipeline provides a comprehensive framework for analyzing RNA sequencing data, from raw reads to biological insights. The modular design allows for easy customization and extension based on specific analysis requirements.

Frequently Asked Questions

Q: What is RNA-Seq and why use bash for analysis?

A: RNA-Seq (RNA sequencing) is a method to analyze the transcriptome - all RNA molecules in cells. Bash pipelines are ideal for RNA-Seq analysis because they allow you to chain together multiple bioinformatics tools, handle large datasets efficiently, and create reproducible workflows that can run on local machines or HPC clusters.

Q: How much computational power do I need for RNA-Seq analysis?

A: For a typical RNA-Seq experiment with 6-12 samples, you’ll need at least 16GB RAM and 8 CPU cores. STAR alignment is memory-intensive (requires 30GB RAM for human genome), while DESeq2 analysis needs 4-8GB RAM. Processing time ranges from 2-6 hours depending on sample size and depth.

Q: What file formats are used in RNA-Seq analysis?

A: RNA-Seq analysis uses several key formats: FASTQ (raw sequencing reads), BAM (aligned reads), GTF/GFF (gene annotations), TSV (count matrices), and standard formats like PNG for plots. The pipeline automatically handles format conversions between tools.

Q: How do I choose the right parameters for STAR alignment?

A: Key STAR parameters include: --runThreadN (match your CPU cores), --outFilterMultimapNmax (20 for genome, 1 for transcriptome), and --outSAMstrandField intronMotif for stranded libraries. Use --quantMode GeneCounts if you want STAR to do quantification instead of featureCounts.

Q: What’s the difference between gene expression and differential expression?

A: Gene expression measures how much each gene is being transcribed (read counts). Differential expression compares gene expression between conditions (treatment vs control) to find genes that are significantly up- or down-regulated. DESeq2 performs statistical testing to identify these changes.

Q: How many biological replicates do I need for RNA-Seq?

A: Minimum 3 biological replicates per condition for meaningful statistics. More replicates (4-6) improve power to detect small fold changes. Technical replicates are less important than biological replicates. Never pool biological samples - treat each individual as a separate replicate.

Q: What quality control steps should I include?

A: Essential QC steps: FastQC on raw and trimmed reads, MultiQC for aggregated reports, alignment statistics (mapping rates >80%), duplication rates (<30%), and PCA plots to check sample clustering. Use tools like RSeQC for advanced RNA-Seq specific QC.

Q: How do I interpret differential expression results?

A: Key metrics: Log2 fold change (>1 or <-1 for 2-fold change), adjusted p-value (padj <0.05 for significance), and base mean (average expression level). Filter results by both statistical significance (padj) and biological significance (fold change).

Q: Can I modify this pipeline for different organisms?

A: Yes! Change the reference genome files in the configuration section, update the organism database in the enrichment analysis (e.g., org.Mm.eg.db for mouse), and adjust gene ID formats if needed. The core pipeline logic remains the same across organisms.

Q: What if my RNA-Seq data is single-end instead of paired-end?

A: Modify the STAR command to use only --readFilesIn $forward_read (remove reverse), update Trim Galore parameters to remove --paired, and adjust the sample sheet format to include only one FASTQ file per sample. The rest of the pipeline remains unchanged.

Continue Learning

One Small Win

Try this quick command to get started:

fastqc *.fastq.gz

Copy and paste this into your terminal to get started immediately.

Start Your Own Project

Use our battle-tested template to jumpstart your reproducible research workflows. Pre-configured environments, standardized structure, and example workflows included.

Use This Template

git clone https://github.com/Tamoghna12/bench2bash-starter
cd bench2bash-starter
conda env create -f env.yml
make run

RNA-Seq Analysis Pipeline with Bash

RNA-Seq Analysis Pipeline with Bash

Overview

Prerequisites

Pipeline Workflow

Setting Up the Project Structure

Creating the Main Pipeline Script

Creating a Sample Processing Script

Making Scripts Executable

Advanced Features

1. Parallel Sample Processing

2. Quality Control and Validation

3. Configuration Management

Running the Pipeline

Expected Output Structure

Best Practices

Performance Optimization

Frequently Asked Questions

Q: What is RNA-Seq and why use bash for analysis?

Q: How much computational power do I need for RNA-Seq analysis?

Q: What file formats are used in RNA-Seq analysis?

Q: How do I choose the right parameters for STAR alignment?

Q: What’s the difference between gene expression and differential expression?

Q: How many biological replicates do I need for RNA-Seq?

Q: What quality control steps should I include?

Q: How do I interpret differential expression results?

Q: Can I modify this pipeline for different organisms?

Q: What if my RNA-Seq data is single-end instead of paired-end?

Continue Learning

One Small Win

Related Content

Genome Extraction Pipeline with Bash

Protein Annotation Pipeline with Bash

Variant Calling Pipeline with Bash

RNA-Seq Analysis Pipeline

Start Your Own Project