OmicsFlow

🧬 OmicsFlow

A modular, containerized NGS pipeline for RNA-seq, long-read, and metagenomic analysis

Nextflow Docker License: MIT CI DOI β€”

πŸ“‹ Overview

OmicsFlow is a production-ready bioinformatics pipeline built with Nextflow and Docker, designed for reproducible multi-omics data analysis. It supports three major sequencing technologies and workflows:

Workflow Technology Key tools Status
rnaseq.nf Illumina short reads FastQC Β· STAR Β· Salmon Β· DESeq2 βœ… Stable
longread.nf Oxford Nanopore (ONT) NanoStat · Minimap2 · Samtools 🚧 In development
metagenomics.nf Illumina / ONT Kraken2 · Bracken 🚧 In development

All workflows are fully containerized via Docker and can run locally, on HPC clusters (SLURM/PBS), or in the cloud (AWS Batch).


πŸ“Έ Pipeline Results

Pipeline Architecture

Pipeline Architecture

Differential Expression β€” Volcano Plot

Volcano Plot

Top 50 DE Genes β€” Heatmap

Heatmap

Sample Clustering β€” PCA

PCA Plot


πŸ“Š Validation Metrics

Benchmarked on nf-core test dataset (S. cerevisiae, GSE110004, 4 samples Γ— 50,000 reads):

Metric Value
Input reads per sample 50,000
Reads passing QC 99.5%
Adapter contamination (auto-detected & removed) 40.3%
Uniquely mapped reads (STAR) 81.8% – 84.6%
Properly paired reads 100%
Mismatch rate 0.9%
Pipeline execution time (4 samples, 4 CPUs) ~8 min
Docker image size 4.63 GB

πŸ“¦ What you need before starting

OmicsFlow is flexible β€” you can use the full pipeline or individual tools depending on your needs.

The only real requirement: your data

Use case What you need
Quality control only FASTQ files
Trimming only FASTQ files
Alignment (STAR) FASTQ files + reference genome + GTF + STAR index
Quantification (Salmon) FASTQ files + Salmon index
Statistics (Samtools) An existing BAM file
Differential expression Salmon counts + sample metadata
Python / R analysis Your own data + scripts

You do not need to prepare everything upfront. Start with what you have and add steps as needed.


Reference genome & annotation (only if using STAR alignment)

If you plan to use STAR for alignment, you need a reference genome and its annotation.

If you already have a STAR index on your server or HPC β€” just point to it with --genomeDir. No need to rebuild it. Any STAR-compatible index works, regardless of how it was generated.

If you need to build one (one-time operation, ~45 min for full human genome):

# Download reference genome (human GRCh38)
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# Download gene annotation
wget https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
gunzip Homo_sapiens.GRCh38.109.gtf.gz

# Build STAR index using OmicsFlow Docker
docker run --rm -v $(pwd):/data smill/omicsflow:1.0.0 \
  bash -c "mkdir -p /data/star_index && STAR --runMode genomeGenerate \
  --genomeDir /data/star_index \
  --genomeFastaFiles /data/genome/GRCh38.fa \
  --sjdbGTFfile /data/genome/GRCh38.gtf \
  --runThreadN 8"

⚠️ Build the index once, store it, reuse it forever for all your experiments.


What you do NOT need to install

Everything is already inside the Docker image:

Tool Without OmicsFlow With OmicsFlow
FastQC Manual install βœ… Included
Trim Galore Manual install βœ… Included
STAR Compile from source βœ… Included
Salmon Manual install βœ… Included
Samtools Compile from source βœ… Included
DESeq2 R + Bioconductor setup βœ… Included
MultiQC pip install βœ… Included
BioPython pip install βœ… Included
numpy / pandas / matplotlib pip install βœ… Included
NanoStat / NanoPlot pip install βœ… Included
Kraken2 Manual install βœ… Included
Minimap2 Manual install βœ… Included

🐳 Run with Docker only (no Nextflow required)

The easiest way to use OmicsFlow β€” just Docker, no installation needed.

# Pull the image
docker pull smill/omicsflow:1.0.0

# Step 1 β€” Quality control (FastQC)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "fastqc /data/sample_R1.fastq.gz /data/sample_R2.fastq.gz --outdir /data/qc"

# Step 2 β€” Adapter trimming (Trim Galore)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "trim_galore --paired --cores 4 \
  /data/sample_R1.fastq.gz /data/sample_R2.fastq.gz \
  -o /data/trimmed"

# Step 3 β€” Alignment (STAR)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "STAR --runMode alignReads \
  --genomeDir /data/star_index \
  --readFilesIn /data/trimmed/sample_R1_val_1.fq.gz /data/trimmed/sample_R2_val_2.fq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix /data/aligned/sample. \
  --runThreadN 4"

# Step 4 β€” Quantification (Salmon)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "salmon quant \
  --index /data/salmon_index \
  --libType A \
  -1 /data/trimmed/sample_R1_val_1.fq.gz \
  -2 /data/trimmed/sample_R2_val_2.fq.gz \
  --output /data/counts/sample \
  --threads 4 \
  --validateMappings"

# Step 5 β€” BAM statistics (Samtools)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "samtools flagstat /data/aligned/sample.Aligned.sortedByCoord.out.bam"

# Step 6 β€” Aggregated QC report (MultiQC)
docker run --rm -v $(pwd)/data:/data smill/omicsflow:1.0.0 \
  bash -c "multiqc /data --outdir /data/multiqc"

# Interactive R session (DESeq2, ggplot2...)
docker run --rm -it -v $(pwd)/data:/data smill/omicsflow:1.0.0 R

# Interactive Python session (biopython, pandas, matplotlib...)
docker run --rm -it -v $(pwd)/data:/data smill/omicsflow:1.0.0 python3

Windows users: replace $(pwd) with %cd% in CMD, or use the full path.


Prerequisites

Run in one command

# Clone the repository
git clone https://github.com/Millimono/OmicsFlow.git
cd OmicsFlow

# Run RNA-seq pipeline with test data
nextflow run workflows/rnaseq.nf \
  --input data/test/samplesheet.csv \
  --genome GRCh38 \
  --outdir results/ \
  -profile docker

Input samplesheet format (CSV)

sample,fastq_1,fastq_2,strandedness
ctrl_rep1,/path/to/ctrl_rep1_R1.fastq.gz,/path/to/ctrl_rep1_R2.fastq.gz,reverse
ctrl_rep2,/path/to/ctrl_rep2_R1.fastq.gz,/path/to/ctrl_rep2_R2.fastq.gz,reverse
treat_rep1,/path/to/treat_rep1_R1.fastq.gz,/path/to/treat_rep1_R2.fastq.gz,reverse
treat_rep2,/path/to/treat_rep2_R1.fastq.gz,/path/to/treat_rep2_R2.fastq.gz,reverse

Strandedness: use reverse for most Illumina TruSeq kits, forward for some stranded protocols, unstranded if unsure.


πŸ—‚οΈ Project Structure

OmicsFlow/
β”œβ”€β”€ workflows/
β”‚   β”œβ”€β”€ rnaseq.nf           # βœ… RNA-seq Illumina pipeline (stable)
β”‚   β”œβ”€β”€ longread.nf         # 🚧 Nanopore long-read pipeline (in development)
β”‚   └── metagenomics.nf     # 🚧 Metagenomic pipeline (in development)
β”‚
β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ qc/                 # FastQC, MultiQC, NanoStat
β”‚   β”œβ”€β”€ alignment/          # STAR, Minimap2, Samtools
β”‚   └── quantification/     # Salmon, DESeq2
β”‚
β”œβ”€β”€ analysis/
β”‚   β”œβ”€β”€ deseq2.R            # Differential expression (DESeq2 / edgeR)
β”‚   β”œβ”€β”€ plots.py            # Heatmaps, volcano plots, PCA
β”‚   └── report.Rmd          # Automated HTML report template
β”‚
β”œβ”€β”€ containers/
β”‚   └── Dockerfile          # All tools in one reproducible image
β”‚
β”œβ”€β”€ data/
β”‚   └── test/               # Public mini-datasets for testing
β”‚       β”œβ”€β”€ samplesheet.csv
β”‚       └── reads/          # nf-core GSE110004 subset (Illumina)
β”‚
β”œβ”€β”€ docs/                   # Documentation (GitHub Pages)
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml          # GitHub Actions CI/CD
└── nextflow.config         # Profiles: local, cluster, cloud

πŸ“Š Workflows in Detail

1. RNA-seq Pipeline (rnaseq.nf) βœ… Stable

Designed for bulk RNA-seq analysis from raw FASTQ to differential expression.

Input FASTQ
    β”‚
    β–Ό
[FastQC] ──────────────────────────> QC report
    β”‚
    β–Ό
[Trim Galore] ──> Trimmed reads
    β”‚
    β–Ό
[STAR] ──> Aligned BAM + splice junctions
    β”‚
    β–Ό
[Salmon] ──> Gene/transcript counts
    β”‚
    β–Ό
[DESeq2 / edgeR] ──> Differential expression
    β”‚
    β–Ό
[MultiQC] ──> Aggregated QC report (HTML)

Output files:


2. Long-read Pipeline (longread.nf) 🚧 In development

For Nanopore sequencing data. Coming soon β€” tools already available in the Docker image.

Input FASTQ (ONT)
    β”‚
    β–Ό
[NanoStat / NanoPlot] ──> Read quality stats
    β”‚
    β–Ό
[Minimap2] ──> Aligned BAM
    β”‚
    β–Ό
[Samtools] ──> Sorted + indexed BAM
    β”‚
    β–Ό
[MultiQC] ──> Aggregated report

In the meantime, you can use these tools individually via Docker β€” see the Docker section above.


3. Metagenomic Pipeline (metagenomics.nf) 🚧 In development

Taxonomic classification and abundance profiling. Coming soon β€” Kraken2 already available in the Docker image.

Input FASTQ
    β”‚
    β–Ό
[FastQC + Trim Galore] ──> Clean reads
    β”‚
    β–Ό
[Kraken2] ──> Taxonomic classification
    β”‚
    β–Ό
[Bracken] ──> Abundance re-estimation

In the meantime, you can run Kraken2 directly:

docker run --rm -v $(pwd):/data smill/omicsflow:1.0.0 \
  bash -c "kraken2 --db /data/kraken2_db --paired \
  /data/R1.fastq.gz /data/R2.fastq.gz \
  --output /data/kraken2_output.txt \
  --report /data/kraken2_report.txt"

πŸ› οΈ Technical Stack

Category Tools Versions
Pipeline orchestration Nextflow DSL2 β‰₯ 22.10
Containerization Docker Β· Singularity 28.x
QC FastQC Β· MultiQC Β· NanoStat Β· NanoPlot 0.12.1 Β· 1.35 Β· 1.6.0
Alignment STAR Β· Minimap2 2.7.11b Β· 2.31
Quantification Salmon 1.12.0
Variant calling Samtools Β· BCFtools 1.23.1
Metagenomics Kraken2 2.1.3
Statistical analysis DESeq2 Β· edgeR Β· R R 4.5.2
Visualization ggplot2 Β· matplotlib Β· seaborn β€”
Languages Python Β· R Β· Bash Β· C Β· C++ Python 3.x
CI/CD GitHub Actions β€”
Documentation GitHub Pages β€”

πŸ§ͺ Test Data

Test data used during development (publicly available):

Dataset Source Size Used for
GSE110004 / SRR6357070-71 (4 samples) nf-core test datasets ~8 MB RNA-seq validation
S. cerevisiae R64-1-1 genome nf-core test datasets ~230 KB Reference genome
S. cerevisiae gene annotation nf-core test datasets ~200 KB Gene annotation

βš™οΈ Configuration

OmicsFlow supports multiple execution profiles defined in nextflow.config:

profiles {
    docker {
        docker.enabled   = true
        process.executor = 'local'
    }
    cluster {
        process.executor    = 'slurm'
        singularity.enabled = true
        process.queue       = 'normal'
    }
    cloud {
        process.executor = 'awsbatch'
        aws.region       = 'ca-central-1'
    }
    test {
        params.input  = "${projectDir}/data/test/samplesheet.csv"
        params.outdir = 'results_test'
        docker.enabled = true
    }
}

πŸ“ˆ Results & Outputs

Every run generates a timestamped output directory:

results/
β”œβ”€β”€ qc/
β”‚   β”œβ”€β”€ fastqc/             # Per-sample FastQC reports (HTML)
β”‚   └── multiqc_report.html # Aggregated QC report
β”œβ”€β”€ trimmed/
β”‚   └── logs/               # Trim Galore trimming reports
β”œβ”€β”€ aligned/
β”‚   β”œβ”€β”€ sample.Aligned.sortedByCoord.out.bam
β”‚   └── sample.Log.final.out  # Mapping statistics
β”œβ”€β”€ counts/
β”‚   └── salmon/             # Transcript-level quantification
β”‚       └── quant.sf
β”œβ”€β”€ deseq2/
β”‚   β”œβ”€β”€ deseq2_results.csv  # DE genes table
β”‚   β”œβ”€β”€ volcano_plot.pdf    # Volcano plot
β”‚   β”œβ”€β”€ heatmap_top50.pdf   # Top 50 DE genes heatmap
β”‚   └── pca_plot.pdf        # PCA plot
└── pipeline_info/
    β”œβ”€β”€ execution_report.html
    └── execution_timeline.html

This pipeline was developed in conjunction with research in AI-based medical imaging and bioinformatics:

The analysis/ module is extensible β€” ML models from the above projects can be integrated as additional pipeline steps.


πŸ“š Documentation

Full documentation available at: millimono.github.io/OmicsFlow


🀝 Contributing

Contributions welcome! Please open a pull request.

git clone https://github.com/Millimono/OmicsFlow.git
cd OmicsFlow
git checkout -b feature/my-new-module

πŸ“„ Citation

Millimono, S. (2026). OmicsFlow: A modular containerized NGS pipeline 
for reproducible multi-omics analysis (v1.0.1). Zenodo. 
https://doi.org/10.5281/zenodo.20677900

πŸ‘€ Author

Sory Millimono PhD Candidate in AI Β· Bioinformatician UniversitΓ© de MontrΓ©al & Mohammed V University – ENSIAS


πŸ“œ License

MIT License β€” see LICENSE for details.