Skip to main content

Deep Skills Catalog

918 compiled deep skills — auto-generated from live documentation with quality scoring.

Showing 918 of 918 skills

SkillTierbc7score
BLAST+

Basic Local Alignment Search Tool — finds regions of similarity between biological sequences using a heuristic method for sequence database searches.

Deep
97
SAMtools

Tools for manipulating next-generation sequencing data stored in SAM/BAM/CRAM format, including sorting, indexing, and format conversion.

Deep
97
BWA

Burrows-Wheeler Aligner for aligning short and long DNA sequencing reads to a reference genome using the Burrows-Wheeler Transform.

Deep
96
DESeq2

Differential expression analysis for RNA-seq data using negative binomial generalized linear models with size factor normalization and empirical Bayes shrinkage.

Deep
96
AlphaFold

Deep learning system for protein structure prediction from amino acid sequence with atomic accuracy — revolutionized structural biology.

Deep
95
STAR

Spliced Transcripts Alignment to a Reference — ultrafast RNA-seq aligner with splice junction detection and novel transcript discovery.

Deep
95
BEDTools

Swiss-army knife for genome arithmetic — intersecting, merging, counting, and complementing genomic intervals across BED, GFF, VCF, and BAM files.

Deep
93
edgeR

Differential expression analysis of RNA-seq and other count data using empirical Bayes estimation of the dispersion parameter for the negative binomial model.

Deep
93
GATK

Genome Analysis Toolkit — industry-standard variant discovery in high-throughput sequencing data with HaplotypeCaller, BaseRecalibrator, and more.

Deep
93
Scanpy

Scalable toolkit for analyzing single-cell gene expression data in Python — the ecosystem standard for scRNA-seq preprocessing, clustering, and visualization.

Deep
93
Seurat

R toolkit for single-cell RNA-seq data analysis — clustering, dimensionality reduction, integration, and cell type annotation with a tidy interface.

Deep
93
BCFtools

Set of utilities for manipulating and analyzing VCF and BCF files, including variant filtering, genotype calling, and population genetics.

Deep
92
Nextflow

Data-driven computational pipeline framework enabling scalable and reproducible scientific workflows using software containers.

Deep
92
MultiQC

Aggregate bioinformatics analysis results from many samples and tools into a single report — supports 150+ tools including FastQC, STAR, HISAT2, and Samtools.

Deep
91
Picard

Java command-line tools from the Broad Institute for manipulating high-throughput sequencing data and formats including BAM, CRAM, and VCF.

Deep
91
Salmon

Wicked-fast transcript-level quantification from RNA-seq reads using quasi-mapping, enabling accurate expression estimation without genome alignment.

Deep
91
fastp

Ultra-fast all-in-one FASTQ preprocessor with quality control, adapter trimming, low-complexity filtering and base correction — 5-10x faster than Trimmomatic.

Deep
90
FastQC

Quality control tool for high-throughput sequencing data, providing per-base quality scores, GC content, adapter contamination, and duplication metrics.

Deep
90
HMMER

Biosequence analysis using profile hidden Markov models — for detecting remote homologs, protein domain identification, and multiple sequence alignments.

Deep
90
MACS2

Model-based Analysis of ChIP-Seq for identifying transcription factor binding sites and histone modification regions from ChIP-seq data.

Deep
90
Minimap2

Versatile sequence alignment program for mapping long reads (PacBio/Oxford Nanopore), splice-aware RNA-seq alignment, and assembly-to-assembly alignment.

Deep
90
Snakemake

Python-based workflow management system enabling reproducible and scalable data analyses with cluster support and integrated software dependency management.

Deep
90
SPAdes

St. Petersburg genome assembler — an assembly toolkit containing various assembly pipelines for standard isolates, single-cell, metagenomics, and RNA-seq data.

Deep
90
HISAT2

Graph-based alignment of reads to a population of genomes using hierarchical indexing — successor to TopHat and HISAT for RNA-seq alignment.

Medium
89
Kallisto

Near-optimal probabilistic RNA-seq quantification using pseudoalignment against transcriptome index — orders of magnitude faster than traditional aligners.

Medium
89
Cutadapt

Removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Medium
88
deepTools

Modular tools for processing and analyzing deep sequencing data — generates coverage tracks, quality metrics, and heatmaps from BAM files.

Medium
88
featureCounts

Efficient general-purpose read summarization program for counting reads mapped to genomic features such as genes, exons, and promoters.

Medium
88
InterProScan

Sequence analysis application that combines multiple protein signature databases to annotate proteins with functional domains, sites, and families.

Medium
88
Prokka

Rapid prokaryotic genome annotation tool producing standards-compliant GFF3, GenBank, and FASTA output for submission and downstream analysis.

Medium
88
RDKit

Open-source cheminformatics library in Python and C++ for molecular manipulation, fingerprinting, QSAR modeling, and drug discovery workflows.

Medium
88
scvi-tools

Probabilistic framework for single-cell omics data analysis using deep learning variational inference — includes scVI, scANVI, totalVI, and SCANVI.

Medium
88
Trimmomatic

Flexible trimmer for Illumina sequence data that handles adapter removal, low-quality base trimming, and minimum length filtering.

Medium
88
VEP (Ensembl)

Ensembl Variant Effect Predictor determines the effect of variants on genes, transcripts, and protein sequences using the Ensembl annotation framework.

Medium
88
AutoDock Vina

Molecular docking program for predicting the binding mode and affinity of small molecules to protein receptors using an empirical scoring function.

Medium
87
Cellpose

Generalist algorithm for cellular segmentation based on a flow representation of cells — works on diverse image types without parameter tuning.

Medium
87
MetaPhlAn

Metagenomic phylogenetic analysis tool for profiling microbial community composition from shotgun metagenomics data using clade-specific markers.

Medium
87
MUSCLE

Multiple sequence alignment tool with high accuracy and high throughput — one of the best-performing multiple alignment programs.

Medium
87
SnpEff

Fast variant effect prediction and annotation tool that annotates SNPs, indels, and structural variants with their predicted functional effects.

Medium
87
StringTie

Efficient transcript assembly and quantification from RNA-seq alignments using a network flow algorithm with optional guide annotation.

Medium
87
Trinity

De novo reconstruction of full-length transcriptomes from RNA-seq data using three modules: Inchworm, Chrysalis, and Butterfly.

Medium
87
hifiasm

Haplotype-resolved de novo assembler for PacBio HiFi reads, producing highly accurate and contiguous chromosome-level assemblies.

Medium
86
RSEM

Accurate quantification of gene and isoform expression levels from RNA-seq data using expectation-maximization with multimapping read handling.

Medium
86
ChAMP

ChAMP — Chip Analysis Methylation Pipeline for Illumina 450K and EPIC arrays. Integrated R/Bioconductor pipeline for DNA methylation analysis covering IDAT loading, quality control, normalization (BMIQ, SWAN, PBC, FunctionalNormalization), batch correction (SVD, ComBat), differentially methylated positions (DMPs), differentially methylated regions (DMRs via Bumphunter, DMRcate, ProbeLasso), copy number alteration detection, cell-type deconvolution, and gene set enrichment. Use when analyzing Illumina HumanMethylation450 or MethylationEPIC BeadChip data for epigenetic studies, aging clocks, or cancer methylomics.

Medium
70
gget

"Fast CLI/Python queries to 20+ bioinformatics databases. Use for quick lookups: gene info, BLAST searches, AlphaFold structures, enrichment analysis. Best for interactive exploration, simple queries. For batch processing or advanced BLAST use biopython; for multi-database Python workflows use bioservices."

Medium
70
INLA

INLA (Integrated Nested Laplace Approximations) — R package for fast approximate Bayesian inference on latent Gaussian models. Provides deterministic approximations to posterior marginals as an alternative to MCMC, with support for generalized linear mixed models, spatial SPDE models (Matern fields via stochastic partial differential equations), temporal models (AR1, RW1, RW2), survival models, and zero-inflated families. Includes mesh construction for spatial modeling (inla.mesh.2d), the inla.stack data interface, penalized complexity (PC) priors, model comparison via DIC/WAIC/CPO, and integration with inlabru for point process and spatial prediction workflows. Use for Bayesian hierarchical models where MCMC is too slow, especially geostatistical, disease mapping, and space-time applications.

Medium
70
Magic-BLAST

Magic-BLAST — NCBI aligner for mapping large next-generation RNA or DNA sequencing runs against whole genomes or transcriptomes. Optimizes composite paired-read scores with splice-aware alignment, excels at intron discovery, long reads (>250 bp), and compositionally biased genomes. Accepts SRA accessions, FASTA, or FASTQ input; outputs SAM. For short-read-only genomic alignment use BWA-MEM2; for fastest splice-aware mapping use HISAT2 or STAR.

Medium
70
Reactome Database

Query Reactome REST API for pathway analysis, enrichment, gene-pathway mapping, disease pathways, molecular interactions, expression analysis, for systems biology studies.

Medium
70
scipyoptimize

scipy.optimize — Unified interface for numerical optimization in Python. Provides local and global minimization (minimize, differential_evolution, dual_annealing), root finding (root, root_scalar), curve fitting (curve_fit, least_squares), linear programming (linprog, milp), and assignment problems (linear_sum_assignment). Wraps 14 local minimizers, 6 global optimizers, and 8 scalar root-finding methods behind consistent APIs. For deep-learning optimization use PyTorch/JAX optimizers; for symbolic solving use SymPy.

Medium
70
Winnowmap

Winnowmap — long-read alignment optimized for repetitive genomic regions. Maps PacBio HiFi, PacBio CLR, and Oxford Nanopore reads to reference genomes with weighted minimizer seeding that down-weights frequent k-mers. Aligns genome assemblies with asm5/asm10/asm20 presets. Outperforms minimap2 in centromeres, segmental duplications, and tandem repeat arrays. Requires meryl k-mer counting as a preprocessing step. Use when mapping long reads to repetitive references, aligning T2T assemblies, or calling structural variants in near-identical repeats.

Medium
70
3D-DNA

3D-DNA — Hi-C-based genome scaffolding pipeline that produces chromosome-length assemblies from draft contigs and Hi-C contact maps. Performs iterative misjoin correction, scaffolding, polishing, chromosome splitting, sealing, and optional diploid merge. Outputs FASTA scaffolds and Juicebox-compatible .hic/.assembly files for manual review via JBAT (Juicebox Assembly Tools).

Thin
0
AbLang

AbLang — antibody language model for completing and analyzing antibody sequences. Generate residue-level embeddings (res-codings), sequence-level embeddings (seq-codings), restore missing residues from sequencing errors, and compute residue likelihoods for mutation analysis. Trained on the Observed Antibody Space (OAS) database with separate heavy and light chain models. Use for antibody sequence restoration, antibody engineering, immunoinformatics, and VH/VL representation learning.

Thin
0
ABRicate

Use when working with abricate — ABRicate — mass screening of contigs

Thin
0
ABySS

ABySS — de novo genome assembler for short paired-end reads and genomes of all sizes. Uses Bloom filter mode for memory-efficient assembly (10x reduction vs hash table). Supports paired-end, mate-pair, and linked-read scaffolding via ARCS/Tigmint. Driven by the abyss-pe Makefile pipeline with parameters k (k-mer size), B (Bloom filter memory), and name (prefix). Standard tool for WGS de novo assembly from Illumina data.

Thin
0
ACE

Use when working with ACE (Accurate CRISPR Essentiality) — a Python package from the Pe'er Lab (Memorial Sloan Kettering) for estimating gene essentiality from CRISPR pooled knockout screens. Models guide RNA depletion using a negative binomial mixture to separate essential from non-essential gene effects. Accepts raw or normalized sgRNA count matrices and outputs per-gene essentiality scores and confidence intervals. Handles copy-number confounding by incorporating genomic position information. Used in cancer functional genomics pipelines alongside CERES, Chronos, and MAGeCK for essential gene discovery and genetic dependency mapping.

Thin
0
ADMIXTURE

ADMIXTURE for maximum likelihood estimation of individual ancestries from SNP genotype data. Use when working with admixture, ancestry estimation, population structure analysis, K selection via cross-validation, supervised ancestry assignment, or STRUCTURE-like analysis on PLINK binary files.

Thin
0
AfterQC

AfterQC — automatic filtering, trimming, error removing and quality control for FASTQ data. Processes Illumina paired-end and single-end reads with automatic adapter detection, overlap-based error correction, poly-X filtering, bubble artifact removal, and HTML QC reporting. Outputs good/bad/QC folders. Use when preprocessing Illumina FASTQ data or generating QC reports.

Thin
0
ai-models

ECMWF ai-models — unified CLI and Python framework for running AI-based weather forecasting models including PanguWeather, FourCastNet, FourCastNetv2, and GraphCast. Provides standardized input handling (MARS, CDS/ERA5, local GRIB), model weight management, and GRIB output for global medium-range weather prediction with GPU acceleration.

Thin
0
AICSImageIO

AICSImageIO — Python library for reading and writing multi-dimensional microscopy image data. Supports CZI (Zeiss), LIF (Leica), ND2 (Nikon), OME-TIFF, TIFF, and DV formats via a unified AICSImage interface with lazy-loaded dask/xarray arrays using STCZYX dimension ordering. Provides metadata extraction, scene selection, physical pixel size access, mosaic stitching, and OME-TIFF writing for bioimage analysis workflows.

Thin
0
alevin-fry

alevin-fry — fast, accurate, and memory-frugal Rust-based tool for single-cell and single-nucleus RNA-seq quantification. Processes RAD (Reduced Alignment Data) files from salmon alevin to generate cell-by-gene count matrices. Supports permit-list generation via knee-distance, forced-cells, or external whitelists, USA (unspliced/spliced/ambiguous) quantification, and multiple resolution strategies (cr-like, cr-like-em, parsimony, trivial). Part of the COMBINE-lab/salmon ecosystem for scRNA-seq and snRNA-seq preprocessing pipelines.

Thin
0
AlphaFold Database

AlphaFold Database — public repository of 200M+ AI-predicted protein structures maintained by DeepMind and EMBL-EBI. Access predictions by UniProt accession via REST API at alphafold.ebi.ac.uk. Download mmCIF/PDB coordinate files, per-residue pLDDT confidence scores, and PAE matrices. Bulk access via Google Cloud Storage (gs://public-datasets-deepmind-alphafold-v4/). Core workflow: resolve UniProt ID, fetch prediction metadata, download structure file, extract pLDDT from B-factors, visualize PAE matrix. Use for structure-based drug discovery, homology modeling gap-filling, disorder prediction, and domain boundary identification.

Thin
0
AlphaFold2

AlphaFold2 — deep learning system for predicting 3D protein structures from amino acid sequences with atomic-level accuracy. Uses multiple sequence alignments (MSAs) and an attention-based Evoformer architecture to predict inter-residue distances and torsion angles. Produces PDB files with per-residue confidence scores (pLDDT), predicted TM-scores (pTM), and predicted aligned error (PAE) matrices. Supports monomer, multimer (protein complex), and template-free prediction modes via Docker with NVIDIA GPU.

Thin
0
AlphaPept

AlphaPept — Python-based, open-source proteomics pipeline for DDA mass spectrometry data analysis. Provides feature detection, peptide identification via database search, deep learning-based scoring, label-free and TMT/SILAC quantification, and match-between-runs. Built on Numba for high performance with HDF5 intermediate files. Part of the MannLabs ecosystem alongside AlphaBase, AlphaTims, and alphamap.

Thin
0
AlphaTims

AlphaTims — open-source Python package for fast indexing, slicing, and visualization of unprocessed Bruker timsTOF Pro LC-TIMS-Q-TOF mass spectrometry data. Loads raw .d folders into a sparse five-dimensional matrix (retention time, ion mobility, quadrupole m/z, TOF m/z, intensity) enabling sub-second data access across billions of data points. Supports ddaPASEF and diaPASEF acquisition modes with HDF5 export, interactive plotting, GUI, CLI, and Python API interfaces.

Thin
0
AMBER

AMBER (Assisted Model Building with Energy Refinement) — suite of molecular dynamics simulation programs for biomolecular systems. Provides force fields (ff14SB, ff19SB, GAFF2), system preparation (tleap, antechamber), molecular dynamics engines (sander, pmemd, pmemd.cuda), and trajectory analysis (cpptraj, pytraj). Used for protein folding, ligand binding free energy, membrane simulations, and enhanced sampling (REMD, GaMD).

Thin
0
Amelia II

Amelia II — R package for multiple imputation of incomplete multivariate data using a bootstrap-based EM algorithm (EMB). Faster and more scalable than MCMC approaches, with native support for time-series cross-sectional (TSCS) panel data via polytime, splinetime, cross-section fixed effects (intercs), lags/leads, normalizing transformations, cell-level priors, logical bounds, and overimputation diagnostics. Core API: amelia(), overimpute(), compare.density(), missmap(), disperse(), write.amelia(), ameliabind(). Use for fast multiple imputation in R with Rubin's rules pooling.

Thin
0
AMRFinderPlus

AMRFinderPlus (NCBI Antimicrobial Resistance Gene Finder Plus) — identifies antimicrobial resistance genes, virulence factors, and stress genes in bacterial genome assemblies and protein sequences. Accepts nucleotide FASTA, protein FASTA, or combined protein+nucleotide+GFF3 input. Required for AMR surveillance, pathogen characterization, and NCBI submission workflows. Use for: AMR gene detection, resistance gene catalog search, organism-specific point mutation calling, virulence gene annotation, amrfinderplus, amrfinder.

Thin
0
Analyze Sample Metadata

Analyze bioinformatics sample metadata to auto-detect data modality, column semantics, and sequencing platform from CSV/TSV files, SRA RunTables, GEO Series Matrix files, and AnnData .obs exports. Infers RNA-seq, scRNA-seq, ATAC-seq, WGS, ChIP-seq, bisulfite-seq, and spatial transcriptomics modalities from column names, controlled vocabulary values, and EDAM ontology terms. Recommends matching QC and analysis tools ranked by bc7score from the biocontext7 registry.

Thin
0
ANGSD

Use when working with angsd — ANGSD — Analysis of Next Generation Sequencing

Thin
0
Ankh

Use when working with Ankh — an optimized protein language model — for generating protein sequence embeddings, secondary structure prediction, remote homology detection, and protein property prediction. Ankh provides efficient protein representations using transformer-based architecture trained on UniProt sequences. Supports ankh_base, ankh_large, ankh3_large, and ankh3_xl model variants. Use for feature extraction from FASTA sequences, downstream fine-tuning on biological tasks, and protein function prediction.

Thin
0
AnnData

AnnData — annotated data matrices for single-cell and multi-omics analysis. Core data structure for the scverse ecosystem storing expression matrices (X) with observation metadata (obs), variable metadata (var), multi-dimensional annotations (obsm, varm, obsp, varp), layers, and unstructured data (uns). Read/write h5ad and zarr formats. Backed mode for out-of-core access to large datasets. Concatenate samples with flexible join and merge strategies. This is the data format skill — for analysis workflows use scanpy; for probabilistic models use scvi-tools; for population-scale queries use cellxgene-census.

Thin
0
AnnotationHub

AnnotationHub — Bioconductor R package providing a central repository for pre-processed biological data resources. Query and retrieve genome annotations (GRanges, TxDb, OrgDb), FASTA sequences, VCF files, BED tracks, and expression data from Ensembl, ENCODE, UCSC, dbSNP, and FANTOM5. Access hundreds of thousands of curated biological datasets with automatic local caching. Use when users need organism genome annotations, gene models, GTF files, OrgDb packages, liftOver chains, or any Bioconductor annotation resource without manual download.

Thin
0
AnnotSV

AnnotSV — integrated tool for structural variation annotation and ranking. Annotates DEL, DUP, INS, INV, and translocation SVs from VCF or BED input against 30+ databases (gnomAD-SV, DGV, ClinVar, OMIM, ClinGen). Provides ACMG/ClinGen-based 5-class pathogenicity ranking. Supports GRCh37, GRCh38, and CHM13. Produces split (per-gene) and full (per-SV) annotation modes. Essential for clinical SV interpretation, exome/genome diagnostics, and CNV array analysis.

Thin
0
ANNOVAR

ANNOVAR — efficient Perl-based tool for functional annotation of genetic variants from diverse genomes. Supports gene-based, region-based, and filter-based annotation using 100+ databases including ClinVar, gnomAD, dbNSFP, COSMIC, and population frequency data. Primary command is table_annovar.pl for combined multi-database annotation of VCF and AVINPUT files. Used in clinical exome/genome variant prioritization workflows.

Thin
0
AntiFold

AntiFold — antibody-specific inverse folding model built on ESM-IF1, fine-tuned on antibody structures from SAbDab and OAS databases. Predicts residue log-likelihoods for antibody variable domains (IMGT positions 1-128) and samples new sequences in FASTA format. Supports CDR-specific design via --regions, antigen-conditioned prediction, and embedding extraction. Use for antibody sequence design, humanization, CDR optimization, and inverse folding of variable domains from PDB structures.

Thin
0
Anvi'o

Anvi'o — integrated analysis and visualization platform for 'omics data. Supports metagenomics (MAG binning, refinement, read recruitment), pangenomics (gene cluster analysis, ANI computation), phylogenomics (concatenated phylogeny), and genomics (functional annotation, HMM-based gene calling). Key workflows: contigs database creation, read profiling with Bowtie2/Samtools, interactive binning via anvi-interactive, collection summarization, and metapangenomics. Use for metagenome-assembled genome (MAG) recovery, comparative genomics across microbial populations, and multi-'omics integration in environmental or clinical microbiology.

Thin
0
Apache Arrow

Apache Arrow — cross-language columnar in-memory analytics format and ecosystem. Provides zero-copy IPC between Python, R, C++, Java, Go, and Rust via shared memory and the C Data Interface. Use for: converting between pandas, Polars, DuckDB, and genomics libraries without data copies; reading and writing Parquet and Feather/IPC files; building high-throughput data pipelines from bioinformatics ETL to ML feature matrices. Key terms: pyarrow, pa.Table, pa.RecordBatch, pa.Schema, Arrow IPC, Feather, Parquet, plasma store, chunked array, zero-copy, Apache Arrow Flight, columnar format.

Thin
0
ape

Use when working with the R package ape for phylogenetic tree I/O, manipulation, plotting, molecular dating, comparative methods, or DNA distance workflows. ape supports Newick and Nexus tree formats, DNAbin/AAbin sequence containers, distance-based tree inference with nj and bionj, ancestral character estimation with ace, plotting with plot.phylo, and penalized-likelihood dating with chronos. Trigger on ape, Analysis of Phylogenetics and Evolution, read.tree, read.nexus, dist.dna, nj, ace, chronos, phylogenetic comparative methods, or R phylogeny workflows.

Thin
0
Arima-HiC Mapping Pipeline

Arima-HiC mapping pipeline — shell-based workflow for aligning and processing Hi-C data generated with Arima Genomics proximity ligation kits. Maps paired-end FASTQ reads with BWA-MEM using Hi-C-specific flags (-5SP), filters chimeric alignments to retain 5-prime ends, pairs mates, removes PCR duplicates with Picard MarkDuplicates, and generates .pairs and .hic contact matrix files via Juicer pre. Designed for Arima-HiC, Arima-HiC+, and Arima-HiChIP library types. Upstream of Juicer, HiGlass, cooler, HiCExplorer, and 3D-DNA scaffolding workflows.

Thin
0
ArviZ

ArviZ — Python library for exploratory analysis of Bayesian models providing posterior visualization (trace, forest, pair, posterior plots), diagnostics (R-hat, ESS, MCSE, BFMI), model comparison (LOO-CV, WAIC, Bayes factors), and the InferenceData container (xarray-backed, NetCDF-serializable) with converters for PyMC, CmdStan, NumPyro, Pyro, PyStan, emcee, and Bean Machine.

Thin
0
ASTRAL

ASTRAL — fast species tree estimation from unrooted gene trees under the multi-species coalescent model. Handles incomplete lineage sorting (ILS) in phylogenomic datasets with hundreds of species and thousands of loci. Computes local posterior probabilities (localPP) for branch support. ASTRAL-III uses a constrained search space with dynamic programming for scalable, statistically consistent species tree inference. Input: Newick gene trees. Output: Newick species tree with branch support and branch lengths in coalescent units.

Thin
0
ATACseqQC -- ATAC-seq Quality Control in R

ATACseqQC -- R/Bioconductor package for ATAC-seq quality control and assessment. Computes fragment size distribution, nucleosome positioning, TSS enrichment score, library complexity, and signal-to-noise metrics. Provides footprinting analysis, NFR (nucleosome-free region) scoring, CTCF footprint visualization, and V-plot generation. Shifts Tn5 insertion sites (+4/-5 bp) for accurate cut-site analysis. Integrates with GenomicAlignments, BSgenome, and TxDb annotation packages for comprehensive ATAC-seq diagnostic workflows.

Thin
0
Augur

Augur — Nextstrain's Python toolkit for phylogenetic analysis of pathogen genomes. Provides modular subcommands for sequence filtering, multiple alignment, phylogenetic tree building, temporal dating, ancestral state reconstruction, clade assignment, and Auspice JSON export. Used for real-time genomic epidemiology of SARS-CoV-2, influenza, Ebola, and other pathogens. Install with pip install nextstrain-augur.

Thin
0
AUGUSTUS -- Eukaryotic Gene Prediction

>

Thin
0
Auspice

Auspice — interactive visualization tool for phylogenomic and genomic epidemiology data from the Nextstrain platform. Renders Nextstrain JSON datasets as interactive phylogenetic trees, geographic transmission maps, frequency panels, and temporal charts in the browser. Supports custom datasets, narrative-driven storytelling, and client-side filtering. Core visualization layer for real-time pathogen surveillance pipelines.

Thin
0
Auto-Detect Input Format

Infer bioinformatics file format from magic bytes, header patterns, and file extensions

Thin
0
Azimuth

Azimuth — reference-based automated cell type annotation for single-cell RNA-seq. Maps a query Seurat object or h5ad dataset onto a curated reference atlas (PBMC, lung, kidney, cortex, heart, bone marrow, and 10+ others) using weighted nearest neighbors (WNN) and supervised PCA. Transfers cell type labels, UMAP coordinates, and confidence scores with RunAzimuth(). Also available via a web app. Use when you need fast, reproducible cell annotation without manual marker curation or when comparing against a published atlas.

Thin
0
BAGEL2

BAGEL2 — Bayesian Analysis of Gene Essentiality v2 for identifying essential genes from CRISPR/Cas9 knockout screens. Calculates Bayes Factors from fold-change distributions using reference sets of core essential and non-essential genes. Supports multi-target gene-level analysis, precision-recall evaluation, and bootstrapped re-sampling for robust essentiality scoring. Input: read count matrices from CRISPR screens. Output: gene-level Bayes Factors, precision-recall curves, and essentiality classifications.

Thin
0
Bakta

Bakta — rapid, standardized annotation of bacterial genomes, plasmids, and metagenome-assembled genomes (MAGs). Provides comprehensive feature detection (CDS, tRNA, rRNA, ncRNA, CRISPR, sORF, oriC/oriV/oriT, pseudogenes) with taxonomy-independent UniRef-based protein identification. Outputs INSDC-compliant GenBank/EMBL, GFF3, and machine-readable JSON. Includes AMRFinderPlus for antimicrobial resistance gene detection.

Thin
0
ballgown

ballgown -- R/Bioconductor package for flexible isoform-level differential expression analysis of RNA-seq experiments. Works with StringTie output (.ctab files) to test for differential expression at the transcript, exon, and intron levels using linear models. Part of the new Tuxedo pipeline (HISAT2 -> StringTie -> ballgown). Provides statistical testing via stattest(), FPKM/TPM filtering, phenotype-aware models, and built-in visualization (plotTranscripts, plotMeans, plotLatent).

Thin
0
BALSAMIC

Use when running somatic mutation analysis on cancer whole-genome (WGS), whole-exome (WES), or targeted panel sequencing data with BALSAMIC. Handles tumor-normal and tumor-only workflows, calling SNVs, indels, SVs, and CNVs using an ensemble of callers (Mutect2, Strelka, Manta, DELLY, CNVkit, ASCAT). Built on Snakemake with Singularity containers for reproducibility. Part of the Clinical Genomics stack at SciLifeLab, Sweden. Also known as BALSAMIC, Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer.

Thin
0
bamAlignCleaner

bamAlignCleaner — removes unaligned references from BAM/CRAM alignment files. Cleans alignment files by filtering out reference sequences with no mapped reads, reducing file size and speeding downstream analysis. Two methods: parse (iterate reads, better when references > reads) and index_stat (use BAM index, better when reads > references). Supports retaining specific references via a list file or FASTA. Common in ancient DNA and metagenomics pipelines. CLI tool built on pysam.

Thin
0
Bambi

Bambi — BAyesian Model-Building Interface for Python providing R-style formula syntax for Bayesian regression models built on PyMC. Specify generalized linear and mixed-effects models with bmb.Model("y ~ x1 + (1|group)", data), automatic prior selection, built-in families (Gaussian, Bernoulli, Poisson, Negative Binomial, Beta, Gamma, StudentT, Categorical, Ordinal, Zero-Inflated), random effects with (1|group) and (slope|group) syntax, splines via bs(x), interaction terms, posterior predictive checks, model comparison via ArviZ, and interpretation plots (conditional adjusted predictions, comparisons, slopes).

Thin
0
Bambu

Bambu — R/Bioconductor package for context-aware transcript discovery and quantification from long-read RNA-seq data (Oxford Nanopore, PacBio). Performs reference-guided isoform reconstruction, novel transcript detection with NDR scoring, multi-sample analysis, and de novo assembly. Key functions: bambu(), prepareAnnotations(), plotBambu(), writeBambuOutput().

Thin
0
BAMtools

BAMtools — C++ API and command-line toolkit for reading, writing, sorting, indexing, filtering, merging, splitting, and converting BAM alignment files. Provides JSON-based filter expressions, multi-format conversion (BAM to BED, FASTA, FASTQ, SAM, JSON, YAML), coverage statistics, random subsampling, and paired-end resolution. Useful in NGS pipelines alongside samtools.

Thin
0
Bandage

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) — interactive visualization tool for de novo assembly graphs in GFA and FASTG formats. Visualize genome assembly graphs, perform BLAST searches within graphs, extract subgraphs by node or BLAST hit, and generate publication-quality images. Essential for inspecting SPAdes, MEGAHIT, Unicycler, and Flye assembly graphs.

Thin
0
BayesSpace

BayesSpace for spatially-resolved clustering and resolution enhancement of spatial transcriptomics data. Performs Bayesian spatial clustering with a Potts smoothing prior on 10x Visium and Slide-seq data using SingleCellExperiment objects. Enhances sub-spot resolution to predict gene expression at finer spatial granularity. Use when clustering spatial spots, enhancing spatial resolution, or analyzing tissue domains.

Thin
0
BBDuk

BBDuk (Decontamination Using Kmers) — high-performance Java tool from the BBTools suite for adapter trimming, quality trimming, contaminant filtering, sequence masking, GC filtering, and format conversion in a single pass. Uses kmer matching against reference sequences for decontamination and trimming of FASTQ/FASTA reads. Essential preprocessing step in WGS, RNA-seq, metagenomics, and amplicon sequencing pipelines.

Thin
0
BBKNN -- Batch Balanced K Nearest Neighbours

BBKNN (Batch Balanced K Nearest Neighbours) -- lightweight Python batch correction method for single-cell RNA-seq that operates on the k-nearest neighbor graph rather than the expression matrix. Replaces scanpy's sc.pp.neighbors() with a batch-balanced alternative, connecting each cell to its nearest neighbors within each batch. Integrates directly with scanpy and the scverse ecosystem for downstream UMAP, Leiden clustering, and PAGA trajectory analysis on AnnData objects.

Thin
0
BBMap / BBTools

BBMap/BBTools — suite of 265+ fast, multithreaded Java-based bioinformatics tools for DNA/RNA sequence analysis. Includes BBMap short read aligner, BBDuk adapter trimmer and quality filter, BBMerge paired-end read merger, Clumpify optical duplicate remover, Tadpole assembler, BBNorm coverage normalizer, and Reformat file converter. Handles FASTQ, FASTA, SAM/BAM, VCF, and GFF formats across Illumina, PacBio, and Nanopore platforms.

Thin
0
BCL Convert

Use when converting Illumina BCL (Binary Base Call) files to FASTQ format using BCL Convert, the successor to bcl2fastq. Supports all modern Illumina platforms including NovaSeq X, NovaSeq X Plus, NovaSeq 6000, NextSeq 1000/2000, MiSeq, and iSeq. Handles demultiplexing, adapter trimming, UMI extraction, ORA compression, and DRAGEN hardware-accelerated conversion. Also known as DRAGEN BCL Convert, Illumina BCL Convert, bclconvert.

Thin
0
bcl2fastq

Use when converting Illumina BCL (Binary Base Call) files to FASTQ format, demultiplexing sequencing runs by index sequences, or trimming adapters from short-read data. Handles HiSeq, MiSeq, NextSeq, NovaSeq, and iSeq instrument output. Key for NGS preprocessing pipelines: RNA-seq, WGS, WES, ATAC-seq, ChIP-seq, amplicon sequencing. Also known as bcl2fastq2, Illumina bcl2fastq conversion software, Illumina FASTQ conversion.

Thin
0
BE-Hive

Use when working with BE-Hive, the Shen lab base-editing outcome predictor distributed through the `be_predict_bystander` repository and the crisprbehive.design web app. BE-Hive predicts bystander editing outcome frequencies for a 50-nt target sequence after `init_model(base_editor, celltype)` and `predict(seq)`, supports ABE/CBE/CGBE editor panels listed in `models.csv`, and can expand predictions to full edited 50-nt genotypes with `add_genotype_column()`. Use this skill for route planning, local wrapper execution against a cloned upstream repository, debugging unsupported editor-celltype combinations, or deciding whether predictive BE-Hive output is the right tool instead of experimental readout pipelines.

Thin
0
BEAGLE

BEAGLE for genotype phasing and imputation from VCF files. Performs haplotype phase inference for genotyped variants, imputes ungenotyped markers using a reference panel, and estimates identity-by-descent (IBD) segments. Use when phasing genotypes, imputing missing variants, building haplotype reference panels, or detecting IBD sharing.

Thin
0
BEAST2

BEAST2 — Bayesian Evolutionary Analysis by Sampling Trees. Cross-platform Java application for Bayesian phylogenetic inference using MCMC sampling. Use when estimating time-calibrated phylogenies, relaxed molecular clocks, coalescent demographic reconstruction (Bayesian skyline), birth-death epidemiological models (BDSKY), species trees (*BEAST/StarBEAST2), or phylogeographic inference from molecular sequence data.

Thin
0
BEDOPS

BEDOPS — high-performance genomic interval arithmetic suite for BED, VCF, GFF, GTF, BAM, and SAM files. Provides set operations (intersection, union, difference, complement, symmetric difference) via bedops, element-by-element statistics via bedmap, and a required coordinate-sort utility sort-bed. Designed for memory-efficient streaming on sorted inputs — no genome loading required. Use for interval arithmetic, coverage analysis, feature annotation, and building genomic pipelines from POSIX-composable components.

Thin
0
Benchmark Tools for Task

Benchmark and compare bioinformatics tools for a given analysis task using published benchmark data (accuracy, runtime, memory), PrecisionFDA and CAMI challenge results, bc7score community metrics, and peer-reviewed citation references. Supports head-to-head comparison tables for variant calling, single-cell clustering, RNA-seq alignment, genome assembly, metagenomics profiling, differential expression, peak calling, and phylogenetics with graceful fallback to community signals when formal benchmarks are unavailable.

Thin
0
bgzip

Use when working with bgzip — bgzip — BGZF block-compression utility

Thin
0
Bio Context Lookup

Look up documentation and code examples for bioinformatics tools using BioContext7

Thin
0
Bio-Formats

Bio-Formats — Java library for reading and writing microscopy image formats. Supports 160+ file formats including OME-TIFF, CZI (Zeiss), LIF (Leica), ND2 (Nikon), VSI (Olympus), and proprietary vendor formats. Provides metadata extraction, format conversion, pixel data access, and OME-XML metadata model. Essential for microscopy image analysis pipelines and FAIR data archival.

Thin
0
bio-tool-search

'Use when working with bio-tool-search — bio-tool-search — search 47,000+

Thin
0
bioawk

Use when working with bioawk — bioawk — AWK with built-in parsers for

Thin
0
biobambam2

biobambam2 — C++ toolkit for early-stage BAM file processing built on libmaus2. Provides bamsormadup for parallel sorting with duplicate marking, bamcollate2 for name-collation, bammarkduplicates for tagging PCR/optical duplicates, bamtofastq for BAM-to-FASTQ conversion, bamsort for coordinate/queryname resorting, and bamrecompress for BAM recompression. Used in high-throughput sequencing pipelines for alignment post-processing.

Thin
0
Bioconda

Bioconda — community-driven Conda channel providing 12,000+ bioinformatics packages with pinned dependencies and reproducible environments. Supports recipe creation, environment management, containerized builds, and multi-platform package distribution. Essential for reproducible bioinformatics pipelines including nf-core, Snakemake, and Galaxy workflows.

Thin
0
Bioconductor

Bioconductor package management and installation via BiocManager for R-based genomic data analysis. Install, update, and validate Bioconductor packages for RNA-seq, ChIP-seq, single-cell, methylation, variant calling, and proteomics workflows. Manage version-locked Bioconductor releases, check package validity, query available packages, and resolve dependency conflicts. Covers BiocManager::install(), BiocManager::valid(), BiocManager::version(), and BiocManager::available() APIs.

Thin
0
Bioconductor AnnotationDbi

Use when working with Bioconductor AnnotationDbi, the R package that provides a common interface for SQLite-backed annotation packages such as OrgDb, ChipDb, GO.db, and custom AnnotationDb derivatives. Route users who need to inspect available key types and columns, translate identifiers with select() or mapIds(), query annotation package metadata, save or reload SQLite annotation databases, or debug one-to-many identifier mappings in Bioconductor workflows for transcriptomics, microarrays, and functional annotation.

Thin
0
BioContainers

BioContainers — community-driven registry of Docker and Singularity container images for bioinformatics tools. Provides reproducible, pre-built containers from Bioconda recipes via automated multi-build infrastructure. Covers container search, pull, run, multi-tool combinations, and Singularity/Docker conversion for HPC and cloud workflows. Essential for reproducible bioinformatics pipelines on Nextflow, Snakemake, WDL, and Galaxy.

Thin
0
BioJulia

BioJulia — Julia ecosystem for computational biology providing high-performance biological sequence analysis, file I/O for FASTA/FASTQ/BAM/SAM/VCF/GFF3 formats, pairwise alignment, genomic interval operations, and k-mer analysis. Core packages: BioSequences.jl (DNA/RNA/AA sequences), FASTX.jl (FASTA/FASTQ I/O), XAM.jl (SAM/BAM), GenomicFeatures.jl (intervals/annotations), BioAlignments.jl (alignment). Use for Julia-based bioinformatics pipelines, sequence manipulation, genomics data parsing, and high-throughput sequence processing.

Thin
0
biomaRt

Use when working with biomaRt, BioMart queries in R, Ensembl annotation retrieval, cross-dataset identifier mapping, or annotation lookups through Bioconductor. biomaRt connects R to BioMart services such as Ensembl, supports dataset discovery with listEnsembl() and listDatasets(), query construction with getBM(), linked-dataset retrieval with getLDS(), helper accessors such as listFilters() and listAttributes(), and result caching via biomartCacheInfo(). Use this skill for gene annotation, identifier conversion, sequence retrieval, Ensembl archive access, or troubleshooting BioMart connection settings.

Thin
0
BioNumPy

BioNumPy — NumPy-based Python library for array-oriented genomics analysis.

Thin
0
Biopython

Biopython — comprehensive Python toolkit for computational molecular biology. Provides sequence manipulation (Bio.Seq), file I/O for 30+ formats including FASTA/GenBank/FASTQ/PDB (Bio.SeqIO), programmatic NCBI database access (Bio.Entrez), BLAST automation (Bio.Blast), structural bioinformatics (Bio.PDB with SMCRA hierarchy), phylogenetics (Bio.Phylo), pairwise and multiple sequence alignment (Bio.Align), motif analysis, restriction enzymes, and population genetics. Best for batch bioinformatics pipelines, custom sequence analysis, and programmatic database queries. For quick single-gene lookups use gget; for multi-service REST integration use bioservices.

Thin
0
Biopython SeqIO

Biopython SeqIO — the unified sequence file I/O module for 30+ bioinformatics formats. Read, write, and convert FASTA, FASTQ, GenBank, EMBL, Stockholm, Phylip, NEXUS, PDB-seqres, and more using SeqIO.parse, SeqIO.read, SeqIO.write, and SeqIO.convert. Handles streaming large files without loading into memory, batch format conversion, and SeqRecord manipulation. Use for sequence file parsing, format conversion pipelines, record filtering by ID or quality, and integrating external sequence databases into Python workflows.

Thin
0
BioPython.PDB

BioPython.PDB -- Python module for structural biology and macromolecular structure analysis. Parses PDB and mmCIF files, navigates the Structure-Model-Chain-Residue-Atom (SMCRA) hierarchy, computes atomic distances and contacts via NeighborSearch, performs structure superimposition with SVD, calculates secondary structure (DSSP), solvent accessibility (NACCESS/DSSP), and half-sphere exposure. Supports reading from RCSB PDB, writing modified structures, and extracting B-factors, occupancies, and coordinate data for downstream analysis.

Thin
0
BioRender

BioRender — web-based platform for creating professional scientific figures, illustrations, and diagrams. Use for designing publication-quality figures for research papers, grants, posters, and presentations. Provides 50,000+ pre-made scientific icons across biology, chemistry, immunology, and neuroscience. Supports figure export as PNG, PDF, TIFF, or SVG. Offers a REST publish API for programmatic figure export and automation. Use when creating scientific diagrams, biological pathway figures, experimental workflow illustrations, or journal-ready panels without Illustrator expertise.

Thin
0
Biostrings

Biostrings — Bioconductor R package for efficient manipulation of biological sequences (DNA, RNA, amino acids). Provides S4 classes DNAStringSet, RNAStringSet, AAStringSet for collections; pattern matching with matchPattern and vmatchPattern; pairwise alignment (Needleman-Wunsch, Smith-Waterman); FASTA/FASTQ I/O; translation; reverse complement; and consensus motif discovery. Essential for sequence analysis in R-based genomics workflows.

Thin
0
Bismark

Bismark — bisulfite-seq alignment and methylation calling toolkit. Maps bisulfite-treated reads to a reference genome using Bowtie 2, HISAT2, or minimap2, performs cytosine methylation calls in CpG/CHG/CHH contexts, and generates comprehensive HTML reports. Supports WGBS, RRBS, PBAT, NOMe-seq, and SLAM-seq library types. Outputs BAM files with methylation tags, bedGraph/coverage files, and genome-wide cytosine reports.

Thin
0
BLASR

BLASR (Basic Local Alignment with Successive Refinement) — PacBio long-read aligner for mapping SMRT sequencing reads to reference genomes. Uses suffix array indexing and global chaining for seed-and-extend alignment of high-error-rate long reads. Accepts FASTA, BAM, and legacy bas.h5 inputs; outputs SAM/BAM. DEPRECATED: PacBio recommends pbmm2 (minimap2 wrapper) for new projects. Use BLASR only for legacy pipeline compatibility or reproducing published analyses.

Thin
0
BOLT-LMM

BOLT-LMM — efficient linear mixed model association testing for biobank-scale GWAS. Implements infinitesimal and non-infinitesimal Bayesian mixed models with O(MN^1.5) complexity for genome-wide association analysis of hundreds of thousands of individuals. Use when running GWAS on quantitative traits in large cohorts, testing imputed variants in BGEN format, estimating SNP heritability with BOLT-REML, or constructing polygenic scores via BLUP. Operates on PLINK binary genotypes with support for BGEN, IMPUTE2, and dosage-format imputed variants.

Thin
0
Boltz-1

Boltz-1 — open-source deep learning model for predicting biomolecular 3D structures and interactions, approaching AlphaFold3-level accuracy. Supports protein, DNA, RNA, and small-molecule ligand structure prediction via YAML input. Provides confidence metrics (pLDDT, pTM, ipTM), binding affinity prediction, MSA generation, and pocket conditioning. CLI-driven with GPU acceleration. MIT licensed for academic and commercial use.

Thin
0
Bonito

Bonito — open-source PyTorch research basecaller for Oxford Nanopore Technologies (ONT) long-read sequencing. Designed for basecaller training and model development, not production basecalling (use Dorado for production). Converts POD5 raw signal files to FASTQ/BAM using CTC-based neural networks. Supports custom model training from scratch or fine-tuning pretrained models, model evaluation with accuracy/error-rate metrics, and training data generation via --save-ctc flag. Works with R9.4.1 and R10.4.1 flow cells.

Thin
0
Bowtie2

Bowtie2 — ultrafast and memory-efficient short-read aligner using FM-index with gapped alignment and local alignment modes. Standard aligner for ChIP-seq, ATAC-seq, CUT&RUN, CUT&Tag, and general short-read alignment. Supports end-to-end and local alignment, sensitivity presets, and multithreaded operation. Produces SAM output for downstream peak calling (MACS2/MACS3), signal quantification (deepTools), and variant analysis. Use when aligning Illumina short reads for epigenomics, functional genomics, or metagenomics mapping workflows.

Thin
0
Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) — statistical method for computing species-level abundance estimates from Kraken/Kraken2/KrakenUniq taxonomic classifications. Redistributes reads assigned to higher taxonomic levels down to species (or any specified rank) using a Bayesian algorithm trained on k-mer distribution patterns from the reference database. Produces corrected abundance tables and updated Kraken-style reports compatible with Pavian, KrakenTools, and downstream R/Python ecological analysis.

Thin
0
BRAKER

BRAKER — fully automated gene structure annotation pipeline combining GeneMark-ES/ET/EP/ETP with AUGUSTUS training. Supports RNA-Seq evidence (BRAKER1), protein evidence (BRAKER2), and combined RNA-Seq + protein evidence (BRAKER3). Produces GTF/GFF3 gene annotations from a genome FASTA plus extrinsic evidence. Required for eukaryotic genome annotation projects including de novo, RNA-Seq-guided, and protein-homology workflows.

Thin
0
broom

broom — Convert statistical model objects into tidy tibbles in R. Three core verbs: tidy() extracts per-term coefficient estimates, standard errors, test statistics, and p-values; glance() returns one-row model-level summaries (R-squared, AIC, BIC, log-likelihood); augment() adds per-observation fitted values, residuals, Cook's distance, and leverage to original data. Supports 100+ model classes including lm, glm, nls, aov, coxph, survreg, survfit, kmeans, prcomp, factanal, glmnet, gam, betareg, polr, multinom, lavaan, Arima, rq, rlm, lmrob, ergm, rma, and all base stats hypothesis tests (t.test, cor.test, wilcox.test, chisq.test). Part of the tidymodels ecosystem. Essential for reproducible reporting, model comparison tables, diagnostic plots with ggplot2, and pipeline integration.

Thin
0
BSgenome

Use when working with BSgenome, Bioconductor whole-genome sequence packages, getBSgenome(), getSeq(), bsapply(), injectSNPs(), genome FASTA export, or masked BSgenome objects in R. BSgenome provides the shared infrastructure for Biostrings-based genome data packages and is the correct skill for packaged reference genomes, subsequence extraction, chromosome-wise iteration, SNPlocs-based SNP injection, and export to FASTA or twoBit. Trigger on: BSgenome, BSgenome.Hsapiens.UCSC.hg38, available.genomes(), installed.genomes(), getBSgenome(), getSeq(), bsapply(), injectSNPs(), writeBSgenomeToFasta(), writeBSgenomeToTwobit(), or BSgenomeForge hand-off questions.

Thin
0
BSseeker2

Use when working with bsseeker2 — BSseeker2 — bisulfite sequencing aligner

Thin
0
bsseq -- Bisulfite Sequencing Data Analysis

bsseq -- Bioconductor R package for analyzing and visualizing bisulfite sequencing (BS-seq) data, including whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS). Provides the BSseq class for storing methylation data, BSmooth smoothing algorithm for low-coverage WGBS, detection of differentially methylated regions (DMRs), and publication-quality plotting of methylation profiles across genomic regions.

Thin
0
BUGS/WinBUGS

BUGS/WinBUGS — Bayesian inference Using Gibbs Sampling for complex hierarchical models via MCMC. Model specification in the BUGS language with stochastic (~) and deterministic (<-) nodes, 30+ distributions (dnorm, dgamma, dbin, dpois, dunif, dbeta, dweib, dmnorm, dwish), link functions (logit, log, cloglog, probit), censoring/truncation support, DIC model comparison, and R integration via R2WinBUGS/R2OpenBUGS.

Thin
0
BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) — assess genome assembly, transcriptome, and proteome completeness using lineage-specific single-copy orthologs from the OrthoDB database. Reports Complete, Duplicated, Fragmented, and Missing (C/D/F/M) percentages. Supports genome, transcriptome, and protein assessment modes. Essential quality control step for de novo assemblies.

Thin
0
BUStools

BUStools — C toolkit for manipulating BUS (Barcode, UMI, Set) files from single-cell RNA-seq pseudoalignment. Provides sorting, barcode error correction, UMI counting, matrix generation, quality inspection, and BUS file capture/filtering. Works downstream of kallisto bus and upstream of Scanpy/Seurat for scRNA-seq preprocessing pipelines.

Thin
0
BWA-MEM2

BWA-MEM2 — accelerated Burrows-Wheeler Aligner for mapping short DNA reads to reference genomes. Use when aligning Illumina WGS, WES, targeted panel, ChIP-seq, or ATAC-seq FASTQ reads. Drop-in replacement for BWA-MEM with identical output and 1.3-3.1x faster performance via SIMD vectorization. De facto standard for short-read alignment in variant calling pipelines (GATK, DeepVariant).

Thin
0
bwa-meth

bwa-meth -- fast and accurate bisulfite-seq (WGBS/RRBS) aligner built on BWA-MEM. Performs in-silico C-to-T conversion of reads and reference, aligns with BWA-MEM or BWA-MEM2, and recovers original bases for downstream methylation calling. Handles paired-end and single-end directional BS-Seq protocols. Used upstream of MethylDackel, Bismark methylation extractor, or biscuit for CpG methylation quantification.

Thin
0
Cactus

Cactus — reference-free whole-genome multiple alignment using progressive decomposition and the cactus graph structure. Produces HAL format alignments for comparative genomics, conservation scoring, liftover, and pangenome graph construction via Minigraph-Cactus. Use when users need to align multiple genomes, build pangenome graphs from assemblies, perform cross-species liftover, or run Minigraph-Cactus pangenome pipelines.

Thin
0
CADD -- Combined Annotation Dependent Depletion

CADD (Combined Annotation Dependent Depletion) -- integrative framework for scoring the deleteriousness of single nucleotide variants and insertion/deletions in the human genome. Combines diverse genomic annotations (conservation, regulatory, protein-level, epigenetic) into a single C-score using a support vector machine trained on evolutionary proxy variants. Provides pre-scored files for all possible SNVs (GRCh37/GRCh38), a REST API for small queries, and offline scoring scripts for novel indels and custom variant sets via CADD-scripts.

Thin
0
Caduceus

Caduceus — bi-directional DNA foundation model built on the Mamba state-space architecture for genomic sequence modeling. Provides bi-directional long-range sequence understanding with reverse complement equivariance, masked language modeling pretraining, variant effect prediction, chromatin profile classification, and regulatory element analysis. Supports sequences up to 131k tokens using the Mamba backbone. Use for DNA sequence classification, variant scoring, and genomic sequence embeddings.

Thin
0
Canu

Canu — long-read de novo genome assembler for PacBio HiFi, PacBio CLR, and Oxford Nanopore reads. Performs read correction, trimming, and overlap-layout-consensus assembly in three stages. Supports diploid-aware assembly, trio binning, grid engine submission (SGE/Slurm/PBS), and adaptive k-mer-based overlap detection. Standard tool for microbial, plant, and human genome assembly from third-generation sequencing.

Thin
0
car

car — Companion to Applied Regression. R package providing Type II/III ANOVA tables via Anova(), variance inflation factors via vif(), general linear hypothesis tests via linearHypothesis(), diagnostic plots (avPlots, crPlots, residualPlots, influencePlot, qqPlot, spreadLevelPlot), power transformations (boxCox, powerTransform, bcPower, yjPower), homoscedasticity tests (leveneTest, ncvTest), autocorrelation tests (durbinWatsonTest), outlier detection (outlierTest), bootstrapping (Boot), delta method standard errors (deltaMethod), contrast coding (Contrasts), and data utilities (recode, some, Import, Export). Essential companion to Fox & Weisberg "An R Companion to Applied Regression" (3rd ed., Sage, 2019). Works with lm, glm, lme, lmer, and many other model classes.

Thin
0
CARD/RGI

Use when working with cardrgi — CARD/RGI (Comprehensive Antibiotic Resistance

Thin
0
Cas-OFFinder

Cas-OFFinder — OpenCL-accelerated off-target site finder for CRISPR/Cas RNA-guided endonucleases. Searches whole genomes for potential off-target sites with configurable mismatch thresholds, DNA/RNA bulge detection, and ambiguous PAM support. Works with SpCas9, SaCas9, StCas9, NmCas9, and custom nucleases. Accepts FASTA or 2BIT genome input, outputs tab-separated off-target hits with genomic coordinates, strand, and mismatch counts. Runs on GPU (OpenCL) or CPU. Essential for CRISPR guide RNA design and specificity assessment.

Thin
0
CasTLE

CasTLE (Cas9 High-Throughput maximum Likelihood Estimator) — empirical Bayesian framework for analyzing pooled CRISPR/Cas9 and RNAi genetic screens. Identifies genes with significant phenotypic effects from guide-level count data using maximum likelihood estimation with permutation-based p-values. Supports combining results from multiple screen types (shRNA + CRISPR), volcano plots, gene-level visualization, and reproducibility analysis. Works with FASTQ input via Bowtie alignment or pre-computed count files.

Thin
0
CausalImpact

CausalImpact — Google's R package for causal inference on time series using Bayesian structural time-series (BSTS) models. Estimate the causal effect of an intervention (ad campaign, policy change, treatment rollout) by constructing a synthetic counterfactual from control time series via CausalImpact(), visualize observed vs predicted response with plot(), obtain point and cumulative effect estimates with summary(), customize the underlying state-space model through model.args (niter, prior.level.sd, nseasons, dynamic.regression), or supply a fully custom bsts model for maximum flexibility. Returns posterior credible intervals, tail-area probabilities, and verbal effect descriptions.

Thin
0
CCP4

CCP4 — Collaborative Computational Project Number 4 software suite for macromolecular X-ray crystallography. Provides programs for data processing (AIMLESS, POINTLESS), molecular replacement (Phaser), model building (Buccaneer, Coot), refinement (REFMAC5), and validation (MolProbity integration). Uses MTZ reflection format and PDB/mmCIF coordinate files. Essential for protein structure determination from diffraction data.

Thin
0
CD-HIT

CD-HIT — fast sequence clustering and redundancy reduction for protein and nucleotide sequences. Clusters FASTA sequences at a user-defined identity threshold to remove redundancy from databases. Provides cd-hit (protein), cd-hit-est (nucleotide/EST), cd-hit-2d (cross-database comparison), and cd-hit-div (splitting large databases). Widely used for non-redundant reference set construction, training/test set splitting, and database pre-processing in comparative genomics and machine learning pipelines.

Thin
0
cell2location

cell2location — Bayesian spatial deconvolution mapping fine-grained cell types from scRNA-seq references onto spatial transcriptomics data. Estimates absolute cell type abundance per location via Pyro/scvi-tools hierarchical model. Two-stage workflow: NB regression for reference signatures (RegressionModel), then Bayesian spatial mapping (Cell2location). Supports 10X Visium, Slide-seq, MERFISH. Use when deconvolving spatial spots, mapping cell types to tissue, or running colocation analysis on spatial transcriptomics experiments.

Thin
0
CellBender

CellBender removes ambient RNA contamination and technical noise from droplet-based single-cell RNA-seq data using a deep generative model (VAE). Processes raw 10x Genomics feature-barcode matrices to distinguish true cell signal from empty droplet background. Use before Scanpy, Seurat, or scvi-tools preprocessing. Handles remove-background, empty droplet detection, and posterior regularization. Supports GPU acceleration via PyTorch. Outputs cleaned HDF5 files compatible with standard scRNA-seq pipelines.

Thin
0
CellChat

CellChat — R package for inference, analysis, and visualization of cell-cell communication networks from single-cell RNA-seq data. Uses a curated ligand-receptor database (CellChatDB) to quantify signaling probability across cell groups. Supports secreted signaling, cell-cell contact, and extracellular matrix interactions. Key capabilities: communication probability inference, signaling pathway analysis, network visualization, comparison across conditions. Works with Seurat and SingleCellExperiment objects. Triggers: CellChat, cell-cell communication, ligand-receptor, intercellular signaling, CellChatDB, LIANA, NicheNet comparison.

Thin
0
CellOracle

CellOracle — Python toolkit for gene regulatory network (GRN) inference and in silico transcription factor (TF) perturbation simulation from single-cell data. Builds cluster-specific GRNs by integrating scRNA-seq expression with chromatin accessibility (scATAC-seq/Cicero) base GRNs, then simulates TF knockout or overexpression to predict cell fate transitions. Key capabilities: Oracle object for GRN fitting, Links class for network analysis, vector field visualization of simulated cell state shifts. Use for cell reprogramming design, lineage analysis, and identifying driver TFs in differentiation.

Thin
0
CellPhoneDB

CellPhoneDB — repository of curated receptors, ligands, and their interactions for cell-cell communication analysis from single-cell RNA-seq data. Performs permutation-based statistical analysis (statistical_analysis) or DEG-based analysis (degs_analysis) to identify significant ligand-receptor interactions between cell types. Supports microenvironment constraints, multi-subunit complex handling, and produces interaction tables ranked by specificity. Use when inferring intercellular signaling from scRNA-seq, analyzing cell communication networks, or comparing cell-type-specific interactions.

Thin
0
CellProfiler

CellProfiler — open-source image analysis software for quantitative measurement of cell phenotypes from microscopy images. Supports high-content screening, cell segmentation (IdentifyPrimaryObjects, IdentifySecondaryObjects), feature extraction (intensity, morphology, texture), pipeline-based workflows (.cppipe), batch processing, and integration with CellProfiler Analyst for machine learning classification. Works with TIFF, PNG, and common microscopy formats via Bio-Formats.

Thin
0
Cell Ranger

Cell Ranger — 10x Genomics official analysis pipeline for Chromium single-cell and spatial data. Performs FASTQ generation, alignment (STAR-based), barcode processing, UMI counting, V(D)J assembly, Feature Barcode analysis, and secondary analysis (PCA, clustering, t-SNE, UMAP, differential expression). Produces filtered and raw feature-barcode matrices, web summaries, Loupe Browser files, and BAM files from 10x Chromium 3' and 5' Gene Expression, Immune Profiling, and Feature Barcode libraries.

Thin
0
CellTypist

CellTypist — automated cell type annotation for scRNA-seq using logistic regression with pre-trained immune models, majority voting refinement, multi-label classification, custom model training, cross-species conversion, GPU acceleration, and CLI/Python API for batch AnnData annotation.

Thin
0
CellTypist

celltypist — automated cell type annotation for single-cell RNA-seq data using pre-trained logistic regression models from a curated model zoo. Use for: cell type classification, immune cell annotation, tissue-specific cell typing, majority-voting cluster annotation, custom model training, and model zoo management. Key terms: celltypist.annotate, celltypist.train, Model.load, download_models, get_all_models, AnnotationResult, predicted_labels, probability_matrix, majority_voting, Immune_All_High, Immune_All_Low, single-cell annotation, transfer learning, scRNA-seq.

Thin
0
cellxgene

cellxgene — interactive single-cell data explorer from the Chan Zuckerberg Initiative. Launch a local browser-based visualizer for any AnnData (.h5ad) file: pan-zoom UMAP/PCA plots, gene expression overlays, subset selection, and cluster annotation write-back. Also covers cellxgene Census, the CZI cloud data platform providing programmatic access to 33M+ cells across 700+ datasets via cellxgene-census Python SDK and TileDB-SOMA backend. Use to explore, annotate, and download single-cell datasets interactively or programmatically.

Thin
0
CZ CELLxGENE Census

Query the CZ CELLxGENE Census (61M+ cells) programmatically via TileDB-SOMA. Use when you need single-cell expression data across tissues, diseases, or cell types from the largest curated atlas. Supports in-memory AnnData queries, out-of-core streaming via axis_query(), PyTorch dataloaders for ML training, and scanpy integration. Best for population-scale cross-dataset analyses and reference atlas comparisons.

Thin
0
Centrifuge

Centrifuge is a rapid and memory-efficient metagenomic classifier for DNA sequences. Uses FM-index compression to classify reads against large reference databases including bacteria, archaea, viruses, and human. Supports paired-end and single-end reads, custom database building, Kraken-style reports, and abundance estimation. Use when classifying metagenomic or metatranscriptomic reads, building taxonomic profiles, or comparing metagenomic classifiers.

Thin
0
CERES

Use when correcting copy-number bias in CRISPR-Cas9 essentiality screens with CERES. Handles guide-level depletion data from Project Achilles or custom CRISPR knockout screens, integrating segmented copy number profiles to separate true gene dependency from cut-site artifacts. Also known as CERES, Computational Elimination of copy-number Effects in CRISPR screens. R package from the Broad Institute Cancer Data Science team (cancerdatasci).

Thin
0
Chai-1

Chai-1 -- multi-modal foundation model for molecular structure prediction from Chai Discovery. Predicts 3D structures of proteins, nucleic acids (DNA/RNA), small molecules, glycans, ions, and their complexes from sequence and SMILES input. Supports protein folding, protein-ligand docking, antibody structure prediction, and multimeric complex modeling. Python API and CLI via the chai-lab package with PyTorch backend.

Thin
0
Chanjo

Use when working with chanjo — chanjo — clinical genomics coverage analysis

Thin
0
CheckM

CheckM — assess the quality of microbial genomes recovered from metagenomes, single cells, and isolates. Estimates genome completeness and contamination using lineage-specific sets of single-copy marker genes via collocated gene analysis. Used for MAG quality control, genome bin validation, taxonomic placement, and MIMAG standard compliance assessment.

Thin
0
ChEMBL Database

ChEMBL Database — manually curated database of bioactive molecules with drug-like properties maintained by EMBL-EBI. Contains 2.4M compounds, 1.6M assays, 20M+ activity measurements, 15K+ targets, and approved drug data. Query via chembl_webresource_client Python client using Django-style ORM filters. Core API: new_client.molecule, .target, .activity, .drug, .mechanism, .similarity, .substructure. Use for compound search, bioactivity retrieval (IC50/Ki/EC50), target identification, SAR analysis, and drug repurposing studies.

Thin
0
CHESS

CHESS (Comparison of Hi-C Experiments using Structural Similarity) — Python command-line tool for quantitative comparison and automatic feature extraction of chromatin contact data using the structural similarity index (SSIM). Compares Hi-C matrices between conditions, species, or genotypes to identify regions with structural differences. Supports FAN-C, Juicer .hic, and cooler formats. Outputs SSIM scores, Z-scores, and signal-to-noise metrics. Includes automated feature extraction and cross-correlation clustering of differential regions.

Thin
0
UCSF ChimeraX

UCSF ChimeraX — next-generation molecular visualization and analysis tool for structural biology. Provides interactive 3D visualization of proteins, nucleic acids, cryo-EM density maps, and molecular assemblies. Supports structure superposition, density map fitting, sequence alignment, surface analysis, and Python/command-line scripting. Successor to UCSF Chimera with modern architecture, VR support, and large-data handling.

Thin
0
ChIPseeker

Use when working with chipseeker — chIPseeker — R/Bioconductor package

Thin
0
CHOPCHOP

CHOPCHOP — web-based and command-line tool for designing optimized CRISPR guide RNAs (sgRNA) and TALENs. Supports Cas9, Cas9 nickase, Cpf1/Cas12a, and Cas13 nuclease systems across 200+ genomes. Provides efficiency scoring (Doench 2014/2016, Xu 2015, Moreno-Mateos 2015), off-target analysis via Bowtie, primer design, and restriction site identification. Accepts gene IDs, genomic coordinates, or FASTA sequences as input.

Thin
0
Chopper

Chopper — fast Rust-based quality, length, and GC content filtering and trimming tool for long-read sequencing data (Oxford Nanopore, PacBio) in FASTQ format. Successor to NanoFilt with multithreaded processing. Supports four trimming strategies: fixed-crop, trim-by-quality, best-read-segment, and split-by-low-quality. Filters reads by minimum/maximum quality score, read length, and GC content. Reads from stdin or file, writes filtered FASTQ to stdout. Can filter contaminant reads against a reference FASTA.

Thin
0
Chromap

Chromap — ultrafast chromatin accessibility and Hi-C read alignment tool for single-cell and bulk ATAC-seq, CUT&Tag, and Hi-C data. Performs alignment, deduplication, and fragment file generation in a single pass with low memory usage. Outputs fragment files compatible with ArchR, Signac, SnapATAC2, and Cell Ranger ATAC downstream workflows.

Thin
0
ChromHMM

Use when working with chromhmm — chromHMM — Java-based tool for learning

Thin
0
Chromosight

Chromosight — Python computer-vision tool for detecting chromatin loops, borders (TAD boundaries), hairpins, and other patterns in Hi-C contact maps using template matching with Pearson correlation. Reads cooler (.cool) files, outputs coordinates with confidence scores. Supports custom kernel generation, quantification of patterns across samples, and multi-threaded detection. Common in 3D genome analysis pipelines downstream of HiC-Pro, cooler, or pairtools.

Thin
0
chromVAR

chromVAR — R/Bioconductor package for determining variation in chromatin accessibility across sets of annotations or peaks. Designed for single-cell ATAC-seq (scATAC-seq) and sparse bulk ATAC-seq data, chromVAR computes per-cell bias-corrected deviation scores for transcription factor motifs, k-mers, or custom annotation sets. Uses a background peak sampling strategy matched on GC content and average accessibility to control for technical biases. Integrates with JASPAR motif databases via motifmatchr, supports t-SNE visualization of motif accessibility, and provides differential deviation and variability testing between cell populations.

Thin
0
Circos

Circos — Perl-based circular visualization tool for genomics data. Creates publication-quality circular plots for genome comparisons, synteny, GC content, gene density, chromosome ideograms, link/ribbon diagrams, heatmaps, and multi-track data overlays. Configuration-driven via .conf files with tabular data inputs. Standard tool for comparative genomics and structural variation visualization.

Thin
0
cisTEM

Use when working with cistem — cisTEM — single-particle cryo-EM data

Thin
0
CIViC

CIViC (Clinical Interpretation of Variants in Cancer) — comprehensive knowledgebase for cancer genomic variant interpretation. Provides expert-curated clinical evidence linking cancer mutations to treatment response, prognosis, and diagnostic implications. GraphQL API for querying cancer variants, genes, evidence, assertions, and clinical interpretations. Essential for precision oncology, variant annotation pipelines, and cancer genomic research.

Thin
0
Clair3 -- Long-Read Germline Variant Calling

Clair3 -- deep learning-based germline variant caller for long-read sequencing (Oxford Nanopore, PacBio HiFi/CLR, Illumina). Calls SNPs and indels using a two-stage pileup + full-alignment neural network pipeline. Supports ONT R9/R10, PacBio CCS/HiFi/Revio, GVCF output, WhatsHap phasing integration, and Docker/ Singularity deployment. Produces high-accuracy VCF from BAM + reference FASTA.

Thin
0
ClimaX

ClimaX — Vision Transformer foundation model for weather and climate science from Microsoft Research. Pretrained on CMIP6 via self-supervised learning, then fine-tuned for global forecasting (ERA5), regional forecasting, climate projection (ClimateBench), and downscaling. Uses variable tokenization and variable aggregation to handle heterogeneous atmospheric variables and resolutions. Powered by PyTorch Lightning for distributed training. Supports 5.625° and 1.40625° pretrained checkpoints from HuggingFace.

Thin
0
ClinVar Database

ClinVar Database — NCBI's public archive of human genetic variant clinical significance. Query via E-utilities REST API (esearch, esummary, efetch, elink) or download bulk data from FTP in XML, VCF, and tab-delimited formats. Interpret pathogenicity classifications (Pathogenic, Likely Pathogenic, VUS, Likely Benign, Benign) with ACMG/AMP criteria and review star ratings. Core workflow: search variants by gene/condition, retrieve clinical significance, check review status, annotate VCFs, resolve conflicting interpretations. Use for clinical genomics, variant annotation pipelines, pharmacogenomics, and genetic counseling support.

Thin
0
ClinVar Tools

Use when working with clinvar-tools — clinVar tools — NCBI utilities

Thin
0
Clustal Omega

Clustal Omega — high-performance multiple sequence alignment tool for protein and DNA sequences using HMM-accelerated pairwise alignment, progressive alignment, and iterative refinement. Supports FASTA, Clustal, Stockholm, and phylip formats with both command-line and web interface workflows.

Thin
0
clusterProfiler

clusterProfiler — universal enrichment analysis tool for interpreting omics data using Gene Ontology, KEGG pathways, WikiPathways, Reactome, Disease Ontology, and custom gene sets. Implements over-representation analysis (ORA) via hypergeometric test and gene set enrichment analysis (GSEA) via fgsea permutation algorithm. Supports cross-species analysis for thousands of organisms, biological theme comparison across gene clusters, redundancy reduction via semantic similarity, and publication-ready visualization through enrichplot integration. Use when running GO enrichment, KEGG pathway analysis, GSEA from DESeq2/edgeR/limma results, comparing enrichment across clusters or conditions, or visualizing functional profiles with dotplot, cnetplot, emapplot, or ridgeplot. The standard functional enrichment tool in the Bioconductor ecosystem since 2012 with 18,000+ citations.

Thin
0
cmocean

cmocean — perceptually uniform colormaps for oceanographic data visualization. Provides 22 colormaps (thermal, haline, solar, ice, deep, dense, algae, matter, turbid, speed, amp, tempo, rain, phase, topo, balance, delta, curl, diff, tarn, oxy, gray) designed for sea temperature, salinity, bathymetry, chlorophyll, oxygen, currents, and wave data. Integrates with matplotlib, cartopy, and xarray. Colorblind-friendly and grayscale-compatible.

Thin
0
cmprsk

cmprsk — Subdistribution Analysis of Competing Risks. R package providing non-parametric cumulative incidence estimation via cuminc() with Gray's K-sample test for group comparisons, Fine & Gray proportional subdistribution hazards regression via crr() with time-varying covariates, cumulative incidence function extraction at arbitrary timepoints via timepoints(), predicted subdistribution curves via predict.crr(), and plotting methods for both non-parametric estimates (plot.cuminc) and regression predictions (plot.predict.crr). Foundational package for competing risks survival analysis referenced by Gray (1988) and Fine & Gray (1999). Depends on the survival package. CRAN Task Views: Survival, Epidemiology.

Thin
0
CNVkit -- Copy Number Variant Detection

CNVkit -- Python toolkit for detecting copy number variants (CNVs) from targeted DNA sequencing data (hybrid capture, amplicon) and whole-genome sequencing (WGS). Leverages both on-target and off-target reads for uniform coverage analysis. Provides batch pipeline, reference building, segmentation (CBS, HMM, Fused Lasso), copy number calling (integer, clonal, thresholding), scatter/diagram plots, VCF/BED export, and sex-chromosome inference. Standard tool in somatic and germline CNV detection pipelines.

Thin
0
CNVnator

CNVnator — read-depth (RD) based copy number variation (CNV) detection from whole-genome sequencing BAM/CRAM files. Partitions the genome into equal-size bins, computes normalized read-depth signals, applies mean-shift partitioning and t-test/KS-test statistics to identify deletions and duplications. Designed for germline large-CNV discovery (>1 kb) in diploid genomes. Required dependency: CERN ROOT framework. Use when detecting CNVs from WGS read depth.

Thin
0
Cobolt

Cobolt — Bayesian multi-modal variational autoencoder for joint analysis of single-cell multi-omics data, especially paired snRNA-seq and snATAC-seq. Learns a shared latent representation across gene expression and chromatin accessibility modalities for integrated clustering, batch correction, and downstream analysis. Built on PyTorch, operates on AnnData/MuData objects with a fit() / get_latent() API pattern. Part of the multi-omics integration ecosystem alongside MultiVI, MOFA+, and Seurat WNN.

Thin
0
ColabFold

ColabFold — fast protein structure prediction combining AlphaFold2 with MMseqs2 for rapid MSA generation. Predict monomer and multimer structures, generate multiple sequence alignments, run batch predictions, use custom PDB templates, and apply Amber relaxation. Produces PDB structure files, pLDDT confidence scores, PAE plots, and MSA coverage visualizations. Available as Google Colab notebooks or local CLI (colabfold_batch). Use when predicting protein structures, analyzing protein complexes, or running high-throughput structure prediction pipelines.

Thin
0
COLOC -- Bayesian Colocalization of Genetic Associations

COLOC (coloc) -- R package for Bayesian colocalization analysis of genetic associations. Tests whether two traits share a causal variant at a genomic locus using GWAS summary statistics. Implements coloc.abf() for approximate Bayes factor colocalization under five hypotheses (H0-H4), and coloc.susie() for multi-causal-variant colocalization via SuSiE fine-mapping. Used in GWAS interpretation, eQTL-GWAS integration, and functional genomics pipelines.

Thin
0
Comet

Comet — open-source tandem mass spectrometry (MS/MS) sequence database search engine for peptide identification. Searches MS/MS spectra against FASTA protein databases producing pepXML, mzIdentML, SQT, TSV, and Percolator PIN output. Supports mzML, mzXML, ms2, and Thermo RAW input. Multithreaded, cross-platform, configurable via comet.params. Used in shotgun proteomics, DDA analysis, and peptide-spectrum matching workflows.

Thin
0
Compare Containerization

Compare container images for bioinformatics tools across BioContainers (quay.io/biocontainers), DockerHub, and Galaxy Singularity depot. Returns available versions, image sizes, pull commands for Docker and Singularity/Apptainer, registry priority recommendations, and workflow integration snippets for Nextflow, Snakemake, WDL, and CWL. Validates container availability, checks digest pinning for reproducibility, and generates provenance manifests for audit trails.

Thin
0
ComplexHeatmap

ComplexHeatmap — R/Bioconductor package for creating highly customizable heatmaps with complex annotations. Supports row/column splitting, multiple annotation tracks (bar, box, point, histogram), OncoPrint for cancer genomics mutation landscapes, UpSet plot intersections, density heatmaps, and heatmap list composition via + and %v% operators. Built on the grid graphics system for precise layout control of publication-ready figures.

Thin
0
CONCOCT

CONCOCT — metagenomic binning tool that clusters contigs into genomes using sequence composition (k-mer frequencies) and coverage across multiple samples. Combines PCA dimensionality reduction with Gaussian mixture models for unsupervised binning of assembled metagenomes. Trigger on: CONCOCT, metagenomic binning, contig binning, metagenome-assembled genomes, MAG recovery, coverage binning, composition binning, metagenomic clustering.

Thin
0
Conda/Mamba

Conda and Mamba package/environment management for bioinformatics. Create, export, and reproduce isolated software environments using conda or mamba (fast C++ solver). Manage Bioconda channels, resolve dependency conflicts, convert environments to containers, and pin versions for reproducible pipelines. Essential for any NGS, genomics, or computational biology workflow setup.

Thin
0
ConsensusClusterPlus

ConsensusClusterPlus — R/Bioconductor package for determining cluster count and membership by stability evidence in unsupervised analysis, implementing the Monti et al. (2003) consensus clustering algorithm with extensions for subsampling-based resampling, multiple clustering algorithms (hierarchical, k-means, PAM), consensus matrix visualization (heatmaps, CDF, delta area), item-consensus and cluster-consensus metrics (calcICL), and custom distance and clustering function hooks for integration with TCGA cancer subtyping, single-cell clustering, flow cytometry, and other high-dimensional biological data analysis workflows.

Thin
0
cooler

cooler — file format and Python library for Hi-C chromatin contact data. Stores sparse contact matrices in .cool (single-resolution) and .mcool (multi-resolution) HDF5 files. Supports ingestion from pairs/tabular data, ICE balancing (matrix normalization), coarsening to multi-resolution, and random-access pixel queries. Core of the Open2C Hi-C ecosystem alongside pairtools, cooltools, and higlass. Use for Hi-C, Micro-C, ChIA-PET, and any 3D genome contact map analysis.

Thin
0
cooltools

cooltools — Python toolkit for Hi-C and chromosome conformation capture analysis in the Open2C ecosystem. Analyzes .cool format contact maps to compute A/B compartments (eigenvectors), insulation scores and TAD boundaries, dot/loop calling, expected contact decay (P(s) curves), pileup averaging, saddle plots, and coverage. Used for all major 3D genome structure analyses including Hi-C, Micro-C, SPRITE, and PLAC-seq data. Requires pre-balanced cooler files from cooler or distiller pipelines.

Thin
0
CopyKAT

CopyKAT (Copy number Karyotyping of Tumors) — R package for inferring genome-wide aneuploidy and copy number variations from single-cell RNA-seq data. Uses Bayesian segmentation to distinguish tumor (aneuploid) from normal (diploid) cells without requiring a predefined reference set. Identifies tumor subclones via hierarchical clustering of copy number profiles at 5MB resolution. Supports human (hg20/hg38) and mouse (mm10) genomes with parallel computation. Core tool for cancer single-cell genomics.

Thin
0
Corset

Corset — C++ tool for clustering de novo assembled transcripts into gene-level groups and producing gene-level counts for differential expression analysis. Groups contigs from Trinity, Trans-ABySS, or Oases assemblies using multi-mapped read information from bowtie/bowtie2. Outputs clusters.txt (contig-to-cluster mapping) and counts.txt (gene-level read counts).

Thin
0
COSMOS

Use when inferring mechanistic causal links across multiple omics layers with COSMOS (Causal Oriented Search of Multi-Omics Space). Integrates metabolomics, transcriptomics, and signaling pathway data using prior knowledge networks from OmniPath and integer linear programming via CARNIVAL. Scores TF activities with DOROTHEA and pathway activities with PROGENy to trace causal paths from metabolite signals through signaling intermediates to gene expression changes. R/Bioconductor package from the saezlab (Julio Saez-Rodriguez group). Also known as cosmosR.

Thin
0
coxph

coxph — Cox Proportional Hazards Regression via R's survival package. Fits semiparametric Cox models with coxph(), creates survival objects with Surv(), estimates survival curves with survfit(), tests the proportional hazards assumption with cox.zph(), computes concordance statistics with concordance(), compares survival curves with survdiff(), supports stratified models via strata(), clustered data via cluster(), frailty models via frailty(), penalized regression via ridge(), time-varying covariates via tmerge() and time-dependent coefficients, competing risks, multi-state models, and prediction via predict.coxph(). Provides basehaz() for baseline hazard extraction, anova.coxph() for nested model comparison, residuals (martingale, deviance, dfbeta, Schoenfeld, score) for diagnostics, and AIC/BIC for model selection. The survival package ships with base R and is the foundation for clinical trial analysis, epidemiological cohort studies, and reliability engineering.

Thin
0
CPA

Use when working with CPA, Compositional Perturbation Autoencoder, single-cell perturbation modeling, drug-response latent spaces, out-of-distribution prediction for unseen drug combinations, dose-response estimation, or AnnData datasets annotated with perturbation, dose, split, and covariate metadata. Trigger on: facebookresearch/CPA, cpa.train, cpa.api.API, perturb-seq response modeling, combinatorial perturbation prediction, uncertainty for unseen perturbations, and preparing `.h5ad` inputs for CPA training or inference.

Thin
0
CPSR

CPSR (Cancer Predisposition Sequencing Reporter) — Python tool for clinical interpretation of germline variants in cancer predisposition genes. Generates interactive HTML reports with ACMG/AMP variant classification, ClinVar annotations, gnomAD population frequencies, and actionable findings for hereditary cancer risk assessment. Built on PCGR framework, supports VCF input with configurable gene panels and virtual panels for targeted analysis.

Thin
0
CRAM Tools

Use when working with cram-tools — CRAM tools — reference-based compression

Thin
0
CRISPRcleanR

Use when working with CRISPRcleanR — an R package for unsupervised correction of copy-number-driven biases and off-target effects in CRISPR pooled knock-out screens. Normalizes sgRNA read counts, removes copy-number bias using a segmentation model, and re-scores gene essentiality. Works with genome-wide and focused Cas9 dropout screens from Brunello, TKOv3, or custom libraries. Outputs corrected fold-changes and quality metrics for downstream MAGeCK, BAGEL2, or drugZ analyses. Used in cancer functional genomics and essential gene discovery pipelines.

Thin
0
CRISPResso2

CRISPResso2 — Python tool for quantifying CRISPR genome editing outcomes from amplicon sequencing data. Analyzes NHEJ (insertions, deletions), HDR, and mixed repair outcomes. Supports single amplicon (CRISPResso), batch (CRISPRessoBatch), pooled amplicon (CRISPRessoPooled), WGS (CRISPRessoWGS), and comparison (CRISPRessoCompare) modes. Generates publication-quality plots of editing frequencies, indel distributions, and allele tables.

Thin
0
CRISPRscan

CRISPRscan — sequence-based scoring algorithm for predicting CRISPR/Cas9 sgRNA on-target efficiency. Uses a linear regression model trained on zebrafish injection data to score 20-mer guide sequences with PAM context. Provides hexamer enrichment scoring, nucleotide position weights, and guide ranking for CRISPR experiment design. Use for sgRNA scoring, guide selection, CRISPR library design, and comparing guide scoring methods.

Thin
0
cryoDRGN

'Use when working with cryodrgn — cryoDRGN — deep variational autoencoder

Thin
0
cryoSPARC

cryoSPARC — cryo-EM single particle analysis platform for high-resolution 3D structure determination. Provides motion correction, CTF estimation, particle picking (blob, template, Topaz), 2D classification, ab-initio reconstruction, heterogeneous/homogeneous/non-uniform refinement, 3D variability analysis, and local resolution estimation. Accessible via web GUI and cryosparc-tools Python API for programmatic job management.

Thin
0
csvtk

Use when working with csvtk — csvtk — cross-platform, ultrafast CSV/TSV

Thin
0
Cufflinks

Cufflinks — RNA-seq transcript assembly, quantification, and differential expression suite. Assembles aligned reads into transcripts (cufflinks), performs differential expression analysis at transcript resolution (cuffdiff), compares assemblies (cuffcompare), merges annotations (cuffmerge), pre-computes expression levels (cuffquant), and normalizes expression tables (cuffnorm). Part of the Tuxedo pipeline with TopHat/HISAT2 and CummeRbund. Essential for reproducing published RNA-seq analyses and isoform-level differential expression.

Thin
0
Cumulus

Cumulus — cloud-based framework for large-scale single-cell and single-nucleus RNA-seq analysis on Terra/Google Cloud. Provides WDL workflows for demultiplexing (demuxEM, souporcell, demuxlet), count matrix generation (Cell Ranger, STARsolo, Kallisto BUStools), and downstream analysis via the Pegasus Python library (clustering, visualization, differential expression, batch correction). Use for Terra/Cromwell workflow execution, multi-sample demultiplexing, and cloud-scale scRNA-seq pipelines processing millions of cells.

Thin
0
cuteSV

cuteSV -- sensitive and scalable long-read-based structural variation (SV) detection from PacBio CLR, PacBio CCS/HiFi, and Oxford Nanopore Technology (ONT) sequencing data. Detects deletions, insertions, inversions, duplications, and translocations from sorted BAM files aligned to a reference genome. Uses a clustering-and-refinement approach to identify SVs with high sensitivity. Also known as cute-SV. Produces standard VCF output with optional genotyping.

Thin
0
CUT&RUNTools 2.0

CUT&RUNTools 2.0 — end-to-end analysis pipeline for CUT&RUN and CUT&TAG chromatin profiling experiments. Handles adapter trimming (Trimmomatic), spike-in normalization, Bowtie2 alignment with size-selection filtering, peak calling (MACS2 for transcription factors, SEACR for broad marks), fragment-size QC, motif analysis (MEME-ChIP), and single-cell CUT&RUN support. Use for histone modification profiling, transcription factor occupancy, chromatin accessibility, and CUT&TAG data from bulk or single-cell experiments. Successor to CUT&RUNTools 1.x with JSON-based configuration, improved sensitivity, and automated ENCODE-compliant QC.

Thin
0
CWL (Common Workflow Language)

CWL (Common Workflow Language) — open standard for describing portable, reproducible computational workflows. Covers cwltool reference runner, CWL v1.2 spec, CommandLineTool and Workflow classes, Docker/Singularity containers, scatter/gather parallelism, and workflow validation. Use for writing, running, debugging, or porting CWL workflows.

Thin
0
CyTOF Workflow (CATALYST)

CyTOF workflow — differential discovery in high-dimensional mass cytometry data using the CATALYST R/Bioconductor package. Covers the full CyTOF analysis pipeline: bead-based normalization, single-cell deconvolution, spillover compensation, FlowSOM clustering, dimension reduction (UMAP, tSNE), differential abundance (DA) and differential state (DS) analysis with visualization. Use when users need to analyze CyTOF, mass cytometry, FACS, or IMC data, perform cytometry clustering, or run differential discovery workflows.

Thin
0
CytoTRACE -- Predicting Differentiation State from scRNA-seq

CytoTRACE -- computational method for predicting relative differentiation state of cells from single-cell RNA-seq data. Uses gene counts (number of detectably expressed genes per cell) as a robust proxy for developmental potential. Scores cells 0-1 (0=differentiated, 1=undifferentiated) via gene count signature correlation, NNLS regression, and Markov diffusion. Available as R package and web interface. Works across tissue types, platforms, and species without prior knowledge of lineage.

Thin
0
cyvcf2

cyvcf2 — fast Cython-wrapped htslib library for reading, writing, and querying VCF/BCF variant files in Python. Provides numpy-backed genotype arrays (gt_types, gt_ref_depths, gt_alt_depths, gt_quals, gt_phases), region-based queries, INFO/FORMAT field extraction, and a Writer class for programmatic VCF creation. Essential for high-performance variant analysis pipelines in Python.

Thin
0
DADA2

DADA2 — high-resolution amplicon sequence variant (ASV) inference from Illumina, 454, and Ion Torrent amplicon sequencing data. R/Bioconductor package that models sequencing errors to resolve exact biological sequences at single-nucleotide resolution, replacing 97% OTU clustering. Complete pipeline from demultiplexed FASTQ through quality filtering, error learning, sample inference, paired-end merging, chimera removal, and taxonomy assignment. Produces ASV abundance tables compatible with phyloseq, vegan, DESeq2, and QIIME 2.

Thin
0
DAGitty

DAGitty — graphical analysis of causal models using directed acyclic graphs (DAGs). Create DAGs with dagitty(), identify minimal adjustment sets for unbiased causal effect estimation (adjustmentSets), derive testable conditional independence implications (impliedConditionalIndependencies), find instrumental variables for endogeneity (instrumentalVariables), test model fit against data (localTests), simulate data from structural equation models (simulateSEM, simulateLogistic), and visualize causal structures (graphLayout, plot.dagitty). R package wrapping JavaScript algorithms; available on CRAN and via dagitty.net web interface.

Thin
0
DAS Tool

DAS Tool — genome-resolved metagenomics bin refinement tool that integrates results from multiple binning algorithms to produce an optimized, non-redundant set of metagenome-assembled genomes (MAGs). Uses single-copy gene analysis with dereplication, aggregation, and scoring to select the highest-quality bin per genome. Accepts contig-to-bin mapping tables from MetaBAT2, MaxBin2, CONCOCT, VAMB, SemiBin, or any binner. Outputs refined bin assignments, quality scores, and optional FASTA bin files. Use for MAG recovery, bin refinement, bin dereplication, and consensus binning in shotgun metagenomics pipelines.

Thin
0
Dash Bio -- Interactive Bioinformatics Visualization

Dash Bio -- interactive bioinformatics visualization components built on Plotly Dash. Provides domain-specific charts including Circos plots, ideograms, needle plots (lollipop/mutation diagrams), alignment charts, OncoPrint heatmaps, volcano plots, clustergrams, Manhattan plots, RNA secondary structure viewers (FornaContainer), sequence viewers, and 2D/3D molecular structure viewers. Use for building interactive web dashboards for genomics, proteomics, and structural biology data exploration.

Thin
0
datamash

Use when working with datamash — datamash — GNU command-line tool for

Thin
0
data.table

Use when working with data.table — the high-performance R package for fast data manipulation, aggregation, and file I/O on large datasets. data.table extends data.frame with a concise DT[i, j, by] syntax for filtering, computing, and grouping. Use for fread/fwrite of large CSV/TSV/genomic files, fast joins (including rolling and non-equi joins), in-place column updates via :=, reshape (melt/dcast), and set operations. Especially suited for genomics, GWAS, proteomics, survival analysis, and any workflow with multi-GB tabular data in R. Also known as R data.table, DT, dt package, fread, fwrite.

Thin
0
DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) — fast C++ implementation of density-based clustering algorithms in R. Includes DBSCAN, HDBSCAN, OPTICS/OPTICSXi, LOF outlier detection, GLOSH outlier scoring, shared-nearest-neighbor clustering, Jarvis-Patrick clustering, and FOSC optimal selection. Uses kd-tree acceleration for neighbor searches, providing performance superior to fpc::dbscan, scikit-learn, WEKA, and ELKI for Euclidean distances. Standard density-based clustering tool in R for spatial data analysis, single-cell genomics, ecology, anomaly detection, and unsupervised learning workflows.

Thin
0
deconstructSigs

deconstructSigs — R package for decomposing tumor somatic mutation patterns into known mutational signature contributions. Converts SNV calls to a 96-trinucleotide context matrix, then applies constrained least squares to determine the weighted combination of COSMIC SBS signatures (v2 or v3) that best explains each sample. Used for cancer genomics signature attribution, clinical HRD/APOBEC assessment, and single-sample or cohort-level signature exposure analysis. Input: somatic SNVs (MAF/VCF/data frame); output: per-sample signature weights summing to 1. Integrates with BSgenome and Bioconductor.

Thin
0
Decoupler -- Biological Activity Inference from Omics Data

Decoupler -- Python framework for inferring biological activities from omics data. Estimates transcription factor (TF) activities, pathway activities, and ligand-receptor interactions from gene expression using multiple statistical methods (ULM, MLM, WSUM, VIPER, AUCell, GSEA, ORA, consensus). Works with AnnData objects and integrates with scanpy, scverse, and OmniPath for prior knowledge networks. Supports bulk and single-cell RNA-seq activity inference with dc.run_ulm(), dc.run_mlm(), dc.get_acts(), dc.decouple().

Thin
0
DeepCell

DeepCell — deep learning library for single-cell analysis of biological images using TensorFlow. Provides pretrained models for cell segmentation including Mesmer (multiplexed tissue imaging), nuclear segmentation, and cytoplasm segmentation. Supports whole-cell and nuclear segmentation from fluorescence microscopy, multiplexed ion beam imaging (MIBI), cyclic immunofluorescence (CyCIF), CODEX, and other multiplexed imaging platforms. Use for cell segmentation, instance segmentation, cell tracking, and image preprocessing.

Thin
0
DeepChem

DeepChem — Python framework for deep learning in drug discovery, materials science, quantum chemistry, and biology. Provides molecular featurizers (ECFP, graph convolutions, Coulomb matrices), pre-built models (GraphConvModel, AttentiveFPModel, GCNModel, MPNN, SchNet), MoleculeNet benchmarks, and dataset loaders for molecular property prediction, virtual screening, de novo generation, and reaction prediction tasks.

Thin
0
DeepConsensus

DeepConsensus — Google's gap-aware sequence transformer for improving PacBio CCS (HiFi) read accuracy post-basecalling. Use for: polishing PacBio long reads, reducing insertion/deletion errors in CCS data, improving Q-score of HiFi reads, preparing high-accuracy long reads for assembly or variant calling. Key terms: DeepConsensus, PacBio CCS, HiFi polishing, long-read accuracy, subreads, actc alignment, checkpoint model, deepconsensus run, gap-aware transformer, Q20, Q30, read quality improvement, pbccs, ccs bam, subreads_to_ccs.

Thin
0
DeepSignal

DeepSignal — deep learning-based 5mC CpG methylation detection from Oxford Nanopore sequencing raw signals. Uses a BiLSTM+Inception neural network trained on bisulfite-seq-labeled data to call methylation at single-read resolution from FAST5 files. Supports model training, feature extraction, and methylation calling for nanopore methylation analysis pipelines. Use for nanopore CpG methylation calling, signal feature extraction, or training custom methylation models.

Thin
0
DeepVariant

DeepVariant — CNN-based variant caller that converts aligned reads into pileup image tensors and classifies them with a deep neural network to produce SNP and indel calls in VCF/gVCF format. Supports Illumina WGS/WES, PacBio HiFi, Oxford Nanopore, and hybrid sequencing data with pre-trained models for each platform. Won PrecisionFDA Truth Challenge V2 (2020) across multiple categories.

Thin
0
DELLY

DELLY — Structural variant discovery tool that uses paired-end and split-read analysis to detect deletions, tandem duplications, inversions, translocations, and copy-number variants from short-read and long-read sequencing data. Supports germline multi-sample calling, somatic tumor-control analysis, CNV segmentation, and pan-genome graph-based SV detection. Use when user asks about structural variant calling, SV genotyping, copy-number analysis, or tumor-control paired analysis from BAM/CRAM files.

Thin
0
Dendroscope

Dendroscope — interactive viewer for large phylogenetic trees and rooted networks. Handles hundreds of thousands of taxa with smooth navigation. Supports Newick, Nexus, and phyloXML input; exports PNG, SVG, PDF, and EPS. Provides rectangular, circular, radial, and triangular tree layouts; consensus tree computation; tanglegram comparison; and batch-mode scripting for automated figure generation. Use when visualizing, comparing, or exporting phylogenetic trees from RAxML, IQ-TREE, FastTree, BEAST, or ASTRAL output.

Thin
0
densMAP

densMAP — density-preserving dimensionality reduction for visualization. Augments UMAP with a density-correlation regularizer that preserves local density information from high-dimensional data in the low-dimensional embedding. Computes local radii estimates and uses them as regularization during stochastic gradient descent optimization. Particularly valuable for single-cell RNA-seq where cell density encodes biological state abundance. Accessed via the densmap=True parameter in the umap-learn package.

Thin
0
DepMap/Chronos

DepMap/Chronos — Bayesian algorithm for inferring gene fitness effects from CRISPR knockout screen readcount data. Separates true gene knockout effects from copy-number artifacts, guide efficacy variation, and screen quality differences. Supports multi-library integration, copy-number correction, hit-calling with FDR control, and condition comparison experiments. Core algorithm behind the Cancer Dependency Map (DepMap) project at the Broad Institute. Requires TensorFlow 2.x.

Thin
0
DEXSeq

DEXSeq -- R/Bioconductor package for testing differential exon usage (DEU) from RNA-seq data. Uses a generalized linear model framework (negative binomial distribution) to detect exons whose relative usage changes between experimental conditions, enabling discovery of alternative splicing events. Requires flattened exon counting bins from DEXSeq-specific annotation preparation and HTSeq-based counting. Integrates with DESeq2 dispersion estimation and produces per-exon p-values, fold changes, and visualization plots.

Thin
0
DGL-LifeSci

DGL-LifeSci — deep graph learning toolkit for molecular property prediction, drug-target interaction (DTI), and reaction prediction. Converts SMILES strings to DGL molecular graphs, then applies GNN architectures (GCN, GAT, MPNN, AttentiveFP) for regression or multi-task classification. Use for: MoleculeNet benchmarks, QSAR/QSPR modeling, graph-level property prediction, virtual screening, binding affinity prediction, and drug-likeness scoring. Key terms: smiles_to_bigraph, CanonicalAtomFeaturizer, GCNPredictor, GATPredictor, AttentiveFP, MPNNPredictor, MoleculeCSVDataset, dgllife, molecular graph, graph neural network, GNN, molecular property prediction, ESOL, Tox21, HIV, BBBP, FreeSolv, Lipophilicity, BACE.

Thin
0
DIABLO

DIABLO (Data Integration Analysis for Biomarker discovery using Latent cOmponents) — supervised multi-omics integration method from the mixOmics R package. Performs sparse generalized canonical correlation analysis (sGCCA) via block.splsda() to identify multi-omics biomarker signatures that discriminate known sample groups. Integrates transcriptomics, proteomics, metabolomics, methylation, and other omics blocks with a tunable design matrix controlling pairwise block correlations. Produces circos plots, relevance networks, and loading plots for multi-omics feature selection.

Thin
0
DIAMOND

DIAMOND — fast protein alignment and translated DNA search tool, operating at 100x-10,000x the speed of BLAST. Aligns protein query sequences against protein reference databases (blastp) or translated DNA sequences against protein databases (blastx). Supports multiple sensitivity modes from fast screening to ultra-sensitive remote homology detection. Includes protein sequence clustering (cluster, linclust) scaling to billions of sequences. Produces BLAST-compatible tabular, XML, pairwise, and DAA output formats. Supports frameshift-aware alignment for long-read sequencing data, NCBI taxonomy integration, and composition-based statistics.

Thin
0
DIA-NN

DIA-NN — automated software suite for data-independent acquisition (DIA) proteomics data processing. Uses deep neural networks for peptide retention time, ion mobility, and fragmentation prediction to generate in silico spectral libraries from FASTA databases. Processes DIA, SWATH, dia-PASEF, Slice-PASEF, scanning SWATH, and Orbitrap Astral data. Produces precursor and protein quantification in Parquet and TSV matrix formats with QuantUMS/MaxLFQ normalization. Supports match-between-runs, plexDIA multiplexing (mTRAQ, SILAC, dimethyl labels), phosphoproteomics with site localization, and peptidoform-level confidence scoring.

Thin
0
DIA-NN R Package

DIA-NN R package — R toolkit for post-processing DIA-NN proteomics search results. Provides report loading (diann_load), precursor/peptide/protein matrix generation with FDR filtering (diann_matrix), MaxLFQ protein quantification (diann_maxlfq), and tab-delimited export (diann_save). Essential for DIA/SWATH-MS downstream analysis, batch correction, and label-free quantification from DIA-NN output.

Thin
0
DiffBind

DiffBind — Bioconductor R package for differential binding analysis of ChIP-seq and ATAC-seq data. Computes read count overlaps from BAM files at consensus peak sets, normalizes with DESeq2 or edgeR, identifies differentially bound sites between conditions, and produces MA plots, volcano plots, PCA, and heatmaps. Works with peak callers like MACS2, HOMER, and BED-format peak files.

Thin
0
DiffDock

DiffDock — diffusion generative model for molecular docking that predicts protein-ligand binding poses. Uses a diffusion process over translations, rotations, and torsion angles to generate and rank docked conformations. Supports blind docking without predefined binding sites, confidence-based pose ranking, and batch inference for virtual screening. State-of-the-art on PDBBind blind docking benchmarks.

Thin
0
distiller-nf

distiller-nf — Nextflow pipeline for reproducible Hi-C data processing. Aligns Hi-C FASTQ reads with BWA, parses alignments into pairs with pairtools, filters PCR duplicates, and aggregates contacts into multi- resolution cooler (.mcool) matrices. Supports chunked mapping, SRA downloads, MAPQ filtering, and ICE balancing. Part of the Open2C ecosystem upstream of cooler, cooltools, and chromosight.

Thin
0
DMRcate

DMRcate — R/Bioconductor package for identifying differentially methylated regions (DMRs) from Illumina Infinium arrays (450K, EPIC, EPICv2) and whole genome bisulfite sequencing (WGBS/RRBS) data. Uses kernel smoothing to aggregate CpG-level statistics into contiguous DMR calls via limma-based linear models. Filters SNP-confounded and cross-hybridising probes, generates GRanges output with gene annotations, and produces Gviz-based DMR visualisation plots. Supports differential methylation, variability (VMR), ANOVA, and differential variability analyses.

Thin
0
DNABERT

DNABERT — pre-trained BERT model for DNA sequence understanding and classification. Tokenizes DNA using overlapping k-mers (k=3,4,5,6) and provides contextualized embeddings for promoter prediction, splice site detection, transcription factor binding site identification, and sequence classification. Fine-tune for downstream genomic tasks; use DNABERT-2 for multi-species analysis with byte-pair encoding. Supports HuggingFace Transformers API. Compare with Nucleotide Transformer, Hyena-DNA, or Evo for long-context genomic modeling.

Thin
0
Docker

Docker — container platform for building, running, and distributing reproducible bioinformatics environments. Provides Dockerfile authoring, image management, container execution with volume mounts for genomic data, multi-stage builds for pipeline tools, and integration with BioContainers registry. Essential for reproducible NGS, single-cell, and multi-omics workflows on HPC and cloud.

Thin
0
Dockstore

Dockstore — open platform for sharing Docker-based bioinformatics tools and workflows written in CWL, WDL, Nextflow, and Galaxy. Provides workflow discovery, versioned registrations, TRS API for programmatic access, launch-with integrations (Terra, DNAnexus, CGC, AnVIL), and checker workflows for validation. Central hub for GA4GH-compliant workflow sharing across bioinformatics platforms.

Thin
0
Dorado

Dorado — Oxford Nanopore's official high-performance open-source basecaller that converts raw nanopore signal (POD5/FAST5) into nucleotide sequences. Replaces Guppy as the primary ONT basecaller. Supports simplex basecalling, duplex basecalling (~Q30 accuracy), modified base detection (5mC, 5hmC, 6mA, m6A, pseudouridine), barcode demultiplexing, poly(A) tail estimation, single-read error correction (HERRO), assembly polishing, and integrated minimap2 alignment. Three accuracy tiers: fast, hac (recommended), sup. Uses R10.4.1 chemistry with v5.2.0 DNA models and RNA004 with v5.3.0 RNA models. Output is BAM by default with SAM, CRAM, and FASTQ options. Runs on NVIDIA GPUs (CUDA), Apple Silicon (Metal), and CPU.

Thin
0
DOSE

DOSE — Disease Ontology Semantic and Enrichment analysis R/Bioconductor package for gene set enrichment against disease ontologies. Performs over-representation analysis (ORA) and gene set enrichment analysis (GSEA) using Disease Ontology (DO), Network of Cancer Genes (NCG), and DisGeNET databases. Computes disease semantic similarity with doSim(). Use for functional annotation of DEG lists, disease pathway enrichment, cancer gene network analysis, and translational genomics workflows. Input requires ENTREZ gene IDs.

Thin
0
DoubletFinder

DoubletFinder — R package for detecting neotypic doublets in single-cell RNA-seq data using artificial nearest neighbor (pANN) scoring. Interfaces with Seurat objects (v2–v5) to identify heterotypic doublets through parameter sweeping (paramSweep), pK optimization (find.pK), homotypic doublet proportion estimation (modelHomotypic), and classification with adjustable stringency thresholds.

Thin
0
dplyr

dplyr — grammar of data manipulation R package from the tidyverse. Provides consistent, expressive verbs for filtering rows, selecting columns, mutating values, summarising groups, and joining tables. Use for: RNA-seq sample metadata wrangling, genomic annotation table manipulation, VCF/BED/TSV preprocessing, multi-condition comparisons, and any tabular data work in R. Key terms: filter, select, mutate, group_by, summarise, left_join, inner_join, anti_join, across, case_when, arrange, tibble, tidyverse, data.frame, dplyr pipe, tidy data, .by argument, tidy evaluation, grammar of data manipulation.

Thin
0
DRAGEN Bio-IT Platform

DRAGEN (Dynamic Read Analysis for GENomics) — Illumina's FPGA-accelerated bioinformatics platform for secondary analysis of next-generation sequencing data. Performs hardware-accelerated mapping, alignment, sorting, duplicate marking, variant calling (SNVs, indels, SVs, CNVs), and joint genotyping. Supports WGS, WES, RNA-seq, methylation, single-cell, and tumor-normal somatic pipelines. Replaces traditional BWA-MEM + GATK workflows with 25-40x speedup.

Thin
0
DVC (Data Version Control)

DVC (Data Version Control) — Git-based version control for data, models, and ML experiments. Provides data versioning with .dvc files, ML pipeline definition via dvc.yaml DAGs, experiment tracking and comparison, and remote storage backends (S3, GCS, Azure, SSH, local). Essential for reproducible bioinformatics and ML workflows that manage large genomic datasets alongside code.

Thin
0
Dynamo

Use when working with Dynamo, `dynamo-release`, or transcriptomic vector field analysis for single-cell RNA-seq, metabolic labeling, and multi-omics time-course data. Dynamo estimates RNA velocity with `dyn.tl.dynamics`, projects velocities with `dyn.tl.cell_velocities`, reconstructs continuous vector fields with `dyn.vf.VectorField`, and supports fate prediction, topology analysis, and in silico perturbation. Trigger this skill for scRNA-seq velocity debugging, vector field reconstruction, labeling-based kinetic analysis, or comparisons with Scanpy, scVelo, or Monocle-style trajectory workflows.

Thin
0
Dysgu

Dysgu — structural variant caller for paired-end and long-read sequencing data (Illumina, PacBio HiFi, Oxford Nanopore). Detects deletions, insertions, duplications, inversions, and translocations from BAM/CRAM alignments. Supports germline calling with diploid/non-diploid models, somatic SV filtering via panel-of-normals, multi-sample merging, site-level re-genotyping, and phased variant calling. Works with both short and long reads.

Thin
0
echtvar

'Use when working with echtvar — echtvar — ultra-fast Rust CLI for variant

Thin
0
EDAM Ontology Explorer

Navigate the EDAM ontology hierarchy, find compatible tools and data types, and check format compatibility

Thin
0
eggNOG-mapper

Use when working with eggnog-mapper — eggNOG-mapper — fast functional

Thin
0
EIGENSOFT smartpca -- Principal Component Analysis for Population Genetics

EIGENSOFT smartpca -- C/C++ tool for principal component analysis of genome-wide SNP genotype data. Computes eigenvectors and eigenvalues for population structure analysis, ancestry inference, stratification correction, and outlier detection. Reads PLINK binary (.bed/.bim/.fam) or EIGENSTRAT (.geno/.snp/.ind) format. Part of the EIGENSOFT suite alongside smarteigenstrat, twstats, and evec2pca. Used in GWAS quality control and population genetics studies.

Thin
0
EMAN2

Use when working with eman2 — EMAN2 — comprehensive cryo-EM image processing

Thin
0
emmeans

emmeans — Estimated Marginal Means (least-squares means) in R. Compute marginal means from fitted models with emmeans(), pairwise comparisons with pairs() and contrast(), interaction analysis with joint_tests(), custom contrasts with contrast(method = list()), effect sizes with eff_size(), back-transformation for GLMs with type = "response", multiplicity adjustment via adjust = "tukey"|"bonferroni"|"mvt"|"none", plotting with emmip() and plot(), and support for 50+ model types including lm, glm, lmer, glmer, lme, gls, gam, clm, coxph, brmsfit, and rstanarm. Essential for post-hoc analysis in clinical trials, agricultural experiments, and factorial designs.

Thin
0
ENCODE Pipeline ATAC-seq

ENCODE DCC ATAC-seq pipeline — ENCODE consortium-standard WDL/Cromwell pipeline (run via Caper) for processing bulk ATAC-seq data from FASTQ through alignment, deduplication, peak calling (MACS2), IDR reproducibility analysis, and comprehensive QC (ataqc). Produces ENCODE-compliant outputs: optimal peak sets, fold-change and p-value signal tracks (bigWig), and qc.json reports. Supports hg38, mm10, hg19, mm9, and custom genomes via genome TSV database. Use for ENCODE-standard ATAC-seq processing, open chromatin profiling, IDR-filtered peak sets, and compliance with ENCODE data submission requirements.

Thin
0
ENCODE Pipeline

ENCODE ChIP-seq pipeline — official ENCODE DCC WDL/Cromwell pipeline for transcription factor (TF) and histone mark ChIP-seq analysis. Handles adapter trimming (Trimmomatic), alignment (Bowtie2/BWA), duplicate removal (Picard/sambamba), cross-correlation QC (SPP/phantompeakqualtools), peak calling (MACS2), irreproducibility discovery rate (IDR), signal track generation (deepTools), and blacklist filtering. Supports single-end and paired-end reads, pooled replicates, multiple backends (local, Docker, Singularity, GCP, AWS Batch). Use for ENCODE-compliant ChIP-seq processing, reproducible peak calling with IDR, or comparative TF binding analysis.

Thin
0
EncyclopeDIA

Use when working with encyclopedia — encyclopeDIA — Java-based DIA (Data-Independent

Thin
0
Enformer

Enformer — transformer-based deep learning model from DeepMind that predicts gene expression, chromatin accessibility, histone modifications, and TF binding from 196,608 bp raw DNA sequence at 128 bp resolution. Outputs 5,313 human genomic tracks (CAGE, DNase-seq, histone ChIP-seq, TF ChIP-seq) for variant effect scoring, in-silico mutagenesis (ISM), regulatory element prediction, and eQTL prioritization. Available as TensorFlow original (deepmind-research) and PyTorch port (lucidrains/enformer-pytorch).

Thin
0
EnhancedVolcano

EnhancedVolcano — R/Bioconductor package for creating publication-ready volcano plots with enhanced colouring and labeling from differential expression results. Visualize DESeq2, limma, edgeR output with customizable fold-change and p-value thresholds, connector lines, boxed labels, and multi-attribute encoding (color, shape, size, encircle, shade). Supports selective gene labeling, italic text parsing, and ggplot2-based theming.

Thin
0
ENmix

ENmix — R/Bioconductor package for quality control, background correction, and normalization of Illumina DNA methylation array data (EPIC, 450K, EPIC v2). Provides ENmix background correction, RELIC dye-bias correction, RCP probe-type bias correction, detection p-value filtering, bead count QC, principal component regression plots, surrogate variable analysis for batch effects, sex prediction, and cell-type composition estimation. Use for preprocessing raw IDAT files from methylation arrays into analysis-ready beta or M-value matrices. Key functions: readidat, preprocessENmix, QCfilter, norm.quantile, rcp, ctrlsva, pcrplot.

Thin
0
Enrichr

Enrichr — web-based gene set enrichment analysis platform providing instant over-representation analysis against ~300 gene set libraries containing ~400,000 annotated gene sets. Implements Fisher's exact test with Benjamini-Hochberg correction and a combined score integrating p-value and z-score for robust ranking. Accessible via REST API, the GSEApy Python client (gseapy.enrichr), the enrichR R/CRAN package, and interactive web interface. Supports human, mouse, fly, worm, yeast, and fish through organism-specific Enrichr instances.

Thin
0
Ensembl Database

Ensembl Database — EMBL-EBI's comprehensive genome annotation database covering 300+ species with gene models, variants, regulatory features, and comparative genomics. Query via REST API at rest.ensembl.org (no auth required). Core endpoints: /lookup/symbol/{species}/{symbol}, /sequence/id/{id}, /vep/{species}/hgvs/{notation}, /homology/id/{id}, /overlap/region/{species}/{region}, /map/{species}/{asm_one}/{region}/{asm_two}. Supports gene lookups, sequence retrieval (genomic/cDNA/protein), VEP variant consequence prediction, ortholog discovery, assembly coordinate mapping (GRCh37↔GRCh38), and regulatory feature queries. Rate limit: 15 req/s anonymous, 55K req/hr registered. Use for genomic annotation, variant analysis, comparative genomics, and cross-species research.

Thin
0
Ensembl VEP

Use when working with Ensembl VEP (Variant Effect Predictor) to annotate SNPs, indels, and structural variation with gene, transcript, and protein-level consequences. Supports VCF, HGVS, and coordinate input with Ensembl/RefSeq transcripts, pathogenicity scores (SIFT, PolyPhen, CADD, REVEL), population frequencies, regulatory annotations, and plugin extensibility. Trigger phrases: vep, ensembl-vep, variant effect predictor, clinical annotation, somatic annotation, hgvs notation, filter common variants, plugin annotation.

Thin
0
EpiDISH

EpiDISH — Bioconductor R package for cell-type deconvolution of DNA methylation data. Estimates cell-type proportions from Illumina 450K and EPIC arrays using reference-based methods: Robust Partial Correlations (RPC), Cibersort (CBS), and Constrained Projection (CP). Includes centDHSbloodDMC.m and centBloodSub.m reference matrices for blood cell subtypes and immune cell fractions.

Thin
0
ESM-2/ESM-IF

ESM-2/ESM-IF — Meta AI protein language models for embeddings, structure prediction, and inverse folding. ESM-2 provides learned representations (up to 15B parameters) for zero-shot variant effect prediction, contact prediction, and structure prediction via ESMFold. ESM-IF1 performs inverse folding — predicting amino acid sequences from backbone structures using a GVP-Transformer. Input via FASTA sequences or PDB structures; output as embeddings, predicted structures, or designed sequences.

Thin
0
ESMFold

ESMFold — end-to-end single-sequence protein structure prediction using the ESM-2 protein language model. Predicts 3D atomic coordinates directly from amino acid sequence without multiple sequence alignment (MSA), achieving AlphaFold2-competitive accuracy for well-represented families at 10-60x faster inference. Part of Meta AI's ESM (Evolutionary Scale Modeling) framework. Supports batch prediction via the ESM Metagenomic Atlas API or local GPU inference with the esm Python package.

Thin
0
Estimate Compute Requirements

estimate-compute-requirements — predict CPU, memory, wall time, and storage for bioinformatics pipelines given input data size and tool selection. Generates per-step resource tables, platform-specific configurations (Nextflow, SLURM, PBS, AWS Batch, Google Cloud Life Sciences, Azure Batch), and cloud cost estimates using published benchmark data and nf-core resource labels. Covers alignment, variant calling, quantification, assembly, QC, and single-cell analysis tools with O(n)/O(s)/O(c) scaling models.

Thin
0
ETE Toolkit

ETE Toolkit — Python library for phylogenetic tree exploration, manipulation, and analysis. Provides tree parsing (Newick, Nexus), database connectivity (NCBI Taxonomy, GTDB), interactive visualization with rendering framework, orthology detection, distance calculations, comparative genomics pipelines, and large-scale phylogenetic visualization for molecular evolution and phylogenomics research.

Thin
0
eulerr

eulerr — R package for drawing area-proportional Euler and Venn diagrams with circles or ellipses. Takes named numeric vectors, logical matrices, or data.frames describing set memberships or intersection sizes and fits an optimal diagram layout. Use when users need to visualize overlapping sets, set intersections, gene set overlaps, Venn diagrams, upset-style set comparisons, or any area-proportional set relationship diagram in R. Handles 2-5+ sets with goodness-of-fit diagnostics.

Thin
0
Evo

Evo — genomic foundation model for DNA sequence modeling at single-nucleotide resolution. Uses StripedHyena architecture (7B parameters) trained on 2.7M prokaryotic and phage genomes (OpenGenome dataset). Provides zero-shot fitness prediction for proteins and non-coding RNA, DNA sequence generation, variant effect scoring, and multi-element genetic system design with up to 131k token context length. Available via HuggingFace (togethercomputer/evo-1-131k-base) and Together AI API.

Thin
0
EvoDiff

EvoDiff — discrete diffusion models for protein sequence generation directly from evolutionary-scale data. Generates novel protein sequences without requiring 3D structures using order-agnostic autoregressive diffusion (OA-AR-DM), discrete denoising diffusion (D3PM), and MSA-guided conditional generation. Supports unconditional generation, sequence inpainting, and scaffold-based motif scaffolding. Input via FASTA sequences or MSAs; output as generated protein sequences.

Thin
0
Exomiser -- Phenotype-Driven Variant Prioritization

Exomiser -- Java application for phenotype-driven variant prioritization in rare disease diagnosis. Filters and ranks variants from VCF files using Human Phenotype Ontology (HPO) terms, cross-species phenotype matching (hiPhive), protein interaction networks, and pathogenicity scores (CADD, REVEL, ClinVar). Supports whole-exome and whole-genome analysis with preset and custom YAML configurations.

Thin
0
ExpansionHunter

Use when working with ExpansionHunter — Illumina's tool for genotyping short tandem repeats (STRs) and repeat expansions from PCR-free whole-genome sequencing data. Detects pathogenic repeat expansions (e.g., Huntington disease, fragile X syndrome, ALS/FTD C9orf72, Friedreich ataxia, myotonic dystrophy) by realigning reads to repeat loci defined in a JSON variant catalog. Outputs VCF and JSON with allele sizes, confidence intervals, and read support. Used in clinical genomics pipelines and rare-disease cohort studies for STR genotyping.

Thin
0
f5c

f5c — ultra-fast methylation calling and event alignment for Oxford Nanopore sequencing data. Re-implements nanopolish call-methylation and eventalign with multi-threading and optional CUDA GPU acceleration. Processes FAST5 and BLOW5 signal files alongside BAM alignments for CpG methylation detection, signal- level event alignment, and frequency calculation. Use for nanopore methylation analysis, signal-to-reference alignment, or as a faster nanopolish replacement.

Thin
0
factoextra

factoextra — R package for extracting and visualizing multivariate analysis results with ggplot2. Covers PCA (fviz_pca_ind, fviz_pca_var, fviz_pca_biplot), CA (fviz_ca_row, fviz_ca_col, fviz_ca_biplot), MCA (fviz_mca_ind, fviz_mca_var), MFA, HMFA, FAMD visualization, eigenvalue screeplots (fviz_eig), contribution and cos2 bar charts (fviz_contrib, fviz_cos2), clustering visualization (fviz_cluster, fviz_dend, fviz_silhouette, fviz_nbclust), hierarchical k-means (hkmeans), enhanced hierarchical clustering (hcut, eclust), and optimal cluster number determination via silhouette, WSS, and gap statistic.

Thin
0
FactoMineR

FactoMineR — R package for multivariate exploratory data analysis providing principal component analysis (PCA), correspondence analysis (CA), multiple correspondence analysis (MCA), multiple factor analysis (MFA), factor analysis of mixed data (FAMD), hierarchical multiple factor analysis (HMFA), and hierarchical clustering on principal components (HCPC). Handles quantitative, categorical, and mixed variables with supplementary individuals/variables, automatic dimension description via dimdesc/catdes, and integrated visualization. For interactive exploration use Factoshiny; for ggplot2-based visualization use factoextra; for missing data imputation before analysis use missMDA.

Thin
0
FAN-C

FAN-C is a Python framework for analysis, visualization, and exploration of Hi-C and related chromatin conformation capture data. Provides a unified API for importing, processing, and analyzing Hi-C contact matrices from multiple formats (.hic, .cool, .pairs, .validPairs). Includes modules for matrix balancing (KR, ICE), TAD calling via insulation score, A/B compartment analysis, aggregate peak analysis (APA), loop calling, and publication-quality plotting. Works with cooler and Juicer backends. Supports automatic format conversion between .hic, .cool, and FAN-C native formats.

Thin
0
fancyimpute

fancyimpute — Python library for multivariate imputation and matrix completion providing KNN imputation (weighted nearest neighbors), SoftImpute (iterative SVD soft-thresholding), IterativeSVD (low-rank SVD decomposition), MatrixFactorization (SGD-based low-rank UV with biases), NuclearNormMinimization (convex optimization), BiScaler (iterative row/column normalization), SimpleFill (mean/median baseline), and IterativeImputer re-export from scikit-learn (MICE chained equations), all via a unified fit_transform(X_incomplete) API on NumPy arrays.

Thin
0
Fastq Screen

Fastq Screen — Perl tool for screening FASTQ sequencing reads against multiple reference genomes to detect sample contamination. Maps reads to user-defined genome databases (human, mouse, E. coli, PhiX, adapters, rRNA, mitochondria) using Bowtie2, BWA, or Bismark as alignment backend. Produces text summary and PNG/HTML bar charts showing percentage of reads mapping to each genome. Supports paired-end reads, bisulfite-treated samples, and tag-based filtering/removal of contaminant reads. Part of the Babraham Bioinformatics suite alongside FastQC.

Thin
0
fastSTRUCTURE

fastSTRUCTURE for variational Bayesian inference of population structure from SNP genotype data. Use when working with faststructure, fast structure, population structure analysis, variational Bayes ancestry estimation, K selection via marginal likelihood, admixture analysis on PLINK binary files, or STRUCTURE-like analysis with faster convergence.

Thin
0
FastTree

FastTree — approximately maximum-likelihood phylogenetic tree inference from nucleotide or protein sequence alignments. Infers large phylogenies (up to 1M sequences) 100-1000x faster than RAxML or PhyML using heuristic neighbor-joining, minimum-evolution SPR moves, and ML NNIs. Supports JTT, WAG, LG (protein) and Jukes-Cantor, GTR (nucleotide) models with CAT rate approximation or Gamma20. Outputs Newick trees with Shimodaira-Hasegawa local support values. CLI binary: FastTree / FastTreeMP (OpenMP multi-threaded).

Thin
0
fdrtool

fdrtool — estimation of tail area-based false discovery rates (Fdr/q-values) and density-based local false discovery rates (fdr) from observed test statistics. Supports four null models: normal (z-scores), correlation coefficients, p-values, and t-scores. Uses the Grenander estimator for non-parametric density estimation under monotonicity constraints and censored maximum-likelihood fitting for null distribution parameters. Includes higher criticism (HC) scoring for rare/weak signal detection, half-normal distribution functions, and monotone regression (isotonic and antitonic). Standard CRAN package for unified FDR analysis in genomics, proteomics, and high-throughput multiple testing scenarios.

Thin
0
fgbio

fgbio — command-line toolkit for working with genomic and next-generation sequencing data, specializing in UMI (Unique Molecular Identifier) processing, consensus read calling, duplex sequencing, and BAM manipulation. Essential for error-corrected variant calling pipelines. Includes GroupReadsByUmi, CallMolecularConsensusReads, CallDuplexConsensusReads, FilterConsensusReads, FastqToBam, DemuxFastqs, ClipBam, and FilterBam.

Thin
0
fgsea

fgsea — fast preranked gene set enrichment analysis in R using an adaptive multi-level split Monte-Carlo scheme for arbitrarily precise GSEA p-values. Provides fgsea(), fgseaMultilevel(), plotEnrichment(), plotGseaTable(), collapsePathways() for pathway enrichment testing from ranked gene lists. Works with MSigDB, KEGG, Reactome, GO gene set collections. Input is a named numeric vector of gene-level statistics and a list of gene sets.

Thin
0
FigTree

FigTree — interactive Java-based phylogenetic tree viewer for displaying, annotating, and exporting phylogenetic trees produced by BEAST, BEAST2, MrBayes, RAxML, IQ-TREE, or any NEXUS/Newick tree file. Supports clade coloring, branch annotation, tip label formatting, time axes, posterior probability display, HPD interval bars, and export to PDF, SVG, and PNG. Essential for visualizing Bayesian phylogenetics and molecular clock trees. Trigger on: figtree, phylogenetic tree visualization, BEAST MCC tree, NEXUS tree annotation, Newick display, clade coloring, HPD intervals.

Thin
0
Fiji/ImageJ

Fiji/ImageJ — batteries-included distribution of ImageJ2 for scientific bioimage analysis. Batch macro scripting for cell counting, fluorescence quantification, and particle tracking. Bio-Formats support for 130+ microscopy file formats (CZI, LIF, ND2, OME-TIFF). TrackMate object tracking in time-lapse data. GPU-accelerated processing via CLIJ2. PyImageJ bridge for NumPy/scikit-image integration. Headless execution on HPC clusters. 7+ scripting languages (ImageJ Macro, Jython, Groovy).

Thin
0
Fiji/ImageJ

Fiji/ImageJ — open-source scientific image analysis platform for life sciences. Provides image segmentation, particle analysis, cell tracking, colocalization, 3D visualization, and batch processing of microscopy images. Supports Bio-Formats for 150+ file formats (ND2, CZI, LIF, OME-TIFF). Scriptable via ImageJ Macro Language, Jython, and Python (PyImageJ). Used in fluorescence microscopy, histopathology, calcium imaging, and high-content screening workflows.

Thin
0
Find Alternative Tools

Find drop-in replacements or compatible alternatives for bioinformatics tools using EDAM ontology I/O compatibility matching. Searches 47,000+ tools with curated migration knowledge for 20+ common alternative pairs. Ranks alternatives by EDAM operation/topic overlap, generates feature comparison matrices, and assesses migration difficulty. Detects deprecated or unmaintained tools and suggests active successors.

Thin
0
FINEMAP

FINEMAP — Bayesian fine-mapping tool for identifying causal variants from GWAS summary statistics and linkage disequilibrium (LD) matrices. Uses shotgun stochastic search to enumerate causal configurations and compute posterior inclusion probabilities (PIPs). Supports conditional analysis, credible set construction, and multi-causal-variant modeling. Essential for post-GWAS fine-mapping in complex trait genetics.

Thin
0
FLAIR

FLAIR — Full-Length Alternative Isoform analysis of RNA for long-read sequencing. Corrects, defines, and quantifies transcript isoforms from nanopore cDNA, native RNA, and PacBio reads. Provides splice-site correction using short-read data, isoform collapse, quantification, differential expression (DESeq2), and differential splicing (DRIMSeq). Core workflow: align → correct → collapse → quantify → diffexp/diffsplice. Essential for isoform-level analysis of Oxford Nanopore and PacBio long-read RNA-seq data.

Thin
0
FLAMES

FLAMES — Full-Length Analysis of Mutations and Splicing for bulk and single-cell long-read RNA-seq data. R/Bioconductor pipeline for barcode demultiplexing, isoform discovery, transcript quantification, and variant calling from Oxford Nanopore reads. Key functions: BulkPipeline(), SingleCellPipeline(), MultiSampleSCPipeline(), find_isoform(), quantify_transcript(), find_variants(), sc_mutations(), sc_DTU_analysis().

Thin
0
FlashLFQ

FlashLFQ — ultrafast label-free quantification for mass spectrometry proteomics. Quantifies peptide and protein abundances from MS/MS identifications using peak detection on raw/mzML spectra. Supports match-between-runs (MBR) for transferring identifications across runs, Bayesian protein fold-change analysis with MCMC, intensity normalization, and multiple search engine inputs (MetaMorpheus, MaxQuant, PeptideShaker). .NET CLI tool with cross-platform support.

Thin
0
flexsurv

flexsurv — Flexible Parametric Survival and Multi-State Models. R package for parametric time-to-event analysis with right-censored, left-censored, and left-truncated data. Core functions: flexsurvreg() fits standard parametric models (Weibull, Gompertz, log-normal, log-logistic, generalized gamma, generalized F) with covariates on any parameter; flexsurvspline() fits Royston-Parmar spline models on log cumulative hazard, odds, or probit scales; flexsurvmix() fits mixture models for competing events; fmsm() constructs multi-state models from transition-specific survival fits. Supports standsurv() for marginal (standardized) survival/hazard estimation, hr_flexsurvreg() for time-varying hazard ratios, AICc/BIC model comparison, Cox-Snell residuals, predict/tidy/augment tidyverse integration, and simulation from fitted models. Primary citation: Jackson (2016) JSS 70(8). Depends on survival package; builds on mstate for multi-state models.

Thin
0
flowCore

Use when working with flowCore — the foundational Bioconductor R package for reading, writing, and manipulating flow cytometry FCS files. Provides core data structures (flowFrame, flowSet), compensation matrix application, logicle and arcsinh transformations, and gate-based filtering. The base layer beneath flowWorkspace, openCyto, ggcyto, and CytoML in the RGLab ecosystem. Used for batch FCS processing, channel inspection, compensation, transformation, and extracting event-level expression matrices from cytometry experiments.

Thin
0
FlowSOM

FlowSOM — high-performance self-organizing map (SOM) clustering for flow cytometry and mass cytometry (CyTOF) data. Build a SOM grid from multi-parameter single-cell marker expression, then apply consensus metaclustering to identify cell populations automatically. Supports FCS file input, marker selection, FlowSOM tree visualization, star charts, and integration with downstream differential abundance analysis. Use for unsupervised cell population identification in cytometry experiments.

Thin
0
Flye

Flye — de novo assembler for single-molecule sequencing reads (PacBio CLR, PacBio HiFi, Oxford Nanopore) using repeat graphs built on approximate sequence matching. Produces polished contigs, GFA repeat graphs, and assembly statistics from raw, uncorrected long reads without requiring pre-correction. Supports genome, metagenome (metaFlye), and haplotype-aware assembly modes. Handles datasets from bacterial plasmids to mammalian-scale genomes.

Thin
0
FMLRC2

FMLRC2 — FM-index Long Read Corrector version 2, a Rust-based hybrid error correction tool for long reads (PacBio CLR, Oxford Nanopore) using Illumina short reads as a correction guide. Builds a compressed multi-string BWT (msBWT2) from short reads, then applies two-pass k-mer correction (small k-mer + large k-mer) to reduce error rates before downstream assembly or variant calling. Successor to the original Python FMLRC with major performance improvements. Use when you need to correct long reads with short-read data before genome assembly or variant calling.

Thin
0
Foldseek

Foldseek — fast protein structure search and clustering using the 3Di structural alphabet. Compares protein structures at sequence-search speed, supporting monomer and multimer searches, structural clustering, and sequence-to-structure prediction via ProstT5. Used for remote homolog detection, fold classification, structural database mining against AlphaFoldDB/PDB, and protein complex alignment.

Thin
0
FourCastNet

FourCastNet — NVIDIA's global data-driven weather forecasting model using Adaptive Fourier Neural Operators (AFNO) on a vision transformer backbone. Generates 0.25-degree resolution global forecasts in under 2 seconds. Supports backbone weather variable prediction and precipitation diagnostics from ERA5 reanalysis data. Trained on ECMWF ERA5 with PyTorch DDP on multi-GPU clusters. Provides inference scripts for custom date ranges via Copernicus CDS downloads.

Thin
0
FPocket

FPocket — fast open-source protein pocket detection and druggability estimation using Voronoi tessellation and alpha spheres. Identifies binding sites on protein surfaces from PDB structures, scores pockets by druggability, and tracks pocket dynamics across MD trajectories. Use for structure-based drug design, binding site prediction, pocket comparison, and druggability assessment.

Thin
0
fqtools

fqtools — fast FASTQ file manipulation suite written in C. Provides 16 subcommands for viewing, filtering, trimming, converting, and validating FASTQ files. Supports gzip-compressed and BAM input natively. Commands include view, head, count, header, sequence, quality, fasta, basetab, qualtab, type, validate, find, trim, qualmap, and header2. Handles single-end and paired-end (interleaved) data. Requires compilation from source with zlib and htslib dependencies.

Thin
0
FragGeneScan

Use when working with fraggenescan — fragGeneScan -- gene prediction

Thin
0
FragPipe / MSFragger

FragPipe/MSFragger — ultrafast proteomics database search engine and computational pipeline for peptide identification from mass spectrometry (MS/MS) data. Supports closed search, open search for post-translational modification (PTM) discovery, glycoproteomics (O-Pair), DIA analysis via DIA-Umpire, TMT/iTRAQ isobaric labeling quantification, and label-free quantification (LFQ) via IonQuant. Integrates Philosopher for downstream validation with PeptideProphet and ProteinProphet.

Thin
0
Franklin

Franklin — web-based clinical variant interpretation platform by Genoox for ACMG/AMP-based pathogenicity classification of SNVs, indels, and CNVs. Provides a community-curated variant database, REST API for automated classification, ML-assisted evidence scoring, and aggregated evidence from ClinVar, gnomAD, HGMD, and literature. Used in clinical exome/genome interpretation, variant curation workflows, and laboratory reporting. Trigger keywords: Franklin, Genoox, variant interpretation, ACMG classification, clinical variant curation, variant pathogenicity, variant database.

Thin
0
FreeBayes -- Bayesian Haplotype-Based Variant Caller

FreeBayes -- Bayesian haplotype-based variant caller for short-read sequencing data. Detects SNPs, indels, MNPs, and complex variants from BAM/CRAM alignments against a reference genome. Uses literal haplotype observations without local reassembly. Supports pooled sequencing, polyploid genotyping, and population-level calling. Common in WGS, WES, and targeted panel variant calling pipelines.

Thin
0
FusionCatcher

FusionCatcher — tool for detecting somatic fusion genes, translocations, and chimeric transcripts from RNA-seq data. Identifies known and novel gene fusions in tumor and normal samples using multiple alignment methods (BLAT, STAR, Bowtie2). Supports human genome assemblies, outputs fusion candidates with evidence levels and supporting reads. Used in cancer genomics, translational oncology, and transcriptome analysis pipelines.

Thin
0
GA4GH Schemas

GA4GH Schemas — the Global Alliance for Genomics and Health data models and APIs for standardized genomic data exchange. Defines Protobuf/JSON schemas for Reads (SAM-equivalent), Variants (VCF-equivalent), References, Sequence Annotations, Metadata (Biosample, Individual), RNA Quantification, Allele Annotations, and Genotype-to-Phenotype associations. Use when implementing federated genomic data sharing, parsing GA4GH API responses, validating schema compliance, or migrating to successor standards (VRS, GKS, DRS, TES).

Thin
0
Galaxy

Galaxy — open-source web-based platform for accessible, reproducible, and transparent computational biology. Provides a browser GUI and REST API for running bioinformatics tools without command-line expertise. Supports workflow composition (drag-and-drop editor), dataset management, history tracking, interactive visualizations, and integration with thousands of tools via the Galaxy ToolShed. Programmable via BioBlend Python library for batch job submission, workflow invocation, and data transfer. Deployed as public servers (usegalaxy.org, usegalaxy.eu), institutional installs, or local Docker/Ansible instances.

Thin
0
gamlss

gamlss — Generalized Additive Models for Location, Scale and Shape in R. Distributional regression framework that models all parameters of a response distribution (location mu, scale sigma, skewness nu, kurtosis tau) as functions of explanatory variables. Supports 100+ distributions via gamlss.dist, P-spline and cubic spline smoothers (pb(), cs(), ps()), centile curve estimation (centiles(), centiles.pred()), worm plot diagnostics (wp()), stepwise model selection via GAIC (stepGAIC()), random effects (random(), re()), and ridge/lasso regularization (ri()). Used by WHO for Child Growth Standards.

Thin
0
GATK HaplotypeCaller

GATK HaplotypeCaller — germline short variant caller using localized de novo assembly of haplotypes. Calls SNPs and indels from BAM/CRAM alignment files, producing VCF or gVCF output. Supports single-sample calling, gVCF mode for scalable joint calling via GenomicsDBImport + GenotypeGVCFs, and interval-based parallelism. Required for GATK Best Practices germline variant discovery.

Thin
0
gatk-markduplicates

Use when working with GATK MarkDuplicates, Picard MarkDuplicates through the GATK wrapper, duplicate marking in BAM or CRAM files, duplicate metrics, or optical duplicate detection in short-read DNA sequencing pipelines. gatk-markduplicates marks PCR or optical duplicates, writes duplication metrics, can tag duplicate type, and fits directly after coordinate sorting and before BQSR or variant calling in common germline and somatic workflows. Trigger this skill for questions about MarkDuplicates syntax, sort order requirements, duplicate metrics, CreateIndex behavior, or MarkDuplicates vs MarkDuplicatesSpark.

Thin
0
GATK VQSR

GATK VQSR (Variant Quality Score Recalibration) — machine learning-based variant filtering for GATK Best Practices germline pipelines. Trains a Gaussian mixture model on truth/training resource datasets to assign VQSLOD scores to SNPs and indels, then applies tranche-based filtering to produce high-confidence callsets. Required for cohorts of 30+ WGS or 300+ WES samples. Replaces hard filtering in production variant discovery.

Thin
0
GATK4 (Genome Analysis Toolkit)

GATK4 — Genome Analysis Toolkit for germline and somatic short variant discovery (SNPs and indels). Industry-standard caller providing HaplotypeCaller for germline, Mutect2 for somatic, plus Base Quality Score Recalibration (BQSR), joint genotyping via GenomicsDBImport, and variant filtering with VQSR or hard filters. The backbone of Broad Institute best-practices pipelines for WGS, WES, and targeted panels.

Thin
0
GCTA

GCTA (Genome-wide Complex Trait Analysis) — command-line tool for estimating SNP-based heritability (GREML), performing mixed linear model association (MLMA/fastGWA), computing genetic relationship matrices (GRM), running principal component analysis on genotype data, and conditional/joint GWAS analysis (COJO). Works with PLINK binary format (.bed/.bim/.fam) inputs. Essential for quantitative genetics and complex trait analysis.

Thin
0
GEARS

Use when working with GEARS, the SNAP Stanford method for predicting transcriptional outcomes of single-gene and multi-gene perturbations from single-cell RNA-seq perturbation screens. GEARS exposes the `PertData` and `GEARS` Python APIs for loading built-in Norman, Adamson, Dixit, and Replogle datasets, processing custom AnnData inputs, preparing simulation or combo-aware splits, training a graph model, running uncertainty-aware inference, and scoring genetic interactions with `GI_predict`. Trigger this skill for GEARS, cell-gears, combinatorial perturbation prediction, perturbation-response modeling, or GEARS training and inference workflows.

Thin
0
GEMMA

GEMMA (Genome-wide Efficient Mixed Model Association) — fast C++ tool for genome-wide association study (GWAS) analysis using linear mixed models (LMM). Computes genetic relatedness matrices (GRM/kinship), runs univariate and multivariate LMM association tests, and fits Bayesian sparse linear mixed models (BSLMM) for polygenic score estimation. Accepts PLINK binary (.bed/.bim/.fam) and BIMBAM genotype formats. Use when analyzing population structure confounding, estimating SNP heritability, or running polygenic modeling. Alternatives: BOLT-LMM, SAIGE, regenie, FaST-LMM.

Thin
0
GENCODE

Use when working with gencode — GENCODE — comprehensive gene annotation

Thin
0
GeneAbacus

GeneAbacus — fast Go-based tool for counting and profiling sequencing reads within genes, chromosomes, or constructs from SAM/BAM files. Supports mRNA-seq, Ribo-seq, ChIP-seq, CLIP-seq, Structure-seq, and MPRAs. Produces per-feature read counts (RPKM/TPM) and per-nucleotide coverage profiles (BedGraph, binary, CSV). Works with unsorted BAM for maximum pipeline efficiency.

Thin
0
Geneformer

Geneformer — transformer-based foundation model pretrained on ~30 million single-cell transcriptomes for context-specific gene network analysis. Supports fine-tuning for cell type classification, gene dosage sensitivity prediction, chromatin dynamics, network perturbation analysis, and in silico perturbation of gene regulatory networks. Built on Hugging Face Transformers with rank value encoding of transcriptomic data.

Thin
0
Generate CWL From Nextflow

generate-cwl-from-nextflow — cross-compile Nextflow DSL2 process and workflow definitions into CWL v1.2 CommandLineTool and Workflow YAML documents for portable execution on CWL-compatible platforms (Terra, Seven Bridges, Arvados, Toil, CWLtool). Maps Nextflow container directives to DockerRequirement, publishDir to output bindings, and channel wiring to CWL step connections. Validates generated CWL against v1.2 structural rules before output.

Thin
0
GenomeScope

GenomeScope — reference-free genome profiling from k-mer count histograms. Estimates genome size, heterozygosity, and repeat content using a negative binomial mixture model. Supports diploid (GenomeScope 1.0) and polyploid (GenomeScope 2.0) genomes. Input is a k-mer frequency histogram from Jellyfish, KMC, or meryl. Essential pre-assembly QC for genome projects.

Thin
0
GenomicAlignments

Use when working with GenomicAlignments, GAlignments, GAlignmentPairs, summarizeOverlaps(), summarizeJunctions(), coverage() on BAM files, splice junction counting, paired-end alignment import, or Bioconductor workflows that operate on genomic read alignments. Trigger on: GenomicAlignments, readGAlignments, readGAlignmentPairs, summarizeOverlaps, summarizeJunctions, junctions, cigar operations, overlap counting, RNA-seq read counting, and BAM-backed range analysis in R.

Thin
0
GenomicFeatures

Use when working with genomicfeatures — genomicFeatures — Bioconductor

Thin
0
GenomicRanges

GenomicRanges — Bioconductor R package for representing, manipulating, and analyzing genomic intervals (GRanges, GRangesList). Provides operations for finding overlaps (findOverlaps, subsetByOverlaps), computing coverage, nearest-neighbor searches, set operations (union, intersect, setdiff), flanking/promoter extraction, and seqinfo-aware interval arithmetic on chromosome coordinates. Core infrastructure for ChIP-seq peak analysis, variant annotation, RNA-seq exon counting, and any genomic interval workflow in the Bioconductor ecosystem.

Thin
0
GenomicSEM

GenomicSEM — R package for structural equation modeling (SEM) on genome-wide association study (GWAS) summary statistics. Enables fitting SEM models without individual-level SNP data, conducting multivariate GWAS analyses, estimating heritability and genetic correlations via HDL method, and performing functional enrichment for SEM parameters across genomic regions.

Thin
0
GENOVA

GENOVA — GEnome orgaNizatiOn Visual Analytics, an R package for analysis and visualization of Hi-C and Micro-C chromatin contact data. Provides compartment analysis (A/B compartments, saddle plots), TAD detection (insulation scores, diamond insulation), loop analysis (APA — Aggregate Peak Analysis), distance-decay curves (RCP), virtual 4C plots, and differential Hi-C comparisons. Input: cooler (.cool/.mcool) or sparse contact matrix files. Output: GENOVA contacts objects, publication-ready plots, and quantitative chromatin organization metrics.

Thin
0
Genrich

Genrich — peak caller for genomic enrichment assays (ChIP-seq and ATAC-seq) with built-in support for multiple biological replicates via Fisher's method, multimapping reads with fractional pileup counts, PCR duplicate removal, and ATAC-seq Tn5 cut site centering. Produces ENCODE narrowPeak output. Use when calling peaks from queryname-sorted BAM files, combining replicates without IDR, analyzing repetitive regions, or running ATAC-seq peak calling workflows.

Thin
0
GenVisR

Use when working with GenVisR, the R/Bioconductor package for genomic cohort visualization. Supports cohort oncoprints with `Waterfall()`, protein-domain mutation maps with `Lolliplot()`, copy-number cohort displays with `cnSpec()` and `cnFreq()`, loss-of-heterozygosity summaries with `lohSpec()`, region-of-interest coverage plots with `genCov()`, and identity-SNP or rainfall-style mutation diagnostics. Use for somatic mutation, copy-number, and targeted sequencing figure generation from MAF-like tables, segmentation data, BAM-derived coverage, or variant-allele-frequency tracks.

Thin
0
GEO Database

GEO Database — NCBI Gene Expression Omnibus public repository for high-throughput gene expression and functional genomics data. Contains 264,000+ series (GSE), 8M+ samples (GSM), 27,000+ platforms (GPL), and 4,300+ curated datasets (GDS). Query via GEOparse Python client for series download/parsing, or Biopython Entrez for metadata search. Core API: GEOparse.get_GEO(), Entrez.esearch(db="gds"), Entrez.esummary(). Use for transcriptomics dataset retrieval, expression matrix extraction, sample metadata parsing, differential expression analysis, and multi-study meta-analysis.

Thin
0
GEOfetch

GEOfetch — Python tool for downloading and converting GEO and SRA metadata and data into PEP (Portable Encapsulations of Projects) format. Fetches sample metadata from NCBI GEO/SRA, builds standardized PEP sample tables, and optionally downloads raw sequencing data via SRA Toolkit. Part of the pepkit ecosystem for reproducible project management.

Thin
0
gfatools

Use when working with gfatools — a lightweight C toolkit for GFA format manipulation — for viewing, converting, sorting, and validating Graphical Fragment Assembly (GFA) files. Provides subcommands for GFA statistics (gfa2fa, gfa2bed), graph simplification (asm), blacklisting contigs, and converting between GFA and FASTA/BED formats. Essential for pangenome assembly QC, graph inspection, and format conversion in genome assembly pipelines. Developed by Heng Li alongside minimap2 and minigraph.

Thin
0
gffcompare

gffcompare -- Tool for comparing, merging, and annotating RNA-seq transcript assemblies against a reference annotation. Classifies each assembled transcript with a class code (=, c, j, u, x, i, p, r, e, o, s) indicating its relationship to the reference. Produces .stats accuracy metrics (Sensitivity/Precision at base, exon, intron, transcript, locus levels), .tracking multi-sample correspondence file, .loci consolidated locus list, and .annotated.gtf with class-code annotations. Successor to cuffcompare. Essential QC step after StringTie, Scallop, or CLASS2 assembly.

Thin
0
gffread

gffread -- GFF/GTF utility for filtering, converting, and extracting sequences from genome annotation files. Converts between GFF3 and GTF formats, extracts transcript (FASTA) and protein sequences from genomic FASTA using GFF/GTF annotations, merges overlapping transcripts, filters by attributes or coordinates, and validates annotation structure. Part of the GFF Utilities suite by Geo Pertea. Essential in RNA-seq pipelines between transcript assembly (StringTie, Cufflinks) and downstream quantification or analysis.

Thin
0
ggplot2

ggplot2 — grammar of graphics implementation in R for creating complex, layered statistical visualizations. Supports scatter plots, bar charts, histograms, boxplots, heatmaps, faceted plots, and custom themes. Widely used in bioinformatics for gene expression visualization, variant plots, survival curves, volcano plots, and publication-quality figure generation from tabular data (CSV, TSV, data frames).

Thin
0
ggtree

ggtree — R/Bioconductor package for phylogenetic tree visualization and annotation using ggplot2 grammar. Supports Newick, Nexus, NHX, Jplace, and BEAST tree formats. Enables layer-based annotation with external metadata, clade highlighting, bootstrap labeling, tanglegrams, and tree subsetting. Used for microbial phylogenetics, molecular evolution, metagenomics taxonomy, and comparative genomics visualization.

Thin
0
GiardiaDB

GiardiaDB — VEuPathDB genomic resource for the intestinal parasite Giardia lamblia (G. intestinalis, G. duodenalis). Provides gene search, BLAST, genome browsing, functional annotation, ortholog queries, transcript expression evidence (EST, SAGE, microarray), proteomics data, and phylogenetic trees via a REST API. Supports programmatic access to gene records, FASTA/GFF3 downloads, and search strategies for the 12 Mb genome (~4,976 genes). Part of the VEuPathDB eukaryotic pathogen database federation.

Thin
0
Giotto Suite

Giotto Suite — comprehensive R framework for spatial multi-omics analysis supporting 20+ spatial technologies (Visium, MERFISH, Xenium, CosMx, Slide-seq, seqFISH, CODEX, Stereo-seq). Technology-agnostic pipeline for data ingestion, QC filtering, normalization, dimensionality reduction, clustering, spatial gene detection (binSpect), spatial domain identification (HMRF), cell-cell interaction analysis, deconvolution (PAGE, DWLS), ligand-receptor communication scoring, and 2D/3D visualization. Built on modular packages (GiottoClass, GiottoUtils, GiottoVisuals) with interoperability to AnnData, Seurat, and SpatialExperiment.

Thin
0
Giotto Suite

Giotto Suite — R toolkit for spatial multi-omics analysis at all scales and resolutions. Processes data from Visium, MERFISH, Xenium, CosMx, Slide-seq, CODEX, Stereo-seq, and other spatial technologies. Provides preprocessing (filtering, normalization, HVF detection), dimension reduction (PCA, UMAP, tSNE, NMF), clustering (Leiden, Louvain, k-means), spatial pattern detection (binSpect, spatialDE, SPARK), HMRF domain analysis, cell-cell communication, deconvolution (DWLS), enrichment analysis, subcellular polygon analysis, and 2D/3D visualization. The giottoObject stores expression, spatial coordinates, networks, images, and polygons in a unified multi-resolution framework. Interoperable with Seurat, AnnData, and SpatialExperiment.

Thin
0
glmmTMB

glmmTMB — R package for fitting generalized linear mixed models (GLMMs) and extensions using Template Model Builder. Supports zero-inflation via ziformula, dispersion modeling via dispformula, and 15+ covariance structures (ar1, cs, us, ou, exp, gau, mat, toep, diag, rr). Core API: glmmTMB() for model fitting with three-formula interface (conditional + zero-inflation + dispersion), fixef()/ranef()/VarCorr() for extraction, predict() with SE and confidence intervals, simulate() for parametric bootstrap, diagnose() for convergence troubleshooting, and confint() with Wald/profile/uniroot methods. Families include nbinom1, nbinom2, beta_family, tweedie, compois, genpois, betabinomial, truncated distributions, and all standard GLM families.

Thin
0
GLnexus

GLnexus — scalable gVCF merging and joint variant calling for cohort genomics. Merges per-sample gVCFs from DeepVariant or GATK HaplotypeCaller into a joint-called project-level BCF/VCF using an embedded RocksDB database. Provides preset configurations (DeepVariantWGS, DeepVariantWES, gatk, gatk_unfiltered, weCall) for consistent quality filtering. Essential for population-scale WGS/WES joint genotyping pipelines.

Thin
0
GLUE

GLUE (Graph-Linked Unified Embedding) — deep learning framework for integrating unpaired single-cell multi-omics data (scRNA-seq, scATAC-seq, snmC-seq). Uses a guidance graph of prior regulatory interactions to orient cross-modality alignment. Produces unified cell embeddings for joint clustering, feature embeddings for cis-regulatory inference, and TF-gene regulatory networks. Python package: scglue. Built on AnnData, PyTorch, and NetworkX.

Thin
0
Gnina

Gnina — deep learning molecular docking program built on AutoDock Vina. Uses convolutional neural networks (CNNs) to rescore protein-ligand poses for improved binding pose prediction and virtual screening. Supports flexible ligand docking, CNN-based affinity scoring, ensemble rescoring, and AutoDock Vina compatibility. Works with PDB receptors and SDF/MOL2 ligands for structure-based drug discovery.

Thin
0
GNPS

GNPS (Global Natural Products Social Molecular Networking) — web-based mass spectrometry platform for molecular networking, spectral library searching, and community-driven metabolomics analysis. Provides MS/MS spectral clustering, MASST (Mass Spectrometry Search Tool), FBMN (Feature-Based Molecular Networking), MolNetEnhancer, and spectral library matching against community-curated reference libraries. Used for natural products discovery, metabolomics, and untargeted mass spectrometry workflows.

Thin
0
g:Profiler

g:Profiler — multi-organism functional enrichment analysis and gene identifier mapping toolkit. Performs over-representation analysis (ORA) against Gene Ontology (GO:BP, GO:MF, GO:CC), KEGG, Reactome, WikiPathways, TRANSFAC, miRTarBase, CORUM, Human Phenotype Ontology, and Human Protein Atlas via the g:GOSt module. Includes gene ID conversion across databases (g:Convert), orthology mapping between 641+ organisms (g:Orth), and SNP-to-gene annotation (g:SNPense). Available as a web server, Python client (gprofiler-official), and R client (gprofiler2). Uses the g:SCS multiple testing correction method designed for hierarchical GO structure.

Thin
0
GROMACS

GROMACS — high-performance molecular dynamics simulation engine for biomolecular systems. Simulates Newtonian equations of motion for hundreds to millions of particles using leap-frog, velocity Verlet, or stochastic dynamics integrators. Supports all-atom and coarse-grained force fields (AMBER, CHARMM, GROMOS, OPLS-AA, Martini), PME electrostatics, free energy perturbation, replica exchange, and GPU-accelerated production runs. Use when setting up MD simulations, preparing topologies with pdb2gmx, running energy minimization, equilibration, or production dynamics, analyzing trajectories with gmx analysis tools, or troubleshooting GROMACS errors.

Thin
0
GSEA (Gene Set Enrichment Analysis)

GSEA — Gene Set Enrichment Analysis determines whether a priori defined gene sets show statistically significant, concordant differences between two biological states. Computes enrichment scores via a weighted Kolmogorov-Smirnov-like running statistic on ranked gene lists, with permutation-based FDR control. Available as the Broad Institute Java desktop tool (GSEA 4.4.x), the fast R/Bioconductor implementation (fgsea 1.36.x with multilevel adaptive algorithm), and the Python/Rust implementation (GSEApy 1.1.x supporting prerank, ssGSEA, GSVA, and Enrichr). All implementations consume ranked gene lists and GMT gene set collections from MSigDB (H, C1-C9) and produce normalized enrichment scores, FDR q-values, and leading edge gene subsets.

Thin
0
GTDB-Tk

GTDB-Tk — toolkit for objective taxonomic classification of bacterial and archaeal genomes using the Genome Taxonomy Database (GTDB). Assigns taxonomy based on placement in reference trees inferred from 120 bacterial and 53 archaeal marker genes. Provides genome classification, ANI-based species assignment, relative evolutionary divergence, and multi-sequence alignment. Use for metagenome-assembled genome (MAG) taxonomy, genome quality assessment integration, and prokaryotic phylogenomics.

Thin
0
GuacaMol

Use when working with GuacaMol — the benchmarking suite for de novo molecular generation — to evaluate generative models for drug-like molecules. Assesses distribution-learning (17 benchmarks: validity, uniqueness, novelty, KL divergence, FCD) and goal-directed generation (20 benchmarks: rediscovery, similarity, MPO, isomers). Implements DistributionMatchingGenerator and GoalDirectedGenerator interfaces. Essential for comparing VAEs, GANs, RNNs, and transformer-based molecular generators against standardized benchmarks.

Thin
0
Gubbins

Gubbins — rapid phylogenetic analysis of recombinant bacterial whole genome sequences. Iteratively detects recombination hotspots with elevated SNP density while constructing clonal frame phylogenies. Supports RAxML, IQ-TREE, RAxML-NG, FastTree, and RapidNJ tree builders with GTR/HKY/JC models. Outputs Newick trees, GFF/EMBL recombination predictions, VCF SNP summaries, and per-branch recombination statistics (r/m ratio).

Thin
0
GUIDES

GUIDES (Graphical User Interface for DNA Editing Screens) — web-based tool for designing customized CRISPR-Cas9 guide RNA libraries. Integrates Doench on-target efficiency scoring, tissue-specific gene expression data, and protein structural information for optimized sgRNA pool design in genome-scale knockout screens. Supports human (GRCh) and mouse (GRCm38) genomes. Use for CRISPR library design, sgRNA scoring, guide RNA optimization, and knockout screen preparation.

Thin
0
GuideScan2

GuideScan2 — genome-wide CRISPR guide RNA (gRNA) design, specificity analysis, and library construction for Cas9 and Cas12a (Cpf1) in custom genomes. Provides memory-efficient gRNA database building from FASTA references, off-target enumeration with configurable mismatch thresholds, allele-specific gRNA design, and pre-computed human/mouse libraries. Essential for CRISPR knockout, interference, and activation screen design in genomics workflows.

Thin
0
Guppy

Guppy — Oxford Nanopore Technologies neural-network basecaller for converting raw nanopore signal data (FAST5/POD5) into nucleotide sequences (FASTQ/BAM). Supports GPU-accelerated basecalling, barcoding/demultiplexing, modified base detection (5mC, 6mA), and adapter trimming. Used in long-read genomics pipelines for whole-genome sequencing, metagenomics, and direct RNA sequencing.

Thin
0
Gviz

Use when working with Gviz, the Bioconductor R package for plotting genomic coordinates, annotations, signal tracks, read alignments, and sequence context in genome-browser-style figures. Covers core track classes such as DataTrack, AnnotationTrack, GeneRegionTrack, GenomeAxisTrack, IdeogramTrack, AlignmentsTrack, and SequenceTrack, plus plotTracks() composition, UCSC and biomaRt-backed annotation retrieval, and publication-ready genomic panels for ChIP-seq, ATAC-seq, RNA-seq, and gene model visualization.

Thin
0
Hail

Hail — scalable Python framework for genomic data analysis built on Apache Spark. Use when performing GWAS, variant quality control, sample QC, PCA, LD pruning, association testing, burden tests, or biobank-scale genomic analysis. Supports VCF, PLINK, BGEN, and MatrixTable formats.

Thin
0
hap.py (Haplotype Comparison Tools)

hap.py (Haplotype Comparison Tools) — Python/C++ benchmarking toolkit from Illumina for comparing VCF files against a gold-standard truth set. Computes SNP and INDEL precision, recall, and F1-score for variant calling accuracy benchmarking. Supports GIAB truth sets, GA4GH benchmarking standards, stratified analysis by region (repeats, GC content, segmental duplications), ROC curve generation, somatic variant evaluation (som.py), and pre.py variant normalization. Essential for validating GATK, DeepVariant, Strelka2, or any germline/somatic variant caller against NIST/GIAB reference standards.

Thin
0
Harmony -- Single-Cell Data Integration

Harmony -- fast and scalable single-cell data integration and batch correction algorithm. Operates on PCA embeddings using iterative soft k-means clustering to remove batch effects while preserving biological variation. Available as R package (harmony) and Python package (harmonypy). Integrates natively with Seurat and Scanpy workflows. Suitable for multi-sample, multi-technology, and multi-modality integration of single-cell experiments.

Thin
0
HDBSCAN

HDBSCAN — Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs density-based clustering over varying epsilon values to find clusters of variable density without requiring a fixed distance threshold. Provides soft clustering (membership probabilities), GLOSH outlier detection, condensed tree visualization, approximate prediction for new points, and branch detection within clusters. Available as standalone scikit-learn-contrib package and integrated into scikit-learn (>=1.3).

Thin
0
HDMT

HDMT — High-Dimensional Mediation Testing for joint significance of exposure-mediator and mediator-outcome associations. Controls FWER and FDR for mediation hypotheses in high-dimensional settings (epigenome-wide, transcriptome-wide mediation studies) where thousands of potential mediators are tested simultaneously. Estimates a four-component null proportion vector from bivariate p-value pairs, then applies composite null-adjusted FDR or FWER procedures. Specifically designed for mediation analysis in genomic epidemiology where the joint significance test (both paths non-null) must be corrected for multiplicity. Implements the Dai et al. (2020) JASA method with kernel density-based null estimation and quantile-adjusted p-values.

Thin
0
Herro

Herro — haplotype-aware error correction for Oxford Nanopore long reads using deep learning. Corrects ONT simplex reads to Q30+ accuracy while preserving haplotype information for diploid assembly. Uses all-vs-all read overlaps with a POA + deep learning correction model. Input is aligned BAM (minimap2 all-vs-all); output is corrected FASTQ. Supports both ONT R9.4.1 and R10.4.1 chemistries. Requires CUDA GPU.

Thin
0
HiC-Pro

HiC-Pro is an all-in-one pipeline for processing Hi-C sequencing data from raw FASTQ files to normalized contact matrices. Performs read alignment with Bowtie2, restriction-fragment assignment, valid-pair filtering, duplicate removal, and ICE (Iterative Correction and Eigenvector decomposition) normalization. Outputs allValidPairs, sparse contact matrices, and genome binned interaction files compatible with downstream visualization (HiCBrowser, Juicebox) and TAD/loop callers (insulation score, HiCCUPS). Handles both restriction-enzyme Hi-C and DNase Hi-C protocols.

Thin
0
HiCanu

HiCanu — HiFi-optimized mode of the Canu genome assembler for accurate

Thin
0
HiCExplorer

HiCExplorer — comprehensive toolkit for Hi-C chromosome conformation capture data analysis, quality control, normalization, and visualization. Provides hicBuildMatrix (contact matrix construction), hicCorrectMatrix (ICE normalization), hicFindTADs (topologically associating domain calling), hicDetectLoops (chromatin loop detection), hicPCA (compartment analysis), hicPlotMatrix (visualization), and 40+ additional tools. Works with .h5 and .cool matrix formats. Used for 3D genome organization studies, TAD analysis, loop detection, and A/B compartment identification in Hi-C and Micro-C data.

Thin
0
HiCUP

HiCUP (Hi-C User Pipeline) is a modular pipeline for mapping and performing quality control on Hi-C sequencing data. Processes paired-end FASTQ reads through four stages: truncation at ligation junctions, alignment with Bowtie2 or HISAT2, filtering of experimental artifacts (dangling ends, self-ligations, re-ligations, wrong size, contiguous sequences), and PCR duplicate removal. Produces deduplicated BAM files, per-stage HTML QC reports, and di-tag summary statistics. Output feeds downstream tools including Juicer, cooler, HiCExplorer, and CHiCAGO for TAD calling, loop detection, compartment analysis, and Capture Hi-C interaction scoring.

Thin
0
HiGlass

HiGlass — web-based genome browser and visualization platform for multi-resolution

Thin
0
HistomicsTK

HistomicsTK — Python toolkit for computational pathology and histopathology image analysis. Provides color normalization and deconvolution of H&E-stained whole-slide images, nuclei segmentation, feature extraction, and classification. Use for digital pathology workflows including stain normalization, cell detection, nuclear morphometry, and tissue analysis on SVS/NDPI/TIFF whole-slide images via large_image.

Thin
0
HMMRATAC

HMMRATAC — Hidden Markov Model-based peak caller for ATAC-seq chromatin accessibility data. Identifies nucleosome-free regions (NFR), mono-, di-, and tri-nucleosomal fragments using an HMM that models the fragment length distribution in ATAC-seq experiments. Accepts sorted, indexed BAM files and outputs gappedPeak, summit BED, and open chromatin bedgraph files. Use for ATAC-seq peak calling, open chromatin identification, nucleosome positioning, and chromatin accessibility profiling in bulk or cell-line experiments.

Thin
0
HOMER (Hypergeometric Optimization of Motif EnRichment)

HOMER (Hypergeometric Optimization of Motif EnRichment) — a comprehensive suite of tools for motif discovery and next-generation sequencing analysis. HOMER performs de novo motif finding optimized for 8-12 bp motifs in large-scale genomics data, ChIP-seq peak calling, peak annotation, differential peak analysis, GRO-seq transcript identification, Hi-C interaction analysis, and RNA-seq quantification. Built around a tag directory abstraction that normalizes diverse input formats, HOMER provides an integrated analysis environment from raw alignments through publication-ready motif logos and genome annotations. The hypergeometric enrichment algorithm identifies both known and novel motifs while controlling for background sequence composition.

Thin
0
Horvath Clock

Horvath clock — the foundational multi-tissue epigenetic age predictor using 353 CpG dinucleotide methylation sites and elastic net regression. Predict biological age from DNA methylation arrays (27K, 450K, EPIC, EPICv2) via pyaging (Python, GPU-optimized PyTorch) or methylclock (R/Bioconductor). Supports Horvath2013 (pan-tissue), SkinAndBlood (Horvath2018), and 100+ additional aging clocks. Preprocessing includes missing value imputation (mean/KNN), probe aggregation for EPICv2, and anti-log-linear postprocessing. Downstream analysis: age acceleration residuals, clock correlation, and multi-clock benchmarking for aging intervention studies.

Thin
0
CHORD

CHORD (Classifier of Homologous Recombination Deficiency) — R package for predicting HRD status in tumors from somatic mutation patterns (SNVs, indels, structural variants). Uses random forest classification to distinguish BRCA1-type vs BRCA2-type HRD. Guides clinical decisions for PARP inhibitor and platinum-based chemotherapy eligibility. Companion to mutSigExtractor; works with WGS-derived VCF files and mutation contexts.

Thin
0
HTSeq

HTSeq — Python framework for high-throughput sequencing data analysis, primarily used for counting aligned reads overlapping genomic features (genes, exons) from SAM/BAM/CRAM files using GTF/GFF annotations. Key CLI tools: htseq-count (feature quantification), htseq-count-barcodes (single-cell UMI counting), htseq-qa (quality assessment). Critical flags: -s/--stranded (yes/no/reverse), -r/--order (name/pos), -m/--mode (union/intersection-strict/intersection-nonempty), -t/--type (exon), -i/--idattr (gene_id), -a (min MAPQ), --nonunique (none/all/fraction/random), -c/--counts_output, -n/--nprocesses. Input: SAM/BAM/CRAM + GTF/GFF. Output: tab-delimited count table, CSV, sparse matrix (mtx), h5ad, loom. Special counters: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique. Python API provides GenomicArray, GenomicInterval, GFF_Reader for custom analyses. Use for RNA-seq gene quantification, differential expression input, and single-cell counting.

Thin
0
htsget

htsget — GA4GH REST streaming API for accessing genomic reads and variants from remote servers. Fetch BAM, CRAM, VCF, or BCF data by region using HTTP ticket-based retrieval. Supports Bearer token auth, 0-based half-open coordinate queries, and chunked URL assembly. Use to stream reads from htsget-rs, EGA, or any GA4GH-compliant server.

Thin
0
HTSlib

HTSlib — C library and command-line utilities for reading, writing, indexing, and manipulating high-throughput sequencing data in SAM, BAM, CRAM, VCF, BCF, and tabix-indexed formats. Provides bgzip block compression, tabix genomic region indexing, and htsfile format identification. Foundation library for samtools and bcftools.

Thin
0
Hugging Face Transformers

Hugging Face Transformers — state-of-the-art machine learning library for natural language processing, computer vision, audio, and multimodal tasks. Provides pretrained models (BERT, GPT-2, T5, Llama, Mistral, CLIP, Whisper, etc.) via the Model Hub, high-level pipeline API for inference, AutoModel and AutoTokenizer for flexible loading, and Trainer API for fine-tuning on custom datasets. Supports PyTorch, TensorFlow, and JAX backends. Use for text classification, NER, summarization, translation, question answering, image classification, object detection, speech recognition, and fine-tuning foundation models on domain data.

Thin
0
HUMAnN (HMP Unified Metabolic Analysis Network)

HUMAnN 3 — functional profiling pipeline for metagenomic and metatranscriptomic shotgun sequencing data. Quantifies gene families (UniRef90/50), metabolic pathways (MetaCyc), and enzyme reactions via three-tier search: MetaPhlAn taxonomic prescreening, Bowtie2 nucleotide alignment against ChocoPhlAn pangenomes, then DIAMOND translated protein search against UniRef databases. Produces organism-stratified RPK abundance tables for gene families, pathway abundances, and pathway coverage. Core tool in the bioBakery suite alongside MetaPhlAn and MaAsLin2.

Thin
0
HydroEval

HydroEval — Python evaluator for streamflow simulations using vectorized NumPy computation. Provides Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE, KGE', KGEnp), RMSE, MARE, Percent Bias, and bounded C2M variants for hydrological model calibration and validation. Supports flow transformations (log, sqrt, inverse) for low-flow emphasis.

Thin
0
HydroMT (Hydro Model Tools)

HydroMT (Hydro Model Tools) — open-source Python framework for automated building, updating, and analysis of spatial geoscientific models with a focus on water systems. Provides a model-agnostic interface with plugin architecture supporting Wflow, SFINCS, Delwaq, FIAT, and Delft3D FM backends. Automates data ingestion via DataCatalog YAML configs, spatial clipping, regridding, and reproducible model setup from raw geospatial data (DEMs, land use, soil, climate, hydrography). Core model types include GridModel, MeshModel, VectorModel, and NetworkModel.

Thin
0
HyPhy

HyPhy (Hypothesis Testing using Phylogenies) — open-source software for molecular evolution analysis and natural selection detection. Provides site-level methods (FEL, SLAC, FUBAR, MEME), branch-level methods (aBSREL, BUSTED), comparative methods (RELAX, CONTRAST-FEL), and recombination detection (GARD). Processes codon-aware alignments with phylogenetic trees. Install with conda install -c bioconda hyphy.

Thin
0
iClusterPlus

iClusterPlus — Bioconductor R package for integrative clustering of multi-omics data. Performs joint latent variable modeling of multiple genomic data types (e.g., SNP copy number, methylation, gene expression, mutation) using a Bayesian latent variable framework with Lasso regularization. Produces consensus molecular subtypes across data modalities. Used for multi-omics integration, cancer subtype discovery, TCGA pan-cancer analyses, and dimensionality reduction across genomic platforms. Supports continuous, binary, and count data types in a unified model.

Thin
0
IDR

IDR (Irreproducible Discovery Rate) — statistical framework for assessing and controlling reproducibility of high-throughput sequencing experiments across biological replicates. Fits a copula mixture model to ranked peak lists from ChIP-seq, ATAC-seq, or other peak-calling experiments. Outputs per-peak IDR scores, reproducible peak sets at a given threshold, and diagnostic plots. Required step in ENCODE ChIP-seq and ATAC-seq processing pipelines. Use for replicate consistency analysis, setting reproducibility thresholds, and generating consensus peak sets from narrowPeak or broadPeak files.

Thin
0
ieugwasr -- GWAS Data Access and Querying

ieugwasr (IEU GWAS API in R) -- R package for querying the IEU GWAS database, accessing summary statistics from 47,000+ GWAS studies. Enables genome-wide association study data retrieval, effect size extraction, LD estimation, and two-sample Mendelian randomization analysis. Query by SNP (rsID, chr:pos), trait, or exposure-outcome pairs with flexible filtering and batch operations.

Thin
0
IgFold

IgFold — fast antibody structure prediction using language model embeddings. Predicts 3D structures of antibody variable domains (Fv) from amino acid sequence alone, without multiple sequence alignments. Uses pretrained protein language model (AntiBERTy) embeddings with graph neural networks for coordinate prediction. Supports paired heavy/light chain modeling, CDR loop refinement via OpenMM energy minimization, and nanobody (VHH) prediction. Use for antibody homology modeling, CDR loop structure, therapeutic antibody design, and structural annotation of repertoire data.

Thin
0
IGV (Integrative Genomics Viewer)

IGV (Integrative Genomics Viewer) — high-performance Java desktop application for interactive visualization and exploration of genomic data. Supports BAM/CRAM alignments, VCF variants, BED/GFF annotations, BigWig/TDF quantitative tracks, and GWAS results. Provides batch scripting for automated snapshots, igvtools for preprocessing (count, sort, index, toTDF), and session files for reproducible views. Essential for visual QC of NGS data.

Thin
0
IGV.js

IGV.js — embeddable JavaScript genome browser for interactive visualization of genomic data in web applications. Supports BAM/CRAM alignments, VCF variants, BED annotations, BigWig signal tracks, GFF3/GTF gene models, and SEG copy-number segments. Configurable via JSON track objects. Embeddable in React, Angular, and plain HTML. Supports indexed file access via htsget, Google Cloud Storage, and Amazon S3. Reference genomes include hg38, hg19, mm10, and custom FASTA.

Thin
0
IHW

IHW — Independent Hypothesis Weighting for covariate-informed multiple testing correction with increased statistical power. Accepts p-values and an independent covariate, learns data-driven weights via cross-validation, and applies weighted Benjamini-Hochberg (FDR) or weighted Bonferroni (FWER) procedures. Provides accessor functions for adjusted p-values (adj_pvalues), weights, rejections, and weighted p-values. Supports ordinal and nominal covariates, automatic bin selection, configurable fold counts, and Grenander or ECDF distribution estimation. Standard tool for boosting discovery power in genome-wide studies (RNA-seq, GWAS, proteomics, ChIP-seq) where an informative covariate is available alongside p-values.

Thin
0
ilastik

ilastik — interactive machine learning toolkit for bioimage analysis. Provides pixel classification, object classification, autocontext, tracking, counting, carving (3D segmentation), and multicut boundary-based segmentation using random forest classifiers trained interactively. Supports headless batch processing via command line for HDF5, TIFF, and N5 image data. Use for cell segmentation, organelle detection, particle tracking, and neurite tracing in microscopy images.

Thin
0
ImageJ/Fiji

Use when working with imagej — open-source image processing program with

Thin
0
ImmuneBuilder

ImmuneBuilder — deep-learning models for predicting 3D structures of immune proteins from amino acid sequence alone. Provides three specialist predictors: ABodyBuilder2 (paired antibody Fv), NanoBodyBuilder2 (nanobody / VHH single- domain), and TCRBuilder2 (alpha/beta T-cell receptor). Generates ensembled PDB structures with per-residue pLDDT confidence scores using IMGT, AHo, or Chothia numbering. Use for antibody structure prediction, TCR modeling, nanobody engineering, CDR loop analysis, and immune repertoire structural annotation. Faster and more accurate than homology modeling; complementary to IgFold, AlphaFold2-Multimer, and ABodyBuilder1.

Thin
0
IMPUTE5

IMPUTE5 for genotype imputation from phased haplotypes and a reference panel. Performs fast, accurate imputation using the Li and Stephens hidden Markov model with the PBWT data structure for efficient haplotype matching. Use when imputing genotypes, running imputation pipelines, chunking chromosomes for parallel imputation, or comparing imputation tools.

Thin
0
inferCNV

inferCNV — R/Bioconductor package for inferring copy number variations (CNVs) from single-cell RNA-seq data by comparing tumor cell expression against a reference set of normal cells. Generates chromosomal CNV heatmaps, identifies tumor subclones via hierarchical clustering, applies hidden Markov model (HMM) denoising (i3 or i6 modes), and supports subclusters analysis for intratumoral heterogeneity. Core tool for cancer single-cell genomics pipelines.

Thin
0
Inspector

Inspector — long-read-based genome assembly evaluation and error correction. Use for: assembly quality assessment, QV (quality value) score calculation, structural error detection, small-scale error profiling, assembly correction using long reads, multi-assembly comparison, and phasing quality estimation. Key terms: inspector.py, inspector-correct.py, QV score, structural errors, small-scale errors, assembly evaluation, PacBio HiFi, PacBio CLR, ONT, contig_summary, assembly_summary, structural_error.bed, small_scale_error.bed, genome assembly QC, long-read assembly benchmarking.

Thin
0
inStrain

'Use when working with instrain — inStrain — strain-level microbial population

Thin
0
intervaltree

Use when working with intervaltree — intervaltree — Python mutable interval

Thin
0
InterVar

InterVar — Python tool for clinical interpretation of genetic variants according to ACMG-AMP 2015 guidelines. Automatically evaluates 28 evidence codes (PVS1, PS1-4, PM1-6, PP1-5, BA1, BS1-4, BP1-7) and classifies variants as pathogenic, likely pathogenic, uncertain significance, likely benign, or benign. Integrates with ANNOVAR for variant annotation. Used in clinical genomics, exome/genome sequencing interpretation, and variant curation pipelines.

Thin
0
IonQuant

IonQuant — high-performance label-free and isobaric (TMT/iTRAQ) quantification engine for mass spectrometry proteomics. Performs MaxLFQ-based protein intensity rollup, match-between-runs (MBR) with 3D LC-MS feature detection, and site-level quantification. Integrates with FragPipe/MSFragger pipeline and supports DDA, DIA, and timsTOF PASEF data. Used for protein-level and peptide-level quantification in bottom-up proteomics experiments.

Thin
0
IQ-TREE

IQ-TREE — efficient phylogenomic software for maximum likelihood tree inference. Integrates ModelFinder for automatic model selection (10-100x faster than jModelTest/ProtTest), ultrafast bootstrap (UFBoot2, 100x faster than standard bootstrap), partition models with PartitionFinder-like merging, concordance factor analysis (gCF/sCF), and AliSim sequence simulation. Supports DNA, protein, codon, binary, and morphological data with extensive rate heterogeneity models (+G, +I, +R FreeRate). Produces Newick tree files, detailed analysis reports, bootstrap consensus trees, and checkpoint files for resuming.

Thin
0
IsoformSwitchAnalyzeR

IsoformSwitchAnalyzeR — R/Bioconductor package for detecting, annotating, and visualizing isoform switches with functional consequences from RNA-seq data. Integrates with Salmon, kallisto, StringTie, or RSEM quantification and DEXSeq or DRIMSeq for differential isoform usage testing. Annotates switches with coding potential (CPC2/CPAT), protein domains (Pfam), signal peptides (SignalP), intrinsically disordered regions (IUPred2A), and alternative splicing events. Produces switchPlot visualizations and genome-wide summaries of functional isoform switching consequences.

Thin
0
ISOLDE

ISOLDE — interactive molecular dynamics plugin for UCSF ChimeraX that enables real-time model building into low-to-medium resolution cryo-EM and crystallographic maps. Uses GPU-accelerated OpenMM with AMBER forcefield for interactive simulation, real-time geometric validation, adaptive distance and torsion restraints, peptide flipping, ligand fitting, and refinement export to Phenix/REFMAC.

Thin
0
IsoQuant

IsoQuant — long-read RNA-seq transcript discovery and quantification tool from PacBio and Oxford Nanopore data. Assigns reads to known isoforms, discovers novel transcripts via intron graph algorithms, and quantifies expression at gene and transcript level. Supports reference-guided and reference-free modes. Input: BAM/FASTQ from PacBio HiFi, CLR, or ONT. Output: GTF annotations, read assignments, transcript/gene counts. Python CLI installed via pip or conda.

Thin
0
iTOL

iTOL (Interactive Tree Of Life) — web-based phylogenetic tree display and annotation tool. Visualizes and annotates trees in Newick, Nexus, NHX, and PhyloXML formats. Supports colored nodes/branches, dataset overlays (binary, bar charts, pie charts, color strips, heatmaps, gradient), external links, and publication-quality export (SVG, PDF, PNG, EPS). Batch upload and programmatic access available via itolapi Python package. Use for phylogenomics, microbial ecology, comparative genomics, and evolutionary biology visualizations.

Thin
0
JACKS

JACKS — Joint Analysis of CRISPR/Cas9 Knock-out Screens. Bayesian method for jointly estimating gene essentiality and gRNA efficacy across multiple CRISPR knockout screens. Shares information across cell lines to improve statistical power. Supports negative and positive selection, pre-trained gRNA efficacy references, matched or common controls, FDR thresholding, and pseudo-gene p-value calibration.

Thin
0
JAGS

JAGS (Just Another Gibbs Sampler) — program for Bayesian hierarchical model analysis via Markov Chain Monte Carlo (MCMC) simulation using the BUGS language dialect. Supports Gibbs sampling, slice sampling, and Metropolis-Hastings for continuous, discrete, and mixture models. Python interface via PyJAGS with multichain parallel sampling, HDF5 persistence, and ArviZ integration for diagnostics and visualization. Key distributions include dnorm (precision- parameterized), dgamma, dbeta, dbinom, dpois, dunif, dexp, and dt.

Thin
0
JAX

JAX — Google's high-performance numerical computing library combining NumPy API with automatic differentiation (grad, value_and_grad), just-in-time XLA compilation (jit), automatic vectorization (vmap), and parallelization (pmap). Designed for machine learning research and scientific computing on CPU, GPU, and TPU. Provides a functional programming model with explicit PRNG state, pytree data structures, and composable function transformations for high-performance array computing in biology, genomics, and deep learning.

Thin
0
JBrowse 2

JBrowse 2 — modern React-based genome browser for interactive visualization of genomic data including BAM/CRAM alignments, VCF variants, BED/GFF annotations, BigWig signal tracks, Hi-C contact maps, and synteny/dotplot views. Provides an embeddable React component library (@jbrowse/react-linear-genome-view), a desktop Electron app, and a CLI (@jbrowse/cli) for assembly setup, track configuration, and instance administration. Successor to JBrowse 1.

Thin
0
Jellyfish

Jellyfish — fast, multi-threaded k-mer counter for DNA sequences. Counts exact k-mer frequencies in FASTA/FASTQ files using a lock-free hash table. Used for genome size estimation (flow cytometry-free), sequencing error detection, repeat analysis, read deduplication, and as a preprocessing step for genome assemblers. Outputs k-mer histograms or sorted k-mer databases. Supports streaming input via pipes, gzipped files, and canonical k-mers.

Thin
0
JM / JMbayes2

JM / JMbayes2 — Joint models for longitudinal and time-to-event data under the Bayesian framework. Fits shared-parameter joint models via jm() linking mixed-effects longitudinal submodels (lme, mixed_model from GLMMadaptive) with Cox or AFT survival submodels (coxph, survreg). Supports multiple longitudinal outcomes of mixed type (continuous Gaussian/Student-t/Gamma/Beta, binary, ordinal, count Poisson/Negative-Binomial/Beta-Binomial), censored normal, competing risks via crisk_setup(), multi-state processes, recurrent events via rc_setup(), and time-varying effects via tv(). Association structures include value(), slope(), velocity(), acceleration(), area(), with transformation helpers vexpit(), vexp(), vlog(), vsqrt(). Dynamic predictions via predict() with time-dependent ROC (tvROC), AUC (tvAUC), calibration_plot(), calibration_metrics(), Brier score (tvBrier), and cross-validation via create_folds(). MCMC diagnostics via traceplot(), ggtraceplot(), gelman_diag(), densplot(), ggdensityplot(). Model comparison via compare_jm(). Posterior predictive checks via ppcheck(). Penalized estimation with horseshoe/ridge priors for variable selection. Consensus Monte Carlo for large datasets via consensus(). Implements C++ MCMC via Rcpp/RcppArmadillo. Primary reference: Rizopoulos (2012) "Joint Models for Longitudinal and Time-to-Event Data" (Chapman & Hall/CRC). JMbayes2 v0.6-0 (2026-01-28).

Thin
0
Juicebox

Juicebox — interactive visualization and analysis tool for Hi-C and other proximity-ligation chromatin conformation data. Loads .hic files produced by Juicer, Dovetail, or Arima workflows and enables zoom, normalization, and loop/domain annotation at any resolution. Supports A/B compartment analysis, TAD boundary calling, loop annotation with HiCCUPS, and comparison of multiple Hi-C maps. Required for 3D genome visualization, chromatin loop inspection, TAD/compartment exploration, and Hi-C quality control.

Thin
0
Juicer

Juicer — one-click Hi-C data processing pipeline that aligns paired-end FASTQ reads to a reference genome and produces .hic contact matrices for chromatin conformation analysis. Handles alignment (BWA), deduplication, valid-pair parsing, and contact matrix generation via Juicer Tools. Outputs merged_nodups.txt for downstream genome assembly (3D-DNA) and .hic files for loop calling (HiCCUPS), TAD calling (Arrowhead), and visualization (Juicebox). Supports MboI, DpnII, HindIII, Arima, and no-enzyme (micro-C) digestion protocols.

Thin
0
Kaiju

Kaiju is a fast taxonomic classifier for metagenomic sequencing reads using protein-level sequence matching. Translates DNA reads into amino acid sequences and compares against NCBI protein reference databases (RefSeq, nr, progenomes) using the Burrows-Wheeler transform. Use for taxonomic profiling of shotgun metagenomics, viral metagenomics, environmental DNA classification, or comparing protein-level vs nucleotide-level classification approaches (Kaiju vs Kraken2).

Thin
0
karyoploteR

karyoploteR — R/Bioconductor package for creating publication-quality karyotype plots and genome-wide data visualization along chromosome ideograms. Supports plotting points, lines, bars, links, rainfall plots, Manhattan plots, copy number segments, coverage tracks, and gene/region labels on customizable chromosome backgrounds. Ideal for visualizing GWAS results, structural variants, CNVs, and multi-track genomic annotations.

Thin
0
kb-python

kb-python — Python wrapper for the kallisto|bustools (kb) workflow for single-cell RNA-seq pre-processing. Runs kb ref to build or download reference indices and kb count to generate count matrices from raw FASTQ files. Supports 10x Chromium v1/v2/v3, Drop-seq, inDrop, and other single-cell technologies. Produces h5ad, loom, and sparse mtx outputs. Standard tool for kallisto bus-based single-cell quantification pipelines, RNA velocity (lamanno/nac workflows), and modular scRNA-seq pre-processing.

Thin
0
KEGG Database

KEGG Database — Direct REST API access to KEGG (Kyoto Encyclopedia of Genes and Genomes) for biological pathway analysis, gene-pathway mapping, metabolic networks, drug interactions, and cross-database ID conversion. Core API operations: kegg info, kegg list, kegg find, kegg get, kegg conv, kegg link, kegg ddi. Access at rest.kegg.jp. Academic use only. For Python workflows needing multiple databases, prefer bioservices. Use this skill for direct HTTP/REST work or KEGG-specific pathway and gene queries.

Thin
0
KING

KING — robust kinship estimation and relationship inference for genome-wide SNP data. Computes pairwise kinship coefficients, infers family relationships (MZ twins, parent-offspring, full siblings, 2nd/3rd-degree relatives, unrelated), detects pedigree errors, performs ancestry inference via principal components, and identifies population structure. Essential QC tool for GWAS, biobank, and family-based genetic studies.

Thin
0
Kipoi

Kipoi — unified Python API and CLI for 2000+ pre-trained machine learning models for genomic sequence analysis. Load, run, and interpret models for variant effect prediction, transcription factor binding, RNA splicing, chromatin accessibility, and gene expression from a shared model zoo. Supports Keras, TensorFlow, PyTorch, and scikit-learn backends. Use for applying published genomics ML models without retraining, variant scoring, and DeepLIFT/in-silico mutagenesis interpretation of sequence models.

Thin
0
KMC (K-Mer Counter)

KMC — high-performance disk-based k-mer counting and set operations for genomic sequences. Counts k-mers in FASTQ, FASTA, BAM, and KMC database files with minimal memory usage via disk-based partitioning. Provides kmc_tools for union, intersection, subtraction, and histogram operations on k-mer databases. Used for genome size estimation, error correction preprocessing, contamination screening, and metagenomic profiling.

Thin
0
KMCP

Use when working with kmcp — KMCP — coverage-based metagenomic sequence

Thin
0
KneadData

KneadData — quality control and host decontamination tool for metagenomic

Thin
0
Kraken 2

Kraken 2 — fast k-mer-based taxonomic sequence classifier for metagenomics. Assigns NCBI taxonomic labels to DNA/protein sequences using exact k-mer matching against a minimizer-indexed hash table. Supports standard NCBI databases, custom databases, 16S rRNA databases (Greengenes, SILVA, GTDB), and protein-level translated searching. Produces per-read classification output and aggregate sample reports compatible with Pavian, Bracken, and KrakenTools for downstream abundance estimation.

Thin
0
KrakenUniq

Use when working with krakenuniq — krakenUniq — metagenomics taxonomic

Thin
0
Krona

Krona — interactive HTML pie chart visualization of hierarchical taxonomic data from metagenomics and sequence classification. Part of KronaTools, which provides importers for Kraken2, BLAST, Diamond, MetaPhlan, MEGAN, and tab-delimited text. Generates zoomable, multi-level taxonomy charts viewable in any browser. Key commands: ktImportText, ktImportBLAST, ktImportKraken, ktImportTaxonomy. Use for metagenomic taxonomy visualization, BLAST result exploration, and diversity summary.

Thin
0
Laspy

Laspy — Python library for reading, writing, and manipulating LAS and LAZ (compressed) LIDAR point cloud files conforming to ASPRS LAS specification versions 1.0 through 1.4. Supports chunked I/O for large datasets, LAZ compression via lazrs or laszip backends, extra byte dimensions, VLRs, COPC (Cloud Optimized Point Cloud), and numpy-based point attribute access. Essential for remote sensing, forestry, terrain modeling, and geospatial point cloud processing pipelines.

Thin
0
LAST

LAST — adaptive-seed sequence aligner for genomes, long reads, and proteins. Uses lastdb to build databases, last-train to learn substitution/gap rates, lastal for alignment, last-split for rearrangement/splice detection, last-pair-probs for paired reads, and maf-convert for format conversion. Handles AT-rich DNA, DNA-protein frameshifts, and genome-genome comparison.

Thin
0
LDpred2

LDpred2 — Bayesian polygenic risk score (PRS) method implemented in R via the bigsnpr package. Computes polygenic scores from GWAS summary statistics using four models: LDpred2-inf (infinitesimal), LDpred2-grid (grid of p and h2), LDpred2-auto (automatic hyperparameter estimation), and lassosum2. Use when working with ldpred2, bigsnpr, polygenic scores, PRS, genetic risk prediction, or Bayesian shrinkage of GWAS effect sizes.

Thin
0
LDSC

LD Score Regression (LDSC) for estimating SNP heritability, genetic correlation, and partitioned heritability from GWAS summary statistics. Use when working with ldsc, LD score regression, heritability estimation, genetic correlation analysis, cell-type-specific enrichment, or partitioned h2.

Thin
0
Leafcutter

Leafcutter — annotation-free RNA splicing quantification from short-read RNA-seq data. Detects differential intron usage, alternative splicing events, outlier splicing (LeafcutterMD), and splicing QTLs by analyzing split reads spanning introns. Uses Dirichlet-multinomial GLM for statistical testing. Scales to thousands of samples. Includes leafviz Shiny visualization app. Works with BAM files via regtools junction extraction. R package + Python clustering scripts.

Thin
0
LEfSe

'Use when working with lefse — LEfSe (Linear discriminant analysis Effect

Thin
0
LIANA

LIANA (LIgand-receptor ANAlysis) — unified Python framework for cell-cell communication inference from single-cell transcriptomics. Wraps multiple methods (CellPhoneDB, NATMI, Connectome, SingleCellSignalR, logFC, CellChat) under a consistent API using AnnData objects. Generates consensus rankings via robust rank aggregation to reduce method-specific biases. Supports multi-condition comparison via LIANA+, tensor decomposition with Tensor-cell2cell, and spatial CCC with MISTY integration. Part of the scverse/liana-py ecosystem for ligand-receptor interaction scoring.

Thin
0
PDAL

PDAL (Point Data Abstraction Library) — C/C++ toolkit for translating, filtering, and processing point cloud data from LiDAR, photogrammetry, and depth sensors. Provides JSON pipeline architecture with 40+ readers, 80+ filters, and 25+ writers. Supports LAS/LAZ, COPC, E57, PLY, PCD formats. Key capabilities include ground classification (SMRF, PMF, CSF), noise removal, coordinate reprojection, decimation, height-above-ground, surface reconstruction, and batch tiling. The standard point cloud processing engine for remote sensing, forestry, terrain modeling, and earth observation.

Thin
0
lifelines

lifelines — Python survival analysis library for time-to-event data. Fits Kaplan-Meier curves with KaplanMeierFitter(), Nelson-Aalen cumulative hazard with NelsonAalenFitter(), semiparametric Cox proportional hazards with CoxPHFitter(), time-varying Cox models with CoxTimeVaryingFitter(), accelerated failure time models (WeibullAFTFitter, LogNormalAFTFitter, LogLogisticAFTFitter, GeneralizedGammaRegressionFitter), piecewise exponential regression, and Aalen additive hazards. Supports left, right, and interval censoring, late entry (left truncation), formula API via formulaic, elastic net regularization (L1/L2), stratified Cox models, cluster-robust standard errors, log-rank tests via logrank_test(), proportional hazards assumption testing via check_assumptions(), residual diagnostics (martingale, deviance, Schoenfeld, score, delta_beta), concordance index (C-statistic), AIC model comparison, calibration plots, at-risk tables, and partial effect plots. Built on pandas DataFrames with matplotlib plotting. The primary Python library for classical survival analysis in clinical trials, epidemiology, and reliability engineering.

Thin
0
Liftoff

Liftoff — accurate genome annotation liftover tool that maps GFF/GTF annotations between assemblies of the same or closely-related species using Minimap2 alignment. Supports gene copy detection, ORF validation, annotation polishing, and chromosome-by-chromosome lifting without requiring pre-generated chain files. Essential for genome assembly projects, comparative genomics, and annotation transfer pipelines.

Thin
0
LIGER

liger (rliger) -- R package for integrative non-negative matrix factorization (iNMF) of single-cell multi-omic data. Integrates scRNA-seq, scATAC-seq, spatial transcriptomics, and methylation datasets across batches, modalities, and species. Identifies shared and dataset-specific expression programs (factors). Key functions: createLiger, normalize, selectGenes, scaleNotCenter, optimizeALS/runIntegration, quantileNorm, runUMAP, runWilcoxon. Supports online learning for large-scale integration and UINMF for unshared features.

Thin
0
limma-voom

limma-voom — linear models for differential expression analysis of RNA-seq and microarray data. The voom transformation converts RNA-seq read counts to log-CPM with precision weights derived from the mean-variance trend, enabling limma's powerful linear modeling framework (lmFit, contrasts.fit, eBayes) to analyze RNA-seq data with the same flexibility as microarrays. Provides empirical Bayes moderation of gene-wise variances, support for complex experimental designs with arbitrary contrasts, gene set testing (camera, fry, roast), gene ontology analysis (goana, kegga), and diagnostic visualizations (plotMD, plotSA, plotMDS). Part of the Bioconductor ecosystem and integrates with edgeR for normalization and filtering.

Thin
0
LINX

LINX — structural variant annotation and visualization tool from the Hartwig Medical Foundation hmftools suite. Interprets structural variants and copy number data to classify driver events including gene fusions, homozygous deletions, viral insertions, and complex genomic rearrangements in cancer whole-genome sequencing data. Requires PURPLE copy number and GRIPSS filtered SV output as input.

Thin
0
LJA

LJA (La Jolla Assembler) — de novo genome assembler for PacBio HiFi (CCS) long reads using a multiplex de Bruijn graph that simultaneously spans multiple k-mer lengths. Produces chromosome-scale, highly contiguous haploid assemblies from HiFi data. Outputs Assembly.fasta and Assembly.gfa. Developed by the Pevzner Lab at UC San Diego. Comparable to Hifiasm and HiCanu for HiFi assembly; preferred when fine-grained repeat resolution via multiplex dbg is desired.

Thin
0
lme4

lme4 — R package for fitting linear, generalized linear, and nonlinear mixed-effects models using Eigen-based sparse matrix methods. Core functions: lmer() for linear mixed models (LMMs), glmer() for generalized linear mixed models (GLMMs) with user-specified families/links, and nlmer() for nonlinear mixed models. Supports arbitrary nesting and crossing of random effects via formula syntax (1|group), (slope|group), (1|g1/g2). Implements REML/ML estimation, Laplace approximation, adaptive Gauss-Hermite quadrature, likelihood profiling, parametric bootstrapping, and prediction with new random-effect levels. Pairs with lmerTest for Satterthwaite/KR p-values.

Thin
0
locfdr

locfdr — Efron's empirical Bayes local false discovery rate estimation from z-score vectors. Computes per-test posterior probability of being null using a mixture model approach that fits the empirical density f(z) and a null sub-density f0(z). Core method for large-scale simultaneous hypothesis testing (RNA-seq, GWAS, proteomics, ChIP-seq) where thousands of z-statistics are produced. Supports theoretical N(0,1) null, MLE-estimated null, central matching null, or split-normal null. Provides local FDR estimates (per-test), expected FDR for non-null cases, and diagnostic plots. Implements Efron (2004, 2007) methodology for empirical null estimation which accounts for unobserved correlation and other departures from the theoretical null.

Thin
0
LocusZoom

LocusZoom — regional association plot generator for GWAS and fine-mapping results. Creates publication-quality locus zoom plots showing -log10(p-value) vs genomic position with LD coloring, recombination rate tracks, and gene annotations. Supports command-line static plots (PNG/PDF) and interactive browser-based visualization via LocusZoom.js. Used in post-GWAS analysis, fine-mapping, colocalization, and credible set visualization workflows.

Thin
0
LOFTEE

Use when working with loftee — LOFTEE (Loss-Of-Function Transcript Effect

Thin
0
LongQC

LongQC — quality control tool for long-read sequencing data from Oxford Nanopore and PacBio platforms. Generates comprehensive QC reports with read length distributions, per-read quality scores, coverage analysis, and sample contamination detection. Uses minimap2 internally for sample-vs-sample alignment. Supports platform-adaptive profiles for ONT and PacBio HiFi/CLR. Produces HTML reports and TSV summary tables. Developed at RIKEN.

Thin
0
loompy

Use when working with loompy — loompy — Python library for reading, writing,

Thin
0
LSHTM Pipelines

LSHTM PathogenSeq bioinformatics pipelines for pathogen whole-genome sequencing analysis, including TB-Profiler for Mycobacterium tuberculosis lineage and drug-resistance prediction from WGS data, Malaria-Profiler for Plasmodium species identification and antimalarial resistance detection, and the shared pathogen-profiler Python library. Covers Illumina and Nanopore workflows for tropical disease surveillance, AMR profiling, and molecular epidemiology.

Thin
0
LUMPY

LUMPY — probabilistic structural variant discovery from paired-end short-read sequencing. Detects deletions, duplications, inversions, and translocations from BWA-MEM aligned BAM files using discordant paired-end reads and split reads. Supports lumpyexpress (automated) and lumpy (manual) modes. Use for germline or somatic SV calling from Illumina WGS or WES data.

Thin
0
MACS2/MACS3

MACS2/MACS3 — Model-based Analysis of ChIP-Seq for identifying transcription factor binding sites and histone modification enrichment from ChIP-seq, ATAC-seq, and CUT&Tag data. Provides peak calling (callpeak), signal track generation (bdgcmp), fragment size modeling (predictd), duplicate filtering (filterdup), and ATAC-seq-specific HMM-based nucleosome positioning (hmmratac, MACS3 only). The standard peak caller in ENCODE and most epigenomics pipelines.

Thin
0
MAFFT

MAFFT — multiple sequence alignment for nucleotide and protein sequences. Implements progressive (FFT-NS-1, FFT-NS-2), iterative refinement (FFT-NS-i), and consistency-based iterative methods (L-INS-i, G-INS-i, E-INS-i) for different accuracy/speed trade-offs. Handles datasets from a few sequences to hundreds of thousands using PartTree guide-tree construction. Supports fragment addition (--addfragments), profile alignment (--addprofile), RNA structural alignment, and viral genome alignment with --6merpair. Use when aligning sequences, building phylogenies, or adding sequences to existing alignments.

Thin
0
Maftools

Maftools — R package for analyzing somatic mutations from cancer sequencing (MAF files). Provides oncoplots for mutation visualization, clonality analysis, tumor burden estimation, mutational signature detection, pathway enrichment, survival association, GISTIC analysis, drug-gene interactions, and clinical correlation for cohort-level cancer genomics. Works with TCGA MAF, institutional sequencing, and copy number variation data.

Thin
0
MAGeCK -- Model-based Analysis of Genome-wide CRISPR-Knockout

MAGeCK (Model-based Analysis of Genome-wide CRISPR-Knockout) -- computational pipeline for CRISPR genetic screen analysis. Performs sgRNA counting from FASTQ files (mageck count), gene-level essentiality testing via robust rank aggregation (mageck test), multi-condition maximum likelihood estimation (mageck mle), and pathway enrichment analysis (mageck pathway). Works with CRISPR knockout, CRISPRi, and CRISPRa screens. Accepts FASTQ or count table input, produces ranked gene lists with RRA scores, beta scores, FDR values, and pathway results.

Thin
0
MAGeCK-VISPR

MAGeCK-VISPR — comprehensive CRISPR screen analysis framework combining MAGeCK statistical testing (RRA and MLE algorithms) with VISPR interactive visualization. Supports sgRNA count generation from FASTQ, gene-level ranking, pathway enrichment, and quality control for genome-wide CRISPR knockout, activation (CRISPRa), and inhibition (CRISPRi) screens. Essential for functional genomics CRISPR library screening analysis.

Thin
0
MAGMA

MAGMA (Multi-marker Analysis of GenoMic Annotation) for gene-level and gene-set analysis of GWAS summary statistics. Use when performing gene analysis from SNP p-values, gene-set enrichment testing, competitive gene-set analysis, tissue expression analysis, or pathway enrichment from GWAS data.

Thin
0
MAJIQ

MAJIQ — Modeling Alternative Junction Inclusion Quantification for detecting, quantifying, and visualizing local splicing variations (LSVs) from RNA-Seq data. Builds splice graphs from BAM files and GFF3 annotations, quantifies PSI and delta-PSI with Bayesian Dirichlet-multinomial model, and visualizes via Voila (TSV, modulize, interactive HTML). Key subcommands: majiq build, majiq psi, majiq deltapsi, majiq heterogen, voila tsv, voila modulize, voila view. Input: sorted BAM + GFF3 annotation + settings INI. Output: .majiq files, .voila files, splicegraph.sql, TSV tables. Use for differential splicing analysis, LSV quantification, and splice graph visualization.

Thin
0
MAKER

MAKER — portable genome annotation pipeline that integrates ab initio gene predictors (SNAP, Augustus, GeneMark), protein and EST evidence alignment, and repeat masking to produce GFF3 gene models with AED quality scores. Supports iterative training, multiple evidence types, and MPI parallelism. Standard tool for eukaryotic genome annotation in non-model organisms.

Thin
0
MalariaGEN Data

MalariaGEN data Python package for accessing and analysing malaria genomic epidemiology data. Provides cloud-native APIs for Anopheles mosquito vectors (Ag3, Af1, Amin1, Adir1) and Plasmodium parasites (Pf7, Pf8, Pv4) with built-in population genetics tools including SNP calling, allele frequency analysis, PCA, FST, haplotype clustering, selection scans (H12, iHS, XP-EHH), copy number variation, and interactive geospatial visualisation. Data is hosted in Google Cloud Storage and accessed via zarr arrays without local download.

Thin
0
Manta

Manta — structural variant (SV) and indel caller that discovers, assembles, and scores large-scale genomic rearrangements from paired-end sequencing reads. Detects deletions, insertions, inversions, tandem duplications, and interchromosomal translocations in germline, tumor/normal somatic, and RNA-Seq data. Outputs scored VCF files with paired-read and split-read evidence fields. Use when user asks about structural variant calling, SV discovery, somatic SV analysis, or Manta configuration from BAM/CRAM files.

Thin
0
Mash

Mash — fast genome and metagenome distance estimation using MinHash sketching. Estimates pairwise distances between genomic sequences (FASTA/FASTQ) without full alignment using k-mer sketches. Supports whole-genome comparison, all-vs-all distance matrices, metagenome containment screening (mash screen), genome clustering, and species-level identification. Key commands: sketch, dist, screen, paste, info, triangle. Use for rapid phylogenetics, genome clustering, outbreak detection, and database containment queries.

Thin
0
Mashtree

Mashtree — rapid distance-based phylogenetic tree construction from genome assemblies using MinHash (Mash) distances. Creates neighbor-joining trees from FASTA assemblies, FASTQ reads, or GenBank files without requiring multiple sequence alignment. Used for rapid outbreak investigation, species clustering, and phylogenetic screening of bacterial genomes. Install with conda install -c bioconda mashtree.

Thin
0
MASS

MASS — Modern Applied Statistics with S. R package providing negative binomial GLMs via glm.nb(), robust linear models via rlm() with Huber/bisquare/Hampel psi functions, linear and quadratic discriminant analysis via lda()/qda(), ordinal regression via polr(), ridge regression via lm.ridge(), stepwise model selection via stepAIC(), Box-Cox transformations via boxcox(), maximum-likelihood distribution fitting via fitdistr(), multivariate normal simulation via mvrnorm(), two-dimensional kernel density estimation via kde2d(), generalized matrix inverse via ginv(), resistant regression via lqs(), log-linear models via loglm(), correspondence analysis via corresp()/mca(), and 100+ classic datasets. Companion to Venables & Ripley "Modern Applied Statistics with S" (4th ed., Springer, 2002). Ships as a recommended package with every R installation.

Thin
0
MaSuRCA

MaSuRCA (Maryland Super-Read Celera Assembler) — hybrid genome assembler combining Illumina short reads with PacBio or Oxford Nanopore long reads using super-read and mega-read technology. Performs automatic error correction, k-mer counting, super-read construction, and Celera/CABOG-based assembly. Supports Illumina-only, hybrid (short+long), and chromosome-scale scaffolding with SAMBA. Suitable for bacterial through mammalian-scale genomes.

Thin
0
MatchIt

MatchIt — R package for nonparametric preprocessing via propensity score matching in observational studies. Implements nearest neighbor, optimal, full, generalized full, genetic, exact, coarsened exact, cardinality matching, and subclassification. Core API: matchit() to match, summary() for covariate balance diagnostics (SMD, variance ratios, eCDF), match.data()/get_matches() to extract matched datasets with weights, and plot() for Love plots and distributional balance visualization. Supports ATT, ATE, ATC estimands with caliper, replacement, and ratio options.

Thin
0
Matplotlib

Matplotlib — comprehensive Python library for creating static, animated, and interactive visualizations. Provides publication-quality figures with fine-grained control over every plot element via the pyplot interface and object-oriented API. Supports line plots, scatter plots, bar charts, histograms, heatmaps, 3D plots, image display, and custom layouts. Foundation of the Python scientific visualization ecosystem used by pandas, seaborn, scanpy, and most bioinformatics tools.

Thin
0
matUtils

Use when working with matUtils — a toolkit for querying, interpreting, and manipulating mutation-annotated trees (MATs) in protobuf format. Part of the UShER project. Supports subcommands for tree summarization, subtree extraction, clade annotation, placement uncertainty analysis, and geographic introduction inference. Handles conversions between protobuf, VCF, Newick, and Auspice JSON. Essential for SARS-CoV-2 genomic surveillance, phylogenetic analysis, and large-scale pathogen evolution studies.

Thin
0
MaxBin2

MaxBin2 — metagenomic binning tool that recovers metagenome-assembled genomes (MAGs) from assembled contigs using an expectation-maximization algorithm with marker gene probabilities and coverage information. Requires assembled contigs in FASTA format and optional abundance (coverage) files. Used for microbial genome recovery, metagenome binning, and environmental genomics.

Thin
0
MaxQuant

MaxQuant — quantitative proteomics platform for analyzing large-scale mass spectrometry data. Integrates the Andromeda search engine for peptide identification, supports label-free quantification (LFQ), SILAC, TMT/iTRAQ, and DIA analysis via MaxDIA. Produces proteinGroups.txt, peptides.txt, evidence.txt, and msms.txt from raw LC-MS/MS data. Configuration through mqpar.xml parameter files enables reproducible proteomics workflows.

Thin
0
MCIA

MCIA (Multiple Co-Inertia Analysis) — R/Bioconductor omicade4 package for simultaneous dimensionality reduction across multiple paired omics datasets. Identifies co-structure axes (co-inertia) that are maximally covariant across data blocks (e.g., mRNA, miRNA, proteomics) measured on the same samples. Produces per-dataset loadings, global sample scores, and a discordance metric (RV coefficient) to flag samples with inconsistent multi-omics profiles. Use when users need to integrate two or more omics matrices sharing the same samples, visualize multi-omics co-variation, identify driver features per layer, or detect outlier samples across data blocks.

Thin
0
mclust

mclust — R package for Gaussian finite mixture models fitted via EM algorithm providing model-based clustering (automatic cluster number and covariance structure selection via BIC/ICL across 14 parameterizations from EII to VVV), classification (MclustDA discriminant analysis, EDDA, semi-supervised MclustSSC), density estimation (densityMclust with diagnostic plots), dimension reduction (MclustDR via GMMDR directions), bootstrap inference, and model-based agglomerative hierarchical clustering initialization.

Thin
0
MCMCglmm

MCMCglmm — Bayesian Multivariate Generalised Linear Mixed Models via MCMC in R. Fit complex hierarchical models with MCMCglmm() supporting 15+ response families (gaussian, poisson, categorical, ordinal, multinomial, exponential, geometric, cengaussian, cenpoisson, cenexponential, zipoisson, zapoisson, zibinomial, threshold, hudle). Prior specification via inverse-Wishart (V, nu) with parameter expansion (alpha.mu, alpha.V) and half-Cauchy priors. Supports pedigree-based animal models and phylogenetic comparative methods via inverseA(), multi-response models with cbind(), measurement error meta-analysis via mev, random regression, antedependence, path analysis via sir(), and multiple membership models via mult.memb(). Posterior diagnostics via coda integration (plot, autocorr.diag, effectiveSize).

Thin
0
MDAnalysis

MDAnalysis — Python library for analyzing molecular dynamics trajectories and atomic coordinate data. Supports 40+ file formats (DCD, XTC, TRR, PDB, GRO, AMBER, LAMMPS). Provides RMSD, RMSF, hydrogen bond analysis, radial distribution functions, contact maps, PCA, solvent accessibility, membrane analysis, and custom atom group selections. Interoperates with NumPy arrays and works with GROMACS, NAMD, AMBER, OpenMM, and LAMMPS trajectories.

Thin
0
MDTraj

MDTraj — Python library for reading, writing, and analyzing molecular dynamics trajectories. Supports 20+ formats (DCD, XTC, TRR, PDB, HDF5, NetCDF, etc.) with NumPy-native arrays. Provides RMSD, RMSF, hydrogen bond analysis, DSSP secondary structure, contact maps, solvent-accessible surface area, and distance/angle/dihedral calculations for biomolecular simulation analysis.

Thin
0
Medaka

medaka -- neural network-based tool from Oxford Nanopore Technologies for creating consensus sequences and calling variants from nanopore sequencing data. Polishes draft assemblies from Flye, miniasm, or canu using basecalled reads aligned with minimap2. Calls SNPs and indels from nanopore reads mapped to a reference genome. Supports R9 and R10 chemistries with model auto-selection. Essential in nanopore assembly and variant calling pipelines.

Thin
0
mediation

mediation — R package for causal mediation analysis implementing parametric and non-parametric methods. Core function mediate() estimates average causal mediation effects (ACME), average direct effects (ADE), and total effects via quasi-Bayesian Monte Carlo or nonparametric bootstrap. Supports binary, continuous, and count outcomes with GLM, GAM, quantile regression, survival, and multilevel (lme4) models. Includes sensitivity analysis (medsens) for sequential ignorability violations, treatment-mediator interaction testing (test.TMint), moderated mediation (test.modmed), multiple mediators (multimed), and instrumental variable mediation (ivmediate).

Thin
0
Meeko

Meeko — Python interface for AutoDock molecular docking preparation. Parameterizes small-molecule ligands and macromolecular receptors into PDBQT format for AutoDock-GPU and AutoDock-Vina. Handles flexible sidechains, reactive docking, macrocycles, and post-docking export of results to SDF/PDB. Essential for structure-based drug discovery and virtual screening pipelines.

Thin
0
MEGA (Molecular Evolutionary Genetics Analysis)

MEGA (Molecular Evolutionary Genetics Analysis) — cross-platform software suite for phylogenetic tree construction, multiple sequence alignment, molecular evolution analysis, and molecular dating. Supports Neighbor-Joining, Maximum Likelihood, Maximum Parsimony, and Minimum Evolution methods. Includes ClustalW and MUSCLE alignment, substitution model selection, ancestral state reconstruction, selection pressure tests, and Timetree calibration. MEGA CC provides headless command-line execution via .mao analysis options files.

Thin
0
MEGAHIT

MEGAHIT — ultra-fast and memory-efficient NGS assembler for large and complex metagenomes, single genomes, and single-cell data. Uses succinct de Bruijn graph (SdBG) with an iterative multi k-mer strategy to achieve low memory usage while maintaining high assembly quality. Supports paired-end, single-end, and interleaved reads; multiple libraries in one invocation; optional CUDA GPU acceleration; and resumable assemblies with --continue. Produces FASTA contigs (final.contigs.fa) and intermediate FASTG graphs per k-mer iteration. The standard tool for metagenomic assembly in gut, soil, ocean, and clinical microbiome projects.

Thin
0
Megalodon

Megalodon — Oxford Nanopore basecalling-anchored analysis tool for modified base detection (5mC, 5hmC, 6mA) and sequence variant calling from raw nanopore signal (FAST5) data. Provides per-read and per-site modification probabilities, aggregated bedmethyl output, and variant calls in VCF format. Integrates with Guppy and Remora models for signal-level analysis. Use for nanopore methylation calling, modified base detection, and long-read variant calling from raw signal files.

Thin
0
MELD -- Quantifying Perturbation Effects at Single-Cell Resolution

MELD (Manifold Enhancement of Latent Dimensions) -- Python package for quantifying the effect of experimental perturbations at single-cell resolution. Uses graph signal processing to estimate sample-associated density for each cell, enabling identification of cell populations most affected by treatment, knockout, or disease conditions without requiring pre-defined clusters. Works with scRNA-seq data via graphtools and pygsp. Includes VertexFrequencyCluster for spectral clustering of perturbed populations. Part of the Krishnaswamy Lab ecosystem (PHATE, MAGIC, scprep).

Thin
0
MEME Suite

MEME Suite — motif-based sequence analysis toolkit for de novo motif discovery (MEME, STREME), motif enrichment testing (AME, SEA, CentriMo), motif scanning (FIMO, MAST, MCAST), motif comparison (Tomtom), and integrated ChIP-seq (MEME-ChIP) and general enrichment (XSTREME) pipelines. Standard workflow for discovering transcription factor binding motifs from peak-called sequences.

Thin
0
MEME Suite

MEME Suite — comprehensive motif analysis toolkit for DNA, RNA, and protein sequences. Provides de novo motif discovery (MEME, STREME), motif scanning (FIMO), motif comparison (TOMTOM), motif enrichment analysis (AME, SEA), and integrated ChIP-seq pipelines (MEME-ChIP). Supports MEME motif format, minimal MEME format, and position-specific scoring matrices. Used for transcription factor binding site analysis, regulatory element discovery, and ChIP-seq / ATAC-seq peak motif characterization.

Thin
0
Merqury

Merqury — reference-free quality, completeness, and phasing assessment for genome assemblies using k-mer databases built from Illumina reads. Computes QV (Phred-scaled quality value), k-mer completeness, copy-number spectra (CN spectra), and haplotype phasing statistics. Essential QC step for de novo genome assemblies, haplotype-resolved assemblies, and trio binning workflows. Works with diploid, polyploid, and haplotype-phased assemblies.

Thin
0
Meryl

Meryl — fast k-mer counting and set-operation toolkit from the Marbl group (Canu/Verkko). Counts canonical or strand-specific k-mers from FASTA/FASTQ inputs into binary meryl databases, then supports histogram, print, filter, and full set-algebra (union, intersect, difference) across databases. Required pre-step for Merqury assembly QV evaluation and Verkko genome assembly.

Thin
0
Mesmer

Mesmer — deep learning-powered whole-cell segmentation for multiplexed tissue imaging. Uses a pre-trained TissueNet model to segment both nuclei and whole cells from multiplex fluorescence images (CODEX, MIBI, IMC, mIF, MERFISH). Accepts nuclear and membrane/cytoplasm channel inputs, returns labeled segmentation masks. Built on DeepCell and TensorFlow. Use for cell segmentation, nuclear segmentation, spatial proteomics preprocessing, and multiplexed imaging analysis pipelines.

Thin
0
MetaBAT2

MetaBAT2 — adaptive binning of metagenomic contigs using empirical probabilistic distances of genome abundance and tetranucleotide frequency. Clusters assembled contigs into genome bins (MAGs) from metagenomic sequencing data. Requires sorted BAM alignment files and assembled contigs in FASTA format. Used for metagenome-assembled genome recovery, microbial community profiling, and environmental genomics.

Thin
0
MetaboAnalyst

MetaboAnalyst — comprehensive R package and web platform for metabolomics data analysis and interpretation. Provides statistical analysis (t-test, ANOVA, PCA, PLS-DA), pathway analysis (KEGG, SMPDB), enrichment analysis (MSEA, QEA), biomarker discovery (ROC curves, random forest), compound identification, time-series analysis, and publication-ready visualization for LC-MS, GC-MS, NMR, and targeted metabolomics workflows.

Thin
0
METAL

METAL for genome-wide meta-analysis of genetic association studies. Combines GWAS summary statistics across cohorts using inverse-variance-weighted or sample-size-weighted fixed-effects schemes. Use when performing GWAS meta-analysis, combining summary statistics, heterogeneity testing, or genomic control correction.

Thin
0
MetaPhlAn 4 / StrainPhlAn 4

MetaPhlAn 4 + StrainPhlAn 4 — marker-gene-based metagenomic profiling and strain-level phylogenetics. MetaPhlAn profiles microbial communities at species level from shotgun metagenomes using ~5.1M unique clade-specific markers from the SGB database (Jan 2024, mpa_vJun23_CHOCOPhlAnSGB_202403). StrainPhlAn reconstructs strain-level phylogenies from consensus marker sequences extracted during MetaPhlAn profiling. Use for taxonomic profiling, relative abundance estimation, strain tracking, and transmission analysis.

Thin
0
metaSPAdes

metaSPAdes — de novo metagenomic assembler from the SPAdes suite using iterative de Bruijn graphs with multiple k-mer sizes optimized for uneven-coverage metagenomes. Assembles short-read (Illumina paired-end) and hybrid (short + long) metagenomic datasets into contigs and scaffolds. Used in metagenomics pipelines before binning (MetaBAT2, SemiBin, CONCOCT) and annotation. Competes with MEGAHIT and IDBA-UD. Invoked as `metaspades.py` or `spades.py --meta`.

Thin
0
MetaWRAP

MetaWRAP — flexible pipeline for genome-resolved metagenomic data analysis. Wraps read QC (trimming, host removal), metagenomic assembly (MEGAHIT, metaSPAdes), binning (MaxBin2, metaBAT2, CONCOCT), bin refinement, bin reassembly, taxonomy classification, abundance quantification, and functional annotation. Used for microbiome studies including gut, soil, and aquatic metagenomics.

Thin
0
methylclock

methylclock — R/Bioconductor package for computing DNA methylation-based biological age estimates using epigenetic clocks. Implements Horvath (2013), Hannum (2013), Levine/PhenoAge (2018), Zhang (2019), Skin&Blood, and DunedinPACE clocks from Illumina 450k/EPIC array beta-value matrices. Produces chronological age predictions, biological age acceleration metrics, and telomere length estimates for population-level aging studies, clinical cohort analyses, and comparative epigenomics. Trigger keywords: epigenetic clock, biological age, methylation age, DNA methylation aging, age acceleration, Horvath clock, Hannum clock, PhenoAge, DNAm age, aging biomarker.

Thin
0
MethylDackel -- Methylation Extraction from BAM/CRAM Files

MethylDackel (formerly PileOMeth) -- fast C tool for extracting per-base methylation metrics from coordinate-sorted BAM/CRAM files produced by bisulfite sequencing aligners (Bismark, bwa-meth, BSMAP). Provides CpG/CHG/CHH context extraction, per-read methylation filtering, M-bias plot generation, merging of strand-complementary CpG sites, and per-read methylation output. Standard extraction layer between alignment and downstream differential methylation analysis (methylKit, DSS, dmrseq).

Thin
0
methylKit -- DNA Methylation Analysis in R

methylKit -- R/Bioconductor package for DNA methylation analysis from bisulfite sequencing data. Reads Bismark, BSmap, and generic CpG coverage files. Provides differential methylation analysis at base-pair and regional resolution, coverage filtering, normalization, sample clustering, PCA, correlation heatmaps, and per-base/per-region methylation statistics. Supports tiling windows, CpG island annotation, and gene-level aggregation. Core downstream analysis layer for WGBS, RRBS, and targeted bisulfite sequencing pipelines.

Thin
0
methylpy

methylpy — Python pipeline for whole-genome bisulfite sequencing (WGBS) methylation analysis. Trims adapters, aligns with Bowtie2, marks duplicates, and calls cytosine methylation in CpG, CHG, and CHH contexts. Outputs allc format (per-base mc/cov counts). Provides DMRfind for differentially methylated regions, allc merge/filter/bigwig conversion, and allele-specific methylation analysis. Supports single-cell bisulfite sequencing (snmC-seq) and multi-chromosome parallelism. Use when you need a complete WGBS or single-cell methylation pipeline.

Thin
0
mgcv

mgcv — Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. Fit generalized additive models (GAMs) with penalized regression splines via gam(), large-dataset GAMs with bam(), and generalized additive mixed models with gamm(). Supports 17+ smooth basis types (thin plate tp, cubic cr, cyclic cc, P-splines ps, Duchon ds, random effects re, Markov random fields mrf, Gaussian process gp, soap film so, tensor products te/ti/t2), extended families (Tweedie, negative binomial, beta, zero-inflated Poisson, ordered categorical, Cox PH), automatic smoothing parameter selection via GCV, REML, ML, or NCV, variable selection with select=TRUE, and concurvity diagnostics.

Thin
0
mice

mice — Multivariate Imputation by Chained Equations for handling missing data in R. Fully Conditional Specification (FCS) with 9+ built-in methods (pmm, norm, logreg, polyreg, polr, cart, rf, mean, lda), diagnostic visualizations (stripplot, densityplot, bwplot, md.pattern), predictor selection via quickpred(), pooling with Rubin's rules via pool(), passive imputation for derived variables, and two-level imputation support.

Thin
0
MicrobiomeAnalystR

MicrobiomeAnalystR — comprehensive R package for statistical analysis of microbiome data from 16S rRNA amplicon sequencing and shotgun metagenomics. Provides alpha diversity (Shannon, Chao1, ACE, Faith's PD), beta diversity (PCoA, NMDS, PERMANOVA), differential abundance testing (DESeq2, edgeR, LEfSe, ALDEx2), biomarker discovery, functional prediction (PICRUSt2), co-occurrence network analysis, and longitudinal microbiome analysis. Accepts OTU/ASV tables from QIIME2, DADA2, mothur, or plain TSV. Companion R package to the MicrobiomeAnalyst web server (microbiomeanalyst.ca).

Thin
0
MiDAS

'Use when working with midas — miDAS — Multiple Imputation with Denoising

Thin
0
Mikado

Mikado -- Python pipeline for selecting the best transcript models from multiple RNA-seq assemblies. Merges transcript sets from StringTie, Trinity, CLASS2, and other assemblers, then scores and filters loci using ORF completeness, junction reliability, UTR length, BLAST/HMM homology, and expression data. Three-phase workflow: mikado prepare (merge GTFs), mikado serialise (load external evidence), mikado pick (select best transcripts). Essential for genome annotation refinement and evidence-based gene model selection.

Thin
0
miloR

miloR — R/Bioconductor package for differential abundance (DA) testing on single-cell datasets using k-nearest neighbor (KNN) graphs. Identifies cell populations that shift in frequency across experimental conditions without requiring hard cluster boundaries. Operates on SingleCellExperiment objects with functions: Milo(), buildGraph(), makeNhoods(), countCells(), testNhoods(), calcNhoodDistance(), annotateNhoods(), findNhoodMarkers(), plotNhoodGraphDA(), plotDAbeeswarm(). Use for scRNA-seq or scATAC-seq DA analysis, compositional shifts, case/control comparisons, and multi-condition trajectory studies.

Thin
0
Miniasm

Miniasm — ultrafast OLC (overlap-layout-consensus) assembler for long reads (PacBio, Oxford Nanopore). Produces unitig-level assemblies from read overlaps computed by minimap2, without error correction or consensus polishing. Designed for rapid draft assembly of bacterial and small eukaryotic genomes. Requires post-assembly polishing with minipolish, racon, or medaka for accuracy.

Thin
0
Minigraph

Minigraph — sequence-to-graph aligner and incremental pangenome graph assembler for long reads and whole-genome assemblies. Aligns PacBio/ONT reads or haplotype-resolved assemblies to an rGFA reference graph, and incrementally constructs pangenome graphs from multiple genomes. Produces rGFA graphs and GAF alignments. Use when users need to build or align to a reference-guided pangenome graph, call structural variants from graph paths, map assemblies to a linear or graph reference, or work with rGFA/GAF formats in long-read genomics workflows.

Thin
0
Minimac4 -- Genotype Imputation

Minimac4 for genotype imputation from phased haplotype reference panels. Imputes missing genotypes using pre-phased target haplotypes and a reference panel in MVCF/M3VCF/VCF/BCF format. Use when performing genotype imputation, preparing imputation reference panels, converting M3VCF to MVCF, running meta-imputation with MetaMinimac2, or evaluating imputation quality (R-squared).

Thin
0
MIP

Use when working with MIP (Medical Information Process) — a workflow management system for clinical whole-genome and whole-exome sequencing analysis. Runs end-to-end rare disease (rd_dna, rd_rna) and cancer DNA pipelines from FASTQ to annotated VCF. Integrates BWA, GATK, DeepVariant, VEP, Chanjo, and produces output for Scout clinical review. Supports SLURM HPC cluster execution and local modes. Part of the Clinical Genomics MIP/Scout/Chanjo stack at SciLifeLab, Sweden.

Thin
0
MIRA --- Multimodal Models for Integrated Regulatory Analysis

MIRA (Multimodal models for Integrated Regulatory Analysis) for joint single-cell RNA-seq and ATAC-seq analysis. Provides probabilistic topic modeling, joint representation learning, pseudotime trajectory inference, cis-regulatory potential modeling (NITE/LITE), chromatin differential analysis, and transcription factor enrichment. Use when integrating scRNA-seq with scATAC-seq, building multimodal topic models, or modeling gene regulatory dynamics from single-cell multiomics data.

Thin
0
missMDA

missMDA — R package for handling missing values in multivariate data analysis providing single imputation via iterative PCA (imputePCA for continuous data), MCA (imputeMCA for categorical data), FAMD (imputeFAMD for mixed data), MFA (imputeMFA for grouped variables), CA (imputeCA for contingency tables), and multilevel FAMD (imputeMultilevel), with regularized and EM algorithms, automatic dimension selection via cross-validation (estim_ncpPCA, estim_ncpMCA, estim_ncpFAMD), and multiple imputation extensions (MIPCA, MIMCA, MIFAMD) compatible with the FactoMineR ecosystem.

Thin
0
mixOmics

mixOmics — R/Bioconductor package for multi-omics data integration using multivariate projection methods. Provides PCA, PLS, sPLS, PLS-DA, sPLS-DA, DIABLO (multi-block sPLS-DA), and MINT (multi-study integration) for supervised and unsupervised analysis of transcriptomics, proteomics, metabolomics, and microbiome data. Includes variable selection via LASSO penalization, sample and variable visualization (plotIndiv, plotVar, plotLoadings, cimDiablo, circosPlot), and performance evaluation via repeated cross-validation with BER metric.

Thin
0
MLflow

MLflow — open-source platform for managing the full ML lifecycle: experiment tracking (parameters, metrics, artifacts), model registry (versioning, staging, production promotion), model serving (REST API, batch inference), and project reproducibility. Use for: logging training runs, comparing experiments, registering and deploying models, tracking hyperparameters and evaluation metrics, managing model versions across dev/staging/prod, and integrating with scikit-learn, PyTorch, TensorFlow, XGBoost, and HuggingFace. Key terms: mlflow.log_param, mlflow.log_metric, mlflow.log_artifact, MLflowClient, MlflowException, autolog, model registry, MLflow server.

Thin
0
MMseqs2

MMseqs2 — ultra-fast protein and nucleotide sequence searching, clustering, and taxonomy assignment. Performs all-vs-all comparisons, iterative profile searches, linclust linear-time clustering, easy-taxonomy LCA-based classification, and multi-step search workflows. Used for large-scale homology detection, metagenomic read classification, protein family construction, and database creation from custom FASTA files.

Thin
0
ModelAngelo

Use when working with modelangelo — modelAngelo — automated atomic model

Thin
0
modkit

modkit — Oxford Nanopore Technologies Rust CLI for modified base (methylation) analysis on nanopore sequencing data. Processes modBAM files (BAM with MM/ML tags) to compute per-position modification frequencies (bedMethyl), extract per-read modification calls, summarize modification profiles, and perform motif-based filtering (CpG, CHH, CHG, 6mA). Use for 5mC, 5hmC, 6mA nanopore methylation analysis, DMR detection, and epigenetic pipelines.

Thin
0
MOFA+

MOFA+ (Multi-Omics Factor Analysis) — Bayesian group factor analysis framework for unsupervised integration of multiple omics data modalities. Discovers shared and modality-specific latent factors from RNA-seq, ATAC-seq, methylation, proteomics, or any combination. Supports multi-group designs (e.g., conditions, time points, batches) via structured sparsity priors. Python interface through mofapy2 for model training and muon/mofax for downstream interpretation and visualization.

Thin
0
mokapot -- Semi-Supervised PSM Confidence Estimation

Use when working with mokapot — mokapot -- Python package for semi-supervised

Thin
0
MolBART

Use when working with MolBART — a BART-based sequence-to-sequence transformer for computational chemistry and drug discovery. MolBART is pre-trained on SMILES molecular representations and supports forward reaction prediction, retrosynthesis, molecular optimization, and de novo molecular generation. Use for chemical synthesis planning, SMILES-to-SMILES translation, molecular property fine-tuning, and exploring chemical space. Also known as MolecularAI BART, Chemformer precursor, and MolecularAI/MolBART on GitHub.

Thin
0
MoleculeNet

MoleculeNet in DeepChem provides curated molecular machine learning dataset loaders for ADME, toxicity, quantum chemistry, materials, and reaction benchmarks through `deepchem.molnet.load_*` functions. Use when the user wants to load datasets such as Delaney/ESOL, Tox21, ClinTox, BBBP, FreeSolv, QM7, QM8, QM9, or PCBA with standard featurizers, splitters, and cached disk datasets. Trigger phrases: MoleculeNet benchmark, DeepChem molnet, load_tox21, load_delaney, molecular property prediction dataset, scaffold split, graph convolution featurizer, drug discovery benchmark.

Thin
0
Monocle3

Monocle3 single-cell trajectory analysis and pseudotime inference — use when ordering cells along developmental trajectories, computing pseudotime, identifying trajectory-dependent genes via Moran's I, discovering gene modules, performing regression-based differential expression, or projecting query data onto a reference atlas in R/Bioconductor workflows.

Thin
0
mosdepth

mosdepth — fast BAM/CRAM depth calculation for WGS, exome, and targeted sequencing. Computes per-base depth, mean per-window depth for CNV calling, per-region coverage from BED intervals, coverage distribution histograms, quantized callable-region BED files, and threshold-based coverage summaries. Written in Nim with htslib; ~2x faster than samtools depth.

Thin
0
Mothur

Mothur is an open-source bioinformatics platform for microbial ecology analysis. Provides tools for 16S/18S/ITS amplicon sequence processing, OTU clustering, phylotype classification, ASV generation, alpha/beta diversity analysis, community structure visualization, and statistical hypothesis testing. Use when processing amplicon sequencing data, building OTU tables, computing diversity metrics, or comparing microbial community composition.

Thin
0
mOTUs

mOTUs (marker gene-based Operational Taxonomic Units) is a computational tool for taxonomic profiling of microbial communities from metagenomic and metatranscriptomic shotgun sequencing data. Uses 10 universal single-copy marker genes to profile both known and unknown species (metagenomic OTUs). Provides species-level abundance estimation, multi-sample merging, and SNV-based strain profiling. Use when profiling microbial taxonomy from shotgun metagenomics, comparing community composition, or identifying unknown species via marker gene analysis.

Thin
0
Mowgli

Mowgli — multi-omics Wasserstein integrated analysis using group NMF to jointly factorize paired multi-omics data (RNA, ATAC, protein/ADT) from single-cell experiments. Learns shared and modality-specific latent factors using optimal transport regularization on MuData objects. Part of the scverse ecosystem, provides dimensionality reduction, factor interpretation, and integration of CITE-seq, multiome, and TEA-seq datasets.

Thin
0
MrBayes

MrBayes — Bayesian phylogenetic inference using MCMC. Accepts sequence alignments (NEXUS format) and performs Bayesian inference on evolutionary trees with posterior probability estimates. Supports DNA, protein, and morphological data with partitioned models, topology constraints, nucleotide substitution models (GTR, HKY, F81), and calibration for molecular clocks.

Thin
0
msigdbr

msigdbr — R package providing MSigDB (Molecular Signatures Database) gene sets as tidy data frames for use in gene set enrichment analysis (GSEA), over-representation analysis (ORA), and pathway analysis. Access all MSigDB collections (H, C1–C8) including Hallmark, KEGG, Reactome, GO, and immunological signatures for 20+ species via ortholog mapping. Use when users need gene sets for fgsea, clusterProfiler, enricher, limma::fry, or any GSEA-compatible downstream tool. Replaces manual MSigDB downloads and .gmt file parsing.

Thin
0
MSIsensor2

MSIsensor2 detects microsatellite instability (MSI) in tumor-only sequencing data using machine learning — no paired normal sample required. Supports whole-exome sequencing (WES), whole-genome sequencing (WGS), targeted panel sequencing, cell-free DNA (cfDNA), and FFPE samples. Achieves up to 99% accuracy on TCGA/EGA datasets with a 10x speed advantage over paired methods. Trigger on: MSIsensor2, MSIsensor, microsatellite instability, MSI detection, tumor-only MSI, cfDNA MSI, FFPE MSI, MSI-high, MSI scoring, somatic MSI, mismatch repair deficiency, dMMR, MSI WES, MSI WGS, MSI panel.

Thin
0
MSstats

MSstats — Bioconductor R package for statistical analysis of quantitative mass spectrometry-based proteomics experiments. Supports label-free DDA, TMT, iTRAQ, SRM/MRM, and DIA workflows. Provides preprocessing with dataProcess(), differential abundance testing with groupComparison(), sample size estimation with designSampleSize(), and quantification summaries. Converts output from MaxQuant, Spectronaut, DIA-NN, Skyline, Progenesis, OpenSWATH, and Proteome Discoverer into analysis-ready format.

Thin
0
MuData

MuData — Python multimodal data container for storing and manipulating multi-omics single-cell datasets built on AnnData. Provides MuData objects wrapping multiple AnnData modalities (RNA, ATAC, protein, etc.), HDF5-based .h5mu file I/O, shared and modality-specific annotations, observation/variable axis management, concatenation, AnnData interconversion, and Zarr support. Part of the scverse ecosystem for CITE-seq, Multiome, TEA-seq, and other multimodal omics experiments.

Thin
0
MultiVI

MultiVI — deep generative model for joint analysis of scRNA-seq and scATAC-seq multi-omics data using variational inference. Part of scvi-tools (scverse ecosystem). Integrates single-modality and multiome (paired RNA+ATAC) datasets, learns a shared latent space, imputes missing modalities, and enables joint clustering and differential accessibility analysis. Built on PyTorch with GPU acceleration via scvi-tools.

Thin
0
Muon

Muon — multimodal omics data analysis framework from the scverse ecosystem. Built on the MuData container for multi-modal single-cell experiments including CITE-seq, Multiome (RNA+ATAC), TEA-seq, and DOGMA-seq. Provides modality-specific preprocessing (mu.atac for ATAC-seq, mu.prot for protein), multi-omics integration (multi-omics factor analysis with MOFA, weighted nearest neighbors WNN), and cross-modal visualization. Works with AnnData objects per modality and MuData as the joint container. Use for multimodal single-cell analysis, ATAC-seq peak processing, and protein surface marker quantification.

Thin
0
Mustache

Mustache — multi-scale detection of chromatin loops from Hi-C and Micro-C contact maps using scale-space representation. Identifies loops across resolutions from 10 kb down to 500 bp using computer-vision scale-space theory. Supports .hic (Juicer), .cool/.mcool (Cooler), and text contact formats. Includes diff_mustache for differential loop calling between conditions. Used for chromatin loop detection, 3D genome organization, enhancer-promoter interaction mapping, and CTCF loop analysis.

Thin
0
MutationalPatterns

MutationalPatterns — R/Bioconductor package for comprehensive analysis of somatic mutational patterns in cancer genomes. Extracts and visualizes mutational signatures (SBS, DBS, indel), performs COSMIC signature fitting, analyzes transcription and replication strand bias, generates rainfall plots, and quantifies regional mutation density. Works with VCF files via GRanges and BSgenome reference genomes. Used for cancer genomics, mutational signature analysis, and tumor mutational burden studies.

Thin
0
Mutect2 -- GATK Somatic Short Variant Caller

Mutect2 -- GATK somatic short variant caller for detecting SNVs and indels in tumor samples. Supports tumor-only and tumor-normal paired calling modes with Panel of Normals and germline resource filtering. Includes orientation bias artifact detection and FilterMutectCalls post-processing. Standard tool for somatic variant calling in cancer genomics WGS, WES, and targeted panel pipelines.

Thin
0
MZmine 3

MZmine 3 — open-source Java software for LC-MS and GC-MS mass spectrometry data processing in untargeted metabolomics. Provides feature detection (ADAP, local minimum), chromatogram building, isotope grouping, alignment (join, RANSAC), gap filling, spectral networking, MS/MS library matching, GNPS/SIRIUS export, and batch-mode processing via XML configuration files. Supports mzML, mzXML, imzML, and vendor raw formats through msconvert.

Thin
0
NAMD

NAMD — high-performance parallel molecular dynamics for large biomolecular systems. Runs on CPU clusters and GPUs with CHARMM and AMBER force fields. Supports explicit/implicit solvent, enhanced sampling (replica exchange, metadynamics via colvars), free energy perturbation (FEP/TI), steered MD, and QM/MM simulations. Input via configuration files (.namd/.conf) with PSF/PDB or AMBER prmtop/inpcrd coordinate systems.

Thin
0
NanoFilt

NanoFilt — Python tool for filtering and trimming Oxford Nanopore long-read sequencing data in FASTQ format. Filters reads by minimum average quality score, minimum/maximum read length, and GC content range. Trims bases from read start (headcrop) and end (tailcrop). Reads from stdin or file, writes filtered FASTQ to stdout. Supports basecaller summary files from Guppy/Albacore for quality scores. Part of the nanopack suite.

Thin
0
NanoPlot

NanoPlot — visualization and quality-control tool for Oxford Nanopore long-read sequencing data. Generates read length histograms, quality distribution plots, cumulative yield curves, and alignment identity scatter plots from FASTQ, BAM, CRAM, or sequencing_summary input files. Produces HTML reports with NanoStats summary tables. Essential QC step in Nanopore basecalling, assembly, and variant-calling pipelines.

Thin
0
NanoStat

NanoStat — Python tool for calculating statistics from Oxford Nanopore and PacBio long-read sequencing data. Reads FASTQ, FASTA, BAM, and sequencing summary files to produce read length distributions, mean/median quality scores, throughput metrics, and N50 values. Part of the nanopack suite alongside NanoPlot, NanoFilt, and NanoComp. Essential QC step in nanopore and long-read sequencing pipelines.

Thin
0
NanoSV

NanoSV — structural variant caller for long-read sequencing data (Oxford Nanopore). Detects deletions, insertions, inversions, duplications, and translocations from split-read alignments in BAM files. Outputs standard VCF. Requires sorted BAM input aligned with a split-read aware aligner (minimap2 with -a, LAST). Configurable via INI config file for minimum support, clustering distance, and SV type filtering.

Thin
0
Napari

Napari — fast, interactive, multi-dimensional image viewer for Python built on Qt and vispy (OpenGL). Provides GPU-accelerated rendering of 2D, 3D, and nD image data with seven layer types (Image, Labels, Points, Shapes, Surface, Tracks, Vectors), interactive annotation and segmentation tools, dask-based lazy loading for large datasets, an extensible plugin system (npe2) with 300+ community plugins via napari-hub, and tight integration with the scientific Python stack (NumPy, scikit-image, scipy) for microscopy, pathology, and bioimage analysis workflows.

Thin
0
needletail

needletail — high-performance Rust-based FASTA/FASTQ parser with Python bindings. Parse FASTA and FASTQ files at near-C speed using parse_fastx_file and parse_fastx_string. Provides k-mer extraction with canonical_kmers, sequence normalization with normalize_seq, and reverse_complement operations. Use for fast sequence parsing, k-mer counting, format-agnostic FASTA/FASTQ iteration, and high-throughput sequence preprocessing in Python bioinformatics pipelines.

Thin
0
NetCoMi

NetCoMi (Network Construction and Comparison for Microbiome Data) — R package for constructing, analyzing, and comparing microbial association and dissimilarity networks from compositional count data. Supports 10+ association measures (SparCC, SpiecEasi, SPRING, CCREPE, pearson, spearman), multiple zero-handling and normalization strategies, network sparsification, centrality analysis, hub detection, cluster comparison via Jaccard/Rand indices, and differential network construction. Designed for 16S/ITS amplicon sequencing and shotgun metagenomics OTU/ASV tables.

Thin
0
NetMHCpan

Use when working with NetMHCpan 4.1 for MHC class I peptide binding and ligand presentation prediction. NetMHCpan accepts peptide lists or FASTA proteins, scores candidate ligands against up to 20 alleles per submission, reports EL rank and optional binding-affinity output, and supports custom full length MHC-I sequences for pan-allele prediction. Useful for epitope prioritization, neoantigen screening, immunopeptidomics follow-up, allele benchmarking, and interpreting strong-binder versus weak-binder rank thresholds from the DTU Health Tech web service.

Thin
0
NeuralGCM

NeuralGCM — hybrid ML-physics atmospheric model for weather forecasting and climate simulation built on JAX and the Dinosaur spectral dynamical core. Provides PressureLevelModel API for loading pre-trained checkpoints (0.7°–2.8° deterministic and stochastic), running multi-step predictions from ERA5 initial conditions via encode/advance/decode, and converting results to xarray Datasets. Supports GPU/TPU-accelerated inference for global weather prediction and climate projections.

Thin
0
NeuronJ

Use when working with NeuronJ — an ImageJ/Fiji plugin for semi-automatic tracing and quantification of elongated image structures such as neurons, axons, dendrites, and neurites in microscopy images. Computes length, branching, and morphology metrics from 2D fluorescence or brightfield images. Essential for neuroscience imaging workflows requiring reproducible neuron tracing and morphological analysis.

Thin
0
Nextclade

Nextclade — viral genome alignment, clade/lineage assignment, mutation calling, and sequence quality assessment. Aligns query sequences against pathogen-specific reference datasets (SARS-CoV-2, influenza, RSV, mpox, and 50+ others), calls nucleotide and amino acid substitutions/deletions/insertions, assigns clades and Pango lineages, and produces per-sequence QC scores. Part of the Nextstrain ecosystem for genomic epidemiology and real-time pathogen surveillance.

Thin
0
NextDenovo

NextDenovo — string graph-based de novo assembler for long reads. Performs error correction (NextCorrect) then assembly (NextGraph) from PacBio CLR, PacBio HiFi, or Oxford Nanopore reads. Config-file driven with parallelized task scheduling. Use for genome assembly, read correction, and contig generation from third-generation sequencing data.

Thin
0
nf-core

nf-core — community-curated Nextflow pipeline framework with 100+ production bioinformatics pipelines. Provides CLI tools for listing, launching, downloading, creating, and linting pipelines; managing DSL2 modules and subworkflows; and building parameter schemas. Used for RNA-seq, variant calling, metagenomics, proteomics, and single-cell analysis with standardized configs and containers.

Thin
0
nf-coreatacseq

nf-core/atacseq — Nextflow pipeline for ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) data analysis. Handles paired- or single-end reads through adapter trimming (Trim Galore), alignment (BWA, Bowtie2, STAR, Chromap), duplicate marking, library complexity estimation, peak calling (MACS2), consensus peak sets, differential accessibility (DESeq2), motif enrichment (HOMER), and MultiQC reporting. Supports hg38, mm10, and custom genomes. Input: CSV samplesheet with FASTQ paths. Output: BAM, BED peaks, ATAC-seq QC metrics (FRiP, TSS enrichment, fragment length).

Thin
0
nf-corechipseq

nf-core/chipseq — Nextflow pipeline for comprehensive ChIP-seq analysis from raw reads through peak calling and differential analysis. Automates adapter trimming, multi-aligner support (BWA, Bowtie2, Chromap, STAR), duplicate marking, peak calling (MACS3), quality control with FastQC/MultiQC, and consensus peak identification. Production-ready with containerization (Docker/Singularity), ideal for bulk ChIP-seq and CUT&RUN workflows analyzing transcription factors, histones, and chromatin modifications.

Thin
0
nf-corefetchngs

nf-core/fetchngs — Nextflow pipeline for downloading and standardizing sequencing data from public repositories (SRA, ENA, DDBJ, GEO, Synapse). Fetches FASTQ files from accession IDs and generates nf-core-compatible samplesheets. Use when downloading public sequencing data or preparing inputs for nf-core pipelines.

Thin
0
nf-coremag

nf-core/mag — Nextflow pipeline for metagenome-assembled genome (MAG) analysis. Takes raw short-read (Illumina) and/or long-read (Nanopore/PacBio) metagenomic sequencing data through quality control, taxonomic classification, assembly, binning, and bin quality assessment. Supports hybrid assembly, co-assembly, ancient DNA, and domain classification. Use when building MAGs from shotgun metagenomics data or evaluating genome bin quality with CheckM, BUSCO, and GUNC.

Thin
0
nf-coremethylseq

nf-core/methylseq — community-curated Nextflow pipeline for whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) data analysis. Handles adapter trimming (Trim Galore), alignment (Bismark or bwa-meth), methylation extraction (Bismark extractor or MethylDackel), and quality reporting (MultiQC). Supports CpG methylation calling, CHG/CHH context analysis, and differential methylation preparation. Use for WGBS, RRBS, or any bisulfite sequencing pipeline on human, mouse, or custom genomes.

Thin
0
nf-corernaseq

nf-core/rnaseq — production Nextflow pipeline for bulk RNA-seq analysis including read QC, trimming, alignment (STAR/HISAT2), pseudo-alignment (Salmon), quantification, and comprehensive quality control (FastQC, MultiQC, RSeQC, dupRadar, Qualimap). Use for RNA-seq samplesheet preparation, pipeline execution, aligner selection, strandedness detection, and output validation.

Thin
0
nf-corescrnaseq

nf-core/scrnaseq — Nextflow pipeline for single-cell RNA-seq quantification and preprocessing. Supports multiple alignment/quantification methods including Cell Ranger, STARsolo, Alevin-fry (Salmon), Kallisto/Bustools, and UniverSC. Handles FASTQ demultiplexing, barcode whitelisting, UMI deduplication, and count matrix generation for 10x Genomics, Drop-seq, and other droplet-based protocols. Input: CSV samplesheet with FASTQ paths and protocol specification. Output: count matrices (h5ad, MTX), MultiQC report, and alignment statistics.

Thin
0
nf-coreviralrecon

nf-core/viralrecon — Nextflow pipeline for viral genome reconstruction and analysis from sequencing data. Supports Illumina and Oxford Nanopore reads for SARS-CoV-2, influenza, and other viral genomes. Performs read QC, reference-based mapping (BWA/Bowtie2), de novo assembly (SPAdes/Unicycler), variant calling (iVar/BCFtools), consensus generation, lineage assignment (Pangolin/Nextclade), and MultiQC reporting. Part of the nf-core framework with containerized reproducibility.

Thin
0
NGMLR

NGMLR — long-read mapper for sensitive alignment of PacBio and Oxford Nanopore reads to reference genomes using convex gap-cost scoring. Designed for structural variation detection with SV-aware k-mer search and banded Smith-Waterman alignment. Key flags: -r (reference), -q (query reads), -o (output SAM), -x (preset: pacbio/ont), -t (threads), -i (min identity), -R (min residues), --bam-fix, --rg-id, --no-smallinv. Input: FASTA/FASTQ (plain or gzipped). Output: SAM. Commonly paired with Sniffles for SV calling.

Thin
0
ngsplot

ngs.plot — fast R-based visualization tool for next-generation sequencing data enrichment at functional genomic regions. Generates average profiles and heatmaps from BAM files over TSS, TES, gene bodies, enhancers, and custom BED regions. Supports ChIP-seq, RNA-seq, GRO-seq, and MNase-seq data. Uses pre-built genome annotation databases for hg19, hg38, mm10, and other assemblies. Use for publication-quality metagene and enrichment plots from aligned sequencing reads.

Thin
0
NicheNet

NicheNet — R-based computational framework for studying intercellular communication from single-cell transcriptomics. Predicts which ligands expressed by sender cell types regulate target gene expression in receiver cell types using a prior knowledge network of ligand-receptor-target regulatory interactions. Generates ranked ligand prioritization, ligand-target regulatory links, and ligand-receptor networks. Accepts Seurat objects or raw expression matrices. Use for cell-cell communication analysis, niche characterization, ligand activity prediction, and comparing signaling between conditions.

Thin
0
niffler

niffler — transparent compressed file I/O for Rust bioinformatics pipelines. Sniffs compression format from magic bytes using get_reader and from_path. Supports gzip, bzip2, xz/lzma, zstd, and BGZF (block gzip) via get_writer and to_path. Use for automatic format detection, transparent decompression of FASTQ/FASTA/VCF files, compressed output in Rust-based genomics tools, and building I/O layers that handle mixed compression formats without manual dispatch.

Thin
0
NIMBLE

NIMBLE — R package for hierarchical statistical modeling using MCMC, particle filtering, Laplace approximation, and programmable algorithms. Extends the BUGS/JAGS language with customizable samplers, compiled C++ execution, and a nimbleFunction system for writing model-generic algorithms. Supports Bayesian inference (nimbleMCMC, configureMCMC, buildMCMC), sequential Monte Carlo (nimbleSMC), maximum likelihood via Laplace (nimbleQuad/buildLaplace), and Monte Carlo EM (buildMCEM).

Thin
0
Nirvana

Nirvana — Clinical-grade genomic variant annotation tool by Illumina that processes VCF files and outputs structured JSON with transcript consequences, population frequencies, pathogenicity scores, and clinical annotations from 25+ data sources including ClinVar, gnomAD, dbSNP, COSMIC, OMIM, REVEL, SpliceAI, and PrimateAI. Supports SNVs, MNVs, insertions, deletions, indels, structural variants, CNVs, gene fusions, and STRs. Use when user asks about variant annotation, clinical interpretation, VCF-to-JSON conversion, or integrating multiple annotation databases into a single output.

Thin
0
nlme

nlme — Linear and Nonlinear Mixed-Effects Models in R. Fit hierarchical and repeated-measures data with lme() for linear models and nlme() for nonlinear models. Supports nested random effects, within-group correlation structures (corAR1, corARMA, corCompSymm, corExp, corGaus, corLin, corRatio, corSpher), heteroscedastic variance functions (varIdent, varPower, varExp, varConstPower, varComb, varFixed), generalized least squares via gls(), model comparison with anova(), confidence intervals via intervals(), and diagnostic plots. Ships with base R.

Thin
0
nloptr

nloptr — R interface to the NLopt nonlinear optimization library. Provides a common interface to 30+ optimization algorithms covering local and global search, gradient-based and derivative-free methods, with support for bound constraints, nonlinear inequality constraints, and equality constraints. Wrapper functions offer simplified access to BOBYQA, COBYLA, L-BFGS, SLSQP, Nelder-Mead, DIRECT, ISRES, and other algorithms. Widely used as a backend for lme4, brms, and other statistical modeling packages.

Thin
0
Noodles

Noodles — Rust bioinformatics I/O library with specification-compliant

Thin
0
Novoalign

Novoalign — high-accuracy short-read sequence aligner for Illumina and MGI platforms using full Needleman-Wunsch algorithm with affine gap penalties. Maps single-end and paired-end reads up to 950bp onto reference genomes. Features built-in adapter trimming, base quality calibration, bisulfite alignment for methylation analysis, and ambiguous code support. Outputs SAM/BAM format. Requires novoindex for reference indexing. Commercial license required (free for academic evaluation).

Thin
0
Nucleotide Transformer

Nucleotide Transformer — DNA foundation models by InstaDeep and BioNTech for genomic sequence analysis. Pre-trained transformer models (50M to 2.5B parameters) on 3,202 diverse genomes generate embeddings for variant effect prediction, promoter detection, splice site identification, enhancer classification, and histone modification prediction. Access via Hugging Face Transformers with InstaDeepAI model hub. Supports fine-tuning and zero-shot token-level and sequence-level downstream tasks.

Thin
0
Numba

Numba — open-source JIT compiler for numerical Python using LLVM. Compiles Python functions to optimized machine code at runtime via @jit/@njit decorators. Supports automatic parallelization (parallel=True, prange), NumPy-aware compilation, CUDA GPU kernels (@cuda.jit), custom ufunc creation (@vectorize, @guvectorize), stencil computations (@stencil), and C callback generation (@cfunc). Enables near-C performance for numerical loops, array operations, and scientific computing without leaving Python.

Thin
0
numpy

Use when working with NumPy, ndarray-based numerical computing, array creation, vectorization, broadcasting, linear algebra, random sampling, or scientific file I/O in Python. NumPy provides the n-dimensional array, ufuncs, dtype controls, indexing, reshaping, sorting, and statistics used by pandas, SciPy, scikit-learn, JAX, and many bioinformatics pipelines. Trigger on: numpy arrays, ndarray, np.array, np.asarray, broadcasting, vectorized math, matrix multiplication, .npy/.npz files, genfromtxt, savetxt, einsum, linalg, and converting Python lists or pandas objects into arrays.

Thin
0
Oases

Oases — de novo transcriptome assembler for short-read RNA-seq data built on the Velvet de Bruijn graph framework. Performs transcript assembly across varying expression levels using multi-k-mer merge strategy, scaffold construction from paired-end reads, and locus-aware isoform resolution. Key commands: velveth, velvetg, oases, oases_pipeline.py.

Thin
0
Octopus

Octopus — Bayesian haplotype-based variant caller for germline and somatic mutation detection from WGS, WES, and targeted sequencing data. Supports multiple calling models including individual, trio, population, tumour-normal, polyclone, and cell-level calling. Takes BAM/CRAM alignments and a reference FASTA as input and produces VCF output with phased genotypes. Built on a variational Bayesian genotype model with haplotype likelihood caching.

Thin
0
odgi

Use when working with odgi — the optimized dynamic genome/graph implementation toolkit — for pangenome variation graph manipulation, sorting, visualization, and analysis. Converts GFA graphs to efficient ODGI format (odgi build), computes 1D and 2D graph layouts (odgi sort, odgi layout), renders linear and 2D visualizations (odgi viz, odgi draw), extracts graph statistics and depth (odgi stats, odgi depth), and performs untangling for structural variant analysis (odgi untangle). Core component of the PGGB pangenome pipeline alongside wfmash, seqwish, and smoothxg.

Thin
0
OGGM (Open Global Glacier Model)

OGGM (Open Global Glacier Model) — modular open-source framework for simulating past and future glacier mass balance, volume, and geometry worldwide. Automates glacier-specific workflows from RGI outlines through DEM processing, centerline computation, climate data ingestion, mass balance calibration, ice thickness inversion, and flowline dynamics modeling. Outputs NetCDF time series and glacier statistics for sea-level projections, water resource assessments, and paleoglaciology studies.

Thin
0
ome-zarr-py -- OME-Zarr Image I/O

ome-zarr-py -- Python library for reading and writing OME-Zarr (NGFF) images, the cloud-native, chunked, multiscale format for bioimaging. Provides APIs for reading OME-Zarr datasets with multiscale pyramids, writing image data with OME metadata, converting between TIFF/HDF5 and Zarr, and validating NGFF specification compliance. Integrates with napari for visualization and Dask for lazy out-of-core access to large microscopy datasets.

Thin
0
OMERO

OMERO (Open Microscopy Environment Remote Objects) — client-server platform for managing, visualizing, and analyzing microscopy image data. Provides centralized image repository with Python API (omero-py), CLI, and web client. Supports Bio-Formats for 150+ image formats, ROI management, metadata annotation, and integration with Fiji/ImageJ and CellProfiler. Essential for multi-user microscopy core facilities and high-content screening pipelines.

Thin
0
OmicsIntegrator2

Use when working with OmicsIntegrator or OmicsIntegrator2 — a network-based multi-omics data integration framework from the Fraenkel Lab (MIT). Applies the Prize-Collecting Steiner Forest (PCSF) algorithm to connect proteomic, transcriptomic, phosphoproteomic, and metabolomic hits through protein-protein interaction networks. Identifies functionally coherent subnetworks that explain multi-level molecular changes in disease or perturbation experiments. Use for pathway discovery, network medicine, multi-omics integration, and PCSF-based analysis.

Thin
0
OncoKB

OncoKB — precision oncology knowledge base from Memorial Sloan Kettering

Thin
0
Open Babel

Open Babel — open-source cheminformatics toolkit for chemical file format conversion, 3D coordinate generation, molecular fingerprint calculation, substructure and SMARTS pattern searching, descriptor computation, and property filtering. Supports 110+ chemical file formats including SMILES, SDF/MOL, PDB, MOL2, CIF, and XYZ. Provides the obabel CLI and pybel Python interface for molecular data interconversion and cheminformatics pipelines.

Thin
0
OpenAQ

OpenAQ — open-source platform providing free, real-time and historical air quality data from government and research-grade monitoring stations worldwide. Query PM2.5, PM10, O3, NO2, CO, SO2 measurements via REST API v3 or Python SDK. Supports geospatial queries, temporal aggregations (hourly, daily, yearly), and sensor-level data access for environmental health research, wildfire smoke tracking, and pollution exposure analysis.

Thin
0
OpenCRAVAT

'Use when working with opencravat — openCRAVAT — open-source variant

Thin
0
openCyto

Use when working with openCyto — the Bioconductor framework for automated, reproducible flow cytometry gating. Applies hierarchical gating strategies defined in a CSV template to GatingSet objects (flowWorkspace). Supports data-driven gate placement via mindensity, flowClust.1d, flowClust.2d, quantile, tailgate, and cytokine methods. Essential for high-throughput FCS file analysis, cell population identification, and immunophenotyping pipelines. Part of the RGLab ecosystem alongside flowCore, flowWorkspace, CytoML, and ggcyto.

Thin
0
OpenFold

OpenFold — open-source PyTorch reimplementation of AlphaFold2 for trainable protein structure prediction. Supports custom training on user datasets, fine-tuning of structure prediction models, and inference with DeepSpeed acceleration. Produces PDB/mmCIF structures with pLDDT, pTM, and PAE confidence metrics. Provides OpenProteinSet (140k+ MSA clusters) for training. Compatible with AlphaFold2 model weights and databases.

Thin
0
OpenMM

OpenMM — high-performance molecular dynamics simulation toolkit with GPU acceleration via CUDA and OpenCL. Provides Python API for building custom simulations, force field support (AMBER, CHARMM, AMOEBA), enhanced sampling methods (metadynamics, replica exchange), free energy calculations, and custom force definitions for biomolecular systems including proteins, nucleic acids, lipid membranes, and small molecules.

Thin
0
OpenMS

OpenMS — open-source C++ framework with Python bindings (pyOpenMS) for liquid chromatography-mass spectrometry (LC-MS/MS) data analysis. Provides 200+ TOPP command-line tools for proteomics, metabolomics, and lipidomics workflows including peak picking, feature detection, map alignment, peptide/protein identification, and label-free/labeled quantification. Supports mzML, mzXML, FASTA, featureXML, and consensusXML formats. Integrates with workflow systems (KNIME, NextFlow, Galaxy, TOPPAS) and search engines (Comet, MSFragger, MS-GF+, Sage, X!Tandem). pyOpenMS exposes the full C++ API as Python classes for scripted analysis pipelines and custom algorithm development.

Thin
0
org.Hs.eg.db

org.Hs.eg.db is the Bioconductor OrgDb package for genome-wide human gene annotation built primarily around Entrez Gene identifiers. Use this skill when users need human gene ID conversion, symbol lookup, GO or KEGG mapping, chromosome and cytoband annotation, or R workflows using AnnotationDbi functions such as keys(), columns(), select(), and mapIds(). It is the right choice for reproducible offline human annotation in Bioconductor pipelines and for comparing org.Hs.eg.db against AnnotationHub, biomaRt, or GenomicFeatures.

Thin
0
OrthoFinder

OrthoFinder — fast, accurate ortholog inference for comparative genomics. Identifies orthogroups, orthologs, gene trees, rooted species trees, and gene duplication events from protein or nucleotide sequences using all-vs-all DIAMOND/BLAST search. Standard tool for phylogenetic profiling, gene family evolution, and multi-species proteome comparison.

Thin
0
P2Rank

P2Rank — fast machine learning tool for predicting ligand binding sites from protein structure. Uses random forest classifier on physicochemical and geometric features of solvent-accessible surface points. Accepts PDB and mmCIF input files, outputs predicted binding pockets with confidence scores and residue-level predictions. Useful for drug discovery, protein function annotation, binding site comparison, and virtual screening preparation.

Thin
0
padjust

p.adjust — Multiple testing p-value correction in base R (stats package). Adjusts a vector of p-values using one of seven methods: Bonferroni, Holm (1979), Hochberg (1988), Hommel (1988), Benjamini-Hochberg / FDR (1995), Benjamini-Yekutieli (2001), or no adjustment. Controls either family-wise error rate (FWER) or false discovery rate (FDR). Part of the stats package shipped with every R installation — no external packages required. The single most commonly used p-value adjustment function in R-based bioinformatics, called by DESeq2, limma, edgeR, clusterProfiler, and virtually every differential analysis pipeline. Handles NA values, supports custom comparison counts via the n parameter, and preserves input vector names.

Thin
0
PAGA

PAGA (Partition-based Graph Abstraction) — trajectory inference and topology analysis for single-cell RNA-seq data. Computes a coarse-grained graph of cell clusters (PAGA graph) that captures connectivity, trajectory structure, and differentiation paths. Integrated into scanpy as sc.tl.paga and sc.pl.paga. Enables trajectory-informed UMAP initialization, pseudotime analysis, and multi-resolution graph abstraction from scRNA-seq count matrices.

Thin
0
Pairtools

pairtools — Command-line toolkit for processing Hi-C and chromosome conformation capture contact pairs in the Open2C ecosystem. Parses alignments (SAM/BAM) into .pairs format, sorts, deduplicates, filters by distance or chromosome, and generates contact statistics. Used as the primary upstream processing step before cooler and cooltools for Hi-C, Micro-C, SPRITE, and capture-C data. Supports 4DN .pairs format and integrates with distiller-nf pipeline.

Thin
0
Palantir

Palantir — pseudotime and cell fate probability analysis for single-cell RNA-seq data using Markov diffusion chains. Computes a continuous pseudotime ordering from a user-specified start cell and quantifies probabilistic branch fate assignments for each cell. Identifies waypoints (representative cells) along trajectories, and computes fate entropy to locate progenitor and committed states. Integrates with scanpy AnnData objects.

Thin
0
PAML

PAML — Phylogenetic Analysis by Maximum Likelihood. Suite of programs for phylogenetic analysis of DNA and protein sequences using maximum likelihood. Includes codeml (codon/amino acid models, dN/dS selection analysis), baseml (nucleotide substitution models), mcmctree (Bayesian divergence time estimation), evolver (sequence simulation), and yn00 (pairwise dN/dS). Used for detecting positive selection, estimating divergence times, and molecular evolution studies. Install via conda or compile from source.

Thin
0
Panaroo

Panaroo — graph-based bacterial pan-genome analysis pipeline that corrects annotation errors. Takes GFF3 files from Prokka or Bakta, builds a population-level gene graph, and outputs core/accessory gene presence-absence matrices, aligned core gene sequences, and structural variants. Supports strict, moderate, and sensitive cleaning modes for different levels of contamination tolerance. Used for pan-genome construction, core genome alignment, gene gain/loss analysis, and population structure studies in bacterial genomics.

Thin
0
pandas

Use when working with pandas — the foundational Python data analysis library — for tabular data manipulation, CSV/TSV/Excel/HDF5/Parquet I/O, DataFrame operations, and bioinformatics data wrangling. pandas provides DataFrame and Series objects with labeled indexing, rich I/O capabilities, groupby/merge/ pivot operations, and time series support. Use for: reading genomic annotation tables, sample metadata processing, VCF/BED/GTF parsing, feature matrix construction, statistical summarization, and ETL pipelines in Python. Key terms: DataFrame, Series, read_csv, read_parquet, groupby, merge, pivot_table, apply, loc, iloc, pandas dataframe, pd.read_csv, data wrangling python.

Thin
0
Pangolin

Pangolin — Phylogenetic Assignment of Named Global Outbreak LINeages for SARS-CoV-2. Assigns Pango lineages to consensus genome sequences using UShER (phylogenetic placement) or pangoLEARN (ML classification). Provides scorpio constellation calling, designation cache lookup, and lineage report generation in CSV format. Essential for genomic epidemiology and SARS-CoV-2 surveillance pipelines.

Thin
0
Pangu-Weather

Pangu-Weather — AI-based global weather forecasting with 3D neural networks. Predicts atmospheric conditions at 0.25° resolution (~28 km) for lead times of 1, 3, 6, and 24 hours using Earth-Specific Transformer architecture trained on 43 years of ERA5 reanalysis data. 10,000x faster than traditional numerical weather prediction. Runs via ONNX Runtime on CPU or GPU. Covers geopotential, temperature, humidity, wind speed, and mean sea level pressure across 13 pressure levels. Integrated into ECMWF operational suite.

Thin
0
Papermill

Papermill — parameterize, execute, and analyze Jupyter notebooks from the command line or Python API. Enables reproducible notebook-based bioinformatics pipelines by injecting parameters into tagged cells and producing executed output notebooks with full provenance. Supports parameterized batch execution, notebook inspection, and integration with workflow managers like Nextflow and Snakemake.

Thin
0
Paragraph

Paragraph — graph-based structural variant genotyper for short-read sequencing data from Illumina. Genotypes known deletions, insertions, duplications, and inversions from BAM/CRAM files using local graph realignment. Accepts candidate SVs in VCF format and outputs per-sample genotypes. Supports population-scale genotyping via single-sample mode with VCF merging. Built by Illumina.

Thin
0
Pavian

Pavian — interactive R/Shiny web application for metagenomics analysis and visualization. Reads classification reports from Kraken, Kraken2, Bracken, MetaPhlAn, DIAMOND+MEGAN, and Centrifuge to generate Sankey diagrams, comparative heatmaps, and abundance tables. Key capabilities: multi-sample comparison, taxonomic filtering at any rank, pathogen identification, microbiome composition visualization, and HTML report export. Use for metagenomics QC review, clinical pathogen detection, and microbiome community profiling from shotgun sequencing data.

Thin
0
pbmm2

pbmm2 — SMRT C++ wrapper for minimap2's C API providing native PacBio long-read alignment. Supports CCS/HiFi, SUBREAD, ISOSEQ, and UNROLLED presets. Reads native PacBio BAM and dataset XML inputs, produces sorted BAM output with concordance/identity tags. Official replacement for BLASR in PacBio sequencing workflows.

Thin
0
pbsv

pbsv — PacBio structural variant (SV) caller for HiFi and CLR long-read sequencing data. Discovers and genotypes deletions, insertions, duplications, inversions, translocations, and breakends from aligned BAM files. Two-stage workflow: discover signatures from individual samples, then call/genotype variants jointly. Use for long-read SV detection, multi-sample SV genotyping, or as part of a PacBio HiFi variant calling pipeline.

Thin
0
pcalg

pcalg — R package for causal structure learning and causal inference using graphical models. Constraint-based algorithms (PC for no hidden variables, FCI/RFCI for latent confounders), score-based methods (GES for observational data, GIES for interventional data), and causal effect estimation (IDA, jointIda, Generalized Backdoor Criterion). Supports Gaussian, discrete, and binary data with conditional independence testing, DAG simulation (randDAG, rmvDAG), and graph comparison utilities (compareGraphs, SHD).

Thin
0
PCAngsd

PCAngsd — principal component analysis and population genetics inference from genotype likelihoods in low-depth sequencing data. Use when performing PCA on low-coverage WGS, ancient DNA, or pooled sequencing with Beagle genotype probability files. Supports selection scans, kinship estimation, admixture inference, inbreeding coefficients, and GWAS via genotype likelihood-based methods. Compares to smartpca, FlashPCA2, and PLINK PCA.

Thin
0
PCGR

PCGR (Personal Cancer Genome Reporter) — Python/R tool for automated annotation and clinical interpretation of somatic mutations and copy number aberrations in individual tumor genomes. Accepts somatic VCF and optional CNA segment files, producing HTML/JSON reports with oncogenicity classification (VICC/CGC), tumor mutational burden (TMB), MSI status, mutational signature decomposition (COSMIC SBS/DBS/ID), and HRD estimation. Integrates ClinVar, CancerHotspots, COSMIC, OncoKB, and IntOGen. Used for clinical genomics reporting, tumor characterization, and treatment guidance.

Thin
0
Peddy

Peddy — pedigree and sample QC tool for VCF files. Verifies sample sex from X-chromosome heterozygosity, checks relatedness against PED file expectations, predicts ancestry via PCA against 1000 Genomes, and flags samples with outlier heterozygosity ratios. Works with multi-sample germline VCFs from GATK, DeepVariant, or any standard variant caller. Essential QC step before GWAS or family-based studies.

Thin
0
Pegasus

Pegasus — scalable single-cell RNA-seq analysis toolkit for millions of cells. Command-line tool and Python package for preprocessing, batch correction, clustering, differential expression, visualization, and demultiplexing. Supports zarr/h5ad formats, harmony/scanorama/scVI batch correction, Louvain/ Leiden clustering, UMAP/tSNE/FLE embeddings, pseudo-bulk DESeq2, and cell-hashing demultiplexing via demuxEM. Developed at the Li Lab, Broad Institute of MIT and Harvard.

Thin
0
PEPPER-Margin-DeepVariant

PEPPER-Margin-DeepVariant — end-to-end haplotype-aware variant calling pipeline for long-read sequencing data (Oxford Nanopore Technology and PacBio HiFi). Combines PEPPER (RNN-based candidate variant detection), Margin (haplotype phasing), and DeepVariant (CNN-based variant genotyping) into a single Docker/Singularity container. Supports SNP and indel calling from ONT R9.4.1, ONT R10.4.1, and PacBio HiFi reads. Outputs phased VCF and haplotagged BAM files. Use when calling variants from long-read alignments.

Thin
0
Percolator

Percolator — semi-supervised machine learning tool for rescoring peptide-spectrum matches (PSMs) from shotgun proteomics database searches. Uses target-decoy competition with SVM classification to separate correct from incorrect PSMs, producing posterior error probabilities (PEP) and q-values at PSM, peptide, and protein levels. Accepts PIN/tab-delimited input from Comet, X!Tandem, Crux, MS-GF+, and Tide search engines. Standard post-processing step in DDA proteomics pipelines.

Thin
0
Peregrine

Use when working with Peregrine (peregrine-2021) — a fast, SHIMMER-based

Thin
0
Perseus

Perseus — computational platform for comprehensive statistical analysis of quantitative proteomics data. Companion to MaxQuant, Perseus provides an interactive workflow environment for processing protein group tables through normalization, imputation, statistical testing (t-tests, ANOVA), hierarchical clustering, PCA, volcano plots, annotation enrichment, and machine learning classification. Supports label-free (LFQ), SILAC, TMT/iTRAQ quantification data and integrates with R and Python via PerseusR and perseuspy companion libraries. All analysis steps are captured as reproducible workflow trees with full parameter documentation.

Thin
0
pertpy

Use when working with pertpy, the scverse framework for single-cell perturbation analysis in Python. Trigger this skill for Perturb-seq, CRISPR screen QC, perturbation signatures, Mixscape, Milo, Augur, scCODA, tasccoda, Dialogue, distance-based perturbation comparison, or MuData / AnnData workflows that need the official `pertpy.tl` APIs and tutorials. pertpy organizes perturbation methods around `AnnData` and `MuData`, exposes curated example datasets, and routes users to method-specific tools such as `Mixscape`, `Milo`, `Sccoda`, `Distance`, `Augur`, and `Dialogue`.

Thin
0
Perturb-seq Tools (MIMOSCA)

Perturb-seq analysis tools including MIMOSCA (Multi-Input Multi-Output Single-Cell Analysis) for CRISPR perturbation screens with single-cell RNA-seq readout. Handles guide RNA assignment to cells, perturbation effect estimation, differential expression across perturbations, and multi-perturbation combinatorial analysis. Works with Perturb-seq, CROP-seq, and CRISPRi/a screen data in AnnData or CSV format. Uses linear models and negative binomial regression for effect estimation.

Thin
0
MalariaGEN Pf3k/Pv4 Pipelines

MalariaGEN Pf3k/Pv4 pipelines — cloud-native Python API for accessing and analysing Plasmodium falciparum (Pf7/Pf8, 33K+ genomes) and Plasmodium vivax (Pv4, 1,895 genomes) population genomics data. Provides sample metadata retrieval, variant calls (SNPs, indels, CNVs), reference genome access, and gene annotations via the malariagen_data package. Supports lazy cloud-based loading from S3/GCS through Zarr, xarray, and dask arrays without local downloads. Commonly used for malaria drug resistance surveillance, population structure analysis, and genomic epidemiology.

Thin
0
PGAP

PGAP — NCBI Prokaryotic Genome Annotation Pipeline for comprehensive annotation of bacterial and archaeal genomes. Runs gene prediction (GeneMarkS-2+, tRNAscan-SE), functional annotation via HMM/BLAST, frameshifted gene detection, and produces GenBank/GFF3/FASTA submissions. Uses CWL workflows with Docker or Singularity. Required for NCBI genome submissions and RefSeq annotation.

Thin
0
PGGB

PGGB (PanGenome Graph Builder) — constructs unbiased pangenome variation graphs from multiple whole-genome assemblies using all-vs-all pairwise alignment (wfmash), graph induction (seqwish), and graph normalization (smoothxg). Produces GFA variation graphs for downstream variant calling, haplotype analysis, and comparative genomics. Use when users need to build a pangenome graph, align multiple assemblies, call variants across haplotypes, or work with panSN-spec sequence naming in graph genomics workflows.

Thin
0
phantompeakqualtools

Use when working with phantompeakqualtools — phantompeakqualtools (spp)

Thin
0
PHATE

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) — dimensionality reduction for visualizing high-dimensional biological data while preserving both local and global structure. Constructs diffusion potential from adaptive kernel, applies multidimensional scaling to embed into 2D/3D. Excels at revealing trajectory and branching structure in single-cell RNA-seq, mass cytometry (CyTOF), and other high-dimensional biological datasets. Includes built-in clustering (k-means with auto silhouette selection), plotting utilities, and Scanpy/AnnData integration.

Thin
0
pheatmap

Use when working with pheatmap, Pretty Heatmaps, clustered heatmaps in R, or publication-grade matrix visualization. pheatmap draws annotated heatmaps from numeric matrices, supports row and column annotations, hierarchical clustering, k-means row aggregation, numeric overlays, and direct export to png, pdf, tiff, bmp, or jpeg. Use this skill when users need pheatmap syntax, troubleshooting, comparison against ComplexHeatmap or base heatmap, or a reproducible wrapper around pheatmap::pheatmap().

Thin
0
Phenix

Phenix — comprehensive macromolecular structure determination from X-ray crystallography, cryo-EM, and neutron diffraction data. Provides automated molecular replacement (Phaser), structure refinement (phenix.refine, phenix.real_space_refine), model building (AutoBuild), ligand fitting (LigandFit), map improvement, and validation (MolProbity). Built on the CCTBX computational crystallography toolbox. Input: MTZ reflection files, PDB/mmCIF models, cryo-EM maps (MRC/CCP4).

Thin
0
Full phenopacket workflow

GA4GH Phenopacket Schema for computable representation of clinical phenotypic data. Build, validate, and convert phenopackets encoding patient phenotypes, diagnoses, genomic interpretations, and medical actions using standardised ontology terms (HPO, MONDO, OMIM). Supports rare disease, cancer, and complex disease contexts with protobuf-based Python API.

Thin
0
Philosopher

Philosopher — fast, scalable Go toolkit for shotgun proteomics data analysis. Provides database downloading, peptide validation (PeptideProphet), protein inference (ProteinProphet), PTM localization (PTMProphet), multi-level integrative analysis (iProphet), FDR filtering, label-free and TMT/iTRAQ quantification, and multi-level reporting. Core engine behind FragPipe workflows for mass spectrometry-based proteomics.

Thin
0
phyloseq -- Microbiome Data Analysis in R

phyloseq -- R/Bioconductor package for importing, storing, analyzing, and visualizing microbiome census data. Handles OTU/ASV tables, taxonomy tables, sample metadata, phylogenetic trees, and reference sequences in a unified S4 object. Provides ordination (PCoA, NMDS, RDA, CCA), alpha and beta diversity analysis, taxonomic filtering, agglomeration, normalization, and ggplot2-based visualization. Imports BIOM, QIIME, and mothur formats. Core downstream analysis layer for DADA2 and QIIME 2 amplicon pipelines.

Thin
0
PhyML

PhyML — maximum likelihood phylogenetic inference from nucleotide and amino acid sequence alignments. Supports GTR, HKY, K80, JC, and 15+ DNA/protein substitution models, gamma rate heterogeneity (+G), proportion of invariants (+I), NNI/SPR tree topology search, standard bootstrap, SH-like aLRT, and aBayes branch support. PhyML-SMS integrates automatic Smart Model Selection. Produces Newick tree files and detailed statistics reports with log-likelihood, AIC, BIC, and estimated model parameters.

Thin
0
phytools

phytools — R package for phylogenetic comparative biology and the study of trait evolution on phylogenies. Provides ancestral state reconstruction (continuous and discrete), stochastic character mapping (make.simmap), Brownian motion and OU model fitting, phylogenetic PCA, correlated trait evolution tests (Pagel's method), tree simulation (birth-death, coalescent), and publication-quality tree visualization (contMap, plotSimmap, phenogram). Use when users need phylogenetic comparative methods, character evolution analysis, trait mapping onto trees, or phylogenetic visualization in R.

Thin
0
PICRUSt2

PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States 2) predicts functional composition of microbial metagenomes from 16S rRNA, ITS1, or ITS2 amplicon marker gene data. Performs phylogenetic placement of sequences onto reference trees, hidden state prediction of gene families (EC, KO, COG), and pathway-level inference via MinPath. Essential for estimating MetaCyc pathways, KEGG orthologs, and enzyme functions from amplicon surveys without shotgun metagenomics. Use for microbiome functional profiling, comparative metagenomics, and NSTI-based reliability filtering.

Thin
0
Pierian Dx (Velsera)

Pierian Dx (now Velsera) Clinical Genomics Workspace (CGW) — comprehensive clinical interpretation and analysis platform for NGS data. Supports variant annotation, clinical significance assessment, report generation, and integration with EMR/LIS systems. Used for cancer diagnostics, molecular testing, and precision medicine workflows across 140+ global partner sites.

Thin
0
pigz

pigz (Parallel Implementation of GZip) — multi-threaded drop-in replacement for gzip that compresses and decompresses files using multiple CPU cores. Supports gzip, zlib, and single-entry zip formats. Essential for fast compression of FASTQ, SAM, VCF, and other large bioinformatics files in NGS pipelines. Includes unpigz for parallel decompression verification.

Thin
0
Pilon

Pilon — automated genome assembly improvement and variant detection tool. Polishes draft assemblies by correcting SNPs, indels, and misassemblies using short-read alignments (BAM files). Fills gaps in scaffolds via local reassembly with De Bruijn graphs. Generates corrected FASTA, change logs, VCF variant calls, and genome browser tracks. Use for Illumina-based assembly polishing, microbial genome finishing, and short-read variant calling against a draft reference. Alternatives: Racon (long reads), Medaka (ONT), NextPolish (large genomes), HyPo (hybrid).

Thin
0
Pipeline Composer

Use when working with pipeline-composer — pipeline-composer — compose

Thin
0
Pixi

Pixi — fast, cross-platform package manager for conda and PyPI packages with project-level lockfiles and built-in task runner. Uses pixi.toml manifest for reproducible, multi-platform environments. Ideal for bioinformatics pipelines requiring pinned tool versions across Linux, macOS, and Windows.

Thin
0
PlantCV

Use when analyzing plant images for phenotyping with PlantCV. Handles visible (RGB), near-infrared (NIR), fluorescence, thermal, and hyperspectral imaging workflows. Performs image segmentation, shape analysis, color analysis, landmark detection, and multi-plant workflows. Built on OpenCV with a modular Python API for high-throughput plant phenotyping pipelines. Also known as PlantCV, plant computer vision, plant image analysis.

Thin
0
PlasmoDB Tools

PlasmoDB — the Plasmodium genome resource and functional genomics database for malaria parasites. Part of VEuPathDB, it provides genomic, transcriptomic, and proteomic data for Plasmodium species including P. falciparum, P. vivax, P. berghei, P. chabaudi, P. knowlesi, and P. yoelii. Offers gene search, BLAST, orthology analysis (OrthoMCL), GO enrichment, expression queries, and a REST API for programmatic access to gene records, searches, and bulk data downloads.

Thin
0
PLINK 1.9/2

PLINK 1.9/2 — high-performance command-line toolset for whole-genome association analysis, population stratification, identity-by-descent, linkage disequilibrium computation, and genotype data management. Reads VCF, BED/BIM/FAM (PLINK 1 binary), and PGEN/PVAR/PSAM (PLINK 2) formats. Provides allele frequency calculation, Hardy-Weinberg filtering, case/control association tests (--assoc, --logistic, --glm), PCA, sample/ variant QC, and format conversion. Essential in GWAS pipelines and population genetics workflows.

Thin
0
PLINK 2.0

PLINK 2.0 — whole-genome association analysis toolset for large-scale genotype data management, quality control (missingness, HWE, MAF filtering), population stratification correction (PCA), linkage disequilibrium pruning, GWAS association testing (linear/logistic/Firth regression), polygenic risk scoring, KING kinship estimation, GRM computation, allele frequency calculation, and format conversion between VCF, PGEN, BED, BGEN, and Oxford formats. Standard pre-processing tool for REGENIE, SAIGE, GCTA, and LDSC pipelines.

Thin
0
PLINK/SEQ

PLINK/SEQ — open-source C/C++ library and command-line tool (pseq) for working with human genetic variation data from large-scale resequencing projects. Use when working with VCF loading, variant filtering, rare variant burden tests, gene-based association (SKAT, C-alpha, VT), mask-based queries, exome sequencing QC, or PLINK/SEQ project management.

Thin
0
PLIP (Protein-Ligand Interaction Profiler)

PLIP (Protein-Ligand Interaction Profiler) — rule-based detection, profiling, and visualization of non-covalent protein-ligand interactions in 3D structures. Identifies hydrogen bonds, hydrophobic contacts, pi-stacking, salt bridges, water bridges, halogen bonds, and metal coordination from PDB files. Outputs XML/text reports and PyMOL session files for structural biology and drug discovery.

Thin
0
Plotly

Plotly — interactive graphing library for Python producing publication-quality figures as HTML, PNG, SVG, and PDF. Provides plotly.express for concise high-level charts (scatter, bar, line, heatmap, violin, box, histogram), plotly.graph_objects for fine-grained control over traces and layouts, plotly.subplots for multi-panel figures, and plotly.io for static/interactive export. Commonly used in bioinformatics for volcano plots, heatmaps, PCA biplots, genome feature visualization, and interactive dashboards with Dash.

Thin
0
POD5

POD5 — Oxford Nanopore Technologies signal storage format based on Apache Arrow. Stores raw nanopore electrical signal (squiggle) data per read with calibration, pore, and run metadata. Replaces FAST5 (HDF5) as the primary ONT output format. Supports convert (FAST5↔POD5), inspect, merge, subset, filter, view, and repack operations via CLI and Python API. Use for reading, writing, converting, or subsetting Oxford Nanopore raw signal data.

Thin
0
Polars

Polars — high-performance DataFrame library written in Rust with Python and Rust APIs. Built for speed, parallelism, and out-of-core processing via lazy evaluation and SIMD acceleration. Use for: fast tabular data manipulation, genomic annotation tables, VCF/BED/TSV parsing, sample metadata processing, feature matrices, and replacing pandas in bioinformatics ETL pipelines. Key terms: LazyFrame, eager, scan_csv, scan_parquet, pl.col(), expressions, group_by, join, pivot, streaming mode, polars dataframe, fast pandas alternative.

Thin
0
PopART (Population Analysis with Reticulate Trees)

PopART (Population Analysis with Reticulate Trees) — Java-based application for constructing and visualizing haplotype networks from DNA sequence alignments. Supports TCS, Median-Joining, Minimum Spanning, Integer NJ, and Tight Span Walker network methods. Integrates geographic and demographic trait data for phylogeographic analysis. Reads Nexus format input with haplotype frequency and trait blocks. Used in population genetics, phylogeography, and conservation genetics studies.

Thin
0
PopLDdecay

Use when working with poplddecay — popLDdecay — fast, memory-efficient

Thin
0
Porechop

Porechop — adapter trimming and demultiplexing tool for Oxford Nanopore long-read sequencing data. Finds and removes adapters from read ends, splits chimeric reads containing internal adapters, and demultiplexes barcoded reads. Supports FASTQ and FASTA input (gzip-compressed or uncompressed). Part of the ONT preprocessing pipeline between basecalling and quality filtering or alignment. No longer actively maintained; Porechop_ABI is the community-maintained fork.

Thin
0
Poseidon

Poseidon — a framework for organizing, sharing, and managing archaeogenetic genotype datasets. Standardized package format (POSEIDON.yml, .janno metadata, .ssf sequencing source files, .bib bibliography) with genotype data in EIGENSTRAT (.geno/.snp/.ind), PLINK (.bed/.bim/.fam), or VCF formats. The trident CLI provides init, fetch, forge, genoconvert, list, summarise, survey, validate, rectify, and jannocoalesce commands for package management, data merging, format conversion, and quality control. Three public archives (PCA, PMA, PAA) host curated ancient DNA datasets.

Thin
0
PPanGGOLiN

Use when building pangenome graphs or analyzing pangenome structure across prokaryotic genomes. PPanGGOLiN partitions gene families into persistent, shell, and cloud categories using a statistical model applied to a pangenome graph. Accepts annotated input (GFF3, GBFF) or unannotated FASTA with built-in Prokka annotation. Detects Regions of Genome Plasticity (RGP), spots of insertion, and conserved pangenome modules. Use for comparative genomics, mobile element analysis, and microbial population pangenome studies.

Thin
0
Preseq

Preseq — C++ toolkit for predicting library complexity and future sequencing yield from genomic sequencing experiments. Estimates distinct molecule counts, extrapolates coverage curves, and identifies optimal sequencing depth using rational function approximation of Good-Toulmin power series. Supports BAM, BED, and text histogram inputs. Essential QC tool for detecting over-sequencing and planning cost-effective sequencing experiments.

Thin
0
PRIME-Del

Use when working with PRIME-Del, primedel, paired pegRNA deletion design, prime editing deletions, or short insertion-at-deletion workflows. PRIME-Del designs paired pegRNAs from a single FASTA target plus optional FlashFry, CRISPOR, or Broad GPP guide scoring tables, and supports either deletion-size windows or exact start/end targets. Use this skill when users need the real PRIME-Del CLI flags, input format requirements, troubleshooting for empty peg pair outputs, or a reproducible wrapper around the upstream `gen_pegs.py` script and `primedel.design` helpers.

Thin
0
PRINSEQ++

PRINSEQ++ — high-performance C++ tool for quality control and preprocessing of FASTQ sequencing reads. Filters and trims reads by quality score, length, GC content, sequence complexity (DUST/entropy), N content, and exact/near duplicates. Supports single-end and paired-end data. Generates summary statistics for QC reporting. Successor to the original Perl-based PRINSEQ. Used as a preprocessing step before alignment, assembly, or downstream analysis.

Thin
0
Prodigal

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a fast, accurate gene prediction tool for prokaryotic genomes and metagenomes. Predicts protein-coding genes in bacterial and archaeal DNA sequences using dynamic programming. Supports single genome mode, metagenomic mode, and anonymous mode. Outputs GFF3, GenBank, and protein/nucleotide FASTA formats. Use when predicting ORFs, annotating draft genomes, or finding genes in metagenomic assemblies.

Thin
0
ProDy

ProDy — Python library for protein dynamics analysis and structural bioinformatics. Provides elastic network models (GNM, ANM), principal component analysis (PCA) of structural ensembles and MD trajectories, normal mode analysis, structure parsing (PDB, mmCIF, DCD), sequence and structure alignment, contact maps, and cross-correlation analysis. Used for protein flexibility prediction, allosteric mechanism discovery, and conformational dynamics studies.

Thin
0
ProGen

ProGen — protein language model suite by Salesforce Research for generating functional protein sequences and scoring protein fitness. Trained on 280M+ protein sequences with conditional control tags for organism, function, and localization. Models range from 151M to 6.4B parameters. Generates novel proteins with catalytic activity comparable to natural enzymes. Essential for de novo protein design, directed evolution guidance, and protein fitness landscape exploration.

Thin
0
Prosit

Prosit — deep learning framework for predicting MS2 fragment ion spectra and indexed retention times (iRT) from peptide sequences. Enables in silico spectral library generation for any organism and protease. Provides REST API server with Docker/GPU deployment, outputs in generic/MSP/msms.txt formats. Core prediction engine behind Koina model serving and Oktoberfest rescoring. Used for DDA and DIA proteomics spectral library generation, retention time prediction, and peptide identification rescoring.

Thin
0
ProteinMPNN

ProteinMPNN — deep learning-based protein sequence design from backbone structures. Uses message passing neural networks to predict amino acid sequences that fold into a given 3D backbone. Supports fixed backbone design, multi-chain complex design, tied positions for symmetric assemblies, partial sequence fixing, and sampling with controllable temperature. Input via PDB/mmCIF structures; output as FASTA sequences.

Thin
0
Proteome Discoverer

Proteome Discoverer — Thermo Fisher Scientific's integrated proteomics software platform for mass spectrometry data analysis. Provides SEQUEST HT and Mascot database searching, Percolator PSM validation, label-free and TMT/iTRAQ isobaric quantification, post-translational modification analysis, multi-consensus reporting, and node-based visual workflow design. Primary platform for Orbitrap instrument data processing.

Thin
0
ProteoWizard msconvert

ProteoWizard msconvert — command-line mass spectrometry data converter for proteomics workflows. Use when converting vendor RAW files (Thermo, AB SCIEX, Waters, Bruker, Agilent) to open formats (mzML, mzXML, MGF, MS2). Essential preprocessing step for database search engines (MaxQuant, MSFragger, Comet), spectral libraries, and DIA analysis (DIA-NN, Spectronaut). Supports peakPicking, zlib compression, subset filtering, and SRM/MRM chromatogram extraction.

Thin
0
ProtTrans

Use when working with ProtTrans — a collection of protein language models (ProtT5, ProtBert, ProtAlbert, ProtXLNet, ProtElectra) — for generating protein sequence embeddings, secondary structure prediction, subcellular localization, and protein property prediction. ProtTrans models are trained on UniRef and BFD protein databases using self-supervised learning. The recommended model is ProtT5-XL-UniRef50 (~3B parameters, 1024-dim embeddings). Use for feature extraction from FASTA sequences, downstream fine-tuning on biological tasks, and protein function annotation.

Thin
0
PRSice-2

PRSice-2 — fast polygenic risk score (PRS) calculator using clumping and thresholding (C+T). Performs LD clumping on GWAS summary statistics, applies p-value thresholds, computes PRS in target genotype data, and evaluates predictive performance via regression. Supports quantitative and binary traits, PRSet pathway-based analysis, cross-ancestry PRS, permutation testing for empirical p-values, and high-resolution bar/scatter plots. Keywords: polygenic risk score, PRS, C+T, clumping thresholding, GWAS summary statistics, LD clumping, p-value threshold, PRSice, PRSet.

Thin
0
PubMed Database

PubMed Database — NCBI's comprehensive biomedical literature database providing free access to over 37 million citations from MEDLINE, life science journals, and online books. Query via E-utilities REST API (ESearch, EFetch, ESummary, ELink, EPost) or Biopython Entrez module. Core API: Entrez.esearch(db="pubmed"), Entrez.efetch(), Entrez.elink(). Supports Boolean/MeSH queries, field tags, batch processing, citation matching, and history server for large result sets. Use for literature search, systematic reviews, citation retrieval, and programmatic PubMed access.

Thin
0
Purge_dups

Purge_dups — haplotypic duplication identification and removal tool for genome assemblies. Detects and removes overlapping haplotigs and heterozygous duplications from primary assemblies using read depth analysis and self-alignment. Key step between initial assembly (hifiasm, Canu, Flye) and scaffolding (SALSA, YaHS). Works with PacBio HiFi, PacBio CLR, and Oxford Nanopore reads.

Thin
0
Purge Haplotigs

Purge Haplotigs — pipeline for curating heterozygous diploid genome assemblies by identifying and reassigning haplotigs (alternative haplotype contigs). Analyzes read-depth histograms to set coverage cutoffs, flags suspect contigs, and uses BLAST/LASTZ alignments to reassign haplotigs. Essential post-assembly QC for PacBio HiFi, ONT, and CLR long-read data.

Thin
0
PURPLE

PURPLE (Purity/Ploidy Estimator) — HMFtools Java tool for tumor purity and ploidy estimation from whole-genome sequencing. Consumes AMBER allele frequencies and COBALT read-depth ratios to fit the optimal purity/ploidy model, producing somatic copy-number profiles, driver catalogs, and annotated SV/SNV VCFs. Core component of the Hartwig Medical Foundation WGS pipeline (AMBER → COBALT → PURPLE → GRIPSS → LINX). Supports GRCh37 and GRCh38.

Thin
0
pwr

pwr — Statistical power analysis functions along the lines of Cohen (1988). Computes sample size, effect size, significance level, or power for t-tests, ANOVA, chi-square tests, correlation tests, proportion tests, and general linear models. Implements conventional effect size benchmarks (small, medium, large) and arcsine-transform effect sizes for proportions. Standard tool for experimental design in clinical trials, A/B testing, grant applications, and pre-registration of studies. Includes plotting methods for power curves and vector-valued parameter sweeps.

Thin
0
PyAEDT

PyAEDT — Python library for automating Ansys Electronics Desktop (AEDT) electromagnetic, thermal, and circuit simulations. Provides unified API for HFSS (high-frequency EM), Icepak (thermal management), Maxwell 3D/2D (magnetostatics and eddy currents), Q3D Extractor (parasitic extraction), Circuit/Nexxim (circuit simulation), Twin Builder (system simulation), and HFSS 3D Layout (PCB/package analysis). Supports parametric sweeps, design optimization, post-processing, and report generation.

Thin
0
pybedtools

Use when working with pybedtools — pybedtools — Python interface to BEDTools

Thin
0
Pychopper

Pychopper — Nanopore direct cDNA read preprocessing tool that identifies, orients, and trims full-length reads using primer detection (pHMM or alignment). Rescues fused reads by dynamic programming, extracts UMI from primer sequences, outputs validated FASTQ/BAM with per-read statistics for filtering and downstream analysis. Essential for direct cDNA sequencing workflows.

Thin
0
PycoQC

PycoQC — Python tool for Oxford Nanopore sequencing QC that parses basecaller summary files (sequencing_summary.txt from Guppy, Dorado, or MinKNOW) and BAM alignment files to generate interactive HTML quality control reports. Computes read length distributions, quality score distributions, throughput over time, channel activity, read length vs quality, and alignment statistics. Supports 1D and 1D2 sequencing runs. Part of the Nanopore QC ecosystem alongside NanoPlot and NanoStat.

Thin
0
PyDESeq2

PyDESeq2 — Python implementation of DESeq2 for differential gene expression analysis from bulk RNA-seq count data. Performs size factor normalization, genewise dispersion estimation, Wald tests with Benjamini-Hochberg FDR correction, and optional apeGLM log-fold-change shrinkage. Supports single- and multi-factor designs via formulaic notation, AnnData integration, and reproducible export of results DataFrames and pickled objects.

Thin
0
pyGenomeTracks

pygenometracks -- Python program and library for plotting publication-quality genome browser tracks. Generates highly customizable visualizations of bigwig, BED, GTF, bedgraph, Hi-C matrices, links/arcs, narrow peaks, FASTA, and MAF data using INI-format configuration files. Part of the deeptools ecosystem. Supports PDF, PNG, SVG output. CLI tools: pyGenomeTracks, make_tracks_file.

Thin
0
PyMC

PyMC — probabilistic programming framework for Bayesian statistical modeling in Python. Define models with intuitive syntax using pm.Model() context manager, specify priors (Normal, HalfNormal, HalfCauchy, Exponential, Beta, Gamma, StudentT, Uniform), define likelihoods with observed data, run MCMC inference via NUTS/Metropolis/Slice samplers (pm.sample), perform variational inference (ADVI, FullRankADVI), generate posterior/prior predictive checks, and compare models with ArviZ (WAIC, LOO). Built on PyTensor computational backend.

Thin
0
PyMOL

PyMOL — molecular visualization system for rendering publication-quality 3D images of protein structures, nucleic acids, small molecules, electron density maps, and volumetric data. Provides interactive visualization with cartoon, surface, stick, and sphere representations. Features structural alignment (align, super, cealign), distance/angle measurements, electrostatic surface mapping via APBS integration, ray-traced rendering, movie generation, and a full Python scripting API (cmd module) for reproducible figure generation. Supports PDB, mmCIF, SDF/MOL2, and MAP file formats with direct PDB/EMDB fetching. Used in structural biology, drug discovery, and molecular modeling for structure analysis, ligand binding visualization, and publication figures.

Thin
0
pyOpenMS

pyOpenMS — Python bindings for OpenMS mass spectrometry framework. Read/write mzML, mzXML, mzTab, TraML, and FASTA files. Signal processing, peak picking, feature detection, peptide identification, protein quantification, and metabolomics analysis. Use for proteomics pipelines, LC-MS/MS data processing, spectral library search, and targeted quantification (SRM/MRM/PRM).

Thin
0
Pyrodigal

Pyrodigal is a Python library binding to Prodigal for gene prediction in prokaryotic genomes and metagenomes. Provides in-memory ORF finding with SIMD-accelerated scoring, single-genome and metagenomic modes, custom translation tables, region masking, and thread-safe Gene/GeneFinder API. Use for prokaryotic gene calling, ORF detection, protein translation, metagenomic gene prediction, or replacing command-line Prodigal.

Thin
0
Pysam

Pysam — Python interface to htslib for reading, writing, and manipulating SAM/BAM/CRAM alignment files, VCF/BCF variant files, FASTA/FASTQ sequences, and tabix-indexed files. Provides Pythonic wrappers around samtools and bcftools with indexed random access, pileup analysis, and programmatic variant filtering. Key classes: AlignmentFile, VariantFile, FastaFile, FastxFile, TabixFile.

Thin
0
pySCENIC

pySCENIC — Python implementation of SCENIC for single-cell gene regulatory network inference and regulon analysis. Implements a three-step pipeline: GRN inference via GRNBoost2 or GENIE3, cis-regulatory motif enrichment via cisTarget, and cellular enrichment scoring via AUCell. Identifies transcription factor regulons from scRNA-seq count matrices and produces regulon activity scores (AUC) per cell for downstream clustering and cell state characterization.

Thin
0
pysheds

pysheds — simple and fast watershed delineation in Python. Processes digital elevation models (DEMs) for hydrological analysis including pit filling, depression filling, flat resolution, D8 flow direction, flow accumulation, catchment delineation, stream ordering, river network extraction, HAND computation, and distance-to-outlet calculations. Reads GeoTIFF and ASCII grid formats via rasterio, outputs raster grids and GeoJSON river networks.

Thin
0
PyStan

PyStan — Python interface to Stan for Bayesian statistical modeling and high-performance inference. Compile Stan programs (stan.build), draw posterior samples via HMC-NUTS (model.sample), extract draws as DataFrames (fit.to_frame), access parameters dictionary-style (fit["param"]), compute log-probabilities and gradients (model.log_prob, model.grad_log_prob), and extend via plugins (stan.plugins.PluginBase). Supports hierarchical models, regression, time series, and custom probability models.

Thin
0
pyteomics -- Proteomics and Mass Spectrometry Data Analysis

pyteomics -- Python library for proteomics and mass spectrometry data analysis. Reads and writes mzML, mzXML, MGF, FASTA, pepXML, mzIdentML, and other standard formats. Provides mass calculations, peptide fragmentation prediction, isotopic distribution modeling, enzyme digestion simulation, and retention time indexing. Supports LC-MS/MS data parsing, PSM handling, and FDR estimation with target-decoy approach.

Thin
0
PyTorch

PyTorch — open-source deep learning framework for building and training neural networks in biology and life sciences. Provides tensor computation with GPU acceleration, automatic differentiation (autograd), nn.Module model building, DataLoader pipelines, mixed-precision training (AMP), distributed training (DDP), TorchScript export, and ONNX interoperability. Used for protein structure prediction, genomic sequence modeling, single-cell analysis, drug discovery, and medical image classification.

Thin
0
QIIME 2

QIIME 2 — open-source microbiome bioinformatics platform for amplicon and marker-gene analysis. Plugin-based architecture with semantic type system enforcing data compatibility between analysis steps. Processes demultiplexed amplicon sequences (16S/18S/ITS) through denoising (DADA2/Deblur), taxonomy assignment (Naive Bayes classifiers), alpha/beta diversity analysis, and differential abundance testing (ANCOM-BC). All data stored as Artifacts (.qza) with automatic provenance tracking for full reproducibility. Integrates with Greengenes2, SILVA, and GTDB reference databases. Supports CLI (q2cli), Python Artifact API, and Galaxy interfaces.

Thin
0
Qualimap

Qualimap — platform-independent quality control tool for next-generation sequencing alignment data. Provides BAM QC (coverage, insert size, GC content, mapping quality), RNA-seq QC (gene body coverage, 5'/3' bias, junction analysis), multi-sample BAM QC (cross-sample comparison), and counts QC (saturation, biotype proportions). Generates HTML reports with interactive plots. Used in WGS, WES, and RNA-seq pipelines for alignment quality assessment.

Thin
0
QUAST

QUAST (Quality Assessment Tool for Genome Assemblies) — evaluate genome assembly quality using contiguity metrics including N50, L50, total length, largest contig, misassemblies, and mismatches. Supports reference-based and reference-free assessment. Includes MetaQUAST for metagenome assemblies and QUAST-LG for large genomes (> 100 Mb). Essential QC step for de novo assembly.

Thin
0
QuPath

QuPath — open-source platform for whole slide image analysis and digital pathology. Provides interactive tools for tissue detection via thresholding, cell detection and positive cell classification (H-score, Ki67 labelling index), pixel classification with machine learning (random forests, ANNs), multiplexed fluorescence analysis, density mapping, stain separation via color deconvolution, and measurement export. Supports deep learning extensions including StarDist (nucleus detection), InstanSeg (instance segmentation), and WSInfer (whole slide inference). Built on Java with Groovy scripting for automation and batch processing, integrates with ImageJ, OpenCV, and the Bioimage Model Zoo. Handles whole slide images (WSI) from scanners in formats including .svs, .ndpi, .mrxs, .scn, .czi, and OME-TIFF via Bio-Formats and OpenSlide.

Thin
0
qvalue

qvalue — Q-value estimation for false discovery rate control in multiple hypothesis testing. Estimates q-values, the proportion of true null hypotheses (pi0), and local false discovery rates from vectors of p-values. Core method for controlling FDR in genome-wide studies (RNA-seq, GWAS, proteomics, ChIP-seq) where thousands of tests are performed simultaneously. Implements the Storey (2002, 2003) direct approach to FDR with automatic pi0 estimation via spline smoother or bootstrap. Includes empirical p-value computation from permutation null distributions and diagnostic visualization of q-value distributions. Standard tool for post-hoc significance filtering in differential expression, association studies, and any high-throughput multiple testing scenario.

Thin
0
Racon

Racon — ultrafast consensus module for raw de novo genome assembly of long uncorrected reads (PacBio CLR, Oxford Nanopore). Uses partial order alignment (POA) to polish draft assemblies from miniasm, wtdbg2, Canu, Flye, or other assemblers. Supports optional GPU acceleration via CUDA. Typically chained as minimap2 → racon (1-2 rounds) → medaka for Nanopore polishing workflows.

Thin
0
RagTag (Ragoo/RagTag)

RagTag — reference-guided genome assembly scaffolding, misassembly correction, gap patching, and multi-scaffold merging toolkit. Successor to RaGOO. Supports minimap2, unimap, and nucmer aligners. Produces AGP and FASTA outputs with confidence scoring. Enables Hi-C-guided scaffold merging for consensus assembly. Use for chromosome-scale scaffolding, gap filling, assembly correction, and multi-reference merging. Alternatives: SALSA2 (Hi-C), 3D-DNA (Hi-C), Chromosomer (reference-guided), ALLMAPS (genetic maps).

Thin
0
RawTools

RawTools — open-source C# toolkit for parsing and quality control of Thermo Orbitrap RAW mass spectrometry files from DDA experiments. Extracts scan metadata, generates QC metrics for instrument monitoring, converts .raw files to MGF format for downstream database search, extracts chromatograms and extracted ion chromatograms (XIC), and supports FAIMS, TMT, and TMTPro data. Use for proteomics RAW file interrogation, MS quality control dashboards, and MGF generation pipelines.

Thin
0
RAxML-NG

RAxML-NG — maximum likelihood phylogenetic tree inference using iterative SPR moves with libpll likelihood computation. Supports DNA, protein, binary, and multi-state data with partitioned analyses, non-parametric bootstrap, Transfer Bootstrap Expectation (TBE), ancestral state reconstruction, consensus trees, Robinson-Foulds distance computation, and terrace analysis. Three-level parallelization (SSE/AVX, pthreads, MPI) with checkpointing. Successor to RAxML 8.x.

Thin
0
Rcorrector

Rcorrector — kmer-based error correction for RNA-seq reads. Corrects Illumina sequencing errors in FASTQ data using Jellyfish2 bloom filters and adaptive kmer frequency thresholds. Handles non-uniform transcript coverage in bulk RNA-seq and single-cell datasets. Outputs corrected FASTQ files with cor/unfixable_error annotations in read headers. Run via perl run_rcorrector.pl for paired-end or single-end reads.

Thin
0
spacexr (RCTD)

spacexr (RCTD) — R package for Robust Cell Type Decomposition of spatial transcriptomics data. Deconvolves cell type mixtures in spatial spots using single-cell RNA-seq references. Supports three modes: doublet (max 2 types per spot), full (weighted proportions), and multi (flexible assignments). Includes C-SIDE for cell-type-specific spatially variable differential expression. Works with Visium, Slide-seq, MERFISH, and other spatial platforms. Requires a Reference object built from annotated scRNA-seq and a SpatialRNA object from spatial counts plus coordinates.

Thin
0
Reactome

Reactome — curated pathway knowledgebase and analysis platform for pathway enrichment, expression overlay, and species comparison. Use when user needs pathway over-representation analysis (ORA) with ReactomePA enrichPathway, gene set enrichment analysis (GSEA) with gsePathway, expression overlay on curated diagrams via REST API, or programmatic pathway queries with reactome2py. Covers 2,700+ peer-reviewed human pathways across metabolism, signaling, immunity, cell cycle, and disease.

Thin
0
ReactomePA

Use when working with ReactomePA — an R/Bioconductor package for Reactome Pathway Analysis. Performs over-representation analysis (ORA) and gene set enrichment analysis (GSEA) against the Reactome pathway database. Accepts Entrez Gene ID lists from RNA-seq differential expression results and ranked gene lists for GSEA. Key functions: enrichPathway() for ORA, gsePathway() for GSEA, viewPathway() for pathway topology visualization. Supports human, mouse, rat, zebrafish, fly, worm, and yeast. Integrates with clusterProfiler, enrichplot, and ggplot2 for publication-quality figures.

Thin
0
REGENIE

REGENIE — whole-genome regression for biobank-scale GWAS. Two-step ridge regression with leave-one-chromosome-out (LOCO) predictions for efficient single-variant and gene-based association testing. Supports quantitative traits, binary traits with Firth/SPA correction for unbalanced case-control ratios, and time-to-event phenotypes. Handles population structure and relatedness without GRM computation. Includes Burden, SKAT, SKATO, ACAT-V, ACAT-O gene-based tests and GxE/GxG interaction testing. Input: BGEN, PLINK bed, PLINK2 pgen. Output: .regenie summary statistics.

Thin
0
REINVENT4

REINVENT4 — AI-driven de novo molecular design platform for drug discovery using reinforcement learning (RL) and generative models. Generate novel SMILES, decorate scaffolds (LibInvent), design linkers (LinkInvent), perform scaffold hopping (Mol2Mol), and pre-train custom priors (Transfer Learning). Configured via TOML files and run as a CLI tool. Use for lead generation, hit expansion, scaffold decoration, fragment linking, multi-objective molecular optimization, and ADMET-guided drug design.

Thin
0
RELION

RELION — Bayesian approach to cryo-EM single-particle analysis for 3D structure determination of biological macromolecules. Provides motion correction, CTF estimation, particle picking (autopicking, LoG, Topaz), 2D/3D classification, 3D auto-refinement, CTF refinement, Bayesian polishing, post-processing, and local resolution estimation. Uses STAR metadata files and MRC map format throughout the pipeline. Integrates with CTFFIND, MotionCor2, EMAN2, and molecular modeling tools.

Thin
0
RepeatMasker

RepeatMasker — screens DNA sequences for interspersed repeats and low-complexity regions. Uses Tandem Repeats Finder and curated repeat libraries (Dfam, RepBase) to identify SINEs, LINEs, LTR elements, DNA transposons, simple repeats, and satellites. Outputs masked FASTA (hard/soft), annotation tables (.out), and GFF. Essential pre-processing step for genome annotation, gene prediction, and comparative genomics pipelines.

Thin
0
RepeatModeler

RepeatModeler — automated de novo repeat family identification and modeling pipeline for genomic sequences. Combines RECON and RepeatScout for consensus building, with optional LTR structural detection via LTRharvest/LTR_retriever. Produces classified repeat libraries for genome annotation with RepeatMasker. Essential for repeat annotation in non-model organisms and new genome assemblies.

Thin
0
ResFinder

Use when working with resfinder — resFinder — command-line tool for identifying

Thin
0
REVEL -- Rare Exome Variant Ensemble Learner

REVEL (Rare Exome Variant Ensemble Learner) -- ensemble method for predicting the pathogenicity of rare missense variants in the human genome. Combines scores from 13 individual tools (SIFT, PolyPhen-2, MutationAssessor, MutationTaster, CADD, VEST3, PROVEAN, MetaSVM, MetaLR, FATHMM, LRT, GERP++, SiPhy) into a single score (0-1) trained on ClinVar pathogenic vs benign missense variants. Consistently outperforms individual predictors for missense pathogenicity classification. Pre-computed scores available for all possible missense variants in the human exome (GRCh37/hg19 and GRCh38/hg38) via dbNSFP and direct download.

Thin
0
REViewer

REViewer — haplotype-resolved visualization of read alignments at short tandem repeat (STR) loci detected by ExpansionHunter. Accepts a sorted BAM/CRAM file, an ExpansionHunter VCF, a reference FASTA, and a variant catalog JSON; produces SVG/PNG read-pileup images per locus. Use for clinical STR interpretation, repeat expansion QC, and generating publication-quality figures for tandem repeat variants in disease-associated loci (HTT, FMR1, C9orf72, ATXN, RFC1, etc.).

Thin
0
RFdiffusion -- Protein Structure Generation via Diffusion

RFdiffusion -- deep learning framework for protein structure generation via denoising diffusion on SE(3) protein backbone frames. Designs novel protein backbones including unconditional monomer generation, motif scaffolding, symmetric oligomer assembly, binder design, and fold-conditioned generation. Built on RoseTTAFold architecture. Pairs with ProteinMPNN for sequence design and AlphaFold2/ESMFold for structure prediction validation. Accepts PDB inputs, produces PDB backbone coordinates.

Thin
0
Rhtslib

Use when working with Rhtslib, the Bioconductor package that vendors HTSlib inside R packages for high-throughput sequencing development. Covers pkgconfig("PKG_LIBS"), pkgconfig("PKG_CPPFLAGS"), Rhtslib:::htsVersion(), LinkingTo setup, GNU make Makevars snippets, header discovery, and package build validation for R packages that need bundled HTSlib instead of a system dependency. Trigger phrases include Rhtslib, HTSlib in R, compile against HTSlib from Bioconductor, Makevars PKG_LIBS, and Rsamtools example package.

Thin
0
rMATS

rMATS (Replicate Multivariate Analysis of Transcript Splicing) — statistical tool for detecting differential alternative splicing from RNA-seq data. Identifies five splicing event types: skipped exon (SE), alternative 5'/3' splice sites (A5SS/A3SS), mutually exclusive exons (MXE), and retained introns (RI). Uses a hierarchical model with biological replicates for accurate splicing quantification. Key flags: --b1, --b2, --gtf, --od, --tmp, --readLength, -t, --nthread, --variable-read-length, --novelSS, --statoff. Input: BAM (sorted) or FASTQ + STAR index. Output: per-event-type tab-separated files with PSI, delta-PSI, p-value, FDR. Essential for transcriptomics studies comparing splicing regulation across conditions, tissues, or disease states. Use for differential splicing, exon skipping, intron retention, and alternative splice site analysis.

Thin
0
nf-core/rnafusion

Use when working with nf-core/rnafusion — a Nextflow pipeline for RNA fusion gene detection and quantification. Detects fusion transcripts from RNA-seq data using up to six callers (STAR-Fusion, Arriba, FusionCatcher, Squid, StringTie, PIZZA). Supports single-sample and multi-sample cohort analysis with integrated QC (FastQC, MultiQC, Fastp), genome reference building, and VCF/TSV fusion output. Runs on HPC, cloud (AWS, GCP, Azure), and local environments via nf-core profiles.

Thin
0
RNA-SeQC

RNA-SeQC — fast C++ tool for RNA-seq quality control and gene-level quantification for large cohorts. Computes 70+ QC metrics including alignment rates, rRNA contamination, GC bias, 3'/5' coverage bias, exonic/intronic/intergenic rates, and gene detection. Outputs per-sample metrics TSV, gene-level read counts and TPM in GCT format, fragment sizes, and per-transcript coverage. Requires a collapsed GTF with non-overlapping transcripts. Used in GTEx and CPTAC pipelines.

Thin
0
RnBeads

RnBeads — comprehensive R/Bioconductor package for DNA methylation analysis. Supports Illumina Infinium arrays (450K, EPIC/850K), WGBS, RRBS, and other bisulfite sequencing platforms. Key capabilities: automated HTML report generation, quality control, preprocessing (normalization, filtering), batch effect analysis, differential methylation analysis (DMA) at locus and region level, segmentation, and cross-dataset comparison. Use for epigenome-wide association studies (EWAS), cancer methylation profiling, imprinting analysis, and methylation-based cell type deconvolution.

Thin
0
Roary

Roary — rapid large-scale prokaryotic pan genome analysis. Calculates the pan genome from annotated assemblies (GFF3 from Prokka/Bakta), producing core and accessory gene clusters, gene presence/absence matrices, and core gene alignments for phylogenetics. Supports identity thresholds, paralog splitting, and outputs suitable for Scoary, Phandango, and downstream tree building.

Thin
0
Rosetta

Rosetta — comprehensive macromolecular modeling suite for protein structure prediction, protein design, docking, loop modeling, and enzyme engineering. Includes RosettaFold for deep-learning structure prediction, RosettaDesign for computational protein design, RosettaDock for protein-protein docking, and RosettaScripts XML protocol engine. Uses REF2015 energy function with PDB input/output and silent file formats.

Thin
0
RoseTTAFold

RoseTTAFold — deep learning-based protein structure prediction using a three-track neural network architecture for simultaneous processing of 1D sequence, 2D distance maps, and 3D coordinates. Predicts protein structures from amino acid sequences using multiple sequence alignments (MSAs) and templates. Supports monomer structure prediction, complex modeling with RoseTTAFold2, and protein-nucleic acid interactions with RoseTTAFold All-Atom. Developed by the Baker lab at UW.

Thin
0
Rsamtools

Use when working with Rsamtools, the Bioconductor R package for indexed BAM, BCF, FASTA, and tabix-backed genomic files. Covers BamFile, ScanBamParam, filterBam, sortBam, indexBam, pileup, TabixFile, scanTabix, FaFile, scanFa, and BcfFile workflows for alignment inspection, region-restricted retrieval, FASTA sequence extraction, and tabix/BCF access inside R pipelines. Trigger phrases include scan BAM in R, count alignments by region, query tabix from R, read indexed FASTA, and Rsamtools vs samtools.

Thin
0
RStan

RStan — R interface to Stan for full Bayesian inference via MCMC (NUTS/HMC), approximate inference via ADVI, and penalized MLE via L-BFGS. Compile Stan programs in-process with stan() or stan_model(), draw posterior samples with sampling(), obtain point estimates with optimizing(), run variational Bayes with vb(), and generate quantities from existing fits with gqs(). Extract posterior draws via extract() or as.matrix(), diagnose convergence with Rhat, ESS, and check_hmc_diagnostics(), and visualize with stan_trace(), stan_plot(), and stan_dens(). Tight integration with bayesplot, loo, and shinystan for post-estimation workflows.

Thin
0
rstanarm

rstanarm — Bayesian Applied Regression Modeling via Stan. Fit Bayesian regression models using pre-compiled Stan programs with familiar R syntax: stan_glm() for generalized linear models, stan_glmer()/stan_lmer() for mixed-effects models, stan_polr() for ordinal regression, stan_betareg() for beta regression, stan_gamm4() for generalized additive mixed models, stan_jm() for joint longitudinal-survival models, and stan_mvmer() for multivariate models. Supports flexible prior specification (normal, student_t, cauchy, laplace, horseshoe, R2), posterior predictive checks via pp_check(), LOO cross-validation via loo(), and interactive diagnostics via launch_shinystan(). Backend: rstan HMC/NUTS sampler.

Thin
0
rstpm2

rstpm2 — R package for generalized survival models (GSMs), smooth accelerated failure time (AFT) models, and Markov multi-state models. Flexible parametric survival analysis via stpm2() with natural spline baselines and multiple link functions (log-log, probit, logit, additive hazards, proportional odds). Penalized models via pstpm2() using mgcv smoothers without manual knot selection. Predictions for survival, hazard, hazard ratios, survival differences, restricted mean survival time (RMST), cure fractions, and attributable fractions via 25+ predict types. Markov multi-state models via markov_msm() for state occupancy, length of stay, costs, and utilities. Supports left truncation, right/interval censoring, gamma frailties, normal random effects, copulas, relative survival, and cure models.

Thin
0
Rsubread

Use when working with Rsubread, the Bioconductor R package for read alignment, exon junction discovery, feature counting, long-read mapping, annotation flattening, and alignment QC. Covers buildindex(), align(), subjunc(), featureCounts(), flattenGTF(), qualityScores(), propmapped(), sublong(), and txUnique() for bulk RNA-seq, genomic DNA sequencing, and transcript-aware counting pipelines in R. Trigger on: Rsubread, Subread aligner, Subjunc, featureCounts in R, build a Subread index, flatten GTF to SAF, long-read alignment in R, or Rsubread vs STAR / HISAT2 / HTSeq.

Thin
0
RTG Tools

RTG Tools — Java-based toolkit from Real Time Genomics for haplotype-aware variant call comparison, VCF filtering, statistics, and pedigree analysis. Primary use: vcfeval for benchmarking variant callers against a truth set (GIAB, PrecisionFDA), using SDF reference format. Also provides vcffilter, vcfstats, vcfannotate, vcfmerge, mendelian, and pedstats. Use when user needs to compare VCFs, benchmark callers, check Mendelian concordance, compute VCF statistics, or work with RTG SDF reference files.

Thin
0
rtracklayer

Use when working with rtracklayer — rtracklayer — Bioconductor R package

Thin
0
t-SNE (Rtsne)

t-SNE via Rtsne — Barnes-Hut t-distributed Stochastic Neighbor Embedding for nonlinear dimensionality reduction and visualization. Wraps Van der Maaten's C++ Barnes-Hut implementation for O(n log n) computation. Supports exact and approximate modes, perplexity tuning, PCA initialization, distance matrix input, multi-threaded computation via OpenMP, and continuation from prior embeddings. Standard visualization method in single-cell RNA-seq (Seurat, Scanpy), genomics, and high-dimensional exploratory data analysis.

Thin
0
rust-bio

rust-bio — high-performance Rust library for bioinformatics algorithms

Thin
0
rust-htslib

rust-htslib — safe Rust bindings to htslib for reading and writing BAM, SAM, CRAM, VCF, BCF, tabix, BGZF, and FASTA index files. Provides bam::Reader, bam::IndexedReader, bcf::Reader, bcf::Writer, faidx::Reader, tbx::Reader, and bgzf compression. Use for high-performance alignment processing, variant calling pipelines, indexed region queries, and BAM/VCF manipulation in Rust bioinformatics applications. Alternative to pysam and htslib C API.

Thin
0
SAbDab

SAbDab (Structural Antibody Database) and ANARCI antibody numbering toolkit. Query the curated database of all antibody structures from the PDB with consistent Chothia, IMGT, Kabat, and Martin numbering. Number antibody sequences with ANARCI, extract CDR loops, analyze VH/VL pairings, and retrieve antibody-antigen complex structures. Essential for computational antibody engineering, therapeutic antibody design, and structural immunology.

Thin
0
Sage

Use when working with sage — sage — ultrafast Rust proteomics search

Thin
0
SAIGE

SAIGE — scalable genome-wide association tests in biobank-scale data using generalized mixed models with saddlepoint approximation. Fits null logistic/linear mixed models with sparse or full GRM to account for sample relatedness, then runs single-variant tests with SPA for case-control imbalance correction. SAIGE-GENE+ extends the framework with BURDEN, SKAT, and SKAT-O set-based rare variant tests across multiple MAF thresholds and functional annotation masks. Supports PLINK, BGEN, VCF, BCF, SAV, and PGEN input formats with Firth's bias-reduced logistic regression.

Thin
0
Sailfish

Sailfish — rapid mapping-based isoform quantification from RNA-seq reads. Uses quasi-mapping to estimate transcript abundance without full alignment. Provides TPM and estimated counts per transcript. Predecessor to Salmon. Commands: sailfish index, sailfish quant. Supports paired-end and single-end reads with strand-specific library types. Now superseded by Salmon.

Thin
0
SALSA2

SALSA2 — scaffold long-read genome assemblies using Hi-C proximity ligation data. Takes a draft contig assembly and Hi-C read alignments (BAM/BED) to produce chromosome-scale scaffolds. Iteratively corrects misassemblies using Hi-C contact maps, then orders and orients contigs by graph-based optimization. Outputs scaffolded FASTA, AGP files, and misassembly-corrected assemblies. Use for Hi-C scaffolding of PacBio/ONT assemblies, chromosome-scale genome finishing, and misassembly correction. Alternatives: 3D-DNA, YaHS, LACHESIS, AllHiC.

Thin
0
SAMap

Use when working with SAMap for cross-species single-cell RNA-seq atlas alignment, manifold mapping, homolog graph construction, or cell-type conservation analysis. SAMap maps AnnData or SAM objects across evolutionarily distant organisms using BLAST-derived gene homology, iterative manifold alignment, and downstream scoring utilities such as get_mapping_scores and GenePairFinder. Trigger this skill for SAMap, sc-samap, map_genes.sh, or when users need cross-species cell atlas comparison instead of within-species clustering alone.

Thin
0
Sambamba

Sambamba — high-performance BAM/CRAM processing tool written in D with native multi-threading. Provides fast sorting, indexing, duplicate marking, merging, filtering, depth calculation, flagstat, and slice extraction. Drop-in replacement for samtools and Picard operations on multi-core systems.

Thin
0
Sambamba mkdup

Sambamba mkdup routes duplicate-marking workflows for coordinate-sorted BAM files using `sambamba markdup`. Use this skill when a user needs to mark or remove duplicate reads, tune `--hash-table-size` or `--overflow-list-size`, choose Sambamba versus Picard MarkDuplicates, or validate duplicate-flagged BAM outputs in NGS preprocessing pipelines. Trigger phrases include sambamba markdup, sambamba mkdup, duplicate marking, remove duplicates, Picard-style duplicate flags, and BAM deduplication.

Thin
0
samblaster

samblaster is a streaming duplicate marker and structural-variant read extractor for read-id grouped paired-end SAM. Use it when a user needs fast duplicate marking, discordant read extraction, split-read extraction, or unmapped/clipped FASTQ recovery directly in a bwa mem or samtools pipeline. Trigger on phrases such as mark duplicates in SAM, extract discordant reads, splitter reads for LUMPY, accept existing duplicate marks, or samblaster vs Picard or sambamba.

Thin
0
Sapiens

Use when working with Sapiens — a BERT-based human antibody language model for antibody humanization, humanness scoring, and sequence design. Sapiens predicts humanizing mutations, scores residue-level humanness probabilities, infills masked antibody positions, and generates embeddings for VH and VL chains. Use for therapeutic antibody engineering, CDR grafting guidance, humanization campaigns, and antibody sequence analysis. Also known as BioPhi Sapiens, biophi-sapiens1, and Merck Sapiens on GitHub.

Thin
0
Sarek

Sarek — nf-core Nextflow pipeline for germline and somatic variant calling from whole-genome sequencing (WGS), whole-exome sequencing (WES), and targeted sequencing data. Supports multiple variant callers including GATK HaplotypeCaller, Mutect2, Strelka2, FreeBayes, DeepVariant, Manta, and TIDDIT. Includes preprocessing (mapping, marking duplicates, BQSR), variant calling, and annotation (VEP, SnpEff). Built on Nextflow DSL2 with nf-core best practices for reproducibility and scalability.

Thin
0
SATURN

Use when working with SATURN for cross-species single-cell RNA-seq integration that combines AnnData count matrices with protein embedding TorchDicts. SATURN trains macrogene representations with k-means initialization, a conditional autoencoder, and metric learning, then writes integrated `.h5ad` outputs plus macrogene weights and scoring logs. Use this skill when users need the real `train-saturn.py` or `score_adata.py` arguments, input CSV layout, output file contracts, scoring workflow, or reproducible wrappers around the upstream Stanford SNAP repository.

Thin
0
Scanorama -- Efficient Integration of Single-Cell Transcriptomes

Scanorama -- efficient batch correction and integration of heterogeneous single-cell RNA-seq datasets using panoramic stitching. Aligns and merges multiple scRNA-seq experiments via mutual nearest neighbors (MNN), producing corrected expression matrices and low-dimensional embeddings (X_scanorama). Provides integrate(), correct(), integrate_scanpy(), correct_scanpy() for AnnData workflows. Scales to hundreds of thousands of cells across dozens of batches with approximate nearest neighbor acceleration.

Thin
0
scArches

scArches (Single-cell Architecture Surgery) — Python toolkit for reference-based single-cell data integration via transfer learning. Maps query datasets into pre-trained reference atlases using architectural surgery on models like scVI, scANVI, totalVI, trVAE, scPoli, treeArches, and expiMap. Enables cell-type annotation transfer, novel cell state detection, multi-modal integration, and spatial transcriptomics mapping. Built on PyTorch with scvi-tools backend.

Thin
0
scater

scater — Bioconductor single-cell RNA-seq QC and visualization toolkit built on SingleCellExperiment. Use when user needs per-cell QC metrics with addPerCellQCMetrics, adaptive MAD-based outlier detection with perCellQCFilters, dimensionality reduction (runPCA, runTSNE, runUMAP), or publication-quality plots (plotReducedDim, plotExpression, plotColData, plotHighestExprs, plotHeatmap, plotDots, ggcells). Core QC functions re-exported from scuttle since Bioconductor 3.12.

Thin
0
scBERT

scBERT is a transformer-based single-cell RNA-seq annotation workflow from Tencent AI Lab Healthcare that follows a pretrain-and-fine-tune pattern for cell type labeling. Use this skill when users mention scBERT, PerformerLM, Panglao reference genes, pretrained cell annotation models, or novel cell type detection by confidence thresholding. It is appropriate for planning scBERT preprocessing, fine-tuning, prediction, and output validation on `.h5ad` single-cell datasets.

Thin
0
scCODA

scCODA — Bayesian compositional analysis for single-cell data using Dirichlet-multinomial models. Detects statistically credible changes in cell type compositions between conditions (e.g., disease vs control) while accounting for the compositional nature of cell count data. Built on TensorFlow Probability with AnnData integration. Part of the scverse/pertpy ecosystem. Use for differential abundance testing, compositional bias detection, and cell type proportion analysis in scRNA-seq experiments.

Thin
0
scCRISPR

scCRISPR — single-cell multiomics framework for CRISPR perturbation analysis combining scRNA-seq and scATAC-seq readouts. Built on the MPAL-Single-Cell-2019 codebase from the Greenleaf Lab, it integrates chromatin accessibility and gene expression to map regulatory programs disrupted by CRISPR perturbations. Use for single-cell CRISPR screens, perturbation response analysis, guide assignment, enhancer-gene linking, and multimodal single-cell CRISPR analysis.

Thin
0
scDblFinder

scDblFinder — Bioconductor R package for detecting doublets in single-cell RNA-seq data using simulation-based classification with gradient boosting. Identifies neotypic and heterotypic doublets in droplet-based scRNA-seq experiments. Works with SingleCellExperiment objects. Supports multi-sample and cluster-based doublet detection. Use for QC filtering before downstream clustering and differential expression analysis.

Thin
0
SCENIC

SCENIC (pySCENIC) — gene regulatory network inference and transcription factor regulon analysis for single-cell RNA-seq data. Infers TF-target gene networks using GRNBoost2/GENIE3, refines regulons via RcisTarget cisTarget motif enrichment databases, and scores regulon activity per cell with AUCell. Enables transcription factor-driven cell state identification, regulatory network visualization, and GRN-based clustering from scRNA-seq count matrices.

Thin
0
scFoundation

scFoundation — large-scale pretrained foundation model for single-cell transcriptomics built on the xTrimo transformer architecture. Trained on ~50 million human single-cell profiles, scFoundation generates contextualized cell embeddings for downstream tasks including cell type annotation, gene expression enhancement, drug response prediction, and perturbation analysis. Supports zero-shot and fine-tuned inference on scRNA-seq count matrices.

Thin
0
scGen

scGen — deep learning framework for predicting single-cell perturbation responses using variational autoencoders. Predicts how cells respond to unseen conditions (drug treatment, genetic knockout, disease state) by learning latent representations and applying vector arithmetic. Also performs batch correction across datasets. Works with AnnData objects and integrates with the scanpy/scverse ecosystem. Use for in silico perturbation prediction, transfer learning across cell types, and multi-dataset batch integration.

Thin
0
scGPT

scGPT — generative pretrained transformer foundation model for single-cell multi-omics analysis. Provides cell type annotation, gene perturbation prediction, multi-batch integration, multi-omic integration, gene regulatory network (GRN) inference, and zero-shot cell embedding. Built on a transformer architecture pretrained on 33 million human cells from CellxGene, scGPT enables transfer learning for diverse single-cell tasks using AnnData objects.

Thin
0
scHPF

scHPF — single-cell Hierarchical Poisson Factorization for de novo gene program discovery in scRNA-seq data. Decomposes raw UMI count matrices into interpretable cell scores (theta) and gene loadings (beta) without normalization. Use for identifying latent transcriptional programs, gene signatures, continuous cell state gradients, rare population detection, and integration of multiple datasets via projection. Supports CLI (scHPF prep, train, score, project) and Python API. Works on MEX/10x, loom, and sparse text matrix formats.

Thin
0
scib -- Single-Cell Integration Benchmarking

scib (single-cell integration benchmarking) -- Python framework for evaluating and benchmarking batch correction and data integration methods in single-cell omics. Computes standardized metrics for bio-conservation (ARI, NMI, cell type ASW) and batch correction (batch ASW, graph connectivity, kBET, iLISI, cLISI, PCR batch). Supports benchmarking of Scanorama, Harmony, scVI, BBKNN, scGen, MNN, and other integration methods on AnnData objects.

Thin
0
scib-metrics

scib-metrics — single-cell integration benchmarking metrics for evaluating batch correction and biological conservation in single-cell RNA-seq data. Compute batch correction metrics (kBET, iLISI, PCR, graph connectivity, ASW batch) and biological conservation metrics (NMI, ARI, cLISI, isolated labels ASW, HVG overlap, cell cycle conservation) on AnnData objects. GPU-accelerated via JAX. Use for benchmarking data integration methods, evaluating scVI, Harmony, Seurat, BBKNN, scANVI outputs, and comparing integration quality across embedding spaces.

Thin
0
scikit-allel

scikit-allel — Python package for exploratory analysis of large-scale genetic variation data. Provides data structures for genotypes, haplotypes, and allele counts (GenotypeArray, HaplotypeArray, AlleleCountsArray), population genetics statistics (FST, Tajima's D, nucleotide diversity), selection scans (iHS, XP-EHH, NSL, PBS), PCA on genotypes, linkage disequilibrium analysis, VCF I/O to NumPy/HDF5/Zarr, and windowed genome-wide analyses for population genomics, malaria genomics, and parasitology research.

Thin
0
scikit-image

scikit-image — Python library for image processing built on NumPy arrays. Provides filtering (gaussian, sobel, median, thresholding), segmentation (watershed, slic, felzenszwalb, random_walker), feature detection (canny, blob_log, corner_harris, hog), morphological operations, color space conversions, geometric transforms (resize, rotate, warp), region measurement (label, regionprops), and image restoration for bioimage analysis, microscopy, histopathology, and general computer vision workflows.

Thin
0
scikit-learn

scikit-learn — open-source Python machine learning library providing consistent estimator API for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. Used in bioinformatics for gene expression classification, microbiome analysis, variant feature engineering, proteomics, and single-cell ML pipelines. Integrates with NumPy, pandas, and AnnData. Use for supervised/unsupervised ML tasks on tabular biological data, feature importance analysis, cross- validation, and pipeline construction.

Thin
0
sciPENN

sciPENN — neural network model for single-cell protein expression imputation and multi-omics integration. Transfers protein predictions from CITE-seq (paired RNA+protein) training data to unpaired RNA-only datasets. Supports protein imputation, batch integration across CITE-seq and scRNA-seq datasets, quality uncertainty quantification, and downstream differential expression. Built on PyTorch and scanpy/AnnData. Use when imputing missing ADT/protein data from RNA-only experiments or integrating heterogeneous CITE-seq cohorts.

Thin
0
SciPy

Use when working with SciPy — the foundational Python scientific computing library — for statistical testing, signal processing, optimization, linear algebra, spatial analysis, and numerical integration in bioinformatics workflows. SciPy wraps LAPACK/BLAS for linear algebra, provides scipy.stats for parametric and non-parametric tests (t-test, Mann-Whitney, Fisher's exact, chi-square, Kruskal-Wallis), scipy.signal for filtering and spectral analysis, scipy.optimize for curve fitting and root finding, scipy.spatial for distance matrices and k-d trees, and scipy.cluster for hierarchical clustering. Use for differential expression statistics, genomic distance computation, clustering pipelines, and numerical analysis of biological data.

Thin
0
scirpy -- Single-Cell Immune Receptor Analysis

scirpy -- Python toolkit for single-cell immune receptor (TCR and BCR) repertoire analysis built on the scverse ecosystem. Defines clonotypes from paired chain data using CDR3 nucleotide/amino acid sequence identity or receptor structure graphs. Computes repertoire diversity (Shannon entropy, D50), clonal expansion, V(D)J gene usage, spectratype distributions, and repertoire overlap (Jaccard, Morisita-Horn). Integrates with AnnData/MuData and scanpy for combined transcriptome-repertoire analysis.

Thin
0
scMoMaT

Use when working with scMoMaT, single-cell mosaic integration, multi-omics matrix tri-factorization, or marker discovery across RNA, ATAC, and protein batches with missing modalities. scMoMaT integrates mosaic single-cell datasets, supports unequal cell type composition across batches, learns joint cell factors in a first training stage, and learns modality-specific feature factors in a retraining stage. Trigger this skill for scMoMaT, scmomat, mosaic integration, multi-modal biomarker detection, or when users need the scMoMaT count-dictionary contract and real API names.

Thin
0
Scout

Use when working with Scout — a web-based visualization and case management

Thin
0
scran -- Single-Cell RNA-Seq Normalization and Analysis

>

Thin
0
ScreenProcessing

ScreenProcessing — Python pipeline for analyzing pooled genetic screens (CRISPRi/CRISPRa). Converts raw FASTQ sequencing files into library counts using fastqgz_to_counts.py, then generates sgRNA phenotype scores and gene-level scores with p-values using process_experiments.py. Supports configurable experiment definitions, multiple scoring algorithms (MW, ttest), and interactive visualization via screen_analysis.py. Essential for pooled CRISPR screen analysis in functional genomics.

Thin
0
scRepertoire -- Single-Cell Immune Repertoire Analysis

scRepertoire -- R package for analyzing T cell receptor (TCR) and B cell receptor (BCR) repertoires from single-cell sequencing data. Integrates with Seurat and SingleCellExperiment objects to combine clonotype information with gene expression. Supports clonotype quantification, diversity analysis, overlap comparison, gene usage visualization, and clonal homeostasis across samples and clusters. Works with 10x Genomics, AIRR, BD Rhapsody, TRUST4, WAT3R, and other immune repertoire formats.

Thin
0
Scrublet

Scrublet — Python toolkit for computational identification of cell doublets in single-cell RNA-seq data. Simulates synthetic doublets from observed transcriptomes and scores cells via k-nearest-neighbor density estimation in PCA space. Provides doublet score thresholding, UMAP embedding of doublet neighborhoods, and integration with Scanpy/AnnData workflows.

Thin
0
scVelo

scVelo — RNA velocity generalized through dynamical modeling. Estimates splicing kinetics from single-cell RNA-seq data using steady-state, stochastic, and dynamical models to infer cellular dynamics, latent time, driver genes, and fate decisions. Built on the scverse AnnData framework, scVelo recovers per-gene transcription, splicing, and degradation rates via an expectation-maximization algorithm, enabling velocity-based pseudotime, confidence scoring, differential kinetics testing, and velocity-directed PAGA trajectory analysis.

Thin
0
scverse

scverse — the community-maintained Python ecosystem for single-cell and spatial omics analysis. Coordinates AnnData (universal data structure), scanpy (scRNA-seq QC/clustering/DE), squidpy (spatial transcriptomics), scvi-tools (deep generative models), muon (multimodal omics), scirpy (immune repertoire), spatialdata (spatial omics), and cellrank (cell fate). Use when working with single-cell RNA-seq, spatial transcriptomics, CITE-seq, scATAC-seq, TCR/BCR repertoire, or multi-omics datasets in Python.

Thin
0
scvi-hub

Use when working with scvi-hub — scvi-hub — the HuggingFace Hub integration

Thin
0
Seaborn -- Statistical Data Visualization

Seaborn — statistical data visualization library built on matplotlib. Provides high-level functions for relational plots (scatterplot, lineplot), distribution plots (histplot, kdeplot, ecdfplot), categorical plots (boxplot, violinplot, swarmplot), regression plots (lmplot, regplot), matrix heatmaps (heatmap, clustermap), and multi-facet grids (FacetGrid, PairGrid, JointGrid). Integrates with pandas DataFrames for tidy-data workflows and produces publication-quality figures with minimal code.

Thin
0
SEACR (Sparse Enrichment Analysis for CUT&RUN)

SEACR (Sparse Enrichment Analysis for CUT&RUN) — peak caller specifically designed for CUT&RUN and CUT&Tag chromatin profiling data. Uses the sparse signal characteristics of CUT&RUN to call enriched regions without requiring an IgG control, using either a control sample or a numeric threshold for peak selection. Outputs BED files of enriched regions in stringent or relaxed mode. Recommended alternative to MACS2 for low-background CUT&RUN/CUT&Tag experiments.

Thin
0
Search Publications for Tool

Search publications for a bioinformatics tool — find peer-reviewed papers that use, cite, or benchmark a tool using PubMed E-utilities, CrossRef REST API, and OpenAlex scholarly graph. Aggregates citation counts, usage trends by year, open-access flags, and DOI-based citation tracking across 240M+ scholarly works. Supports tool name search, primary DOI citation lookup, and temporal trend analysis.

Thin
0
Segway

Use when working with segway — segway — probabilistic genome segmentation

Thin
0
SemiBin

SemiBin — semi-supervised metagenomic binning using deep learning with self-supervised contrastive learning for contig embeddings. Supports single-sample, multi-sample, and long-read binning modes. Ships with pre-trained models for common habitats (human gut, dog gut, ocean, soil, cat gut, built environment, wastewater, chicken caecum, global, pig gut). Used for recovering metagenome-assembled genomes (MAGs) from assembled contigs with higher quality than traditional binners.

Thin
0
Sentieon

Sentieon — commercial high-performance genomics toolkit implementing GATK-compatible algorithms (DNAscope, TNscope, Haplotyper, BQSR, Dedup) with 50x speed improvement over GATK. Provides germline SNV/indel calling, somatic tumor-normal variant calling, structural variant detection, long-read analysis (PacBio/ONT), copy number variation, and RNA-seq analysis. Used in clinical and research WGS/WES pipelines requiring fast, reproducible, GATK-equivalent results.

Thin
0
seq_io

seq_io — high-performance Rust FASTA/FASTQ parser with zero-copy design. Read and write FASTA and FASTQ files using fasta::Reader, fastq::Reader, and the Record trait. Supports multi-line FASTA, parallel batch processing via RecordSet and read_parallel, configurable buffer policies, and both borrowed RefRecord and owned OwnedRecord variants. Use for fast sequence I/O in Rust bioinformatics pipelines, high-throughput FASTX processing, and parallel sequence analysis with minimal memory allocations.

Thin
0
Seqera Platform

Seqera Platform (formerly Nextflow Tower) — cloud orchestration and monitoring platform for Nextflow pipelines. Provides the tw CLI for launching pipelines, managing compute environments (AWS Batch, Google Life Sciences, Azure Batch, SLURM, LSF), monitoring workflow runs, and organizing work within workspaces and organizations. Essential for scaling Nextflow pipelines from local development to production on cloud and HPC infrastructure.

Thin
0
SeqKit

SeqKit — cross-platform ultrafast toolkit for FASTA/Q file manipulation. 38 subcommands for sequence statistics (stats), filtering by length or quality (seq), pattern searching with mismatches (grep, locate), subsequence extraction (subseq), random sampling (sample), deduplication (rmdup), file splitting (split, split2), paired-end reconciliation (pair), format conversion (fq2fa, fx2tab, tab2fx), sorting and shuffling, BAM processing, and in-silico PCR (amplicon). Single static binary with no dependencies. Handles gzip, xz, zstd, bzip2, and lz4 compressed files natively.

Thin
0
seqtk

seqtk — fast lightweight C toolkit for processing FASTA and FASTQ files. Supports format conversion (FASTQ↔FASTA), random subsampling, quality trimming, reverse complement, base composition, sequence extraction by name or region, and interleave/deinterleave of paired-end reads. Essential utility for NGS preprocessing and QC pipelines. By Heng Li (BWA/SAMtools author).

Thin
0
SeSAMe

SeSAMe (SEnsible Step-wise Analysis of DNA MEthylation BeadChips) — R/Bioconductor package for processing Illumina Infinium DNA methylation arrays. Supports EPIC, EPICv2, HM450, HM27, MM285, and Mammal40 platforms. Provides the openSesame one-command pipeline for IDAT-to-beta conversion with configurable preprocessing codes (quality masking, channel inference, dye bias correction, p-value detection, background subtraction). Includes sex/species/strain inference, epigenetic age prediction (Horvath clocks), leukocyte estimation, CNV segmentation, differential methylation (DML/DMR), cross-platform lift-over (mLiftOver), SNP-based sample QC, and visualization of genes, regions, and probes.

Thin
0
SHAPEIT4/5

SHAPEIT4/5 — statistical haplotype phasing for SNP array and whole-genome sequencing data. SHAPEIT5 is the current production version, providing phasing of common variants (MAF >= 0.1%) via phase_common, rare variant phasing via phase_rare, chunk ligation via ligate, and phasing accuracy evaluation via switch. Works with large cohorts (100k+ samples), reference panels, and outputs phased VCF/BCF files. SHAPEIT4 remains widely used for array-based imputation and smaller cohorts.

Thin
0
Shasta

Shasta — fast de novo long-read genome assembler optimized for Oxford Nanopore (ONT) reads. Produces haploid or phased diploid assemblies from nanopore data using run-length encoding, MinHash-based overlap detection, and marker graph construction. Outputs Assembly.fasta and Assembly.gfa. Supports configurable assembly modes (haploid mode 0, phased diploid mode 2, anchor-based mode 3). Capable of assembling a human genome in hours on a single machine.

Thin
0
ShortRead

Use when working with ShortRead, the Bioconductor package for FASTQ and FASTA quality control, chunked read iteration, random sampling, filtering, trimming, and QA reporting in R. Covers FastqStreamer, FastqSampler, readFastq, writeFastq, countFastq, qa(), report(), srFilter, filterFastq, trimTailw(), trimTails(), and ShortReadQ workflows for sequencing-read preprocessing and exploratory quality assessment. Trigger on: ShortRead, FASTQ QC in R, Bioconductor FASTQ processing, stream gzipped FASTQ, build a QA report, filter reads by Ns or quality, or ShortRead vs DADA2 / fastp / Biostrings.

Thin
0
Sickle

Sickle — sliding-window quality trimming tool for short-read Illumina FASTQ data. Trims low-quality bases from 3' and 5' ends using a windowed adaptive algorithm. Supports paired-end (pe) and single-end (se) modes with Sanger (Phred+33), Solexa, and Illumina 1.3+ quality encodings. Written in C for fast processing. Outputs trimmed FASTQ files. Use for Illumina read QC preprocessing before alignment or assembly.

Thin
0
Signac

Signac — R toolkit for single-cell chromatin accessibility analysis built on Seurat. Processes scATAC-seq data from 10x Genomics and other platforms. Handles fragment files, performs TF-IDF normalization, LSI dimensionality reduction, peak calling (MACS2), peak annotation (EnsDb/TxDb), motif analysis (JASPAR), gene activity scoring, and multi-modal RNA+ATAC integration via Weighted Nearest Neighbor (WNN). ChromatinAssay stores peaks, fragment files, genome annotation, and motif information alongside standard Seurat assay slots.

Thin
0
SigProfiler

SigProfiler — Alexandrov Lab suite for somatic mutational signature analysis. Generates mutation matrices from VCF/MAF files (SigProfilerMatrixGenerator), extracts de novo mutational signatures via NMF (SigProfilerExtractor), assigns COSMIC signatures to samples (SigProfilerAssignment), and simulates somatic mutations (SigProfilerSimulator). Used for cancer genomics, tumor mutational burden analysis, exposome studies, and COSMIC SBS/DBS/ID signature profiling. Supports GRCh37, GRCh38, mm9, mm10 reference genomes.

Thin
0
SingleCellExperiment

SingleCellExperiment — core R/Bioconductor container class for single-cell genomics data. Stores assays (counts, logcounts, normalized), cell metadata (colData), gene metadata (rowData), dimensionality reductions (reducedDims), alternative experiments (altExps for spike-ins and CRISPR), and size factors. Used as the shared data structure across the entire Bioconductor single-cell ecosystem: scran, scater, scuttle, DESeq2, and 200+ downstream packages. Trigger on: SingleCellExperiment, SCE, colData, rowData, reducedDim, altExp, assay, Bioconductor single-cell container.

Thin
0
SingleR

SingleR — automated cell type annotation for single-cell RNA-seq data using reference-based scoring. Assigns cell type labels by correlating query expression profiles against curated reference datasets (HumanPrimaryCellAtlas, Blueprint+Encode, ImmGen, MouseRNAseq). Works with SingleCellExperiment and Seurat objects via the Bioconductor ecosystem. Use SingleR() for single-step annotation, trainSingleR() + classifySingleR() for pre-trained models, and celldex for reference dataset retrieval. Handles fine vs. main label granularity, per-cluster annotation, and ambiguous cell pruning via delta score thresholds.

Thin
0
Singularity/Apptainer

Singularity/Apptainer — HPC container runtime for building, running, and managing SIF container images in rootless, multi-tenant cluster environments. Provides OCI-compatible builds from Dockerfiles and Docker Hub, GPU passthrough via --nv/--rocm, MPI integration, bind-mount host filesystems, and cryptographic image signing. Standard container solution for SLURM/PBS bioinformatics pipelines.

Thin
0
Skyline

Skyline — free, open-source Windows application for building targeted mass spectrometry methods and quantitative analysis. Supports SRM/MRM, PRM, DIA (SWATH), and full-scan MS1 quantification with transition list management, spectral library building, peak integration, and result export. Part of the ProteoWizard ecosystem. Command-line interface via SkylineCmd and SkylineRunner for batch processing and pipeline integration.

Thin
0
sleuth

sleuth — R package for differential expression analysis of RNA-seq data at the transcript level. Works with kallisto bootstrap quantifications to model technical variability using a response error model. Supports transcript-level and gene-level differential expression testing, likelihood ratio tests, Wald tests, interactive Shiny visualization, and sleuth-to-gene aggregation. Designed for the kallisto-sleuth lightweight RNA-seq analysis pipeline.

Thin
0
Slingshot

Slingshot — R/Bioconductor package for trajectory inference in single-cell RNA-seq data. Infers branching lineage structures via cluster-based minimum spanning trees (MST), fits simultaneous principal curves along each lineage, computes per-cell pseudotime and lineage assignment weights. Two-step design separates topology inference (getLineages) from curve fitting (getCurves), allowing supervised trajectory construction with known start/end clusters. Integrates with tradeSeq for trajectory-based differential expression and condiments for differential topology across conditions. Operates on SingleCellExperiment objects and produces PseudotimeOrdering objects.

Thin
0
slivar

Use when working with slivar — slivar — fast VCF/BCF variant filtering

Thin
0
slow5tools

slow5tools -- command-line toolkit for converting, compressing, merging, splitting, indexing, and extracting reads from SLOW5/BLOW5 format files. SLOW5/BLOW5 is an alternative to HDF5-based FAST5/POD5 for Oxford Nanopore sequencing data, enabling efficient random access and parallel I/O. Provides fast5toslow5, slow5toslow5, merge, split, index, get, view, stats, and quickcheck subcommands for nanopore signal data management.

Thin
0
EIGENSOFT / smartpca

EIGENSOFT / smartpca — suite for population structure analysis and stratification correction in genome-wide genetic studies. smartpca performs principal component analysis (PCA) on genotype data to infer ancestry and detect population structure. EIGENSTRAT corrects association statistics for population stratification using PCA-derived axes. CONVERTF converts between five genotype formats (ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, PACKEDANCESTRYMAP). MERGEIT merges datasets by union of individuals and intersection of SNPs. Widely used in ancient DNA, human population genetics, and GWAS quality control for stratification detection and correction.

Thin
0
SMINA

Use when working with SMINA — a fork of AutoDock Vina with improved scoring functions and flexible docking for structure-based virtual screening and lead optimization. SMINA extends Vina with custom scoring (gnina default, vinardo, AD4), per-atom affinity maps, minimization-only mode, and flexible receptor residues. Use for small-molecule docking against protein targets, rescoring docked poses, SBVS campaigns, binding affinity estimation, and preparing docking inputs from PDB/PDBQT structures. Also known as smina-docking, smina vina, and is the CPU docking backbone of GNINA.

Thin
0
SMR

SMR (Summary-data-based Mendelian Randomization) for testing pleiotropic associations between molecular traits (eQTL, mQTL, sQTL) and complex traits using GWAS summary statistics. Includes HEIDI test for distinguishing pleiotropy from linkage, multi-SNP SMR, and MeCS meta-analysis. Use when working with smr, mendelian randomization, eQTL integration, HEIDI test, BESD format, or causal gene prioritization from GWAS.

Thin
0
SnapATAC2

SnapATAC2 — Python/Rust toolkit for single-cell ATAC-seq analysis. Provides fragment file import, cell-by-bin/peak matrix generation, spectral embedding dimensionality reduction, leiden clustering, MACS3-based peak calling, gene activity scoring, motif enrichment via chromVAR, and multi-modal integration with scRNA-seq. Built on AnnData for scverse ecosystem interoperability.

Thin
0
SnapGene Reader

snapgene-reader — Python library for parsing SnapGene *.dna binary files into Python dicts or Biopython SeqRecord objects. Converts proprietary SnapGene format to GenBank (.gbk), extracts annotated sequences, features, primers, and metadata without requiring a SnapGene license. Use for batch processing of .dna files, format conversion pipelines, and programmatic access to SnapGene-annotated constructs. Aliases: SnapGene Reader, snapgene_reader, dna file parser, SnapGene to GenBank converter, Edinburgh Genome Foundry.

Thin
0
SNF

Use when working with SNF, SNFtool, or Similarity Network Fusion for multi-omics integration, patient subtyping, fused similarity networks, or multi-view clustering in R. SNFtool accepts feature matrices, pairwise distances, or pairwise similarities from multiple data types, builds affinity networks with affinityMatrix(), fuses them with SNF(), and supports spectralClustering(), estimateNumberOfClustersGivenGraph(), concordanceNetworkNMI(), rankFeaturesByNMI(), and groupPredict() for downstream subtype discovery and prediction.

Thin
0
Sniffles2

Sniffles2 — fast structural variant caller for long-read sequencing data (PacBio HiFi, Oxford Nanopore). Detects deletions, insertions, duplications, inversions, and translocations from BAM/CRAM alignments. Supports germline single-sample calling, multi-sample population-level SV calling via .snf intermediate files, mosaic/somatic SV detection, and genotyping of known SVs. Requires minimap2-aligned long reads.

Thin
0
Snippy

Snippy — rapid haploid variant calling pipeline for bacterial genomics. Maps paired-end or single-end reads to a reference genome using BWA-MEM and Minimap2, calls SNPs, MNPs, and indels via Freebayes, and produces VCF, aligned FASTA, BED, GFF, and tab-delimited variant tables. Includes snippy-core for merging multiple isolate results into a core SNP alignment for downstream phylogenetics, recombination detection, and outbreak analysis.

Thin
0
SNP-sites

SNP-sites — fast extraction of single nucleotide polymorphism (SNP) sites from multi-FASTA whole-genome alignments. Produces VCF, multi-FASTA (SNP-only), and Phylip output for downstream phylogenetic inference, recombination detection, and population genomics. Essential in microbial genomics pipelines between alignment (Snippy, Gubbins) and tree building (RAxML-NG, IQ-TREE, FastTree).

Thin
0
SnpSift

SnpSift -- Java-based toolset for filtering, annotating, and manipulating annotated VCF files from SnpEff or other variant callers. Provides expression-based filtering (filter), database annotation with ClinVar/dbSNP/COSMIC (annotate), pathogenicity score integration from dbNSFP (dbnsfp), field extraction to tab-delimited tables (extractFields), variant concordance analysis, case-control association testing, and interval-based filtering. Part of the SnpEff suite.

Thin
0
SOAPdenovo2

SOAPdenovo2 — de novo short-read genome assembler for large plant and animal genomes using de Bruijn graph construction. Runs a four-stage pipeline: pregraph (k-mer graph), contig (initial contigs), map (read alignment), and scaff (scaffolding with gap closure). Accepts paired-end and mate-pair Illumina reads via a configuration file. Outputs contigs (*.contig) and scaffolds (*.scafSeq). Supports k-mer sizes 13-127 and multi-library designs.

Thin
0
SOAPnuke

SOAPnuke — C++ quality control and preprocessing tool for high-throughput sequencing data. Filters and trims paired-end or single-end FASTQ reads by adapter content, low quality bases, N-base ratio, read length, polyG/polyX tails, and tile/FOV IDs. Supports BAM/CRAM input via filterHts module and stLFR barcode detection via filterStLFR. Multi-threaded with MapReduce acceleration. Generates QC statistics and visualization-ready reports. Developed by BGI-flexlab for BGI-seq and Illumina platforms.

Thin
0
softImpute

softImpute — R package for matrix completion via iterative soft-thresholded SVD with nuclear-norm regularization. Implements ALS and SVD algorithms for imputing missing entries in large incomplete matrices. Core API: softImpute(), lambda0(), complete(), impute(), biScale(), deBias(). Handles Netflix-scale sparse matrices via the Incomplete class and SparseplusLowRank storage. Use for recommender systems, genomics missing data, and low-rank matrix approximation in R.

Thin
0
Somalier

somalier -- fast sample-identity verification and relatedness checking for BAM/CRAM/VCF/GVCF files. Extracts genotypes at informative polymorphic sites, computes pairwise relatedness using bit-vector operations, detects sample swaps and mislabeling, infers pedigrees, and predicts genetic ancestry. Essential QC tool in WGS/WES/RNA-seq pipelines for cohort integrity validation.

Thin
0
SortMeRNA

SortMeRNA — fast filtering of ribosomal RNA reads from metatranscriptomic and RNA-seq data using local sequence alignment against curated rRNA databases (SILVA, RFAM). CLI tool for rRNA removal, rRNA quantification, and paired-end read separation. Key flags: --ref, --reads, --aligned, --other, --fastx, --sam, --blast, --threads, --paired_in, --paired_out, --best, --num_alignments, -e (E-value). Input: FASTA/FASTQ (plain or gzipped). Output: aligned/rejected reads, SAM, BLAST-like tabular. Use for metatranscriptomics preprocessing, rRNA depletion QC, and non-coding RNA filtering in NGS pipelines.

Thin
0
SoupX

SoupX -- R package for estimating and removing ambient RNA contamination (soup) from droplet-based single-cell RNA-seq data. Works with 10x Chromium Cell Ranger output to estimate the contamination fraction (rho) per cell and produce corrected count matrices. Key functions include SoupChannel(), setClusters(), autoEstCont(), and adjustCounts(). Use before downstream analysis with Seurat, Scanpy, or scvi-tools.

Thin
0
sourmash

sourmash — fast k-mer sketching and comparison for genomic and metagenomic data using FracMinHash (scaled MinHash). Computes DNA and protein sketches, performs pairwise genome comparison, containment search against large databases (GenBank, GTDB), taxonomic classification and profiling of metagenomes, and Jaccard/containment similarity estimation. Supports k-mer-based operations on FASTA/FASTQ inputs at multiple k-sizes.

Thin
0
SPARK

SPARK for identifying spatially variable genes in spatial transcriptomics data. Uses generalized linear spatial models with penalized quasi-likelihood (PQL) and multiple spatial kernels to test for spatial expression patterns. Use when analyzing spatial transcriptomics, detecting spatially variable genes, running SpatialDE alternatives, or comparing spatial gene expression methods.

Thin
0
SpatialDE -- Spatially Variable Gene Detection

SpatialDE -- Python package for identifying spatially variable genes in spatial transcriptomics data using Gaussian process regression. Tests whether gene expression exhibits spatial patterns beyond random variation. Supports Visium, MERFISH, seqFISH, Slide-seq, and Stereo-seq platforms. Includes automatic expression histology (AEH) for clustering genes with similar spatial patterns. Works with AnnData objects from the scverse ecosystem.

Thin
0
ST Pipeline

ST Pipeline — automated processing pipeline for spatial transcriptomics (Spatial Transcriptomics method). Handles demultiplexing, quality trimming, contamination filtering, genome alignment, annotation, and UMI-based gene expression quantification from Spatial Transcriptomics array data. Produces count matrices with spatial barcode coordinates for downstream visualization and clustering. Works with FASTQ input from ST arrays.

Thin
0
Spectronaut -- DIA Proteomics Analysis

Spectronaut -- commercial DIA (Data-Independent Acquisition) proteomics software from Biognosys for analyzing DIA-MS data. Supports library-based and library-free (directDIA / Pulsar) peptide-centric analysis, spectral library generation, peptide and protein quantification, post-translational modification (PTM) analysis, and multi-sample statistical reporting. Processes Thermo .raw, Bruker .d, and SCIEX .wiff files.

Thin
0
SpliceAI

SpliceAI -- deep learning tool for predicting the impact of genetic variants on RNA splicing using raw DNA sequence context. Predicts four delta scores (acceptor gain, acceptor loss, donor gain, donor loss) ranging 0-1 for each variant. Supports VCF input/output, SNVs and indels, GRCh37 and GRCh38. Pre-scored lookup tables available for all possible SNVs (~30 GB). Used for splice site variant prioritization, pathogenicity assessment of intronic and synonymous variants, and splicing QTL analysis.

Thin
0
SPOTlight

SPOTlight — R/Bioconductor spatial transcriptomics deconvolution using seeded NMF regression. Estimates cell type proportions at each spatial location by integrating single-cell RNA-seq reference data with Visium, Slide-seq, or any spot-based spatial transcriptomics technology. Uses Non-negative Matrix Factorization initialized with marker genes to learn topic profiles, then Non-Negative Least Squares regression to decompose spatial spots. Works with SingleCellExperiment, SpatialExperiment, and Seurat objects. CPU-only, fast, produces interpretable NMF topic profiles for biological validation.

Thin
0
Squidpy

Squidpy — spatial single-cell analysis toolkit in the scverse ecosystem for analyzing and visualizing spatial molecular data. Builds spatial neighbor graphs from tissue coordinates (Visium, Xenium, MERFISH, Slide-seq, CosMx), computes spatial statistics (neighborhood enrichment, co-occurrence, Moran's I, Ripley's, centrality scores, ligand-receptor interactions), extracts image features from H&E and fluorescence microscopy, performs cell segmentation, and provides publication-ready spatial scatter plots. Operates on AnnData objects and integrates directly with Scanpy for downstream clustering and visualization.

Thin
0
SRA Toolkit

SRA Toolkit — NCBI command-line utilities for downloading, converting, and validating Sequence Read Archive (SRA) data. Provides prefetch for bulk download, fasterq-dump for high-speed SRA-to-FASTQ conversion, vdb-validate for integrity checking, sam-dump for SAM/BAM extraction, and vdb-config for cloud/local storage configuration. Essential first step in any NGS pipeline that starts from public sequencing data (GEO, SRA, ENA, DDBJ).

Thin
0
StabMap

Use when working with StabMap for mosaic single-cell integration across assays that only partially share features, especially RNA plus Multiome, RNA plus ATAC via a bridging assay, or other feature-overlap topologies that need a shared embedding without inventing missing features. StabMap accepts named feature-by-cell matrices, builds a mosaic data topology from rowname overlaps, runs `stabMap()` or `stabMapSE()`, and optionally uses `reWeightEmbedding()` or `imputeEmbedding()` for downstream weighting and naive imputation. Trigger this skill for StabMap, mosaic data topology, unshared-feature integration, or indirect single-cell integration through a connected overlap graph.

Thin
0
Stan

Stan — probabilistic programming language for Bayesian statistical modeling and high-performance inference. Full Bayesian inference via No-U-Turn Sampler (NUTS/HMC), approximate inference via Automatic Differentiation Variational Inference (ADVI), and penalized maximum likelihood via L-BFGS. Stan programs define data, parameters, transformed parameters, model (log-density), and generated quantities blocks. Python interfaces: CmdStanPy (recommended, lightweight CLI wrapper) and PyStan (in-process compilation). R interfaces: CmdStanR and RStan. Supports ODEs, Gaussian processes, mixture models, hierarchical/multilevel models, survival analysis, and custom distributions.

Thin
0
CmdStan

CmdStan — command-line interface to the Stan probabilistic programming language for Bayesian statistical modeling and high-performance inference. Compiles Stan programs (.stan) to C++ executables via GNU Make, then runs MCMC sampling (NUTS/HMC), penalized maximum likelihood optimization (L-BFGS, BFGS, Newton), approximate Bayesian inference (ADVI meanfield/fullrank), Pathfinder multi-path approximate inference, Laplace sampling, and generated quantities. Includes stanc (Stan-to-C++ compiler), stansummary (posterior analysis from CSV output), and diagnose (HMC diagnostic checks). Output in Stan CSV format with metadata comments, sampler parameters, and model parameters. Supports JSON and RDump data input formats. Version 2.38.0. BSD-3-Clause license.

Thin
0
STAR-Fusion

STAR-Fusion — detects candidate fusion transcripts from RNA-seq data using STAR alignments and the FusionInspector validation framework. Integrates with CTAT genome resource libraries for comprehensive fusion annotation including cancer gene databases, paralogs, and read-through events. Used in cancer transcriptomics, rare disease diagnostics, and gene fusion discovery pipelines. Supports both chimeric junction and spanning read evidence for fusion calling.

Thin
0
StarDist

StarDist — deep learning framework for cell and nuclei detection and instance segmentation in 2D and 3D microscopy images using star-convex polygons. Predicts radial distances to object boundaries along fixed rays combined with object probability maps, followed by non-maximum suppression to produce pixel-accurate instance labels. Provides pretrained models for fluorescence nuclei (2D_versatile_fluo), H&E histopathology nuclei (2D_versatile_he), and DSB2018 challenge data (2D_paper_dsb2018). Supports custom model training on annotated data, multi-class prediction, tiled prediction for large images, threshold optimization, and export to TensorFlow SavedModel for deployment in Fiji, QuPath, and napari.

Thin
0
STARmap

STARmap (Spatially-resolved Transcript Amplicon Readout mapping) — in situ spatial transcriptomics method that combines hydrogel-tissue chemistry with sequencing-by-hybridization (SEDAL) for multiplexed RNA detection in intact tissue. Enables imaging-based spatial gene expression profiling, cell segmentation, cell type clustering, and 3D tissue reconstruction from thick tissue sections. Python toolkit for image processing, spot detection, barcode decoding, and spatial analysis of STARmap datasets.

Thin
0
STdeconvolve

STdeconvolve — reference-free cell-type deconvolution for spatial transcriptomics using Latent Dirichlet Allocation (LDA). Decomposes multi-cellular spatial pixels into cell-type proportions and transcriptomic profiles without requiring external single-cell RNA-seq reference data. Supports Visium, Slide-seq, MERFISH, and other spatial platforms. Uses topic modeling to identify optimal number of cell types (K) and recover per-pixel proportions (theta) and cell-type gene expression profiles (beta).

Thin
0
stLearn

stLearn — spatial transcriptomics analysis in Python integrating gene expression with tissue morphology. Provides SME (spatial morphological gene expression) normalization, spatial clustering, spatial trajectory inference (PSTS), cell-cell interaction analysis (LR pairs), and visualization for Visium (10x Genomics) and other spatial transcriptomics platforms. Built on AnnData and Scanpy for seamless scverse ecosystem integration.

Thin
0
Straglr

Straglr — genome-wide detection and genotyping of tandem repeat (TR) expansions from long-read sequencing alignments (PacBio HiFi, ONT). Performs expansion scanning with configurable size thresholds and targeted genotyping at known TR loci using Gaussian Mixture Model clustering. Requires Minimap2-aligned BAM, TRF, and BLASTN. Outputs TSV, BED, and VCF with allele sizes and copy numbers.

Thin
0
stranger

Use when working with stranger — a Clinical Genomics tool for annotating short tandem repeat (STR) variants in VCF files with pathogenicity classifications. Accepts VCFs from ExpansionHunter or TRGT and adds STR_STATUS (normal/pre_mutation/full_mutation), NormalMax, and PathologicMin INFO fields using a repeat definitions catalog. Essential for clinical STR expansion reporting, Huntington disease, spinocerebellar ataxia, fragile X, and other repeat expansion disorder diagnostics in WGS/WES pipelines.

Thin
0
Strelka2

Strelka2 — fast and accurate small variant caller for germline and somatic analysis. Detects SNVs and indels (up to ~49 bp) from mapped paired-end sequencing reads with tiered haplotype modeling, adaptive indel error estimation, and random forest empirical variant scoring. Optimized for WGS with exome/targeted and RNA-Seq support.

Thin
0
STRING Database

Query STRING API for protein-protein interaction networks, functional enrichment, and interaction partner discovery. Covers 59M proteins across 5000+ species with 20B+ scored interactions from 7 evidence channels. REST API access for network retrieval, GO/KEGG/Pfam enrichment, PPI enrichment testing, homology analysis, and network visualization. Part of the ELIXIR Core Data Resources.

Thin
0
Struo2

Struo2 — Snakemake pipeline for building custom Kraken2, Bracken, and HUMAnN3 databases from NCBI or GTDB genome accessions. Automates downloading reference genomes, formatting taxonomy, and running Kraken2/Bracken database build steps. Supports both NCBI and GTDB taxonomic frameworks. Use for metagenomics custom database construction, taxonomic classifier database building, GTDB-based classification, and microbiome reference databases.

Thin
0
Subread

Subread — high-performance read alignment and quantification package using a seed-and-vote mapping strategy. Includes subread-align for genomic DNA and RNA-seq read alignment, subjunc for exon-junction-aware RNA-seq mapping, featureCounts for read summarization against genomic features (genes, exons, promoters), and exactSNP for SNP calling. The seed-and-vote algorithm extracts multiple short subreads (seeds) from each read, maps them to a hash-table index independently, and determines the best mapping location by majority vote — achieving ultrafast speed while maintaining high accuracy. Supports both short reads (Illumina) and moderately long reads (up to several thousand bases). Handles single-end and paired-end data with gapped and multi-mapping read support. Widely used in RNA-seq quantification pipelines via featureCounts. Published in Nucleic Acids Research (2013) with >10,000 citations. Part of the Rsubread Bioconductor package for R users.

Thin
0
SummarizedExperiment

SummarizedExperiment — core Bioconductor container for rectangular genomics data matrices. Stores count matrices (assays) with row metadata (rowData, rowRanges for GRanges coordinates) and column metadata (colData for sample annotations). Used as the standard input/output format for DESeq2, edgeR, limma, scater, and the SingleCellExperiment extension. Handles RNA-seq, ChIP-seq, ATAC-seq, and proteomics assay data with seamless GRanges integration for genomic interval subsetting.

Thin
0
SUPPA2

SUPPA2 — fast, accurate analysis of alternative splicing from RNA-seq data. Calculates PSI (Percent Spliced In) values per event and per transcript from transcript quantification (Salmon, kallisto, RSEM). Detects differential splicing across conditions using empirical distribution methods. Supports seven event types (SE, A5, A3, MX, RI, AF, AL) and cluster analysis for co-regulated splicing programs. Used in splicing regulation, disease variant effect, and alternative isoform studies.

Thin
0
survival

survival — the foundational R package for time-to-event analysis, including Kaplan-Meier survival curves, Cox proportional hazards regression, log-rank tests, and parametric survival models. Use for: overall survival, progression- free survival, disease-free survival, competing risks, time-varying covariates, clinical trial endpoint analysis, and genomic biomarker validation. Key terms: Surv(), survfit(), coxph(), survdiff(), cox.zph(), strata(), hazard ratio, concordance index, censoring, Kaplan-Meier, log-rank test, Cox model, survival curve, survreg, right-censored.

Thin
0
SURVIVOR

SURVIVOR — C++ toolkit for structural variation (SV) analysis including merging multi-caller VCF files into consensus callsets, SV simulation on reference genomes, benchmarking SV callers against truth sets, filtering by size or region, computing SV statistics, and format conversion (BED/BEDPE/smap to VCF). Essential for multi-sample SV merging and SV caller benchmarking in long-read and short-read pipelines.

Thin
0
SuSiE

SuSiE (Sum of Single Effects) — R package for Bayesian variable selection and fine-mapping of GWAS loci. Fits a sparse regression model to identify credible sets of likely causal variants with posterior inclusion probabilities (PIPs). Supports individual-level genotype data (susie) and GWAS summary statistics with LD matrix (susie_rss). Used for statistical fine-mapping, colocalization input preparation, and multi-causal-variant analysis in quantitative genetics and complex trait genomics.

Thin
0
sva

sva — R/Bioconductor package for surrogate variable analysis, batch effect correction, and unwanted variation removal in high-throughput experiments. Provides sva() for estimating surrogate variables, ComBat() for known batch removal via empirical Bayes, ComBat_seq() for RNA-seq count data batch correction, num.sv() for estimating the number of surrogate variables, and f.pvalue() for F-statistic computation. Essential for differential expression preprocessing in microarray and RNA-seq workflows.

Thin
0
SvABA

SvABA -- structural variant and indel caller using genome-wide local assembly. Detects deletions, insertions, duplications, inversions, and complex rearrangements from short-read (Illumina) whole-genome sequencing (WGS) data. Supports somatic mode (tumor-normal pairs) and germline mode (single sample). Uses a local de novo assembly approach with BWA-MEM re-alignment to discover SVs and large indels with single-nucleotide breakpoint resolution. Produces VCF output with somatic/germline annotations and contig evidence.

Thin
0
SVIM

SVIM — structural variant identification from long-read sequencing data (PacBio, Oxford Nanopore). Detects deletions, insertions, tandem and interspersed duplications, inversions, and translocations from sorted BAM alignments or raw reads. Supports two modes: alignment (pre-aligned BAM) and reads (raw FASTQ with built-in minimap2 alignment). Outputs VCF with QUAL scores for downstream filtering.

Thin
0
SWAT (Soil and Water Assessment Tool)

SWAT (Soil and Water Assessment Tool) — river basin-scale hydrological model for simulating water quantity and quality, sediment transport, nutrient cycling, and land management impacts. Covers watershed delineation, HRU definition, weather generation, surface runoff (SCS-CN / Green-Ampt), groundwater flow, crop growth, and channel routing. Python interface via pySWATPlus for calibration, sensitivity analysis, and programmatic simulation control. Supports SWAT+ (restructured) and classic SWAT. QSWAT+ for GIS-based setup.

Thin
0
Sylph

Sylph — ultrafast metagenomic profiling and containment ANI estimation using k-mer sketching. Performs species-level taxonomic profiling with abundance quantification and genome querying against pre-built databases (GTDB-R220, viral, fungal). Over 50x faster than MetaPhlAn4 with fewer false positives than Kraken2. Supports paired-end and long reads. Key commands: sketch, profile, query. Use for metagenomic profiling, taxonomic abundance estimation, genome detection, ANI calculation, and shotgun metagenomics pipelines.

Thin
0
systemPipeR

Use when working with systemPipeR, the Bioconductor workflow environment that combines R and command-line tools through CWL-backed workflow definitions. Covers systemPipeRdata::genWorkenvir, genWorkenvir_gh, SPRproject, importWF, runWF, plotWF, renderReport, renderLogs, listCmdTools, and targets-driven workflow setup for RNA-Seq, ChIP-Seq, VAR-Seq, and custom reproducible analysis pipelines on local systems or HPC. Trigger phrases include systemPipeR workflow, CWL in Bioconductor, importWF errors, runWF restart, systemPipeRdata template setup, and systemPipeR vs Nextflow or Snakemake.

Thin
0
tabix

tabix — fast random-access indexer for bgzip-compressed, tab-delimited genomic position files (VCF, BED, GFF3, GTF, SAM, and custom formats). Creates .tbi or .csi index files enabling sub-second region queries over multi-GB files without decompression. Part of HTSlib. Required upstream of region-based variant filtering, annotation intersection, and genome browser track serving. Trigger phrases: index VCF, bgzip compress, tabix index, random access genomic regions, query BED by region, .tbi .csi index.

Thin
0
TADbit

TADbit — Python toolkit for end-to-end Hi-C data analysis including read mapping, contact matrix construction, ICE normalization, TAD (Topologically Associating Domain) detection, and 3D chromatin structure modeling using IMP. Supports FASTQ-to-model pipelines for chromosome conformation analysis, TAD boundary calling, compartment detection, and 3D structure validation. Use for Hi-C analysis, 3D genome modeling, TAD boundary detection, chromatin organization, and chromosome conformation capture workflows.

Thin
0
TALON

TALON — technology-agnostic long-read transcriptome analysis for identifying and quantifying known and novel genes/isoforms. Works with PacBio and Oxford Nanopore SAM files via a SQLite database backend. Classifies transcripts into novelty categories (Known, ISM, NIC, NNC, Antisense, Intergenic, Genomic). Supports internal priming detection, transcript filtering by reproducibility, abundance extraction, and custom GTF generation. Python package from the Mortazavi Lab at UC Irvine.

Thin
0
Tandem-genotypes

Tandem-genotypes — genotype tandem repeats from long-read sequencing alignments. Analyzes minimap2 or LAST alignments against tandem-repeats-finder annotations to call insertion/deletion lengths at tandem repeat loci. Supports microsatellite instability detection, repeat expansion disorder screening, and population-level repeat polymorphism studies from PacBio or Oxford Nanopore long reads.

Thin
0
Tangram

Tangram — deep learning framework for mapping single-cell and single-nucleus gene expression data onto spatial transcriptomics data. Built on PyTorch and scanpy, Tangram optimizes a probabilistic mapping matrix that aligns single-cell profiles to spatial voxels by maximizing cosine similarity of gene expression across shared genes. Supports cell-level mapping (full resolution, GPU recommended), cluster-level mapping (faster, laptop-friendly), and constrained mapping (with cell density priors). Enables spatial gene imputation, cell type deconvolution, and annotation transfer. Integrates with Squidpy for spatial analysis and scvi-tools for probabilistic modeling workflows.

Thin
0
tascCODA

Use when working with tascCODA for tree-aggregated compositional analysis of high-throughput sequencing data, especially single-cell RNA-seq, microbiome, or amplicon count tables with a lineage or taxonomic tree. tascCODA extends scCODA with Bayesian tree-aware effect inference via `tree_ana.CompositionalAnalysisTree`, `TreeModelSSLasso`, and `tree_utils.df2newick`, then summarizes credible node- and feature-level changes through `summary()`, `node_df`, and `draw_tree_effects()`. Trigger this skill for tascCODA, tree-aggregated compositional analysis, phylogenetic differential abundance, lineage-aware cell type shifts, or scCODA plus tree questions.

Thin
0
TensorFlow

TensorFlow — open-source machine learning framework for building and deploying deep learning models. Provides Keras high-level API for model building, tf.data for input pipelines, tf.image for image preprocessing, TFRecord for efficient data storage, SavedModel for serving, and TensorBoard for visualization. Used in biology for cell image classification, protein structure prediction, genomic sequence analysis, drug discovery, and medical image segmentation. Supports GPU/TPU acceleration, distributed training, and TensorFlow Lite for edge deployment.

Thin
0
Terra/Firecloud -- Cloud Genomics Platform

Terra/Firecloud -- Broad Institute cloud-based genomics platform for managing workspaces, executing WDL workflows at scale via Cromwell, organizing data tables, and running interactive Jupyter analyses on Google Cloud. Provides the FISS Python SDK and Firecloud REST API for programmatic workspace management, workflow submission, data import/export, and cost monitoring. Part of the AnVIL ecosystem for NHGRI genomic data analysis.

Thin
0
ThermoRawFileParser

ThermoRawFileParser — cross-platform converter for Thermo Fisher RAW mass spectrometry files to open formats (mzML, MGF, Parquet). Supports batch conversion, metadata extraction, native peak picking, gzip compression, S3 cloud upload, XIC extraction, and spectral queries. Built on .NET 8 with the Thermo RawFileReader library. Essential first step in proteomics pipelines before search engines like MaxQuant, FragPipe, or DIA-NN.

Thin
0
tidySingleCellExperiment

Use when working with tidySingleCellExperiment, the R/Bioconductor adapter that makes SingleCellExperiment objects behave like tidy tibbles while preserving compatibility with the Bioconductor single-cell stack. Covers tidyverse verbs on SingleCellExperiment, `as_tibble()`, `join_features()`, `extract()`, ggplot/plotly dispatch, and pseudobulk-oriented helpers such as `aggregate_cells()`. Trigger on tidySingleCellExperiment, tidy single-cell analysis in R, SingleCellExperiment plus dplyr/tidyr, or Bioconductor to tidyverse bridging questions.

Thin
0
tidyverse

Use when working with tidyverse — the collection of R packages for data science including ggplot2 (visualization), dplyr (data manipulation), tidyr (reshaping), readr (file I/O), purrr (functional programming), tibble, stringr, forcats, and lubridate. Use for data wrangling, tidy data workflows, exploratory data analysis, and publication-quality plots. Also known as the "Hadleyverse", tidyverse core, or individual packages ggplot2, dplyr, tidyr, readr, purrr, stringr, forcats, and lubridate. Key terms: filter, mutate, select, group_by, summarize, pivot_longer, pivot_wider, left_join, inner_join, read_csv, read_tsv, ggplot, aes, geom_point, geom_bar, geom_line, facet_wrap, %>%, |>, tibble, R pipeline.

Thin
0
timereg

timereg — flexible regression models for survival data in R. Additive hazards via aalen(), semiparametric Cox-Aalen models via cox.aalen(), competing risks regression via comp.risk() with Fine-Gray and additive subdistribution models, proportional hazards with time-varying effects via timecox(), semiparametric proportional odds via prop.odds(), two-stage frailty models for clustered survival data via two.stage(), cumulative residual goodness-of-fit tests, excess risk modeling, and the Event() constructor for event-history objects. Based on Martinussen & Scheike "Dynamic Regression Models for Survival Data" (Springer, 2006).

Thin
0
TMB/MSI Tools

TMB/MSI analysis tools — Tumor Mutational Burden (TMB) estimation and Microsatellite Instability (MSI) detection from paired tumor-normal sequencing data. Covers MSIsensor for MSI scoring from BAM files, TMB calculation from VCF/MAF somatic mutation calls, and clinical interpretation for immunotherapy biomarker assessment. Used in precision oncology for checkpoint inhibitor eligibility and mismatch repair deficiency screening.

Thin
0
TOBIAS

TOBIAS — Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal. Python toolkit for differential transcription factor (TF) footprinting from ATAC-seq data. Corrects Tn5 insertion bias, calculates footprint scores, performs differential binding analysis between conditions, and visualizes TF binding dynamics. Works with ATAC-seq BAM files, motif databases (JASPAR), and peak regions (BED).

Thin
0
PyTorch Geometric

PyTorch Geometric (PyG) — graph neural network library built on PyTorch for learning on graphs and irregular structures. Provides message-passing layers (GCN, GAT, GraphSAGE, GIN, Transformer), mini-batch graph loaders, neighborhood sampling, 200+ benchmark datasets, and graph transforms. Used for molecular property prediction, protein interaction networks, drug discovery, and biological network analysis.

Thin
0
TorchDrug

TorchDrug — PyTorch-based machine learning platform for drug discovery and graph representation learning. Supports molecule property prediction, molecular generation, retrosynthesis planning, knowledge graph reasoning, and pretrained molecular representations. Provides GPU-accelerated graph operations with Molecule, Protein, and Graph data structures. Works with SMILES, PDB, and standard molecular formats via RDKit integration.

Thin
0
totalVI

totalVI — deep generative model for joint analysis of CITE-seq RNA and surface protein data built on scvi-tools and PyTorch. Provides integrated dimensionality reduction across transcriptome and protein modalities, batch correction for multi-sample CITE-seq experiments, denoised protein expression imputation, differential expression accounting for both modalities, and latent space extraction for downstream clustering. Part of the scverse ecosystem, operates on MuData/AnnData objects via the standard setup_mudata / train / get_* API pattern.

Thin
0
totalVI / MultiVI

totalVI and MultiVI — deep generative models from scvi-tools for multi-modal single-cell integration. totalVI jointly models scRNA-seq and protein (CITE-seq) data via a variational autoencoder for denoised protein imputation, differential expression, and joint latent space learning. MultiVI extends this to integrate scRNA-seq, scATAC-seq, and CITE-seq in a single framework. Built on AnnData/MuData objects in the scverse ecosystem with GPU-accelerated PyTorch training.

Thin
0
ToxoDB

ToxoDB — integrated genomic and functional database for Toxoplasma gondii and related apicomplexan parasites within the VEuPathDB ecosystem. Provides gene search, BLAST, genome browser (GBrowse/JBrowse), expression queries, proteomics data, SNP analysis, ortholog mapping, metabolic pathway exploration, and REST API access for programmatic queries across 30+ species of Toxoplasma, Neospora, Eimeria, Sarcocystis, and related coccidian parasites.

Thin
0
TransDecoder

TransDecoder — identifies candidate coding regions within transcript sequences from de novo RNA-Seq assemblies. Provides TransDecoder.LongOrfs for extracting long open reading frames and TransDecoder.Predict for predicting likely coding regions using log-likelihood scoring, optional BLAST homology, and Pfam domain evidence. Works downstream of Trinity, StringTie, or any transcript assembler. Input is transcript FASTA; output is predicted peptides, CDS, GFF3, and BED.

Thin
0
treeio

treeio — Bioconductor R package for importing and exporting phylogenetic tree data with associated metadata. Reads and writes Newick, Nexus, BEAST, RAxML, FastTree, IQ-TREE, MrBayes, PHYLODOG, r8s, MEGA, NHX, Phylip, and jplace formats. Attaches evolutionary metadata (posterior probabilities, bootstrap values, branch lengths, node annotations) as tidy data frames to tree objects, enabling downstream visualization with ggtree and statistical analysis with treedata methods. Use when users need to parse, convert, or annotate phylogenetic tree files for evolutionary biology or microbiome workflows.

Thin
0
TreeTime

TreeTime — maximum likelihood molecular clock inference, ancestral sequence reconstruction, and phylodynamic analysis from time-stamped phylogenies. Infers timetrees by iteratively optimizing ancestral sequences and node positions on the time axis. Supports GTR model inference, root-to-tip regression, relaxed clocks, coalescent models, mugration (discrete trait) analysis, and homoplasy detection. Integral to the Nextstrain real-time pathogen tracking platform. Python library and CLI.

Thin
0
Tandem Repeats Finder (TRF)

Tandem Repeats Finder (TRF) — command-line tool for locating and displaying tandem repeats in DNA sequences. Detects microsatellites (STRs), minisatellites, and larger tandem duplications using a probabilistic model with alignment scoring. Accepts FASTA input and produces tabular .dat output with repeat coordinates, period size, copy number, consensus pattern, and alignment scores. Essential for genome annotation, repeat masking, forensic STR analysis, and structural variant characterization.

Thin
0
trimAl

trimAl — automated alignment trimming tool for removing spurious sequences and poorly aligned regions from multiple sequence alignments. Supports automated heuristic methods (gappyout, strict, strictplus, automated1), manual gap and similarity thresholds, consistency-based trimming from multiple alignments, column and sequence selection, back-translation from protein to codon alignments, and format conversion between FASTA, Clustal, NEXUS, Phylip, MEGA, and PIR formats.

Thin
0
Trim Galore

Trim Galore — convenience wrapper around Cutadapt and FastQC for consistent, automated adapter trimming, quality filtering, and post-trimming QC of high-throughput sequencing reads. Auto-detects Illumina, Nextera, and Small RNA adapters, provides specialized RRBS bisulfite sequencing support, NextSeq/NovaSeq two-color chemistry polyG handling, and optional integrated FastQC reporting. Use when preprocessing FASTQ files before alignment or quantification.

Thin
0
Trinotate

Trinotate — comprehensive functional annotation suite for transcriptome assemblies. Integrates BLAST/DIAMOND homology searches against SwissProt, HMMER/Pfam protein domain identification, SignalP signal peptide prediction, TMHMM transmembrane domain detection, Infernal ncRNA annotation, and EggnogMapper functional classification into a unified SQLite database. Produces tab-delimited annotation reports with GO, KEGG, and eggNOG assignments. Standard companion to Trinity for non-model organism transcriptomics. Includes TrinotateWeb for interactive visualization of annotations and expression data.

Thin
0
TriTrypDB

TriTrypDB — integrated functional genomics resource for kinetoplastid parasites (Trypanosoma, Leishmania, and related species). Part of the VEuPathDB Bioinformatics Resource Center, hosting 83+ annotated genomes with REST web services for gene search, GO/pathway enrichment, BLAST, orthology queries, SNP analysis, transcriptomics, proteomics, and phenotype data mining across tropical disease pathogens including T. brucei, T. cruzi, and L. major.

Thin
0
Truvari

Truvari — structural variant benchmarking, merging, and annotation toolkit for VCF files. Provides SV caller performance evaluation (bench) with configurable size/sequence/overlap thresholds, redundant variant collapsing (collapse), representation harmonization via MSA (refine/phab), and 14 annotation subcommands (gcpct, trf, remap, grm, etc.). Supports symbolic alleles, BND comparison, and GA4GH benchmarking format. Python CLI and API.

Thin
0
TwoSampleMR

TwoSampleMR — R package for two-sample Mendelian randomization using GWAS summary statistics. Extracts genetic instruments from the IEU OpenGWAS database, harmonises exposure and outcome data, and applies MR methods (IVW, MR-Egger, weighted median, MR-PRESSO) to estimate causal effects. Use when performing Mendelian randomization, causal inference from GWAS, instrument variable analysis, or pleiotropy assessment.

Thin
0
TxDb packages

Use when working with Bioconductor TxDb annotation packages, especially TxDb.Hsapiens.UCSC.hg38.knownGene, GenomicFeatures transcript models, UCSC knownGene annotations, gene and transcript coordinates, promoter extraction, transcript-length summaries, overlap-based genomic feature lookup, or GRanges/GRangesList workflows that depend on a packaged TxDb object for hg38. Trigger on: TxDb packages, TxDb.Hsapiens.UCSC.hg38.knownGene, transcripts(), genes(), promoters(), transcriptsBy(), exonsBy(), transcriptLengths(), and AnnotationDbi-backed TxDb queries.

Thin
0
tximeta

tximeta — R/Bioconductor package for importing transcript-level quantification data from Salmon, alevin, piscem, and other quantifiers with automatic metadata attachment. Automatically identifies the reference transcriptome via hashed checksums and links genomic ranges, genome info, and provenance metadata to the resulting SummarizedExperiment object. Wraps tximport for gene-level summarization, differential expression with DESeq2/edgeR, and reproducible RNA-seq quantification workflows.

Thin
0
tximport

tximport -- R/Bioconductor package for importing transcript-level abundance, estimated counts, and transcript lengths from quantification tools (Salmon, Kallisto, RSEM, StringTie, Sailfish) and summarizing into gene-level matrices for downstream analysis with DESeq2, edgeR, or limma-voom. Handles countsFromAbundance normalization, tx2gene mapping, and inferential replicates. Central to RNA-seq differential expression pipelines.

Thin
0
UCE

UCE (Universal Cell Embeddings) — species-agnostic cell embedding model from Stanford SNAP Lab that uses protein language models (ESM-2) to create universal cell representations without species-specific reference genomes. Generates transferable embeddings for cross-species integration, zero-shot cell type annotation, atlas-level comparison, and integration of novel organisms without re-training. Works directly from raw single-cell RNA-seq count matrices. Produces 1280-dim embeddings usable with scanpy/AnnData.

Thin
0
UCSC Genome Browser

UCSC Genome Browser — web-based genome annotation browser providing access to reference genome assemblies, gene predictions, comparative genomics, variation, regulation, and expression data across hundreds of species. Features the BLAT sequence alignment tool, REST API for programmatic data access, Track Hubs for custom annotation hosting, Table Browser for bulk data export, and the kent command-line utilities (bedToBigBed, wigToBigWig, liftOver) for file format conversion. The primary public genome browser for ENCODE, Roadmap Epigenomics, and GTEx data integration.

Thin
0
UINMF

Use when working with UINMF in `rliger` for mosaic single-cell integration with shared and unshared features. UINMF extends LIGER's iNMF workflow to keep modality-specific genes, intergenic accessibility features, targeted spatial genes, or non-orthologous cross-species features while still learning shared factors. Trigger this skill for `runUINMF`, `runIntegration(..., method = "UINMF")`, `selectGenes(useUnsharedDatasets = ...)`, `scaleUnsharedData`, multimodal RNA + ATAC integration, targeted spatial plus scRNA integration, or cross-species analyses that cannot rely on shared features alone.

Thin
0
UMAP

UMAP — Uniform Manifold Approximation and Projection for dimension reduction. Constructs a fuzzy topological representation of high-dimensional data and optimizes a low-dimensional layout preserving that structure. Supports unsupervised, supervised, and semi-supervised reduction, transforming new data via a learned model, inverse transforms, density-preserving DensMAP, aligned multi-dataset embedding (AlignedUMAP), and neural-network-based Parametric UMAP. Scikit-learn compatible transformer API. Primary tool for visualization and preprocessing in single-cell genomics, spatial transcriptomics, and general-purpose nonlinear dimension reduction.

Thin
0
UMAP

UMAP (Uniform Manifold Approximation and Projection) — nonlinear dimensionality reduction for visualization and general-purpose embedding. Constructs fuzzy simplicial complex from high-dimensional data and optimizes low-dimensional layout preserving topological structure. Supports supervised, semi-supervised, and unsupervised modes, sparse data, custom distance metrics, inverse transforms, aligned embeddings, and parametric neural network variants. Standard embedding in single-cell RNA-seq pipelines (Scanpy, Seurat) and widely used in text analysis, image embeddings, and ML preprocessing.

Thin
0
UMI-tools

UMI-tools — Python toolkit for processing Unique Molecular Identifiers (UMIs) in sequencing data. Provides end-to-end UMI workflows including extraction from FASTQ reads, cell barcode demultiplexing for single-cell RNA-seq, network-based UMI deduplication from BAM files, and per-gene UMI counting. Supports directional, adjacency, and cluster deduplication methods. Essential for removing PCR duplicates in UMI-tagged libraries and quantifying single-cell gene expression.

Thin
0
UniProt Database

Direct REST API access to UniProt (250M+ protein sequences, 570K+ reviewed Swiss-Prot entries). Protein search, FASTA/JSON retrieval, ID mapping across 200+ databases, batch retrieval, streaming, field selection, and advanced query syntax. For Python workflows with multiple databases, prefer bioservices (unified interface to 40+ services). Use this for direct HTTP/REST work or UniProt-specific control.

Thin
0
UpSetR

UpSetR is an R package for visualizing intersections of multiple sets using UpSet plots — a scalable alternative to Venn diagrams for 3+ sets. Use when you need to compare membership across gene lists, sample groups, variant callsets, pathway hits, or any categorical overlaps. Supports binary matrix input, named list input via fromList(), and expression-string input via fromExpression(). Key capabilities: ordered intersection bars, set size bars, dot-line matrix, intersection queries with color highlighting, and attribute scatter plots. Use for set intersection visualization, multi-set overlap, UpSet plot, upset chart, gene list comparison, sample overlap, variant overlap.

Thin
0
USEARCH

USEARCH — ultra-fast amplicon sequence analysis toolkit for 16S/ITS/18S microbiome studies. Supports FASTQ quality filtering (fastq_filter), paired-end merging (fastq_mergepairs), dereplication (derep_fulllength), OTU clustering (cluster_otus / UPARSE pipeline), denoising to ZOTUs (unoise3), chimera detection (uchime2_ref), taxonomic classification (sintax), and OTU table generation (otutab). Use for amplicon-based microbiome profiling from raw reads to OTU/ZOTU tables.

Thin
0
UShER

Use when working with UShER (Ultrafast Sample placement on Existing tRees) for phylogenetic placement of new samples onto existing mutation-annotated trees. Places sequences via maximum parsimony on large phylogenies (millions of tips), manipulates mutation-annotated trees (MAT protobuf) with matUtils, optimizes tree parsimony with matOptimize, and detects recombinants with RIPPLES. Essential for SARS-CoV-2 genomic surveillance, real-time phylogenetics, outbreak investigation, and clade assignment.

Thin
0
uwot

uwot — Pure-R implementation of UMAP (Uniform Manifold Approximation and Projection) for nonlinear dimensionality reduction. Provides umap(), umap2() (optimized defaults), tumap() (t-distributed UMAP with ~50% speedup), and lvish() (LargeVis-like embedding). Supports supervised and semi-supervised reduction, out-of-sample transform via umap_transform(), density preservation (densMAP), multiple nearest neighbor backends (Annoy, HNSW, nndescent, FNN), multi-threaded computation, and batch-reproducible SGD. Standard tool in Seurat single-cell workflows (RunUMAP). No Python required.

Thin
0
Validate Workflow Syntax

LSP-style static validation of Nextflow, Snakemake, WDL, and CWL workflow files without execution

Thin
0
VarDict

VarDict variant caller for SNVs, MNVs, indels, complex variants, and structural variants from BAM files. Supports somatic paired tumor-normal calling and single-sample germline mode. Ultra-sensitive variant detection for targeted sequencing, FFPE samples, and low allele frequency mutations. Also known as VarDictJava. Use for variant calling, somatic mutation detection, amplicon-based sequencing analysis.

Thin
0
VariantAnnotation

Use when working with Bioconductor VariantAnnotation, VCF import in R, ScanVcfParam field or range subsetting, locateVariants(), predictCoding(), filterVcf(), or coding consequence workflows that combine VariantAnnotation with TxDb and BSgenome resources. Trigger on: VariantAnnotation, readVcf, VCF annotation in Bioconductor, coding change prediction, genomic context lookup, chunked VCF filtering, and R-based VCF inspection or export.

Thin
0
VarScan2

VarScan2 -- Java-based variant caller for somatic and germline SNV/indel detection, copy number analysis, and LOH detection from samtools mpileup output. Supports tumor-normal paired somatic calling, germline SNP/indel calling, copy number variant (CNV) profiling, and somatic copy number alteration (SCNA) detection. Takes SAM/BAM files via samtools mpileup pileup format. Widely used in cancer genomics (WGS/WES/amplicon) for somatic mutation detection. Also known as VarScan v2, varscan, VarScan.jar.

Thin
0
VarSome

Use when working with varsome — varSome — comprehensive human genomics

Thin
0
VCF2PCA

Use when working with vcf2pca — VCF2PCA — lightweight Python tool for

Thin
0
VCFtools

VCFtools — C++ toolkit for filtering, comparing, summarizing, converting, and manipulating VCF (Variant Call Format) and BCF files. Provides site and individual-level filtering, allele frequency calculation, Hardy-Weinberg equilibrium testing, linkage disequilibrium computation, Fst population differentiation, nucleotide diversity (pi), Tajima's D, relatedness estimation, and format conversion (VCF to PLINK, IMPUTE, LDhat). Essential for variant-level QC in GWAS, population genetics, and WGS/WES pipelines.

Thin
0
VectorBase

VectorBase — bioinformatics resource for invertebrate vector genomics providing gene search, BLAST, genome browsing (JBrowse), population biology mapping (MapVEu), and REST API access for arthropod vectors of human disease including mosquitoes (Anopheles, Aedes, Culex), ticks (Ixodes), sand flies (Lutzomyia, Phlebotomus), tsetse flies (Glossina), and other invertebrate vectors. Part of VEuPathDB, supports 53+ genomes with multi-omics data integration.

Thin
0
Velocyto

Velocyto — RNA velocity estimation tool that distinguishes unspliced and spliced mRNAs in single-cell RNA-seq data to predict future cell states. Provides a CLI for counting spliced/unspliced/ambiguous reads from BAM files (10x Chromium, Smart-seq2, Drop-seq) into loom files, and a Python API (VelocytoLoom) for gamma fitting, velocity calculation, transition probability estimation, and velocity field visualization on cell embeddings.

Thin
0
velocyto.R

velocyto.R skill for RNA velocity analysis in single-cell RNA-seq data. Use when users ask about velocyto.R, RNA velocity in R, deterministic velocity models, or working with .loom files containing spliced and unspliced layers. Supports end-to-end workflows from .loom validation through velocity projection on UMAP/tSNE embeddings and comparison with scVelo trajectory tools.

Thin
0
Velvet

Velvet — de Bruijn graph short-read genome assembler for Illumina, Solexa, and 454 sequencing data. Runs as a two-stage pipeline: velveth builds the hash table and de Bruijn graph, velvetg traverses the graph to produce contigs and scaffolds. Key for small genome assembly, metagenomics (MetaVelvet), and transcriptomics (Oases). Supports paired-end, mate-pair, and mixed short-read datasets with configurable k-mer length and coverage cutoffs.

Thin
0
VennDiagram

VennDiagram - R package for generating high-resolution, highly customizable Venn diagrams for one to five sets and Euler diagrams for selected two- and three-set cases. Use when users need publication-ready overlap figures in R, TIFF/PNG/SVG export, direct control over fills, labels, cat.pos, cat.dist, or grid-based Venn rendering for gene lists, pathway overlaps, cohort overlaps, and set comparison workflows. Includes venn.diagram(), draw.pairwise.venn(), draw.triple.venn(), draw.quad.venn(), draw.quintuple.venn(), get.venn.partitions(), and calculate.overlap().

Thin
0
VEP

VEP (Ensembl Variant Effect Predictor) — gold-standard tool for annotating and predicting the functional effects of genomic variants on genes, transcripts, and protein sequences. Provides consequence types, SIFT/PolyPhen predictions, allele frequencies, regulatory annotations, and extensible plugin framework. Use when annotating VCF files, filtering variants by pathogenicity, running filter_vep, or building clinical variant interpretation pipelines.

Thin
0
Verkko

Verkko — hybrid genome assembler for telomere-to-telomere (T2T) diploid assembly from PacBio HiFi and Oxford Nanopore ultra-long reads. Combines MBG de Bruijn graphs with progressive ONT resolution, trio phasing via Merqury hapmers, and Hi-C/PoreC scaffolding. Snakemake-based pipeline with grid engine support (Slurm/SGE/PBS/LSF). Standard tool for complete human and non-human genome assembly.

Thin
0
vg

Use when working with vg — the variation graph toolkit — for pangenome graph construction, read mapping to variation graphs, and graph-aware variant calling. Builds genome graphs from VCF + reference FASTA (vg construct, vg autoindex), maps short or long reads to graphs (vg giraffe, vg map), and calls variants (vg call). Essential for representing population-level variation, structural variant detection, and pangenome analysis. Part of the vgteam ecosystem alongside Minigraph and PGGB.

Thin
0
VSEARCH

VSEARCH — open-source, multithreaded alternative to USEARCH for amplicon and metagenomics sequence analysis. Performs dereplication, chimera detection (de novo and reference-based), OTU/ASV clustering, paired-end merging, quality filtering, and database searching on FASTA/FASTQ inputs. Key commands: --cluster_fast, --cluster_size, --uchime_denovo, --uchime_ref, --derep_fulllength, --fastq_filter, --fastq_mergepairs, --usearch_global. Use for 16S/18S rRNA amplicon pipelines, ITS analysis, and metagenome dereplication workflows.

Thin
0
vt

vt — C++ command-line variant tool set for manipulating VCF files. Provides variant normalization (left-alignment and trimming), multiallelic decomposition, VCF summary statistics (peek), annotation, subsetting, concatenation, and genotype operations. Essential preprocessing step between variant callers (GATK, FreeBayes, DeepVariant) and annotation tools (VEP, SnpEff, ANNOVAR).

Thin
0
wateRmelon

wateRmelon v2.16.0 — Bioconductor R package for Illumina 450K and EPIC DNA methylation array normalization and performance metrics. Provides 15 normalization methods (dasen, nasen, naten, danes, swan, BMIQ, noob, and others), three QC performance metrics (dmrse, genki, seabi), probe filtering (pfilter), epigenetic age prediction via Horvath and Hannum clocks (agep), sex prediction (predictSex), cell-type deconvolution (estimateCellCounts.wmln), bisulphite conversion QC (bscon), and outlier detection (outlyx). Works with methylumi and minfi objects.

Thin
0
Wave

Wave — on-demand container provisioning service by Seqera Labs for building, augmenting, and caching container images dynamically. Supports Dockerfile-based builds, Conda/Mamba package-based image creation, Singularity image generation, multi-platform builds (amd64/arm64), container vulnerability scanning, and registry mirroring. Deeply integrated with Nextflow via wave.enabled config. Essential for reproducible bioinformatics pipelines without manual image management.

Thin
0
WDL/Cromwell

WDL/Cromwell — Workflow Description Language (WDL) authoring and Cromwell execution engine for reproducible bioinformatics pipelines. WDL defines portable, composable workflows using task/workflow blocks with typed inputs, outputs, and runtime declarations. Cromwell executes WDL on local machines, HPC clusters (SLURM, SGE, LSF, PBS), and cloud platforms (Google Cloud, AWS Batch, Azure Batch). Powers GATK Best Practices, Terra/AnVIL genomics platform, and BioData Catalyst. Supports scatter-gather parallelism, sub-workflows, call caching, struct types, and WDL 1.0/1.1 specification. Use for variant calling pipelines, whole genome sequencing, RNA-seq, and any workflow requiring cloud-portable reproducibility.

Thin
0
Weights & Biases

Weights & Biases (wandb) — ML experiment tracking, hyperparameter sweeps, artifact versioning, and collaborative model management platform. Log training metrics, gradients, system stats, and media (images, tables, audio) with wandb.init() and wandb.log(). Run automated hyperparameter search with Sweeps. Track and version datasets, models, and evaluation results as Artifacts. Integrates with PyTorch, TensorFlow, Keras, JAX, Hugging Face Transformers, and Lightning. Use for reproducible ML experiment management, team collaboration, model registry, and training performance monitoring.

Thin
0
Wengan

Wengan — hybrid genome assembler combining short and long reads using a synthetic scaffolding approach. Integrates short-read assembly backends (Minia3, ABySS2, DiscovarDenovo) with long-read pseudo-alignment via IntervalMiss for rapid, accurate hybrid assembly of large genomes. Supports PacBio CLR, PacBio HiFi, and Oxford Nanopore long reads paired with Illumina short reads. Produces chromosome-scale scaffolds with N50 values competitive with best-in-class assemblers.

Thin
0
wf-basecalling

wf-basecalling — Oxford Nanopore Technologies (ONT) EPI2ME Nextflow workflow for production-grade basecalling using Dorado. Converts POD5 or FAST5 raw electrical signal files to BAM/FASTQ with optional alignment to a reference genome. Supports simplex (single-strand) and duplex (paired-strand) basecalling, modified base detection (5mCG, 5hmC, 6mA), and generates built-in QC reports. Use when running ONT basecalling at scale with MinION, GridION, or PromethION data. Replaces the deprecated Guppy + Nextflow combination.

Thin
0
Wflow

Wflow — Deltares open-source distributed hydrological modeling framework written in Julia. Provides SBM, HBV, and sediment model configurations for catchment hydrology simulation. Uses TOML configuration files and NetCDF input/output. Supports kinematic wave and local inertial routing, reservoir and lake modules, snow and glacier processes, and water demand allocation. Build models with HydroMT-wflow for automated setup from global datasets.

Thin
0
wgbs-tools

wgbs_tools — command-line toolkit for whole-genome bisulfite sequencing (WGBS) analysis. Converts, indexes, and queries methylation data in efficient PAT/BETA binary formats. Provides per-CpG and per-region methylation extraction, segment-level beta value computation, genomic region visualization, and marker identification for cell-type deconvolution. Works with BAM files from bisulfite aligners (Bismark, bwa-meth) and human/mouse reference genomes.

Thin
0
WGCNA

WGCNA (Weighted Gene Co-expression Network Analysis) — R package for constructing weighted correlation networks from high-dimensional gene expression data. Identifies co-expression modules via hierarchical clustering and dynamic tree cutting, relates modules to sample traits using eigengene correlations, finds hub genes, and exports networks to Cytoscape. Key functions: pickSoftThreshold, blockwiseModules, moduleEigengenes, cor, adjacency, TOMsimilarity, exportNetworkToCytoscape. Input: gene expression matrix (genes as rows/columns) and sample trait data. Output: module assignments, module-trait correlations, hub gene lists, network edge lists. Essential for systems biology, biomarker discovery, and gene regulatory network inference from RNA-seq or microarray data.

Thin
0
WhatsHap

WhatsHap -- read-based phasing (haplotype assembly) of genomic variants from sequencing reads. Phases diploid and polyploid genomes using long reads (PacBio, ONT) or short reads (Illumina). Key subcommands: phase (variant phasing), haplotag (tag BAM reads by haplotype), stats (phasing quality metrics), compare (evaluate phasing accuracy), polyphase (polyploid phasing), genotype (re-genotyping), split (separate reads by haplotype), and unphase (strip phasing). Produces phased VCF output with PS/HP tags. Also known as WhatsHap, whatshap, haplotype assembly.

Thin
0
WhiteboxTools

WhiteboxTools — advanced geospatial data analysis platform with 518 tools for GIS operations, hydrological analysis, terrain analysis, LiDAR processing, remote sensing, and geomorphometry. Supports raster (GeoTIFF, Whitebox GAT, ESRI ASCII), vector (Shapefile), and LiDAR (LAS/LAZ) formats. Python frontend via the whitebox package wraps a Rust-based analytical engine for cost-distance analysis, flow accumulation, watershed delineation, slope calculations, hillshading, and point cloud classification.

Thin
0
WNN (Weighted Nearest Neighbor)

WNN (Weighted Nearest Neighbor) — Seurat's method for multimodal single-cell data integration that computes cell-specific modality weights and combines per-modality KNN graphs. Supports any combination of data modalities (RNA + ATAC, RNA + protein/ADT, RNA + spatial, etc.). Produces a unified WNN graph for UMAP embedding and clustering via FindMultiModalNeighbors, RunUMAP with weighted.nn, and FindClusters on the wsnn graph. Part of the Seurat v4+ ecosystem for multi-omics single-cell analysis in R.

Thin
0
Workflow Generator

Use for deterministic conversion of workflow specifications into Nextflow DSL2, Snakemake 8+, WDL 1.0, and CWL v1.2 pipelines with container-aware resource routing and validation.

Thin
0
wtdbg2

wtdbg2 — ultrafast de novo long-read genome assembler using a fuzzy de Bruijn graph approach. Assembles PacBio (RSII, Sequel, CCS) and Oxford Nanopore reads without prior error correction. Two-step workflow: wtdbg2 assembler produces layout, wtpoa-cns derives consensus. Handles bacterial to mammalian genomes. Fastest long-read assembler for large genomes.

Thin
0
xarray-spatial

xarray-spatial — fast Numba-accelerated Python library for raster spatial analysis operations on xarray DataArrays. Provides surface analysis (slope, aspect, curvature, hillshade, viewshed), focal statistics (hotspots, mean), proximity (distance, allocation, direction), multispectral indices (NDVI, EVI, SAVI, NBR), classification (natural breaks, quantile, reclassify), hydrology (flow direction, flow accumulation, watershed), pathfinding (A*), and terrain generation. Scales with Dask for distributed computation without GDAL/GEOS dependencies.

Thin
0
XCMS

XCMS — R/Bioconductor framework for preprocessing and analysis of chromatographically separated mass spectrometry data. Provides centWave and matchedFilter peak detection, obiwarp and peak-groups retention time alignment, density-based correspondence, and gap-filling for LC-MS, GC-MS, and DIA/SWATH data in mzML/mzXML/mzData/NetCDF formats. Produces feature-by-sample intensity matrices for downstream metabolomics statistical analysis. Built on the MsExperiment/Spectra infrastructure (v4) with full backward compatibility.

Thin
0
X!Tandem

X!Tandem — open-source proteomics search engine for matching tandem mass spectra (MS/MS) to peptide sequences in protein databases. Performs database searching with configurable scoring, supports post-translational modification searching, refinement searches for semi-tryptic and non-specific cleavage, and outputs XML results with peptide-spectrum matches (PSMs). Common in shotgun proteomics pipelines for protein identification from LC-MS/MS data.

Thin
0
xTrimoGene

Use when working with xTrimoGene or scFoundation-style single-cell foundation model inference for cell embeddings, gene embeddings, gene-batch embeddings, or xTrimoGene-based fine-tuning. Trigger this skill for xTrimoGene, scFoundation, single-cell transformer embeddings, `get_embedding.py`, 19,264-gene remapping, targeted high-resolution tokens, or workflows that need the released BioMap checkpoint interface instead of inventing a new API.

Thin
0
Zarr

Zarr — chunked, compressed N-dimensional array storage for large numerical datasets in bioinformatics and scientific computing. Supports cloud-native storage (S3, GCS, Azure), parallel read/write, and multiple codecs (Blosc, Zstd, GZip). Widely used for genomics arrays, imaging data, spatial transcriptomics, and single-cell matrices. Implements the Zarr spec v2 and v3. Use when storing or streaming large arrays to/from cloud or local disk. Integrates with AnnData, Dask, and OME-Zarr for bioinformatics workflows.

Thin
0
Zarr

Zarr — chunked, compressed N-dimensional arrays for Python with cloud-native storage. Provides hierarchical groups, pluggable compression codecs (Blosc, Zstd, Gzip), sharding for large-scale datasets, parallel I/O via Dask, and seamless integration with NumPy, Xarray, and AnnData for scientific computing workflows from genomics to climate science.

Thin
0