scRBP implements a 10-step command-line pipeline that takes raw single-cell RNA-seq data and produces disease-relevant RBP regulon rankings.

Raw scRNA-seq (.h5ad / .feather)
        ↓
  [1] getSketch       → Stratified subsampling (Optional)
  [2] getGRN          → GRN inference (GRNBoost2/GENIE3, --mode gene/isoform)
  [3] getMerge_GRN    → Consensus merging (N seeds, default 30)
  [4] getModule       → Regulon candidate extraction
  [5] getPrune        → Motif enrichment pruning
  [6] getRegulon      → GMT file generation (symbol + Entrez)
  [7] mergeRegulons   → Merge 4 region GMT files (3UTR/5UTR/CDS/Introns)
        ↓
  [8] ras             → Regulon Activity Score (RAS, --mode sc/ct)
  [9] rgs             → Regulon-level Genetic association Score (RGS, --mode sc/ct)
  [10] trs            → Trait Relevance Score (TRS, --mode sc/ct)
        ↓
   Disease-relevant RBP rankings

Step 1: getSketch

Stratified cell downsampling using GeoSketch (Optional). For .h5ad input, performs per-cell-type stratified sampling to ensure each cell type has at least min_cells_per_type cells while approaching the global n_cells target.

scRBP getSketch \
  --input data.h5ad \
  --output sketch.feather \
  --n_cells 50000 \
  --celltype_col celltype \
  --min_cells_per_type 50
Parameter Type Default Description
--input path required Input .h5ad or .feather file
--output path required Output file (.h5ad / .feather / .csv / .npz)
--n_cells int 50000 Target total number of cells to sketch
--n_pca int 100 Number of PCA components for GeoSketch
--celltype_col str celltype Column in adata.obs for cell-type labels (.h5ad only)
--min_cells_per_type int 50 Minimum cells per cell type to guarantee (.h5ad only)
--seed int 42 Random seed

Note: .feather input lacks cell-type metadata; global GeoSketch is applied without per-type guarantees.


Step 2: getGRN

GRN inference using GRNBoost2 or GENIE3. Supports two modes: gene-level (RBP→gene) and isoform-level (RBP→isoform). Run this command with multiple seeds (e.g., 30 times) for robust consensus networks.

# Gene-level inference (recommended)
scRBP getGRN \
  --matrix sketch.feather \
  --rbp_list rbp_list.txt \
  --output grn_seed1.tsv \
  --method grnboost2 \
  --mode gene \
  --seed 1

# Isoform-level inference
scRBP getGRN \
  --matrix sketch.feather \
  --rbp_list rbp_list.txt \
  --output grn_isoform_seed1.tsv \
  --mode isoform \
  --isoform_annotation isoform2gene.txt
Parameter Type Default Description
--matrix path required Expression matrix (.csv / .csv.gz / .feather / .loom); rows=features, cols=cells
--rbp_list path required RBP list file (gene symbols, one per line)
--output path required Output GRN .tsv file
--method str grnboost2 Inference algorithm: grnboost2 or genie3
--mode str gene Inference mode: gene (RBP→gene) or isoform (RBP→isoform)
--isoform_annotation path None Isoform→gene annotation file (required when --mode isoform)
--n_workers int all CPUs Number of parallel workers
--batch_size int 10 Number of outer batches
--threshold float 0.03 Absolute Spearman correlation threshold for filtering
--correlation bool True Compute Spearman correlation and Mode columns
--seed int 1234 Random seed

Output columns: RBP, Target, Importance, Correlation, Mode


Step 3: getMerge_GRN

Merge GRN results across N seeds for a robust consensus network. Uses a glob pattern to match all seed output files.

scRBP getMerge_GRN \
  --pattern "grn_seed*.tsv" \
  --output merged_grn.tsv \
  --present_rate 0.0 \
  --corr-threshold 0.0
Parameter Type Default Description
--pattern str required Glob pattern matching all seed GRN .tsv files (e.g. "grn_seed*.tsv")
--output path required Output merged consensus GRN .tsv
--corr-threshold float 0.0 Filter edges with abs(mean Correlation) ≤ threshold
--n_present int 0 Minimum number of seed runs in which an edge must appear
--present_rate float 0.0 Minimum presence rate (n_present / N_runs) to keep an edge

Output columns: RBP, Target, mean_Importance, mean_Correlation, n_present, present_rate, Mode


Step 4: getModule

Extract regulon candidate modules from the merged GRN using multiple selection strategies (Top-N and percentile-based).

scRBP getModule \
  --input merged_grn.tsv \
  --output_merged modules.tsv \
  --importance_threshold 0.005 \
  --top_n_list "5,10,50" \
  --target_top_n "50" \
  --percentile "0.75,0.9"
Parameter Type Default Description
--input path required Merged GRN .tsv from getMerge_GRN
--output_merged path required Output merged modules .tsv
--importance_threshold float 0.005 Minimum importance score to retain an edge
--top_n_list str "5,10,50" Comma-separated Top-N values for target selection
--target_top_n str "50" Target Top-N for merged module output
--percentile str "0.75,0.9" Comma-separated percentile thresholds for importance
--verbose flag False Enable verbose logging

Step 5: getPrune

Filter regulon candidates using motif-binding enrichment via ctxcore (NES scoring). Requires pre-built motif annotation and genome ranking databases.

scRBP getPrune \
  --rbp_targets modules.tsv \
  --motif_rbp_links motif_rbp_links.feather \
  --motif_target_ranks hg38_500bp_rankings.feather \
  --save_dir pruned_results/ \
  --rank_threshold 1500 \
  --nes_threshold 3.0 \
  --min_genes 20
Parameter Type Default Description
--rbp_targets path required Module .tsv from getModule
--motif_rbp_links path required Motif-RBP annotation .feather (maps motifs to RBPs)
--motif_target_ranks path required Rankings .feather (e.g. homo_sapiens_616RBPs_20746motifs_gene_rank_3UTR.feather)
--save_dir path required Output directory for pruned scores (Parquet)
--rank_threshold int 1500 Top-N rank cutoff for motif target enrichment
--auc_threshold float 0.05 AUC threshold for enrichment significance
--nes_threshold float 3.0 Normalized Enrichment Score (NES) threshold
--min_genes int 20 Minimum number of target genes to retain a regulon
--n_jobs int all CPUs Number of parallel processes
--chunksize int 4 Chunk size for multiprocessing imap
--only_rbp str None Restrict pruning to a specific RBP (for debugging)
--only_strategy str None Restrict to a specific selection strategy

Download motif databases from scRBP resources.


Step 6: getRegulon

Convert pruned ctxcore scores to standard GMT files in both gene-symbol and Entrez-ID formats.

scRBP getRegulon \
  --input pruned_results/ctx_scores.csv \
  --out-symbol regulons_symbol.gmt \
  --out-entrez regulons_entrez.gmt \
  --min_genes 1 \
  --taxid 9606
Parameter Type Default Description
--input path required Pruned ctxcore .csv file from getPrune
--out-symbol path required Output GMT file (gene symbols)
--out-entrez path required Output GMT file (Entrez IDs)
--rbp_col str RBP Column name for RBP in the input CSV
--genes_col str auto Column name for target genes (auto-detected if omitted)
--min_genes int 1 Minimum number of targets to retain a regulon
--taxid int 9606 NCBI Taxonomy ID (9606 = human, 10090 = mouse)
--map-hgnc path None HGNC gene mapping table for symbol→Entrez conversion
--map-ncbi path None NCBI gene info file for symbol→Entrez conversion
--drop-unmapped-genes flag False Drop genes that cannot be mapped to Entrez IDs
--drop-empty-sets flag False Drop regulons with zero genes after mapping

GMT format: RBP_name <TAB> description <TAB> gene1 <TAB> gene2 ...


Step 7: mergeRegulons

Merge region-specific GMT files (3’UTR, 5’UTR, CDS, Introns) from multiple run directories into a single unified regulon set.

scRBP mergeRegulons \
  --base_dir results/ \
  --input regulons_symbol.gmt \
  --output merged_regulons.gmt \
  --region_order 3UTR 5UTR CDS Introns \
  --dedup_lines
Parameter Type Default Description
--base_dir path required Base directory containing region subdirectories
--input str required Input GMT filename to find in each region directory
--output str required Output merged GMT filename
--region_order list 3UTR 5UTR CDS Introns Order in which regions are merged
--region_glob str Results_final_*_RBP_top1500_* Glob pattern for region-specific subdirectories
--tissue_glob str z_GRNBoost2_*_30times Glob pattern for parent tissue dirs (with --recursive)
--recursive flag False Recursively process multiple parent directories
--dedup_lines flag False Deduplicate identical GMT lines
--overwrite flag False Overwrite existing output files
--summary_out path None Optional output .tsv for region-level summary table

Step 8: ras

Compute Regulon Activity Scores (RAS) using the AUCell algorithm. Supports single-cell (--mode sc) and cell-type aggregated (--mode ct) modes.

# Single-cell mode
scRBP ras \
  --mode sc \
  --matrix data.h5ad \
  --regulons merged_regulons.gmt \
  --out ras_sc.csv

# Cell-type mode (requires cell-type annotation)
scRBP ras \
  --mode ct \
  --matrix data.h5ad \
  --regulons merged_regulons.gmt \
  --celltypes-csv celltypes.csv \
  --out ras_ct.csv
Parameter Type Default Description
--mode str ct Scoring mode: sc (per cell) or ct (per cell type)
--matrix path required Expression matrix (.h5ad / .feather / .loom / .csv)
--regulons path required Regulon GMT file from mergeRegulons
--out path required Output RAS file
--out_format str csv Output format: csv, loom, or both
--celltypes-csv path None CSV with cell_id, cell_type columns (required for --mode ct)
--n_workers int 4 Number of workers for AUCell
--min_genes int 1 Drop regulons with fewer than min_genes targets
--to_upper flag False Uppercase gene symbols when matching to regulons

Step 9: rgs

MAGMA gene-set analysis for GWAS enrichment. Computes Regulon-level Genetic association Scores (RGS) with matched null regulons controlling for 4 confounders: number of SNPs, number of parameters, mean gene expression, and percent detected.

scRBP rgs \
  --mode ct \
  --magma /path/to/magma \
  --genes-raw gwas.genes.raw \
  --sets merged_regulons_entrez.gmt \
  --id-type entrez \
  --out rgs_output \
  --n-null 1000 \
  --seed 2025
Parameter Type Default Description
--mode str required sc (single-cell) or ct (cell-type)
--magma path required Path to MAGMA binary
--genes-raw path required MAGMA <prefix>.genes.raw from prior gene analysis
--sets path required Regulon GMT file (symbol or Entrez format)
--id-type str entrez Gene ID format in GMT: entrez or symbol
--out str required Output file prefix
--n-null int 1000 Number of matched null regulons to generate
--seed int 2025 Random seed for null sampling
--q-bins int 10 Number of quantile bins for null matching
--threads int auto CPU threads for MAGMA
--gene-loc path None MAGMA NCBI*.gene.loc file (for symbol→Entrez mapping)
--min_genes int 0 Minimum regulon size for inclusion
--cleanup-out bool True Remove intermediate MAGMA output files
--expr-stats path None Precomputed expression stats TSV (symbol, mean_expr, pct_detected)

Requires MAGMA binary. See GWASTutorial and download from CNCR.


Step 10: trs

Integrate RAS and RGS into a unified Trait Relevance Score (TRS):

\[\text{TRS} = \text{norm(RAS)} + \text{norm(RGS)} - \lambda \times |\text{norm(RAS)} - \text{norm(RGS)}|\]
scRBP trs \
  --mode ct \
  --ras ras_ct.csv \
  --rgs-csv rgs_output.csv \
  --out-prefix trs_results \
  --lambda-penalty 1.0 \
  --rgs-score mlog10p
Parameter Type Default Description
--mode str required sc (single-cell) or ct (cell-type)
--ras path required RAS .csv from ras step
--rgs-csv path required RGS .csv from rgs step
--out-prefix str required Output file prefix
--rgs-score str mlog10p RGS score column to use: mlog10p or z
--lambda-penalty float 1.0 Penalty for RAS–RGS divergence (λ in TRS formula)
--q-hi-ras float 0.99 Upper quantile cap for RAS normalization
--q-hi-rgs float 0.99 Upper quantile cap for RGS normalization
--do-fdr int 1 Apply BH-FDR correction (1=yes, 0=no; CT mode only)
--celltypes-csv path None CSV with cell_id, cell_type columns (CT mode)
--min_cells_pert_ct int 25 Minimum cells per cell type for CT mode