Pipeline overview

Preflight → DADA2 → 3-layer NCBI ID Mapping → SpacePHARER Execution
         → Protein Clusters (ProCs) Estimation → Enhanced Networks Estimation

Stages

Preflight

Folder layout; optional fresh-start cleanup that preserves bundled references and the input virus FASTA.

DADA2

Denoise 16S reads to ASVs and assign SILVA taxonomy via the bundled DADA2_Pipe.R script.

3-layer NCBI ID Mapping

Download names.dmp and assign real NCBI taxids to the SILVA taxonomy table, then run a taxonomy-aware mapping of ASVs to proGenomes3 representative genomes via mmseqs easy-search with three-layer fallback (ASV → genus → family) and derivation of the target_taxids column.

SpacePHARER Execution

Filter the bundled spacer collection to the cohort, build SpacePHARER databases, and run predictmatch with FDR control to obtain the virus–host adjacency \(W\).

Protein Clusters (ProCs) Estimation

Protein clustering of bacterial and viral proteins, building the ProCs presence/count matrix.

Enhanced Networks Estimation

Common-abundance preprocessing, CLR transformation, Schäfer–Strimmer shrinkage correlations, raw and taxonomy-smoothed CRISPR networks

\[\tilde{W} = (1 - \alpha) W + \alpha\, K_{\mathrm{vir}}\, W\, K_{\mathrm{bac}},\]

and X* message-passing propagation

\[Z^*_v = Z_v + \eta (Z_b P_h - Z_v), \quad Z^*_b = Z_b + \eta (Z_v P_v - Z_b).\]