Pipeline overview
Preflight → DADA2 → 3-layer NCBI ID Mapping → SpacePHARER Execution
→ Protein Clusters (ProCs) Estimation → Enhanced Networks Estimation
Stages
- Preflight
Folder layout; optional fresh-start cleanup that preserves bundled references and the input virus FASTA.
- DADA2
Denoise 16S reads to ASVs and assign SILVA taxonomy via the bundled
DADA2_Pipe.Rscript.- 3-layer NCBI ID Mapping
Download
names.dmpand assign real NCBI taxids to the SILVA taxonomy table, then run a taxonomy-aware mapping of ASVs to proGenomes3 representative genomes viammseqs easy-searchwith three-layer fallback (ASV → genus → family) and derivation of thetarget_taxidscolumn.- SpacePHARER Execution
Filter the bundled spacer collection to the cohort, build SpacePHARER databases, and run
predictmatchwith FDR control to obtain the virus–host adjacency \(W\).- Protein Clusters (ProCs) Estimation
Protein clustering of bacterial and viral proteins, building the ProCs presence/count matrix.
- Enhanced Networks Estimation
Common-abundance preprocessing, CLR transformation, Schäfer–Strimmer shrinkage correlations, raw and taxonomy-smoothed CRISPR networks
\[\tilde{W} = (1 - \alpha) W + \alpha\, K_{\mathrm{vir}}\, W\, K_{\mathrm{bac}},\]and X* message-passing propagation
\[Z^*_v = Z_v + \eta (Z_b P_h - Z_v), \quad Z^*_b = Z_b + \eta (Z_v P_v - Z_b).\]