capellini.utils.network_utils

Network-level utilities: residual message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.

This is the slim core of the CAPELLINI network stage. The math follows the paper (W̃ = (1−α) W + α K_vir W K_bac, residual propagation Z* = Z + η (cross - Z)). Helper plumbing (sample alignment, orientation auto-detection, CRISPR binarisation, taxonomy clean-up) lives below the math.

Functions

`aggregate_otu_columns_by_rank_skip_nan`(otu, ...)	Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.
`align_abundance_from_metadata`(...[, ...])	Align V/B with the standardized metadata (filter, reorder, rename index).
`build_binary_crispr_matrix`(raw_crispr_path, ...)	Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix.
`build_smoothed_crispr_for_study`(...[, ...])	Orchestrate the smoothed-CRISPR build for a single study.
`build_taxonomy_kernel`(ids, tax_df, ranks[, ...])	Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]).
`build_xstar_from_smoothed_crispr`(V_df, B_df, ...)	Residual X* pipeline: align samples, CLR, propagate via W̃.
`crispr_matrix_aggregate_viruses`(df_crispr, ...)	Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix.
`get_hierarchies`(df_b, df_v)	Prepare bacteria/virus taxonomy frames for taxonomy-kernel building.
`orient_W_viruses_by_bacteria`(W_df, V_df, B_df)	Detect and correct the orientation of W to be viruses x bacteria.
`prepare_bacteria_genus_abundance`(otu, tax[, ...])	Aggregate bacteria OTUs to `rank`, sanitize the resulting column index.
`prevalence_filter_df`(df[, prevalence, verbose])	Keep features present in at least `prevalence` × n_samples samples.
`remove_disease_columns_from_virus_abundance`(V)	Drop phenotype/metadata columns that sometimes leak into viral abundance tables.
`smooth_crispr_bac_vir`(crispr_df, K_bac, K_vir)	W̃ = (1 − α) W + α (K_bac · W · K_vir), restricted to common rows/cols.

capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) → DataFrame[source]: Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') → tuple[DataFrame, DataFrame, DataFrame][source]: Align V/B with the standardized metadata (filter, reorder, rename index).

capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) → DataFrame[source]

Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix.

The output is reindexed onto the requested bacteria_features / virus_features (missing rows/cols are zero) and the values are clipped to {0, 1}.

capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) → dict[str, DataFrame][source]

Orchestrate the smoothed-CRISPR build for a single study.

Returns the artefacts (CRISPR binary, K_bac, K_vir, smoothed W, aligned taxonomy frames). The caller is responsible for persisting them.

capellini.utils.network_utils.build_taxonomy_kernel(ids, tax_df: DataFrame, ranks, weights=None, fill_value: str = '', normalize_rows: bool = True) → tuple[DataFrame, DataFrame][source]

Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]).

Parameters:

ids – Feature IDs to include.
tax_df – Taxonomy table (rows = IDs, columns include ranks).
ranks – Ordered rank columns (deepest last).
weights – Optional per-rank weights; default 1..len(ranks). Normalized to sum to 1.
fill_value – String treated as missing — pairs sharing this value at a given rank are NOT counted as a match.
normalize_rows – Row-normalize the resulting kernel.

Returns:

(K, aligned_tax) — kernel as DataFrame indexed by the overlap of ids and tax_df.index, plus the cleaned taxonomy slice.

capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, *, pseudocount: float = 1e-06, lam: float = 0.5, n_steps: int = 1, eps: float = 1e-12) → dict[str, DataFrame][source]

Residual X* pipeline: align samples, CLR, propagate via W̃.

Returns a dict with V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0') → DataFrame[source]

Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix.

Bacterial spacer IDs of the form ...>TAXID... are parsed to the leading taxid; viral contigs are mapped to vir_rank via vir_tax.

capellini.utils.network_utils.get_hierarchies(df_b: DataFrame, df_v: DataFrame) → tuple[DataFrame, DataFrame][source]

Prepare bacteria/virus taxonomy frames for taxonomy-kernel building.

Bacteria are reindexed by progenomes_taxid_genus (drop NaN, dedup). Viruses are reindexed by lev0 (dedup).

capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) → DataFrame[source]: Detect and correct the orientation of W to be viruses x bacteria.

capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') → DataFrame[source]: Aggregate bacteria OTUs to rank, sanitize the resulting column index.

capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) → DataFrame[source]: Keep features present in at least prevalence × n_samples samples.

capellini.utils.network_utils.remove_disease_columns_from_virus_abundance(V: DataFrame) → DataFrame[source]: Drop phenotype/metadata columns that sometimes leak into viral abundance tables.

capellini.utils.network_utils.smooth_crispr_bac_vir(crispr_df: DataFrame, K_bac: DataFrame, K_vir: DataFrame, alpha: float = 0.95) → DataFrame[source]: W̃ = (1 − α) W + α (K_bac · W · K_vir), restricted to common rows/cols.