capellini.utils.network_utils

Network-level utilities: residual message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.

This is the slim core of the CAPELLINI network stage. The math follows the paper (W̃ = (1−α) W + α K_vir W K_bac, residual propagation Z* = Z + η (cross - Z)). Helper plumbing (sample alignment, orientation auto-detection, CRISPR binarisation, taxonomy clean-up) lives below the math.

Functions

aggregate_otu_columns_by_rank_skip_nan(otu, ...)

Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

align_abundance_from_metadata(...[, ...])

Align V/B with the standardized metadata (filter, reorder, rename index).

build_binary_crispr_matrix(raw_crispr_path, ...)

Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix.

build_smoothed_crispr_for_study(...[, ...])

Orchestrate the smoothed-CRISPR build for a single study.

build_taxonomy_kernel(ids, tax_df, ranks[, ...])

Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]).

build_xstar_from_smoothed_crispr(V_df, B_df, ...)

Residual X* pipeline: align samples, CLR, propagate via W̃.

crispr_matrix_aggregate_viruses(df_crispr, ...)

Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix.

get_hierarchies(df_b, df_v)

Prepare bacteria/virus taxonomy frames for taxonomy-kernel building.

orient_W_viruses_by_bacteria(W_df, V_df, B_df)

Detect and correct the orientation of W to be viruses x bacteria.

prepare_bacteria_genus_abundance(otu, tax[, ...])

Aggregate bacteria OTUs to rank, sanitize the resulting column index.

prevalence_filter_df(df[, prevalence, verbose])

Keep features present in at least prevalence × n_samples samples.

remove_disease_columns_from_virus_abundance(V)

Drop phenotype/metadata columns that sometimes leak into viral abundance tables.

smooth_crispr_bac_vir(crispr_df, K_bac, K_vir)

W̃ = (1 − α) W + α (K_bac · W · K_vir), restricted to common rows/cols.

capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) DataFrame[source]

Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') tuple[DataFrame, DataFrame, DataFrame][source]

Align V/B with the standardized metadata (filter, reorder, rename index).

capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) DataFrame[source]

Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix.

The output is reindexed onto the requested bacteria_features / virus_features (missing rows/cols are zero) and the values are clipped to {0, 1}.

capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) dict[str, DataFrame][source]

Orchestrate the smoothed-CRISPR build for a single study.

Returns the artefacts (CRISPR binary, K_bac, K_vir, smoothed W, aligned taxonomy frames). The caller is responsible for persisting them.

capellini.utils.network_utils.build_taxonomy_kernel(ids, tax_df: DataFrame, ranks, weights=None, fill_value: str = '', normalize_rows: bool = True) tuple[DataFrame, DataFrame][source]

Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]).

Parameters:
  • ids – Feature IDs to include.

  • tax_df – Taxonomy table (rows = IDs, columns include ranks).

  • ranks – Ordered rank columns (deepest last).

  • weights – Optional per-rank weights; default 1..len(ranks). Normalized to sum to 1.

  • fill_value – String treated as missing — pairs sharing this value at a given rank are NOT counted as a match.

  • normalize_rows – Row-normalize the resulting kernel.

Returns:

(K, aligned_tax) — kernel as DataFrame indexed by the overlap of ids and tax_df.index, plus the cleaned taxonomy slice.

capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, *, pseudocount: float = 1e-06, lam: float = 0.5, n_steps: int = 1, eps: float = 1e-12) dict[str, DataFrame][source]

Residual X* pipeline: align samples, CLR, propagate via W̃.

Returns a dict with V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0') DataFrame[source]

Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix.

Bacterial spacer IDs of the form ...>TAXID... are parsed to the leading taxid; viral contigs are mapped to vir_rank via vir_tax.

capellini.utils.network_utils.get_hierarchies(df_b: DataFrame, df_v: DataFrame) tuple[DataFrame, DataFrame][source]

Prepare bacteria/virus taxonomy frames for taxonomy-kernel building.

Bacteria are reindexed by progenomes_taxid_genus (drop NaN, dedup). Viruses are reindexed by lev0 (dedup).

capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) DataFrame[source]

Detect and correct the orientation of W to be viruses x bacteria.

capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') DataFrame[source]

Aggregate bacteria OTUs to rank, sanitize the resulting column index.

capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) DataFrame[source]

Keep features present in at least prevalence × n_samples samples.

capellini.utils.network_utils.remove_disease_columns_from_virus_abundance(V: DataFrame) DataFrame[source]

Drop phenotype/metadata columns that sometimes leak into viral abundance tables.

capellini.utils.network_utils.smooth_crispr_bac_vir(crispr_df: DataFrame, K_bac: DataFrame, K_vir: DataFrame, alpha: float = 0.95) DataFrame[source]

W̃ = (1 − α) W + α (K_bac · W · K_vir), restricted to common rows/cols.