capellini.utils.network_utils
Network-level utilities: residual message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.
This is the slim core of the CAPELLINI network stage. The math follows the paper (W̃ = (1−α) W + α K_vir W K_bac, residual propagation Z* = Z + η (cross - Z)). Helper plumbing (sample alignment, orientation auto-detection, CRISPR binarisation, taxonomy clean-up) lives below the math.
Functions
|
Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels. |
|
Align V/B with the standardized metadata (filter, reorder, rename index). |
|
Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix. |
|
Orchestrate the smoothed-CRISPR build for a single study. |
|
Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]). |
|
Residual X* pipeline: align samples, CLR, propagate via W̃. |
|
Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix. |
|
Prepare bacteria/virus taxonomy frames for taxonomy-kernel building. |
|
Detect and correct the orientation of W to be viruses x bacteria. |
|
Aggregate bacteria OTUs to |
|
Keep features present in at least |
Drop phenotype/metadata columns that sometimes leak into viral abundance tables. |
|
|
W̃ = (1 − α) W + α (K_bac · W · K_vir), restricted to common rows/cols. |
- capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) DataFrame[source]
Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.
- capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') tuple[DataFrame, DataFrame, DataFrame][source]
Align V/B with the standardized metadata (filter, reorder, rename index).
- capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) DataFrame[source]
Load a raw CRISPR network CSV and build a binary bacteria × viruses matrix.
The output is reindexed onto the requested
bacteria_features/virus_features(missing rows/cols are zero) and the values are clipped to {0, 1}.
- capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) dict[str, DataFrame][source]
Orchestrate the smoothed-CRISPR build for a single study.
Returns the artefacts (CRISPR binary, K_bac, K_vir, smoothed W, aligned taxonomy frames). The caller is responsible for persisting them.
- capellini.utils.network_utils.build_taxonomy_kernel(ids, tax_df: DataFrame, ranks, weights=None, fill_value: str = '', normalize_rows: bool = True) tuple[DataFrame, DataFrame][source]
Build a similarity kernel where K[i,j] = Σ w_k · I(rank_k[i] == rank_k[j]).
- Parameters:
ids – Feature IDs to include.
tax_df – Taxonomy table (rows = IDs, columns include
ranks).ranks – Ordered rank columns (deepest last).
weights – Optional per-rank weights; default 1..len(ranks). Normalized to sum to 1.
fill_value – String treated as missing — pairs sharing this value at a given rank are NOT counted as a match.
normalize_rows – Row-normalize the resulting kernel.
- Returns:
(K, aligned_tax) — kernel as DataFrame indexed by the overlap of
idsandtax_df.index, plus the cleaned taxonomy slice.
- capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, *, pseudocount: float = 1e-06, lam: float = 0.5, n_steps: int = 1, eps: float = 1e-12) dict[str, DataFrame][source]
Residual X* pipeline: align samples, CLR, propagate via W̃.
Returns a dict with V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.
- capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0') DataFrame[source]
Aggregate a SpacePHARER predictions TSV into a (bac_taxid × vOTU) matrix.
Bacterial spacer IDs of the form
...>TAXID...are parsed to the leading taxid; viral contigs are mapped tovir_rankviavir_tax.
- capellini.utils.network_utils.get_hierarchies(df_b: DataFrame, df_v: DataFrame) tuple[DataFrame, DataFrame][source]
Prepare bacteria/virus taxonomy frames for taxonomy-kernel building.
Bacteria are reindexed by
progenomes_taxid_genus(drop NaN, dedup). Viruses are reindexed bylev0(dedup).
- capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) DataFrame[source]
Detect and correct the orientation of W to be viruses x bacteria.
- capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') DataFrame[source]
Aggregate bacteria OTUs to
rank, sanitize the resulting column index.
- capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) DataFrame[source]
Keep features present in at least
prevalence× n_samples samples.