capellini.utils.network_utils

Network-level utilities: message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.

Functions

aggregate_otu_columns_by_rank_skip_nan(otu, ...)

Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

align_abundance_from_metadata(...[, ...])

Align viral and bacterial abundance using standardized metadata.

assign_crispr(cri_big, cri_s)

Fill matching rows/columns of a target matrix from a source CRISPR matrix.

build_binary_crispr_matrix(raw_crispr_path, ...)

Load a raw CRISPR network and build a binary bacteria x viruses matrix.

build_smoothed_crispr_for_study(...[, ...])

Build a smoothed CRISPR matrix for a single study.

build_taxonomy_kernel_from_shared_ranks(ids, ...)

Build a taxonomy kernel matrix using weighted shared rank agreement.

build_xstar_from_smoothed_crispr(V_df, B_df, ...)

Full non-residual X-star pipeline (convex message passing).

build_xstar_from_smoothed_crispr_residual(...)

Full residual additive X-star pipeline.

crispr_matrix_aggregate_viruses(df_crispr, ...)

Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank.

get_hierarchies(df_b1, df_v1)

Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing.

orient_W_viruses_by_bacteria(W_df, V_df, B_df)

Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria.

out_path(study, subdir, filename, output_root)

Construct the full path to an output file within a study directory.

prepare_bacteria_genus_abundance(otu, tax[, ...])

Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names.

prevalence_filter_df(df[, prevalence, verbose])

Keep only features present in at least prevalence * n_samples samples.

remove_disease_columns_from_virus_abundance(V)

Remove phenotype/metadata columns accidentally stored in viral abundance tables.

residual_message_passing(V, B, W_vh_smooth)

Residual additive cross-domain message passing using a smoothed CRISPR matrix.

residual_message_passing_df(V_clr_df, ...[, ...])

DataFrame wrapper for residual additive message passing.

smooth_crispr_bac_vir(crispr_df, K_bac, K_vir)

Smooth a CRISPR matrix using taxonomy kernels.

study_outdir(study, subdir, output_root)

Construct the output directory path for a study.

summarize_df(name, df)

Print a short summary of a DataFrame shape and index uniqueness.

transform_message_passing_smoothed_crispr(V, ...)

Non-residual convex cross-domain message passing using a smoothed CRISPR matrix.

transform_message_passing_smoothed_crispr_df(...)

DataFrame wrapper for non-residual convex message passing.

capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) DataFrame[source]

Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

Parameters:
  • otu – Samples x ASVs abundance DataFrame.

  • tax – ASVs x ranks taxonomy DataFrame.

  • rank – Taxonomy column to aggregate to.

Returns:

Samples x rank-groups abundance DataFrame.

capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') tuple[DataFrame, DataFrame, DataFrame][source]

Align viral and bacterial abundance using standardized metadata.

Parameters:
  • virus_abundance – Viral abundance DataFrame.

  • bacteria_abundance – Bacterial abundance DataFrame.

  • metadata – Metadata DataFrame with sample ID and keep columns.

  • keep_col – Column used to filter metadata rows.

  • virus_id_col – Metadata column for viral sample IDs.

  • bacteria_id_col – Metadata column for bacterial sample IDs.

  • final_index_col – Metadata column to use as the aligned index.

Returns:

Tuple (V, B, meta_aligned) with aligned, filtered abundance and metadata.

capellini.utils.network_utils.assign_crispr(cri_big: DataFrame, cri_s: DataFrame) DataFrame[source]

Fill matching rows/columns of a target matrix from a source CRISPR matrix.

Parameters:
  • cri_big – Target (full-size) matrix.

  • cri_s – Source CRISPR matrix (subset).

Returns:

Updated copy of cri_big.

capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) DataFrame[source]

Load a raw CRISPR network and build a binary bacteria x viruses matrix.

Parameters:
  • raw_crispr_path – Path to the CSV of the raw CRISPR network.

  • bacteria_features – Ordered list of bacteria feature IDs.

  • virus_features – Ordered list of virus feature IDs.

  • transpose_after_load – If True, transpose the loaded matrix (contigs were rows).

Returns:

Binary bacteria x viruses DataFrame.

capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) dict[str, DataFrame][source]

Build a smoothed CRISPR matrix for a single study.

Parameters:
  • raw_crispr_path – Path to the raw CRISPR network CSV.

  • bacteria_features – Bacteria feature IDs (from processed abundance).

  • virus_features – Virus feature IDs (from processed abundance).

  • tax_bac_path – Path to bacteria taxonomy CSV.

  • tax_vir_path – Path to virus taxonomy CSV.

  • bacterial_ranks – Bacterial taxonomy rank columns for kernel.

  • viral_ranks – Viral taxonomy rank columns for kernel.

  • bacterial_weights – Per-rank weights for bacteria kernel.

  • viral_weights – Per-rank weights for virus kernel.

  • alpha – CRISPR smoothing weight.

  • transpose_after_load – Passed to build_binary_crispr_matrix.

Returns:

Dict with crispr_binary, K_bac, K_vir, crispr_smooth, bac_tax_aligned, vir_tax_aligned.

capellini.utils.network_utils.build_taxonomy_kernel_from_shared_ranks(ids, tax_df: DataFrame, ranks, weights=None, fill_value: str = '', normalize_rows: bool = True, strict: bool = False) tuple[DataFrame, DataFrame][source]

Build a taxonomy kernel matrix using weighted shared rank agreement.

K[i,j] = sum_k weights[k] * I(rank_k[i] == rank_k[j]) if rank_k[i] != fill_value. K[i,i] = 1.

Parameters:
  • ids – Ordered collection of feature IDs to include.

  • tax_df – Taxonomy DataFrame indexed by the same IDs.

  • ranks – Ordered list of taxonomy rank column names.

  • weights – Per-rank weights (default: 1..n). Normalized to sum to 1.

  • fill_value – Value treated as missing (no match score).

  • normalize_rows – Row-normalize the kernel.

  • strict – Raise ValueError if any IDs are missing from tax_df.

Returns:

Tuple (K DataFrame, aligned taxonomy DataFrame).

capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) dict[str, DataFrame][source]

Full non-residual X-star pipeline (convex message passing).

Parameters:
  • V_df – Samples x viruses raw abundance.

  • B_df – Samples x bacteria raw abundance.

  • W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.

  • pseudocount – CLR pseudocount.

  • lam – Mixing weight.

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

  • eps – Numerical stability constant.

Returns:

V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

Return type:

Dict with keys

capellini.utils.network_utils.build_xstar_from_smoothed_crispr_residual(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) dict[str, DataFrame][source]

Full residual additive X-star pipeline.

Parameters:
  • V_df – Samples x viruses raw abundance.

  • B_df – Samples x bacteria raw abundance.

  • W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.

  • pseudocount – CLR pseudocount.

  • lam – Additive weight.

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

  • eps – Numerical stability constant.

Returns:

V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

Return type:

Dict with keys

capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0', vir_id_col=None, dropna_vir: bool = True, dtype=<class 'int'>) DataFrame[source]

Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank.

Parameters:
  • df_crispr – Raw SpacePHARER predictions DataFrame (no header).

  • vir_tax – Virus taxonomy DataFrame.

  • bac_col – Column index for bacterial spacer IDs.

  • vir_col – Column index for viral contig IDs.

  • vir_rank – Viral taxonomy rank column to aggregate to.

  • vir_id_col – Optional viral ID column in vir_tax; uses index if None.

  • dropna_vir – Drop viruses not found in taxonomy.

  • dtype – Output matrix dtype.

Returns:

Bacteria x viral_groups crosstab matrix.

capellini.utils.network_utils.get_hierarchies(df_b1: DataFrame, df_v1: DataFrame) tuple[DataFrame, DataFrame][source]

Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing.

Parameters:
  • df_b1 – Bacteria taxonomy DataFrame with progenomes_taxid_genus column.

  • df_v1 – Virus taxonomy DataFrame with lev0 column.

Returns:

Tuple (bacteria_tax, virus_tax) with cleaned indexes.

capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) DataFrame[source]

Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria.

Parameters:
  • W_df – CRISPR matrix (may be bacteria x viruses or viruses x bacteria).

  • V_df – Viral abundance (samples x viruses).

  • B_df – Bacterial abundance (samples x bacteria).

  • verbose – Print overlap counts.

Returns:

W_df in viruses x bacteria orientation.

capellini.utils.network_utils.out_path(study: str, subdir: str, filename: str, output_root: Path) Path[source]

Construct the full path to an output file within a study directory.

Parameters:
  • study – Study identifier string.

  • subdir – Sub-directory name.

  • filename – File name.

  • output_root – Root output directory.

Returns:

Full Path to the output file.

capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') DataFrame[source]

Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names.

Parameters:
  • otu – Samples x ASVs raw abundance.

  • tax – ASV taxonomy DataFrame with a column matching rank.

  • rank – Taxonomy column to aggregate to.

Returns:

Samples x genus-level abundance DataFrame with sanitized column names.

capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) DataFrame[source]

Keep only features present in at least prevalence * n_samples samples.

Parameters:
  • df – Samples x features DataFrame.

  • prevalence – Minimum fractional prevalence threshold.

  • verbose – Print kept/total feature count.

Returns:

Filtered DataFrame.

capellini.utils.network_utils.remove_disease_columns_from_virus_abundance(V: DataFrame) DataFrame[source]

Remove phenotype/metadata columns accidentally stored in viral abundance tables.

Parameters:

V – Viral abundance DataFrame.

Returns:

Cleaned copy without non-feature columns.

capellini.utils.network_utils.residual_message_passing(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[ndarray, ndarray][source]

Residual additive cross-domain message passing using a smoothed CRISPR matrix.

Parameters:
  • V – Samples x viruses CLR matrix.

  • B – Samples x bacteria CLR matrix.

  • W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.

  • lam – Additive weight for cross-domain messages.

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star, B_star).

capellini.utils.network_utils.residual_message_passing_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[DataFrame, DataFrame, DataFrame][source]

DataFrame wrapper for residual additive message passing.

Parameters:
  • V_clr_df – Samples x viruses CLR-transformed abundance.

  • B_clr_df – Samples x bacteria CLR-transformed abundance.

  • W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.

  • lam – Additive weight.

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star_df, B_star_df, X_star_df).

capellini.utils.network_utils.smooth_crispr_bac_vir(crispr_df: DataFrame, K_bac: DataFrame, K_vir: DataFrame, alpha: float = 1.0, preserve_original: bool = True) DataFrame[source]

Smooth a CRISPR matrix using taxonomy kernels.

W_smooth = (1 - alpha) * W + alpha * K_bac @ W @ K_vir

Parameters:
  • crispr_df – Bacteria x viruses binary CRISPR matrix.

  • K_bac – Bacteria taxonomy kernel.

  • K_vir – Virus taxonomy kernel.

  • alpha – Smoothing weight (1.0 = full kernel propagation).

  • preserve_original – If True, blend with original; if False, use propagated only.

Returns:

Smoothed CRISPR DataFrame.

capellini.utils.network_utils.study_outdir(study: str, subdir: str, output_root: Path) Path[source]

Construct the output directory path for a study.

Parameters:
  • study – Study identifier string.

  • subdir – Sub-directory name within the study folder.

  • output_root – Root output directory.

Returns:

Path to the study sub-directory.

capellini.utils.network_utils.summarize_df(name: str, df: DataFrame) None[source]

Print a short summary of a DataFrame shape and index uniqueness.

Parameters:
  • name – Label for the printout.

  • df – DataFrame to summarize.

capellini.utils.network_utils.transform_message_passing_smoothed_crispr(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[ndarray, ndarray][source]

Non-residual convex cross-domain message passing using a smoothed CRISPR matrix.

Parameters:
  • V – Samples x viruses matrix (CLR-transformed).

  • B – Samples x bacteria matrix (CLR-transformed).

  • W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.

  • lam – Mixing weight (0 = no update, 1 = full cross-domain).

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star, B_star).

capellini.utils.network_utils.transform_message_passing_smoothed_crispr_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[DataFrame, DataFrame, DataFrame][source]

DataFrame wrapper for non-residual convex message passing.

Parameters:
  • V_clr_df – Samples x viruses CLR-transformed abundance.

  • B_clr_df – Samples x bacteria CLR-transformed abundance.

  • W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.

  • lam – Mixing weight.

  • n_steps – Number of propagation steps.

  • preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star_df, B_star_df, X_star_df).