capellini.utils.network_utils

Network-level utilities: message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.

Functions

`aggregate_otu_columns_by_rank_skip_nan`(otu, ...)	Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.
`align_abundance_from_metadata`(...[, ...])	Align viral and bacterial abundance using standardized metadata.
`assign_crispr`(cri_big, cri_s)	Fill matching rows/columns of a target matrix from a source CRISPR matrix.
`build_binary_crispr_matrix`(raw_crispr_path, ...)	Load a raw CRISPR network and build a binary bacteria x viruses matrix.
`build_smoothed_crispr_for_study`(...[, ...])	Build a smoothed CRISPR matrix for a single study.
`build_taxonomy_kernel_from_shared_ranks`(ids, ...)	Build a taxonomy kernel matrix using weighted shared rank agreement.
`build_xstar_from_smoothed_crispr`(V_df, B_df, ...)	Full non-residual X-star pipeline (convex message passing).
`build_xstar_from_smoothed_crispr_residual`(...)	Full residual additive X-star pipeline.
`crispr_matrix_aggregate_viruses`(df_crispr, ...)	Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank.
`get_hierarchies`(df_b1, df_v1)	Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing.
`orient_W_viruses_by_bacteria`(W_df, V_df, B_df)	Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria.
`out_path`(study, subdir, filename, output_root)	Construct the full path to an output file within a study directory.
`prepare_bacteria_genus_abundance`(otu, tax[, ...])	Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names.
`prevalence_filter_df`(df[, prevalence, verbose])	Keep only features present in at least prevalence * n_samples samples.
`remove_disease_columns_from_virus_abundance`(V)	Remove phenotype/metadata columns accidentally stored in viral abundance tables.
`residual_message_passing`(V, B, W_vh_smooth)	Residual additive cross-domain message passing using a smoothed CRISPR matrix.
`residual_message_passing_df`(V_clr_df, ...[, ...])	DataFrame wrapper for residual additive message passing.
`smooth_crispr_bac_vir`(crispr_df, K_bac, K_vir)	Smooth a CRISPR matrix using taxonomy kernels.
`study_outdir`(study, subdir, output_root)	Construct the output directory path for a study.
`summarize_df`(name, df)	Print a short summary of a DataFrame shape and index uniqueness.
`transform_message_passing_smoothed_crispr`(V, ...)	Non-residual convex cross-domain message passing using a smoothed CRISPR matrix.
`transform_message_passing_smoothed_crispr_df`(...)	DataFrame wrapper for non-residual convex message passing.

capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) → DataFrame[source]

Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.

Parameters:

otu – Samples x ASVs abundance DataFrame.
tax – ASVs x ranks taxonomy DataFrame.
rank – Taxonomy column to aggregate to.

Returns:

Samples x rank-groups abundance DataFrame.

capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') → tuple[DataFrame, DataFrame, DataFrame][source]

Align viral and bacterial abundance using standardized metadata.

Parameters:

virus_abundance – Viral abundance DataFrame.
bacteria_abundance – Bacterial abundance DataFrame.
metadata – Metadata DataFrame with sample ID and keep columns.
keep_col – Column used to filter metadata rows.
virus_id_col – Metadata column for viral sample IDs.
bacteria_id_col – Metadata column for bacterial sample IDs.
final_index_col – Metadata column to use as the aligned index.

Returns:

Tuple (V, B, meta_aligned) with aligned, filtered abundance and metadata.

capellini.utils.network_utils.assign_crispr(cri_big: DataFrame, cri_s: DataFrame) → DataFrame[source]

Fill matching rows/columns of a target matrix from a source CRISPR matrix.

Parameters:

cri_big – Target (full-size) matrix.
cri_s – Source CRISPR matrix (subset).

Returns:

Updated copy of cri_big.

capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) → DataFrame[source]

Load a raw CRISPR network and build a binary bacteria x viruses matrix.

Parameters:

raw_crispr_path – Path to the CSV of the raw CRISPR network.
bacteria_features – Ordered list of bacteria feature IDs.
virus_features – Ordered list of virus feature IDs.
transpose_after_load – If True, transpose the loaded matrix (contigs were rows).

Returns:

Binary bacteria x viruses DataFrame.

capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) → dict[str, DataFrame][source]

Build a smoothed CRISPR matrix for a single study.

Parameters:

raw_crispr_path – Path to the raw CRISPR network CSV.
bacteria_features – Bacteria feature IDs (from processed abundance).
virus_features – Virus feature IDs (from processed abundance).
tax_bac_path – Path to bacteria taxonomy CSV.
tax_vir_path – Path to virus taxonomy CSV.
bacterial_ranks – Bacterial taxonomy rank columns for kernel.
viral_ranks – Viral taxonomy rank columns for kernel.
bacterial_weights – Per-rank weights for bacteria kernel.
viral_weights – Per-rank weights for virus kernel.
alpha – CRISPR smoothing weight.
transpose_after_load – Passed to build_binary_crispr_matrix.

Returns:

Dict with crispr_binary, K_bac, K_vir, crispr_smooth, bac_tax_aligned, vir_tax_aligned.

capellini.utils.network_utils.build_taxonomy_kernel_from_shared_ranks(ids, tax_df: DataFrame, ranks, weights=None, fill_value: str = '', normalize_rows: bool = True, strict: bool = False) → tuple[DataFrame, DataFrame][source]

Build a taxonomy kernel matrix using weighted shared rank agreement.

K[i,j] = sum_k weights[k] * I(rank_k[i] == rank_k[j]) if rank_k[i] != fill_value. K[i,i] = 1.

Parameters:

ids – Ordered collection of feature IDs to include.
tax_df – Taxonomy DataFrame indexed by the same IDs.
ranks – Ordered list of taxonomy rank column names.
weights – Per-rank weights (default: 1..n). Normalized to sum to 1.
fill_value – Value treated as missing (no match score).
normalize_rows – Row-normalize the kernel.
strict – Raise ValueError if any IDs are missing from tax_df.

Returns:

Tuple (K DataFrame, aligned taxonomy DataFrame).

capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) → dict[str, DataFrame][source]

Full non-residual X-star pipeline (convex message passing).

Parameters:

V_df – Samples x viruses raw abundance.
B_df – Samples x bacteria raw abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
pseudocount – CLR pseudocount.
lam – Mixing weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
eps – Numerical stability constant.

Returns:

V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

Return type:

Dict with keys

capellini.utils.network_utils.build_xstar_from_smoothed_crispr_residual(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) → dict[str, DataFrame][source]

Full residual additive X-star pipeline.

Parameters:

V_df – Samples x viruses raw abundance.
B_df – Samples x bacteria raw abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
pseudocount – CLR pseudocount.
lam – Additive weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
eps – Numerical stability constant.

Returns:

V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.

Return type:

Dict with keys

capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0', vir_id_col=None, dropna_vir: bool = True, dtype=<class 'int'>) → DataFrame[source]

Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank.

Parameters:

df_crispr – Raw SpacePHARER predictions DataFrame (no header).
vir_tax – Virus taxonomy DataFrame.
bac_col – Column index for bacterial spacer IDs.
vir_col – Column index for viral contig IDs.
vir_rank – Viral taxonomy rank column to aggregate to.
vir_id_col – Optional viral ID column in vir_tax; uses index if None.
dropna_vir – Drop viruses not found in taxonomy.
dtype – Output matrix dtype.

Returns:

Bacteria x viral_groups crosstab matrix.

capellini.utils.network_utils.get_hierarchies(df_b1: DataFrame, df_v1: DataFrame) → tuple[DataFrame, DataFrame][source]

Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing.

Parameters:

df_b1 – Bacteria taxonomy DataFrame with progenomes_taxid_genus column.
df_v1 – Virus taxonomy DataFrame with lev0 column.

Returns:

Tuple (bacteria_tax, virus_tax) with cleaned indexes.

capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) → DataFrame[source]

Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria.

Parameters:

W_df – CRISPR matrix (may be bacteria x viruses or viruses x bacteria).
V_df – Viral abundance (samples x viruses).
B_df – Bacterial abundance (samples x bacteria).
verbose – Print overlap counts.

Returns:

W_df in viruses x bacteria orientation.

capellini.utils.network_utils.out_path(study: str, subdir: str, filename: str, output_root: Path) → Path[source]

Construct the full path to an output file within a study directory.

Parameters:

study – Study identifier string.
subdir – Sub-directory name.
filename – File name.
output_root – Root output directory.

Returns:

Full Path to the output file.

capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') → DataFrame[source]

Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names.

Parameters:

otu – Samples x ASVs raw abundance.
tax – ASV taxonomy DataFrame with a column matching rank.
rank – Taxonomy column to aggregate to.

Returns:

Samples x genus-level abundance DataFrame with sanitized column names.

capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) → DataFrame[source]

Keep only features present in at least prevalence * n_samples samples.

Parameters:

df – Samples x features DataFrame.
prevalence – Minimum fractional prevalence threshold.
verbose – Print kept/total feature count.

Returns:

Filtered DataFrame.

capellini.utils.network_utils.remove_disease_columns_from_virus_abundance(V: DataFrame) → DataFrame[source]

Remove phenotype/metadata columns accidentally stored in viral abundance tables.

Parameters:: V – Viral abundance DataFrame.
Returns:: Cleaned copy without non-feature columns.

capellini.utils.network_utils.residual_message_passing(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) → tuple[ndarray, ndarray][source]

Residual additive cross-domain message passing using a smoothed CRISPR matrix.

Parameters:

V – Samples x viruses CLR matrix.
B – Samples x bacteria CLR matrix.
W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.
lam – Additive weight for cross-domain messages.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star, B_star).

capellini.utils.network_utils.residual_message_passing_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) → tuple[DataFrame, DataFrame, DataFrame][source]

DataFrame wrapper for residual additive message passing.

Parameters:

V_clr_df – Samples x viruses CLR-transformed abundance.
B_clr_df – Samples x bacteria CLR-transformed abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
lam – Additive weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star_df, B_star_df, X_star_df).

capellini.utils.network_utils.smooth_crispr_bac_vir(crispr_df: DataFrame, K_bac: DataFrame, K_vir: DataFrame, alpha: float = 1.0, preserve_original: bool = True) → DataFrame[source]

Smooth a CRISPR matrix using taxonomy kernels.

W_smooth = (1 - alpha) * W + alpha * K_bac @ W @ K_vir

Parameters:

crispr_df – Bacteria x viruses binary CRISPR matrix.
K_bac – Bacteria taxonomy kernel.
K_vir – Virus taxonomy kernel.
alpha – Smoothing weight (1.0 = full kernel propagation).
preserve_original – If True, blend with original; if False, use propagated only.

Returns:

Smoothed CRISPR DataFrame.

capellini.utils.network_utils.study_outdir(study: str, subdir: str, output_root: Path) → Path[source]

Construct the output directory path for a study.

Parameters:

study – Study identifier string.
subdir – Sub-directory name within the study folder.
output_root – Root output directory.

Returns:

Path to the study sub-directory.

capellini.utils.network_utils.summarize_df(name: str, df: DataFrame) → None[source]

Print a short summary of a DataFrame shape and index uniqueness.

Parameters:

name – Label for the printout.
df – DataFrame to summarize.

capellini.utils.network_utils.transform_message_passing_smoothed_crispr(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) → tuple[ndarray, ndarray][source]

Non-residual convex cross-domain message passing using a smoothed CRISPR matrix.

Parameters:

V – Samples x viruses matrix (CLR-transformed).
B – Samples x bacteria matrix (CLR-transformed).
W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.
lam – Mixing weight (0 = no update, 1 = full cross-domain).
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star, B_star).

capellini.utils.network_utils.transform_message_passing_smoothed_crispr_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) → tuple[DataFrame, DataFrame, DataFrame][source]

DataFrame wrapper for non-residual convex message passing.

Parameters:

V_clr_df – Samples x viruses CLR-transformed abundance.
B_clr_df – Samples x bacteria CLR-transformed abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
lam – Mixing weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.

Returns:

Tuple (V_star_df, B_star_df, X_star_df).