capellini.utils.network_utils
Network-level utilities: message passing, CRISPR smoothing, taxonomy kernels, abundance helpers.
Functions
|
Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels. |
|
Align viral and bacterial abundance using standardized metadata. |
|
Fill matching rows/columns of a target matrix from a source CRISPR matrix. |
|
Load a raw CRISPR network and build a binary bacteria x viruses matrix. |
|
Build a smoothed CRISPR matrix for a single study. |
|
Build a taxonomy kernel matrix using weighted shared rank agreement. |
|
Full non-residual X-star pipeline (convex message passing). |
Full residual additive X-star pipeline. |
|
|
Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank. |
|
Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing. |
|
Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria. |
|
Construct the full path to an output file within a study directory. |
|
Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names. |
|
Keep only features present in at least prevalence * n_samples samples. |
Remove phenotype/metadata columns accidentally stored in viral abundance tables. |
|
|
Residual additive cross-domain message passing using a smoothed CRISPR matrix. |
|
DataFrame wrapper for residual additive message passing. |
|
Smooth a CRISPR matrix using taxonomy kernels. |
|
Construct the output directory path for a study. |
|
Print a short summary of a DataFrame shape and index uniqueness. |
Non-residual convex cross-domain message passing using a smoothed CRISPR matrix. |
|
DataFrame wrapper for non-residual convex message passing. |
- capellini.utils.network_utils.aggregate_otu_columns_by_rank_skip_nan(otu: DataFrame, tax: DataFrame, rank: str) DataFrame[source]
Aggregate OTU/ASV columns to a taxonomy rank, skipping NaN labels.
- Parameters:
otu – Samples x ASVs abundance DataFrame.
tax – ASVs x ranks taxonomy DataFrame.
rank – Taxonomy column to aggregate to.
- Returns:
Samples x rank-groups abundance DataFrame.
- capellini.utils.network_utils.align_abundance_from_metadata(virus_abundance: DataFrame, bacteria_abundance: DataFrame, metadata: DataFrame, *, keep_col: str = 'keep_for_analysis', virus_id_col: str = 'virus_sample_id', bacteria_id_col: str = 'bacteria_sample_id', final_index_col: str = 'virus_sample_id') tuple[DataFrame, DataFrame, DataFrame][source]
Align viral and bacterial abundance using standardized metadata.
- Parameters:
virus_abundance – Viral abundance DataFrame.
bacteria_abundance – Bacterial abundance DataFrame.
metadata – Metadata DataFrame with sample ID and keep columns.
keep_col – Column used to filter metadata rows.
virus_id_col – Metadata column for viral sample IDs.
bacteria_id_col – Metadata column for bacterial sample IDs.
final_index_col – Metadata column to use as the aligned index.
- Returns:
Tuple (V, B, meta_aligned) with aligned, filtered abundance and metadata.
- capellini.utils.network_utils.assign_crispr(cri_big: DataFrame, cri_s: DataFrame) DataFrame[source]
Fill matching rows/columns of a target matrix from a source CRISPR matrix.
- Parameters:
cri_big – Target (full-size) matrix.
cri_s – Source CRISPR matrix (subset).
- Returns:
Updated copy of cri_big.
- capellini.utils.network_utils.build_binary_crispr_matrix(raw_crispr_path: str, bacteria_features, virus_features, transpose_after_load: bool = True) DataFrame[source]
Load a raw CRISPR network and build a binary bacteria x viruses matrix.
- Parameters:
raw_crispr_path – Path to the CSV of the raw CRISPR network.
bacteria_features – Ordered list of bacteria feature IDs.
virus_features – Ordered list of virus feature IDs.
transpose_after_load – If True, transpose the loaded matrix (contigs were rows).
- Returns:
Binary bacteria x viruses DataFrame.
- capellini.utils.network_utils.build_smoothed_crispr_for_study(raw_crispr_path: str, bacteria_features, virus_features, tax_bac_path: str, tax_vir_path: str, bacterial_ranks, viral_ranks, bacterial_weights, viral_weights, alpha: float = 0.95, transpose_after_load: bool = True) dict[str, DataFrame][source]
Build a smoothed CRISPR matrix for a single study.
- Parameters:
raw_crispr_path – Path to the raw CRISPR network CSV.
bacteria_features – Bacteria feature IDs (from processed abundance).
virus_features – Virus feature IDs (from processed abundance).
tax_bac_path – Path to bacteria taxonomy CSV.
tax_vir_path – Path to virus taxonomy CSV.
bacterial_ranks – Bacterial taxonomy rank columns for kernel.
viral_ranks – Viral taxonomy rank columns for kernel.
bacterial_weights – Per-rank weights for bacteria kernel.
viral_weights – Per-rank weights for virus kernel.
alpha – CRISPR smoothing weight.
transpose_after_load – Passed to build_binary_crispr_matrix.
- Returns:
Dict with crispr_binary, K_bac, K_vir, crispr_smooth, bac_tax_aligned, vir_tax_aligned.
Build a taxonomy kernel matrix using weighted shared rank agreement.
K[i,j] = sum_k weights[k] * I(rank_k[i] == rank_k[j]) if rank_k[i] != fill_value. K[i,i] = 1.
- Parameters:
ids – Ordered collection of feature IDs to include.
tax_df – Taxonomy DataFrame indexed by the same IDs.
ranks – Ordered list of taxonomy rank column names.
weights – Per-rank weights (default: 1..n). Normalized to sum to 1.
fill_value – Value treated as missing (no match score).
normalize_rows – Row-normalize the kernel.
strict – Raise ValueError if any IDs are missing from tax_df.
- Returns:
Tuple (K DataFrame, aligned taxonomy DataFrame).
- capellini.utils.network_utils.build_xstar_from_smoothed_crispr(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) dict[str, DataFrame][source]
Full non-residual X-star pipeline (convex message passing).
- Parameters:
V_df – Samples x viruses raw abundance.
B_df – Samples x bacteria raw abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
pseudocount – CLR pseudocount.
lam – Mixing weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
eps – Numerical stability constant.
- Returns:
V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.
- Return type:
Dict with keys
- capellini.utils.network_utils.build_xstar_from_smoothed_crispr_residual(V_df: DataFrame, B_df: DataFrame, W_vh_smooth_df: DataFrame, pseudocount: float = 1e-06, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False, eps: float = 1e-12) dict[str, DataFrame][source]
Full residual additive X-star pipeline.
- Parameters:
V_df – Samples x viruses raw abundance.
B_df – Samples x bacteria raw abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
pseudocount – CLR pseudocount.
lam – Additive weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
eps – Numerical stability constant.
- Returns:
V_raw_aligned, B_raw_aligned, V_clr, B_clr, X_clr, V_star, B_star, X_star, W_smooth_aligned.
- Return type:
Dict with keys
- capellini.utils.network_utils.crispr_matrix_aggregate_viruses(df_crispr: DataFrame, vir_tax: DataFrame, *, bac_col: int = 0, vir_col: int = 1, vir_rank: str = 'lev0', vir_id_col=None, dropna_vir: bool = True, dtype=<class 'int'>) DataFrame[source]
Parse a SpacePHARER output TSV and aggregate viruses by taxonomy rank.
- Parameters:
df_crispr – Raw SpacePHARER predictions DataFrame (no header).
vir_tax – Virus taxonomy DataFrame.
bac_col – Column index for bacterial spacer IDs.
vir_col – Column index for viral contig IDs.
vir_rank – Viral taxonomy rank column to aggregate to.
vir_id_col – Optional viral ID column in vir_tax; uses index if None.
dropna_vir – Drop viruses not found in taxonomy.
dtype – Output matrix dtype.
- Returns:
Bacteria x viral_groups crosstab matrix.
- capellini.utils.network_utils.get_hierarchies(df_b1: DataFrame, df_v1: DataFrame) tuple[DataFrame, DataFrame][source]
Prepare bacteria and virus taxonomy hierarchies for CRISPR smoothing.
- Parameters:
df_b1 – Bacteria taxonomy DataFrame with progenomes_taxid_genus column.
df_v1 – Virus taxonomy DataFrame with lev0 column.
- Returns:
Tuple (bacteria_tax, virus_tax) with cleaned indexes.
- capellini.utils.network_utils.orient_W_viruses_by_bacteria(W_df: DataFrame, V_df: DataFrame, B_df: DataFrame, verbose: bool = True) DataFrame[source]
Detect and correct the orientation of a CRISPR matrix to be viruses x bacteria.
- Parameters:
W_df – CRISPR matrix (may be bacteria x viruses or viruses x bacteria).
V_df – Viral abundance (samples x viruses).
B_df – Bacterial abundance (samples x bacteria).
verbose – Print overlap counts.
- Returns:
W_df in viruses x bacteria orientation.
- capellini.utils.network_utils.out_path(study: str, subdir: str, filename: str, output_root: Path) Path[source]
Construct the full path to an output file within a study directory.
- Parameters:
study – Study identifier string.
subdir – Sub-directory name.
filename – File name.
output_root – Root output directory.
- Returns:
Full Path to the output file.
- capellini.utils.network_utils.prepare_bacteria_genus_abundance(otu: DataFrame, tax: DataFrame, rank: str = 'target_taxids') DataFrame[source]
Aggregate bacteria OTUs to the selected taxonomy rank and sanitize feature names.
- Parameters:
otu – Samples x ASVs raw abundance.
tax – ASV taxonomy DataFrame with a column matching rank.
rank – Taxonomy column to aggregate to.
- Returns:
Samples x genus-level abundance DataFrame with sanitized column names.
- capellini.utils.network_utils.prevalence_filter_df(df: DataFrame, prevalence: float = 0.1, verbose: bool = True) DataFrame[source]
Keep only features present in at least prevalence * n_samples samples.
- Parameters:
df – Samples x features DataFrame.
prevalence – Minimum fractional prevalence threshold.
verbose – Print kept/total feature count.
- Returns:
Filtered DataFrame.
- capellini.utils.network_utils.remove_disease_columns_from_virus_abundance(V: DataFrame) DataFrame[source]
Remove phenotype/metadata columns accidentally stored in viral abundance tables.
- Parameters:
V – Viral abundance DataFrame.
- Returns:
Cleaned copy without non-feature columns.
- capellini.utils.network_utils.residual_message_passing(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[ndarray, ndarray][source]
Residual additive cross-domain message passing using a smoothed CRISPR matrix.
- Parameters:
V – Samples x viruses CLR matrix.
B – Samples x bacteria CLR matrix.
W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.
lam – Additive weight for cross-domain messages.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
- Returns:
Tuple (V_star, B_star).
- capellini.utils.network_utils.residual_message_passing_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[DataFrame, DataFrame, DataFrame][source]
DataFrame wrapper for residual additive message passing.
- Parameters:
V_clr_df – Samples x viruses CLR-transformed abundance.
B_clr_df – Samples x bacteria CLR-transformed abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
lam – Additive weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
- Returns:
Tuple (V_star_df, B_star_df, X_star_df).
- capellini.utils.network_utils.smooth_crispr_bac_vir(crispr_df: DataFrame, K_bac: DataFrame, K_vir: DataFrame, alpha: float = 1.0, preserve_original: bool = True) DataFrame[source]
Smooth a CRISPR matrix using taxonomy kernels.
W_smooth = (1 - alpha) * W + alpha * K_bac @ W @ K_vir
- Parameters:
crispr_df – Bacteria x viruses binary CRISPR matrix.
K_bac – Bacteria taxonomy kernel.
K_vir – Virus taxonomy kernel.
alpha – Smoothing weight (1.0 = full kernel propagation).
preserve_original – If True, blend with original; if False, use propagated only.
- Returns:
Smoothed CRISPR DataFrame.
- capellini.utils.network_utils.study_outdir(study: str, subdir: str, output_root: Path) Path[source]
Construct the output directory path for a study.
- Parameters:
study – Study identifier string.
subdir – Sub-directory name within the study folder.
output_root – Root output directory.
- Returns:
Path to the study sub-directory.
- capellini.utils.network_utils.summarize_df(name: str, df: DataFrame) None[source]
Print a short summary of a DataFrame shape and index uniqueness.
- Parameters:
name – Label for the printout.
df – DataFrame to summarize.
- capellini.utils.network_utils.transform_message_passing_smoothed_crispr(V: ndarray, B: ndarray, W_vh_smooth: ndarray, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[ndarray, ndarray][source]
Non-residual convex cross-domain message passing using a smoothed CRISPR matrix.
- Parameters:
V – Samples x viruses matrix (CLR-transformed).
B – Samples x bacteria matrix (CLR-transformed).
W_vh_smooth – Viruses x bacteria smoothed CRISPR matrix.
lam – Mixing weight (0 = no update, 1 = full cross-domain).
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
- Returns:
Tuple (V_star, B_star).
- capellini.utils.network_utils.transform_message_passing_smoothed_crispr_df(V_clr_df: DataFrame, B_clr_df: DataFrame, W_vh_smooth_df: DataFrame, lam: float = 0.1, n_steps: int = 1, preserve_scale: bool = False) tuple[DataFrame, DataFrame, DataFrame][source]
DataFrame wrapper for non-residual convex message passing.
- Parameters:
V_clr_df – Samples x viruses CLR-transformed abundance.
B_clr_df – Samples x bacteria CLR-transformed abundance.
W_vh_smooth_df – Viruses x bacteria smoothed CRISPR matrix.
lam – Mixing weight.
n_steps – Number of propagation steps.
preserve_scale – If True, restore original column standard deviations.
- Returns:
Tuple (V_star_df, B_star_df, X_star_df).