capellini.utils.taxonomy

Taxonomy helpers: NCBI name lookup, index sanitization, bacteria taxonomy cleaning.

Functions

`apply_custom_renames`(df[, renames])	Apply a fixed set of column renames to a DataFrame.
`assign_ncbi_taxids`(taxonomy_table, name_to_ncbi)	Assign NCBI taxids to every row of a taxonomy table and print a summary.
`build_name_to_ncbi`(names_dmp_path)	Parse an NCBI names.dmp file into a scientific-name → taxid mapping.
`build_rank_to_taxids`(df_all_ncbis, rank_col)	Build a mapping from rank name to the set of ProGenomes taxids in that rank.
`clean_bacteria_taxonomy`(tax[, ...])	Sanitize bacteria taxonomy columns and index.
`clean_df_ids`(df)	Apply clean_index_ids to both the row index and column index of a DataFrame.
`clean_index_ids`(idx)	Strip trailing .0 float artefacts from string-cast integer IDs.
`load_bacteria_taxonomy`(path)	Load a bacteria taxonomy CSV, handling the old notebook's double-index convention.
`lookup_ncbi_taxid`(row, name_to_taxid[, ranks])	Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.
`parse_bool_series`(s)	Robustly parse a boolean metadata column that may be stored as strings.
`rename_clostridium_sensu_stricto`(df)	Rename the Clostridium sensu stricto column to include the subspecies number.
`sanitize_index`(idx)	Apply `sanitize_taxon_name()` to every element of an index.
`sanitize_taxon_name`(s)	Normalize a taxon string consistently across studies.

capellini.utils.taxonomy.apply_custom_renames(df: DataFrame, renames: dict[str, str] = {'Clostridium sensu stricto': 'Clostridium sensu stricto 1'}) → DataFrame[source]

Apply a fixed set of column renames to a DataFrame.

Parameters:

df – Input DataFrame.
renames – Mapping from old column name to new name.

Returns:

DataFrame with renamed columns.

capellini.utils.taxonomy.assign_ncbi_taxids(taxonomy_table: DataFrame, name_to_ncbi: dict[str, int]) → DataFrame[source]

Assign NCBI taxids to every row of a taxonomy table and print a summary.

Parameters:

taxonomy_table – DataFrame with ranks as columns.
name_to_ncbi – Scientific-name → taxid mapping from build_name_to_ncbi.

Returns:

DataFrame with added NCBI_taxid and taxid_matched_rank columns.

capellini.utils.taxonomy.build_name_to_ncbi(names_dmp_path: str) → dict[str, int][source]

Parse an NCBI names.dmp file into a scientific-name → taxid mapping.

Parameters:: names_dmp_path – Path to names.dmp extracted from taxdmp.zip.
Returns:: Dictionary mapping scientific name strings to integer NCBI taxids.

capellini.utils.taxonomy.build_rank_to_taxids(df_all_ncbis: DataFrame, rank_col: str) → dict[str, set][source]

Build a mapping from rank name to the set of ProGenomes taxids in that rank.

Parameters:

df_all_ncbis – DataFrame with taxid and rank columns.
rank_col – Column name for the rank (e.g., “genus”, “family”).

Returns:

Dictionary mapping rank name → set of integer taxids.

capellini.utils.taxonomy.clean_bacteria_taxonomy(tax: DataFrame, cols_to_clean: Sequence[str] = ('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus'), keep_cols: Sequence[str] = ('target_taxids',)) → DataFrame[source]

Sanitize bacteria taxonomy columns and index.

Parameters:

tax – Taxonomy DataFrame.
cols_to_clean – Rank columns to sanitize.
keep_cols – Columns to copy through without sanitization.

Returns:

Cleaned taxonomy DataFrame.

capellini.utils.taxonomy.clean_df_ids(df: DataFrame) → DataFrame[source]

Apply clean_index_ids to both the row index and column index of a DataFrame.

Parameters:: df – Input DataFrame.
Returns:: Copy of df with cleaned index and columns.

capellini.utils.taxonomy.clean_index_ids(idx: Iterable) → list[source]

Strip trailing .0 float artefacts from string-cast integer IDs.

Parameters:: idx – Iterable of index values.
Returns:: List of cleaned string values.

capellini.utils.taxonomy.load_bacteria_taxonomy(path: str) → DataFrame[source]

Load a bacteria taxonomy CSV, handling the old notebook’s double-index convention.

Parameters:: path – Path to the taxonomy CSV file.
Returns:: DataFrame with ASV/OTU index.

capellini.utils.taxonomy.lookup_ncbi_taxid(row: Any, name_to_taxid: dict[str, int], ranks: list[str] = ['Genus', 'Family', 'Order', 'Class', 'Phylum', 'Kingdom']) → tuple[Any, Any][source]

Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.

Parameters:

row – A dict-like row with taxonomy rank keys.
name_to_taxid – Mapping from scientific name to taxid.
ranks – Ordered list of ranks to try (finest first).

Returns:

Tuple of (taxid, matched_rank) or (pd.NA, None) if no match found.

capellini.utils.taxonomy.parse_bool_series(s: Series) → Series[source]

Robustly parse a boolean metadata column that may be stored as strings.

Parameters:: s – Series of bool or string values.
Returns:: Boolean Series.

capellini.utils.taxonomy.rename_clostridium_sensu_stricto(df: DataFrame) → DataFrame[source]

Rename the Clostridium sensu stricto column to include the subspecies number.

Parameters:: df – DataFrame with genus-level abundance columns.
Returns:: Copy of df with corrected column name.

capellini.utils.taxonomy.sanitize_index(idx: Iterable) → Index[source]: Apply sanitize_taxon_name() to every element of an index.

capellini.utils.taxonomy.sanitize_taxon_name(s: Any) → str[source]

Normalize a taxon string consistently across studies.

Strips, collapses whitespace, removes brackets/quotes, drops the R X. prefix, removes spaces/underscores, and collapses dot runs.