capellini.utils.taxonomy

Taxonomy helpers: NCBI name lookup, index sanitization, bacteria taxonomy cleaning.

Functions

`apply_custom_renames`(df[, renames])	Apply a fixed set of column renames to a DataFrame.
`assign_ncbi_taxids`(taxonomy_table, name_to_ncbi)	Assign NCBI taxids to every row of a taxonomy table and print a summary.
`build_name_to_ncbi`(names_dmp_path)	Parse an NCBI names.dmp file into a scientific-name → taxid mapping.
`build_rank_to_taxids`(df_all_ncbis, rank_col)	Build a mapping from rank name to the set of ProGenomes taxids in that rank.
`clean_bacteria_taxonomy`(tax[, ...])	Sanitize bacteria taxonomy columns and index.
`clean_df_ids`(df)	Apply clean_index_ids to both the row index and column index of a DataFrame.
`clean_index_ids`(idx)	Strip trailing .0 float artefacts from string-cast integer IDs.
`load_bacteria_taxonomy`(path)	Load a bacteria taxonomy CSV, handling the old notebook's double-index convention.
`lookup_ncbi_taxid`(row, name_to_taxid[, ranks])	Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.
`parse_bool_series`(s)	Robustly parse a boolean metadata column that may be stored as strings.
`rename_clostridium_sensu_stricto`(df)	Rename the Clostridium sensu stricto column to include the subspecies number.
`sanitize_index`(idx, **kwargs)	Apply sanitize_taxon_name to every element of an index.
`sanitize_taxon_name`(s[, ...])	Sanitize a taxon name consistently across studies, matching old notebook behavior.

capellini.utils.taxonomy.apply_custom_renames(df: DataFrame, renames: dict[str, str] = {'Clostridium sensu stricto': 'Clostridium sensu stricto 1'}) → DataFrame[source]

Apply a fixed set of column renames to a DataFrame.

Parameters:

df – Input DataFrame.
renames – Mapping from old column name to new name.

Returns:

DataFrame with renamed columns.

capellini.utils.taxonomy.assign_ncbi_taxids(taxonomy_table: DataFrame, name_to_ncbi: dict[str, int]) → DataFrame[source]

Assign NCBI taxids to every row of a taxonomy table and print a summary.

Parameters:

taxonomy_table – DataFrame with ranks as columns.
name_to_ncbi – Scientific-name → taxid mapping from build_name_to_ncbi.

Returns:

DataFrame with added NCBI_taxid and taxid_matched_rank columns.

capellini.utils.taxonomy.build_name_to_ncbi(names_dmp_path: str) → dict[str, int][source]

Parse an NCBI names.dmp file into a scientific-name → taxid mapping.

Parameters:: names_dmp_path – Path to names.dmp extracted from taxdmp.zip.
Returns:: Dictionary mapping scientific name strings to integer NCBI taxids.

capellini.utils.taxonomy.build_rank_to_taxids(df_all_ncbis: DataFrame, rank_col: str) → dict[str, set][source]

Build a mapping from rank name to the set of ProGenomes taxids in that rank.

Parameters:

df_all_ncbis – DataFrame with taxid and rank columns.
rank_col – Column name for the rank (e.g., “genus”, “family”).

Returns:

Dictionary mapping rank name → set of integer taxids.

capellini.utils.taxonomy.clean_bacteria_taxonomy(tax: DataFrame, cols_to_clean: Sequence[str] = ('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus'), keep_cols: Sequence[str] = ('target_taxids',)) → DataFrame[source]

Sanitize bacteria taxonomy columns and index.

Parameters:

tax – Taxonomy DataFrame.
cols_to_clean – Rank columns to sanitize.
keep_cols – Columns to copy through without sanitization.

Returns:

Cleaned taxonomy DataFrame.

capellini.utils.taxonomy.clean_df_ids(df: DataFrame) → DataFrame[source]

Apply clean_index_ids to both the row index and column index of a DataFrame.

Parameters:: df – Input DataFrame.
Returns:: Copy of df with cleaned index and columns.

capellini.utils.taxonomy.clean_index_ids(idx: Iterable) → list[source]

Strip trailing .0 float artefacts from string-cast integer IDs.

Parameters:: idx – Iterable of index values.
Returns:: List of cleaned string values.

capellini.utils.taxonomy.load_bacteria_taxonomy(path: str) → DataFrame[source]

Load a bacteria taxonomy CSV, handling the old notebook’s double-index convention.

Parameters:: path – Path to the taxonomy CSV file.
Returns:: DataFrame with ASV/OTU index.

capellini.utils.taxonomy.lookup_ncbi_taxid(row: Any, name_to_taxid: dict[str, int], ranks: list[str] = ['Genus', 'Family', 'Order', 'Class', 'Phylum', 'Kingdom']) → tuple[Any, Any][source]

Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.

Parameters:

row – A dict-like row with taxonomy rank keys.
name_to_taxid – Mapping from scientific name to taxid.
ranks – Ordered list of ranks to try (finest first).

Returns:

Tuple of (taxid, matched_rank) or (pd.NA, None) if no match found.

capellini.utils.taxonomy.parse_bool_series(s: Series) → Series[source]

Robustly parse a boolean metadata column that may be stored as strings.

Parameters:: s – Series of bool or string values.
Returns:: Boolean Series.

capellini.utils.taxonomy.rename_clostridium_sensu_stricto(df: DataFrame) → DataFrame[source]

Rename the Clostridium sensu stricto column to include the subspecies number.

Parameters:: df – DataFrame with genus-level abundance columns.
Returns:: Copy of df with corrected column name.

capellini.utils.taxonomy.sanitize_index(idx: Iterable, **kwargs) → Index[source]

Apply sanitize_taxon_name to every element of an index.

Parameters:

idx – Iterable of index labels.
**kwargs – Forwarded to sanitize_taxon_name.

Returns:

New pd.Index with sanitized labels.

capellini.utils.taxonomy.sanitize_taxon_name(s: Any, remove_trailing_numeric_suffix: bool = True, remove_brackets: bool = True, remove_quotes: bool = True, collapse_spaces: bool = True, strip: bool = True, remove_X_prefix: bool = True, spaces_to_underscore: bool = True, collapse_dots: bool = True, remove_spaces_and_underscores: bool = True) → str[source]

Sanitize a taxon name consistently across studies, matching old notebook behavior.

Parameters:

s – Input taxon name (any type; will be cast to str).
remove_trailing_numeric_suffix – Strip trailing whitespace.
remove_brackets – Remove square brackets.
remove_quotes – Remove single and double quotes.
collapse_spaces – Collapse runs of whitespace to a single space.
strip – Strip leading/trailing whitespace.
remove_X_prefix – Remove the R-generated X. prefix.
spaces_to_underscore – Convert spaces to underscores (overridden by remove_spaces_and_underscores).
collapse_dots – Collapse multiple consecutive dots.
remove_spaces_and_underscores – Remove all spaces and underscores.

Returns:

Sanitized string.