capellini.utils.taxonomy

Taxonomy helpers: NCBI name lookup, index sanitization, bacteria taxonomy cleaning.

Functions

apply_custom_renames(df[, renames])

Apply a fixed set of column renames to a DataFrame.

assign_ncbi_taxids(taxonomy_table, name_to_ncbi)

Assign NCBI taxids to every row of a taxonomy table and print a summary.

build_name_to_ncbi(names_dmp_path)

Parse an NCBI names.dmp file into a scientific-name → taxid mapping.

build_rank_to_taxids(df_all_ncbis, rank_col)

Build a mapping from rank name to the set of ProGenomes taxids in that rank.

clean_bacteria_taxonomy(tax[, ...])

Sanitize bacteria taxonomy columns and index.

clean_df_ids(df)

Apply clean_index_ids to both the row index and column index of a DataFrame.

clean_index_ids(idx)

Strip trailing .0 float artefacts from string-cast integer IDs.

load_bacteria_taxonomy(path)

Load a bacteria taxonomy CSV, handling the old notebook's double-index convention.

lookup_ncbi_taxid(row, name_to_taxid[, ranks])

Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.

parse_bool_series(s)

Robustly parse a boolean metadata column that may be stored as strings.

rename_clostridium_sensu_stricto(df)

Rename the Clostridium sensu stricto column to include the subspecies number.

sanitize_index(idx, **kwargs)

Apply sanitize_taxon_name to every element of an index.

sanitize_taxon_name(s[, ...])

Sanitize a taxon name consistently across studies, matching old notebook behavior.

capellini.utils.taxonomy.apply_custom_renames(df: DataFrame, renames: dict[str, str] = {'Clostridium sensu stricto': 'Clostridium sensu stricto 1'}) DataFrame[source]

Apply a fixed set of column renames to a DataFrame.

Parameters:
  • df – Input DataFrame.

  • renames – Mapping from old column name to new name.

Returns:

DataFrame with renamed columns.

capellini.utils.taxonomy.assign_ncbi_taxids(taxonomy_table: DataFrame, name_to_ncbi: dict[str, int]) DataFrame[source]

Assign NCBI taxids to every row of a taxonomy table and print a summary.

Parameters:
  • taxonomy_table – DataFrame with ranks as columns.

  • name_to_ncbi – Scientific-name → taxid mapping from build_name_to_ncbi.

Returns:

DataFrame with added NCBI_taxid and taxid_matched_rank columns.

capellini.utils.taxonomy.build_name_to_ncbi(names_dmp_path: str) dict[str, int][source]

Parse an NCBI names.dmp file into a scientific-name → taxid mapping.

Parameters:

names_dmp_path – Path to names.dmp extracted from taxdmp.zip.

Returns:

Dictionary mapping scientific name strings to integer NCBI taxids.

capellini.utils.taxonomy.build_rank_to_taxids(df_all_ncbis: DataFrame, rank_col: str) dict[str, set][source]

Build a mapping from rank name to the set of ProGenomes taxids in that rank.

Parameters:
  • df_all_ncbis – DataFrame with taxid and rank columns.

  • rank_col – Column name for the rank (e.g., “genus”, “family”).

Returns:

Dictionary mapping rank name → set of integer taxids.

capellini.utils.taxonomy.clean_bacteria_taxonomy(tax: DataFrame, cols_to_clean: Sequence[str] = ('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus'), keep_cols: Sequence[str] = ('target_taxids',)) DataFrame[source]

Sanitize bacteria taxonomy columns and index.

Parameters:
  • tax – Taxonomy DataFrame.

  • cols_to_clean – Rank columns to sanitize.

  • keep_cols – Columns to copy through without sanitization.

Returns:

Cleaned taxonomy DataFrame.

capellini.utils.taxonomy.clean_df_ids(df: DataFrame) DataFrame[source]

Apply clean_index_ids to both the row index and column index of a DataFrame.

Parameters:

df – Input DataFrame.

Returns:

Copy of df with cleaned index and columns.

capellini.utils.taxonomy.clean_index_ids(idx: Iterable) list[source]

Strip trailing .0 float artefacts from string-cast integer IDs.

Parameters:

idx – Iterable of index values.

Returns:

List of cleaned string values.

capellini.utils.taxonomy.load_bacteria_taxonomy(path: str) DataFrame[source]

Load a bacteria taxonomy CSV, handling the old notebook’s double-index convention.

Parameters:

path – Path to the taxonomy CSV file.

Returns:

DataFrame with ASV/OTU index.

capellini.utils.taxonomy.lookup_ncbi_taxid(row: Any, name_to_taxid: dict[str, int], ranks: list[str] = ['Genus', 'Family', 'Order', 'Class', 'Phylum', 'Kingdom']) tuple[Any, Any][source]

Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.

Parameters:
  • row – A dict-like row with taxonomy rank keys.

  • name_to_taxid – Mapping from scientific name to taxid.

  • ranks – Ordered list of ranks to try (finest first).

Returns:

Tuple of (taxid, matched_rank) or (pd.NA, None) if no match found.

capellini.utils.taxonomy.parse_bool_series(s: Series) Series[source]

Robustly parse a boolean metadata column that may be stored as strings.

Parameters:

s – Series of bool or string values.

Returns:

Boolean Series.

capellini.utils.taxonomy.rename_clostridium_sensu_stricto(df: DataFrame) DataFrame[source]

Rename the Clostridium sensu stricto column to include the subspecies number.

Parameters:

df – DataFrame with genus-level abundance columns.

Returns:

Copy of df with corrected column name.

capellini.utils.taxonomy.sanitize_index(idx: Iterable, **kwargs) Index[source]

Apply sanitize_taxon_name to every element of an index.

Parameters:
  • idx – Iterable of index labels.

  • **kwargs – Forwarded to sanitize_taxon_name.

Returns:

New pd.Index with sanitized labels.

capellini.utils.taxonomy.sanitize_taxon_name(s: Any, remove_trailing_numeric_suffix: bool = True, remove_brackets: bool = True, remove_quotes: bool = True, collapse_spaces: bool = True, strip: bool = True, remove_X_prefix: bool = True, spaces_to_underscore: bool = True, collapse_dots: bool = True, remove_spaces_and_underscores: bool = True) str[source]

Sanitize a taxon name consistently across studies, matching old notebook behavior.

Parameters:
  • s – Input taxon name (any type; will be cast to str).

  • remove_trailing_numeric_suffix – Strip trailing whitespace.

  • remove_brackets – Remove square brackets.

  • remove_quotes – Remove single and double quotes.

  • collapse_spaces – Collapse runs of whitespace to a single space.

  • strip – Strip leading/trailing whitespace.

  • remove_X_prefix – Remove the R-generated X. prefix.

  • spaces_to_underscore – Convert spaces to underscores (overridden by remove_spaces_and_underscores).

  • collapse_dots – Collapse multiple consecutive dots.

  • remove_spaces_and_underscores – Remove all spaces and underscores.

Returns:

Sanitized string.