capellini.utils.taxonomy
Taxonomy helpers: NCBI name lookup, index sanitization, bacteria taxonomy cleaning.
Functions
|
Apply a fixed set of column renames to a DataFrame. |
|
Assign NCBI taxids to every row of a taxonomy table and print a summary. |
|
Parse an NCBI names.dmp file into a scientific-name → taxid mapping. |
|
Build a mapping from rank name to the set of ProGenomes taxids in that rank. |
|
Sanitize bacteria taxonomy columns and index. |
|
Apply clean_index_ids to both the row index and column index of a DataFrame. |
|
Strip trailing .0 float artefacts from string-cast integer IDs. |
|
Load a bacteria taxonomy CSV, handling the old notebook's double-index convention. |
|
Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest. |
Robustly parse a boolean metadata column that may be stored as strings. |
|
Rename the Clostridium sensu stricto column to include the subspecies number. |
|
|
Apply sanitize_taxon_name to every element of an index. |
|
Sanitize a taxon name consistently across studies, matching old notebook behavior. |
- capellini.utils.taxonomy.apply_custom_renames(df: DataFrame, renames: dict[str, str] = {'Clostridium sensu stricto': 'Clostridium sensu stricto 1'}) DataFrame[source]
Apply a fixed set of column renames to a DataFrame.
- Parameters:
df – Input DataFrame.
renames – Mapping from old column name to new name.
- Returns:
DataFrame with renamed columns.
- capellini.utils.taxonomy.assign_ncbi_taxids(taxonomy_table: DataFrame, name_to_ncbi: dict[str, int]) DataFrame[source]
Assign NCBI taxids to every row of a taxonomy table and print a summary.
- Parameters:
taxonomy_table – DataFrame with ranks as columns.
name_to_ncbi – Scientific-name → taxid mapping from build_name_to_ncbi.
- Returns:
DataFrame with added NCBI_taxid and taxid_matched_rank columns.
- capellini.utils.taxonomy.build_name_to_ncbi(names_dmp_path: str) dict[str, int][source]
Parse an NCBI names.dmp file into a scientific-name → taxid mapping.
- Parameters:
names_dmp_path – Path to names.dmp extracted from taxdmp.zip.
- Returns:
Dictionary mapping scientific name strings to integer NCBI taxids.
- capellini.utils.taxonomy.build_rank_to_taxids(df_all_ncbis: DataFrame, rank_col: str) dict[str, set][source]
Build a mapping from rank name to the set of ProGenomes taxids in that rank.
- Parameters:
df_all_ncbis – DataFrame with taxid and rank columns.
rank_col – Column name for the rank (e.g., “genus”, “family”).
- Returns:
Dictionary mapping rank name → set of integer taxids.
- capellini.utils.taxonomy.clean_bacteria_taxonomy(tax: DataFrame, cols_to_clean: Sequence[str] = ('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus'), keep_cols: Sequence[str] = ('target_taxids',)) DataFrame[source]
Sanitize bacteria taxonomy columns and index.
- Parameters:
tax – Taxonomy DataFrame.
cols_to_clean – Rank columns to sanitize.
keep_cols – Columns to copy through without sanitization.
- Returns:
Cleaned taxonomy DataFrame.
- capellini.utils.taxonomy.clean_df_ids(df: DataFrame) DataFrame[source]
Apply clean_index_ids to both the row index and column index of a DataFrame.
- Parameters:
df – Input DataFrame.
- Returns:
Copy of df with cleaned index and columns.
- capellini.utils.taxonomy.clean_index_ids(idx: Iterable) list[source]
Strip trailing .0 float artefacts from string-cast integer IDs.
- Parameters:
idx – Iterable of index values.
- Returns:
List of cleaned string values.
- capellini.utils.taxonomy.load_bacteria_taxonomy(path: str) DataFrame[source]
Load a bacteria taxonomy CSV, handling the old notebook’s double-index convention.
- Parameters:
path – Path to the taxonomy CSV file.
- Returns:
DataFrame with ASV/OTU index.
- capellini.utils.taxonomy.lookup_ncbi_taxid(row: Any, name_to_taxid: dict[str, int], ranks: list[str] = ['Genus', 'Family', 'Order', 'Class', 'Phylum', 'Kingdom']) tuple[Any, Any][source]
Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.
- Parameters:
row – A dict-like row with taxonomy rank keys.
name_to_taxid – Mapping from scientific name to taxid.
ranks – Ordered list of ranks to try (finest first).
- Returns:
Tuple of (taxid, matched_rank) or (pd.NA, None) if no match found.
- capellini.utils.taxonomy.parse_bool_series(s: Series) Series[source]
Robustly parse a boolean metadata column that may be stored as strings.
- Parameters:
s – Series of bool or string values.
- Returns:
Boolean Series.
- capellini.utils.taxonomy.rename_clostridium_sensu_stricto(df: DataFrame) DataFrame[source]
Rename the Clostridium sensu stricto column to include the subspecies number.
- Parameters:
df – DataFrame with genus-level abundance columns.
- Returns:
Copy of df with corrected column name.
- capellini.utils.taxonomy.sanitize_index(idx: Iterable, **kwargs) Index[source]
Apply sanitize_taxon_name to every element of an index.
- Parameters:
idx – Iterable of index labels.
**kwargs – Forwarded to sanitize_taxon_name.
- Returns:
New pd.Index with sanitized labels.
- capellini.utils.taxonomy.sanitize_taxon_name(s: Any, remove_trailing_numeric_suffix: bool = True, remove_brackets: bool = True, remove_quotes: bool = True, collapse_spaces: bool = True, strip: bool = True, remove_X_prefix: bool = True, spaces_to_underscore: bool = True, collapse_dots: bool = True, remove_spaces_and_underscores: bool = True) str[source]
Sanitize a taxon name consistently across studies, matching old notebook behavior.
- Parameters:
s – Input taxon name (any type; will be cast to str).
remove_trailing_numeric_suffix – Strip trailing whitespace.
remove_brackets – Remove square brackets.
remove_quotes – Remove single and double quotes.
collapse_spaces – Collapse runs of whitespace to a single space.
strip – Strip leading/trailing whitespace.
remove_X_prefix – Remove the R-generated X. prefix.
spaces_to_underscore – Convert spaces to underscores (overridden by remove_spaces_and_underscores).
collapse_dots – Collapse multiple consecutive dots.
remove_spaces_and_underscores – Remove all spaces and underscores.
- Returns:
Sanitized string.