capellini.utils.taxonomy
Taxonomy helpers: NCBI name lookup, index sanitization, bacteria taxonomy cleaning.
Functions
|
Apply a fixed set of column renames to a DataFrame. |
|
Assign NCBI taxids to every row of a taxonomy table and print a summary. |
|
Parse an NCBI names.dmp file into a scientific-name → taxid mapping. |
|
Build a mapping from rank name to the set of ProGenomes taxids in that rank. |
|
Sanitize bacteria taxonomy columns and index. |
|
Apply clean_index_ids to both the row index and column index of a DataFrame. |
|
Strip trailing .0 float artefacts from string-cast integer IDs. |
|
Load a bacteria taxonomy CSV, handling the old notebook's double-index convention. |
|
Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest. |
Robustly parse a boolean metadata column that may be stored as strings. |
|
Rename the Clostridium sensu stricto column to include the subspecies number. |
|
|
Apply |
Normalize a taxon string consistently across studies. |
- capellini.utils.taxonomy.apply_custom_renames(df: DataFrame, renames: dict[str, str] = {'Clostridium sensu stricto': 'Clostridium sensu stricto 1'}) DataFrame[source]
Apply a fixed set of column renames to a DataFrame.
- Parameters:
df – Input DataFrame.
renames – Mapping from old column name to new name.
- Returns:
DataFrame with renamed columns.
- capellini.utils.taxonomy.assign_ncbi_taxids(taxonomy_table: DataFrame, name_to_ncbi: dict[str, int]) DataFrame[source]
Assign NCBI taxids to every row of a taxonomy table and print a summary.
- Parameters:
taxonomy_table – DataFrame with ranks as columns.
name_to_ncbi – Scientific-name → taxid mapping from build_name_to_ncbi.
- Returns:
DataFrame with added NCBI_taxid and taxid_matched_rank columns.
- capellini.utils.taxonomy.build_name_to_ncbi(names_dmp_path: str) dict[str, int][source]
Parse an NCBI names.dmp file into a scientific-name → taxid mapping.
- Parameters:
names_dmp_path – Path to names.dmp extracted from taxdmp.zip.
- Returns:
Dictionary mapping scientific name strings to integer NCBI taxids.
- capellini.utils.taxonomy.build_rank_to_taxids(df_all_ncbis: DataFrame, rank_col: str) dict[str, set][source]
Build a mapping from rank name to the set of ProGenomes taxids in that rank.
- Parameters:
df_all_ncbis – DataFrame with taxid and rank columns.
rank_col – Column name for the rank (e.g., “genus”, “family”).
- Returns:
Dictionary mapping rank name → set of integer taxids.
- capellini.utils.taxonomy.clean_bacteria_taxonomy(tax: DataFrame, cols_to_clean: Sequence[str] = ('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus'), keep_cols: Sequence[str] = ('target_taxids',)) DataFrame[source]
Sanitize bacteria taxonomy columns and index.
- Parameters:
tax – Taxonomy DataFrame.
cols_to_clean – Rank columns to sanitize.
keep_cols – Columns to copy through without sanitization.
- Returns:
Cleaned taxonomy DataFrame.
- capellini.utils.taxonomy.clean_df_ids(df: DataFrame) DataFrame[source]
Apply clean_index_ids to both the row index and column index of a DataFrame.
- Parameters:
df – Input DataFrame.
- Returns:
Copy of df with cleaned index and columns.
- capellini.utils.taxonomy.clean_index_ids(idx: Iterable) list[source]
Strip trailing .0 float artefacts from string-cast integer IDs.
- Parameters:
idx – Iterable of index values.
- Returns:
List of cleaned string values.
- capellini.utils.taxonomy.load_bacteria_taxonomy(path: str) DataFrame[source]
Load a bacteria taxonomy CSV, handling the old notebook’s double-index convention.
- Parameters:
path – Path to the taxonomy CSV file.
- Returns:
DataFrame with ASV/OTU index.
- capellini.utils.taxonomy.lookup_ncbi_taxid(row: Any, name_to_taxid: dict[str, int], ranks: list[str] = ['Genus', 'Family', 'Order', 'Class', 'Phylum', 'Kingdom']) tuple[Any, Any][source]
Look up an NCBI taxid for a taxonomy row, trying ranks from finest to coarsest.
- Parameters:
row – A dict-like row with taxonomy rank keys.
name_to_taxid – Mapping from scientific name to taxid.
ranks – Ordered list of ranks to try (finest first).
- Returns:
Tuple of (taxid, matched_rank) or (pd.NA, None) if no match found.
- capellini.utils.taxonomy.parse_bool_series(s: Series) Series[source]
Robustly parse a boolean metadata column that may be stored as strings.
- Parameters:
s – Series of bool or string values.
- Returns:
Boolean Series.
- capellini.utils.taxonomy.rename_clostridium_sensu_stricto(df: DataFrame) DataFrame[source]
Rename the Clostridium sensu stricto column to include the subspecies number.
- Parameters:
df – DataFrame with genus-level abundance columns.
- Returns:
Copy of df with corrected column name.
- capellini.utils.taxonomy.sanitize_index(idx: Iterable) Index[source]
Apply
sanitize_taxon_name()to every element of an index.