capellini.stages.ncbi_mapping

NCBI mapping stage: download taxonomy names and assign real NCBI taxids.

Functions

download_ncbi_names(names_dmp_path, taxdmp_url)

Download taxdmp.zip, extract names.dmp into names_dmp_path.

run_ncbi_mapping(cfg)

Load taxonomy table, assign NCBI taxids, and return the updated DataFrame.

capellini.stages.ncbi_mapping.download_ncbi_names(names_dmp_path: str | Path, taxdmp_url: str) Path[source]

Download taxdmp.zip, extract names.dmp into names_dmp_path.

Skips the download if names_dmp_path already exists.

Parameters:
  • names_dmp_path – Destination path for names.dmp.

  • taxdmp_url – Source URL of the taxdmp.zip archive.

Returns:

Path to the names.dmp file.

capellini.stages.ncbi_mapping.run_ncbi_mapping(cfg: CapelliniConfig) DataFrame[source]

Load taxonomy table, assign NCBI taxids, and return the updated DataFrame.

Loads the DADA2-produced taxonomy_table_{F|R|P}.csv, downloads NCBI names if needed, looks up real NCBI taxids for each ASV (finest available rank), and adds NCBI_taxid and taxid_matched_rank columns.

Parameters:

cfg – Populated CapelliniConfig instance.

Returns:

taxonomy_table DataFrame with NCBI_taxid and taxid_matched_rank columns added.