capellini.stages.mmseqs2
MMSeqs2 stage: 16S reference, easy-search, and 3-layer NCBI/GCA assignment.
Functions
|
Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment. |
|
Extract all structured IDs from a FASTA header using compiled regex patterns. |
|
Parse the 16S FASTA reference and extract IDs from every record header. |
|
Deterministic fallback: pick the smallest unused taxid from the allowed space. |
|
Return path to the 16S ProGenomes reference, using the bundled file if available. |
|
Return True if a FASTA description corresponds to a 16S rRNA gene. |
|
3-layer NCBI taxid assignment: species, genus, and family resolution. |
|
Parse an mmseqs .m8 output file and return the top-scoring hits per query. |
|
Pick the first ranked taxid that is in allowed_space and not already used. |
|
Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment. |
|
Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference. |
- capellini.stages.mmseqs2.build_top200_dicts(taxonomy_table: DataFrame, topBitScore_df: DataFrame) tuple[dict, dict, dict][source]
Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment.
Layer 1 (species): per-ASV ranked hits for ASVs with a Genus. Layer 2 (genus): pooled ranked hits for all ASVs in each genus. Layer 3 (family): per-ASV ranked hits for ASVs without a Genus but with a Family.
- Parameters:
taxonomy_table – ASV taxonomy DataFrame (index = ASV names, Genus/Family columns).
topBitScore_df – MMSeqs2 hits with query, NCBI ID, bitscore columns.
- Returns:
Tuple (top_200_per_asv, top_200_per_genus, top_200_per_family).
- capellini.stages.mmseqs2.extract_ids_from_header(header: str) Dict[str, List[str]][source]
Extract all structured IDs from a FASTA header using compiled regex patterns.
- Parameters:
header – FASTA record description string.
- Returns:
Dict with keys taxids, assemblies, biosamples, bioprojects, accessions.
- capellini.stages.mmseqs2.extract_ids_from_reference(reference_16s_path: Path) tuple[list, DataFrame][source]
Parse the 16S FASTA reference and extract IDs from every record header.
- Parameters:
reference_16s_path – Path to progenome16S.fasta.
- Returns:
Tuple (per_record_rows list, df DataFrame) where df has columns record_id, ncbi_id, assembly, biosample, bioproject, accessions, description.
- capellini.stages.mmseqs2.fallback_pick_from_space(allowed_space: set, used_set: set)[source]
Deterministic fallback: pick the smallest unused taxid from the allowed space.
- Parameters:
allowed_space – Set of valid taxids.
used_set – Set of already-assigned taxids.
- Returns:
Smallest unused int taxid, or None if the space is exhausted.
- capellini.stages.mmseqs2.get_reference_16s(cfg: CapelliniConfig) Path[source]
Return path to the 16S ProGenomes reference, using the bundled file if available.
Modification 1: checks the bundled progenome16S.fasta first. If not present, downloads the full genes FASTA from ProGenomes3 and filters it for 16S records.
- Parameters:
cfg – Populated CapelliniConfig instance.
- Returns:
Path to the ready progenome16S.fasta reference.
- capellini.stages.mmseqs2.is_16s_gene(description: str) bool[source]
Return True if a FASTA description corresponds to a 16S rRNA gene.
- Parameters:
description – FASTA record description string.
- Returns:
True if the record is a 16S rRNA gene.
- capellini.stages.mmseqs2.map_silva_to_progenomes_bounded(silva: DataFrame, top_200_per_asv: dict, top_200_per_genus: dict, top_200_per_family: dict, genus_to_taxids: dict, family_to_taxids: dict, ncbi_taxids_by_genus: dict, ncbi_taxids_by_family: dict, allowed_universe: set, topBitScore_df: DataFrame, progenomes_ref_df: DataFrame, enforce_unique_taxids: bool = True, debug: bool = False) tuple[source]
3-layer NCBI taxid assignment: species, genus, and family resolution.
- Adds six new columns to the input silva DataFrame:
progenomes_taxid_species, progenomes_taxid_genus, progenomes_taxid_family GCA_species, GCA_genus, GCA_family, GCA (consolidated)
- Parameters:
silva – ASV taxonomy DataFrame.
top_200_per_asv – Layer 1 ranked hits (per-ASV).
top_200_per_genus – Layer 2 ranked hits (per-genus).
top_200_per_family – Layer 3 ranked hits (per-ASV, family fallback).
genus_to_taxids – ProGenomes genus name → set of taxids.
family_to_taxids – ProGenomes family name → set of taxids.
ncbi_taxids_by_genus – NCBI genus name → set of taxids.
ncbi_taxids_by_family – NCBI family name → set of taxids.
allowed_universe – Set of all NCBI taxids in the 16S reference.
topBitScore_df – MMSeqs2 hits DataFrame (with NCBI ID, Genome Accession ID).
progenomes_ref_df – DataFrame from extract_ids_from_reference (ncbi_id, record_id).
enforce_unique_taxids – Enforce uniqueness across genera/families (Layer 2/3).
debug – Print per-ASV assignment details.
- Returns:
Tuple (silva_out, species_mapping, genus_mapping, family_mapping, genus_rep).
- capellini.stages.mmseqs2.parse_mmseqs_output(mmseqs_output: Path, min_bitscore: int, max_matches: int) DataFrame[source]
Parse an mmseqs .m8 output file and return the top-scoring hits per query.
Adds NCBI ID, Genome Accession ID, and Gene Index columns parsed from the target header.
- Parameters:
mmseqs_output – Path to output.m8.
min_bitscore – Minimum bitscore threshold.
max_matches – Maximum number of hits to keep per query (by bitscore).
- Returns:
Filtered DataFrame with NCBI ID, Genome Accession ID, Gene Index columns.
- capellini.stages.mmseqs2.pick_bounded(ranked_taxids: list, allowed_space: set, used_set: set)[source]
Pick the first ranked taxid that is in allowed_space and not already used.
- Parameters:
ranked_taxids – Ordered list of NCBI taxids (best first).
allowed_space – Set of valid taxids for this rank/layer.
used_set – Global set of already-assigned taxids.
- Returns:
Chosen integer taxid, or None if none qualify.
- capellini.stages.mmseqs2.run_mmseqs2(cfg: CapelliniConfig, taxonomy_table: DataFrame) DataFrame[source]
Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment.
- Steps:
Get/bundle 16S reference (Modification 1).
Run mmseqs easy-search.
Parse .m8 output.
Extract IDs from reference FASTA.
Build per-layer ranked hit dicts.
Run map_silva_to_progenomes_bounded.
Save silva_fixed CSV.
Optionally delete downloaded reference.
- Parameters:
cfg – Populated CapelliniConfig instance.
taxonomy_table – Taxonomy table from the NCBI mapping stage.
- Returns:
silva_fixed DataFrame with progenomes_taxid and GCA columns.
- capellini.stages.mmseqs2.run_mmseqs_easy_search(bact_path: Path, reference_16s_path: Path, mmseq_folder: str) Path[source]
Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference.
- Parameters:
bact_path – Query FASTA (16S DADA2 bacteria sequences).
reference_16s_path – Subject FASTA (progenome16S.fasta).
mmseq_folder – Output directory for the .m8 file.
- Returns:
Path to the output.m8 file.