capellini.stages.mmseqs2

MMSeqs2 stage: 16S reference, easy-search, and 3-layer NCBI/GCA assignment.

Functions

`build_top200_dicts`(taxonomy_table, ...)	Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment.
`extract_ids_from_header`(header)	Extract all structured IDs from a FASTA header using compiled regex patterns.
`extract_ids_from_reference`(reference_16s_path)	Parse the 16S FASTA reference and extract IDs from every record header.
`fallback_pick_from_space`(allowed_space, used_set)	Deterministic fallback: pick the smallest unused taxid from the allowed space.
`get_reference_16s`(cfg)	Return path to the 16S ProGenomes reference, using the bundled file if available.
`is_16s_gene`(description)	Return True if a FASTA description corresponds to a 16S rRNA gene.
`map_silva_to_progenomes_bounded`(silva, ...)	3-layer NCBI taxid assignment: species, genus, and family resolution.
`parse_mmseqs_output`(mmseqs_output, ...)	Parse an mmseqs .m8 output file and return the top-scoring hits per query.
`pick_bounded`(ranked_taxids, allowed_space, ...)	Pick the first ranked taxid that is in allowed_space and not already used.
`run_mmseqs2`(cfg, taxonomy_table)	Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment.
`run_mmseqs_easy_search`(bact_path, ...)	Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference.

capellini.stages.mmseqs2.build_top200_dicts(taxonomy_table: DataFrame, topBitScore_df: DataFrame) → tuple[dict, dict, dict][source]

Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment.

Layer 1 (species): per-ASV ranked hits for ASVs with a Genus. Layer 2 (genus): pooled ranked hits for all ASVs in each genus. Layer 3 (family): per-ASV ranked hits for ASVs without a Genus but with a Family.

Parameters:

taxonomy_table – ASV taxonomy DataFrame (index = ASV names, Genus/Family columns).
topBitScore_df – MMSeqs2 hits with query, NCBI ID, bitscore columns.

Returns:

Tuple (top_200_per_asv, top_200_per_genus, top_200_per_family).

capellini.stages.mmseqs2.extract_ids_from_header(header: str) → Dict[str, List[str]][source]

Extract all structured IDs from a FASTA header using compiled regex patterns.

Parameters:: header – FASTA record description string.
Returns:: Dict with keys taxids, assemblies, biosamples, bioprojects, accessions.

capellini.stages.mmseqs2.extract_ids_from_reference(reference_16s_path: Path) → tuple[list, DataFrame][source]

Parse the 16S FASTA reference and extract IDs from every record header.

Parameters:: reference_16s_path – Path to progenome16S.fasta.
Returns:: Tuple (per_record_rows list, df DataFrame) where df has columns record_id, ncbi_id, assembly, biosample, bioproject, accessions, description.

capellini.stages.mmseqs2.fallback_pick_from_space(allowed_space: set, used_set: set)[source]

Deterministic fallback: pick the smallest unused taxid from the allowed space.

Parameters:

allowed_space – Set of valid taxids.
used_set – Set of already-assigned taxids.

Returns:

Smallest unused int taxid, or None if the space is exhausted.

capellini.stages.mmseqs2.get_reference_16s(cfg: CapelliniConfig) → Path[source]

Return path to the 16S ProGenomes reference, using the bundled file if available.

Modification 1: checks the bundled progenome16S.fasta first. If not present, downloads the full genes FASTA from ProGenomes3 and filters it for 16S records.

Parameters:: cfg – Populated CapelliniConfig instance.
Returns:: Path to the ready progenome16S.fasta reference.

capellini.stages.mmseqs2.is_16s_gene(description: str) → bool[source]

Return True if a FASTA description corresponds to a 16S rRNA gene.

Parameters:: description – FASTA record description string.
Returns:: True if the record is a 16S rRNA gene.

capellini.stages.mmseqs2.map_silva_to_progenomes_bounded(silva: DataFrame, top_200_per_asv: dict, top_200_per_genus: dict, top_200_per_family: dict, genus_to_taxids: dict, family_to_taxids: dict, ncbi_taxids_by_genus: dict, ncbi_taxids_by_family: dict, allowed_universe: set, topBitScore_df: DataFrame, progenomes_ref_df: DataFrame, enforce_unique_taxids: bool = True, debug: bool = False) → tuple[source]

3-layer NCBI taxid assignment: species, genus, and family resolution.

Adds six new columns to the input silva DataFrame:: progenomes_taxid_species, progenomes_taxid_genus, progenomes_taxid_family GCA_species, GCA_genus, GCA_family, GCA (consolidated)

Parameters:

silva – ASV taxonomy DataFrame.
top_200_per_asv – Layer 1 ranked hits (per-ASV).
top_200_per_genus – Layer 2 ranked hits (per-genus).
top_200_per_family – Layer 3 ranked hits (per-ASV, family fallback).
genus_to_taxids – ProGenomes genus name → set of taxids.
family_to_taxids – ProGenomes family name → set of taxids.
ncbi_taxids_by_genus – NCBI genus name → set of taxids.
ncbi_taxids_by_family – NCBI family name → set of taxids.
allowed_universe – Set of all NCBI taxids in the 16S reference.
topBitScore_df – MMSeqs2 hits DataFrame (with NCBI ID, Genome Accession ID).
progenomes_ref_df – DataFrame from extract_ids_from_reference (ncbi_id, record_id).
enforce_unique_taxids – Enforce uniqueness across genera/families (Layer 2/3).
debug – Print per-ASV assignment details.

Returns:

Tuple (silva_out, species_mapping, genus_mapping, family_mapping, genus_rep).

capellini.stages.mmseqs2.parse_mmseqs_output(mmseqs_output: Path, min_bitscore: int, max_matches: int) → DataFrame[source]

Parse an mmseqs .m8 output file and return the top-scoring hits per query.

Adds NCBI ID, Genome Accession ID, and Gene Index columns parsed from the target header.

Parameters:

mmseqs_output – Path to output.m8.
min_bitscore – Minimum bitscore threshold.
max_matches – Maximum number of hits to keep per query (by bitscore).

Returns:

Filtered DataFrame with NCBI ID, Genome Accession ID, Gene Index columns.

capellini.stages.mmseqs2.pick_bounded(ranked_taxids: list, allowed_space: set, used_set: set)[source]

Pick the first ranked taxid that is in allowed_space and not already used.

Parameters:

ranked_taxids – Ordered list of NCBI taxids (best first).
allowed_space – Set of valid taxids for this rank/layer.
used_set – Global set of already-assigned taxids.

Returns:

Chosen integer taxid, or None if none qualify.

capellini.stages.mmseqs2.run_mmseqs2(cfg: CapelliniConfig, taxonomy_table: DataFrame) → DataFrame[source]

Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment.

Steps:

Get/bundle 16S reference (Modification 1).
Run mmseqs easy-search.
Parse .m8 output.
Extract IDs from reference FASTA.
Build per-layer ranked hit dicts.
Run map_silva_to_progenomes_bounded.
Save silva_fixed CSV.
Optionally delete downloaded reference.

Parameters:

cfg – Populated CapelliniConfig instance.
taxonomy_table – Taxonomy table from the NCBI mapping stage.

Returns:

silva_fixed DataFrame with progenomes_taxid and GCA columns.

capellini.stages.mmseqs2.run_mmseqs_easy_search(bact_path: Path, reference_16s_path: Path, mmseq_folder: str) → Path[source]

Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference.

Parameters:

bact_path – Query FASTA (16S DADA2 bacteria sequences).
reference_16s_path – Subject FASTA (progenome16S.fasta).
mmseq_folder – Output directory for the .m8 file.

Returns:

Path to the output.m8 file.