capellini.stages.mmseqs2

MMSeqs2 stage: 16S reference, easy-search, and 3-layer NCBI/GCA assignment.

Functions

build_top200_dicts(taxonomy_table, ...)

Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment.

extract_ids_from_header(header)

Extract all structured IDs from a FASTA header using compiled regex patterns.

extract_ids_from_reference(reference_16s_path)

Parse the 16S FASTA reference and extract IDs from every record header.

fallback_pick_from_space(allowed_space, used_set)

Deterministic fallback: pick the smallest unused taxid from the allowed space.

get_reference_16s(cfg)

Return path to the 16S ProGenomes reference, using the bundled file if available.

is_16s_gene(description)

Return True if a FASTA description corresponds to a 16S rRNA gene.

map_silva_to_progenomes_bounded(silva, ...)

3-layer NCBI taxid assignment: species, genus, and family resolution.

parse_mmseqs_output(mmseqs_output, ...)

Parse an mmseqs .m8 output file and return the top-scoring hits per query.

pick_bounded(ranked_taxids, allowed_space, ...)

Pick the first ranked taxid that is in allowed_space and not already used.

run_mmseqs2(cfg, taxonomy_table)

Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment.

run_mmseqs_easy_search(bact_path, ...)

Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference.

capellini.stages.mmseqs2.build_top200_dicts(taxonomy_table: DataFrame, topBitScore_df: DataFrame) tuple[dict, dict, dict][source]

Build the three ranked-hit dictionaries for the 3-layer NCBI ID assignment.

Layer 1 (species): per-ASV ranked hits for ASVs with a Genus. Layer 2 (genus): pooled ranked hits for all ASVs in each genus. Layer 3 (family): per-ASV ranked hits for ASVs without a Genus but with a Family.

Parameters:
  • taxonomy_table – ASV taxonomy DataFrame (index = ASV names, Genus/Family columns).

  • topBitScore_df – MMSeqs2 hits with query, NCBI ID, bitscore columns.

Returns:

Tuple (top_200_per_asv, top_200_per_genus, top_200_per_family).

capellini.stages.mmseqs2.extract_ids_from_header(header: str) Dict[str, List[str]][source]

Extract all structured IDs from a FASTA header using compiled regex patterns.

Parameters:

header – FASTA record description string.

Returns:

Dict with keys taxids, assemblies, biosamples, bioprojects, accessions.

capellini.stages.mmseqs2.extract_ids_from_reference(reference_16s_path: Path) tuple[list, DataFrame][source]

Parse the 16S FASTA reference and extract IDs from every record header.

Parameters:

reference_16s_path – Path to progenome16S.fasta.

Returns:

Tuple (per_record_rows list, df DataFrame) where df has columns record_id, ncbi_id, assembly, biosample, bioproject, accessions, description.

capellini.stages.mmseqs2.fallback_pick_from_space(allowed_space: set, used_set: set)[source]

Deterministic fallback: pick the smallest unused taxid from the allowed space.

Parameters:
  • allowed_space – Set of valid taxids.

  • used_set – Set of already-assigned taxids.

Returns:

Smallest unused int taxid, or None if the space is exhausted.

capellini.stages.mmseqs2.get_reference_16s(cfg: CapelliniConfig) Path[source]

Return path to the 16S ProGenomes reference, using the bundled file if available.

Modification 1: checks the bundled progenome16S.fasta first. If not present, downloads the full genes FASTA from ProGenomes3 and filters it for 16S records.

Parameters:

cfg – Populated CapelliniConfig instance.

Returns:

Path to the ready progenome16S.fasta reference.

capellini.stages.mmseqs2.is_16s_gene(description: str) bool[source]

Return True if a FASTA description corresponds to a 16S rRNA gene.

Parameters:

description – FASTA record description string.

Returns:

True if the record is a 16S rRNA gene.

capellini.stages.mmseqs2.map_silva_to_progenomes_bounded(silva: DataFrame, top_200_per_asv: dict, top_200_per_genus: dict, top_200_per_family: dict, genus_to_taxids: dict, family_to_taxids: dict, ncbi_taxids_by_genus: dict, ncbi_taxids_by_family: dict, allowed_universe: set, topBitScore_df: DataFrame, progenomes_ref_df: DataFrame, enforce_unique_taxids: bool = True, debug: bool = False) tuple[source]

3-layer NCBI taxid assignment: species, genus, and family resolution.

Adds six new columns to the input silva DataFrame:

progenomes_taxid_species, progenomes_taxid_genus, progenomes_taxid_family GCA_species, GCA_genus, GCA_family, GCA (consolidated)

Parameters:
  • silva – ASV taxonomy DataFrame.

  • top_200_per_asv – Layer 1 ranked hits (per-ASV).

  • top_200_per_genus – Layer 2 ranked hits (per-genus).

  • top_200_per_family – Layer 3 ranked hits (per-ASV, family fallback).

  • genus_to_taxids – ProGenomes genus name → set of taxids.

  • family_to_taxids – ProGenomes family name → set of taxids.

  • ncbi_taxids_by_genus – NCBI genus name → set of taxids.

  • ncbi_taxids_by_family – NCBI family name → set of taxids.

  • allowed_universe – Set of all NCBI taxids in the 16S reference.

  • topBitScore_df – MMSeqs2 hits DataFrame (with NCBI ID, Genome Accession ID).

  • progenomes_ref_df – DataFrame from extract_ids_from_reference (ncbi_id, record_id).

  • enforce_unique_taxids – Enforce uniqueness across genera/families (Layer 2/3).

  • debug – Print per-ASV assignment details.

Returns:

Tuple (silva_out, species_mapping, genus_mapping, family_mapping, genus_rep).

capellini.stages.mmseqs2.parse_mmseqs_output(mmseqs_output: Path, min_bitscore: int, max_matches: int) DataFrame[source]

Parse an mmseqs .m8 output file and return the top-scoring hits per query.

Adds NCBI ID, Genome Accession ID, and Gene Index columns parsed from the target header.

Parameters:
  • mmseqs_output – Path to output.m8.

  • min_bitscore – Minimum bitscore threshold.

  • max_matches – Maximum number of hits to keep per query (by bitscore).

Returns:

Filtered DataFrame with NCBI ID, Genome Accession ID, Gene Index columns.

capellini.stages.mmseqs2.pick_bounded(ranked_taxids: list, allowed_space: set, used_set: set)[source]

Pick the first ranked taxid that is in allowed_space and not already used.

Parameters:
  • ranked_taxids – Ordered list of NCBI taxids (best first).

  • allowed_space – Set of valid taxids for this rank/layer.

  • used_set – Global set of already-assigned taxids.

Returns:

Chosen integer taxid, or None if none qualify.

capellini.stages.mmseqs2.run_mmseqs2(cfg: CapelliniConfig, taxonomy_table: DataFrame) DataFrame[source]

Orchestrate the full MMSeqs2 stage and 3-layer NCBI/GCA assignment.

Steps:
  1. Get/bundle 16S reference (Modification 1).

  2. Run mmseqs easy-search.

  3. Parse .m8 output.

  4. Extract IDs from reference FASTA.

  5. Build per-layer ranked hit dicts.

  6. Run map_silva_to_progenomes_bounded.

  7. Save silva_fixed CSV.

  8. Optionally delete downloaded reference.

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • taxonomy_table – Taxonomy table from the NCBI mapping stage.

Returns:

silva_fixed DataFrame with progenomes_taxid and GCA columns.

Run mmseqs easy-search (nucleotide mode) of 16S ASVs against the reference.

Parameters:
  • bact_path – Query FASTA (16S DADA2 bacteria sequences).

  • reference_16s_path – Subject FASTA (progenome16S.fasta).

  • mmseq_folder – Output directory for the .m8 file.

Returns:

Path to the output.m8 file.