capellini.stages.procs

ProCs stage: bacterial/viral protein extraction, clustering, and PA matrix.

Functions

build_pa_matrix(cluster_res_df, ...[, ...])

Build a presence/absence or count matrix of protein clusters per genome/virus.

combine_protein_collections(bac_path, ...)

Concatenate bacterial and viral protein FASTAs into a single combined FASTA.

download_protein_reference(cfg)

Download the ProGenomes3 proteins FASTA if not already present.

extract_bacterial_proteins(cfg, gca_target_set)

Stream the bz2 protein FASTA and extract proteins for target GCAs.

extract_viral_proteins(cfg)

Run prodigal in metagenomic mode on the virus FASTA to extract proteins.

run_mmseqs_clustering(combined_fasta_path, ...)

Run mmseqs easy-cluster on the combined protein FASTA.

run_procs(cfg, gca_target_set)

Orchestrate the full ProCs stage.

capellini.stages.procs.build_pa_matrix(cluster_res_df: DataFrame, filter_1bac_1vir: bool, vir_fasta: Path, bac_fasta: Path, matrix_type: str = 'count') DataFrame[source]

Build a presence/absence or count matrix of protein clusters per genome/virus.

Parameters:
  • cluster_res_df – DataFrame with Cluster and Protein columns.

  • filter_1bac_1vir – If True, keep only clusters with ≥1 bacterial and ≥1 viral protein.

  • vir_fasta – ViralProteinsCollection.fasta for filter_1bac_1vir logic.

  • bac_fasta – BacterialProteinsCollection.fasta for filter_1bac_1vir logic.

  • matrix_type – ‘count’ or ‘binary’.

Returns:

Genomes/viruses x protein clusters matrix DataFrame.

capellini.stages.procs.combine_protein_collections(bac_path: Path, vir_path: Path, combined_path: Path) Path[source]

Concatenate bacterial and viral protein FASTAs into a single combined FASTA.

Parameters:
  • bac_path – BacterialProteinsCollection.fasta path.

  • vir_path – ViralProteinsCollection.fasta path.

  • combined_path – Destination CombinedProteinsCollection.fasta path.

Returns:

Path to the combined FASTA.

capellini.stages.procs.download_protein_reference(cfg: CapelliniConfig) Path[source]

Download the ProGenomes3 proteins FASTA if not already present.

Parameters:

cfg – Populated CapelliniConfig instance.

Returns:

Path to the downloaded bz2 protein reference.

capellini.stages.procs.extract_bacterial_proteins(cfg: CapelliniConfig, gca_target_set: set) Path[source]

Stream the bz2 protein FASTA and extract proteins for target GCAs.

Uses a batch approach when len(gca_target_set) > cfg.batch_size to avoid holding all sequences in memory at once.

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • gca_target_set – Set of GCA IDs (e.g. ‘GCA_000001405’) to extract.

Returns:

Path to BacterialProteinsCollection.fasta.

capellini.stages.procs.extract_viral_proteins(cfg: CapelliniConfig) Path[source]

Run prodigal in metagenomic mode on the virus FASTA to extract proteins.

Parameters:

cfg – Populated CapelliniConfig instance.

Returns:

Path to ViralProteinsCollection.fasta.

capellini.stages.procs.run_mmseqs_clustering(combined_fasta_path: Path, clustering_path: str) Path[source]

Run mmseqs easy-cluster on the combined protein FASTA.

Parameters:
  • combined_fasta_path – CombinedProteinsCollection.fasta path.

  • clustering_path – Directory for clustering outputs.

Returns:

Path to clusterRes (prefix; actual tsv is clusterRes_cluster.tsv).

capellini.stages.procs.run_procs(cfg: CapelliniConfig, gca_target_set: set) DataFrame[source]

Orchestrate the full ProCs stage.

Steps:
  1. Extract bacterial proteins from ProGenomes3 bz2.

  2. Extract viral proteins with Prodigal.

  3. Combine into a single FASTA.

  4. Run mmseqs easy-cluster.

  5. Build PA matrix.

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • gca_target_set – Set of target GCA IDs from the MMSeqs2 stage.

Returns:

PA/count matrix DataFrame (genomes/viruses x protein clusters).