capellini.stages.procs
ProCs stage: bacterial/viral protein extraction, clustering, and PA matrix.
Functions
|
Build a presence/absence or count matrix of protein clusters per genome/virus. |
|
Concatenate bacterial and viral protein FASTAs into a single combined FASTA. |
Download the ProGenomes3 proteins FASTA if not already present. |
|
|
Stream the bz2 protein FASTA and extract proteins for target GCAs. |
Run prodigal in metagenomic mode on the virus FASTA to extract proteins. |
|
|
Run mmseqs easy-cluster on the combined protein FASTA. |
|
Orchestrate the full ProCs stage. |
- capellini.stages.procs.build_pa_matrix(cluster_res_df: DataFrame, filter_1bac_1vir: bool, vir_fasta: Path, bac_fasta: Path, matrix_type: str = 'count') DataFrame[source]
Build a presence/absence or count matrix of protein clusters per genome/virus.
- Parameters:
cluster_res_df – DataFrame with Cluster and Protein columns.
filter_1bac_1vir – If True, keep only clusters with ≥1 bacterial and ≥1 viral protein.
vir_fasta – ViralProteinsCollection.fasta for filter_1bac_1vir logic.
bac_fasta – BacterialProteinsCollection.fasta for filter_1bac_1vir logic.
matrix_type – ‘count’ or ‘binary’.
- Returns:
Genomes/viruses x protein clusters matrix DataFrame.
- capellini.stages.procs.combine_protein_collections(bac_path: Path, vir_path: Path, combined_path: Path) Path[source]
Concatenate bacterial and viral protein FASTAs into a single combined FASTA.
- Parameters:
bac_path – BacterialProteinsCollection.fasta path.
vir_path – ViralProteinsCollection.fasta path.
combined_path – Destination CombinedProteinsCollection.fasta path.
- Returns:
Path to the combined FASTA.
- capellini.stages.procs.download_protein_reference(cfg: CapelliniConfig) Path[source]
Download the ProGenomes3 proteins FASTA if not already present.
- Parameters:
cfg – Populated CapelliniConfig instance.
- Returns:
Path to the downloaded bz2 protein reference.
- capellini.stages.procs.extract_bacterial_proteins(cfg: CapelliniConfig, gca_target_set: set) Path[source]
Stream the bz2 protein FASTA and extract proteins for target GCAs.
Uses a batch approach when len(gca_target_set) > cfg.batch_size to avoid holding all sequences in memory at once.
- Parameters:
cfg – Populated CapelliniConfig instance.
gca_target_set – Set of GCA IDs (e.g. ‘GCA_000001405’) to extract.
- Returns:
Path to BacterialProteinsCollection.fasta.
- capellini.stages.procs.extract_viral_proteins(cfg: CapelliniConfig) Path[source]
Run prodigal in metagenomic mode on the virus FASTA to extract proteins.
- Parameters:
cfg – Populated CapelliniConfig instance.
- Returns:
Path to ViralProteinsCollection.fasta.
- capellini.stages.procs.run_mmseqs_clustering(combined_fasta_path: Path, clustering_path: str) Path[source]
Run mmseqs easy-cluster on the combined protein FASTA.
- Parameters:
combined_fasta_path – CombinedProteinsCollection.fasta path.
clustering_path – Directory for clustering outputs.
- Returns:
Path to clusterRes (prefix; actual tsv is clusterRes_cluster.tsv).
- capellini.stages.procs.run_procs(cfg: CapelliniConfig, gca_target_set: set) DataFrame[source]
Orchestrate the full ProCs stage.
- Steps:
Extract bacterial proteins from ProGenomes3 bz2.
Extract viral proteins with Prodigal.
Combine into a single FASTA.
Run mmseqs easy-cluster.
Build PA matrix.
- Parameters:
cfg – Populated CapelliniConfig instance.
gca_target_set – Set of target GCA IDs from the MMSeqs2 stage.
- Returns:
PA/count matrix DataFrame (genomes/viruses x protein clusters).