capellini.config

Configuration dataclass for the CAPELLINI pipeline.

Classes

CapelliniConfig(base, download_path, ...)

All settings for the CAPELLINI pipeline, mirroring the notebook Settings section.

class capellini.config.CapelliniConfig(base: str = '', download_path: str = '', input_fasta_folder: str = '', dada2_folder: str = '', mmseq_folder: str = '', sp_folder: str = '', procs_folder: str = '', enhanced_networks_folder: str = '', silva_ref_path: str = '', silva_taxmap_path: str = '', full_ncbi_taxonomy_path: str = '', virus_fasta_name: str = '', metadata_path: str = '', bacterial_raw_fasta_folder: str = '', species_level: bool = False, fresh_start: bool = False, ref_removal: bool = True, regenerate_16S_reference: bool = False, regenerate_spacers_collection: bool = False, genes_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.genes.representatives.fasta.bz2', bacContigs_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.contigs.representatives.fasta.bz2', protein_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.proteins.representatives.fasta.bz2', direction: str = 'forward', bacteria_fasta_name: str = '16S_DADA2_bacteria.fasta', fasta_generation: bool = True, isolate_ref_16S: bool = True, mapping_saving: bool = True, min_bitscore: int = 50, max_matches: int = 20, add_taxonomy: bool = True, extend_taxonomy: bool = True, min_n_spacers: int = 3, min_length: int = 23, max_length: int = 47, fdr: float = 0.05, keep_spacers_collection: bool = True, remove_decomp_fasta: bool = True, proteins_extraction_path: str = '', clustering_path: str = '', matrix_type: str = 'count', save_single_bacgenome_collection: bool = False, keep_coords: bool = False, filter_1bac_1vir: bool = False, remove_collections: bool = False, batch_size: int = 1500, OUTPUT_ROOT: str = '', OVERWRITE: bool = False, VERBOSE: bool = True, RUN_COMMON_ABUNDANCE: bool = True, RUN_SHRINKAGE_CORRELATIONS: bool = True, RUN_RAW_CRISPR_NETWORKS: bool = True, RUN_SMOOTH_CRISPR: bool = True, RUN_XSTAR: bool = True, PREVALENCE: float = 0.1, KEEP_COLUMN: str = 'keep_for_analysis', BACTERIA_TAXONOMY_RANK: str = 'target_taxids', BACTERIAL_RANKS: list = <factory>, BACTERIAL_WEIGHTS: list = <factory>, CRISPR_SMOOTH_ALPHA: float = 0.95, TRANSPOSE_RAW_CRISPR_AFTER_LOAD: bool = True, PSEUDOCOUNT: float = 1e-06, LAM: float = 0.5, N_STEPS: int = 1, PRESERVE_SCALE: bool = False, STUDY: str = 'default', virus_abundance_raw: str = '', bacteria_otu: str = '', bacteria_taxonomy: str = '', phage_host_predictions: str = '', tax_bac_for_smoothing: str = '', tax_vir: str = '', viral_ranks: list = <factory>, viral_weights: list = <factory>, aggregate_viral_rank: str = 'lev0')[source]

Bases: object

All settings for the CAPELLINI pipeline, mirroring the notebook Settings section.

Required fields (no defaults) must be provided explicitly or via from_yaml/from_dict. Derived path fields are computed in __post_init__ when left as empty strings.

BACTERIAL_RANKS: list
BACTERIAL_WEIGHTS: list
BACTERIA_TAXONOMY_RANK: str = 'target_taxids'
CRISPR_SMOOTH_ALPHA: float = 0.95
KEEP_COLUMN: str = 'keep_for_analysis'
LAM: float = 0.5
N_STEPS: int = 1
OUTPUT_ROOT: str = ''
OVERWRITE: bool = False
PRESERVE_SCALE: bool = False
PREVALENCE: float = 0.1
PSEUDOCOUNT: float = 1e-06
RUN_COMMON_ABUNDANCE: bool = True
RUN_RAW_CRISPR_NETWORKS: bool = True
RUN_SHRINKAGE_CORRELATIONS: bool = True
RUN_SMOOTH_CRISPR: bool = True
RUN_XSTAR: bool = True
STUDY: str = 'default'
TRANSPOSE_RAW_CRISPR_AFTER_LOAD: bool = True
VERBOSE: bool = True
add_taxonomy: bool = True
aggregate_viral_rank: str = 'lev0'
bacContigs_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.contigs.representatives.fasta.bz2'
bacteria_fasta_name: str = '16S_DADA2_bacteria.fasta'
bacteria_otu: str = ''
bacteria_taxonomy: str = ''
bacterial_raw_fasta_folder: str = ''
base: str = ''
batch_size: int = 1500
clustering_path: str = ''
dada2_folder: str = ''
classmethod default() CapelliniConfig[source]

Return a config with all default values (paths will be empty until base is set).

direction: str = 'forward'
download_path: str = ''
enhanced_networks_folder: str = ''
extend_taxonomy: bool = True
fasta_generation: bool = True
fdr: float = 0.05
filter_1bac_1vir: bool = False
fresh_start: bool = False
classmethod from_dict(d: dict[str, Any]) CapelliniConfig[source]

Load config from a plain Python dict.

classmethod from_yaml(path: str | Path) CapelliniConfig[source]

Load config from a YAML file.

Parameters:

path – Path to the YAML configuration file.

Returns:

CapelliniConfig populated from the YAML.

full_ncbi_taxonomy_path: str = ''
genes_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.genes.representatives.fasta.bz2'
input_fasta_folder: str = ''
isolate_ref_16S: bool = True
keep_coords: bool = False
keep_spacers_collection: bool = True
mapping_saving: bool = True
matrix_type: str = 'count'
max_length: int = 47
max_matches: int = 20
metadata_path: str = ''
min_bitscore: int = 50
min_length: int = 23
min_n_spacers: int = 3
mmseq_folder: str = ''
phage_host_predictions: str = ''
procs_folder: str = ''
protein_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.proteins.representatives.fasta.bz2'
proteins_extraction_path: str = ''
ref_removal: bool = True
regenerate_16S_reference: bool = False
regenerate_spacers_collection: bool = False
remove_collections: bool = False
remove_decomp_fasta: bool = True
save_single_bacgenome_collection: bool = False
silva_ref_path: str = ''
silva_taxmap_path: str = ''
sp_folder: str = ''
species_level: bool = False
tax_bac_for_smoothing: str = ''
tax_vir: str = ''
to_yaml(path: str | Path) None[source]

Serialize config to a YAML file.

Parameters:

path – Destination YAML file path.

viral_ranks: list
viral_weights: list
virus_abundance_raw: str = ''
virus_fasta_name: str = ''
virus_fasta_path() Path[source]

Resolved absolute path to the virus FASTA file.