capellini.config

Configuration dataclass for the CAPELLINI pipeline.

Classes

CapelliniConfig(base, download_path, ...)

All settings for the CAPELLINI pipeline, mirroring the notebook Settings section.

class capellini.config.CapelliniConfig(base: str = '', download_path: str = '', input_fasta_folder: str = '', dada2_folder: str = '', mmseq_folder: str = '', sp_folder: str = '', procs_folder: str = '', enhanced_networks_folder: str = '', silva_ref_path: str = '', silva_taxmap_path: str = '', full_ncbi_taxonomy_path: str = '', ncbi_accessory_path: str = '', virus_fasta_name: str = '', metadata_path: str = '', bacterial_raw_fasta_folder: str = '', species_level: bool = False, fresh_start: bool = False, ref_removal: bool = True, regenerate_16S_reference: bool = False, regenerate_spacers_collection: bool = False, silva_ref_url: str = 'https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz', silva_taxmap_url: str = 'https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz', full_ncbi_taxonomy_url: str = '', ncbi_taxdmp_url: str = 'https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip', genes_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.genes.representatives.fasta.bz2', bacContigs_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.contigs.representatives.fasta.bz2', protein_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.proteins.representatives.fasta.bz2', direction: str = 'forward', bacteria_fasta_name: str = '16S_DADA2_bacteria.fasta', fasta_generation: bool = True, isolate_ref_16S: bool = True, mapping_saving: bool = True, min_bitscore: int = 50, max_matches: int = 20, add_taxonomy: bool = True, extend_taxonomy: bool = True, min_n_spacers: int = 3, min_length: int = 23, max_length: int = 47, fdr: float = 0.05, keep_spacers_collection: bool = True, remove_decomp_fasta: bool = True, proteins_extraction_path: str = '', clustering_path: str = '', matrix_type: str = 'count', save_single_bacgenome_collection: bool = False, keep_coords: bool = False, filter_1bac_1vir: bool = False, remove_collections: bool = False, batch_size: int = 1500, output_root: str = '', overwrite: bool = False, verbose: bool = True, run_common_abundance: bool = True, run_shrinkage_correlations: bool = True, run_raw_crispr_networks: bool = True, run_smooth_crispr: bool = True, run_xstar: bool = True, prevalence: float = 0.1, keep_column: str = 'keep_for_analysis', bacteria_taxonomy_rank: str = 'target_taxids', bacterial_ranks: list = <factory>, bacterial_weights: list = <factory>, crispr_smooth_alpha: float = 0.95, transpose_raw_crispr_after_load: bool = True, pseudocount: float = 1e-06, lam: float = 0.5, n_steps: int = 1, preserve_scale: bool = False, virus_abundance_raw: str = '', bacteria_otu: str = '', bacteria_taxonomy: str = '', phage_host_predictions: str = '', tax_bac_for_smoothing: str = '', tax_vir: str = '', viral_ranks: list = <factory>, viral_weights: list = <factory>, aggregate_viral_rank: str = 'lev0')[source]

Bases: object

All settings for the CAPELLINI pipeline, mirroring the notebook Settings section.

Required fields (no defaults) must be provided explicitly or via from_yaml/from_dict. Derived path fields are computed in __post_init__ when left as empty strings.

add_taxonomy: bool = True
aggregate_viral_rank: str = 'lev0'
bacContigs_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.contigs.representatives.fasta.bz2'
bacteria_fasta_name: str = '16S_DADA2_bacteria.fasta'
bacteria_otu: str = ''
bacteria_taxonomy: str = ''
bacteria_taxonomy_rank: str = 'target_taxids'
bacterial_ranks: list
bacterial_raw_fasta_folder: str = ''
bacterial_weights: list
base: str = ''
batch_size: int = 1500
clustering_path: str = ''
crispr_smooth_alpha: float = 0.95
dada2_folder: str = ''
classmethod default() CapelliniConfig[source]

Return a config with all default values (paths will be empty until base is set).

direction: str = 'forward'
download_path: str = ''
enhanced_networks_folder: str = ''
extend_taxonomy: bool = True
fasta_generation: bool = True
fdr: float = 0.05
filter_1bac_1vir: bool = False
fresh_start: bool = False
classmethod from_dict(d: dict[str, Any]) CapelliniConfig[source]

Load config from a plain Python dict.

  • Old UPPER_CASE network keys are auto-translated to the new lowercase names.

  • STUDY is silently dropped.

  • YAML values that look like un-interpolated Python f-strings (f"{dada2_folder}/...") are normalised to empty so the path-derivation logic in __post_init__ can fill them in.

classmethod from_yaml(path: str | Path) CapelliniConfig[source]

Load config from a YAML file.

Parameters:

path – Path to the YAML configuration file.

Returns:

CapelliniConfig populated from the YAML.

full_ncbi_taxonomy_path: str = ''
full_ncbi_taxonomy_url: str = ''
genes_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.genes.representatives.fasta.bz2'
input_fasta_folder: str = ''
isolate_ref_16S: bool = True
keep_column: str = 'keep_for_analysis'
keep_coords: bool = False
keep_spacers_collection: bool = True
lam: float = 0.5
mapping_saving: bool = True
matrix_type: str = 'count'
max_length: int = 47
max_matches: int = 20
metadata_path: str = ''
min_bitscore: int = 50
min_length: int = 23
min_n_spacers: int = 3
mmseq_folder: str = ''
n_steps: int = 1
ncbi_accessory_path: str = ''
ncbi_taxdmp_url: str = 'https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip'
output_root: str = ''
overwrite: bool = False
phage_host_predictions: str = ''
preserve_scale: bool = False
prevalence: float = 0.1
procs_folder: str = ''
protein_reference_url: str = 'http://progenomes3.embl.de/data/repGenomes/progenomes3.proteins.representatives.fasta.bz2'
proteins_extraction_path: str = ''
pseudocount: float = 1e-06
ref_removal: bool = True
regenerate_16S_reference: bool = False
regenerate_spacers_collection: bool = False
remove_collections: bool = False
remove_decomp_fasta: bool = True
run_common_abundance: bool = True
run_raw_crispr_networks: bool = True
run_shrinkage_correlations: bool = True
run_smooth_crispr: bool = True
run_xstar: bool = True
save_single_bacgenome_collection: bool = False
silva_ref_path: str = ''
silva_ref_url: str = 'https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz'
silva_taxmap_path: str = ''
silva_taxmap_url: str = 'https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz'
sp_folder: str = ''
species_level: bool = False
tax_bac_for_smoothing: str = ''
tax_vir: str = ''
to_yaml(path: str | Path) None[source]

Serialize config to a YAML file.

Parameters:

path – Destination YAML file path.

transpose_raw_crispr_after_load: bool = True
verbose: bool = True
viral_ranks: list
viral_weights: list
virus_abundance_raw: str = ''
virus_fasta_name: str = ''
virus_fasta_path() Path[source]

Resolved absolute path to the virus FASTA file.