capellini.stages.spacepharer
SpacePHARER stage: spacer extraction, DB creation, prediction, and statistics.
Functions
Check that spacepharer and minced are on PATH; install via conda if missing. |
|
|
Print the full pipeline statistics summary (Sections 1-3 from notebook). |
|
Filter spacers_CompleteCollection.fasta to cohort-specific NCBI IDs. |
|
Return path to spacers_CompleteCollection.fasta, using the bundled file if present. |
Generate SpacePHARER network figures (Sections 4-5 from notebook). |
|
|
Run the full SpacePHARER stage: spacer collection, filtering, and prediction. |
Classes
|
Wrapper around the SpacePHARER + MinCED command-line tools. |
- class capellini.stages.spacepharer.SpacePHARERWorkflow(workdir: str, spacerdir: str)[source]
Bases:
objectWrapper around the SpacePHARER + MinCED command-line tools.
- extract_spacers(fasta_path: str | Path, min_n_spacers: int, min_length: int, max_length: int, tag: str = 'spacers_CompleteCollection') Path[source]
Extract CRISPR spacers from a FASTA using MinCED.
- Parameters:
fasta_path – Input contigs FASTA.
min_n_spacers – Minimum number of spacers in a CRISPR array (-minNR).
min_length – Minimum spacer repeat length (-minRL).
max_length – Maximum spacer repeat length (-maxRL).
tag – Output file name prefix.
- Returns:
Path to the spacer FASTA file.
- make_db(fasta_file: str | Path, dbname: str, is_spacer: bool = False, rev: bool = False) Path[source]
Create a SpacePHARER database from a FASTA file.
- Parameters:
fasta_file – Input FASTA.
dbname – Database name (placed in databases/ subdirectory).
is_spacer – Add –extractorf-spacer 1 flag.
rev – Add –reverse-fragments 1 flag.
- Returns:
Path to the database.
- predict(spacerDB: Path, viralDB: Path, viralctrlDB: Path, out: str = 'phage_host_predictions.tsv', fdr: float = 0.05) Path[source]
Run SpacePHARER predictmatch.
- Parameters:
spacerDB – Path to spacer database.
viralDB – Path to viral database.
viralctrlDB – Path to viral control database.
out – Output TSV filename.
fdr – False discovery rate threshold.
- Returns:
Path to the prediction TSV.
- capellini.stages.spacepharer.check_and_install_spacepharer() tuple[str, str][source]
Check that spacepharer and minced are on PATH; install via conda if missing.
- Returns:
Tuple (spacepharer_path, minced_path).
- capellini.stages.spacepharer.compute_spacepharer_stats(cfg: CapelliniConfig, silva_fixed: DataFrame, topBitScore_df: DataFrame) None[source]
Print the full pipeline statistics summary (Sections 1-3 from notebook).
- Parameters:
cfg – Populated CapelliniConfig instance.
silva_fixed – Output DataFrame with NCBI taxid columns.
topBitScore_df – MMSeqs2 hits DataFrame.
- capellini.stages.spacepharer.filter_target_spacers(spacers_collection: Path, ncbi_id_target_set_int: set, input_fasta_folder: str | Path) Path[source]
Filter spacers_CompleteCollection.fasta to cohort-specific NCBI IDs.
- Parameters:
spacers_collection – Path to the complete spacers FASTA.
ncbi_id_target_set_int – Set of integer NCBI IDs to keep.
input_fasta_folder – Directory where target_spacers.fasta will be written.
- Returns:
Path to the filtered target_spacers.fasta.
- capellini.stages.spacepharer.get_spacers_collection(cfg: CapelliniConfig, wf: SpacePHARERWorkflow) Path[source]
Return path to spacers_CompleteCollection.fasta, using the bundled file if present.
Modification 2: checks the bundled FASTA first. If not present, downloads progenomes3.contigs.representatives.fasta.bz2, decompresses it, runs MinCED, and optionally removes the decompressed FASTA.
- Parameters:
cfg – Populated CapelliniConfig instance.
wf – Initialized SpacePHARERWorkflow instance.
- Returns:
Path to spacers_CompleteCollection.fasta.
- capellini.stages.spacepharer.plot_spacepharer_figures(cfg: CapelliniConfig) None[source]
Generate SpacePHARER network figures (Sections 4-5 from notebook).
Only called when figures_display=True in the pipeline config.
- Parameters:
cfg – Populated CapelliniConfig instance.
- capellini.stages.spacepharer.run_spacepharer(cfg: CapelliniConfig, silva_fixed: DataFrame) None[source]
Run the full SpacePHARER stage: spacer collection, filtering, and prediction.
- Parameters:
cfg – Populated CapelliniConfig instance.
silva_fixed – Output of the MMSeqs2 stage with progenomes_taxid and GCA columns.