capellini.stages.spacepharer

SpacePHARER stage: spacer extraction, DB creation, prediction, and statistics.

Functions

check_and_install_spacepharer()

Check that spacepharer and minced are on PATH; install via conda if missing.

compute_spacepharer_stats(cfg, silva_fixed, ...)

Print the full pipeline statistics summary (Sections 1-3 from notebook).

filter_target_spacers(spacers_collection, ...)

Filter spacers_CompleteCollection.fasta to cohort-specific NCBI IDs.

get_spacers_collection(cfg, wf)

Return path to spacers_CompleteCollection.fasta, using the bundled file if present.

plot_spacepharer_figures(cfg)

Generate SpacePHARER network figures (Sections 4-5 from notebook).

run_spacepharer(cfg, silva_fixed)

Run the full SpacePHARER stage: spacer collection, filtering, and prediction.

Classes

SpacePHARERWorkflow(workdir, spacerdir)

Wrapper around the SpacePHARER + MinCED command-line tools.

class capellini.stages.spacepharer.SpacePHARERWorkflow(workdir: str, spacerdir: str)[source]

Bases: object

Wrapper around the SpacePHARER + MinCED command-line tools.

extract_spacers(fasta_path: str | Path, min_n_spacers: int, min_length: int, max_length: int, tag: str = 'spacers_CompleteCollection') Path[source]

Extract CRISPR spacers from a FASTA using MinCED.

Parameters:
  • fasta_path – Input contigs FASTA.

  • min_n_spacers – Minimum number of spacers in a CRISPR array (-minNR).

  • min_length – Minimum spacer repeat length (-minRL).

  • max_length – Maximum spacer repeat length (-maxRL).

  • tag – Output file name prefix.

Returns:

Path to the spacer FASTA file.

make_db(fasta_file: str | Path, dbname: str, is_spacer: bool = False, rev: bool = False) Path[source]

Create a SpacePHARER database from a FASTA file.

Parameters:
  • fasta_file – Input FASTA.

  • dbname – Database name (placed in databases/ subdirectory).

  • is_spacer – Add –extractorf-spacer 1 flag.

  • rev – Add –reverse-fragments 1 flag.

Returns:

Path to the database.

predict(spacerDB: Path, viralDB: Path, viralctrlDB: Path, out: str = 'phage_host_predictions.tsv', fdr: float = 0.05) Path[source]

Run SpacePHARER predictmatch.

Parameters:
  • spacerDB – Path to spacer database.

  • viralDB – Path to viral database.

  • viralctrlDB – Path to viral control database.

  • out – Output TSV filename.

  • fdr – False discovery rate threshold.

Returns:

Path to the prediction TSV.

quick_stats(tsv: str | Path) None[source]

Print quick interaction statistics from a prediction TSV.

Parameters:

tsv – Path to phage_host_predictions.tsv.

capellini.stages.spacepharer.check_and_install_spacepharer() tuple[str, str][source]

Check that spacepharer and minced are on PATH; install via conda if missing.

Returns:

Tuple (spacepharer_path, minced_path).

capellini.stages.spacepharer.compute_spacepharer_stats(cfg: CapelliniConfig, silva_fixed: DataFrame, topBitScore_df: DataFrame) None[source]

Print the full pipeline statistics summary (Sections 1-3 from notebook).

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • silva_fixed – Output DataFrame with NCBI taxid columns.

  • topBitScore_df – MMSeqs2 hits DataFrame.

capellini.stages.spacepharer.filter_target_spacers(spacers_collection: Path, ncbi_id_target_set_int: set, input_fasta_folder: str | Path) Path[source]

Filter spacers_CompleteCollection.fasta to cohort-specific NCBI IDs.

Parameters:
  • spacers_collection – Path to the complete spacers FASTA.

  • ncbi_id_target_set_int – Set of integer NCBI IDs to keep.

  • input_fasta_folder – Directory where target_spacers.fasta will be written.

Returns:

Path to the filtered target_spacers.fasta.

capellini.stages.spacepharer.get_spacers_collection(cfg: CapelliniConfig, wf: SpacePHARERWorkflow) Path[source]

Return path to spacers_CompleteCollection.fasta, using the bundled file if present.

Modification 2: checks the bundled FASTA first. If not present, downloads progenomes3.contigs.representatives.fasta.bz2, decompresses it, runs MinCED, and optionally removes the decompressed FASTA.

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • wf – Initialized SpacePHARERWorkflow instance.

Returns:

Path to spacers_CompleteCollection.fasta.

capellini.stages.spacepharer.plot_spacepharer_figures(cfg: CapelliniConfig) None[source]

Generate SpacePHARER network figures (Sections 4-5 from notebook).

Only called when figures_display=True in the pipeline config.

Parameters:

cfg – Populated CapelliniConfig instance.

capellini.stages.spacepharer.run_spacepharer(cfg: CapelliniConfig, silva_fixed: DataFrame) None[source]

Run the full SpacePHARER stage: spacer collection, filtering, and prediction.

Parameters:
  • cfg – Populated CapelliniConfig instance.

  • silva_fixed – Output of the MMSeqs2 stage with progenomes_taxid and GCA columns.