Guide RNA Design for a Toxin–Antitoxin Locus in CP052959 Using the GuideFinder Workflow (Data_JuliaFuchs_RNAseq_2025)

This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.

Target locus

Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:

  • Toxin: HJI06_09100 (1889026–1889421; strand -)
  • Antitoxin: HJI06_09105 (1889421–1889651; strand -)

Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.


Overview of the pipeline

The workflow consists of four stages:

  1. Extract CDS sequences from the GenBank file → CP052959_CDS.fasta
  2. BLAST CDS to genome to generate a coordinate table → CP052959_gene_coordinates.tsv
  3. Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters → CP052959_guides_preoff.tsv + BLAST hit table
  4. Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined) → scan_results/.../guides_all.tsv + guides_top5.tsv

A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.


Requirements

Software

  • Python 3 (Biopython)
  • R (with Biostrings, readr, dplyr, stringr, tidyr)
  • BLAST+ (makeblastdb, blastn)
  • (Optional) samtools for quick sequence validation

Input files

Download from NCBI:

  • Genome FASTA: CP052959.1.fna (or equivalent)
  • GenBank annotation: CP052959.1.gbk (or equivalent)

For convenience, I standardize filenames locally as:

  • CP052959.fna (genome)
  • CP052959.gbk (annotation)

Step-by-step commands

0) Prepare working directory

mkdir -p HD46_GuideFinder
cd HD46_GuideFinder

Place the following files in this directory:

  • CP052959.gbk
  • CP052959.fna
  • the scripts from the bottom of this post

1) Extract CDS sequences from GenBank (Python)

This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.

✅ Run:

python 1_extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

Expected output:

  • CP052959_CDS.fasta

Example log:

  • Wrote 2573 CDS sequences to CP052959_CDS.fasta

2) Pre-processing: build BLAST DB + map CDS to genome (R)

This step wraps:

  • makeblastdb
  • blastn (CDS vs genome)
  • parsing BLAST output into a coordinate table

✅ Run:

#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R

Expected output:

  • GenomeDB_CP052959.* (BLAST database files)
  • CDS_vs_genome_CP052959.blast
  • CP052959_gene_coordinates.tsv

Sanity check (recommended):

grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv

You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).


3) Filter TA coordinates + create TA_targets.txt (R)

The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

✅ Run:

Rscript 3_Filter_TA_coords.R

Expected outputs:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

This script scans:

  • promoter window + gene body window
  • identifies NGG PAM (and CCN for reverse-strand equivalent)
  • extracts adjacent 20-nt protospacers
  • filters by GC and homopolymer
  • BLASTs guides to genome and writes a hit table

✅ Run:

Rscript 4_GuideFinder_CP052959_base.R

Expected outputs:

  • CP052959_guides_preoff.tsv
  • BLASTguides.fasta
  • MatchGuides.blast
  • CP052959_blast_hits.tsv

Troubleshooting note (from my run)

Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:

  • scan_gene_frac <- 1.0 (scan full gene; crucial for short antitoxin genes)
  • increase promoter length (e.g., 600 bp)
  • optionally relax GC/homopolymer thresholds slightly

5) Off-target filtering (R)

This script reads:

  • CP052959_guides_preoff.tsv
  • CP052959_blast_hits.tsv

Then produces multiple filtering scenarios:

  • off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
  • off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits

✅ Run:

Rscript 5_ScanOfftarget_CP052959.R

Expected output structure:

scan_results/
  off_n1/
  off_n3/
  off_n5/
  off_n10/
  off_refined/

Each contains:

  • guides_all.tsv
  • guides_top5.tsv (ranked by dist_to_tss and GC)

Key concepts (explained from the debugging/communication notes)

What is PAM?

PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.

What is guide_seq vs protospacer?

In this pipeline:

  • guide_seq is the 20-nt protospacer sequence extracted from the genome adjacent to PAM.
  • The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).

What does guide_strand mean?

  • guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) at guide_start..guide_end.
  • guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.

It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.

Why do some target sites appear multiple times?

Two reasons:

  1. The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
  2. In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.

Final candidate selection (from my off_refined output)

After off_refined filtering, I had 10 candidates covering both genes.

Recommended single guide (entry-proximal for TA-unit knockdown):

  • CP052959|HJI06_09105|gene|1889575|+
  • guide_seq: AGTGCTGCGATCACTTCTGT
  • PAM: CGG
  • This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.

Recommended “enhanced” 2-guide set:

  • Guide A (entry-proximal): CP052959|HJI06_09105|gene|1889575|+
  • Guide B (upstream roadblock): CP052959|HJI06_09105|promoter|1889908|- (guide_seq: CATTCTGACTTTATGGAAAC, PAM AGG)
  • These are ~333 bp apart and both pass off_refined.

Reproducibility note: verifying a candidate sequence in the genome

I used samtools faidx to confirm that the guide_seq matches the genome coordinates:

samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594

Expected sequence: AGTGCTGCGATCACTTCTGT


Raw code (scripts used)

Below are the exact scripts used so you can reproduce this pipeline on another computer.


1) 1_extract_cds_from_gbk.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"

records = []

for rec in SeqIO.parse(gbk_file, "genbank"):
    for feature in rec.features:
        if feature.type != "CDS":
            continue

        seq = feature.extract(rec.seq)

        locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
        gene      = feature.qualifiers.get("gene", [""])[0]
        product   = feature.qualifiers.get("product", [""])[0]

        header_parts = [locus_tag]
        if gene:
            header_parts.append(gene)
        if product:
            header_parts.append(product)

        header = "|".join(header_parts)

        rec_out = SeqRecord(
            seq,
            id=header,
            description=""
        )
        records.append(rec_out)

SeqIO.write(records, out_fasta, "fasta")

print(f"Wrote {len(records)} CDS sequences to {out_fasta}")

2) 2_PreProcessing_CP052959.R

# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates

# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")

library(Biostrings)
library(readr)
library(dplyr)
library(stringr)

genome_fasta <- "CP052959.fna"
cds_fasta    <- "CP052959_CDS.fasta"
db_name      <- "GenomeDB_CP052959"
blast_out    <- "CDS_vs_genome_CP052959.blast"
coord_table  <- "CP052959_gene_coordinates.tsv"

message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA:    ", cds_fasta)

# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
                  shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)

# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
  "blastn",
  "-task blastn",
  "-query", shQuote(cds_fasta),
  "-db", shQuote(db_name),
  "-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
  "-max_target_seqs 1",
  "-perc_identity 95",
  "-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)

# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
          "length", "pident", "bitscore")

blast <- read_tsv(blast_out,
                  col_names = cols,
                  show_col_types = FALSE,
                  comment = "#")

if (nrow(blast) == 0) {
  stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}

# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
  mutate(
    start  = pmin(sstart, send),
    end    = pmax(sstart, send),
    strand = if_else(sstart <= send, "+", "-")
  ) %>%
  select(qseqid, sseqid, start, end, strand, length, pident, bitscore)

# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
  mutate(
    locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
    gene      = str_split_fixed(qseqid, "\\|", 3)[, 2]
  )

# 6. Save coordinate table
coord <- blast2 %>%
  transmute(
    gene_id    = if_else(gene == "", locus_tag, gene),
    locus_tag  = locus_tag,
    contig_id  = sseqid,
    start      = start,
    end        = end,
    strand     = strand,
    length_nt  = length,
    pident     = pident,
    bitscore   = bitscore
  )

write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)

3) 3_Filter_TA_coords.R

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})

# ---------- 输入 / 输出 ----------
coord_in  <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"

# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")

# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)

# 预期 9 列:product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
  stop("Input coord table has < 9 columns: ", coord_in)
}

colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")

# 强制类型(避免读成字符导致 slice_max 异常)
coords <- coords %>%
  mutate(
    locus_tag = as.character(locus_tag),
    aln_len = as.integer(aln_len),
    pident = as.numeric(pident),
    bitscore = as.numeric(bitscore)
  )

# ---------- 过滤:只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
  stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
       "\nCheck that locus_tags exist in ", coord_in)
}

if (nrow(coords_ta) != length(targets)) {
  warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
          "\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}

# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)

# targets.txt:一行一个 locus_tag(按你给定的顺序写)
writeLines(targets, targets_out)

message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)

4) 4_GuideFinder_CP052959_base.R

#!/usr/bin/env Rscript

# GuideFinder_CP052959_base.R
# 目的:
#   1) 仅针对 TA targets 生成所有候选 guides(NGG PAM,20bp)
#   2) 计算 dist_to_tss(promoter 为负,gene body 为正,按转录方向)
#   3) 输出 guides_preoff.tsv + guides fasta
#   4) 运行 BLAST 做全基因组命中统计,输出 blast_hits.tsv
#
# 输入:
#   CP052959.fna                (整条基因组序列,单条序列 FASTA;可软链接到 CP052959.1.fna)
#   CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
#   TA_targets.txt                (两行:HJI06_09100 / HJI06_09105)
#
# 输出:
#   CP052959_guides_preoff.tsv
#   BLASTguides.fasta
#   MatchGuides.blast
#   CP052959_blast_hits.tsv

suppressPackageStartupMessages({
  if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
  if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
  if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")

  library(Biostrings)
  library(readr)
  library(dplyr)
  library(stringr)
  library(tidyr)
})

# ---------- 参数区 ----------
genome_fasta        <- "CP052959.fna"
coord_table         <- "CP052959_gene_coordinates.tsv"   # PreProcessing_CP052959.R 生成的
targets_file        <- "TA_targets.txt"                  # 只跑 TA 两个基因
blast_db_name       <- "GenomeDB_CP052959"               # 预处理里建的 BLAST DB 前缀
guides_fasta        <- "BLASTguides.fasta"
blast_matches_file  <- "MatchGuides.blast"
guides_preoff_tsv   <- "CP052959_guides_preoff.tsv"
blast_hits_tsv      <- "CP052959_blast_hits.tsv"

guide_len      <- 20L
pam            <- "NGG"              # SpCas9
#promoter_len   <- 200L              # 扫描 TSS 上游长度(bp),可按需改 100/300
#scan_gene_frac <- 0.30              # gene body 扫描前 30%(更适合 CRISPRi)
#gc_min         <- 0.30              # S. epidermidis GC 偏低,建议 0.30~0.35
#gc_max         <- 0.80              # 基本不会用到,但留着
#homopolymer_n  <- 4L                # 连续 A 或 T >= 4 就过滤(可改 5 更宽松)
promoter_len   <- 600L     # 原来 200L,改大:增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00     # 原来 0.30,改为扫完整 gene body(短基因必需)
gc_min         <- 0.25     # 原来 0.30,略放宽(S. epi GC 低,避免全被 GC 过滤)
gc_max         <- 0.80
homopolymer_n  <- 5L       # 原来 4L,略放宽(避免把仅有的候选全杀掉)

# BLAST 参数:用于命中统计(不是最终 off_refined)
blast_task     <- "blastn-short"
blast_wordsize <- 7L
blast_evalue   <- 1000

# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
}

calc_gc <- function(seq) {
  s <- as.character(seq)
  if (nchar(s) == 0) return(NA_real_)
  g <- str_count(s, "G")
  c <- str_count(s, "C")
  (g + c) / nchar(s)
}

has_homopolymer_AT <- function(seq, n = 4L) {
  s <- as.character(seq)
  patA <- paste0("A{", n, ",}")
  patT <- paste0("T{", n, ",}")
  str_detect(s, patA) | str_detect(s, patT)
}

revcomp_str <- function(s) {
  as.character(reverseComplement(DNAString(s)))
}

# 扫描 forward strand 的 NGG:返回 data.frame(protospacer + PAM)
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  # dna: DNAString(该区域序列,5'->3' 按基因组正链)
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # i: protospacer 起点(0-based in R string positions)
  # protospacer = s[i+1 .. i+20]
  # PAM         = s[i+21 .. i+23] must match NGG
  max_i <- L - (guide_len + 3L)
  if (max_i < 0) return(tibble())

  idx <- 0L
  for (i in 0:max_i) {
    pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
    if (nchar(pam_seq) != 3) next
    if (!str_detect(pam_seq, "^[ACGT]GG$")) next

    prot <- substr(s, i + 1L, i + guide_len)
    if (nchar(prot) != guide_len) next

    guide_start <- offset_genome_1based + i
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- guide_end + 1L
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = prot,        # 常规报告为 protospacer(不含 PAM)
      pam_seq     = pam_seq,
      guide_strand = "+",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# 扫描 reverse strand:等价扫描 forward 的 CCN(因为 reverse 的 PAM = NGG)
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # 在 forward 上找 CCN:pam_fwd = CCN (positions j..j+2)
  # reverse 上 PAM = NGG,对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt(j+3..j+22)
  # 输出 guide_seq 应为该 20nt 的反向互补(5'->3')
  max_j <- L - (3L + guide_len)
  if (max_j < 0) return(tibble())

  idx <- 0L
  for (j in 0:max_j) {
    pam_fwd <- substr(s, j + 1L, j + 3L)
    if (nchar(pam_fwd) != 3) next
    if (!str_detect(pam_fwd, "^CC[ACGT]$")) next

    prot_fwd <- substr(s, j + 4L, j + 3L + guide_len)  # downstream 20nt
    if (nchar(prot_fwd) != guide_len) next

    guide_seq <- revcomp_str(prot_fwd)
    # 基因组坐标(1-based):prot_fwd 覆盖 [j+4 .. j+23](长度20)
    guide_start <- offset_genome_1based + j + 3L
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- offset_genome_1based + j
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = guide_seq,
      pam_seq     = revcomp_str(pam_fwd),   # 反向互补后就是 NGG
      guide_strand = "-",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# dist_to_tss:按转录方向计算 guide 到 TSS 的距离(bp)
# 约定:
#   + 链:TSS ~ gene_start(start),promoter 在 start-promoter_len .. start-1(dist 负)
#   - 链:TSS ~ gene_end(end),promoter 在 end+1 .. end+promoter_len(dist 负)
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
  if (gene_strand == "+") {
    # 取 guide 的 5'端(转录方向上的前端)近似
    guide_5p <- guide_start
    as.integer(guide_5p - gene_start)
  } else {
    # - 链转录方向为 大 -> 小,5'端在更大坐标
    guide_5p <- guide_end
    as.integer(gene_end - guide_5p)
  }
}

# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)

message("Using genome FASTA: ", genome_fasta)
message("Using coord table:  ", coord_table)
message("Using targets:      ", targets_file)

targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))

# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
  message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)

# ---------- 读取坐标表并过滤(关键:只保留 targets + 每基因最长命中) ----------
# 你的表是 9 列(从 grep 输出推断):
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)

coords <- coords_raw %>%
  transmute(
    product   = as.character(X1),
    locus_tag = as.character(X2),
    seqid     = as.character(X3),
    start     = as.integer(X4),
    end       = as.integer(X5),
    strand    = as.character(X6),
    aln_len   = as.integer(X7),
    pident    = as.numeric(X8),
    bitscore  = as.numeric(X9)
  ) %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# 安全检查:targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
  stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}

message("Filtered coord rows: ", nrow(coords))
print(coords)

# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L

for (i in seq_len(nrow(coords))) {
  row <- coords[i, ]

  gene_start <- min(row$start, row$end)
  gene_end   <- max(row$start, row$end)
  gene_strand <- row$strand
  locus <- row$locus_tag
  prod  <- row$product

  gene_len <- gene_end - gene_start + 1L
  scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))

  # promoter 区间(按链方向定义“上游”)
  if (gene_strand == "+") {
    tss <- gene_start
    prom_start <- max(1L, gene_start - promoter_len)
    prom_end   <- gene_start - 1L
    gene_scan_start <- gene_start
    gene_scan_end   <- min(gene_end, gene_start + scan_gene_len - 1L)
  } else {
    tss <- gene_end
    prom_start <- gene_end + 1L
    prom_end   <- min(genome_len, gene_end + promoter_len)
    gene_scan_end   <- gene_end
    gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
  }

  message("\n--- ", locus, " (", gene_strand, ") ---")
  message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
          ", TSS~", tss)
  message("Promoter region: ", prom_start, "..", prom_end)
  message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)

  # 提取 promoter / gene-scan 序列(按基因组正链坐标截取)
  guides_gene <- tibble()
  guides_prom <- tibble()

  if (prom_start <= prom_end) {
    dna_prom <- subseq(genome, start = prom_start, end = prom_end)
    guides_prom <- bind_rows(
      scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
      scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
    )
  }

  if (gene_scan_start <= gene_scan_end) {
    dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
    guides_gene <- bind_rows(
      scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
      scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
    )
  }

  guides <- bind_rows(guides_prom, guides_gene)
  if (nrow(guides) == 0) {
    message("No NGG candidates found in promoter+gene-scan region for ", locus)
    next
  }

  # 过滤:GC + homopolymer
  guides <- guides %>%
    mutate(
      locus_tag = locus,
      product   = prod,
      gene_start = gene_start,
      gene_end   = gene_end,
      gene_strand = gene_strand,
      tss_pos   = tss,
      gc        = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
      has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
      dist_to_tss = mapply(
        FUN = calc_dist_to_tss,
        gene_start = gene_start,
        gene_end   = gene_end,
        gene_strand = gene_strand,
        guide_start = guide_start,
        guide_end   = guide_end
      )
    ) %>%
    filter(!is.na(gc)) %>%
    filter(gc >= gc_min, gc <= gc_max) %>%
    filter(!has_hpoly)

  if (nrow(guides) == 0) {
    message("All candidates filtered out by GC/homopolymer for ", locus)
    next
  }

  # 给每条 guide 一个唯一 ID
  guides <- guides %>%
    arrange(region, abs(dist_to_tss), desc(gc)) %>%
    mutate(
      guide_id = {
        # 稳定可读:CP052959|locus|region|start|strand
        paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
      }
    )

  all_guides[[length(all_guides) + 1L]] <- guides
  message("Kept guides after filters: ", nrow(guides))
}

guides_df <- bind_rows(all_guides)

if (nrow(guides_df) == 0) {
  stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}

# 输出 preoff
write_tsv(guides_df %>%
            select(
              guide_id, locus_tag, product, region,
              guide_seq, pam_seq, guide_strand,
              guide_start, guide_end, pam_start, pam_end,
              gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
              gc, has_hpoly
            ),
          guides_preoff_tsv)

message("\nWrote guides_preoff: ", guides_preoff_tsv)

# ---------- 写 BLASTguides.fasta ----------
# 注意:写出的序列是 guide_seq(20nt)
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)

# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在,否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
  message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
  cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
  message("Running: ", cmd_mk)
  system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}

cmd_blast <- sprintf(
  "blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
  blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)

stopif_file_missing(blast_matches_file)

# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
  blast_matches_file,
  col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
  show_col_types = FALSE
)

if (nrow(blast_raw) == 0) {
  stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}

# 统一 sstart/send 为 min/max(方便判断)
blast_hits <- blast_raw %>%
  mutate(
    smin = pmin(sstart, send),
    smax = pmax(sstart, send),
    hit_strand = ifelse(sstart <= send, "+", "-")
  )

# 命中计数(粗略:所有 hits;后续 off_refined 由你的 ScanOfftarget 脚本做)
hit_counts <- blast_hits %>%
  count(qseqid, name = "n_hits") %>%
  rename(guide_id = qseqid)

# 合并 guides 注释
blast_merged <- blast_hits %>%
  rename(guide_id = qseqid) %>%
  left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
  left_join(hit_counts, by = "guide_id") %>%
  arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))

write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)

message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

#!/usr/bin/env Rscript

# ScanOfftarget_CP052959.R
# 适配当前版本输出:
#   - guides_preoff: CP052959_guides_preoff.tsv(locus_tag, dist_to_tss, gc, guide_id...)
#   - blast_hits:    CP052959_blast_hits.tsv(guide_id, pident, length, mismatch, gapopen... + n_hits)
#
# 输出:
#   scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}

suppressPackageStartupMessages({
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(readr)
  library(dplyr)
})

guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv    <- "CP052959_blast_hits.tsv"
guide_len         <- 20L
base_outdir       <- "scan_results"

message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)

message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)

if (!dir.exists(base_outdir)) dir.create(base_outdir)

# ---------- 兼容列名:gene_id / locus_tag;GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列:", guides_preoff_tsv)

# gene_id:优先用 gene_id,没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
  if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
  guides <- guides %>% mutate(gene_id = locus_tag)
}

# GC:优先用 GC,没有就用 gc
if (!"GC" %in% names(guides)) {
  if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
  guides <- guides %>% mutate(GC = gc)
}

# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")

# ---------- BLAST hits 列检查 ----------
# 我们需要:guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
  stop("blast_hits 缺少必要列:", paste(missing_cols, collapse = ", "),
       "\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}

# ---------- 规则封装 ----------
# 通用:按 n_hits 上限过滤
#是太严(仍然 0),就放宽一点: mismatch 2-->3: filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
                            require_full_length = TRUE,
                            max_mismatch = 2,
                            require_no_gaps = TRUE,
                            min_pident = 0) {

  # 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
  b <- blast

  # 强制数值类型(避免字符导致比较全 NA)
  b <- b %>%
    mutate(
      pident  = as.numeric(pident),
      length  = as.integer(length),
      mismatch = as.integer(mismatch),
      gapopen = as.integer(gapopen)
    )

  if (require_full_length) {
    b <- b %>% filter(length == guide_len)
  }
  if (require_no_gaps) {
    b <- b %>% filter(gapopen == 0)
  }
  if (!is.null(max_mismatch)) {
    b <- b %>% filter(mismatch <= max_mismatch)
  }
  if (!is.null(min_pident) && min_pident > 0) {
    # 注意:你的 pident 通常是 0-100 的百分比
    b <- b %>% filter(pident >= min_pident)
  }

  # 2) 用“严格命中”重新计算每条 guide 的 n_hits(更合理)
  valid_ids <- b %>%
    count(guide_id, name = "n_hits_strict") %>%
    filter(n_hits_strict <= max_hits) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 精细规则 off_refined:
#  - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
#  - 其他 hits(若存在):length < 15 或 mismatch >= 3
filter_refined <- function() {
  blast_grouped <- blast %>%
    group_by(guide_id) %>%
    summarise(
      n_hits = n(),
      n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
      other_ok = all(
        ifelse(
          length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
          TRUE,
          (length < 15 | mismatch >= 3)
        )
      ),
      .groups = "drop"
    )

  valid_ids <- blast_grouped %>%
    filter(n_perfect == 1, other_ok) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 写结果到子目录
run_scenario <- function(name, guides_selected) {
  outdir <- file.path(base_outdir, name)
  if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)

  all_path <- file.path(outdir, "guides_all.tsv")
  top_path <- file.path(outdir, "guides_top5.tsv")

  write_tsv(guides_selected, all_path)

  guides_top <- guides_selected %>%
    group_by(gene_id) %>%
    arrange(dist_to_tss, desc(GC)) %>%
    slice_head(n = 5) %>%
    ungroup()

  write_tsv(guides_top, top_path)

  message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
                  name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}

# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))

message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))

message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))

message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))

message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())

message("All scenarios finished. Check 'scan_results/' directory.")

Final notes

  • If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
  • If a target gene yields no candidates again, check:

    • scan_gene_frac (set to 1.0 for short genes)
    • promoter length (increase if needed)
    • GC/homopolymer thresholds (slightly relax)

If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.

Leave a Reply

Your email address will not be published. Required fields are marked *