This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.

Target locus

Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:

Toxin: HJI06_09100 (1889026–1889421; strand -)
Antitoxin: HJI06_09105 (1889421–1889651; strand -)

Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.

Overview of the pipeline

The workflow consists of four stages:

Extract CDS sequences from the GenBank file → CP052959_CDS.fasta
BLAST CDS to genome to generate a coordinate table → CP052959_gene_coordinates.tsv
Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters → CP052959_guides_preoff.tsv + BLAST hit table
Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined) → scan_results/.../guides_all.tsv + guides_top5.tsv

A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.

Requirements

Software

Python 3 (Biopython)
R (with Biostrings, readr, dplyr, stringr, tidyr)
BLAST+ (makeblastdb, blastn)
(Optional) samtools for quick sequence validation

Input files

Download from NCBI:

Genome FASTA: CP052959.1.fna (or equivalent)
GenBank annotation: CP052959.1.gbk (or equivalent)

For convenience, I standardize filenames locally as:

CP052959.fna (genome)
CP052959.gbk (annotation)

Step-by-step commands

0) Prepare working directory

mkdir -p HD46_GuideFinder
cd HD46_GuideFinder

Place the following files in this directory:

CP052959.gbk
CP052959.fna
the scripts from the bottom of this post

1) Extract CDS sequences from GenBank (Python)

This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.

✅ Run:

python 1_extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

Expected output:

CP052959_CDS.fasta

Example log:

Wrote 2573 CDS sequences to CP052959_CDS.fasta

2) Pre-processing: build BLAST DB + map CDS to genome (R)

This step wraps:

makeblastdb
blastn (CDS vs genome)
parsing BLAST output into a coordinate table

✅ Run:

#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R

Expected output:

GenomeDB_CP052959.* (BLAST database files)
CDS_vs_genome_CP052959.blast
CP052959_gene_coordinates.tsv

Sanity check (recommended):

grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv

You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).

3) Filter TA coordinates + create TA_targets.txt (R)

The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:

TA_gene_coordinates.tsv
TA_targets.txt

✅ Run:

Rscript 3_Filter_TA_coords.R

Expected outputs:

TA_gene_coordinates.tsv
TA_targets.txt

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

This script scans:

promoter window + gene body window
identifies NGG PAM (and CCN for reverse-strand equivalent)
extracts adjacent 20-nt protospacers
filters by GC and homopolymer
BLASTs guides to genome and writes a hit table

✅ Run:

Rscript 4_GuideFinder_CP052959_base.R

Expected outputs:

CP052959_guides_preoff.tsv
BLASTguides.fasta
MatchGuides.blast
CP052959_blast_hits.tsv

Troubleshooting note (from my run)

Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:

scan_gene_frac <- 1.0 (scan full gene; crucial for short antitoxin genes)
increase promoter length (e.g., 600 bp)
optionally relax GC/homopolymer thresholds slightly

5) Off-target filtering (R)

This script reads:

CP052959_guides_preoff.tsv
CP052959_blast_hits.tsv

Then produces multiple filtering scenarios:

off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits

✅ Run:

Rscript 5_ScanOfftarget_CP052959.R

Expected output structure:

scan_results/
  off_n1/
  off_n3/
  off_n5/
  off_n10/
  off_refined/

Each contains:

guides_all.tsv
guides_top5.tsv (ranked by dist_to_tss and GC)

Key concepts (explained from the debugging/communication notes)

What is PAM?

PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.

What is guide_seq vs protospacer?

In this pipeline:

guide_seq is the 20-nt protospacer sequence extracted from the genome adjacent to PAM.
The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).

What does guide_strand mean?

guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) at guide_start..guide_end.
guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.

It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.

Why do some target sites appear multiple times?

Two reasons:

The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.

Final candidate selection (from my off_refined output)

After off_refined filtering, I had 10 candidates covering both genes.

Recommended single guide (entry-proximal for TA-unit knockdown):

CP052959|HJI06_09105|gene|1889575|+
guide_seq: AGTGCTGCGATCACTTCTGT
PAM: CGG
This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.

Recommended “enhanced” 2-guide set:

Guide A (entry-proximal): CP052959|HJI06_09105|gene|1889575|+
Guide B (upstream roadblock): CP052959|HJI06_09105|promoter|1889908|- (guide_seq: CATTCTGACTTTATGGAAAC, PAM AGG)
These are ~333 bp apart and both pass off_refined.

Reproducibility note: verifying a candidate sequence in the genome

I used samtools faidx to confirm that the guide_seq matches the genome coordinates:

samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594

Expected sequence: AGTGCTGCGATCACTTCTGT

Raw code (scripts used)

Below are the exact scripts used so you can reproduce this pipeline on another computer.

1) 1_extract_cds_from_gbk.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"

records = []

for rec in SeqIO.parse(gbk_file, "genbank"):
    for feature in rec.features:
        if feature.type != "CDS":
            continue

        seq = feature.extract(rec.seq)

        locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
        gene      = feature.qualifiers.get("gene", [""])[0]
        product   = feature.qualifiers.get("product", [""])[0]

        header_parts = [locus_tag]
        if gene:
            header_parts.append(gene)
        if product:
            header_parts.append(product)

        header = "|".join(header_parts)

        rec_out = SeqRecord(
            seq,
            id=header,
            description=""
        )
        records.append(rec_out)

SeqIO.write(records, out_fasta, "fasta")

print(f"Wrote {len(records)} CDS sequences to {out_fasta}")

2) 2_PreProcessing_CP052959.R

# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates

# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")

library(Biostrings)
library(readr)
library(dplyr)
library(stringr)

genome_fasta <- "CP052959.fna"
cds_fasta    <- "CP052959_CDS.fasta"
db_name      <- "GenomeDB_CP052959"
blast_out    <- "CDS_vs_genome_CP052959.blast"
coord_table  <- "CP052959_gene_coordinates.tsv"

message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA:    ", cds_fasta)

# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
                  shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)

# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
  "blastn",
  "-task blastn",
  "-query", shQuote(cds_fasta),
  "-db", shQuote(db_name),
  "-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
  "-max_target_seqs 1",
  "-perc_identity 95",
  "-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)

# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
          "length", "pident", "bitscore")

blast <- read_tsv(blast_out,
                  col_names = cols,
                  show_col_types = FALSE,
                  comment = "#")

if (nrow(blast) == 0) {
  stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}

# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
  mutate(
    start  = pmin(sstart, send),
    end    = pmax(sstart, send),
    strand = if_else(sstart <= send, "+", "-")
  ) %>%
  select(qseqid, sseqid, start, end, strand, length, pident, bitscore)

# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
  mutate(
    locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
    gene      = str_split_fixed(qseqid, "\\|", 3)[, 2]
  )

# 6. Save coordinate table
coord <- blast2 %>%
  transmute(
    gene_id    = if_else(gene == "", locus_tag, gene),
    locus_tag  = locus_tag,
    contig_id  = sseqid,
    start      = start,
    end        = end,
    strand     = strand,
    length_nt  = length,
    pident     = pident,
    bitscore   = bitscore
  )

write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)

3) 3_Filter_TA_coords.R

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})

# ---------- 输入 / 输出 ----------
coord_in  <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"

# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")

# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)

# 预期 9 列：product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
  stop("Input coord table has < 9 columns: ", coord_in)
}

colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")

# 强制类型（避免读成字符导致 slice_max 异常）
coords <- coords %>%
  mutate(
    locus_tag = as.character(locus_tag),
    aln_len = as.integer(aln_len),
    pident = as.numeric(pident),
    bitscore = as.numeric(bitscore)
  )

# ---------- 过滤：只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
  stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
       "\nCheck that locus_tags exist in ", coord_in)
}

if (nrow(coords_ta) != length(targets)) {
  warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
          "\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}

# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)

# targets.txt：一行一个 locus_tag（按你给定的顺序写）
writeLines(targets, targets_out)

message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)

4) 4_GuideFinder_CP052959_base.R

#!/usr/bin/env Rscript

# GuideFinder_CP052959_base.R
# 目的：
#   1) 仅针对 TA targets 生成所有候选 guides（NGG PAM，20bp）
#   2) 计算 dist_to_tss（promoter 为负，gene body 为正，按转录方向）
#   3) 输出 guides_preoff.tsv + guides fasta
#   4) 运行 BLAST 做全基因组命中统计，输出 blast_hits.tsv
#
# 输入：
#   CP052959.fna                (整条基因组序列，单条序列 FASTA；可软链接到 CP052959.1.fna)
#   CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
#   TA_targets.txt                (两行：HJI06_09100 / HJI06_09105)
#
# 输出：
#   CP052959_guides_preoff.tsv
#   BLASTguides.fasta
#   MatchGuides.blast
#   CP052959_blast_hits.tsv

suppressPackageStartupMessages({
  if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
  if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
  if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")

  library(Biostrings)
  library(readr)
  library(dplyr)
  library(stringr)
  library(tidyr)
})

# ---------- 参数区 ----------
genome_fasta        <- "CP052959.fna"
coord_table         <- "CP052959_gene_coordinates.tsv"   # PreProcessing_CP052959.R 生成的
targets_file        <- "TA_targets.txt"                  # 只跑 TA 两个基因
blast_db_name       <- "GenomeDB_CP052959"               # 预处理里建的 BLAST DB 前缀
guides_fasta        <- "BLASTguides.fasta"
blast_matches_file  <- "MatchGuides.blast"
guides_preoff_tsv   <- "CP052959_guides_preoff.tsv"
blast_hits_tsv      <- "CP052959_blast_hits.tsv"

guide_len      <- 20L
pam            <- "NGG"              # SpCas9
#promoter_len   <- 200L              # 扫描 TSS 上游长度（bp），可按需改 100/300
#scan_gene_frac <- 0.30              # gene body 扫描前 30%（更适合 CRISPRi）
#gc_min         <- 0.30              # S. epidermidis GC 偏低，建议 0.30~0.35
#gc_max         <- 0.80              # 基本不会用到，但留着
#homopolymer_n  <- 4L                # 连续 A 或 T >= 4 就过滤（可改 5 更宽松）
promoter_len   <- 600L     # 原来 200L，改大：增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00     # 原来 0.30，改为扫完整 gene body（短基因必需）
gc_min         <- 0.25     # 原来 0.30，略放宽（S. epi GC 低，避免全被 GC 过滤）
gc_max         <- 0.80
homopolymer_n  <- 5L       # 原来 4L，略放宽（避免把仅有的候选全杀掉）

# BLAST 参数：用于命中统计（不是最终 off_refined）
blast_task     <- "blastn-short"
blast_wordsize <- 7L
blast_evalue   <- 1000

# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
}

calc_gc <- function(seq) {
  s <- as.character(seq)
  if (nchar(s) == 0) return(NA_real_)
  g <- str_count(s, "G")
  c <- str_count(s, "C")
  (g + c) / nchar(s)
}

has_homopolymer_AT <- function(seq, n = 4L) {
  s <- as.character(seq)
  patA <- paste0("A{", n, ",}")
  patT <- paste0("T{", n, ",}")
  str_detect(s, patA) | str_detect(s, patT)
}

revcomp_str <- function(s) {
  as.character(reverseComplement(DNAString(s)))
}

# 扫描 forward strand 的 NGG：返回 data.frame（protospacer + PAM）
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  # dna: DNAString（该区域序列，5'->3' 按基因组正链）
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # i: protospacer 起点（0-based in R string positions）
  # protospacer = s[i+1 .. i+20]
  # PAM         = s[i+21 .. i+23] must match NGG
  max_i <- L - (guide_len + 3L)
  if (max_i < 0) return(tibble())

  idx <- 0L
  for (i in 0:max_i) {
    pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
    if (nchar(pam_seq) != 3) next
    if (!str_detect(pam_seq, "^[ACGT]GG$")) next

    prot <- substr(s, i + 1L, i + guide_len)
    if (nchar(prot) != guide_len) next

    guide_start <- offset_genome_1based + i
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- guide_end + 1L
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = prot,        # 常规报告为 protospacer（不含 PAM）
      pam_seq     = pam_seq,
      guide_strand = "+",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# 扫描 reverse strand：等价扫描 forward 的 CCN（因为 reverse 的 PAM = NGG）
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # 在 forward 上找 CCN：pam_fwd = CCN (positions j..j+2)
  # reverse 上 PAM = NGG，对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt（j+3..j+22）
  # 输出 guide_seq 应为该 20nt 的反向互补（5'->3'）
  max_j <- L - (3L + guide_len)
  if (max_j < 0) return(tibble())

  idx <- 0L
  for (j in 0:max_j) {
    pam_fwd <- substr(s, j + 1L, j + 3L)
    if (nchar(pam_fwd) != 3) next
    if (!str_detect(pam_fwd, "^CC[ACGT]$")) next

    prot_fwd <- substr(s, j + 4L, j + 3L + guide_len)  # downstream 20nt
    if (nchar(prot_fwd) != guide_len) next

    guide_seq <- revcomp_str(prot_fwd)
    # 基因组坐标（1-based）：prot_fwd 覆盖 [j+4 .. j+23]（长度20）
    guide_start <- offset_genome_1based + j + 3L
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- offset_genome_1based + j
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = guide_seq,
      pam_seq     = revcomp_str(pam_fwd),   # 反向互补后就是 NGG
      guide_strand = "-",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# dist_to_tss：按转录方向计算 guide 到 TSS 的距离（bp）
# 约定：
#   + 链：TSS ~ gene_start（start），promoter 在 start-promoter_len .. start-1（dist 负）
#   - 链：TSS ~ gene_end（end），promoter 在 end+1 .. end+promoter_len（dist 负）
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
  if (gene_strand == "+") {
    # 取 guide 的 5'端（转录方向上的前端）近似
    guide_5p <- guide_start
    as.integer(guide_5p - gene_start)
  } else {
    # - 链转录方向为 大 -> 小，5'端在更大坐标
    guide_5p <- guide_end
    as.integer(gene_end - guide_5p)
  }
}

# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)

message("Using genome FASTA: ", genome_fasta)
message("Using coord table:  ", coord_table)
message("Using targets:      ", targets_file)

targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))

# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
  message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)

# ---------- 读取坐标表并过滤（关键：只保留 targets + 每基因最长命中） ----------
# 你的表是 9 列（从 grep 输出推断）：
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)

coords <- coords_raw %>%
  transmute(
    product   = as.character(X1),
    locus_tag = as.character(X2),
    seqid     = as.character(X3),
    start     = as.integer(X4),
    end       = as.integer(X5),
    strand    = as.character(X6),
    aln_len   = as.integer(X7),
    pident    = as.numeric(X8),
    bitscore  = as.numeric(X9)
  ) %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# 安全检查：targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
  stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}

message("Filtered coord rows: ", nrow(coords))
print(coords)

# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L

for (i in seq_len(nrow(coords))) {
  row <- coords[i, ]

  gene_start <- min(row$start, row$end)
  gene_end   <- max(row$start, row$end)
  gene_strand <- row$strand
  locus <- row$locus_tag
  prod  <- row$product

  gene_len <- gene_end - gene_start + 1L
  scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))

  # promoter 区间（按链方向定义“上游”）
  if (gene_strand == "+") {
    tss <- gene_start
    prom_start <- max(1L, gene_start - promoter_len)
    prom_end   <- gene_start - 1L
    gene_scan_start <- gene_start
    gene_scan_end   <- min(gene_end, gene_start + scan_gene_len - 1L)
  } else {
    tss <- gene_end
    prom_start <- gene_end + 1L
    prom_end   <- min(genome_len, gene_end + promoter_len)
    gene_scan_end   <- gene_end
    gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
  }

  message("\n--- ", locus, " (", gene_strand, ") ---")
  message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
          ", TSS~", tss)
  message("Promoter region: ", prom_start, "..", prom_end)
  message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)

  # 提取 promoter / gene-scan 序列（按基因组正链坐标截取）
  guides_gene <- tibble()
  guides_prom <- tibble()

  if (prom_start <= prom_end) {
    dna_prom <- subseq(genome, start = prom_start, end = prom_end)
    guides_prom <- bind_rows(
      scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
      scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
    )
  }

  if (gene_scan_start <= gene_scan_end) {
    dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
    guides_gene <- bind_rows(
      scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
      scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
    )
  }

  guides <- bind_rows(guides_prom, guides_gene)
  if (nrow(guides) == 0) {
    message("No NGG candidates found in promoter+gene-scan region for ", locus)
    next
  }

  # 过滤：GC + homopolymer
  guides <- guides %>%
    mutate(
      locus_tag = locus,
      product   = prod,
      gene_start = gene_start,
      gene_end   = gene_end,
      gene_strand = gene_strand,
      tss_pos   = tss,
      gc        = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
      has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
      dist_to_tss = mapply(
        FUN = calc_dist_to_tss,
        gene_start = gene_start,
        gene_end   = gene_end,
        gene_strand = gene_strand,
        guide_start = guide_start,
        guide_end   = guide_end
      )
    ) %>%
    filter(!is.na(gc)) %>%
    filter(gc >= gc_min, gc <= gc_max) %>%
    filter(!has_hpoly)

  if (nrow(guides) == 0) {
    message("All candidates filtered out by GC/homopolymer for ", locus)
    next
  }

  # 给每条 guide 一个唯一 ID
  guides <- guides %>%
    arrange(region, abs(dist_to_tss), desc(gc)) %>%
    mutate(
      guide_id = {
        # 稳定可读：CP052959|locus|region|start|strand
        paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
      }
    )

  all_guides[[length(all_guides) + 1L]] <- guides
  message("Kept guides after filters: ", nrow(guides))
}

guides_df <- bind_rows(all_guides)

if (nrow(guides_df) == 0) {
  stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}

# 输出 preoff
write_tsv(guides_df %>%
            select(
              guide_id, locus_tag, product, region,
              guide_seq, pam_seq, guide_strand,
              guide_start, guide_end, pam_start, pam_end,
              gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
              gc, has_hpoly
            ),
          guides_preoff_tsv)

message("\nWrote guides_preoff: ", guides_preoff_tsv)

# ---------- 写 BLASTguides.fasta ----------
# 注意：写出的序列是 guide_seq（20nt）
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)

# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在，否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
  message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
  cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
  message("Running: ", cmd_mk)
  system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}

cmd_blast <- sprintf(
  "blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
  blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)

stopif_file_missing(blast_matches_file)

# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
  blast_matches_file,
  col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
  show_col_types = FALSE
)

if (nrow(blast_raw) == 0) {
  stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}

# 统一 sstart/send 为 min/max（方便判断）
blast_hits <- blast_raw %>%
  mutate(
    smin = pmin(sstart, send),
    smax = pmax(sstart, send),
    hit_strand = ifelse(sstart <= send, "+", "-")
  )

# 命中计数（粗略：所有 hits；后续 off_refined 由你的 ScanOfftarget 脚本做）
hit_counts <- blast_hits %>%
  count(qseqid, name = "n_hits") %>%
  rename(guide_id = qseqid)

# 合并 guides 注释
blast_merged <- blast_hits %>%
  rename(guide_id = qseqid) %>%
  left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
  left_join(hit_counts, by = "guide_id") %>%
  arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))

write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)

message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

#!/usr/bin/env Rscript

# ScanOfftarget_CP052959.R
# 适配当前版本输出：
#   - guides_preoff: CP052959_guides_preoff.tsv（locus_tag, dist_to_tss, gc, guide_id...）
#   - blast_hits:    CP052959_blast_hits.tsv（guide_id, pident, length, mismatch, gapopen... + n_hits）
#
# 输出：
#   scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}

suppressPackageStartupMessages({
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(readr)
  library(dplyr)
})

guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv    <- "CP052959_blast_hits.tsv"
guide_len         <- 20L
base_outdir       <- "scan_results"

message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)

message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)

if (!dir.exists(base_outdir)) dir.create(base_outdir)

# ---------- 兼容列名：gene_id / locus_tag；GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列：", guides_preoff_tsv)

# gene_id：优先用 gene_id，没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
  if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
  guides <- guides %>% mutate(gene_id = locus_tag)
}

# GC：优先用 GC，没有就用 gc
if (!"GC" %in% names(guides)) {
  if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
  guides <- guides %>% mutate(GC = gc)
}

# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")

# ---------- BLAST hits 列检查 ----------
# 我们需要：guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
  stop("blast_hits 缺少必要列：", paste(missing_cols, collapse = ", "),
       "\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}

# ---------- 规则封装 ----------
# 通用：按 n_hits 上限过滤
#是太严（仍然 0），就放宽一点： mismatch 2-->3： filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
                            require_full_length = TRUE,
                            max_mismatch = 2,
                            require_no_gaps = TRUE,
                            min_pident = 0) {

  # 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
  b <- blast

  # 强制数值类型（避免字符导致比较全 NA）
  b <- b %>%
    mutate(
      pident  = as.numeric(pident),
      length  = as.integer(length),
      mismatch = as.integer(mismatch),
      gapopen = as.integer(gapopen)
    )

  if (require_full_length) {
    b <- b %>% filter(length == guide_len)
  }
  if (require_no_gaps) {
    b <- b %>% filter(gapopen == 0)
  }
  if (!is.null(max_mismatch)) {
    b <- b %>% filter(mismatch <= max_mismatch)
  }
  if (!is.null(min_pident) && min_pident > 0) {
    # 注意：你的 pident 通常是 0-100 的百分比
    b <- b %>% filter(pident >= min_pident)
  }

  # 2) 用“严格命中”重新计算每条 guide 的 n_hits（更合理）
  valid_ids <- b %>%
    count(guide_id, name = "n_hits_strict") %>%
    filter(n_hits_strict <= max_hits) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 精细规则 off_refined：
#  - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
#  - 其他 hits（若存在）：length < 15 或 mismatch >= 3
filter_refined <- function() {
  blast_grouped <- blast %>%
    group_by(guide_id) %>%
    summarise(
      n_hits = n(),
      n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
      other_ok = all(
        ifelse(
          length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
          TRUE,
          (length < 15 | mismatch >= 3)
        )
      ),
      .groups = "drop"
    )

  valid_ids <- blast_grouped %>%
    filter(n_perfect == 1, other_ok) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 写结果到子目录
run_scenario <- function(name, guides_selected) {
  outdir <- file.path(base_outdir, name)
  if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)

  all_path <- file.path(outdir, "guides_all.tsv")
  top_path <- file.path(outdir, "guides_top5.tsv")

  write_tsv(guides_selected, all_path)

  guides_top <- guides_selected %>%
    group_by(gene_id) %>%
    arrange(dist_to_tss, desc(GC)) %>%
    slice_head(n = 5) %>%
    ungroup()

  write_tsv(guides_top, top_path)

  message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
                  name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}

# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))

message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))

message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))

message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))

message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())

message("All scenarios finished. Check 'scan_results/' directory.")

Final notes

If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
If a target gene yields no candidates again, check:
- scan_gene_frac (set to 1.0 for short genes)
- promoter length (increase if needed)
- GC/homopolymer thresholds (slightly relax)

If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Guide RNA Design for a Toxin–Antitoxin Locus in CP052959 Using the GuideFinder Workflow (Data_JuliaFuchs_RNAseq_2025)

Target locus

Overview of the pipeline

Requirements

Software

Input files

Step-by-step commands

0) Prepare working directory

1) Extract CDS sequences from GenBank (Python)

2) Pre-processing: build BLAST DB + map CDS to genome (R)

3) Filter TA coordinates + create TA_targets.txt (R)

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

Troubleshooting note (from my run)

5) Off-target filtering (R)

Key concepts (explained from the debugging/communication notes)

What is PAM?

What is guide_seq vs protospacer?

What does guide_strand mean?

Why do some target sites appear multiple times?

Final candidate selection (from my off_refined output)

Reproducibility note: verifying a candidate sequence in the genome

Raw code (scripts used)

1) 1_extract_cds_from_gbk.py

2) 2_PreProcessing_CP052959.R

3) 3_Filter_TA_coords.R

4) 4_GuideFinder_CP052959_base.R

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

Final notes

Leave a Reply Cancel reply