This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.
Target locus
Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:
- Toxin:
HJI06_09100(1889026–1889421; strand-) - Antitoxin:
HJI06_09105(1889421–1889651; strand-)
Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.
Overview of the pipeline
The workflow consists of four stages:
- Extract CDS sequences from the GenBank file →
CP052959_CDS.fasta - BLAST CDS to genome to generate a coordinate table →
CP052959_gene_coordinates.tsv - Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters
→
CP052959_guides_preoff.tsv+ BLAST hit table - Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined)
→
scan_results/.../guides_all.tsv+guides_top5.tsv
A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.
Requirements
Software
- Python 3 (Biopython)
- R (with
Biostrings,readr,dplyr,stringr,tidyr) - BLAST+ (
makeblastdb,blastn) - (Optional)
samtoolsfor quick sequence validation
Input files
Download from NCBI:
- Genome FASTA:
CP052959.1.fna(or equivalent) - GenBank annotation:
CP052959.1.gbk(or equivalent)
For convenience, I standardize filenames locally as:
CP052959.fna(genome)CP052959.gbk(annotation)
Step-by-step commands
0) Prepare working directory
mkdir -p HD46_GuideFinder
cd HD46_GuideFinder
Place the following files in this directory:
CP052959.gbkCP052959.fna- the scripts from the bottom of this post
1) Extract CDS sequences from GenBank (Python)
This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.
✅ Run:
python 1_extract_cds_from_gbk.py \
--gbk CP052959.gbk \
--out CP052959_CDS.fasta
Expected output:
CP052959_CDS.fasta
Example log:
Wrote 2573 CDS sequences to CP052959_CDS.fasta
2) Pre-processing: build BLAST DB + map CDS to genome (R)
This step wraps:
makeblastdbblastn(CDS vs genome)- parsing BLAST output into a coordinate table
✅ Run:
#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R
Expected output:
GenomeDB_CP052959.*(BLAST database files)CDS_vs_genome_CP052959.blastCP052959_gene_coordinates.tsv
Sanity check (recommended):
grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv
You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).
3) Filter TA coordinates + create TA_targets.txt (R)
The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:
TA_gene_coordinates.tsvTA_targets.txt
✅ Run:
Rscript 3_Filter_TA_coords.R
Expected outputs:
TA_gene_coordinates.tsvTA_targets.txt
4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)
This script scans:
- promoter window + gene body window
- identifies NGG PAM (and CCN for reverse-strand equivalent)
- extracts adjacent 20-nt protospacers
- filters by GC and homopolymer
- BLASTs guides to genome and writes a hit table
✅ Run:
Rscript 4_GuideFinder_CP052959_base.R
Expected outputs:
CP052959_guides_preoff.tsvBLASTguides.fastaMatchGuides.blastCP052959_blast_hits.tsv
Troubleshooting note (from my run)
Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:
scan_gene_frac <- 1.0(scan full gene; crucial for short antitoxin genes)- increase promoter length (e.g., 600 bp)
- optionally relax GC/homopolymer thresholds slightly
5) Off-target filtering (R)
This script reads:
CP052959_guides_preoff.tsvCP052959_blast_hits.tsv
Then produces multiple filtering scenarios:
- off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
- off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits
✅ Run:
Rscript 5_ScanOfftarget_CP052959.R
Expected output structure:
scan_results/
off_n1/
off_n3/
off_n5/
off_n10/
off_refined/
Each contains:
guides_all.tsvguides_top5.tsv(ranked bydist_to_tssand GC)
Key concepts (explained from the debugging/communication notes)
What is PAM?
PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.
What is guide_seq vs protospacer?
In this pipeline:
guide_seqis the 20-nt protospacer sequence extracted from the genome adjacent to PAM.- The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).
What does guide_strand mean?
guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) atguide_start..guide_end.guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.
It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.
Why do some target sites appear multiple times?
Two reasons:
- The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
- In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.
Final candidate selection (from my off_refined output)
After off_refined filtering, I had 10 candidates covering both genes.
Recommended single guide (entry-proximal for TA-unit knockdown):
CP052959|HJI06_09105|gene|1889575|+- guide_seq:
AGTGCTGCGATCACTTCTGT - PAM:
CGG - This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.
Recommended “enhanced” 2-guide set:
- Guide A (entry-proximal):
CP052959|HJI06_09105|gene|1889575|+ - Guide B (upstream roadblock):
CP052959|HJI06_09105|promoter|1889908|-(guide_seq:CATTCTGACTTTATGGAAAC, PAMAGG) - These are ~333 bp apart and both pass off_refined.
Reproducibility note: verifying a candidate sequence in the genome
I used samtools faidx to confirm that the guide_seq matches the genome coordinates:
samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594
Expected sequence:
AGTGCTGCGATCACTTCTGT
Raw code (scripts used)
Below are the exact scripts used so you can reproduce this pipeline on another computer.
1) 1_extract_cds_from_gbk.py
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"
records = []
for rec in SeqIO.parse(gbk_file, "genbank"):
for feature in rec.features:
if feature.type != "CDS":
continue
seq = feature.extract(rec.seq)
locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
gene = feature.qualifiers.get("gene", [""])[0]
product = feature.qualifiers.get("product", [""])[0]
header_parts = [locus_tag]
if gene:
header_parts.append(gene)
if product:
header_parts.append(product)
header = "|".join(header_parts)
rec_out = SeqRecord(
seq,
id=header,
description=""
)
records.append(rec_out)
SeqIO.write(records, out_fasta, "fasta")
print(f"Wrote {len(records)} CDS sequences to {out_fasta}")
2) 2_PreProcessing_CP052959.R
# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates
# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
install.packages("BiocManager")
BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
library(Biostrings)
library(readr)
library(dplyr)
library(stringr)
genome_fasta <- "CP052959.fna"
cds_fasta <- "CP052959_CDS.fasta"
db_name <- "GenomeDB_CP052959"
blast_out <- "CDS_vs_genome_CP052959.blast"
coord_table <- "CP052959_gene_coordinates.tsv"
message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA: ", cds_fasta)
# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)
# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
"blastn",
"-task blastn",
"-query", shQuote(cds_fasta),
"-db", shQuote(db_name),
"-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
"-max_target_seqs 1",
"-perc_identity 95",
"-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)
# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
"length", "pident", "bitscore")
blast <- read_tsv(blast_out,
col_names = cols,
show_col_types = FALSE,
comment = "#")
if (nrow(blast) == 0) {
stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}
# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
mutate(
start = pmin(sstart, send),
end = pmax(sstart, send),
strand = if_else(sstart <= send, "+", "-")
) %>%
select(qseqid, sseqid, start, end, strand, length, pident, bitscore)
# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
mutate(
locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
gene = str_split_fixed(qseqid, "\\|", 3)[, 2]
)
# 6. Save coordinate table
coord <- blast2 %>%
transmute(
gene_id = if_else(gene == "", locus_tag, gene),
locus_tag = locus_tag,
contig_id = sseqid,
start = start,
end = end,
strand = strand,
length_nt = length,
pident = pident,
bitscore = bitscore
)
write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)
3) 3_Filter_TA_coords.R
#!/usr/bin/env Rscript
suppressPackageStartupMessages({
library(readr)
library(dplyr)
})
# ---------- 输入 / 输出 ----------
coord_in <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"
# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")
# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)
# 预期 9 列:product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
stop("Input coord table has < 9 columns: ", coord_in)
}
colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")
# 强制类型(避免读成字符导致 slice_max 异常)
coords <- coords %>%
mutate(
locus_tag = as.character(locus_tag),
aln_len = as.integer(aln_len),
pident = as.numeric(pident),
bitscore = as.numeric(bitscore)
)
# ---------- 过滤:只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
filter(locus_tag %in% targets) %>%
group_by(locus_tag) %>%
slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
ungroup()
# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
"\nCheck that locus_tags exist in ", coord_in)
}
if (nrow(coords_ta) != length(targets)) {
warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
"\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}
# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)
# targets.txt:一行一个 locus_tag(按你给定的顺序写)
writeLines(targets, targets_out)
message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)
4) 4_GuideFinder_CP052959_base.R
#!/usr/bin/env Rscript
# GuideFinder_CP052959_base.R
# 目的:
# 1) 仅针对 TA targets 生成所有候选 guides(NGG PAM,20bp)
# 2) 计算 dist_to_tss(promoter 为负,gene body 为正,按转录方向)
# 3) 输出 guides_preoff.tsv + guides fasta
# 4) 运行 BLAST 做全基因组命中统计,输出 blast_hits.tsv
#
# 输入:
# CP052959.fna (整条基因组序列,单条序列 FASTA;可软链接到 CP052959.1.fna)
# CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
# TA_targets.txt (两行:HJI06_09100 / HJI06_09105)
#
# 输出:
# CP052959_guides_preoff.tsv
# BLASTguides.fasta
# MatchGuides.blast
# CP052959_blast_hits.tsv
suppressPackageStartupMessages({
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")
library(Biostrings)
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
})
# ---------- 参数区 ----------
genome_fasta <- "CP052959.fna"
coord_table <- "CP052959_gene_coordinates.tsv" # PreProcessing_CP052959.R 生成的
targets_file <- "TA_targets.txt" # 只跑 TA 两个基因
blast_db_name <- "GenomeDB_CP052959" # 预处理里建的 BLAST DB 前缀
guides_fasta <- "BLASTguides.fasta"
blast_matches_file <- "MatchGuides.blast"
guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv <- "CP052959_blast_hits.tsv"
guide_len <- 20L
pam <- "NGG" # SpCas9
#promoter_len <- 200L # 扫描 TSS 上游长度(bp),可按需改 100/300
#scan_gene_frac <- 0.30 # gene body 扫描前 30%(更适合 CRISPRi)
#gc_min <- 0.30 # S. epidermidis GC 偏低,建议 0.30~0.35
#gc_max <- 0.80 # 基本不会用到,但留着
#homopolymer_n <- 4L # 连续 A 或 T >= 4 就过滤(可改 5 更宽松)
promoter_len <- 600L # 原来 200L,改大:增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00 # 原来 0.30,改为扫完整 gene body(短基因必需)
gc_min <- 0.25 # 原来 0.30,略放宽(S. epi GC 低,避免全被 GC 过滤)
gc_max <- 0.80
homopolymer_n <- 5L # 原来 4L,略放宽(避免把仅有的候选全杀掉)
# BLAST 参数:用于命中统计(不是最终 off_refined)
blast_task <- "blastn-short"
blast_wordsize <- 7L
blast_evalue <- 1000
# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
if (!file.exists(path)) stop("File not found: ", path)
}
calc_gc <- function(seq) {
s <- as.character(seq)
if (nchar(s) == 0) return(NA_real_)
g <- str_count(s, "G")
c <- str_count(s, "C")
(g + c) / nchar(s)
}
has_homopolymer_AT <- function(seq, n = 4L) {
s <- as.character(seq)
patA <- paste0("A{", n, ",}")
patT <- paste0("T{", n, ",}")
str_detect(s, patA) | str_detect(s, patT)
}
revcomp_str <- function(s) {
as.character(reverseComplement(DNAString(s)))
}
# 扫描 forward strand 的 NGG:返回 data.frame(protospacer + PAM)
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
# dna: DNAString(该区域序列,5'->3' 按基因组正链)
s <- as.character(dna)
L <- nchar(s)
out <- list()
# i: protospacer 起点(0-based in R string positions)
# protospacer = s[i+1 .. i+20]
# PAM = s[i+21 .. i+23] must match NGG
max_i <- L - (guide_len + 3L)
if (max_i < 0) return(tibble())
idx <- 0L
for (i in 0:max_i) {
pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
if (nchar(pam_seq) != 3) next
if (!str_detect(pam_seq, "^[ACGT]GG$")) next
prot <- substr(s, i + 1L, i + guide_len)
if (nchar(prot) != guide_len) next
guide_start <- offset_genome_1based + i
guide_end <- guide_start + guide_len - 1L
pam_start <- guide_end + 1L
pam_end <- pam_start + 2L
idx <- idx + 1L
out[[idx]] <- tibble(
guide_seq = prot, # 常规报告为 protospacer(不含 PAM)
pam_seq = pam_seq,
guide_strand = "+",
guide_start = guide_start,
guide_end = guide_end,
pam_start = pam_start,
pam_end = pam_end,
region = region_label
)
}
bind_rows(out)
}
# 扫描 reverse strand:等价扫描 forward 的 CCN(因为 reverse 的 PAM = NGG)
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
s <- as.character(dna)
L <- nchar(s)
out <- list()
# 在 forward 上找 CCN:pam_fwd = CCN (positions j..j+2)
# reverse 上 PAM = NGG,对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt(j+3..j+22)
# 输出 guide_seq 应为该 20nt 的反向互补(5'->3')
max_j <- L - (3L + guide_len)
if (max_j < 0) return(tibble())
idx <- 0L
for (j in 0:max_j) {
pam_fwd <- substr(s, j + 1L, j + 3L)
if (nchar(pam_fwd) != 3) next
if (!str_detect(pam_fwd, "^CC[ACGT]$")) next
prot_fwd <- substr(s, j + 4L, j + 3L + guide_len) # downstream 20nt
if (nchar(prot_fwd) != guide_len) next
guide_seq <- revcomp_str(prot_fwd)
# 基因组坐标(1-based):prot_fwd 覆盖 [j+4 .. j+23](长度20)
guide_start <- offset_genome_1based + j + 3L
guide_end <- guide_start + guide_len - 1L
pam_start <- offset_genome_1based + j
pam_end <- pam_start + 2L
idx <- idx + 1L
out[[idx]] <- tibble(
guide_seq = guide_seq,
pam_seq = revcomp_str(pam_fwd), # 反向互补后就是 NGG
guide_strand = "-",
guide_start = guide_start,
guide_end = guide_end,
pam_start = pam_start,
pam_end = pam_end,
region = region_label
)
}
bind_rows(out)
}
# dist_to_tss:按转录方向计算 guide 到 TSS 的距离(bp)
# 约定:
# + 链:TSS ~ gene_start(start),promoter 在 start-promoter_len .. start-1(dist 负)
# - 链:TSS ~ gene_end(end),promoter 在 end+1 .. end+promoter_len(dist 负)
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
if (gene_strand == "+") {
# 取 guide 的 5'端(转录方向上的前端)近似
guide_5p <- guide_start
as.integer(guide_5p - gene_start)
} else {
# - 链转录方向为 大 -> 小,5'端在更大坐标
guide_5p <- guide_end
as.integer(gene_end - guide_5p)
}
}
# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)
message("Using genome FASTA: ", genome_fasta)
message("Using coord table: ", coord_table)
message("Using targets: ", targets_file)
targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))
# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)
# ---------- 读取坐标表并过滤(关键:只保留 targets + 每基因最长命中) ----------
# 你的表是 9 列(从 grep 输出推断):
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)
coords <- coords_raw %>%
transmute(
product = as.character(X1),
locus_tag = as.character(X2),
seqid = as.character(X3),
start = as.integer(X4),
end = as.integer(X5),
strand = as.character(X6),
aln_len = as.integer(X7),
pident = as.numeric(X8),
bitscore = as.numeric(X9)
) %>%
filter(locus_tag %in% targets) %>%
group_by(locus_tag) %>%
slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
ungroup()
# 安全检查:targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}
message("Filtered coord rows: ", nrow(coords))
print(coords)
# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L
for (i in seq_len(nrow(coords))) {
row <- coords[i, ]
gene_start <- min(row$start, row$end)
gene_end <- max(row$start, row$end)
gene_strand <- row$strand
locus <- row$locus_tag
prod <- row$product
gene_len <- gene_end - gene_start + 1L
scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))
# promoter 区间(按链方向定义“上游”)
if (gene_strand == "+") {
tss <- gene_start
prom_start <- max(1L, gene_start - promoter_len)
prom_end <- gene_start - 1L
gene_scan_start <- gene_start
gene_scan_end <- min(gene_end, gene_start + scan_gene_len - 1L)
} else {
tss <- gene_end
prom_start <- gene_end + 1L
prom_end <- min(genome_len, gene_end + promoter_len)
gene_scan_end <- gene_end
gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
}
message("\n--- ", locus, " (", gene_strand, ") ---")
message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
", TSS~", tss)
message("Promoter region: ", prom_start, "..", prom_end)
message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)
# 提取 promoter / gene-scan 序列(按基因组正链坐标截取)
guides_gene <- tibble()
guides_prom <- tibble()
if (prom_start <= prom_end) {
dna_prom <- subseq(genome, start = prom_start, end = prom_end)
guides_prom <- bind_rows(
scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
)
}
if (gene_scan_start <= gene_scan_end) {
dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
guides_gene <- bind_rows(
scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
)
}
guides <- bind_rows(guides_prom, guides_gene)
if (nrow(guides) == 0) {
message("No NGG candidates found in promoter+gene-scan region for ", locus)
next
}
# 过滤:GC + homopolymer
guides <- guides %>%
mutate(
locus_tag = locus,
product = prod,
gene_start = gene_start,
gene_end = gene_end,
gene_strand = gene_strand,
tss_pos = tss,
gc = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
dist_to_tss = mapply(
FUN = calc_dist_to_tss,
gene_start = gene_start,
gene_end = gene_end,
gene_strand = gene_strand,
guide_start = guide_start,
guide_end = guide_end
)
) %>%
filter(!is.na(gc)) %>%
filter(gc >= gc_min, gc <= gc_max) %>%
filter(!has_hpoly)
if (nrow(guides) == 0) {
message("All candidates filtered out by GC/homopolymer for ", locus)
next
}
# 给每条 guide 一个唯一 ID
guides <- guides %>%
arrange(region, abs(dist_to_tss), desc(gc)) %>%
mutate(
guide_id = {
# 稳定可读:CP052959|locus|region|start|strand
paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
}
)
all_guides[[length(all_guides) + 1L]] <- guides
message("Kept guides after filters: ", nrow(guides))
}
guides_df <- bind_rows(all_guides)
if (nrow(guides_df) == 0) {
stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}
# 输出 preoff
write_tsv(guides_df %>%
select(
guide_id, locus_tag, product, region,
guide_seq, pam_seq, guide_strand,
guide_start, guide_end, pam_start, pam_end,
gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
gc, has_hpoly
),
guides_preoff_tsv)
message("\nWrote guides_preoff: ", guides_preoff_tsv)
# ---------- 写 BLASTguides.fasta ----------
# 注意:写出的序列是 guide_seq(20nt)
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)
# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在,否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
message("Running: ", cmd_mk)
system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}
cmd_blast <- sprintf(
"blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
stopif_file_missing(blast_matches_file)
# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
blast_matches_file,
col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
show_col_types = FALSE
)
if (nrow(blast_raw) == 0) {
stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}
# 统一 sstart/send 为 min/max(方便判断)
blast_hits <- blast_raw %>%
mutate(
smin = pmin(sstart, send),
smax = pmax(sstart, send),
hit_strand = ifelse(sstart <= send, "+", "-")
)
# 命中计数(粗略:所有 hits;后续 off_refined 由你的 ScanOfftarget 脚本做)
hit_counts <- blast_hits %>%
count(qseqid, name = "n_hits") %>%
rename(guide_id = qseqid)
# 合并 guides 注释
blast_merged <- blast_hits %>%
rename(guide_id = qseqid) %>%
left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
left_join(hit_counts, by = "guide_id") %>%
arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))
write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)
message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")
5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)
#!/usr/bin/env Rscript
# ScanOfftarget_CP052959.R
# 适配当前版本输出:
# - guides_preoff: CP052959_guides_preoff.tsv(locus_tag, dist_to_tss, gc, guide_id...)
# - blast_hits: CP052959_blast_hits.tsv(guide_id, pident, length, mismatch, gapopen... + n_hits)
#
# 输出:
# scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}
suppressPackageStartupMessages({
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
library(readr)
library(dplyr)
})
guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv <- "CP052959_blast_hits.tsv"
guide_len <- 20L
base_outdir <- "scan_results"
message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)
message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)
if (!dir.exists(base_outdir)) dir.create(base_outdir)
# ---------- 兼容列名:gene_id / locus_tag;GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列:", guides_preoff_tsv)
# gene_id:优先用 gene_id,没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
guides <- guides %>% mutate(gene_id = locus_tag)
}
# GC:优先用 GC,没有就用 gc
if (!"GC" %in% names(guides)) {
if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
guides <- guides %>% mutate(GC = gc)
}
# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")
# ---------- BLAST hits 列检查 ----------
# 我们需要:guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
stop("blast_hits 缺少必要列:", paste(missing_cols, collapse = ", "),
"\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}
# ---------- 规则封装 ----------
# 通用:按 n_hits 上限过滤
#是太严(仍然 0),就放宽一点: mismatch 2-->3: filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
require_full_length = TRUE,
max_mismatch = 2,
require_no_gaps = TRUE,
min_pident = 0) {
# 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
b <- blast
# 强制数值类型(避免字符导致比较全 NA)
b <- b %>%
mutate(
pident = as.numeric(pident),
length = as.integer(length),
mismatch = as.integer(mismatch),
gapopen = as.integer(gapopen)
)
if (require_full_length) {
b <- b %>% filter(length == guide_len)
}
if (require_no_gaps) {
b <- b %>% filter(gapopen == 0)
}
if (!is.null(max_mismatch)) {
b <- b %>% filter(mismatch <= max_mismatch)
}
if (!is.null(min_pident) && min_pident > 0) {
# 注意:你的 pident 通常是 0-100 的百分比
b <- b %>% filter(pident >= min_pident)
}
# 2) 用“严格命中”重新计算每条 guide 的 n_hits(更合理)
valid_ids <- b %>%
count(guide_id, name = "n_hits_strict") %>%
filter(n_hits_strict <= max_hits) %>%
pull(guide_id)
guides %>% filter(guide_id %in% valid_ids)
}
# 精细规则 off_refined:
# - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
# - 其他 hits(若存在):length < 15 或 mismatch >= 3
filter_refined <- function() {
blast_grouped <- blast %>%
group_by(guide_id) %>%
summarise(
n_hits = n(),
n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
other_ok = all(
ifelse(
length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
TRUE,
(length < 15 | mismatch >= 3)
)
),
.groups = "drop"
)
valid_ids <- blast_grouped %>%
filter(n_perfect == 1, other_ok) %>%
pull(guide_id)
guides %>% filter(guide_id %in% valid_ids)
}
# 写结果到子目录
run_scenario <- function(name, guides_selected) {
outdir <- file.path(base_outdir, name)
if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)
all_path <- file.path(outdir, "guides_all.tsv")
top_path <- file.path(outdir, "guides_top5.tsv")
write_tsv(guides_selected, all_path)
guides_top <- guides_selected %>%
group_by(gene_id) %>%
arrange(dist_to_tss, desc(GC)) %>%
slice_head(n = 5) %>%
ungroup()
write_tsv(guides_top, top_path)
message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}
# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))
message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))
message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))
message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))
message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())
message("All scenarios finished. Check 'scan_results/' directory.")
Final notes
- If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
-
If a target gene yields no candidates again, check:
scan_gene_frac(set to1.0for short genes)- promoter length (increase if needed)
- GC/homopolymer thresholds (slightly relax)
If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.





