Guide RNA Design for a Toxin–Antitoxin Locus in CP052959 Using the GuideFinder Workflow (Data_JuliaFuchs_RNAseq_2025)

This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.

Target locus

Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:

  • Toxin: HJI06_09100 (1889026–1889421; strand -)
  • Antitoxin: HJI06_09105 (1889421–1889651; strand -)

Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.


Overview of the pipeline

The workflow consists of four stages:

  1. Extract CDS sequences from the GenBank file → CP052959_CDS.fasta
  2. BLAST CDS to genome to generate a coordinate table → CP052959_gene_coordinates.tsv
  3. Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters → CP052959_guides_preoff.tsv + BLAST hit table
  4. Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined) → scan_results/.../guides_all.tsv + guides_top5.tsv

A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.


Requirements

Software

  • Python 3 (Biopython)
  • R (with Biostrings, readr, dplyr, stringr, tidyr)
  • BLAST+ (makeblastdb, blastn)
  • (Optional) samtools for quick sequence validation

Input files

Download from NCBI:

  • Genome FASTA: CP052959.1.fna (or equivalent)
  • GenBank annotation: CP052959.1.gbk (or equivalent)

For convenience, I standardize filenames locally as:

  • CP052959.fna (genome)
  • CP052959.gbk (annotation)

Step-by-step commands

0) Prepare working directory

mkdir -p HD46_GuideFinder
cd HD46_GuideFinder

Place the following files in this directory:

  • CP052959.gbk
  • CP052959.fna
  • the scripts from the bottom of this post

1) Extract CDS sequences from GenBank (Python)

This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.

✅ Run:

python 1_extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

Expected output:

  • CP052959_CDS.fasta

Example log:

  • Wrote 2573 CDS sequences to CP052959_CDS.fasta

2) Pre-processing: build BLAST DB + map CDS to genome (R)

This step wraps:

  • makeblastdb
  • blastn (CDS vs genome)
  • parsing BLAST output into a coordinate table

✅ Run:

#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R

Expected output:

  • GenomeDB_CP052959.* (BLAST database files)
  • CDS_vs_genome_CP052959.blast
  • CP052959_gene_coordinates.tsv

Sanity check (recommended):

grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv

You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).


3) Filter TA coordinates + create TA_targets.txt (R)

The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

✅ Run:

Rscript 3_Filter_TA_coords.R

Expected outputs:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

This script scans:

  • promoter window + gene body window
  • identifies NGG PAM (and CCN for reverse-strand equivalent)
  • extracts adjacent 20-nt protospacers
  • filters by GC and homopolymer
  • BLASTs guides to genome and writes a hit table

✅ Run:

Rscript 4_GuideFinder_CP052959_base.R

Expected outputs:

  • CP052959_guides_preoff.tsv
  • BLASTguides.fasta
  • MatchGuides.blast
  • CP052959_blast_hits.tsv

Troubleshooting note (from my run)

Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:

  • scan_gene_frac <- 1.0 (scan full gene; crucial for short antitoxin genes)
  • increase promoter length (e.g., 600 bp)
  • optionally relax GC/homopolymer thresholds slightly

5) Off-target filtering (R)

This script reads:

  • CP052959_guides_preoff.tsv
  • CP052959_blast_hits.tsv

Then produces multiple filtering scenarios:

  • off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
  • off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits

✅ Run:

Rscript 5_ScanOfftarget_CP052959.R

Expected output structure:

scan_results/
  off_n1/
  off_n3/
  off_n5/
  off_n10/
  off_refined/

Each contains:

  • guides_all.tsv
  • guides_top5.tsv (ranked by dist_to_tss and GC)

Key concepts (explained from the debugging/communication notes)

What is PAM?

PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.

What is guide_seq vs protospacer?

In this pipeline:

  • guide_seq is the 20-nt protospacer sequence extracted from the genome adjacent to PAM.
  • The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).

What does guide_strand mean?

  • guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) at guide_start..guide_end.
  • guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.

It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.

Why do some target sites appear multiple times?

Two reasons:

  1. The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
  2. In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.

Final candidate selection (from my off_refined output)

After off_refined filtering, I had 10 candidates covering both genes.

Recommended single guide (entry-proximal for TA-unit knockdown):

  • CP052959|HJI06_09105|gene|1889575|+
  • guide_seq: AGTGCTGCGATCACTTCTGT
  • PAM: CGG
  • This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.

Recommended “enhanced” 2-guide set:

  • Guide A (entry-proximal): CP052959|HJI06_09105|gene|1889575|+
  • Guide B (upstream roadblock): CP052959|HJI06_09105|promoter|1889908|- (guide_seq: CATTCTGACTTTATGGAAAC, PAM AGG)
  • These are ~333 bp apart and both pass off_refined.

Reproducibility note: verifying a candidate sequence in the genome

I used samtools faidx to confirm that the guide_seq matches the genome coordinates:

samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594

Expected sequence: AGTGCTGCGATCACTTCTGT


Raw code (scripts used)

Below are the exact scripts used so you can reproduce this pipeline on another computer.


1) 1_extract_cds_from_gbk.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"

records = []

for rec in SeqIO.parse(gbk_file, "genbank"):
    for feature in rec.features:
        if feature.type != "CDS":
            continue

        seq = feature.extract(rec.seq)

        locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
        gene      = feature.qualifiers.get("gene", [""])[0]
        product   = feature.qualifiers.get("product", [""])[0]

        header_parts = [locus_tag]
        if gene:
            header_parts.append(gene)
        if product:
            header_parts.append(product)

        header = "|".join(header_parts)

        rec_out = SeqRecord(
            seq,
            id=header,
            description=""
        )
        records.append(rec_out)

SeqIO.write(records, out_fasta, "fasta")

print(f"Wrote {len(records)} CDS sequences to {out_fasta}")

2) 2_PreProcessing_CP052959.R

# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates

# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")

library(Biostrings)
library(readr)
library(dplyr)
library(stringr)

genome_fasta <- "CP052959.fna"
cds_fasta    <- "CP052959_CDS.fasta"
db_name      <- "GenomeDB_CP052959"
blast_out    <- "CDS_vs_genome_CP052959.blast"
coord_table  <- "CP052959_gene_coordinates.tsv"

message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA:    ", cds_fasta)

# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
                  shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)

# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
  "blastn",
  "-task blastn",
  "-query", shQuote(cds_fasta),
  "-db", shQuote(db_name),
  "-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
  "-max_target_seqs 1",
  "-perc_identity 95",
  "-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)

# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
          "length", "pident", "bitscore")

blast <- read_tsv(blast_out,
                  col_names = cols,
                  show_col_types = FALSE,
                  comment = "#")

if (nrow(blast) == 0) {
  stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}

# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
  mutate(
    start  = pmin(sstart, send),
    end    = pmax(sstart, send),
    strand = if_else(sstart <= send, "+", "-")
  ) %>%
  select(qseqid, sseqid, start, end, strand, length, pident, bitscore)

# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
  mutate(
    locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
    gene      = str_split_fixed(qseqid, "\\|", 3)[, 2]
  )

# 6. Save coordinate table
coord <- blast2 %>%
  transmute(
    gene_id    = if_else(gene == "", locus_tag, gene),
    locus_tag  = locus_tag,
    contig_id  = sseqid,
    start      = start,
    end        = end,
    strand     = strand,
    length_nt  = length,
    pident     = pident,
    bitscore   = bitscore
  )

write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)

3) 3_Filter_TA_coords.R

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})

# ---------- 输入 / 输出 ----------
coord_in  <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"

# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")

# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)

# 预期 9 列:product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
  stop("Input coord table has < 9 columns: ", coord_in)
}

colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")

# 强制类型(避免读成字符导致 slice_max 异常)
coords <- coords %>%
  mutate(
    locus_tag = as.character(locus_tag),
    aln_len = as.integer(aln_len),
    pident = as.numeric(pident),
    bitscore = as.numeric(bitscore)
  )

# ---------- 过滤:只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
  stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
       "\nCheck that locus_tags exist in ", coord_in)
}

if (nrow(coords_ta) != length(targets)) {
  warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
          "\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}

# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)

# targets.txt:一行一个 locus_tag(按你给定的顺序写)
writeLines(targets, targets_out)

message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)

4) 4_GuideFinder_CP052959_base.R

#!/usr/bin/env Rscript

# GuideFinder_CP052959_base.R
# 目的:
#   1) 仅针对 TA targets 生成所有候选 guides(NGG PAM,20bp)
#   2) 计算 dist_to_tss(promoter 为负,gene body 为正,按转录方向)
#   3) 输出 guides_preoff.tsv + guides fasta
#   4) 运行 BLAST 做全基因组命中统计,输出 blast_hits.tsv
#
# 输入:
#   CP052959.fna                (整条基因组序列,单条序列 FASTA;可软链接到 CP052959.1.fna)
#   CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
#   TA_targets.txt                (两行:HJI06_09100 / HJI06_09105)
#
# 输出:
#   CP052959_guides_preoff.tsv
#   BLASTguides.fasta
#   MatchGuides.blast
#   CP052959_blast_hits.tsv

suppressPackageStartupMessages({
  if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
  if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
  if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")

  library(Biostrings)
  library(readr)
  library(dplyr)
  library(stringr)
  library(tidyr)
})

# ---------- 参数区 ----------
genome_fasta        <- "CP052959.fna"
coord_table         <- "CP052959_gene_coordinates.tsv"   # PreProcessing_CP052959.R 生成的
targets_file        <- "TA_targets.txt"                  # 只跑 TA 两个基因
blast_db_name       <- "GenomeDB_CP052959"               # 预处理里建的 BLAST DB 前缀
guides_fasta        <- "BLASTguides.fasta"
blast_matches_file  <- "MatchGuides.blast"
guides_preoff_tsv   <- "CP052959_guides_preoff.tsv"
blast_hits_tsv      <- "CP052959_blast_hits.tsv"

guide_len      <- 20L
pam            <- "NGG"              # SpCas9
#promoter_len   <- 200L              # 扫描 TSS 上游长度(bp),可按需改 100/300
#scan_gene_frac <- 0.30              # gene body 扫描前 30%(更适合 CRISPRi)
#gc_min         <- 0.30              # S. epidermidis GC 偏低,建议 0.30~0.35
#gc_max         <- 0.80              # 基本不会用到,但留着
#homopolymer_n  <- 4L                # 连续 A 或 T >= 4 就过滤(可改 5 更宽松)
promoter_len   <- 600L     # 原来 200L,改大:增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00     # 原来 0.30,改为扫完整 gene body(短基因必需)
gc_min         <- 0.25     # 原来 0.30,略放宽(S. epi GC 低,避免全被 GC 过滤)
gc_max         <- 0.80
homopolymer_n  <- 5L       # 原来 4L,略放宽(避免把仅有的候选全杀掉)

# BLAST 参数:用于命中统计(不是最终 off_refined)
blast_task     <- "blastn-short"
blast_wordsize <- 7L
blast_evalue   <- 1000

# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
}

calc_gc <- function(seq) {
  s <- as.character(seq)
  if (nchar(s) == 0) return(NA_real_)
  g <- str_count(s, "G")
  c <- str_count(s, "C")
  (g + c) / nchar(s)
}

has_homopolymer_AT <- function(seq, n = 4L) {
  s <- as.character(seq)
  patA <- paste0("A{", n, ",}")
  patT <- paste0("T{", n, ",}")
  str_detect(s, patA) | str_detect(s, patT)
}

revcomp_str <- function(s) {
  as.character(reverseComplement(DNAString(s)))
}

# 扫描 forward strand 的 NGG:返回 data.frame(protospacer + PAM)
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  # dna: DNAString(该区域序列,5'->3' 按基因组正链)
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # i: protospacer 起点(0-based in R string positions)
  # protospacer = s[i+1 .. i+20]
  # PAM         = s[i+21 .. i+23] must match NGG
  max_i <- L - (guide_len + 3L)
  if (max_i < 0) return(tibble())

  idx <- 0L
  for (i in 0:max_i) {
    pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
    if (nchar(pam_seq) != 3) next
    if (!str_detect(pam_seq, "^[ACGT]GG$")) next

    prot <- substr(s, i + 1L, i + guide_len)
    if (nchar(prot) != guide_len) next

    guide_start <- offset_genome_1based + i
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- guide_end + 1L
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = prot,        # 常规报告为 protospacer(不含 PAM)
      pam_seq     = pam_seq,
      guide_strand = "+",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# 扫描 reverse strand:等价扫描 forward 的 CCN(因为 reverse 的 PAM = NGG)
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # 在 forward 上找 CCN:pam_fwd = CCN (positions j..j+2)
  # reverse 上 PAM = NGG,对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt(j+3..j+22)
  # 输出 guide_seq 应为该 20nt 的反向互补(5'->3')
  max_j <- L - (3L + guide_len)
  if (max_j < 0) return(tibble())

  idx <- 0L
  for (j in 0:max_j) {
    pam_fwd <- substr(s, j + 1L, j + 3L)
    if (nchar(pam_fwd) != 3) next
    if (!str_detect(pam_fwd, "^CC[ACGT]$")) next

    prot_fwd <- substr(s, j + 4L, j + 3L + guide_len)  # downstream 20nt
    if (nchar(prot_fwd) != guide_len) next

    guide_seq <- revcomp_str(prot_fwd)
    # 基因组坐标(1-based):prot_fwd 覆盖 [j+4 .. j+23](长度20)
    guide_start <- offset_genome_1based + j + 3L
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- offset_genome_1based + j
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = guide_seq,
      pam_seq     = revcomp_str(pam_fwd),   # 反向互补后就是 NGG
      guide_strand = "-",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# dist_to_tss:按转录方向计算 guide 到 TSS 的距离(bp)
# 约定:
#   + 链:TSS ~ gene_start(start),promoter 在 start-promoter_len .. start-1(dist 负)
#   - 链:TSS ~ gene_end(end),promoter 在 end+1 .. end+promoter_len(dist 负)
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
  if (gene_strand == "+") {
    # 取 guide 的 5'端(转录方向上的前端)近似
    guide_5p <- guide_start
    as.integer(guide_5p - gene_start)
  } else {
    # - 链转录方向为 大 -> 小,5'端在更大坐标
    guide_5p <- guide_end
    as.integer(gene_end - guide_5p)
  }
}

# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)

message("Using genome FASTA: ", genome_fasta)
message("Using coord table:  ", coord_table)
message("Using targets:      ", targets_file)

targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))

# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
  message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)

# ---------- 读取坐标表并过滤(关键:只保留 targets + 每基因最长命中) ----------
# 你的表是 9 列(从 grep 输出推断):
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)

coords <- coords_raw %>%
  transmute(
    product   = as.character(X1),
    locus_tag = as.character(X2),
    seqid     = as.character(X3),
    start     = as.integer(X4),
    end       = as.integer(X5),
    strand    = as.character(X6),
    aln_len   = as.integer(X7),
    pident    = as.numeric(X8),
    bitscore  = as.numeric(X9)
  ) %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# 安全检查:targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
  stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}

message("Filtered coord rows: ", nrow(coords))
print(coords)

# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L

for (i in seq_len(nrow(coords))) {
  row <- coords[i, ]

  gene_start <- min(row$start, row$end)
  gene_end   <- max(row$start, row$end)
  gene_strand <- row$strand
  locus <- row$locus_tag
  prod  <- row$product

  gene_len <- gene_end - gene_start + 1L
  scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))

  # promoter 区间(按链方向定义“上游”)
  if (gene_strand == "+") {
    tss <- gene_start
    prom_start <- max(1L, gene_start - promoter_len)
    prom_end   <- gene_start - 1L
    gene_scan_start <- gene_start
    gene_scan_end   <- min(gene_end, gene_start + scan_gene_len - 1L)
  } else {
    tss <- gene_end
    prom_start <- gene_end + 1L
    prom_end   <- min(genome_len, gene_end + promoter_len)
    gene_scan_end   <- gene_end
    gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
  }

  message("\n--- ", locus, " (", gene_strand, ") ---")
  message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
          ", TSS~", tss)
  message("Promoter region: ", prom_start, "..", prom_end)
  message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)

  # 提取 promoter / gene-scan 序列(按基因组正链坐标截取)
  guides_gene <- tibble()
  guides_prom <- tibble()

  if (prom_start <= prom_end) {
    dna_prom <- subseq(genome, start = prom_start, end = prom_end)
    guides_prom <- bind_rows(
      scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
      scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
    )
  }

  if (gene_scan_start <= gene_scan_end) {
    dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
    guides_gene <- bind_rows(
      scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
      scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
    )
  }

  guides <- bind_rows(guides_prom, guides_gene)
  if (nrow(guides) == 0) {
    message("No NGG candidates found in promoter+gene-scan region for ", locus)
    next
  }

  # 过滤:GC + homopolymer
  guides <- guides %>%
    mutate(
      locus_tag = locus,
      product   = prod,
      gene_start = gene_start,
      gene_end   = gene_end,
      gene_strand = gene_strand,
      tss_pos   = tss,
      gc        = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
      has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
      dist_to_tss = mapply(
        FUN = calc_dist_to_tss,
        gene_start = gene_start,
        gene_end   = gene_end,
        gene_strand = gene_strand,
        guide_start = guide_start,
        guide_end   = guide_end
      )
    ) %>%
    filter(!is.na(gc)) %>%
    filter(gc >= gc_min, gc <= gc_max) %>%
    filter(!has_hpoly)

  if (nrow(guides) == 0) {
    message("All candidates filtered out by GC/homopolymer for ", locus)
    next
  }

  # 给每条 guide 一个唯一 ID
  guides <- guides %>%
    arrange(region, abs(dist_to_tss), desc(gc)) %>%
    mutate(
      guide_id = {
        # 稳定可读:CP052959|locus|region|start|strand
        paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
      }
    )

  all_guides[[length(all_guides) + 1L]] <- guides
  message("Kept guides after filters: ", nrow(guides))
}

guides_df <- bind_rows(all_guides)

if (nrow(guides_df) == 0) {
  stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}

# 输出 preoff
write_tsv(guides_df %>%
            select(
              guide_id, locus_tag, product, region,
              guide_seq, pam_seq, guide_strand,
              guide_start, guide_end, pam_start, pam_end,
              gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
              gc, has_hpoly
            ),
          guides_preoff_tsv)

message("\nWrote guides_preoff: ", guides_preoff_tsv)

# ---------- 写 BLASTguides.fasta ----------
# 注意:写出的序列是 guide_seq(20nt)
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)

# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在,否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
  message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
  cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
  message("Running: ", cmd_mk)
  system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}

cmd_blast <- sprintf(
  "blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
  blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)

stopif_file_missing(blast_matches_file)

# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
  blast_matches_file,
  col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
  show_col_types = FALSE
)

if (nrow(blast_raw) == 0) {
  stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}

# 统一 sstart/send 为 min/max(方便判断)
blast_hits <- blast_raw %>%
  mutate(
    smin = pmin(sstart, send),
    smax = pmax(sstart, send),
    hit_strand = ifelse(sstart <= send, "+", "-")
  )

# 命中计数(粗略:所有 hits;后续 off_refined 由你的 ScanOfftarget 脚本做)
hit_counts <- blast_hits %>%
  count(qseqid, name = "n_hits") %>%
  rename(guide_id = qseqid)

# 合并 guides 注释
blast_merged <- blast_hits %>%
  rename(guide_id = qseqid) %>%
  left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
  left_join(hit_counts, by = "guide_id") %>%
  arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))

write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)

message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

#!/usr/bin/env Rscript

# ScanOfftarget_CP052959.R
# 适配当前版本输出:
#   - guides_preoff: CP052959_guides_preoff.tsv(locus_tag, dist_to_tss, gc, guide_id...)
#   - blast_hits:    CP052959_blast_hits.tsv(guide_id, pident, length, mismatch, gapopen... + n_hits)
#
# 输出:
#   scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}

suppressPackageStartupMessages({
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(readr)
  library(dplyr)
})

guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv    <- "CP052959_blast_hits.tsv"
guide_len         <- 20L
base_outdir       <- "scan_results"

message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)

message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)

if (!dir.exists(base_outdir)) dir.create(base_outdir)

# ---------- 兼容列名:gene_id / locus_tag;GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列:", guides_preoff_tsv)

# gene_id:优先用 gene_id,没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
  if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
  guides <- guides %>% mutate(gene_id = locus_tag)
}

# GC:优先用 GC,没有就用 gc
if (!"GC" %in% names(guides)) {
  if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
  guides <- guides %>% mutate(GC = gc)
}

# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")

# ---------- BLAST hits 列检查 ----------
# 我们需要:guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
  stop("blast_hits 缺少必要列:", paste(missing_cols, collapse = ", "),
       "\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}

# ---------- 规则封装 ----------
# 通用:按 n_hits 上限过滤
#是太严(仍然 0),就放宽一点: mismatch 2-->3: filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
                            require_full_length = TRUE,
                            max_mismatch = 2,
                            require_no_gaps = TRUE,
                            min_pident = 0) {

  # 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
  b <- blast

  # 强制数值类型(避免字符导致比较全 NA)
  b <- b %>%
    mutate(
      pident  = as.numeric(pident),
      length  = as.integer(length),
      mismatch = as.integer(mismatch),
      gapopen = as.integer(gapopen)
    )

  if (require_full_length) {
    b <- b %>% filter(length == guide_len)
  }
  if (require_no_gaps) {
    b <- b %>% filter(gapopen == 0)
  }
  if (!is.null(max_mismatch)) {
    b <- b %>% filter(mismatch <= max_mismatch)
  }
  if (!is.null(min_pident) && min_pident > 0) {
    # 注意:你的 pident 通常是 0-100 的百分比
    b <- b %>% filter(pident >= min_pident)
  }

  # 2) 用“严格命中”重新计算每条 guide 的 n_hits(更合理)
  valid_ids <- b %>%
    count(guide_id, name = "n_hits_strict") %>%
    filter(n_hits_strict <= max_hits) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 精细规则 off_refined:
#  - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
#  - 其他 hits(若存在):length < 15 或 mismatch >= 3
filter_refined <- function() {
  blast_grouped <- blast %>%
    group_by(guide_id) %>%
    summarise(
      n_hits = n(),
      n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
      other_ok = all(
        ifelse(
          length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
          TRUE,
          (length < 15 | mismatch >= 3)
        )
      ),
      .groups = "drop"
    )

  valid_ids <- blast_grouped %>%
    filter(n_perfect == 1, other_ok) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 写结果到子目录
run_scenario <- function(name, guides_selected) {
  outdir <- file.path(base_outdir, name)
  if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)

  all_path <- file.path(outdir, "guides_all.tsv")
  top_path <- file.path(outdir, "guides_top5.tsv")

  write_tsv(guides_selected, all_path)

  guides_top <- guides_selected %>%
    group_by(gene_id) %>%
    arrange(dist_to_tss, desc(GC)) %>%
    slice_head(n = 5) %>%
    ungroup()

  write_tsv(guides_top, top_path)

  message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
                  name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}

# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))

message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))

message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))

message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))

message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())

message("All scenarios finished. Check 'scan_results/' directory.")

Final notes

  • If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
  • If a target gene yields no candidates again, check:

    • scan_gene_frac (set to 1.0 for short genes)
    • promoter length (increase if needed)
    • GC/homopolymer thresholds (slightly relax)

If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.

Guide RNA Design with GuideFinder for the Toxin–Antitoxin Unit in Staphylococcus epidermidis HD46 (CP052959.1)

Part A|把你“125 的流程”迁移到 CP052959.1 的 step-by-step(中文)

0️⃣ 目录结构建议(最省事)

新建一个工作目录,例如:

HD46_GuideFinder/
  CP052959.gbk
  CP052959.fna
  extract_cds_from_gbk.py
  PreProcessing_CP052959.R
  GuideFinder_CP052959_base.R
  ScanOfftarget_CP052959.R
  (可选) *.Rmd

你可以把你当时的 PreProcessing_125.R / GuideFinder_125_base.R / ScanOfftarget_125.R 复制一份改名,然后把里面所有 “125” 的文件名前缀改成 “HD46” 或 “CP052959.1”。


1️⃣ 从 GenBank 提取 CDS → 生成 CP052959_CDS.fasta(Python + Biopython)

输入: CP052959.gbk 输出: CP052959_CDS.fasta(每个 CDS 一条序列,header 用 locus_tag)

运行(示例):

python extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

检查输出里应该能看到类似:

  • >HJI06_09100 ...
  • >HJI06_09105 ...

这一步的目的:后面要把每条 CDS 用 BLAST 定位回整条基因组,得到统一的坐标表(尤其适配 draft genome / 多 contig 的情况;complete genome 也能这么跑,流程统一)。


2️⃣ Pre-processing:把 CDS BLAST 回基因组 → 生成 gene coordinate table

2.1 建 BLAST 数据库(GenomeDB)

用基因组 FASTA 建库(输入 CP052959.1.fna):

makeblastdb -in CP052959.1.fna -dbtype nucl -out GenomeDB_HD46

2.2 BLAST:CDS vs Genome

(你笔记里用的是 -task blastn,这里沿用;如果你 CDS 很短也可用 blastn-short

blastn -task blastn \
  -query HD46_CDS.fasta \
  -db GenomeDB_HD46 \
  -outfmt "6 qseqid sseqid sstart send sstrand length pident bitscore" \
  -max_target_seqs 1 \
  -perc_identity 95 \
  -out CDS_vs_genome_HD46.blast

2.3 生成坐标表

运行你改好的预处理脚本(对应你笔记里的 PreProcessing_125.R):

Rscript PreProcessing_HD46.R
# 或者
Rscript -e "rmarkdown::render('PreProcessing_HD46.Rmd')"

输出: HD46_gene_coordinates.tsv

这张表通常会包含每个基因(locus_tag)的:

  • contig/seqid(这里应该是 CP052959.1)
  • start/end
  • strand
  • (如果脚本计算了)promoter 区、TSS 近似位置、gene length 等

3️⃣ Guide 设计:跑 GuideFinder(或你简化版 GuideFinder-like 脚本)

你笔记里分成:

  • GuideFinder_125_base.R:做 PAM 扫描、GC/同聚物过滤、计算 dist_to_tss、写出待 BLAST 的 guides fasta
  • 然后再 ScanOfftarget_125.R:根据 BLAST 命中做 off-target 分层(off_n3/off_n5/off_refined)

我建议你按同样分法跑(更清晰、也更容易 debug)。

3.1(可选但很推荐)只做 TA 这两个基因:建 targets 列表

新建 TA_targets.txt

HJI06_09100
HJI06_09105

然后在 GuideFinder 脚本里加一个“只保留这两个基因”的过滤(如果你当时 125 是全基因组,这里就能显著加速)。

3.2 运行 base 设计脚本

Rscript GuideFinder_HD46_base.R

你期望它会做:

  1. 扫 NGG PAM、生成 20bp(或 20/21/22)候选
  2. GC / homopolymer / bad-seed / restriction-site 过滤
  3. 正确计算 dist_to_tss(你笔记里写这次修好了)
  4. 写出 BLASTguides.fasta 并跑 BLAST(或至少准备好 BLAST 输入)
  5. 产生一个“preoff”的 guides 表(比如 HD46_guides_preoff.tsv

3.3 Off-target 扫描与分层输出

Rscript ScanOfftarget_HD46.R

它会把结果放进类似这样的目录结构(沿用你笔记习惯):

scan_results/
  off_n3/
    guides_all.tsv
    guides_top5.tsv
  off_n5/
    guides_all.tsv
    guides_top5.tsv
  off_refined/
    guides_all.tsv
    guides_top5.tsv

Part B|你要用 off_refined 的“选择逻辑”怎么理解(中文)

你笔记里这段非常关键,我用更直白的方式总结一下(保持你原有意图):

1) 现在脚本里的 n_hits 统计是什么?

  • 目前 n_hits 是“guide 的 BLAST 命中次数”,不区分 promoter / CDS / intergenic
  • 不检查命中位点是否旁边有 NGG PAM
  • 因此它会偏“保守”(宁可把一些其实没有 PAM 的匹配也算作潜在风险)

2) off_refined 策略核心是什么?

你笔记里这句说得很到位:

  • off_refined 已经在 BLAST level 上把真正危险的“第二个完美 20/20 命中”踢掉了
  • 剩下的 guide 通常是“1 个 on-target + 一堆残缺 off-target”(比如 mismatch 多、seed 不完整等),更接近你理想的“只在目标位点真正起作用”

3) 你提出的“只把有 PAM 的命中算 off-target”是对的

从 dCas9/Cas9 生物学机制看:

  • 没有合适 PAM(这里是 NGG),多数情况下不构成真实 off-target 风险
  • 你可以在后续加一层:对每个 BLAST 命中回到 genome 取邻域序列,检查是否存在 NGG,再把“有 PAM 的命中”纳入 n_hits

但就你现在说的用途(实际做 CRISPRi 设计),off_refined 已经是很实用的默认档位


Part C|给 Xxxxxx 的邮件草稿(你直接复制粘贴就能发)

你要“写一封邮件给 Xxxxxx,说原理上可行,并做了一个 probe-test;最佳匹配 Enterobacter intestinihominis…;用 off_refined;解释结果;并附算法细节;附件 guides_all.tsv / guides_top5.tsv”。 我下面给英文邮件(更像工作邮件)。你把方括号里的占位符替换成你的具体数字/文件名即可。

Hi Xxxxxx,

In principle the guide design pipeline is working end-to-end now. I ran a small probe-test on Patricia’s previously assembled complete isolate “125” genome input.

Quick taxonomy sanity check (PATRIC/closest match): Best match: Enterobacter intestinihominis (taxid = 3133180; species; lineage: Bacteria → Pseudomonadati/Pseudomonadota → Gammaproteobacteria → Enterobacterales → Enterobacteriaceae).

What the pipeline produced (strategy: off_refined):

  • It completes the sequence-level filters (GC + homopolymers) and computes dist_to_tss correctly.
  • It then runs a BLAST-based off-target scan and generates the final guide tables.
  • Attached are:

    • guides_all.tsv (all guides passing filters)
    • guides_top5.tsv (up to 5 best guides per gene)

Summary of the results (off_refined):

  • Total genes evaluated: [N_genes]
  • Genes with ≥1 guide: [N_with_guides] ([pct]% coverage)
  • Genes without guides: [N_without_guides]
  • Total guides retained in guides_all.tsv: [N_guides_all]
  • Median guides per gene (genes with guides): [median_guides_per_gene] (These numbers are from the attached tables.)

Algorithm details (off_refined, in plain terms):

  1. Candidate generation: scan for NGG PAM sites and create guide candidates (20 bp; optionally 21/22 bp depending on settings).
  2. Primary filters: remove candidates failing GC threshold and basic sequence-quality rules (e.g., long homopolymer runs; optional “bad-seed” motifs / restriction sites).
  3. TSS proximity scoring: compute dist_to_tss and prioritize guides closer to the inferred transcription start region for CRISPRi.
  4. Off-target screening via BLAST: align guide (or seed region, depending on mode) against the whole genome.
  5. off_refined rule: discard any guide that shows a second perfect full-length match elsewhere in the genome (i.e., another 20/20 hit besides the intended on-target). The remaining guides typically have 1 on-target hit plus only weak/partial off-target hits.

Note: in the current implementation, n_hits is counted genomewide from BLAST hits (no promoter/CDS distinction and no explicit PAM check at hit sites), which is conservative. If needed, we can further refine by counting only BLAST hits that also have an appropriate NGG PAM in the correct context.

Best, Yyyy


如果你接下来要把这个流程用到你真正关心的 HD46(CP052959.1)TA 单元,我建议你按上面 Part A 跑完后,把 scan_results/off_refined/guides_top5.tsvHJI06_09100HJI06_09105 的行贴出来,我可以再帮你快速做“最终挑哪 1 条单 guide / 哪一对 paired guides 最适合用来一起 knockdown 这个 TA 单元”的二次筛选。

GuideFinder-Based gRNA Design for CRISPRi Knockdown of a Toxin–Antitoxin Operon in S. epidermidis HD46 (CP052959.1)

两个基因的 locus_tag 分别是:

  • 毒素(toxin)HJI06_09100,坐标 complement(1889026..1889421)
  • 抗毒素(antitoxin)HJI06_09105,坐标 complement(1889421..1889651)

这里有一个非常重要的更正:它们在 complement,也就是 负链(minus strand)。 因此“操纵子上游/启动子在前端”对应的方向是:

  • 负链基因的“上游(5’端)”在 更大的坐标数那一侧
  • 你的 TA 单元从坐标上看是 1889651(更大)→ 1889026(更小) 这个方向转录

你想把 toxin+antitoxin 当一个单元一起 knockdown,那么最有效的策略通常是: 优先打在 antitoxin 的 5’端附近(靠近 1889651 一侧)以及其上游启动子区域,这样最容易把整条共转录本压下去。


1) “paired guides(成对 guides)”用中文再讲清楚

paired guides = 为同一个目标(同一个基因/同一转录单元)挑两条不同位置的 guide,同时表达,让 dCas9 在两个点上“堵住”转录,从而增强抑制的强度或稳定性。

  • 它不是“分别打两个基因”这个意思
  • 对你来说更合适的用法是:

    • 1 条 guide 卡在 启动子/TSS 附近
    • 另 1 条卡在 操纵子前段(这里就是 antitoxin 5’端或紧邻区域)
  • GuideFinder 会按你设置的“最小间距”(常见 100 bp)去输出可用的 guide 对

2) 用 GuideFinder 给 HD46(CP052959.1)这两个 locus_tag 设计 guides:实操步骤

下面按“你有完整基因组(complete genome)”的最常见流程写(你不需要自己手算 NGG 位点,GuideFinder 会做)。

Step A:准备文件

从 NCBI 下载这两个文件到同一目录(建议):

  1. 基因组序列 FASTACP052959.1.fna
  2. 基因组注释 GenBankCP052959.1.gbff(或 .gbk)

FASTA 提供序列;GenBank 提供 gene/CDS 坐标和 locus_tag。GuideFinder 依赖这些信息来定位目标基因并取启动子区域。

Step B:安装依赖

你需要:

  • R / RStudio
  • BLAST+(命令行 blastn):GuideFinder 用它做 off-target(脱靶)筛查
  • 下载 GuideFinder(GitHub: ohlab/Guide-Finder)的 Rmarkdown/R 脚本

Step C:准备目标基因列表

建一个文本文件(例如 targets.txt),内容就两行:

HJI06_09100
HJI06_09105

Step D:在 GuideFinder 里设置关键参数(推荐给 S. epidermidis 的组合)

因为 S. epidermidis GC 偏低,建议用“先严格、再迭代放宽”的策略(GuideFinder 的特色功能):

第一轮(主参数)

  • PAM:NGG(默认 SpCas9)
  • GC minimum:35%(或稍低一点)
  • TSS/启动子附近优先:让它偏向“离 TSS 更近”的 guide
  • off-target:开启(强烈建议不要关)

迭代轮(只针对第一轮找不到 guide 的基因自动放宽)

  • GC minimum:降到 30%
  • TSS 最大距离:放宽
  • 必要时再放宽一些序列过滤(同聚物等),但 off-target 最好别放得太松

Step E:运行并看输出

GuideFinder 通常会给你两类输出表(名字可能因版本略有不同):

  1. Top hits(单条 guide 推荐)
  2. Paired guides(成对 guide 推荐)

你现在的目标是“把 TA 单元一起压下去”,所以你从输出里优先挑:

  • 靠近 1889651 一侧(负链的 5’端/上游)的 guides

    • 重点看 HJI06_09105(antitoxin)输出里最靠前、最接近其 5’端/推定启动子区的 guides
    • HJI06_09100(toxin)也会出 guides,但从“整单元 knockdown”角度,最“入口”的位置通常更强

3) 针对你这个 TA 单元,怎么挑“最可能一把压住两基因”的 guide

因为它们紧挨着且共转录,你最推荐的组合是:

方案 1:单 guide(最简、最常成功)

  • 选 1 条 最接近操纵子启动子/TSS 的 guide
  • 实验上先用它验证 toxin 与 antitoxin 的 mRNA 是否都下降(RT-qPCR 两个都测)

方案 2:paired guides(更稳的增强版)

  • 选两条都位于“操纵子前端”的 guide(一般以 antitoxin 5’端/上游区域为核心)
  • 两条之间满足你设的最小间距(比如 100 bp)
  • 用同一个载体同时表达两条 guide(或串联表达)

4) 一个很关键的小提醒:负链时,“上游”在坐标更大端

你最初邮件里写的是 “(+)”,但 GenBank 注释清楚写了 complement。 所以当你在 GuideFinder 输出里看到坐标时,记住:

  • 越接近 1889651(更大坐标)越接近“转录起点/上游入口”
  • 这类 guide 通常更容易把整个 TA 转录单元一起压下去

把运行 GuideFinder 后输出表里与这两个 locus_tag 对应的前几条候选(包含坐标、链方向、序列、是否 off-target)贴出来,可以快速做二次筛选: 哪一条最适合做“单 guide 一把压住”,哪一对最适合做“paired guides 增强版”。

GEO HTS Submission

The steps are basically: (1) prepare the GEO HTS metadata spreadsheet, (2) upload files to GEO FTP, (3) submit the metadata file via the GEO web form, and (4) respond to any curator questions (esp. raw FASTQ).

1) Make a single dataset folder name for GEO (recommended)

GEO wants one folder per submission/dataset. Create one top-level folder name that will be uploaded as-is:

cd ~/Downloads
mv GEO_submissions GeoMx_DSP_M666_UKE_Hamburg

Inside it should remain:

  • dcc/
  • pkc/
  • annotation/
  • README.txt
  • counts_matrix.tsv

2) Prepare the GEO HTS metadata spreadsheet (REQUIRED)

On the GEO “Submit HTS” page:

  • Download the HTS metadata spreadsheet template.
  • Fill at least these tabs:

STUDY tab

  • Title, summary, overall design.
  • Add the “raw FASTQ missing” sentence if you still don’t have FASTQs:

    Raw sequencing FASTQ files are currently not available to the authors; processed DCC outputs and PKC probe metadata are provided. FASTQs will be deposited to SRA and linked once obtained.

SAMPLES tab (most important)

One row per ROI/AOI (so ~45 rows for your DCCs).

  • Use Sample_ID (from your Sheet8/clean TSV) as the sample/library name.
  • For each row, set processed data file to the matching DCC filename (e.g., DSP-...-A02.dcc).
  • Include key sample characteristics (Group/Location/PCR/Case/etc.) from your annotation.

PROTOCOLS tab

  • Sample prep + GeoMx DSP workflow description.
  • Mention panels: Hs_R_NGS_WTA_v1.0, GeoMx_Hs_CTA_v1.0, GeoMx_COVID19_v1.0.
  • State: DCC exported from NanoString GeoMx NGS pipeline; PKC required to interpret RTS_IDs.

FILES / supplementary (if the template has it)

List:

  • all DCCs
  • all PKCs
  • the annotation file(s) (at least the clean TSV or the Excel)
  • README.txt
  • counts_matrix.tsv (optional, but you have it already)

Tip: In the metadata, reference counts_matrix.tsv as “gene-level raw counts matrix (targets x ROIs)” and keep DCCs as the authoritative per-ROI outputs.

3) FTP upload to GEO (data files only)

  1. Connect to GEO’s FTP using the credentials/path GEO gives you (your personal upload directory).
  2. In that upload directory, create a dataset folder, e.g.:

    • GeoMx_DSP_M666_UKE_Hamburg/
  3. Upload only the data files by FTP:

    • the whole folder with subfolders: dcc/, pkc/, annotation/, plus README.txt, counts_matrix.tsv
  4. Do NOT upload the metadata spreadsheet by FTP (GEO explicitly says upload metadata via the web form).

4) Submit the metadata spreadsheet on the GEO web page

After the FTP transfer finishes:

  • Go to Submit a new high-throughput sequencing submission
  • Upload the metadata spreadsheet file for this dataset
  • Submit → this places it in the GEO processing queue.

5) If GEO asks for raw FASTQs

This is the most common friction point for GeoMx when FASTQs aren’t available.

  • If they allow: they’ll accept processed now and ask you to link SRA later.
  • If they require: you’ll need to obtain FASTQs from the provider or email GEO and explain you only received DCC/PKC and are requesting FASTQs.

TODO: upload/paste the GEO HTS metadata spreadsheet template (blank) or a screenshot of the SAMPLES tab columns — to have a look which columns to fill with which fields from your annotation/M666_Sheet8_annotation_clean.tsv.

Comprehensive summary: GEO vs NCBI Submission Portal (SRA/BioProject/BioSample) for bulk RNA-seq + GeoMx DSP

https://submit.ncbi.nlm.nih.gov/subs/

https://www.ncbi.nlm.nih.gov/geo/subs/

1) What each platform is best for

  • NCBI Submission Portal Used primarily to submit raw sequencing reads to the Sequence Read Archive (SRA) and to create/link the organizing records:

    • BioProject (the overall study / project container)
    • BioSample (the per-sample metadata records required for SRA)
  • NCBI GEO (Gene Expression Omnibus) Best for gene expression–style datasets and processed outputs, including:

    • count matrices / processed tables
    • sample/ROI metadata tables
    • GeoMx DSP outputs (DCC/PKC)
    • supplementary analysis outputs and documentation (README, workflow description) GEO is commonly used for “processed + metadata”, while raw FASTQs (if available) go to SRA.

2) Where to submit your two dataset types

  • Bulk RNA-seq (Illumina) with raw FASTQ files

    • Submit to SRA via the NCBI Submission Portal.
    • Ensure you have/define: BioProject + BioSamples, then upload FASTQs and metadata.
  • GeoMx DSP spatial transcriptomics with only DCC/PKC + annotation Excel (+ R workflow), no FASTQs

    • Submit to GEO (not SRA), because DCC/PKC are processed outputs, not raw reads.
    • In GEO, upload:

      • DCC files (processed counts + QC/metrics per ROI/AOI)
      • PKC files (panel/probe definitions)
      • annotation table (Excel is okay; CSV/TSV is even better for machine readability)
      • README describing file structure + column meanings + mapping between ROI IDs and metadata
      • analysis workflow (e.g., R scripts; optionally hosted on GitHub/Zenodo and linked)
  • If GeoMx FASTQs are obtained later

    • Submit GeoMx FASTQs to SRA, and in the GEO record state: raw reads in SRA, processed data in GEO, cross-linking accessions.

3) What to choose in the NCBI “Start a new submission” list

For your use case:

  • Choose Sequence Read Archive (SRA) ✅ for FASTQ (raw reads).
  • You will also use/associate BioProject and BioSample (often created during the SRA workflow).
  • Do not use GenBank/TSA/Genome for these transcriptomics read submissions (those are for assembled sequences/genomes, not raw RNA-seq reads).
  • GEO is not started from that list; it has its own submission entry point.

4) Accounts / login clarification

  • Same login principle: In general, you use the same My NCBI account across NCBI services.
  • Why GEO can feel like “another account”:

    • GEO requires a GEO Submitter Profile (contact identity/profile) attached to a My NCBI account.
    • In many labs, a PI or colleague already has a GEO profile under their account—so the lab might say “GEO is on another account.”
  • Your options:

    • Create/use your own My NCBI + GEO Submitter Profile and submit under your name.
    • Or submit using the lab/PI account that already has the GEO profile (for consistent lab identity in GEO).

5) Practical “Data availability” logic for a manuscript

A journal-friendly setup is:

  • Raw sequencing reads (bulk RNA-seq FASTQs; later GeoMx FASTQs if available) → SRA accession(s)
  • Processed expression outputs and metadata (GeoMx DCC/PKC + annotation + processed tables) → GEO accession(s)
  • Workflow/code → GitHub + DOI archive (e.g., Zenodo), linked from GEO and/or the paper.

6) ENA note (from the email context)

  • If a manuscript currently lists an ENA project accession that can’t be found (e.g., looks like a placeholder), you either:

    • confirm the correct existing ENA project accession, or
    • create a new submission/accession (ENA or NCBI—both are widely accepted, but your processed GeoMx outputs still fit best in GEO).

7) Key takeaway (one-liner)

  • FASTQ = SRA (via NCBI Submission Portal + BioProject/BioSample)
  • GeoMx DCC/PKC + annotation + processed outputs = GEO
  • One My NCBI login, but GEO needs a GEO Submitter Profile and uses a separate submission entry point.

德国《居留法》(AufenthG) §9 vs §18 中文译文与差异对比

§ 9(定居许可 / Niederlassungserlaubnis)中文翻译

§9 定居许可(Niederlassungserlaubnis)

(1) 定居许可是一种无期限的居留许可。只有在本法明确允许的情形下,才可以附加附条件(附加条款)。§47不受影响。(sozialgesetzbuch-sgb.de)

(2) 向外国人应当签发定居许可,如果:(sozialgesetzbuch-sgb.de)

  1. 该外国人已持有居留许可(Aufenthaltserlaubnis)满五年
  2. 生活费用有保障
  3. 已向法定养老保险缴纳至少 60 个月强制或自愿保险费,或能证明已为获得可比的养老待遇向保险/供养机构或保险公司支出;因育儿或居家护理造成的职业中断期应相应计入;
  4. 在综合考虑违反公共安全或秩序的严重程度/性质,或该外国人造成的危险,并考虑其既往居留时长以及其在德国境内的联系纽带后,不存在反对签发的公共安全或秩序方面的理由;
  5. 若其为雇员(受雇劳动者),其就业是被允许的
  6. 其持有持续从事其职业活动所需的其他许可
  7. 具备足够的德语能力
  8. 具备对德国联邦境内的法律与社会制度及生活状况的基础知识;并且
  9. 对本人及与其共同生活的家庭成员拥有足够的居住空间

(同一款后续规定)(sozialgesetzbuch-sgb.de)

  • 第7、8项条件:如果成功完成融合课程(Integrationskurs),视为已证明。
  • 若因身体、精神或心理疾病/残疾而无法满足第7、8项,可不再要求。
  • 另外,为避免特殊困难(Härte),也可以不再要求第7、8项。
  • 如果该外国人能够用德语以简单方式进行口头交流,并且其按照 §44 Abs.3 Nr.2没有参加融合课程的权利,或按照 §44a Abs.2 Nr.3不被强制参加融合课程,也可不再要求第7、8项。
  • 此外,如果该外国人因前述“疾病/残疾”等原因也无法满足第2、3项,则也可不再要求第2、3项。

(3) 对处于婚姻共同生活的配偶:只要第(2)款第1句第3、5、6项由一方配偶满足即可。若该外国人正在接受可获得认可的学校/职业教育结业证书或大学学位的教育,则可不要求第(2)款第1句第3项(养老缴费/可比养老证明)。第1句在 §26 Abs.4的情形中同样适用。(sozialgesetzbuch-sgb.de)

(3a) 对于持有 §18c(专业人才定居许可)的外国人的配偶,应当签发定居许可,如果:(sozialgesetzbuch-sgb.de)

  1. 与该外国人处于婚姻共同生活;
  2. 已持有居留许可满三年
  3. 每周工作不少于 20 小时;并且
  4. 满足第(2)款第1句第2项、第4至第9项条件。 并且第(2)款第2至第6句相应适用;按第(3)款条件签发定居许可不受影响。

(4) 对签发定居许可所需的“持有居留许可”的期间,可计入:(sozialgesetzbuch-sgb.de)

  1. 曾经持有居留许可或定居许可的期间:如果该外国人在出境时持有定居许可,则可计入(但须扣除期间在德国境外、并导致定居许可失效的停留时间);最多计入四年
  2. 每次在德国境外停留且未导致居留许可失效的,可最多计入六个月
  3. 以学习或职业教育为目的的合法居留时间,按一半计入。

§ 18(专业人才移民基本原则;一般规定)中文翻译

§18 专业人才移民基本原则;一般规定

(1) 接纳外国雇员,应以德国作为经济与科研所在地的需求为导向,并考虑劳动力市场状况。对外国专业人才和劳动力的特别机会,旨在保障专业/劳动力基础并加强社会保障体系。相关规定应以专业人才以及具有显著职业经验的劳动力在劳动力市场与社会中的可持续融入为目标,同时注意公共安全利益。(sozialgesetzbuch-sgb.de)

(2) 依据本节为从事就业活动签发居留许可的前提是:(sozialgesetzbuch-sgb.de)

  1. 存在明确的具体工作岗位/工作邀约
  2. 联邦就业局(Bundesagentur für Arbeit)已按 §39同意;但若法律、国家间协议或《就业条例》(Beschäftigungsverordnung)规定可无需就业局同意即可就业,则不适用该同意要求;即便无需同意,如出现 §40 Abs.2 或 Abs.3中的某种情形,仍可拒绝签发居留许可;
  3. 如需要执业许可(Berufsausübungserlaubnis),则该许可已获签发或已获保证;
  4. 已确认资格等同性(Gleichwertigkeit),或存在被认可的外国高校学位、或与德国高校学位相当的外国高校学位——只要这属于签发居留许可的条件; 4a. 外国人与雇主共同声明该工作将被实际履行;并且
  5. 在首次签发 §18a 或 §18b 的情形中,如果外国人在满 45 岁之后申请,则工资至少达到法定养老保险年度缴费基数上限(Beitragsbemessungsgrenze)年值的 55%,除非能证明已有足够的养老保障。

同款后续:若存在对雇佣该外国人的公共利益(尤其地区性经济或劳动力市场政策利益),可在个案中对上述条件作例外处理,尤其是在工资门槛仅略低或年龄门槛仅略超时。内政部每年最晚于上一年 12 月 31 日在联邦公报公布当年的最低工资标准。(sozialgesetzbuch-sgb.de)

(3) 本法所称“专业人才(Fachkraft)”是指:(sozialgesetzbuch-sgb.de)

  1. 拥有德国境内的合格职业培训,或与之等同的外国职业资格(职业培训类专业人才);或
  2. 拥有德国、被认可的外国,或与德国高校学位相当的外国高校学位(学术类专业人才)。

(4) 依据 §§18a、18b、18g、19c 签发的居留许可,期限为四年;如果劳动合同或就业局同意的期限更短,则按更短期限另加 3 个月,但总期限不得超过四年。(sozialgesetzbuch-sgb.de)


§9 和 §18 的核心区别(中文对比)

  1. 性质不同

    • §9:定义并规定“定居许可/永居(无期限居留许可)”是什么,以及一般获得条件。(sozialgesetzbuch-sgb.de)
    • §18:是“为了就业目的的居留许可体系”的总则/框架(专业人才移民原则、一般条件、专业人才定义、许可期限规则)。(sozialgesetzbuch-sgb.de)
  2. 期限不同

  3. 条件侧重点不同

    • §9:强调“稳定融入与长期居留能力”的条件:5年居留、生活保障、60个月养老、语言、融入知识、住房、公共安全等。(sozialgesetzbuch-sgb.de)
    • §18:强调“就业准入”的条件:具体工作邀约、就业局同意(或法定豁免)、必要执业许可、学历/资格认可、雇佣真实性声明,以及45岁后的工资/养老保障门槛等。(sozialgesetzbuch-sgb.de)
  4. 它们之间的关系

    • 很多人的路径是:先在 §18 体系下拿到就业类居留许可(如 §18a/§18b/§18g 等),满足条件后再申请 §9(或某些人走 §18c 直接专业人才定居许可)。这一点从 §9(3a) 直接提到与 §18c 的关联也能看出来。(sozialgesetzbuch-sgb.de)

§18a 具备职业培训的专业人才(Fachkräfte mit Berufsausbildung)— 中文翻译

对“具备职业培训的专业人才”,应签发一项居留许可(Aufenthaltserlaubnis),用于从事任何合格的就业(qualifizierte Beschäftigung)。 (互联网法律)


§18b 具备高等教育背景的专业人才(Fachkräfte mit akademischer Ausbildung)— 中文翻译

对“具备高等教育背景的专业人才”,应签发一项居留许可(Aufenthaltserlaubnis),用于从事任何合格的就业(qualifizierte Beschäftigung)。 (sozialgesetzbuch-sgb.de)


§18c 专业人才的定居许可(Niederlassungserlaubnis für Fachkräfte)— 中文翻译

(1) 对专业人才,无需联邦就业局(BA)同意,应签发定居许可(Niederlassungserlaubnis),如果满足:

  1. 已持有 §18a / §18b / §18d 或 §18g 的居留身份满 3 年
  2. 有一个工作岗位,且该岗位依 §18a/§18b/§18d/§18g 的条件允许由其担任;
  3. 已缴纳至少 36 个月法定养老保险强制或自愿缴费(或可比养老保障支出证明);
  4. 具备足够的德语能力;
  5. 同时满足 §9 Abs.2 Satz1 Nr.2 以及 Nr.4–6、8、9 的条件(并适用 §9 的若干例外规则)。 另外:若该专业人才在德国完成了职业培训或学业,上述第1项“3年”可缩短为 2年,第3项“36个月养老”可缩短为 24个月。 (sozialgesetzbuch-sgb.de)

(2) 作为蓝卡持有人(§18g),若已按 §18g 就业满 27个月并缴纳养老,且满足 §9 的相应条件,并具备“基础/简单德语”,则应签发定居许可;若德语达到“足够”,期限可缩短为 21个月。 (sozialgesetzbuch-sgb.de)

(3) 对“高度合格的、具备学术背景的专业人才”,在特殊情况下可(应当倾向于)在无需 BA 同意下签发定居许可:如果可以合理预期其能融入德国生活且无需国家救助即可维持生计,并满足 §9 Abs.2 Satz1 Nr.4(公共安全/秩序不构成反对理由)。各州还可规定此类签发需州最高主管机关(或其指定机构)同意。“高度合格”例示包括:具有特殊专业知识的科研人员;担任重要职务的教师/高级科研人员等。 (sozialgesetzbuch-sgb.de)


§18d 研究(Forschung)— 中文翻译

(1) 对外国人,无需 BA 同意,应依据欧盟指令 (EU) 2016/801 为“研究目的”签发居留许可,如果:

  1. 他: a) 与在德国境内为研究人员特殊准入程序而获得认可的研究机构,签署了有效的“接收协议”(Aufnahmevereinbarung)或相当合同,用于实施某项研究计划;或 b) 与从事研究的研究机构签署了有效接收协议或相当合同;并且
  2. 该研究机构书面承诺承担公共部门在接收协议结束后最长6个月内可能发生的费用,尤其包括: a) 该外国人在欧盟成员国非法停留期间的生活费用;以及 b) 对该外国人的遣返/驱逐费用。 并且:在(1)第1项a)情形下,居留许可应在提出申请后 60天内签发。 (sozialgesetzbuch-sgb.de)

(2) 如果研究机构的活动主要由公共资金资助,则原则上应免除(1)第2项的费用承诺要求;若该研究项目具有特别公共利益,也可以免除。并规定相关承诺的适用条款。 (sozialgesetzbuch-sgb.de)

(3) 研究机构也可以向负责其认可的主管机构作出“通用承诺”,适用于与其签署接收协议并获得研究居留许可的所有外国人。 (sozialgesetzbuch-sgb.de)

(4) 该研究居留许可一般至少签发 1年;若参加带有流动措施的欧盟/多边项目,则至少 2年;若研究项目更短,则按项目期限签发,但在“至少2年”规则的情形下,期限仍至少 1年。 (sozialgesetzbuch-sgb.de)

(5) 依本条签发的居留许可,允许在接收协议所列研究机构开展研究,并允许从事教学活动;研究项目在居留期间变更,不当然导致该许可失效。 (sozialgesetzbuch-sgb.de)

(6) 对在欧盟某成员国已获国际保护的人,如其满足(1)条件且在该成员国获保护后已居留至少 2年,可签发研究目的居留许可;(5)相应适用。 (sozialgesetzbuch-sgb.de)


§18g 欧盟蓝卡(Blaue Karte EU)— 中文翻译

(1) 对具备学术背景的专业人才,无需 BA 同意,应为其签发欧盟蓝卡,用于从事与其资格相匹配的德国境内工作,前提是:其工资至少达到法定养老保险年度缴费基数上限的 50%,且不存在 §19f 规定的拒绝理由。 但对以下两类人:

  1. 从事特定职业分类(ISCO-08若干组别所列职业);或
  2. 在申请蓝卡前不超过 3年取得高校学位者; 蓝卡改为需要 BA 同意签发,且工资门槛降低为年度缴费基数上限的 45.3%。 并且:若申请人已持有 §18b 居留许可且蓝卡工作所需执业许可与 §18b 相同,则视为满足 §18 Abs.2 Nr.3;若其在 §18b 申请时已提交与蓝卡相同的学位,则视为满足 §18 Abs.2 Nr.4。另对等同高校学位、至少三年学制的高等教育项目毕业者,也可按相应规则适用。 (sozialgesetzbuch-sgb.de)

(2) 对不满足(1)的申请人,在某些职业组别(ISCO-08中的特定组别)下,可在需要 BA 同意的情况下签发蓝卡;并在一定条件下对学历要求作特殊处理(包括:工资至少45.3%;无§19f拒绝理由;并能证明近7年内获得的、至少3年的相关职业经验,且能力水平可与高校学位相当并对岗位必需)。 (sozialgesetzbuch-sgb.de)

(3) 签发蓝卡要求:具体工作邀约所约定的雇佣期限至少 6个月。 (sozialgesetzbuch-sgb.de)

(4) 蓝卡持有人更换雇主/岗位:一般不需要外国人局许可;但在就业的前 12个月,外国人局可将岗位变更暂停最多 30天并在此期间拒绝(若不再满足蓝卡签发条件)。 (sozialgesetzbuch-sgb.de)

(5) 在某些情况下,签发蓝卡可视为生活费已保障:如果外国人持有 §18a 或 §18b 的居留许可且不更换工作岗位。 (sozialgesetzbuch-sgb.de)

(6) 蓝卡延期的特殊工资门槛:若申请人在申请延期前不超过 3年取得学位,或自首次按较低门槛((1)中45.3%那种情形)签发蓝卡以来未满 24个月,则延期时适用该较低门槛;其余仍适用一般延期规则。 (sozialgesetzbuch-sgb.de)

(7) 内政部每年在上一年 12月31日前于联邦公报公布下一年度(1)(2)所需的最低工资标准。 (sozialgesetzbuch-sgb.de)


五个条款的关键区别(中文对比)

  • §18a vs §18b(工作居留的入口)

    • 都是“居留许可 Aufenthaltserlaubnis”用于合格就业;
    • 差别主要在“你是职业培训型还是大学学历型专业人才”。 (互联网法律)
  • §18g(蓝卡)

    • 仍是“居留许可”类型,但属于欧盟蓝卡路径;核心是学术背景 + 工资门槛(50% 或特定情形45.3%)以及对岗位变更的规则。 (sozialgesetzbuch-sgb.de)
  • §18d(研究)

    • 也是“居留许可”,目的限定为研究;核心条件是接收协议/合同 + 研究机构费用承诺(以及相关豁免、期限规则)。 (sozialgesetzbuch-sgb.de)
  • §18c(定居/永居)

    • 这是“定居许可 Niederlassungserlaubnis(无期限)”路径:一般要求你先持 §18a/18b/18d/18g 一段时间并满足养老、语言、§9相关条件;对蓝卡还有 27/21个月的加速路径。 (sozialgesetzbuch-sgb.de)

FASTQ / raw sequencing datasets overview (T. and F.)

1) Per-dataset sample inventory (compact lists)

1. Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606

  • X101SC24105589-Z01-J001: AUM-1..3, MHB-1..3, Urine-1..3 (all PE)
  • X101SC25062155-Z01-J002: AUM-1..3, AUM-AZI-1..3, MH-1..3, MH-AZI-1..3, Urine-1..3, Urine-AZI-1..3 (all PE)
  • Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606_pca2

2. Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606

  • LB: LB-AB-1..3, LB-IJ-(1,2,4), LB-W1-1..3, LB-WT19606-2..4, LB-Y1-2..4
  • Mac: Mac-AB-1..3, Mac-IJ-(1,2,4), Mac-W1-1..3, Mac-WT19606-2..4, Mac-Y1-2..4
  • Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606_pca

3. Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606

Each with reps -1..-3 (all PE):

  • 0_5ΔIJ-17, 0_5ΔIJ-24
  • preWT-17, preWT-24
  • preΔIJ-17, preΔIJ-24
  • WT0_5-17, WT0_5-24
  • WT-17, WT-24
  • ΔIJ-17, ΔIJ-24
  • Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606_PCA_condition_time_complete

4. Data_Tam_DNAseq_2023_lab_strains

  • A6WT – Acinetobacter baumannii ATCC19606
  • A10CraA – Acinetobacter baumannii ATCC19606
  • A12AYE – Acinetobacter baumannii AYE
  • A1917978 – Acinetobacter baumannii ATCC17978

5. Data_Tam_DNAseq_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200

  • AYE-Q, AYE-S, AYE-WTonTig4, AYE-craAonTig4, AYE-craA-1onCm200, AYE-craA-2onCm200, clinical (all PE)
  • brig_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200

6. Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606

  • adeABadeIJ, adeIJK, CM1, CM2, HF (all PE)
  • brig_2025_adeABadeIJ_adeIJK_CM1_CM2_on_ATCC19606

7. Data_Tam_DNAseq_2025_ATCC19606-Y1Y2Y3Y4W1W2W3W4

  • Illumina PE: △adeIJ, Tig1, Tig2, W, W2, W3, W4, Y, Y2, Y3, Y4
  • Nanopore (*_fastq_pass.tar):

    • W1 (3 tar files), W2 (1), W3 (2), W4 (1)
    • Y1 (3), Y2 (1), Y3 (1), Y4 (1)

8. Data_Tam_DNAseq_2026_19606deltaIJfluE

All PE; grouped by background:

  • 19606△ABfluE: cef-1, cipro-2, dori-2, nitro-3, pip-1, polyB-3, tet-1
  • 19606△IJfluE: cef-4, cipro-3, dori-1, nitro-3, pip-4, polyB-4
  • 19606wtfluE: cef-1, cipro-2, dori-1, nitro-1, pip-4, polyB-4, tet-2

9. Data_Tam_DNAseq_2026_Acinetobacter_harbinensis

  • An6 (PE)

10. Data_Tam_Metagenomics_2026

  • A1, A1a, A2, B1, B2 (PE)

11. Data_Foong_RNAseq_2021_ATCC19606_Cm (mapping list provided)

  • Batch1: WT_1, WT_2B, C_1B, C_2, J_1, J_2
  • Batch2: Control, WT_1B, WT_2B, WT_3B, Cra_1, Cra_2, Cra_3, IJ_1B, IJ_2B, IJ_3
  • Batch3: adIJ_1, adIJ_2, crA2, crA_ab_1, crA_ab_2, crA_ab_3, adAB_1, adAB_2, adAB_ab1, adAB_ab2, adAB_ab3
  • Data_Foong_RNAseq_2021_ATCC19606_Cm_pca_after_batch_correction_400dpi

12. Data_Foong_DNAseq_2025_AYE_Dark_vs_Light

  • Dark, Light (PE)

2) Dataset-level summary (quick lookup)

Dataset folder Year Data type Platform / format Run / project IDs present Samples (n) Files (n) Sample groups / notes
Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606/ 2024 RNA-seq Illumina PE (*_1.fq.gz, *_2.fq.gz) X101SC24105589-Z01-J001, X101SC25062155-Z01-J002 27 54 J001: AUM/MHB/Urine (each 1–3). J002: AUM, AUM-AZI, MH, MH-AZI, Urine, Urine-AZI (each 1–3).
Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606/ 2025 RNA-seq Illumina PE X101SC25015922-Z02-J002 30 60 LB vs Mac sets; conditions AB, IJ, W1, Y1, WT19606 with listed replicates (mostly 1–3 or 2–4; IJ uses 1,2,4).
Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606/ 2025 RNA-seq Illumina PE X101SC25062155-Z01-J001 36 72 12 condition blocks × 3 reps: preWT, preΔIJ, WT, ΔIJ, WT0_5, 0_5ΔIJ at timepoints 17 and 24.
Data_Tam_DNAseq_2025_ATCC19606-Y1Y2Y3Y4W1W2W3W4/ 2025 DNA-seq Illumina PE + Nanopore (*_fastq_pass.tar) Illumina: X101SC24065637-Z01-J001/J002; Nanopore: X101SC25080408-Z01-J001 11 (Illumina) + 13 tar archives 22 + 13 Illumina: △adeIJ, Tig1, Tig2, W, W2–W4, Y, Y2–Y4. Nanopore: W1(3), W2(1), W3(2), W4(1), Y1(3), Y2(1), Y3(1), Y4(1) tar files.
Data_Tam_DNAseq_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200/ 2025 DNA-seq Illumina PE X101SC25015922-Z01-J001 7 14 AYE variants: AYE-Q, AYE-S, AYE-WTonTig4, AYE-craAonTig4, AYE-craA-1onCm200, AYE-craA-2onCm200, plus clinical.
Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2 2025 DNA-seq Illumina PE X101SC24115801-Z01-J001 5 10 adeABadeIJ, adeIJK, CM1, CM2, HF.
Data_Tam_DNAseq_2026_19606deltaIJfluE/ 2026 DNA-seq Illumina PE X101SC25116512-Z01-J003 20 40 Three backgrounds: 19606△ABfluE* (7), 19606△IJfluE* (6), 19606wtfluE* (7) across drug tags (cef/cipro/dori/nitro/pip/polyB/tet) with replicate suffixes.
Data_Tam_DNAseq_2026_Acinetobacter_harbinensis/ 2026 DNA-seq Illumina PE X101SC25116512-Z01-J002 1 2 An6 (paired-end).
Data_Tam_Metagenomics_2026/ 2026 Metagenomics Illumina PE X101SC25123808-Z01-J001 5 10 A1, A1a, A2, B1, B2.
Data_Foong_RNAseq_2021_ATCC19606_Cm/ 2021 RNA-seq Illumina PE (symlink/mapping list shown) (paths point to raw_data_batch1/2/3) 27 54 Batch1: WT/craA/adeIJ (each 2 reps). Batch2: Control + WT.abx + craA.abx + adeIJ.abx (various reps). Batch3: adeIJ, craA, craA.abx, adeAB, adeAB.abx (various reps).
Data_Foong_DNAseq_2025_AYE_Dark_vs_Light/ 2025 DNA-seq Illumina PE X101SC25116512-Z01-J001 2 4 Dark, Light.

3) Complete list


Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606/

    ./X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz
    ./X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-1/AUM-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-1/AUM-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-2/AUM-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-2/AUM-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-3/AUM-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-3/AUM-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-1/AUM-AZI-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-1/AUM-AZI-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-2/AUM-AZI-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-2/AUM-AZI-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-3/AUM-AZI-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/AUM-AZI-3/AUM-AZI-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-1/MH-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-1/MH-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-2/MH-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-2/MH-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-3/MH-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-3/MH-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-1/MH-AZI-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-1/MH-AZI-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-2/MH-AZI-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-2/MH-AZI-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-3/MH-AZI-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/MH-AZI-3/MH-AZI-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-1/Urine-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-1/Urine-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-2/Urine-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-2/Urine-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-3/Urine-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-3/Urine-3_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-1/Urine-AZI-1_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-1/Urine-AZI-1_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-2/Urine-AZI-2_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-2/Urine-AZI-2_2.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-3/Urine-AZI-3_1.fq.gz
    ./X101SC25062155-Z01-J002/01.RawData/Urine-AZI-3/Urine-AZI-3_2.fq.gz

Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606/

    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-1/LB-AB-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-2/LB-AB-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-AB-3/LB-AB-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-1/LB-IJ-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-2/LB-IJ-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-IJ-4/LB-IJ-4_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-1/LB-W1-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-2/LB-W1-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-W1-3/LB-W1-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-2/LB-WT19606-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-3/LB-WT19606-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-WT19606-4/LB-WT19606-4_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-2/LB-Y1-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-3/LB-Y1-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/LB-Y1-4/LB-Y1-4_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-1/Mac-AB-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-2/Mac-AB-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-AB-3/Mac-AB-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-1/Mac-IJ-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-2/Mac-IJ-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-IJ-4/Mac-IJ-4_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-1/Mac-W1-1_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-2/Mac-W1-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-W1-3/Mac-W1-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-2/Mac-WT19606-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-3/Mac-WT19606-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-WT19606-4/Mac-WT19606-4_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-2/Mac-Y1-2_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-3/Mac-Y1-3_2.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_1.fq.gz
    ./X101SC25015922-Z02-J002/01.RawData/Mac-Y1-4/Mac-Y1-4_2.fq.gz

Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606/

    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_2.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_1.fq.gz
    ./X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_2.fq.gz

Data_Tam_DNAseq_2025_ATCC19606-Y1Y2Y3Y4W1W2W3W4/

    Illumina short-sequencing:

        ./X101SC24065637-Z01-J001/01.RawData/△adeIJ/△adeIJ_1.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/△adeIJ/△adeIJ_2.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Tig1/Tig1_1.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Tig1/Tig1_2.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Tig2/Tig2_1.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Tig2/Tig2_2.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/W/W_1.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/W/W_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W2/W2_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W2/W2_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W3/W3_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W3/W3_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W4/W4_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/W4/W4_2.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Y/Y_1.fq.gz
        ./X101SC24065637-Z01-J001/01.RawData/Y/Y_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y2/Y2_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y2/Y2_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y3/Y3_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y3/Y3_2.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y4/Y4_1.fq.gz
        ./X101SC24065637-Z01-J002/01.RawData/Y4/Y4_2.fq.gz

Nanopore long-sequencing:

        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W1/0710_2F_PBG50143_74807b09/W1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W1/0629_2H_PBG55359_f19e323f/W1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W1/0631_2C_PBG05153_55abe88b/W1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W2/0620_2C_PBG17000_6bfd0048/W2_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W3/0710_2F_PBG50143_74807b09/W3_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W3/0629_2H_PBG55359_f19e323f/W3_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/W4/0620_2C_PBG17000_6bfd0048/W4_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y1/0655_3B_PBE70655_6bbd09a4/Y1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y1/0620_2C_PBG17000_6bfd0048/Y1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y1/0631_2C_PBG05153_55abe88b/Y1_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y2/0620_2C_PBG17000_6bfd0048/Y2_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y3/0620_2C_PBG17000_6bfd0048/Y3_fastq_pass.tar
        ./X101SC25080408-Z01-J001/Release-X101SC25080408-Z01-J001-20251009/Data-X101SC25080408-Z01-J001/Y4/0620_2C_PBG17000_6bfd0048/Y4_fastq_pass.tar

Data_Tam_DNAseq_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200/

    ./X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/clinical/clinical_1.fq.gz
    ./X101SC25015922-Z01-J001/01.RawData/clinical/clinical_2.fq.gz

Data_Tam_DNAseq_2025_E.hormaechei-adeABadeIJ_adeIJK_CM1_CM2

    ./X101SC24115801-Z01-J001/01.RawData/adeABadeIJ/adeABadeIJ_1.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/adeABadeIJ/adeABadeIJ_2.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/adeIJK/adeIJK_1.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/adeIJK/adeIJK_2.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/CM1/CM1_1.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/CM1/CM1_2.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/CM2/CM2_1.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/CM2/CM2_2.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/HF/HF_1.fq.gz
    ./X101SC24115801-Z01-J001/01.RawData/HF/HF_2.fq.gz

Data_Tam_DNAseq_2026_19606deltaIJfluE/

    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_2.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_1.fq.gz
    ./X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_2.fq.gz

Data_Tam_DNAseq_2026_Acinetobacter_harbinensis/

    ./X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz
    ./X101SC25116512-Z01-J002/01.RawData/An6/An6_2.fq.gz

Data_Tam_Metagenomics_2026/

    ./X101SC25123808-Z01-J001/01.RawData/A1/A1_1.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/A1/A1_2.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/A1a/A1a_1.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/A1a/A1a_2.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/A2/A2_1.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/A2/A2_2.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/B1/B1_1.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/B1/B1_2.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/B2/B2_1.fq.gz
    ./X101SC25123808-Z01-J001/01.RawData/B2/B2_2.fq.gz

Data_Foong_RNAseq_2021_ATCC19606_Cm/

    wt_r1_R1.fq.gz -> ../raw_data_batch1/WT_1_1.fq.gz
    wt_r1_R2.fq.gz -> ../raw_data_batch1/WT_1_2.fq.gz
    wt_r2_R1.fq.gz -> ../raw_data_batch1/WT_2B_1.fq.gz
    wt_r2_R2.fq.gz -> ../raw_data_batch1/WT_2B_2.fq.gz
    craA_r1_R1.fq.gz -> ../raw_data_batch1/C_1B_1.fq.gz
    craA_r1_R2.fq.gz -> ../raw_data_batch1/C_1B_2.fq.gz
    craA_r2_R1.fq.gz -> ../raw_data_batch1/C_2_1.fq.gz
    craA_r2_R2.fq.gz -> ../raw_data_batch1/C_2_2.fq.gz
    adeIJ_r1_R1.fq.gz -> ../raw_data_batch1/J_1_1.fq.gz
    adeIJ_r1_R2.fq.gz -> ../raw_data_batch1/J_1_2.fq.gz
    adeIJ_r2_R1.fq.gz -> ../raw_data_batch1/J_2_1.fq.gz
    adeIJ_r2_R2.fq.gz -> ../raw_data_batch1/J_2_2.fq.gz
    wt_r3_R1.fq.gz -> ../raw_data_batch2/Control_1.fq.gz
    wt_r3_R2.fq.gz -> ../raw_data_batch2/Control_2.fq.gz
    wt.abx_r1_R1.fq.gz -> ../raw_data_batch2/WT_1B_1.fq.gz
    wt.abx_r1_R2.fq.gz -> ../raw_data_batch2/WT_1B_2.fq.gz
    wt.abx_r2_R1.fq.gz -> ../raw_data_batch2/WT_2B_1.fq.gz
    wt.abx_r2_R2.fq.gz -> ../raw_data_batch2/WT_2B_2.fq.gz
    wt.abx_r3_R1.fq.gz -> ../raw_data_batch2/WT_3B_1.fq.gz
    wt.abx_r3_R2.fq.gz -> ../raw_data_batch2/WT_3B_2.fq.gz
    craA.abx_r1_R1.fq.gz -> ../raw_data_batch2/Cra_1_1.fq.gz
    craA.abx_r1_R2.fq.gz -> ../raw_data_batch2/Cra_1_2.fq.gz
    craA.abx_r2_R1.fq.gz -> ../raw_data_batch2/Cra_2_1.fq.gz
    craA.abx_r2_R2.fq.gz -> ../raw_data_batch2/Cra_2_2.fq.gz
    craA.abx_r3_R1.fq.gz -> ../raw_data_batch2/Cra_3_1.fq.gz
    craA.abx_r3_R2.fq.gz -> ../raw_data_batch2/Cra_3_2.fq.gz
    adeIJ.abx_r1_R1.fq.gz -> ../raw_data_batch2/IJ_1B_1.fq.gz
    adeIJ.abx_r1_R2.fq.gz -> ../raw_data_batch2/IJ_1B_2.fq.gz
    adeIJ.abx_r2_R1.fq.gz -> ../raw_data_batch2/IJ_2B_1.fq.gz
    adeIJ.abx_r2_R2.fq.gz -> ../raw_data_batch2/IJ_2B_2.fq.gz
    adeIJ.abx_r3_R1.fq.gz -> ../raw_data_batch2/IJ_3_1.fq.gz
    adeIJ.abx_r3_R2.fq.gz -> ../raw_data_batch2/IJ_3_2.fq.gz
    adeIJ_r3_R1.fq.gz -> ../raw_data_batch3/adIJ_1_1.fq.gz
    adeIJ_r3_R2.fq.gz -> ../raw_data_batch3/adIJ_1_2.fq.gz
    adeIJ_r4_R1.fq.gz -> ../raw_data_batch3/adIJ_2_1.fq.gz
    adeIJ_r4_R2.fq.gz -> ../raw_data_batch3/adIJ_2_2.fq.gz
    craA_r3_R1.fq.gz -> ../raw_data_batch3/crA2_1.fq.gz
    craA_r3_R2.fq.gz -> ../raw_data_batch3/crA2_2.fq.gz
    craA.abx_r4_R1.fq.gz -> ../raw_data_batch3/crA_ab_1_1.fq.gz
    craA.abx_r4_R2.fq.gz -> ../raw_data_batch3/crA_ab_1_2.fq.gz
    craA.abx_r5_R1.fq.gz -> ../raw_data_batch3/crA_ab_2_1.fq.gz
    craA.abx_r5_R2.fq.gz -> ../raw_data_batch3/crA_ab_2_2.fq.gz
    craA.abx_r6_R1.fq.gz -> ../raw_data_batch3/crA_ab_3_1.fq.gz
    craA.abx_r6_R2.fq.gz -> ../raw_data_batch3/crA_ab_3_2.fq.gz
    adeAB_r1_R1.fq.gz -> ../raw_data_batch3/adAB_1_1.fq.gz
    adeAB_r1_R2.fq.gz -> ../raw_data_batch3/adAB_1_2.fq.gz
    adeAB_r2_R1.fq.gz -> ../raw_data_batch3/adAB_2_1.fq.gz
    adeAB_r2_R2.fq.gz -> ../raw_data_batch3/adAB_2_2.fq.gz
    adeAB.abx_r1_R1.fq.gz -> ../raw_data_batch3/adAB_ab1_1.fq.gz
    adeAB.abx_r1_R2.fq.gz -> ../raw_data_batch3/adAB_ab1_2.fq.gz
    adeAB.abx_r2_R1.fq.gz -> ../raw_data_batch3/adAB_ab2_1.fq.gz
    adeAB.abx_r2_R2.fq.gz -> ../raw_data_batch3/adAB_ab2_2.fq.gz
    adeAB.abx_r3_R1.fq.gz -> ../raw_data_batch3/adAB_ab3_1.fq.gz
    adeAB.abx_r3_R2.fq.gz -> ../raw_data_batch3/adAB_ab3_2.fq.gz

Data_Foong_DNAseq_2025_AYE_Dark_vs_Light/

    ./X101SC25116512-Z01-J001/01.RawData/Dark/Dark_1.fq.gz
    ./X101SC25116512-Z01-J001/01.RawData/Dark/Dark_2.fq.gz
    ./X101SC25116512-Z01-J001/01.RawData/Light/Light_1.fq.gz
    ./X101SC25116512-Z01-J001/01.RawData/Light/Light_2.fq.gz