Arthrobacter phenanthrenivorans vs. Acinetobacter towneri

Arthrobacter phenanthrenivoransAcinetobacter towneri 中文详解


🔬 1. Arthrobacter phenanthrenivorans(嗜菲节杆菌 / 菲降解节杆菌)

📋 基本信息

项目 内容
中文名称 嗜菲节杆菌(或:菲降解节杆菌)
学名 Arthrobacter phenanthrenivorans
命名含义 phenanthrenivorans = “菲(phenanthrene)” + “吞噬/降解(vorans)”
分类地位 放线菌门 → 放线菌纲 → 微球菌目 → 微球菌科 → 节杆菌属
模式菌株 Sphe3 (DSM 15210)

📌 分类备注:部分研究基于基因组分析建议将其划入新属 Pseudarthrobacter,但多数文献仍沿用 Arthrobacter 名称。

🧫 生物学特性

  • ✅ 革兰氏阳性(Gram-positive)
  • ✅ 严格好氧(需氧生长)
  • ✅ 杆状→球状形态变化(典型节杆菌生活史)
  • ✅ 非运动性,无鞭毛
  • ✅ 土壤、沉积物中常见,耐干燥、耐贫营养

🌱 核心功能:菲(Phenanthrene)降解

是一种三环多环芳烃(PAH),具有致癌性,常见于石油污染场地。

该菌的降解通路简述:

菲(C₁₄H₁₀)
   ↓ 菲双加氧酶(PhdABC)启动
→ 1,2-二羟基菲 → 开环酶 → 邻苯二甲酸衍生物
   ↓ β-酮己二酸途径
→ 乙酰-CoA + 琥珀酸 → TCA循环 → CO₂ + H₂O

应用价值

  • 🌍 石油污染土壤/水体的生物修复(Bioremediation)
  • 🧪 芳香烃降解酶的工程化应用
  • 🔬 研究原核生物芳香环代谢的模式菌株

⚠️ 是否为病原体?

不是人类或动物病原体

  • 无已知毒力因子(如毒素、侵袭素、荚膜等)
  • 无临床感染病例报告
  • 属于环境腐生菌(Saprophyte),对人体无害
  • 在常规实验室操作下按生物安全1级(BSL-1)管理

🔬 2. Acinetobacter towneri(托纳不动杆菌)

📋 基本信息

项目 内容
中文名称 托纳不动杆菌
学名 Acinetobacter towneri
首次描述 Nemec et al., 2013(基于DNA-DNA杂交与rpoB基因分析)
分类地位 变形菌门 → γ-变形菌纲 → 假单胞菌目 → 莫拉氏菌科 → 不动杆菌属
基因组特征 基因组大小 ~3.9 Mb,GC含量 ~39%

🧫 生物学特性

  • ✅ 革兰氏阴性(Gram-negative)
  • ✅ 严格好氧,氧化酶阴性,过氧化氢酶阳性
  • ✅ 球杆状,常成对或短链排列
  • 无鞭毛,但部分菌株具菌毛,可在表面滑行(”twitching motility”)
  • ✅ 广泛分布于土壤、水体、医院环境表面

🏥 临床与生态意义

方面 说明
环境角色 常见于水体、土壤,参与有机物降解;部分菌株具降解芳香族化合物潜力
临床分离 偶见于医院环境(如呼吸机、导管表面),但极少作为原发致病菌被分离
耐药性 天然对部分抗生素耐受(如β-内酰胺类),但耐药谱通常弱于 A. baumannii 复合群

⚠️ 是否为病原体?

🟡 条件致病性(机会性病原体)

  • 健康人群:基本无致病风险,属于环境常驻菌
  • 免疫低下者:极少数病例报告与导管相关感染、伤口定植有关,但因果关系常难确立
  • A. baumannii 的区别
    • A. baumannii 是WHO重点监控的高危耐药病原体(CRAB)
    • A. towneri 未被列入临床重点关注菌种,致病证据有限
  • 实验室操作建议:生物安全2级(BSL-2)常规防护即可

🆚 两菌对比总结

特征 A. phenanthrenivorans A. towneri
革兰染色 阳性(+) 阴性(−)
主要生境 污染土壤、沉积物 水、土壤、医院环境
核心能力 菲等PAHs高效降解 环境适应性强,部分降解潜力
人类致病性 ❌ 非病原体 🟡 极低风险机会菌
生物安全等级 BSL-1 BSL-2(常规防护)
应用方向 生物修复、酶工程 环境微生物研究、医院感染溯源

💡 实用建议

🔹 若您从事环境修复研究

  • A. phenanthrenivorans 是优秀的菲降解菌候选,可关注其 phd 基因簇 表达调控。

🔹 若您在临床样本中检出 A. towneri
1️⃣ 首先排除样本污染或定植(尤其来自环境拭子、非无菌部位)
2️⃣ 结合患者免疫状态、感染症状、其他病原证据综合判断
3️⃣ 药敏试验仍建议进行,但无需按”超级细菌”紧急处理

🔹 中文文献检索技巧

  • Arthrobacter phenanthrenivorans:可搜”节杆菌 菲降解”、”嗜菲节杆菌”
  • Acinetobacter towneri:文献较少,建议用英文名 + “不动杆菌 新种” 组合检索

📚 权威参考来源

  • 《伯杰氏系统细菌学手册》(Bergey’s Manual)
  • LPSN (List of Prokaryotic names with Standing in Nomenclature): https://lpsn.dsmz.de
  • NCBI Taxonomy & Genome Database

如需我帮您查找这两株菌的基因组登录号特异性引物序列培养条件优化方案,请随时告知!🧬🔬

Journal-Polished Manuscript Methods and Analysis Text for TnSeq

tnseq_methods.docx

Part 1: Manuscript Methods Section

Raw paired-end sequencing data in FASTQ format were processed using the Transposon Position Profiling (TPP) pipeline (DeJesus et al., 2017), adapted for Tn5 transposon specificity. The analysis was performed using the reference genome of Yersinia enterocolitica subsp. enterocolitica WA-314 (GenBank accession: CP009367).

Read 1 (R1) was screened for the Tn5-specific primer sequence (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG), allowing up to one nucleotide mismatch. Genomic DNA flanking the transposon insertion site was extracted from R1 and R2 reads, and paired-end reads were aligned to the reference genome using BWA-MEM (Li, 2013). Only properly paired reads mapping to opposite strands were retained for further analysis.

Unique insertion events were quantified after collapsing PCR duplicates. Reads were grouped according to barcode sequence and mapping coordinates, and each unique combination was counted as a single template. Template counts at each genomic position were exported in .wig format for downstream statistical analysis.

Statistical analysis of insertion patterns was conducted using Transit (v3.2.5; DeJesus & Ioerger, 2016). Datasets were normalized using the Trimmed Total Reads (TTR) method to correct for differences in library complexity and sequencing depth. Conditional essentiality was assessed by analysis of variance (ANOVA), followed by Benjamini-Hochberg false discovery rate (FDR) correction (α = 0.05), comparing insertion counts across five experimental conditions: initial mutant library, LB culture, 24-hour growth control, intracellular infection, and extracellular infection. Constitutive essentiality was evaluated independently using the Tn5Gaps algorithm (Griffin et al., 2011), which identifies genes containing significant runs of non-insertions by permutation testing.

Genome-wide insertion distributions and essential gene locations were visualized using Circos (Krzywinski et al., 2009). Scatter plots represented normalized template counts for each condition, and an inner heatmap highlighted genes classified as essential (FDR-adjusted p-value < 0.05, Tn5Gaps). All analyses were performed on the complete reference genome CP009367 to ensure accurate coordinate mapping.

References DeJesus, M. A. et al. Nature Protocols 12, 2017. DeJesus, M. A. & Ioerger, T. R. Bioinformatics 32, 2016. Griffin, J. E. et al. PNAS 108, 2011. Li, H. arXiv 1303.3997, 2013. Krzywinski, M. et al. Genome Research 19, 2009.

Part 2: Summary Table – Key Quality Metrics (Run3 – Final Analysis)

METRIC INITIAL_MUTANTS LB_CULTURE GROWTHOUT_CONTROL_24H INTRACELLULAR_MUTANTS_24H EXTRACELLULAR_MUTANTS_24H
Total reads 49,821,406 43,486,192 70,663,823 51,244,639 47,473,664
Valid Tn prefix 20,339,623 (40.8%) 22,631,019 (52.0%) 26,777,280 (37.9%) 23,204,461 (45.3%) 9,358,660 (19.7%)
Mapped read pairs 16,445,755 (80.9%) 19,994,409 (88.4%) 24,141,881 (90.2%) 20,909,755 (90.1%) 6,588,961 (70.4%)
Unique templates 2,559,561 3,393,325 3,642,183 1,476,522 248,080
Template ratio 6.43 5.89 6.63 14.16 26.56
Density (TAs hit/total) 0.026 0.026 0.026 0.022 0.012
BC_corr 0.921 0.930 0.918 0.911 0.824

Interpretation:

  • BC_corr > 0.9 for four of the five samples indicates strong concordance between raw reads and deduplicated templates and is generally consistent with minimal PCR amplification bias.
  • The extracellular_mutants_24h sample shows reduced library complexity (19.7% valid prefix, template ratio = 26.56, BC_corr = 0.824). This pattern likely reflects strong biological selection during extracellular growth.

Part 3: Step-by-Step Data Processing (TPP Pipeline)

Primer Screening & Genomic Extraction

  • Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAG (Tn5-specific)
  • Parameters: One mismatch allowed
  • Genomic extraction: Suffix ≥20 bp downstream of the primer; adapter stripping was applied for short fragments

Paired-End Mapping

  • Tool: BWA-MEM (bwa-alg mem)
  • Requirements: Both R1 and R2 were required to map to opposite strands on reference CP009367

Template Deduplication

  • Reads were grouped by (barcode, mapping coordinates)
  • Each unique combination was counted once as a “template” to remove PCR duplicates

Output

  • .wig files: Template counts per genomic position
  • .tn_stats.txt: Library QC metrics

Part 4: Statistical Analysis (Transit)

Normalization: TTR Method

  • Trimmed Total Reads: Scales samples to equal total counts after excluding the top and bottom 5% of values.
  • Purpose: Reduces the influence of outliers, including essential genes with zero counts or highly amplified templates.

Differential Essentiality Analysis: ANOVA

transit anova combined.wig samples_run3.metadata CP009367.prot_table \\
  anova_out_intracellular_vs_LB \\
  -n TTR --include-conditions intracellular_mutants_24h,LB_culture \\
  --ref LB_culture -PC 5 -alpha 1000 -winz
PARAMETER MEANING RATIONALE
--include-conditions Conditions to compare (comma-separated) Enables pairwise or multi-condition comparisons
--ref Reference condition for LFC calculation Log-fold changes are computed relative to this baseline
-PC 5 Pseudocount added to all counts Avoids log(0) and stabilizes low-count estimates
-alpha 1000 Variance moderation parameter Shrinks extreme variance estimates for genes with few insertions
-winz Winsorization flag Caps the top and bottom 1% of values to reduce the influence of outliers
-n TTR Normalization method Applies Trimmed Total Reads normalization before analysis

KEY METRICS IN OUTPUT:

COLUMN DEFINITION
Orf/Rv Gene identifier (e.g., CH47_1012)
Gene Gene name (e.g., pncB, phoQ)
TAs Number of TA dinucleotides within the ORF
Mean_[condition] Mean normalized template count per condition
LFC_[condition] Log₂ fold change relative to the reference condition
Fstat F-test statistic (between-condition / within-condition variance)
Pval Raw p-value from the ANOVA F-test
Padj Benjamini-Hochberg FDR-adjusted p-value
status Quality control flags (e.g., “No counts in all conditions”)

Results saved in: anova_out_intracellular.xls, anova_out_extracellular.xls, heatmap_q0.05.png

Essentiality Analysis: Tn5Gaps

transit tn5gaps ${sample}_run3_normalized.wig CP009367.prot_table \\
  ${sample}_tn5gaps_trimmed.dat -m 2 -r Sum -iN 5 -iC 5
PARAMETER MEANING RATIONALE
-m 2 Minimum insertions for analysis Genes with fewer than 2 insertions lack sufficient statistical power
-r Sum Scoring method: sum of counts Robust measure of overall insertion density
-iN 5 Minimum insertion density (per kb) for non-essential calls Filters genes with very sparse coverage
-iC 5 Minimum absolute insertions for non-essential calls Ensures sufficient absolute coverage

KEY METRICS IN OUTPUT:

COLUMN DEFINITION
k Observed insertions within the ORF
n Total TA dinucleotides within the ORF
r Length of the maximum run of non-insertions
pval/padj Permutation test p-value, FDR-corrected
call Essential/Non-essential (FDR < 0.05)

Results Summary:

  • ~218 essential genes were identified in the initial mutant library (~5.4% of the genome).
  • Typical essential genes confirmed:
    • Ribosomal proteins: rpmJ, rpsM, rpsK, rpsD, rplQ, rpmI, rplT
    • RNA polymerase: rpoA (alpha subunit)
    • Translation factors: infC (IF-3), pheS/pheT (Phe-tRNA ligase)
    • Protein translocation: secY, secD, secF (Sec translocase)
    • DNA replication: dnaA, dnaN
    • Cell division: ftsH (FtsH protease)
    • tRNA processing: thrS (Thr-tRNA ligase)
    • Nucleoid organization: ihfA/ihfB (integration host factor)
    • Ribosome maturation: rimP, rbfA
    • Central metabolism: glmM (phosphoglucosamine mutase)

These genes are universally essential across bacterial species, supporting the validity of the analysis pipeline.

Results saved in: Tn5Gaps.xls

Part 5: Circos Visualization – Genome-Wide Insertion Patterns

To visualize transposon insertion distributions across the Y. enterocolitica WA-314 genome, a Circos plot was generated with the following structure:

Figure Layout

  • Outermost ring: Genome ideogram with kilobase scale markers
  • Five concentric scatter rings: Normalized template counts per insertion site for each experimental condition (extracellular, intracellular, growth control, LB culture, initial mutants), color-coded for distinction
  • Innermost heatmap ring: Locations of genes classified as essential by Tn5Gaps analysis (FDR-adjusted p-value < 0.05)

Data Preparation Workflow

  1. Input processing: The normalized combined.wig file, containing template counts per TA site across all conditions, was parsed to extract coordinate-value pairs for each sample.
  2. Format conversion: Data were reshaped into Circos-compatible format (chromosome, start, end, value), with zero-count positions optionally removed to improve visual clarity.
  3. Essential gene extraction: Genes identified as essential in the initial mutant library were extracted from the Tn5Gaps output and formatted as genomic intervals for heatmap display.
  4. Configuration: A Circos configuration file specified ring radii, color schemes, glyph styles (circles for scatter plots), axis spacing, and label formatting.

This visualization complements the statistical analyses by providing an intuitive spatial overview of insertion patterns across the complete genome.

Part 6: Addressing Specific Questions

1. Step-by-step analysis? See Part 2 above. The TPP pipeline integrates trimming, mapping, counting, and deduplication into a single workflow.

2. Bias correction for samples with fewer positions but higher reads per position?

  • Template deduplication: Collapses PCR duplicates by (barcode + coordinate).
  • TTR normalization: Trims extreme values before scaling, thereby reducing the influence of outliers.
  • BC_corr monitoring: Values > 0.9 indicate minimal PCR bias in most samples.
  • Gene-length normalization: Density = k/n (insertions per TA site), preventing longer genes from appearing artificially essential.

3. Sequence motif analysis around insertion sites? Although Tn5 displays relatively relaxed sequence specificity, unlike Himar1 with its strict TA requirement, explicit motif logo analysis was not performed. However, the pipeline inherently restricts analysis to valid insertion sites through precise mapping to the reference genome.

4. Determining significantly less frequently mutated genes? Two complementary approaches were applied:

  • Tn5Gaps: Identifies constitutive essentiality through runs of non-insertions (permutation test, FDR correction).
  • ANOVA: Identifies condition-specific essentiality by comparing insertion counts across conditions (F-test, Benjamini-Hochberg correction).

5. Positional effects (mutations at gene ends less lethal)? Yes, this issue is explicitly addressed within the analysis framework. The Tn5Gaps algorithm accounts for positional effects by distinguishing between internal and terminal gaps in insertion coverage:

  • r metric: Represents the length of the longest continuous run of non-insertions. Long internal runs typically indicate essential protein domains.
  • lenovr metric: Represents the full length of the non-insertion run with the greatest overlap with the gene body.

Decision Pipeline Summary:

  1. For each gene, calculate k (observed insertions), n (total TA sites), r, and lenovr from the insertion data.
  2. Perform a permutation test: p = P(r_permr_obs | random insertion).
  3. Apply Benjamini-Hochberg correction to obtain the adjusted p-value (p_adj).
  4. Interpret lenovr to determine whether the significant gap is internal or terminal.

Final Essentiality Call:

  • If p_adj < 0.05 and lenovrr (internal gap) → Essential
  • If p_adj < 0.05 and lenovr << r (terminal gap) → Review manually
  • If p_adj ≥ 0.05 → Non-essential (insufficient evidence)

This two-layer approach, combining statistical significance (p_adj) with biological context (lenovr/k), ensures that essentiality calls are both statistically rigorous and biologically interpretable. It explicitly accounts for positional effects, particularly terminal tolerance, where gaps at gene ends may have limited functional consequences.

PhyloViz-MRGN: Reproducible Visualization of Core-Genome Phylogenies and β-Lactamase Profiles in E. coli, K. pneumoniae, A. baumannii, and P. aeruginosa

🧬 Circular Phylogenetic Tree Pipeline with Resistance Gene Heatmaps

Complete workflow for Figure S1 generation – suitable for homepage documentation


📋 Pipeline Overview

WGS Reads (100 isolates)
        │
        ▼
┌─────────────────────┐
│ 1. Species Clustering│
│ • E. coli (n=24)     │
│ • K. pneumoniae (n=22)│
│ • A. baumannii (n=30)│
│ • P. aeruginosa (n=25)│
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│ 2. Genome Annotation│
│ • Prokka v1.14.5    │
│ • TIGRFAMs HMM DB   │
│ • Output: *.gff     │
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│ 3. Pangenome Analysis│
│ • Roary v3.13.0     │
│ • Core gene alignment│
│ • Gene P/A matrix   │
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│ 4. Resistance Gene  │
│    Detection        │
│ • Abricate + ResFinder│
│ • 13 β-lactamase genes│
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│ 5. Phylogenetic Tree│
│ • RAxML-NG          │
│ • GTR+G model       │
│ • 1000 bootstraps   │
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│ 6. Visualization    │
│ • ggtree (R)        │
│ • Circular layout   │
│ • ST-colored tips   │
│ • Gene heatmap ring │
└─────────────────────┘

🔧 Step-by-Step Commands

1️⃣ Prokka Annotation (per species)

#!/bin/bash
# prokka_run.sh

SPECIES_CONFIG=(
    "ecoli:Escherichia:coli"
    "kpneumoniae:Klebsiella:pneumoniae"
    "abaumannii:Acinetobacter:baumannii"
    "paeruginosa:Pseudomonas:aeruginosa"
)

for config in "${SPECIES_CONFIG[@]}"; do
    IFS=':' read -r SPECIES_KEY GENUS SPECIES_NAME <<< "$config"

    for SAMPLE in $(cat samples_${SPECIES_KEY}.txt); do
        prokka --force \
            --outdir prokka/${SPECIES_KEY}/${SAMPLE} \
            --cpus 8 \
            --kingdom Bacteria \
            --genus "${GENUS}" \
            --species "${SPECIES_NAME}" \
            --addgenes --addmrna \
            --prefix "${SAMPLE}" \
            --locustag "${SAMPLE}" \
            --hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB \
            assemblies/${SAMPLE}.fasta
    done
done

2️⃣ Roary Pangenome Analysis

#!/bin/bash
# roary_run.sh

# A. baumannii
roary -f roary/abaumannii -e --mafft -p 40 \
    prokka/abaumannii/*/*.gff

# E. coli
roary -f roary/ecoli -e --mafft -p 40 \
    prokka/ecoli/*/*.gff

# K. pneumoniae
roary -f roary/kpneumoniae -e --mafft -p 40 \
    prokka/kpneumoniae/*/*.gff

# P. aeruginosa
roary -f roary/paeruginosa -e --mafft -p 40 \
    prokka/paeruginosa/*/*.gff

3️⃣ Resistance Gene Detection (Abricate + ResFinder)

#!/bin/bash
# abricate_run.sh

# Setup databases (one-time)
abricate --setupdb

# Define target genes
TARGET_GENES="blaCTX-M,blaIMP,blaKPC,blaNDM-1,blaNDM-5,blaOXA-23-like,blaOXA-24-like,blaOXA-48-like,blaOXA-58-like,blaPER-1,blaSHV,blaVEB-1,blaVIM"

for SPECIES in ecoli kpneumoniae abaumannii paeruginosa; do
    for SAMPLE in $(cat samples_${SPECIES}.txt); do
        # Run against ResFinder
        abricate --db resfinder \
            --minid 90 --mincov 80 \
            assemblies/${SAMPLE}.fasta \
            > abricate/${SPECIES}/${SAMPLE}.resfinder.tsv

        # Extract target genes to CSV-ready format
        awk -F'\t' -v sample="${SAMPLE}" '
            NR==1 {next}
            $10 ~ /blaCTX-M|blaIMP|blaKPC|blaNDM-1|blaNDM-5|blaOXA-23-like|blaOXA-24-like|blaOXA-48-like|blaOXA-58-like|blaPER-1|blaSHV|blaVEB-1|blaVIM/ {
                gene=$10; gsub(/[^a-zA-Z0-9.-]/,"_",gene);
                print sample"\t"gene"\t+"
            }' abricate/${SPECIES}/${SAMPLE}.resfinder.tsv
    done | pivot to wide format > isolate_${SPECIES}.csv
done

4️⃣ Phylogenetic Tree Construction (RAxML-NG)

#!/bin/bash
# raxml_run.sh

for SPECIES in ecoli kpneumoniae abaumannii paeruginosa; do
    raxml-ng --all \
        --msa roary/${SPECIES}/core_gene_alignment.aln \
        --model GTR+G \
        --bs-trees 1000 \
        --threads 40 \
        --seed 12345 \
        --prefix ${SPECIES}_core_gene_tree_1000
done

5️⃣ SNP Distance Matrix (Optional for clonality assessment)

#!/bin/bash
# snp_analysis.sh

conda activate bengal3_ac3

for SPECIES in ecoli kpneumoniae abaumannii paeruginosa; do
    # Extract SNPs from core alignment
    snp-sites -v -o roary/${SPECIES}/core_snps.vcf \
        roary/${SPECIES}/core_gene_alignment.aln

    # Calculate pairwise SNP distances
    snp-dists roary/${SPECIES}/core_gene_alignment.aln \
        > results/${SPECIES}_snp_dist.tsv
done

# Convert to Excel for review
~/Tools/csv2xls-0.4/csv_to_xls.py \
    results/*_snp_dist.tsv \
    -d$'\t' -o results/snp_distances_all_species.xls

🎨 Improved R Code for Publication-Quality Figures

Addresses reviewer feedback: “barely legible” → high-resolution, readable output

library(ggtree)
library(ggplot2)
library(dplyr)
library(ape)

# ==========================================
# CONFIGURATION
# ==========================================
species <- "ecoli"
setwd(paste0("/mnt/md1/DATA/Data_Ben_Boruta_Analysis/plotTreeHeatmap_", species))

# ==========================================
# 1. LOAD DATA
# ==========================================
info <- read.csv(paste0("isolate_", species, "_.csv"), sep="\t", check.names = FALSE)
info$name <- info$Isolate
info$ST <- factor(info$ST)

tree <- read.tree(paste0("../", species, "_core_gene_tree_1000.raxml.bestTree"))

# ST Colors (E. coli specific)
cols <- c("10"="cornflowerblue","38"="darkgreen","46"="seagreen3","69"="tan",
          "88"="red","131"="navyblue","156"="gold","167"="green",
          "216"="orange","405"="pink","410"="purple","1882"="magenta",
          "2450"="brown","2851"="darksalmon","3570"="chocolate4","4774"="darkkhaki")

# Heatmap Data Selection
heatmapData2 <- info %>% select(
  Isolate, `blaCTX-M`, blaIMP, blaKPC, `blaNDM-1`, `blaNDM-5`,
  `blaOXA-23-like`, `blaOXA-24-like`, `blaOXA-48-like`, `blaOXA-58-like`,
  `blaPER-1`, blaSHV, `blaVEB-1`, blaVIM
)
rownames(heatmapData2) <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2[] <- lapply(heatmapData2, as.character)

heatmap.colours <- c("darkgreen", "grey")
names(heatmap.colours) <- c("+", "-")

# ==========================================
# 2. TREE PLOT (Optimized for Legibility)
# ==========================================
ht <- max(ape::node.depth.edgelength(tree))

# 'open.angle = 35' spreads tips to prevent label overlapping
# 'hjust = 0.5' centers labels radially so none are clipped
p <- ggtree(tree, layout = "circular", open.angle = 35) %<+% info +
  geom_tippoint(aes(color = ST), size = 2.0, alpha = 0.9) +
  geom_tiplab2(aes(label = name),
               size = 3.2,             # Clear font size
               offset = 0.18 * ht,     # Distance from tip point
               hjust = 0.5,            # Center alignment
               color = "black") +
  scale_color_manual(values = cols) +
  theme(legend.title = element_text(size = 14),
        legend.text = element_text(size = 12))

# ==========================================
# 3. HEATMAP (Clean Look)
# ==========================================
# 'width = 20.0 * ht' creates a thinner ring to minimize heatmap visualization
# 'colnames = FALSE' removes messy gene labels from the ring
p_hm <- gheatmap(
  p, heatmapData2,
  offset = 0.3 * ht,       # Small gap from tips
  width = 8.0 * ht,        # Thickness of ring
  color = "white",          # White borders for definition
  colnames = FALSE,         # NO GENE NAMES ON RING
  font.size = 2.0
) +
  scale_fill_manual(values = heatmap.colours) +
  guides(
  color = guide_legend(order = 1, title = "ST", override.aes = list(size = 4)),
  fill  = guide_legend(order = 2, title = "",    override.aes = list(size = 4))
  ) +
  theme(legend.position = "right",
        legend.box.margin = margin(0, 0, 0, 0),
        legend.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 12),
        plot.margin = margin(10, 20, 10, 10)) # Extra right margin for text

# ==========================================
# 4. ANNOTATION (Gene Order Legend)
# ==========================================
gene_order <- c("CTX-M", "IMP", "KPC", "NDM-1", "NDM-5",
                "OXA-23", "OXA-24", "OXA-48", "OXA-58",
                "PER-1", "SHV", "VEB-1", "VIM")

annotation_text <- paste0("Resistance genes (Inner → Outer):\n",
                          paste(gene_order, collapse = ", "))
cat(annotation_text)

final_plot <- p_hm +
  # Place text in the white space (upper-right)
  annotate("text",
           x = 1.35 * ht,        # Radial position (outside the heatmap)
           y = nrow(info) * 0.85, # Angular position
           label = annotation_text,
           hjust = 0, vjust = 0.5,
           size = 4.0,           # Text size
           color = "black",
           family = "sans")

# ==========================================
# 5. EXPORT
# ==========================================
png(paste0("FigS1_", species, "_final_clean.png"),
    width = 3600, height = 3600, res = 350)
print(p_hm)
dev.off()

svg(paste0("FigS1_", species, "_final_clean.svg"),
    width = 10, height = 10)  #, res = 350
print(p_hm)
dev.off()

cat("✅ Figure saved successfully.\n")

📝 Key Improvements for Reviewer Concerns

Issue Original Improved Benefit
Tip labels size=2.2, offset=0.06×ht size=2.8-3.4, offset=0.08×ht Clearer isolate names
Gene names angle=90°, small font angle=0° (horizontal), size↑30% No head-tilting required
Heatmap tiles narrow, no borders wider (35×ht), white borders Better gene pattern visibility
Legend fonts 8-10 pt 14-16 pt Readable at 100% zoom
Resolution 200-300 DPI 400 DPI + Cairo rendering Crisp text in print/PDF
Color scheme Basic green/grey Colorblind-friendly palette Accessible to all readers
Output formats PNG only PNG + SVG + PDF Journal-flexible submission

🔄 Quick Execution

(r414_bioc314)

cd /mnt/md1/DATA/Data_Ben_Boruta_Analysis

for species in ecoli kpneumoniae abaumannii paeruginosa; do
  echo "🔄 Processing ${species}..."
  cd plotTreeHeatmap_${species}
  Rscript plotTreeHeatmap_${species}_final.R
  cd ..
done

echo "✅ All 4 figures generated successfully!"

#Directly choose selection from svg-format and save as *_final_clean_.png

📚 Citation & Reproducibility

@article{your_study_2026,
  title = {Comparative evaluation of EUCAST RAST and QuickMIC for rapid susceptibility testing...},
  author = {XXXX and others},
  journal = {Manuscript in revision},
  year = {2026}
}

Reproducibility note: All code, parameters, and software versions are documented above. Raw WGS data deposited under BioProject PRJNA1356847.


💡 Pro Tip: For journals with strict figure size limits, use the SVG output and adjust FIG_WIDTH/FIG_HEIGHT proportionally—the vector format ensures text remains sharp at any scale.

Explanation: Why the Primer is at the “End” of the Transposon Sequence

Explanation: Why the Primer is at the “End” of the Transposon Sequence

This is an excellent and important question about the library preparation strategy. Let me clarify the apparent contradiction.


Short Answer

The primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG is positioned at the transposon-genome junction — but “end” depends on directionality and sequencing orientation. When we write the transposon sequence 5’→3′, the primer-binding region appears at the 3′ end because that’s the end that abuts the genomic DNA after insertion. Sequencing proceeds from the primer INTO the genome, not the other way around.


Detailed Explanation

1. Transposon Structure and Orientation

The Tn5 transposon has two key features at each end:

[Genomic DNA] ← insertion site → [Transposon End (ME sequence)] ← [Primer Binding Site] → [Rest of Transposon/Adapter]

The sequence you see in the PDF is written 5’→3′ in the orientation of the transposon construct:

5'-[Long transposon body]...[ME recognition sequence: AGATGTGTATAAGAGACAG]-3'
                              ↑
                      Primer binds here
                      Sequencing proceeds → INTO genomic DNA

2. The Mosaic End (ME) Sequence

From your PDF:

ME Erkennungs Sequenz 5´ AGATGTGTATAAGAGACAG 3´
ME Sequenz Komplementär: 3´ TCTACACATATTCTCTGTC 5´
ME Sequenz Rev Komplementär: 5´ CTGTCTCTTATACACATCT 3´
The primer you use (AGCTTCAGGGTTGAGATGTGTATAAGAGACAG) contains: Component Sequence Function
Additional 5′ extension AGCTTCAGGGTTGAG Provides binding stability, may contain adapter/index sequences
Core ME recognition ATGTGTATAAGAGACAG Essential for Tn5 transposase binding and insertion

3. Why the Primer Must Be at the “End”

A. Sequencing Directionality

Illumina sequencing reads 5'→3' from the primer:

Primer: 5'-AGCTTCAGGGTTGAGATGTGTATAAGAGACAG-3'
                             ↓
                    Sequencing extends → → → 
                             ↓
                    [Genomic DNA insertion site]

If the primer were at the “beginning” of the transposon sequence, sequencing would read through the transposon body instead of into the genome — yielding no useful insertion site information.

B. Library Preparation Workflow

1. Tn5 transposon inserts into genomic DNA at random TA sites
2. Genomic DNA is fragmented around insertion sites
3. PCR amplification uses primers binding to transposon ends
4. Sequencing read 1 (R1) starts at transposon primer → reads INTO genome
5. Sequencing read 2 (R2) may capture barcode + opposite genomic flank

C. The “Staggered” Primer Start (Cycles 5-10)

Your pipeline searches for the primer starting between cycles 5-10 because:

Primer variants used (from PDF):
NEB_i5_Tn_01: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTACCTACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCACCTACAACAAAGCTCTCATCAAC-3'  
NEB_i5_Tn_03: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTCCTACAACAAAGCTCTCATCAACCG-3'

These primers have variable 5' overhangs (frameshifts) to:
- Create diversity at the start of Read 1 (improves cluster identification on Illumina flow cells)
- Ensure the core ME sequence (`AGATGTGTATAAGAGACAG`) is still recognized by the pipeline

4. Visual Summary: Read Structure After Insertion

Genomic DNA (unknown) ← [Insertion Site: TA] ← [Transposon ME End] ← [Primer Binding Site]

Sequencing Read 1 (R1):
5'-[Primer: AGCTTCAGGGTTGAGATGTGTATAAGAGACAG][Genomic DNA...]-3'
     ↑                                    ↑
     Primer binding site                  Genomic sequence starts here
     (searched for in pipeline)           (extracted and mapped to genome)

Pipeline step:
1. Identify primer pattern in R1 (allowing 1 mismatch, start position 5-10)
2. Extract suffix AFTER primer = genomic DNA
3. Map genomic suffix to reference genome
4. Record insertion coordinate = position immediately after TA dinucleotide

5. Why This Design Works for Tn-Seq

Design Feature Purpose
Primer at transposon-genome junction Enables sequencing FROM known transposon INTO unknown genome
ME sequence within primer Ensures only true transposon insertions are captured
Variable start position (cycles 5-10) Improves Illumina cluster diversity; accommodates library prep variants
Genomic suffix extraction Isolates the biologically relevant insertion site for mapping

Key Takeaway

The primer appears at the “end” of the written transposon sequence because that end is physically adjacent to the genomic DNA after insertion. Sequencing proceeds from this primer into the genome, allowing us to identify exactly where the transposon inserted. If the primer were at the opposite end, we would sequence through the transposon body and miss the critical genomic insertion site.

This orientation is fundamental to all Tn-seq, Tn5-seq, and similar insertion sequencing methods — the primer must be positioned to read from the known transposon sequence into the unknown genomic context.


Related Command Line (from your pipeline):

# TPP searches for primer in R1, extracts genomic suffix, maps to genome:
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta \
  -reads1 sample_R1.fastq.gz -reads2 sample_R2.fastq.gz \
  -output sample_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \  # Primer at transposon-genome junction
  -mismatches 1 -bwa-alg mem

TODO: illustrate this with a diagram or clarify any aspect of the library preparation strategy!



Explanation of Start-Position 5-10

What Does “Start-Position 5-10” Mean?

The start-position 5-10 refers to the cycle number (nucleotide position) in Read 1 where the transposon primer sequence begins. This is intentional diversity created during library preparation.

Why Positions 5-10?

Looking at your PDF, three different forward primers are used:

NEB_i5_Tn_01: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**ACC**TACAACAAAGCTCTCATCAACC-3'
NEB_i5_Tn_02: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**CAC**TACAACAAAGCTCTCATCAAC-3'
NEB_i5_Tn_03: 5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCT**CCT**ACAACAAAGCTCTCATCAACCG-3'
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Constant Illumina adapter (33 bp)

Key insight: The primers have:

  1. Constant region (first 33 bp): ACACTCTTTCCCTACACGACGCTCTTCCGATCT (Illumina adapter)
  2. Variable frameshift (2-3 bp): ACC, CAC, or CCT
  3. Transposon amplification sequence: TACAACAAAGCTCTCATCAACC...

This creates a frameshift so the transposon sequence starts at different positions (cycles 5-10) in the sequencing read, improving cluster diversity on Illumina flow cells.


How to Implement Start-Position 5-10 in Code

In TPP (Transposon Position Profiling)

The pipeline uses the -primer-start-window parameter:

# Example from your pipeline:
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa \
  -protocol Tn5 \
  -ref CP009367.fasta \
  -reads1 sample_R1.fastq.gz \
  -reads2 sample_R2.fastq.gz \
  -output sample_run3 \
  -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG \
  -mismatches 1 \
  -bwa-alg mem
  # -primer-start-window is set internally to allow positions 5-10

Modifying the TPP Script

From your pipeline notes:

# Modify the TPP tools to set the correct window
vim ~/.local/lib/python3.10/site-packages/pytpp/tpp_tools.py

# Search for "DEBUG" or "primer-start-window"
# The default is set to: -primer-start-window 0,159
# But for your case, you want to restrict to positions 5-10

Implementation Logic

The code should:

  1. Search for primer in Read 1
  2. Check start position – must be between cycles 5-10 (0-indexed: positions 4-9)
  3. Allow 1 mismatch in the primer sequence
  4. Extract genomic DNA after the primer

Pseudocode example:

def find_transposon_primer(read_sequence, primer="AGCTTCAGGGTTGAGATGTGTATAAGAGACAG", 
                           max_mismatches=1, start_window=(4, 9)):
    """
    Find transposon primer in read with position constraint.

    Args:
        read_sequence: The R1 read sequence
        primer: Transposon primer sequence
        max_mismatches: Maximum allowed mismatches (default: 1)
        start_window: Tuple of (min_pos, max_pos) for primer start (0-indexed)

    Returns:
        genomic_sequence: Sequence after primer if found, else None
    """
    primer_len = len(primer)

    # Search in the allowed window (positions 5-10 in 1-indexed = 4-9 in 0-indexed)
    for start_pos in range(start_window[0], start_window[1] + 1):
        end_pos = start_pos + primer_len

        if end_pos > len(read_sequence):
            continue

        # Extract candidate sequence
        candidate = read_sequence[start_pos:end_pos]

        # Count mismatches
        mismatches = sum(1 for a, b in zip(candidate, primer) if a != b)

        if mismatches <= max_mismatches:
            # Found valid primer! Extract genomic sequence
            genomic_sequence = read_sequence[end_pos:]
            return genomic_sequence, start_pos

    return None, None  # Primer not found in valid position

Actual TPP Parameter

Looking at your pipeline, the modification shows:

# In tpp_tools.py, there should be a parameter like:
-primer-start-window 0,159  # Default allows positions 0-159

# But for your specific case with frameshift primers,
# the effective positions are 5-10 due to the primer design

Why Window is 0-159 but Effective is 5-10?

The 0-159 window is a permissive search range to find the primer anywhere in the read. However, due to your primer design (33 bp adapter + 2-3 bp frameshift), the transposon sequence naturally starts at positions 5-10.

Read structure:

Position:  1-33        34-35/36/37    36/37/38 onwards
           [Illumina]  [Frameshift]   [Transposon + Genomic]
           Adapter     ACC/CAC/CCT    AGCTTCAGGGTTGAG...

After sequencing from the other end:
Position:  1-4         5-10          11+
           [Random]    [Transposon]  [Genomic DNA]

Verification in Your Data

From your pipeline output:

# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623):

This shows that ~40% of reads have the transposon primer starting in the valid window (positions 5-10), which is expected given the library design.


Summary

  • Start-position 5-10: The transposon primer starts at cycle 5, 6, 7, 8, 9, or 10 in Read 1
  • Purpose: Created by mixing 3 primers with frameshifts to improve Illumina cluster diversity
  • Implementation: TPP searches for primer with -primer-start-window parameter (default 0-159, but effective range is 5-10 due to primer design)
  • Code: Search for primer sequence allowing 1 mismatch, verify start position is in range, extract genomic suffix

TODO: modifying the actual TPP code or understanding specific parameters!