Category Archives: Articles

Tenergy / Dignics 与 Butterfly Timo Boll ALC 终极实用手册

版本:2025-10-27(Europe/Berlin)

TODO: 第一选择:想要抓球+弧线:正手 Dignics 09C 黑 2.1(起下旋最稳),反手 Dignics 64 红 1.9; 第二选择: 想要直接、快出:正手 Tenergy 05 红 2.1,反手 Tenergy 80 黑 1.9。

面向:使用/考虑使用 Butterfly Timo Boll ALC(TB ALC) 的选手,搭配 Dignics / Tenergy 胶皮(含 09C、64、64 FX、05、80)。 目标:快速定型正反手配置,理解每款胶的手感、弧线、速度与台内表现,并给出厚度/配色建议。


目录

  1. 快速结论(给着急的人)
  2. TB ALC 底板一览
  3. 胶皮速览卡(D09C / D64 / T05 / T80 / T64 / T64 FX)
  4. 关键差异对照表
  5. 正反手搭配方案(按打法)
  6. 厚度与配色(红/黑)选择
  7. 与 TB ALC 的化学反应:上手感受与注意事项
  8. 台内/发接发与相持要点
  9. 维护与更换周期
  10. 常见问答(FAQ)
  11. 术语小词典

1) 快速结论(给着急的人)

  • 稳健FH弧圈Dignics 09C(黑)2.1Tenergy 05(红)2.1
  • 凌厉BH快带/反拉Dignics 64(1.9)Tenergy 64(1.9)
  • 一张通吃/省心Tenergy 80(FH 2.1 / BH 1.9)
  • BH 要容错/易起球Tenergy 64 FX(1.9 或 2.1)
  • 为什么少推 T64 做正手:低弧直线,台内活、起下旋容错较低;正手通常更需要抓球+弧线(09C/05/80 更贴合)。
  • 红/黑颜色:规则只要求“一面黑一面非黑”。普遍经验:黑皮更黏/质感更实(适合FH抓摩)红皮更通透爽快(常见BH或快出风格)

2) TB ALC 底板一览

  • 类型:OFF / OFF-(进攻),ALC 纤维(Arylate-Carbon)
  • 层数(经典 ALC 叠层):Koto – ALC – Limba – Kiri(芯)– Limba – ALC – Koto
  • 常见参数:厚 ~5.7–5.9 mm;重 ~86–90 g;板面 ~157×150 mm
  • 打感:甜区大、低震动、击球干净略“闷”,速度快但可控;弧圈中等抛物线,挡/对冲稳、响应线性。

与相近底板

  • Viscaria:整体更柔一点,抛物线略高;TB ALC 更直接、低一丝抛。
  • TB ZLC:更快更脆,持球/容错降低。
  • TB ZLF:更软更吃球,但顶速低。

3) 胶皮速览卡

Dignics 09C(D09C)

  • 特性:微黏顶皮,抓球最强、吃球最深;中高弧线,起下旋最稳;前台不弹,中远台后劲足
  • 定位FH 现代弧圈核心;在 TB ALC 上中和“干脆”,提升弧线与控制。

Dignics 64(D64)

  • 特性:直线、反弹快,借力好;中低弧线,出速高;台内略活。
  • 定位BH 强势快带/对冲;在 TB ALC 上 BH 非常顺手。

Tenergy 05(T05)

  • 特性:高摩擦颗粒设定,抓球强、弧线中高,台内更稳,响应线性。
  • 定位FH 万金油弧圈/拉冲;在 TB ALC 上成熟耐用的经典搭配。

Tenergy 80(T80)

  • 特性:位于 05 与 64 之间的平衡点;弧线中等,速度/控制均衡。
  • 定位双面皆可,适合想“一张打全场”的简化方案。

Tenergy 64(T64)

  • 特性:出球直、弧线中低、顶速与穿透强;台内更易“活”。
  • 定位BH 快带/反拉导向;FH 少量玩家偏好直线穿透可选。

Tenergy 64 FX(T64 FX)

  • 特性更软海绵版本;易起球、容错高,低中功率更轻松;台内更活、顶速略降。
  • 定位BH 容错/易操控路线;入门到中级/小力量友好。

4) 关键差异对照表

型号 抓球/持球 弧线高度 出速/穿透 台内控制 起下旋容错 最佳位面
D09C 最大 中-高 中后程强 最优 FH
T05 中-高 中高 很好 FH / BH 控制
T80 中高 中高 稳定 FH / BH
D64 中-低 最高 中等(活) 中等 BH
T64 低-中 最高 中等(活) 较低 BH / 少数 FH
T64 FX 中高(软) 高(低中功率更易) 偏活 较高 BH 初中级

注:表中“台内活”=反弹系数高、对小动作更敏感;需要更细腻的手上控制。


5) 正反手搭配方案(按打法)

A. FH 现代弧圈 + BH 快带/反拉(主流)

  • FH:D09C 2.1(黑) / T05 2.1(红)!!!!
  • BH:D64 1.9(红)/ T64 1.9 / T80 1.9(黑)!!!!

B. 近台借力对冲 + 二速快上手

  • FH:T05 2.1 / T80 2.1
  • BH:T64 1.9 / T64 FX 1.9(要容错)

C. 中远台大力爆冲

  • FH:D09C 2.1
  • BH:D64 2.1 或 T80 2.1(看稳定需求)

D. 一套省心通吃

  • FH:T80 2.1
  • BH:T80 1.9

E. 坚持 FH 直线穿透风

  • FH:T64 1.9(控台内)
  • BH:T80 1.9 / T64 FX 1.9(容错)

常见选择思路

  • 粘性/中国套(如狂飙、09C)做正手 → 多数人选黑色; 黑皮通常略更黏、更实、更“顶”,适合发力刷摩、前冲。
  • 日德套(如 Tenergy/Dignics/ESN)做正手 → 很多人选红色; 红皮一般手感更通透一点、出球更爽快,弧线略高、速度更轻快。

结合你这块 TB ALC:

  • 想要抓球+弧线:正手 Dignics 09C 黑 2.1,反手 Dignics 64 红 1.9。
  • 想要直接、快出:正手 Tenergy 05 红 2.1,反手 Tenergy 80 黑 1.9。
  • 想要更稳控:都用 1.9 厚度即可。

6) 厚度与配色(红/黑)选择

厚度(1.9 vs 2.1)

  • 1.9 mm:更稳,台内控制好、起板成功率高,适合反手或控球为先。
  • 2.1 mm:最大威力与后程顶速,适合正手或追求爆冲者。

配色(规则与习惯)

  • 规则:一面黑,一面非黑(通常红),没有“正手必须红/黑”的硬性规定
  • 经验黑皮往往更黏、更扎实(FH 抓摩)红皮更通透快出(BH/快攻)
  • 推荐:FH 黑 / BH 红(若选 09C/T05 等抓球型);若使用 64/64 FX 做 BH,红/黑皆可。

7) 与 TB ALC 的化学反应:上手感受与注意事项

  • TB ALC 出球干净、回弹快,与 D09C/T05 组合能获得更稳的弧线与起下旋容错;
  • 与 D64/T64 叠加会更“利落”,BH 爽快但台内更活;
  • FH 也选 64:建议 1.9 厚,台内要格外细腻;或更柔的底板来“降躁”。

8) 台内/发接发与相持要点

  • 台内:D09C/T05/T80 更易控短与摆短;64/64 FX 需降低击球力度与板形开合,加大摩擦比重。
  • 起下旋:D09C 最稳、T05 次之;64/64 FX 要注意摩擦角度与击球深度。
  • 对拉/反拉:64/D64 出速高、直线穿透强;T05/T80 更有弧线安全窗。
  • 借力:64 系列占优;D09C 需要主动发力,更吃你质量。

9) 维护与更换周期

  • 清洁:每次打完用微湿海绵/专用清洁剂轻拭,贴保护膜;微黏(09C)表面避免硬擦。
  • 更换:高频训练(>3 次/周)约 2–3 月更换一侧;普通强度 3–6 月
  • 粘贴:使用水溶性无机胶;避免非法改装(如违规增黏/增弹)。

10) 常见问答(FAQ)

Q1:为什么很多人不推荐 T64 做正手?
A:直线低弧、台内活、起下旋容错低,与 TB ALC 叠加更“躁”。大多数 FH 更需要抓球与弧线(09C/05/80)。

Q2:D64 vs T64?
A:D64 抓球与容错略好,仍保持直线与高出速;T64 更“经典 64 味”,更凌厉但更挑台内控制。

Q3:T64 vs T64 FX?
A:FX 更软,低中功率更容易、容错高;顶速与台内稳定性不及 T64。BH 入门到中级偏向 FX。

Q4:T80 能否双面?
A:可以。它在 05 与 64 之间,速度/弧线/控制均衡,是省心选择。

Q5:红黑是否影响性能?
A:配方/批次差异外,普遍经验是黑略黏、红更通透;按手感与需求定,无硬性规则。


11) 术语小词典

  • 抓球/持球:球在胶皮上的“停留与摩擦”感。越强越有利于起下旋与弧圈稳定。
  • 台内活:小力量下的反弹灵敏度高,容易弹起,控制难度增加。
  • 弧线高度:出球抛物线的“拱度”,高抛提供更大安全窗。
  • 直线穿透:出球平直、速度快,吃台后“冲”的感觉。
  • 容错:击球角度/力量略有偏差时,仍能上台/有效的宽容度。

小结

  • FH 选抓球与弧线(D09C/T05),BH 选直线出速与借力(D64/T64/T80)。
  • T80 是“两边都不极端”的万能解;T64 FX 给 BH 轻松与容错。
  • TB ALC 性格“干净+线性”,合理胶皮搭配即可兼顾台内与相持。

ONT Methylation Analysis — Comprehensive Summary

Scope: What methylation is (5‑mC, 6‑mA, 4‑mC), how Oxford Nanopore (ONT) detects it, how it differs from bisulfite sequencing, required coverage, file types (modBAM/CRAM with MM/ML tags), basecalling models (Dorado), practical workflows, pipelines (nf-core/methylong), deliverables to request from providers (e.g., Novogene), and specific advice for bacterial projects.


1) What are 5‑mC, 6‑mA, 4‑mC?

  • 5‑mC (5‑methylcytosine): methyl group on cytosine C5 carbon. In eukaryotes strongly linked to gene regulation (CpG), chromatin state, imprinting. Also present in some bacteria (e.g., Dcm at CCWGG).
  • 6‑mA (N6‑methyladenine): methyl on adenine N6. Very common in bacteria/archaea (e.g., Dam at GATC), functions in restriction–modification (R–M), mismatch repair, replication control, and gene regulation.
  • 4‑mC (N4‑methylcytosine): methyl on cytosine N4, mostly in bacteria/archaea (R–M and regulation).

Coverage guidance (ONT direct detection):

  • ≥ 10× for 5‑mC calling/quantification.
  • ≥ 50× for 6‑mA and 4‑mC (signals are weaker; models need depth).

2) How ONT detects methylation (no chemical conversion)

  • ONT does not convert bases (unlike bisulfite sequencing which converts un‑methylated C → U → read as T). ONT reads remain A/C/G/T.
  • ONT measures ionic current while DNA k‑mers pass the pore. Modified bases (5‑mC/6‑mA/4‑mC) slightly shift current distributions.
  • A modified‑base basecaller (now Dorado; historically Guppy+Remora) decodes those shifts and writes methylation annotations into aligned BAM/CRAM as MM/ML tags:
    • MM: modified motif and per‑read positions.
    • ML: per‑site modification probabilities/scores.
  • Downstream tools (e.g., modkit, methylartist, nf‑core/methylong) summarize per‑site/per‑region methylation and export BED/bedGraph/bigWig for visualization/statistics.

Key contrast with bisulfite (BS‑seq):

  • BS‑seq chemically converts un‑methylated C to U (sequenced as T) → uses base changes to infer methylation.
  • ONT uses signal differences; no base letters change. Methylation is metadata in BAM tags, not edits in the sequence.

3) Data types & what you need (modBAM vs “assembly” reads)

  • Previous ONT reads used for genome assembly are typically standard basecalls (A/C/G/T only) and lack MM/ML tags, so not suitable for methylation quantification.
  • For methylation analysis you need either:
    1. Provider delivers aligned modified‑base BAM/CRAM (modBAM/CRAM) with MM/ML tags and indices (.bai/.crai).
    2. Or you re‑basecall FAST5/FASTQ with a modified‑base Dorado model and then align to your reference (producing modBAM).

Reference genome requirement:

  • For aligned BAM, you (or the provider) must map to a reference FASTA. Keep the exact FASTA (and .fai) used for reproducibility and downstream summarization.

4) Practical workflow (bacteria)

A. Planning & sequencing

  • Decide targets: in bacteria prioritize 6‑mA/4‑mC; optionally 5‑mC (if Dam/Dcm enzymes present).
  • Coverage targets: ≥50× (6‑mA/4‑mC), ≥10× (5‑mC).
  • Ask provider to run Dorado (modified‑base model) and deliver aligned modBAM/CRAM with MM/ML tags.

B. Inputs/outputs to request from provider (e.g., Novogene)

  1. Deliverables:
    • modBAM/CRAM (aligned to our provided reference), with MM/ML tags + .bai/.crai.
    • Optional per‑site tracks: BED/bedGraph/bigWig and a QC report.
  2. Reference:
    • Can we provide bacterial reference FASTA? Will they return the exact FASTA (.fai) used?
  3. Models & modifications:
    • Which Dorado model version and which mods (5‑mC, 6‑mA, 4‑mC) are called by default?
  4. Unaligned data:
    • If delivering unmapped uBAM/FASTQ, request that modified‑base calls (tags) are still included, or obtain raw signal/FAST5 if re‑calling in‑house.

C. In‑house analysis (outline)

  • Align mod‑called reads to reference (if not already) → modBAM.
  • Run modkit to summarize per‑site methylation frequencies and export bedGraph/bigWig.
  • Use methylartist for regional plots, motif‑centric views, metaplots over features (promoters, operons, RND genes, etc.).
  • Integrate with other omics (RNA‑seq) by averaging methylation in promoter/operon windows and correlating with expression changes.

5) nf‑core/methylong (pipeline overview)

  • Community Nextflow pipeline for ONT methylation. Typical features:
    • Supports Dorado modified‑base calling (or consumes modBAM/CRAM).
    • Performs alignment (e.g., minimap2) to your reference, keeps MM/ML tags.
    • Generates per‑site/ per‑region summaries, tracks (bedGraph/bigWig), and QC.
  • Inputs: reads (FASTQ/FAST5) or modBAM + reference FASTA; sample sheet with metadata.
  • Outputs: modBAM/CRAM + indices, per‑site methylation tables, genome tracks, multiQC‑style reports.

(Exact CLI flags vary by version; coordinate with the provider or your compute environment.)


6) QC & caveats

  • Depth matters: 6‑mA/4‑mC need higher coverage than 5‑mC.
  • Model choice: Use the correct Dorado modified‑base model for your chemistry/flow cell and target modifications.
  • Reference fidelity: Use the same reference throughout (and document version).
  • BAM integrity: Verify MM/ML tags exist; confirm alignment header matches the provided FASTA.
  • Context effects: Methylation calling is k‑mer context‑dependent; some motifs are easier/harder.
  • Biological interpretation: In bacteria, methylation is often tied to R–M systems, replication, and gene regulation; interpret rates in motif/operon context, not only at single CpG‑style sites.

7) What to ask a provider (email checklist)

  • Will you deliver aligned modBAM/CRAM with MM/ML tags (+ index)?
  • Which modified bases are called (5‑mC, 6‑mA, 4‑mC)? Which Dorado model/version?
  • Do you require us to provide a bacterial reference FASTA for alignment? Will you return the exact reference used?
  • Can you also provide per‑site methylation tracks (bedGraph/bigWig) and a QC report?
  • What coverage will be achieved per sample (target ≥10× for 5‑mC; ≥50× for 6‑mA/4‑mC)?

8) Suggested minimal deliverables

  • modBAM/CRAM aligned to our provided reference (+ .bai/.crai).
  • Reference FASTA and .fai used in alignment/calling.
  • Per‑site tables (tsv) and tracks (bedGraph/bigWig).
  • Brief QC (coverage, fraction modified by motif, per‑site confidence).

9) Bacterial project recommendation (one‑liner)

For bacteria, profile 6‑mA (and 4‑mC) as primary targets (≥50×), optionally 5‑mC (≥10× if Dcm‑like activity expected), using Dorado modified‑base calling and aligned modBAM/CRAM with MM/ML tags; summarize with modkit/methylartist and integrate with RNA‑seq.


10) Handy pointers & checks (quick ref)

  • Check BAM has mods: samtools view -h mod.bam | head → look for MM:Z: and ML:B:C tags.
  • Confirm reference: samtools view -H mod.bam | grep '^@SQ' and keep the FASTA.
  • Summarize (example modkit): modkit pileup mod.bam ref.fa --bedgraph out.bg --min-mapq 20
  • Visualize: Load bigWig/bedGraph in IGV/JBrowse; overlay RNA‑seq coverage/DE results.

Prepared from the morning discussion to serve as a self‑contained guide and hand‑off document.

RNA-seq Sample Interpretation & Design (Updated Mapping) for RNAseq_2025_LB_vs_Mac_ATCC19606

This document explains the background and rationale of your RNA-seq dataset based on sample names, updates the genotype mapping (AB = ΔadeAB; IJ = ΔadeIJ; WT19606 = WT; W1/Y1 = clinical WT isolates), and provides ready-to-run analysis contrasts aligned with your research proposal.


1) Naming logic (decoded)

  • Medium prefix

    • LB- → growth in LB (non-selective baseline; total transcriptome without membrane/efflux stress).
    • Mac- → growth in MacConkey (bile salts + crystal violet; membrane/outer-membrane/efflux stress condition).
  • Strain/genotype tag

    • WT19606A. baumannii ATCC 19606, wild-type (reference WT).
    • IJΔadeIJ (AdeIJK RND efflux knockout).
    • ABΔadeAB (AdeAB subfamily knockout).
    • W1, Y1two clinical wild-type isolates (distinct lineages).
  • Replicates

    • r1 / r2 / r3 / r4 → biological replicates.

Note on proposal text: The provided proposal did not explicitly name W1/Y1; it referred to “clinical isolates” generally. If you analyze W1/Y1, add a Methods line defining them (e.g., source, genotype as WT).


2) Updated mapping table

Sample prefix Medium Strain Genotype Purpose / Rationale
**LB-AB-*** LB AB ΔadeAB Baseline transcriptome of AdeAB knockout (no stress)
**Mac-AB-*** MacConkey AB ΔadeAB Stress response without AdeAB (efflux-dependent programs)
**LB-IJ-*** LB IJ ΔadeIJ Baseline of AdeIJK knockout
**Mac-IJ-*** MacConkey IJ ΔadeIJ Stress response without AdeIJK
**LB-WT19606-*** LB ATCC 19606 WT Reference WT baseline (anchor strain)
**Mac-WT19606-*** MacConkey ATCC 19606 WT Reference WT under stress (gold-standard contrast vs ΔadeAB/ΔadeIJ)
**LB-W1-*** LB W1 WT (clinical) Baseline of clinical WT lineage W1
**Mac-W1-*** MacConkey W1 WT (clinical) W1 under stress
**LB-Y1-*** LB Y1 WT (clinical) Baseline of clinical WT lineage Y1
**Mac-Y1-*** MacConkey Y1 WT (clinical) Y1 under stress

Why these conditions matter (proposal-aligned):

  • LB captures baseline networks.
  • Mac induces the membrane/efflux stress program that revealed R vs S behavior in your proposal and is tightly linked to RND function.
  • Contrasting WT vs efflux knockouts (ΔadeAB/ΔadeIJ) under LB/Mac tests both genotype main effects and genotype × stress interactions.
  • Multiple WT lineages (WT19606/W1/Y1) allow testing conservation vs isolate-specificity of stress responses (avoid overfitting to one WT).

3) Suggested metadata file (samples.csv)

sample,fastq_1,fastq_2,medium,strain,genotype,replicate,notes,strandedness
LB-AB-r1,LB-AB-r1_R1.fq.gz,LB-AB-r1_R2.fq.gz,LB,AB,ΔadeAB,r1,AdeAB knockout baseline,auto
LB-AB-r2,LB-AB-r2_R1.fq.gz,LB-AB-r2_R2.fq.gz,LB,AB,ΔadeAB,r2,AdeAB knockout baseline,auto
LB-AB-r3,LB-AB-r3_R1.fq.gz,LB-AB-r3_R2.fq.gz,LB,AB,ΔadeAB,r3,AdeAB knockout baseline,auto
LB-IJ-r1,LB-IJ-r1_R1.fq.gz,LB-IJ-r1_R2.fq.gz,LB,IJ,ΔadeIJ,r1,AdeIJK knockout baseline,auto
LB-IJ-r2,LB-IJ-r2_R1.fq.gz,LB-IJ-r2_R2.fq.gz,LB,IJ,ΔadeIJ,r2,AdeIJK knockout baseline,auto
LB-IJ-r4,LB-IJ-r4_R1.fq.gz,LB-IJ-r4_R2.fq.gz,LB,IJ,ΔadeIJ,r4,AdeIJK knockout baseline,auto
LB-W1-r1,LB-W1-r1_R1.fq.gz,LB-W1-r1_R2.fq.gz,LB,W1,WT,r1,clinical WT W1 baseline,auto
LB-W1-r2,LB-W1-r2_R1.fq.gz,LB-W1-r2_R2.fq.gz,LB,W1,WT,r2,clinical WT W1 baseline,auto
LB-W1-r3,LB-W1-r3_R1.fq.gz,LB-W1-r3_R2.fq.gz,LB,W1,WT,r3,clinical WT W1 baseline,auto
LB-WT19606-r2,LB-WT19606-r2_R1.fq.gz,LB-WT19606-r2_R2.fq.gz,LB,WT19606,WT,r2,reference WT baseline,auto
LB-WT19606-r3,LB-WT19606-r3_R1.fq.gz,LB-WT19606-r3_R2.fq.gz,LB,WT19606,WT,r3,reference WT baseline,auto
LB-WT19606-r4,LB-WT19606-r4_R1.fq.gz,LB-WT19606-r4_R2.fq.gz,LB,WT19606,WT,r4,reference WT baseline,auto
LB-Y1-r2,LB-Y1-r2_R1.fq.gz,LB-Y1-r2_R2.fq.gz,LB,Y1,WT,r2,clinical WT Y1 baseline,auto
LB-Y1-r3,LB-Y1-r3_R1.fq.gz,LB-Y1-r3_R2.fq.gz,LB,Y1,WT,r3,clinical WT Y1 baseline,auto
LB-Y1-r4,LB-Y1-r4_R1.fq.gz,LB-Y1-r4_R2.fq.gz,LB,Y1,WT,r4,clinical WT Y1 baseline,auto
Mac-AB-r1,Mac-AB-r1_R1.fq.gz,Mac-AB-r1_R2.fq.gz,MacConkey,AB,ΔadeAB,r1,AdeAB knockout under stress,auto
Mac-AB-r2,Mac-AB-r2_R1.fq.gz,Mac-AB-r2_R2.fq.gz,MacConkey,AB,ΔadeAB,r2,AdeAB knockout under stress,auto
Mac-AB-r3,Mac-AB-r3_R1.fq.gz,Mac-AB-r3_R2.fq.gz,MacConkey,AB,ΔadeAB,r3,AdeAB knockout under stress,auto
Mac-IJ-r1,Mac-IJ-r1_R1.fq.gz,Mac-IJ-r1_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r1,AdeIJK knockout under stress,auto
Mac-IJ-r2,Mac-IJ-r2_R1.fq.gz,Mac-IJ-r2_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r2,AdeIJK knockout under stress,auto
Mac-IJ-r4,Mac-IJ-r4_R1.fq.gz,Mac-IJ-r4_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r4,AdeIJK knockout under stress,auto
Mac-W1-r1,Mac-W1-r1_R1.fq.gz,Mac-W1-r1_R2.fq.gz,MacConkey,W1,WT,r1,clinical WT W1 under stress,auto
Mac-W1-r2,Mac-W1-r2_R1.fq.gz,Mac-W1-r2_R2.fq.gz,MacConkey,W1,WT,r2,clinical WT W1 under stress,auto
Mac-W1-r3,Mac-W1-r3_R1.fq.gz,Mac-W1-r3_R2.fq.gz,MacConkey,W1,WT,r3,clinical WT W1 under stress,auto
Mac-WT19606-r2,Mac-WT19606-r2_R1.fq.gz,Mac-WT19606-r2_R2.fq.gz,MacConkey,WT19606,WT,r2,reference WT under stress,auto
Mac-WT19606-r3,Mac-WT19606-r3_R1.fq.gz,Mac-WT19606-r3_R2.fq.gz,MacConkey,WT19606,WT,r3,reference WT under stress,auto
Mac-WT19606-r4,Mac-WT19606-r4_R1.fq.gz,Mac-WT19606-r4_R2.fq.gz,MacConkey,WT19606,WT,r4,reference WT under stress,auto
Mac-Y1-r2,Mac-Y1-r2_R1.fq.gz,Mac-Y1-r2_R2.fq.gz,MacConkey,Y1,WT,r2,clinical WT Y1 under stress,auto
Mac-Y1-r3,Mac-Y1-r3_R1.fq.gz,Mac-Y1-r3_R2.fq.gz,MacConkey,Y1,WT,r3,clinical WT Y1 under stress,auto
Mac-Y1-r4,Mac-Y1-r4_R1.fq.gz,Mac-Y1-r4_R2.fq.gz,MacConkey,Y1,WT,r4,clinical WT Y1 under stress,auto

4) DE design & contrasts (DESeq2/edgeR)

Model

  • Reference-genotype focus (WT19606 vs knockouts):
    ~ medium * genotype
    where medium ∈ {LB, MacConkey}, genotype ∈ {WT, ΔadeAB, ΔadeIJ}.
    (Set WT19606_LB as reference level.)

  • All strains (WT lineages):
    ~ medium * strain
    where strain ∈ {WT19606, W1, Y1, ΔadeAB, ΔadeIJ}.

Hypothesis-driven contrasts

  1. Stress response within each background:
    Mac vs LB for WT19606, ΔadeAB, ΔadeIJ, W1, Y1.
    → Membrane/efflux-stress regulons; check adeA/B/C, adeI/J/K, envelope stress, OM biogenesis, osmotic/ions, ribosome PQC.

  2. Genotype main effect at baseline (LB):
    ΔadeAB vs WT19606 (LB), ΔadeIJ vs WT19606 (LB).
    → Efflux-independent differences; compensatory pathways.

  3. Genotype effect under stress (Mac):
    ΔadeAB vs WT19606 (Mac), ΔadeIJ vs WT19606 (Mac).
    → How loss of AdeAB/AdeIJK alters the stress transcriptome.

  4. Interaction (genotype × medium):
    (ΔadeAB_Mac − ΔadeAB_LB) − (WT19606_Mac − WT19606_LB); same for ΔadeIJ.
    Core test: does RND loss change the stress response itself?

  5. Conservation across WT lineages:
    Intersect Mac vs LB DEGs across WT19606/W1/Y1 to define a conserved “Mac stress signature”; then identify isolate-specific modules.

QC

  • Verify strandedness with RSeQC (you set auto), mapping rate, rRNA %, insert size.
  • PCA: expect Mac vs LB separation; ΔadeIJ should diverge strongly under Mac.
  • Validate a small panel via RT-qPCR (ade genes + OM stress markers).

5) Where this plugs into the proposal

  • Mac vs LB embodies the proposal’s “LB = who is alive” vs “Mac = who can cope” paradigm (membrane/efflux stress).
  • ΔadeAB / ΔadeIJ model the role of RND efflux in stress adaptation and heterogeneity.
  • Multiple WT lineages prevent overfitting to ATCC 19606 and test generalizability.
  • Downstream integration with PAP/MDK/R% metrics links transcriptome to phenotype (R vs S).

If you want, I can also generate a starter DESeq2 R script that reads this samples.csv, sets factors/contrasts, and outputs PCA, volcano plots, and KEGG/GO enrichment stubs.

模型选择与对比:如何在有限功效下回答“two≈one@17h、two>one@24h(尤其ΔadeIJ)

  • Research_Proposal-E_Figure3
  • Research_Proposal-E_Figure4

亚MIC驱动的表型异质与耐药演化:以 A. baumannii RND 外排泵为核心的机制与干预

模型选择与对比:如何在有限功效下回答“two≈one@17h、two>one@24h(尤其ΔadeIJ)”

三种模型的对比

1) 分层按时间做(Option A:每个时间点各跑一个模型)

公式:在 17 h 子集:~ exposure * genotype;在 24 h 子集:~ exposure * genotype
(因为同一子集里 time_h 不变,再加 + time_h 是冗余的)

能回答的问题

  • 直接给出同一时间点内的所有核心对比(如 two vs one、基因型差异及其交互)。
  • 很适合你现在关心的:“17 h 两次≈一次,而 24 h 两次>一次?”

优点

  • 参数少、功效高(不去估计三方交互)。
  • 结果直观:每个时间点一套 DEGs,便于和 Fig.3c 的现象逐一对应。
  • 抗不平衡设计(两个时间点样本数/离群点不一样时更稳)。

缺点

  • 不能在一个模型里“正式检验”时间是否改变了暴露×基因型的效应(缺少三方交互的整体显著性检验)。
  • 跨时间比较需要你在结果层面去比对(例如比较 17 h 和 24 h 的 LFC/DEG 数),不是模型内参数。

适用场景:样本量有限、你最关心分时间的效应;先得到清晰、功效高的 per-time 结果。


2) 全局简化模型(Global reduced):~ exposure * genotype + time_h

能回答的问题

  • “暴露效应是否被基因型修改?”(有 exposure×genotype 交互)
  • 同时控制了一个时间的平均主效应。

优点

  • 比全模型参数更少、功效更高。
  • 仍可做“差-中-差”式的交互检验(但这是跨两个时间的平均)。

缺点

  • 假设暴露×基因型的交互在 17 h 和 24 h 相同/相似;
  • 无法告诉你“17 h 小、24 h 大”(时间特异的交互被“平均”掉)。

适用场景:如果 PCA/探索性分析显示时间影响较小而且平行,且你只想得到“平均意义”上的交互结论。


3) 全模型(Full):~ exposure * genotype * time_h

能回答的问题

  • 所有主效应与全部交互,尤其是三方交互(检验“暴露×基因型 的差异是否随时间改变”)。
  • 可用 LRT(似然比检验) 与简化模型比较,正式证明时间特异。

优点

  • 统计上最完整;能一口气检验你提出的“17 h 小、24 h 大”这种时间依赖。

缺点

  • 参数最多、功效最低(在基因层面更难达显著;多重比较负担更大)。
  • 解释略复杂(系数多,需要合成对比)。

适用场景:样本量足/信号强,或者用于确证“时间改变交互”的结论(比如在 Option A 看到强烈趋势之后)。


给你这套数据的建议(结合你目标与当前信号)

主分析:优先用 Option A(分层按时间)

  • 在 17 h 和 24 h 分别拟合 ~ exposure * genotype,导出 two vs one(WT 与 ΔadeIJ 各一套)。
  • 这和 Fig.3c 的读图是一致的,功效也最好;如果你的预期正确,17 h 的 two vs one 基本不显著,24 h 显著增多,尤其是 ΔadeIJ。

确证时间依赖(可选但推荐):跑一版 全模型

  • ~ exposure * genotype * time_hLRT 对比简化模型 ~ exposure * genotype + time_h
    若大量基因在 LRT 中显著 ⇒ 说明三方交互真实存在(时间改变了交互)。
  • 对重点基因/通路,再报告三方交互的方向与效应量。

若功效更有限或只想给一个“总体交互”的结论

  • 可以只给 Global reduced 的结果(平均的 exposure×genotype 交互),
  • 再辅以 Option A 的火山图作为“时间分层的可视化支持”。

简短决策树

  • 你要看清 17 h vs 24 h 的差别 → Option A
  • 你要在统计上证明“差别随时间改变”Full + LRT(可在 Option A 之后做)
  • 你要功效最大但只要平均结论Global reduced

实操小贴士

  • 两个时间点样本数尽量均衡;有离群点先处理(PCA/库大小/样本间相关)。
  • 需要时加入批次/库制备等协变量(+ batch),在三种模型里都可以加。
  • 报告时同时给:DEG 数、FDR、代表基因 LFC,并用 GSEA/富集做通路层面的对照。
  • 图表:每个时间点出 two vs one 的火山图/热图;全模型的 LRT p 值分布或显著通路一览。

一句话结论

为了回答你现在最关心的“17 h 的 two≈one,24 h 的 two>one,尤其在 ΔadeIJ”,Option A 是首选(清晰、功效高);随后用全模型 + LRT补一个“官方盖章”的时间依赖即可。


0. 术语速览(便于快速阅读)

  • MDR:多重耐药(Multi-Drug Resistance)
  • RND 外排泵:Resistance–Nodulation–Division 家族(A. baumannii 主要为 AdeABC/AdeIJK)
  • sub-MIC / 亚MIC:低于最低抑菌浓度(MIC)的暴露/处理
  • R/S 亚群:在选择压力下仍能成殖耐受/适应亚群(R)不成殖/受损易感亚群(S)
  • PAP:群体分析谱(Population Analysis Profiling)
  • Time-kill / MDK:时间杀菌曲线 / 最小杀灭持续时间(MDK99/MDK99.99)
  • AUM:人工尿培养基(Artificial Urine Medium)
  • MH2B(MHB):Mueller–Hinton Broth 第二配方
  • EPI:外排泵抑制剂(Efflux Pump Inhibitor)

1. 研究背景与问题定位

1.1 临床与公共卫生动因

  • MDR 革兰阴性菌威胁升级,A. baumannii 在 ICU/院感中占比高,治疗选择有限。
  • 最后线药物(替加环素、碳青霉烯等)敏感率下滑,且地区波动大,提示环境/宿主体内因素的重要性。
  • 亚MIC 暴露在临床(药代/组织渗透/生物膜)与环境(废水/污泥/水体)里普遍存在,能驱动适应、耐受、异质耐药等演化过程。

1.2 科学空白

  • RND 外排泵被公认为 A. baumannii 耐药关键,但基因表达表型耐药不总一致;
  • 异质性(heteroresistance/耐受/持留)与宿主相关培养基(AUM/尿液)的影响在常规 AST 中难以显现
  • 缺少将亚MIC → 表型异质 → 宿主环境转录重编程串联起来的系统证据链。

2. 科学问题、核心假说与研究目标

2.1 科学问题(What)

1) 亚MIC 替加环素如何诱发/放大 A. baumannii表型异质性耐受/适应亚群
2) RND 外排泵(AdeABC/AdeIJK)在这一过程中扮演何种动力学与调控角色
3) 尿液/AUM等宿主相关环境如何重编程转录网络改变药敏表型毒力

2.2 核心假说(Why + How)

  • H1:反复亚MIC 暴露通过应激响应/膜稳健性/代谢重分配,选择或诱导“在膜/外排应激下仍能成殖”的 R 亚群上升;RND 缺失株(ΔadeIJ)这一现象尤显著。
  • H2:RND 泵不仅通过外排直接降低胞内药物,还通过变构/能量耦联影响膜稳健、代谢与应激路径,放大表型异质。
  • H3:尿液/AUM汇聚多因素(渗透压、尿素、金属离子、碳源限制),系统性重编程外排泵与代谢网络,导致MIC 变化毒力改变

2.3 研究目标(Deliverables)

1) 建立亚MIC → R/S 亚群的定量框架(PAP、MDK、脉冲生存)并定位分子机制
2) 解析 RND异质性宿主环境适应中的作用;
3) 构建UTI 相关药敏解读范式与EPI 靶点清单。


3. 实验总体设计与样本体系

3.1 株系与处理

  • 菌株:WT(野生型)与 ΔadeIJ(RND 缺失),必要时扩展至 ΔadeIJK。
  • 处理与命名逻辑
    • No(未暴露):WT_17/24、deltaIJ_17/24
    • One exposure(一次亚MIC):pre_WT_17/24、pre_deltaIJ_17/24
    • Two exposures(两次亚MIC,0.5×MIC ×2):0_5_WT_17/24、0_5_deltaIJ_17/24
    • 时间点:17 h、24 h(或 8/16/24 h 批次采样)
  • Rationale:“05”表示每次 0.5×MIC 的两次脉冲,强调sub-MIC 叠加效应

3.2 培养基与应用场景

  • LB(液体/平板):生长曲线(液体)、总 CFU 与母板(平板)。
  • MacConkey agar(胆盐+结晶紫):膜/外排应激选择,用于显化 R/S 亚群、计算 CFU_Mac/CFU_LB
  • MH2B(液体):标准药敏/动力学参照。
  • AUM(液体)/Urine(液体):模拟尿路体内环境,用于生长、MIC 与 RNA-seq

4. 指标体系与统计学(统一口径)

4.1 关键指标定义

  • R 比例(3c 指标)R% = CFU_Mac / CFU_LB × 100%
  • R/S 比值(可选报告):R/S = CFU_Mac / (CFU_LB − CFU_Mac)
  • PAP 尾部比例p_tail = ΣCFU(≥MIC) / CFU(无药)
  • AUC(PAP 曲线下面积):衡量整体耐受度
  • MDK99 / MDK99.99:达到 99%/99.99% 杀灭所需时间
  • 脉冲生存率survival = CFU_pulse / CFU_start
  • RNA-seq:DEGs(|log2FC| 与 FDR),通路富集(KEGG/GO),主成分贡献(PC1/PC2)

4.2 统计检验与多重校正

  • 组间比较:t 检验/曼–惠特尼;多组用 ANOVA/Kruskal–Wallis + FDR/Bonferroni。
  • 比例/计数:二项置信区间、Fisher 精确检验、GLM(logit)。
  • 时间–生存类:分段回归/非线性拟合;MDK 的置信区间或 bootstrap。
  • RNA-seq:DESeq2(FDR < 0.05;|log2FC| 阈值预先注册)。

4.3 生物学重复与效应量

  • 每组 ≥3 生物学重复;报告 效应量(Cohen’s d 或 Cliff’s delta)与 95% CI,避免只看 p 值。

5. SOP:核心实验流程(可直接贴进方法学)

5.1 并行铺板(LB vs MacConkey)与 R 比例

1) 标准化 OD/CFU 接种;
2) 在目标时点(8/16/24 h)取样,并行涂布LB agarMacConkey agar
3) 过夜培养计数 CFU_LB / CFU_Mac
4) 计算 R% = CFU_Mac/CFU_LB,记录 ±95% CI;
5) 统计 No vs One vs Two、WT vs ΔadeIJ 的差异。

5.2 复制平板分离 S(易感)

1) 先在 LB 无药制备母板(100–300 个离散菌落);
2) 用天鹅绒/膜 整体复制MacConkey(±药物板);
3) 选择板不长/显著受损的坐标 = S 候选
4) 回到 LB 母板对应坐标挑取 S,纯化 2–3 轮,建库;
5) 验证:MIC(不升/接近原始群体)、Time-kill(无长尾)、PAP 尾部低。

5.3 PAP(Population Analysis Profiling)

1) 配置药物阶梯:0、0.25×、0.5×、1×、2×MIC(可扩展);
2) 等量涂布 → 计数 CFU
3) 作图(log10 CFU vs 浓度),计算 AUCp_tail
4) 统计处理间差异与效应量。

5.4 Time-kill / MDK

1) 固定高浓度药(杀菌剂常 ≥10×MIC;四环素类可选联合或耐受表征方案);
2) 0–24 h 分时取样计 CFU;
3) 拟合曲线、估 MDK99/MDK99.99
4) MDK↑/长尾加重=耐受增强。

5.5 脉冲生存(Pulse Survival)

1) 给予短时高浓度脉冲(10–20×MIC,30–60 min);
2) 终止药效、复苏、计 CFU
3) 生存比例比较处理差异。

5.6 RNA-seq(尿液/AUM/MH2B & 暴露对照)

  • 管线:FastQC → HISAT2/STAR → featureCounts → DESeq2;
  • 批次设计:WT/ΔadeIJ ×(尿液/AUM/MH2B)×(No/One/Two)× 时间点;
  • 输出:PCA(PC1/PC2)、DEGs(FDR<0.05)、KEGG/GO 富集;重点关注 外排泵(adeA/B/C、adeI/J/K)、膜/代谢/应激通路。

6. 初步结果(与前述一致的“证据链”)

6.1 Fig.3 — 亚MIC 暴露 → 适应与 R 亚群上升

  • 3a:流程示意(未画基质)。
  • 3b(LB 液体)Two > One > None 的生长恢复/存活。
  • 3c(R%)R% = CFU_Mac/CFU_LB;在 ΔadeIJTwo > One > None(16–24 h 最显著;*p<0.005, p<0.01**),WT 变化小。

6.2 Fig.4 — 预暴露后 R/S 表型分化与药板验证

  • LB vs MacConkeyR 能长,S 不能长/受损;WT 在两板均可生长。
  • 药板(四环素、碳青霉烯、利福平、多黏菌素、万古霉素):S 不可成殖,R 表现更耐受。

6.3 Fig.5 — 培养基效应(AUM/Urine/MH2B)

  • 生长:MH2B 最快;Urine 适中;AUM 较慢。
  • MIC 转变:AUM 中 四环素类 ↓~4×、碳青霉烯 ↓~8×(vs MH2B),提示介质效应

6.4 Fig.6 — 尿液/AUM 的全转录重编程

  • PCA:Urine 沿 PC1(~75%)聚类;AUM 与 MH2B 沿 PC2(~20%)分离。
  • DEGs/通路:两者各自独特且部分重叠;脲酶上调Urine 中 adeB 上调、AUM 中 adeA/adeB 下调;富集 氨基酸/碳代谢、TCA 等。

7. 质量控制(QC)与混杂控制

  • 接种量与生长期:统一 OD600/CFU 与生长阶段,避免接种效应。
  • 时间点:固定 8/16/24 h(或 17/24 h)窗口;保证重复一致。
  • 平板并行:同一稀释度并行涂布 LB vs MacConkey;同批操作。
  • 批次效应:RNA-seq 采用阻断设计/ComBat 校正;PCA 检查批次漂移。
  • 药物效力:MIC 校准、现配现用、稳定性检查;AUM/尿液配方与 pH/渗透压监控。
  • 统计注册:预注册阈值(FDR、|log2FC|、效应量),报告完整。

8. 风险点与备选方案(Mitigation)

  • R/S 不显著:提高重复数;增加 MacConkey 胆盐/结晶紫梯度;延长/调整暴露方案。
  • RNA-seq 变异大:加大样本量或聚焦关键时间点;加入 RT-qPCR 验证。
  • WT 也出现明显 R 上升:进一步引入 ΔadeABC / ΔadeIJK 与互补株,拆解特异性。
  • MIC 与耐受定义混淆:采用 “MIC(稳态)+ MDK(动力学)”双标准区分。

9. 项目管理:时间线与里程碑

时间 核心任务 可交付物
2026 Q1–Q2 生长曲线/MIC/PAP/并行铺板体系搭建 QC 报告、首版 SOP、基线数据
2026 Q3 演化实验/Time-kill/MDK/脉冲生存 AUC/MDK/生存率统计与图表
2026 Q4 RNA-seq(尿液/AUM/MH2B)、WGS PCA、DEGs、KEGG/GO、突变谱
2027 Q1 敲除株构建与表型/分子验证 RT-qPCR/WB/银染/代谢物数据
2027 Q2 综合分析与模型建立 机制图、整合图谱(Fig. Model)
2027 Q3–Q4 论文/会议/专利 3–4 篇论文草稿、会议摘要、EPI 靶点清单

10. 预期产出与学术/临床价值

  • 机制贡献:完整链路——亚MIC 暴露 → RND/膜/代谢 → R/S 异质 → UTI 环境转录重编程 → 药敏变化
  • 工具与范式:R 比例/PAP/MDK/脉冲生存的统一报告范式;UTI 相关 AST 的培养基效应提示
  • 转化潜力:聚焦 TM 变构/能量耦联位点的 EPI 设计;提示在 UTI 情景下药物/联合方案的优化路径。

11. 附:可直接复用的图注(精炼版,供英文化)

  • Fig. 3:Sub-MIC tigecycline exposures (0.5×MIC ×1/×2) promote adaptation in ΔadeIJ; LB growth (OD600) shows Two > One > None; MacConkey tolerance ratio R% = CFU_Mac/CFU_LB increases significantly at 16–24 h (*p<0.005, p<0.01**); WT shows minimal change.
  • Fig. 4:Post-exposure ΔadeIJ splits into R (grows on MacConkey) and S (impaired); S fails on drug plates (TET/CARB/RIF/PMB/VAN).
  • Fig. 5:MH2B fastest growth; urine moderate; AUM slower; MICs drop in AUM (TET ~4×, CARB ~8× vs MH2B).
  • Fig. 6:Urine clusters on PC1 (~75%); AUM separates on PC2 (~20%); urease up; adeB up in urine, adeA/adeB down in AUM; AA/carbon/TCA enriched.

Toxin–Antitoxin (TA) Systems & Pulldown Experiments — Practical Guide

TA_operon_shared_promoter_v3_en

A consolidated reference covering TA gene organization and regulation, promoter vs RBS roles, co-transcription criteria, start-codon troubleshooting, RNA-seq analysis strategy, pulldown experiment design/controls/statistics, and multi-omics integration.


1) TA System Overview

  • Composition: Paired genes encoding toxin (protein) and antitoxin (protein).
  • Typical organization: Same strand, antitoxin upstream, toxin downstream, forming a bicistronic operon.
  • Transcriptional control: Frequently transcribed from a shared σ⁷⁰-like promoter (−35/−10) upstream of the antitoxin.
  • Autoregulation: Antitoxin or TA complex often binds operator sites near the promoter to repress transcription. Under stress (e.g., antitoxin proteolysis), repression is relieved → toxin increases.
  • Functions: Stress response, persistence, plasmid maintenance, virulence modulation (family-specific).

2) Promoter vs RBS — Who Does What?

  • Promoter → transcription start.
    • Recognized by RNA polymerase holoenzyme (core RNAP + σ factor; often σ⁷⁰).
    • −35/−10 boxes typically spaced 16–19 bp; TSS sits downstream of −10.
  • RBS (Shine–Dalgarno) → translation start.
    • 16S rRNA (30S) anti-SD tail base-pairs with the RBS to position the start codon (usually ATG, also GTG/TTG).
    • RBS–start codon spacing commonly 5–10 nt.
  • Cheats: Promoter decides where transcription begins; RBS decides where translation begins.

3) Evidence Framework for a Shared Promoter / Co-transcription

Goal: Decide whether antitoxin & toxin belong to the same transcript and quantify co-expression.

3.1 Structural / Sequence Evidence

  1. Genomic context: Same strand; short intergenic (<50–100 bp) or slight overlap.
  2. Promoter prediction: Clear −35/−10 upstream of antitoxin; no strong independent promoter upstream of toxin.
  3. RBS: SD-like motifs upstream of both ORFs.
  4. Terminator: No strong Rho-independent terminator between the pair; terminator at operon end.

3.2 RNA-seq Evidence (Strand-specific libraries preferred)

  1. Coverage continuity: Same-strand coverage crosses the intergenic region.
  2. Spanning fragments: Paired-end insert spans the antitoxin↔toxin boundary.
  3. Expression correlation: From all samples (e.g., 27), compute TPM/CPM correlations; Pearson/Spearman r ≥ 0.8, p<0.01; remains high within each timepoint subset.
  4. DE consistency: For each timepoint’s treated vs control, log2FC for both genes are same direction with FDR<0.05.
  5. (Optional) TSS evidence: 5′-enriched or TSS-seq reveals shared TSS cluster upstream of antitoxin.

Note: Non-strand-specific libraries weaken strand-continuity evidence; interpret cautiously.


4) Why Your Provided Sequences Start with “TTA,” Not “ATG”

  • Observed “TTA” starts suggest: 1) Sequences include 5′ UTR/promoter (CDS not cut at true start). 2) Sequences could be reverse-complement relative to coding strand. 3) Bacteria can use GTG/TTG as starts, but TTA is not a typical start codon.
  • Standard resolution steps:
    • BLAST the fragments to the genome to get strand & coordinates.
    • Six-frame translate; on the correct strand, locate the longest ORF starting with ATG/GTG/TTG and ending at a stop.
    • Verify RBS distance (5–10 nt) and domain homology (BLASTX/HMM against TA families).
    • Use RNA-seq coverage shape/TSS to refine the start site.

5) RNA-seq Analysis Plan for 27 Samples (Example Design)

Design: Same strain × 3 conditions (untreated / Mitomycin C / Moxifloxacin) × 3 timepoints × 3 biological replicates = 27 samples.

5.1 Pipeline Outline

  1. QC & Alignment: FastQC/MultiQC → trimming → align to reference (confirm strand-specificity).
  2. Quantification: featureCounts/Salmon → DESeq2/edgeR normalization.
  3. Differential Expression:
    • For each timepoint, contrast treated vs untreated (include batch if needed).
    • Output per contrast: log2FC, SE, p, FDR.
  4. TA Co-transcription Checks:
    • IGV views: same-strand continuity across intergenic; spanning fragments.
    • Correlation between antitoxin & toxin across 27 samples (r, p).
    • DE direction consistency for both genes.
  5. Pulldown Targets in RNA-seq:
    • For candidate target list, extract log2FC/FDR; produce volcano/heatmaps.
    • Perform functional enrichment (GO/KEGG/COG) with overlap to pulldown hits.
  6. Deliverables:
    • IGV screenshots with annotated −35/−10, TSS, RBS, terminator.
    • MA/volcano plots, sample PCA, correlation plots.
    • Tables summarizing DEGs per timepoint and pulldown×RNA-seq overlaps.

6) Pulldown Experiments — Types, Controls, Statistics

6.1 Types

  • Protein–protein pulldown / affinity purification: Bait = toxin/antitoxin protein (His/FLAG/biotin); ID by LC–MS/MS.
  • Nucleic-acid pulldown:
    • DNA pulldown: bait = promoter/operator DNA; identify bound proteins (MS).
    • RNA pulldown: bait = specific RNA; identify bound proteins (MS) or enriched RNAs.

6.2 Critical Controls

  • Empty vector/beads, irrelevant protein, mutant bait (disrupt binding), competition elution.
  • 3 biological replicates recommended.

6.3 Hit Calling (Proteomics Example)

  • Use SAINT / MSstats / DEP or log2FC + FDR thresholds, e.g. log2FC ≥ 1 & FDR ≤ 0.05, consistently detected in ≥2 replicates.
  • Remove sticky background proteins (CRAPome) and ubiquitous ribosomal/chaperones where appropriate.
  • Deliver a high-confidence candidate list.

6.4 Integration with RNA-seq

  • Cross-table: pulldown hits vs RNA-seq log2FC/FDR across conditions/timepoints.
  • Enrichment/pathways: overlap enrichment for hits and DEGs.
  • Evidence ladder: 1) Pulldown enrichment (binding); 2) RNA-seq co-expression / DE (regulatory consistency); 3) Biophysical/functional assays (EMSA/SPR/ChIP-qPCR, reporter assays) for validation.

7) Validation Roadmap (Low→High Effort)

  1. RT-qPCR: Junction-spanning primers across antitoxin→toxin.
  2. EMSA/SPR: Direct binding & affinity to operator by antitoxin/TA complex.
  3. Reporter / Mutagenesis: Disrupt operator/−35/−10 or RBS, assess transcription/translation impact.
  4. ChIP-qPCR/ChIP-seq: In vivo occupancy (if antitoxin has DNA-binding domain).
  5. RACE/TSS-seq: Precise TSS mapping to confirm shared promoter.

8) Practical Criteria & Verdict Grades

  • Structure: Same strand; intergenic <100 bp; no strong terminator in-between.
  • Promoter: Clear −35/−10 upstream of antitoxin; toxin lacks strong independent promoter.
  • RNA-seq: Same-strand continuity across intergenic; boundary-spanning fragments; r ≥ 0.8 (p<0.01) across all samples; per-timepoint log2FC same direction (FDR<0.05).
  • Conclusion grades: Strong support / Support / Insufficient evidence.

9) Schematic Figures (Generated)

  • Chinese-labeled (non-overlapping):
    TA_operon_shared_promoter_v3_cn.pngopen/download
  • English-labeled (non-overlapping):
    TA_operon_shared_promoter_v3_en.pngopen/download

Each figure depicts: shared σ⁷⁰-like promoter (−35/−10, TSS)antitoxin (upstream)toxin (downstream) with each RBS, a terminal Rho-independent terminator, and stylized same-strand RNA-seq coverage that spans the intergenic region.


10) “Ready-to-Ask” Template for Collaborators

Objective: Determine if the TA pair shares a promoter and is co-transcribed; call DEGs per timepoint across conditions; test RNA-seq changes for pulldown targets.

Please deliver:

  1. IGV tracks with −35/−10, TSS, RBS, terminator, and boundary-spanning reads.
  2. DE tables (per timepoint per contrast), with log2FC/FDR.
  3. Correlation stats (antitoxin↔toxin r, p) across 27 samples and within timepoints.
  4. Pulldown×RNA-seq cross table (+ enrichment analyses).
  5. One-page verdict: shared promoter? co-transcription? evidence grade & key screenshots.

Inputs we’ll provide: Reference genome/annotation (FASTA/GFF/GTF), BAM/BAI, sample sheet, pulldown target list.


One-liner Summary

Promoter = transcription start; RBS = translation start.
For TA pairs, antitoxin→toxin often sits in a single operon driven by a shared promoter; RNA-seq continuity, spanning fragments, correlation, and concordant DE together provide strong evidence for co-transcription.

克隆≠表型完全相同 —— 详细阐述与具体例子

“克隆性”是指细菌在基因型上几乎无差异,属于高度同源的一组,但这并不意味着它们在耐药性等表型上一定一模一样。克隆株的表型可以有差异,原因包括调控机制、基因表达水平、外排泵活性、膜蛋白变化、插入序列等导致的基因功能或表达的不同。

具体例子:

在临床实践中,研究发现铜绿假单胞菌(Pseudomonas aeruginosa)的同一克隆菌株,虽然基因组完全一致,但对美罗培南(Meropenem)的最小抑菌浓度(MIC)可能不同。进一步检测发现:某些菌株因oprD基因突变、插入等,导致外膜通道蛋白表达下调,从而表现为高MIC(耐药),而同簇内oprD完整的菌株则敏感。1

这说明:即使是同一克隆簇的菌株,耐药表型可以因调控突变或基因表达等后天因素存在差异。


科学文章(中文发布版)

克隆性≠表型一致:基因同源不等于耐药全同 —— Holger邮件讨论解读

在近期的菌群全基因组分析中,我们通过cgMLST/WGS技术识别出了若干克隆相关的菌株簇。以Acinetobacter baumannii和Klebsiella pneumoniae为例,每个克隆簇内的菌株,其耐药基因和毒力因子分布高度一致,且AST(药敏)表型大多数时间点表现相近。

但邮件交流中,Gradientech团队指出:“仅凭MLST等判断克隆性,不能保证所有基因组无差异。即使核心基因型相同,也可能存有表型差异,例如耐药性或毒力不同。” 这很有道理。

举例说明:临床细菌如铜绿假单胞菌,同簇内部分菌株,因为外排泵或通道蛋白基因(如oprD)表达下降、插入序列影响或失活突变,导致美罗培南敏感性变弱(MIC升高),表型变为耐药。而同克隆簇的另一株也许表达正常,表现为敏感。这种“基因型同源但表型不完全一致”的现象,正是精准医学面临的挑战之一。1

针对克隆株是否要在分析中剔除,Holger建议保持信息透明,在方法和讨论部分如实披露、科学解释,不做删减。讨论段推荐补充一句:“克隆性并不必然对应表型耐药表达的一致,实际还受调控机制等多种因素影响,因此本研究保留所有分离株评估表型检测方法的性能。如果后续审核需要,可以按簇去重再行敏感性分析。”


总结

  • 克隆唯一区分于基因型层面,表型如耐药往往还会受基因调控、表达水平、特殊突变等多种影响,同一克隆内“表型一致”不是绝对规律。
  • 这一认识对临床耐药菌株流行病学追踪和新型方法学评估极其关键,有助于提升科学结论的严谨性。1 2345678910

Step‑by‑Step Workflow for Phylogenomic and Functional Analysis of 100 Clinical Isolates Using WGS Data

  • ggtree_and_gheatmap_kpneumoniae_new
  • ggtree_and_gheatmap_paeruginosa_new
  • ggtree_and_gheatmap_paeruginosa
  • ggtree_and_gheatmap_kpneumoniae
  • ggtree_and_gheatmap_abaumannii
  • ggtree_and_gheatmap_ecoli

Below is a sorted, step‑by‑step protocol integrating all commands and materials from your notes — organized into logical analysis phases. It outlines how to process whole‑genome data, annotate with Prokka, perform pan‑genome and SNP‑based phylogenetic analysis, and conduct resistome profiling and visualization.


1. Initial Setup and Environment Configuration

  1. Create and activate working environments:
conda create -n bengal3_ac3 python=3.9
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
  1. Copy necessary config files for bacto:
cp /home/jhuang/Tools/bacto/bacto-0.1.json .
cp /home/jhuang/Tools/bacto/Snakefile .
  1. Prepare R environment for later visualization:
mamba create -n r_414_bioc314 -c conda-forge -c bioconda \
  r-base=4.1.3 bioconductor-ggtree bioconductor-treeio \
  r-ggplot2 r-dplyr r-ape pandoc

2. Species Assignment and MLST Screening

  • Use WGS data to determine phylogenetic relationships based on MLST. Confirm species identity of isolates: E. coli, K. pneumoniae, A. baumannii complex, P. aeruginosa.
  • Example log output:
[10:19:59] Excluding ecoli.icdA.262 due to --exclude option
[10:19:59] If you like MLST, you're going to absolutely love cgMLST!
[10:19:59] Done.
  • Observation: No strain was close enough to be a suspected clonal strain (to be confirmed later with SNP‑based analysis).

3. Genome Annotation with Prokka

Cluster isolates into four species folders and annotate each genome separately.

Example shell script:

for cluster in abaumannii_2 ecoli_achtman_4 klebsiella paeruginosa; do
  GENUS_MAP=( [abaumannii_2]="Acinetobacter" [ecoli_achtman_4]="Escherichia" [klebsiella]="Klebsiella" [paeruginosa]="Pseudomonas" )
  SPECIES_MAP=( [abaumannii_2]="baumannii" [ecoli_achtman_4]="coli" [klebsiella]="pneumoniae" [paeruginosa]="aeruginosa" )
  prokka --force --outdir prokka/${cluster} --cpus 8 \
         --genus ${GENUS_MAP[$cluster]} --species ${SPECIES_MAP[$cluster]} \
         --addgenes --addmrna --prefix ${cluster} \
         -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
 done

4. Pan‑Genome Construction with Roary

Run Roary for each species cluster (GFF outputs from Prokka):

roary -f roary/ecoli_cluster -e --mafft -p 100 prokka/ecoli_cluster/*.gff
roary -f roary/kpneumoniae_cluster1 -e --mafft -p 100 prokka/kpneumoniae_cluster1/*.gff
roary -f roary/abaumannii_cluster -e --mafft -p 100 prokka/abaumannii_cluster/*.gff
roary -f roary/paeruginosa_cluster -e --mafft -p 100 prokka/paeruginosa_cluster/*.gff

Output: core gene alignments for each species.


5. Core‑Genome Phylogenetic Tree Building (RAxML‑NG)

Use maximum likelihood reconstruction per species:

raxml-ng --all --msa roary/ecoli_achtman_4/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix ecoli_core_gene_tree_1000
raxml-ng --all --msa roary/klebsiella/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix klebsiella_core_gene_tree_1000
raxml-ng --all --msa roary/abaumannii/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix abaumannii_core_gene_tree_1000
raxml-ng --all --msa roary/paeruginosa/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix paeruginosa_core_gene_tree_1000

6. SNP Detection and Distance Calculation

  1. Extract SNP sites and generate distance matrices:
snp-sites -v -o core_snps.vcf roary/ecoli_achtman_4/core_gene_alignment.aln
snp-dists roary/ecoli_achtman_4/core_gene_alignment.aln > ecoli_snp_dist.tsv
  1. Repeat for other species and convert to Excel summary:
~/Tools/csv2xls-0.4/csv_to_xls.py ecoli_snp_dist.tsv klebsiella_snp_dist.tsv abaumannii_snp_dist.tsv paeruginosa_snp_dist.tsv -d$'\t' -o snp_dist.xls

7. Resistome and Virulence Profiling with Abricate

  1. Install and set up databases:
conda install -c bioconda -c conda-forge abricate=1.0.1
abricate --setupdb
abricate --list
DATABASE        SEQUENCES       DBTYPE  DATE
vfdb    2597    nucl    2025-Oct-22
resfinder       3077    nucl    2025-Oct-22
argannot        2223    nucl    2025-Oct-22
ecoh    597     nucl    2025-Oct-22
megares 6635    nucl    2025-Oct-22
card    2631    nucl    2025-Oct-22
ecoli_vf        2701    nucl    2025-Oct-22
plasmidfinder   460     nucl    2025-Oct-22
ncbi    5386    nucl    2025-Oct-22
  1. Run VFDB for virulence genes:
# # Run VFDB example
# abricate --db vfdb genome.fasta > vfdb.tsv
#
# # strict filter (≥90% ID, ≥80% cov) using header-safe awk
# awk -F'\t' 'NR==1{
#   for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
#   print; next
# } ($id+0)>=90 && ($cov+0)>=80' vfdb.tsv > vfdb.strict.tsv

# 0) Define your cluster directories (exactly as in your prompt)
CLUSTERS="fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3"

# 1) Run Abricate (VFDB) per isolate, strictly filtered (≥90% ID, ≥80% COV)
for D in $CLUSTERS; do
  mkdir -p "$D/abricate_vfdb"
  for F in "$D"/*.fasta; do
    ISO=$(basename "$F" .fasta)
    # raw
    abricate --db vfdb "$F" > "$D/abricate_vfdb/${ISO}.vfdb.tsv"
    # strict filter while keeping header (header-safe awk)
    awk -F'\t' 'NR==1{
      for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
      print; next
    } ($id+0)>=90 && ($cov+0)>=80' \
      "$D/abricate_vfdb/${ISO}.vfdb.tsv" > "$D/abricate_vfdb/${ISO}.vfdb.strict.tsv"
  done
done

# 2) Build per-cluster "isolate
<TAB>gene" lists (Abricate column 5 = GENE)
for D in $CLUSTERS; do
  OUT="$D/virulence_isolate_gene.tsv"
  : > "$OUT"
  for T in "$D"/abricate_vfdb/*.vfdb.strict.tsv; do
    ISO=$(basename "$T" .vfdb.strict.tsv)
    awk -F'\t' -v I="$ISO" 'NR>1{print I"\t"$5}' "$T" >> "$OUT"
  done
  sort -u "$OUT" -o "$OUT"
done

# 3) Create a per-isolate "signature" (sorted, comma-joined gene list)
for D in $CLUSTERS; do
  IN="$D/virulence_isolate_gene.tsv"
  SIG="$D/virulence_signatures.tsv"
  awk -F'\t' '
  {m[$1][$2]=1}
  END{
    for(i in m){
      n=0
      for(g in m[i]){n++; a[n]=g}
      asort(a)
      printf("%s\t", i)
      for(k=1;k<=n;k++){printf("%s%s", (k>1?",":""), a[k])}
      printf("\n")
      delete a
    }
  }' "$IN" | sort > "$SIG"
done

# 4) Report whether each cluster is internally identical
for D in $CLUSTERS; do
  SIG="$D/virulence_signatures.tsv"
  echo "== $D =="
  # how many unique signatures?
  CUT=$(cut -f2 "$SIG" | sort -u | wc -l)
  if [ "$CUT" -eq 1 ]; then
    echo "  Virulence profiles: IDENTICAL within cluster"
  else
    echo "  Virulence profiles: NOT IDENTICAL (unique signatures: $CUT)"
    echo "  --- differing isolates & their signatures ---"
    cat "$SIG"
  fi
done
  1. Summarize VFDB results:
#----
# Make gene lists (VFDB "GENE" = column 5) for each isolate
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | tail -n +2 | sort -u > c2_070.genes.txt

# Show symmetric differences
echo "Genes present in QRC018 only:"
comm -23 c2_018.genes.txt c2_070.genes.txt

echo "Genes present in QRC070 only:"
comm -13 c2_018.genes.txt c2_070.genes.txt

awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | sort -u > c2_018.gene_product.txt
awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | sort -u > c2_070.gene_product.txt

# Genes unique to QRC018 with product
join -t $'\t' <(cut -f1 c2_018.gene_product.txt) <(cut -f1 c2_070.gene_product.txt) -v1 > /dev/null # warms caches
comm -23 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_018.gene_product.txt

# Genes unique to QRC070 with product
comm -13 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_070.gene_product.txt

# Make a cluster summary table from raw (or strict) TSVs
for D in fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3; do
  echo "== $D =="
  abricate --summary "$D"/abricate_vfdb/*.vfdb.strict.tsv | column -t
done
  1. Make gene presence lists and find symmetric differences between isolates:
cut -f5 fasta_kpneumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
comm -23 c2_018.genes.txt c2_070.genes.txt

5 (Optional). For A. baumannii with zero hits, run DIAMOND BLASTp against VFDB proteins.

#If you want to be extra thorough for A. baumannii. Because your VFDB nucleotide set returned zero for A. baumannii, you can cross-check with VFDB protein via DIAMOND:

# Build a DIAMOND db (once)
diamond makedb -in VFDB_proteins.faa -d vfdb_prot

# Query predicted proteins (Prokka .faa) per isolate
for F in fasta_abaumannii_cluster/*.fasta; do
  ISO=$(basename "$F" .fasta)
  prokka --cpus 8 --outdir prokka/$ISO --prefix $ISO "$F"
  diamond blastp -q prokka/$ISO/$ISO.faa -d vfdb_prot \
    -o abricate_vfdb/$ISO.vfdb_prot.tsv \
    -f 6 qseqid sseqid pident length evalue bitscore qcovhsp \
    --id 90 --query-cover 80 --max-target-seqs 1 --threads 8
done

8. Visualization with ggtree and Heatmaps

Render circular core genome trees and overlay selected ARGs or gene presence/absence matrices.

Key R snippet:

library(ggtree)
library(ggplot2)
library(dplyr)
info <- read.csv("typing_until_ecpA.csv", sep="\t")
tree <- read.tree("raxml-ng/snippy.core.aln.raxml.tree")
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info +
     geom_tippoint(aes(color=ST), size=4) + geom_tiplab2(aes(label=name), offset=1)
gheatmap(p, heatmapData2, width=4, colnames_angle=45, offset=4.2)

Output: multi‑layer circular tree (e.g., ggtree_and_gheatmap_selected_genes.png).


9. Final Analyses and Reporting

  • Compute pairwise SNP distances and identify potential clonal clusters using species‑specific thresholds (e.g., ≤17 SNPs for E. coli).
  • If clones exist, retain one representative per cluster for subsequent Boruta analysis or CA/EA metrics.
  • Integrate ARG heatmaps to the trees for intuitive visualization of resistance determinants relative to phylogeny.

Protocol Description Summary

This workflow systematically processes WGS data from multiple bacterial species:

  1. Annotation (Prokka) ensures consistent gene predictions.
  2. Pan‑genome alignment (Roary) captures core and accessory variation.
  3. Phylogenetic reconstruction (RAxML‑NG) provides species‑level evolutionary context.
  4. SNP‑distance computation (snp‑sites, snp‑dists) allows clonal determination and diversity quantification.
  5. Resistome analysis (Abricate, Diamond) characterizes antimicrobial resistance and virulence determinants.
  6. Visualization (ggtree + gheatmap) integrates tree topology with gene presence–absence or ARG annotations.

Properly documented, this end‑to‑end pipeline can serve as the Methods section for publication or as an online reproducible workflow guide.

Methods (for the manuscript). Genomes were annotated with PROKKA v1.14.5 [1]. A pangenome was built with Roary v3.13.0 [2] using default settings (95% protein identity) to derive a gene presence/absence matrix. Core genes—defined as present in 99–100% of isolates by Roary—were individually aligned with MAFFT v7 [3] and concatenated; the resulting multiple alignment was used to infer a maximum-likelihood phylogeny in RAxML-NG [4] under GTR+G, with 1,000 bootstrap replicates [5] and a random starting tree. Trees were visualized with the R package ggtree [6] in circular layout; tip colors denote sequence type (ST), with MLST assigned via PubMLST/BIGSdb [7]; concentric heatmap rings indicate the presence (+) or absence (-) of resistance genes.

  1. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.
  2. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693.
  3. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780.
  4. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A, Wren J. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455.
  5. Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
  6. Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. 2017. ggtree: an R package for visualization and annotation of phylogenetic trees. Methods in Ecology and Evolution 8(1):28–36.
  7. Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3:124.

Phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages (plotTreeHeatmap)

ggtree_and_gheatmap_ecoli
library(ggtree)
library(ggplot2)
library(dplyr)
#setwd("/home/jhuang/DATA/Data_Ben_Boruta_Analysis/plotTreeHeatmap_ecoli/")

info <- read.csv("isolate_ecoli_.csv", sep="\t", check.names = FALSE)
info$name <- info$Isolate
# make ST discrete
info$ST <- factor(info$ST)

tree <- read.tree("../ecoli_core_gene_tree_1000.raxml.bestTree")
cols <- c("10"="cornflowerblue","38"="darkgreen","46"="seagreen3","69"="tan","88"="red",  "131"="navyblue", "156"="gold",     "167"="green","216"="orange","405"="pink","410"="purple","1882"="magenta","2450"="brown", "2851"="darksalmon","3570"="chocolate4","4774"="darkkhaki")
# "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"

#heatmapData2 <- info %>% select(Isolate, blaCTX.M, blaIMP, blaKPC, blaNDM.1, blaNDM.5, blaOXA.23.like, blaOXA.24.like, blaOXA.48.like, blaOXA.58.like, blaPER.1, blaSHV, blaVEB.1, blaVIM)  #ST,
heatmapData2 <- info %>% select(
    Isolate,
    `blaCTX-M`, blaIMP, blaKPC, `blaNDM-1`, `blaNDM-5`,
    `blaOXA-23-like`, `blaOXA-24-like`, `blaOXA-48-like`, `blaOXA-58-like`,
    `blaPER-1`, blaSHV, `blaVEB-1`, blaVIM
)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
#names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")

heatmap.colours <- c("darkgreen", "grey")
names(heatmap.colours) <- c("+","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
#svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

# ---------

# 1) Optional: shrink tree to create more outer space (keeps relative scale)
tree_small <- tree
tree_small$edge.length <- tree_small$edge.length * 0.4    # try 0.3–0.5

# 2) Height after shrinking
ht <- max(ape::node.depth.edgelength(tree_small))

# 3) Base tree
info$ST <- factor(info$ST)
p <- ggtree(tree_small, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.4) +
    geom_tiplab2(aes(label = name), size = 1.4, offset = 0.02 * ht) +
    scale_color_manual(values = cols)

# 4) Reserve room outside the tips
off <- 0.30 * ht   # gap
wid <- 2.80 * ht   # ring thickness
mar <- 0.25 * ht
p_wide <- p + xlim(0, ht + off + wid + mar)

# 5) Ensure discrete values are factors and keep both levels
heatmapData2[] <- lapply(heatmapData2, function(x) factor(x, levels = c("+","-")))

# 6) Draw the heatmap
png("ggtree_and_gheatmap_readable.png", width = 3600, height = 3200, res = 350)
gheatmap(
    p_wide, heatmapData2,
    offset = off,
    width  = wid,
    color  = "white",            # borders between tiles
    colnames = TRUE,
    colnames_position = "top",
    colnames_angle = 0,
    colnames_offset_y = 0.04 * ht,
    hjust = 0.5,
    font.size = 7
) +
    scale_fill_manual(values = c("+"="darkgreen", "-"="grey"), drop = FALSE) +
    guides(
        fill  = guide_legend(title = "", override.aes = list(size = 4)),
        color = guide_legend(override.aes = list(size = 3))
    )
dev.off()

#-------------- plot with true scale -------------

# tree height (root → farthest tip)
ht <- max(ape::node.depth.edgelength(tree))

# circular tree with a small opening and compact labels
p <- ggtree(tree, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.8, alpha = 0.9) +
    geom_tiplab2(aes(label = name), size = 2.2, offset = 0.06 * ht) +
    scale_color_manual(values = cols) +
    theme(legend.title = element_text(size = 12),
                legend.text  = element_text(size = 10))

# higher-DPI export so small cells look crisp
png("ggtree_true_scale.png", width = 2400, height = 2400, res = 300)
p
dev.off()

# --- Heatmap: push it further out and make it wider ---
#svg("ggtree_and_gheatmap_mibi_selected_genes_true_scale.svg",
#    width = 32, height = 28)

png("ggtree_and_gheatmap_mibi_selected_genes_true_scale.png",
        width = 2000, height = 1750,  res = 200)

gheatmap(
    p, heatmapData2,
    offset = 0.24 * ht,         # farther from tips (was 0.08*ht)
    width  = 20.0 * ht,         # much wider ring (was 0.35*ht)
    color  = "white",           # thin tile borders help readability
    colnames_position  = "top",
    colnames_angle     = 90,     # keep column labels horizontal to save space
    colnames_offset_y  = 0.08 * ht,
    hjust = 0.1,
    font.size = 2.6               # bigger column label font
) +
    scale_fill_manual(values = heatmap.colours) +
    guides(fill  = guide_legend(title = "", override.aes = list(size = 4)),
                 color = guide_legend(override.aes = list(size = 3))) +
    theme(legend.position = "right",
                legend.title    = element_text(size = 10),
                legend.text     = element_text(size = 8))
dev.off()

This R script visualizes a phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages, annotated with sequence type (ST) colors and a heatmap showing the presence or absence of antimicrobial resistance genes. It creates several versions of the circular tree figure to adjust scaling and readability.

Key Steps

  1. Load libraries and data The code imports ggtree, ggplot2, and dplyr, reads isolate metadata (isolate_ecoli_.csv), and loads a phylogenetic tree file (ecoli_core_gene_tree_1000.raxml.bestTree).1112
  2. Prepare metadata It defines each isolate’s name and converts the sequence type (ST) into a categorical factor. A subset of columns (various bla resistance gene markers) is selected to create a heatmap data matrix, where each gene’s presence (“+”) or absence (“–”) will be visualized.
  3. Define color schemes Different STs are assigned distinct colors, and binary gene presence/absence states (“+”, “–”) mapped to “darkgreen” and “grey,” respectively.13
  4. Plot tree with annotation The main tree is plotted circularly using:
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
     geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols)

Tip labels are added for isolates, and STs are colored as per the legend.

  1. Add heatmap (gene presence/absence) The gheatmap() function appends the gene matrix to the tree, aligning rows with the tree tips and providing a heatmap of resistance gene patterns.1213
  2. Export figures Multiple PNGs are generated:
    • ggtree.png: Basic circular tree
    • ggtree_and_gheatmap_mibi_selected_genes.png: Tree + resistance gene heatmap
    • Scaled versions with adjusted offsets and spacing for improved readability and “true-scale” representations

Final Output

The resulting figures show a circular phylogenetic tree annotated by ST colors, accompanied by a ring-style heatmap indicating the distribution of key resistance genes across isolates. The layout adjustments (tree shrinking, offset tuning, width scaling) ensure legibility when genes or isolates are numerous.1213 1415161718192021222324252627282930

Processing Data_Patricia_Transposon_2025 v2 (Workflow for Structural Variant Calling in Nanopore Sequencing)

  1. Generate the HD46_Ctrol annotation

    mamba activate trycycler
    cd trycycler_HD46_Ctrl;
    trycycler cluster --threads 55 --assemblies assemblies/*.fasta --reads reads.fastq --out_dir trycycler;
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_001
    mv trycycler/cluster_001/1_contigs/J_ctg000010.fasta .
    mv trycycler/cluster_001/1_contigs/L_tig00000016.fasta .
    mv trycycler/cluster_001/1_contigs/R_tig00000001.fasta .
    mv trycycler/cluster_001/1_contigs/H_utg000001c.fasta .
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_002
    mv trycycler/cluster_002/1_contigs/*00000*.fasta .
    Error: unable to find a suitable common sequence
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_003
    mv trycycler/cluster_003/1_contigs/F_tig00000004.fasta .
    mv trycycler/cluster_003/1_contigs/L_tig00000003.fasta .
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_004
    mv trycycler/cluster_004/1_contigs/J_ctg000000.fasta .
    mv trycycler/cluster_004/1_contigs/P_ctg000000.fasta .
    mv trycycler/cluster_004/1_contigs/S_contig_2.fasta .
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_005
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_006
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_007
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_008
    
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_009
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_010
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_011
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_012
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_013
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_014
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_015
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_016
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_017
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_018
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_019
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_020
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_021
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_022
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_023
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_024
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_025
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_026
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_027
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_028
    trycycler reconcile --threads 55 --reads reads.fastq --cluster_dir trycycler/cluster_029
    
    trycycler msa --threads 55 --cluster_dir trycycler/cluster_001
    trycycler msa --threads 55 --cluster_dir trycycler/cluster_004
    
    trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_001
    trycycler partition --threads 55 --reads reads.fastq --cluster_dirs trycycler/cluster_004
    
    trycycler consensus --threads 55 --cluster_dir trycycler/cluster_001
    trycycler consensus --threads 55 --cluster_dir trycycler/cluster_004
    
    #Polish --> TODO: Need to be Debugged!
    for c in trycycler/cluster_001 trycycler/cluster_004; do
        medaka_consensus -i "$c"/4_reads.fastq -d "$c"/7_final_consensus.fasta -o "$c"/medaka  -m r941_min_sup_g507 -t 12
        mv "$c"/medaka/consensus.fasta "$c"/8_medaka.fasta
        rm -r "$c"/medaka "$c"/*.fai "$c"/*.mmi  # clean up
    done
    # cat trycycler/cluster_*/8_medaka.fasta > trycycler/consensus.fasta
    
    cp trycycler/cluster_001/7_final_consensus.fasta HD46_Ctrl_chr.fasta
    cp trycycler/cluster_004/7_final_consensus.fasta HD46_Ctrl_plasmid.fasta
  2. install mambaforge https://conda-forge.org/miniforge/ (recommended)

    #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    
    To activate this environment, use:
        micromamba activate /home/jhuang/mambaforge
    Or to execute a single command in this environment, use:
        micromamba run -p /home/jhuang/mambaforge mycommand
    installation finished.
    
    Do you wish to update your shell profile to automatically initialize conda?
    This will activate conda on startup and change the command prompt when activated.
    If you'd prefer that conda's base environment not be activated on startup,
      run the following command when conda is activated:
    
    conda config --set auto_activate_base false
    
    You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    [no] >>> yes
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    modified      /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    no change     /home/jhuang/mambaforge/condabin/conda
    no change     /home/jhuang/mambaforge/bin/conda
    no change     /home/jhuang/mambaforge/bin/conda-env
    no change     /home/jhuang/mambaforge/bin/activate
    no change     /home/jhuang/mambaforge/bin/deactivate
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.sh
    no change     /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    no change     /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    no change     /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    no change     /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    no change     /home/jhuang/mambaforge/etc/profile.d/conda.csh
    no change     /home/jhuang/.bashrc
    No action taken.
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    Added mamba to /home/jhuang/.bashrc
    ==> For changes to take effect, close and re-open your current shell. <==
    Thank you for installing Mambaforge!
    
    Close your terminal window and open a new one, or run:
    #source ~/mambaforge/bin/activate
    conda --version
    mamba --version
    
    https://github.com/conda-forge/miniforge/releases
    Note
    
        * After installation, please make sure that you do not have the Anaconda default channels configured.
            conda config --show channels
            conda config --remove channels defaults
            conda config --add channels conda-forge
            conda config --show channels
            conda config --set channel_priority strict
            #conda clean --all
            conda config --remove channels biobakery
    
        * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    
    # --Deprecated method: mamba installing on conda--
    #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    #    * Note that installing mamba into any other environment than base is not supported.
    #
    #conda activate base
    #conda install conda
    #conda uninstall mamba
    #conda install mamba

2: install required Tools on the mamba env

    * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
    * RepeatModeler2: Identify and classify transposons de novo.
    * RepeatMasker: Annotate known transposable elements using transposon libraries.
    * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
    * SURVIVOR: Consolidate structural variants across samples for comparative analysis.

    mamba deactivate
    # Create a new conda environment
    mamba create -n transposon_long python=3.6 -y

    # Activate the environment
    mamba activate transposon_long

    mamba install -c bioconda sniffles
    mamba install -c bioconda repeatmodeler repeatmasker

    # configure repeatmasker database
    mamba info --envs
    cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker

    #mamba install python=3.6
    mamba install -c bioconda svim
    mamba install -c bioconda survivor
  1. Test the installed tools

    # Check versions
    sniffles --version
    RepeatModeler -h
    RepeatMasker -h
    svim --help
    SURVIVOR --help
    mamba install -c conda-forge perl r
  2. Data Preparation

    Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

    Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

    Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. (OFFICIAL STARTING POINT) Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

      #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    
      # -- PREPARING the input fastq-data, merge the fastqz and move the top-directory
    
      # Under raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass
      zcat ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_0.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_1.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_2.fastq.gz ./barcode01/FBD12615_pass_barcode01_97f118c2_aa46ecf7_3.fastq.gz ... | gzip > HD46_1.fastq.gz
      mv ./raw_data/no_sample_id/20250731_0943_MN45170_FBD12615_97f118c2/fastq_pass/HD46_1.fastq.gz ~/DATA/Data_Patricia_Transposon_2025
    
        #this are the corresponding sample names:
        #barcode 1: HD46-1
        #barcode 2: HD46-2
        #barcode 3: HD46-3
        #barcode 4: HD46-4
        mv barcode01.fastq.gz HD46_1.fastq.gz
        mv barcode02.fastq.gz HD46_2.fastq.gz
        mv barcode03.fastq.gz HD46_3.fastq.gz
        mv barcode04.fastq.gz HD46_4.fastq.gz
    
      # -- CALCULATE the coverages
        #!/bin/bash
    
        for bam in barcode*_minimap2.sorted.bam; do
            echo "Processing $bam ..."
            avg_cov=$(samtools depth -a "$bam" | awk '{sum+=$3; cnt++} END {if (cnt>0) print sum/cnt; else print 0}')
            echo -e "${bam}\t${avg_cov}" >> coverage_summary.txt
        done
    
      # ---- !!!! LOGIN the suitable environment !!!! ----
      # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
      mamba activate transposon_long
    
      # -- TODO: AFTERNOON_DEBUG_THIS: FAILED and not_USED: Alignment and Detect structural variants in each sample using SVIM which used aligner ngmlr or mimimap2
      #mamba install -c bioconda ngmlr
      mamba install -c bioconda svim
    
      #SEARCH FOR "HD46_Ctrl_chr_plasmid.fasta" for finding the insertion-calling-commands
      # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! for all 4 options #
      # ---- Option_1: minimap2 (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
    
      for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #INS,INV,DUP:TANDEM,DUP:INT,BND
          svim reads --aligner minimap2 --nanopore minimap2+svim_${sample}    ${sample}.fastq.gz HD46_Ctrl_chr_plasmid.fasta  --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
      done
    
      #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    
      # ---- Option_2: minamap2 (aligner) + Sniffles2 (structural variant caller) --> SUCCESSFUL ----
      #Minimap2: A commonly used aligner for nanopore sequencing data.
      #    Align Long Reads to the WT Reference using Minimap2
      #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
      #  -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
      #  -l 50: Reports SVs that are at least 50 base pairs long.
      #  -t 60: Uses 60 threads for faster processing.
      for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          #minimap2 --MD -t 60 -ax map-ont HD46_Ctrl_chr_plasmid.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
          minimap2 --MD -t 60 -ax map-ont HD46_Ctrl_chr_plasmid.fasta ${sample}.fastq.gz | samtools sort -o ${sample}_minimap2.sorted.bam
          samtools index ${sample}_minimap2.sorted.bam
          sniffles -m ${sample}_minimap2.sorted.bam -v ${sample}_minimap2+sniffles.vcf -s 10 -l 50 -t 60
          #QUAL < 20 ||
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_minimap2+sniffles.vcf > ${sample}_minimap2+sniffles_filtered.vcf
      done
    
        #Estimating parameter...
        #        Max dist between aln events: 44
        #        Max diff in window: 76
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.0112045
        #        Avg INS ratio: 0.0364027
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #        Finalizing  ..
        #Start genotype calling:
        #        Reopening Bam file for parsing coverage
        #        Finalizing  ..
        #Estimating parameter...
        #        Max dist between aln events: 28
        #        Max diff in window: 89
        #        Min score ratio: 2
        #        Avg DEL ratio: 0.013754
        #        Avg INS ratio: 0.17393
        #Start parsing... CP020463
        #                # Processed reads: 10000
        #                # Processed reads: 20000
        #                # Processed reads: 30000
        #                # Processed reads: 40000
    
        # Results:
        # * barcode01_minimap2+sniffles.vcf
        # * barcode01_minimap2+sniffles_filtered.vcf
        # * barcode02_minimap2+sniffles.vcf
        # * barcode02_minimap2+sniffles_filtered.vcf
        # * barcode03_minimap2+sniffles.vcf
        # * barcode03_minimap2+sniffles_filtered.vcf
        # * barcode04_minimap2+sniffles.vcf
        # * barcode04_minimap2+sniffles_filtered.vcf
    
      #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
      #for sample in barcode01 barcode02 barcode03 barcode04; do
      #    sniffles -m svim_reads_minimap2_${sample}/${sample}.fastq.minimap2.coordsorted.bam -v sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
      #    bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
      #done
    
      # ---- Option_3: NGMLR (aligner) + SVIM (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          svim reads --aligner ngmlr --nanopore    ngmlr+svim_${sample}       ${sample}.fastq.gz HD46_Ctrl_chr_plasmid.fasta  --cores 10;
      done
    
      # ---- Option_4: NGMLR (aligner) + sniffles (structural variant caller) --> SUCCESSFUL ----
      for sample in HD46_1 HD46_2 HD46_3 HD46_4 HD46_5 HD46_6 HD46_7 HD46_8 HD46_13; do
          sniffles -m ngmlr+svim_${sample}/${sample}.fastq.ngmlr.coordsorted.bam -v ${sample}_ngmlr+sniffles.vcf -s 10 -l 50 -t 60
          bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}_ngmlr+sniffles.vcf > ${sample}_ngmlr+sniffles_filtered.vcf
      done
    
      #END
  6. Compare and integrate all results produced by minimap2+sniffles and ngmlr+sniffles, and check them each position in IGV!

    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    mv HD46_1_minimap2+sniffles_filtered.vcf    HD46-1_minimap2+sniffles_filtered.vcf
    mv HD46_1_ngmlr+sniffles_filtered.vcf       HD46-1_ngmlr+sniffles_filtered.vcf
    mv HD46_2_minimap2+sniffles_filtered.vcf    HD46-2_minimap2+sniffles_filtered.vcf
    mv HD46_2_ngmlr+sniffles_filtered.vcf       HD46-2_ngmlr+sniffles_filtered.vcf
    mv HD46_3_minimap2+sniffles_filtered.vcf    HD46-3_minimap2+sniffles_filtered.vcf
    mv HD46_3_ngmlr+sniffles_filtered.vcf       HD46-3_ngmlr+sniffles_filtered.vcf
    mv HD46_4_minimap2+sniffles_filtered.vcf    HD46-4_minimap2+sniffles_filtered.vcf
    mv HD46_4_ngmlr+sniffles_filtered.vcf       HD46-4_ngmlr+sniffles_filtered.vcf
    mv HD46_5_minimap2+sniffles_filtered.vcf    HD46-5_minimap2+sniffles_filtered.vcf
    mv HD46_5_ngmlr+sniffles_filtered.vcf       HD46-5_ngmlr+sniffles_filtered.vcf
    mv HD46_6_minimap2+sniffles_filtered.vcf    HD46-6_minimap2+sniffles_filtered.vcf
    mv HD46_6_ngmlr+sniffles_filtered.vcf       HD46-6_ngmlr+sniffles_filtered.vcf
    mv HD46_7_minimap2+sniffles_filtered.vcf    HD46-7_minimap2+sniffles_filtered.vcf
    mv HD46_7_ngmlr+sniffles_filtered.vcf       HD46-7_ngmlr+sniffles_filtered.vcf
    mv HD46_8_minimap2+sniffles_filtered.vcf    HD46-8_minimap2+sniffles_filtered.vcf
    mv HD46_8_ngmlr+sniffles_filtered.vcf       HD46-8_ngmlr+sniffles_filtered.vcf
    mv HD46_13_minimap2+sniffles_filtered.vcf   HD46-13_minimap2+sniffles_filtered.vcf
    mv HD46_13_ngmlr+sniffles_filtered.vcf      HD46-13_ngmlr+sniffles_filtered.vcf
  7. (NOT_USED) Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)

      python vcf_to_fasta.py variants.vcf variants.fasta
      #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
      #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
      /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5    #1.9G
      python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
      Exact Matches
      =============
      2 bacteria (blast name), Bacteria 
    (scientific name), eubacteria (genbank common name), Monera (in-part), Procaryotae (in-part), Prokaryota (in-part), Prokaryotae (in-part), prokaryote (in-part), prokaryotes (in-part) Non-exact Matches ================= 1783272 Terrabacteria group (scientific name) 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority) 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name) Summary of Classes within Firmicutes: * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061) * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis) * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis) * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes) * Listeria (e.g., Listeria monocytogenes) * Clostridia (includes many anaerobic species like Clostridium and Clostridioides) * Erysipelotrichia (intestinal bacteria, some pathogenic) * Tissierellia (less-studied, veterinary relevance) * Mollicutes (cell wall-less, includes Mycoplasma species) * Negativicutes (includes some Gram-negative, anaerobic species) RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta #bcftools filter -i ‘QUAL>30 && INFO/SVLEN>100’ variants.vcf -o filtered.vcf # #bcftools view -i ‘SVTYPE=”INS”‘ variants.vcf | bcftools query -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO\n’ > insertions.txt #mamba install -c bioconda vcf2fasta #vcf2fasta variants.vcf -o insertions.fasta #grep “SEQS” variants.vcf | awk ‘{ print $1, $2, $4, $5, $8 }’ > insertions.txt #python3 filtering_low_complexity.py # #vcftools –vcf input.vcf –recode –out filtered_output –minSVLEN 100 #bcftools filter -e ‘INFO/SEQS ~ “^(G+|C+|T+|A+){4,}”‘ variants.vcf -o filtered.vcf # — calculate the percentage of reads To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record. Step 1: Extract Relevant Information In the provided VCF entry: RE (Reads Evidence): 733 – the total number of reads supporting the insertion. GT (Genotype): 1/1 – this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion. AF (Allele Frequency): 1 – a 100% allele frequency, indicating that every read in this sample supports the insertion. DR (Depth Reference): 0 – the number of reads supporting the reference allele. DV (Depth Variant): 733 – the number of reads supporting the variant allele (insertion). Step 2: Calculate Percentage of Reads Supporting the Insertion Using the formula: Percentage of reads with insertion=(DVDR+DV)×100 Percentage of reads with insertion=(DR+DVDV​)×100 Substitute the values: Percentage=(7330+733)×100=100% Percentage=(0+733733​)×100=100% Conclusion Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields. * In your VCF file generated by Sniffles, the REF=N in the results has a specific meaning: * In a standard VCF, the REF field usually contains the reference base(s) at the variant position. * For structural variants (SVs), especially insertions, there is no reference sequence replaced; the insertion occurs between reference bases. * Therefore, Sniffles uses N as a placeholder in the REF field to indicate “no reference base replaced”. * The actual inserted sequence is then stored in the ALT field.
  8. Why some records have UNRESOLVED in the FILTER field in the Excel output.

    1. Understanding the format
    
        The data appears to be structural variant (SV) calls from Sniffles, probably in a VCF-like tabular format exported to Excel:
    
            * gi|1176884116|gb|CP020463.1| → reference sequence
            * Positions: 1855752 and 2422820
            * N → insertion event
            * SVLEN=999 → size of the insertion
            * AF → allele frequency
            * GT:DR:DV → genotype, depth reference, depth variant (1/1:0:678, example values for a PASS variant)
            * FILTER → whether the variant passed filters (UNRESOLVED means it didn’t pass)
    
    2. What UNRESOLVED usually means
    
        In Sniffles:
    
        * UNRESOLVED is assigned to SVs when the tool cannot confidently resolve the exact sequence or breakpoint.
        * Reasons include:
            - Low read support (RE, DV) relative to the expected coverage
            - Ambiguous alignment at repetitive regions
            - Conflicting strand or orientation signals
            - Allele frequency inconsistent with expectations
    
    3. Examine your two records
    
        First record
    
            POS: 1855752
            SVTYPE: INS
            SVLEN: 999
            RE: 68
            AF: 1
            GT: 1/1
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 1 → allele frequency 100%, homozygous insertion
        * RE = 68 → 68 reads support the variant, decent coverage
        * Still UNRESOLVED → likely because Sniffles could not resolve the inserted sequence precisely; sometimes long insertions in repetitive regions are hard to reconstruct fully even with good read support.
    
        Second record
    
            POS: 2422820
            SVTYPE: INS
            SVLEN: 999
            RE: 22
            AF: 0.025522
            GT: 0/0
            FILTER: UNRESOLVED
    
        Observations:
    
        * AF = 0.0255 → very low allele frequency (~2.5%)
        * RE = 22, DR = 840 → very low variant reads vs reference
        * GT = 0/0 → homozygous reference
        * Sniffles marks it UNRESOLVED because the variant is essentially noise, not confidently detected.
    
    4. Key difference between the two
        Feature First record    Second record
        Allele frequency (AF)   1 (high)    0.0255 (very low)
        Variant reads (RE)  68  22
        Genotype (GT)   1/1 0/0
        Reason for UNRESOLVED   Unresolvable inserted sequence
    
    ✅ 5. Conclusion
    
        * Sniffles marks a variant as UNRESOLVED when the SV cannot be confidently characterized.
        * Even if there is good read support (first record), complex insertions can’t always be reconstructed fully.
        * Very low allele frequency (second record) also triggers UNRESOLVED because the signal is too weak compared to background noise.
        * Essentially: “UNRESOLVED” ≠ bad data, it’s just unresolved uncertainty.
  9. (NOT_SURE_HOW_TO_USE) Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

      mamba install -c bioconda medaka
      medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  10. Compare Insertions Across Samples

    Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    
    SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    
        Input: List of VCF files from Sniffles2.
        Output: A consolidated VCF file with shared and unique variants.
    
    Filter WT Insertions:
    
        Identify transposons present only in samples 1–9 by subtracting WT variants using bcftools:
    
            bcftools isec WT.vcf merged.vcf -p comparison_results
  11. Validate and Visualize

    Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    
    igv.sh
    
    Validate Findings:
        Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  12. Alternatives to TEPID for Long-Read Data

    If you’re looking for transposon-specific tools for long reads:
    
        REPET: A robust transposon annotation tool compatible with assembled genomes.
        EDTA (Extensive de novo TE Annotator):
            A pipeline to identify, classify, and annotate transposons.
            Works directly on your assembled genomes.
    
            perl EDTA.pl --genome WT.fasta --type all
  13. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    Steps to Generate WT.vcf
    1. Align WT Reads to the WT Reference Genome
    
    The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    
    Command using Minimap2:
    
    minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    
    Index the BAM file:
    
    samtools index WT.sorted.bam
    
    2. Detect Structural Variants with Sniffles2
    
    Run Sniffles2 on the WT alignment to call structural variants:
    
    sniffles --input WT.sorted.bam --vcf WT.vcf
    
    This step identifies:
    
        Native transposons and insertions present in the WT genome.
        Other structural variants that are part of the reference genome or sequencing artifacts.
    
    Key parameters to consider:
    
        --min_support: Adjust based on your WT sequencing coverage.
        --max_distance: Define proximity for merging variants.
        --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  14. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    To ensure the WT.vcf only includes relevant transposons or SVs:
    
        Use bcftools or similar tools to filter out low-confidence variants:
    
        bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
        bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  15. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species’ genome. It is not assembled from the WT.fastq sequencing reads in this context. Here’s why:

    Why Use a Reference Genome (WT.fasta) from a Database?
    
        Higher Quality and Completeness:
            Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    
        Consistency:
            Using a standard reference ensures consistent comparisons across your WT and samples (1–9). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    
        Saves Time:
            Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    
    Alternative: Assembling WT from FASTQ
    
    If you don’t have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    
        Assemble the genome from your WT.fastq:
            Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    
        flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size 
    Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy. Use the assembled and polished genome as your WT.fasta reference for further steps. Key Takeaways: If you have access to a reliable, high-quality reference genome, use it as the WT.fasta. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  16. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    #Using snpEff to annotate the insertion!
    conda activate /home/jhuang/miniconda3/envs/spandx
# --> BUG:
LOCUS       HD46_Ctrl 2707468 bp    DNA     circular BCT
    02-OCT-2025
DEFINITION  Staphylococcus epidermidis strain HD46-ctrl chromosome, whole
            genome shotgun sequence.
ACCESSION
VERSION

# --> DEBUG: adapt the genbank-file header as follows:
LOCUS       HD46_Ctrl 2707468 bp    DNA     circular BCT 02-OCT-2025
DEFINITION  Staphylococcus epidermidis strain HD46-ctrl chromosome, whole
            genome shotgun sequence.
ACCESSION   HD46_Ctrl
VERSION     HD46_Ctrl.1
DBLINK      BioProject: PRJNA1337321
            BioSample: SAMN52215988
KEYWORDS    .
SOURCE      Staphylococcus epidermidis
  ORGANISM  Staphylococcus epidermidis
            Bacteria; Firmicutes; Bacilli; Bacillales; Staphylococcaceae;
            Staphylococcus.
COMMENT     Annotated genome for HD46_Ctrl.
...
    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/HD46_Ctrl
    cp HD46_Ctrl_chr.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/HD46_Ctrl/genes.gbk

    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config  #HD46_Ctrl.genome : HD46_Ctrl
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build -genbank HD46_Ctrl      -d

    sed -i 's/^cluster_001_consensus/HD46_Ctrl.1/' HD46-8_ngmlr+sniffles_filtered.vcf
    sed -i 's/^cluster_001_consensus/HD46_Ctrl.1/' HD46-13_ngmlr+sniffles_filtered.vcf
    #snpEff eff -nodownload -no-downstream -no-intergenic -ud 100 -v HD46_Ctrl HD46-8_ngmlr+sniffles_filtered.vcf > HD46-8_ngmlr+sniffles_filtered.annotated.vcf
    #snpEff eff -nodownload -no-downstream -no-intergenic -ud 100 -v HD46_Ctrl HD46-13_ngmlr+sniffles_filtered.vcf > HD46-13_ngmlr+sniffles_filtered.annotated.vcf

    # HD46-8
    snpEff ann -Xmx8g -v -hgvs -canon -ud 200 \
    -stats HD46-8_snpeff_stats.html \
    HD46_Ctrl \
    HD46-8_ngmlr+sniffles_filtered.vcf \
    > HD46-8_ngmlr+sniffles_filtered.annotated.vcf

    # HD46-13
    snpEff ann -Xmx8g -v -hgvs -canon -ud 200 \
    -stats HD46-13_snpeff_stats.html \
    HD46_Ctrl \
    HD46-13_ngmlr+sniffles_filtered.vcf \
    > HD46-13_ngmlr+sniffles_filtered.annotated.vcf
  1. Summarize the results as a Excel-file

    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    conda activate plot-numpy1
    #python generate_common_vcf.py
    #mv common_variants.xlsx putative_transposons.xlsx
    
    # * Reads each of your VCFs.
    # * Filters variants → only keep those with FILTER == PASS.
    # * Compares the two aligner methods (minimap2+sniffles2 vs ngmlr+sniffles2) per sample.
    # * Keeps only variants that appear in both methods for the same sample.
    # * Outputs: An Excel file with the common variants and a log text file listing which variants were filtered out, and why (not_PASS or not_COMMON_in_two_VCF).
    
    #python generate_fuzzy_common_vcf_v1.py
    #Sample PASS_minimap2   PASS_ngmlr  COMMON
    #  HD46-Ctrl_Ctrl   39  29  28
    #  HD46-1   39  32  29
    #  HD46-2   40  32  28
    #  HD46-3   38  30  27
    #  HD46-4   46  35  32
    #  HD46-5   40  35  31
    #  HD46-6   43  35  30
    #  HD46-7   40  33  28
    #  HD46-8   37  20  11
    #  HD46-13  39  38  27
    
    #Sample PASS_minimap2   PASS_ngmlr  COMMON_FINAL
    #HD46-Ctrl_Ctrl 39  29  6
    #HD46-1 39  32  8
    #HD46-2 40  32  8
    #HD46-3 38  30  6
    #HD46-4 46  35  8
    #HD46-5 40  35  9
    #HD46-6 43  35  10
    #HD46-7 40  33  8
    #HD46-8 37  20  4
    #HD46-13    39  38  5
    
    # !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! #
    #!!!! Summarize the results of ngmlr+sniffles !!!!
    python merge_ngmlr+sniffles_filtered_results_and_summarize.py
    
    #!!!! Post-Processing !!!!
    #DELETE "2186168    N   

    . PASS” in Sheet HD46-13 and Summary #DELETE “2427785 N CGTCAGAATCGCTGTCTGCGTCCGAGTCACTGTCTGAGTCTGAATCACTATCTGCGTCTGAGTCACTGTCTG . PASS” due to “0/1:169:117” in HD46-13 and Summary #DELETE “2441640 N GCTCATTAAGAATCATTAAATTAC . PASS” due to 0/1:170:152 in HD46-13 and Summary

  2. Source code of merge_ngmlr+sniffles_filtered_results_and_summarize.py

    python add_ann_to_excel.py         --excel merged_ngmlr+sniffles_variants.xlsx         --sheet8 "HD46-8"         --sheet13 "HD46-13"         --vcf8 HD46-8_ngmlr+sniffles_filtered.annotated.vcf         --vcf13 HD46-13_ngmlr+sniffles_filtered.annotated.vcf         --out merged_ngmlr+sniffles_variants_with_ANN.xlsx
    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    """
    Add SnpEff ANN columns (for SVTYPE=INS) from annotated VCFs into an Excel workbook,
    with detailed debug about why CHROM+POS may not match.
    
    Key improvements:
    - Stronger CHROM/POS normalization (strip 'chr', unify MT naming, coerce numbers).
    - Explicit detection and logging of sheet key columns used.
    - Debug block prints:
    * Unique key counts in sheet vs VCF (before/after normalization)
    * Example non-matching keys from the sheet and from the VCF (top N)
    * Chromosome naming diagnostics (e.g., 'chr' presence, 'MT'/'M' harmonization)
    * Off-by-N diagnostics via --pos_tolerance (counts for would-match @ ±N)
    * Optional preview of the sheet's SV type column, if present
    - Safer ANN parsing and aggregation.
    - Command-line options: --debug_examples, --pos_tolerance
    
    ANN filling is done only for **exact** (CHROM, POS) equality (as before).
    Tolerance is used only for *diagnostics*, not for filling, to avoid incorrect merges.
    """
    
    import argparse
    import gzip
    import io
    import re
    from pathlib import Path
    from typing import List, Tuple, Dict, Iterable, Set
    
    import pandas as pd
    
    FALLBACK_ANN_FIELDS: List[str] = [
        'Allele','Annotation','Annotation_Impact','Gene_Name','Gene_ID',
        'Feature_Type','Feature_ID','Transcript_BioType','Rank','HGVS.c',
        'HGVS.p','cDNA.pos/cDNA.length','CDS.pos/CDS.length','AA.pos/AA.length',
        'Distance','Errors_Warnings_Info'
    ]
    
    def open_text_maybe_gzip(path: Path):
        if str(path).endswith('.gz'):
            return io.TextIOWrapper(gzip.open(path, 'rb'), encoding='utf-8', errors='ignore')
        return open(path, 'r', encoding='utf-8', errors='ignore')
    
    def normalize_chrom(col: pd.Series) -> pd.Series:
        s = col.astype(str).str.strip()
        s = s.str.replace(r'^(chr|CHR)', '', regex=True)
        # Standardize mitochondrial names to "MT"
        s = s.str.replace(r'^(M|MtDNA|MTDNA|Mito|Mitochondrion)$', 'MT', regex=True, case=False)
        return s.str.upper()
    
    def normalize_pos(col: pd.Series) -> pd.Series:
        # Excel can make ints look like floats; coerce then Int64
        # (We keep Int64 nullable for robustness; we never compare NaNs.)
        s = pd.to_numeric(col, errors='coerce')
        # If people had 0-based starts in the sheet (rare for INS), this won't fix it,
        # but the tolerance debug will reveal a +1 shift if present.
        return s.astype('Int64')
    
    def parse_vcf_ann(vcf_path: Path) -> Tuple[pd.DataFrame, List[str]]:
        ann_fields = None
        header_cols = None
        records = []
    
        with open_text_maybe_gzip(vcf_path) as f:
            for line in f:
                if line.startswith('##INFO=<ID=ANN'):
                    m = re.search(r'Format:\s*([^">]+)', line)
                    if m:
                        ann_fields = [s.strip() for s in m.group(1).split('|')]
                if line.startswith('#CHROM'):
                    header_cols = line.strip().lstrip('#').split('\t')
                    break
    
            if not header_cols:
                raise RuntimeError(f"Could not find VCF header line (#CHROM ...) in {vcf_path}")
    
            if not ann_fields:
                ann_fields = FALLBACK_ANN_FIELDS
    
            ann_cols = [f'ANN_{x}' for x in ann_fields]
    
            for line in f:
                if not line or line[0] == '#':
                    continue
                parts = line.rstrip('\n').split('\t')
                if len(parts) < len(header_cols):
                    continue
                row = dict(zip(header_cols, parts))
                info = row.get('INFO', '')
    
                # Only INS
                if not re.search(r'(?:^|;)SVTYPE=INS(?:;|$)', info):
                    continue
    
                chrom = row.get('#CHROM') or row.get('CHROM')
                pos_str = row.get('POS')
                try:
                    pos = int(pos_str)
                except Exception:
                    continue
    
                # Extract ANN entries
                ann_match = re.search(r'(?:^|;)ANN=([^;]+)', info)
                ann_entries = ann_match.group(1).split(',') if ann_match else []
    
                field_values: Dict[str, List[str]] = {k: [] for k in ann_fields}
                for ann in ann_entries:
                    items = ann.split('|')
                    if len(items) < len(ann_fields):
                        items += [''] * (len(ann_fields) - len(items))
                    elif len(items) > len(ann_fields):
                        items = items[:len(ann_fields)]
                    for k, v in zip(ann_fields, items):
                        field_values[k].append(v)
    
                joined = {f'ANN_{k}': (';'.join(v) if v else '') for k, v in field_values.items()}
                records.append({'CHROM': chrom, 'POS': pos, **joined})
    
        df = pd.DataFrame.from_records(records)
        if not df.empty:
            df['POS'] = pd.to_numeric(df['POS'], errors='coerce').astype('Int64')
            df['CHROM'] = normalize_chrom(df['CHROM'])
        return df, ann_cols
    
    def detect_key_columns(df: pd.DataFrame) -> Dict[str, str]:
        chrom_candidates = ['CHROM', '#CHROM', 'Chrom', 'Chromosome', 'chrom', 'chr', 'Chr']
        pos_candidates   = ['POS', 'Position', 'position', 'pos', 'Start', 'start']
        mapping = {}
        for c in chrom_candidates:
            if c in df.columns:
                mapping['CHROM'] = c
                break
        for p in pos_candidates:
            if p in df.columns:
                mapping['POS'] = p
                break
        return mapping
    
    def normalize_chrom_pos_df(df: pd.DataFrame, keys: Dict[str, str]) -> pd.DataFrame:
        out = df.copy()
        out[keys['CHROM']] = normalize_chrom(out[keys['CHROM']])
        out[keys['POS']]   = normalize_pos(out[keys['POS']])
        return out
    
    def summarize_chr_formats(series: pd.Series, label: str):
        raw = series.astype(str)
        has_chr_prefix = raw.str.startswith(('chr','CHR')).sum()
        mt_like = raw.str.fullmatch(r'(M|MtDNA|MTDNA|Mito|Mitochondrion)', case=False).sum()
        print(f"[{label}] CHROM diagnostics:")
        print(f"  total rows: {len(raw)}")
        print(f"  with 'chr'/'CHR' prefix: {has_chr_prefix}")
        print(f"  mitochondrial names like M/MtDNA/etc: {mt_like}")
    
    def keys_set(df: pd.DataFrame, chrom_col: str, pos_col: str) -> Set[Tuple[str, int]]:
        # Drop NA POS, NA CHROM
        sub = df[[chrom_col, pos_col]].dropna()
        # Ensure ints (drop NA after coercion)
        sub = sub[(sub[pos_col].astype('Int64').notna())]
        return set(zip(sub[chrom_col].astype(str), sub[pos_col].astype('int64')))
    
    def tolerance_match_count(sheet_keys: Iterable[Tuple[str,int]],
                            vcf_keys: Set[Tuple[str,int]],
                            tol: int) -> int:
        if tol <= 0:
            return sum(1 for k in sheet_keys if k in vcf_keys)
        cnt = 0
        for chrom, pos in sheet_keys:
            if (chrom, pos) in vcf_keys:
                cnt += 1
            else:
                matched = False
                # check +/- 1..tol
                for d in range(1, tol+1):
                    if (chrom, pos - d) in vcf_keys or (chrom, pos + d) in vcf_keys:
                        matched = True
                        break
                if matched:
                    cnt += 1
        return cnt
    
    def debug_match_report(df_sheet: pd.DataFrame,
                        vcf_df: pd.DataFrame,
                        keys: Dict[str, str],
                        debug_examples: int = 15,
                        pos_tolerance: int = 1):
        print("\n=== DEBUG: Matching overview ===")
        # Raw diagnostics
        summarize_chr_formats(df_sheet[keys['CHROM']], label="SHEET (raw)")
        summarize_chr_formats(vcf_df['CHROM'], label="VCF (normalized)")
    
        # Normalize sheet
        df_norm = normalize_chrom_pos_df(df_sheet, keys)
        print(f"Detected key columns -> CHROM: '{keys['CHROM']}'  POS: '{keys['POS']}'")
        # Basic stats
        n_sheet_all = len(df_sheet)
        n_sheet_key_nonnull = df_norm[keys['CHROM']].notna().sum() - df_norm[keys['CHROM']].isna().sum()
        n_sheet_pos_nonnull = df_norm[keys['POS']].notna().sum()
        print(f"SHEET rows total: {n_sheet_all}")
        print(f"SHEET rows with non-null CHROM: {n_sheet_key_nonnull}, non-null POS: {n_sheet_pos_nonnull}")
    
        # Unique key counts
        sheet_norm_keys_df = df_norm.rename(columns={keys['CHROM']: 'CHROM', keys['POS']: 'POS'})
        sheet_norm_keys_df = sheet_norm_keys_df.dropna(subset=['CHROM','POS'])
        sheet_norm_keys_df['POS'] = sheet_norm_keys_df['POS'].astype('Int64')
        sheet_keys_unique = keys_set(sheet_norm_keys_df, 'CHROM', 'POS')
        vcf_keys_unique   = keys_set(vcf_df, 'CHROM', 'POS')
    
        print(f"Unique (CHROM,POS) keys -> SHEET: {len(sheet_keys_unique)}  VCF(INS): {len(vcf_keys_unique)}")
    
        # Exact match count
        exact_matches = len(sheet_keys_unique & vcf_keys_unique)
        print(f"Exact key matches (SHEET∩VCF): {exact_matches}")
    
        # Tolerance diagnostics (diagnose off-by-one etc.)
        if pos_tolerance > 0:
            approx_matches = tolerance_match_count(sheet_keys_unique, vcf_keys_unique, pos_tolerance)
            print(f"Keys that would match within ±{pos_tolerance}: {approx_matches}")
    
        # Show some examples of non-matching keys from SHEET
        if debug_examples > 0:
            not_in_vcf = sorted(k for k in sheet_keys_unique if k not in vcf_keys_unique)
            not_in_sheet = sorted(k for k in vcf_keys_unique if k not in sheet_keys_unique)
            print(f"\nExamples of SHEET keys not found in VCF (showing up to {debug_examples}):")
            for k in not_in_vcf[:debug_examples]:
                print("  SHEET-only:", k)
            print(f"\nExamples of VCF keys not found in SHEET (showing up to {debug_examples}):")
            for k in not_in_sheet[:debug_examples]:
                print("  VCF-only:", k)
    
        # Try to detect a type column and report counts
        type_cols = [c for c in df_sheet.columns if c.lower() in ('svtype','type','variant_type','sv_type')]
        if type_cols:
            tcol = type_cols[0]
            is_ins = df_sheet[tcol].astype(str).str.upper() == 'INS'
            print(f"\nType column detected: '{tcol}'. SHEET rows with INS: {int(is_ins.sum())} / {len(df_sheet)}")
            # Of the INS rows, how many have keys that match?
            ins_keys = keys_set(df_norm[is_ins], keys['CHROM'], keys['POS'])
            exact_ins_matches = len(ins_keys & vcf_keys_unique)
            print(f"  INS-only exact key matches: {exact_ins_matches} / {len(ins_keys)}")
            if pos_tolerance > 0:
                approx_ins_matches = tolerance_match_count(ins_keys, vcf_keys_unique, pos_tolerance)
                print(f"  INS-only matches within ±{pos_tolerance}: {approx_ins_matches} / {len(ins_keys)}")
        else:
            print("\nNo explicit type column found in SHEET.")
    
    def merge_ann_into_sheet(df_sheet: pd.DataFrame, vcf_df: pd.DataFrame, ann_cols: List[str],
                            pos_tolerance: int = 1, debug_examples: int = 15) -> pd.DataFrame:
        df = df_sheet.copy()
    
        keys = detect_key_columns(df)
        if 'CHROM' not in keys or 'POS' not in keys:
            print("WARNING: Could not detect CHROM/POS columns in sheet; ANN columns will be empty.")
            for c in ann_cols:
                if c not in df.columns:
                    df[c] = ''
            return df
    
        # DEBUG: run a comprehensive match report
        debug_match_report(df, vcf_df, keys, debug_examples=debug_examples, pos_tolerance=pos_tolerance)
    
        # Normalize sheet keys for merge
        df_norm = normalize_chrom_pos_df(df, keys)
    
        # Prepare VCF map (unique by CHROM,POS), aggregate ANN fields
        vcf_use = vcf_df.copy()
        if vcf_use.empty:
            print("NOTE: No INS records found in VCF; ANN columns will be created but empty.")
        else:
            agg = {c: lambda s: ';'.join([x for x in s.astype(str).tolist() if x]) for c in ann_cols}
            vcf_use = vcf_use.groupby(['CHROM', 'POS'], as_index=False).agg(agg)
    
        # Identify potential type column in sheet
        type_cols = [c for c in df.columns if c.lower() in ('svtype','type','variant_type','sv_type')]
        has_type = bool(type_cols)
        if has_type:
            tcol = type_cols[0]
            is_ins = df[tcol].astype(str).str.upper() == 'INS'
            print(f"\nMERGE: using type column '{tcol}' -> rows marked INS: {int(is_ins.sum())} / {len(df)}")
        else:
            is_ins = pd.Series([False]*len(df), index=df.index)
            print("\nMERGE: no type column -> will fill ANN wherever exact (CHROM,POS) matches VCF INS.")
    
        # Left merge on exact keys only (do not use tolerance for filling, just for diagnostics)
        left = df_norm.rename(columns={keys['CHROM']: 'CHROM', keys['POS']: 'POS'})
        merged = left.merge(vcf_use[['CHROM','POS'] + ann_cols], on=['CHROM','POS'], how='left', suffixes=('',''))
    
        # Initialize ANN columns on original df
        for c in ann_cols:
            if c not in df.columns:
                df[c] = ''
    
        # Fill values:
        for c in ann_cols:
            values = merged[c]
            if has_type:
                df.loc[is_ins, c] = values[is_ins].fillna('').astype(str).values
            else:
                df[c] = values.fillna('').astype(str).values
    
        # Report matching stats on the actual merge
        matched = merged[ann_cols].notna().any(axis=1).sum()
        print(f"\nMERGE RESULT: rows with any ANN filled (exact VCF match): {int(matched)} / {len(df)}")
    
        # Additional hint if tolerance suggests many near-misses
        if pos_tolerance > 0:
            sheet_keys = keys_set(left, 'CHROM', 'POS')
            vcf_keys_unique = keys_set(vcf_use, 'CHROM', 'POS')
            approx = tolerance_match_count(sheet_keys, vcf_keys_unique, pos_tolerance)
            if approx > matched:
                print(f"NOTE: There appear to be {approx - matched} additional rows that would match within ±{pos_tolerance}.")
                print("      This often indicates a 0-based vs 1-based position shift or use of END instead of POS in the sheet.")
    
        return df
    
    def main():
        ap = argparse.ArgumentParser()
        ap.add_argument('--excel', default='merged_ngmlr+sniffles_variants.xlsx', help='Input Excel workbook')
        ap.add_argument('--sheet8', default='HD46-8', help='Sheet name for HD46-8 sample')
        ap.add_argument('--sheet13', default='HD46-13', help='Sheet name for HD46-13 sample')
        ap.add_argument('--vcf8', default='HD46-8_ngmlr+sniffles_filtered.annotated.vcf', help='Annotated VCF for HD46-8')
        ap.add_argument('--vcf13', default='HD46-13_ngmlr+sniffles_filtered.annotated.vcf', help='Annotated VCF for HD46-13')
        ap.add_argument('--out', default='merged_ngmlr+sniffles_variants_with_ANN.xlsx', help='Output Excel path')
        ap.add_argument('--debug_examples', type=int, default=15, help='How many non-match examples to print from each side')
        ap.add_argument('--pos_tolerance', type=int, default=1, help='Diagnostic tolerance (±N bp) for off-by-N checks (used for debug only)')
        args = ap.parse_args()
    
        excel_path = Path(args.excel)
        vcf8_path = Path(args.vcf8)
        vcf13_path = Path(args.vcf13)
        out_path = Path(args.out)
    
        # Load sheets (resolve case-insensitive names)
        xls = pd.ExcelFile(excel_path)
        def resolve_sheet(name: str) -> str:
            if name in xls.sheet_names:
                return name
            lower_map = {s.lower(): s for s in xls.sheet_names}
            return lower_map.get(name.lower(), name)
    
        sheet8 = resolve_sheet(args.sheet8)
        sheet13 = resolve_sheet(args.sheet13)
    
        df8 = pd.read_excel(excel_path, sheet_name=sheet8)
        df13 = pd.read_excel(excel_path, sheet_name=sheet13)
    
        # Parse VCFs (INS only)
        vcf8_df, ann_cols = parse_vcf_ann(vcf8_path)
        vcf13_df, _ = parse_vcf_ann(vcf13_path)
    
        print(f"VCF8 INS variants: {len(vcf8_df)}; VCF13 INS variants: {len(vcf13_df)}")
        print(f"ANN subfields ({len(ann_cols)}): {', '.join(ann_cols)}")
    
        # Merge with diagnostics
        df8_out = merge_ann_into_sheet(df8, vcf8_df, ann_cols,
                                    pos_tolerance=args.pos_tolerance,
                                    debug_examples=args.debug_examples)
        df13_out = merge_ann_into_sheet(df13, vcf13_df, ann_cols,
                                        pos_tolerance=args.pos_tolerance,
                                        debug_examples=args.debug_examples)
    
        # Save
        with pd.ExcelWriter(out_path, engine='xlsxwriter') as writer:
            df8_out.to_excel(writer, sheet_name=sheet8, index=False)
            df13_out.to_excel(writer, sheet_name=sheet13, index=False)
    
        print(f"\nDone. Wrote: {out_path.resolve()}")
    
    if __name__ == '__main__':
        main()
  3. Manually merge all contents of ANN=? to a seperate column ‘ANN’ in the isolate-specific sheets in the Excel-file.

    #Add CHROM and HD46_Ctrl.1 to first column of the input Excel-file
    (plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Patricia_Transposon_2025$ python add_ann_to_excel.py         --excel merged_ngmlr+sniffles_variants.xlsx         --sheet8 "HD46-8"         --sheet13 "HD46-13"         --vcf8 HD46-8_ngmlr+sniffles_filtered.annotated.vcf         --vcf13 HD46-13_ngmlr+sniffles_filtered.annotated.vcf         --out merged_ngmlr+sniffles_variants_with_ANN.xlsx
    #DEL some columns (INFO, NN_Allele, ANN_Rank, ANN_Errors_Warnings_Info, from the table, and COPY the summary-sheet to the final table.

雙側足底筋膜炎與體外衝擊波治療(ESWT)全指南

概述

雙側足底筋膜炎是成人足跟痛最常見的原因之一,典型表現為清晨或久坐後首次下地時足跟內側刺痛,走動少許可緩解,但久站久走或跑步後又加重。313233 病因多與足底筋膜反覆牽拉導致的微撕裂與退行性改變相關,而非單純的急性發炎,雙側受累時步態代償更明顯、承重耐受度下降。343536

定義與病理

雖名為“炎”,但顯微與臨床更多呈現著骨點病變與退行性改變(微撕裂、膠原退變),而非典型急性炎細胞浸潤。34 病灶多起於跟骨內側結節的足底筋膜近端附著點,可見筋膜增厚與壓痛;影像可能出現跟骨骨刺,但骨刺本身並非疼痛直接原因。353734 足底筋膜自跟骨延伸至足趾底部,負責支撐足弓並吸收衝擊,因此在跑跳或久站等反覆負荷下容易受損。3331

常見症狀

足跟內側或足弓處的銳痛/刺痛,清晨首步痛最明顯,稍活動後可暫時減輕,但長時間負重後再度加重。3631 雙側病例中,兩足承重對稱受限,代償步態更突出,影響長時間站立與行走的耐受度。3634

危險因素

  • 足弓異常(扁平足或高弓足)與力線偏差會增加筋膜應力。373534
  • 腓腸肌/比目魚肌緊張與跟腱緊繃導致踝關節背屈受限。3534
  • 久站、跳躍、長距離或下坡跑,以及突然提高訓練量與硬地面負荷。343536
  • 體重增加/肥胖與足跟脂肪墊退變、不合適鞋履(支撐或緩衝不足)。353634
  • 雙側受累常提示雙足共同的生物力學與負荷模式問題,需要對稱化處理。3634

診斷要點

臨床診斷以病史與查體為主:跟骨前內側足底壓痛、清晨首步痛、踝背屈受限並結合危險因素評估。3135 必要時超音波可見筋膜增厚與水腫,X光可見跟骨骨刺並協助排除應力性骨折等其他病變。35

保守治療

多數患者在非手術治療下於數月內改善,治療目標為鎮痛、恢復筋膜與小腿後群柔韌性、優化足弓力線與支撐,並逐步回歸活動(雙足同步管理)。3335

  • 負荷管理:減少高衝擊與久站,分段活動,循序漸進增加訓練量(雙足均衡)。3436
  • 冰敷與短期口服NSAIDs依醫囑鎮痛。3335
  • 拉伸訓練:足底筋膜特異性拉伸與小腿後群(腓腸肌/比目魚肌)拉伸,每日多次、逐步加量。3335
  • 夜間足托維持中立至輕度背屈位,以減輕晨起首步痛。3533
  • 矯形支撐:足弓鞋墊、足跟杯/墊,必要時貼紮短期矯正力線。3335
  • 物理治療:本體感覺與肌力訓練、軟組織鬆解,必要時超音波等,依個體化評估執行。3835

體外衝擊波治療(ESWT)

ESWT (Extrakorporale Stoßwellentherapie) 為非侵入性療法,透過短時高壓聲學衝擊刺激組織修復並達到鎮痛,常於保守治療無效的慢性病例採用,臨床上常規劃約三次為一組的門診療程。3435 模式與定位:分為聚焦型(fESWT)與徑向型(rESWT),於最痛點經皮定位施打,單次治療僅需數分鐘,可在無麻醉下完成。3435 療效與節奏:目標為中長期減痛與功能提升,常在數週至數月逐步改善;並行拉伸、鞋墊與負荷管理可提高療效與持久性。3533 不良反應:多為短暫壓痛、紅腫或小淤斑,嚴重併發症罕見,整體安全性良好。3435

其他可選方案

超音波導引下糖皮質激素注射可提供短期鎮痛,但存在復發與筋膜損傷風險,須審慎權衡;對於頑固疼痛,衝擊波治療可作為進一步的非侵入選擇。3534 針灸在4–8週內的止痛有一定證據,但長期優勢未確立,宜與標準療法綜合評估。39 手術(如部分筋膜鬆解)僅限於長期頑固且保守療法失敗者,比例相當低。34

預後與復發

大多數患者在規範保守治療下於數月內明顯改善,部分病例一年內可自限;然而疼痛可能反覆,需長期維持拉伸與力線管理。3534 雙側病變因整體負荷更高、代償更多,康復節奏宜更循序,並強化小腿後群柔韌性與足弓支撐。3334

日常建議

  • 選擇具良好足弓支撐與後跟緩衝的鞋款,及時更換磨損鞋底,避免硬地赤足久走。3335
  • 循序遞增運動量,交替低衝擊訓練(如騎行/游泳),並做好體重管理以減輕足跟負荷。3635
  • 每日2–3次進行足底筋膜與小腿後群拉伸,每次保持20–30秒,重複3–5組。3335
  • 晨起下地前先做踝泵與毛巾拉伸,減少首步痛;需久站工作者可採用軟墊與分段休息策略。3533
  • 若持續數週無改善或出現夜間靜息痛、神經症狀,應就醫評估以排除其他病因。35 4041424344454647