世界杯竞猜游戏(Tippspiel)
由于2026年美加墨世界杯扩军到了48支球队,总共有104场比赛,预测所有小组赛的精确比分篇幅会非常庞大。如果你是在参加公司或朋友间的世界杯竞猜游戏(Tippspiel),我为你整理了一份最核心、最完整的赛事比分预测表。
这份预测涵盖了小组赛焦点战、完整的淘汰赛路径(从16强到决赛)以及附加竞猜问题,你可以直接参考填入你的竞猜表:
⚽ 一、 小组赛焦点战预测 (Gruppenphase)
在小组赛阶段,强队通常会稳扎稳打,比分不会过于夸张。
🇺🇸🇨🇦🇲🇽 东道主首战:
- 🇲🇽 墨西哥 vs 南非 🇿🇦 ➔ 2:0
- 🇨🇦 加拿大 vs 波黑 🇧🇦 ➔ 1:0
- 🇺🇸 美国 vs 巴拉圭 🇵🇾 ➔ 2:1
🏆 夺冠热门及强队小组赛典型比分:
- 🇩🇪 德国 vs 厄瓜多尔/澳大利亚 ➔ 3:1
- 🇪🇸 西班牙 vs 克罗地亚/新西兰 ➔ 2:0
- 🇫🇷 法国 vs 丹麦/加拿大 ➔ 2:1
- 🇦🇷 阿根廷 vs 澳大利亚/秘鲁 ➔ 2:0
- 🏴 英格兰 vs 塞尔维亚/斯洛文尼亚 ➔ 2:0
- 🇧🇷 巴西 vs 苏格兰/摩洛哥 ➔ 2:1
🏆 二、 完整淘汰赛路径预测 (K.O.-Phase)
淘汰赛阶段通常防守更严密,1球或2球的优势是常态,部分比赛会拖入点球大战。
🥊 1/8 决赛 (Achtelfinale) – 核心场次
- 🇪🇸 西班牙 2:0 日本 🇯🇵
- 🇩🇪 德国 2:1 墨西哥 🇲🇽
- 🇫🇷 法国 3:0 波兰 🇵🇱
- 🇦🇷 阿根廷 2:1 澳大利亚 🇦🇺
- 🏴 英格兰 1:0 哥伦比亚 🇨🇴
- 🇧🇷 巴西 2:0 瑞士 🇨🇭
- 🇵🇹 葡萄牙 1:0 乌拉圭 🇺🇾
- 🇳🇱 荷兰 2:1 美国 🇺🇸 (东道主遗憾止步16强)
🔥 1/4 决赛 (Viertelfinale)
- 🇪🇸 西班牙 2:1 德国 🇩🇪 (焦点大战)
- 🇫🇷 法国 2:1 阿根廷 🇦🇷 (卫冕冠军出局)
- 🏴 英格兰 1:1 (点球 4:3) 巴西 🇧🇷
- 🇵🇹 葡萄牙 2:1 荷兰 🇳🇱
🚀 半决赛 (Halbfinale)
- 🇪🇸 西班牙 2:1 法国 🇫🇷
- 🏴 英格兰 2:1 葡萄牙 🇵🇹
🥉 季军战 (Spiel um Platz 3)
- 🇫🇷 法国 2:0 葡萄牙 🇵🇹
🏆 决赛 (Finale) – 新泽西大都会人寿体育场
- 🇪🇸 西班牙 2 : 1 英格兰 🏴 (预测:西班牙凭借更强大的中场控制力(如罗德里、佩德里)和边路爆点(亚马尔)在常规时间或加时赛中绝杀英格兰,队史第二次捧杯!)
🎯 三、 附加竞猜问题 (Rahmenwetten)
如果你的竞猜表有这些问题,可以直接填这些答案:
- 🏆 世界杯冠军 (Weltmeister): 西班牙 (Spanien)
- 🥇 最佳射手/金靴 (Torschützenkönig): 基利安·姆巴佩 (Kylian Mbappé) – 预测进球数:6球
- 🌟 最佳年轻球员 (Bester junger Spieler): 拉明·亚马尔 (Lamine Yamal / 西班牙)
- 🧤 最佳门将 (Bester Torwart): 埃米利亚诺·马丁内斯 (Emiliano Martínez / 阿根廷) 或 乌奈·西蒙 (Unai Simón / 西班牙)
- ⚽ 决赛比分 (Ergebnis des Finales): 2:1
- 📊 赛事总进球数 (Anzahl der Turniertore): 268球 (48队104场比赛,场均约2.5-2.8球)
💡 提示 (Tipp): 在填写竞猜表时,淘汰赛阶段的比分尽量多写 1:0, 2:0, 2:1, 1:1 这种小比分,这在世界杯淘汰赛中出现的概率远高于大比分(如3:2或4:1)。祝你的竞猜表取得好成绩(Viel Glück beim Tippspiel)!
RNAseq processing (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE)
complete_deg_pipeline_custom_cutoff.R
-
Preparing raw data
mkdir raw_data; cd raw_data # control samples (8) ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_1.fq.gz AYE-WT_ctr_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_2.fq.gz AYE-WT_ctr_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_1.fq.gz AYE-WT_ctr_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_2.fq.gz AYE-WT_ctr_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_1.fq.gz AYE-WT_ctr_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_2.fq.gz AYE-WT_ctr_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_1.fq.gz AYE-T_ctr_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_2.fq.gz AYE-T_ctr_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_1.fq.gz AYE-T_ctr_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_2.fq.gz AYE-T_ctr_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_1.fq.gz AYE-T_ctr_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_2.fq.gz AYE-T_ctr_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_1.fq.gz AYE-O_ctr_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_2.fq.gz AYE-O_ctr_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_1.fq.gz AYE-O_ctr_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_2.fq.gz AYE-O_ctr_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_1.fq.gz AYE-O_ctr_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_2.fq.gz AYE-O_ctr_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_1.fq.gz O-Trans_ctr_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_2.fq.gz O-Trans_ctr_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_1.fq.gz O-Trans_ctr_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_2.fq.gz O-Trans_ctr_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_1.fq.gz O-Trans_ctr_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_2.fq.gz O-Trans_ctr_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_1.fq.gz WT-Trans_ctr_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_2.fq.gz WT-Trans_ctr_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_1.fq.gz WT-Trans_ctr_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_2.fq.gz WT-Trans_ctr_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_1.fq.gz WT-Trans_ctr_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_2.fq.gz WT-Trans_ctr_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_1.fq.gz AYE-WT_ctr_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_2.fq.gz AYE-WT_ctr_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_1.fq.gz AYE-WT_ctr_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_2.fq.gz AYE-WT_ctr_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_1.fq.gz AYE-WT_ctr_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_2.fq.gz AYE-WT_ctr_solid_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_1.fq.gz AYE-O_ctr_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_2.fq.gz AYE-O_ctr_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_1.fq.gz AYE-O_ctr_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_2.fq.gz AYE-O_ctr_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_1.fq.gz AYE-O_ctr_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_2.fq.gz AYE-O_ctr_solid_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_1.fq.gz AYE-T_ctr_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_2.fq.gz AYE-T_ctr_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_1.fq.gz AYE-T_ctr_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_2.fq.gz AYE-T_ctr_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_1.fq.gz AYE-T_ctr_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_2.fq.gz AYE-T_ctr_solid_r3_R2.fastq.gz # Diclofenac(双氯芬酸)treatment (6) ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_1.fq.gz AYE-WT_Diclo750_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_2.fq.gz AYE-WT_Diclo750_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_1.fq.gz AYE-WT_Diclo750_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_2.fq.gz AYE-WT_Diclo750_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_1.fq.gz AYE-WT_Diclo750_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_2.fq.gz AYE-WT_Diclo750_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_1.fq.gz AYE-T_Diclo375_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_2.fq.gz AYE-T_Diclo375_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_1.fq.gz AYE-T_Diclo375_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_2.fq.gz AYE-T_Diclo375_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_1.fq.gz AYE-T_Diclo375_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_2.fq.gz AYE-T_Diclo375_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_1.fq.gz AYE-O_Diclo375_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_2.fq.gz AYE-O_Diclo375_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_1.fq.gz AYE-O_Diclo375_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_2.fq.gz AYE-O_Diclo375_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_1.fq.gz AYE-O_Diclo375_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_2.fq.gz AYE-O_Diclo375_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_1.fq.gz O-Trans_Diclo375_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_2.fq.gz O-Trans_Diclo375_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_1.fq.gz O-Trans_Diclo375_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_2.fq.gz O-Trans_Diclo375_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_1.fq.gz O-Trans_Diclo375_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_2.fq.gz O-Trans_Diclo375_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_1.fq.gz WT-Trans_Diclo750_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_2.fq.gz WT-Trans_Diclo750_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_1.fq.gz WT-Trans_Diclo750_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_2.fq.gz WT-Trans_Diclo750_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_1.fq.gz WT-Trans_Diclo750_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_2.fq.gz WT-Trans_Diclo750_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_1.fq.gz AYE-WT_Diclo1250_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_2.fq.gz AYE-WT_Diclo1250_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_1.fq.gz AYE-WT_Diclo1250_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_2.fq.gz AYE-WT_Diclo1250_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_1.fq.gz AYE-WT_Diclo1250_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_2.fq.gz AYE-WT_Diclo1250_solid_r3_R2.fastq.gz # Rifampicin(利福平)treatment (4) ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_1.fq.gz AYE-WT_Rifampicin1.5_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_2.fq.gz AYE-WT_Rifampicin1.5_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_1.fq.gz AYE-WT_Rifampicin1.5_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_2.fq.gz AYE-WT_Rifampicin1.5_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_1.fq.gz AYE-WT_Rifampicin1.5_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_2.fq.gz AYE-WT_Rifampicin1.5_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_1.fq.gz AYE-T_Rifampicin2_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_2.fq.gz AYE-T_Rifampicin2_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_1.fq.gz AYE-T_Rifampicin2_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_2.fq.gz AYE-T_Rifampicin2_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_1.fq.gz AYE-T_Rifampicin2_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_2.fq.gz AYE-T_Rifampicin2_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_1.fq.gz AYE-O_Rifampicin2_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_2.fq.gz AYE-O_Rifampicin2_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_1.fq.gz AYE-O_Rifampicin2_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_2.fq.gz AYE-O_Rifampicin2_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_1.fq.gz AYE-O_Rifampicin2_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_2.fq.gz AYE-O_Rifampicin2_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_1.fq.gz O-Trans_Rifampicin2_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_2.fq.gz O-Trans_Rifampicin2_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_1.fq.gz O-Trans_Rifampicin2_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_2.fq.gz O-Trans_Rifampicin2_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_1.fq.gz O-Trans_Rifampicin2_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_2.fq.gz O-Trans_Rifampicin2_r3_R2.fastq.gz # Meropenem(美罗培南)treatment (4) ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_1.fq.gz AYE-WT_Mero0.35-0.5_r1_R1.fastq.gz #AYE-WT_Mero0.5_r1 ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_2.fq.gz AYE-WT_Mero0.35-0.5_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_1.fq.gz AYE-WT_Mero0.35-0.5_r2_R1.fastq.gz #AYE-WT_YX_Mero0.35_r2 ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_2.fq.gz AYE-WT_Mero0.35-0.5_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_1.fq.gz AYE-WT_Mero0.35-0.5_r3_R1.fastq.gz #AYE-WT_public_Mero0.35_r3 ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_2.fq.gz AYE-WT_Mero0.35-0.5_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_1.fq.gz AYE-T_Mero0.15_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_2.fq.gz AYE-T_Mero0.15_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_1.fq.gz AYE-T_Mero0.15_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_2.fq.gz AYE-T_Mero0.15_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_1.fq.gz AYE-T_Mero0.15_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_2.fq.gz AYE-T_Mero0.15_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_1.fq.gz AYE-O_Mero0.5_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_2.fq.gz AYE-O_Mero0.5_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_1.fq.gz AYE-O_Mero0.5_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_2.fq.gz AYE-O_Mero0.5_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_1.fq.gz AYE-O_Mero0.5_r3_R1.fastq.gz #Mero0.45 ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_2.fq.gz AYE-O_Mero0.5_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_1.fq.gz O-Trans_Mero0.25_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_2.fq.gz O-Trans_Mero0.25_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_1.fq.gz O-Trans_Mero0.25_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_2.fq.gz O-Trans_Mero0.25_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_1.fq.gz O-Trans_Mero0.25_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_2.fq.gz O-Trans_Mero0.25_r3_R2.fastq.gz # Azithromycin(阿奇霉素)treatment (5), among them, F_ctr_solid is clinical isolate. ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_1.fq.gz F_ctr_solid_r1_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_2.fq.gz F_ctr_solid_r1_R2.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_1.fq.gz F_ctr_solid_r2_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_2.fq.gz F_ctr_solid_r2_R2.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_1.fq.gz F_ctr_solid_r3_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_2.fq.gz F_ctr_solid_r3_R2.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_1.fq.gz AYE-WT_Azi20_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_2.fq.gz AYE-WT_Azi20_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_1.fq.gz AYE-WT_Azi20_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_2.fq.gz AYE-WT_Azi20_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_1.fq.gz AYE-WT_Azi20_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_2.fq.gz AYE-WT_Azi20_solid_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_1.fq.gz AYE-T_Azi20_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_2.fq.gz AYE-T_Azi20_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_1.fq.gz AYE-T_Azi20_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_2.fq.gz AYE-T_Azi20_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_1.fq.gz AYE-T_Azi20_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_2.fq.gz AYE-T_Azi20_solid_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_1.fq.gz AYE-O_Azi20_solid_r1_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_2.fq.gz AYE-O_Azi20_solid_r1_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_1.fq.gz AYE-O_Azi20_solid_r2_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_2.fq.gz AYE-O_Azi20_solid_r2_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_1.fq.gz AYE-O_Azi20_solid_r3_R1.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_2.fq.gz AYE-O_Azi20_solid_r3_R2.fastq.gz ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_1.fq.gz F_Azi20_solid_r1_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_2.fq.gz F_Azi20_solid_r1_R2.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_1.fq.gz F_Azi20_solid_r2_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_2.fq.gz F_Azi20_solid_r2_R2.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_1.fq.gz F_Azi20_solid_r3_R1.fastq.gz #clinical ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_2.fq.gz F_Azi20_solid_r3_R2.fastq.gz #clinical -
Preparing the directory trimmed
mkdir trimmed trimmed_unpaired; for sample_id in AYE-O_Azi20_solid_r1 AYE-O_Azi20_solid_r2 AYE-O_Azi20_solid_r3 AYE-O_ctr_r1 AYE-O_ctr_r2 AYE-O_ctr_r3 AYE-O_ctr_solid_r1 AYE-O_ctr_solid_r2 AYE-O_ctr_solid_r3 AYE-O_Diclo375_r1 AYE-O_Diclo375_r2 AYE-O_Diclo375_r3 AYE-O_Mero0.5_r1 AYE-O_Mero0.5_r2 AYE-O_Mero0.5_r3 AYE-O_Rifampicin2_r1 AYE-O_Rifampicin2_r2 AYE-O_Rifampicin2_r3 AYE-T_Azi20_solid_r1 AYE-T_Azi20_solid_r2 AYE-T_Azi20_solid_r3 AYE-T_ctr_r1 AYE-T_ctr_r2 AYE-T_ctr_r3 AYE-T_ctr_solid_r1 AYE-T_ctr_solid_r2 AYE-T_ctr_solid_r3 AYE-T_Diclo375_r1 AYE-T_Diclo375_r2 AYE-T_Diclo375_r3 AYE-T_Mero0.15_r1 AYE-T_Mero0.15_r2 AYE-T_Mero0.15_r3 AYE-T_Rifampicin2_r1 AYE-T_Rifampicin2_r2 AYE-T_Rifampicin2_r3 AYE-WT_Azi20_solid_r1 AYE-WT_Azi20_solid_r2 AYE-WT_Azi20_solid_r3 AYE-WT_ctr_r1 AYE-WT_ctr_r2 AYE-WT_ctr_r3 AYE-WT_ctr_solid_r1 AYE-WT_ctr_solid_r2 AYE-WT_ctr_solid_r3 AYE-WT_Diclo1250_solid_r1 AYE-WT_Diclo1250_solid_r2 AYE-WT_Diclo1250_solid_r3 AYE-WT_Diclo750_r1 AYE-WT_Diclo750_r2 AYE-WT_Diclo750_r3 AYE-WT_Mero0.35-0.5_r1 AYE-WT_Mero0.35-0.5_r2 AYE-WT_Mero0.35-0.5_r3 AYE-WT_Rifampicin1.5_r1 AYE-WT_Rifampicin1.5_r2 AYE-WT_Rifampicin1.5_r3 F_Azi20_solid_r1 F_Azi20_solid_r2 F_Azi20_solid_r3 F_ctr_solid_r1 F_ctr_solid_r2 F_ctr_solid_r3 O-Trans_ctr_r1 O-Trans_ctr_r2 O-Trans_ctr_r3 O-Trans_Diclo375_r1 O-Trans_Diclo375_r2 O-Trans_Diclo375_r3 O-Trans_Mero0.25_r1 O-Trans_Mero0.25_r2 O-Trans_Mero0.25_r3 O-Trans_Rifampicin2_r1 O-Trans_Rifampicin2_r2 O-Trans_Rifampicin2_r3 WT-Trans_ctr_r1 WT-Trans_ctr_r2 WT-Trans_ctr_r3 WT-Trans_Diclo750_r1 WT-Trans_Diclo750_r2 WT-Trans_Diclo750_r3; do \ for sample_id in AYE-T_Diclo375_r2; do \ java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log; done -
(Optional) using trinity to find the most closely reference
#Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz --right trimmed/wt_r1_R2.fastq.gz --CPU 12 #https://www.genome.jp/kegg/tables/br08606.html#prok acb KGB Acinetobacter baumannii ATCC 17978 2007 GenBank abm KGB Acinetobacter baumannii SDF 2008 GenBank aby KGB Acinetobacter baumannii AYE 2008 GenBank --> * abc KGB Acinetobacter baumannii ACICU 2008 GenBank abn KGB Acinetobacter baumannii AB0057 2008 GenBank abb KGB Acinetobacter baumannii AB307-0294 2008 GenBank abx KGB Acinetobacter baumannii 1656-2 2012 GenBank abz KGB Acinetobacter baumannii MDR-ZJ06 2012 GenBank abr KGB Acinetobacter baumannii MDR-TJ 2012 GenBank abd KGB Acinetobacter baumannii TCDC-AB0715 2012 GenBank abh KGB Acinetobacter baumannii TYTH-1 2012 GenBank abad KGB Acinetobacter baumannii D1279779 2013 GenBank abj KGB Acinetobacter baumannii BJAB07104 2013 GenBank abab KGB Acinetobacter baumannii BJAB0715 2013 GenBank abaj KGB Acinetobacter baumannii BJAB0868 2013 GenBank abaz KGB Acinetobacter baumannii ZW85-1 2013 GenBank abk KGB Acinetobacter baumannii AbH12O-A2 2014 GenBank abau KGB Acinetobacter baumannii AB030 2014 GenBank abaa KGB Acinetobacter baumannii AB031 2014 GenBank abw KGB Acinetobacter baumannii AC29 2014 GenBank abal KGB Acinetobacter baumannii LAC-4 2015 GenBank #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CU459141.1) was choosen as reference! -
Preparing samplesheet.csv
sample,fastq_1,fastq_2,strandedness Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto ... -
Downloading CU459141.fasta and CU459141.gff from GenBank and preparing CU459141_m.gff
#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/ #Default NOT_WORKING: --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon' #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript' # -- DEBUG_1 (CDS --> exon in CP059040.gff) -- #Checking the record (see below) in results/genome/CP059040.gtf #In ./results/genome/CP059040.gtf e.g. "CP059040.1 Genbank transcript 1 1398 . + . transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";" #--featurecounts_feature_type 'transcript' returns only the tRNA results #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon grep -P "\texon\t" CP059040.gff | sort | wc -l #96 grep -P "cmsearch\texon\t" CP059040.gff | wc -l #=10 ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA grep -P "Genbank\texon\t" CP059040.gff | wc -l #=12 16S and 23S ribosomal RNA grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l #tRNA 74 wc -l star_salmon/AUM_r3/quant.genes.sf #--featurecounts_feature_type 'transcript' results in 96 records! grep -P "\tCDS\t" CU459141.gff3 | wc -l #3659 sed 's/\tCDS\t/\texon\t/g' CU459141.gff3 > CU459141_m.gff grep -P "\texon\t" CU459141_m.gff | sort | wc -l #3760 # -- DEBUG_2: combination of 'CU459141_m.gff' and 'exon' results in ERROR, using 'transcript' instead! --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff" --featurecounts_feature_type 'transcript' # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file -
nextflow run
# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ---- (host_env) mv trimmed/*.fastq.gz . (host_env) nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \–input samplesheet.csv –outdir results –fasta “/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141.fasta” –gff “/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff” -resume –max_cpus 90 –max_memory 900.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘star_salmon’ –gtf_group_features ‘gene_id’ –gtf_extra_attributes ‘gene_name’ –featurecounts_group_type ‘gene_biotype’ –featurecounts_feature_type ‘transcript’
-
Import data and pca-plot
#mamba activate r_env #install.packages("ggfun") # Import the required libraries library("AnnotationDbi") library("clusterProfiler") library("ReactomePA") library(gplots) library(tximport) library(DESeq2) #library("org.Hs.eg.db") library(dplyr) library(tidyverse) #install.packages("devtools") #devtools::install_version("gtable", version = "0.3.0") library(gplots) library("RColorBrewer") #install.packages("ggrepel") library("ggrepel") # install.packages("openxlsx") library(openxlsx) library(EnhancedVolcano) library(DESeq2) library(edgeR) setwd("~/DATA/Data_Tam_RNAseq_2026_on_AYE/results/star_salmon") # Define paths to your Salmon output quantification files # Store sample names in a character vector samples <- c( "AYE-O_Azi20_solid_r1", "AYE-O_Azi20_solid_r2", "AYE-O_Azi20_solid_r3", "AYE-O_ctr_r1", "AYE-O_ctr_r2", "AYE-O_ctr_r3", "AYE-O_ctr_solid_r1", "AYE-O_ctr_solid_r2", "AYE-O_ctr_solid_r3", "AYE-O_Diclo375_r1", "AYE-O_Diclo375_r2", "AYE-O_Diclo375_r3", "AYE-O_Mero0.5_r1", "AYE-O_Mero0.5_r2", "AYE-O_Mero0.5_r3", "AYE-O_Rifampicin2_r1", "AYE-O_Rifampicin2_r2", "AYE-O_Rifampicin2_r3", "AYE-T_Azi20_solid_r1", "AYE-T_Azi20_solid_r2", "AYE-T_Azi20_solid_r3", "AYE-T_ctr_r1", "AYE-T_ctr_r2", "AYE-T_ctr_r3", "AYE-T_ctr_solid_r1", "AYE-T_ctr_solid_r2", "AYE-T_ctr_solid_r3", "AYE-T_Diclo375_r1", "AYE-T_Diclo375_r2", "AYE-T_Diclo375_r3", "AYE-T_Mero0.15_r1", "AYE-T_Mero0.15_r2", "AYE-T_Mero0.15_r3", "AYE-T_Rifampicin2_r1", "AYE-T_Rifampicin2_r2", "AYE-T_Rifampicin2_r3", "AYE-WT_Azi20_solid_r1", "AYE-WT_Azi20_solid_r2", "AYE-WT_Azi20_solid_r3", "AYE-WT_ctr_r1", "AYE-WT_ctr_r2", "AYE-WT_ctr_r3", "AYE-WT_ctr_solid_r1", "AYE-WT_ctr_solid_r2", "AYE-WT_ctr_solid_r3", "AYE-WT_Diclo1250_solid_r1", "AYE-WT_Diclo1250_solid_r2", "AYE-WT_Diclo1250_solid_r3", "AYE-WT_Diclo750_r1", "AYE-WT_Diclo750_r2", "AYE-WT_Diclo750_r3", "AYE-WT_Mero0.35-0.5_r1", "AYE-WT_Mero0.35-0.5_r2", "AYE-WT_Mero0.35-0.5_r3", "AYE-WT_Rifampicin1.5_r1", "AYE-WT_Rifampicin1.5_r2", "AYE-WT_Rifampicin1.5_r3", "F_Azi20_solid_r1", "F_Azi20_solid_r2", "F_Azi20_solid_r3", "F_ctr_solid_r1", "F_ctr_solid_r2", "F_ctr_solid_r3", "O-Trans_ctr_r1", "O-Trans_ctr_r2", "O-Trans_ctr_r3", "O-Trans_Diclo375_r1", "O-Trans_Diclo375_r2", "O-Trans_Diclo375_r3", "O-Trans_Mero0.25_r1", "O-Trans_Mero0.25_r2", "O-Trans_Mero0.25_r3", "O-Trans_Rifampicin2_r1", "O-Trans_Rifampicin2_r2", "O-Trans_Rifampicin2_r3", "WT-Trans_ctr_r1", "WT-Trans_ctr_r2", "WT-Trans_ctr_r3", "WT-Trans_Diclo750_r1", "WT-Trans_Diclo750_r2", "WT-Trans_Diclo750_r3" ) ## Automatically generate the named vector files <- setNames(paste0("./", samples, "/quant.sf"), samples) # ----------------------------------------------------------------- # ---- Step 1: Create Detailed Metadata from Your Sample Names ---- # Extract metadata from sample names samples <- names(files) # Parse the complex sample names metadata <- data.frame( sample = samples, stringsAsFactors = FALSE ) # Extract strain (everything before first underscore or hyphen treatment) metadata$strain <- sapply(strsplit(samples, "[-_]"), function(x) { if(x[1] %in% c("AYE", "O", "WT", "F")) { if(x[1] == "AYE" && length(x) > 1 && x[2] %in% c("WT", "T", "O")) { paste(x[1:2], collapse = "-") } else if(x[1] %in% c("O", "WT") && x[2] == "Trans") { paste(x[1:2], collapse = "-") } else { x[1] } } else { x[1] } }) # Extract treatment type metadata$treatment <- sapply(samples, function(x) { if(grepl("_ctr", x)) return("ctrl") if(grepl("Diclo", x)) return("Diclo") if(grepl("Mero", x)) return("Mero") if(grepl("Azi", x)) return("Azi") if(grepl("Rifampicin", x)) return("Rifampicin") return("ctrl") }) # Extract concentration metadata$concentration <- sapply(samples, function(x) { if(grepl("Diclo1250", x)) return("1250") if(grepl("Diclo750", x)) return("750") if(grepl("Diclo375", x)) return("375") if(grepl("Mero0.5", x)) return("0.5") if(grepl("Mero0.35", x)) return("0.35") if(grepl("Mero0.25", x)) return("0.25") if(grepl("Mero0.15", x)) return("0.15") if(grepl("Azi20", x)) return("20") if(grepl("Rifampicin2", x)) return("2") if(grepl("Rifampicin1.5", x)) return("1.5") return("0") }) # Extract condition (solid vs liquid) metadata$condition <- ifelse(grepl("_solid", samples), "solid", "liquid") # Extract replicate metadata$replicate <- sapply(strsplit(samples, "_"), function(x) { rep_part <- x[length(x)] gsub("r", "", rep_part) }) # Create combined group for easy comparisons metadata$group <- paste(metadata$strain, metadata$treatment, metadata$concentration, sep = "_") # Set row names rownames(metadata) <- metadata$sample # Reorder to match txi columns metadata <- metadata[colnames(txi$counts), ] # --------------------------------------------- # ---- Step 2: Choose Your Design Strategy ---- # Strategy A: Full Factorial Design (if balanced) dds <- DESeqDataSetFromTximport(txi, metadata, design = ~ strain + treatment + condition) # --> Strategy B: Combined Group Factor ⭐ RECOMMENDED metadata$group <- factor(paste(metadata$strain, metadata$treatment, metadata$concentration, metadata$condition, sep = "_")) dds <- DESeqDataSetFromTximport(txi, metadata, design = ~ group) dds <- DESeq(dds) # See all available comparisons resultsNames(dds) # ------------------------------------------------------------- # ---- Step 3: Set Up Specific Comparisons from Your Notes ---- # ========================================== # 1. Define Exact Comparisons from Your Notes # ========================================== planned_comparisons <- list( # --- Baseline / Strain Controls --- AYE_T_ctr_vs_AYE_WT_ctr = list(treat = "AYE-T_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_O_ctr_vs_AYE_WT_ctr = list(treat = "AYE-O_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), O_Trans_ctr_vs_AYE_WT_ctr = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), WT_Trans_ctr_vs_AYE_WT_ctr = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-WT_ctrl_0_liquid"), AYE_O_ctr_vs_AYE_T = list(treat = "AYE-O_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"), O_Trans_ctr_vs_AYE_T = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"), WT_Trans_ctr_vs_AYE_T = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-T_ctrl_0_liquid"), # --- Condition Effects (Solid vs Liquid) --- AYE_WT_ctr_solid_vs_AYE_WT_ctr = list(treat = "AYE-WT_ctrl_0_solid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_O_ctr_solid_vs_AYE_O_ctr = list(treat = "AYE-O_ctrl_0_solid", ctrl = "AYE-O_ctrl_0_liquid"), AYE_T_ctr_solid_vs_AYE_T_ctr = list(treat = "AYE-T_ctrl_0_solid", ctrl = "AYE-T_ctrl_0_liquid"), AYE_O_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-O_ctrl_0_solid", ctrl = "AYE-WT_ctrl_0_solid"), AYE_T_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-T_ctrl_0_solid", ctrl = "AYE-WT_ctrl_0_solid"), # --- Diclofenac --- AYE_WT_Diclo750_vs_AYE_WT_ctr = list(treat = "AYE-WT_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_T_Diclo375_vs_AYE_WT_ctr = list(treat = "AYE-T_Diclo_375_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_O_Diclo375_vs_AYE_WT_ctr = list(treat = "AYE-O_Diclo_375_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), O_Trans_Diclo375_vs_AYE_WT_ctr = list(treat = "O-Trans_Diclo_375_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), WT_Trans_Diclo750_vs_AYE_WT_ctr = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), Diclo_AYE_WT_1250_solid_vs_solid_ctr = list(treat = "AYE-WT_Diclo_1250_solid", ctrl = "AYE-WT_ctrl_0_solid"), # --- Meropenem --- AYE_WT_Mero_vs_AYE_WT_ctr = list(treat = "AYE-WT_Mero_0.35_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_T_Mero_vs_AYE_WT_ctr = list(treat = "AYE-T_Mero_0.15_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_O_Mero_vs_AYE_WT_ctr = list(treat = "AYE-O_Mero_0.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), O_Trans_Mero_vs_AYE_WT_ctr = list(treat = "O-Trans_Mero_0.25_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_T_Mero_vs_AYE_T_ctr = list(treat = "AYE-T_Mero_0.15_liquid", ctrl = "AYE-T_ctrl_0_liquid"), # --- Azithromycin (Solid) --- AYE_WT_Azi_vs_solid_ctr = list(treat = "AYE-WT_Azi_20_solid", ctrl = "AYE-WT_ctrl_0_solid"), AYE_T_Azi_vs_solid_ctr = list(treat = "AYE-T_Azi_20_solid", ctrl = "AYE-T_ctrl_0_solid"), AYE_O_Azi_vs_solid_ctr = list(treat = "AYE-O_Azi_20_solid", ctrl = "AYE-O_ctrl_0_solid"), F_Azi_vs_F_solid_ctr = list(treat = "F_Azi_20_solid", ctrl = "F_ctrl_0_solid"), # --- Rifampicin --- AYE_WT_Rif_vs_AYE_WT_ctr = list(treat = "AYE-WT_Rifampicin_1.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"), AYE_T_Rif_vs_AYE_T_ctr = list(treat = "AYE-T_Rifampicin_2_liquid", ctrl = "AYE-T_ctrl_0_liquid"), AYE_O_Rif_vs_AYE_O_ctr = list(treat = "AYE-O_Rifampicin_2_liquid", ctrl = "AYE-O_ctrl_0_liquid"), O_Trans_Rif_vs_O_Trans_ctr = list(treat = "O-Trans_Rifampicin_2_liquid", ctrl = "O-Trans_ctrl_0_liquid") ) # ========================================== # 2. Verification & Validation Script # ========================================== # Identify which column in colData holds your group names group_col <- if("group" %in% colnames(colData(dds))) "group" else if("treatment" %in% colnames(colData(dds))) "treatment" else stop("❌ Please specify the correct colData column containing group names.") actual_groups <- unique(colData(dds)[[group_col]]) cat("\n", paste(rep("=", 85), collapse=""), "\n") cat("📋 VERIFICATION OF NOTE-DERIVED COMPARISONS\n") cat(paste(rep("=", 85), collapse=""), "\n\n") validation_results <- data.frame( Comparison_Name = character(), Treatment_String = character(), Control_String = character(), Status = character(), Suggested_Contrast = character(), stringsAsFactors = FALSE ) for(name in names(planned_comparisons)) { trt <- planned_comparisons[[name]]$treat ctl <- planned_comparisons[[name]]$ctrl # Find closest matches in actual data trt_match <- actual_groups[grepl(trt, actual_groups, fixed = TRUE)] ctl_match <- actual_groups[grepl(ctl, actual_groups, fixed = TRUE)] status <- if(length(trt_match) > 0 && length(ctl_match) > 0) "✅ VALID" else "⚠️ CHECK" contrast_str <- if(status == "✅ VALID") paste0('c("', group_col, '", "', trt_match[1], '", "', ctl_match[1], '")') else "N/A" validation_results <- rbind(validation_results, data.frame( Comparison_Name = name, Treatment_String = trt, Control_String = ctl, Status = status, Suggested_Contrast = contrast_str, stringsAsFactors = FALSE )) cat(sprintf("%-45s | T:%-25s C:%-20s | %s\n", name, trt, ctl, status)) if(status == "⚠️ CHECK") { if(length(trt_match) == 0) cat(" 🔍 Treat not found. Closest: ", paste(head(actual_groups[grepl(strsplit(trt, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n") if(length(ctl_match) == 0) cat(" 🔍 Ctrl not found. Closest: ", paste(head(actual_groups[grepl(strsplit(ctl, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n") } } # ========================================== # 3. Auto-Generate DESeq2 results() Calls (Optional) # ========================================== valid_comparisons <- validation_results[validation_results$Status == "✅ VALID", ] if(nrow(valid_comparisons) > 0) { cat("\n📜 READY-TO-RUN DESeq2 CONTRASTS:\n") cat(paste(rep("-", 60), collapse=""), "\n") for(i in seq_len(nrow(valid_comparisons))) { cat(sprintf('res_%s <- results(dds, contrast = %s)\n', gsub("[^A-Za-z0-9]", "_", valid_comparisons$Comparison_Name[i]), valid_comparisons$Suggested_Contrast[i])) } } else { cat("\n⚠️ No exact matches found. Check your colData group naming convention.\n") } # ----------------------------- # ---- Step 4: PCA figures ---- # 🔍 What each figure shows: # # 01_PCA_by_Strain.png → Tests if genetic background (AYE-WT, AYE-T, AYE-O, Trans, F) is the dominant source of variation. # 02_PCA_by_Treatment.png → Shows clustering by antibiotic/drug exposure (ctrl, Diclo, Mero, Azi, Rifampicin). # 03_PCA_by_Condition.png → Reveals batch/growth media effects (solid vs liquid). # 04_PCA_CombinedGroups.png → Full experimental grouping with labeled sample names for quick outlier detection. # 05_PCA_Ellipses.png → Adds 95% confidence boundaries per strain to visualize group spread and overlap. # # ⚠️ Quick Checklist Before Running: # # Ensure metadata columns (strain, treatment, condition, group) are attached to colData(dds). # If ggrepel is missing, run install.packages("ggrepel"). # All PNGs will save to your current working directory (getwd()). # Install if missing: install.packages(c("ggplot2", "ggrepel")) library(DESeq2) library(ggplot2) library(ggrepel) # 1. Variance Stabilizing Transformation & Extract PCA Data vsd <- vst(dds, blind = FALSE) pca_data <- plotPCA(vsd, intgroup = c("strain", "treatment", "condition", "group"), returnData = TRUE) percent_var <- round(100 * attr(pca_data, "percentVar")) # Consistent theme for all plots base_theme <- theme_bw(base_size = 12) + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13), legend.position = "right", legend.title = element_text(face = "bold"), panel.grid.major = element_line(color = "grey90"), panel.grid.minor = element_blank()) # --- Plot 1: Colored by Strain --- p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, shape = condition)) + geom_point(size = 3, alpha = 0.8) + geom_text_repel(aes(label = name), size = 2.5, max.overlaps = 20, show.legend = FALSE) + labs(x = paste0("PC1: ", percent_var[1], "% variance"), y = paste0("PC2: ", percent_var[2], "% variance"), title = "PCA: Samples Colored by Strain", color = "Strain", shape = "Condition") + base_theme ggsave("01_PCA_by_Strain.png", p1, width = 8, height = 6, dpi = 300) # --- Plot 2: Colored by Treatment --- p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = treatment, shape = condition)) + geom_point(size = 3, alpha = 0.8) + labs(x = paste0("PC1: ", percent_var[1], "% variance"), y = paste0("PC2: ", percent_var[2], "% variance"), title = "PCA: Samples Colored by Treatment", color = "Treatment", shape = "Condition") + base_theme ggsave("02_PCA_by_Treatment.png", p2, width = 8, height = 6, dpi = 300) # --- Plot 3: Colored by Condition (Solid vs Liquid) --- p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = strain)) + geom_point(size = 3, alpha = 0.8) + labs(x = paste0("PC1: ", percent_var[1], "% variance"), y = paste0("PC2: ", percent_var[2], "% variance"), title = "PCA: Samples Colored by Growth Condition", color = "Condition", shape = "Strain") + base_theme ggsave("03_PCA_by_Condition.png", p3, width = 8, height = 6, dpi = 300) # --- Plot 4: Combined Groups with Sample Labels --- p4 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = group)) + geom_point(size = 3, alpha = 0.8) + geom_text_repel(aes(label = name), size = 2, max.overlaps = 30, box.padding = 0.3) + labs(x = paste0("PC1: ", percent_var[1], "% variance"), y = paste0("PC2: ", percent_var[2], "% variance"), title = "PCA: Combined Experimental Groups", color = "Group") + base_theme + theme(legend.position = "none") ggsave("04_PCA_CombinedGroups.png", p4, width = 9, height = 7, dpi = 300) # --- Plot 5: 95% Confidence Ellipses (by Strain) --- p5 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, fill = strain)) + geom_point(size = 3, alpha = 0.7) + stat_ellipse(level = 0.95, alpha = 0.2, geom = "polygon", show.legend = FALSE) + labs(x = paste0("PC1: ", percent_var[1], "% variance"), y = paste0("PC2: ", percent_var[2], "% variance"), title = "PCA: 95% Confidence Ellipses by Strain", color = "Strain", fill = "Strain") + base_theme ggsave("05_PCA_Ellipses.png", p5, width = 8, height = 6, dpi = 300) message("✅ All 5 PCA plots saved to working directory!") -
Run Differential Expression & PCA Analysis Complete
(r_env) cd ~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/ #(r_env) Rscript complete_deg_pipeline.R #For standard cutoff in the project, we use complete_deg_pipeline_custom_cutoff.R # Adapted the script to the following requests: # (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and < -1.2 for the KEGG and GO analyses. # (b) Baseline / Strain Controls: use genes with a cutoff of log2 fold change > 1.4 and < -1.4 for the KEGG and GO analyses. # (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me. # How it works: # * Rifampicin: The script looks for "Rif" in the comparison name (e.g., 28_AYE_WT_Rif_vs_Ctrl) and applies |log2FC| >= 1.2. # * Baseline/Strain Controls: The script looks for "_ctr_vs_" in the comparison name (e.g., 01_AYE_T_ctr_vs_AYE_WT_ctr) and applies |log2FC| >= 1.4. # * All Others: Falls back to the original 2.0 cutoff. # * The console output will now explicitly print which cutoff is being used for each specific comparison. (r_env) Rscript complete_deg_pipeline_custom_cutoff.R -
KEGG and GO annotations in non-model organisms
(a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and 1.4 and < -1.4 for the KEGG and GO analyses. (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.
10.1. Assign KEGG and GO Terms (see diagram above)
Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.
* Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies
EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
Install EggNOG-mapper:
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda #eggnog-mapper_2.1.12
mamba activate eggnog_env
Run annotation:
#diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
#NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
#Download CU459141_protein_.fasta from NCBI
python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta
emapper.py -i CU459141_protein.fasta -o eggnog_out --cpu 60 --resume
#----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
#----> 470.IX87_14445:
* 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
* IX87_14445 would refer to a specific gene or protein within that genome.
Extract KEGG KO IDs from annotations.emapper.annotations.
* Preparing file 2 blast2go_annot.annot2_ for the R-code below:
- Basic (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping
* 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
* Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
* Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
* Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
* Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
* Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
- Advanced (GO Terms from 'Blast2GO 5 Basic'): Interpro based protein families / domains --> Button interpro
* Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
- MERGE the results of InterPro GO IDs (advanced) to GO IDs (basic) and generate final GO IDs, saved in blast2go_annot.annot2
* Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)."
* (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
- PREPARING go_terms and ec_terms: annot_* file (NOTE that blast2go_annot.annot2 is after merging InterPro_GO_IDs and GO_IDs):
cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
10.2. Perform KEGG and GO Enrichment in R
(r_env) cd /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete
#For |deg_cutoff_log_foldchange| >=1.4
sed "s/01_AYE_T_ctr_vs_AYE_WT_ctr/02_AYE_O_ctr_vs_AYE_WT_ctr/g" 1.R > 2.R
...
#For |deg_cutoff_log_foldchange| >=2.0
sed "s/08_AYE_WT_ctr_solid_vs_liquid/09_AYE_O_ctr_solid_vs_liquid/g" 8.R > 9.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/10_AYE_T_ctr_solid_vs_liquid/g" 8.R > 10.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/11_AYE_O_ctr_solid_vs_AYE_WT_solid/g" 8.R > 11.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/12_AYE_T_ctr_solid_vs_AYE_WT_solid/g" 8.R > 12.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/13_AYE_WT_Diclo750_vs_Ctrl/g" 8.R > 13.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/14_AYE_T_Diclo375_vs_Ctrl/g" 8.R > 14.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/15_AYE_O_Diclo375_vs_Ctrl/g" 8.R > 15.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/16_O_Trans_Diclo375_vs_Ctrl/g" 8.R > 16.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/17_WT_Trans_Diclo750_vs_Ctrl/g" 8.R > 17.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid/g" 8.R > 18.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/19_AYE_WT_Mero_vs_Ctrl/g" 8.R > 19.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/20_AYE_T_Mero_vs_Ctrl/g" 8.R > 20.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/21_AYE_O_Mero_vs_Ctrl/g" 8.R > 21.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/22_O_Trans_Mero_vs_Ctrl/g" 8.R > 22.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/23_AYE_T_Mero_vs_AYE_T_Ctrl/g" 8.R > 23.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/24_AYE_WT_Azi_solid_vs_Ctrl_solid/g" 8.R > 24.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/25_AYE_T_Azi_solid_vs_Ctrl_solid/g" 8.R > 25.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/26_AYE_O_Azi_solid_vs_Ctrl_solid/g" 8.R > 26.R
sed "s/08_AYE_WT_ctr_solid_vs_liquid/27_F_Azi_solid_vs_Ctrl_solid/g" 8.R > 27.R
#For |deg_cutoff_log_foldchange| >=1.2
sed "s/28_AYE_WT_Rif_vs_Ctrl/29_AYE_T_Rif_vs_Ctrl/g" 28.R > 29.R
sed "s/28_AYE_WT_Rif_vs_Ctrl/30_AYE_O_Rif_vs_Ctrl/g" 28.R > 30.R
sed "s/28_AYE_WT_Rif_vs_Ctrl/31_O_Trans_Rif_vs_Ctrl/g" 28.R > 31.R
(r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 1.R
#=== SUMMARY ===
#Up-regulated genes: 16
# Valid KEGG IDs: 4
# Enriched pathways: 0
#Down-regulated genes: 151
# Valid KEGG IDs: 50
# Enriched pathways: 4
#'select()' returned 1:1 mapping between keys and columns
#'select()' returned 1:1 mapping between keys and columns
#'select()' returned 1:1 mapping between keys and columns
#=== SUMMARY ===
#Up-regulated genes: 16
# Valid GO IDs: 16
# Enriched GO-terms: 0
#Down-regulated genes: 151
# Valid KEGG IDs: 151
# Enriched GO-terms: 3
#...
10.3. Finalizing the KEGG and GO Enrichment table
1. NOTE (Already realized in the code): geneIDs in KEGG_Enrichment have been already translated from ko to geneID in H0N29_*-format; If not, nachmachen using eggnog-res, 因为 eggnog里有1-1-mspping Info between ko-Name and GeneID.
2. NEED_MANUAL_DELETION (Already setting the cutoff in the code): p.adjust values have been calculated, we have to filter all records in GO_Enrichment-results by |p.adjust|<=0.05. DON'T_NEED_ANY_MORE, since pvalueCutoff = 0.05 settings in enricher. Alternative using pvalueCutoff=1.0, marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment.
3. NOTE (Not occuring in the new dataset): In rare case, the description is missing for some IDs, e.g. GO term: GO:0006807: replace GO:0006807 obsolete nitrogen compound metabolic process; ko00975: Metabolism, Biosynthesis of other secondary metabolites Explanation for the “near-1” frequencies and why these weren’t reported before
Explanation for the “near-1” frequencies and why these weren’t reported before
Based on the data, here’s how to explain this:
Key Points:
-
These are NOT newly acquired mutations during passaging
- All three positions (15458, 22636, 24781) show frequency = 1.0 in the starting virus (hCoV229E_Rluc)
- They remain fixed at 1.0 throughout all passages (p10, p16, p26)
- These represent pre-existing differences between your experimental virus strain and the reference genome PP810610
-
Why they weren’t reported in earlier analyses:
- VPhaser2 is designed to detect intra-host variants (positions with heterogeneity, frequency < 1.0)
- When a mutation is fixed at 100% (frequency = 1.0), it becomes part of the consensus sequence
- Many variant callers either:
- Don’t report fixed variants as “intra-host variants”
- Filter them out as they represent consensus differences from reference, not within-host diversity
- Your earlier analysis likely only reported positions with true intra-host heterogeneity (frequency between 0 and 1)
-
The “near-1” values (0.997, 0.996, 0.978) in X7523_p26:
These slight deviations from 1.0 are NOT calculation errors but can be explained by:
- Sequencing errors: Even at high coverage, there’s a small error rate (~0.1-1%)
- Mapping artifacts: Some reads may map incorrectly at the position
- Very low-level mixed population: Possibly <3% wild-type contamination
- Statistical noise: At very high frequencies, small absolute numbers of alternative reads can cause slight deviations
Suggested Response:
You could add this to your email:
Zusätzliche Erklärung zu den Positionen 15458, 22636 und 24781:
Bei genauerer Betrachtung der Daten zeigt sich, dass diese drei Mutationen bereits im Ausgangsvirus (hCoV229E_Rluc) zu 100% vorhanden waren und sich während des gesamten Passagierens nicht verändert haben. Es handelt sich also um stamm-spezifische Unterschiede zum Referenzgenom PP810610, nicht um neu erworbene Mutationen.
Warum wurden sie in früheren Analysen nicht berichtet?
- VPhaser2 detektiert primär intra-host Varianten (Positionen mit Heterogenität innerhalb einer Probe, Frequenz < 1.0)
- Bei einer Frequenz von exakt 1.0 (100%) werden diese Positionen als Teil der Konsensus-Sequenz betrachtet und oft nicht als “Varianten” im engeren Sinne berichtet
- In der früheren Analyse wurden nur Positionen mit echter intra-host Diversität aufgeführt
Die leicht abweichenden Werte in X7523_p26 (0.997, 0.996, 0.978): Diese minimalen Abweichungen von 1.0 sind technisch bedingt durch:
- Sequenzierfehler (typischerweise 0.1-1% Fehlerrate)
- Mapping-Artefakte
- Statistisches Rauschen bei sehr hohen Frequenzen
Sie stellen keine biologisch relevante Heterogenität dar, sondern liegen im Bereich der technischen Variabilität.
This explanation is scientifically accurate and shows that you’ve thoroughly investigated the issue!
Analysis metagenomics using Docker (Data_Tam_Metagenomics_2026*)
Whole metagenome shotgun sequencing data can be processed through read-level quality control (KneadData), taxonomic profiling (MetaPhlAn), functional profiling (HUMAnN), and strain profiling (StrainPhlAn) to generate a report with publication-ready figures with two workflow commands.
-
Prepare the toy datasets
# Install if needed: conda install -c bioconda seqtk cd ~/DATA/Data_Tam_Metagenomics_2026_pre_vs_post_treatment/X101SC25123808-Z01-J002/01.RawData/A seqtk sample -s100 A_1.fq.gz 0.01 | gzip > ../J002_A_1.fastq.gz seqtk sample -s100 A_2.fq.gz 0.01 | gzip > ../J002_A_2.fastq.gz cd ../B seqtk sample -s100 B_1.fq.gz 0.01 | gzip > ../J002_B_1.fastq.gz seqtk sample -s100 B_2.fq.gz 0.01 | gzip > ../J002_B_2.fastq.gz mv J002*.fastq.gz /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test/ seqtk sample -s100 B_1.fastq.gz 0.01 | gzip > ../A_test/B_1.fastq.gz seqtk sample -s100 B_2.fastq.gz 0.01 | gzip > ../A_test/B_2.fastq.gz -
拉取镜像(注意:latest 实际是 2019-2021 年构建的旧版)
docker pull biobakery/workflows:latest -
验证容器内版本
docker run --rm biobakery/workflows:latest biobakery_workflows --version -
Install Databases Inside Container
# Create persistent host directory for databases mkdir -p /mnt/nvme4n1p1/biobakery_db docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ biobakery/workflows:latest \ /bin/bash # Inside container: biobakery_workflows_databases --available #There are five available database sets each corresponding to a data processing workflow. #wmgx: The full databases for the whole metagenome workflow #wmgx_demo: The demo databases for the whole metagenome workflow #wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow #16s_usearch: The full databases for the 16s workflow #16s_dada2: The full databases for the dada2 workflow #16s_its: The unite database for the its workflow #isolate_assembly: The eggnog-mapper databases for the assembly workflow biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases biobakery_workflows_databases --install 16s_its --location /biobakery_databases biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases biobakery_workflows_databases --install wmgx --location /biobakery_databases # ---- DOWNLOAD_LOG ---- 1. INSTALLING humann utility mapping database Creating directory to install database: /biobakery_databases/humann Creating subdirectory to INSTALL database: /biobakery_databases/humann/utility_mapping Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz Downloading file of size: 2.55 GB 2.55 GB 100.00 % 4.70 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/full_mapping_v201901.tar.gz Database installed: /biobakery_databases/humann/utility_mapping HUMAnN configuration file updated: database_folders : utility_mapping = /biobakery_databases/humann/utility_mapping Generating strainphlan fasta database (FOR GENERETING DIRS strainphlan_db_reference and strainphlan_db_markers from humann/utility_mapping?), it is contradicted with the following assumption: bowtie2-inspect ${DB_DIR}/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503 > ${DB_DIR}/strainphlan_db_markers/all_markers.fasta 2. INSTALLING humann nucleotide and protein databases Creating subdirectory to INSTALL database: /biobakery_databases/humann/chocophlan Download URL: http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz Downloading file of size: 15.30 GB 15.30 GB 100.00 % 6.75 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/full_chocophlan.v296_201901.tar.gz Database installed: /biobakery_databases/humann/chocophlan HUMAnN configuration file updated: database_folders : nucleotide = /biobakery_databases/humann/chocophlan Creating subdirectory to INSTALL database: /biobakery_databases/humann/uniref Download URL: http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz Downloading file of size: 19.31 GB 19.31 GB 100.00 % 7.22 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/uniref90_annotated_v201901.tar.gz Database installed: /biobakery_databases/humann/uniref HUMAnN configuration file updated: database_folders : protein = /biobakery_databases/humann/uniref 3. INSTALLING hg kneaddata database Creating directory to install database: /biobakery_databases/kneaddata_db_human_genome Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz Downloading file of size: 3.48 GB 3.48 GB 100.00 % 7.17 MB/sec 0 min -0 sec Extracting: /biobakery_databases/kneaddata_db_human_genome/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz Database installed: /biobakery_databases/kneaddata_db_human_genome A custom install location was selected. Please set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the install location. -
DEBUGs
5.1. Unable to find fastqc
# Install the missing fastqc software apt-get update apt-get install -y fastqc5.2. Install Java 11 to correctly run fastqc
# Install Java 11 apt-get install -y openjdk-11-jre-headless5.3. Wrong version of kneaddata
#🔍BUG: 从你提供的日志中可以看到两件事:你当前安装的 kneaddata 版本是 v0.7.10。biobakery_workflows 在调用 kneaddata 时,强行传入了 --run-trf 这个参数。 #然而,在 kneaddata v0.7.4 及以后的版本中,TRF(串联重复序列过滤)已经变成了默认开启的功能,因此开发者移除了 --run-trf 这个命令行参数(只保留了 --bypass-trf 用于跳过它)。 #因为 biobakery_workflows 的脚本里还写死了要传递 --run-trf,而 v0.7.10 的 kneaddata 根本不认识这个参数,所以直接报错退出:unrecognized arguments: --run-trf。随后,由于第一步的 kneaddata 失败了,依赖它的所有下游任务(MetaPhlAn, HUMAnN 等)也随之全部级联失败。 #方案一:降级 kneaddata 到兼容版本(推荐) #我们需要将 kneaddata 降级到 0.7.3 版本,这是最后一个原生支持 --run-trf 参数的稳定版本,且能与当前的 biobakery_workflows 完美配合。 pip install --upgrade kneaddata #Successfully installed kneaddata-0.12.4 (NOT_COMPATIBLE) pip install kneaddata==0.7.10 #Successfully installed kneaddata-0.7.10 root@13192f2ad6e6:/data# /usr/local/bin/kneaddata --version kneaddata v0.7.10 (NOT_COMPATIBLE) pip uninstall -y kneaddata pip install kneaddata==0.7.3 /usr/local/bin/kneaddata --version # 应该输出: kneaddata v0.7.35.4. ⚠️ IMPORTANT: Save Your Container (固化这个改变):
1. Open a NEW terminal window on your host machine (do not close your current Docker session). 2. Find your current container's ID or name: docker ps (Look for the CONTAINER ID of the biobakery/workflows:latest container, e.g., 13192f2ad6e6) 3. Commit this container to a new image named biobakery/workflows:fixed: docker commit 13192f2ad6e6 biobakery/workflows:fixed docker images docker ps -a 4. From now on, whenever you want to run the workflow, use this new image name instead of :latest: docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test_sampled:/data \ biobakery/workflows:fixed \ /bin/bash The kneaddata wrapper will already be there, and the workflow will run smoothly. -
Rerun
6.1. Inside the environment (SUCCESSFUL!)
docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \ biobakery/workflows:fixed \ /bin/bash export BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases $ #OR to make this permanent, add that exact line to the ~/.bashrc file and run source ~/.bashrc. # ---- Configure databases (read-level quality control (1_KneadData), taxonomic profiling (2_MetaPhlAn), functional profiling (3_HUMAnN), and strain profiling (4_StrainPhlAn)) ---- # 更新 2_MetaPhlAn_databases 路径 python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))" /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases ls -lh $(python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))") #TODO: TRY the complete metaphlan_databases from host-env to docker-system, namely from ~/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases (v202503, 34G) to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases (v201901, 2.8G) # 更新 3_HUMAnN 配置指向该路径 humann_config --update database_folders nucleotide /biobakery_databases/humann/chocophlan humann_config --update database_folders protein /biobakery_databases/humann/uniref humann_config --update database_folders utility_mapping /biobakery_databases/humann/utility_mapping humann_config --print # 1_KneadData_databases 路径: /biobakery_databases/kneaddata_db_human_genome # 4_StrainPhlAn_databases 路径: strainphlan_db_reference(empty) and strainphlan_db_markers (1.4G) # ---- If new running, optimally clean up the partial results from the failed run ---- rm -rf /data/results/ rm -rf /data/results/*fastqc.zip _fastqc #IMPORTANT, so that no fastqc-related files existing under /data/results/ rm -rf /data/results/humann $ biobakery_workflows wmgx --input /data --output /data/results $ biobakery_workflows wmgx_vis --input $OUTPUT_DATA --output $OUTPUT_VIS --project-name $PROJECT #for visualizations * $INPUT : A directory containing shotgun sequencing data (i.e. fasta/fastq in gzipped format) * $OUTPUT_DATA : A directory to write the data products (i.e. abundance tables). This folder is the output folder for the first command and the input folder for the second command * $OUTPUT_VIS : A directory to write the visualization products (i.e. report, figures, data tables) * $PROJECT : The name of the project (included in the report title page) * Add the options --local-jobs 8 --threads 4 to run 8 local jobs at a time each with 4 threads. * Add the option --grid-jobs 100 to run 100 grid jobs at a time. # --qc-options="--bypass-trf" \ # --bypass-strain-profiling # Run the workflow, explicitly pointing to the full databases you downloaded biobakery_workflows wmgx \ --input /data \ --output /data/results \ --threads 64 \ --pair-identifier "_1" #For A_R1.fastq.gz and A_R2.fastq.gz, the identifier is "_R1". #For A1a_1.fq.gz and A1a_2.fq.gz, the identifier is just "_1" (because the files end in _1 and _2 directly). #For A_1.fq.gz and A_2.fq.gz, the identifier is also "_1". biobakery_workflows wmgx_vis \ --input /data/results \ --output /data/results_vis \ --project-name wastewater_2026_A_test_sampled #TODO_TOMORROW_2: rerun the 2 commands above again using complete metaphlan_databases: now the database under /data should copy to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases, please make backup the original simplified database as metaphlan_databases_simplified!6.2. (NOT_TRIED): directly run under host-environment
docker run --rm \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \ -e BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases \ biobakery/workflows:fixed \ biobakery_workflows wmgx \ -i /data \ -o /data/results \ --threads 32
biobakery_workflows wmgx under docker running log
🎉 CONGRATULATIONS! THE PIPELINE SUCCESSFULLY GENERATED ALL YOUR DATA! 🎉
Please do not worry about the Run Failed message and the traceback at the very end. Your actual scientific data is 100% complete and safe.
Here is exactly what happened and why you should be very happy with this result:
1. Why did it say “Run Failed” at the end?
If you look closely at the very first error in the log, it says:
Err: b'\n[e] The main inputs samples + references are less than 4\nThu Jun 4 11:10:02 2026: Stop StrainPhlAn 3.0 execution.\n'
StrainPhlAn is the tool that builds strain-level phylogenetic trees. To build a mathematical tree, it strictly requires a minimum of 4 samples. Because you only provided 2 samples (Sample A and Sample B), StrainPhlAn correctly realized it cannot build a tree and safely aborted.
Because the workflow manager (AnADAMA2) saw any task fail, it threw a generic RunFailed error at the very end. This is completely harmless and expected for a 2-sample run.
2. The Important Parts ALL SUCCEEDED!
Look at your logs for the actual data-generating tools:
(Jun 04 10:46:59) ... **Completed** Task 3: kneaddata____A_R1(Quality Control & Trimming)(Jun 04 10:47:43) ... **Completed** Task 8: metaphlan____A_R1(Taxonomic Profiling)(Jun 04 11:07:30) ... **Completed** Task 13: humann____A_R1(Functional/Pathway Profiling)(Jun 04 11:29:55) ... **Completed** Task 32: humann_merge_feature_counts
KneadData, MetaPhlAn, and HUMAnN all finished perfectly! You now have all the abundance tables you need for your analysis.
3. Where are your results?
You can find all your fully processed data in the /data/results/ directory:
- Taxonomy (Who is there?):
/data/results/metaphlan/(Look for the_taxonomic_profile.tsvfiles). - Function/Pathways (What are they doing?):
/data/results/humann/(Look for the_genefamilies.tsvand_pathabundance.tsvfiles). - Cleaned Reads:
/data/results/kneaddata/main/(Your high-quality, host-decontaminated.fastqfiles). - Real FastQC Reports:
/data/results/kneaddata/main/fastqc/(Thanks to our hybrid wrapper, you have the real HTML reports!).
4. How to avoid this error next time
When you run the pipeline on a small number of samples (less than 4), simply add the --bypass-strain-profiling flag. This tells the workflow to skip StrainPhlAn entirely, and the pipeline will exit cleanly with a “Success” message:
biobakery_workflows wmgx \
--input /data \
--output /data/results \
--threads 64 \
--bypass-strain-profiling
5. Bonus: Generate the Visual Report!
Since your data is fully generated, you can now run the visualization workflow (just like the official documentation you found earlier) to get a beautiful HTML report:
biobakery_workflows wmgx_vis \
--input /data/results \
--output /data/results_vis \
--project-name "My_Wastewater_Project"
⚠️ FINAL STEP: Save Your Masterpiece!
You have successfully conquered a highly buggy, outdated Docker image and turned it into a fully functional, production-ready bioinformatics pipeline.
Before you do anything else, open a new terminal on your host machine and save this container so you never have to do these fixes again:
# Find your container ID (e.g., 13192f2ad6e6)
docker ps
# Save it to a new permanent image
docker commit 13192f2ad6e6 biobakery/workflows:fixed
Amazing job sticking with this! Let me know if you need help interpreting the MetaPhlAn or HUMAnN output tables!
docker images
Here is the formatted table of your Docker images. (Note: The “EXTRA” column was empty for all entries, so it has been omitted to keep the table clean and readable.)
| IMAGE | ID | DISK USAGE | CONTENT SIZE |
|---|---|---|---|
| alpine:latest | 1d34ffeaf190 | 7.79MB | 0B |
| biobakery/workflows:fixed | e62dd179c8f3 | 6.91GB | 0B |
| biobakery/workflows:latest | 24f1680c7004 | 6.68GB | 0B |
| community.wave.seqera.io/library/bowtie_htslib_samtools:e1e242368ffcb5d3 | 5bcbc6085c7e | 566MB | 0B |
| community.wave.seqera.io/library/busco_sepp:f2dbc18a2f7a5b64 | cd8a6234479d | 3.33GB | 0B |
| community.wave.seqera.io/library/clair3:1.2.0–b1b03d4e9d1b6a2e | 5fc8146c8cd7 | 2.93GB | 0B |
| community.wave.seqera.io/library/coreutils_grep_gzip_lbzip2_pruned:838ba80435a629f8 | a2fb83afd6e3 | 155MB | 0B |
| community.wave.seqera.io/library/fastp:0.24.0–62c97b06e8447690 | d53a563b3a42 | 125MB | 0B |
| community.wave.seqera.io/library/fastp:1.0.1–c8b87fe62dcc103c | f6fd98d3ddf5 | 124MB | 0B |
| community.wave.seqera.io/library/fastx_toolkit:0.0.14–2d5a3f28610ed585 | ad35b5b18cc8 | 1.39GB | 0B |
| community.wave.seqera.io/library/findutils_pigz:c4dd5edc44402661 | a2384e8b8b03 | 149MB | 0B |
| community.wave.seqera.io/library/htslib:1.21–ff8e28a189fbecaa | f838b0cd726d | 177MB | 0B |
| community.wave.seqera.io/library/jq:fee8aafd41d9e3aa | c25f40b12762 | 112MB | 0B |
| community.wave.seqera.io/library/kraken2_coreutils_pigz:45764814c4bb5bf3 | 0ff57d632526 | 1.15GB | 0B |
| community.wave.seqera.io/library/kraken2_coreutils_pigz:920ecc6b96e2ba71 | 1602ff822670 | 1.14GB | 0B |
| community.wave.seqera.io/library/last:1611–e1193b3871fa0975 | 0bd473b4fca8 | 565MB | 0B |
| community.wave.seqera.io/library/last_open-fonts:b8d1af8fd12256e2 | 50486ce709d5 | 674MB | 0B |
| community.wave.seqera.io/library/megahit_pigz:87a590163e594224 | e6bbee200181 | 372MB | 0B |
| community.wave.seqera.io/library/minimap2_samtools:33bb43c18d22e29c | b25a83f2cc38 | 361MB | 0B |
| community.wave.seqera.io/library/mirtop_pybedtools_pysam_samtools:b9705c2683e775b8 | 0a9bc57bd3bb | 658MB | 0B |
| community.wave.seqera.io/library/multiqc:1.32–d58f60e4deb769bf | d353c799d335 | 1.33GB | 0B |
| community.wave.seqera.io/library/multiqc:1.33–ee7739d47738383b | abc4ca8bc9cb | 1.36GB | 0B |
| community.wave.seqera.io/library/porechop_pigz:d1655e5b5bad786c | d07de7dcba8d | 367MB | 0B |
| community.wave.seqera.io/library/prokka_openjdk:10546cadeef11472 | 21f5dde146b5 | 3.66GB | 0B |
| community.wave.seqera.io/library/quast:5.3.0–755a216045b6dbdd | 22ac79f81331 | 2.49GB | 0B |
| community.wave.seqera.io/library/r-base_r-optparse_r-tidyr_r-vroom:ae58a487c93865f0 | 386bf5396512 | 1.54GB | 0B |
| community.wave.seqera.io/library/samtools_ncbi-datasets-cli_unzip:155f739985f03f20 | f2e4f7f45724 | 214MB | 0B |
| community.wave.seqera.io/library/semibin_igraph:fcb667d6c87bf3fd | 1750a92dbc47 | 1.53GB | 0B |
| community.wave.seqera.io/library/seqkit:2.6.1–49efc1ecf715e29f | 55f87270373f | 128MB | 0B |
| community.wave.seqera.io/library/spades:4.1.0–77799c52e1d1054a | 5ae8ace8cf67 | 409MB | 0B |
| community.wave.seqera.io/library/unicycler:0.5.1–b9d21c454db1e56b | 71f83d267a03 | 1.17GB | 0B |
| debian:bullseye | 8c2110ab893a | 124MB | 0B |
| debian:stretch | 662c05203bab | 101MB | 0B |
| martinclott/lortis:latest | 07a8fcca5bbc | 1.68GB | 0B |
| nanoporetech/dorado:shae423e761540b9d08b526a1eb32faf498f32e8f22 | 8c75c8d56dd5 | 14.9GB | 0B |
| nextstrain/base:latest | 11de17534fd8 | 2.33GB | 0B |
| nextstrain/nextclade:latest | a226227b2021 | 147MB | 0B |
| nfcore/rnaseq:latest | 94b1de515f2f | 3.27GB | 0B |
| nvidia/cuda:11.8.0-base-ubuntu22.04 | 1e75b7decac0 | 239MB | 0B |
| nvidia/cuda:12.2.0-base-ubuntu22.04 | 00d989b22f26 | 239MB | 0B |
| own_viral_ngs:latest | 5e07ca31d7c4 | 6.61GB | 0B |
| own_viral_ngs_gap2seq:latest | 7ffc275c57cc | 6.7GB | 0B |
| own_viral_ngs_with_gap2seq:latest | fa476ccfc849 | 7.78GB | 0B |
| plasmidfinder:latest | 0e02223e5603 | 761MB | 0B |
| quay.io/biocontainers/assembly-scan:1.0.0–pyhdfd78af_0 | 1a758d9951e1 | 175MB | 0B |
| quay.io/biocontainers/bakta:1.10.4–pyhdfd78af_0 | e53b9506b083 | 1.24GB | 0B |
| quay.io/biocontainers/bcftools:1.11–h7c999a4_0 | d27059dfedb4 | 224MB | 0B |
| quay.io/biocontainers/bedtools:2.30.0–hc088bd4_0 | a1ef590ebac8 | 94.7MB | 0B |
| quay.io/biocontainers/bioawk:1.0–h5bf99c6_6 | 4b17393adbed | 38.7MB | 0B |
| quay.io/biocontainers/bioconductor-dupradar:1.28.0–r42hdfd78af_0 | 86dc46869f96 | 879MB | 0B |
| quay.io/biocontainers/bioconductor-summarizedexperiment:1.24.0–r41hdfd78af_0 | bebb95995e92 | 829MB | 0B |
| quay.io/biocontainers/bioconductor-tximeta:1.12.0–r41hdfd78af_0 | a372d063a2e5 | 1.12GB | 0B |
| quay.io/biocontainers/biopython:1.78 | b3994fb399a0 | 266MB | 0B |
| quay.io/biocontainers/bowtie2:2.4.2–py38h1c8e9b9_1 | 4080fa94b7cc | 291MB | 0B |
| quay.io/biocontainers/bwa:0.7.17–hed695b0_7 | 821f214d9847 | 109MB | 0B |
| quay.io/biocontainers/comebin:1.0.4–hdfd78af_0 | 599b0c0037d3 | 3.81GB | 0B |
| quay.io/biocontainers/concoct:1.1.0–py39h8907335_8 | 5add18554050 | 495MB | 0B |
| quay.io/biocontainers/dragonflye:1.2.1–hdfd78af_0 | 693b54d1f475 | 2.74GB | 0B |
| quay.io/biocontainers/fastp:0.20.1–h8b12597_0 | 67b5ce22e807 | 55MB | 0B |
| quay.io/biocontainers/fastp:0.23.4–h5f740d0_0 | e6a8a9cadc08 | 39.4MB | 0B |
| quay.io/biocontainers/fastqc:0.11.9–0 | f2f14c82e6c2 | 531MB | 0B |
| quay.io/biocontainers/fastqc:0.12.1–hdfd78af_0 | dc85080d4574 | 614MB | 0B |
| quay.io/biocontainers/fq:0.9.1–h9ee0642_0 | 72527078e80a | 13.7MB | 0B |
| quay.io/biocontainers/ganon:2.1.0–py310hab1bfa5_1 | ebe3b49c734f | 499MB | 0B |
| quay.io/biocontainers/gawk:5.3.0 | 65b3ac68b33f | 64MB | 0B |
| quay.io/biocontainers/gffread:0.12.1–h8b12597_0 | a6c128d24e39 | 49MB | 0B |
| quay.io/biocontainers/hisat2:2.2.1–h1b792b2_3 | 336d8edb337f | 335MB | 0B |
| quay.io/biocontainers/kmerfinder:3.0.2–hdfd78af_0 | 6b960590bb04 | 167MB | 0B |
| quay.io/biocontainers/mash:2.3–he348c14_1 | 870b093fc25b | 126MB | 0B |
| quay.io/biocontainers/medaka:1.4.3–py38h130def0_0 | 7af1a4272629 | 2.09GB | 0B |
| quay.io/biocontainers/medaka:2.2.1–py312hc7af5e1_0 | 1325bf00934f | 2.33GB | 0B |
| quay.io/biocontainers/megahit:1.2.9–h2e03b76_1 | bc0c7f5c00a4 | 200MB | 0B |
| quay.io/biocontainers/metabat2:2.15–h986a166_1 | 2cec952009c9 | 167MB | 0B |
| quay.io/biocontainers/mirtrace:1.0.1–0 | a939d27879c2 | 790MB | 0B |
| quay.io/biocontainers/mulled-v2-1fa26d1ce03c295fe2fdcf85831a92fbcbd7e8c2:1df389393721fc66f3fd8778ad938ac711951107-0 | bc95522e1c82 | 77.7MB | 0B |
| quay.io/biocontainers/mulled-v2-1fa26d1ce03c295fe2fdcf85831a92fbcbd7e8c2:59cdd445419f14abac76b31dd0d71217994cbcc9-0 | 09068e32f7d4 | 113MB | 0B |
| quay.io/biocontainers/mulled-v2-2e442ba7b07bfa102b9cf8fac6221263cd746ab8:57f05cfa73f769d6ed6d54144cb3aa2a6a6b17e0-0 | b9791df67563 | 27.2MB | 0B |
| quay.io/biocontainers/mulled-v2-3a59640f3fe1ed11819984087d31d68600200c3f:185a25ca79923df85b58f42deb48f5ac4481e91f-0 | ec005263947d | 286MB | 0B |
| quay.io/biocontainers/mulled-v2-5799ab18b5fc681e75923b2450abaa969907ec98:87fc08d11968d081f3e8a37131c1f1f6715b6542-0 | 622d9c126807 | 283MB | 0B |
| quay.io/biocontainers/mulled-v2-8849acf39a43cdd6c839a369a74c0adc823e2f91:ab110436faf952a33575c64dd74615a84011450b-0 | 93e967cad095 | 973MB | 0B |
| quay.io/biocontainers/mulled-v2-ac74a7f02cebcfcc07d8e8d1d750af9c83b4d45a:577a697be67b5ae9b16f637fd723b8263a3898b3-0 | f26e4b265e39 | 329MB | 0B |
| quay.io/biocontainers/mulled-v2-cf0123ef83b3c38c13e3b0696a3f285d3f20f15b:64aad4a4e144878400649e71f42105311be7ed87-0 | eb7fe52d1201 | 946MB | 0B |
| quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40:eabfac3657eda5818bae4090db989e3d41b01542-0 | 97a71971e4ae | 166MB | 0B |
| quay.io/biocontainers/multiqc:1.10.1–py_0 | d7a084485324 | 419MB | 0B |
| quay.io/biocontainers/multiqc:1.14–pyhdfd78af_0 | c18fefe6c727 | 490MB | 0B |
| quay.io/biocontainers/multiqc:1.19–pyhdfd78af_0 | 055cf2c3dd59 | 486MB | 0B |
| quay.io/biocontainers/multiqc:1.29–pyhdfd78af_0 | c098db41e8dd | 1.04GB | 0B |
| quay.io/biocontainers/nanoplot:1.46.1–pyhdfd78af_0 | e4af559c79ea | 799MB | 0B |
| quay.io/biocontainers/ont-modkit:0.5.0–hcdda2d0_1 | 759271fac42a | 56.3MB | 0B |
| quay.io/biocontainers/ont-modkit:0.5.0–hcdda2d0_2 | 23338f1f3608 | 51MB | 0B |
| quay.io/biocontainers/perl:5.26.2 | d3998a9936be | 107MB | 0B |
| quay.io/biocontainers/picard:3.0.0–hdfd78af_1 | d395540d73c4 | 1.19GB | 0B |
| quay.io/biocontainers/pigz:2.8 | 044d2120894b | 14.1MB | 0B |
| quay.io/biocontainers/porechop:0.2.4–py39h7cff6ad_2 | 88875379657b | 191MB | 0B |
| quay.io/biocontainers/prokka:1.14.6–pl5321hdfd78af_4 | de20f4295af4 | 1.85GB | 0B |
| quay.io/biocontainers/python:3.8.3 | 7d255f0a290f | 207MB | 0B |
| quay.io/biocontainers/python:3.9–1 | 34c2b9e3810c | 191MB | 0B |
| quay.io/biocontainers/qualimap:2.2.2d–1 | 508a009a25da | 1.26GB | 0B |
| quay.io/biocontainers/qualimap:2.3–hdfd78af_0 | 305b0e7620e9 | 1.65GB | 0B |
| quay.io/biocontainers/quast:5.0.2–py37pl526hb5aa323_2 | 43d7b71dfd43 | 1.77GB | 0B |
| quay.io/biocontainers/quast:5.2.0–py39pl5321h2add14b_1 | b7b8479a9014 | 1.33GB | 0B |
| quay.io/biocontainers/rasusa:0.3.0–h779adbc_1 | d65888d0076a | 44MB | 0B |
| quay.io/biocontainers/requests:2.26.0 | e3166094707d | 181MB | 0B |
| quay.io/biocontainers/rseqc:3.0.1–py37h516909a_1 | 8e8d841718c7 | 802MB | 0B |
| quay.io/biocontainers/rseqc:5.0.3–py39hf95cd2a_0 | e9b5f9b8302a | 985MB | 0B |
| quay.io/biocontainers/salmon:1.10.1–h7e5ed60_0 | d2562f60654a | 266MB | 0B |
| quay.io/biocontainers/samtools:1.10–h9402c20_2 | 66dbf63b9173 | 90.2MB | 0B |
| quay.io/biocontainers/samtools:1.16.1–h6899075_1 | 09cd4486af55 | 62MB | 0B |
| quay.io/biocontainers/samtools:1.17–h00cdaf9_0 | 57a71725cb8a | 61.8MB | 0B |
| quay.io/biocontainers/samtools:1.22.1–h96c455f_0 | a5ee23aa5171 | 71.5MB | 0B |
| quay.io/biocontainers/seqcluster:1.2.9–pyh5e36f6f_0 | c5d422d60a7d | 669MB | 0B |
| quay.io/biocontainers/seqkit:2.9.0–h9ee0642_0 | aa3ad245f30e | 29MB | 0B |
| quay.io/biocontainers/seqtk:1.4–he4a0461_1 | cabd6bfde871 | 15.2MB | 0B |
| quay.io/biocontainers/snp-sites:2.5.1–hed695b0_0 | 55bb32a6d4ff | 33.9MB | 0B |
| quay.io/biocontainers/spades:3.15.3–h95f258a_0 | 3f81f9bb1c34 | 556MB | 0B |
| quay.io/biocontainers/stringtie:2.2.1–hecb563c_2 | bff12f9ac90d | 200MB | 0B |
| quay.io/biocontainers/subread:2.0.1–hed695b0_0 | bbbd1bbfb3bd | 96.8MB | 0B |
| quay.io/biocontainers/toulligqc:2.7.1–pyhdfd78af_0 | 55e97054bff0 | 1.1GB | 0B |
| quay.io/biocontainers/toulligqc:2.8.4–pyhdfd78af_0 | 355df1ba819f | 1.26GB | 0B |
| quay.io/biocontainers/trim-galore:0.6.7–hdfd78af_0 | 9786daa22bb1 | 714MB | 0B |
| quay.io/biocontainers/ubuntu:24.04 | 35a88802559d | 78.1MB | 0B |
| quay.io/biocontainers/ucsc-bedclip:377–h0b8a92a_2 | 94ef80b67886 | 70.1MB | 0B |
| quay.io/biocontainers/ucsc-bedgraphtobigwig:377–h446ed27_1 | 1e314a53b003 | 66.4MB | 0B |
| quay.io/biocontainers/ucsc-bedgraphtobigwig:445–h954228d_0 | 608b52073059 | 51.1MB | 0B |
| quay.io/biocontainers/whatshap:2.6–py39h2de1943_0 | 9c9c9b6edc9b | 453MB | 0B |
| quay.io/broadinstitute/viral-ngs:latest | 3b0a22aa2452 | 5.57GB | 0B |
| quay.io/nf-core/ubuntu:20.04 | 88bd68917189 | 72.8MB | 0B |
| quay.io/qiime2/core:2023.9 | a8adfed74f1b | 5.93GB | 0B |
| rkitchen/excerpt:latest | 38fceb372de2 | 2.02GB | 0B |
| sangerpathogens/circlator:latest | b475e326f98b | 1.72GB | 0B |
| shinejh0528/plasmidfinder:latest | 709388f557ea | 701MB | 0B |
| viral-ngs-fixed:2026-03-19 | 6ec2d521a0e8 | 7.89GB | 0B |
| viral-ngs-fixed:l | cb66bb9a9373 | 9.34GB | 0B |
| viral-ngs-fixed:la | c0390ae1e056 | 10.4GB | 0B |
| viral_ngs_with_gap2seq:latest | c0f397367599 | 6.7GB | 0B |
| zacharyfoster/main-report-r-packages:0.20 | c164ad489714 | 4.33GB | 0B |
Protected: 两份文件对比分析
MetaPhlAn 与 StrainPhlAn 数据库关系详解(中文)
MetaPhlAn 与 StrainPhlAn 数据库关系详解(中文)
简短回答:它们有关联,但不完全相同。两者都基于同一套 “标记基因 (marker genes)” 理念,但用途、文件结构和下载内容是分开的。
🔹 核心区别对比
| 特性 | MetaPhlAn | StrainPhlAn |
|---|---|---|
| 主要用途 | 物种/菌种水平的分类组成分析(谁在那儿?丰度多少?) | 同一物种内不同菌株水平的进化/变异分析(是同一个菌株吗?有没有突变?) |
| 核心数据库 | mpa_vJan25_CHOCOPhlAnSGB_202503_bt2/(Bowtie2 索引 + 标记基因列表) |
strainphlan_db_markers/(标记基因 FASTA)+ strainphlan_db_reference/(参考基因组,可选) |
| 文件大小 | ~16 GB(Bowtie2 索引解压后) | 标记基因 ~50–200 MB;参考基因组需用户自行提供或额外下载 |
| 是否自动下载 | ✅ metaphlan --install 会自动下载完整数据库 |
⚠️ 标记基因可能随 MetaPhlAn 一起下载,但参考基因组通常为空,需手动配置 |
| 依赖关系 | 独立运行 | 通常依赖 MetaPhlAn 先鉴定出目标物种,再对该物种做菌株分析 |
🔹 目录结构说明(以你的路径为例)
# MetaPhlAn 主数据库(必需)
${DB_DIR}/
├── mpa_vJan25_CHOCOPhlAnSGB_202503_bt2/ # Bowtie2 索引文件(.bt2),用于快速比对
├── mpa_vJan25_CHOCOPhlAnSGB_202503_bt2.tar # 压缩包
├── mpa_vJan25_CHOCOPhlAnSGB_202503_bt2.md5 # 校验文件
└── markers/ # 标记基因定义文件(.mpa)
# StrainPhlAn 专用目录(可选/扩展)
├── strainphlan_db_markers/
│ └── all_markers.fasta # 所有物种的标记基因序列(用于提取目标物种的标记)
└── strainphlan_db_reference/
└── [空或用户自定义] # 存放特定物种的参考基因组集合,用于构建系统发育树
💡 你看到
strainphlan_db_reference/通常是空的,因为:
- StrainPhlAn 默认使用 MetaPhlAn 鉴定出的物种 + 内置标记基因进行分析;
- 只有当你想做高精度菌株追踪(比如比较多个样本中 E. coli 的 SNP 差异)时,才需要手动下载或提供该物种的参考基因组集合。
🔹 它们如何协作?(工作流程)
graph LR
A[原始 FASTQ] --> B(MetaPhlAn)
B --> C{物种组成表}
C -->|目标物种: e.g. Escherichia coli | D[StrainPhlAn]
D --> E[提取该物种的标记基因]
E --> F[多序列比对 + 构建系统发育树]
F --> G[菌株水平差异/传播分析]
🔹 实用建议(针对你的测试环境)
-
当前测试阶段:你用了
--bypass-strain-profiling,所以 StrainPhlAn 完全不会运行,只需确保 MetaPhlAn 数据库完整即可。 -
未来启用菌株分析时:
# 1. 先确认标记基因是否存在 ls -lh ${DB_DIR}/strainphlan_db_markers/all_markers.fasta # 2. 如果缺失,可尝试重新安装(部分版本会一起下载) metaphlan --install --bowtie2db ${DB_DIR} # 3. 如需参考基因组(高级用法),手动下载并放入 reference 目录 # 例如从 GTDB 或 NCBI 下载目标物种的 .fna 文件 -
数据库路径统一管理(避免重复下载):
# 在 ~/.bashrc 或 conda env 中设置 export METAPHLAN_DB_PATH=/mnt/nvme4n1p1/biobakery_db/metaphlan_dbs export STRAINPHLAN_DB_PATH=/mnt/nvme4n1p1/biobakery_db/strainphlan_dbs
🔹 常见问题排查
# 检查 MetaPhlAn 数据库是否完整
metaphlan --help # 如果能看到版本信息,说明基础环境正常
# 手动验证 MD5(你之前报错的关键)
cd ${DB_DIR}
md5sum -c mpa_vJan25_CHOCOPhlAnSGB_202503_bt2.md5
# 正确输出应为: mpa_vJan25_CHOCOPhlAnSGB_202503_bt2.tar: OK
# 如果 StrainPhlAn 报错找不到标记基因
strainphlan --print-default-db # 查看默认数据库路径
# 然后通过 --marker_db 和 --reference_db 参数指定自定义路径
✅ 总结:
MetaPhlAn 是 “普查”(查有哪些菌),StrainPhlAn 是 “精查”(查同一个菌的不同菌株)。
它们共享底层标记基因理念,但数据库文件、用途和运行逻辑是分层独立的。
你当前测试只需关注 MetaPhlAn 数据库修复即可,StrainPhlAn 可后续按需启用。
如果修复 MD5 问题后还有报错,欢迎继续贴日志,我帮你逐行分析!🔍🚀
如何获取 strainphlan_db_markers/all_markers.fasta?
关键结论:这个文件不是直接下载的,而是通过 bowtie2-inspect 工具从 MetaPhlAn 的 Bowtie2 索引文件本地生成的。[[41]]
🔹 原理说明
all_markers.fasta 包含数据库中的所有标记基因序列,StrainPhlAn 用它来提取目标物种的标记。[[28]]
生成逻辑(来自官方源码 biobakery_workflows_databases.py):[[41]]
# 1. 找到 MetaPhlAn 数据库目录
strainphlan_db_folder = os.path.join(os.path.dirname(metaphlan.__file__), "metaphlan_databases")
strainphlan_db = glob.glob(strainphlan_db_folder+"/mpa_*.pkl")[0].replace(".pkl", "")
# 2. 用 bowtie2-inspect 提取 FASTA
run_command("bowtie2-inspect " + strainphlan_db + " > " + install_folder + "/all_markers.fasta", shell=True)
🔹 方法一:使用官方命令自动生成(推荐)
# 1. 确保 MetaPhlAn 数据库已正确安装(先修复你之前的 MD5 问题)
metaphlan --install --bowtie2db /home/jhuang/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases
# 2. 运行 biobakery_workflows 的数据库安装命令,自动创建 all_markers.fasta
biobakery_workflows_databases --install wmgx --location /mnt/nvme4n1p1/biobakery_db
✅ 执行后,会在指定 location 下自动生成:
/mnt/nvme4n1p1/biobakery_db/
├── strainphlan_db_markers/
│ └── all_markers.fasta # ← 自动生成,约 50–200 MB
└── strainphlan_db_reference/
└── (空目录,用于自定义参考基因组)
🔹 方法二:手动用 bowtie2-inspect 生成(高级用户)
如果你已确认 MetaPhlAn 数据库完整,可手动执行:
# 1. 找到你的 MetaPhlAn Bowtie2 索引前缀(不含 .bt2 后缀)
# 通常在:~/.conda/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases/
DB_PREFIX="/home/jhuang/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503_bt2"
# 2. 创建输出目录
mkdir -p /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers
# 3. 用 bowtie2-inspect 提取 FASTA
bowtie2-inspect "$DB_PREFIX" > /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers/all_markers.fasta
# 4. 验证生成结果
ls -lh /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers/all_markers.fasta
head -n 4 /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers/all_markers.fasta
💡
bowtie2-inspect是 Bowtie2 自带工具,安装bowtie2后自动可用。
🔹 验证数据库是否完整
# 检查 MetaPhlAn 能否正常调用数据库
metaphlan --print-default-db
# 检查 all_markers.fasta 是否存在且非空
test -s /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers/all_markers.fasta && echo "✅ all_markers.fasta OK" || echo "❌ 文件缺失或为空"
# (可选)快速统计标记基因数量
grep -c "^>" /mnt/nvme4n1p1/biobakery_db/strainphlan_db_markers/all_markers.fasta
# 预期输出:数万条(取决于数据库版本)
🔹 常见问题排查
| 问题 | 解决方案 |
|---|---|
bowtie2-inspect: command not found |
conda install -c bioconda bowtie2 |
Unable to find strainphlan install |
确保 metaphlan 已正确安装且 $PYTHONPATH 包含其路径 |
生成的 all_markers.fasta 为空 |
检查 $DB_PREFIX 是否正确,应指向 .bt2 索引的公共前缀(不含 .1.bt2 等后缀) |
| 磁盘空间不足 | Bowtie2 索引解压后约 16 GB,确保目标分区有 ≥20 GB 空闲 |
🔹 当前测试建议
你当前运行 biobakery_workflows 时使用了 --bypass-strain-profiling,StrainPhlAn 完全不会执行,因此 all_markers.fasta 缺失不会影响你的测试流程。✅
🎯 优先修复 MetaPhlAn 数据库的 MD5 问题(删除损坏的
.tar+.md5文件后重跑),等流程测试通过后再按需生成 StrainPhlAn 数据库。
如需我帮你写一个一键修复 + 生成数据库的脚本,请告诉我!🚀
Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)
✅ 简短回答:是的,nf-core/methylong 接受 2 个生物学重复进行 DMR 分析
但需要注意统计效力的限制和结果解读的谨慎性。
🔧 技术层面:methylong 如何支持 2 重复设计
1️⃣ 参数配置示例
nextflow run nf-core/methylong \
-r 2.0.0 \
-profile docker \
--input samplesheet_6mA.csv \ # 包含 2 个重复/组
--outdir methylome_out_6mA \
--ont_aligner minimap2 \
--m6a \
--skip_snvs \
--dmr_population_scale \ # ⭐ 启用群体水平比较
--population_dmrer dss \ # ⭐ 使用 DSS (支持小样本)
--dmr_a "An6_rep1,An6_rep2" \ # 组 A: 2 个重复
--dmr_b "BG5_rep1,BG5_rep2" \ # 组 B: 2 个重复
-resume
2️⃣ samplesheet 格式要求
group,sample,path,ref,method
An6,An6_rep1,/path/An6_rep1.mod.bam,/path/genome.fa,ont
An6,An6_rep2,/path/An6_rep2.mod.bam,/path/genome.fa,ont
BG5,BG5_rep1,/path/BG5_rep1.mod.bam,/path/genome.fa,ont
BG5,BG5_rep2,/path/BG5_rep2.mod.bam,/path/genome.fa,ont
⚠️ 关键:
group列的值必须与--dmr_a/--dmr_b中的组名完全一致。
⚠️ 统计层面:2 重复的局限性与应对策略
🔹 DSS 在 2 重复下的工作原理
| 特性 | 说明 | 对您的影响 |
|---|---|---|
| 贝叶斯收缩 | 借用全基因组位点信息”收缩”方差估计 | ✅ 提升小样本稳健性,是 2 重复可行的理论基础 |
| 经验分布建模 | 基于所有位点的甲基化概率分布估计背景噪声 | ✅ 不需要大量重复也能估计全局变异 |
| FDR 校正 | Benjamini-Hochberg 方法控制多重检验 | ⚠️ 2 重复时假阳性控制可能偏保守或偏宽松,需谨慎解读 p.adj |
🔹 2 重复 vs 3+ 重复的统计效力对比
| 指标 | 2 重复/组 | 3 重复/组 | 对您的建议 | ||
|---|---|---|---|---|---|
| 检测中等效应 ( | diff | =20-30%) 的功率 | ~40-60% | ~70-85% | 当前结果可能漏检部分真实差异位点 |
| FDR 控制的可靠性 | 中等 | 高 | 显著位点 (p.adj<0.05) 可信,但”不显著”不代表无差异 |
||
| 异常值鲁棒性 | 低 (1 个异常样本可扭曲结果) | 中 | 建议用 MultiQC 检查样本间相关性,排除技术异常 |
🎯 实用建议:如何在 2 重复下最大化分析价值
✅ 策略 1:提高过滤阈值,聚焦高置信结果
# 在 DSS 输出后,额外应用更严格的过滤
awk '$8 >= 0.95 && abs($7) >= 0.3' An6_vs_BG5.DSS_DMLs.tsv > high_conf_diff_6mA.tsv
# $8 = p.adj, $7 = diff (甲基化差异)
✅ 策略 2:结合生物学先验知识验证
# 优先关注已知限制修饰系统相关基因
bedtools intersect -a high_conf_diff_6mA.tsv -b RM_system_genes.bed -wa > RM_related_diff.tsv
# 或关注表型相关基因 (如毒力、应激响应)
bedtools intersect -a high_conf_diff_6mA.tsv -b virulence_genes.bed -wa > phenotype_related_diff.tsv
✅ 策略 3:敏感性分析 (Bootstrap 评估稳健性)
# R 脚本示例:评估 2 重复结果的稳定性
library(boot)
# 对每个显著位点,用 bootstrap 重采样估计差异的置信区间
boot_diff <- function(data, indices) {
d <- data[indices, ]
mean(d$groupA) - mean(d$groupB) # 简化示例
}
# 对 top 100 显著位点运行 1000 次 bootstrap
# 如果 95% CI 不包含 0,则结果稳健
✅ 策略 4:报告时透明说明局限性
在论文/报告方法部分添加:
“差异甲基化分析使用 DSS (vX.X) 进行,每组 2 个生物学重复。由于样本量限制,我们采用更严格的显著性阈值 (FDR 30%) 并优先关注已知功能基因的甲基化变化,以降低假阳性风险。”
📊 如果您决定扩展至 3 重复:最小成本方案
| 方案 | 额外成本 | 收益 |
|---|---|---|
| 技术重复 (同一 DNA 多次建库) | 低 (仅测序成本) | 评估技术噪声,但不能替代生物学重复 |
| 生物学重复 (独立培养 + 提取) | 中 (时间 + 试剂) | ✅ 真正提升统计效力,推荐 |
| 混合策略 (2 生物重复 + 1 技术重复) | 中低 | 平衡成本与可靠性,适合资源有限场景 |
💡 一句话总结
🔑 技术上:methylong + DSS 完全支持 2 重复/组的 DMR 分析,管道可正常运行并输出结果。
🔑 统计上:2 重复的统计效力有限,建议:(1) 使用更严格的显著性阈值,(2) 结合生物学先验知识验证,(3) 在报告中透明说明局限性。
🔑 未来规划:如有关键发现需深入验证,建议扩展至≥3 个生物学重复以获得更可靠的 FDR 控制。
如需我帮您编写更严格的过滤脚本或敏感性分析代码,请随时告知!🚀
🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline
1. Base Modification Calling with modkit
Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:
| Column | Name | Description | Example |
|---|---|---|---|
| 1 | chrom |
Contig/chromosome identifier | contig_1 |
| 2 | start |
Modification position (0-based, BED format) | 12345 |
| 3 | end |
End position (start + 1 for single-base mods) |
12346 |
| 4 | name |
Modification code + sequence context | a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC) |
| 5 | score |
Total read coverage at the site | 45 |
| 6 | strand |
Strand orientation (+, -, or . for unstranded) |
+ |
| 7-9 | thickStart/End, itemRgb |
Visualization fields (unused in analysis) | – |
| 10 | blockCount |
Reads supporting the modification call | 32 |
| 11 | blockSizes |
Modification probability (0–100%) ← core filtering metric | 85.3 |
| 12-18 | Extended QC metrics | mod_count, unmod_count, del_count, no_call_count, strand-specific coverage |
– |
📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g.,
85.3= 85.3% of reads show 6mA at this adenine).
2. High-Confidence Site Filtering
Sites were filtered using two stringent thresholds to minimize false positives:
- Minimum coverage: ≥10 reads (
$5 >= 10) - Minimum modification rate:
- 5mC: ≥70% (
$11 >= 70) - 4mC: ≥50% (
$11 >= 50) - 6mA: ≥70% (
$11 >= 70)
- 5mC: ≥70% (
Filtering was performed using awk:
# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed
Sites were further separated by:
- Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
- Sequence context: CG vs. non-CG (for cytosine modifications)
3. Flanking Sequence Extraction for Motif Analysis
For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:
Modification site: [position X]
Extracted window: [X-50] ................ [X] ................ [X+50]
↑
Total length = 101 bp
Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.
Extraction was performed using bedtools getfasta:
# Create extended regions file
awk -F'\t' -v flank=50 '{
c=$1; gsub(/^>/,"",c);
s=($2-flank>0)?$2-flank:0; e=$3+flank;
print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed
# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s
4. Motif Enrichment Analysis with HOMER
De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:
| Parameter | Value | Rationale |
|---|---|---|
-len |
4,6,8,10 |
Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs |
-size |
50 |
Matches the ±50 bp flanking window used for extraction |
-mask |
true |
Masks low-complexity/repetitive regions to reduce spurious motifs |
-S |
10 |
Optimizes 10 putative motifs per dataset to balance sensitivity and specificity |
-noknown |
true |
Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE |
-useNewBg |
true |
Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability) |
-p |
8–100 |
Parallel threads (adjusted per compute resource) |
Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.
5. REBASE Annotation & Methylase Classification
Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.
Matching Logic:
- IUPAC-aware comparison: Motifs containing degenerate bases (e.g.,
R=A/G,Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences. - Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
- Methylation notation parsing: REBASE entries with methylation annotations (e.g.,
2(6)= N6-methyladenine at position 2) were parsed to distinguish:- REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
- Methylase Hits: Entries where the enzyme name starts with
M.(indicating a methyltransferase) AND the methylation notation matches the modification type:(4)= N4-methylcytosine → 4mC(5)= 5-methylcytosine → 5mC(6)= N6-methyladenine → 6mA
Output Classification:
| Category | Definition | Example |
|---|---|---|
| No REBASE match | Motif not found in REBASE | Likely a transcription factor binding site or novel methylase |
| REBASE match (restriction enzyme only) | Matches a restriction endonuclease but not its cognate methylase | Coincidental sequence similarity; methylation likely from orphan methylase |
| Methylase hit | Matches a methyltransferase with correct modification type notation | High-confidence candidate for the enzyme catalyzing the observed modification |
6. Output File Structure
Final results were compiled into Excel workbooks with the following sheets per sample:
| Sheet Name | Content | Source File |
|---|---|---|
4mC_CG |
4mC sites in CG context | 4mC_CG_filtered.bed + REBASE annotation |
4mC_nonCG |
4mC sites in non-CG context | 4mC_nonCG_filtered.bed + REBASE annotation |
5mC_CG |
5mC sites in CG context | 5mC_CG_filtered.bed + REBASE annotation |
5mC_nonCG |
5mC sites in non-CG context | 5mC_nonCG_filtered.bed + REBASE annotation |
6mA |
All 6mA sites | 6mA_raw_filtered.bed + REBASE annotation |
Each sheet includes:
- Genomic coordinates and modification metrics (from
modkit) - HOMER motif enrichment statistics (p-value, motif sequence)
- REBASE annotation (matched enzymes, methylase status, methylation notation)
7. Software & Database Versions
| Tool/Database | Version | Purpose |
|---|---|---|
modkit |
0.3.1 | Base modification calling from nanopore signals |
bedtools |
2.31.1 | Sequence extraction and genomic interval operations |
HOMER |
4.11 | De novo motif discovery and enrichment analysis |
REBASE |
605 (Apr 2026) | Curated database of restriction enzymes and methyltransferases |
Python |
3.10+ | Custom annotation and data integration scripts |
pandas/openpyxl |
Latest | Excel workbook generation |
🔑 Key Clarifications
- Flanking window:
-f 50= ±50 bp (101 bp total), not ±25 bp. - Modification rate: Column 11 (
blockSizes) = percent modified (0–100), not a probability score. - Methylase identification: Requires BOTH
M.prefix in enzyme name AND matching methylation notation(4)/(5)/(6). - Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.
This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.