医院废水处理工艺显著降低了废水中“副血链球菌”(*Streptococcus parasanguinis*)的相对丰度

这个统计结果表明:在排除了不同采样时间(季节/月份)带来的自然波动影响后,医院废水处理工艺显著降低了废水中“副血链球菌”(Streptococcus parasanguinis)的相对丰度。

下面为您逐列详细解释这些统计指标的含义:

1. 字段详细解释

字段名 结果值 含义解释
feature Streptococcus.parasanguinis 特征名称:即发生显著变化的物种名称,中文通常译为“副血链球菌”。这是一种常见于人类口腔和肠道的细菌,属于机会致病菌。
value Post 比较的组别:因为我们在 MaAsLin2 中设置 Pre(处理前)为参考组(Reference),这里的 Post 表示这是 “处理后”相对于“处理前” 的比较结果。
coef -0.0006712921 效应系数(Coefficient):代表处理后该物种相对丰度的变化量。负值(-)表示丰度下降。因为您的输入数据是相对丰度(0~1之间),这意味着经过处理后,该菌的相对丰度比处理前降低了约 0.067%
pval 0.0001454243 原始 P 值(P-value):未经多重检验校正的显著性水平。这个值远小于 0.05(甚至小于 0.001),说明在单变量统计检验中,处理前后的差异是极其显著的。
qval 0.01915278 校正后的 Q 值(FDR):即错误发现率(False Discovery Rate)。因为宏基因组数据同时检验了成百上千个物种,必须进行多重检验校正(如 Benjamini-Hochberg 方法)以防止假阳性。Q值 < 0.05,说明即使经过了严格校正,这个差异依然是统计学显著的

2. 结合您的实验设计的深度解读

在您的实验设计中,MaAsLin2 模型同时纳入了 Treatment(处理前/后)和 TimePoint(11月、1月、3月、5月)两个变量。

  • 控制时间变量(TimePoint):废水中的微生物群落会随着季节、气温等时间因素自然变化。模型将 TimePoint 作为协变量(Covariate)剔除,意味着这个 -0.00067 的下降纯粹是由废水处理工艺(Pre vs Post)引起的,而不是因为采样月份不同造成的。
  • 统计学稳健性:原始 P 值(0.00014)和 Q 值(0.019)都非常小,说明这个结果非常稳健,不是偶然产生的假阳性。

3. 生物学与环境意义

  • 副血链球菌(Streptococcus parasanguinis 通常是人类口腔、呼吸道的正常菌群,但也可能引起心内膜炎等机会性感染。在医院废水中检测到它,说明它可能来源于医院的医疗排放或人类排泄物。
  • 环境意义:该菌在处理后(Post)显著减少,说明您的医院废水处理系统能够有效地截留、降解或灭活这种条件致病菌,降低了其排放到自然环境中带来的潜在生物安全风险。这是一个非常积极的工程效果!


Great question! The difference in min_abundance thresholds between species (0.001) and pathways (0.0001) is intentional and based on the fundamental differences in how MetaPhlAn and HUMAnN data are distributed:

1. Data Distribution Differences

Species data (MetaPhlAn):

  • Community composition is typically dominated by a few highly abundant species
  • Most species have relative abundances well above 0.1% (0.001)
  • A threshold of 0.001 (0.1%) effectively filters out “noise” while retaining biologically meaningful species

Pathway data (HUMAnN):

  • Functional profiles are much more evenly distributed across thousands of pathways (you have 15,785 pathways!)
  • Many biologically important pathways exist at very low relative abundances (e.g., antibiotic resistance, xenobiotic degradation, rare metabolic functions)
  • A higher threshold would eliminate too many pathways, potentially losing critical functional signals

2. Why 0.0001 (0.01%) for Pathways?

  • Captures the “long tail” of functional diversity: Many critical metabolic pathways operate at low abundance but are essential for ecosystem function
  • Balances sensitivity and specificity: Combined with min_prevalence = 0.1 (the pathway must appear in at least 10% of samples), this ensures we only test features that are consistently detected, even if at low levels
  • Standard practice in metagenomics: HUMAnN pathway analyses commonly use lower abundance thresholds because functional redundancy means even rare pathways can be important

3. You Can Adjust This

If you want to be more conservative (fewer pathways tested, faster computation, less multiple testing correction burden), you could increase it:

min_abundance = 0.001  # Same as species, 0.1% threshold

Or if you want to be even more sensitive:

min_abundance = 0.00001  # 0.001% threshold

4. Practical Recommendation

Given your dataset has only 11 samples, I’d actually recommend increasing the pathway threshold to match species (min_abundance = 0.001) because:

  • With small sample sizes, testing too many low-abundance features increases the multiple testing burden
  • It makes the Q-value correction more stringent
  • You’ll get more robust, interpretable results

Would you like me to update the code to use min_abundance = 0.001 for pathways as well?

小鼠肠道类器官的分化,用于耶尔森氏菌(Y. enterocolitica)感染研究,采用双极性构型。

一、Organoid(类器官)—— 深度解析

1. 定义

类器官是指利用成体干细胞(ASC)多能干细胞(PSC,包括胚胎干细胞和诱导多能干细胞),在体外三维(3D)培养体系中,通过自组织(self-organization)和定向分化,形成的微型三维组织样结构。它能够再现来源器官的部分细胞类型、空间架构和生理功能

2. 核心特征(四点)

特征 说明
三维结构 不同于传统的二维(2D)单层细胞培养,类器官具有立体的细胞极性和细胞间连接
多细胞类型 包含多种分化细胞,而非单一细胞系
自组织能力 细胞按照内在的发育程序自行排列成类似体内的结构,无需外部支架的精细引导
功能性 能够执行部分器官特有功能,如肠类器官的吸收/分泌、脑类器官的神经电活动

3. 来源分类

  • 成体干细胞来源(ASC-derived):取自组织中的成体干细胞(如肠道隐窝干细胞),分化潜能有限(只能形成该组织的细胞类型),但成熟度高、更接近体内真实状态。
  • 多能干细胞来源(PSC-derived):取自胚胎干细胞或iPS细胞,分化潜能大,可模拟胚胎发育过程,但成熟度相对较低。

4. 应用领域

  • 疾病建模(感染、肿瘤、遗传病)
  • 药物筛选与毒性测试
  • 再生医学(移植修复)
  • 发育生物学研究

二、Enteroid(肠类器官)—— 深度解析

1. 定义

肠类器官是类器官的一种特定亚型,特指来源于肠道上皮干细胞(主要是小肠或结肠的Lgr5⁺隐窝基底柱状细胞),在三维培养中形成的、包含所有肠道上皮主要细胞类型的微型肠道组织。

2. 肠类器官包含的细胞类型(核心)

细胞类型 功能
肠上皮吸收细胞(Enterocyte) 营养吸收、离子转运
杯状细胞(Goblet cell) 分泌黏液,保护上皮屏障
潘氏细胞(Paneth cell) 分泌抗菌肽(如溶菌酶),调节干细胞微环境
肠内分泌细胞(Enteroendocrine cell) 分泌激素(如GLP-1、5-羟色胺),调控消化和食欲
干细胞(Lgr5⁺ CBC细胞) 自我更新,分化补充其他细胞类型
簇细胞(Tuft cell) 参与免疫感知和寄生虫防御

3. 结构特征

  • 肠类器官在Matrigel(基底膜基质)中培养时,会形成极性的囊泡状结构
    • 管腔面(Apical side):面向内部空腔,带有微绒毛,对应肠道的内腔(接触食物和菌群的一面)
    • 基底侧面(Basolateral side):朝向外侧,对应肠道的基底膜侧(接触血管和结缔组织的一面)
  • 这种极性正是你原句中“two-polarity conformations(双极性构型)”所指的内容。

4. 与体内肠道的对应关系

  • 肠类器官的隐窝-绒毛轴在体外被简化为出芽结构(buds)——出芽部分富含干细胞和潘氏细胞(对应隐窝),中央囊泡部分富含分化细胞(对应绒毛区)。

三、Organoid vs Enteroid —— 更详细的对比表

对比维度 Organoid(类器官) Enteroid(肠类器官)
中文译名 类器官 肠类器官(或称肠道类器官)
层级关系 上位概念(总称) 下位概念(子类)
组织来源 可以是脑、肝、肾、肺、肠、胰腺、视网膜等任何器官 仅限于小肠或结肠
干细胞类型 ASC 或 PSC 均可 通常为 ASC(肠道隐窝干细胞)
分化方向 定向分化为对应器官的细胞谱系 定向分化为肠道上皮各细胞类型
培养条件 因器官而异,需不同的细胞因子组合(如脑需Noggin,肝需HGF等) 通常需EGF、Noggin、R-spondin、Wnt3a等(经典“肠类器官培养基”)
结构特点 因器官而异(如脑类器官呈脑区样结构,肝类器官呈肝板样) 呈囊泡状+出芽结构,具有肠上皮特有的隐窝-绒毛极性
典型应用 广泛(神经退行性疾病、肝癌、肾发育、肺纤维化等) 主要集中于肠道感染(如沙门氏菌、耶尔森氏菌)、炎症性肠病(IBD)、结直肠癌、营养代谢

四、为什么你的句子用 Enteroid 而不是 Organoid —— 深入解释

你的原句:

“Differentiation of murine enteroids for Y. enterocolitica infection in two-polarity conformations”

  • 研究对象耶尔森氏菌(Yersinia enterocolitica——这是一种肠道致病菌,主要通过侵入肠道上皮引起感染。
  • 要研究这类细菌,你需要一个能模拟肠道上皮屏障、黏液层、极性运输和细菌侵入机制的模型。
  • 肠类器官(enteroid)正好提供了:
    • 完整的肠道上皮细胞多样性
    • 生理性的顶-底极性(细菌从顶侧侵入,免疫信号从底侧传递)
    • 与体内更接近的微环境

如果用 organoid(泛称),就太笼统了,别人不知道你研究的是肠、胃还是其他器官的类器官。因此,enteroid 是此处最精确、最专业的术语。


五、延伸概念:Colonoid 和 Gastroid

为了帮你更清楚理解 enteroid 的范围,再补充两个相关术语:

术语 中文 来源 说明
Enteroid 肠类器官 小肠或结肠(广义) 有时狭义特指小肠来源
Colonoid 结肠类器官 结肠(大肠) 是 enteroid 的下位细分
Gastroid 胃类器官 胃组织 与 enteroid 同属消化道类器官但不是同一类

在严格学术语境中,enteroid 有时只指小肠来源,而colonoid结肠来源,但日常文献中常混用,统称 intestinal organoids(肠道类器官),而 enteroid 则是更专业的叫法。


六、补充:关于“two-polarity conformations(双极性构型)”

这一点与你问的概念密切相关,我额外解释一下:

  • 在体内,肠道上皮是单层柱状上皮,具有明确的顶底极性(apical-basal polarity)

    • 顶侧(Apical):朝向肠腔,接触食物、菌群、病原体
    • 底侧(Basolateral):朝向基底膜,接触固有层、血管、免疫细胞
  • 在体外肠类器官培养中:

    • 常规构型:类器官的顶侧朝向管腔内部(封闭腔),底侧朝外(接触培养基)。这种构型下,细菌难以直接接触到顶侧(因为顶侧在腔里面),研究感染时需要显微注射或破坏类器官。
    • 反转构型(反向极性):通过特殊培养方法(如去掉基质胶或使用transwell),使顶侧朝外建立开放顶面的单层,这样病原体可以直接从培养液接触顶侧,更便于感染实验。

你的句子中 “two-polarity conformations” 就是指这两种不同的极性排列方式(常规封闭腔 vs 开放/反转极性),用于比较哪种更适于研究耶尔森氏菌的侵入机制。


Further topics:

  • 肠类器官的具体培养方法(培养基成分、Matrigel操作)
  • “反转极性”如何实现(具体实验方案)
  • 耶尔森氏菌感染肠类器官的研究案例


这是一个非常关键且容易混淆的问题。直接给你结论:

隐窝(Crypt)既不是管腔面(Apical side),也不是基底侧面(Basolateral side)。

为了让你彻底明白,我从空间位置细胞极性两个维度来拆解:


一、从空间位置来看(解剖学视角)

在真实的肠道组织中,隐窝和绒毛是肠壁上的两种不同地形结构

  • 绒毛(Villus):像手指一样向上突起,伸向肠道内部(管腔)。
  • 隐窝(Crypt):像小口袋一样向下凹陷,嵌入肠壁深层的固有层中。

所以,隐窝是一个“凹陷的坑”,绒毛是一个“凸起的包”。它们之间的关系是“高低起伏”的地形关系,而不是“内外表面”的关系。


二、从细胞极性来看(细胞生物学视角)

关键是:无论是隐窝里的细胞,还是绒毛上的细胞,它们都是单层上皮细胞,每一颗细胞都有自己的“管腔面”和“基底侧面”。

细胞部位 管腔面(Apical,顶侧) 基底侧面(Basolateral,底侧)
绒毛上的细胞 朝向肠腔内部(接触食物、菌群) 朝向绒毛内部的固有层(接触血管)
隐窝里的细胞 朝向隐窝的管腔(即那个凹陷的空腔) 朝向隐窝外周的基底膜和固有层

也就是说:

隐窝本身是一个“下陷的管腔空间”。隐窝里的干细胞和潘氏细胞,它们的顶膜(Apical)是面向隐窝那个空洞的,而基底膜(Basolateral)是面向隐窝外壁的组织间隙的。


三、用一张图来理解(文字版)

        ═══════════════════════════════   ← 肠腔(真正的管腔,食物通过的地方)
        ↑  绒毛(Villus)            ↑
        |  细胞顶面(Apical)朝向肠腔   |
        |  细胞底面(Basolateral)朝内   |
        ═══════════════════════════════   ← 基底膜
                 ↓↓↓ 向下凹陷 ↓↓↓
        ═══════════════════════════════   ← 隐窝开口
        |  隐窝(Crypt)              |
        |  细胞顶面(Apical)朝向隐窝空腔 |
        |  细胞底面(Basolateral)朝向外周 |
        ═══════════════════════════════   ← 基底膜

四、回答你问题的核心

  • 问:隐窝是管腔面吗?

    • 不是。 隐窝是一个结构区域(凹陷区),而管腔面(Apical)是细胞的一个面(朝向空腔的那一面)。隐窝里的细胞自己的管腔面(朝向隐窝空洞),但隐窝本身不等于管腔面。
  • 问:隐窝是基底侧面吗?

    • 不是。 基底侧面(Basolateral)是细胞接触基底膜和细胞间质的那一面。隐窝里的细胞也有基底侧面(朝向隐窝外周的基底膜),但隐窝本身不等于基底侧面。

五、在肠类器官(Enteroid)中对应什么?

在体外3D培养的肠类器官中:

体内结构 类器官中的对应结构
绒毛(Villus) 类器官的中央囊泡(中央腔体),上皮较薄,含分化细胞
隐窝(Crypt) 类器官的出芽(Bud),向外凸起的芽状结构,富含Lgr5⁺干细胞和潘氏细胞

注意:这里的“出芽”是向外凸起的,而不是向下凹陷的,这是因为类器官的极性相对于基质胶是“内外颠倒”的(顶面朝内腔,基底面朝外)。


六、一句话总结

隐窝是一个“结构位置”(凹陷区域),不是细胞的“一个面”。隐窝里的细胞既有管腔面(朝隐窝空洞),也有基底侧面(朝隐窝外周的基底膜)。

Further topics:

  • 为什么类器官的极性是“内外颠倒”的?
  • 在Transwell系统中如何实现“顶面朝外”?


1. “单层柱状上皮”是指只有一层细胞吗?

是的,千真万确。

  • 单层”在组织学里就是字面意思:只有一层细胞,不像皮肤(复层鳞状上皮)那样有很多层叠加在一起。
  • 柱状”指的是细胞的形态:它们不是圆形的,也不是扁平的,而是像高脚杯或柱子一样,高度 > 宽度,看起来细细长长的。

正因为只有一层,所以这层细胞肩负着巨大的责任:既要当“屏障”挡住坏人(细菌毒素),又要当“搬运工”吸收营养。


2. 细胞的一侧朝向肠腔,另一侧朝向基底膜吗?

完全正确! 这正是单层上皮最经典的极性(Polarity)特征。

因为只有一层,所以这层细胞必须“一头一尾”各司其职:

  • 顶端(Apical side / 管腔面)朝向肠腔(也就是大便和食物残渣流过的地方)。这一侧长满了密密麻麻的微绒毛(Microvilli),用来增加吸收面积,且没有血管直接接触。
  • 基底侧(Basolateral side / 基底侧面)朝向基底膜(Basement membrane)。基底膜像一层“胶水”和“滤网”,把这层上皮和下面的固有层(里面有血管、淋巴管、免疫细胞)粘在一起。细胞吸收的营养物质,会从这一侧排出去,进入血管。

3. 帮你建立“立体空间感”(重要补充)

你可能会困惑:“隐窝不是凹进去的吗?那凹进去的坑里,哪边算肠腔?”

记住这个黄金法则:只要是“空腔”的那一面,就是顶侧(Apical)。

  • 在绒毛上:细胞顶侧 → 对着肠道中间的大洞(肠腔)。
  • 在隐窝里:细胞顶侧 → 对着隐窝那个“小坑”的洞。虽然这个洞是封闭的凹陷,但它依然属于“腔”的一部分。
  • 细胞基底侧:永远对着基底膜(即远离食物,靠近身体内部的那一侧)。

4. 套用在你研究的“双极性构型”上

你现在完全理解了这个结构,就很容易搞懂体外实验的痛点了:

  • 在体内:细菌(如耶尔森氏菌)站在肠腔里,直接接触的就是细胞的顶侧(Apical),然后细菌入侵。
  • 在常规类器官(3D胶里):细胞顶侧(有微绒毛的那面)被包在里面(朝向类器官的内腔),细菌在胶外面,接触的是细胞的基底侧。这就反了!所以科学家才要费尽心机做“双极性构型(反转极性)”,目的就是把细胞的顶侧翻到外面来,让细菌能像在体内一样,从顶侧发起攻击。

总结一句话: 单层 = 只有一层细胞极性 = 顶侧对肠腔(吸收/感染面),基底侧对基底膜(连接身体面)

Generating report using R (Data_Tam_Metagenomics_2026_Wastewater)

To analyze your metagenomics data with greater control, you can use the R packages phyloseq, ggplot2, vegan, and MaAsLin2. The biobakery workflow has already generated the merged tables you need in the results/ directory.

Below is the complete R code to:

  1. Load the data and reproduce the figures from wmgx_report.pdf (PCoA, Heatmap, Stacked Barplot).
  2. Perform differential abundance analysis between Pre-treatment and Post-treatment.
  3. Statistically test if the differences between time points are significant.

Prerequisites

Make sure you have the following R packages installed:

install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr", "tidyr", "pheatmap"))
# MaAsLin2 is installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("MaAsLin2")

Step 1: Load Data and Create Metadata

The biobakery workflow merged your samples into single tables. We will load the MetaPhlAn species table and HUMAnN pathway table, and map your experimental design.

library(phyloseq)
library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Load MetaPhlAn taxonomic profiles
mpa_file <- "results/metaphlan/merged/metaphlan_taxonomic_profiles.tsv"
mpa_raw <- read.table(mpa_file, header=TRUE, sep="\t", comment.char="", check.names=FALSE)
colnames(mpa_raw)[1] <- "clade_name"
rownames(mpa_raw) <- mpa_raw$clade_name
mpa_otu <- mpa_raw[, -1]

# Filter for species level (s__) and clean names
species_idx <- grep("s__", rownames(mpa_otu))
species_otu <- mpa_otu[species_idx, ]
species_names <- gsub(".*s__", "", rownames(species_otu))
species_names <- gsub("\\|.*", "", species_names)
species_names <- gsub("_", " ", species_names) # Replace underscores with spaces
rownames(species_otu) <- species_names

# 2. Create Metadata based on your file names
sample_names <- colnames(species_otu)
metadata <- data.frame(
  row.names = sample_names,
  TimePoint = case_when(
    grepl("Nov", sample_names) ~ "Nov",
    grepl("Jan", sample_names) ~ "Jan",
    grepl("Mar", sample_names) ~ "Mar",
    grepl("May", sample_names) ~ "May"
  ),
  Treatment = ifelse(grepl("Pre", sample_names), "Pre", "Post")
)

# 3. Create phyloseq object for Species
tax_mat <- matrix(rownames(species_otu), ncol=1, dimnames=list(rownames(species_otu), "Species"))
species_ps <- phyloseq(
  otu_table(species_otu, taxa_are_rows=TRUE),
  sample_data(metadata),
  tax_table(tax_mat)
)

# (Optional) Load HUMAnN pathways using the same logic
# path_file <- "results/humann/merged/pathabundance_relab.tsv"
# ... (similar loading steps as above) ...

Step 2: Reproduce Figures from wmgx_report.pdf

1. Ordination (PCoA)

Reproduces the Bray-Curtis Principal Coordinate Analysis.

library(vegan)

# Calculate Bray-Curtis distance and perform PCoA
ord <- ordinate(species_ps, method="PCoA", distance="bray")

# Plot
plot_ordination(species_ps, ord, color="Treatment", shape="TimePoint") +
  geom_point(size=4) +
  theme_minimal() +
  labs(title="PCoA of Species Profiles (Bray-Curtis)", 
       subtitle="Colored by Treatment, Shaped by Time Point") +
  theme(legend.position = "right")

2. Heatmap (Top 25 Species)

Reproduces the hierarchical clustering heatmap using Euclidean distance.

# Get top 25 most abundant species
top25 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:25]
ps_top25 <- prune_taxa(top25, species_ps)

# Plot heatmap (method=NULL defaults to hierarchical clustering)
plot_heatmap(ps_top25, method=NULL, distance="euclidean", 
             taxa.labels=TRUE, label.size=3, na.value=NA) +
  scale_fill_viridis_c() +
  labs(title="Heatmap of Top 25 Most Abundant Species")

3. Stacked Barplot (Top 15 Species)

Reproduces the stacked barplot, sorted by total abundance.

# Get top 15 species
top15 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:15]
ps_top15 <- prune_taxa(top15, species_ps)

# Melt data for ggplot2
df <- psmelt(ps_top15)

# Sort samples by the total abundance of these top 15 species (decreasing order)
sample_order <- df %>%
  group_by(Sample) %>%
  summarise(Total_Ab = sum(Abundance)) %>%
  arrange(desc(Total_Ab)) %>%
  pull(Sample)

df$Sample <- factor(df$Sample, levels=sample_order)

# Plot
ggplot(df, aes(x=Sample, y=Abundance, fill=Species)) +
  geom_bar(stat="identity", position="stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
  labs(title="Relative Abundance of Top 15 Most Abundant Species", 
       y="Relative Abundance") +
  scale_fill_discrete(name = "Species")

Step 3: Differential Abundance Analysis (Pre vs. Post)

To find which specific species or pathways are significantly different between Pre- and Post-treatment, we use MaAsLin2 (the official biobakery recommendation). It allows us to test the effect of Treatment while adjusting for TimePoint as a covariate.

library(MaAsLin2)

# MaAsLin2 requires samples as rows and features as columns
species_t <- t(otu_table(species_ps))
species_df <- as.data.frame(species_t)
species_df$sample_id <- rownames(species_df)

metadata_df <- data.frame(
  sample_id = rownames(metadata),
  metadata
)

# Run MaAsLin2
# We set "Pre" and "Nov" as reference groups. 
# The output coefficients will represent the difference of Post vs Pre.
fit_results <- MaAsLin2(
  input_data = species_df,
  input_metadata = metadata_df,
  output = "maaslin2_species_output",
  fixed_effects = c("Treatment", "TimePoint"),
  reference = c("Treatment,Pre", "TimePoint,Nov"), 
  random_effects = c(),
  correction = "FDR",
  standardize = FALSE,
  min_abundance = 0.001, 
  min_prevalence = 0.1   # Feature must be present in at least 10% of samples
)

# View significant results (q-value < 0.05)
sig_results <- subset(fit_results$results, qval < 0.05 & metadata == "Treatment")
if(nrow(sig_results) > 0) {
  print(sig_results[, c("feature", "coef", "qval")] %>% arrange(qval))
} else {
  print("No significantly different species found at FDR < 0.05.")
}

Note: You can run the exact same code for pathways by replacing species_df with your HUMAnN pathway data frame.


Step 4: Is the difference between time points significant?

To determine if the microbial community composition changes significantly across the different time points (Nov, Jan, Mar, May), we use PERMANOVA (via the adonis2 function in the vegan package). This tests the overall community structure rather than individual features.

library(vegan)

# Calculate Bray-Curtis distance matrix
dist_bc <- vegdist(t(otu_table(species_ps)), method="bray")

# Run PERMANOVA
# This tests the main effects of Treatment and TimePoint, and their interaction
adonis_res <- adonis2(dist_bc ~ Treatment * TimePoint, data=metadata, permutations=999)
print(adonis_res)

How to interpret the adonis2 output:

  1. Treatment row: Look at the Pr(>F) (p-value). If it is < 0.05, it means there is a statistically significant difference in the overall microbiome between Pre-treatment and Post-treatment.
  2. TimePoint row: Look at the Pr(>F). If it is < 0.05, it means the microbial community composition differs significantly across the months (Nov, Jan, Mar, May).
  3. Treatment:TimePoint row: If this interaction is significant (< 0.05), it means the effect of the treatment depends on the time of year (e.g., the treatment works differently in November compared to May).

Important Note on Statistical Power: Your dataset has 11 samples total. In the PERMANOVA model, this leaves only 3 residual degrees of freedom. While this mathematically allows for testing, the statistical power is low. Significant results are highly trustworthy, but non-significant results might simply be due to the small sample size.

Processing RNAseq (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE)

  1. Preparing raw data

     mkdir raw_data; cd raw_data
    
     # control samples (8)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_1.fq.gz AYE-WT_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_2.fq.gz AYE-WT_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_1.fq.gz AYE-WT_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_2.fq.gz AYE-WT_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_1.fq.gz AYE-WT_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_2.fq.gz AYE-WT_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_1.fq.gz AYE-T_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_2.fq.gz AYE-T_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_1.fq.gz AYE-T_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_2.fq.gz AYE-T_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_1.fq.gz AYE-T_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_2.fq.gz AYE-T_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_1.fq.gz AYE-O_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_2.fq.gz AYE-O_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_1.fq.gz AYE-O_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_2.fq.gz AYE-O_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_1.fq.gz AYE-O_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_2.fq.gz AYE-O_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_1.fq.gz O-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_2.fq.gz O-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_1.fq.gz O-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_2.fq.gz O-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_1.fq.gz O-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_2.fq.gz O-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_1.fq.gz WT-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_2.fq.gz WT-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_1.fq.gz WT-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_2.fq.gz WT-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_1.fq.gz WT-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_2.fq.gz WT-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_1.fq.gz AYE-WT_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_2.fq.gz AYE-WT_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_1.fq.gz AYE-WT_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_2.fq.gz AYE-WT_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_1.fq.gz AYE-WT_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_2.fq.gz AYE-WT_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_1.fq.gz AYE-O_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_2.fq.gz AYE-O_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_1.fq.gz AYE-O_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_2.fq.gz AYE-O_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_1.fq.gz AYE-O_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_2.fq.gz AYE-O_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_1.fq.gz AYE-T_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_2.fq.gz AYE-T_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_1.fq.gz AYE-T_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_2.fq.gz AYE-T_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_1.fq.gz AYE-T_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_2.fq.gz AYE-T_ctr_solid_r3_R2.fastq.gz
    
     # Diclofenac(双氯芬酸)treatment (6)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_1.fq.gz AYE-WT_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_2.fq.gz AYE-WT_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_1.fq.gz AYE-WT_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_2.fq.gz AYE-WT_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_1.fq.gz AYE-WT_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_2.fq.gz AYE-WT_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_1.fq.gz AYE-T_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_2.fq.gz AYE-T_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_1.fq.gz AYE-T_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_2.fq.gz AYE-T_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_1.fq.gz AYE-T_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_2.fq.gz AYE-T_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_1.fq.gz AYE-O_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_2.fq.gz AYE-O_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_1.fq.gz AYE-O_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_2.fq.gz AYE-O_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_1.fq.gz AYE-O_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_2.fq.gz AYE-O_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_1.fq.gz O-Trans_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_2.fq.gz O-Trans_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_1.fq.gz O-Trans_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_2.fq.gz O-Trans_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_1.fq.gz O-Trans_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_2.fq.gz O-Trans_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_1.fq.gz WT-Trans_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_2.fq.gz WT-Trans_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_1.fq.gz WT-Trans_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_2.fq.gz WT-Trans_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_1.fq.gz WT-Trans_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_2.fq.gz WT-Trans_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_1.fq.gz AYE-WT_Diclo1250_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_2.fq.gz AYE-WT_Diclo1250_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_1.fq.gz AYE-WT_Diclo1250_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_2.fq.gz AYE-WT_Diclo1250_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_1.fq.gz AYE-WT_Diclo1250_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_2.fq.gz AYE-WT_Diclo1250_solid_r3_R2.fastq.gz
    
     # Rifampicin(利福平)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_1.fq.gz AYE-WT_Rifampicin1.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_2.fq.gz AYE-WT_Rifampicin1.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_1.fq.gz AYE-WT_Rifampicin1.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_2.fq.gz AYE-WT_Rifampicin1.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_1.fq.gz AYE-WT_Rifampicin1.5_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_2.fq.gz AYE-WT_Rifampicin1.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_1.fq.gz AYE-T_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_2.fq.gz AYE-T_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_1.fq.gz AYE-T_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_2.fq.gz AYE-T_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_1.fq.gz AYE-T_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_2.fq.gz AYE-T_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_1.fq.gz AYE-O_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_2.fq.gz AYE-O_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_1.fq.gz AYE-O_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_2.fq.gz AYE-O_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_1.fq.gz AYE-O_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_2.fq.gz AYE-O_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_1.fq.gz O-Trans_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_2.fq.gz O-Trans_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_1.fq.gz O-Trans_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_2.fq.gz O-Trans_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_1.fq.gz O-Trans_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_2.fq.gz O-Trans_Rifampicin2_r3_R2.fastq.gz
    
     # Meropenem(美罗培南)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_1.fq.gz AYE-WT_Mero0.35-0.5_r1_R1.fastq.gz  #AYE-WT_Mero0.5_r1
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_2.fq.gz AYE-WT_Mero0.35-0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_1.fq.gz AYE-WT_Mero0.35-0.5_r2_R1.fastq.gz  #AYE-WT_YX_Mero0.35_r2
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_2.fq.gz AYE-WT_Mero0.35-0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_1.fq.gz AYE-WT_Mero0.35-0.5_r3_R1.fastq.gz  #AYE-WT_public_Mero0.35_r3
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_2.fq.gz AYE-WT_Mero0.35-0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_1.fq.gz AYE-T_Mero0.15_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_2.fq.gz AYE-T_Mero0.15_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_1.fq.gz AYE-T_Mero0.15_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_2.fq.gz AYE-T_Mero0.15_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_1.fq.gz AYE-T_Mero0.15_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_2.fq.gz AYE-T_Mero0.15_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_1.fq.gz AYE-O_Mero0.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_2.fq.gz AYE-O_Mero0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_1.fq.gz AYE-O_Mero0.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_2.fq.gz AYE-O_Mero0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_1.fq.gz AYE-O_Mero0.5_r3_R1.fastq.gz  #Mero0.45
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_2.fq.gz AYE-O_Mero0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_1.fq.gz O-Trans_Mero0.25_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_2.fq.gz O-Trans_Mero0.25_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_1.fq.gz O-Trans_Mero0.25_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_2.fq.gz O-Trans_Mero0.25_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_1.fq.gz O-Trans_Mero0.25_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_2.fq.gz O-Trans_Mero0.25_r3_R2.fastq.gz
    
     # Azithromycin(阿奇霉素)treatment (5), among them, F_ctr_solid is clinical isolate.
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_1.fq.gz F_ctr_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_2.fq.gz F_ctr_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_1.fq.gz F_ctr_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_2.fq.gz F_ctr_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_1.fq.gz F_ctr_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_2.fq.gz F_ctr_solid_r3_R2.fastq.gz  #clinical
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_1.fq.gz AYE-WT_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_2.fq.gz AYE-WT_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_1.fq.gz AYE-WT_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_2.fq.gz AYE-WT_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_1.fq.gz AYE-WT_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_2.fq.gz AYE-WT_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_1.fq.gz AYE-T_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_2.fq.gz AYE-T_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_1.fq.gz AYE-T_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_2.fq.gz AYE-T_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_1.fq.gz AYE-T_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_2.fq.gz AYE-T_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_1.fq.gz AYE-O_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_2.fq.gz AYE-O_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_1.fq.gz AYE-O_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_2.fq.gz AYE-O_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_1.fq.gz AYE-O_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_2.fq.gz AYE-O_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_1.fq.gz F_Azi20_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_2.fq.gz F_Azi20_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_1.fq.gz F_Azi20_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_2.fq.gz F_Azi20_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_1.fq.gz F_Azi20_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_2.fq.gz F_Azi20_solid_r3_R2.fastq.gz  #clinical
  2. Preparing the directory trimmed

     mkdir trimmed trimmed_unpaired;
     for sample_id in AYE-O_Azi20_solid_r1 AYE-O_Azi20_solid_r2 AYE-O_Azi20_solid_r3 AYE-O_ctr_r1 AYE-O_ctr_r2 AYE-O_ctr_r3 AYE-O_ctr_solid_r1 AYE-O_ctr_solid_r2 AYE-O_ctr_solid_r3 AYE-O_Diclo375_r1 AYE-O_Diclo375_r2 AYE-O_Diclo375_r3 AYE-O_Mero0.5_r1 AYE-O_Mero0.5_r2 AYE-O_Mero0.5_r3 AYE-O_Rifampicin2_r1 AYE-O_Rifampicin2_r2 AYE-O_Rifampicin2_r3 AYE-T_Azi20_solid_r1 AYE-T_Azi20_solid_r2 AYE-T_Azi20_solid_r3 AYE-T_ctr_r1 AYE-T_ctr_r2 AYE-T_ctr_r3 AYE-T_ctr_solid_r1 AYE-T_ctr_solid_r2 AYE-T_ctr_solid_r3 AYE-T_Diclo375_r1 AYE-T_Diclo375_r2 AYE-T_Diclo375_r3 AYE-T_Mero0.15_r1 AYE-T_Mero0.15_r2 AYE-T_Mero0.15_r3 AYE-T_Rifampicin2_r1 AYE-T_Rifampicin2_r2 AYE-T_Rifampicin2_r3 AYE-WT_Azi20_solid_r1 AYE-WT_Azi20_solid_r2 AYE-WT_Azi20_solid_r3 AYE-WT_ctr_r1 AYE-WT_ctr_r2 AYE-WT_ctr_r3 AYE-WT_ctr_solid_r1 AYE-WT_ctr_solid_r2 AYE-WT_ctr_solid_r3 AYE-WT_Diclo1250_solid_r1 AYE-WT_Diclo1250_solid_r2 AYE-WT_Diclo1250_solid_r3 AYE-WT_Diclo750_r1 AYE-WT_Diclo750_r2 AYE-WT_Diclo750_r3 AYE-WT_Mero0.35-0.5_r1 AYE-WT_Mero0.35-0.5_r2 AYE-WT_Mero0.35-0.5_r3 AYE-WT_Rifampicin1.5_r1 AYE-WT_Rifampicin1.5_r2 AYE-WT_Rifampicin1.5_r3 F_Azi20_solid_r1 F_Azi20_solid_r2 F_Azi20_solid_r3 F_ctr_solid_r1 F_ctr_solid_r2 F_ctr_solid_r3 O-Trans_ctr_r1 O-Trans_ctr_r2 O-Trans_ctr_r3 O-Trans_Diclo375_r1 O-Trans_Diclo375_r2 O-Trans_Diclo375_r3 O-Trans_Mero0.25_r1 O-Trans_Mero0.25_r2 O-Trans_Mero0.25_r3 O-Trans_Rifampicin2_r1 O-Trans_Rifampicin2_r2 O-Trans_Rifampicin2_r3 WT-Trans_ctr_r1 WT-Trans_ctr_r2 WT-Trans_ctr_r3 WT-Trans_Diclo750_r1 WT-Trans_Diclo750_r2 WT-Trans_Diclo750_r3; do \
     for sample_id in AYE-T_Diclo375_r2; do \
             java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
     done
  3. (Optional) using trinity to find the most closely reference

     #Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
     #https://www.genome.jp/kegg/tables/br08606.html#prok
     acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
     abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
     aby     KGB     Acinetobacter baumannii AYE     2008    GenBank --> *
     abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
     abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
     abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
     abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
     abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
     abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
     abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
     abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
     abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
     abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
     abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
     abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
     abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
     abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
     abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
     abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
     abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
     abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
     #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CU459141.1) was choosen as reference!
  4. Preparing samplesheet.csv

     sample,fastq_1,fastq_2,strandedness
     Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
     ...
  5. Downloading CU459141.fasta and CU459141.gff from GenBank and preparing CU459141_m.gff

     #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
     #Default NOT_WORKING: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
     #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
     #Checking the record (see below) in results/genome/CP059040.gtf
     #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
     #--featurecounts_feature_type 'transcript' returns only the tRNA results
     #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
     grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
     grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
     grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
     grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
     wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
     grep -P "\tCDS\t" CU459141.gff3 | wc -l  #3659
     sed 's/\tCDS\t/\texon\t/g' CU459141.gff3 > CU459141_m.gff
     grep -P "\texon\t" CU459141_m.gff | sort | wc -l  #3760
    
     # -- DEBUG_2: combination of 'CU459141_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
     --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff" --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  6. nextflow run

     # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
     (host_env) mv trimmed/*.fastq.gz .
     (host_env) nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \
     --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff"        -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
  7. (OPTIONAL, since the R-script complete_deg_pipeline_custom_cutoff.R runs completely) Generate PCA-plot

     #mamba activate r_env
    
     #install.packages("ggfun")
     # Import the required libraries
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library(gplots)
     library(tximport)
     library(DESeq2)
     #library("org.Hs.eg.db")
     library(dplyr)
     library(tidyverse)
     #install.packages("devtools")
     #devtools::install_version("gtable", version = "0.3.0")
     library(gplots)
     library("RColorBrewer")
     #install.packages("ggrepel")
     library("ggrepel")
     # install.packages("openxlsx")
     library(openxlsx)
     library(EnhancedVolcano)
     library(DESeq2)
     library(edgeR)
    
     setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon")
     # Define paths to your Salmon output quantification files
    
     # Store sample names in a character vector
     samples <- c(
         "AYE-O_Azi20_solid_r1", "AYE-O_Azi20_solid_r2", "AYE-O_Azi20_solid_r3", "AYE-O_ctr_r1", "AYE-O_ctr_r2",
         "AYE-O_ctr_r3", "AYE-O_ctr_solid_r1", "AYE-O_ctr_solid_r2", "AYE-O_ctr_solid_r3",
         "AYE-O_Diclo375_r1", "AYE-O_Diclo375_r2", "AYE-O_Diclo375_r3", "AYE-O_Mero0.5_r1",
         "AYE-O_Mero0.5_r2", "AYE-O_Mero0.5_r3", "AYE-O_Rifampicin2_r1", "AYE-O_Rifampicin2_r2",
         "AYE-O_Rifampicin2_r3", "AYE-T_Azi20_solid_r1", "AYE-T_Azi20_solid_r2", "AYE-T_Azi20_solid_r3",
         "AYE-T_ctr_r1", "AYE-T_ctr_r2", "AYE-T_ctr_r3", "AYE-T_ctr_solid_r1", "AYE-T_ctr_solid_r2",
         "AYE-T_ctr_solid_r3", "AYE-T_Diclo375_r1", "AYE-T_Diclo375_r2", "AYE-T_Diclo375_r3",
         "AYE-T_Mero0.15_r1", "AYE-T_Mero0.15_r2", "AYE-T_Mero0.15_r3", "AYE-T_Rifampicin2_r1",
         "AYE-T_Rifampicin2_r2", "AYE-T_Rifampicin2_r3", "AYE-WT_Azi20_solid_r1", "AYE-WT_Azi20_solid_r2",
         "AYE-WT_Azi20_solid_r3", "AYE-WT_ctr_r1", "AYE-WT_ctr_r2", "AYE-WT_ctr_r3", "AYE-WT_ctr_solid_r1",
         "AYE-WT_ctr_solid_r2", "AYE-WT_ctr_solid_r3", "AYE-WT_Diclo1250_solid_r1", "AYE-WT_Diclo1250_solid_r2",
         "AYE-WT_Diclo1250_solid_r3", "AYE-WT_Diclo750_r1", "AYE-WT_Diclo750_r2", "AYE-WT_Diclo750_r3",
         "AYE-WT_Mero0.35-0.5_r1", "AYE-WT_Mero0.35-0.5_r2", "AYE-WT_Mero0.35-0.5_r3", "AYE-WT_Rifampicin1.5_r1",
         "AYE-WT_Rifampicin1.5_r2", "AYE-WT_Rifampicin1.5_r3", "F_Azi20_solid_r1", "F_Azi20_solid_r2",
         "F_Azi20_solid_r3", "F_ctr_solid_r1", "F_ctr_solid_r2", "F_ctr_solid_r3",
         "O-Trans_ctr_r1","O-Trans_ctr_r2","O-Trans_ctr_r3",  "O-Trans_Diclo375_r1","O-Trans_Diclo375_r2","O-Trans_Diclo375_r3",
         "O-Trans_Mero0.25_r1", "O-Trans_Mero0.25_r2", "O-Trans_Mero0.25_r3", "O-Trans_Rifampicin2_r1",
         "O-Trans_Rifampicin2_r2", "O-Trans_Rifampicin2_r3", "WT-Trans_ctr_r1", "WT-Trans_ctr_r2",
         "WT-Trans_ctr_r3", "WT-Trans_Diclo750_r1", "WT-Trans_Diclo750_r2", "WT-Trans_Diclo750_r3"
     )
    
     # Automatically generate the named vector
     files <- setNames(paste0("./", samples, "/quant.sf"), samples)
     # Import the transcript abundance data with tximport
     txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
     # -----------------------------------------------------------------
     # ---- Step 1: Create Detailed Metadata from Your Sample Names ----
    
     # Extract metadata from sample names
     samples <- names(files)
    
     # Parse the complex sample names
     metadata <- data.frame(
     sample = samples,
     stringsAsFactors = FALSE
     )
    
     # Extract strain (everything before first underscore or hyphen treatment)
     metadata$strain <- sapply(strsplit(samples, "[-_]"), function(x) {
     if(x[1] %in% c("AYE", "O", "WT", "F")) {
         if(x[1] == "AYE" && length(x) > 1 && x[2] %in% c("WT", "T", "O")) {
         paste(x[1:2], collapse = "-")
         } else if(x[1] %in% c("O", "WT") && x[2] == "Trans") {
         paste(x[1:2], collapse = "-")
         } else {
         x[1]
         }
     } else {
         x[1]
     }
     })
    
     # Extract treatment type
     metadata$treatment <- sapply(samples, function(x) {
         if(grepl("_ctr", x)) return("ctrl")
         if(grepl("Diclo", x)) return("Diclo")
         if(grepl("Mero", x)) return("Mero")
         if(grepl("Azi", x)) return("Azi")
         if(grepl("Rifampicin", x)) return("Rifampicin")
         return("ctrl")
     })
    
     # Extract concentration
     metadata$concentration <- sapply(samples, function(x) {
         if(grepl("Diclo1250", x)) return("1250")
         if(grepl("Diclo750", x)) return("750")
         if(grepl("Diclo375", x)) return("375")
         if(grepl("Mero0.5", x)) return("0.5")
         if(grepl("Mero0.35", x)) return("0.35")
         if(grepl("Mero0.25", x)) return("0.25")
         if(grepl("Mero0.15", x)) return("0.15")
         if(grepl("Azi20", x)) return("20")
         if(grepl("Rifampicin2", x)) return("2")
         if(grepl("Rifampicin1.5", x)) return("1.5")
         return("0")
     })
    
     # Extract condition (solid vs liquid)
     metadata$condition <- ifelse(grepl("_solid", samples), "solid", "liquid")
    
     # Extract replicate
     metadata$replicate <- sapply(strsplit(samples, "_"), function(x) {
     rep_part <- x[length(x)]
     gsub("r", "", rep_part)
     })
    
     # Create combined group for easy comparisons
     metadata$group <- paste(metadata$strain, metadata$treatment, metadata$concentration, sep = "_")
    
     # Set row names
     rownames(metadata) <- metadata$sample
    
     # Reorder to match txi columns
     metadata <- metadata[colnames(txi$counts), ]
    
     # ---------------------------------------------
     # ---- Step 2: Choose Your Design Strategy ----
    
     # Strategy A: Full Factorial Design (if balanced)
     #dds <- DESeqDataSetFromTximport(txi, metadata,
     #                         design = ~ strain + treatment + condition)
    
     # --> Strategy B: Combined Group Factor ⭐ RECOMMENDED
     metadata$group <- factor(paste(metadata$strain,
                                     metadata$treatment,
                                     metadata$concentration,
                                     metadata$condition,
                                     sep = "_"))
    
     dds <- DESeqDataSetFromTximport(txi, metadata,
                                     design = ~ group)
     dds <- DESeq(dds)
    
     # See all available comparisons
     resultsNames(dds)
    
     # -------------------------------------------------------------
     # ---- Step 3: Set Up Specific Comparisons from Your Notes ----
     # ==========================================
     # 1. Define Exact Comparisons from Your Notes
     # ==========================================
     planned_comparisons <- list(
     # --- Baseline / Strain Controls ---
     AYE_T_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-T_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_WT_ctr          = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_WT_ctr         = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_T                 = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-T_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_T               = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_T              = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Condition Effects (Solid vs Liquid) ---
     AYE_WT_ctr_solid_vs_AYE_WT_ctr     = list(treat = "AYE-WT_ctrl_0_solid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_O_ctr       = list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-O_ctrl_0_liquid"),
     AYE_T_ctr_solid_vs_AYE_T_ctr       = list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Diclofenac ---
     AYE_WT_Diclo750_vs_AYE_WT_ctr      = list(treat = "AYE-WT_Diclo_750_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-T_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-O_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_AYE_WT_ctr     = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_AYE_WT_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     Diclo_AYE_WT_1250_solid_vs_solid_ctr = list(treat = "AYE-WT_Diclo_1250_solid", ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Meropenem ---
     AYE_WT_Mero_vs_AYE_WT_ctr          = list(treat = "AYE-WT_Mero_0.35_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-O_Mero_0.5_liquid",       ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Mero_vs_AYE_WT_ctr         = list(treat = "O-Trans_Mero_0.25_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_T_ctr            = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Azithromycin (Solid) ---
     AYE_WT_Azi_vs_solid_ctr            = list(treat = "AYE-WT_Azi_20_solid", ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_Azi_vs_solid_ctr             = list(treat = "AYE-T_Azi_20_solid",  ctrl = "AYE-T_ctrl_0_solid"),
     AYE_O_Azi_vs_solid_ctr             = list(treat = "AYE-O_Azi_20_solid",  ctrl = "AYE-O_ctrl_0_solid"),
     F_Azi_vs_F_solid_ctr               = list(treat = "F_Azi_20_solid",      ctrl = "F_ctrl_0_solid"),
    
     # --- Rifampicin ---
     AYE_WT_Rif_vs_AYE_WT_ctr           = list(treat = "AYE-WT_Rifampicin_1.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Rif_vs_AYE_T_ctr             = list(treat = "AYE-T_Rifampicin_2_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Rif_vs_AYE_O_ctr             = list(treat = "AYE-O_Rifampicin_2_liquid",    ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Rif_vs_O_Trans_ctr         = list(treat = "O-Trans_Rifampicin_2_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # Additional Analyses
     planned_comparisons <- list(
     # --- Diclofenac_2 ---
     # * AYE-T_Diclo_375_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Diclo_375_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Diclo_375_liquid vs AYE-O-Trans_ctrl_0_liquid
     # * WT-Trans_Diclo_750_liquid vs AYE-WT-Trans_ctrl_0_liquid
     AYE_T_Diclo375_vs_AYE_T_ctr        = list(treat = "AYE-T_Diclo_375_liquid",        ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_O_ctr        = list(treat = "AYE-O_Diclo_375_liquid",        ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_O_Trans_ctr    = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "O-Trans_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_WT_Trans_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "WT-Trans_ctrl_0_liquid"),
    
     # --- Meropenem_2 ---
     # * AYE-T_Mero_0.15_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Mero_0.5_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Mero_0.25_liquid vs AYE-O-Trans_ctrl_0_liquid
     AYE_T_Mero_vs_AYE_T_ctr        = list(treat = "AYE-T_Mero_0.15_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_O_ctr        = list(treat = "AYE-O_Mero_0.5_liquid",     ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Mero_vs_O_Trans_ctr    = list(treat = "O-Trans_Mero_0.25_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # ==========================================
     # 2. Verification & Validation Script
     # ==========================================
     # Identify which column in colData holds your group names
     group_col <- if("group" %in% colnames(colData(dds))) "group" else
                 if("treatment" %in% colnames(colData(dds))) "treatment" else
                 stop("❌ Please specify the correct colData column containing group names.")
    
     actual_groups <- unique(colData(dds)[[group_col]])
    
     cat("\n", paste(rep("=", 85), collapse=""), "\n")
     cat("📋 VERIFICATION OF NOTE-DERIVED COMPARISONS\n")
     cat(paste(rep("=", 85), collapse=""), "\n\n")
    
     validation_results <- data.frame(
     Comparison_Name = character(),
     Treatment_String = character(),
     Control_String = character(),
     Status = character(),
     Suggested_Contrast = character(),
     stringsAsFactors = FALSE
     )
    
     for(name in names(planned_comparisons)) {
     trt <- planned_comparisons[[name]]$treat
     ctl <- planned_comparisons[[name]]$ctrl
    
     # Find closest matches in actual data
     trt_match <- actual_groups[grepl(trt, actual_groups, fixed = TRUE)]
     ctl_match <- actual_groups[grepl(ctl, actual_groups, fixed = TRUE)]
    
     status <- if(length(trt_match) > 0 && length(ctl_match) > 0) "✅ VALID" else "⚠️  CHECK"
     contrast_str <- if(status == "✅ VALID")
         paste0('c("', group_col, '", "', trt_match[1], '", "', ctl_match[1], '")') else "N/A"
    
     validation_results <- rbind(validation_results, data.frame(
         Comparison_Name = name,
         Treatment_String = trt,
         Control_String = ctl,
         Status = status,
         Suggested_Contrast = contrast_str,
         stringsAsFactors = FALSE
     ))
    
     cat(sprintf("%-45s | T:%-25s C:%-20s | %s\n", name, trt, ctl, status))
     if(status == "⚠️  CHECK") {
         if(length(trt_match) == 0) cat("   🔍 Treat not found. Closest: ", paste(head(actual_groups[grepl(strsplit(trt, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
         if(length(ctl_match) == 0) cat("   🔍 Ctrl not found.  Closest: ", paste(head(actual_groups[grepl(strsplit(ctl, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
     }
     }
    
     # ==========================================
     # 3. Auto-Generate DESeq2 results() Calls (Optional)
     # ==========================================
     valid_comparisons <- validation_results[validation_results$Status == "✅ VALID", ]
     if(nrow(valid_comparisons) > 0) {
     cat("\n📜 READY-TO-RUN DESeq2 CONTRASTS:\n")
     cat(paste(rep("-", 60), collapse=""), "\n")
     for(i in seq_len(nrow(valid_comparisons))) {
         cat(sprintf('res_%s <- results(dds, contrast = %s)\n',
                     gsub("[^A-Za-z0-9]", "_", valid_comparisons$Comparison_Name[i]),
                     valid_comparisons$Suggested_Contrast[i]))
     }
     } else {
     cat("\n⚠️  No exact matches found. Check your colData group naming convention.\n")
     }
    
     # -----------------------------
     # ---- Step 4: PCA figures ----
    
     # 🔍 What each figure shows:
     #
     #    01_PCA_by_Strain.png → Tests if genetic background (AYE-WT, AYE-T, AYE-O, Trans, F) is the dominant source of variation.
     #    02_PCA_by_Treatment.png → Shows clustering by antibiotic/drug exposure (ctrl, Diclo, Mero, Azi, Rifampicin).
     #    03_PCA_by_Condition.png → Reveals batch/growth media effects (solid vs liquid).
     #    04_PCA_CombinedGroups.png → Full experimental grouping with labeled sample names for quick outlier detection.
     #    05_PCA_Ellipses.png → Adds 95% confidence boundaries per strain to visualize group spread and overlap.
     #
     # ⚠️ Quick Checklist Before Running:
     #
     #    Ensure metadata columns (strain, treatment, condition, group) are attached to colData(dds).
     #    If ggrepel is missing, run install.packages("ggrepel").
     #    All PNGs will save to your current working directory (getwd()).
    
     # Install if missing: install.packages(c("ggplot2", "ggrepel"))
     library(DESeq2)
     library(ggplot2)
     library(ggrepel)
    
     # 1. Variance Stabilizing Transformation & Extract PCA Data
     vsd <- vst(dds, blind = FALSE)
     pca_data <- plotPCA(vsd, intgroup = c("strain", "treatment", "condition", "group"), returnData = TRUE)
     percent_var <- round(100 * attr(pca_data, "percentVar"))
    
     # Consistent theme for all plots
     base_theme <- theme_bw(base_size = 12) +
     theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
             legend.position = "right",
             legend.title = element_text(face = "bold"),
             panel.grid.major = element_line(color = "grey90"),
             panel.grid.minor = element_blank())
    
     # --- Plot 1: Colored by Strain ---
     p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2.5, max.overlaps = 20, show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Strain",
         color = "Strain", shape = "Condition") +
     base_theme
     ggsave("01_PCA_by_Strain.png", p1, width = 8, height = 6, dpi = 300)
    
     # --- Plot 2: Colored by Treatment ---
     p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = treatment, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Treatment",
         color = "Treatment", shape = "Condition") +
     base_theme
     ggsave("02_PCA_by_Treatment.png", p2, width = 8, height = 6, dpi = 300)
    
     # --- Plot 3: Colored by Condition (Solid vs Liquid) ---
     p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = strain)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Growth Condition",
         color = "Condition", shape = "Strain") +
     base_theme
     ggsave("03_PCA_by_Condition.png", p3, width = 8, height = 6, dpi = 300)
    
     # --- Plot 4: Combined Groups with Sample Labels ---
     p4 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = group)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2, max.overlaps = 30, box.padding = 0.3) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Combined Experimental Groups",
         color = "Group") +
     base_theme + theme(legend.position = "none")
     ggsave("04_PCA_CombinedGroups.png", p4, width = 9, height = 7, dpi = 300)
    
     # --- Plot 5: 95% Confidence Ellipses (by Strain) ---
     p5 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, fill = strain)) +
     geom_point(size = 3, alpha = 0.7) +
     stat_ellipse(level = 0.95, alpha = 0.2, geom = "polygon", show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: 95% Confidence Ellipses by Strain",
         color = "Strain", fill = "Strain") +
     base_theme
     ggsave("05_PCA_Ellipses.png", p5, width = 8, height = 6, dpi = 300)
    
     message("✅ All 5 PCA plots saved to working directory!")
  8. Run Differential Expression & PCA Analysis Complete

     (r_env) cd ~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/
     #(r_env) Rscript complete_deg_pipeline.R  #For standard cutoff in the project
    
     # Adapted the script to the following requests:
     # (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and < -1.2 for the KEGG and GO analyses.
     # (b) Baseline / Strain Controls: use genes with a cutoff of log2 fold change > 1.4 and < -1.4 for the KEGG and GO analyses.
     # (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.
    
     # How it works:
     #   * Rifampicin: The script looks for "Rif" in the comparison name (e.g., 28_AYE_WT_Rif_vs_Ctrl) and applies |log2FC| >= 1.2.
     #   * Baseline/Strain Controls: The script looks for "_ctr_vs_" in the comparison name (e.g., 01_AYE_T_ctr_vs_AYE_WT_ctr) and applies |log2FC| >= 1.4.
     #   * All Others: Falls back to the original 2.0 cutoff.
     #   * The console output will now explicitly print which cutoff is being used for each specific comparison.
    
     (r_env) Rscript complete_deg_pipeline_custom_cutoff.R
  9. LOG of Rscript complete_deg_pipeline_custom_cutoff.R

     -rw-rw-r-- 1 jhuang jhuang 355609 Jun 22 12:53 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 253271 Jun 22 12:53 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 368843 Jun 22 12:53 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 358124 Jun 22 12:53 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 347126 Jun 22 12:53 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 283473 Jun 22 12:53 Volcano_31_O_Trans_Rif_vs_Ctrl.png
    
     -rw-rw-r-- 1 jhuang jhuang 352375 Jun 22 12:53 DEG_32_AYE_T_Diclo375_vs_AYE_T_ctr.xlsx
     -rw-rw-r-- 1 jhuang jhuang 257075 Jun 22 12:53 Volcano_32_AYE_T_Diclo375_vs_AYE_T_ctr.png
    
     -rw-rw-r-- 1 jhuang jhuang  355610 Jun  5 16:26 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  254804 Jun  5 16:26 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  368843 Jun  5 16:26 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  358417 Jun  5 16:26 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  347127 Jun  5 16:26 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  284084 Jun  5 16:26 Volcano_31_O_Trans_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang    1579 Jun  5 16:26 DEG_Summary_All_31.csv
    
     (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon$ Rscript complete_deg_pipeline_custom_cutoff.R
     There were 22 warnings (use warnings() to see them)
     📖 Parsing annotation file to build tx2gene mapping...
     ✅ Created tx2gene mapping with 7520 entries.
     📥 Importing Salmon quantifications...
     reading in files with read_tsv
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
     removing duplicated transcript rows from tx2gene
     summarizing abundance
     summarizing counts
     summarizing length
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using counts and average transcript lengths from tximport
     🚀 Running DESeq2 pipeline...
     estimating size factors
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using 'avgTxLength' from assays(dds), correcting for library size
     estimating dispersions
     gene-wise dispersion estimates
     mean-dispersion relationship
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     final dispersion estimates
     fitting model and testing
     Available groups in dds:
     [1] "AYE-O_Azi_20_solid"           "AYE-O_ctrl_0_liquid"
     [3] "AYE-O_ctrl_0_solid"           "AYE-O_Diclo_375_liquid"
     [5] "AYE-O_Mero_0.5_liquid"        "AYE-O_Rifampicin_2_liquid"
     [7] "AYE-T_Azi_20_solid"           "AYE-T_ctrl_0_liquid"
     [9] "AYE-T_ctrl_0_solid"           "AYE-T_Diclo_375_liquid"
     [11] "AYE-T_Mero_0.15_liquid"       "AYE-T_Rifampicin_2_liquid"
     [13] "AYE-WT_Azi_20_solid"          "AYE-WT_ctrl_0_liquid"
     [15] "AYE-WT_ctrl_0_solid"          "AYE-WT_Diclo_1250_solid"
     [17] "AYE-WT_Diclo_750_liquid"      "AYE-WT_Mero_0.35-0.5_liquid"
     [19] "AYE-WT_Rifampicin_1.5_liquid" "F_Azi_20_solid"
     [21] "F_ctrl_0_solid"               "O-Trans_ctrl_0_liquid"
     [23] "O-Trans_Diclo_375_liquid"     "O-Trans_Mero_0.25_liquid"
     [25] "O-Trans_Rifampicin_2_liquid"  "WT-Trans_ctrl_0_liquid"
     [27] "WT-Trans_Diclo_750_liquid"
     📖 Parsing annotation file for gene names...
    
     🚀 PROCESSING  38  COMPARISONS
     Base Thresholds: padj <= 0.05, |log2FC| >= 2 (Dynamic adjustments apply per comparison)
     ================================================================================
    
     [01/38] 01_AYE_T_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 16, Down: 151)
    
     [02/38] 02_AYE_O_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 9, Down: 9)
    
     [03/38] 03_O_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 44)
    
     [04/38] 04_WT_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 15, Down: 153)
    
     [05/38] 05_AYE_O_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 28, Down: 3)
    
     [06/38] 06_O_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 11, Down: 10)
    
     [07/38] 07_WT_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 4, Down: 14)
    
     [08/38] 08_AYE_WT_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 485, Down: 324)
    
     [09/38] 09_AYE_O_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 475, Down: 309)
    
     [10/38] 10_AYE_T_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 501, Down: 332)
    
     [11/38] 11_AYE_O_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 5)
    
     [12/38] 12_AYE_T_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 5, Down: 8)
    
     [13/38] 13_AYE_WT_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 45)
    
     [14/38] 14_AYE_T_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 75, Down: 181)
    
     [15/38] 15_AYE_O_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 26, Down: 240)
    
     [16/38] 16_O_Trans_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 17, Down: 82)
    
     [17/38] 17_WT_Trans_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 25, Down: 230)
    
     [18/38] 18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 66, Down: 26)
    
     [19/38] 19_AYE_WT_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 359, Down: 175)
    
     [20/38] 20_AYE_T_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 93, Down: 110)
    
     [21/38] 21_AYE_O_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 267, Down: 244)
    
     [22/38] 22_O_Trans_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 183, Down: 153)
    
     [23/38] 23_AYE_T_Mero_vs_AYE_T_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [24/38] 24_AYE_WT_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 341, Down: 329)
    
     [25/38] 25_AYE_T_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 510, Down: 433)
    
     [26/38] 26_AYE_O_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 484, Down: 443)
    
     [27/38] 27_F_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 147, Down: 271)
    
     [28/38] 28_AYE_WT_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 11)
    
     [29/38] 29_AYE_T_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 34, Down: 41)
    
     [30/38] 30_AYE_O_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 71, Down: 100)
    
     [31/38] 31_O_Trans_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 6, Down: 10)
    
     [32/38] 32_AYE_T_Diclo375_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 47, Down: 6)
    
     [33/38] 33_AYE_O_Diclo375_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 73)
    
     [34/38] 34_O_Trans_Diclo375_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 2, Down: 17)
    
     [35/38] 35_WT_Trans_Diclo750_vs_WT_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 27)
    
     [36/38] 36_AYE_T_Mero_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [37/38] 37_AYE_O_Mero_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 242, Down: 135)
    
     [38/38] 38_O_Trans_Mero_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 177, Down: 50)
    
     ================================================================================
     📊 FINAL SUMMARY OF ALL 38 COMPARISONS
                                         name total  up down sig_total pct_sig
               # |log2FC| >= 1.4
                   01_AYE_T_ctr_vs_AYE_WT_ctr  3609  16  151       167     4.6
                   02_AYE_O_ctr_vs_AYE_WT_ctr  3609   9    9        18     0.5
                  03_O_Trans_ctr_vs_AYE_WT_ctr  3609  10   44        54     1.5
               04_WT_Trans_ctr_vs_AYE_WT_ctr  3609  15  153       168     4.7
                   05_AYE_O_ctr_vs_AYE_T_ctr  3609  28    3        31     0.9
                 06_O_Trans_ctr_vs_AYE_T_ctr  3609  11   10        21     0.6
                 07_WT_Trans_ctr_vs_AYE_T_ctr  3609   4   14        18     0.5
    
               # |log2FC| >= 2
               08_AYE_WT_ctr_solid_vs_liquid  3609 485  324       809    22.4
                 09_AYE_O_ctr_solid_vs_liquid  3609 475  309       784    21.7
                 10_AYE_T_ctr_solid_vs_liquid  3609 501  332       833    23.1
           11_AYE_O_ctr_solid_vs_AYE_WT_solid  3609  12    5        17     0.5
           12_AYE_T_ctr_solid_vs_AYE_WT_solid  3609   5    8        13     0.4
                   13_AYE_WT_Diclo750_vs_Ctrl  3609  12   45        57     1.6
                   14_AYE_T_Diclo375_vs_Ctrl  3609  75  181       256     7.1
                   15_AYE_O_Diclo375_vs_Ctrl  3609  26  240       266     7.4
                 16_O_Trans_Diclo375_vs_Ctrl  3609  17   82        99     2.7
                 17_WT_Trans_Diclo750_vs_Ctrl  3609  25  230       255     7.1
     18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid  3609  66   26        92     2.5
                       19_AYE_WT_Mero_vs_Ctrl  3609 359  175       534    14.8
                       20_AYE_T_Mero_vs_Ctrl  3609  93  110       203     5.6
                       21_AYE_O_Mero_vs_Ctrl  3609 267  244       511    14.2
                     22_O_Trans_Mero_vs_Ctrl  3609 183  153       336     9.3
                 23_AYE_T_Mero_vs_AYE_T_Ctrl  3609  91   27       118     3.3
           24_AYE_WT_Azi_solid_vs_Ctrl_solid  3609 341  329       670    18.6
             25_AYE_T_Azi_solid_vs_Ctrl_solid  3609 510  433       943    26.1
             26_AYE_O_Azi_solid_vs_Ctrl_solid  3609 484  443       927    25.7
                 27_F_Azi_solid_vs_Ctrl_solid  3609 147  271       418    11.6
    
               # |log2FC| >= 1.2
                       28_AYE_WT_Rif_vs_Ctrl  3609   8   11        19     0.5
                         29_AYE_T_Rif_vs_Ctrl  3609  34   41        75     2.1
                         30_AYE_O_Rif_vs_Ctrl  3609  71  100       171     4.7
                       31_O_Trans_Rif_vs_Ctrl  3609   6   10        16     0.4
    
               # |log2FC| >= 2
                                         name total  up down sig_total
               32_AYE_T_Diclo375_vs_AYE_T_ctr  3609  46    6        52
               33_AYE_O_Diclo375_vs_AYE_O_ctr  3609   8   73        81     2.2
           34_O_Trans_Diclo375_vs_O_Trans_ctr  3609   2   17        19     0.5
         35_WT_Trans_Diclo750_vs_WT_Trans_ctr  3609  10   27        37     1.0
                   36_AYE_T_Mero_vs_AYE_T_ctr  3609  91   27       118     3.3
                   37_AYE_O_Mero_vs_AYE_O_ctr  3609 242  135       377    10.4
               38_O_Trans_Mero_vs_O_Trans_ctr  3609 177   50       227     6.3
    
     ✨ All files saved to: /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete
     📁 Each comparison contains:
         1️ DEG_
    .xlsx (3 sheets: Complete_Results, Up_Regulated, Down_Regulated) 2️ Volcano_ .png (GeneName labels for top significant genes) 📤 EXPORTING COMPLETE RESULTS TO CSV FOR KEGG/GO… ✅ Successfully exported 38 CSV files to DEG_Results_Complete
  10. KEGG and GO annotations in non-model organisms

    (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and 1.4 and < -1.4 for the KEGG and GO analyses. (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

* Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download CU459141_protein_.fasta from NCBI
        python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta
        emapper.py -i CU459141_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

* Preparing file 2 blast2go_annot.annot2_ for the R-code below:

  - Basic (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    * Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

  - Advanced (GO Terms from 'Blast2GO 5 Basic'): Interpro based protein families / domains --> Button interpro

    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

  - MERGE the results of InterPro GO IDs (advanced) to GO IDs (basic) and generate final GO IDs, saved in blast2go_annot.annot2

    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)."
    * (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

  - PREPARING go_terms and ec_terms: annot_* file (NOTE that blast2go_annot.annot2 is after merging InterPro_GO_IDs and GO_IDs):

    cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_

10.2. Perform KEGG and GO Enrichment in R

      (r_env) cd /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete

      #For |deg_cutoff_log_foldchange| >=1.4
      sed "s/01_AYE_T_ctr_vs_AYE_WT_ctr/02_AYE_O_ctr_vs_AYE_WT_ctr/g" 1.R > 2.R
      ...

      #For |deg_cutoff_log_foldchange| >=2.0
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/09_AYE_O_ctr_solid_vs_liquid/g" 8.R > 9.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/10_AYE_T_ctr_solid_vs_liquid/g" 8.R > 10.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/11_AYE_O_ctr_solid_vs_AYE_WT_solid/g" 8.R > 11.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/12_AYE_T_ctr_solid_vs_AYE_WT_solid/g" 8.R > 12.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/13_AYE_WT_Diclo750_vs_Ctrl/g" 8.R > 13.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/14_AYE_T_Diclo375_vs_Ctrl/g" 8.R > 14.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/15_AYE_O_Diclo375_vs_Ctrl/g" 8.R > 15.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/16_O_Trans_Diclo375_vs_Ctrl/g" 8.R > 16.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/17_WT_Trans_Diclo750_vs_Ctrl/g" 8.R > 17.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid/g" 8.R > 18.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/19_AYE_WT_Mero_vs_Ctrl/g" 8.R > 19.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/20_AYE_T_Mero_vs_Ctrl/g" 8.R > 20.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/21_AYE_O_Mero_vs_Ctrl/g" 8.R > 21.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/22_O_Trans_Mero_vs_Ctrl/g" 8.R > 22.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/23_AYE_T_Mero_vs_AYE_T_Ctrl/g" 8.R > 23.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/24_AYE_WT_Azi_solid_vs_Ctrl_solid/g" 8.R > 24.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/25_AYE_T_Azi_solid_vs_Ctrl_solid/g" 8.R > 25.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/26_AYE_O_Azi_solid_vs_Ctrl_solid/g" 8.R > 26.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/27_F_Azi_solid_vs_Ctrl_solid/g" 8.R > 27.R

      #For |deg_cutoff_log_foldchange| >=1.2
      sed "s/28_AYE_WT_Rif_vs_Ctrl/29_AYE_T_Rif_vs_Ctrl/g" 28.R > 29.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/30_AYE_O_Rif_vs_Ctrl/g" 28.R > 30.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/31_O_Trans_Rif_vs_Ctrl/g" 28.R > 31.R

      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 1.R
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid KEGG IDs: 4
      #  Enriched pathways: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 50
      #  Enriched pathways: 4
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid GO IDs: 16
      #  Enriched GO-terms: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 151
      #  Enriched GO-terms: 3
      #...

10.3. Finalizing the KEGG and GO Enrichment table

      1. NOTE (Already realized in the code): geneIDs in KEGG_Enrichment have been already translated from ko to geneID in H0N29_*-format; If not, nachmachen using eggnog-res, 因为 eggnog里有1-1-mspping Info between ko-Name and GeneID.
      2. NEED_MANUAL_DELETION (Already setting the cutoff in the code): p.adjust values have been calculated, we have to filter all records in GO_Enrichment-results by |p.adjust|<=0.05. DON'T_NEED_ANY_MORE, since pvalueCutoff = 0.05 settings in enricher. Alternative using pvalueCutoff=1.0, marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment.
      3. NOTE (Not occuring in the new dataset): In rare case, the description is missing for some IDs, e.g. GO term: GO:0006807: replace GO:0006807    obsolete nitrogen compound metabolic process;  ko00975: Metabolism, Biosynthesis of other secondary metabolites
  1. KEGG and GO enrichments via 1.R, 2.R, …

      #Update "01_AYE_T_ctr_vs_AYE_WT_ctr" with "02_AYE_O_ctr_vs_AYE_WT_ctr" in 1.R and save the updated script as 2.R.
      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 2.R
      === SUMMARY ===
      Up-regulated genes: 9
        Valid KEGG IDs: 1
        Enriched pathways: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 6
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 9
        Valid GO IDs: 9
        Enriched GO-terms: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 9
        Enriched GO-terms: 2
    
      Rscript 3.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 2
        Enriched pathways: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 20
        Enriched pathways: 6
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 44
        Enriched GO-terms: 0
    
      Rscript 4.R
      === SUMMARY ===
      Up-regulated genes: 15
        Valid KEGG IDs: 7
        Enriched pathways: 6
      Down-regulated genes: 152
        Valid KEGG IDs: 43
        Enriched pathways: 11
      === SUMMARY ===
      Up-regulated genes: 15
        Valid GO IDs: 15
        Enriched GO-terms: 0
      Down-regulated genes: 152
        Valid KEGG IDs: 152
        Enriched GO-terms: 0
    
      Rscript 5.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 6.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 7.R
      === SUMMARY ===
      Up-regulated genes: 4
        Valid KEGG IDs: 0
        Enriched pathways: 0
      Down-regulated genes: 14
        Valid KEGG IDs: 10
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 4
        Valid GO IDs: 4
        Enriched GO-terms: 1
      Down-regulated genes: 14
        Valid KEGG IDs: 14
        Enriched GO-terms: 3
    
      Rscript 8.R
      === SUMMARY ===
      Up-regulated genes: 479
        Valid KEGG IDs: 246
        Enriched pathways: 30
      Down-regulated genes: 322
        Valid KEGG IDs: 186
        Enriched pathways: 13
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 479
        Valid GO IDs: 479
        Enriched GO-terms: 4
      Down-regulated genes: 322
        Valid KEGG IDs: 322
        Enriched GO-terms: 6
    
      Rscript 9.R
      === SUMMARY ===
      Up-regulated genes: 472
        Valid KEGG IDs: 230
        Enriched pathways: 25
      Down-regulated genes: 306
        Valid KEGG IDs: 217
        Enriched pathways: 19
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 472
        Valid GO IDs: 472
        Enriched GO-terms: 3
      Down-regulated genes: 306
        Valid KEGG IDs: 306
        Enriched GO-terms: 6
    
      Rscript 10.R
      === SUMMARY ===
      Up-regulated genes: 496
        Valid KEGG IDs: 236
        Enriched pathways: 28
      Down-regulated genes: 330
        Valid KEGG IDs: 250
        Enriched pathways: 26
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 496
        Valid GO IDs: 496
        Enriched GO-terms: 2
      Down-regulated genes: 330
        Valid KEGG IDs: 330
        Enriched GO-terms: 6
    
      Rscript 11.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 2
        Enriched pathways: 5
      Down-regulated genes: 5
        Valid KEGG IDs: 2
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 0
      Down-regulated genes: 5
        Valid KEGG IDs: 5
        Enriched GO-terms: 0
    
      Rscript 12.R
      === SUMMARY ===
      Up-regulated genes: 5
        Valid KEGG IDs: 1
        Enriched pathways: 3
      Down-regulated genes: 8
        Valid KEGG IDs: 3
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 5
        Valid GO IDs: 5
        Enriched GO-terms: 0
      Down-regulated genes: 8
        Valid KEGG IDs: 8
        Enriched GO-terms: 0
    
      Rscript 13.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 45
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 1
      Down-regulated genes: 45
        Valid KEGG IDs: 45
        Enriched GO-terms: 0
    
      Rscript 14.R
      === SUMMARY ===
      Up-regulated genes: 73
        Valid KEGG IDs: 37
        Enriched pathways: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 29
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 73
        Valid GO IDs: 73
        Enriched GO-terms: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 180
        Enriched GO-terms: 0
    
      Rscript 15.R
      === SUMMARY ===
      Up-regulated genes: 26
        Valid KEGG IDs: 14
        Enriched pathways: 10
      Down-regulated genes: 239
        Valid KEGG IDs: 38
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 26
        Valid GO IDs: 26
        Enriched GO-terms: 0
      Down-regulated genes: 239
        Valid KEGG IDs: 239
        Enriched GO-terms: 0
    
      Rscript 16.R
      === SUMMARY ===
      Up-regulated genes: 17
        Valid KEGG IDs: 8
        Enriched pathways: 11
      Down-regulated genes: 82
        Valid KEGG IDs: 14
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 17
        Valid GO IDs: 17
        Enriched GO-terms: 0
      Down-regulated genes: 82
        Valid KEGG IDs: 82
        Enriched GO-terms: 2
    
      Rscript 17.R
      === SUMMARY ===
      Up-regulated genes: 25
        Valid KEGG IDs: 15
        Enriched pathways: 3
      Down-regulated genes: 229
        Valid KEGG IDs: 46
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 25
        Valid GO IDs: 25
        Enriched GO-terms: 2
      Down-regulated genes: 229
        Valid KEGG IDs: 229
        Enriched GO-terms: 1
    
      Rscript 18.R
      === SUMMARY ===
      Up-regulated genes: 66
        Valid KEGG IDs: 18
        Enriched pathways: 8
      Down-regulated genes: 26
        Valid KEGG IDs: 19
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 66
        Valid GO IDs: 66
        Enriched GO-terms: 5
      Down-regulated genes: 26
        Valid KEGG IDs: 26
        Enriched GO-terms: 3
    
      Rscript 19.R
      === SUMMARY ===
      Up-regulated genes: 355
        Valid KEGG IDs: 200
        Enriched pathways: 25
      Down-regulated genes: 175
        Valid KEGG IDs: 70
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 355
        Valid GO IDs: 355
        Enriched GO-terms: 6
      Down-regulated genes: 175
        Valid KEGG IDs: 175
        Enriched GO-terms: 2
    
      Rscript 20.R
      === SUMMARY ===
      Up-regulated genes: 93
        Valid KEGG IDs: 48
        Enriched pathways: 21
      Down-regulated genes: 110
        Valid KEGG IDs: 32
        Enriched pathways: 4
      === SUMMARY ===
      Up-regulated genes: 93
        Valid GO IDs: 93
        Enriched GO-terms: 0
      Down-regulated genes: 110
        Valid KEGG IDs: 110
        Enriched GO-terms: 0
    
      Rscript 21.R
      === SUMMARY ===
      Up-regulated genes: 263
        Valid KEGG IDs: 154
        Enriched pathways: 26
      Down-regulated genes: 244
        Valid KEGG IDs: 72
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 263
        Valid GO IDs: 263
        Enriched GO-terms: 8
      Down-regulated genes: 244
        Valid KEGG IDs: 244
        Enriched GO-terms: 5
    
      Rscript 22.R
      === SUMMARY ===
      Up-regulated genes: 182
        Valid KEGG IDs: 115
        Enriched pathways: 23
      Down-regulated genes: 153
        Valid KEGG IDs: 40
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 182
        Valid GO IDs: 182
        Enriched GO-terms: 3
      Down-regulated genes: 153
        Valid KEGG IDs: 153
        Enriched GO-terms: 2
    
      Rscript 23.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 24.R
      === SUMMARY ===
      Up-regulated genes: 333
        Valid KEGG IDs: 193
        Enriched pathways: 12
      Down-regulated genes: 328
        Valid KEGG IDs: 175
        Enriched pathways: 24
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 333
        Valid GO IDs: 333
        Enriched GO-terms: 4
      Down-regulated genes: 328
        Valid KEGG IDs: 328
        Enriched GO-terms: 4
    
      Rscript 25.R
      === SUMMARY ===
      Up-regulated genes: 499
        Valid KEGG IDs: 264
        Enriched pathways: 19
      Down-regulated genes: 430
        Valid KEGG IDs: 239
        Enriched pathways: 36
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 499
        Valid GO IDs: 499
        Enriched GO-terms: 2
      Down-regulated genes: 430
        Valid KEGG IDs: 430
        Enriched GO-terms: 0
    
      Rscript 26.R
      === SUMMARY ===
      Up-regulated genes: 474
        Valid KEGG IDs: 267
        Enriched pathways: 19
      Down-regulated genes: 440
        Valid KEGG IDs: 216
        Enriched pathways: 32
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 474
        Valid GO IDs: 474
        Enriched GO-terms: 2
      Down-regulated genes: 440
        Valid KEGG IDs: 440
        Enriched GO-terms: 1
    
      Rscript 27.R
      === SUMMARY ===
      Up-regulated genes: 142
        Valid KEGG IDs: 69
        Enriched pathways: 1
      Down-regulated genes: 269
        Valid KEGG IDs: 148
        Enriched pathways: 27
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 142
        Valid GO IDs: 142
        Enriched GO-terms: 5
      Down-regulated genes: 269
        Valid KEGG IDs: 269
        Enriched GO-terms: 3
    
      Rscript 28.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 6
        Enriched pathways: 2
      Down-regulated genes: 11
        Valid KEGG IDs: 3
        Enriched pathways: 0
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 11
        Valid KEGG IDs: 11
        Enriched GO-terms: 0
    
      Rscript 29.R
      === SUMMARY ===
      Up-regulated genes: 34
        Valid KEGG IDs: 11
        Enriched pathways: 6
      Down-regulated genes: 41
        Valid KEGG IDs: 22
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 34
        Valid GO IDs: 34
        Enriched GO-terms: 0
      Down-regulated genes: 41
        Valid KEGG IDs: 41
    
      Rscript 30.R
      === SUMMARY ===
      Up-regulated genes: 70
        Valid KEGG IDs: 47
        Enriched pathways: 16
      Down-regulated genes: 99
        Valid KEGG IDs: 40
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 70
        Valid GO IDs: 70
        Enriched GO-terms: 0
      Down-regulated genes: 99
        Valid KEGG IDs: 99
        Enriched GO-terms: 0
    
      Rscript 31.R
      === SUMMARY ===
      Up-regulated genes: 6
        Valid KEGG IDs: 3
        Enriched pathways: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 6
        Valid GO IDs: 6
        Enriched GO-terms: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 10
        Enriched GO-terms: 5
    
      Rscript 32.R
      === SUMMARY ===
      Up-regulated genes: 46
        Valid KEGG IDs: 18
        Enriched pathways: 1
      Down-regulated genes: 6
        Valid KEGG IDs: 0
        Enriched pathways: 0
      No gene sets have size between 10 and 500 ...
      --> return NULL...
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 46
        Valid GO IDs: 46
        Enriched GO-terms: 4
      Down-regulated genes: 6
        Valid KEGG IDs: 6
        Enriched GO-terms: 0
    
      Rscript 33.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 9
        Enriched pathways: 8
      Down-regulated genes: 72
        Valid KEGG IDs: 7
        Enriched pathways: 3
    
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 72
        Valid KEGG IDs: 72
        Enriched GO-terms: 0
    
      Rscript 34.R
      === SUMMARY ===
      Up-regulated genes: 2
        Valid KEGG IDs: 2
        Enriched pathways: 3
      Down-regulated genes: 17
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 2
        Valid GO IDs: 2
        Enriched GO-terms: 0
      Down-regulated genes: 17
        Valid KEGG IDs: 17
        Enriched GO-terms: 0
    
      Rscript 35.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 5
      Down-regulated genes: 27
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 0
    
      Rscript 36.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 37.R
      === SUMMARY ===
      Up-regulated genes: 238
        Valid KEGG IDs: 141
        Enriched pathways: 28
      Down-regulated genes: 134
        Valid KEGG IDs: 35
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 238
        Valid GO IDs: 238
        Enriched GO-terms: 2
      Down-regulated genes: 134
        Valid KEGG IDs: 134
        Enriched GO-terms: 4
    
      Rscript 38.R
      === SUMMARY ===
      Up-regulated genes: 177
        Valid KEGG IDs: 104
        Enriched pathways: 22
      Down-regulated genes: 50
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 177
        Valid GO IDs: 177
        Enriched GO-terms: 2
      Down-regulated genes: 50
        Valid KEGG IDs: 50
        Enriched GO-terms: 4
  2. (DEPRECATED due to be WRONG and TIME-CONSUMING) KEGG and GO enrichments by replacing the sample name in the R-code, and by copying and pasting the R-code directly on console

      # For your reference, here is the exact list of the 31 comparisons and their assigned groupings:
    
      # Baseline / Strain Controls (|log2FC| >= 1.4)
      "01_AYE_T_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-T_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "02_AYE_O_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-O_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "03_O_Trans_ctr_vs_AYE_WT_ctr"     = c("group", "O-Trans_ctrl_0_liquid", "AYE-WT_ctrl_0_liquid"),
      "04_WT_Trans_ctr_vs_AYE_WT_ctr"    = c("group", "WT-Trans_ctrl_0_liquid","AYE-WT_ctrl_0_liquid"),
      "05_AYE_O_ctr_vs_AYE_T_ctr"        = c("group", "AYE-O_ctrl_0_liquid",   "AYE-T_ctrl_0_liquid"),
      "06_O_Trans_ctr_vs_AYE_T_ctr"      = c("group", "O-Trans_ctrl_0_liquid", "AYE-T_ctrl_0_liquid"),
      "07_WT_Trans_ctr_vs_AYE_T_ctr"     = c("group", "WT-Trans_ctrl_0_liquid","AYE-T_ctrl_0_liquid"),
    
      # Condition Effects (|log2FC| >= 2.0)
      "08_AYE_WT_ctr_solid_vs_liquid"    = c("group", "AYE-WT_ctrl_0_solid",   "AYE-WT_ctrl_0_liquid"),
      "09_AYE_O_ctr_solid_vs_liquid"     = c("group", "AYE-O_ctrl_0_solid",    "AYE-O_ctrl_0_liquid"),
      "10_AYE_T_ctr_solid_vs_liquid"     = c("group", "AYE-T_ctrl_0_solid",    "AYE-T_ctrl_0_liquid"),
      "11_AYE_O_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-O_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
      "12_AYE_T_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-T_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
    
      # Diclofenac (|log2FC| >= 2.0)
      "13_AYE_WT_Diclo750_vs_Ctrl"       = c("group", "AYE-WT_Diclo_750_liquid",   "AYE-WT_ctrl_0_liquid"),
      "14_AYE_T_Diclo375_vs_Ctrl"        = c("group", "AYE-T_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "15_AYE_O_Diclo375_vs_Ctrl"        = c("group", "AYE-O_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "16_O_Trans_Diclo375_vs_Ctrl"      = c("group", "O-Trans_Diclo_375_liquid",  "AYE-WT_ctrl_0_liquid"),
      "17_WT_Trans_Diclo750_vs_Ctrl"     = c("group", "WT-Trans_Diclo_750_liquid", "AYE-WT_ctrl_0_liquid"),
      "18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid" = c("group", "AYE-WT_Diclo_1250_solid", "AYE-WT_ctrl_0_solid"),
    
      # Meropenem (|log2FC| >= 2.0)
      "19_AYE_WT_Mero_vs_Ctrl"           = c("group", "AYE-WT_Mero_0.35-0.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "20_AYE_T_Mero_vs_Ctrl"            = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-WT_ctrl_0_liquid"),
      "21_AYE_O_Mero_vs_Ctrl"            = c("group", "AYE-O_Mero_0.5_liquid",       "AYE-WT_ctrl_0_liquid"),
      "22_O_Trans_Mero_vs_Ctrl"          = c("group", "O-Trans_Mero_0.25_liquid",    "AYE-WT_ctrl_0_liquid"),
      "23_AYE_T_Mero_vs_AYE_T_Ctrl"      = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-T_ctrl_0_liquid"),
    
      # Azithromycin (Solid) (|log2FC| >= 2.0)
      "24_AYE_WT_Azi_solid_vs_Ctrl_solid"= c("group", "AYE-WT_Azi_20_solid", "AYE-WT_ctrl_0_solid"),
      "25_AYE_T_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-T_Azi_20_solid",  "AYE-T_ctrl_0_solid"),
      "26_AYE_O_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-O_Azi_20_solid",  "AYE-O_ctrl_0_solid"),
      "27_F_Azi_solid_vs_Ctrl_solid"     = c("group", "F_Azi_20_solid",      "F_ctrl_0_solid"),
    
      # Rifampicin (|log2FC| >= 1.2)
      "28_AYE_WT_Rif_vs_Ctrl"            = c("group", "AYE-WT_Rifampicin_1.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "29_AYE_T_Rif_vs_Ctrl"             = c("group", "AYE-T_Rifampicin_2_liquid",    "AYE-T_ctrl_0_liquid"),
      "30_AYE_O_Rif_vs_Ctrl"             = c("group", "AYE-O_Rifampicin_2_liquid",    "AYE-O_ctrl_0_liquid"),
      "31_O_Trans_Rif_vs_Ctrl"           = c("group", "O-Trans_Rifampicin_2_liquid",  "O-Trans_ctrl_0_liquid")
    
      ./DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv
      ./DEG_02_AYE_O_ctr_vs_AYE_WT_ctr.csv
      ./DEG_03_O_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_04_WT_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_05_AYE_O_ctr_vs_AYE_T_ctr.csv
      ./DEG_06_O_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_07_WT_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_08_AYE_WT_ctr_solid_vs_liquid.csv
      ./DEG_09_AYE_O_ctr_solid_vs_liquid.csv
      ./DEG_10_AYE_T_ctr_solid_vs_liquid.csv
      ./DEG_11_AYE_O_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_12_AYE_T_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_13_AYE_WT_Diclo750_vs_Ctrl.csv
      ./DEG_14_AYE_T_Diclo375_vs_Ctrl.csv
      ./DEG_15_AYE_O_Diclo375_vs_Ctrl.csv
      ./DEG_16_O_Trans_Diclo375_vs_Ctrl.csv
      ./DEG_17_WT_Trans_Diclo750_vs_Ctrl.csv
      ./DEG_18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid.csv
      ./DEG_19_AYE_WT_Mero_vs_Ctrl.csv
      ./DEG_20_AYE_T_Mero_vs_Ctrl.csv
      ./DEG_21_AYE_O_Mero_vs_Ctrl.csv
      ./DEG_22_O_Trans_Mero_vs_Ctrl.csv
      ./DEG_23_AYE_T_Mero_vs_AYE_T_Ctrl.csv
      ./DEG_24_AYE_WT_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_25_AYE_T_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_26_AYE_O_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_27_F_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_28_AYE_WT_Rif_vs_Ctrl.csv
      ./DEG_29_AYE_T_Rif_vs_Ctrl.csv
      ./DEG_30_AYE_O_Rif_vs_Ctrl.csv
      ./DEG_31_O_Trans_Rif_vs_Ctrl.csv
    
        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete_manually")
    
        # 1. Blast2GO: Extract GO & EC terms (Primary source)
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_AYE/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
        colnames(annot_df) <- c("GeneID", "Term")
    
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        res <- read.csv("DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv")
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # Filter up- and down-regulated genes (NEEDS_TO_BE_ADAPTED_BY_1.4_or_2.0_or_1.2)
        up_regulated <- res_updated[res_updated$log2FoldChange >= 1.4 & res_updated$padj <= 0.05, ]
        down_regulated <- res_updated[res_updated$log2FoldChange <= -1.4 & res_updated$padj <= 0.05, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        saveWorkbook(wb, "Gene_Expression_with_Annotations_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
    
        # ---------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
    
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
    
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # -----------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
        #saveWorkbook(wb, "KEGG_Enrichment.xlsx", overwrite = TRUE)
    
        # ----------------------------------------
        # ---- Perform GO enrichment analysis ----
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis, where pAdjustMethod is one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
  3. BUG_1: The Count column does not match the number of gene IDs listed in the geneID column.

        The discrepancy between the Count and the number of listed GeneIDs is not an error, but a result of KEGG's many-to-many mapping between physical genes and KOs.
    
        KEGG enrichment statistics (Count, GeneRatio, p-values) are strictly calculated based on unique KOs, not GeneIDs. The geneID column simply maps these enriched KOs back to our specific genome for display.
        Concrete Examples (see KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx):
    
            * ko03070 (Bacterial secretion system): Count = 2 (KOs: K02456/K02457). However, both KOs map to the exact same physical gene (ABAYE2071), so only 1 geneID is listed.
            * ko05111 (Biofilm formation): Count = 3 (KOs: K02456/K02457/K03092), which map back to 2 unique genes (ABAYE2071/ABAYE3136).
    
        To ensure full transparency, I have added a new KEGG_ko column (placed between Count and geneID). It explicitly lists the exact KOs contributing to the Count.

DEBUG: The Count column does not match the number of gene IDs listed in the geneID column (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE$)

This is a brilliant and fundamental question. The short answer is: KEGG enrichment statistics (including GeneRatio, BgRatio, p-values, and the original Count) are strictly based on KOs (KEGG Orthologs), NOT your original GeneIDs.

In fact, my previous attempt to “fix” the discrepancy by changing the Count and GeneRatio to match the number of GeneIDs was statistically incorrect. I apologize for that confusion. By changing the ratio to match GeneIDs while leaving the denominator as the number of KOs, we created a mathematically invalid ratio that no longer matches the p-value calculated by clusterProfiler.

Here is the exact scientific explanation of why this happens, why your co-author saw a discrepancy, and how to properly handle it.


1. Why Enrichment is Based on KOs

KEGG pathways are defined by KOs (e.g., K00001), not by specific genomic loci (like ABAYE0067). When you run enrichKEGG(), the tool performs a hypergeometric test.

  • The Universe (Background): All KOs in the KEGG database.
  • The Sample (Your Input): The unique KOs extracted from your DEGs.
  • The Calculation: It calculates the probability of finding $k$ KOs in a pathway out of your $n$ input KOs.

Therefore, the Count column represents the number of unique KOs from your input list that map to that pathway. The GeneRatio is (Number of input KOs in pathway) / (Total input KOs).

2. Why the Co-author Saw a “Discrepancy”

Your co-author noticed that Count was 3, but only 2 GeneIDs were listed. This is not a bug; it is a biological reality of gene annotation.

Here are the two reasons why Count (KOs) $\neq$ Number of GeneIDs:

  1. Multiple KOs map to the same GeneID: A single gene in your annotation file might have multiple functional domains or be assigned to multiple KOs (e.g., ABAYE2071 might be annotated as K00001, K00002). If both K00001 and K00002 are enriched in a pathway, Count = 2, but they both map back to the exact same gene: ABAYE2071.
  2. Missing Mappings in the DEG list: The KEGG database says 3 KOs belong to this pathway. Your DEGs contain those 3 KOs. However, when your script tries to map those 3 KOs back to your specific Gene_Expression_with_Annotations file, one of the KOs might not have a corresponding GeneID in that specific file (or it was filtered out). Thus, Count = 3, but only 2 GeneIDs are revealed.

3. The Correct Solution

Do not change the Count or GeneRatio. They are statistically correct and mathematically tied to the p-value. The geneID column is purely for display purposes so the biologist can see which actual genes are involved. It is completely normal and expected that the number of displayed GeneIDs differs from the Count.

Here is the scientifically accurate version of the process_kegg_enrichment function. It maps the KOs back to GeneIDs for display, but leaves the statistical columns (Count, GeneRatio) completely untouched.

The Corrected R Function

Replace your process_kegg_enrichment function with this:

============================================================
HELPER FUNCTION: Process KEGG enrichment (STATISTICALLY CORRECT)
============================================================
process_kegg_enrichment <- function(enrichment_result, kegg_to_geneid_mapping) {
  if (is.null(enrichment_result)) {
    message("  -> Enrichment result is NULL")
    return(NULL)
  }

  enrichment_df <- as.data.frame(enrichment_result)
  if (nrow(enrichment_df) == 0) return(NULL)

  message("  -> Mapping KEGG KOs back to original GeneIDs for display...")

  # 1. Split the original KO IDs and map to GeneIDs
  expanded <- enrichment_df %>%
    tidyr::separate_rows(geneID, sep = "/") %>% # Split KOs
    dplyr::left_join(kegg_to_geneid_mapping, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
    dplyr::distinct() %>%
    dplyr::group_by(ID) %>%
    dplyr::summarise(
      # Concatenate all unique GeneIDs that map to the KOs in this pathway
      Mapped_GeneID = paste(unique(na.omit(GeneID)), collapse = "/"), 
      .groups = "drop"
    )

  # 2. Replace the geneID column, but DO NOT TOUCH Count or GeneRatio
  # The original Count and GeneRatio are based on KOs and are statistically valid.
  result <- enrichment_df %>%
    dplyr::select(-geneID) %>% # Drop the old KO-based geneID column
    dplyr::left_join(expanded, by = "ID") %>% # Add the new GeneID-based column
    dplyr::rename(geneID = Mapped_GeneID)

  # 3. Reorder columns for clean Excel output
  desired_order <- c(
    "category", "subcategory", "ID", "Description", "GeneRatio", "BgRatio", 
    "RichFactor", "FoldEnrichment", "zScore", "pvalue", "p.adjust", "qvalue", 
    "Count", "geneID"
  )
  result <- result %>% dplyr::select(dplyr::any_of(desired_order), dplyr::everything())

  message("  ✅ SUCCESS: GeneIDs mapped for display. Statistical columns (Count/GeneRatio) preserved based on KOs.")
  return(result)
}

💡 How to Reply to Your Co-author

You can use this exact explanation to clarify the situation with your co-author:

“Dear [Co-author’s Name],

Thank you for pointing out the discrepancy between the Count and the number of GeneIDs listed. I have investigated the pipeline and can clarify why this occurs.

The KEGG enrichment analysis (enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw GeneIDs. Because multiple GeneIDs can share the same KO annotation, and a single GeneID can be assigned multiple KOs, the mapping between genes and KOs is many-to-many.

Therefore:

  • Count and GeneRatio represent the number of unique KOs from our input list that fall into that pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain based on KOs to be statistically valid.
  • The geneID column is provided purely for display purposes, showing the actual genomic loci (e.g., ABAYE...) that correspond to those enriched KOs.

If Count is 3 but only 2 GeneIDs are listed, it simply means that 3 unique KOs were enriched in that pathway, but when mapped back to our specific annotation file, those 3 KOs correspond to only 2 unique genes (either because two KOs map to the same gene, or one KO lacked a matching gene in our specific DEG list).

I have updated the script to ensure the statistical columns remain perfectly accurate based on KOs, while the geneID column accurately reflects all available gene mappings for your review.”



结合您刚刚修复并导出的新 Excel 文件(其中成功增加了 KEGG_ko 列),我们可以用具体的数据例子来回复您的共同作者。这样不仅能解释原理,还能直接“用数据说话”,彻底打消他/她关于“annotation parsing error(注释解析错误)”的疑虑。

以下是为您改进后的回复草稿:


Subject: Re: Clarification on KEGG Enrichment Output (Count vs. GeneID discrepancy)

Dear [Co-author’s Name],

Thank you for pointing out the apparent discrepancy between the Count and the number of GeneIDs listed in the KEGG enrichment results. I have thoroughly investigated the pipeline and can clarify exactly why this occurs, and how I have updated the output files to make it completely transparent.

The core reason is that the KEGG enrichment analysis (enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw physical GeneIDs. The mapping between genes and KOs is many-to-many: multiple genes can share the same KO annotation, and a single gene can be assigned multiple KOs.

Therefore, the columns should be interpreted as follows:

  • Count and GeneRatio: Represent the number of unique KOs from our input DEG list that fall into that specific pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain strictly based on KOs to be statistically valid.
  • KEGG_ko (NEW COLUMN): To address your concern and provide full traceability, I have added this new column to the Excel output. It explicitly lists the exact KOs that contributed to the Count.
  • geneID: Shows the actual genomic loci (e.g., ABAYE…) that correspond to those enriched KOs.

To illustrate with the exact examples you mentioned: Let’s look at the updated results for ko03070 (Bacterial secretion system) and ko05111 (Biofilm formation):

  • For ko03070, the Count is 2. If you look at the new KEGG_ko column, it correctly lists 2 unique KOs: K02456/K02457. However, when we map these 2 KOs back to our specific genome annotation, they both happen to be carried by the exact same physical gene: ABAYE2071. Thus, the geneID column only reveals 1 gene.
  • Similarly, for ko05111, the Count is 3 (KOs: K02456/K02457/K03092), but they map back to only 2 unique physical genes (ABAYE2071/ABAYE3136).

This is not an annotation parsing error, but a biological reality of how gene functions are annotated in the KEGG database. I have updated the pipeline to ensure the statistical columns remain perfectly accurate based on KOs, while the newly added KEGG_ko column and the geneID column provide a complete, traceable, and biologically meaningful picture for your review.

Please let me know if you would like to discuss this further or if you need any additional adjustments to the output format.

Best regards,


💡 改进说明:

  1. 引入了新列 KEGG_ko 的解释:明确告诉共同作者,为了增加透明度,我们特意加上了这一列,让他能直接看到底是哪几个 KO 参与了计算。
  2. 使用了新文件中的真实数据作为例子:直接引用了 ko03070(2个KO对应1个基因)和 ko05111(3个KO对应2个基因)的具体数据(K02456/K02457 等),这比纯理论解释更有说服力,直接证明了我们没有算错,也没有解析错误。
  3. 语气专业且自信:明确界定这不是 “annotation parsing error”,而是 KEGG 数据库本身的 “many-to-many” 生物学特性,展现了您对生信分析流程的透彻理解。

Host and Viral Temporal Transcriptomics of VZV Infection in Human Skin Organoids (Data_Nina_RNAseq_2026)

draw_3D.py.zip

complete_deg_pipeline.R

  1. Preparing samplesheet.csv

         sample,fastq_1,fastq_2,strandedness
         control_r1,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R1.fastq.gz,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R2.fastq.gz,auto
         control_r2,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R1.fastq.gz,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R2.fastq.gz,auto
         control_r3,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R1.fastq.gz,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R2.fastq.gz,auto
         VZV.d10_r1,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R1.fastq.gz,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R2.fastq.gz,auto
         VZV.d10_r2,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R1.fastq.gz,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R2.fastq.gz,auto
         VZV.d10_r3,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R1.fastq.gz,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R2.fastq.gz,auto
         VZV.d15_r1,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R1.fastq.gz,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R2.fastq.gz,auto
         VZV.d15_r2,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R1.fastq.gz,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R2.fastq.gz,auto
         VZV.d15_r3,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R1.fastq.gz,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R2.fastq.gz,auto
         VZV.d20_r1,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R1.fastq.gz,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R2.fastq.gz,auto
         VZV.d20_r2,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R1.fastq.gz,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R2.fastq.gz,auto
         VZV.d20_r3,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R1.fastq.gz,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R2.fastq.gz,auto
  2. DO NOT RUN Trimmomatic before Nextflow.

         It is highly recommended to let the nf-core/rnaseq pipeline handle all trimming internally. Pre-trimming your data can break the pipeline's internal file tracking, interfere with the UMI extraction step, and make reproducibility harder.
         The nf-core/rnaseq pipeline has built-in support for both UMI extraction and read trimming, and it executes them in the correct order: it will extract the UMIs from the raw Read 1 first, and then trim the error-prone bases from Read 2.
         Here is how you should adjust your approach and your Nextflow command.
    
         不应该在运行 Nextflow 之前手动使用 Trimmomatic。
         强烈建议让 nf-core/rnaseq 流程在内部处理所有的修剪(trimming)工作。提前手动修剪数据可能会破坏流程内部的文件追踪机制,干扰 UMI 的提取步骤,并降低结果的可重复性。
         nf-core/rnaseq 流程内置了对 UMI 提取和 reads 修剪的支持,并且会按照正确的顺序执行:它会首先从原始的 Read 1 中提取 UMI,然后再修剪 Read 2 中容易出错的碱基。
  3. Nextflow running

         #yes, exactly, the Reads contain UMI: the first 12 nt of Read1.
         #The CORALL v2 Kit from Lexogen was used for library prep. According to the manufacturer it can additionally be beneficial for the mapping to trim the first 10 nt of Read2 as well, as this site is more error prone (due to priming they say).
         #If you need more information from my side, let me know!
    
         方法一:正则表达式法 (Regex) —— 你目前选择的方案
         根据你提供的文档,使用 regex 方法时,必须使用特定的命名捕获组。文档中明确指出:
         "The expected groups in the regex are: umi_n = UMI positions, where n can be any value (required)"
         这意味着,捕获 UMI 的组名必须是 umi_ 加上一个数字(例如 umi_1),而不能只是 umi。
         正确的正则表达式:'^(?P
    .{12}).*’ 原理解释: ^:匹配字符串的开头。 (?P .{12}):捕获前 12 个任意字符,并将其命名为 umi_1(这将作为 UMI 被提取并添加到 read name 中)。 .*:匹配并保留剩余的序列(文档说明:未被命名为 discard 的其他部分会被重新附加到 read 序列上)。 方法二:字符串法 (String) —— 更简单直观的替代方案 文档中提到: “string: This should be used where the barcodes are always in the same place in the read. N = UMI position” 因为你的 UMI 长度固定(12nt)且位置固定(永远在 Read 1 开头),你完全可以使用更简单的 string 方法,用 N 来代表 UMI 的位置。 对应的 Nextflow 参数: –umitools_extract_method string –umitools_bc_pattern NNNNNNNNNNNN (正好 12 个 N) # OLD_COMMANDS_DEPRECATED: under sage in the early running # ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq # nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method regex –umitools_bc_pattern ‘^(?P.{12}).*’ -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –pseudo_aligner salmon –umitools_grouping_method unique # #Debug the following error: added “–minAssignedFrags 0 \\” to modules/nf-core/salmon/quant/main.nf option “salmon quant” and added “–min_mapped_reads 0” in the nextflow command #nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \ nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf –help nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \ –input samplesheet.csv \ –outdir results_GRCh38 \ –genome GRCh38 \ –with_umi \ –umitools_extract_method regex \ –umitools_bc_pattern ‘^(?P .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon \ –pseudo_aligner salmon \ –umitools_grouping_method unique 关键执行逻辑(让你完全放心): 第一步 (umi_tools):流程会读取原始的 Read 1,根据 ^(?P .{12}).* 提取前 12 个碱基作为 UMI,并将其添加到 read 的名称中(例如 @READ_ID_UMISEQUENCE)。此时 Read 1 的序列中不再包含这 12 个碱基。 第二步 (fastp):流程接着运行 fastp。由于你设置了 –clip_r2 10,fastp 会精准地切除 Read 2 的前 10 个碱基(解决 Lexogen CORALL v2 的引物错误问题)。 互不干扰:umi_tools 只处理 Read 1 的 UMI,fastp 负责全局的质控和 Read 2 的修剪。你绝对不需要在运行 Nextflow 之前手动使用 Trimmomatic。
  4. To verify that fastp has successfully trimmed the first 10 nucleotides from Read 2, you can use any of the following three methods.

         *(Note: First, ensure that the `FASTP` step in your Nextflow terminal has actually finished running and shows a green checkmark ✔).*
    
         ### Method: Manually Measure the Sequence Length (Raw Data Proof)
         Because you did not use the `--save_trimmed` flag, the trimmed FASTQ files are not saved in your `results/` folder. However, they still exist temporarily in the `work/` directory. You can measure the exact length of the first read to prove 10bp was cut.
    
         *Note: Since you used `--with_umi`, the input file for fastp is the output from UMI-tools, so the filename will likely contain `umi_extract` instead of `raw`.*
    
         1. Navigate into the `work/` directory for the `FASTP` step:
         ```bash
         cd work/fc/def4de*/
         cd work/a1/b2c3d4*/
         ```
         2. List the files to see the exact naming convention:
         ```bash
         ls -lh *2.fastq.gz
         ```
         *(You will likely see an input file like `*.umi_extract_2.fastq.gz` and an output file like `*.fastp.trimmed_2.fastq.gz` or `*.trim_2.fastq.gz`).*
         3. **Measure the length of the first sequence in the INPUT Read 2:**
         ```bash
         zcat *.umi_extract_1.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat *.umi_extract_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #154
         ```
         *(Let's say this outputs `150`)*
         4. **Measure the length of the first sequence in the OUTPUT Read 2:**
         ```bash
         zcat *.trimmed_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_1.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_2.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #144
         ```
         *(This should output exactly `140`)*.
    
         If the output length is exactly 10 bases shorter than the input length, the trimming was 100% successful.
    
         ---
    
         ### 💡 Pro Tip for Future Runs
         If you want to easily inspect the trimmed FASTQ files in your `results/` folder without digging through the `work/` directory next time, simply add **`--save_trimmed`** to your `nextflow run` command. This will force the pipeline to copy the post-fastp FASTQ files to your main results directory.
  5. Nextflow running on virus

         # OLD_COMMANDS_DEPRECATED: under sage in the early running
         # The virus-referenced commands!
         # nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_chrHsv1 --fasta chrHsv1_s17.fasta --gtf chrHsv1_s17.gtf --with_umi --umitools_extract_method regex --umitools_bc_pattern '^(?P.{12}).*' --umitools_dedup_stats -profile test_full -resume --max_memory 256.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0
    
             #This error occurs because your GenBank file was likely exported from SnapGene (as seen in the COMMENT section of the file). SnapGene sometimes formats the date in the LOCUS line as DD-MM-YYYY (e.g., 13-06-2025), but Biopython's parser strictly expects the standard NCBI GenBank format, which requires a 3-letter month abbreviation like DD-MMM-YYYY (e.g., 13-JUN-2025).
             sed -i 's/13-06-2025/13-JUN-2025/' BB1528_nanopore_consensus.gb
    
             python3 gb_to_fasta_gtf_v2.py BB1528_nanopore_consensus.gb BB1528_nanopore_consensus.fasta BB1528_nanopore_consensus.gtf BB1528    # Generate the last two files in the command line
             python fix_gtf_to_hsv1_format.py    # Generate BB1528_final.gtf
    
             nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
             --input samplesheet_virus.csv \
             --outdir results_BB1528 \
             --fasta BB1528_nanopore_consensus.fasta \
             --gtf BB1528_final.gtf \
             --with_umi \
             --umitools_extract_method regex \
             --umitools_bc_pattern '^(?P
    .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon –pseudo_aligner salmon \ –gtf_extra_attributes gene_id \ –gtf_group_features transcript_id \ –featurecounts_group_type gene_id \ –featurecounts_feature_type transcript \ –skip_rseqc –skip_dupradar –skip_preseq –skip_biotype_qc –skip_deseq2_qc –skip_multiqc \ –umitools_grouping_method unique –min_mapped_reads 0 # —- DEBUG_LOG —- Good news: Your GTF and FASTA files are perfectly formatted! We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct. What is causing the error? The error is purely a biological/sample-specific issue: [warning] salmon was only able to assign 9 fragments to transcripts in the index, but the minimum number of required assigned fragments (–minAssignedFrags) was 10. The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (–libType=A). Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus. Salmon has a built-in safety threshold (–minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step in the three control samples! Later on, we can manually set all counts for all virus transcripts in the three control-conditions as 0.
  6. A small BUG in both nextflow runs: versions.yml

         vim /mnt/md1/DATA/Data_Nina_RNAseq_2026/work/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml
         BUG_LINES:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-4ht1kmld because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
         1.1.4
         SHOULD_BE:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: 1.1.4
    
         ALTERNATIVE: Not correct the bug, copy the version log to mail directory:
             cp work_GRCh38_DEL/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml results_GRCh38
             cp work/bf/4a1e941bc32b97de1bc8f4fd47df79/collated_versions.yml results_BB1528
  7. import data and pca-plot

         # Import the required libraries
         library("AnnotationDbi")
         library("clusterProfiler")
         library("ReactomePA")
         library(gplots)
         library(tximport)
         library(DESeq2)
         library("org.Hs.eg.db")
         library(dplyr)
         library(tidyverse)
    
         # ------ For human genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("control_r1" = "./control_r1/quant.sf",
                 "control_r2" = "./control_r2/quant.sf",
                 "control_r3" = "./control_r3/quant.sf",
                 "VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # ------ For virus genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # MANUALLY set all counts for all virus transcripts in the three control-conditions ("control_r1","control_r2","control_r3") as 0 using Excel
    
         # ------ Merge the raw counts of human and microbe ------
         #cat ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv > merged_gene_counts.csv
         #DELETE the second line: "","control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"
    
         # 1. Remove the hidden ^M (Windows carriage returns) from both files
         sed -i 's/\r$//' ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv
         # 2. Extract the header from the first file to start the new merged file
         head -n 1 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv > merged_gene_counts.csv
         # 3. Append the data from the first file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv >> merged_gene_counts.csv
         # 4. Append the data from the second file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv >> merged_gene_counts.csv
    
         # Should be 60683 lines
         #~/Tools/csv2xls-0.4/csv_to_xls.py merged_gene_counts.csv -d',' -o raw_gene_counts.xls;
    
         # -- for merged analysis due to false normalization factors wenn alone analyzed on virus data --
         setwd("~/DATA/Data_Nina_RNAseq_2026/")
         d.raw <- read.csv("merged_gene_counts.csv", header=TRUE, row.names=1)
    
         # Re-define the replicates and condition for the merged counts
         rownames <- factor(c("control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"))
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         colData <- data.frame(condition=condition, replicate=replicate, row.names=rownames)
    
         #CORRECTED: dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+replicate)
         #⚠️ A Note on the Design Formula: In the snippet, I used design = ~condition+replicate. In DESeq2, adding replicate (r1, r2, r3) to the design formula tells the model to treat "r1" as a globally systematic batch effect across all conditions. Unless the replicates were processed in distinct batches (e.g., all r1 on Monday, all r2 on Tuesday), it is standard to use design = ~ condition to simply measure the biological variance. Therefore, I have updated the script to use ~ condition.
         dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition)
         dim(counts(dds))
         head(counts(dds), 10)
    
         rld <- rlogTransformation(dds)
    
         #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
         #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
         #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
         #So, the typical workflow is:
         #  - Create the DESeqDataSet object.
         #  - Use estimateSizeFactors to normalize for library size.
         #  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
         #  - Use DESeq for the differential expression analysis.
         #  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    
         # draw simple pca and heatmap
         library(gplots)
         library("RColorBrewer")
         #mat <- assay(rld)
         #mm <- model.matrix(~condition, colData(rld))
         #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
         #assay(rld) <- mat
         # -- pca --
         png("pca.png", 1200, 800)
         plotPCA(rld, intgroup=c("condition"))
         dev.off()
         # -- heatmap --
         png("heatmap.png", 1200, 800)
         distsRL <- dist(t(assay(rld)))
         mat <- as.matrix(distsRL)
         hc <- hclust(distsRL)
         hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
         heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
         dev.off()
  8. draw 3D PCA plots.

         library(gplots)
         library("RColorBrewer")
    
         library(ggplot2)
         data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
         write.csv(data, file="plotPCA_data.csv")
         #calculate all PCs including PC3 with the following codes
         library(genefilter)
         ntop <- 500
         rv <- rowVars(assay(rld))
         select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
         mat <- t( assay(rld)[select, ] )
         pc <- prcomp(mat)
         pc$x[,1:3]
         #df_pc <- data.frame(pc$x[,1:3])
         df_pc <- data.frame(pc$x)
         identical(rownames(data), rownames(df_pc)) #-->TRUE
    
         data$PC1 <- NULL
         data$PC2 <- NULL
         merged_df <- merge(data, df_pc, by = "row.names")
         #merged_df <- merged_df[, -1]
         row.names(merged_df) <- merged_df$Row.names
         merged_df$Row.names <- NULL  # remove the "name" column
         merged_df$name <- NULL
         merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","group","condition","replicate")]
         write.csv(merged_df, file="merged_df_12PCs.csv")
         summary(pc)
         #                           PC1     PC2     PC3
         #Proportion of Variance  0.5969  0.1732 0.09444
    
         python3 draw_3D.py
         # Edit on the generated SVG-figure.
         #/usr/bin/convert PCA_3D.png -crop 2900x1600+250+700 PCA_3D_cropped.png
  9. (Optional) estimate size factors and dispersion values.

         #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
         #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
         #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
         sizeFactors(dds)
         #NULL
         # Estimate size factors
         dds <- estimateSizeFactors(dds)
         # Estimate dispersions
         dds <- estimateDispersions(dds)
         #> sizeFactors(dds)
         #control_r1 control_r2 control_r3 VZV.d10_r1 VZV.d10_r2 VZV.d10_r3 VZV.d15_r1
         #1.2072182  1.1009757  0.8610196  1.1965615  0.8575169  1.1714809  0.8344160
         #VZV.d15_r2 VZV.d15_r3 VZV.d20_r1 VZV.d20_r2 VZV.d20_r3
         #0.5278583  1.1768657  1.1006353  1.1288974  1.2630530
    
         # If alone with virus data, the following BUG occured:
         #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                             HeLa_TO_r1                      HeLa_TO_r2
                             0.9978755                       1.1092227
         data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
         #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    
         1/0.9978755=1.002129023
         1/1.1092227=
    
         #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
         raw_counts <- counts(dds)
         normalized_counts <- counts(dds, normalized=TRUE)
         #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
         #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
         #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
         estimSf <- function (cds){
             # Get the count matrix
             cts <- counts(cds)
    
             # Compute the geometric mean
             geomMean <- function(x) prod(x)^(1/length(x))
    
             # Compute the geometric mean over the line
             gm.mean  <-  apply(cts, 1, geomMean)
    
             # Zero values are set to NA (avoid subsequentcdsdivision by 0)
             gm.mean[gm.mean == 0] <- NA
    
             # Divide each line by its corresponding geometric mean
             # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
             # MARGIN: 1 or 2 (line or columns)
             # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
             # FUN: the function to be applied
             cts <- sweep(cts, 1, gm.mean, FUN="/")
    
             # Compute the median over the columns
             med <- apply(cts, 2, median, na.rm=TRUE)
    
             # Return the scaling factor
             return(med)
         }
         #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
         #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
         #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
         #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
         #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
         #DESeq2’s median of ratios [1]
         #EdgeR’s trimmed mean of M values (TMM) [2]
         #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
         test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
         sum(test_normcount != normalized_counts)
  10. Merged Host-Virus Counts -> DESeq2 -> DEGs of 6 Comparisons

         Rscript complete_deg_pipeline.R
    
         # Key changes made:
         #     1. Virus gene identification: Added detection of virus genes using pattern matching ("^ORF|cat_RNA|redF_RNA|repE_RNA|sopA_RNA|sopB_RNA|UL45_RNA")
         #     2. Updated significance cutoff: Changed from |log2FC| >= 1 to |log2FC| >= 2
         #     3. Color scheme:
         #        * Green for virus RNAs (all virus genes)
         #        * Dark green for significantly regulated virus genes
         #        * Red for significantly up-regulated host genes
         #        * Blue for significantly down-regulated host genes
         #        * Gray for non-significant host genes
         #     4. Enhanced legend: The legend now clearly distinguishes between virus RNAs and host genes with their regulation status
         #     5. Summary statistics: Added detailed output showing counts of virus vs host genes and their regulation status
    
         #  ================================================================================
         # 📊 FINAL SUMMARY OF ALL 6 COMPARISONS
         #                 name total  up down virus sig_total pct_sig
         # 03_VZV.d20_vs_control 17594 193  220    77       413     2.3
         # 02_VZV.d15_vs_control 17594 139   94    77       233     1.3
         # 05_VZV.d20_vs_VZV.d10 17594  35  114    77       149     0.8
         # 01_VZV.d10_vs_control 17594  89   20    77       109     0.6
         # 06_VZV.d20_vs_VZV.d15 17594  28   38    77        66     0.4
         # 04_VZV.d15_vs_VZV.d10 17594   6   13    77        19     0.1

GOOD_QUESTIONS

  1. Validation of Mutant-Specific Effects The additional methodological details on mutant construction and genome validation are appreciated. However, the study still lacks functional complementation or equivalent validation to demonstrate that the observed phenotypes are directly attributable to loss of the targeted efflux pumps rather than secondary regulatory consequences. Reliance on knockout mutants alone limits confidence in causal interpretation, particularly for broad transcriptomic and stress-response phenotypes. 突变体特异性效应的验证 感谢作者补充了关于突变体构建和基因组验证的额外方法学细节。然而,本研究仍缺乏功能回补或等效的验证手段,来证明所观察到的表型是直接由靶向外排泵的缺失引起的,而非继发性的调控后果。仅依赖敲除突变体限制了因果推断的可靠性,尤其是对于广泛的转录组与应激反应表型而言。

93只恒生指数成分股名单

根据维基百科和恒生指数公司的最新信息,从2026年6月8日起,恒生指数成分股已增加至93只[[2]]。以下是完整的93只恒生指数成分股名单,按行业分类:

金融业(10只成份股)

  1. 汇丰控股 (0005)
  2. 友邦保险 (1299)
  3. 建设银行 (0939)
  4. 工商银行 (1398)
  5. 香港交易所 (0388)
  6. 中国平安 (2318)
  7. 中国银行 (3988)
  8. 中国人寿 (2628)
  9. 招商银行 (3968)
  10. 中银香港 (2388)

非必需性消费及必需性消费(30只成份股)

  1. 阿里巴巴-SW (9988)
  2. 美团-W (3690)
  3. 比亚迪股份 (1211)
  4. 京东集团-SW (9618)
  5. 百度集团-SW (9888)
  6. 创科实业 (0669)
  7. 快手-W (1024)
  8. 吉利汽车 (0175)
  9. 泡泡玛特 (9992)
  10. 安踏体育 (2020)
  11. 携程集团-S (9961)
  12. 农夫山泉 (9633)
  13. 理想汽车-W (2015)
  14. 万洲国际 (0288)
  15. 银河娱乐 (0027)
  16. 港铁公司 (0066)
  17. 美的集团 (0300)
  18. 蒙牛乳业 (2319)
  19. 海尔智家 (6690)
  20. 华润啤酒 (0291)
  21. 李宁 (2331)
  22. 申洲国际 (2313)
  23. 金沙中国 (1928)
  24. 老铺黄金 (6181)
  25. 新东方-S (9901)
  26. 海底捞 (6862)
  27. 康师傅控股 (0322)
  28. 周大福 (1929)
  29. 恒安国际 (1044)
  30. 百威亚太 (1876)

资讯科技业(6只成份股)

  1. 腾讯控股 (0700)
  2. 小米集团-W (1810)
  3. 中芯国际 (0981)
  4. 网易-S (9999)
  5. 联想集团 (0992)
  6. 比亚迪电子 (0285)

能源业、原材料业、工业及综合事业(18只成份股)

  1. 中国海洋石油 (0883)
  2. 中国石油股份 (0857)
  3. 紫金矿业 (2899)
  4. 长和 (0001)
  5. 中国神华 (1088)
  6. 宁德时代 (3000)
  7. 中国宏桥 (1378)
  8. 中国石油化工股份 (0386)
  9. 中信股份 (0267)
  10. 中通快递 (2057)
  11. 洛阳钼业 (3993)
  12. 极兔速递 (1519) 新增
  13. 舜宇光学科技 (2382)
  14. 中国铝业 (2600) 新增
  15. 京东物流 (2618)
  16. 信义玻璃 (0868)
  17. 东方海外国际 (0316)
  18. 信义光能 (0968)

电讯业及公用事业(9只成份股)

  1. 中国移动 (0941)
  2. 中电控股 (0002)
  3. 电能实业 (0006)
  4. 香港中华煤气 (0003)
  5. 中国电信 (0728)
  6. 中国联通 (0762)
  7. 新奥能源 (2688)
  8. 华润电力 (0836)
  9. 长江基建集团 (1038)

地产建筑业(10只成份股)

  1. 新鸿基地产 (0016)
  2. 华润置地 (1109)
  3. 领展房产基金 (0823)
  4. 长实集团 (1113)
  5. 中国海外发展 (0688)
  6. 恒基地产 (0012)
  7. 九龙仓置业 (1997)
  8. 华润万象生活 (1209)
  9. 龙湖集团 (0960)
  10. 恒隆地产 (0101)

医疗保健业(10只成份股)

  1. 百济神州 (6160) 新增
  2. 信达生物 (1801)
  3. 药明生物 (2269)
  4. 药明康德 (2359)
  5. 石药集团 (1093)
  6. 中国生物制药 (1177)
  7. 翰森制药 (3692)
  8. 京东健康 (6618)
  9. 阿里健康 (0241)
  10. 国药控股 (1099)

最新变动: 在2026年5月22日的季度检讨中,恒生指数公司宣布新增3只蓝筹股:极兔速递(1519)中国铝业(2600)百济神州(6160),令恒指成分股数目由90只增加至93只,并于2026年6月8日起生效[[3]][[7]]。



在港股市场中,股票名称后面的 -W-S 甚至 -SW-B 等字母后缀,是港交所为了区分公司的特殊股权结构或上市类型而设置的“身份标签”。

这能帮助投资者一眼看出这家公司的治理结构或上市背景。具体含义如下:

1. -W:同股不同权 (Weighted Voting Rights)

  • 全称:Weighted Voting Rights (WVR)。
  • 含义:指这家公司采用了“同股不同权”的架构。在这种架构下,公司的股票分为普通股份和特别表决权股份。普通投资者买到的股票(1股只有1票投票权),而创始人或管理层持有的特别股份(1股拥有多票投票权,比如1股抵10票)。
  • 目的:确保创始团队在公司经过多轮融资、自身股权被稀释的情况下,依然能牢牢掌握公司的控制权和经营决策权(这在互联网和科技公司中非常常见)。
  • 代表公司:美团-W (3690)、小米集团-W (1810)。

2. -S:第二上市 (Secondary Listing)

  • 全称:Secondary Listing。
  • 含义:指这家公司已经在海外(如美国纽交所、纳斯达克)完成了“主要上市”,然后再次来到港交所发行股票并挂牌交易。
  • 目的:这类公司通常是为了拓宽融资渠道、吸引亚洲及大中华区的投资者、提升股票流动性,或者为了规避单一海外市场的监管风险(近年来很多中概股回港都属于此类)。
  • 代表公司:百度集团-S (9888)。
  • (注:如果一家第二上市的公司后来申请转为“双重主要上市”,它的 -S 标记通常会被取消。)

3. 组合后缀:-SW

  • 含义:如果您看到像 阿里巴巴-SW (9988)京东集团-SW (9618) 这样的名字,说明这家公司既是第二上市(-S),又采用了同股不同权架构(-W)

补充:其他常见后缀

除了 -W-S,您在港股名单中可能还会看到:

  • -B:代表 Biotech(生物科技公司)。
    • 含义:指这家公司是未通过港交所常规盈利测试、但具备核心研发能力的生物科技公司。因为还在研发阶段,可能尚未实现盈利,带有 -B 是交易所特意提醒投资者注意其未盈利的风险。
    • 代表公司:歌礼制药-B (1672)。

总结: 这些字母后缀只是交易所的提示性标记,相当于公司的“特殊身份证”,它们不会影响股票的正常交易,也不直接代表公司基本面的好坏。普通投资者在买卖时,只需知道它们代表了特殊的股权或上市背景即可。



不是的,比亚迪并非只在香港上市。它是一家典型的 “A+H”两地上市的公司,同时在中国大陆的A股市场和香港的H股市场挂牌交易。

具体情况如下:

1. A股(中国大陆)

  • 上市地点:深圳证券交易所(深交所)
  • 股票简称:比亚迪
  • 股票代码002594
  • 上市时间:2011年6月(比亚迪最早是在A股上市的,当时发行的是中小板,后来中小板与主板合并)。

2. H股(中国香港)

  • 上市地点:香港联合交易所(港交所)
  • 股票简称:比亚迪股份
  • 股票代码01211(在港股软件中常显示为 1211)
  • 上市时间:2002年7月(比亚迪最早其实是先在港股上市的,比A股早了将近9年)。

为什么会产生“只在港股上市”的错觉?

在您之前看到的“恒生指数成分股”名单中,列出的是 “比亚迪股份 (1211)”。因为恒生指数是港股的指数,所以名单里只能收录它在港股上市的代码和名称。这可能会让人误以为它只有港股。

A股和H股是什么关系?

  • 同一家公司:002594(A股)和 01211(H股)代表的是完全同一家公司(比亚迪股份有限公司),享有同等的股东权益(同股同权)。
  • 价格不同:因为A股和H股是在两个完全独立的市场(大陆市场和香港市场)进行交易,面对的投资者群体、资金流动性、市场情绪都不同,所以两者的股价(折算成同一种货币后)通常是不一样的。这就产生了所谓的“A/H股溢价”。

补充:在其他地方上市了吗?

除了中国大陆(A股)和中国香港(H股)之外,比亚迪没有在美国(如纳斯达克、纽交所)或欧洲(如伦敦交易所)等其他海外资本市场上市。

总结:如果您在大陆的股票软件里搜索“比亚迪”(002594),或者在港股软件里搜索“比亚迪股份”(01211),买到的都是同一家公司的股票,只是交易的币种(人民币 vs 港币)和所在的市场不同而已。



这个问题需要分两个层面来回答:

如果按大众认知和实际市场主导地位来看,美国主要有 2个 核心股票交易所; 但如果按美国证券交易委员会(SEC)的官方注册来看,美国目前共有 16个 全国性的合法证券交易所。

为了让您清晰了解,我们可以将它们分为“两大巨头”、“其他专业交易所”和“场外市场”三个梯队:

第一梯队:两大绝对巨头(占据 99% 以上的市值和交易量)

这是普通投资者最熟悉、也是全球最著名的两家交易所。绝大多数知名的美国上市公司都在这两家挂牌:

  1. 纽约证券交易所 (NYSE – New York Stock Exchange)
    • 特点:历史最悠久(1792年成立),位于华尔街11号。采用“专家撮合”制度(虽然现在很多也电子化了),门槛极高。
    • 代表公司:主要是传统行业巨头、超级蓝筹股、大型金融机构。比如:伯克希尔·哈撒韦、沃尔玛、可口可乐、摩根大通、强生等。
  2. 纳斯达克 (Nasdaq)
    • 特点:成立于1971年,是全球第一个全电子化的股票交易市场。没有实体的交易大厅,门槛相对纽交所略低,对科技和创新企业非常友好。
    • 代表公司:全球顶尖的科技巨头、成长型公司。比如:苹果、微软、英伟达、特斯拉、亚马逊、Meta(脸书)、谷歌等。

第二梯队:其他全国性证券交易所(共 14 个左右)

除了纽交所和纳斯达克,美国SEC还注册了十几个其他的证券交易所。这些交易所普通散户平时很少直接感知,它们主要处理机构订单、提供流动性或主打特定理念。 它们包括:

  • 纽交所旗下的其他交易所
    • NYSE Arca: originally 太平洋证券交易所,现在是ETF(交易型开放式指数基金) 交易的绝对主力。
    • NYSE American:原美国证券交易所(AMEX),现在主要面向中小型市值公司和初创企业。
    • NYSE Chicago / NYSE National:主要处理部分电子订单,提供交易通道。
  • 芝加哥期权交易所 (Cboe) 旗下的股票交易所
    • Cboe 本身是期权交易巨头,但它旗下也拥有 4 个股票交易所(Cboe BZX, BYX, EDGA, EDGX)。很多券商(如Robinhood、盈透)会把散户的订单路由到这些交易所去撮合。
  • 特色/独立交易所
    • IEX (Investors Exchange):因畅销书《闪电侠》而闻名。它的特色是故意设置一个微小的“减速带”(350微秒的延迟),用来防止高频交易机构利用速度优势“插队”收割普通投资者。
    • MEMX (Members Exchange):由华尔街多家大型券商(如嘉信理财、摩根士丹利等)联合出资成立的交易所,目的是为了打破纽交所和纳斯达克的垄断,降低交易手续费
    • LTSE (Long-Term Stock Exchange):长期证券交易所,由硅谷著名风投家提出,旨在奖励长期持有的股东(比如持股时间越长,投票权越大),对抗华尔街的短期逐利行为。

补充:场外交易市场 (OTC – Over-The-Counter)

除了上述正规的“交易所”,美国还有一个庞大的场外交易市场(严格意义上它不是交易所,而是一个交易商网络)。

  • OTC Markets:大家常听说的 “粉单市场” (Pink Sheets) 就在这里。
  • 特点:这里上市的公司通常达不到纽交所或纳斯达克的财务标准。里面充斥着仙股(几毛钱甚至几分钱的股票)、退市公司、破产重组公司,以及一些不想承担高昂合规成本的微型企业。风险极高。

总结

  • 如果您问的是 “美国最主要的股票交易场所”,答案是 2个(纽交所和纳斯达克)。
  • 如果您问的是 “美国官方认可的证券交易所数量”,答案是 16个。美国之所以有这么多交易所,是因为美国有一个“全国市场系统(NMS)”,允许各家交易所通过降低手续费、提供更快的网速或更公平的机制来互相竞争,从而让整体的交易成本保持在极低的水平。


以下是截至2026年6月最新的纳斯达克100指数(NASDAQ-100)完整成分股列表及权重[[1]]。

重要说明:虽然指数名为“100”,但由于Alphabet(谷歌母公司) 同时发行了A类(GOOGL)和C类(GOOG)两种股票,这两类股票均被单独计入指数,因此下表实际包含101个条目,代表的是100家公司。权重数据基于各公司市值在指数总市值中的占比动态计算[[1]]。

纳斯达克100指数完整成分股及权重表

排名 公司中文名称 公司英文名称 股票代码 权重
1 英伟达 Nvidia Corp NVDA 12.74%
2 苹果 Apple Inc. AAPL 11.11%
3 微软 Microsoft Corp MSFT 7.19%
4 亚马逊 Amazon.Com Inc AMZN 6.55%
5 谷歌A类股 Alphabet Inc. Class A GOOGL 5.83%
6 谷歌C类股 Alphabet Inc. Class C GOOG 5.44%
7 博通 Broadcom Inc. AVGO 4.89%
8 特斯拉 Tesla, Inc. TSLA 3.80%
9 Meta (Facebook母公司) Meta Platforms, Inc. META 3.70%
10 美光科技 Micron Technology, Inc. MU 3.12%
11 沃尔玛 Walmart Inc. WMT 2.39%
12 超威半导体 (AMD) Advanced Micro Devices AMD 2.19%
13 阿斯麦 (ASML) ASML Holding NV ASML 1.88%
14 英特尔 Intel Corp INTC 1.67%
15 泛林集团 Lam Research Corp LRCX 1.25%
16 应用材料 Applied Materials Inc AMAT 1.25%
17 思科 Cisco Systems, Inc. CSCO 1.19%
18 Arm控股 Arm Holdings plc ARM 1.18%
19 开市客 (Costco) Costco Wholesale Corp COST 1.09%
20 奈飞 (Netflix) NetFlix Inc NFLX 0.83%
21 科磊 (KLA) KLA Corporation KLAC 0.83%
22 帕兰泰尔 (Palantir) Palantir Technologies Inc. PLTR 0.80%
23 闪迪 (SanDisk) Sandisk Corporation SNDK 0.77%
24 德州仪器 Texas Instruments Incorporated TXN 0.71%
25 美满电子 (Marvell) Marvell Technology, Inc. MRVL 0.68%
26 西部数据 Western Digital Corp. WDC 0.66%
27 希捷科技 Seagate Technology Holdings PLC STX 0.64%
28 林德气体 Linde plc LIN 0.61%
29 高通 Qualcomm Inc QCOM 0.59%
30 派拓网络 Palo Alto Networks, Inc. PANW 0.59%
31 亚德诺半导体 (ADI) Analog Devices, Inc. ADI 0.52%
32 T-Mobile T-Mobile US, Inc. TMUS 0.50%
33 百事可乐 PepsiCo, Inc. PEP 0.49%
34 安进 Amgen Inc AMGN 0.47%
35 CrowdStrike CrowdStrike Holdings, Inc. CRWD 0.44%
36 AppLovin Applovin Corporation APP 0.41%
37 吉利德科学 Gilead Sciences Inc GILD 0.40%
38 霍尼韦尔 Honeywell International, Inc. HON 0.37%
39 直观外科 Intuitive Surgical Inc. ISRG 0.37%
40 Shopify Shopify Inc. SHOP 0.36%
41 Booking Holdings Booking Holdings Inc. BKNG 0.34%
42 福泰制药 Vertex Pharmaceuticals Inc VRTX 0.29%
43 星巴克 Starbucks Corp SBUX 0.29%
44 拼多多 PDD Holdings Inc. PDD 0.29%
45 楷登电子 Cadence Design Systems CDNS 0.28%
46 飞塔 (Fortinet) Fortinet, Inc. FTNT 0.27%
47 万豪国际 Marriott International MAR 0.27%
48 星座能源 Constellation Energy Corporation CEG 0.24%
49 怪物饮料 Monster Beverage Corporation MNST 0.23%
50 新思科技 Synopsys Inc SNPS 0.23%
51 自动数据处理公司 (ADP) Automatic Data Processing ADP 0.22%
52 CSX运输 CSX Corporation CSX 0.22%
53 爱彼迎 Airbnb, Inc. ABNB 0.21%
54 美卡多 Mercado Libre, Inc MELI 0.21%
55 康卡斯特 Comcast Corp CMCSA 0.21%
56 Datadog Datadog, Inc. DDOG 0.20%
57 亿滋国际 Mondelez International, Inc. MDLZ 0.20%
58 奥多比 (Adobe) Adobe Inc. ADBE 0.20%
59 恩智浦半导体 NXP Semiconductors N.V. NXPI 0.20%
60 罗斯百货 Ross Stores Inc ROST 0.19%
61 单片电源系统 (MPS) Monolithic Power Systems, Inc. MPWR 0.19%
62 奥莱利汽车 O’Reilly Automotive, Inc. ORLY 0.19%
63 Intuit Intuit Inc INTU 0.19%
64 DoorDash DoorDash, Inc. DASH 0.18%
65 美国电力公司 (AEP) American Electric Power Company, Inc. AEP 0.18%
66 Lumentum Lumentum Holdings Inc. LITE 0.18%
67 Cintas Cintas Corp CTAS 0.17%
68 华纳兄弟探索 Warner Bros. Discovery, Inc. WBD 0.17%
69 再生元制药 Regeneron Pharmaceuticals Inc REGN 0.16%
70 帕卡 (Paccar) Paccar Inc PCAR 0.16%
71 贝克休斯 Baker Hughes Company BKR 0.15%
72 微芯科技 Microchip Technology Inc MCHP 0.13%
73 响尾蛇能源 Diamondback Energy, Inc. FANG 0.13%
74 艺电 (EA) Electronic Arts Inc EA 0.13%
75 快扣 (Fastenal) Fastenal Co FAST 0.13%
76 卓越能源 Xcel Energy, Inc. XEL 0.13%
77 法罗里奥 (Ferrovial) Ferrovial N.V. FER 0.13%
78 爱克斯龙 (Exelon) Exelon Corporation EXC 0.12%
79 老道明货运 Old Dominion Freight Line ODFL 0.12%
80 爱德士实验室 (IDEXX) Idexx Laboratories Inc IDXX 0.11%
81 可口可乐欧洲太平洋 Coca-Cola Europacific Partners plc CCEP 0.11%
82 Take-Two Interactive Take-Two Interactive Software Inc TTWO 0.11%
83 Keurig Dr Pepper Keurig Dr Pepper Inc. KDP 0.11%
84 Strategy Inc (原MicroStrategy) Strategy Inc MSTR 0.11%
85 欧特克 (Autodesk) Autodesk Inc ADSK 0.10%
86 Alnylam制药 Alnylam Pharmaceuticals, Inc. ALNY 0.10%
87 贝宝 (PayPal) PayPal Holdings, Inc. PYPL 0.10%
88 Paychex Paychex Inc PAYX 0.09%
89 汤森路透 Thomson Reuters Corporation TRI 0.09%
90 Axon Enterprise Axon Enterprise, Inc. AXON 0.09%
91 柔佛科技 (Roper) Roper Technologies, Inc. ROP 0.08%
92 Workday Workday, Inc. WDAY 0.08%
93 GE医疗 GE HealthCare Technologies Inc. GEHC 0.07%
94 德康医疗 (DexCom) DexCom, Inc. DXCM 0.07%
95 Copart Copart Inc CPRT 0.07%
96 卡夫亨氏 The Kraft Heinz Company KHC 0.07%
97 Verisk Analytics Verisk Analytics, Inc. VRSK 0.06%
98 高知特 (Cognizant) Cognizant Technology Solutions CTSH 0.06%
99 Insmed Insmed, Inc. INSM 0.05%
100 Zscaler Zscaler, Inc. ZS 0.05%
101 特许通讯 Charter Comm Inc Del CL A New CHTR 0.04%

注:由于股票市值每日随市场波动,上述权重数据为实时动态估算值,仅供参考。

关于Betano送你80欧元优惠的“流水要求”(Umsatzbedingungen)

关于Betano送你80欧元、需要充值80欧元的这个优惠,这里的关键信息都整理好了:

💰 这80欧元能提现吗?

可以,但有条件。

这80欧元属于奖金(Bonusguthaben),不能直接提现。你需要先满足 “流水要求”(Umsatzbedingungen) 后才能提现。

具体要求是:你需要用 “充值金额+奖金总额” (即80+80=160欧元)下注,且总下注额达到这个数的5倍,也就是 800欧元

并且,这些投注最低赔率必须是1.65。只有满足这些条件后,奖金及其产生的盈利才会转为可提现的余额。

⏳ 有使用期限吗?

有,而且有两个不同的截止日期需要注意:

  • 完成流水要求的期限:你有90天的时间来完成上述800欧元的流水要求。如果90天后没完成,奖金和相关的盈利可能会被收回。
  • 激活后投注的期限:奖金发放后,也需要在一定时间内用于下注,通常是90天内。

📝 其他重要规则

  • 这是一项新用户优惠:通常只针对新注册用户。
  • 最低充值额:要获得这个100%的奖金,最低充值额是10欧元
  • 注意支付方式:并非所有支付方式都符合条件,例如使用 Skrill 充值可能无法获得此奖金。
  • 某些投注类型不算:不是所有投注都算在流水里,比如“双 chance”等特殊玩法可能不被计入。
  • 可能还有额外赠礼:这个优惠有时还附带一张20欧元的免费投注券(Freebet)。它通常有单独的7-14天有效期,赢了只给你净利润,不包括本金。

💎 总结

简单来说,Betano的这笔奖金可以提现,但必须在90天内,用不低于1.65的赔率,完成总计800欧元的投注额。

建议你在参与前,务必登录Betano官网,仔细阅读最新的 “奖金条款”(Bonusbedingungen),因为具体规则可能会有调整。