Monthly Archives: June 2026

HUMAnN 通路丰度计算方法详解 (Data_Tam_Metagenomics_2026_Soil)

总体流程

HUMAnN (HMP Unified Metabolic Analysis Network) 是 bioBakery 工作流中用于分析宏基因组功能的核心工具[[12]]。通路丰度的计算是一个多步骤的递归过程:

计算步骤:

  1. 基因家族丰度 → 2. 反应丰度 → 3. 通路丰度

详细计算原理

第1步:基因家族丰度(Gene Family Abundance)

从原始测序 reads 开始:

  • 使用 BLASTX 将 reads 比对到参考数据库(如 UniRef)
  • 根据比对质量、覆盖度、序列长度进行加权
  • 生成 RPK(Reads Per Kilobase)值

公式:

基因丰度 = Σ(比对权重) / 基因长度(kb)

其中每个 read 的总权重为 1.0,根据比对质量分配到多个基因匹配上[[9]]。


第2步:反应丰度(Reaction Abundance)

每个生化反应由一个或多个基因催化:

反应丰度 = Σ(催化该反应的所有基因丰度)

第3步:通路丰度(Pathway Abundance)

这是最关键的一步。通路包含多个反应,反应之间有不同的关系:

核心原则: 通路丰度由”最弱环节”(weakest link)决定[[1]]

计算方法:

  • 串联反应(必须全部存在):使用调和平均数(harmonic mean)
  • 并联反应(可选路径):使用最大值(max)
  • 可选反应:只有当其丰度大于必需反应的调和平均数时才计入[[1]]

最终通路丰度 = 通路中丰度最低的关键反应


具体示例

示例场景:糖酵解通路(Glycolysis)

假设糖酵解通路包含 5 个关键反应(R1-R5):

葡萄糖 → R1 → G6P → R2 → F6P → R3 → ... → 丙酮酸

基因-反应关系:

  • R1: 由基因 GK1 和 GK2 催化(冗余)
  • R2: 由基因 PGI 催化
  • R3: 由基因 PFK 催化
  • R4: 由基因 ALDO 催化
  • R5: 由基因 GAPDH 催化

测序后得到的基因丰度(RPK单位):

GK1:  8.0
GK2:  4.0
PGI:  10.0
PFK:  6.0
ALDO: 7.0
GAPDH: 5.0

计算步骤:

① 计算反应丰度:

R1 = GK1 + GK2 = 8.0 + 4.0 = 12.0  (冗余基因相加)
R2 = PGI = 10.0
R3 = PFK = 6.0
R4 = ALDO = 7.0
R5 = GAPDH = 5.0

② 计算通路丰度: 由于糖酵解是串联反应(所有步骤必须完成),使用”最弱环节”原则:

通路丰度 = min(R1, R2, R3, R4, R5)
          = min(12.0, 10.0, 6.0, 7.0, 5.0)
          = 5.0 RPK

解释: 该样本中糖酵解通路的丰度为 5.0 RPK,意味着”最弱环节”(R5/GAPDH)的覆盖度为 5.0。这表示通路中至少有 5.0 个”完整拷贝”的活性[[1]]。


归一化处理

为什么需要归一化?

原始 RPK 值受测序深度影响,不能直接跨样本比较[[1]]。

示例:

  • 样本 A:总 reads = 1000万,通路丰度 = 5.0 RPK
  • 样本 B:总 reads = 2000万,通路丰度 = 5.0 RPK

虽然都是 5.0 RPK,但样本 A 的相对丰度更高!

归一化方法:

CPM(Counts Per Million)或 RPKM:

归一化丰度 = (原始RPK / 总RPK) × 1,000,000

HUMAnN 输出的 pathabundance_relab.tsv 文件已经是归一化后的相对丰度[[11]]。


输出文件说明

HUMAnN 生成两个关键文件:

文件 含义 取值范围
pathcoverage.tsv 通路覆盖度(存在/缺失概率) 0-1
pathabundance.tsv 通路丰度(相对拷贝数) 0 及以上

覆盖度 vs 丰度:

  • 覆盖度 = 0.8:表示 80% 置信度该通路存在
  • 丰度 = 5.0:表示该通路约有 5 个完整拷贝(在归一化前)[[9]]

分层分析(Stratification)

HUMAnN 还能计算每个物种对通路的贡献:

总通路丰度 = 物种A贡献 + 物种B贡献 + 未分类贡献

例如:

糖酵解通路总丰度: 5.0 RPK
├─ Escherichia coli:  3.0 RPK (60%)
├─ Bacteroides fragilis: 1.5 RPK (30%)
└─ Unclassified: 0.5 RPK (10%)

这帮助您理解哪些微生物在驱动特定功能[[21]]。


关键要点总结

通路丰度 = 最弱关键反应的丰度(木桶原理)
单位:RPK(归一化前)或 CPM/RPKM(归一化后)
跨样本比较:必须使用归一化值(pathabundance_relab.tsv)[[15]]
解释:丰度 5.0 表示通路中最弱环节有 5.0 RPK 的覆盖度
准确性:考虑了基因冗余、反应关系、可选步骤等生物学复杂性


参考文献

  1. bioBakery Forum – Pathway abundance calculation [[1]]
  2. HUMAnN SOP – HMP Data Coordination Center [[9]]
  3. HUMAnN3 Documentation – Huttenhower Lab [[12]]
  4. bioBakery 3 publication (Nature Methods, 2021) [[13]]


HUMAnN Pathway Abundance 计算方法

基本原理

HUMAnN (HMP Unified Metabolic Analysis Network) 是 biobakery 工具集的一部分,用于从宏基因组数据中计算功能通路的丰度。

计算步骤

1. 基因家族定量 (Gene Family Quantification)

  • 首先将测序 reads 比对到 UniRef 基因家族数据库
  • 计算每个基因家族的丰度(reads per kilobase, RPK)

2. 通路映射 (Pathway Mapping)

  • 将基因家族映射到 MetaCyc 通路
  • 一个通路通常包含多个反应步骤,每个步骤可能由多个基因家族催化

3. 通路丰度计算 (Pathway Abundance Calculation)

关键公式:

通路丰度 = min(该通路中所有反应步骤的丰度)

其中:

  • 每个反应步骤的丰度 = 该步骤中所有基因家族丰度的总和
  • 通路的最终丰度 = 所有反应步骤丰度的最小值(瓶颈原则)

具体示例

假设情况

假设有一个简单的代谢通路 “Glycolysis”(糖酵解),包含 3 个反应步骤:

反应步骤 1:葡萄糖 → 葡萄糖-6-磷酸

  • 由基因家族 UniRef90_A 和 UniRef90_B 催化
  • UniRef90_A 丰度 = 100 RPK
  • UniRef90_B 丰度 = 50 RPK
  • 步骤 1 丰度 = 100 + 50 = 150 RPK

反应步骤 2:葡萄糖-6-磷酸 → 果糖-6-磷酸

  • 由基因家族 UniRef90_C 催化
  • UniRef90_C 丰度 = 80 RPK
  • 步骤 2 丰度 = 80 RPK

反应步骤 3:果糖-6-磷酸 → 果糖-1,6-二磷酸

  • 由基因家族 UniRef90_D 和 UniRef90_E 催化
  • UniRef90_D 丰度 = 200 RPK
  • UniRef90_E 丰度 = 120 RPK
  • 步骤 3 丰度 = 200 + 120 = 320 RPK

通路丰度计算

Glycolysis 通路丰度 = min(步骤1, 步骤2, 步骤3)
                    = min(150, 80, 320)
                    = 80 RPK

为什么用最小值?

  • 这遵循”木桶原理”(瓶颈效应)
  • 通路的整体通量受限于最慢/最少的反应步骤
  • 步骤 2 只有 80 RPK,即使其他步骤丰度很高,整个通路的最大通量也只能是 80

归一化 (Normalization)

HUMAnN 还会进行归一化处理:

1. RPK (Reads Per Kilobase)

RPK = (比对到基因的 reads 数) / (基因长度 kb)

2. RPKM/CPM (可选)

RPKM = RPK / (总百万 reads)

3. 相对丰度 (Relative Abundance)

通路相对丰度 = (通路 RPKM) / (所有通路 RPKM 总和) × 1,000,000

单位:CPM (Counts Per Million)


您的数据示例

从您的输出看:

otu_table()   OTU Table:         [ 1097 taxa and 2 samples ]

这 1097 个 “taxa” 实际上是 1097 个 MetaCyc 通路,每个通路的丰度值是经过以下处理的:

  1. 原始丰度:基于上述最小值原则计算
  2. 归一化:转换为相对丰度(sum = 1 或 100%)
  3. 输出文件pathabundance_relab.tsv 中的值就是相对丰度

关键要点

特点 说明
计算方法 取通路中所有反应步骤的最小丰度
单位 通常是相对丰度(0-1 或 0-100%)
生物学意义 反映通路的潜在代谢能力
优势 考虑了通路的完整性,不是简单加和
局限性 无法区分活跃/非活跃通路(需要转录组验证)

注意事项

⚠️ 重要提醒

  • 通路丰度反映的是基因潜力(gene potential),不是实际代谢活性
  • 一个通路存在 ≠ 该通路正在被使用
  • 需要结合转录组(RNA-seq)或代谢组数据才能确定实际活性
  • 对于您的 n=1 样本,只能做描述性比较,无法统计推断

STEP 5 — Descriptive visualisations (appropriate for n = 1 per group) for Data_Tam_Metagenomics_2026_Soil

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Add openxlsx to your package list (for Excel export):

# Install if needed
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr",
                   "tidyr", "pheatmap", "openxlsx"))

And load it:

library(openxlsx)

Complete Step 5 — Replace your existing Step 5 with this

# =============================================================================
# STEP 5 — Visualisations (PNG) + Complete Excel exports
# =============================================================================

dir.create("figures", showWarnings = FALSE)
dir.create("tables",  showWarnings = FALSE)

# Helper: safe log2 fold change (adds pseudocount to avoid log(0))
safe_log2fc <- function(x, y, pseudo = 1e-6) {
  log2((x + pseudo) / (y + pseudo))
}

# =============================================================================
# 5a. Top-N species stacked bar plot (Loc1 vs Loc4)
# =============================================================================

top_n_sp <- 20
top_species <- names(sort(rowMeans(otu_table(species_ps)),
                          decreasing = TRUE))[1:top_n_sp]

ps_top <- prune_taxa(top_species, species_ps)

df_species <- psmelt(ps_top) %>%
  mutate(Species = factor(Species, levels = rev(top_species)))

p_species <- ggplot(df_species, aes(x = Location, y = Abundance, fill = Species)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "D") +
  labs(title = paste("Top", top_n_sp, "Species by Location"),
       x = "Location", y = "Relative Abundance", fill = "Species") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/species_top20_barplot.png", p_species,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/species_top20_barplot.png\n")

# =============================================================================
# 5b. Species heatmap (all species, row-scaled)
# =============================================================================

otu_mat <- as.matrix(otu_table(species_ps))

# Filter species with very low abundance (max < 0.01% across both samples)
keep <- apply(otu_mat, 1, max) > 0.0001
otu_filt <- otu_mat[keep, , drop = FALSE]

# Annotation column
ann_col <- data.frame(Location = metadata[colnames(otu_filt), "Location"],
                      row.names = colnames(otu_filt))

# Write heatmap to PNG
png("figures/species_heatmap.png",
    width = 10, height = max(8, 0.25 * nrow(otu_filt) + 2),
    units = "in", res = 300)

pheatmap(otu_filt,
         scale         = "row",
         clustering_distance_rows = "euclidean",
         clustering_method        = "complete",
         annotation_col = ann_col,
         main           = "Species Abundance Heatmap (row-scaled)",
         fontsize_row   = 6,
         fontsize_col   = 10,
         show_rownames  = nrow(otu_filt) <= 80)

dev.off()
cat("✅ Saved: figures/species_heatmap.png\n")

# =============================================================================
# 5c. Top pathways stacked bar plot
# =============================================================================

top_n_pw <- 20
top_pw_names <- names(sort(rowMeans(otu_table(pathway_ps)),
                           decreasing = TRUE))[1:top_n_pw]

ps_pw_top <- prune_taxa(top_pw_names, pathway_ps)

df_pw <- psmelt(ps_pw_top) %>%
  mutate(Pathway = factor(Pathway, levels = rev(top_pw_names)))

p_pw <- ggplot(df_pw, aes(x = Location, y = Abundance, fill = Pathway)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  labs(title = paste("Top", top_n_pw, "HUMAnN Pathways by Location"),
       x = "Location", y = "Relative Abundance", fill = "Pathway") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/pathways_top20_barplot.png", p_pw,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathways_top20_barplot.png\n")

# =============================================================================
# 5d. Pathway dot plot (Loc1 vs Loc4)
# =============================================================================

df_pw_wide <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway") %>%
  filter(Pathway %in% top_pw_names) %>%
  pivot_longer(-Pathway, names_to = "Sample", values_to = "Abundance") %>%
  left_join(data.frame(Sample = rownames(metadata_clean),
                       Location = metadata_clean$Location,
                       stringsAsFactors = FALSE),
            by = "Sample")

p_dot <- ggplot(df_pw_wide, aes(x = Location, y = Pathway,
                                 size = Abundance, color = Abundance)) +
  geom_point() +
  scale_color_viridis_c() +
  labs(title = "Pathway Abundance: Loc1 vs Loc4",
       x = "Location", y = "Pathway") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7))

ggsave("figures/pathway_dotplot.png", p_dot,
       width = 8, height = 10, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathway_dotplot.png\n")

# =============================================================================
# 5e. COMPLETE species list → Excel
# =============================================================================

# Build full species table (ALL species, no cutoff)
sp_full <- as.data.frame(otu_table(species_ps)) %>%
  rownames_to_column("Species")

# Ensure columns exist (defensive)
if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(sp_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in species OTU table.")
}

sp_full <- sp_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Species,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for species
sp_wb <- createWorkbook()
addWorksheet(sp_wb, "All_Species")
addWorksheet(sp_wb, "Top50_by_Diff")
addWorksheet(sp_wb, "Top50_by_Abundance")
addWorksheet(sp_wb, "Loc1_only")
addWorksheet(sp_wb, "Loc4_only")
addWorksheet(sp_wb, "Shared")

writeData(sp_wb, "All_Species",       sp_full)
writeData(sp_wb, "Top50_by_Diff",     head(sp_full, 50))
writeData(sp_wb, "Top50_by_Abundance",
          sp_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(sp_wb, "Loc1_only",
          sp_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(sp_wb, "Loc4_only",
          sp_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(sp_wb, "Shared",
          sp_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

# Formatting
header_style <- createStyle(textDecoration = "bold", bgFill = "#D3D3D3")
for (sh in c("All_Species","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(sp_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(sp_wb, sh, cols = 1:ncol(sp_full), widths = "auto")
  freezePane(sp_wb, sh, firstRow = TRUE)
}

saveWorkbook(sp_wb, "tables/species_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/species_Loc1_vs_Loc4.xlsx  (%d species total)\n",
            nrow(sp_full)))

# =============================================================================
# 5f. COMPLETE pathway list → Excel
# =============================================================================

pw_full <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway")

if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(pw_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in pathway OTU table.")
}

pw_full <- pw_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Pathway,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for pathways
pw_wb <- createWorkbook()
addWorksheet(pw_wb, "All_Pathways")
addWorksheet(pw_wb, "Top50_by_Diff")
addWorksheet(pw_wb, "Top50_by_Abundance")
addWorksheet(pw_wb, "Loc1_only")
addWorksheet(pw_wb, "Loc4_only")
addWorksheet(pw_wb, "Shared")

writeData(pw_wb, "All_Pathways",       pw_full)
writeData(pw_wb, "Top50_by_Diff",      head(pw_full, 50))
writeData(pw_wb, "Top50_by_Abundance",
          pw_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(pw_wb, "Loc1_only",
          pw_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(pw_wb, "Loc4_only",
          pw_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(pw_wb, "Shared",
          pw_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

for (sh in c("All_Pathways","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(pw_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(pw_wb, sh, cols = 1:ncol(pw_full), widths = "auto")
  freezePane(pw_wb, sh, firstRow = TRUE)
}

saveWorkbook(pw_wb, "tables/pathways_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways total)\n",
            nrow(pw_full)))

# =============================================================================
# Summary
# =============================================================================

cat("\n========================================\n")
cat("STEP 5 COMPLETE\n")
cat("========================================\n")
cat("Figures (PNG, 300 dpi):\n")
cat("  • figures/species_top20_barplot.png\n")
cat("  • figures/species_heatmap.png\n")
cat("  • figures/pathways_top20_barplot.png\n")
cat("  • figures/pathway_dotplot.png\n")
cat("\nExcel tables (complete lists, no cutoff):\n")
cat(sprintf("  • tables/species_Loc1_vs_Loc4.xlsx   (%d species)\n", nrow(sp_full)))
cat(sprintf("  • tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways)\n", nrow(pw_full)))
cat("========================================\n")

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

File Content Size
species_top20_barplot.png Top 20 species stacked bar 10 × 8 in
species_heatmap.png All species above 0.01% threshold, row-scaled 10 × auto (scales with # species)
pathways_top20_barplot.png Top 20 pathways stacked bar 10 × 8 in
pathway_dotplot.png Top 20 pathways dot plot 8 × 10 in

📑 Excel Files (complete lists, no cutoff)

Each workbook contains 6 sheets:

Sheet Content
All_Species / All_Pathways Every detected feature, sorted by absolute difference
Top50_by_Diff Top 50 features by Loc4 − Loc1
Top50_by_Abundance Top 50 features by mean abundance
Loc1_only Features detected only in Loc1
Loc4_only Features detected only in Loc4
Shared Features detected in both locations

📋 Columns in each Excel file

Column Meaning
Species / Pathway Feature name
Abundance_Loc1 Raw relative abundance in Loc1
Abundance_Loc4 Raw relative abundance in Loc4
Diff_Loc4_minus_Loc1 Absolute difference (Loc4 − Loc1)
Log2FC_Loc4_vs_Loc1 Log₂ fold change (with pseudocount 1e-6 to handle zeros)
Present_in_Loc1 TRUE/FALSE
Present_in_Loc4 TRUE/FALSE
Total_Abundance Sum across both samples
Mean_Abundance Mean across both samples

Notes

  1. Log₂ fold change: Uses a small pseudocount (1e-6) to avoid log(0). For features absent in one location, this gives a large but finite fold change — interpret these as “presence/absence” rather than true fold change.

  2. Heatmap filtering: Species with maximum abundance < 0.01% across both samples are excluded to keep the heatmap readable. Adjust the threshold in keep <- apply(otu_mat, 1, max) > 0.0001 if needed.

  3. Excel formatting: Headers are bold with grey background, columns auto-sized, and the first row is frozen for easy scrolling.

  4. File locations: All outputs go into figures/ and tables/ subfolders inside your current working directory (reports/).

医院废水处理工艺显著降低了废水中“副血链球菌”(*Streptococcus parasanguinis*)的相对丰度

这个统计结果表明:在排除了不同采样时间(季节/月份)带来的自然波动影响后,医院废水处理工艺显著降低了废水中“副血链球菌”(Streptococcus parasanguinis)的相对丰度。

下面为您逐列详细解释这些统计指标的含义:

1. 字段详细解释

字段名 结果值 含义解释
feature Streptococcus.parasanguinis 特征名称:即发生显著变化的物种名称,中文通常译为“副血链球菌”。这是一种常见于人类口腔和肠道的细菌,属于机会致病菌。
value Post 比较的组别:因为我们在 MaAsLin2 中设置 Pre(处理前)为参考组(Reference),这里的 Post 表示这是 “处理后”相对于“处理前” 的比较结果。
coef -0.0006712921 效应系数(Coefficient):代表处理后该物种相对丰度的变化量。负值(-)表示丰度下降。因为您的输入数据是相对丰度(0~1之间),这意味着经过处理后,该菌的相对丰度比处理前降低了约 0.067%
pval 0.0001454243 原始 P 值(P-value):未经多重检验校正的显著性水平。这个值远小于 0.05(甚至小于 0.001),说明在单变量统计检验中,处理前后的差异是极其显著的。
qval 0.01915278 校正后的 Q 值(FDR):即错误发现率(False Discovery Rate)。因为宏基因组数据同时检验了成百上千个物种,必须进行多重检验校正(如 Benjamini-Hochberg 方法)以防止假阳性。Q值 < 0.05,说明即使经过了严格校正,这个差异依然是统计学显著的

2. 结合您的实验设计的深度解读

在您的实验设计中,MaAsLin2 模型同时纳入了 Treatment(处理前/后)和 TimePoint(11月、1月、3月、5月)两个变量。

  • 控制时间变量(TimePoint):废水中的微生物群落会随着季节、气温等时间因素自然变化。模型将 TimePoint 作为协变量(Covariate)剔除,意味着这个 -0.00067 的下降纯粹是由废水处理工艺(Pre vs Post)引起的,而不是因为采样月份不同造成的。
  • 统计学稳健性:原始 P 值(0.00014)和 Q 值(0.019)都非常小,说明这个结果非常稳健,不是偶然产生的假阳性。

3. 生物学与环境意义

  • 副血链球菌(Streptococcus parasanguinis 通常是人类口腔、呼吸道的正常菌群,但也可能引起心内膜炎等机会性感染。在医院废水中检测到它,说明它可能来源于医院的医疗排放或人类排泄物。
  • 环境意义:该菌在处理后(Post)显著减少,说明您的医院废水处理系统能够有效地截留、降解或灭活这种条件致病菌,降低了其排放到自然环境中带来的潜在生物安全风险。这是一个非常积极的工程效果!


Great question! The difference in min_abundance thresholds between species (0.001) and pathways (0.0001) is intentional and based on the fundamental differences in how MetaPhlAn and HUMAnN data are distributed:

1. Data Distribution Differences

Species data (MetaPhlAn):

  • Community composition is typically dominated by a few highly abundant species
  • Most species have relative abundances well above 0.1% (0.001)
  • A threshold of 0.001 (0.1%) effectively filters out “noise” while retaining biologically meaningful species

Pathway data (HUMAnN):

  • Functional profiles are much more evenly distributed across thousands of pathways (you have 15,785 pathways!)
  • Many biologically important pathways exist at very low relative abundances (e.g., antibiotic resistance, xenobiotic degradation, rare metabolic functions)
  • A higher threshold would eliminate too many pathways, potentially losing critical functional signals

2. Why 0.0001 (0.01%) for Pathways?

  • Captures the “long tail” of functional diversity: Many critical metabolic pathways operate at low abundance but are essential for ecosystem function
  • Balances sensitivity and specificity: Combined with min_prevalence = 0.1 (the pathway must appear in at least 10% of samples), this ensures we only test features that are consistently detected, even if at low levels
  • Standard practice in metagenomics: HUMAnN pathway analyses commonly use lower abundance thresholds because functional redundancy means even rare pathways can be important

3. You Can Adjust This

If you want to be more conservative (fewer pathways tested, faster computation, less multiple testing correction burden), you could increase it:

min_abundance = 0.001  # Same as species, 0.1% threshold

Or if you want to be even more sensitive:

min_abundance = 0.00001  # 0.001% threshold

4. Practical Recommendation

Given your dataset has only 11 samples, I’d actually recommend increasing the pathway threshold to match species (min_abundance = 0.001) because:

  • With small sample sizes, testing too many low-abundance features increases the multiple testing burden
  • It makes the Q-value correction more stringent
  • You’ll get more robust, interpretable results

Would you like me to update the code to use min_abundance = 0.001 for pathways as well?

Can differentiated organoid flip? 小鼠肠道类器官的分化,用于耶尔森氏菌(Y. enterocolitica)感染研究,采用双极性构型。

结合你提供的详细背景信息,幻灯片上提出的 “Can differentiated organoid flip?”(已分化的类器官能否翻转?) 是一个非常核心且切中实验痛点的问题。

在中文学术语境中,这里的 “Flip(翻转)”绝对不是指像翻煎饼一样把培养皿里的球体物理翻个面,而是指 “细胞极性的反转(Apical-Basal Polarity Reversal)”“培养维度的转换(从3D转为2D单层)”

结合你的背景信息,我为你深度解析这个问题到底在问什么,以及为什么它对你的研究至关重要:


一、 结合背景信息:为什么要探讨“Flip(翻转)”?

回顾你背景信息中提到的 “双极性构型(two-polarity conformations)”“耶尔森氏菌(Y. enterocolitica)感染”

  1. 体内的真实生理状态: 在真实肠道中,耶尔森氏菌等致病菌是存在于肠腔中的。因此,细菌入侵时,首先接触的是肠上皮细胞的管腔面(顶侧,Apical side,长有微绒毛和黏液层)
  2. 常规3D类器官的致命痛点: 在Matrigel(基质胶)中常规培养的3D肠类器官,其极性是“内向”的:顶侧朝内(朝向中央管腔),基底侧朝外(接触外面的培养基)。如果你把细菌加在培养基里,细菌只能接触到细胞的基底侧。这完全违背了生理状态,无法真实模拟细菌从肠腔入侵的过程。
  3. “Flip”的目的: 为了让实验符合生理状态,必须把类器官“Flip(翻转)”,让顶侧(Apical)朝外,直接暴露给含有细菌的培养基。这就是幻灯片探讨这个问题的核心动机。

二、 科学解答:已分化的类器官能否“Flip”?

答案是:能,但需要特定的技术手段。

当幻灯片问“已分化的类器官能否翻转”时,它实际上在探讨一个细胞生物学难题:类器官已经分化出了成熟的吸收细胞、杯状细胞、潘氏细胞等,这种高度分化的状态能否被打破并重新排列,以适应新的极性?

在实验中,科学家通常通过以下两种方式实现“Flip”:

1. 3D 转 2D 单层培养(最常用、最彻底的 Flip)

这是目前解决极性问题的“黄金标准”。

  • 操作方法:将3D胶里已经分化的肠类器官消化成单细胞或小细胞团,然后接种到 Transwell(半透膜小室) 上。
  • Flip 结果:细胞会在膜上重新生长,形成你背景信息中提到的 “单层柱状上皮”。此时,细胞的基底侧自然贴在膜上(朝下接触底部培养基),而顶侧(长满微绒毛) 自然朝上(暴露在顶部空间)。
  • 优势:完美实现了极性翻转。细菌加在顶部,直接感染顶侧;底部培养基可以模拟血液侧,加入免疫细胞或药物。

2. 微流控芯片(Gut-on-a-chip)

  • 这是 Transwell 的高级进阶版。在微流控通道中培养单层细胞,一侧通道模拟肠腔(加细菌),另一侧通道模拟血管(加流体剪切力和免疫细胞)。这也是一种高级的“Flip”应用,能更好地模拟体内动态环境。

3. 3D 囊泡的直接诱导反转(难度极高)

  • 通过特殊的离心力、局部酶解基质胶、或使用特定的小分子化合物,强行让3D球体的细胞重新排列,使顶侧翻到外面。但这种方法很难保证所有细胞极性一致,且容易破坏类器官结构,目前较少用于感染实验。

三、 幻灯片提出这个问题的深层科学考量

当讲者提出“Can differentiated organoid flip?”时,他/她实际上是在引导听众思考以下三个关键科学问题(你可以用来丰富你的理解或提问):

  1. 分化状态的可塑性(Plasticity): 类器官在3D胶里已经分化成熟了,当我们把它打散重铺成2D单层(Flip)时,这些细胞还能不能重新建立正确的顶底极性?还能不能保留原有的细胞类型(如杯状细胞还能不能分泌黏液)?
  2. 屏障功能的完整性(Barrier Integrity): 在翻转和重新铺板的过程中,细胞间的紧密连接(Tight junctions) 会不会受损?如果屏障漏了,细菌就不需要“入侵”细胞,而是直接从细胞缝隙漏过去了,这就失去了研究耶尔森氏菌侵入机制的意义。
  3. 微绒毛与黏液层的保留: 耶尔森氏菌感染需要穿透肠道黏液层。Flip 之后,顶侧朝外,杯状细胞分泌的黏液层和吸收细胞的微绒毛还能不能正常形成并发挥功能?

四、 一句话总结

结合你的背景信息,幻灯片上 “Can differentiated organoid flip?” 的准确中文解释是:

“已经分化成熟的肠类器官,能否通过改变培养方式(如从3D转为2D单层),实现细胞极性的反转(让顶侧朝外),从而克服常规3D培养中顶侧朝内的缺陷,完美模拟耶尔森氏菌从肠腔入侵的真实生理过程?”

结论是:能。 通过 Transwell 或微流控芯片等技术,将 3D 类器官转化为 2D 单层,是实现极性翻转(Flip)并研究肠道病原体感染的最佳策略。



杯状细胞 的英语是:

Goblet cell

(复数形式:Goblet cells)


补充说明:

  • 名称由来:因细胞形态像高脚杯(goblet)而得名
  • 主要功能:分泌黏液(mucin),形成黏液层保护上皮
  • 分布位置:肠道上皮、呼吸道上皮、结膜等


如果你说的 “flip” 是指类器官(organoid)在已经分化(differentiated)之后发生极性翻转(polarity flip,inside-out flipping),那么答案是:

可以,在一定条件下可以发生。

什么是“flip”?

许多上皮来源的类器官(如肠道类器官、气道类器官、肾脏类器官)具有明确的顶端-基底极性:

  • Apical(顶端面):通常朝向类器官内部腔体(lumen)
  • Basal(基底面):通常朝向外部基质(如 Matrigel)

正常情况下结构如下:

外部基质
    ↓
[Basal面]
  细胞层
[Apical面]
    ↓
   Lumen

当研究人员去除基质、改变培养条件或进行机械处理后,类器官可能重组为:

外部培养液
    ↓
[Apical面]
  细胞层
[Basal面]
    ↓
  内部

这就是常说的 inside-out organoidpolarity reversal(极性翻转)

已经分化后还能翻转吗?

通常可以

原因是:

  1. 分化状态和极性状态不是同一个概念

    • 细胞已经分化为吸收细胞、杯状细胞、纤毛细胞等,
    • 但其空间排列和极性仍具有一定可塑性。
  2. 细胞会感知外部ECM(细胞外基质)

    • Integrin等受体持续感知基质信号。
    • 当ECM环境改变时,细胞可能重新建立极性。
  3. 许多实验已经实现

    • 肠道类器官
    • 肺/气道类器官
    • 肾脏类器官

都报道过在成熟后发生极性反转。

会不会失去分化特征?

不一定。

很多情况下:

  • 分化标志物仍然存在;
  • 细胞类型组成变化不大;
  • 只是顶端和基底面的朝向改变。

但如果培养条件改变过大或时间过长:

  • 部分细胞可能去分化(dedifferentiation);
  • 分化谱系比例发生变化;
  • 类器官结构重塑。

如果你说的“flip”是另一种意思

有时研究者也用“flip”指:

  1. 类器官物理翻转(turn over)
  2. 极性翻转(polarity flip)
  3. 分化细胞重新回到干细胞状态(fate reversal)

其中:

  • 极性翻转 → 经常可以发生;
  • 完全命运逆转(differentiated → stem-like)→ 也可能发生,但机制不同,通常称为去分化(dedifferentiation)美式英语:/diˌdɪfəˌrɛnʃiˈeɪʃən/ ,而不是“flip”。

如果你是在阅读某篇论文或实验方案,可以把原句贴出来,我可以根据具体上下文解释这里的 “flip” 到底指哪一种现象。



一、Organoid(类器官)—— 深度解析

1. 定义

类器官是指利用成体干细胞(ASC)多能干细胞(PSC,包括胚胎干细胞和诱导多能干细胞),在体外三维(3D)培养体系中,通过自组织(self-organization)和定向分化,形成的微型三维组织样结构。它能够再现来源器官的部分细胞类型、空间架构和生理功能

2. 核心特征(四点)

特征 说明
三维结构 不同于传统的二维(2D)单层细胞培养,类器官具有立体的细胞极性和细胞间连接
多细胞类型 包含多种分化细胞,而非单一细胞系
自组织能力 细胞按照内在的发育程序自行排列成类似体内的结构,无需外部支架的精细引导
功能性 能够执行部分器官特有功能,如肠类器官的吸收/分泌、脑类器官的神经电活动

3. 来源分类

  • 成体干细胞来源(ASC-derived):取自组织中的成体干细胞(如肠道隐窝干细胞),分化潜能有限(只能形成该组织的细胞类型),但成熟度高、更接近体内真实状态。
  • 多能干细胞来源(PSC-derived):取自胚胎干细胞或iPS细胞,分化潜能大,可模拟胚胎发育过程,但成熟度相对较低。

4. 应用领域

  • 疾病建模(感染、肿瘤、遗传病)
  • 药物筛选与毒性测试
  • 再生医学(移植修复)
  • 发育生物学研究

二、Enteroid(肠类器官)—— 深度解析

1. 定义

肠类器官是类器官的一种特定亚型,特指来源于肠道上皮干细胞(主要是小肠或结肠的Lgr5⁺隐窝基底柱状细胞),在三维培养中形成的、包含所有肠道上皮主要细胞类型的微型肠道组织。

2. 肠类器官包含的细胞类型(核心)

细胞类型 功能
肠上皮吸收细胞(Enterocyte) 营养吸收、离子转运
杯状细胞(Goblet cell) 分泌黏液,保护上皮屏障
潘氏细胞(Paneth cell) 分泌抗菌肽(如溶菌酶),调节干细胞微环境
肠内分泌细胞(Enteroendocrine cell) 分泌激素(如GLP-1、5-羟色胺),调控消化和食欲
干细胞(Lgr5⁺ CBC细胞) 自我更新,分化补充其他细胞类型
簇细胞(Tuft cell) 参与免疫感知和寄生虫防御

3. 结构特征

  • 肠类器官在Matrigel(基底膜基质)中培养时,会形成极性的囊泡状结构
    • 管腔面(Apical side):面向内部空腔,带有微绒毛,对应肠道的内腔(接触食物和菌群的一面)
    • 基底侧面(Basolateral side):朝向外侧,对应肠道的基底膜侧(接触血管和结缔组织的一面)
  • 这种极性正是你原句中“two-polarity conformations(双极性构型)”所指的内容。

4. 与体内肠道的对应关系

  • 肠类器官的隐窝-绒毛轴在体外被简化为出芽结构(buds)——出芽部分富含干细胞和潘氏细胞(对应隐窝),中央囊泡部分富含分化细胞(对应绒毛区)。

三、Organoid vs Enteroid —— 更详细的对比表

对比维度 Organoid(类器官) Enteroid(肠类器官)
中文译名 类器官 肠类器官(或称肠道类器官)
层级关系 上位概念(总称) 下位概念(子类)
组织来源 可以是脑、肝、肾、肺、肠、胰腺、视网膜等任何器官 仅限于小肠或结肠
干细胞类型 ASC 或 PSC 均可 通常为 ASC(肠道隐窝干细胞)
分化方向 定向分化为对应器官的细胞谱系 定向分化为肠道上皮各细胞类型
培养条件 因器官而异,需不同的细胞因子组合(如脑需Noggin,肝需HGF等) 通常需EGF、Noggin、R-spondin、Wnt3a等(经典“肠类器官培养基”)
结构特点 因器官而异(如脑类器官呈脑区样结构,肝类器官呈肝板样) 呈囊泡状+出芽结构,具有肠上皮特有的隐窝-绒毛极性
典型应用 广泛(神经退行性疾病、肝癌、肾发育、肺纤维化等) 主要集中于肠道感染(如沙门氏菌、耶尔森氏菌)、炎症性肠病(IBD)、结直肠癌、营养代谢

四、为什么你的句子用 Enteroid 而不是 Organoid —— 深入解释

你的原句:

“Differentiation of murine enteroids for Y. enterocolitica infection in two-polarity conformations”

  • 研究对象耶尔森氏菌(Yersinia enterocolitica——这是一种肠道致病菌,主要通过侵入肠道上皮引起感染。
  • 要研究这类细菌,你需要一个能模拟肠道上皮屏障、黏液层、极性运输和细菌侵入机制的模型。
  • 肠类器官(enteroid)正好提供了:
    • 完整的肠道上皮细胞多样性
    • 生理性的顶-底极性(细菌从顶侧侵入,免疫信号从底侧传递)
    • 与体内更接近的微环境

如果用 organoid(泛称),就太笼统了,别人不知道你研究的是肠、胃还是其他器官的类器官。因此,enteroid 是此处最精确、最专业的术语。


五、延伸概念:Colonoid 和 Gastroid

为了帮你更清楚理解 enteroid 的范围,再补充两个相关术语:

术语 中文 来源 说明
Enteroid 肠类器官 小肠或结肠(广义) 有时狭义特指小肠来源
Colonoid 结肠类器官 结肠(大肠) 是 enteroid 的下位细分
Gastroid 胃类器官 胃组织 与 enteroid 同属消化道类器官但不是同一类

在严格学术语境中,enteroid 有时只指小肠来源,而colonoid结肠来源,但日常文献中常混用,统称 intestinal organoids(肠道类器官),而 enteroid 则是更专业的叫法。


六、补充:关于“two-polarity conformations(双极性构型)”

这一点与你问的概念密切相关,我额外解释一下:

  • 在体内,肠道上皮是单层柱状上皮,具有明确的顶底极性(apical-basal polarity)

    • 顶侧(Apical):朝向肠腔,接触食物、菌群、病原体
    • 底侧(Basolateral):朝向基底膜,接触固有层、血管、免疫细胞
  • 在体外肠类器官培养中:

    • 常规构型:类器官的顶侧朝向管腔内部(封闭腔),底侧朝外(接触培养基)。这种构型下,细菌难以直接接触到顶侧(因为顶侧在腔里面),研究感染时需要显微注射或破坏类器官。
    • 反转构型(反向极性):通过特殊培养方法(如去掉基质胶或使用transwell),使顶侧朝外建立开放顶面的单层,这样病原体可以直接从培养液接触顶侧,更便于感染实验。

你的句子中 “two-polarity conformations” 就是指这两种不同的极性排列方式(常规封闭腔 vs 开放/反转极性),用于比较哪种更适于研究耶尔森氏菌的侵入机制。


Further topics:

  • 肠类器官的具体培养方法(培养基成分、Matrigel操作)
  • “反转极性”如何实现(具体实验方案)
  • 耶尔森氏菌感染肠类器官的研究案例


这是一个非常关键且容易混淆的问题。直接给你结论:

隐窝(Crypt)既不是管腔面(Apical side),也不是基底侧面(Basolateral side)。

为了让你彻底明白,我从空间位置细胞极性两个维度来拆解:


一、从空间位置来看(解剖学视角)

在真实的肠道组织中,隐窝和绒毛是肠壁上的两种不同地形结构

  • 绒毛(Villus):像手指一样向上突起,伸向肠道内部(管腔)。
  • 隐窝(Crypt):像小口袋一样向下凹陷,嵌入肠壁深层的固有层中。

所以,隐窝是一个“凹陷的坑”,绒毛是一个“凸起的包”。它们之间的关系是“高低起伏”的地形关系,而不是“内外表面”的关系。


二、从细胞极性来看(细胞生物学视角)

关键是:无论是隐窝里的细胞,还是绒毛上的细胞,它们都是单层上皮细胞,每一颗细胞都有自己的“管腔面”和“基底侧面”。

细胞部位 管腔面(Apical,顶侧) 基底侧面(Basolateral,底侧)
绒毛上的细胞 朝向肠腔内部(接触食物、菌群) 朝向绒毛内部的固有层(接触血管)
隐窝里的细胞 朝向隐窝的管腔(即那个凹陷的空腔) 朝向隐窝外周的基底膜和固有层

也就是说:

隐窝本身是一个“下陷的管腔空间”。隐窝里的干细胞和潘氏细胞,它们的顶膜(Apical)是面向隐窝那个空洞的,而基底膜(Basolateral)是面向隐窝外壁的组织间隙的。


三、用一张图来理解(文字版)

        ═══════════════════════════════   ← 肠腔(真正的管腔,食物通过的地方)
        ↑  绒毛(Villus)            ↑
        |  细胞顶面(Apical)朝向肠腔   |
        |  细胞底面(Basolateral)朝内   |
        ═══════════════════════════════   ← 基底膜
                 ↓↓↓ 向下凹陷 ↓↓↓
        ═══════════════════════════════   ← 隐窝开口
        |  隐窝(Crypt)              |
        |  细胞顶面(Apical)朝向隐窝空腔 |
        |  细胞底面(Basolateral)朝向外周 |
        ═══════════════════════════════   ← 基底膜

四、回答你问题的核心

  • 问:隐窝是管腔面吗?

    • 不是。 隐窝是一个结构区域(凹陷区),而管腔面(Apical)是细胞的一个面(朝向空腔的那一面)。隐窝里的细胞自己的管腔面(朝向隐窝空洞),但隐窝本身不等于管腔面。
  • 问:隐窝是基底侧面吗?

    • 不是。 基底侧面(Basolateral)是细胞接触基底膜和细胞间质的那一面。隐窝里的细胞也有基底侧面(朝向隐窝外周的基底膜),但隐窝本身不等于基底侧面。

五、在肠类器官(Enteroid)中对应什么?

在体外3D培养的肠类器官中:

体内结构 类器官中的对应结构
绒毛(Villus) 类器官的中央囊泡(中央腔体),上皮较薄,含分化细胞
隐窝(Crypt) 类器官的出芽(Bud),向外凸起的芽状结构,富含Lgr5⁺干细胞和潘氏细胞

注意:这里的“出芽”是向外凸起的,而不是向下凹陷的,这是因为类器官的极性相对于基质胶是“内外颠倒”的(顶面朝内腔,基底面朝外)。


六、一句话总结

隐窝是一个“结构位置”(凹陷区域),不是细胞的“一个面”。隐窝里的细胞既有管腔面(朝隐窝空洞),也有基底侧面(朝隐窝外周的基底膜)。

Further topics:

  • 为什么类器官的极性是“内外颠倒”的?
  • 在Transwell系统中如何实现“顶面朝外”?


1. “单层柱状上皮”是指只有一层细胞吗?

是的,千真万确。

  • 单层”在组织学里就是字面意思:只有一层细胞,不像皮肤(复层鳞状上皮)那样有很多层叠加在一起。
  • 柱状”指的是细胞的形态:它们不是圆形的,也不是扁平的,而是像高脚杯或柱子一样,高度 > 宽度,看起来细细长长的。

正因为只有一层,所以这层细胞肩负着巨大的责任:既要当“屏障”挡住坏人(细菌毒素),又要当“搬运工”吸收营养。


2. 细胞的一侧朝向肠腔,另一侧朝向基底膜吗?

完全正确! 这正是单层上皮最经典的极性(Polarity)特征。

因为只有一层,所以这层细胞必须“一头一尾”各司其职:

  • 顶端(Apical side / 管腔面)朝向肠腔(也就是大便和食物残渣流过的地方)。这一侧长满了密密麻麻的微绒毛(Microvilli),用来增加吸收面积,且没有血管直接接触。
  • 基底侧(Basolateral side / 基底侧面)朝向基底膜(Basement membrane)。基底膜像一层“胶水”和“滤网”,把这层上皮和下面的固有层(里面有血管、淋巴管、免疫细胞)粘在一起。细胞吸收的营养物质,会从这一侧排出去,进入血管。

3. 帮你建立“立体空间感”(重要补充)

你可能会困惑:“隐窝不是凹进去的吗?那凹进去的坑里,哪边算肠腔?”

记住这个黄金法则:只要是“空腔”的那一面,就是顶侧(Apical)。

  • 在绒毛上:细胞顶侧 → 对着肠道中间的大洞(肠腔)。
  • 在隐窝里:细胞顶侧 → 对着隐窝那个“小坑”的洞。虽然这个洞是封闭的凹陷,但它依然属于“腔”的一部分。
  • 细胞基底侧:永远对着基底膜(即远离食物,靠近身体内部的那一侧)。

4. 套用在你研究的“双极性构型”上

你现在完全理解了这个结构,就很容易搞懂体外实验的痛点了:

  • 在体内:细菌(如耶尔森氏菌)站在肠腔里,直接接触的就是细胞的顶侧(Apical),然后细菌入侵。
  • 在常规类器官(3D胶里):细胞顶侧(有微绒毛的那面)被包在里面(朝向类器官的内腔),细菌在胶外面,接触的是细胞的基底侧。这就反了!所以科学家才要费尽心机做“双极性构型(反转极性)”,目的就是把细胞的顶侧翻到外面来,让细菌能像在体内一样,从顶侧发起攻击。

总结一句话: 单层 = 只有一层细胞极性 = 顶侧对肠腔(吸收/感染面),基底侧对基底膜(连接身体面)

Generating report using R (Data_Tam_Metagenomics_2026_Wastewater)

To analyze your metagenomics data with greater control, you can use the R packages phyloseq, ggplot2, vegan, and MaAsLin2. The biobakery workflow has already generated the merged tables you need in the results/ directory.

Below is the complete R code to:

  1. Load the data and reproduce the figures from wmgx_report.pdf (PCoA, Heatmap, Stacked Barplot).
  2. Perform differential abundance analysis between Pre-treatment and Post-treatment.
  3. Statistically test if the differences between time points are significant.

Prerequisites

Make sure you have the following R packages installed:

install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr", "tidyr", "pheatmap"))
# MaAsLin2 is installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("MaAsLin2")

Step 1: Load Data and Create Metadata

The biobakery workflow merged your samples into single tables. We will load the MetaPhlAn species table and HUMAnN pathway table, and map your experimental design.

library(phyloseq)
library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Load MetaPhlAn taxonomic profiles
mpa_file <- "results/metaphlan/merged/metaphlan_taxonomic_profiles.tsv"
mpa_raw <- read.table(mpa_file, header=TRUE, sep="\t", comment.char="", check.names=FALSE)
colnames(mpa_raw)[1] <- "clade_name"
rownames(mpa_raw) <- mpa_raw$clade_name
mpa_otu <- mpa_raw[, -1]

# Filter for species level (s__) and clean names
species_idx <- grep("s__", rownames(mpa_otu))
species_otu <- mpa_otu[species_idx, ]
species_names <- gsub(".*s__", "", rownames(species_otu))
species_names <- gsub("\\|.*", "", species_names)
species_names <- gsub("_", " ", species_names) # Replace underscores with spaces
rownames(species_otu) <- species_names

# 2. Create Metadata based on your file names
sample_names <- colnames(species_otu)
metadata <- data.frame(
  row.names = sample_names,
  TimePoint = case_when(
    grepl("Nov", sample_names) ~ "Nov",
    grepl("Jan", sample_names) ~ "Jan",
    grepl("Mar", sample_names) ~ "Mar",
    grepl("May", sample_names) ~ "May"
  ),
  Treatment = ifelse(grepl("Pre", sample_names), "Pre", "Post")
)

# 3. Create phyloseq object for Species
tax_mat <- matrix(rownames(species_otu), ncol=1, dimnames=list(rownames(species_otu), "Species"))
species_ps <- phyloseq(
  otu_table(species_otu, taxa_are_rows=TRUE),
  sample_data(metadata),
  tax_table(tax_mat)
)

# (Optional) Load HUMAnN pathways using the same logic
# path_file <- "results/humann/merged/pathabundance_relab.tsv"
# ... (similar loading steps as above) ...

Step 2: Reproduce Figures from wmgx_report.pdf

1. Ordination (PCoA)

Reproduces the Bray-Curtis Principal Coordinate Analysis.

library(vegan)

# Calculate Bray-Curtis distance and perform PCoA
ord <- ordinate(species_ps, method="PCoA", distance="bray")

# Plot
plot_ordination(species_ps, ord, color="Treatment", shape="TimePoint") +
  geom_point(size=4) +
  theme_minimal() +
  labs(title="PCoA of Species Profiles (Bray-Curtis)", 
       subtitle="Colored by Treatment, Shaped by Time Point") +
  theme(legend.position = "right")

2. Heatmap (Top 25 Species)

Reproduces the hierarchical clustering heatmap using Euclidean distance.

# Get top 25 most abundant species
top25 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:25]
ps_top25 <- prune_taxa(top25, species_ps)

# Plot heatmap (method=NULL defaults to hierarchical clustering)
plot_heatmap(ps_top25, method=NULL, distance="euclidean", 
             taxa.labels=TRUE, label.size=3, na.value=NA) +
  scale_fill_viridis_c() +
  labs(title="Heatmap of Top 25 Most Abundant Species")

3. Stacked Barplot (Top 15 Species)

Reproduces the stacked barplot, sorted by total abundance.

# Get top 15 species
top15 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:15]
ps_top15 <- prune_taxa(top15, species_ps)

# Melt data for ggplot2
df <- psmelt(ps_top15)

# Sort samples by the total abundance of these top 15 species (decreasing order)
sample_order <- df %>%
  group_by(Sample) %>%
  summarise(Total_Ab = sum(Abundance)) %>%
  arrange(desc(Total_Ab)) %>%
  pull(Sample)

df$Sample <- factor(df$Sample, levels=sample_order)

# Plot
ggplot(df, aes(x=Sample, y=Abundance, fill=Species)) +
  geom_bar(stat="identity", position="stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
  labs(title="Relative Abundance of Top 15 Most Abundant Species", 
       y="Relative Abundance") +
  scale_fill_discrete(name = "Species")

Step 3: Differential Abundance Analysis (Pre vs. Post)

To find which specific species or pathways are significantly different between Pre- and Post-treatment, we use MaAsLin2 (the official biobakery recommendation). It allows us to test the effect of Treatment while adjusting for TimePoint as a covariate.

library(MaAsLin2)

# MaAsLin2 requires samples as rows and features as columns
species_t <- t(otu_table(species_ps))
species_df <- as.data.frame(species_t)
species_df$sample_id <- rownames(species_df)

metadata_df <- data.frame(
  sample_id = rownames(metadata),
  metadata
)

# Run MaAsLin2
# We set "Pre" and "Nov" as reference groups. 
# The output coefficients will represent the difference of Post vs Pre.
fit_results <- MaAsLin2(
  input_data = species_df,
  input_metadata = metadata_df,
  output = "maaslin2_species_output",
  fixed_effects = c("Treatment", "TimePoint"),
  reference = c("Treatment,Pre", "TimePoint,Nov"), 
  random_effects = c(),
  correction = "FDR",
  standardize = FALSE,
  min_abundance = 0.001, 
  min_prevalence = 0.1   # Feature must be present in at least 10% of samples
)

# View significant results (q-value < 0.05)
sig_results <- subset(fit_results$results, qval < 0.05 & metadata == "Treatment")
if(nrow(sig_results) > 0) {
  print(sig_results[, c("feature", "coef", "qval")] %>% arrange(qval))
} else {
  print("No significantly different species found at FDR < 0.05.")
}

Note: You can run the exact same code for pathways by replacing species_df with your HUMAnN pathway data frame.


Step 4: Is the difference between time points significant?

To determine if the microbial community composition changes significantly across the different time points (Nov, Jan, Mar, May), we use PERMANOVA (via the adonis2 function in the vegan package). This tests the overall community structure rather than individual features.

library(vegan)

# Calculate Bray-Curtis distance matrix
dist_bc <- vegdist(t(otu_table(species_ps)), method="bray")

# Run PERMANOVA
# This tests the main effects of Treatment and TimePoint, and their interaction
adonis_res <- adonis2(dist_bc ~ Treatment * TimePoint, data=metadata, permutations=999)
print(adonis_res)

How to interpret the adonis2 output:

  1. Treatment row: Look at the Pr(>F) (p-value). If it is < 0.05, it means there is a statistically significant difference in the overall microbiome between Pre-treatment and Post-treatment.
  2. TimePoint row: Look at the Pr(>F). If it is < 0.05, it means the microbial community composition differs significantly across the months (Nov, Jan, Mar, May).
  3. Treatment:TimePoint row: If this interaction is significant (< 0.05), it means the effect of the treatment depends on the time of year (e.g., the treatment works differently in November compared to May).

Important Note on Statistical Power: Your dataset has 11 samples total. In the PERMANOVA model, this leaves only 3 residual degrees of freedom. While this mathematically allows for testing, the statistical power is low. Significant results are highly trustworthy, but non-significant results might simply be due to the small sample size.

Processing RNAseq (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE)

  1. Preparing raw data

     mkdir raw_data; cd raw_data
    
     # control samples (8)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_1.fq.gz AYE-WT_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_2.fq.gz AYE-WT_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_1.fq.gz AYE-WT_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_2.fq.gz AYE-WT_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_1.fq.gz AYE-WT_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_2.fq.gz AYE-WT_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_1.fq.gz AYE-T_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_2.fq.gz AYE-T_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_1.fq.gz AYE-T_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_2.fq.gz AYE-T_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_1.fq.gz AYE-T_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_2.fq.gz AYE-T_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_1.fq.gz AYE-O_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_2.fq.gz AYE-O_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_1.fq.gz AYE-O_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_2.fq.gz AYE-O_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_1.fq.gz AYE-O_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_2.fq.gz AYE-O_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_1.fq.gz O-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_2.fq.gz O-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_1.fq.gz O-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_2.fq.gz O-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_1.fq.gz O-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_2.fq.gz O-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_1.fq.gz WT-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_2.fq.gz WT-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_1.fq.gz WT-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_2.fq.gz WT-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_1.fq.gz WT-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_2.fq.gz WT-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_1.fq.gz AYE-WT_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_2.fq.gz AYE-WT_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_1.fq.gz AYE-WT_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_2.fq.gz AYE-WT_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_1.fq.gz AYE-WT_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_2.fq.gz AYE-WT_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_1.fq.gz AYE-O_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_2.fq.gz AYE-O_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_1.fq.gz AYE-O_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_2.fq.gz AYE-O_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_1.fq.gz AYE-O_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_2.fq.gz AYE-O_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_1.fq.gz AYE-T_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_2.fq.gz AYE-T_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_1.fq.gz AYE-T_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_2.fq.gz AYE-T_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_1.fq.gz AYE-T_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_2.fq.gz AYE-T_ctr_solid_r3_R2.fastq.gz
    
     # Diclofenac(双氯芬酸)treatment (6)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_1.fq.gz AYE-WT_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_2.fq.gz AYE-WT_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_1.fq.gz AYE-WT_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_2.fq.gz AYE-WT_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_1.fq.gz AYE-WT_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_2.fq.gz AYE-WT_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_1.fq.gz AYE-T_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_2.fq.gz AYE-T_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_1.fq.gz AYE-T_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_2.fq.gz AYE-T_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_1.fq.gz AYE-T_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_2.fq.gz AYE-T_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_1.fq.gz AYE-O_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_2.fq.gz AYE-O_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_1.fq.gz AYE-O_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_2.fq.gz AYE-O_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_1.fq.gz AYE-O_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_2.fq.gz AYE-O_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_1.fq.gz O-Trans_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_2.fq.gz O-Trans_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_1.fq.gz O-Trans_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_2.fq.gz O-Trans_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_1.fq.gz O-Trans_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_2.fq.gz O-Trans_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_1.fq.gz WT-Trans_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_2.fq.gz WT-Trans_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_1.fq.gz WT-Trans_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_2.fq.gz WT-Trans_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_1.fq.gz WT-Trans_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_2.fq.gz WT-Trans_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_1.fq.gz AYE-WT_Diclo1250_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_2.fq.gz AYE-WT_Diclo1250_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_1.fq.gz AYE-WT_Diclo1250_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_2.fq.gz AYE-WT_Diclo1250_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_1.fq.gz AYE-WT_Diclo1250_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_2.fq.gz AYE-WT_Diclo1250_solid_r3_R2.fastq.gz
    
     # Rifampicin(利福平)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_1.fq.gz AYE-WT_Rifampicin1.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_2.fq.gz AYE-WT_Rifampicin1.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_1.fq.gz AYE-WT_Rifampicin1.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_2.fq.gz AYE-WT_Rifampicin1.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_1.fq.gz AYE-WT_Rifampicin1.5_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_2.fq.gz AYE-WT_Rifampicin1.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_1.fq.gz AYE-T_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_2.fq.gz AYE-T_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_1.fq.gz AYE-T_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_2.fq.gz AYE-T_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_1.fq.gz AYE-T_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_2.fq.gz AYE-T_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_1.fq.gz AYE-O_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_2.fq.gz AYE-O_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_1.fq.gz AYE-O_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_2.fq.gz AYE-O_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_1.fq.gz AYE-O_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_2.fq.gz AYE-O_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_1.fq.gz O-Trans_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_2.fq.gz O-Trans_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_1.fq.gz O-Trans_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_2.fq.gz O-Trans_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_1.fq.gz O-Trans_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_2.fq.gz O-Trans_Rifampicin2_r3_R2.fastq.gz
    
     # Meropenem(美罗培南)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_1.fq.gz AYE-WT_Mero0.35-0.5_r1_R1.fastq.gz  #AYE-WT_Mero0.5_r1
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_2.fq.gz AYE-WT_Mero0.35-0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_1.fq.gz AYE-WT_Mero0.35-0.5_r2_R1.fastq.gz  #AYE-WT_YX_Mero0.35_r2
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_2.fq.gz AYE-WT_Mero0.35-0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_1.fq.gz AYE-WT_Mero0.35-0.5_r3_R1.fastq.gz  #AYE-WT_public_Mero0.35_r3
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_2.fq.gz AYE-WT_Mero0.35-0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_1.fq.gz AYE-T_Mero0.15_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_2.fq.gz AYE-T_Mero0.15_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_1.fq.gz AYE-T_Mero0.15_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_2.fq.gz AYE-T_Mero0.15_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_1.fq.gz AYE-T_Mero0.15_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_2.fq.gz AYE-T_Mero0.15_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_1.fq.gz AYE-O_Mero0.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_2.fq.gz AYE-O_Mero0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_1.fq.gz AYE-O_Mero0.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_2.fq.gz AYE-O_Mero0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_1.fq.gz AYE-O_Mero0.5_r3_R1.fastq.gz  #Mero0.45
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_2.fq.gz AYE-O_Mero0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_1.fq.gz O-Trans_Mero0.25_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_2.fq.gz O-Trans_Mero0.25_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_1.fq.gz O-Trans_Mero0.25_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_2.fq.gz O-Trans_Mero0.25_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_1.fq.gz O-Trans_Mero0.25_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_2.fq.gz O-Trans_Mero0.25_r3_R2.fastq.gz
    
     # Azithromycin(阿奇霉素)treatment (5), among them, F_ctr_solid is clinical isolate.
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_1.fq.gz F_ctr_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_2.fq.gz F_ctr_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_1.fq.gz F_ctr_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_2.fq.gz F_ctr_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_1.fq.gz F_ctr_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_2.fq.gz F_ctr_solid_r3_R2.fastq.gz  #clinical
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_1.fq.gz AYE-WT_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_2.fq.gz AYE-WT_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_1.fq.gz AYE-WT_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_2.fq.gz AYE-WT_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_1.fq.gz AYE-WT_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_2.fq.gz AYE-WT_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_1.fq.gz AYE-T_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_2.fq.gz AYE-T_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_1.fq.gz AYE-T_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_2.fq.gz AYE-T_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_1.fq.gz AYE-T_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_2.fq.gz AYE-T_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_1.fq.gz AYE-O_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_2.fq.gz AYE-O_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_1.fq.gz AYE-O_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_2.fq.gz AYE-O_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_1.fq.gz AYE-O_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_2.fq.gz AYE-O_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_1.fq.gz F_Azi20_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_2.fq.gz F_Azi20_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_1.fq.gz F_Azi20_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_2.fq.gz F_Azi20_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_1.fq.gz F_Azi20_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_2.fq.gz F_Azi20_solid_r3_R2.fastq.gz  #clinical
  2. Preparing the directory trimmed

     mkdir trimmed trimmed_unpaired;
     for sample_id in AYE-O_Azi20_solid_r1 AYE-O_Azi20_solid_r2 AYE-O_Azi20_solid_r3 AYE-O_ctr_r1 AYE-O_ctr_r2 AYE-O_ctr_r3 AYE-O_ctr_solid_r1 AYE-O_ctr_solid_r2 AYE-O_ctr_solid_r3 AYE-O_Diclo375_r1 AYE-O_Diclo375_r2 AYE-O_Diclo375_r3 AYE-O_Mero0.5_r1 AYE-O_Mero0.5_r2 AYE-O_Mero0.5_r3 AYE-O_Rifampicin2_r1 AYE-O_Rifampicin2_r2 AYE-O_Rifampicin2_r3 AYE-T_Azi20_solid_r1 AYE-T_Azi20_solid_r2 AYE-T_Azi20_solid_r3 AYE-T_ctr_r1 AYE-T_ctr_r2 AYE-T_ctr_r3 AYE-T_ctr_solid_r1 AYE-T_ctr_solid_r2 AYE-T_ctr_solid_r3 AYE-T_Diclo375_r1 AYE-T_Diclo375_r2 AYE-T_Diclo375_r3 AYE-T_Mero0.15_r1 AYE-T_Mero0.15_r2 AYE-T_Mero0.15_r3 AYE-T_Rifampicin2_r1 AYE-T_Rifampicin2_r2 AYE-T_Rifampicin2_r3 AYE-WT_Azi20_solid_r1 AYE-WT_Azi20_solid_r2 AYE-WT_Azi20_solid_r3 AYE-WT_ctr_r1 AYE-WT_ctr_r2 AYE-WT_ctr_r3 AYE-WT_ctr_solid_r1 AYE-WT_ctr_solid_r2 AYE-WT_ctr_solid_r3 AYE-WT_Diclo1250_solid_r1 AYE-WT_Diclo1250_solid_r2 AYE-WT_Diclo1250_solid_r3 AYE-WT_Diclo750_r1 AYE-WT_Diclo750_r2 AYE-WT_Diclo750_r3 AYE-WT_Mero0.35-0.5_r1 AYE-WT_Mero0.35-0.5_r2 AYE-WT_Mero0.35-0.5_r3 AYE-WT_Rifampicin1.5_r1 AYE-WT_Rifampicin1.5_r2 AYE-WT_Rifampicin1.5_r3 F_Azi20_solid_r1 F_Azi20_solid_r2 F_Azi20_solid_r3 F_ctr_solid_r1 F_ctr_solid_r2 F_ctr_solid_r3 O-Trans_ctr_r1 O-Trans_ctr_r2 O-Trans_ctr_r3 O-Trans_Diclo375_r1 O-Trans_Diclo375_r2 O-Trans_Diclo375_r3 O-Trans_Mero0.25_r1 O-Trans_Mero0.25_r2 O-Trans_Mero0.25_r3 O-Trans_Rifampicin2_r1 O-Trans_Rifampicin2_r2 O-Trans_Rifampicin2_r3 WT-Trans_ctr_r1 WT-Trans_ctr_r2 WT-Trans_ctr_r3 WT-Trans_Diclo750_r1 WT-Trans_Diclo750_r2 WT-Trans_Diclo750_r3; do \
     for sample_id in AYE-T_Diclo375_r2; do \
             java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
     done
  3. (Optional) using trinity to find the most closely reference

     #Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
     #https://www.genome.jp/kegg/tables/br08606.html#prok
     acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
     abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
     aby     KGB     Acinetobacter baumannii AYE     2008    GenBank --> *
     abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
     abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
     abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
     abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
     abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
     abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
     abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
     abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
     abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
     abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
     abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
     abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
     abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
     abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
     abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
     abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
     abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
     abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
     #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CU459141.1) was choosen as reference!
  4. Preparing samplesheet.csv

     sample,fastq_1,fastq_2,strandedness
     Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
     ...
  5. Downloading CU459141.fasta and CU459141.gff from GenBank and preparing CU459141_m.gff

     #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
     #Default NOT_WORKING: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
     #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
     #Checking the record (see below) in results/genome/CP059040.gtf
     #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
     #--featurecounts_feature_type 'transcript' returns only the tRNA results
     #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
     grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
     grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
     grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
     grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
     wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
     grep -P "\tCDS\t" CU459141.gff3 | wc -l  #3659
     sed 's/\tCDS\t/\texon\t/g' CU459141.gff3 > CU459141_m.gff
     grep -P "\texon\t" CU459141_m.gff | sort | wc -l  #3760
    
     # -- DEBUG_2: combination of 'CU459141_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
     --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff" --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  6. nextflow run

     # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
     (host_env) mv trimmed/*.fastq.gz .
     (host_env) nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \
     --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff"        -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
  7. (OPTIONAL, since the R-script complete_deg_pipeline_custom_cutoff.R runs completely) Generate PCA-plot

     #mamba activate r_env
    
     #install.packages("ggfun")
     # Import the required libraries
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library(gplots)
     library(tximport)
     library(DESeq2)
     #library("org.Hs.eg.db")
     library(dplyr)
     library(tidyverse)
     #install.packages("devtools")
     #devtools::install_version("gtable", version = "0.3.0")
     library(gplots)
     library("RColorBrewer")
     #install.packages("ggrepel")
     library("ggrepel")
     # install.packages("openxlsx")
     library(openxlsx)
     library(EnhancedVolcano)
     library(DESeq2)
     library(edgeR)
    
     setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon")
     # Define paths to your Salmon output quantification files
    
     # Store sample names in a character vector
     samples <- c(
         "AYE-O_Azi20_solid_r1", "AYE-O_Azi20_solid_r2", "AYE-O_Azi20_solid_r3", "AYE-O_ctr_r1", "AYE-O_ctr_r2",
         "AYE-O_ctr_r3", "AYE-O_ctr_solid_r1", "AYE-O_ctr_solid_r2", "AYE-O_ctr_solid_r3",
         "AYE-O_Diclo375_r1", "AYE-O_Diclo375_r2", "AYE-O_Diclo375_r3", "AYE-O_Mero0.5_r1",
         "AYE-O_Mero0.5_r2", "AYE-O_Mero0.5_r3", "AYE-O_Rifampicin2_r1", "AYE-O_Rifampicin2_r2",
         "AYE-O_Rifampicin2_r3", "AYE-T_Azi20_solid_r1", "AYE-T_Azi20_solid_r2", "AYE-T_Azi20_solid_r3",
         "AYE-T_ctr_r1", "AYE-T_ctr_r2", "AYE-T_ctr_r3", "AYE-T_ctr_solid_r1", "AYE-T_ctr_solid_r2",
         "AYE-T_ctr_solid_r3", "AYE-T_Diclo375_r1", "AYE-T_Diclo375_r2", "AYE-T_Diclo375_r3",
         "AYE-T_Mero0.15_r1", "AYE-T_Mero0.15_r2", "AYE-T_Mero0.15_r3", "AYE-T_Rifampicin2_r1",
         "AYE-T_Rifampicin2_r2", "AYE-T_Rifampicin2_r3", "AYE-WT_Azi20_solid_r1", "AYE-WT_Azi20_solid_r2",
         "AYE-WT_Azi20_solid_r3", "AYE-WT_ctr_r1", "AYE-WT_ctr_r2", "AYE-WT_ctr_r3", "AYE-WT_ctr_solid_r1",
         "AYE-WT_ctr_solid_r2", "AYE-WT_ctr_solid_r3", "AYE-WT_Diclo1250_solid_r1", "AYE-WT_Diclo1250_solid_r2",
         "AYE-WT_Diclo1250_solid_r3", "AYE-WT_Diclo750_r1", "AYE-WT_Diclo750_r2", "AYE-WT_Diclo750_r3",
         "AYE-WT_Mero0.35-0.5_r1", "AYE-WT_Mero0.35-0.5_r2", "AYE-WT_Mero0.35-0.5_r3", "AYE-WT_Rifampicin1.5_r1",
         "AYE-WT_Rifampicin1.5_r2", "AYE-WT_Rifampicin1.5_r3", "F_Azi20_solid_r1", "F_Azi20_solid_r2",
         "F_Azi20_solid_r3", "F_ctr_solid_r1", "F_ctr_solid_r2", "F_ctr_solid_r3",
         "O-Trans_ctr_r1","O-Trans_ctr_r2","O-Trans_ctr_r3",  "O-Trans_Diclo375_r1","O-Trans_Diclo375_r2","O-Trans_Diclo375_r3",
         "O-Trans_Mero0.25_r1", "O-Trans_Mero0.25_r2", "O-Trans_Mero0.25_r3", "O-Trans_Rifampicin2_r1",
         "O-Trans_Rifampicin2_r2", "O-Trans_Rifampicin2_r3", "WT-Trans_ctr_r1", "WT-Trans_ctr_r2",
         "WT-Trans_ctr_r3", "WT-Trans_Diclo750_r1", "WT-Trans_Diclo750_r2", "WT-Trans_Diclo750_r3"
     )
    
     # Automatically generate the named vector
     files <- setNames(paste0("./", samples, "/quant.sf"), samples)
     # Import the transcript abundance data with tximport
     txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
     # -----------------------------------------------------------------
     # ---- Step 1: Create Detailed Metadata from Your Sample Names ----
    
     # Extract metadata from sample names
     samples <- names(files)
    
     # Parse the complex sample names
     metadata <- data.frame(
     sample = samples,
     stringsAsFactors = FALSE
     )
    
     # Extract strain (everything before first underscore or hyphen treatment)
     metadata$strain <- sapply(strsplit(samples, "[-_]"), function(x) {
     if(x[1] %in% c("AYE", "O", "WT", "F")) {
         if(x[1] == "AYE" && length(x) > 1 && x[2] %in% c("WT", "T", "O")) {
         paste(x[1:2], collapse = "-")
         } else if(x[1] %in% c("O", "WT") && x[2] == "Trans") {
         paste(x[1:2], collapse = "-")
         } else {
         x[1]
         }
     } else {
         x[1]
     }
     })
    
     # Extract treatment type
     metadata$treatment <- sapply(samples, function(x) {
         if(grepl("_ctr", x)) return("ctrl")
         if(grepl("Diclo", x)) return("Diclo")
         if(grepl("Mero", x)) return("Mero")
         if(grepl("Azi", x)) return("Azi")
         if(grepl("Rifampicin", x)) return("Rifampicin")
         return("ctrl")
     })
    
     # Extract concentration
     metadata$concentration <- sapply(samples, function(x) {
         if(grepl("Diclo1250", x)) return("1250")
         if(grepl("Diclo750", x)) return("750")
         if(grepl("Diclo375", x)) return("375")
         if(grepl("Mero0.5", x)) return("0.5")
         if(grepl("Mero0.35", x)) return("0.35")
         if(grepl("Mero0.25", x)) return("0.25")
         if(grepl("Mero0.15", x)) return("0.15")
         if(grepl("Azi20", x)) return("20")
         if(grepl("Rifampicin2", x)) return("2")
         if(grepl("Rifampicin1.5", x)) return("1.5")
         return("0")
     })
    
     # Extract condition (solid vs liquid)
     metadata$condition <- ifelse(grepl("_solid", samples), "solid", "liquid")
    
     # Extract replicate
     metadata$replicate <- sapply(strsplit(samples, "_"), function(x) {
     rep_part <- x[length(x)]
     gsub("r", "", rep_part)
     })
    
     # Create combined group for easy comparisons
     metadata$group <- paste(metadata$strain, metadata$treatment, metadata$concentration, sep = "_")
    
     # Set row names
     rownames(metadata) <- metadata$sample
    
     # Reorder to match txi columns
     metadata <- metadata[colnames(txi$counts), ]
    
     # ---------------------------------------------
     # ---- Step 2: Choose Your Design Strategy ----
    
     # Strategy A: Full Factorial Design (if balanced)
     #dds <- DESeqDataSetFromTximport(txi, metadata,
     #                         design = ~ strain + treatment + condition)
    
     # --> Strategy B: Combined Group Factor ⭐ RECOMMENDED
     metadata$group <- factor(paste(metadata$strain,
                                     metadata$treatment,
                                     metadata$concentration,
                                     metadata$condition,
                                     sep = "_"))
    
     dds <- DESeqDataSetFromTximport(txi, metadata,
                                     design = ~ group)
     dds <- DESeq(dds)
    
     # See all available comparisons
     resultsNames(dds)
    
     # -------------------------------------------------------------
     # ---- Step 3: Set Up Specific Comparisons from Your Notes ----
     # ==========================================
     # 1. Define Exact Comparisons from Your Notes
     # ==========================================
     planned_comparisons <- list(
     # --- Baseline / Strain Controls ---
     AYE_T_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-T_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_WT_ctr          = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_WT_ctr         = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_T                 = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-T_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_T               = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_T              = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Condition Effects (Solid vs Liquid) ---
     AYE_WT_ctr_solid_vs_AYE_WT_ctr     = list(treat = "AYE-WT_ctrl_0_solid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_O_ctr       = list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-O_ctrl_0_liquid"),
     AYE_T_ctr_solid_vs_AYE_T_ctr       = list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Diclofenac ---
     AYE_WT_Diclo750_vs_AYE_WT_ctr      = list(treat = "AYE-WT_Diclo_750_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-T_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-O_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_AYE_WT_ctr     = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_AYE_WT_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     Diclo_AYE_WT_1250_solid_vs_solid_ctr = list(treat = "AYE-WT_Diclo_1250_solid", ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Meropenem ---
     AYE_WT_Mero_vs_AYE_WT_ctr          = list(treat = "AYE-WT_Mero_0.35_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-O_Mero_0.5_liquid",       ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Mero_vs_AYE_WT_ctr         = list(treat = "O-Trans_Mero_0.25_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_T_ctr            = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Azithromycin (Solid) ---
     AYE_WT_Azi_vs_solid_ctr            = list(treat = "AYE-WT_Azi_20_solid", ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_Azi_vs_solid_ctr             = list(treat = "AYE-T_Azi_20_solid",  ctrl = "AYE-T_ctrl_0_solid"),
     AYE_O_Azi_vs_solid_ctr             = list(treat = "AYE-O_Azi_20_solid",  ctrl = "AYE-O_ctrl_0_solid"),
     F_Azi_vs_F_solid_ctr               = list(treat = "F_Azi_20_solid",      ctrl = "F_ctrl_0_solid"),
    
     # --- Rifampicin ---
     AYE_WT_Rif_vs_AYE_WT_ctr           = list(treat = "AYE-WT_Rifampicin_1.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Rif_vs_AYE_T_ctr             = list(treat = "AYE-T_Rifampicin_2_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Rif_vs_AYE_O_ctr             = list(treat = "AYE-O_Rifampicin_2_liquid",    ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Rif_vs_O_Trans_ctr         = list(treat = "O-Trans_Rifampicin_2_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # Additional Analyses
     planned_comparisons <- list(
     # --- Diclofenac_2 ---
     # * AYE-T_Diclo_375_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Diclo_375_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Diclo_375_liquid vs AYE-O-Trans_ctrl_0_liquid
     # * WT-Trans_Diclo_750_liquid vs AYE-WT-Trans_ctrl_0_liquid
     AYE_T_Diclo375_vs_AYE_T_ctr        = list(treat = "AYE-T_Diclo_375_liquid",        ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_O_ctr        = list(treat = "AYE-O_Diclo_375_liquid",        ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_O_Trans_ctr    = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "O-Trans_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_WT_Trans_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "WT-Trans_ctrl_0_liquid"),
    
     # --- Meropenem_2 ---
     # * AYE-T_Mero_0.15_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Mero_0.5_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Mero_0.25_liquid vs AYE-O-Trans_ctrl_0_liquid
     AYE_T_Mero_vs_AYE_T_ctr        = list(treat = "AYE-T_Mero_0.15_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_O_ctr        = list(treat = "AYE-O_Mero_0.5_liquid",     ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Mero_vs_O_Trans_ctr    = list(treat = "O-Trans_Mero_0.25_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # ==========================================
     # 2. Verification & Validation Script
     # ==========================================
     # Identify which column in colData holds your group names
     group_col <- if("group" %in% colnames(colData(dds))) "group" else
                 if("treatment" %in% colnames(colData(dds))) "treatment" else
                 stop("❌ Please specify the correct colData column containing group names.")
    
     actual_groups <- unique(colData(dds)[[group_col]])
    
     cat("\n", paste(rep("=", 85), collapse=""), "\n")
     cat("📋 VERIFICATION OF NOTE-DERIVED COMPARISONS\n")
     cat(paste(rep("=", 85), collapse=""), "\n\n")
    
     validation_results <- data.frame(
     Comparison_Name = character(),
     Treatment_String = character(),
     Control_String = character(),
     Status = character(),
     Suggested_Contrast = character(),
     stringsAsFactors = FALSE
     )
    
     for(name in names(planned_comparisons)) {
     trt <- planned_comparisons[[name]]$treat
     ctl <- planned_comparisons[[name]]$ctrl
    
     # Find closest matches in actual data
     trt_match <- actual_groups[grepl(trt, actual_groups, fixed = TRUE)]
     ctl_match <- actual_groups[grepl(ctl, actual_groups, fixed = TRUE)]
    
     status <- if(length(trt_match) > 0 && length(ctl_match) > 0) "✅ VALID" else "⚠️  CHECK"
     contrast_str <- if(status == "✅ VALID")
         paste0('c("', group_col, '", "', trt_match[1], '", "', ctl_match[1], '")') else "N/A"
    
     validation_results <- rbind(validation_results, data.frame(
         Comparison_Name = name,
         Treatment_String = trt,
         Control_String = ctl,
         Status = status,
         Suggested_Contrast = contrast_str,
         stringsAsFactors = FALSE
     ))
    
     cat(sprintf("%-45s | T:%-25s C:%-20s | %s\n", name, trt, ctl, status))
     if(status == "⚠️  CHECK") {
         if(length(trt_match) == 0) cat("   🔍 Treat not found. Closest: ", paste(head(actual_groups[grepl(strsplit(trt, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
         if(length(ctl_match) == 0) cat("   🔍 Ctrl not found.  Closest: ", paste(head(actual_groups[grepl(strsplit(ctl, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
     }
     }
    
     # ==========================================
     # 3. Auto-Generate DESeq2 results() Calls (Optional)
     # ==========================================
     valid_comparisons <- validation_results[validation_results$Status == "✅ VALID", ]
     if(nrow(valid_comparisons) > 0) {
     cat("\n📜 READY-TO-RUN DESeq2 CONTRASTS:\n")
     cat(paste(rep("-", 60), collapse=""), "\n")
     for(i in seq_len(nrow(valid_comparisons))) {
         cat(sprintf('res_%s <- results(dds, contrast = %s)\n',
                     gsub("[^A-Za-z0-9]", "_", valid_comparisons$Comparison_Name[i]),
                     valid_comparisons$Suggested_Contrast[i]))
     }
     } else {
     cat("\n⚠️  No exact matches found. Check your colData group naming convention.\n")
     }
    
     # -----------------------------
     # ---- Step 4: PCA figures ----
    
     # 🔍 What each figure shows:
     #
     #    01_PCA_by_Strain.png → Tests if genetic background (AYE-WT, AYE-T, AYE-O, Trans, F) is the dominant source of variation.
     #    02_PCA_by_Treatment.png → Shows clustering by antibiotic/drug exposure (ctrl, Diclo, Mero, Azi, Rifampicin).
     #    03_PCA_by_Condition.png → Reveals batch/growth media effects (solid vs liquid).
     #    04_PCA_CombinedGroups.png → Full experimental grouping with labeled sample names for quick outlier detection.
     #    05_PCA_Ellipses.png → Adds 95% confidence boundaries per strain to visualize group spread and overlap.
     #
     # ⚠️ Quick Checklist Before Running:
     #
     #    Ensure metadata columns (strain, treatment, condition, group) are attached to colData(dds).
     #    If ggrepel is missing, run install.packages("ggrepel").
     #    All PNGs will save to your current working directory (getwd()).
    
     # Install if missing: install.packages(c("ggplot2", "ggrepel"))
     library(DESeq2)
     library(ggplot2)
     library(ggrepel)
    
     # 1. Variance Stabilizing Transformation & Extract PCA Data
     vsd <- vst(dds, blind = FALSE)
     pca_data <- plotPCA(vsd, intgroup = c("strain", "treatment", "condition", "group"), returnData = TRUE)
     percent_var <- round(100 * attr(pca_data, "percentVar"))
    
     # Consistent theme for all plots
     base_theme <- theme_bw(base_size = 12) +
     theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
             legend.position = "right",
             legend.title = element_text(face = "bold"),
             panel.grid.major = element_line(color = "grey90"),
             panel.grid.minor = element_blank())
    
     # --- Plot 1: Colored by Strain ---
     p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2.5, max.overlaps = 20, show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Strain",
         color = "Strain", shape = "Condition") +
     base_theme
     ggsave("01_PCA_by_Strain.png", p1, width = 8, height = 6, dpi = 300)
    
     # --- Plot 2: Colored by Treatment ---
     p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = treatment, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Treatment",
         color = "Treatment", shape = "Condition") +
     base_theme
     ggsave("02_PCA_by_Treatment.png", p2, width = 8, height = 6, dpi = 300)
    
     # --- Plot 3: Colored by Condition (Solid vs Liquid) ---
     p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = strain)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Growth Condition",
         color = "Condition", shape = "Strain") +
     base_theme
     ggsave("03_PCA_by_Condition.png", p3, width = 8, height = 6, dpi = 300)
    
     # --- Plot 4: Combined Groups with Sample Labels ---
     p4 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = group)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2, max.overlaps = 30, box.padding = 0.3) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Combined Experimental Groups",
         color = "Group") +
     base_theme + theme(legend.position = "none")
     ggsave("04_PCA_CombinedGroups.png", p4, width = 9, height = 7, dpi = 300)
    
     # --- Plot 5: 95% Confidence Ellipses (by Strain) ---
     p5 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, fill = strain)) +
     geom_point(size = 3, alpha = 0.7) +
     stat_ellipse(level = 0.95, alpha = 0.2, geom = "polygon", show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: 95% Confidence Ellipses by Strain",
         color = "Strain", fill = "Strain") +
     base_theme
     ggsave("05_PCA_Ellipses.png", p5, width = 8, height = 6, dpi = 300)
    
     message("✅ All 5 PCA plots saved to working directory!")
  8. Run Differential Expression & PCA Analysis Complete

     (r_env) cd ~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/
     #(r_env) Rscript complete_deg_pipeline.R  #For standard cutoff in the project
    
     # Adapted the script to the following requests:
     # (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and < -1.2 for the KEGG and GO analyses.
     # (b) Baseline / Strain Controls: use genes with a cutoff of log2 fold change > 1.4 and < -1.4 for the KEGG and GO analyses.
     # (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.
    
     # How it works:
     #   * Rifampicin: The script looks for "Rif" in the comparison name (e.g., 28_AYE_WT_Rif_vs_Ctrl) and applies |log2FC| >= 1.2.
     #   * Baseline/Strain Controls: The script looks for "_ctr_vs_" in the comparison name (e.g., 01_AYE_T_ctr_vs_AYE_WT_ctr) and applies |log2FC| >= 1.4.
     #   * All Others: Falls back to the original 2.0 cutoff.
     #   * The console output will now explicitly print which cutoff is being used for each specific comparison.
    
     (r_env) Rscript complete_deg_pipeline_custom_cutoff.R
  9. LOG of Rscript complete_deg_pipeline_custom_cutoff.R

     -rw-rw-r-- 1 jhuang jhuang 355609 Jun 22 12:53 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 253271 Jun 22 12:53 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 368843 Jun 22 12:53 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 358124 Jun 22 12:53 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 347126 Jun 22 12:53 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 283473 Jun 22 12:53 Volcano_31_O_Trans_Rif_vs_Ctrl.png
    
     -rw-rw-r-- 1 jhuang jhuang 352375 Jun 22 12:53 DEG_32_AYE_T_Diclo375_vs_AYE_T_ctr.xlsx
     -rw-rw-r-- 1 jhuang jhuang 257075 Jun 22 12:53 Volcano_32_AYE_T_Diclo375_vs_AYE_T_ctr.png
    
     -rw-rw-r-- 1 jhuang jhuang  355610 Jun  5 16:26 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  254804 Jun  5 16:26 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  368843 Jun  5 16:26 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  358417 Jun  5 16:26 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  347127 Jun  5 16:26 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  284084 Jun  5 16:26 Volcano_31_O_Trans_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang    1579 Jun  5 16:26 DEG_Summary_All_31.csv
    
     (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon$ Rscript complete_deg_pipeline_custom_cutoff.R
     There were 22 warnings (use warnings() to see them)
     📖 Parsing annotation file to build tx2gene mapping...
     ✅ Created tx2gene mapping with 7520 entries.
     📥 Importing Salmon quantifications...
     reading in files with read_tsv
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
     removing duplicated transcript rows from tx2gene
     summarizing abundance
     summarizing counts
     summarizing length
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using counts and average transcript lengths from tximport
     🚀 Running DESeq2 pipeline...
     estimating size factors
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using 'avgTxLength' from assays(dds), correcting for library size
     estimating dispersions
     gene-wise dispersion estimates
     mean-dispersion relationship
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     final dispersion estimates
     fitting model and testing
     Available groups in dds:
     [1] "AYE-O_Azi_20_solid"           "AYE-O_ctrl_0_liquid"
     [3] "AYE-O_ctrl_0_solid"           "AYE-O_Diclo_375_liquid"
     [5] "AYE-O_Mero_0.5_liquid"        "AYE-O_Rifampicin_2_liquid"
     [7] "AYE-T_Azi_20_solid"           "AYE-T_ctrl_0_liquid"
     [9] "AYE-T_ctrl_0_solid"           "AYE-T_Diclo_375_liquid"
     [11] "AYE-T_Mero_0.15_liquid"       "AYE-T_Rifampicin_2_liquid"
     [13] "AYE-WT_Azi_20_solid"          "AYE-WT_ctrl_0_liquid"
     [15] "AYE-WT_ctrl_0_solid"          "AYE-WT_Diclo_1250_solid"
     [17] "AYE-WT_Diclo_750_liquid"      "AYE-WT_Mero_0.35-0.5_liquid"
     [19] "AYE-WT_Rifampicin_1.5_liquid" "F_Azi_20_solid"
     [21] "F_ctrl_0_solid"               "O-Trans_ctrl_0_liquid"
     [23] "O-Trans_Diclo_375_liquid"     "O-Trans_Mero_0.25_liquid"
     [25] "O-Trans_Rifampicin_2_liquid"  "WT-Trans_ctrl_0_liquid"
     [27] "WT-Trans_Diclo_750_liquid"
     📖 Parsing annotation file for gene names...
    
     🚀 PROCESSING  38  COMPARISONS
     Base Thresholds: padj <= 0.05, |log2FC| >= 2 (Dynamic adjustments apply per comparison)
     ================================================================================
    
     [01/38] 01_AYE_T_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 16, Down: 151)
    
     [02/38] 02_AYE_O_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 9, Down: 9)
    
     [03/38] 03_O_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 44)
    
     [04/38] 04_WT_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 15, Down: 153)
    
     [05/38] 05_AYE_O_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 28, Down: 3)
    
     [06/38] 06_O_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 11, Down: 10)
    
     [07/38] 07_WT_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 4, Down: 14)
    
     [08/38] 08_AYE_WT_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 485, Down: 324)
    
     [09/38] 09_AYE_O_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 475, Down: 309)
    
     [10/38] 10_AYE_T_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 501, Down: 332)
    
     [11/38] 11_AYE_O_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 5)
    
     [12/38] 12_AYE_T_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 5, Down: 8)
    
     [13/38] 13_AYE_WT_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 45)
    
     [14/38] 14_AYE_T_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 75, Down: 181)
    
     [15/38] 15_AYE_O_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 26, Down: 240)
    
     [16/38] 16_O_Trans_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 17, Down: 82)
    
     [17/38] 17_WT_Trans_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 25, Down: 230)
    
     [18/38] 18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 66, Down: 26)
    
     [19/38] 19_AYE_WT_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 359, Down: 175)
    
     [20/38] 20_AYE_T_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 93, Down: 110)
    
     [21/38] 21_AYE_O_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 267, Down: 244)
    
     [22/38] 22_O_Trans_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 183, Down: 153)
    
     [23/38] 23_AYE_T_Mero_vs_AYE_T_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [24/38] 24_AYE_WT_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 341, Down: 329)
    
     [25/38] 25_AYE_T_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 510, Down: 433)
    
     [26/38] 26_AYE_O_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 484, Down: 443)
    
     [27/38] 27_F_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 147, Down: 271)
    
     [28/38] 28_AYE_WT_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 11)
    
     [29/38] 29_AYE_T_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 34, Down: 41)
    
     [30/38] 30_AYE_O_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 71, Down: 100)
    
     [31/38] 31_O_Trans_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 6, Down: 10)
    
     [32/38] 32_AYE_T_Diclo375_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 47, Down: 6)
    
     [33/38] 33_AYE_O_Diclo375_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 73)
    
     [34/38] 34_O_Trans_Diclo375_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 2, Down: 17)
    
     [35/38] 35_WT_Trans_Diclo750_vs_WT_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 27)
    
     [36/38] 36_AYE_T_Mero_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [37/38] 37_AYE_O_Mero_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 242, Down: 135)
    
     [38/38] 38_O_Trans_Mero_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 177, Down: 50)
    
     ================================================================================
     📊 FINAL SUMMARY OF ALL 38 COMPARISONS
                                         name total  up down sig_total pct_sig
               # |log2FC| >= 1.4
                   01_AYE_T_ctr_vs_AYE_WT_ctr  3609  16  151       167     4.6
                   02_AYE_O_ctr_vs_AYE_WT_ctr  3609   9    9        18     0.5
                  03_O_Trans_ctr_vs_AYE_WT_ctr  3609  10   44        54     1.5
               04_WT_Trans_ctr_vs_AYE_WT_ctr  3609  15  153       168     4.7
                   05_AYE_O_ctr_vs_AYE_T_ctr  3609  28    3        31     0.9
                 06_O_Trans_ctr_vs_AYE_T_ctr  3609  11   10        21     0.6
                 07_WT_Trans_ctr_vs_AYE_T_ctr  3609   4   14        18     0.5
    
               # |log2FC| >= 2
               08_AYE_WT_ctr_solid_vs_liquid  3609 485  324       809    22.4
                 09_AYE_O_ctr_solid_vs_liquid  3609 475  309       784    21.7
                 10_AYE_T_ctr_solid_vs_liquid  3609 501  332       833    23.1
           11_AYE_O_ctr_solid_vs_AYE_WT_solid  3609  12    5        17     0.5
           12_AYE_T_ctr_solid_vs_AYE_WT_solid  3609   5    8        13     0.4
                   13_AYE_WT_Diclo750_vs_Ctrl  3609  12   45        57     1.6
                   14_AYE_T_Diclo375_vs_Ctrl  3609  75  181       256     7.1
                   15_AYE_O_Diclo375_vs_Ctrl  3609  26  240       266     7.4
                 16_O_Trans_Diclo375_vs_Ctrl  3609  17   82        99     2.7
                 17_WT_Trans_Diclo750_vs_Ctrl  3609  25  230       255     7.1
     18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid  3609  66   26        92     2.5
                       19_AYE_WT_Mero_vs_Ctrl  3609 359  175       534    14.8
                       20_AYE_T_Mero_vs_Ctrl  3609  93  110       203     5.6
                       21_AYE_O_Mero_vs_Ctrl  3609 267  244       511    14.2
                     22_O_Trans_Mero_vs_Ctrl  3609 183  153       336     9.3
                 23_AYE_T_Mero_vs_AYE_T_Ctrl  3609  91   27       118     3.3
           24_AYE_WT_Azi_solid_vs_Ctrl_solid  3609 341  329       670    18.6
             25_AYE_T_Azi_solid_vs_Ctrl_solid  3609 510  433       943    26.1
             26_AYE_O_Azi_solid_vs_Ctrl_solid  3609 484  443       927    25.7
                 27_F_Azi_solid_vs_Ctrl_solid  3609 147  271       418    11.6
    
               # |log2FC| >= 1.2
                       28_AYE_WT_Rif_vs_Ctrl  3609   8   11        19     0.5
                         29_AYE_T_Rif_vs_Ctrl  3609  34   41        75     2.1
                         30_AYE_O_Rif_vs_Ctrl  3609  71  100       171     4.7
                       31_O_Trans_Rif_vs_Ctrl  3609   6   10        16     0.4
    
               # |log2FC| >= 2
                                         name total  up down sig_total
               32_AYE_T_Diclo375_vs_AYE_T_ctr  3609  46    6        52
               33_AYE_O_Diclo375_vs_AYE_O_ctr  3609   8   73        81     2.2
           34_O_Trans_Diclo375_vs_O_Trans_ctr  3609   2   17        19     0.5
         35_WT_Trans_Diclo750_vs_WT_Trans_ctr  3609  10   27        37     1.0
                   36_AYE_T_Mero_vs_AYE_T_ctr  3609  91   27       118     3.3
                   37_AYE_O_Mero_vs_AYE_O_ctr  3609 242  135       377    10.4
               38_O_Trans_Mero_vs_O_Trans_ctr  3609 177   50       227     6.3
    
     ✨ All files saved to: /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete
     📁 Each comparison contains:
         1️ DEG_
    .xlsx (3 sheets: Complete_Results, Up_Regulated, Down_Regulated) 2️ Volcano_ .png (GeneName labels for top significant genes) 📤 EXPORTING COMPLETE RESULTS TO CSV FOR KEGG/GO… ✅ Successfully exported 38 CSV files to DEG_Results_Complete
  10. KEGG and GO annotations in non-model organisms

    (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and 1.4 and < -1.4 for the KEGG and GO analyses. (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

* Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download CU459141_protein_.fasta from NCBI
        python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta
        emapper.py -i CU459141_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

* Preparing file 2 blast2go_annot.annot2_ for the R-code below:

  - Basic (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    * Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

  - Advanced (GO Terms from 'Blast2GO 5 Basic'): Interpro based protein families / domains --> Button interpro

    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

  - MERGE the results of InterPro GO IDs (advanced) to GO IDs (basic) and generate final GO IDs, saved in blast2go_annot.annot2

    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)."
    * (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

  - PREPARING go_terms and ec_terms: annot_* file (NOTE that blast2go_annot.annot2 is after merging InterPro_GO_IDs and GO_IDs):

    cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_

10.2. Perform KEGG and GO Enrichment in R

      (r_env) cd /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete

      #For |deg_cutoff_log_foldchange| >=1.4
      sed "s/01_AYE_T_ctr_vs_AYE_WT_ctr/02_AYE_O_ctr_vs_AYE_WT_ctr/g" 1.R > 2.R
      ...

      #For |deg_cutoff_log_foldchange| >=2.0
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/09_AYE_O_ctr_solid_vs_liquid/g" 8.R > 9.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/10_AYE_T_ctr_solid_vs_liquid/g" 8.R > 10.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/11_AYE_O_ctr_solid_vs_AYE_WT_solid/g" 8.R > 11.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/12_AYE_T_ctr_solid_vs_AYE_WT_solid/g" 8.R > 12.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/13_AYE_WT_Diclo750_vs_Ctrl/g" 8.R > 13.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/14_AYE_T_Diclo375_vs_Ctrl/g" 8.R > 14.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/15_AYE_O_Diclo375_vs_Ctrl/g" 8.R > 15.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/16_O_Trans_Diclo375_vs_Ctrl/g" 8.R > 16.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/17_WT_Trans_Diclo750_vs_Ctrl/g" 8.R > 17.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid/g" 8.R > 18.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/19_AYE_WT_Mero_vs_Ctrl/g" 8.R > 19.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/20_AYE_T_Mero_vs_Ctrl/g" 8.R > 20.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/21_AYE_O_Mero_vs_Ctrl/g" 8.R > 21.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/22_O_Trans_Mero_vs_Ctrl/g" 8.R > 22.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/23_AYE_T_Mero_vs_AYE_T_Ctrl/g" 8.R > 23.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/24_AYE_WT_Azi_solid_vs_Ctrl_solid/g" 8.R > 24.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/25_AYE_T_Azi_solid_vs_Ctrl_solid/g" 8.R > 25.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/26_AYE_O_Azi_solid_vs_Ctrl_solid/g" 8.R > 26.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/27_F_Azi_solid_vs_Ctrl_solid/g" 8.R > 27.R

      #For |deg_cutoff_log_foldchange| >=1.2
      sed "s/28_AYE_WT_Rif_vs_Ctrl/29_AYE_T_Rif_vs_Ctrl/g" 28.R > 29.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/30_AYE_O_Rif_vs_Ctrl/g" 28.R > 30.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/31_O_Trans_Rif_vs_Ctrl/g" 28.R > 31.R

      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 1.R
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid KEGG IDs: 4
      #  Enriched pathways: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 50
      #  Enriched pathways: 4
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid GO IDs: 16
      #  Enriched GO-terms: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 151
      #  Enriched GO-terms: 3
      #...

10.3. Finalizing the KEGG and GO Enrichment table

      1. NOTE (Already realized in the code): geneIDs in KEGG_Enrichment have been already translated from ko to geneID in H0N29_*-format; If not, nachmachen using eggnog-res, 因为 eggnog里有1-1-mspping Info between ko-Name and GeneID.
      2. NEED_MANUAL_DELETION (Already setting the cutoff in the code): p.adjust values have been calculated, we have to filter all records in GO_Enrichment-results by |p.adjust|<=0.05. DON'T_NEED_ANY_MORE, since pvalueCutoff = 0.05 settings in enricher. Alternative using pvalueCutoff=1.0, marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment.
      3. NOTE (Not occuring in the new dataset): In rare case, the description is missing for some IDs, e.g. GO term: GO:0006807: replace GO:0006807    obsolete nitrogen compound metabolic process;  ko00975: Metabolism, Biosynthesis of other secondary metabolites
  1. KEGG and GO enrichments via 1.R, 2.R, …

      #Update "01_AYE_T_ctr_vs_AYE_WT_ctr" with "02_AYE_O_ctr_vs_AYE_WT_ctr" in 1.R and save the updated script as 2.R.
      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 2.R
      === SUMMARY ===
      Up-regulated genes: 9
        Valid KEGG IDs: 1
        Enriched pathways: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 6
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 9
        Valid GO IDs: 9
        Enriched GO-terms: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 9
        Enriched GO-terms: 2
    
      Rscript 3.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 2
        Enriched pathways: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 20
        Enriched pathways: 6
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 44
        Enriched GO-terms: 0
    
      Rscript 4.R
      === SUMMARY ===
      Up-regulated genes: 15
        Valid KEGG IDs: 7
        Enriched pathways: 6
      Down-regulated genes: 152
        Valid KEGG IDs: 43
        Enriched pathways: 11
      === SUMMARY ===
      Up-regulated genes: 15
        Valid GO IDs: 15
        Enriched GO-terms: 0
      Down-regulated genes: 152
        Valid KEGG IDs: 152
        Enriched GO-terms: 0
    
      Rscript 5.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 6.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 7.R
      === SUMMARY ===
      Up-regulated genes: 4
        Valid KEGG IDs: 0
        Enriched pathways: 0
      Down-regulated genes: 14
        Valid KEGG IDs: 10
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 4
        Valid GO IDs: 4
        Enriched GO-terms: 1
      Down-regulated genes: 14
        Valid KEGG IDs: 14
        Enriched GO-terms: 3
    
      Rscript 8.R
      === SUMMARY ===
      Up-regulated genes: 479
        Valid KEGG IDs: 246
        Enriched pathways: 30
      Down-regulated genes: 322
        Valid KEGG IDs: 186
        Enriched pathways: 13
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 479
        Valid GO IDs: 479
        Enriched GO-terms: 4
      Down-regulated genes: 322
        Valid KEGG IDs: 322
        Enriched GO-terms: 6
    
      Rscript 9.R
      === SUMMARY ===
      Up-regulated genes: 472
        Valid KEGG IDs: 230
        Enriched pathways: 25
      Down-regulated genes: 306
        Valid KEGG IDs: 217
        Enriched pathways: 19
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 472
        Valid GO IDs: 472
        Enriched GO-terms: 3
      Down-regulated genes: 306
        Valid KEGG IDs: 306
        Enriched GO-terms: 6
    
      Rscript 10.R
      === SUMMARY ===
      Up-regulated genes: 496
        Valid KEGG IDs: 236
        Enriched pathways: 28
      Down-regulated genes: 330
        Valid KEGG IDs: 250
        Enriched pathways: 26
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 496
        Valid GO IDs: 496
        Enriched GO-terms: 2
      Down-regulated genes: 330
        Valid KEGG IDs: 330
        Enriched GO-terms: 6
    
      Rscript 11.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 2
        Enriched pathways: 5
      Down-regulated genes: 5
        Valid KEGG IDs: 2
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 0
      Down-regulated genes: 5
        Valid KEGG IDs: 5
        Enriched GO-terms: 0
    
      Rscript 12.R
      === SUMMARY ===
      Up-regulated genes: 5
        Valid KEGG IDs: 1
        Enriched pathways: 3
      Down-regulated genes: 8
        Valid KEGG IDs: 3
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 5
        Valid GO IDs: 5
        Enriched GO-terms: 0
      Down-regulated genes: 8
        Valid KEGG IDs: 8
        Enriched GO-terms: 0
    
      Rscript 13.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 45
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 1
      Down-regulated genes: 45
        Valid KEGG IDs: 45
        Enriched GO-terms: 0
    
      Rscript 14.R
      === SUMMARY ===
      Up-regulated genes: 73
        Valid KEGG IDs: 37
        Enriched pathways: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 29
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 73
        Valid GO IDs: 73
        Enriched GO-terms: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 180
        Enriched GO-terms: 0
    
      Rscript 15.R
      === SUMMARY ===
      Up-regulated genes: 26
        Valid KEGG IDs: 14
        Enriched pathways: 10
      Down-regulated genes: 239
        Valid KEGG IDs: 38
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 26
        Valid GO IDs: 26
        Enriched GO-terms: 0
      Down-regulated genes: 239
        Valid KEGG IDs: 239
        Enriched GO-terms: 0
    
      Rscript 16.R
      === SUMMARY ===
      Up-regulated genes: 17
        Valid KEGG IDs: 8
        Enriched pathways: 11
      Down-regulated genes: 82
        Valid KEGG IDs: 14
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 17
        Valid GO IDs: 17
        Enriched GO-terms: 0
      Down-regulated genes: 82
        Valid KEGG IDs: 82
        Enriched GO-terms: 2
    
      Rscript 17.R
      === SUMMARY ===
      Up-regulated genes: 25
        Valid KEGG IDs: 15
        Enriched pathways: 3
      Down-regulated genes: 229
        Valid KEGG IDs: 46
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 25
        Valid GO IDs: 25
        Enriched GO-terms: 2
      Down-regulated genes: 229
        Valid KEGG IDs: 229
        Enriched GO-terms: 1
    
      Rscript 18.R
      === SUMMARY ===
      Up-regulated genes: 66
        Valid KEGG IDs: 18
        Enriched pathways: 8
      Down-regulated genes: 26
        Valid KEGG IDs: 19
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 66
        Valid GO IDs: 66
        Enriched GO-terms: 5
      Down-regulated genes: 26
        Valid KEGG IDs: 26
        Enriched GO-terms: 3
    
      Rscript 19.R
      === SUMMARY ===
      Up-regulated genes: 355
        Valid KEGG IDs: 200
        Enriched pathways: 25
      Down-regulated genes: 175
        Valid KEGG IDs: 70
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 355
        Valid GO IDs: 355
        Enriched GO-terms: 6
      Down-regulated genes: 175
        Valid KEGG IDs: 175
        Enriched GO-terms: 2
    
      Rscript 20.R
      === SUMMARY ===
      Up-regulated genes: 93
        Valid KEGG IDs: 48
        Enriched pathways: 21
      Down-regulated genes: 110
        Valid KEGG IDs: 32
        Enriched pathways: 4
      === SUMMARY ===
      Up-regulated genes: 93
        Valid GO IDs: 93
        Enriched GO-terms: 0
      Down-regulated genes: 110
        Valid KEGG IDs: 110
        Enriched GO-terms: 0
    
      Rscript 21.R
      === SUMMARY ===
      Up-regulated genes: 263
        Valid KEGG IDs: 154
        Enriched pathways: 26
      Down-regulated genes: 244
        Valid KEGG IDs: 72
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 263
        Valid GO IDs: 263
        Enriched GO-terms: 8
      Down-regulated genes: 244
        Valid KEGG IDs: 244
        Enriched GO-terms: 5
    
      Rscript 22.R
      === SUMMARY ===
      Up-regulated genes: 182
        Valid KEGG IDs: 115
        Enriched pathways: 23
      Down-regulated genes: 153
        Valid KEGG IDs: 40
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 182
        Valid GO IDs: 182
        Enriched GO-terms: 3
      Down-regulated genes: 153
        Valid KEGG IDs: 153
        Enriched GO-terms: 2
    
      Rscript 23.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 24.R
      === SUMMARY ===
      Up-regulated genes: 333
        Valid KEGG IDs: 193
        Enriched pathways: 12
      Down-regulated genes: 328
        Valid KEGG IDs: 175
        Enriched pathways: 24
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 333
        Valid GO IDs: 333
        Enriched GO-terms: 4
      Down-regulated genes: 328
        Valid KEGG IDs: 328
        Enriched GO-terms: 4
    
      Rscript 25.R
      === SUMMARY ===
      Up-regulated genes: 499
        Valid KEGG IDs: 264
        Enriched pathways: 19
      Down-regulated genes: 430
        Valid KEGG IDs: 239
        Enriched pathways: 36
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 499
        Valid GO IDs: 499
        Enriched GO-terms: 2
      Down-regulated genes: 430
        Valid KEGG IDs: 430
        Enriched GO-terms: 0
    
      Rscript 26.R
      === SUMMARY ===
      Up-regulated genes: 474
        Valid KEGG IDs: 267
        Enriched pathways: 19
      Down-regulated genes: 440
        Valid KEGG IDs: 216
        Enriched pathways: 32
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 474
        Valid GO IDs: 474
        Enriched GO-terms: 2
      Down-regulated genes: 440
        Valid KEGG IDs: 440
        Enriched GO-terms: 1
    
      Rscript 27.R
      === SUMMARY ===
      Up-regulated genes: 142
        Valid KEGG IDs: 69
        Enriched pathways: 1
      Down-regulated genes: 269
        Valid KEGG IDs: 148
        Enriched pathways: 27
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 142
        Valid GO IDs: 142
        Enriched GO-terms: 5
      Down-regulated genes: 269
        Valid KEGG IDs: 269
        Enriched GO-terms: 3
    
      Rscript 28.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 6
        Enriched pathways: 2
      Down-regulated genes: 11
        Valid KEGG IDs: 3
        Enriched pathways: 0
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 11
        Valid KEGG IDs: 11
        Enriched GO-terms: 0
    
      Rscript 29.R
      === SUMMARY ===
      Up-regulated genes: 34
        Valid KEGG IDs: 11
        Enriched pathways: 6
      Down-regulated genes: 41
        Valid KEGG IDs: 22
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 34
        Valid GO IDs: 34
        Enriched GO-terms: 0
      Down-regulated genes: 41
        Valid KEGG IDs: 41
    
      Rscript 30.R
      === SUMMARY ===
      Up-regulated genes: 70
        Valid KEGG IDs: 47
        Enriched pathways: 16
      Down-regulated genes: 99
        Valid KEGG IDs: 40
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 70
        Valid GO IDs: 70
        Enriched GO-terms: 0
      Down-regulated genes: 99
        Valid KEGG IDs: 99
        Enriched GO-terms: 0
    
      Rscript 31.R
      === SUMMARY ===
      Up-regulated genes: 6
        Valid KEGG IDs: 3
        Enriched pathways: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 6
        Valid GO IDs: 6
        Enriched GO-terms: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 10
        Enriched GO-terms: 5
    
      Rscript 32.R
      === SUMMARY ===
      Up-regulated genes: 46
        Valid KEGG IDs: 18
        Enriched pathways: 1
      Down-regulated genes: 6
        Valid KEGG IDs: 0
        Enriched pathways: 0
      No gene sets have size between 10 and 500 ...
      --> return NULL...
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 46
        Valid GO IDs: 46
        Enriched GO-terms: 4
      Down-regulated genes: 6
        Valid KEGG IDs: 6
        Enriched GO-terms: 0
    
      Rscript 33.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 9
        Enriched pathways: 8
      Down-regulated genes: 72
        Valid KEGG IDs: 7
        Enriched pathways: 3
    
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 72
        Valid KEGG IDs: 72
        Enriched GO-terms: 0
    
      Rscript 34.R
      === SUMMARY ===
      Up-regulated genes: 2
        Valid KEGG IDs: 2
        Enriched pathways: 3
      Down-regulated genes: 17
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 2
        Valid GO IDs: 2
        Enriched GO-terms: 0
      Down-regulated genes: 17
        Valid KEGG IDs: 17
        Enriched GO-terms: 0
    
      Rscript 35.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 5
      Down-regulated genes: 27
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 0
    
      Rscript 36.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 37.R
      === SUMMARY ===
      Up-regulated genes: 238
        Valid KEGG IDs: 141
        Enriched pathways: 28
      Down-regulated genes: 134
        Valid KEGG IDs: 35
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 238
        Valid GO IDs: 238
        Enriched GO-terms: 2
      Down-regulated genes: 134
        Valid KEGG IDs: 134
        Enriched GO-terms: 4
    
      Rscript 38.R
      === SUMMARY ===
      Up-regulated genes: 177
        Valid KEGG IDs: 104
        Enriched pathways: 22
      Down-regulated genes: 50
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 177
        Valid GO IDs: 177
        Enriched GO-terms: 2
      Down-regulated genes: 50
        Valid KEGG IDs: 50
        Enriched GO-terms: 4
  2. (DEPRECATED due to be WRONG and TIME-CONSUMING) KEGG and GO enrichments by replacing the sample name in the R-code, and by copying and pasting the R-code directly on console

      # For your reference, here is the exact list of the 31 comparisons and their assigned groupings:
    
      # Baseline / Strain Controls (|log2FC| >= 1.4)
      "01_AYE_T_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-T_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "02_AYE_O_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-O_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "03_O_Trans_ctr_vs_AYE_WT_ctr"     = c("group", "O-Trans_ctrl_0_liquid", "AYE-WT_ctrl_0_liquid"),
      "04_WT_Trans_ctr_vs_AYE_WT_ctr"    = c("group", "WT-Trans_ctrl_0_liquid","AYE-WT_ctrl_0_liquid"),
      "05_AYE_O_ctr_vs_AYE_T_ctr"        = c("group", "AYE-O_ctrl_0_liquid",   "AYE-T_ctrl_0_liquid"),
      "06_O_Trans_ctr_vs_AYE_T_ctr"      = c("group", "O-Trans_ctrl_0_liquid", "AYE-T_ctrl_0_liquid"),
      "07_WT_Trans_ctr_vs_AYE_T_ctr"     = c("group", "WT-Trans_ctrl_0_liquid","AYE-T_ctrl_0_liquid"),
    
      # Condition Effects (|log2FC| >= 2.0)
      "08_AYE_WT_ctr_solid_vs_liquid"    = c("group", "AYE-WT_ctrl_0_solid",   "AYE-WT_ctrl_0_liquid"),
      "09_AYE_O_ctr_solid_vs_liquid"     = c("group", "AYE-O_ctrl_0_solid",    "AYE-O_ctrl_0_liquid"),
      "10_AYE_T_ctr_solid_vs_liquid"     = c("group", "AYE-T_ctrl_0_solid",    "AYE-T_ctrl_0_liquid"),
      "11_AYE_O_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-O_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
      "12_AYE_T_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-T_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
    
      # Diclofenac (|log2FC| >= 2.0)
      "13_AYE_WT_Diclo750_vs_Ctrl"       = c("group", "AYE-WT_Diclo_750_liquid",   "AYE-WT_ctrl_0_liquid"),
      "14_AYE_T_Diclo375_vs_Ctrl"        = c("group", "AYE-T_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "15_AYE_O_Diclo375_vs_Ctrl"        = c("group", "AYE-O_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "16_O_Trans_Diclo375_vs_Ctrl"      = c("group", "O-Trans_Diclo_375_liquid",  "AYE-WT_ctrl_0_liquid"),
      "17_WT_Trans_Diclo750_vs_Ctrl"     = c("group", "WT-Trans_Diclo_750_liquid", "AYE-WT_ctrl_0_liquid"),
      "18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid" = c("group", "AYE-WT_Diclo_1250_solid", "AYE-WT_ctrl_0_solid"),
    
      # Meropenem (|log2FC| >= 2.0)
      "19_AYE_WT_Mero_vs_Ctrl"           = c("group", "AYE-WT_Mero_0.35-0.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "20_AYE_T_Mero_vs_Ctrl"            = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-WT_ctrl_0_liquid"),
      "21_AYE_O_Mero_vs_Ctrl"            = c("group", "AYE-O_Mero_0.5_liquid",       "AYE-WT_ctrl_0_liquid"),
      "22_O_Trans_Mero_vs_Ctrl"          = c("group", "O-Trans_Mero_0.25_liquid",    "AYE-WT_ctrl_0_liquid"),
      "23_AYE_T_Mero_vs_AYE_T_Ctrl"      = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-T_ctrl_0_liquid"),
    
      # Azithromycin (Solid) (|log2FC| >= 2.0)
      "24_AYE_WT_Azi_solid_vs_Ctrl_solid"= c("group", "AYE-WT_Azi_20_solid", "AYE-WT_ctrl_0_solid"),
      "25_AYE_T_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-T_Azi_20_solid",  "AYE-T_ctrl_0_solid"),
      "26_AYE_O_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-O_Azi_20_solid",  "AYE-O_ctrl_0_solid"),
      "27_F_Azi_solid_vs_Ctrl_solid"     = c("group", "F_Azi_20_solid",      "F_ctrl_0_solid"),
    
      # Rifampicin (|log2FC| >= 1.2)
      "28_AYE_WT_Rif_vs_Ctrl"            = c("group", "AYE-WT_Rifampicin_1.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "29_AYE_T_Rif_vs_Ctrl"             = c("group", "AYE-T_Rifampicin_2_liquid",    "AYE-T_ctrl_0_liquid"),
      "30_AYE_O_Rif_vs_Ctrl"             = c("group", "AYE-O_Rifampicin_2_liquid",    "AYE-O_ctrl_0_liquid"),
      "31_O_Trans_Rif_vs_Ctrl"           = c("group", "O-Trans_Rifampicin_2_liquid",  "O-Trans_ctrl_0_liquid")
    
      ./DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv
      ./DEG_02_AYE_O_ctr_vs_AYE_WT_ctr.csv
      ./DEG_03_O_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_04_WT_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_05_AYE_O_ctr_vs_AYE_T_ctr.csv
      ./DEG_06_O_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_07_WT_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_08_AYE_WT_ctr_solid_vs_liquid.csv
      ./DEG_09_AYE_O_ctr_solid_vs_liquid.csv
      ./DEG_10_AYE_T_ctr_solid_vs_liquid.csv
      ./DEG_11_AYE_O_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_12_AYE_T_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_13_AYE_WT_Diclo750_vs_Ctrl.csv
      ./DEG_14_AYE_T_Diclo375_vs_Ctrl.csv
      ./DEG_15_AYE_O_Diclo375_vs_Ctrl.csv
      ./DEG_16_O_Trans_Diclo375_vs_Ctrl.csv
      ./DEG_17_WT_Trans_Diclo750_vs_Ctrl.csv
      ./DEG_18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid.csv
      ./DEG_19_AYE_WT_Mero_vs_Ctrl.csv
      ./DEG_20_AYE_T_Mero_vs_Ctrl.csv
      ./DEG_21_AYE_O_Mero_vs_Ctrl.csv
      ./DEG_22_O_Trans_Mero_vs_Ctrl.csv
      ./DEG_23_AYE_T_Mero_vs_AYE_T_Ctrl.csv
      ./DEG_24_AYE_WT_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_25_AYE_T_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_26_AYE_O_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_27_F_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_28_AYE_WT_Rif_vs_Ctrl.csv
      ./DEG_29_AYE_T_Rif_vs_Ctrl.csv
      ./DEG_30_AYE_O_Rif_vs_Ctrl.csv
      ./DEG_31_O_Trans_Rif_vs_Ctrl.csv
    
        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete_manually")
    
        # 1. Blast2GO: Extract GO & EC terms (Primary source)
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_AYE/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
        colnames(annot_df) <- c("GeneID", "Term")
    
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        res <- read.csv("DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv")
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # Filter up- and down-regulated genes (NEEDS_TO_BE_ADAPTED_BY_1.4_or_2.0_or_1.2)
        up_regulated <- res_updated[res_updated$log2FoldChange >= 1.4 & res_updated$padj <= 0.05, ]
        down_regulated <- res_updated[res_updated$log2FoldChange <= -1.4 & res_updated$padj <= 0.05, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        saveWorkbook(wb, "Gene_Expression_with_Annotations_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
    
        # ---------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
    
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
    
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # -----------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
        #saveWorkbook(wb, "KEGG_Enrichment.xlsx", overwrite = TRUE)
    
        # ----------------------------------------
        # ---- Perform GO enrichment analysis ----
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis, where pAdjustMethod is one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
  3. BUG_1: The Count column does not match the number of gene IDs listed in the geneID column.

        The discrepancy between the Count and the number of listed GeneIDs is not an error, but a result of KEGG's many-to-many mapping between physical genes and KOs.
    
        KEGG enrichment statistics (Count, GeneRatio, p-values) are strictly calculated based on unique KOs, not GeneIDs. The geneID column simply maps these enriched KOs back to our specific genome for display.
        Concrete Examples (see KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx):
    
            * ko03070 (Bacterial secretion system): Count = 2 (KOs: K02456/K02457). However, both KOs map to the exact same physical gene (ABAYE2071), so only 1 geneID is listed.
            * ko05111 (Biofilm formation): Count = 3 (KOs: K02456/K02457/K03092), which map back to 2 unique genes (ABAYE2071/ABAYE3136).
    
        To ensure full transparency, I have added a new KEGG_ko column (placed between Count and geneID). It explicitly lists the exact KOs contributing to the Count.

DEBUG: The Count column does not match the number of gene IDs listed in the geneID column (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE$)

This is a brilliant and fundamental question. The short answer is: KEGG enrichment statistics (including GeneRatio, BgRatio, p-values, and the original Count) are strictly based on KOs (KEGG Orthologs), NOT your original GeneIDs.

In fact, my previous attempt to “fix” the discrepancy by changing the Count and GeneRatio to match the number of GeneIDs was statistically incorrect. I apologize for that confusion. By changing the ratio to match GeneIDs while leaving the denominator as the number of KOs, we created a mathematically invalid ratio that no longer matches the p-value calculated by clusterProfiler.

Here is the exact scientific explanation of why this happens, why your co-author saw a discrepancy, and how to properly handle it.


1. Why Enrichment is Based on KOs

KEGG pathways are defined by KOs (e.g., K00001), not by specific genomic loci (like ABAYE0067). When you run enrichKEGG(), the tool performs a hypergeometric test.

  • The Universe (Background): All KOs in the KEGG database.
  • The Sample (Your Input): The unique KOs extracted from your DEGs.
  • The Calculation: It calculates the probability of finding $k$ KOs in a pathway out of your $n$ input KOs.

Therefore, the Count column represents the number of unique KOs from your input list that map to that pathway. The GeneRatio is (Number of input KOs in pathway) / (Total input KOs).

2. Why the Co-author Saw a “Discrepancy”

Your co-author noticed that Count was 3, but only 2 GeneIDs were listed. This is not a bug; it is a biological reality of gene annotation.

Here are the two reasons why Count (KOs) $\neq$ Number of GeneIDs:

  1. Multiple KOs map to the same GeneID: A single gene in your annotation file might have multiple functional domains or be assigned to multiple KOs (e.g., ABAYE2071 might be annotated as K00001, K00002). If both K00001 and K00002 are enriched in a pathway, Count = 2, but they both map back to the exact same gene: ABAYE2071.
  2. Missing Mappings in the DEG list: The KEGG database says 3 KOs belong to this pathway. Your DEGs contain those 3 KOs. However, when your script tries to map those 3 KOs back to your specific Gene_Expression_with_Annotations file, one of the KOs might not have a corresponding GeneID in that specific file (or it was filtered out). Thus, Count = 3, but only 2 GeneIDs are revealed.

3. The Correct Solution

Do not change the Count or GeneRatio. They are statistically correct and mathematically tied to the p-value. The geneID column is purely for display purposes so the biologist can see which actual genes are involved. It is completely normal and expected that the number of displayed GeneIDs differs from the Count.

Here is the scientifically accurate version of the process_kegg_enrichment function. It maps the KOs back to GeneIDs for display, but leaves the statistical columns (Count, GeneRatio) completely untouched.

The Corrected R Function

Replace your process_kegg_enrichment function with this:

============================================================
HELPER FUNCTION: Process KEGG enrichment (STATISTICALLY CORRECT)
============================================================
process_kegg_enrichment <- function(enrichment_result, kegg_to_geneid_mapping) {
  if (is.null(enrichment_result)) {
    message("  -> Enrichment result is NULL")
    return(NULL)
  }

  enrichment_df <- as.data.frame(enrichment_result)
  if (nrow(enrichment_df) == 0) return(NULL)

  message("  -> Mapping KEGG KOs back to original GeneIDs for display...")

  # 1. Split the original KO IDs and map to GeneIDs
  expanded <- enrichment_df %>%
    tidyr::separate_rows(geneID, sep = "/") %>% # Split KOs
    dplyr::left_join(kegg_to_geneid_mapping, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
    dplyr::distinct() %>%
    dplyr::group_by(ID) %>%
    dplyr::summarise(
      # Concatenate all unique GeneIDs that map to the KOs in this pathway
      Mapped_GeneID = paste(unique(na.omit(GeneID)), collapse = "/"), 
      .groups = "drop"
    )

  # 2. Replace the geneID column, but DO NOT TOUCH Count or GeneRatio
  # The original Count and GeneRatio are based on KOs and are statistically valid.
  result <- enrichment_df %>%
    dplyr::select(-geneID) %>% # Drop the old KO-based geneID column
    dplyr::left_join(expanded, by = "ID") %>% # Add the new GeneID-based column
    dplyr::rename(geneID = Mapped_GeneID)

  # 3. Reorder columns for clean Excel output
  desired_order <- c(
    "category", "subcategory", "ID", "Description", "GeneRatio", "BgRatio", 
    "RichFactor", "FoldEnrichment", "zScore", "pvalue", "p.adjust", "qvalue", 
    "Count", "geneID"
  )
  result <- result %>% dplyr::select(dplyr::any_of(desired_order), dplyr::everything())

  message("  ✅ SUCCESS: GeneIDs mapped for display. Statistical columns (Count/GeneRatio) preserved based on KOs.")
  return(result)
}

💡 How to Reply to Your Co-author

You can use this exact explanation to clarify the situation with your co-author:

“Dear [Co-author’s Name],

Thank you for pointing out the discrepancy between the Count and the number of GeneIDs listed. I have investigated the pipeline and can clarify why this occurs.

The KEGG enrichment analysis (enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw GeneIDs. Because multiple GeneIDs can share the same KO annotation, and a single GeneID can be assigned multiple KOs, the mapping between genes and KOs is many-to-many.

Therefore:

  • Count and GeneRatio represent the number of unique KOs from our input list that fall into that pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain based on KOs to be statistically valid.
  • The geneID column is provided purely for display purposes, showing the actual genomic loci (e.g., ABAYE...) that correspond to those enriched KOs.

If Count is 3 but only 2 GeneIDs are listed, it simply means that 3 unique KOs were enriched in that pathway, but when mapped back to our specific annotation file, those 3 KOs correspond to only 2 unique genes (either because two KOs map to the same gene, or one KO lacked a matching gene in our specific DEG list).

I have updated the script to ensure the statistical columns remain perfectly accurate based on KOs, while the geneID column accurately reflects all available gene mappings for your review.”



结合您刚刚修复并导出的新 Excel 文件(其中成功增加了 KEGG_ko 列),我们可以用具体的数据例子来回复您的共同作者。这样不仅能解释原理,还能直接“用数据说话”,彻底打消他/她关于“annotation parsing error(注释解析错误)”的疑虑。

以下是为您改进后的回复草稿:


Subject: Re: Clarification on KEGG Enrichment Output (Count vs. GeneID discrepancy)

Dear [Co-author’s Name],

Thank you for pointing out the apparent discrepancy between the Count and the number of GeneIDs listed in the KEGG enrichment results. I have thoroughly investigated the pipeline and can clarify exactly why this occurs, and how I have updated the output files to make it completely transparent.

The core reason is that the KEGG enrichment analysis (enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw physical GeneIDs. The mapping between genes and KOs is many-to-many: multiple genes can share the same KO annotation, and a single gene can be assigned multiple KOs.

Therefore, the columns should be interpreted as follows:

  • Count and GeneRatio: Represent the number of unique KOs from our input DEG list that fall into that specific pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain strictly based on KOs to be statistically valid.
  • KEGG_ko (NEW COLUMN): To address your concern and provide full traceability, I have added this new column to the Excel output. It explicitly lists the exact KOs that contributed to the Count.
  • geneID: Shows the actual genomic loci (e.g., ABAYE…) that correspond to those enriched KOs.

To illustrate with the exact examples you mentioned: Let’s look at the updated results for ko03070 (Bacterial secretion system) and ko05111 (Biofilm formation):

  • For ko03070, the Count is 2. If you look at the new KEGG_ko column, it correctly lists 2 unique KOs: K02456/K02457. However, when we map these 2 KOs back to our specific genome annotation, they both happen to be carried by the exact same physical gene: ABAYE2071. Thus, the geneID column only reveals 1 gene.
  • Similarly, for ko05111, the Count is 3 (KOs: K02456/K02457/K03092), but they map back to only 2 unique physical genes (ABAYE2071/ABAYE3136).

This is not an annotation parsing error, but a biological reality of how gene functions are annotated in the KEGG database. I have updated the pipeline to ensure the statistical columns remain perfectly accurate based on KOs, while the newly added KEGG_ko column and the geneID column provide a complete, traceable, and biologically meaningful picture for your review.

Please let me know if you would like to discuss this further or if you need any additional adjustments to the output format.

Best regards,


💡 改进说明:

  1. 引入了新列 KEGG_ko 的解释:明确告诉共同作者,为了增加透明度,我们特意加上了这一列,让他能直接看到底是哪几个 KO 参与了计算。
  2. 使用了新文件中的真实数据作为例子:直接引用了 ko03070(2个KO对应1个基因)和 ko05111(3个KO对应2个基因)的具体数据(K02456/K02457 等),这比纯理论解释更有说服力,直接证明了我们没有算错,也没有解析错误。
  3. 语气专业且自信:明确界定这不是 “annotation parsing error”,而是 KEGG 数据库本身的 “many-to-many” 生物学特性,展现了您对生信分析流程的透彻理解。

Host and Viral Temporal Transcriptomics of VZV Infection in Human Skin Organoids (Data_Nina_RNAseq_2026)

draw_3D.py.zip

complete_deg_pipeline.R

  1. Preparing samplesheet.csv

         sample,fastq_1,fastq_2,strandedness
         control_r1,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R1.fastq.gz,./20260527_AV243904_0078_B/01_negative_34_170_34_171_R2.fastq.gz,auto
         control_r2,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R1.fastq.gz,./20260527_AV243904_0078_B/02_negative_34_182_34_206_R2.fastq.gz,auto
         control_r3,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R1.fastq.gz,./20260527_AV243904_0078_B/03_negative_34_216_34_218_R2.fastq.gz,auto
         VZV.d10_r1,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R1.fastq.gz,./20260527_AV243904_0078_B/04_VZV_d10_34_11_34_12_R2.fastq.gz,auto
         VZV.d10_r2,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R1.fastq.gz,./20260527_AV243904_0078_B/05_VZV_d10_34_20_34_23_R2.fastq.gz,auto
         VZV.d10_r3,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R1.fastq.gz,./20260527_AV243904_0078_B/06_VZV_d10_34_25_34_29_R2.fastq.gz,auto
         VZV.d15_r1,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R1.fastq.gz,./20260527_AV243904_0078_B/07_VZV_d15_34_30_34_31_R2.fastq.gz,auto
         VZV.d15_r2,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R1.fastq.gz,./20260527_AV243904_0078_B/08_VZV_d15_34_35_34_39_R2.fastq.gz,auto
         VZV.d15_r3,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R1.fastq.gz,./20260527_AV243904_0078_B/09_VZV_d15_34_56_34_60_R2.fastq.gz,auto
         VZV.d20_r1,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R1.fastq.gz,./20260527_AV243904_0078_B/10_VZV_d20_34_71_34_85_R2.fastq.gz,auto
         VZV.d20_r2,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R1.fastq.gz,./20260527_AV243904_0078_B/11_VZV_d20_34_88_34_91_R2.fastq.gz,auto
         VZV.d20_r3,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R1.fastq.gz,./20260527_AV243904_0078_B/12_VZV_d20_34_94_34_100_R2.fastq.gz,auto
  2. DO NOT RUN Trimmomatic before Nextflow.

         It is highly recommended to let the nf-core/rnaseq pipeline handle all trimming internally. Pre-trimming your data can break the pipeline's internal file tracking, interfere with the UMI extraction step, and make reproducibility harder.
         The nf-core/rnaseq pipeline has built-in support for both UMI extraction and read trimming, and it executes them in the correct order: it will extract the UMIs from the raw Read 1 first, and then trim the error-prone bases from Read 2.
         Here is how you should adjust your approach and your Nextflow command.
    
         不应该在运行 Nextflow 之前手动使用 Trimmomatic。
         强烈建议让 nf-core/rnaseq 流程在内部处理所有的修剪(trimming)工作。提前手动修剪数据可能会破坏流程内部的文件追踪机制,干扰 UMI 的提取步骤,并降低结果的可重复性。
         nf-core/rnaseq 流程内置了对 UMI 提取和 reads 修剪的支持,并且会按照正确的顺序执行:它会首先从原始的 Read 1 中提取 UMI,然后再修剪 Read 2 中容易出错的碱基。
  3. Nextflow running

         #yes, exactly, the Reads contain UMI: the first 12 nt of Read1.
         #The CORALL v2 Kit from Lexogen was used for library prep. According to the manufacturer it can additionally be beneficial for the mapping to trim the first 10 nt of Read2 as well, as this site is more error prone (due to priming they say).
         #If you need more information from my side, let me know!
    
         方法一:正则表达式法 (Regex) —— 你目前选择的方案
         根据你提供的文档,使用 regex 方法时,必须使用特定的命名捕获组。文档中明确指出:
         "The expected groups in the regex are: umi_n = UMI positions, where n can be any value (required)"
         这意味着,捕获 UMI 的组名必须是 umi_ 加上一个数字(例如 umi_1),而不能只是 umi。
         正确的正则表达式:'^(?P
    .{12}).*’ 原理解释: ^:匹配字符串的开头。 (?P .{12}):捕获前 12 个任意字符,并将其命名为 umi_1(这将作为 UMI 被提取并添加到 read name 中)。 .*:匹配并保留剩余的序列(文档说明:未被命名为 discard 的其他部分会被重新附加到 read 序列上)。 方法二:字符串法 (String) —— 更简单直观的替代方案 文档中提到: “string: This should be used where the barcodes are always in the same place in the read. N = UMI position” 因为你的 UMI 长度固定(12nt)且位置固定(永远在 Read 1 开头),你完全可以使用更简单的 string 方法,用 N 来代表 UMI 的位置。 对应的 Nextflow 参数: –umitools_extract_method string –umitools_bc_pattern NNNNNNNNNNNN (正好 12 个 N) # OLD_COMMANDS_DEPRECATED: under sage in the early running # ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq # nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method regex –umitools_bc_pattern ‘^(?P.{12}).*’ -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –pseudo_aligner salmon –umitools_grouping_method unique # #Debug the following error: added “–minAssignedFrags 0 \\” to modules/nf-core/salmon/quant/main.nf option “salmon quant” and added “–min_mapped_reads 0” in the nextflow command #nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \ nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf –help nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \ –input samplesheet.csv \ –outdir results_GRCh38 \ –genome GRCh38 \ –with_umi \ –umitools_extract_method regex \ –umitools_bc_pattern ‘^(?P .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon \ –pseudo_aligner salmon \ –umitools_grouping_method unique 关键执行逻辑(让你完全放心): 第一步 (umi_tools):流程会读取原始的 Read 1,根据 ^(?P .{12}).* 提取前 12 个碱基作为 UMI,并将其添加到 read 的名称中(例如 @READ_ID_UMISEQUENCE)。此时 Read 1 的序列中不再包含这 12 个碱基。 第二步 (fastp):流程接着运行 fastp。由于你设置了 –clip_r2 10,fastp 会精准地切除 Read 2 的前 10 个碱基(解决 Lexogen CORALL v2 的引物错误问题)。 互不干扰:umi_tools 只处理 Read 1 的 UMI,fastp 负责全局的质控和 Read 2 的修剪。你绝对不需要在运行 Nextflow 之前手动使用 Trimmomatic。
  4. To verify that fastp has successfully trimmed the first 10 nucleotides from Read 2, you can use any of the following three methods.

         *(Note: First, ensure that the `FASTP` step in your Nextflow terminal has actually finished running and shows a green checkmark ✔).*
    
         ### Method: Manually Measure the Sequence Length (Raw Data Proof)
         Because you did not use the `--save_trimmed` flag, the trimmed FASTQ files are not saved in your `results/` folder. However, they still exist temporarily in the `work/` directory. You can measure the exact length of the first read to prove 10bp was cut.
    
         *Note: Since you used `--with_umi`, the input file for fastp is the output from UMI-tools, so the filename will likely contain `umi_extract` instead of `raw`.*
    
         1. Navigate into the `work/` directory for the `FASTP` step:
         ```bash
         cd work/fc/def4de*/
         cd work/a1/b2c3d4*/
         ```
         2. List the files to see the exact naming convention:
         ```bash
         ls -lh *2.fastq.gz
         ```
         *(You will likely see an input file like `*.umi_extract_2.fastq.gz` and an output file like `*.fastp.trimmed_2.fastq.gz` or `*.trim_2.fastq.gz`).*
         3. **Measure the length of the first sequence in the INPUT Read 2:**
         ```bash
         zcat *.umi_extract_1.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat *.umi_extract_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #154
         ```
         *(Let's say this outputs `150`)*
         4. **Measure the length of the first sequence in the OUTPUT Read 2:**
         ```bash
         zcat *.trimmed_2.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_1.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         zcat control_r3_2.fastp.fastq.gz | head -n 2 | tail -n 1 | awk '{print length}'
         #142
         #144
         ```
         *(This should output exactly `140`)*.
    
         If the output length is exactly 10 bases shorter than the input length, the trimming was 100% successful.
    
         ---
    
         ### 💡 Pro Tip for Future Runs
         If you want to easily inspect the trimmed FASTQ files in your `results/` folder without digging through the `work/` directory next time, simply add **`--save_trimmed`** to your `nextflow run` command. This will force the pipeline to copy the post-fastp FASTQ files to your main results directory.
  5. Nextflow running on virus

         # OLD_COMMANDS_DEPRECATED: under sage in the early running
         # The virus-referenced commands!
         # nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_chrHsv1 --fasta chrHsv1_s17.fasta --gtf chrHsv1_s17.gtf --with_umi --umitools_extract_method regex --umitools_bc_pattern '^(?P.{12}).*' --umitools_dedup_stats -profile test_full -resume --max_memory 256.GB --max_time 2400.h --save_reference --aligner star_salmon --gtf_extra_attributes gene_id --gtf_group_features transcript_id --featurecounts_group_type gene_id --featurecounts_feature_type transcript --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --skip_multiqc --min_mapped_reads 0
    
             #This error occurs because your GenBank file was likely exported from SnapGene (as seen in the COMMENT section of the file). SnapGene sometimes formats the date in the LOCUS line as DD-MM-YYYY (e.g., 13-06-2025), but Biopython's parser strictly expects the standard NCBI GenBank format, which requires a 3-letter month abbreviation like DD-MMM-YYYY (e.g., 13-JUN-2025).
             sed -i 's/13-06-2025/13-JUN-2025/' BB1528_nanopore_consensus.gb
    
             python3 gb_to_fasta_gtf_v2.py BB1528_nanopore_consensus.gb BB1528_nanopore_consensus.fasta BB1528_nanopore_consensus.gtf BB1528    # Generate the last two files in the command line
             python fix_gtf_to_hsv1_format.py    # Generate BB1528_final.gtf
    
             nextflow run /home/jhuang/Tools/nf-core-rnaseq-3.12.0/main.nf -profile docker \
             --input samplesheet_virus.csv \
             --outdir results_BB1528 \
             --fasta BB1528_nanopore_consensus.fasta \
             --gtf BB1528_final.gtf \
             --with_umi \
             --umitools_extract_method regex \
             --umitools_bc_pattern '^(?P
    .{12}).*’ \ –trimmer fastp \ –extra_fastp_args “–trim_front2 10” \ -resume \ –max_cpus 54 \ –max_memory 120.GB \ –max_time 2400.h \ –save_align_intermeds \ –save_unaligned \ –save_reference \ –aligner star_salmon –pseudo_aligner salmon \ –gtf_extra_attributes gene_id \ –gtf_group_features transcript_id \ –featurecounts_group_type gene_id \ –featurecounts_feature_type transcript \ –skip_rseqc –skip_dupradar –skip_preseq –skip_biotype_qc –skip_deseq2_qc –skip_multiqc \ –umitools_grouping_method unique –min_mapped_reads 0 # —- DEBUG_LOG —- Good news: Your GTF and FASTA files are perfectly formatted! We can prove this from your log: [info] Index contained 78 targets This means Salmon successfully built the transcriptome index and found all 78 transcripts from your BB1528_final.gtf. The reference files are completely correct. What is causing the error? The error is purely a biological/sample-specific issue: [warning] salmon was only able to assign 9 fragments to transcripts in the index, but the minimum number of required assigned fragments (–minAssignedFrags) was 10. The nf-core/rnaseq pipeline automatically subsamples your reads (in this case, 1,000,000 fragments) to run a quick Salmon quantification and check the library strandedness (–libType=A). Because you are mapping to a specific viral reference (BB1528), it is highly likely that control_r3 is a negative control or a sample with an extremely low viral load. Out of the 1,000,000 subsampled reads, only 9 reads actually belonged to the virus. Salmon has a built-in safety threshold (–minAssignedFrags 10) that intentionally crashes the pipeline if fewer than 10 reads map to the transcriptome, to prevent “garbage” quantification of empty samples. Since 9 The pipeline is failing during the FASTQ_SUBSAMPLE_FQ_SALMON step in the three control samples! Later on, we can manually set all counts for all virus transcripts in the three control-conditions as 0.
  6. A small BUG in both nextflow runs: versions.yml

         vim /mnt/md1/DATA/Data_Nina_RNAseq_2026/work/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml
         BUG_LINES:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-4ht1kmld because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
         1.1.4
         SHOULD_BE:
         "NFCORE_RNASEQ:RNASEQ:UMITOOLS_PREPAREFORSALMON":
             umitools: 1.1.4
    
         ALTERNATIVE: Not correct the bug, copy the version log to mail directory:
             cp work_GRCh38_DEL/f0/712c00bea105e90714e8df47d3579a/collated_versions.yml results_GRCh38
             cp work/bf/4a1e941bc32b97de1bc8f4fd47df79/collated_versions.yml results_BB1528
  7. import data and pca-plot

         # Import the required libraries
         library("AnnotationDbi")
         library("clusterProfiler")
         library("ReactomePA")
         library(gplots)
         library(tximport)
         library(DESeq2)
         library("org.Hs.eg.db")
         library(dplyr)
         library(tidyverse)
    
         # ------ For human genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("control_r1" = "./control_r1/quant.sf",
                 "control_r2" = "./control_r2/quant.sf",
                 "control_r3" = "./control_r3/quant.sf",
                 "VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # ------ For virus genome ------
         setwd("~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon")
         # Define paths to your Salmon output quantification files
         files <- c("VZV.d10_r1" = "./VZV.d10_r1/quant.sf",
                 "VZV.d10_r2" = "./VZV.d10_r2/quant.sf",
                 "VZV.d10_r3" = "./VZV.d10_r3/quant.sf",
                 "VZV.d15_r1" = "./VZV.d15_r1/quant.sf",
                 "VZV.d15_r2" = "./VZV.d15_r2/quant.sf",
                 "VZV.d15_r3" = "./VZV.d15_r3/quant.sf",
                 "VZV.d20_r1" = "./VZV.d20_r1/quant.sf",
                 "VZV.d20_r2" = "./VZV.d20_r2/quant.sf",
                 "VZV.d20_r3" = "./VZV.d20_r3/quant.sf")
         # Import the transcript abundance data with tximport
         txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
         # Define the replicates and condition of the samples
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         # Define the colData for DESeq2
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
         # -- transcript-level count data --
         # Create DESeqDataSet object
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         write.csv(counts(dds), file="transcript_counts.csv")
    
         # -- gene-level count data --
         # Read in the tx2gene map from salmon_tx2gene.tsv
         tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
         # Set the column names
         colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
         # Remove the gene_name column if not needed
         tx2gene <- tx2gene[,1:2]
         # Import and summarize the Salmon data with tximport
         txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
         # Continue with the DESeq2 workflow as before...
         colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
         dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
         #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
         write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
         # MANUALLY set all counts for all virus transcripts in the three control-conditions ("control_r1","control_r2","control_r3") as 0 using Excel
    
         # ------ Merge the raw counts of human and microbe ------
         #cat ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv > merged_gene_counts.csv
         #DELETE the second line: "","control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"
    
         # 1. Remove the hidden ^M (Windows carriage returns) from both files
         sed -i 's/\r$//' ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv
         # 2. Extract the header from the first file to start the new merged file
         head -n 1 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv > merged_gene_counts.csv
         # 3. Append the data from the first file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_GRCh38/star_salmon/gene_counts.csv >> merged_gene_counts.csv
         # 4. Append the data from the second file (skipping its header)
         tail -n +2 ~/DATA/Data_Nina_RNAseq_2026/results_BB1528/star_salmon/gene_counts.csv >> merged_gene_counts.csv
    
         # Should be 60683 lines
         #~/Tools/csv2xls-0.4/csv_to_xls.py merged_gene_counts.csv -d',' -o raw_gene_counts.xls;
    
         # -- for merged analysis due to false normalization factors wenn alone analyzed on virus data --
         setwd("~/DATA/Data_Nina_RNAseq_2026/")
         d.raw <- read.csv("merged_gene_counts.csv", header=TRUE, row.names=1)
    
         # Re-define the replicates and condition for the merged counts
         rownames <- factor(c("control_r1","control_r2","control_r3","VZV.d10_r1","VZV.d10_r2","VZV.d10_r3","VZV.d15_r1","VZV.d15_r2","VZV.d15_r3","VZV.d20_r1","VZV.d20_r2","VZV.d20_r3"))
         replicate <- factor(c("r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
         condition <- factor(c("control", "control", "control", "VZV.d10", "VZV.d10", "VZV.d10", "VZV.d15", "VZV.d15", "VZV.d15", "VZV.d20", "VZV.d20", "VZV.d20"))
         colData <- data.frame(condition=condition, replicate=replicate, row.names=rownames)
    
         #CORRECTED: dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition+replicate)
         #⚠️ A Note on the Design Formula: In the snippet, I used design = ~condition+replicate. In DESeq2, adding replicate (r1, r2, r3) to the design formula tells the model to treat "r1" as a globally systematic batch effect across all conditions. Unless the replicates were processed in distinct batches (e.g., all r1 on Monday, all r2 on Tuesday), it is standard to use design = ~ condition to simply measure the biological variance. Therefore, I have updated the script to use ~ condition.
         dds <- DESeqDataSetFromMatrix(countData=d.raw, colData=colData, design=~condition)
         dim(counts(dds))
         head(counts(dds), 10)
    
         rld <- rlogTransformation(dds)
    
         #We don't need to run DESeq(dds) before estimateSizeFactors(dds). In fact, the typical workflow in DESeq2 is the opposite: we usually run estimateSizeFactors(dds) (and other preprocessing functions) before running the main DESeq(dds) function.
         #The estimateSizeFactors function is used to calculate size factors for normalization, which corrects for differences in library size (i.e., the number of read counts) between samples. This normalization step is crucial to ensure that differences in gene expression aren't merely due to differences in sequencing depth between samples.
         #The DESeq function, on the other hand, performs the main differential expression analysis, comparing gene expression between different conditions or groups.
         #So, the typical workflow is:
         #  - Create the DESeqDataSet object.
         #  - Use estimateSizeFactors to normalize for library size.
         #  - (Optionally, estimate dispersion with estimateDispersions if not using the full DESeq function later.)
         #  - Use DESeq for the differential expression analysis.
         #  - However, it's worth noting that if you run the main DESeq function directly after creating the DESeqDataSet object, it will automatically perform the normalization (using estimateSizeFactors) and dispersion estimation steps for you. In that case, there's no need to run estimateSizeFactors separately before DESeq.
    
         # draw simple pca and heatmap
         library(gplots)
         library("RColorBrewer")
         #mat <- assay(rld)
         #mm <- model.matrix(~condition, colData(rld))
         #mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
         #assay(rld) <- mat
         # -- pca --
         png("pca.png", 1200, 800)
         plotPCA(rld, intgroup=c("condition"))
         dev.off()
         # -- heatmap --
         png("heatmap.png", 1200, 800)
         distsRL <- dist(t(assay(rld)))
         mat <- as.matrix(distsRL)
         hc <- hclust(distsRL)
         hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
         heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
         dev.off()
  8. draw 3D PCA plots.

         library(gplots)
         library("RColorBrewer")
    
         library(ggplot2)
         data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
         write.csv(data, file="plotPCA_data.csv")
         #calculate all PCs including PC3 with the following codes
         library(genefilter)
         ntop <- 500
         rv <- rowVars(assay(rld))
         select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
         mat <- t( assay(rld)[select, ] )
         pc <- prcomp(mat)
         pc$x[,1:3]
         #df_pc <- data.frame(pc$x[,1:3])
         df_pc <- data.frame(pc$x)
         identical(rownames(data), rownames(df_pc)) #-->TRUE
    
         data$PC1 <- NULL
         data$PC2 <- NULL
         merged_df <- merge(data, df_pc, by = "row.names")
         #merged_df <- merged_df[, -1]
         row.names(merged_df) <- merged_df$Row.names
         merged_df$Row.names <- NULL  # remove the "name" column
         merged_df$name <- NULL
         merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","group","condition","replicate")]
         write.csv(merged_df, file="merged_df_12PCs.csv")
         summary(pc)
         #                           PC1     PC2     PC3
         #Proportion of Variance  0.5969  0.1732 0.09444
    
         python3 draw_3D.py
         # Edit on the generated SVG-figure.
         #/usr/bin/convert PCA_3D.png -crop 2900x1600+250+700 PCA_3D_cropped.png
  9. (Optional) estimate size factors and dispersion values.

         #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
         #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
         #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).
    
         sizeFactors(dds)
         #NULL
         # Estimate size factors
         dds <- estimateSizeFactors(dds)
         # Estimate dispersions
         dds <- estimateDispersions(dds)
         #> sizeFactors(dds)
         #control_r1 control_r2 control_r3 VZV.d10_r1 VZV.d10_r2 VZV.d10_r3 VZV.d15_r1
         #1.2072182  1.1009757  0.8610196  1.1965615  0.8575169  1.1714809  0.8344160
         #VZV.d15_r2 VZV.d15_r3 VZV.d20_r1 VZV.d20_r2 VZV.d20_r3
         #0.5278583  1.1768657  1.1006353  1.1288974  1.2630530
    
         # If alone with virus data, the following BUG occured:
         #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                             HeLa_TO_r1                      HeLa_TO_r2
                             0.9978755                       1.1092227
         data.frame(genes = rownames(dds), dispersions = dispersions(dds))
    
         #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
    
         1/0.9978755=1.002129023
         1/1.1092227=
    
         #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
         bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
    
         raw_counts <- counts(dds)
         normalized_counts <- counts(dds, normalized=TRUE)
         #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
         #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
         #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
         estimSf <- function (cds){
             # Get the count matrix
             cts <- counts(cds)
    
             # Compute the geometric mean
             geomMean <- function(x) prod(x)^(1/length(x))
    
             # Compute the geometric mean over the line
             gm.mean  <-  apply(cts, 1, geomMean)
    
             # Zero values are set to NA (avoid subsequentcdsdivision by 0)
             gm.mean[gm.mean == 0] <- NA
    
             # Divide each line by its corresponding geometric mean
             # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
             # MARGIN: 1 or 2 (line or columns)
             # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
             # FUN: the function to be applied
             cts <- sweep(cts, 1, gm.mean, FUN="/")
    
             # Compute the median over the columns
             med <- apply(cts, 2, median, na.rm=TRUE)
    
             # Return the scaling factor
             return(med)
         }
         #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
         #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
         #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
         #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
         #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
         #DESeq2’s median of ratios [1]
         #EdgeR’s trimmed mean of M values (TMM) [2]
         #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
         test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
         sum(test_normcount != normalized_counts)
  10. Merged Host-Virus Counts -> DESeq2 -> DEGs of 6 Comparisons

         Rscript complete_deg_pipeline.R
    
         # Key changes made:
         #     1. Virus gene identification: Added detection of virus genes using pattern matching ("^ORF|cat_RNA|redF_RNA|repE_RNA|sopA_RNA|sopB_RNA|UL45_RNA")
         #     2. Updated significance cutoff: Changed from |log2FC| >= 1 to |log2FC| >= 2
         #     3. Color scheme:
         #        * Green for virus RNAs (all virus genes)
         #        * Dark green for significantly regulated virus genes
         #        * Red for significantly up-regulated host genes
         #        * Blue for significantly down-regulated host genes
         #        * Gray for non-significant host genes
         #     4. Enhanced legend: The legend now clearly distinguishes between virus RNAs and host genes with their regulation status
         #     5. Summary statistics: Added detailed output showing counts of virus vs host genes and their regulation status
    
         #  ================================================================================
         # 📊 FINAL SUMMARY OF ALL 6 COMPARISONS
         #                 name total  up down virus sig_total pct_sig
         # 03_VZV.d20_vs_control 17594 193  220    77       413     2.3
         # 02_VZV.d15_vs_control 17594 139   94    77       233     1.3
         # 05_VZV.d20_vs_VZV.d10 17594  35  114    77       149     0.8
         # 01_VZV.d10_vs_control 17594  89   20    77       109     0.6
         # 06_VZV.d20_vs_VZV.d15 17594  28   38    77        66     0.4
         # 04_VZV.d15_vs_VZV.d10 17594   6   13    77        19     0.1