Author Archives: gene_x

力–光子信号相关性分析流程（单分子光学镊子+荧光数据）

Leave a reply

一、背景假说与验证目标

采用单分子光学镊子+荧光，研究LT蛋白在DNA上的装配、熔解与RAD51/RPA70共定位。
工作假说：如果“拉伸力”是主要驱动力，则力阶跃应伴随显著光子状态转变，否则支持“多聚化主导”模型，与论文结论一致。
胡卓伟 https://baike.baidu.com/item/%E8%83%A1%E5%8D%93%E4%BC%9F/8629598

二、数据准备与时间对齐

统一时间轴
- 将 force(t), photons(t), HMM状态序列重采样到同一采样率（如100 Hz）
- 校正曝光延迟及采样偏移，保证时间戳一致
漂白校正
- 光子信号做指数衰减或局部基线漂移校正
- 剔除长段漂白区或作为协变量标注
去噪与标准化
- 适度滤波（如中位数或Savitzky–Golay）
- 光子信号与力均应Z-score化，便于对比

三、HMM结果信号选取

并行追踪：
- 校正后的连续光子信号 photons(t)
- HMM输出的状态转移与后验概率（state(t)、P(switch at t)）

四、三层相关性分析流程

A. 全局时间序列相关性

互相关函数（CCF）
- 计算 force(t) 与 photons(t)/state(t) 的互相关，扫 ±L 秒时滞
- 用块置换/相位随机化检验显著性，获得p值和置信区间
事件触发平均（ETA）
- 以每个 change_time_s 为触发点，窗口 [t0−W, t0+W] 内叠加平均光子/状态曲线
- 判断ETA在0附近是否有系统响应、显著性超置信区间

B. 事件级判定“相关/不相关”

对每个 change_time_s = t_i:
1. 设窗口 Δ（如2秒），判断窗口内是否有：
  - HMM状态跃迁且后验>0.9
  - 或光子信号显著突变（|Δphotons| > 3×局部噪声SD）
2. 标注候选“相关”（Yes），记录证据
3. 用随机对照/置换检验（假阳性率，Fisher/置换测试，q<0.05）确定最终Yes/No
4. 可选：检查方向一致性（如force上升是否更常引起信号/状态上升），用logistic回归检验

C. 位置（x坐标）相关性（若有位置信息）

按位置bin计算每bin相关率、驻留时间、事件占空比
热点分析：检验特定位点相关率是否显著高于其他bin或随机

五、统计模型拓展（可选）

点过程/危险率模型（Cox/Poisson回归），自变量含force、Δforce、方向、时间、位置等
加入分子ID/实验天数作为随机效应，提升稳健性

六、生物学意义解读

1. 若观察到“强相关”：

说明外力促进DNA局部解链或状态切换，可能改变蛋白装配/停留稳定性
若力学区间与论文设定不同，则为新发现或说明实验几何有差异

2. 若观察到“不相关”：

支持“多聚化驱动”主模型：LT多聚化优先结合/侵入ssDNA，之后ATP驱动解旋
与论文结论一致：10pN张力下力–光子信号无必然联系

七、change_time_s事件判定操作清单

设定窗口Δ（如2秒）和阈值（HMM后验>0.9, |Δphotons|>3σ）
对每个t_i，查找HMM跃迁或光子突变，候选Yes/No、证据类型及最大后验
对全体事件做置换检验，统计显著性
可选：检验方向一致性

八、注意事项与易错点

必做光漂白校正
多重比较做FDR校正
时间序列强自相关优先用互相关+置换检验
同步偏移一次性估计修正
Δ窗口做敏感性分析

九、建议的最终交付物

互相关分析图 (force vs photons/HMM-state)
事件触发平均（ETA）图，含置信区间
change_time_s事件级表：is_correlated, evidence, p_value, lag
位置热点图（如有）
结论文字：是否支持“多聚化主导/与力弱相关”模型

十、论文引用要点（适用于报告撰写）

不依赖ATP水解即可熔解，RAD51/RPA70可共定位
多聚化而非解旋活性是关键驱动力
10pN张力与无张力均可见LT与RAD51共定位，张力不是必要条件
多聚化数量与熔解标志强度/速度成正比（三聚体慢、双六聚体快且稳）

附：可运行脚本支持

如需自动化，可将上述判定规则写成Python/R脚本，读入 force.csv、photons.csv、HMM_state.csv，并一键输出相关性图表和事件判定表。

Detect jump with ruptures on Data_Vero_Kymographs force data

Leave a reply

# ------------------------------------------------------------
# Requirements:
# pip install pandas numpy ruptures openpyxl
# ------------------------------------------------------------

import pandas as pd
import numpy as np
import ruptures as rpt
import os

########################
# 1. User I/O settings #
########################

infile = "./ForceHFCorrectedForce2x_p853_p502_b3_1_downsampled1000_HF_HF.csv"
outfile = "force_steps.xlsx"

MIN_STEP = 0.4          # pN threshold to report
FALLBACK_FS = 78.1      # Hz, if no time column
PENALTY = 5             # ruptures sensitivity

# post-processing params:
MERGE_WINDOW = 0.10     # sec: merge same-direction steps < this apart
CANCEL_WINDOW = 0.20    # sec: remove opposite-direction spike pairs < this apart

#####################################
# 2. Load data and pick time/force  #
#####################################

df = pd.read_csv(infile)

# Guess force column
force_col_candidates = [
        "Corrected Force 2x",
        "Corrected Force",
        "Force",
        "Force HF",
        "force",
        "force_pN",
]
force_col = None
for c in force_col_candidates:
        if c in df.columns:
                force_col = c
                break
if force_col is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        if len(numeric_cols) == 0:
                raise ValueError("No numeric force column found in CSV.")
        force_col = numeric_cols[-1]

# Guess time column
time_col_candidates = [
        "Time",
        "time",
        "t",
        "timestamp",
        "Timestamp (s)",
        "Seconds",
]
time_col = None
for c in time_col_candidates:
        if c in df.columns:
                time_col = c
                break
if time_col is None:
        n = len(df)
        dt = 1.0 / FALLBACK_FS
        df["__time_s"] = np.arange(n) * dt
        time_col = "__time_s"

time = df[time_col].to_numpy()
force = df[force_col].to_numpy()

###########################################
# 3. Changepoint detection (ruptures PELT)#
###########################################

algo = rpt.Pelt(model="l2").fit(force)
bkpts = algo.predict(pen=PENALTY)
# bkpts are 1-based segment ends

seg_starts = [0] + [b for b in bkpts[:-1]]
seg_ends   = [b for b in bkpts]  # 1-based inclusive ends

segments = []
for start, end in zip(seg_starts, seg_ends):
        seg_force = force[start:end]
        seg_time = time[start:end]
        mean_force = np.mean(seg_force)
        mid_idx = start + (len(seg_force) // 2)
        mid_time = time[mid_idx]
        segments.append(
                {
                        "start_idx": start,
                        "end_idx": end - 1,
                        "mean_force": mean_force,
                        "mid_time": mid_time,
                }
        )

# Build raw step list
raw_changes = []
for i in range(len(segments) - 1):
        f_before = segments[i]["mean_force"]
        f_after = segments[i + 1]["mean_force"]
        delta_f = f_after - f_before
        boundary_idx = segments[i]["end_idx"]
        change_time = time[boundary_idx]
        raw_changes.append(
                {
                        "change_time_s": change_time,
                        "force_before_pN": f_before,
                        "force_after_pN": f_after,
                        "delta_force_pN": delta_f,
                        "direction": "up" if delta_f > 0 else "down",
                }
        )

changes_df = pd.DataFrame(raw_changes)

###########################################################
# 4. POST-PROCESSING STEPS
#
# 4a. Merge consecutive steps with SAME direction if they
#     happen within MERGE_WINDOW seconds.
#
#     Example: down at t=58.63s then down again at t=58.69s
###########################################################

def merge_close_same_direction(df_steps, merge_window):
        if df_steps.empty:
                return df_steps.copy()

        merged = []
        cur = df_steps.iloc[0].to_dict()

        for idx in range(1, len(df_steps)):
                row = df_steps.iloc[idx].to_dict()

                same_dir = (row["direction"] == cur["direction"])
                close_in_time = (row["change_time_s"] - cur["change_time_s"]) <= merge_window

                if same_dir and close_in_time:
                        # merge cur and row into a single bigger jump:
                        # new before: cur.force_before_pN
                        # new after: row.force_after_pN
                        # new delta: after - before
                        merged_force_before = cur["force_before_pN"]
                        merged_force_after  = row["force_after_pN"]
                        merged_delta        = merged_force_after - merged_force_before

                        cur["force_after_pN"] = merged_force_after
                        cur["delta_force_pN"] = merged_delta
                        cur["direction"] = "up" if merged_delta > 0 else "down"
                        # keep cur["change_time_s"] as first event time
                else:
                        merged.append(cur)
                        cur = row

        merged.append(cur)
        return pd.DataFrame(merged)

merged_df = merge_close_same_direction(changes_df, MERGE_WINDOW)

###########################################################
# 4b. Remove spike pairs:
#     If we have two consecutive steps with OPPOSITE direction
#     within CANCEL_WINDOW sec, drop both (treat as noise).
#
#     Example: up at 284.302s, down at 284.430s (~0.13s)
###########################################################

def remove_opposite_spikes(df_steps, cancel_window):
        if len(df_steps) <= 1:
                return df_steps.copy()

        to_drop = set()

        for i in range(len(df_steps) - 1):
                t1 = df_steps.iloc[i]["change_time_s"]
                t2 = df_steps.iloc[i+1]["change_time_s"]
                d1 = df_steps.iloc[i]["direction"]
                d2 = df_steps.iloc[i+1]["direction"]

                if d1 != d2 and (t2 - t1) <= cancel_window:
                        # mark both i and i+1 to be dropped
                        to_drop.add(i)
                        to_drop.add(i+1)

        keep_rows = [i for i in range(len(df_steps)) if i not in to_drop]
        return df_steps.iloc[keep_rows].reset_index(drop=True)

clean_df = remove_opposite_spikes(merged_df, CANCEL_WINDOW)

###########################################################
# 5. Apply amplitude filter |Δforce| >= 1 pN
###########################################################

filtered_df = clean_df[np.abs(clean_df["delta_force_pN"]) >= MIN_STEP].copy()

##########################################
# 6. Save all versions to Excel
##########################################

# =========================
# Helpers
# =========================

def round_df(d: pd.DataFrame, nd=3) -> pd.DataFrame:
        d = d.copy()
        float_cols = d.select_dtypes(include=[np.floating]).columns
        if len(float_cols):
                d[float_cols] = d[float_cols].round(nd)
        return d

with pd.ExcelWriter(outfile) as writer:
        changes_df.to_excel(writer, sheet_name="raw_steps", index=False)
        merged_df.to_excel(writer, sheet_name="merged_same_dir", index=False)
        clean_df.to_excel(writer, sheet_name="after_spike_removal", index=False)
        round_df(filtered_df).to_excel(writer, sheet_name="final_ge_1pN", index=False)

print(f"Done. Wrote: {outfile}")
print("Sheets in Excel:")
print(" - raw_steps              (direct from ruptures)")
print(" - merged_same_dir        (after merging close same-direction jumps)")
print(" - after_spike_removal    (after removing fast opposite blips)")
print(" - final_ge_1pN           (after |Δforce|≥1 pN filter)")

Calculate alpha-diversty for Group9/10/11

Leave a reply

(a tight drop-in that restricts to Group9/Group10/Group11 and runs stats/plotting for Shannon only)

## ---- Subset to Group9/10/11 and keep factor levels tidy ---------------------
library(dplyr)
library(ggplot2)
library(ggpubr)
library(rstatix)
library(knitr)
library(kableExtra)

shannon_g9_11 <- div.df2 %>%
filter(Group %in% c("Group9", "Group10", "Group11")) %>%
mutate(Group = factor(Group, levels = c("Group9","Group10","Group11"))) %>%
droplevels()

## Quick table of the data used
knitr::kable(shannon_g9_11[, c("Sample name","Group","Shannon")]) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))

## ---- Summary stats (report in text/table) -----------------------------------
sum_stats <- shannon_g9_11 %>%
group_by(Group) %>%
summarise(n = n(),
                        mean = mean(Shannon, na.rm = TRUE),
                        sd = sd(Shannon, na.rm = TRUE),
                        median = median(Shannon, na.rm = TRUE),
                        IQR = IQR(Shannon, na.rm = TRUE),
                        .groups = "drop")

write.csv(sum_stats, "Shannon_Group9_11_summary.csv", row.names = FALSE)
knitr::kable(sum_stats, digits = 3) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))

## ---- Overall test (3 groups) -------------------------------------------------
# Nonparametric overall test (robust default)
overall_kw <- compare_means(Shannon ~ Group, data = shannon_g9_11, method = "kruskal.test")
knitr::kable(overall_kw, digits = 3) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(overall_kw, "Shannon_Group9_11_overall_Kruskal.csv", row.names = FALSE)

## Optional parametric overall (use if assumptions OK)
overall_anova <- compare_means(Shannon ~ Group, data = shannon_g9_11, method = "anova")
write.csv(overall_anova, "Shannon_Group9_11_overall_ANOVA.csv", row.names = FALSE)

## Assumption checks (optional; helps decide ANOVA vs KW)
shapiro_res <- shannon_g9_11 %>% group_by(Group) %>% shapiro_test(Shannon)
levene_res  <- shannon_g9_11 %>% levene_test(Shannon ~ Group)
write.csv(shapiro_res, "Shannon_Group9_11_shapiro.csv", row.names = FALSE)
write.csv(levene_res,  "Shannon_Group9_11_levene.csv",  row.names = FALSE)

## ---- Pairwise tests with BH correction --------------------------------------
# Wilcoxon (nonparametric)
pw_wilcox <- pairwise_wilcox_test(shannon_g9_11, Shannon ~ Group,
                                                                p.adjust.method = "BH", exact = FALSE)
write.csv(pw_wilcox, "Shannon_Group9_11_pairwise_wilcox_BH.csv", row.names = FALSE)

# t-tests (parametric, optional)
pw_t <- pairwise_t_test(shannon_g9_11, Shannon ~ Group, p.adjust.method = "BH")
write.csv(pw_t, "Shannon_Group9_11_pairwise_t_BH.csv", row.names = FALSE)

## ---- Plot: box/jitter with overall & pairwise p-values ----------------------
my_comparisons <- list(c("Group9","Group10"),
                                        c("Group9","Group11"),
                                        c("Group10","Group11"))

p_shannon_g9_11 <- ggplot(shannon_g9_11, aes(x = Group, y = Shannon)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.12, alpha = 0.7) +
labs(y = "Shannon diversity", x = NULL,
        title = "Shannon diversity: Group9 vs Group10 vs Group11") +
theme_bw() +
stat_compare_means(method = "kruskal.test", label = "p.format", label.y.npc = "top") +  # overall
stat_compare_means(comparisons = my_comparisons, method = "wilcox.test",
                                        label = "p.signif", hide.ns = TRUE)

ggsave("Shannon_Group9_10_11_boxplot.pdf", p_shannon_g9_11, width = 5.5, height = 4.2)

#How to report: cite the Kruskal–Wallis p for the overall 3-group comparison and the BH-adjusted Wilcoxon p-values for pairwise contrasts (use the ANOVA/t-test outputs only if normality and homogeneity look acceptable from the Shapiro/Levene tables).

Automated Step Detection in Force Spectroscopy Data

Leave a reply

The script loads raw force-vs-time data from a CSV file, automatically identifies step-like changes in the force signal, and exports a cleaned list of meaningful force transitions to Excel. It first detects changepoints in the force trace using a PELT-based algorithm from the ruptures library, which segments the signal into plateaus and finds the boundaries between them. Each boundary is treated as a candidate step, with its timestamp, the average force before and after, and the delta force (direction up or down). After detection, the script applies two biologically motivated cleanup rules: (1) it merges consecutive steps in the same direction that occur within a short time window, treating them as a single physical event instead of multiple tiny jumps; and (2) it removes fast “spike” pairs where the force briefly jumps up and then down (or down then up) within a very short interval, which are likely noise rather than true mechanical steps. Finally, it filters for steps with an absolute amplitude ≥ 1 pN and writes several worksheets to an Excel file, including the raw detection, intermediate cleanup stages, and the final curated step list. This makes it easy to batch-process kymograph force traces and consistently quantify discrete force changes.

# ------------------------------------------------------------
# Requirements:
# pip install pandas numpy ruptures openpyxl
# ------------------------------------------------------------

import pandas as pd
import numpy as np
import ruptures as rpt
import os

########################
# 1. User I/O settings #
########################

infile = "./ForceHFCorrectedForce2x_p853_p502_b3_1_downsampled1000_HF_HF.csv"
outfile = "force_steps.xlsx"

MIN_STEP = 1.0          # pN threshold to report
FALLBACK_FS = 78.1      # Hz, if no time column
PENALTY = 5             # ruptures sensitivity

# post-processing params:
MERGE_WINDOW = 0.10     # sec: merge same-direction steps < this apart
CANCEL_WINDOW = 0.20    # sec: remove opposite-direction spike pairs < this apart

#####################################
# 2. Load data and pick time/force  #
#####################################

df = pd.read_csv(infile)

# Guess force column
force_col_candidates = [
    "Corrected Force 2x",
    "Corrected Force",
    "Force",
    "Force HF",
    "force",
    "force_pN",
]
force_col = None
for c in force_col_candidates:
    if c in df.columns:
        force_col = c
        break
if force_col is None:
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) == 0:
        raise ValueError("No numeric force column found in CSV.")
    force_col = numeric_cols[-1]

# Guess time column
time_col_candidates = [
    "Time",
    "time",
    "t",
    "timestamp",
    "Timestamp (s)",
    "Seconds",
]
time_col = None
for c in time_col_candidates:
    if c in df.columns:
        time_col = c
        break
if time_col is None:
    n = len(df)
    dt = 1.0 / FALLBACK_FS
    df["__time_s"] = np.arange(n) * dt
    time_col = "__time_s"

time = df[time_col].to_numpy()
force = df[force_col].to_numpy()

###########################################
# 3. Changepoint detection (ruptures PELT)#
###########################################

algo = rpt.Pelt(model="l2").fit(force)
bkpts = algo.predict(pen=PENALTY)
# bkpts are 1-based segment ends

seg_starts = [0] + [b for b in bkpts[:-1]]
seg_ends   = [b for b in bkpts]  # 1-based inclusive ends

segments = []
for start, end in zip(seg_starts, seg_ends):
    seg_force = force[start:end]
    seg_time = time[start:end]
    mean_force = np.mean(seg_force)
    mid_idx = start + (len(seg_force) // 2)
    mid_time = time[mid_idx]
    segments.append(
        {
            "start_idx": start,
            "end_idx": end - 1,
            "mean_force": mean_force,
            "mid_time": mid_time,
        }
    )

# Build raw step list
raw_changes = []
for i in range(len(segments) - 1):
    f_before = segments[i]["mean_force"]
    f_after = segments[i + 1]["mean_force"]
    delta_f = f_after - f_before
    boundary_idx = segments[i]["end_idx"]
    change_time = time[boundary_idx]
    raw_changes.append(
        {
            "change_time_s": change_time,
            "force_before_pN": f_before,
            "force_after_pN": f_after,
            "delta_force_pN": delta_f,
            "direction": "up" if delta_f > 0 else "down",
        }
    )

changes_df = pd.DataFrame(raw_changes)

###########################################################
# 4. POST-PROCESSING STEPS
#
# 4a. Merge consecutive steps with SAME direction if they
#     happen within MERGE_WINDOW seconds.
#
#     Example: down at t=58.63s then down again at t=58.69s
###########################################################

def merge_close_same_direction(df_steps, merge_window):
    if df_steps.empty:
        return df_steps.copy()

    merged = []
    cur = df_steps.iloc[0].to_dict()

    for idx in range(1, len(df_steps)):
        row = df_steps.iloc[idx].to_dict()

        same_dir = (row["direction"] == cur["direction"])
        close_in_time = (row["change_time_s"] - cur["change_time_s"]) <= merge_window

        if same_dir and close_in_time:
            # merge cur and row into a single bigger jump:
            # new before: cur.force_before_pN
            # new after: row.force_after_pN
            # new delta: after - before
            merged_force_before = cur["force_before_pN"]
            merged_force_after  = row["force_after_pN"]
            merged_delta        = merged_force_after - merged_force_before

            cur["force_after_pN"] = merged_force_after
            cur["delta_force_pN"] = merged_delta
            cur["direction"] = "up" if merged_delta > 0 else "down"
            # keep cur["change_time_s"] as first event time
        else:
            merged.append(cur)
            cur = row

    merged.append(cur)
    return pd.DataFrame(merged)

merged_df = merge_close_same_direction(changes_df, MERGE_WINDOW)

###########################################################
# 4b. Remove spike pairs:
#     If we have two consecutive steps with OPPOSITE direction
#     within CANCEL_WINDOW sec, drop both (treat as noise).
#
#     Example: up at 284.302s, down at 284.430s (~0.13s)
###########################################################

def remove_opposite_spikes(df_steps, cancel_window):
    if len(df_steps) <= 1:
        return df_steps.copy()

    to_drop = set()

    for i in range(len(df_steps) - 1):
        t1 = df_steps.iloc[i]["change_time_s"]
        t2 = df_steps.iloc[i+1]["change_time_s"]
        d1 = df_steps.iloc[i]["direction"]
        d2 = df_steps.iloc[i+1]["direction"]

        if d1 != d2 and (t2 - t1) <= cancel_window:
            # mark both i and i+1 to be dropped
            to_drop.add(i)
            to_drop.add(i+1)

    keep_rows = [i for i in range(len(df_steps)) if i not in to_drop]
    return df_steps.iloc[keep_rows].reset_index(drop=True)

clean_df = remove_opposite_spikes(merged_df, CANCEL_WINDOW)

###########################################################
# 5. Apply amplitude filter |Δforce| >= 1 pN
###########################################################

filtered_df = clean_df[np.abs(clean_df["delta_force_pN"]) >= MIN_STEP].copy()

##########################################
# 6. Save all versions to Excel
##########################################

with pd.ExcelWriter(outfile) as writer:
    changes_df.to_excel(writer, sheet_name="raw_steps", index=False)
    merged_df.to_excel(writer, sheet_name="merged_same_dir", index=False)
    clean_df.to_excel(writer, sheet_name="after_spike_removal", index=False)
    filtered_df.to_excel(writer, sheet_name="final_ge_1pN", index=False)

print(f"Done. Wrote: {outfile}")
print("Sheets in Excel:")
print(" - raw_steps              (direct from ruptures)")
print(" - merged_same_dir        (after merging close same-direction jumps)")
print(" - after_spike_removal    (after removing fast opposite blips)")
print(" - final_ge_1pN           (after |Δforce|≥1 pN filter)")

Processing Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606

Leave a reply

Vorgabe

 #perform PCA analysis, Venn diagram analysis, as well as KEGG and GO annotations. We would also appreciate it if you could include CPM calculations for this dataset (gene_cpm_counts.xlsx). For comparative analysis, we are particularly interested in identifying DEGs between WT and ΔIJ across the different treatments and time points.

 I have already performed the six comparisons, using WT as the reference:

     ΔIJ-17 vs WT-17 – no treatment
     ΔIJ-24 vs WT-24 – no treatment
     preΔIJ-17 vs preWT-17 – Treatment A
     preΔIJ-24 vs preWT-24 – Treatment A
     0_5ΔIJ-17 vs 0_5WT-17 – Treatment B
     0_5ΔIJ-24 vs 0_5WT-24 – Treatment B

 To gain a deeper understanding of how the ∆adeIJ mutation influences response dynamics over time and under different stimuli, would you also be interested in the following additional comparisons?

 Within-strain treatment responses
 (to explore how each strain responds to treatments):

 WT:

     preWT-17 vs WT-17 → response to Treatment A at 17 h
     preWT-24 vs WT-24 → response to Treatment A at 24 h
     0_5WT-17 vs WT-17 → response to Treatment B at 17 h
     0_5WT-24 vs WT-24 → response to Treatment B at 24 h

 ∆adeIJ:

     preΔIJ-17 vs ΔIJ-17 → response to Treatment A at 17 h
     preΔIJ-24 vs ΔIJ-24 → response to Treatment A at 24 h
     0_5ΔIJ-17 vs ΔIJ-17 → response to Treatment B at 17 h
     0_5ΔIJ-24 vs ΔIJ-24 → response to Treatment B at 24 h

 Time-course comparisons
 (to investigate time-dependent changes within each condition):

     WT-24 vs WT-17
     ΔIJ-24 vs ΔIJ-17
     preWT-24 vs preWT-17
     preΔIJ-24 vs preΔIJ-17
     0_5WT-24 vs 0_5WT-17
     0_5ΔIJ-24 vs 0_5ΔIJ-17

 I reviewed the datasets again and noticed that there are no ∆adeAB samples included. Should we try to obtain ∆adeAB data from other datasets? However, I’m a bit concerned that batch effects might pose a challenge when integrating data from different datasets.

 > It is possible to analyze DEGs across various time points (17 and 24 h) and stimuli (treatment A and B, and without treatment) iswithin both the ∆adeIJ mutant and the WT strain as our phenotypic characterization of these strains across two times points and stimuli shows significant differences but the other mutant ∆adeAB (similar function as AdeIJ) shows no difference compared to WT, therefore we are wondering what's happened to ∆adeIJ.

 deltaIJ_17, WT_17 – ΔadeIJ and wildtype strains w/o exposure at 17 h (No treatment)
 deltaIJ_24, WT_24 – ΔadeIJ and wildtype strains w/o exposure at 24 h (No treatment)
 pre_deltaIJ_17, pre_WT_17 – ΔadeIJ and wildtype strains with 1 exposure at 17 h (Treatment A)
 pre_deltaIJ_24, pre_WT_24 – ΔadeIJ and wildtype strains with 1 exposure at 24 h (Treatment A)
 0_5_deltaIJ_17, 0_5_WT_17 – ΔadeIJ and wildtype strains with 2 exposure at 17 h (Treatment B)
 0_5_deltaIJ_24, 0_5_WT_24 – ΔadeIJ and wildtype strains with 2 exposure at 24 h (Treatment B)

Preparing raw data

 mkdir raw_data; cd raw_data
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_1.fq.gz WT-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_2.fq.gz WT-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_1.fq.gz WT-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_2.fq.gz WT-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_1.fq.gz WT-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_2.fq.gz WT-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_1.fq.gz WT-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_2.fq.gz WT-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_1.fq.gz WT-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_2.fq.gz WT-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_1.fq.gz WT-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_2.fq.gz WT-24-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_1.fq.gz deltaIJ-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_2.fq.gz deltaIJ-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_1.fq.gz deltaIJ-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_2.fq.gz deltaIJ-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_1.fq.gz deltaIJ-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_2.fq.gz deltaIJ-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_1.fq.gz deltaIJ-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_2.fq.gz deltaIJ-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_1.fq.gz deltaIJ-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_2.fq.gz deltaIJ-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_1.fq.gz deltaIJ-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_2.fq.gz deltaIJ-24-r3_R2.fq.gz

 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_1.fq.gz pre_WT-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_2.fq.gz pre_WT-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_1.fq.gz pre_WT-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_2.fq.gz pre_WT-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_1.fq.gz pre_WT-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_2.fq.gz pre_WT-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_1.fq.gz pre_WT-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_2.fq.gz pre_WT-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_1.fq.gz pre_WT-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_2.fq.gz pre_WT-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_1.fq.gz pre_WT-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_2.fq.gz pre_WT-24-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_1.fq.gz pre_deltaIJ-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_2.fq.gz pre_deltaIJ-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_1.fq.gz pre_deltaIJ-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_2.fq.gz pre_deltaIJ-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_1.fq.gz pre_deltaIJ-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_2.fq.gz pre_deltaIJ-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_1.fq.gz pre_deltaIJ-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_2.fq.gz pre_deltaIJ-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_1.fq.gz pre_deltaIJ-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_2.fq.gz pre_deltaIJ-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_1.fq.gz pre_deltaIJ-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_2.fq.gz pre_deltaIJ-24-r3_R2.fq.gz

 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_1.fq.gz 0_5_WT-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_2.fq.gz 0_5_WT-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_1.fq.gz 0_5_WT-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_2.fq.gz 0_5_WT-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_1.fq.gz 0_5_WT-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_2.fq.gz 0_5_WT-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_1.fq.gz 0_5_WT-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_2.fq.gz 0_5_WT-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_1.fq.gz 0_5_WT-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_2.fq.gz 0_5_WT-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_1.fq.gz 0_5_WT-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_2.fq.gz 0_5_WT-24-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_1.fq.gz 0_5_deltaIJ-17-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_2.fq.gz 0_5_deltaIJ-17-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_1.fq.gz 0_5_deltaIJ-17-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_2.fq.gz 0_5_deltaIJ-17-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_1.fq.gz 0_5_deltaIJ-17-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_2.fq.gz 0_5_deltaIJ-17-r3_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_1.fq.gz 0_5_deltaIJ-24-r1_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_2.fq.gz 0_5_deltaIJ-24-r1_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_1.fq.gz 0_5_deltaIJ-24-r2_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_2.fq.gz 0_5_deltaIJ-24-r2_R2.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_1.fq.gz 0_5_deltaIJ-24-r3_R1.fq.gz
 ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_2.fq.gz 0_5_deltaIJ-24-r3_R2.fq.gz

(Done) Downloading CP059040.fasta and CP059040.gff from GenBank

Preparing the directory trimmed

 mkdir trimmed trimmed_unpaired;
 for sample_id in WT-17-r1 WT-17-r2 WT-17-r3 WT-24-r1 WT-24-r2 WT-24-r3 deltaIJ-17-r1 deltaIJ-17-r2 deltaIJ-17-r3 deltaIJ-24-r1 deltaIJ-24-r2 deltaIJ-24-r3  pre_WT-17-r1 pre_WT-17-r2 pre_WT-17-r3 pre_WT-24-r1 pre_WT-24-r2 pre_WT-24-r3 pre_deltaIJ-17-r1 pre_deltaIJ-17-r2 pre_deltaIJ-17-r3 pre_deltaIJ-24-r1 pre_deltaIJ-24-r2 pre_deltaIJ-24-r3  0_5_WT-17-r1 0_5_WT-17-r2 0_5_WT-17-r3 0_5_WT-24-r1 0_5_WT-24-r2 0_5_WT-24-r3 0_5_deltaIJ-17-r1 0_5_deltaIJ-17-r2 0_5_deltaIJ-17-r3 0_5_deltaIJ-24-r1 0_5_deltaIJ-24-r2 0_5_deltaIJ-24-r3; do \
         java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
 done

Preparing samplesheet.csv

 sample,fastq_1,fastq_2,strandedness
 WT_17_r1,WT-17-r1_R1.fq.gz,WT-17-r1_R2.fq.gz,auto
 WT_17_r2,WT-17-r2_R1.fq.gz,WT-17-r2_R2.fq.gz,auto
 WT_17_r3,WT-17-r3_R1.fq.gz,WT-17-r3_R2.fq.gz,auto
 WT_24_r1,WT-24-r1_R1.fq.gz,WT-24-r1_R2.fq.gz,auto
 WT_24_r2,WT-24-r2_R1.fq.gz,WT-24-r2_R2.fq.gz,auto
 WT_24_r3,WT-24-r3_R1.fq.gz,WT-24-r3_R2.fq.gz,auto
 deltaIJ_17_r1,deltaIJ-17-r1_R1.fq.gz,deltaIJ-17-r1_R2.fq.gz,auto
 deltaIJ_17_r2,deltaIJ-17-r2_R1.fq.gz,deltaIJ-17-r2_R2.fq.gz,auto
 deltaIJ_17_r3,deltaIJ-17-r3_R1.fq.gz,deltaIJ-17-r3_R2.fq.gz,auto
 deltaIJ_24_r1,deltaIJ-24-r1_R1.fq.gz,deltaIJ-24-r1_R2.fq.gz,auto
 deltaIJ_24_r2,deltaIJ-24-r2_R1.fq.gz,deltaIJ-24-r2_R2.fq.gz,auto
 deltaIJ_24_r3,deltaIJ-24-r3_R1.fq.gz,deltaIJ-24-r3_R2.fq.gz,auto
 pre_WT_17_r1,pre_WT-17-r1_R1.fq.gz,pre_WT-17-r1_R2.fq.gz,auto
 pre_WT_17_r2,pre_WT-17-r2_R1.fq.gz,pre_WT-17-r2_R2.fq.gz,auto
 pre_WT_17_r3,pre_WT-17-r3_R1.fq.gz,pre_WT-17-r3_R2.fq.gz,auto
 pre_WT_24_r1,pre_WT-24-r1_R1.fq.gz,pre_WT-24-r1_R2.fq.gz,auto
 pre_WT_24_r2,pre_WT-24-r2_R1.fq.gz,pre_WT-24-r2_R2.fq.gz,auto
 pre_WT_24_r3,pre_WT-24-r3_R1.fq.gz,pre_WT-24-r3_R2.fq.gz,auto
 pre_deltaIJ_17_r1,pre_deltaIJ-17-r1_R1.fq.gz,pre_deltaIJ-17-r1_R2.fq.gz,auto
 pre_deltaIJ_17_r2,pre_deltaIJ-17-r2_R1.fq.gz,pre_deltaIJ-17-r2_R2.fq.gz,auto
 pre_deltaIJ_17_r3,pre_deltaIJ-17-r3_R1.fq.gz,pre_deltaIJ-17-r3_R2.fq.gz,auto
 pre_deltaIJ_24_r1,pre_deltaIJ-24-r1_R1.fq.gz,pre_deltaIJ-24-r1_R2.fq.gz,auto
 pre_deltaIJ_24_r2,pre_deltaIJ-24-r2_R1.fq.gz,pre_deltaIJ-24-r2_R2.fq.gz,auto
 pre_deltaIJ_24_r3,pre_deltaIJ-24-r3_R1.fq.gz,pre_deltaIJ-24-r3_R2.fq.gz,auto
 0_5_WT_17_r1,0_5_WT-17-r1_R1.fq.gz,0_5_WT-17-r1_R2.fq.gz,auto
 0_5_WT_17_r2,0_5_WT-17-r2_R1.fq.gz,0_5_WT-17-r2_R2.fq.gz,auto
 0_5_WT_17_r3,0_5_WT-17-r3_R1.fq.gz,0_5_WT-17-r3_R2.fq.gz,auto
 0_5_WT_24_r1,0_5_WT-24-r1_R1.fq.gz,0_5_WT-24-r1_R2.fq.gz,auto
 0_5_WT_24_r2,0_5_WT-24-r2_R1.fq.gz,0_5_WT-24-r2_R2.fq.gz,auto
 0_5_WT_24_r3,0_5_WT-24-r3_R1.fq.gz,0_5_WT-24-r3_R2.fq.gz,auto
 0_5_deltaIJ_17_r1,0_5_deltaIJ-17-r1_R1.fq.gz,0_5_deltaIJ-17-r1_R2.fq.gz,auto
 0_5_deltaIJ_17_r2,0_5_deltaIJ-17-r2_R1.fq.gz,0_5_deltaIJ-17-r2_R2.fq.gz,auto
 0_5_deltaIJ_17_r3,0_5_deltaIJ-17-r3_R1.fq.gz,0_5_deltaIJ-17-r3_R2.fq.gz,auto
 0_5_deltaIJ_24_r1,0_5_deltaIJ-24-r1_R1.fq.gz,0_5_deltaIJ-24-r1_R2.fq.gz,auto
 0_5_deltaIJ_24_r2,0_5_deltaIJ-24-r2_R1.fq.gz,0_5_deltaIJ-24-r2_R2.fq.gz,auto
 0_5_deltaIJ_24_r3,0_5_deltaIJ-24-r3_R1.fq.gz,0_5_deltaIJ-24-r3_R2.fq.gz,auto

nextflow run

 #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

 docker pull nfcore/rnaseq
 ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

 #Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
 #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

 # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
 #Checking the record (see below) in results/genome/CP059040.gtf
 #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
 #--featurecounts_feature_type 'transcript' returns only the tRNA results
 #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon

 grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
 grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
 grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
 grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
 wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!

 grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
 sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
 grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797

 # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
 --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

 # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
 mv trimmed/*.fq.gz .; rmdir trimmed
 (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040_m.gff"        -profile docker -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

 # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file

Prepare counts_fixed by hand: delete all “””, “gene-“, replace , to ‘\t’.

 cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

 #keep only gene_id
 cut -f1 -d',' counts.tsv > f1
 cut -f3- -d',' counts.tsv > f3_
 paste -d',' f1 f3_ > counts_fixed.tsv

 Rscript rna_timecourse_bacteria.R \
   --counts counts_fixed.tsv \
   --samples samples.tsv \
   --condition_col condition \
   --time_col time_h \
   --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
   --volcano_csvs contrasts/ctrl_vs_treat.csv \
   --outdir results_bacteria

 #Delete the repliate 2 of ΔadeIJ_two_17 and repliate 1 of ΔadeIJ_two_24 are outlier.
 paste -d$'\t' f1_32 f34 f36_ > counts_fixed_2.tsv

 Rscript rna_timecourse_bacteria.R \
   --counts counts_fixed_2.tsv \
   --samples samples_2.tsv \
   --condition_col condition \
   --time_col time_h \
   --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
   --volcano_csvs contrasts/ctrl_vs_treat.csv \
   --outdir results_bacteria_2

Import data and pca-plot

 #mamba activate r_env

 #install.packages("ggfun")
 # Import the required libraries
 library("AnnotationDbi")
 library("clusterProfiler")
 library("ReactomePA")
 library(gplots)
 library(tximport)
 library(DESeq2)
 #library("org.Hs.eg.db")
 library(dplyr)
 library(tidyverse)
 #install.packages("devtools")
 #devtools::install_version("gtable", version = "0.3.0")
 library(gplots)
 library("RColorBrewer")
 #install.packages("ggrepel")
 library("ggrepel")
 # install.packages("openxlsx")
 library(openxlsx)
 library(EnhancedVolcano)
 library(DESeq2)
 library(edgeR)

 setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon")
 # Define paths to your Salmon output quantification files
 files <- c("WT_17_r1" = "./WT_17_r1/quant.sf",
            "WT_17_r2" = "./WT_17_r2/quant.sf",
            "WT_17_r3" = "./WT_17_r3/quant.sf",
            "WT_24_r1" = "./WT_24_r1/quant.sf",
            "WT_24_r2" = "./WT_24_r2/quant.sf",
            "WT_24_r3" = "./WT_24_r3/quant.sf",
            "deltaIJ_17_r1" = "./deltaIJ_17_r1/quant.sf",
            "deltaIJ_17_r2" = "./deltaIJ_17_r2/quant.sf",
            "deltaIJ_17_r3" = "./deltaIJ_17_r3/quant.sf",
            "deltaIJ_24_r1" = "./deltaIJ_24_r1/quant.sf",
            "deltaIJ_24_r2" = "./deltaIJ_24_r2/quant.sf",
            "deltaIJ_24_r3" = "./deltaIJ_24_r3/quant.sf",
            "pre_WT_17_r1" = "./pre_WT_17_r1/quant.sf",
            "pre_WT_17_r2" = "./pre_WT_17_r2/quant.sf",
            "pre_WT_17_r3" = "./pre_WT_17_r3/quant.sf",
            "pre_WT_24_r1" = "./pre_WT_24_r1/quant.sf",
            "pre_WT_24_r2" = "./pre_WT_24_r2/quant.sf",
            "pre_WT_24_r3" = "./pre_WT_24_r3/quant.sf",
            "pre_deltaIJ_17_r1" = "./pre_deltaIJ_17_r1/quant.sf",
            "pre_deltaIJ_17_r2" = "./pre_deltaIJ_17_r2/quant.sf",
            "pre_deltaIJ_17_r3" = "./pre_deltaIJ_17_r3/quant.sf",
            "pre_deltaIJ_24_r1" = "./pre_deltaIJ_24_r1/quant.sf",
            "pre_deltaIJ_24_r2" = "./pre_deltaIJ_24_r2/quant.sf",
            "pre_deltaIJ_24_r3" = "./pre_deltaIJ_24_r3/quant.sf",
            "0_5_WT_17_r1" = "./0_5_WT_17_r1/quant.sf",
            "0_5_WT_17_r2" = "./0_5_WT_17_r2/quant.sf",
            "0_5_WT_17_r3" = "./0_5_WT_17_r3/quant.sf",
            "0_5_WT_24_r1" = "./0_5_WT_24_r1/quant.sf",
            "0_5_WT_24_r2" = "./0_5_WT_24_r2/quant.sf",
            "0_5_WT_24_r3" = "./0_5_WT_24_r3/quant.sf",
            "0_5_deltaIJ_17_r1" = "./0_5_deltaIJ_17_r1/quant.sf",
            "0_5_deltaIJ_17_r2" = "./0_5_deltaIJ_17_r2/quant.sf",
            "0_5_deltaIJ_17_r3" = "./0_5_deltaIJ_17_r3/quant.sf",
            "0_5_deltaIJ_24_r1" = "./0_5_deltaIJ_24_r1/quant.sf",
            "0_5_deltaIJ_24_r2" = "./0_5_deltaIJ_24_r2/quant.sf",
            "0_5_deltaIJ_24_r3" = "./0_5_deltaIJ_24_r3/quant.sf")
 # Import the transcript abundance data with tximport
 txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
 # Define the replicates and condition of the samples
 replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",     "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",      "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
 condition <- factor(c("WT_none_17","WT_none_17","WT_none_17","WT_none_24","WT_none_24","WT_none_24", "deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_24","deltaadeIJ_none_24","deltaadeIJ_none_24",   "WT_one_17","WT_one_17","WT_one_17","WT_one_24","WT_one_24","WT_one_24", "deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_24","deltaadeIJ_one_24","deltaadeIJ_one_24",   "WT_two_17","WT_two_17","WT_two_17","WT_two_24","WT_two_24","WT_two_24", "deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_24","deltaadeIJ_two_24","deltaadeIJ_two_24"))
 # Construct colData manually
 colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
 #dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
 dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

 # -- Save the rlog-transformed counts --
 dim(counts(dds))
 head(counts(dds), 10)
 rld <- rlogTransformation(dds)
 rlog_counts <- assay(rld)
 write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

 # -- pca --
 png("pca2.png", 1200, 800)
 plotPCA(rld, intgroup=c("condition"))
 dev.off()

 png("pca3.png", 1200, 800)
 plotPCA(rld, intgroup=c("replicate"))
 dev.off()

 pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
 percentVar <- round(100 * attr(pdat, "percentVar"))
 # 1) keep only non-WT samples
 #pdat <- subset(pdat, !grepl("^WT_", condition))
 # drop unused factor levels so empty WT facets disappear
 pdat$condition <- droplevels(pdat$condition)
 # 2) pretty condition names: deltaadeIJ -> ΔadeIJ
 pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
 png("pca4.png", 1200, 800)
 ggplot(pdat, aes(PC1, PC2, color = replicate)) +
   geom_point(size = 3) +
   facet_wrap(~ condition) +
   xlab(paste0("PC1: ", percentVar[1], "% variance")) +
   ylab(paste0("PC2: ", percentVar[2], "% variance")) +
   theme_classic()
 dev.off()

 pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
 percentVar <- round(100 * attr(pdat, "percentVar"))
 # Drop WT_* conditions from the data and from factor levels
 pdat <- subset(pdat, !grepl("^WT_", condition))
 pdat$condition <- droplevels(pdat$condition)
 # Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
 pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
 p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
   geom_point(size = 3) +
   xlab(paste0("PC1: ", percentVar[1], "% variance")) +
   ylab(paste0("PC2: ", percentVar[2], "% variance")) +
   theme_classic()
 png("pca5.png", 1200, 800); print(p); dev.off()

 pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
 percentVar <- round(100 * attr(pdat, "percentVar"))
 p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
   geom_point(size = 3) +
   facet_wrap(~ condition) +
   xlab(paste0("PC1: ", percentVar[1], "% variance")) +
   ylab(paste0("PC2: ", percentVar[2], "% variance")) +
   theme_classic()
 png("pca6.png", 1200, 800); print(p_fac); dev.off()

 # -- heatmap --
 png("heatmap2.png", 1200, 800)
 distsRL <- dist(t(assay(rld)))
 mat <- as.matrix(distsRL)
 hc <- hclust(distsRL)
 hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
 heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
 dev.off()

 # -- pca_media_strain --
 #png("pca_media.png", 1200, 800)
 #plotPCA(rld, intgroup=c("media"))
 #dev.off()
 #png("pca_strain.png", 1200, 800)
 #plotPCA(rld, intgroup=c("strain"))
 #dev.off()
 #png("pca_time.png", 1200, 800)
 #plotPCA(rld, intgroup=c("time"))
 #dev.off()

Select the differentially expressed genes

 #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
 #https://www.biostars.org/p/282295/
 #https://www.biostars.org/p/335751/
 dds$condition
 [1] WT_none_17         WT_none_17         WT_none_17         WT_none_24
 [5] WT_none_24         WT_none_24         deltaadeIJ_none_17 deltaadeIJ_none_17
 [9] deltaadeIJ_none_17 deltaadeIJ_none_24 deltaadeIJ_none_24 deltaadeIJ_none_24
 [13] WT_one_17          WT_one_17          WT_one_17          WT_one_24
 [17] WT_one_24          WT_one_24          deltaadeIJ_one_17  deltaadeIJ_one_17
 [21] deltaadeIJ_one_17  deltaadeIJ_one_24  deltaadeIJ_one_24  deltaadeIJ_one_24
 [25] WT_two_17          WT_two_17          WT_two_17          WT_two_24
 [29] WT_two_24          WT_two_24          deltaadeIJ_two_17  deltaadeIJ_two_17
 [33] deltaadeIJ_two_17  deltaadeIJ_two_24  deltaadeIJ_two_24  deltaadeIJ_two_24
 12 Levels: deltaadeIJ_none_17 deltaadeIJ_none_24 ... WT_two_24

 #CONSOLE: mkdir star_salmon/degenes

 setwd("degenes")

 # Construct colData automatically
 sample_table <- data.frame(
     condition = condition,
     replicate = replicate
 )
 split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
 colnames(split_cond) <- c("genotype", "exposure", "time")
 colData <- cbind(sample_table, split_cond)
 colData$genotype <- factor(colData$genotype)
 colData$exposure  <- factor(colData$exposure)
 colData$time   <- factor(colData$time)
 colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
 # Construct colData manually
 colData2 <- data.frame(condition=condition, row.names=names(files))

 # 确保因子顺序（可选）
 colData$genotype <- relevel(factor(colData$genotype), ref = "WT")
 colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
 colData$time   <- relevel(factor(colData$time), ref = "17")

 dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * exposure * time)
 dds <- DESeq(dds, betaPrior = FALSE)
 resultsNames(dds)
 [1] "Intercept"
 [2] "genotype_deltaadeIJ_vs_WT"
 [3] "exposure_one_vs_none"
 [4] "exposure_two_vs_none"
 [5] "time_24_vs_17"
 [6] "genotypedeltaadeIJ.exposureone"
 [7] "genotypedeltaadeIJ.exposuretwo"
 [8] "genotypedeltaadeIJ.time24"
 [9] "exposureone.time24"
 [10] "exposuretwo.time24"
 [11] "genotypedeltaadeIJ.exposureone.time24"
 [12] "genotypedeltaadeIJ.exposuretwo.time24"

 # 提取 genotype 的主效应: up 10, down 4
 contrast <- "genotype_deltaadeIJ_vs_WT"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 # 提取 one exposure 的主效应: up 196; down 298
 contrast <- "exposure_one_vs_none"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 # 提取 two exposure 的主效应: up 80; down 105
 contrast <- "exposure_two_vs_none"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 # 提取 time 的主效应 up 10; down 2
 contrast <- "time_24_vs_17"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 #1.)  ΔadeIJ_none 17h vs WT_none 17h
 #2.)  ΔadeIJ_none 24h vs WT_none 24h
 #3.)  ΔadeIJ_one 17h vs WT_one 17h
 #4.)  ΔadeIJ_one 24h vs WT_one 24h
 #5.)  ΔadeIJ_two 17h vs WT_two 17h
 #6.)  ΔadeIJ_two 24h vs WT_two 24h

 #---- relevel to control ----
 dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
 dds$condition <- relevel(dds$condition, "WT_none_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_none_17_vs_WT_none_17")

 dds$condition <- relevel(dds$condition, "WT_none_24")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_none_24_vs_WT_none_24")

 dds$condition <- relevel(dds$condition, "WT_one_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_one_17_vs_WT_one_17")

 dds$condition <- relevel(dds$condition, "WT_one_24")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_one_24_vs_WT_one_24")

 dds$condition <- relevel(dds$condition, "WT_two_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_two_17_vs_WT_two_17")

 dds$condition <- relevel(dds$condition, "WT_two_24")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_two_24_vs_WT_two_24")

 # WT_none_xh
 dds$condition <- relevel(dds$condition, "WT_none_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_none_24_vs_WT_none_17")

 # WT_one_xh
 dds$condition <- relevel(dds$condition, "WT_one_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_one_24_vs_WT_one_17")

 # WT_two_xh
 dds$condition <- relevel(dds$condition, "WT_two_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_two_24_vs_WT_two_17")

 # deltaadeIJ_none_xh
 dds$condition <- relevel(dds$condition, "deltaadeIJ_none_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_none_24_vs_deltaadeIJ_none_17")

 # deltaadeIJ_one_xh
 dds$condition <- relevel(dds$condition, "deltaadeIJ_one_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_one_24_vs_deltaadeIJ_one_17")

 # deltaadeIJ_two_xh
 dds$condition <- relevel(dds$condition, "deltaadeIJ_two_17")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltaadeIJ_two_24_vs_deltaadeIJ_two_17")

 for (i in clist) {
   contrast = paste("condition", i, sep="_")
   #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
   res = results(dds, name=contrast)
   res <- res[!is.na(res$log2FoldChange),]
   res_df <- as.data.frame(res)

   write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
   #res$log2FoldChange < -2 & res$padj < 5e-2
   up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
   down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
   write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
   write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
 }

 # -- Under host-env (mamba activate plot-numpy1) --
 mamba activate plot-numpy1
 grep -P "\tgene\t" CP059040_m.gff > CP059040_gene.gff

 for cmp in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-all.txt ${cmp}-all.csv
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-up.txt ${cmp}-up.csv
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-down.txt ${cmp}-down.csv
 done
 #deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
 #deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
 #deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

 # ---- delta sbp TSB 2h vs WT TSB 2h ----
 res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
 #print(duplicated_genes)
 # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
 #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
 #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
 #res[res$GeneName == "bfr", ]

 #1st_strategy First occurrence is kept and Subsequent duplicates are removed
 #res <- res[!duplicated(res$GeneName), ]
 #2nd_strategy keep the row with the smallest padj value for each GeneName
 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 ## Ensure the data frame matches the expected format
 ## For example, it should have columns: log2FoldChange, padj, etc.
 #res <- as.data.frame(res)
 ## Remove rows with NA in log2FoldChange (if needed)
 #res <- res[!is.na(res$log2FoldChange),]

 # Replace padj = 0 with a small value
 #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
 dev.off()

 # ---- delta sbp TSB 4h vs WT TSB 4h ----
 res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
 dev.off()

 # ---- delta sbp TSB 18h vs WT TSB 18h ----
 res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
 dev.off()

 # ---- delta sbp MH 2h vs WT MH 2h ----
 res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
 #print(duplicated_genes)
 # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
 #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
 #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
 #res[res$GeneName == "bfr", ]

 #1st_strategy First occurrence is kept and Subsequent duplicates are removed
 #res <- res[!duplicated(res$GeneName), ]
 #2nd_strategy keep the row with the smallest padj value for each GeneName
 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 ## Ensure the data frame matches the expected format
 ## For example, it should have columns: log2FoldChange, padj, etc.
 #res <- as.data.frame(res)
 ## Remove rows with NA in log2FoldChange (if needed)
 #res <- res[!is.na(res$log2FoldChange),]

 # Replace padj = 0 with a small value
 #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
 dev.off()

 # ---- delta sbp MH 4h vs WT MH 4h ----
 res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
 dev.off()

 # ---- delta sbp MH 18h vs WT MH 18h ----
 res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
 dev.off()

 #Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

Clustering the genes and draw heatmap

 #http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
 #http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
 # save the Up-regulated and Down-regulated genes into -up.id and -down.id

 for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
   echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
   echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
 done

 # ------------------ Heatmap generation for two samples ----------------------

 ## ------------------------------------------------------------
 ## DEGs heatmap (dynamic GOI + dynamic column tags)
 ## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
 ## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
 ## ------------------------------------------------------------

    #12 deltaIJ_17_vs_WT_17-up.txt
    #1 deltaIJ_24_vs_WT_24-up.txt
    #239 pre_deltaIJ_17_vs_pre_WT_17-up.txt
    #84 pre_deltaIJ_24_vs_pre_WT_24-up.txt
    #75 0_5_deltaIJ_17_vs_0_5_WT_17-up.txt
    #2 0_5_deltaIJ_24_vs_0_5_WT_24-up.txt

    #4 deltaIJ_17_vs_WT_17-down.txt
    #3 deltaIJ_24_vs_WT_24-down.txt
    #91 pre_deltaIJ_17_vs_pre_WT_17-down.txt
    #65 pre_deltaIJ_24_vs_pre_WT_24-down.txt
    #15 0_5_deltaIJ_17_vs_0_5_WT_17-down.txt
    #4 0_5_deltaIJ_24_vs_0_5_WT_24-down.txt

 ## 0) Config ---------------------------------------------------
 contrast <- "deltaadeIJ_none_17_vs_WT_none_17"  #up 11, down 3 vs. (10,4)
 contrast <- "deltaadeIJ_none_24_vs_WT_none_24"  #up 0, down 2 vs. (0,2)
 contrast <- "deltaadeIJ_one_17_vs_WT_one_17"    #up 238, down 90 vs. (239,89)  --> height 2600
 contrast <- "deltaadeIJ_one_24_vs_WT_one_24"    #up 83, down 64 vs. (64,71) --> height 1600
 contrast <- "deltaadeIJ_two_17_vs_WT_two_17"    #up 74, down 14 vs. (75,9) --> height 1000
 contrast <- "deltaadeIJ_two_24_vs_WT_two_24"    #up 1, down 3 vs. (3,3)

 contrast <- "WT_none_24_vs_WT_none_17"  #(up 10, down 2)
 contrast <- "WT_one_24_vs_WT_one_17"    #(up 97, down 3)
 contrast <- "WT_two_24_vs_WT_two_17"    #(up 12, down 1)
 contrast <- "deltaadeIJ_none_24_vs_deltaadeIJ_none_17" #(up 0, down 0)
 contrast <- "deltaadeIJ_one_24_vs_deltaadeIJ_one_17"   #(up 0, down 10)
 contrast <- "deltaadeIJ_two_24_vs_deltaadeIJ_two_17"   #(up 8, down 51)

 ## 1) Packages -------------------------------------------------
 need <- c("gplots")
 to_install <- setdiff(need, rownames(installed.packages()))
 if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
 suppressPackageStartupMessages(library(gplots))

 ## 2) Helpers --------------------------------------------------
 # Read IDs from a file that may be:
 #  - one column with or without header "Gene_Id"
 #  - may contain quotes
 read_ids_from_file <- function(path) {
   if (!file.exists(path)) stop("File not found: ", path)
   df <- tryCatch(
     read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
     error = function(e) NULL
   )
   if (!is.null(df) && ncol(df) >= 1) {
     if ("Gene_Id" %in% names(df)) {
       ids <- df[["Gene_Id"]]
     } else if (ncol(df) == 1L) {
       ids <- df[[1]]
     } else {
       first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
       if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
       ids <- df[[first_nonempty]]
     }
   } else {
     df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
     if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
     ids <- df2[[1]]
   }
   ids <- trimws(gsub('"', "", ids))
   ids[nzchar(ids)]
 }

 #BREAK_LINE

 # From "A_vs_B" get c("A","B")
 split_contrast_groups <- function(x) {
   parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
   if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
   parts
 }

 # Match whole tags at boundaries or underscores
 match_tags <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 ## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
 # Use rld if present; otherwise try vsd
 if (exists("rld")) {
   expr_all <- assay(rld)
 } else if (exists("vsd")) {
   expr_all <- assay(vsd)
 } else {
   stop("Neither 'rld' nor 'vsd' object is available in the environment.")
 }
 RNASeq.NoCellLine <- as.matrix(expr_all)
 colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

 # -- RUN the code with the new contract from HERE after first run --

 ## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
 up_file   <- paste0(contrast, "-up.id")
 down_file <- paste0(contrast, "-down.id")
 GOI_up   <- read_ids_from_file(up_file)
 GOI_down <- read_ids_from_file(down_file)
 GOI <- unique(c(GOI_up, GOI_down))
 if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

 # GOI are already 'gene-*' in your data — use them directly for matching
 present <- intersect(rownames(RNASeq.NoCellLine), GOI)
 if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
 # Optional: report truly missing IDs (on the same 'gene-*' format)
 missing <- setdiff(GOI, present)
 if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

 ## 5) Keep ONLY columns for the two groups in the contrast -----
 groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
 keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
 if (!any(keep_cols)) {
   stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
       ". Check your column names or implement colData-based filtering.")
 }
 cols_idx <- which(keep_cols)
 sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

 # Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
 ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

 # Subset safely
 expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

 ## 6) Remove constant/NA rows ----------------------------------
 row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
 if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
 datamat <- expr_sub[row_ok, , drop = FALSE]

 # Save the filtered matrix used for the heatmap (optional)
 out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
 write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

 ## 6.1) Pretty labels (display only) ---------------------------
 # Row labels: strip 'gene-'
 labRow_pretty <- sub("^gene-", "", rownames(datamat))

 # Column labels: 'deltaadeIJ' -> 'ΔadeIJ' and nicer spacing
 labCol_pretty <- colnames(datamat)
 labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty)
 labCol_pretty <- gsub("_", " ", labCol_pretty)   # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1"
 # If you prefer to drop replicate suffixes, uncomment:
 # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty)

 ## 7) Clustering -----------------------------------------------
 # Row clustering with Pearson distance
 hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
 #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs"))
 #row_cor[!is.finite(row_cor)] <- 0
 #hr <- hclust(as.dist(1 - row_cor), method = "complete")

 # Color row-side groups by cutting the dendrogram
 mycl <- cutree(hr, h = max(hr$height) / 1.1)
 palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                   "lightblue","pink","purple","lightcyan","salmon","lightgreen")
 mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

 #BREAK_LINE

 png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600)
 heatmap.2(datamat,
         Rowv = as.dendrogram(hr),
         col = bluered(75),
         scale = "row",
         RowSideColors = mycol,
         trace = "none",
         margin = c(10, 20),         # bottom, left
         sepwidth = c(0, 0),
         dendrogram = 'row',
         Colv = 'false',
         density.info = 'none',
         labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
         labCol     = labCol_pretty,   # col labels with Δsbp + spaces
         cexRow = 2.5,
         cexCol = 2.5,
         srtCol = 15,
         lhei = c(0.6, 4),           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
         lwid = c(0.8, 4))           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
 dev.off()

 png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 1000)
 heatmap.2(
   datamat,
   Rowv = as.dendrogram(hr),
   Colv = FALSE,
   dendrogram = "row",
   col = bluered(75),
   scale = "row",
   trace = "none",
   density.info = "none",
   RowSideColors = mycol,
   margins = c(10, 15),      # c(bottom, left)
   sepwidth = c(0, 0),
   labRow = labRow_pretty,
   labCol = labCol_pretty,
   cexRow = 1.3,
   cexCol = 1.8,
   srtCol = 15,
   lhei = c(0.01, 4),
   lwid = c(0.5, 4),
   key = FALSE               # safer; add manual z-score key if you want (see note below)
 )
 dev.off()

 # ------------------ Heatmap generation for three samples ----------------------

 ## ============================================================
 ## Three-condition DEGs heatmap from multiple pairwise contrasts
 ## Example contrasts:
 ##   "WT_MH_4h_vs_WT_MH_2h",
 ##   "WT_MH_18h_vs_WT_MH_2h",
 ##   "WT_MH_18h_vs_WT_MH_4h"
 ## Output shows the union of DEGs across all contrasts and
 ## only the columns (samples) for the 3 conditions.
 ## ============================================================

 ## -------- 0) User inputs ------------------------------------
 contrasts <- c(
   "WT_MH_4h_vs_WT_MH_2h",
   "WT_MH_18h_vs_WT_MH_2h",
   "WT_MH_18h_vs_WT_MH_4h"
 )
 contrasts <- c(
   "WT_TSB_4h_vs_WT_TSB_2h",
   "WT_TSB_18h_vs_WT_TSB_2h",
   "WT_TSB_18h_vs_WT_TSB_4h"
 )
 contrasts <- c(
   "deltasbp_MH_4h_vs_deltasbp_MH_2h",
   "deltasbp_MH_18h_vs_deltasbp_MH_2h",
   "deltasbp_MH_18h_vs_deltasbp_MH_4h"
 )
 contrasts <- c(
   "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
   "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
   "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
 )
 ## Optionally force a condition display order (defaults to order of first appearance)
 cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h")
 cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h")
 cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h")
 cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h")
 #cond_order <- NULL

 ## -------- 1) Packages ---------------------------------------
 need <- c("gplots")
 to_install <- setdiff(need, rownames(installed.packages()))
 if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
 suppressPackageStartupMessages(library(gplots))

 ## -------- 2) Helpers ----------------------------------------
 read_ids_from_file <- function(path) {
   if (!file.exists(path)) stop("File not found: ", path)
   df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE,
                             quote = "\"'", comment.char = ""), error = function(e) NULL)
   if (!is.null(df) && ncol(df) >= 1) {
     ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]]
   } else {
     df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE,
                       quote = "\"'", comment.char = "")
     ids <- df2[[1]]
   }
   ids <- trimws(gsub('"', "", ids))
   ids[nzchar(ids)]
 }

 # From "A_vs_B" return c("A","B")
 split_contrast_groups <- function(x) {
   parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
   if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x)
   parts
 }

 # Grep whole tag between start/end or underscores
 match_tags <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 # Pretty labels for columns (optional tweaks)
 prettify_col_labels <- function(x) {
   x <- gsub("^deltasbp", "\u0394sbp", x)  # example from your earlier case
   x <- gsub("_", " ", x)
   x
 }

 # BREAK_LINE

 ## -------- 3) Build GOI (union across contrasts) -------------
 up_files   <- paste0(contrasts, "-up.id")
 down_files <- paste0(contrasts, "-down.id")

 GOI <- unique(unlist(c(
   lapply(up_files,   read_ids_from_file),
   lapply(down_files, read_ids_from_file)
 )))
 if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.")

 ## -------- 4) Expression matrix (rld or vsd) -----------------
 if (exists("rld")) {
   expr_all <- assay(rld)
 } else if (exists("vsd")) {
   expr_all <- assay(vsd)
 } else {
   stop("Neither 'rld' nor 'vsd' object is available in the environment.")
 }
 expr_all <- as.matrix(expr_all)

 present <- intersect(rownames(expr_all), GOI)
 if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
 missing <- setdiff(GOI, present)
 if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

 ## -------- 5) Infer the THREE condition tags -----------------
 pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B)
 cond_tags <- unique(unlist(pair_groups))
 if (length(cond_tags) != 3L) {
   stop("Expected exactly three unique condition tags across the contrasts, got: ",
       paste(cond_tags, collapse = ", "))
 }

 # If user provided an explicit order, use it; else keep first-appearance order
 if (!is.null(cond_order)) {
   if (!setequal(cond_order, cond_tags))
     stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", "))
   cond_tags <- cond_order
 }

 #BREAK_LINE

 ## -------- 6) Subset columns to those 3 conditions -----------
 # helper: does a name contain any of the tags?
 match_any_tag <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 # helper: return the specific tag that a single name matches
 detect_tag <- function(nm, tags) {
   hits <- vapply(tags, function(t)
     grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
   if (!any(hits)) NA_character_ else tags[which(hits)[1]]
 }

 keep_cols <- match_any_tag(colnames(expr_all), cond_tags)
 if (!any(keep_cols)) {
   stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", "))
 }

 sub_idx <- which(keep_cols)
 sub_colnames <- colnames(expr_all)[sub_idx]

 # find the tag for each kept column (this is the part that was wrong before)
 cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags)

 # rank columns by your desired condition order, then by name within each condition
 cond_rank <- match(cond_for_col, cond_tags)
 ord <- order(cond_rank, sub_colnames)

 expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE]

 ## -------- 7) Remove constant/NA rows ------------------------
 row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
 if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
 datamat <- expr_sub[row_ok, , drop = FALSE]

 ## -------- 8) Labels ----------------------------------------
 labRow_pretty <- sub("^gene-", "", rownames(datamat))
 labCol_pretty <- prettify_col_labels(colnames(datamat))

 ## -------- 9) Clustering (rows) ------------------------------
 hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")

 # Color row-side groups by cutting the dendrogram
 mycl <- cutree(hr, h = max(hr$height) / 1.3)
 palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                   "lightblue","pink","purple","lightcyan","salmon","lightgreen")
 mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

 ## -------- 10) Save the matrix used --------------------------
 out_tag <- paste(cond_tags, collapse = "_")
 write.csv(as.data.frame(datamat),
           file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"),
           quote = FALSE)

 ## -------- 11) Plot heatmap ----------------------------------
 png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000)
 heatmap.2(
   datamat,
   Rowv = as.dendrogram(hr),
   Colv = FALSE,
   dendrogram = "row",
   col = bluered(75),
   scale = "row",
   trace = "none",
   density.info = "none",
   RowSideColors = mycol,
   margins = c(10, 15),      # c(bottom, left)
   sepwidth = c(0, 0),
   labRow = labRow_pretty,
   labCol = labCol_pretty,
   cexRow = 1.3,
   cexCol = 1.8,
   srtCol = 15,
   lhei = c(0.01, 4),
   lwid = c(0.5, 4),
   key = FALSE               # safer; add manual z-score key if you want (see note below)
 )
 dev.off()

 mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png
 mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png
 mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png
 mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png
 # ------------------ Heatmap generation for three samples END ----------------------

 # ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ====
 # Inputs:
 #   mat        : numeric matrix (genes x samples)
 #   hr         : hclust for rows (or TRUE/FALSE)
 #   row_colors : vector of RowSideColors of length nrow(mat) or NULL
 #   labRow     : character vector of row labels (display only)
 #   labCol     : character vector of col labels (display only)
 #   outfile    : output PNG path
 #   base_res   : DPI for PNG (default 150)
 # ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ====
 safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) {
   stopifnot(is.matrix(mat))
   nr <- nrow(mat); nc <- ncol(mat)

   # Target slide size & sensible caps
   #target_w <- 2400; target_h <- 1600
   #max_w <- 3000; max_h <- 2000
   target_w <- 800; target_h <- 2000
   max_w <- 1500; max_h <- 1500

   # Label stats
   max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1
   max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1

   #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE,  TRUE,  TRUE)
   attempts <- list()
   add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) {
     attempts[[length(attempts) + 1]] <<- list(
       w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar,
       key = key, showR = showR, showC = showC, trunc_rows = trunc_rows
     )
   }

   # Note that if the key is FALSE, all works, if the key is TRUE, none works!
   # 1) Preferred look: moderate size, biggish fonts, 45° labels
   add_attempt(target_w,           target_h,           0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE)
   # 2) Same, explicit margins computed later
   add_attempt(target_w,           target_h,           0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE)
   # 3) Slightly bigger canvas
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE)
   # 4) Make margins more generous (in lines)
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE)
   # 5) Reduce rotation to 30 if still tight
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12),  FALSE, TRUE, TRUE)
   # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins
   add_attempt(min(target_w+500,   max_w), min(target_h+300, max_h), 0.80, 0.90,  45, c(8,12),  FALSE, TRUE, TRUE)
   # 7) Last resort: truncate long row labels (keeps readability)
   if (max_row_chars > 20) {
     add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18)
   }

   for (i in seq_along(attempts)) {
     a <- attempts[[i]]

     # Compute margins if not provided
     if (is.null(a$mar)) {
       col_margin <- if (a$showC) {
         if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else
                       max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8)))
       } else 4
       row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4
       mar <- c(col_margin, row_margin)
     } else {
       mar <- a$mar
     }

     # Prepare labels for this attempt
     lr <- if (a$showR) labRow else rep("", nr)
     if (a$trunc_rows > 0 && a$showR) {
       lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr)
     }
     lc <- if (a$showC) labCol else rep("", nc)

     # Close any open device
     if (dev.cur() != 1) try(dev.off(), silent = TRUE)

     ok <- FALSE
     try({
       png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res)
       gplots::heatmap.2(
         mat,
         Rowv = as.dendrogram(hr),
         Colv = FALSE,
         dendrogram = "row",
         col = gplots::bluered(75),
         scale = "row",
         trace = "none",
         density.info = "none",
         RowSideColors = row_colors,
         key = a$key,
         margins = mar,           # c(col, row) in lines
         sepwidth = c(0, 0),
         labRow = lr,
         labCol = lc,
         cexRow = a$cr,
         cexCol = a$cc,
         srtCol = a$rot,
         lhei = c(0.1, 4),
         lwid = c(0.1, 4)
       )
       dev.off()
       ok <- TRUE
     }, silent = TRUE)

     if (ok) {
       message(sprintf(
         "✓ Heatmap saved: %s  (attempt %d)  size=%dx%d  margins=c(%d,%d)  cexRow=%.2f  cexCol=%.2f  srtCol=%d",
         outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot
       ))
       return(invisible(TRUE))
     } else {
       if (dev.cur() != 1) try(dev.off(), silent = TRUE)
       message(sprintf("Attempt %d failed; retrying...", i))
     }
   }

   stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.")
 }

 safe_heatmap2(
   mat        = datamat,
   hr         = hr,
   row_colors = mycol,
   labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
   labCol     = labCol_pretty,   # col labels with Δsbp + spaces
   outfile    = paste0("DEGs_heatmap_", contrast, ".png"),
   #base_res   = 150
 )

 # -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h --
 cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids
 #add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
 GOI <- read.csv("ids")$Gene_Id
 RNASeq.NoCellLine <- assay(rld)
 #install.packages("gplots")
 library("gplots")
 #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
 datamat = RNASeq.NoCellLine[GOI, ]
 #datamat = RNASeq.NoCellLine
 write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")

 constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
 if(any(constant_rows)) {
   cat("Removing", sum(constant_rows), "constant rows.\n")
   datamat <- datamat[!constant_rows, ]
 }
 hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
 hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
 mycl = cutree(hr, h=max(hr$height)/1.1)
 mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
 mycol = mycol[as.vector(mycl)]

 png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000)
 heatmap.2(datamat,
         Rowv = as.dendrogram(hr),
         col = bluered(75),
         scale = "row",
         RowSideColors = mycol,
         trace = "none",
         margin = c(10, 15),         # bottom, left
         sepwidth = c(0, 0),
         dendrogram = 'row',
         Colv = 'false',
         density.info = 'none',
         labRow = rownames(datamat),
         cexRow = 1.5,
         cexCol = 1.5,
         srtCol = 35,
         lhei = c(0.2, 4),           # reduce top space (was 1 or more)
         lwid = c(0.4, 4))           # reduce left space (was 1 or more)
 dev.off()

 # -------------- Cluster members ----------------
 write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
 write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
 write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
 write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
 write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
 #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
 #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;

 #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
 subset_1<-names(subset(mycl, mycl == '1'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
 subset_2<-names(subset(mycl, mycl == '2'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
 subset_3<-names(subset(mycl, mycl == '3'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
 subset_4<-names(subset(mycl, mycl == '4'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
 subset_5<-names(subset(mycl, mycl == '5'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
 # Initialize an empty data frame for the annotated data
 annotated_data <- data.frame()
 # Determine total number of genes
 total_genes <- length(rownames(data))
 # Loop through each gene to annotate
 for (i in 1:total_genes) {
     gene <- rownames(data)[i]
     result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                     filters = 'ensembl_gene_id',
                     values = gene,
                     mart = ensembl)
     # If multiple rows are returned, take the first one
     if (nrow(result) > 1) {
         result <- result[1, ]
     }
     # Check if the result is empty
     if (nrow(result) == 0) {
         result <- data.frame(ensembl_gene_id = gene,
                             external_gene_name = NA,
                             gene_biotype = NA,
                             entrezgene_id = NA,
                             chromosome_name = NA,
                             start_position = NA,
                             end_position = NA,
                             strand = NA,
                             description = NA)
     }
     # Transpose expression values
     expression_values <- t(data.frame(t(data[gene, ])))
     colnames(expression_values) <- colnames(data)
     # Combine gene information and expression data
     combined_result <- cbind(result, expression_values)
     # Append to the final dataframe
     annotated_data <- rbind(annotated_data, combined_result)
     # Print progress every 100 genes
     if (i %% 100 == 0) {
         cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
     }
 }
 # Save the annotated data to a new CSV file
 write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
 #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Assign KEGG and GO Terms (see diagram above)

 Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

 Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

     EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

     Install EggNOG-mapper:

         mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
         mamba activate eggnog_env

     Run annotation:

         #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
         mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
         #Download the protein sequences from Genbank
         mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
         python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
         emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
         #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
         #---->  470.IX87_14445:
             * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
             * IX87_14445 would refer to a specific gene or protein within that genome.

     Extract KEGG KO IDs from annotations.emapper.annotations.

 Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

 * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
 * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
 * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
         QBlast finished with warnings!
         Blasted Sequences: 2084
         Sequences without results: 105
         Check the Job log for details and try to submit again.
         Restarting QBlast may result in additional results, depending on the error type.
         "Blast (CP020463_protein) Done"
 * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
         "Mapping (CP020463_protein) Done"
         "Mapping finished - Please proceed now to annotation."
 * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
         * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
         * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
         "Annotation (CP020463_protein) Done"
         "Annotation finished."
 or blast2go_cli_v1.5.1 (NOT_USED)

         #https://help.biobam.com/space/BCD/2250407989/Installation
         #see ~/Scripts/blast2go_pipeline.sh

 Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
     * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
         "InterProScan (CP020463_protein) Done"
         "InterProScan Finished - You can now merge the obtained GO Annotations."
 MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
     * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
         "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
         "Finished merging GO terms from InterPro with annotations."
         "Maybe you want to run ANNEX (Annotation Augmentation)."
     #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
 File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
         #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

     #-- before merging (blast2go_annot.annot) --
     #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
     #H0N29_18790     GO:0085020
     #-- after merging (blast2go_annot.annot2) -->
     #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
     #H0N29_18790     GO:0070531
     #H0N29_18790     GO:0004842
     #H0N29_18790     GO:0005515
     #H0N29_18790     GO:0085020

     cp blast2go_annot.annot blast2go_annot.annot2

 Option 4 (NOT_USED): RFAM for non-colding RNA

 Option 5 (NOT_USED): PSORTb for subcellular localizations

 Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

 * Go to KAAS
 * Upload your FASTA file.
 * Select an appropriate gene set.
 * Download the KO assignments.

Find the Closest KEGG Organism Code (NOT_USED)

 Since your species isn't directly in KEGG, use a closely related organism.

 * Check available KEGG organisms:

         library(clusterProfiler)
         library(KEGGREST)

         kegg_organisms <- keggList("organism")

         Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

         # Search for Acinetobacter in the list
         grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
         # Gammaproteobacteria
         #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

 If your organism is not in KEGG, search for the closest relative:

         grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

 For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

         kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

     #BiocManager::install("GO.db")
     #BiocManager::install("AnnotationDbi")

     # Load required libraries
     library(openxlsx)  # For Excel file handling
     library(dplyr)     # For data manipulation
     library(tidyr)
     library(stringr)
     library(clusterProfiler)  # For KEGG and GO enrichment analysis
     #library(org.Hs.eg.db)  # Replace with appropriate organism database
     library(GO.db)
     library(AnnotationDbi)

     setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
     # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
     # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
     #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
     #1,5c1
     #< ## Thu Jan 30 16:34:52 2025
     #< ## emapper-2.1.12
     #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
     #< ##
     #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
     #---
     #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
     #3620,3622d3615
     #< ## 3614 queries scanned
     #< ## Total time (seconds): 8.176708459854126

     # Step 1: Load the blast2go annotation file with a check for missing columns
     annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

     # If the structure is inconsistent, we can make sure there are exactly 3 columns:
     colnames(annot_df) <- c("GeneID", "Term")
     # Step 2: Filter and aggregate GO and EC terms as before
     go_terms <- annot_df %>%
     filter(grepl("^GO:", Term)) %>%
     group_by(GeneID) %>%
     summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
     ec_terms <- annot_df %>%
     filter(grepl("^EC:", Term)) %>%
     group_by(GeneID) %>%
     summarize(EC = paste(Term, collapse = ","), .groups = "drop")

     # Key Improvements:
     #    * Looped processing of all 6 input files to avoid redundancy.
     #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
     #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
     #    * Error handling for GO term descriptions via tryCatch.
     #    * Improved clarity and modular structure for easier maintenance and future additions.

     #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

     # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
     suppressPackageStartupMessages({
       library(readr)
       library(dplyr)
       library(stringr)
       library(tidyr)
       library(openxlsx)
       library(clusterProfiler)
       library(AnnotationDbi)
       library(GO.db)
     })

     # ---- PARAMETERS ----
     PADJ_CUT <- 5e-2
     LFC_CUT  <- 2

     # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
     emapper_path <- "~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt"

     # Input files (you can add/remove here)
     input_files <- c(

     "deltaadeIJ_none_17_vs_WT_none_17-all.csv",  #up 11, down 3 vs. (10,4)
     "deltaadeIJ_none_24_vs_WT_none_24-all.csv",  #up 0, down 2 vs. (0,2)
     "deltaadeIJ_one_17_vs_WT_one_17-all.csv",    #up 238, down 90 vs. (239,89)  --> height 2600
     "deltaadeIJ_one_24_vs_WT_one_24-all.csv",    #up 83, down 64 vs. (64,71) --> height 1600
     "deltaadeIJ_two_17_vs_WT_two_17-all.csv",    #up 74, down 14 vs. (75,9) --> height 1000
     "deltaadeIJ_two_24_vs_WT_two_24-all.csv",    #up 1, down 3 vs. (3,3)

     "WT_none_24_vs_WT_none_17-all.csv",  #(up 10, down 2)
     "WT_one_24_vs_WT_one_17-all.csv",    #(up 97, down 3)
     "WT_two_24_vs_WT_two_17-all.csv",    #(up 12, down 1)

     "deltaadeIJ_two_24_vs_deltaadeIJ_two_17-all.csv",   #(up 8, down 51)
     "deltaadeIJ_one_24_vs_deltaadeIJ_one_17-all.csv",   #(up 0, down 10)
     "deltaadeIJ_none_24_vs_deltaadeIJ_none_17-all.csv" #(up 0, down 0)

     )

     # ---- HELPERS ----
     # Robust reader (CSV first, then TSV)
     read_table_any <- function(path) {
       tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                     error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                   error = function(e2) NULL))
       tb
     }

     # Return a nice Excel-safe base name
     xlsx_name_from_file <- function(path) {
       base <- tools::file_path_sans_ext(basename(path))
       paste0("DEG_KEGG_GO_", base, ".xlsx")
     }

     # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
     expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
       if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
       kdf <- as.data.frame(kegg_res)
       if (!"geneID" %in% names(kdf)) return(kdf)
       # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
       map_clean <- mapping_tbl %>%
         dplyr::select(KEGG_ko, GeneID) %>%
         filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
         mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
         tidyr::separate_rows(KEGG_ko, sep = ",") %>%
         distinct()

       if (!nrow(map_clean)) {
         return(kdf)
       }

       expanded <- kdf %>%
         tidyr::separate_rows(geneID, sep = "/") %>%
         dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
         distinct() %>%
         dplyr::group_by(ID) %>%
         dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

       kdf %>%
         dplyr::select(-geneID) %>%
         dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
         dplyr::rename(geneID = GeneID)
     }

     # ---- LOAD emapper annotations ----
     eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
     # Ensure character columns for joins
     eggnog_data$query   <- as.character(eggnog_data$query)
     eggnog_data$GOs     <- as.character(eggnog_data$GOs)
     eggnog_data$EC      <- as.character(eggnog_data$EC)
     eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

     # ---- MAIN LOOP ----
     for (f in input_files) {
       if (!file.exists(f)) { message("Missing: ", f); next }

       message("Processing: ", f)
       res <- read_table_any(f)
       if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

       # Coerce expected columns if present
       if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
       if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

       # Ensure GeneID & GeneName exist
       if (!"GeneID" %in% names(res)) {
         # Try to infer from a generic 'gene' column
         if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
       }
       if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

       # Fill missing GeneName from GeneID (drop "gene-")
       res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                             gsub("^gene-", "", as.character(res$GeneID)),
                             as.character(res$GeneName))

       # De-duplicate by GeneName, keep smallest padj
       if (!"padj" %in% names(res)) res$padj <- NA_real_
       res <- res %>%
         group_by(GeneName) %>%
         slice_min(padj, with_ties = FALSE) %>%
         ungroup() %>%
         as.data.frame()

       # Sort by padj asc, then log2FC desc
       if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
       res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

       # Join emapper (strip "gene-" from GeneID to match emapper 'query')
       res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
       res_ann <- res %>%
         left_join(eggnog_data, by = c("GeneID_plain" = "query"))

       # --- Split by UP/DOWN using your volcano cutoffs ---
       up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
       down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

       # --- KEGG enrichment (using K numbers in KEGG_ko) ---
       # Prepare KO lists (remove "ko:" if present)
       k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
       k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
       k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

       # BREAK_LINE

       kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
       kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

       # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
       kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
       kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

       # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
       # Background gene set = all genes in this comparison
       background_genes <- unique(res_ann$GeneID_plain)
       # TERM2GENE table (GO -> GeneID_plain)
       go_annotation <- res_ann %>%
         dplyr::select(GeneID_plain, GOs) %>%
         mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
         tidyr::separate_rows(GOs, sep = ",") %>%
         filter(GOs != "") %>%
         dplyr::select(GOs, GeneID_plain) %>%
         distinct()

       # Gene lists for GO enricher
       go_list_up   <- unique(up_regulated$GeneID_plain)
       go_list_down <- unique(down_regulated$GeneID_plain)

       go_up <- tryCatch(
         enricher(gene = go_list_up, TERM2GENE = go_annotation,
                 pvalueCutoff = 0.05, pAdjustMethod = "BH",
                 universe = background_genes),
         error = function(e) NULL
       )
       go_down <- tryCatch(
         enricher(gene = go_list_down, TERM2GENE = go_annotation,
                 pvalueCutoff = 0.05, pAdjustMethod = "BH",
                 universe = background_genes),
         error = function(e) NULL
       )

       go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
       go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

       # Add GO term descriptions via GO.db (best-effort)
       add_go_term_desc <- function(df) {
         if (!nrow(df) || !"ID" %in% names(df)) return(df)
         df$Description <- sapply(df$ID, function(go_id) {
           term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                 columns = "TERM", keytype = "GOID"),
                           error = function(e) NULL)
           if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
         })
         df
       }
       go_up_df   <- add_go_term_desc(go_up_df)
       go_down_df <- add_go_term_desc(go_down_df)

       # ---- Write Excel workbook ----
       out_xlsx <- xlsx_name_from_file(f)
       wb <- createWorkbook()

       addWorksheet(wb, "Complete")
       writeData(wb, "Complete", res_ann)

       addWorksheet(wb, "Up_Regulated")
       writeData(wb, "Up_Regulated", up_regulated)

       addWorksheet(wb, "Down_Regulated")
       writeData(wb, "Down_Regulated", down_regulated)

       addWorksheet(wb, "KEGG_Enrichment_Up")
       writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

       addWorksheet(wb, "KEGG_Enrichment_Down")
       writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

       addWorksheet(wb, "GO_Enrichment_Up")
       writeData(wb, "GO_Enrichment_Up", go_up_df)

       addWorksheet(wb, "GO_Enrichment_Down")
       writeData(wb, "GO_Enrichment_Down", go_down_df)

       saveWorkbook(wb, out_xlsx, overwrite = TRUE)
       message("Saved: ", out_xlsx)
     }

     # -------------------------------- OLD_CODE not automatized with loop ----------------------------
     # Load the results
     res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
     res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

     res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
     res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
     res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
     res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

     res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
     res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

     # Replace empty GeneName with modified GeneID
     res$GeneName <- ifelse(
         res$GeneName == "" | is.na(res$GeneName),
         gsub("gene-", "", res$GeneID),
         res$GeneName
     )

     # Remove duplicated genes by selecting the gene with the smallest padj
     duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

     res <- res %>%
     group_by(GeneName) %>%
     slice_min(padj, with_ties = FALSE) %>%
     ungroup()

     res <- as.data.frame(res)
     # Sort res first by padj (ascending) and then by log2FoldChange (descending)
     res <- res[order(res$padj, -res$log2FoldChange), ]
     # Read eggnog annotations
     eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
     # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
     res$GeneID <- gsub("gene-", "", res$GeneID)
     # Merge eggnog data with res based on GeneID
     res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

     # Merge with the res dataframe
     # Perform the left joins and rename columns
     res_updated <- res %>%
     left_join(go_terms, by = "GeneID") %>%
     left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

     # Filter up-regulated genes
     up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
     # Filter down-regulated genes
     down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

     # Create a new workbook
     wb <- createWorkbook()
     # Add the complete dataset as the first sheet (with annotations)
     addWorksheet(wb, "Complete")
     writeData(wb, "Complete_Data", res_updated)
     # Add the up-regulated genes as the second sheet (with annotations)
     addWorksheet(wb, "Up_Regulated")
     writeData(wb, "Up_Regulated", up_regulated)
     # Add the down-regulated genes as the third sheet (with annotations)
     addWorksheet(wb, "Down_Regulated")
     writeData(wb, "Down_Regulated", down_regulated)
     # Save the workbook to a file
     #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
     #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

     #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

     # Set GeneName as row names after the join
     rownames(res_updated) <- res_updated$GeneName
     res_updated <- res_updated %>% dplyr::select(-GeneName)
     ## Set the 'GeneName' column as row.names
     #rownames(res_updated) <- res_updated$GeneName
     ## Drop the 'GeneName' column since it's now the row names
     #res_updated$GeneName <- NULL
     # -- BREAK_1 --

     # ---- Perform KEGG enrichment analysis (up_regulated) ----
     gene_list_kegg_up <- up_regulated$KEGG_ko
     gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
     kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
     # -- convert the GeneID (Kxxxxxx) to the true GeneID --
     # Step 0: Create KEGG to GeneID mapping
     kegg_to_geneid_up <- up_regulated %>%
     dplyr::select(KEGG_ko, GeneID) %>%
     filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
     mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
     # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
     kegg_to_geneid_clean <- kegg_to_geneid_up %>%
     mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
     separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
     filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
     distinct()  # Remove any duplicate mappings
     # Step 2.1: Expand geneID column in kegg_enrichment_up
     expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
     distinct() %>%  # Remove duplicate matches
     group_by(ID) %>%
     summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
     #dplyr::glimpse(expanded_kegg)
     # Step 3.1: Replace geneID column in the original dataframe
     kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
     # Remove old geneID column and merge new one
     kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

     # ---- Perform KEGG enrichment analysis (down_regulated) ----
     # Step 1: Extract KEGG KO terms from down-regulated genes
     gene_list_kegg_down <- down_regulated$KEGG_ko
     gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
     # Step 2: Perform KEGG enrichment analysis
     kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
     # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
     # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
     kegg_to_geneid_down <- down_regulated %>%
     dplyr::select(KEGG_ko, GeneID) %>%
     filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
     mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
     # -- BREAK_2 --

     # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
     kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
     mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
     separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
     filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
     distinct()  # Remove duplicate mappings

     # Step 5: Expand geneID column in kegg_enrichment_down
     expanded_kegg_down <- kegg_enrichment_down %>%
     as.data.frame() %>%
     separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
     left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
     distinct() %>%  # Remove duplicate matches
     group_by(ID) %>%
     summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

     # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
     kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
     dplyr::select(-geneID) %>%  # Remove old geneID column
     left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
     dplyr::rename(geneID = GeneID)  # Rename column back to geneID
     # View the updated dataframe
     head(kegg_enrichment_down_df)

     # Create a new workbook
     #wb <- createWorkbook()
     # Save enrichment results to the workbook
     addWorksheet(wb, "KEGG_Enrichment_Up")
     writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
     # Save enrichment results to the workbook
     addWorksheet(wb, "KEGG_Enrichment_Down")
     writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

     # Define gene list (up-regulated genes)
     gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
     gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

     # Define background gene set (all genes in res)
     background_genes <- res_updated$GeneID  # Extract the 3646 background genes

     # Prepare GO annotation data from res
     go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
     go_annotation <- go_annotation %>%
     tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
     # -- BREAK_3 --

     go_enrichment_up <- enricher(
         gene = gene_list_go_up,                # Up-regulated genes
         TERM2GENE = go_annotation,       # Custom GO annotation
         pvalueCutoff = 0.05,             # Significance threshold
         pAdjustMethod = "BH",
         universe = background_genes      # Define the background gene set
     )
     go_enrichment_up <- as.data.frame(go_enrichment_up)

     go_enrichment_down <- enricher(
         gene = gene_list_go_down,                # Up-regulated genes
         TERM2GENE = go_annotation,       # Custom GO annotation
         pvalueCutoff = 0.05,             # Significance threshold
         pAdjustMethod = "BH",
         universe = background_genes      # Define the background gene set
     )
     go_enrichment_down <- as.data.frame(go_enrichment_down)

     ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
     #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

     # Update the Description column with the term descriptions
     go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
     # Using select to get the term description
     term <- tryCatch({
         AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
     }, error = function(e) {
         message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
         return(data.frame(TERM = NA))  # In case of error, return NA
     })
     if (nrow(term) > 0) {
         return(term$TERM)
     } else {
         return(NA)  # If no description found, return NA
     }
     })
     ## Print the updated data frame
     #print(go_enrichment_up)

     ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
     #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
     # Update the Description column with the term descriptions
     go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
     # Using select to get the term description
     term <- tryCatch({
         AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
     }, error = function(e) {
         message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
         return(data.frame(TERM = NA))  # In case of error, return NA
     })

     if (nrow(term) > 0) {
         return(term$TERM)
     } else {
         return(NA)  # If no description found, return NA
     }
     })

     addWorksheet(wb, "GO_Enrichment_Up")
     writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

     addWorksheet(wb, "GO_Enrichment_Down")
     writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

     # Save the workbook with enrichment results
     saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

     #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
     #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
     #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

     #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
     #Mac_vs_LB
     #LB.AB_vs_LB.WT19606
     #LB.IJ_vs_LB.WT19606
     #LB.W1_vs_LB.WT19606
     #LB.Y1_vs_LB.WT19606
     #Mac.AB_vs_Mac.WT19606
     #Mac.IJ_vs_Mac.WT19606
     #Mac.W1_vs_Mac.WT19606
     #Mac.Y1_vs_Mac.WT19606

     #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

         library(openxlsx)

         # Function to read and clean gene ID files
         read_gene_ids <- function(file_path) {
         # Read the gene IDs from the file
         gene_ids <- readLines(file_path)

         # Remove any quotes and trim whitespaces
         gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
         gene_ids <- trimws(gene_ids)  # Trim whitespaces

         # Remove empty entries or NAs
         gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

         return(gene_ids)
         }

         # Example list of LB files with both -up.id and -down.id for each condition
         lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                         "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
         lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                         "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

         # Combine both up and down files for each condition
         lb_files <- c(lb_files_up, lb_files_down)

         # Read gene IDs for each file in LB group
         #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
         lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

         lb_degs_ <- list()
         combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
         #unique_combined_set <- unique(combined_set)
         lb_degs_$AB <- combined_set
         combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
         lb_degs_$IJ <- combined_set
         combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
         lb_degs_$W1 <- combined_set
         combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
         lb_degs_$Y1 <- combined_set

         # Example list of Mac files with both -up.id and -down.id for each condition
         mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                         "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
         mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                         "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

         # Combine both up and down files for each condition in Mac group
         mac_files <- c(mac_files_up, mac_files_down)

         # Read gene IDs for each file in Mac group
         mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

         mac_degs_ <- list()
         combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
         mac_degs_$AB <- combined_set
         combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
         mac_degs_$IJ <- combined_set
         combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
         mac_degs_$W1 <- combined_set
         combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
         mac_degs_$Y1 <- combined_set

         # Function to clean sheet names to ensure no sheet name exceeds 31 characters
         truncate_sheet_name <- function(names_list) {
         sapply(names_list, function(name) {
         if (nchar(name) > 31) {
         return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
         }
         return(name)
         })
         }

         # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

         # Define intersections between different conditions for LB
         inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
         inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
         inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
         inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
         inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
         inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

         # Define intersections between three conditions for LB
         inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
         inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
         inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
         inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

         # Define intersection between all four conditions for LB
         inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

         # Now remove the intersected genes from each original set for LB
         venn_list_lb <- list()

         # For LB.AB, remove genes that are also in other conditions
         venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

         # For LB.IJ, remove genes that are also in other conditions
         venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

         # For LB.W1, remove genes that are also in other conditions
         venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

         # For LB.Y1, remove genes that are also in other conditions
         venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

         # Add the intersections for LB (same as before)
         venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
         venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
         venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
         venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
         venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
         venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

         # Define intersections between three conditions for LB
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
         venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
         venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

         # Define intersection between all four conditions for LB
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

         # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

         # Define intersections between different conditions
         inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
         inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
         inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
         inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
         inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
         inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

         # Define intersections between three conditions
         inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
         inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
         inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
         inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

         # Define intersection between all four conditions
         inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

         # Now remove the intersected genes from each original set
         venn_list_mac <- list()

         # For Mac.AB, remove genes that are also in other conditions
         venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

         # For Mac.IJ, remove genes that are also in other conditions
         venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

         # For Mac.W1, remove genes that are also in other conditions
         venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

         # For Mac.Y1, remove genes that are also in other conditions
         venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

         # Add the intersections (same as before)
         venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
         venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
         venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
         venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
         venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
         venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

         # Define intersections between three conditions
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
         venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
         venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

         # Define intersection between all four conditions
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

         # Save the gene IDs to Excel for further inspection (optional)
         write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
         write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

         # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
         write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
         write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

         # Venn Diagram for LB group
         venn1 <- ggvenn(lb_degs_,
                         fill_color = c("skyblue", "tomato", "gold", "orchid"),
                         stroke_size = 0.4,
                         set_name_size = 5)
         ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

         # Venn Diagram for Mac group
         venn2 <- ggvenn(mac_degs_,
                         fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                         stroke_size = 0.4,
                         set_name_size = 5)
         ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

         cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Processing Data_Michelle_RNAseq_2025 including automatizing DEG(Annotated)_KEGG_GO_* and splitting DEG-Heatmaps

Leave a reply

In the current results, I extract the main effect. I also compared the condition deltasbp_MH_18h to WT_MH_18h, if you are interested in specific comparison between conditions, please let me know, I can perform differentially expressed analysis and draw corresponding volcano plots for them.

Targets

 The experiment we did so far:
 I have two strains:
 1. 1457 wildtype
 2. 1457Δsbp (sbp knock out strain)

 I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
 1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
 2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

 Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
 Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
 The following would be interesting for me:
 - PCA plot (sample comparison)
 - Heatmaps (wild type vs. sbp knockout)
 - Volcano plots (significant genes)
 - Gene Ontology (GO) analyses

Download the raw data

 Mail von BGI (RNA-SEQ Institute):
 The data from project F25A430000603 are uploaded to AWS.
 Please download the data as below:
 URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
 Project：F25A430000603-01-STAkimxP
 Alias ID：598731762349
 S3 Bucket：stakimxp-598731762349
 Account：stakimxp
 Password：qR0'A7[o9Ql|
 Region：eu-central-1
 Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
 Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

 aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

 cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
 rsync -avzP /local/dir/ user@remote:/remote/dir/
 rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

 mkdir raw_data; cd raw_data

 #Δsbp->deltasbp
 #1457.1_2h_MH,WT,MH,2h,1
 ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
 #1457.2_2h_
 ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
 #1457.3_2h_
 ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
 #1457.1_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
 #1457.2_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
 #1457.3_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
 #1457.1_18h_
 ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
 #1457.2_18h_
 ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
 #1457.3_18h_
 ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
 #1457dsbp1_2h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
 #1457dsbp2_2h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
 #1457dsbp3_2h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
 #1457dsbp1_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
 #1457dsbp2_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
 #1457dsbp3_4h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
 #1457dsbp118h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
 #1457dsbp218h_
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
 ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

 #1457.1_2h_
 ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
 #1457.2_2h_
 ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
 #1457.3_2h_
 ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
 #1457.1_4h_
 ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
 #1457.2_4h_
 ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
 #1457.3_4h_
 ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
 #1457.1_18h_
 ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
 #1457.2_18h_
 ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
 #1457.3_18h_
 ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
 #1457dsbp1_2h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
 #1457dsbp2_2h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
 #1457dsbp3_2h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
 #1457dsbp1_4h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
 #1457dsbp2_4h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
 #1457dsbp3_4h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
 #1457dsbp1_18h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
 #1457dsbp2_18h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
 #1457dsbp3_18h_
 ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
 ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
 #END

Preparing the directory trimmed

 mkdir trimmed trimmed_unpaired;
 for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
         java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
 done
 mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

 sample,fastq_1,fastq_2,strandedness
 WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
 ...

nextflow run

 #See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
 #docker pull nfcore/rnaseq
 ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

 # -- DEBUG_1 (CDS --> exon in CP020463.gff) --
 grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
 grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
 grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
 grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
 grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
 sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
 grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

 # -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
 --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

 # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
 (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

 # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

 #mamba activate r_env

 #install.packages("ggfun")
 # Import the required libraries
 library("AnnotationDbi")
 library("clusterProfiler")
 library("ReactomePA")
 library(gplots)
 library(tximport)
 library(DESeq2)
 #library("org.Hs.eg.db")
 library(dplyr)
 library(tidyverse)
 #install.packages("devtools")
 #devtools::install_version("gtable", version = "0.3.0")
 library(gplots)
 library("RColorBrewer")
 #install.packages("ggrepel")
 library("ggrepel")
 # install.packages("openxlsx")
 library(openxlsx)
 library(EnhancedVolcano)
 library(DESeq2)
 library(edgeR)

 setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
 # Define paths to your Salmon output quantification files
 files <- c(
         "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
         "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
         "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
         "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
         "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
         "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
         "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
         "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
         "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
         "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
         "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
         "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
         "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
         "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
         "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
         "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
         "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
         "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
         "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
         "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
         "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
         "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
         "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
         "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
         "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
         "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
         "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
         "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
         "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
         "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
         "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
         "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
         "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
         "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
         "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

 # Import the transcript abundance data with tximport
 txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
 # Define the replicates and condition of the samples
 replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
 condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

 sample_table <- data.frame(
     condition = condition,
     replicate = replicate
 )
 split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
 colnames(split_cond) <- c("strain", "media", "time")
 colData <- cbind(sample_table, split_cond)
 colData$strain <- factor(colData$strain)
 colData$media  <- factor(colData$media)
 colData$time   <- factor(colData$time)
 #colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
 # Define the colData for DESeq2
 #colData <- data.frame(condition=condition, row.names=names(files))

 #grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
 #grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

 # ------------------------
 # 1️⃣ Setup and input files
 # ------------------------

 # Read in transcript-to-gene mapping
 tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
 colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

 # Prepare tx2gene for gene-level summarization (remove gene_name if needed)
 tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

 # -------------------------------
 # 2️⃣ Transcript-level counts
 # -------------------------------
 # Create DESeqDataSet directly from tximport (transcript-level)
 dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
 write.csv(counts(dds_tx), file="transcript_counts.csv")

 # --------------------------------
 # 3️⃣ Gene-level summarization
 # --------------------------------
 # Re-import Salmon data summarized at gene level
 txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

 # Create DESeqDataSet for gene-level counts
 #dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
 dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
 #dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
 #或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
 #dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
 #~ strain * media * time    主效应 + 所有交互（推荐）  ✅
 #~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

 # --------------------------------
 # 4️⃣ Raw counts table (with gene names)
 # --------------------------------
 # Extract raw gene-level counts
 counts_data <- as.data.frame(counts(dds, normalized=FALSE))
 counts_data$gene_id <- rownames(counts_data)

 # Add gene names
 tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
 counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

 # Reorder columns: gene_id, gene_name, then counts
 count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
 counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

 # --------------------------------
 # 5️⃣ Calculate CPM
 # --------------------------------
 library(edgeR)
 library(openxlsx)

 # Prepare count matrix for CPM calculation
 count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

 # Calculate CPM
 #cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
 total_counts <- colSums(count_matrix)
 cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
 cpm_matrix <- as.data.frame(cpm_matrix)

 # Add gene_id and gene_name back to CPM table
 cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

 # --------------------------------
 # 6️⃣ Save outputs
 # --------------------------------
 write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
 write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
 write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

 # -- (Optional) Save the rlog-transformed counts --
 dim(counts(dds))
 head(counts(dds), 10)
 rld <- rlogTransformation(dds)
 rlog_counts <- assay(rld)
 write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

 # -- pca --
 png("pca2.png", 1200, 800)
 plotPCA(rld, intgroup=c("condition"))
 dev.off()
 # -- heatmap --
 png("heatmap2.png", 1200, 800)
 distsRL <- dist(t(assay(rld)))
 mat <- as.matrix(distsRL)
 hc <- hclust(distsRL)
 hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
 heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
 dev.off()

 # -- pca_media_strain --
 png("pca_media.png", 1200, 800)
 plotPCA(rld, intgroup=c("media"))
 dev.off()
 png("pca_strain.png", 1200, 800)
 plotPCA(rld, intgroup=c("strain"))
 dev.off()
 png("pca_time.png", 1200, 800)
 plotPCA(rld, intgroup=c("time"))
 dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

 #Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
 #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
 #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

 sizeFactors(dds)
 #NULL
 # Estimate size factors
 dds <- estimateSizeFactors(dds)
 # Estimate dispersions
 dds <- estimateDispersions(dds)
 #> sizeFactors(dds)

 #control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
 #2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
 #HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
 #0.4604461  0.5799834  0.6803681

 # (DEBUG) If avgTxLength is Necessary
 #To simplify the computation and ensure sizeFactors are calculated:
 assays(dds)$avgTxLength <- NULL
 dds <- estimateSizeFactors(dds)
 sizeFactors(dds)
 #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
 dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
 sizeFactors(dds)

 # If alone with virus data, the following BUG occured:
 #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                     HeLa_TO_r1                      HeLa_TO_r2
                     0.9978755                       1.1092227
 data.frame(genes = rownames(dds), dispersions = dispersions(dds))

 #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
 1/0.9978755=1.002129023
 1/1.1092227=
 #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
 bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
 bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

 raw_counts <- counts(dds)
 normalized_counts <- counts(dds, normalized=TRUE)
 #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
 #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
 #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
 estimSf <- function (cds){
     # Get the count matrix
     cts <- counts(cds)
     # Compute the geometric mean
     geomMean <- function(x) prod(x)^(1/length(x))
     # Compute the geometric mean over the line
     gm.mean  <-  apply(cts, 1, geomMean)
     # Zero values are set to NA (avoid subsequentcdsdivision by 0)
     gm.mean[gm.mean == 0] <- NA
     # Divide each line by its corresponding geometric mean
     # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
     # MARGIN: 1 or 2 (line or columns)
     # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
     # FUN: the function to be applied
     cts <- sweep(cts, 1, gm.mean, FUN="/")
     # Compute the median over the columns
     med <- apply(cts, 2, median, na.rm=TRUE)
     # Return the scaling factor
     return(med)
 }
 #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
 #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
 #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
 #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
 #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
 #DESeq2’s median of ratios [1]
 #EdgeR’s trimmed mean of M values (TMM) [2]
 #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
 test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
 sum(test_normcount != normalized_counts)

Select the differentially expressed genes

 #https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
 #https://www.biostars.org/p/282295/
 #https://www.biostars.org/p/335751/
 dds$condition
 [1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
 [5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
 [9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
 [13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
 [17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
 [21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
 [25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
 [29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
 [33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
 12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

 #CONSOLE: mkdir star_salmon/degenes

 setwd("degenes")

 # 确保因子顺序（可选）
 colData$strain <- relevel(factor(colData$strain), ref = "WT")
 colData$media  <- relevel(factor(colData$media), ref = "TSB")
 colData$time   <- relevel(factor(colData$time), ref = "2h")

 dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
 dds <- DESeq(dds, betaPrior = FALSE)
 resultsNames(dds)
 #[1] "Intercept"                      "strain_deltasbp_vs_WT"
 #[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
 #[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
 #[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
 #[9] "mediaMH.time18h"                "mediaMH.time4h"
 #[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

 🔹 Main effects for each factor:

 表达量
 ▲
 │       ┌────── WT-TSB
 │      /
 │     /     ┌────── WT-MH
 │    /     /
 │   /     /     ┌────── deltasbp-TSB
 │  /     /     /
 │ /     /     /     ┌────── deltasbp-MH
 └──────────────────────────────▶ 时间（2h, 4h, 18h）

     * strain_deltasbp_vs_WT
     * media_MH_vs_TSB
     * time_18h_vs_2h
     * time_4h_vs_2h

 🔹 两因素交互作用（Two-way interactions）
 这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

 表达量
 ▲
 │
 │             WT ────────┐
 │                        └─↘
 │                           ↘
 │                        deltasbp ←←←← 显著交互（方向/幅度不同）
 └──────────────────────────────▶ 时间

 straindeltasbp.mediaMH
 表示 菌株（strain）和培养基（media）之间的交互作用。
 ➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

 straindeltasbp.time18h
 表示 菌株（strain）和时间（time, 18h）之间的交互作用。
 ➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

 straindeltasbp.time4h
 同上，是菌株和时间（4h）之间的交互作用。

 mediaMH.time18h
 表示 培养基（MH）与时间（18h）之间的交互作用。
 ➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

 mediaMH.time4h
 与上面类似，是 MH 培养基与 4 小时之间的交互作用。

 🔹 三因素交互作用（Three-way interactions）
 三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

 表达量（TSB）
 ▲
 │
 │        WT ──────→→
 │        deltasbp ─────→→
 └────────────────────────▶ 时间（2h, 4h, 18h）

 表达量（MH）
 ▲
 │
 │        WT ──────→→
 │        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
 └────────────────────────▶ 时间（2h, 4h, 18h）

 straindeltasbp.mediaMH.time18h
 表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
 ➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

 straindeltasbp.mediaMH.time4h
 同上，只是观察的是 4 小时下的三因素交互效应。

 ✅ 总结：
 交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

 # 提取 strain 的主效应: up 2, down 16
 contrast <- "strain_deltasbp_vs_WT"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 # 提取 media 的主效应: up 76; down 128
 contrast <- "media_MH_vs_TSB"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 # 提取 time 的主效应 up 228, down 98; up 17, down 2
 contrast <- "time_18h_vs_2h"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 contrast <- "time_4h_vs_2h"
 res = results(dds, name=contrast)
 res <- res[!is.na(res$log2FoldChange),]
 res_df <- as.data.frame(res)
 write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
 up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
 down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
 write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
 write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

 #1.)  delta sbp 2h TSB vs WT 2h TSB
 #2.)  delta sbp 4h TSB vs WT 4h TSB
 #3.)  delta sbp 18h TSB vs WT 18h TSB
 #4.)  delta sbp 2h MH vs WT 2h MH
 #5.)  delta sbp 4h MH vs WT 4h MH
 #6.)  delta sbp 18h MH vs WT 18h MH

 #---- relevel to control ----
 #design=~condition+replicate
 dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
 dds$condition <- relevel(dds$condition, "WT_TSB_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

 dds$condition <- relevel(dds$condition, "WT_TSB_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

 dds$condition <- relevel(dds$condition, "WT_TSB_18h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

 dds$condition <- relevel(dds$condition, "WT_MH_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

 dds$condition <- relevel(dds$condition, "WT_MH_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

 dds$condition <- relevel(dds$condition, "WT_MH_18h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

 # WT_MH_xh
 dds$condition <- relevel(dds$condition, "WT_MH_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
 dds$condition <- relevel(dds$condition, "WT_MH_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_MH_18h_vs_WT_MH_4h")

 # WT_TSB_xh
 dds$condition <- relevel(dds$condition, "WT_TSB_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
 dds$condition <- relevel(dds$condition, "WT_TSB_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

 # deltasbp_MH_xh
 dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
 dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

 # deltasbp_TSB_xh
 dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
 dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
 dds = DESeq(dds, betaPrior=FALSE)
 resultsNames(dds)
 clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

 for (i in clist) {
   contrast = paste("condition", i, sep="_")
   #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
   res = results(dds, name=contrast)
   res <- res[!is.na(res$log2FoldChange),]
   res_df <- as.data.frame(res)

   write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
   #res$log2FoldChange < -2 & res$padj < 5e-2
   up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
   down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
   write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
   write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
 }

 # -- Under host-env (mamba activate plot-numpy1) --
 mamba activate plot-numpy1
 grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

 for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
   python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
 done

 # ---- delta sbp TSB 2h vs WT TSB 2h ----
 res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
 #print(duplicated_genes)
 # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
 #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
 #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
 #res[res$GeneName == "bfr", ]

 #1st_strategy First occurrence is kept and Subsequent duplicates are removed
 #res <- res[!duplicated(res$GeneName), ]
 #2nd_strategy keep the row with the smallest padj value for each GeneName
 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 ## Ensure the data frame matches the expected format
 ## For example, it should have columns: log2FoldChange, padj, etc.
 #res <- as.data.frame(res)
 ## Remove rows with NA in log2FoldChange (if needed)
 #res <- res[!is.na(res$log2FoldChange),]

 # Replace padj = 0 with a small value
 #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
 dev.off()

 # ---- delta sbp TSB 4h vs WT TSB 4h ----
 res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
 dev.off()

 # ---- delta sbp TSB 18h vs WT TSB 18h ----
 res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
 dev.off()

 # ---- delta sbp MH 2h vs WT MH 2h ----
 res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
 #print(duplicated_genes)
 # [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
 #[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
 #[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
 #res[res$GeneName == "bfr", ]

 #1st_strategy First occurrence is kept and Subsequent duplicates are removed
 #res <- res[!duplicated(res$GeneName), ]
 #2nd_strategy keep the row with the smallest padj value for each GeneName
 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 ## Ensure the data frame matches the expected format
 ## For example, it should have columns: log2FoldChange, padj, etc.
 #res <- as.data.frame(res)
 ## Remove rows with NA in log2FoldChange (if needed)
 #res <- res[!is.na(res$log2FoldChange),]

 # Replace padj = 0 with a small value
 #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
 dev.off()

 # ---- delta sbp MH 4h vs WT MH 4h ----
 res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
 dev.off()

 # ---- delta sbp MH 18h vs WT MH 18h ----
 res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
 # Replace empty GeneName with modified GeneID
 res$GeneName <- ifelse(
   res$GeneName == "" | is.na(res$GeneName),
   gsub("gene-", "", res$GeneID),
   res$GeneName
 )
 duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

 res <- res %>%
   group_by(GeneName) %>%
   slice_min(padj, with_ties = FALSE) %>%
   ungroup()
 res <- as.data.frame(res)
 # Sort res first by padj (ascending) and then by log2FoldChange (descending)
 res <- res[order(res$padj, -res$log2FoldChange), ]

 # Assuming res is your dataframe and already processed
 # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
 # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
 # Create a new workbook
 wb <- createWorkbook()
 # Add the complete dataset as the first sheet
 addWorksheet(wb, "Complete_Data")
 writeData(wb, "Complete_Data", res)
 # Add the up-regulated genes as the second sheet
 addWorksheet(wb, "Up_Regulated")
 writeData(wb, "Up_Regulated", up_regulated)
 # Add the down-regulated genes as the third sheet
 addWorksheet(wb, "Down_Regulated")
 writeData(wb, "Down_Regulated", down_regulated)
 # Save the workbook to a file
 saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

 # Set the 'GeneName' column as row.names
 rownames(res) <- res$GeneName
 # Drop the 'GeneName' column since it's now the row names
 res$GeneName <- NULL
 head(res)

 #library(EnhancedVolcano)
 # Assuming res is already sorted and processed
 png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
 #max.overlaps = 10
 EnhancedVolcano(res,
                 lab = rownames(res),
                 x = 'log2FoldChange',
                 y = 'padj',
                 pCutoff = 5e-2,
                 FCcutoff = 2,
                 title = '',
                 subtitleLabSize = 18,
                 pointSize = 3.0,
                 labSize = 5.0,
                 colAlpha = 1,
                 legendIconSize = 4.0,
                 drawConnectors = TRUE,
                 widthConnectors = 0.5,
                 colConnectors = 'black',
                 subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
 dev.off()

 #Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

Clustering the genes and draw heatmap

 #http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
 #http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
 # save the Up-regulated and Down-regulated genes into -up.id and -down.id

 # Attached are the heatmap only shown the genes cutoff with padj < 5e-2 and |log2FC| > 2
 #- 1457 TSB vs 1457deltasbp TSB early  timepoint (1-2h)
 #- 1457 MH vs 1457deltasbp MH early timepoint (1-2h)
 #- 1457 TSB vs 1457deltasbp TSB 4h
 #- 1457 MH vs 1457deltasbp MH 4h
 #- 1457 TSB vs 1457deltasbp TSB 18h
 #- 1457 MH vs 1457deltasbp MH 18h

 # Attached shown the genes padj < 5e-2 and |log2FC| > 2 betwen 18h and time timepoint (1-2h) or betwen 18h and time timepoint 4h
 # The project of Tam_RNAseq can also make it similar to this project!
 #- 1457 TSB  early  timepoint (1-2h) vs 4h vs 18h
 #- 1457 MH early  timepoint (1-2h) vs 4h vs 18h
 #- 1457deltasbp TSB early  timepoint (1-2h) vs 4h vs 18h
 #- 1457deltasbp MH early  timepoint (1-2h) vs 4h vs 18h

 for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
   echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
   echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
 done

 ## ------------------------------------------------------------
 ## DEGs heatmap (dynamic GOI + dynamic column tags)
 ## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
 ## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
 ## ------------------------------------------------------------

 ## 0) Config ---------------------------------------------------
 contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"
 contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"
 contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"
 contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"
 contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"
 contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"

 up_file   <- paste0(contrast, "-up.id")
 down_file <- paste0(contrast, "-down.id")

 ## 1) Packages -------------------------------------------------
 need <- c("gplots")
 to_install <- setdiff(need, rownames(installed.packages()))
 if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
 suppressPackageStartupMessages(library(gplots))

 ## 2) Helpers --------------------------------------------------
 # Read IDs from a file that may be:
 #  - one column with or without header "Gene_Id"
 #  - may contain quotes
 read_ids_from_file <- function(path) {
   if (!file.exists(path)) stop("File not found: ", path)
   df <- tryCatch(
     read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
     error = function(e) NULL
   )
   if (!is.null(df) && ncol(df) >= 1) {
     if ("Gene_Id" %in% names(df)) {
       ids <- df[["Gene_Id"]]
     } else if (ncol(df) == 1L) {
       ids <- df[[1]]
     } else {
       first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
       if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
       ids <- df[[first_nonempty]]
     }
   } else {
     df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
     if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
     ids <- df2[[1]]
   }
   ids <- trimws(gsub('"', "", ids))
   ids[nzchar(ids)]
 }

 #BREAK_LINE

 # From "A_vs_B" get c("A","B")
 split_contrast_groups <- function(x) {
   parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
   if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
   parts
 }

 # Match whole tags at boundaries or underscores
 match_tags <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 ## 3) Build GOI from the two .id files -------------------------
 GOI_up   <- read_ids_from_file(up_file)
 GOI_down <- read_ids_from_file(down_file)
 GOI <- unique(c(GOI_up, GOI_down))
 if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

 ## 4) Expression matrix (DESeq2 rlog/vst) ----------------------
 # Use rld if present; otherwise try vsd
 if (exists("rld")) {
   expr_all <- assay(rld)
 } else if (exists("vsd")) {
   expr_all <- assay(vsd)
 } else {
   stop("Neither 'rld' nor 'vsd' object is available in the environment.")
 }
 RNASeq.NoCellLine <- as.matrix(expr_all)

 # GOI are already 'gene-*' in your data — use them directly for matching
 present <- intersect(rownames(RNASeq.NoCellLine), GOI)
 if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
 # Optional: report truly missing IDs (on the same 'gene-*' format)
 missing <- setdiff(GOI, present)
 if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

 ## 5) Keep ONLY columns for the two groups in the contrast -----
 groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
 keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
 if (!any(keep_cols)) {
   stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
       ". Check your column names or implement colData-based filtering.")
 }
 cols_idx <- which(keep_cols)
 sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

 # Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
 ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

 # Subset safely
 expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

 ## 6) Remove constant/NA rows ----------------------------------
 row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
 if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
 datamat <- expr_sub[row_ok, , drop = FALSE]

 # Save the filtered matrix used for the heatmap (optional)
 out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
 write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

 ## 6.1) Pretty labels (display only) ---------------------------
 # Row labels: strip 'gene-'
 labRow_pretty <- sub("^gene-", "", rownames(datamat))

 # Column labels: 'deltasbp' -> 'Δsbp' and nicer spacing
 labCol_pretty <- colnames(datamat)
 labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty)
 labCol_pretty <- gsub("_", " ", labCol_pretty)   # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1"
 # If you prefer to drop replicate suffixes, uncomment:
 # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty)

 #BREAK_LINE

 ## 7) Clustering -----------------------------------------------
 # Row clustering with Pearson distance
 hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
 #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs"))
 #row_cor[!is.finite(row_cor)] <- 0
 #hr <- hclust(as.dist(1 - row_cor), method = "complete")

 # Color row-side groups by cutting the dendrogram
 mycl <- cutree(hr, h = max(hr$height) / 1.1)
 palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                   "lightblue","pink","purple","lightcyan","salmon","lightgreen")
 mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

 png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=950)
 heatmap.2(datamat,
         Rowv = as.dendrogram(hr),
         col = bluered(75),
         scale = "row",
         RowSideColors = mycol,
         trace = "none",
         margin = c(10, 15),         # bottom, left
         sepwidth = c(0, 0),
         dendrogram = 'row',
         Colv = 'false',
         density.info = 'none',
         labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
         labCol     = labCol_pretty,   # col labels with Δsbp + spaces
         cexRow = 2.5,
         cexCol = 2.5,
         srtCol = 15,
         lhei = c(0.6, 4),           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
         lwid = c(0.8, 4))           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
 dev.off()

 # ------------------ Heatmap generation for three samples ----------------------

 ## ============================================================
 ## Three-condition DEGs heatmap from multiple pairwise contrasts
 ## Example contrasts:
 ##   "WT_MH_4h_vs_WT_MH_2h",
 ##   "WT_MH_18h_vs_WT_MH_2h",
 ##   "WT_MH_18h_vs_WT_MH_4h"
 ## Output shows the union of DEGs across all contrasts and
 ## only the columns (samples) for the 3 conditions.
 ## ============================================================

 ## -------- 0) User inputs ------------------------------------
 contrasts <- c(
   "WT_MH_4h_vs_WT_MH_2h",
   "WT_MH_18h_vs_WT_MH_2h",
   "WT_MH_18h_vs_WT_MH_4h"
 )
 contrasts <- c(
   "WT_TSB_4h_vs_WT_TSB_2h",
   "WT_TSB_18h_vs_WT_TSB_2h",
   "WT_TSB_18h_vs_WT_TSB_4h"
 )
 contrasts <- c(
   "deltasbp_MH_4h_vs_deltasbp_MH_2h",
   "deltasbp_MH_18h_vs_deltasbp_MH_2h",
   "deltasbp_MH_18h_vs_deltasbp_MH_4h"
 )
 contrasts <- c(
   "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
   "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
   "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
 )
 ## Optionally force a condition display order (defaults to order of first appearance)
 cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h")
 cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h")
 cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h")
 cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h")
 #cond_order <- NULL

 ## -------- 1) Packages ---------------------------------------
 need <- c("gplots")
 to_install <- setdiff(need, rownames(installed.packages()))
 if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
 suppressPackageStartupMessages(library(gplots))

 ## -------- 2) Helpers ----------------------------------------
 read_ids_from_file <- function(path) {
   if (!file.exists(path)) stop("File not found: ", path)
   df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE,
                             quote = "\"'", comment.char = ""), error = function(e) NULL)
   if (!is.null(df) && ncol(df) >= 1) {
     ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]]
   } else {
     df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE,
                       quote = "\"'", comment.char = "")
     ids <- df2[[1]]
   }
   ids <- trimws(gsub('"', "", ids))
   ids[nzchar(ids)]
 }

 # From "A_vs_B" return c("A","B")
 split_contrast_groups <- function(x) {
   parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
   if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x)
   parts
 }

 # Grep whole tag between start/end or underscores
 match_tags <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 # Pretty labels for columns (optional tweaks)
 prettify_col_labels <- function(x) {
   x <- gsub("^deltasbp", "\u0394sbp", x)  # example from your earlier case
   x <- gsub("_", " ", x)
   x
 }

 # BREAK_LINE

 ## -------- 3) Build GOI (union across contrasts) -------------
 up_files   <- paste0(contrasts, "-up.id")
 down_files <- paste0(contrasts, "-down.id")

 GOI <- unique(unlist(c(
   lapply(up_files,   read_ids_from_file),
   lapply(down_files, read_ids_from_file)
 )))
 if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.")

 ## -------- 4) Expression matrix (rld or vsd) -----------------
 if (exists("rld")) {
   expr_all <- assay(rld)
 } else if (exists("vsd")) {
   expr_all <- assay(vsd)
 } else {
   stop("Neither 'rld' nor 'vsd' object is available in the environment.")
 }
 expr_all <- as.matrix(expr_all)

 present <- intersect(rownames(expr_all), GOI)
 if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
 missing <- setdiff(GOI, present)
 if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

 ## -------- 5) Infer the THREE condition tags -----------------
 pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B)
 cond_tags <- unique(unlist(pair_groups))
 if (length(cond_tags) != 3L) {
   stop("Expected exactly three unique condition tags across the contrasts, got: ",
       paste(cond_tags, collapse = ", "))
 }

 # If user provided an explicit order, use it; else keep first-appearance order
 if (!is.null(cond_order)) {
   if (!setequal(cond_order, cond_tags))
     stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", "))
   cond_tags <- cond_order
 }

 #BREAK_LINE

 ## -------- 6) Subset columns to those 3 conditions -----------
 # helper: does a name contain any of the tags?
 match_any_tag <- function(nms, tags) {
   pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
   grepl(pat, nms, perl = TRUE)
 }

 # helper: return the specific tag that a single name matches
 detect_tag <- function(nm, tags) {
   hits <- vapply(tags, function(t)
     grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
   if (!any(hits)) NA_character_ else tags[which(hits)[1]]
 }

 keep_cols <- match_any_tag(colnames(expr_all), cond_tags)
 if (!any(keep_cols)) {
   stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", "))
 }

 sub_idx <- which(keep_cols)
 sub_colnames <- colnames(expr_all)[sub_idx]

 # find the tag for each kept column (this is the part that was wrong before)
 cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags)

 # rank columns by your desired condition order, then by name within each condition
 cond_rank <- match(cond_for_col, cond_tags)
 ord <- order(cond_rank, sub_colnames)

 expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE]

 ## -------- 7) Remove constant/NA rows ------------------------
 row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
 if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
 datamat <- expr_sub[row_ok, , drop = FALSE]

 ## -------- 8) Labels ----------------------------------------
 labRow_pretty <- sub("^gene-", "", rownames(datamat))
 labCol_pretty <- prettify_col_labels(colnames(datamat))

 ## -------- 9) Clustering (rows) ------------------------------
 hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")

 # Color row-side groups by cutting the dendrogram
 mycl <- cutree(hr, h = max(hr$height) / 1.3)
 palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                   "lightblue","pink","purple","lightcyan","salmon","lightgreen")
 mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

 ## -------- 10) Save the matrix used --------------------------
 out_tag <- paste(cond_tags, collapse = "_")
 write.csv(as.data.frame(datamat),
           file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"),
           quote = FALSE)

 ## -------- 11) Plot heatmap ----------------------------------
 png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000)
 heatmap.2(
   datamat,
   Rowv = as.dendrogram(hr),
   Colv = FALSE,
   dendrogram = "row",
   col = bluered(75),
   scale = "row",
   trace = "none",
   density.info = "none",
   RowSideColors = mycol,
   margins = c(10, 15),      # c(bottom, left)
   sepwidth = c(0, 0),
   labRow = labRow_pretty,
   labCol = labCol_pretty,
   cexRow = 1.3,
   cexCol = 1.8,
   srtCol = 15,
   lhei = c(0.01, 4),
   lwid = c(0.5, 4),
   key = FALSE               # safer; add manual z-score key if you want (see note below)
 )
 dev.off()

 mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png
 mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png
 mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png
 mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png
 # ------------------ Heatmap generation for three samples END ----------------------

 # ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ====
 # Inputs:
 #   mat        : numeric matrix (genes x samples)
 #   hr         : hclust for rows (or TRUE/FALSE)
 #   row_colors : vector of RowSideColors of length nrow(mat) or NULL
 #   labRow     : character vector of row labels (display only)
 #   labCol     : character vector of col labels (display only)
 #   outfile    : output PNG path
 #   base_res   : DPI for PNG (default 150)
 # ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ====
 safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) {
   stopifnot(is.matrix(mat))
   nr <- nrow(mat); nc <- ncol(mat)

   # Target slide size & sensible caps
   #target_w <- 2400; target_h <- 1600
   #max_w <- 3000; max_h <- 2000
   target_w <- 800; target_h <- 2000
   max_w <- 1500; max_h <- 1500

   # Label stats
   max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1
   max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1

   #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE,  TRUE,  TRUE)
   attempts <- list()
   add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) {
     attempts[[length(attempts) + 1]] <<- list(
       w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar,
       key = key, showR = showR, showC = showC, trunc_rows = trunc_rows
     )
   }

   # Note that if the key is FALSE, all works, if the key is TRUE, none works!
   # 1) Preferred look: moderate size, biggish fonts, 45° labels
   add_attempt(target_w,           target_h,           0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE)
   # 2) Same, explicit margins computed later
   add_attempt(target_w,           target_h,           0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE)
   # 3) Slightly bigger canvas
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE)
   # 4) Make margins more generous (in lines)
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE)
   # 5) Reduce rotation to 30 if still tight
   add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12),  FALSE, TRUE, TRUE)
   # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins
   add_attempt(min(target_w+500,   max_w), min(target_h+300, max_h), 0.80, 0.90,  45, c(8,12),  FALSE, TRUE, TRUE)
   # 7) Last resort: truncate long row labels (keeps readability)
   if (max_row_chars > 20) {
     add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18)
   }

   for (i in seq_along(attempts)) {
     a <- attempts[[i]]

     # Compute margins if not provided
     if (is.null(a$mar)) {
       col_margin <- if (a$showC) {
         if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else
                       max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8)))
       } else 4
       row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4
       mar <- c(col_margin, row_margin)
     } else {
       mar <- a$mar
     }

     # Prepare labels for this attempt
     lr <- if (a$showR) labRow else rep("", nr)
     if (a$trunc_rows > 0 && a$showR) {
       lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr)
     }
     lc <- if (a$showC) labCol else rep("", nc)

     # Close any open device
     if (dev.cur() != 1) try(dev.off(), silent = TRUE)

     ok <- FALSE
     try({
       png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res)
       gplots::heatmap.2(
         mat,
         Rowv = as.dendrogram(hr),
         Colv = FALSE,
         dendrogram = "row",
         col = gplots::bluered(75),
         scale = "row",
         trace = "none",
         density.info = "none",
         RowSideColors = row_colors,
         key = a$key,
         margins = mar,           # c(col, row) in lines
         sepwidth = c(0, 0),
         labRow = lr,
         labCol = lc,
         cexRow = a$cr,
         cexCol = a$cc,
         srtCol = a$rot,
         lhei = c(0.1, 4),
         lwid = c(0.1, 4)
       )
       dev.off()
       ok <- TRUE
     }, silent = TRUE)

     if (ok) {
       message(sprintf(
         "✓ Heatmap saved: %s  (attempt %d)  size=%dx%d  margins=c(%d,%d)  cexRow=%.2f  cexCol=%.2f  srtCol=%d",
         outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot
       ))
       return(invisible(TRUE))
     } else {
       if (dev.cur() != 1) try(dev.off(), silent = TRUE)
       message(sprintf("Attempt %d failed; retrying...", i))
     }
   }

   stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.")
 }

 safe_heatmap2(
   mat        = datamat,
   hr         = hr,
   row_colors = mycol,
   labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
   labCol     = labCol_pretty,   # col labels with Δsbp + spaces
   outfile    = paste0("DEGs_heatmap_", contrast, ".png"),
   #base_res   = 150
 )

 # -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h --
 cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids
 #add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
 GOI <- read.csv("ids")$Gene_Id
 RNASeq.NoCellLine <- assay(rld)
 #install.packages("gplots")
 library("gplots")
 #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
 datamat = RNASeq.NoCellLine[GOI, ]
 #datamat = RNASeq.NoCellLine
 write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")

 constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
 if(any(constant_rows)) {
   cat("Removing", sum(constant_rows), "constant rows.\n")
   datamat <- datamat[!constant_rows, ]
 }
 hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
 hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
 mycl = cutree(hr, h=max(hr$height)/1.1)
 mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
 mycol = mycol[as.vector(mycl)]

 png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000)
 heatmap.2(datamat,
         Rowv = as.dendrogram(hr),
         col = bluered(75),
         scale = "row",
         RowSideColors = mycol,
         trace = "none",
         margin = c(10, 15),         # bottom, left
         sepwidth = c(0, 0),
         dendrogram = 'row',
         Colv = 'false',
         density.info = 'none',
         labRow = rownames(datamat),
         cexRow = 1.5,
         cexCol = 1.5,
         srtCol = 35,
         lhei = c(0.2, 4),           # reduce top space (was 1 or more)
         lwid = c(0.4, 4))           # reduce left space (was 1 or more)
 dev.off()

 # -------------- Cluster members ----------------
 write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
 write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
 write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
 write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
 write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
 #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
 #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;

 #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
 subset_1<-names(subset(mycl, mycl == '1'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
 subset_2<-names(subset(mycl, mycl == '2'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
 subset_3<-names(subset(mycl, mycl == '3'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
 subset_4<-names(subset(mycl, mycl == '4'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
 subset_5<-names(subset(mycl, mycl == '5'))
 data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
 # Initialize an empty data frame for the annotated data
 annotated_data <- data.frame()
 # Determine total number of genes
 total_genes <- length(rownames(data))
 # Loop through each gene to annotate
 for (i in 1:total_genes) {
     gene <- rownames(data)[i]
     result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                     filters = 'ensembl_gene_id',
                     values = gene,
                     mart = ensembl)
     # If multiple rows are returned, take the first one
     if (nrow(result) > 1) {
         result <- result[1, ]
     }
     # Check if the result is empty
     if (nrow(result) == 0) {
         result <- data.frame(ensembl_gene_id = gene,
                             external_gene_name = NA,
                             gene_biotype = NA,
                             entrezgene_id = NA,
                             chromosome_name = NA,
                             start_position = NA,
                             end_position = NA,
                             strand = NA,
                             description = NA)
     }
     # Transpose expression values
     expression_values <- t(data.frame(t(data[gene, ])))
     colnames(expression_values) <- colnames(data)
     # Combine gene information and expression data
     combined_result <- cbind(result, expression_values)
     # Append to the final dataframe
     annotated_data <- rbind(annotated_data, combined_result)
     # Print progress every 100 genes
     if (i %% 100 == 0) {
         cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
     }
 }
 # Save the annotated data to a new CSV file
 write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
 write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
 #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Assign KEGG and GO Terms (see diagram above)

 Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

 Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

     EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

     Install EggNOG-mapper:

         mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
         mamba activate eggnog_env

     Run annotation:

         #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
         mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
         #Download the protein sequences from Genbank
         mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
         python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
         emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
         #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
         #---->  470.IX87_14445:
             * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
             * IX87_14445 would refer to a specific gene or protein within that genome.

     Extract KEGG KO IDs from annotations.emapper.annotations.

 Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

 * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
 * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
 * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
         QBlast finished with warnings!
         Blasted Sequences: 2084
         Sequences without results: 105
         Check the Job log for details and try to submit again.
         Restarting QBlast may result in additional results, depending on the error type.
         "Blast (CP020463_protein) Done"
 * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
         "Mapping (CP020463_protein) Done"
         "Mapping finished - Please proceed now to annotation."
 * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
         * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
         * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
         "Annotation (CP020463_protein) Done"
         "Annotation finished."
 or blast2go_cli_v1.5.1 (NOT_USED)

         #https://help.biobam.com/space/BCD/2250407989/Installation
         #see ~/Scripts/blast2go_pipeline.sh

 Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
     * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
         "InterProScan (CP020463_protein) Done"
         "InterProScan Finished - You can now merge the obtained GO Annotations."
 MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
     * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
         "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
         "Finished merging GO terms from InterPro with annotations."
         "Maybe you want to run ANNEX (Annotation Augmentation)."
     #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
 File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
         #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

     #-- before merging (blast2go_annot.annot) --
     #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
     #H0N29_18790     GO:0085020
     #-- after merging (blast2go_annot.annot2) -->
     #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
     #H0N29_18790     GO:0070531
     #H0N29_18790     GO:0004842
     #H0N29_18790     GO:0005515
     #H0N29_18790     GO:0085020

     cp blast2go_annot.annot blast2go_annot.annot2

 Option 4 (NOT_USED): RFAM for non-colding RNA

 Option 5 (NOT_USED): PSORTb for subcellular localizations

 Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

 * Go to KAAS
 * Upload your FASTA file.
 * Select an appropriate gene set.
 * Download the KO assignments.

Find the Closest KEGG Organism Code (NOT_USED)

 Since your species isn't directly in KEGG, use a closely related organism.

 * Check available KEGG organisms:

         library(clusterProfiler)
         library(KEGGREST)

         kegg_organisms <- keggList("organism")

         Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

         # Search for Acinetobacter in the list
         grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
         # Gammaproteobacteria
         #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

 If your organism is not in KEGG, search for the closest relative:

         grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

 For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

         kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

     #BiocManager::install("GO.db")
     #BiocManager::install("AnnotationDbi")

     # Load required libraries
     library(openxlsx)  # For Excel file handling
     library(dplyr)     # For data manipulation
     library(tidyr)
     library(stringr)
     library(clusterProfiler)  # For KEGG and GO enrichment analysis
     #library(org.Hs.eg.db)  # Replace with appropriate organism database
     library(GO.db)
     library(AnnotationDbi)

     setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
     # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
     # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
     #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
     #1,5c1
     #< ## Thu Jan 30 16:34:52 2025
     #< ## emapper-2.1.12
     #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
     #< ##
     #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
     #---
     #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
     #3620,3622d3615
     #< ## 3614 queries scanned
     #< ## Total time (seconds): 8.176708459854126

     # Step 1: Load the blast2go annotation file with a check for missing columns
     annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

     # If the structure is inconsistent, we can make sure there are exactly 3 columns:
     colnames(annot_df) <- c("GeneID", "Term")
     # Step 2: Filter and aggregate GO and EC terms as before
     go_terms <- annot_df %>%
     filter(grepl("^GO:", Term)) %>%
     group_by(GeneID) %>%
     summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
     ec_terms <- annot_df %>%
     filter(grepl("^EC:", Term)) %>%
     group_by(GeneID) %>%
     summarize(EC = paste(Term, collapse = ","), .groups = "drop")

     # Key Improvements:
     #    * Looped processing of all 6 input files to avoid redundancy.
     #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
     #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
     #    * Error handling for GO term descriptions via tryCatch.
     #    * Improved clarity and modular structure for easier maintenance and future additions.

     # Define the filenames and output suffixes
     file_list <- c(
       "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
       "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
       "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
       "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
       "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
       "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

       "WT_MH_4h_vs_WT_MH_2h",
       "WT_MH_18h_vs_WT_MH_2h",
       "WT_MH_18h_vs_WT_MH_4h",
       "WT_TSB_4h_vs_WT_TSB_2h",
       "WT_TSB_18h_vs_WT_TSB_2h",
       "WT_TSB_18h_vs_WT_TSB_4h",

       "deltasbp_MH_4h_vs_deltasbp_MH_2h",
       "deltasbp_MH_18h_vs_deltasbp_MH_2h",
       "deltasbp_MH_18h_vs_deltasbp_MH_4h",
       "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
       "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
       "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
     )

     #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

     # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
     suppressPackageStartupMessages({
       library(readr)
       library(dplyr)
       library(stringr)
       library(tidyr)
       library(openxlsx)
       library(clusterProfiler)
       library(AnnotationDbi)
       library(GO.db)
     })

     # ---- PARAMETERS ----
     PADJ_CUT <- 5e-2
     LFC_CUT  <- 2

     # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
     emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

     # Input files (you can add/remove here)
     input_files <- c(
       "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
       "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
       "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
       "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
       "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
       "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

       "WT_MH_4h_vs_WT_MH_2h-all.csv",
       "WT_MH_18h_vs_WT_MH_2h-all.csv",
       "WT_MH_18h_vs_WT_MH_4h-all.csv",
       "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
       "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
       "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

       "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
       "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
       "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
       "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
       "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
       "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
     )

     # ---- HELPERS ----
     # Robust reader (CSV first, then TSV)
     read_table_any <- function(path) {
       tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                     error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                   error = function(e2) NULL))
       tb
     }

     # Return a nice Excel-safe base name
     xlsx_name_from_file <- function(path) {
       base <- tools::file_path_sans_ext(basename(path))
       paste0("DEG_KEGG_GO_", base, ".xlsx")
     }

     # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
     expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
       if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
       kdf <- as.data.frame(kegg_res)
       if (!"geneID" %in% names(kdf)) return(kdf)
       # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
       map_clean <- mapping_tbl %>%
         dplyr::select(KEGG_ko, GeneID) %>%
         filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
         mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
         tidyr::separate_rows(KEGG_ko, sep = ",") %>%
         distinct()

       if (!nrow(map_clean)) {
         return(kdf)
       }

       expanded <- kdf %>%
         tidyr::separate_rows(geneID, sep = "/") %>%
         dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
         distinct() %>%
         dplyr::group_by(ID) %>%
         dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

       kdf %>%
         dplyr::select(-geneID) %>%
         dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
         dplyr::rename(geneID = GeneID)
     }

     # ---- LOAD emapper annotations ----
     eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
     # Ensure character columns for joins
     eggnog_data$query   <- as.character(eggnog_data$query)
     eggnog_data$GOs     <- as.character(eggnog_data$GOs)
     eggnog_data$EC      <- as.character(eggnog_data$EC)
     eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

     # ---- MAIN LOOP ----
     for (f in input_files) {
       if (!file.exists(f)) { message("Missing: ", f); next }

       message("Processing: ", f)
       res <- read_table_any(f)
       if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

       # Coerce expected columns if present
       if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
       if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

       # Ensure GeneID & GeneName exist
       if (!"GeneID" %in% names(res)) {
         # Try to infer from a generic 'gene' column
         if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
       }
       if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

       # Fill missing GeneName from GeneID (drop "gene-")
       res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                             gsub("^gene-", "", as.character(res$GeneID)),
                             as.character(res$GeneName))

       # De-duplicate by GeneName, keep smallest padj
       if (!"padj" %in% names(res)) res$padj <- NA_real_
       res <- res %>%
         group_by(GeneName) %>%
         slice_min(padj, with_ties = FALSE) %>%
         ungroup() %>%
         as.data.frame()

       # Sort by padj asc, then log2FC desc
       if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
       res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

       # Join emapper (strip "gene-" from GeneID to match emapper 'query')
       res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
       res_ann <- res %>%
         left_join(eggnog_data, by = c("GeneID_plain" = "query"))

       # --- Split by UP/DOWN using your volcano cutoffs ---
       up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
       down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

       # --- KEGG enrichment (using K numbers in KEGG_ko) ---
       # Prepare KO lists (remove "ko:" if present)
       k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
       k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
       k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

       # BREAK_LINE

       kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
       kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

       # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
       kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
       kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

       # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
       # Background gene set = all genes in this comparison
       background_genes <- unique(res_ann$GeneID_plain)
       # TERM2GENE table (GO -> GeneID_plain)
       go_annotation <- res_ann %>%
         dplyr::select(GeneID_plain, GOs) %>%
         mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
         tidyr::separate_rows(GOs, sep = ",") %>%
         filter(GOs != "") %>%
         dplyr::select(GOs, GeneID_plain) %>%
         distinct()

       # Gene lists for GO enricher
       go_list_up   <- unique(up_regulated$GeneID_plain)
       go_list_down <- unique(down_regulated$GeneID_plain)

       go_up <- tryCatch(
         enricher(gene = go_list_up, TERM2GENE = go_annotation,
                 pvalueCutoff = 0.05, pAdjustMethod = "BH",
                 universe = background_genes),
         error = function(e) NULL
       )
       go_down <- tryCatch(
         enricher(gene = go_list_down, TERM2GENE = go_annotation,
                 pvalueCutoff = 0.05, pAdjustMethod = "BH",
                 universe = background_genes),
         error = function(e) NULL
       )

       go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
       go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

       # Add GO term descriptions via GO.db (best-effort)
       add_go_term_desc <- function(df) {
         if (!nrow(df) || !"ID" %in% names(df)) return(df)
         df$Description <- sapply(df$ID, function(go_id) {
           term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                 columns = "TERM", keytype = "GOID"),
                           error = function(e) NULL)
           if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
         })
         df
       }
       go_up_df   <- add_go_term_desc(go_up_df)
       go_down_df <- add_go_term_desc(go_down_df)

       # ---- Write Excel workbook ----
       out_xlsx <- xlsx_name_from_file(f)
       wb <- createWorkbook()

       addWorksheet(wb, "Complete")
       writeData(wb, "Complete", res_ann)

       addWorksheet(wb, "Up_Regulated")
       writeData(wb, "Up_Regulated", up_regulated)

       addWorksheet(wb, "Down_Regulated")
       writeData(wb, "Down_Regulated", down_regulated)

       addWorksheet(wb, "KEGG_Enrichment_Up")
       writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

       addWorksheet(wb, "KEGG_Enrichment_Down")
       writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

       addWorksheet(wb, "GO_Enrichment_Up")
       writeData(wb, "GO_Enrichment_Up", go_up_df)

       addWorksheet(wb, "GO_Enrichment_Down")
       writeData(wb, "GO_Enrichment_Down", go_down_df)

       saveWorkbook(wb, out_xlsx, overwrite = TRUE)
       message("Saved: ", out_xlsx)
     }

     # -------------------------------- OLD_CODE not automatized with loop ----------------------------
     # Load the results
     res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
     res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

     res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
     res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
     res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
     res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
     res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

     res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
     res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
     res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
     res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

     # Replace empty GeneName with modified GeneID
     res$GeneName <- ifelse(
         res$GeneName == "" | is.na(res$GeneName),
         gsub("gene-", "", res$GeneID),
         res$GeneName
     )

     # Remove duplicated genes by selecting the gene with the smallest padj
     duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

     res <- res %>%
     group_by(GeneName) %>%
     slice_min(padj, with_ties = FALSE) %>%
     ungroup()

     res <- as.data.frame(res)
     # Sort res first by padj (ascending) and then by log2FoldChange (descending)
     res <- res[order(res$padj, -res$log2FoldChange), ]
     # Read eggnog annotations
     eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
     # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
     res$GeneID <- gsub("gene-", "", res$GeneID)
     # Merge eggnog data with res based on GeneID
     res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

     # Merge with the res dataframe
     # Perform the left joins and rename columns
     res_updated <- res %>%
     left_join(go_terms, by = "GeneID") %>%
     left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

     # Filter up-regulated genes
     up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
     # Filter down-regulated genes
     down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

     # Create a new workbook
     wb <- createWorkbook()
     # Add the complete dataset as the first sheet (with annotations)
     addWorksheet(wb, "Complete")
     writeData(wb, "Complete_Data", res_updated)
     # Add the up-regulated genes as the second sheet (with annotations)
     addWorksheet(wb, "Up_Regulated")
     writeData(wb, "Up_Regulated", up_regulated)
     # Add the down-regulated genes as the third sheet (with annotations)
     addWorksheet(wb, "Down_Regulated")
     writeData(wb, "Down_Regulated", down_regulated)
     # Save the workbook to a file
     #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
     #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

     #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

     # Set GeneName as row names after the join
     rownames(res_updated) <- res_updated$GeneName
     res_updated <- res_updated %>% dplyr::select(-GeneName)
     ## Set the 'GeneName' column as row.names
     #rownames(res_updated) <- res_updated$GeneName
     ## Drop the 'GeneName' column since it's now the row names
     #res_updated$GeneName <- NULL
     # -- BREAK_1 --

     # ---- Perform KEGG enrichment analysis (up_regulated) ----
     gene_list_kegg_up <- up_regulated$KEGG_ko
     gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
     kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
     # -- convert the GeneID (Kxxxxxx) to the true GeneID --
     # Step 0: Create KEGG to GeneID mapping
     kegg_to_geneid_up <- up_regulated %>%
     dplyr::select(KEGG_ko, GeneID) %>%
     filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
     mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
     # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
     kegg_to_geneid_clean <- kegg_to_geneid_up %>%
     mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
     separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
     filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
     distinct()  # Remove any duplicate mappings
     # Step 2.1: Expand geneID column in kegg_enrichment_up
     expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
     distinct() %>%  # Remove duplicate matches
     group_by(ID) %>%
     summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
     #dplyr::glimpse(expanded_kegg)
     # Step 3.1: Replace geneID column in the original dataframe
     kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
     # Remove old geneID column and merge new one
     kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

     # ---- Perform KEGG enrichment analysis (down_regulated) ----
     # Step 1: Extract KEGG KO terms from down-regulated genes
     gene_list_kegg_down <- down_regulated$KEGG_ko
     gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
     # Step 2: Perform KEGG enrichment analysis
     kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
     # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
     # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
     kegg_to_geneid_down <- down_regulated %>%
     dplyr::select(KEGG_ko, GeneID) %>%
     filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
     mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
     # -- BREAK_2 --

     # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
     kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
     mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
     separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
     filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
     distinct()  # Remove duplicate mappings

     # Step 5: Expand geneID column in kegg_enrichment_down
     expanded_kegg_down <- kegg_enrichment_down %>%
     as.data.frame() %>%
     separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
     left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
     distinct() %>%  # Remove duplicate matches
     group_by(ID) %>%
     summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

     # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
     kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
     dplyr::select(-geneID) %>%  # Remove old geneID column
     left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
     dplyr::rename(geneID = GeneID)  # Rename column back to geneID
     # View the updated dataframe
     head(kegg_enrichment_down_df)

     # Create a new workbook
     #wb <- createWorkbook()
     # Save enrichment results to the workbook
     addWorksheet(wb, "KEGG_Enrichment_Up")
     writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
     # Save enrichment results to the workbook
     addWorksheet(wb, "KEGG_Enrichment_Down")
     writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

     # Define gene list (up-regulated genes)
     gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
     gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

     # Define background gene set (all genes in res)
     background_genes <- res_updated$GeneID  # Extract the 3646 background genes

     # Prepare GO annotation data from res
     go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
     go_annotation <- go_annotation %>%
     tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
     # -- BREAK_3 --

     go_enrichment_up <- enricher(
         gene = gene_list_go_up,                # Up-regulated genes
         TERM2GENE = go_annotation,       # Custom GO annotation
         pvalueCutoff = 0.05,             # Significance threshold
         pAdjustMethod = "BH",
         universe = background_genes      # Define the background gene set
     )
     go_enrichment_up <- as.data.frame(go_enrichment_up)

     go_enrichment_down <- enricher(
         gene = gene_list_go_down,                # Up-regulated genes
         TERM2GENE = go_annotation,       # Custom GO annotation
         pvalueCutoff = 0.05,             # Significance threshold
         pAdjustMethod = "BH",
         universe = background_genes      # Define the background gene set
     )
     go_enrichment_down <- as.data.frame(go_enrichment_down)

     ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
     #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

     # Update the Description column with the term descriptions
     go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
     # Using select to get the term description
     term <- tryCatch({
         AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
     }, error = function(e) {
         message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
         return(data.frame(TERM = NA))  # In case of error, return NA
     })
     if (nrow(term) > 0) {
         return(term$TERM)
     } else {
         return(NA)  # If no description found, return NA
     }
     })
     ## Print the updated data frame
     #print(go_enrichment_up)

     ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
     #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
     # Update the Description column with the term descriptions
     go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
     # Using select to get the term description
     term <- tryCatch({
         AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
     }, error = function(e) {
         message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
         return(data.frame(TERM = NA))  # In case of error, return NA
     })

     if (nrow(term) > 0) {
         return(term$TERM)
     } else {
         return(NA)  # If no description found, return NA
     }
     })

     addWorksheet(wb, "GO_Enrichment_Up")
     writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

     addWorksheet(wb, "GO_Enrichment_Down")
     writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

     # Save the workbook with enrichment results
     saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

     #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
     #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
     #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

     #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
     #Mac_vs_LB
     #LB.AB_vs_LB.WT19606
     #LB.IJ_vs_LB.WT19606
     #LB.W1_vs_LB.WT19606
     #LB.Y1_vs_LB.WT19606
     #Mac.AB_vs_Mac.WT19606
     #Mac.IJ_vs_Mac.WT19606
     #Mac.W1_vs_Mac.WT19606
     #Mac.Y1_vs_Mac.WT19606

     #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

         library(openxlsx)

         # Function to read and clean gene ID files
         read_gene_ids <- function(file_path) {
         # Read the gene IDs from the file
         gene_ids <- readLines(file_path)

         # Remove any quotes and trim whitespaces
         gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
         gene_ids <- trimws(gene_ids)  # Trim whitespaces

         # Remove empty entries or NAs
         gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

         return(gene_ids)
         }

         # Example list of LB files with both -up.id and -down.id for each condition
         lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                         "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
         lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                         "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

         # Combine both up and down files for each condition
         lb_files <- c(lb_files_up, lb_files_down)

         # Read gene IDs for each file in LB group
         #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
         lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

         lb_degs_ <- list()
         combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
         #unique_combined_set <- unique(combined_set)
         lb_degs_$AB <- combined_set
         combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
         lb_degs_$IJ <- combined_set
         combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
         lb_degs_$W1 <- combined_set
         combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
         lb_degs_$Y1 <- combined_set

         # Example list of Mac files with both -up.id and -down.id for each condition
         mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                         "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
         mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                         "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

         # Combine both up and down files for each condition in Mac group
         mac_files <- c(mac_files_up, mac_files_down)

         # Read gene IDs for each file in Mac group
         mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

         mac_degs_ <- list()
         combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
         mac_degs_$AB <- combined_set
         combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
         mac_degs_$IJ <- combined_set
         combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
         mac_degs_$W1 <- combined_set
         combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
         mac_degs_$Y1 <- combined_set

         # Function to clean sheet names to ensure no sheet name exceeds 31 characters
         truncate_sheet_name <- function(names_list) {
         sapply(names_list, function(name) {
         if (nchar(name) > 31) {
         return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
         }
         return(name)
         })
         }

         # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

         # Define intersections between different conditions for LB
         inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
         inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
         inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
         inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
         inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
         inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

         # Define intersections between three conditions for LB
         inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
         inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
         inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
         inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

         # Define intersection between all four conditions for LB
         inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

         # Now remove the intersected genes from each original set for LB
         venn_list_lb <- list()

         # For LB.AB, remove genes that are also in other conditions
         venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

         # For LB.IJ, remove genes that are also in other conditions
         venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

         # For LB.W1, remove genes that are also in other conditions
         venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

         # For LB.Y1, remove genes that are also in other conditions
         venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

         # Add the intersections for LB (same as before)
         venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
         venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
         venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
         venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
         venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
         venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

         # Define intersections between three conditions for LB
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
         venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
         venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

         # Define intersection between all four conditions for LB
         venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

         # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

         # Define intersections between different conditions
         inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
         inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
         inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
         inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
         inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
         inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

         # Define intersections between three conditions
         inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
         inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
         inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
         inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

         # Define intersection between all four conditions
         inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

         # Now remove the intersected genes from each original set
         venn_list_mac <- list()

         # For Mac.AB, remove genes that are also in other conditions
         venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

         # For Mac.IJ, remove genes that are also in other conditions
         venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

         # For Mac.W1, remove genes that are also in other conditions
         venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

         # For Mac.Y1, remove genes that are also in other conditions
         venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

         # Add the intersections (same as before)
         venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
         venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
         venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
         venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
         venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
         venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

         # Define intersections between three conditions
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
         venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
         venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

         # Define intersection between all four conditions
         venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

         # Save the gene IDs to Excel for further inspection (optional)
         write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
         write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

         # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
         write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
         write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

         # Venn Diagram for LB group
         venn1 <- ggvenn(lb_degs_,
                         fill_color = c("skyblue", "tomato", "gold", "orchid"),
                         stroke_size = 0.4,
                         set_name_size = 5)
         ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

         # Venn Diagram for Mac group
         venn2 <- ggvenn(mac_degs_,
                         fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                         stroke_size = 0.4,
                         set_name_size = 5)
         ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

         cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Report:
```
 Please find the heat maps attached.

 Comparisons included

     1. 1457 TSB vs 1457Δsbp TSB (early timepoint, 1–2 h)

     2. 1457 MH vs 1457Δsbp MH (early timepoint, 1–2 h)

     3. 1457 TSB vs 1457Δsbp TSB (4 h)

     4. 1457 MH vs 1457Δsbp MH (4 h)

     5. 1457 TSB vs 1457Δsbp TSB (18 h)

     6. 1457 MH vs 1457Δsbp MH (18 h)

     7. 1457 TSB: early (1–2 h) vs 4 h vs 18 h

     8. 1457 MH: early (1–2 h) vs 4 h vs 18 h

     9. 1457Δsbp TSB: early (1–2 h) vs 4 h vs 18 h

     10. 1457Δsbp MH: early (1–2 h) vs 4 h vs 18 h

 What each heat map shows

     Rows: significant DEGs only (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2).

     Columns: the samples for the comparison (2 conditions × replicates) or, for the three-condition panels, 3 conditions × replicates = 9 columns.

 How the files were generated

     Choose contrasts.

         Two-condition heat maps (items 1–6): contrast <- "<A>_vs_<B>".

         Three-condition panels (items 7–10):

         contrasts <- c("<A2>_vs_<A1>", "<A3>_vs_<A1>", "<A3>_vs_<A2>")
         cond_order <- c("<A1>", "<A2>", "<A3>")  # e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h

     Build the Genes Of Interest (GOI) dynamically. For each contrast, the script reads 
```
-up.id and -down.id and takes the union; for three-condition panels it unions across the three pairwise contrasts. Subset the expression matrix (assay(rld)): keep only the rows in GOI and the sample columns matching the condition tags (e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h). DEG annotation and enrichment The annotated significant DEGs (padj 2, i.e., > 2 or < −2) used in the steps above and the KEGG/GO enrichment results for each comparison are provided in the corresponding Excel files: DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx DEG_KEGG_GO_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_MH_4h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_TSB_4h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx

Comprehensive RNA-seq Time-Course Analysis Pipeline for Bacterial Stress-Related Genes (Data_Michelle_RNAseq_2025/README_oxidoreductases)

Leave a reply

This article summarizes the pipeline and preserves all major code used for preparing metadata, processing raw counts, performing statistical analysis, integrating annotations, and reporting results. The workflow is designed for bacterial RNA-seq time-course analysis focusing on stress-related genes and oxidoreductases.

1. Prepare `samples.tsv` from the samplesheet

vim samples.tsv

Example contents:

sample  condition   time_h  batch   genotype    medium  replicate
WT_MH_2h_1  WT_MH   2       WT  MH  1
WT_MH_2h_2  WT_MH   2       WT  MH  2
WT_MH_2h_3  WT_MH   2       WT  MH  3
WT_MH_4h_1  WT_MH   4       WT  MH  1
WT_MH_4h_2  WT_MH   4       WT  MH  2
WT_MH_4h_3  WT_MH   4       WT  MH  3
WT_MH_18h_1 WT_MH   18      WT  MH  1
WT_MH_18h_2 WT_MH   18      WT  MH  2
WT_MH_18h_3 WT_MH   18      WT  MH  3
deltasbp_MH_2h_1    deltasbp_MH 2       deltasbp    MH  1
deltasbp_MH_2h_2    deltasbp_MH 2       deltasbp    MH  2
deltasbp_MH_2h_3    deltasbp_MH 2       deltasbp    MH  3
...
deltasbp_TSB_18h_3  deltasbp_TSB    18      deltasbp    TSB 3

2. Reformat `counts.tsv` from STAR/Salmon

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

Clean file manually:

Remove any double quotes ("), remove gene- from first column, replace delimiters to tab.

Clean file in R:

cts <- read.delim("counts.tsv", check.names = FALSE)
names(cts)[1] <- "gene_id"
if ("gene_name" %in% names(cts)) cts$gene_name <- NULL
names(cts) <- sub("_r([0-9]+)$", "_\\1", names(cts))
write.table(cts, file="counts_fixed.tsv", sep="\t", quote=FALSE, row.names=FALSE)
smp <- read.delim("samples.tsv", check.names = FALSE)
setdiff(colnames(cts)[-1], smp$sample)
setdiff(smp$sample, colnames(cts)[-1])

3. Run the R time-course analysis

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

4. Summarize and convert results

~/Tools/csv2xls-0.4/csv_to_xls.py oxidoreductases_time_trends.tsv stress_genes_time_trends.tsv -d$'\t' -o oxidoreductases_and_stress_genes_time_trends.xls

5. Key summary and reporting

PCA plots, time trends, and top decreasing/increasing genes by condition are summarized. For further filtering, decreasing genes can be extracted by filtering direction == "decreasing" in the results tables.

6. Full main R script: `rna_timecourse_bacteria.R`

#!/usr/bin/env Rscript

# ===============================
# RNA-seq time-course helper (Bacteria) — DESeq2
# Uses eggNOG emapper annotations (GOs & EC) for oxidoreductases + stress genes
# ===============================
# Example:
#   Rscript rna_timecourse_bacteria.R \
#     --counts counts.tsv \
#     --samples samples.tsv \
#     --condition_col condition \
#     --time_col time_h \
#     --batch_col batch \
#     --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
#     --volcano_csvs contrasts/ctrl_vs_treat.csv \
#     --outdir results_bacteria
#
# Assumptions:
#   - counts.tsv: first column gene_id matching 'query' in emapper file
#   - samples.tsv: columns 'sample', condition/time (numeric), optional batch
#
suppressPackageStartupMessages({
  library(optparse)
  library(DESeq2)
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(ComplexHeatmap)
  library(circlize)
  library(ggplot2)
  library(purrr)
})

opt_list <- list(
  make_option("--counts", type="character", help="Counts matrix TSV (genes x samples). First col = gene_id"),
  make_option("--samples", type="character", help="Sample metadata TSV with 'sample' column."),
  make_option("--gene_id_col", type="character", default="gene_id", help="Counts gene id column name."),
  make_option("--condition_col", type="character", default="condition", help="Condition column in samples."),
  make_option("--time_col", type="character", default="time", help="Numeric time column in samples."),
  make_option("--batch_col", type="character", default=NULL, help="Optional batch column."),
  make_option("--emapper", type="character", help="eggNOG emapper annotations file (tab-delimited)."),
  make_option("--volcano_csvs", type="character", default=NULL, help="Comma-separated volcano CSV/TSV files (must have a 'gene' column)."),
  make_option("--outdir", type="character", default="results_bacteria", help="Output directory.")
)

opt <- parse_args(OptionParser(option_list = opt_list))
dir.create(opt$outdir, showWarnings = FALSE, recursive = TRUE)

message("[1/7] Load data")
counts <- read_tsv(opt$counts, col_types = cols())
stopifnot(opt$gene_id_col %in% colnames(counts))
counts <- as.data.frame(counts)
rownames(counts) <- counts[[opt$gene_id_col]]
counts[[opt$gene_id_col]] <- NULL

samples <- read_tsv(opt$samples, col_types = cols()) %>%
  filter(sample %in% colnames(counts))
samples <- as.data.frame(samples)
rownames(samples) <- samples$sample
samples$sample <- NULL
samples[[opt$time_col]] <- as.numeric(samples[[opt$time_col]])

# --- Coerce counts to numeric and validate ---
# Remove any commas and coerce to numeric
counts[] <- lapply(counts, function(x) {
  if (is.character(x)) x <- gsub(",", "", x, fixed = TRUE)
  suppressWarnings(as.numeric(x))
})
# Report any NA introduced by coercion
na_cols <- vapply(counts, function(x) any(is.na(x)), logical(1))
if (any(na_cols)) {
  bad <- names(which(na_cols))
  message("WARNING: Non-numeric values detected in count columns; introduced NAs in: ", paste(bad, collapse=", "))
  # Replace NA with 0 (safe fallback) and continue
  counts[bad] <- lapply(counts[bad], function(x) { x[is.na(x)] <- 0; x })
}

# Ensure samples and counts columns align 1:1 and reorder counts accordingly
missing_in_samples <- setdiff(colnames(counts), rownames(samples))
missing_in_counts  <- setdiff(rownames(samples), colnames(counts))
if (length(missing_in_samples) > 0) {
  stop("These count columns have no matching row in samples.tsv: ", paste(missing_in_samples, collapse=", "))
}
if (length(missing_in_counts) > 0) {
  stop("These samples.tsv rows have no matching column in counts.tsv: ", paste(missing_in_counts, collapse=", "))
}
counts <- counts[, rownames(samples), drop=FALSE]
# Finally, round to integers as required by DESeq2
counts <- round(as.matrix(counts))

message("[2/7] DESeq2 model (time-course)")
design_terms <- c()
if (!is.null(opt$batch_col) && opt$batch_col %in% colnames(samples)) {
  design_terms <- c(design_terms, opt$batch_col)
}
design_terms <- c(design_terms, opt$condition_col, opt$time_col, paste0(opt$condition_col, ":", opt$time_col))
design_formula <- as.formula(paste("~", paste(design_terms, collapse=" + ")))

dds <- DESeqDataSetFromMatrix(countData = round(as.matrix(counts)),
                              colData = samples,
                              design = design_formula)
dds <- dds[rowSums(counts(dds)) > 1, ]
dds <- DESeq(dds, test="LRT",
             full = design_formula,
             reduced = as.formula(paste("~", paste(setdiff(design_terms, paste0(opt$condition_col, ":", opt$time_col)), collapse=" + "))))

vsd <- vst(dds, blind=FALSE)
vsd_mat <- assay(vsd)

message("[3/7] Parse emapper for GO/EC (oxidoreductases & stress genes)")
stopifnot(!is.null(opt$emapper))
emap <- read_tsv(opt$emapper, comment = "#", col_types = cols(.default = "c"))
# Expecting columns: query, GOs, EC, Description, Preferred_name, etc.
emap <- emap %>%
  transmute(gene = query,
            GOs = ifelse(is.na(GOs), "", GOs),
            EC = ifelse(is.na(EC), "", EC),
            Description = ifelse(is.na(Description), "", Description),
            Preferred_name = ifelse(is.na(Preferred_name), "", Preferred_name))
emap <- emap %>% distinct(gene, .keep_all = TRUE)

# Flags:
# 1) oxidoreductase: EC starts with "1." OR GO includes GO:0016491
is_ox_by_ec <- grepl("^1\\.", emap$EC)
is_ox_by_go <- grepl("\\bGO:0016491\\b", emap$GOs)
emap$is_oxidoreductase <- is_ox_by_ec | is_ox_by_go

# 2) stress-related: search for stress GO ids in GOs
stress_gos <- c("GO:0006950","GO:0033554","GO:0006979","GO:0006974","GO:0009408","GO:0009266") # response to stress; cellular response; oxidative stress; DNA damage; response to heat; response to starvation
re_pat <- paste(stress_gos, collapse="|")
emap$is_stress <- grepl(re_pat, emap$GOs)

write_tsv(emap, file.path(opt$outdir, "emapper_flags.tsv"))

message("[4/7] Per-gene time slopes within each condition")
cond_levels <- unique(samples[[opt$condition_col]])
slope_summaries <- list()

for (cond in cond_levels) {
  sel <- samples[[opt$condition_col]] == cond
  mat <- vsd_mat[, sel, drop=FALSE]
  tvec <- samples[[opt$time_col]][sel]

  slopes <- apply(mat, 1, function(y) {
    fit <- try(lm(y ~ tvec), silent = TRUE)
    if (inherits(fit, "try-error")) return(c(NA, NA))
    co <- summary(fit)$coefficients
    c(beta=unname(co["tvec","Estimate"]), p=unname(co["tvec","Pr(>|t|)"]))
  })
  slopes <- t(slopes)
  df <- as.data.frame(slopes)
  df$gene <- rownames(mat)
  df$condition <- cond
  slope_summaries[[cond]] <- df
}

# (keep whatever you have above this point unchanged)
slope_df <- bind_rows(slope_summaries) %>%
  mutate(padj = p.adjust(p, method="BH")) %>%
  relocate(gene, condition, beta, p, padj)

# ---- Robust join to emapper ----
# clean IDs: trim; strip version suffixes like ".1"
emap$gene <- trimws(emap$gene)
emap$gene_clean <- sub("\\.\\d+$", "", emap$gene)

slope_df$gene <- trimws(slope_df$gene)
slope_df$gene_clean <- sub("\\.\\d+$", "", slope_df$gene)

# join on cleaned key
slope_df <- slope_df %>%
  dplyr::left_join(
    emap %>% dplyr::select(gene_clean, GOs, EC, Description, Preferred_name,
                           is_oxidoreductase, is_stress),
    by = "gene_clean"
  )

# recompute flags from EC/GOs when missing
slope_df <- slope_df %>%
  mutate(
    is_oxidoreductase = ifelse(
      is.na(is_oxidoreductase),
      (!is.na(EC) & grepl("^1\\.", EC)) | (!is.na(GOs) & grepl("\\bGO:0016491\\b", GOs)),
      is_oxidoreductase
    ),
    is_stress = ifelse(
      is.na(is_stress),
      (!is.na(GOs) & grepl("GO:0006950|GO:0033554|GO:0006979|GO:0006974|GO:0009408|GO:0009266", GOs)),
      is_stress
    )
  ) %>%
  dplyr::select(-gene_clean)

# write full slopes
readr::write_tsv(slope_df, file.path(opt$outdir, "time_slopes_by_condition.tsv"))

# summaries
ox_summary <- slope_df %>%
  dplyr::filter(!is.na(is_oxidoreductase) & is_oxidoreductase) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(ox_summary, file.path(opt$outdir, "oxidoreductases_time_trends.tsv"))

stress_summary <- slope_df %>%
  dplyr::filter(!is.na(is_stress) & is_stress) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(stress_summary, file.path(opt$outdir, "stress_genes_time_trends.tsv"))

message("[5/7] Heatmaps from volcano gene lists (plus per-gene)")
make_heatmap <- function(glist, tag, kmeans_rows=NA) {
  sub <- vsd_mat[rownames(vsd_mat) %in% glist, , drop=FALSE]
  if (nrow(sub) == 0) {
    message("No overlap for ", tag)
    return(invisible(NULL))
  }
  z <- t(scale(t(sub)))
  ha_col <- HeatmapAnnotation(
    df = data.frame(
      condition = samples[[opt$condition_col]],
      time = samples[[opt$time_col]]
    )
  )
  png(file.path(opt$outdir, paste0("heatmap_", tag, ".png")), width=1400, height=1000, res=140)
  print(Heatmap(z, name="z", top_annotation = ha_col,
                clustering_distance_rows = "euclidean",
                clustering_method_rows = "ward.D2",
                show_row_names = FALSE, show_column_names = TRUE,
                row_km = kmeans_rows))
  dev.off()
}

make_single_gene_heatmaps <- function(glist, tag) {
  for (g in glist) {
    if (!(g %in% rownames(vsd_mat))) next
    z <- t(scale(t(vsd_mat[g,,drop=FALSE])))
    png(file.path(opt$outdir, paste0("heatmap_", tag, "_", g, ".png")), width=1200, height=400, res=150)
    print(Heatmap(z, name="z", cluster_rows=FALSE, cluster_columns=FALSE,
                  show_row_names=TRUE, show_column_names=TRUE))
    dev.off()
  }
}

if (!is.null(opt$volcano_csvs) && nzchar(opt$volcano_csvs)) {
  files <- str_split(opt$volcano_csvs, ",")[[1]] %>% trimws()
  for (f in files) {
    df <- tryCatch({
      if (grepl("\\.tsv$", f, ignore.case = TRUE)) read_tsv(f, col_types=cols())
      else read_csv(f, col_types=cols())
    }, error=function(e) NULL)
    if (is.null(df) || !("gene" %in% names(df))) next
    genes <- df$gene %>% unique()
    tag <- tools::file_path_sans_ext(basename(f))
    make_heatmap(genes, tag, kmeans_rows = 4)
    make_single_gene_heatmaps(genes, paste0(tag, "_single"))
  }
}

message("[6/7] PCA")
pca <- plotPCA(vsd, intgroup=c(opt$condition_col, opt$time_col), returnData = TRUE)
percentVar <- round(100 * attr(pca, "percentVar"))

# change 'deltasbp_*' to 'Δ_*' for legend labels
pca[[opt$condition_col]] <- gsub("^deltasbp", "Δsbp", pca[[opt$condition_col]])

p <- ggplot(pca, aes(PC1, PC2,
                     color = .data[[opt$condition_col]],
                     shape = factor(.data[[opt$time_col]]))) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  labs(color = "Condition", shape = "Factor(time_h)") +  # legend titles
  theme_bw()
ggsave(file.path(opt$outdir, "PCA_condition_time.png"), p, width=10, height=7, dpi=150)

message("[7/7] Summary")
summary_txt <- file.path(opt$outdir, "SUMMARY.txt")
sink(summary_txt)
cat("Bacterial time-course summary\n")
cat("Date:", as.character(Sys.time()), "\n\n")
cat("Conditions:", paste(unique(samples[[opt$condition_col]]), collapse=", "), "\n\n")
cat("Top oxidoreductases decreasing over time:\n")
print(ox_summary %>% filter(direction=="decreasing") %>% head(20))
cat("\nTop stress genes decreasing over time:\n")
print(stress_summary %>% filter(direction=="decreasing") %>% head(20))
sink()
message("Done -> ", opt$outdir)

This code-rich summary provides a replicable basis for advanced bacterial RNA-seq time-course analysis and reporting. All main code steps and script logic are retained for transparency and practical reuse.

Tenergy / Dignics 与 Butterfly Timo Boll ALC 终极实用手册

Leave a reply

版本：2025-10-27（Europe/Berlin）

TODO: 第一选择：想要抓球+弧线：正手 Dignics 09C 黑 2.1(起下旋最稳)，反手 Dignics 64 红 1.9；第二选择：想要直接、快出：正手 Tenergy 05 红 2.1，反手 Tenergy 80 黑 1.9。

面向：使用/考虑使用 Butterfly Timo Boll ALC（TB ALC） 的选手，搭配 Dignics / Tenergy 胶皮（含 09C、64、64 FX、05、80）。目标：快速定型正反手配置，理解每款胶的手感、弧线、速度与台内表现，并给出厚度/配色建议。

快速结论（给着急的人）
TB ALC 底板一览
胶皮速览卡（D09C / D64 / T05 / T80 / T64 / T64 FX）
关键差异对照表
正反手搭配方案（按打法）
厚度与配色（红/黑）选择
与 TB ALC 的化学反应：上手感受与注意事项
台内/发接发与相持要点
维护与更换周期
常见问答（FAQ）
术语小词典

1) 快速结论（给着急的人）

稳健FH弧圈：Dignics 09C（黑）2.1 或 Tenergy 05（红）2.1
凌厉BH快带/反拉：Dignics 64（1.9） 或 Tenergy 64（1.9）
一张通吃/省心：Tenergy 80（FH 2.1 / BH 1.9）
BH 要容错/易起球：Tenergy 64 FX（1.9 或 2.1）
为什么少推 T64 做正手：低弧直线，台内活、起下旋容错较低；正手通常更需要抓球+弧线（09C/05/80 更贴合）。
红/黑颜色：规则只要求“一面黑一面非黑”。普遍经验：黑皮更黏/质感更实（适合FH抓摩），红皮更通透爽快（常见BH或快出风格）。

2) TB ALC 底板一览

类型：OFF / OFF-（进攻），ALC 纤维（Arylate-Carbon）
层数（经典 ALC 叠层）：Koto – ALC – Limba – Kiri（芯）– Limba – ALC – Koto
常见参数：厚 ~5.7–5.9 mm；重 ~86–90 g；板面 ~157×150 mm
打感：甜区大、低震动、击球干净略“闷”，速度快但可控；弧圈中等抛物线，挡/对冲稳、响应线性。

与相近底板

Viscaria：整体更柔一点，抛物线略高；TB ALC 更直接、低一丝抛。
TB ZLC：更快更脆，持球/容错降低。
TB ZLF：更软更吃球，但顶速低。

3) 胶皮速览卡

Dignics 09C（D09C）

特性：微黏顶皮，抓球最强、吃球最深；中高弧线，起下旋最稳；前台不弹，中远台后劲足。
定位：FH 现代弧圈核心；在 TB ALC 上中和“干脆”，提升弧线与控制。

Dignics 64（D64）

特性：直线、反弹快，借力好；中低弧线，出速高；台内略活。
定位：BH 强势快带/对冲；在 TB ALC 上 BH 非常顺手。

Tenergy 05（T05）

特性：高摩擦颗粒设定，抓球强、弧线中高，台内更稳，响应线性。
定位：FH 万金油弧圈/拉冲；在 TB ALC 上成熟耐用的经典搭配。

Tenergy 80（T80）

特性：位于 05 与 64 之间的平衡点；弧线中等，速度/控制均衡。
定位：双面皆可，适合想“一张打全场”的简化方案。

Tenergy 64（T64）

特性：出球直、弧线中低、顶速与穿透强；台内更易“活”。
定位：BH 快带/反拉导向；FH 少量玩家偏好直线穿透可选。

Tenergy 64 FX（T64 FX）

特性：更软海绵版本；易起球、容错高，低中功率更轻松；台内更活、顶速略降。
定位：BH 容错/易操控路线；入门到中级/小力量友好。

4) 关键差异对照表

型号	抓球/持球	弧线高度	出速/穿透	台内控制	起下旋容错	最佳位面
D09C	最大	中-高	中后程强	稳	最优	FH
T05	高	中-高	中高	稳	很好	FH / BH 控制
T80	中高	中	中高	稳定	好	FH / BH
D64	中	中-低	最高	中等（活）	中等	BH
T64	中	低-中	最高	中等（活）	较低	BH / 少数 FH
T64 FX	中高（软）	中	高（低中功率更易）	偏活	较高	BH 初中级

注：表中“台内活”=反弹系数高、对小动作更敏感；需要更细腻的手上控制。

5) 正反手搭配方案（按打法）

A. FH 现代弧圈 + BH 快带/反拉（主流）

FH：D09C 2.1（黑） / T05 2.1（红）!!!!
BH：D64 1.9（红）/ T64 1.9 / T80 1.9（黑）!!!!

B. 近台借力对冲 + 二速快上手

FH：T05 2.1 / T80 2.1
BH：T64 1.9 / T64 FX 1.9（要容错）

C. 中远台大力爆冲

FH：D09C 2.1
BH：D64 2.1 或 T80 2.1（看稳定需求）

D. 一套省心通吃

FH：T80 2.1
BH：T80 1.9

E. 坚持 FH 直线穿透风

FH：T64 1.9（控台内）
BH：T80 1.9 / T64 FX 1.9（容错）

常见选择思路

粘性/中国套（如狂飙、09C）做正手 → 多数人选黑色; 黑皮通常略更黏、更实、更“顶”，适合发力刷摩、前冲。
日德套（如 Tenergy/Dignics/ESN）做正手 → 很多人选红色; 红皮一般手感更通透一点、出球更爽快，弧线略高、速度更轻快。

结合你这块 TB ALC：

想要抓球+弧线：正手 Dignics 09C 黑 2.1，反手 Dignics 64 红 1.9。
想要直接、快出：正手 Tenergy 05 红 2.1，反手 Tenergy 80 黑 1.9。
想要更稳控：都用 1.9 厚度即可。

6) 厚度与配色（红/黑）选择

厚度（1.9 vs 2.1）

1.9 mm：更稳，台内控制好、起板成功率高，适合反手或控球为先。
2.1 mm：最大威力与后程顶速，适合正手或追求爆冲者。

配色（规则与习惯）

规则：一面黑，一面非黑（通常红），没有“正手必须红/黑”的硬性规定。
经验：黑皮往往更黏、更扎实（FH 抓摩）；红皮更通透快出（BH/快攻）。
推荐：FH 黑 / BH 红（若选 09C/T05 等抓球型）；若使用 64/64 FX 做 BH，红/黑皆可。

7) 与 TB ALC 的化学反应：上手感受与注意事项

TB ALC 出球干净、回弹快，与 D09C/T05 组合能获得更稳的弧线与起下旋容错；
与 D64/T64 叠加会更“利落”，BH 爽快但台内更活；
若 FH 也选 64：建议 1.9 厚，台内要格外细腻；或更柔的底板来“降躁”。

8) 台内/发接发与相持要点

台内：D09C/T05/T80 更易控短与摆短；64/64 FX 需降低击球力度与板形开合，加大摩擦比重。
起下旋：D09C 最稳、T05 次之；64/64 FX 要注意摩擦角度与击球深度。
对拉/反拉：64/D64 出速高、直线穿透强；T05/T80 更有弧线安全窗。
借力：64 系列占优；D09C 需要主动发力，更吃你质量。

9) 维护与更换周期

清洁：每次打完用微湿海绵/专用清洁剂轻拭，贴保护膜；微黏（09C）表面避免硬擦。
更换：高频训练（>3 次/周）约 2–3 月更换一侧；普通强度 3–6 月。
粘贴：使用水溶性无机胶；避免非法改装（如违规增黏/增弹）。

10) 常见问答（FAQ）

Q1：为什么很多人不推荐 T64 做正手？
A：直线低弧、台内活、起下旋容错低，与 TB ALC 叠加更“躁”。大多数 FH 更需要抓球与弧线（09C/05/80）。

Q2：D64 vs T64？
A：D64 抓球与容错略好，仍保持直线与高出速；T64 更“经典 64 味”，更凌厉但更挑台内控制。

Q3：T64 vs T64 FX？
A：FX 更软，低中功率更容易、容错高；顶速与台内稳定性不及 T64。BH 入门到中级偏向 FX。

Q4：T80 能否双面？
A：可以。它在 05 与 64 之间，速度/弧线/控制均衡，是省心选择。

Q5：红黑是否影响性能？
A：配方/批次差异外，普遍经验是黑略黏、红更通透；按手感与需求定，无硬性规则。

11) 术语小词典

抓球/持球：球在胶皮上的“停留与摩擦”感。越强越有利于起下旋与弧圈稳定。
台内活：小力量下的反弹灵敏度高，容易弹起，控制难度增加。
弧线高度：出球抛物线的“拱度”，高抛提供更大安全窗。
直线穿透：出球平直、速度快，吃台后“冲”的感觉。
容错：击球角度/力量略有偏差时，仍能上台/有效的宽容度。

小结

FH 选抓球与弧线（D09C/T05），BH 选直线出速与借力（D64/T64/T80）。
T80 是“两边都不极端”的万能解；T64 FX 给 BH 轻松与容错。
TB ALC 性格“干净+线性”，合理胶皮搭配即可兼顾台内与相持。

ONT Methylation Analysis — Comprehensive Summary

Leave a reply

Scope: What methylation is (5‑mC, 6‑mA, 4‑mC), how Oxford Nanopore (ONT) detects it, how it differs from bisulfite sequencing, required coverage, file types (modBAM/CRAM with MM/ML tags), basecalling models (Dorado), practical workflows, pipelines (nf-core/methylong), deliverables to request from providers (e.g., Novogene), and specific advice for bacterial projects.

1) What are 5‑mC, 6‑mA, 4‑mC?

5‑mC (5‑methylcytosine): methyl group on cytosine C5 carbon. In eukaryotes strongly linked to gene regulation (CpG), chromatin state, imprinting. Also present in some bacteria (e.g., Dcm at CCWGG).
6‑mA (N6‑methyladenine): methyl on adenine N6. Very common in bacteria/archaea (e.g., Dam at GATC), functions in restriction–modification (R–M), mismatch repair, replication control, and gene regulation.
4‑mC (N4‑methylcytosine): methyl on cytosine N4, mostly in bacteria/archaea (R–M and regulation).

Coverage guidance (ONT direct detection):

≥ 10× for 5‑mC calling/quantification.
≥ 50× for 6‑mA and 4‑mC (signals are weaker; models need depth).

2) How ONT detects methylation (no chemical conversion)

ONT does not convert bases (unlike bisulfite sequencing which converts un‑methylated C → U → read as T). ONT reads remain A/C/G/T.
ONT measures ionic current while DNA k‑mers pass the pore. Modified bases (5‑mC/6‑mA/4‑mC) slightly shift current distributions.
A modified‑base basecaller (now Dorado; historically Guppy+Remora) decodes those shifts and writes methylation annotations into aligned BAM/CRAM as MM/ML tags:
- MM: modified motif and per‑read positions.
- ML: per‑site modification probabilities/scores.
Downstream tools (e.g., modkit, methylartist, nf‑core/methylong) summarize per‑site/per‑region methylation and export BED/bedGraph/bigWig for visualization/statistics.

Key contrast with bisulfite (BS‑seq):

BS‑seq chemically converts un‑methylated C to U (sequenced as T) → uses base changes to infer methylation.
ONT uses signal differences; no base letters change. Methylation is metadata in BAM tags, not edits in the sequence.

3) Data types & what you need (modBAM vs “assembly” reads)

Previous ONT reads used for genome assembly are typically standard basecalls (A/C/G/T only) and lack MM/ML tags, so not suitable for methylation quantification.
For methylation analysis you need either:
1. Provider delivers aligned modified‑base BAM/CRAM (modBAM/CRAM) with MM/ML tags and indices (.bai/.crai).
2. Or you re‑basecall FAST5/FASTQ with a modified‑base Dorado model and then align to your reference (producing modBAM).

Reference genome requirement:

For aligned BAM, you (or the provider) must map to a reference FASTA. Keep the exact FASTA (and .fai) used for reproducibility and downstream summarization.

4) Practical workflow (bacteria)

A. Planning & sequencing

Decide targets: in bacteria prioritize 6‑mA/4‑mC; optionally 5‑mC (if Dam/Dcm enzymes present).
Coverage targets: ≥50× (6‑mA/4‑mC), ≥10× (5‑mC).
Ask provider to run Dorado (modified‑base model) and deliver aligned modBAM/CRAM with MM/ML tags.

B. Inputs/outputs to request from provider (e.g., Novogene)

Deliverables:
- modBAM/CRAM (aligned to our provided reference), with MM/ML tags + .bai/.crai.
- Optional per‑site tracks: BED/bedGraph/bigWig and a QC report.
Reference:
- Can we provide bacterial reference FASTA? Will they return the exact FASTA (.fai) used?
Models & modifications:
- Which Dorado model version and which mods (5‑mC, 6‑mA, 4‑mC) are called by default?
Unaligned data:
- If delivering unmapped uBAM/FASTQ, request that modified‑base calls (tags) are still included, or obtain raw signal/FAST5 if re‑calling in‑house.

C. In‑house analysis (outline)

Align mod‑called reads to reference (if not already) → modBAM.
Run modkit to summarize per‑site methylation frequencies and export bedGraph/bigWig.
Use methylartist for regional plots, motif‑centric views, metaplots over features (promoters, operons, RND genes, etc.).
Integrate with other omics (RNA‑seq) by averaging methylation in promoter/operon windows and correlating with expression changes.

5) nf‑core/methylong (pipeline overview)

Community Nextflow pipeline for ONT methylation. Typical features:
- Supports Dorado modified‑base calling (or consumes modBAM/CRAM).
- Performs alignment (e.g., minimap2) to your reference, keeps MM/ML tags.
- Generates per‑site/ per‑region summaries, tracks (bedGraph/bigWig), and QC.
Inputs: reads (FASTQ/FAST5) or modBAM + reference FASTA; sample sheet with metadata.
Outputs: modBAM/CRAM + indices, per‑site methylation tables, genome tracks, multiQC‑style reports.

(Exact CLI flags vary by version; coordinate with the provider or your compute environment.)

6) QC & caveats

Depth matters: 6‑mA/4‑mC need higher coverage than 5‑mC.
Model choice: Use the correct Dorado modified‑base model for your chemistry/flow cell and target modifications.
Reference fidelity: Use the same reference throughout (and document version).
BAM integrity: Verify MM/ML tags exist; confirm alignment header matches the provided FASTA.
Context effects: Methylation calling is k‑mer context‑dependent; some motifs are easier/harder.
Biological interpretation: In bacteria, methylation is often tied to R–M systems, replication, and gene regulation; interpret rates in motif/operon context, not only at single CpG‑style sites.

7) What to ask a provider (email checklist)

Will you deliver aligned modBAM/CRAM with MM/ML tags (+ index)?
Which modified bases are called (5‑mC, 6‑mA, 4‑mC)? Which Dorado model/version?
Do you require us to provide a bacterial reference FASTA for alignment? Will you return the exact reference used?
Can you also provide per‑site methylation tracks (bedGraph/bigWig) and a QC report?
What coverage will be achieved per sample (target ≥10× for 5‑mC; ≥50× for 6‑mA/4‑mC)?

8) Suggested minimal deliverables

modBAM/CRAM aligned to our provided reference (+ .bai/.crai).
Reference FASTA and .fai used in alignment/calling.
Per‑site tables (tsv) and tracks (bedGraph/bigWig).
Brief QC (coverage, fraction modified by motif, per‑site confidence).

9) Bacterial project recommendation (one‑liner)

For bacteria, profile 6‑mA (and 4‑mC) as primary targets (≥50×), optionally 5‑mC (≥10× if Dcm‑like activity expected), using Dorado modified‑base calling and aligned modBAM/CRAM with MM/ML tags; summarize with modkit/methylartist and integrate with RNA‑seq.

10) Handy pointers & checks (quick ref)

Check BAM has mods: samtools view -h mod.bam | head → look for MM:Z: and ML:B:C tags.
Confirm reference: samtools view -H mod.bam | grep '^@SQ' and keep the FASTA.
Summarize (example modkit): modkit pileup mod.bam ref.fa --bedgraph out.bg --min-mapq 20
Visualize: Load bigWig/bedGraph in IGV/JBrowse; overlay RNA‑seq coverage/DE results.

Prepared from the morning discussion to serve as a self‑contained guide and hand‑off document.

RNA-seq Sample Interpretation & Design (Updated Mapping) for RNAseq_2025_LB_vs_Mac_ATCC19606

Leave a reply

This document explains the background and rationale of your RNA-seq dataset based on sample names, updates the genotype mapping (AB = ΔadeAB; IJ = ΔadeIJ; WT19606 = WT; W1/Y1 = clinical WT isolates), and provides ready-to-run analysis contrasts aligned with your research proposal.

1) Naming logic (decoded)

Medium prefix
- LB- → growth in LB (non-selective baseline; total transcriptome without membrane/efflux stress).
- Mac- → growth in MacConkey (bile salts + crystal violet; membrane/outer-membrane/efflux stress condition).
Strain/genotype tag
- WT19606 → A. baumannii ATCC 19606, wild-type (reference WT).
- IJ → ΔadeIJ (AdeIJK RND efflux knockout).
- AB → ΔadeAB (AdeAB subfamily knockout).
- W1, Y1 → two clinical wild-type isolates (distinct lineages).
Replicates
- r1 / r2 / r3 / r4 → biological replicates.

Note on proposal text: The provided proposal did not explicitly name W1/Y1; it referred to “clinical isolates” generally. If you analyze W1/Y1, add a Methods line defining them (e.g., source, genotype as WT).

2) Updated mapping table

Sample prefix	Medium	Strain	Genotype	Purpose / Rationale
LB-AB-*	LB	AB	ΔadeAB	Baseline transcriptome of AdeAB knockout (no stress)
Mac-AB-*	MacConkey	AB	ΔadeAB	Stress response without AdeAB (efflux-dependent programs)
LB-IJ-*	LB	IJ	ΔadeIJ	Baseline of AdeIJK knockout
Mac-IJ-*	MacConkey	IJ	ΔadeIJ	Stress response without AdeIJK
LB-WT19606-*	LB	ATCC 19606	WT	Reference WT baseline (anchor strain)
Mac-WT19606-*	MacConkey	ATCC 19606	WT	Reference WT under stress (gold-standard contrast vs ΔadeAB/ΔadeIJ)
LB-W1-*	LB	W1	WT (clinical)	Baseline of clinical WT lineage W1
Mac-W1-*	MacConkey	W1	WT (clinical)	W1 under stress
LB-Y1-*	LB	Y1	WT (clinical)	Baseline of clinical WT lineage Y1
Mac-Y1-*	MacConkey	Y1	WT (clinical)	Y1 under stress

Why these conditions matter (proposal-aligned):

LB captures baseline networks.
Mac induces the membrane/efflux stress program that revealed R vs S behavior in your proposal and is tightly linked to RND function.
Contrasting WT vs efflux knockouts (ΔadeAB/ΔadeIJ) under LB/Mac tests both genotype main effects and genotype × stress interactions.
Multiple WT lineages (WT19606/W1/Y1) allow testing conservation vs isolate-specificity of stress responses (avoid overfitting to one WT).

3) Suggested metadata file (`samples.csv`)

sample,fastq_1,fastq_2,medium,strain,genotype,replicate,notes,strandedness
LB-AB-r1,LB-AB-r1_R1.fq.gz,LB-AB-r1_R2.fq.gz,LB,AB,ΔadeAB,r1,AdeAB knockout baseline,auto
LB-AB-r2,LB-AB-r2_R1.fq.gz,LB-AB-r2_R2.fq.gz,LB,AB,ΔadeAB,r2,AdeAB knockout baseline,auto
LB-AB-r3,LB-AB-r3_R1.fq.gz,LB-AB-r3_R2.fq.gz,LB,AB,ΔadeAB,r3,AdeAB knockout baseline,auto
LB-IJ-r1,LB-IJ-r1_R1.fq.gz,LB-IJ-r1_R2.fq.gz,LB,IJ,ΔadeIJ,r1,AdeIJK knockout baseline,auto
LB-IJ-r2,LB-IJ-r2_R1.fq.gz,LB-IJ-r2_R2.fq.gz,LB,IJ,ΔadeIJ,r2,AdeIJK knockout baseline,auto
LB-IJ-r4,LB-IJ-r4_R1.fq.gz,LB-IJ-r4_R2.fq.gz,LB,IJ,ΔadeIJ,r4,AdeIJK knockout baseline,auto
LB-W1-r1,LB-W1-r1_R1.fq.gz,LB-W1-r1_R2.fq.gz,LB,W1,WT,r1,clinical WT W1 baseline,auto
LB-W1-r2,LB-W1-r2_R1.fq.gz,LB-W1-r2_R2.fq.gz,LB,W1,WT,r2,clinical WT W1 baseline,auto
LB-W1-r3,LB-W1-r3_R1.fq.gz,LB-W1-r3_R2.fq.gz,LB,W1,WT,r3,clinical WT W1 baseline,auto
LB-WT19606-r2,LB-WT19606-r2_R1.fq.gz,LB-WT19606-r2_R2.fq.gz,LB,WT19606,WT,r2,reference WT baseline,auto
LB-WT19606-r3,LB-WT19606-r3_R1.fq.gz,LB-WT19606-r3_R2.fq.gz,LB,WT19606,WT,r3,reference WT baseline,auto
LB-WT19606-r4,LB-WT19606-r4_R1.fq.gz,LB-WT19606-r4_R2.fq.gz,LB,WT19606,WT,r4,reference WT baseline,auto
LB-Y1-r2,LB-Y1-r2_R1.fq.gz,LB-Y1-r2_R2.fq.gz,LB,Y1,WT,r2,clinical WT Y1 baseline,auto
LB-Y1-r3,LB-Y1-r3_R1.fq.gz,LB-Y1-r3_R2.fq.gz,LB,Y1,WT,r3,clinical WT Y1 baseline,auto
LB-Y1-r4,LB-Y1-r4_R1.fq.gz,LB-Y1-r4_R2.fq.gz,LB,Y1,WT,r4,clinical WT Y1 baseline,auto
Mac-AB-r1,Mac-AB-r1_R1.fq.gz,Mac-AB-r1_R2.fq.gz,MacConkey,AB,ΔadeAB,r1,AdeAB knockout under stress,auto
Mac-AB-r2,Mac-AB-r2_R1.fq.gz,Mac-AB-r2_R2.fq.gz,MacConkey,AB,ΔadeAB,r2,AdeAB knockout under stress,auto
Mac-AB-r3,Mac-AB-r3_R1.fq.gz,Mac-AB-r3_R2.fq.gz,MacConkey,AB,ΔadeAB,r3,AdeAB knockout under stress,auto
Mac-IJ-r1,Mac-IJ-r1_R1.fq.gz,Mac-IJ-r1_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r1,AdeIJK knockout under stress,auto
Mac-IJ-r2,Mac-IJ-r2_R1.fq.gz,Mac-IJ-r2_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r2,AdeIJK knockout under stress,auto
Mac-IJ-r4,Mac-IJ-r4_R1.fq.gz,Mac-IJ-r4_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r4,AdeIJK knockout under stress,auto
Mac-W1-r1,Mac-W1-r1_R1.fq.gz,Mac-W1-r1_R2.fq.gz,MacConkey,W1,WT,r1,clinical WT W1 under stress,auto
Mac-W1-r2,Mac-W1-r2_R1.fq.gz,Mac-W1-r2_R2.fq.gz,MacConkey,W1,WT,r2,clinical WT W1 under stress,auto
Mac-W1-r3,Mac-W1-r3_R1.fq.gz,Mac-W1-r3_R2.fq.gz,MacConkey,W1,WT,r3,clinical WT W1 under stress,auto
Mac-WT19606-r2,Mac-WT19606-r2_R1.fq.gz,Mac-WT19606-r2_R2.fq.gz,MacConkey,WT19606,WT,r2,reference WT under stress,auto
Mac-WT19606-r3,Mac-WT19606-r3_R1.fq.gz,Mac-WT19606-r3_R2.fq.gz,MacConkey,WT19606,WT,r3,reference WT under stress,auto
Mac-WT19606-r4,Mac-WT19606-r4_R1.fq.gz,Mac-WT19606-r4_R2.fq.gz,MacConkey,WT19606,WT,r4,reference WT under stress,auto
Mac-Y1-r2,Mac-Y1-r2_R1.fq.gz,Mac-Y1-r2_R2.fq.gz,MacConkey,Y1,WT,r2,clinical WT Y1 under stress,auto
Mac-Y1-r3,Mac-Y1-r3_R1.fq.gz,Mac-Y1-r3_R2.fq.gz,MacConkey,Y1,WT,r3,clinical WT Y1 under stress,auto
Mac-Y1-r4,Mac-Y1-r4_R1.fq.gz,Mac-Y1-r4_R2.fq.gz,MacConkey,Y1,WT,r4,clinical WT Y1 under stress,auto

4) DE design & contrasts (DESeq2/edgeR)

Model

Reference-genotype focus (WT19606 vs knockouts):
~ medium * genotype
where medium ∈ {LB, MacConkey}, genotype ∈ {WT, ΔadeAB, ΔadeIJ}.
(Set WT19606_LB as reference level.)
All strains (WT lineages):
~ medium * strain
where strain ∈ {WT19606, W1, Y1, ΔadeAB, ΔadeIJ}.

Hypothesis-driven contrasts

Stress response within each background:
Mac vs LB for WT19606, ΔadeAB, ΔadeIJ, W1, Y1.
→ Membrane/efflux-stress regulons; check adeA/B/C, adeI/J/K, envelope stress, OM biogenesis, osmotic/ions, ribosome PQC.
Genotype main effect at baseline (LB):
ΔadeAB vs WT19606 (LB), ΔadeIJ vs WT19606 (LB).
→ Efflux-independent differences; compensatory pathways.
Genotype effect under stress (Mac):
ΔadeAB vs WT19606 (Mac), ΔadeIJ vs WT19606 (Mac).
→ How loss of AdeAB/AdeIJK alters the stress transcriptome.
Interaction (genotype × medium):
(ΔadeAB_Mac − ΔadeAB_LB) − (WT19606_Mac − WT19606_LB); same for ΔadeIJ.
→ Core test: does RND loss change the stress response itself?
Conservation across WT lineages:
Intersect Mac vs LB DEGs across WT19606/W1/Y1 to define a conserved “Mac stress signature”; then identify isolate-specific modules.

QC

Verify strandedness with RSeQC (you set auto), mapping rate, rRNA %, insert size.
PCA: expect Mac vs LB separation; ΔadeIJ should diverge strongly under Mac.
Validate a small panel via RT-qPCR (ade genes + OM stress markers).

5) Where this plugs into the proposal

Mac vs LB embodies the proposal’s “LB = who is alive” vs “Mac = who can cope” paradigm (membrane/efflux stress).
ΔadeAB / ΔadeIJ model the role of RND efflux in stress adaptation and heterogeneity.
Multiple WT lineages prevent overfitting to ATCC 19606 and test generalizability.
Downstream integration with PAP/MDK/R% metrics links transcriptome to phenotype (R vs S).

If you want, I can also generate a starter DESeq2 R script that reads this samples.csv, sets factors/contrasts, and outputs PCA, volcano plots, and KEGG/GO enrichment stubs.

一、背景假说与验证目标

二、数据准备与时间对齐

三、HMM结果信号选取

四、三层相关性分析流程

A. 全局时间序列相关性

B. 事件级判定“相关/不相关”

C. 位置（x坐标）相关性（若有位置信息）

五、统计模型拓展（可选）

六、生物学意义解读

1. 若观察到“强相关”：

2. 若观察到“不相关”：

七、change_time_s事件判定操作清单

八、注意事项与易错点

九、建议的最终交付物

十、论文引用要点（适用于报告撰写）

附：可运行脚本支持

1. Prepare samples.tsv from the samplesheet

2. Reformat counts.tsv from STAR/Salmon

3. Run the R time-course analysis

4. Summarize and convert results

5. Key summary and reporting

6. Full main R script: rna_timecourse_bacteria.R

目录

1) 快速结论（给着急的人）

2) TB ALC 底板一览

3) 胶皮速览卡

Dignics 09C（D09C）

Dignics 64（D64）

Tenergy 05（T05）

Tenergy 80（T80）

Tenergy 64（T64）

Tenergy 64 FX（T64 FX）

4) 关键差异对照表

5) 正反手搭配方案（按打法）

A. FH 现代弧圈 + BH 快带/反拉（主流）

B. 近台借力对冲 + 二速快上手

C. 中远台大力爆冲

D. 一套省心通吃

E. 坚持 FH 直线穿透风

6) 厚度与配色（红/黑）选择

厚度（1.9 vs 2.1）

配色（规则与习惯）

7) 与 TB ALC 的化学反应：上手感受与注意事项

8) 台内/发接发与相持要点

9) 维护与更换周期

10) 常见问答（FAQ）

11) 术语小词典

小结

1) What are 5‑mC, 6‑mA, 4‑mC?

2) How ONT detects methylation (no chemical conversion)

3) Data types & what you need (modBAM vs “assembly” reads)

4) Practical workflow (bacteria)

5) nf‑core/methylong (pipeline overview)

6) QC & caveats

7) What to ask a provider (email checklist)

8) Suggested minimal deliverables

9) Bacterial project recommendation (one‑liner)

10) Handy pointers & checks (quick ref)

1) Naming logic (decoded)

2) Updated mapping table

3) Suggested metadata file (samples.csv)

4) DE design & contrasts (DESeq2/edgeR)

Model

Hypothesis-driven contrasts

QC

5) Where this plugs into the proposal

1. Prepare `samples.tsv` from the samplesheet

2. Reformat `counts.tsv` from STAR/Salmon

6. Full main R script: `rna_timecourse_bacteria.R`

3) Suggested metadata file (`samples.csv`)