Author Archives: gene_x

结合 HMM 光漂白分级的一种 DNA-蛋白组装定量分析方法

Quantitative Analysis of LT Protein Assembly on DNA Using HMM-Guided Photobleaching Step Detection (结合 HMM 光漂白分级的一种 DNA-蛋白组装定量分析方法)

TODO: 改为 12 rather than 10 states, since 12 is dodecamer!!!!!!

stoichiometry of mN-LT assemblies on Ori98 DNA 实际上就是在问：

每次 binding 事件上，有多少个 LT 分子同时在 DNA 上？

分布是 3-mer, 4-mer, …, 12-mer 各占多少比例？这就是 “binding stoichiometry”。

可以简单记成一句话：

Stoichiometry = “参与的分子各有多少个？” 在你的项目里 → “每次在 DNA 上到底有几个 LT？”

mN-LT 在 Ori98 DNA 上组装成十二聚体（dodecamer）

mN-LT Assembles as a Dodecamer on Ori98 DNA. 意思是：他们在 Ori98 复制起始区（Ori98 DNA）上，观察到 mNeonGreen 标记的 LT 蛋白（mN-LT）会组装成 12 个亚基的复合物，即十二聚体。这个十二聚体很可能是两个六聚体（double hexamer）组成的。
HMM 在这篇文章里是怎么用的？

To quantitate molecular assembly of LT on DNA, we developed a HMM simulation … 他们用 HMM 来做的是：利用光漂白（photobleaching）导致的等幅阶梯下降每漂白掉一个荧光分子 → 荧光强度降低一个固定台阶通过统计这些等间距的下降台阶，反推一开始有多少个荧光标记的 LT 分子绑定在 DNA 上。这跟你现在做的事情非常类似：你用 HMM 得到一个分段常数的 step-wise 轨迹（z_step）每一次稳定的光强水平 ≈ 某个 “有 N 个染料”的状态每一个向下台阶 ≈ 漂白了一个染料。 For technical reasons, the HMM could not reliably distinguish between monomer and dimer binding events. Therefore, these values were not included in the quantitative analysis. 这句话很关键：他们的 HMM 区分不可靠： 1 个分子（单体） 2 个分子（二聚体）所以所有 **1-mer 2（续）。为什么不统计 monomer / dimer？ For technical reasons, the HMM could not reliably distinguish between monomer and dimer binding events. Therefore, these values were not included in the quantitative analysis. 意思是：在他们的 HMM + 光漂白分析里， 1 个分子（monomer）和 2 个分子（dimer）之间的光强区别太小 / 太噪，很难可靠地区分。所以在最后的统计（Fig. 4C）里，他们只看 ≥3 个分子的组装情况。 1-mer 和 2-mer 直接不算在分布里。这跟你现在的情况很像：对于较小的 state jump / 小台阶，你也是用阈值把它们当成“噪声或者不可靠”处理。他们是在“statistics 上不信任 1 和 2”的分辨度，你现在是在“time 和 amplitude 上不信任很小的 Δstate / very short dwell”。
Fig. 4C：3–14 mer 的分布，3-mer 和 12-mer 是高峰

LT molecular assembly on Ori98 for 308 protein binding events, obtained from 30 captured DNAs, ranged from 3 to 14 mN-LT molecules, with notable maxima at 3-mer (32%) and 12-mer (22%) LT complexes (blue bars, Fig. 4C). 这句话说的是：他们总共统计了 308 个 binding events，来自 30 条 DNA。每个事件，对应一个“有多少个 mN-LT 同时在 DNA 上”的状态。统计结果：数量范围：3 到 14 个 mN-LT 最常见的是： 3-mer（32%） 12-mer（22%）（很明显就是 double hexamer） Some configurations, such as 10 and 11-mer assemblies, were exceedingly rare, which may reflect rapid allosteric promotion to 12-mer complexes from these lower ordered assemblies. 10-mer、11-mer 很罕见，可能的解释是：一旦接近 12，就很快“冲”到 12，不太停留在 10 或 11 状态。所以在 HMM + 漂白统计里，这些 intermediate 很少被看到。你现在做的 HMM 分级（L = 10 等级 + state 跳变）其实在概念上就是想得到类似的 “N-mer 分布”（只是你目前还在多 track/accumulated signal 层面没完全拆成“每个 binding episode 的 N 值直方图”）。
12-mer = double hexamer

The dodecameric assembly most likely represents two separate hexamers (a double hexamer), and the term double hexamer is used below, although we could not directly determine this assembly by C-Trap due to optical resolution limits. Other 12-mer assemblies remain formally possible. 意思是：他们认为 12-mer 很可能就是两个六聚体，一个 double hexamer。但 C-Trap 的光学分辨率没办法直接看到“两个 ring”的形状，只能从分子数间接推断。理论上也不能完全排除别的 12 聚体构象，但 double hexamer 是最合理的模型。所以： “dodecameric mN-LT complex” ≈ “LT 以 double hexamer 形式在 origin 上组装” 这也解释了你之前问的： confirmed 是 hexamer / double hexamer，monomer binding 并没有被可靠确认是的，他们明确说了 monomer/dimer 不进最后的统计，而 12-mer 是他们很关注的 stable 状态。
WT Ori98 vs mutant Ori98.Rep 的对比

In contrast, when tumor-derived Ori98.Rep-DNA … was substituted for Ori98, 12-mer assembly was not seen in 178 binding events (yellow bars, Fig. 4C). Maximum assembly on Ori98.Rep- reached only 6 to 8 mN-LT molecules… 重点： WT Ori98：能形成 12-mer（double hexamer） Mutant Ori98.Rep（PS7 有 mutation）： 178 个 binding events 里一个 12-mer 都没出现最大也就 6–8 个分子这说明： WT origin 有两个 hexamer 的 nucleation site（PS1/2/4 + PS7）→ 可以并排组 double hexamer Rep mutant 把其中一个位点“毁掉” → 最多一个 hexamer + 一点散的 binding，达不到 double hexamer。你如果将来想做类似分析：一种 DNA 序列（类似 WT），你会在 Fig.4C 看到 12-mer 的峰；另一种变体（类似 Rep），你 HMM 出来的 N 分布里就“看不到 12 的那一根 bar”。
Fig. 4D：不同 N-mer 的寿命（binding lifetime）

The mean LT–DNA binding lifetime increased from 36 s to 88 s for 3-mer and 6-mer assemblies, respectively… In contrast, mN-LT 12-mer assemblies … had calculated mean binding lifetimes >1500 s … 意思是： 3-mer：平均寿命 ~36 s 6-mer：~88 s 12-mer：>1500 s（比单个 hexamer 寿命长 17+ 倍）也就是： double hexamer 不仅“存在”，而且是极其稳定的 state。你现在做的 dwell time 分析，其实可以直接用来检查类似的问题：大 binding state（大 Δstate / 高 intensity）是不是寿命明显更长？
和你现在的 HMM + event detection 怎么对上？

你目前做的事情，和 paper 的逻辑高度一致，只是你多了一些技术细节： ICON HMM → m_mod(t) 把 m_mod(t) 等间距分级 → L=10 个 level → 得到 z_step(t) 用 state 跳变 + Δstate 阈值：大幅上跳 + 从低基线 → binding event 一步步往下 → photobleaching steps 用 dwell_min 把很短的 binding–bleach 对删掉（模拟“blinking / 不可靠 binding”） paper 是：全部聚焦在下阶（漂白）的统计上（初始有多少 dye）不太关心“binding 的 exact time point” monomer/dimer 直接放弃，只统计 ≥3 你是：同时要：找 binding 时间找 bleaching 时间还要跟 force trace 的 step 做 correlation → 更严格筛选。

Methods: HMM-Based Quantification of mN-LT Assembly on DNA

To quantify the molecular assembly of mNeonGreen-labeled LT (mN-LT) proteins on DNA substrates, we implemented a custom Hidden Markov Model (HMM) analysis workflow, closely paralleling approaches previously established for photobleaching-based stoichiometry estimation (see [reference]). Our analysis leverages the fact that photobleaching of individual fluorophores produces quantized, stepwise decreases in integrated fluorescence intensity. By statistically resolving these steps, we infer the number and stability of mN-LT complexes assembled on single DNA molecules.

1. HMM Analysis and Stepwise Discretization: Raw intensity trajectories were extracted for each DNA molecule and analyzed using the ICON algorithm to fit a continuous-time HMM. The resulting mean trajectory, \$ m{mod}(t) \$, was discretized into \$ L \$ equally spaced intensity levels (typically \$ L=10 \$), yielding a stepwise trace, \$ z{step}(t) \$. Each plateau in this trace approximates a molecular “N-mer” state (i.e., with N active fluorophores), while downward steps represent photobleaching events.

2. Event Detection and Thresholding: To robustly define binding and bleaching events, we implemented the following criteria: a binding event is identified as an upward jump of at least three intensity levels (\$ \Delta \geq 3 $), starting from a baseline state of ≤5; bleaching events are defined as downward jumps of at least two levels ($ \Delta \leq -2 $). Dwell time filtering ($ dwell_{min} = 0.2\, s \$) was applied, recursively removing short-lived binding–bleaching episodes to minimize contributions from transient blinking or unreliable detections.

3. Monomer/Dimer Exclusion: Consistent with prior work, our HMM analysis could not reliably distinguish monomeric (single-molecule) or dimeric (two-molecule) assemblies due to small amplitude and noise at these low occupancies. Therefore, binding events corresponding to 1-mer and 2-mer states were excluded from quantitative aggregation, and our statistical interpretation focuses on assemblies of three or more mN-LT molecules.

4. Distribution and Stability Analysis: Event tables were constructed by compiling all detected binding and bleaching episodes across up to 30 DNA molecules and 300+ events. The apparent stoichiometry of mN-LT assemblies ranged principally from 3-mer to 14-mer states, with notable maxima at 3-mer (~32%) and 12-mer (~22%), paralleling DNA double-hexamer formation. Rare occurrences of intermediates (e.g., 10-mer or 11-mer) may reflect rapid cooperative transitions to the most stable 12-mer complexes. Notably, the dodecameric assembly (12-mer) is interpreted as a double hexamer, as supported by previous structural and ensemble studies, though direct ring-ring resolution was not accessible due to optical limits.

5. DNA Sequence Dependence and Controls: Wild-type (WT) Ori98 DNA supported robust 12-mer (double hexamer) assembly across binding events. In contrast, Ori98.Rep—bearing a PS7 mutation—never showed 12-mer formation (n=178 events), with assembly restricted to ≤6–8 mN-LT, consistent with disruption of one hexamer nucleation site. This differential stoichiometry was further validated by size-exclusion chromatography and qPCR on nuclear extracts.

6. Binding Lifetimes by Stoichiometry: Mean dwell times for assembly states were extracted, revealing markedly increased stability with higher-order assemblies. The 3-mer and 6-mer states exhibited mean lifetimes of 36 s and 88 s, respectively, while 12-mers exceeded 1500 s—over 17-fold more stable than single hexamers. These measurements were conducted under active-flow to preclude reassembly artifacts.

7. Correspondence to Present Analysis: Our current pipeline follows a near-identical logic:

HMM (ICON) yields a denoised mean (\$ m_{mod}(t) \$),
Discretization into L equal levels produces interpretable stepwise traces,
Event detection applies amplitude and dwell time thresholds (e.g., state jumps, short-lived removal). Unlike the original work, we also extract and explicitly analyze both binding (upward) and bleaching (downward) time points, enabling future force-correlation studies.

8. Software and Reproducibility: All intensity traces were processed using the ICON HMM scripts in Octave/MATLAB, with subsequent discretization and event detection implemented in Python. Complete code and workflow commands are provided in the supplementary materials.

This formulation retains all core technical details: double hexamer assembly, stepwise photobleaching strategy, monomer/dimer filtering, state distribution logic, sequence controls, dwell time quantification, and the direct logic links between your pipeline and the referenced published methodology.

English Methods-Style Text

To quantify the assembly of mNeonGreen-labeled LT (mN-LT) proteins on DNA, we constructed an automated workflow based on Hidden Markov Model (HMM) segmentation of single-molecule fluorescence intensity trajectories. This approach utilizes the property that each photobleaching event yields a stepwise, quantized intensity decrease, enabling reconstruction of the number of LT subunits present on the DNA.

First, fluorescence intensity data from individual molecules or foci were modeled using an HMM (ICON algorithm), yielding a denoised mean trajectory \$ m{mod}(t) \$. This trajectory was discretized into \$ L \$ equally spaced intensity levels, matching the expected single-fluorophore step size, to produce a segmented, stepwise intensity trace (\$ z{step}(t) \$). Each plateau in the trace reflected a state with a specific number of active fluorophores (N-mers), while downward steps corresponded to successive photobleaching events.

Binding and bleaching events were automatically detected:

A binding event was defined as an upward jump of at least 3 levels, starting from a baseline state ≤5;
A bleaching event was defined as a downward jump of at least 2 levels.
Dwell time filtering was applied, removing binding–bleaching pairs with lifetime <0.2 s to exclude short blinks and unreliable events.

Due to limited resolution, HMM step amplitudes for monomer and dimer states could not be reliably distinguished from noise, so only events representing ≥3 bound LT molecules were included in further quantification (consistent with prior literature). Multimer distributions were then compiled from all detected events, typically ranging from 3-mer to 14-mer, with 12-mer “double hexamer” complexes as a prominent, highly stable state; rare intermediates (10- or 11-mer) likely reflected rapid cooperative assembly into higher order structures. Parallel analysis of wild-type and mutant origins demonstrated nucleation site dependence for 12-mer assembly. Binding dwell times were quantified for each stoichiometry and increased with N, with 12-mer complexes showing dramatically extended stability.

This HMM-based approach thus enables automated, objective quantification of DNA–protein assembly stoichiometry and kinetics using high-throughput, single-molecule photobleaching trajectories.

中文方法学描述

具体流程如下：首先，对每一个分子的光强轨迹进行 HMM (ICON 算法) 拟合，得到一个去噪的均值轨迹 \$ m{mod}(t) \$。将此轨迹离散为 \$ L \$ 等间距台阶（对应单分子漂白的幅度），得到分段常数的 step-wise 曲线（\$ z{step}(t) \$）。各平台高度对应于指定数量（N-mer）的 mN-LT，向下台阶代表一个荧光蛋白分子的漂白。

结合与漂白事件的自动检测逻辑为：

结合事件：台阶跳升≥3级，且起始状态≤5；
漂白事件：台阶跳降≥2级；
添加 dwell_min （停留过滤，典型值为0.2 s），滤除短暂 binding–bleach 对（模拟“blink”或识别误差）。

由于分子数为1/2的台阶幅度与噪声幅度接近，本方法无法可靠地区分单体和二聚体的组装阶段，因此所有统计仅计入≥3个亚基的结合事件。最终统计出的多聚体分布从3-mer到14-mer不等，其中 12-mer （即 double hexamer）最为显著且稳定（如 Fig. 4C 红/蓝柱所示）；10-mer、11-mer等中间体极为罕见，原因可能是组装过程高度协作性，迅速跃迁到高阶结构。对比野生型和突变型 DNA 可揭示核化位点对双六聚体形成的依赖性。不同 N 值的多聚体 binding dwell（结合寿命）也可自动统计，发现 N 越大，寿命越长，12-mer 远高于 single hexamer。

该 HMM 分析流程可实现 DNA–蛋白结合构象的全自动、高通量定量，并为动力学机制研究提供单分子分辨率的坚实基础。] ¹²³⁴⁵⁶⁷⁸⁹

⁂

Single-Molecule Binding/Bleaching Detection Pipeline for Data_Vero_Kymographs

Leave a reply

Single-Molecule Binding/Bleaching Detection Pipeline for Data_Vero_Kymographs

“What is confirmed is hexamer and double-hexamer binding to DNA, whereas monomer/dimer binding to DNA is not confirmed.”

Overview

This workflow robustly detects and quantifies molecular binding and bleaching events from single-molecule fluorescence trajectories. It employs Hidden Markov Model (HMM) analysis to convert noisy intensity data into interpretable discrete state transitions, using a combination of MATLAB/Octave and Python scripts.

Step 1: ICON HMM Fitting per Track

Runs icon_from_track_csv.m, loading each track’s photon count data, fitting a HMM (via the ICON algorithm), and saving results (icon_analysis_track_XX.mat).
Key outputs:
- Raw time series × photon counts (used for the black curve in plot, top and background of bottom plot)
- HMM mean state sequence (m_mod)
Example command:

for track_id in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
  octave icon_post_equal_levels.m icon_analysis_track_${track_id}.mat
done

Step 2: Discretize HMM Means (Not Used for Plot Generation)

(Optional) Runs icon_post_equal_levels.m to generate equal_levels_track_XX.mat, which contains a stepwise, discretized version of the HMM fit.
This step is designed for diagnostic parameter tuning (finding L_best), but the plotting script does not use these files for figure generation.

后处理成“等间距台阶 + bleaching step” (这些文件主要用来“告诉你 L 选多少比较合适”(namely L_best)，而不是直接给 Python 画图用).

for track_id in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
octave icon_post_equal_levels.m icon_analysis_track_${track_id}.mat
done
# 生成 equal_levels_track_100.mat for Step 4

Step 3: Event Detection \& Visualization (Python)

Core script: plot_fig4AB_python.py
Loads icon_analysis_track_XX_matlab.mat produced by Step 1.
Black plot (top and gray in lower panel): Raw photon count data.
Red plot (lower panel): Stepwise HMM fit — generated by mapping the HMM mean trajectory to L (e.g. 10) equally spaced photon count levels (L_best).
Event detection:
- Blue triangles: Up-step events (binding)
- Black triangles: Down-step events (bleaching)
- Uses in-script logic to detect transitions meeting user-set thresholds:
  - min_levels_bind=3, min_levels_bleach=2, dwell_min=0.2s, baseline_state_max=5
- Produces both plots and machine-readable event tables (CSV, Excel)
Sample commands:

mamba activate kymo_plots
for track_id in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 100; do
  python plot_fig4AB_python.py ${track_id} 10 3 2 0.2 5
done

Step 4: (For Future) Aggregation

icon_multimer_histogram.m is prepared for future use to aggregate results from many tracks (e.g., make multimer/stoichiometry histograms).
This step is not used for the current plots.

octave icon_multimer_histogram.m
# for subset use:
octave icon_multimer_histogram.m "equal_levels_track_1*.mat"
#→ 得到 multimer_histogram.png, 就是你的 Fig. 4C-like 图.

Figure Explanation (e.g. Track 14 and Track 100)

Top panel (black line): Raw photon counts (z).
- Direct output from Step 1 HMM analysis; visualizes the original noisy fluorescence trace.
Bottom panel:
- Gray line: Raw photon counts for direct comparison.
- Red line: Step-wise fit produced by discretizing the HMM mean (m_mod) from Step 1 directly inside the python script.
- Blue “▲”: Detected binding (upward) events.
- Black “▼”: Detected bleaching (downward) events.
Event Table: Both “binding” and “bleach” events are exported with details: time, photon count, state transition, and dwell time.

Note:

For these figures, only Step 1 and Step 3 are used.
Step 2 is for diagnostic/discretization, but in our current pipeline, L_best was given directly with 10, was not calculated from Step 2, therefore Step 2 was not used.
Step 4 is left for future population summaries.

Key Script Function Descriptions

1. icon_from_track_csv.m (Octave/MATLAB)

Loads a csv photon count time sequence for a single molecule track.
Infers hidden states and mean trajectory with ICON/HMM.
Saves all variables for python to use.

2. plot_fig4AB_python.py (Python)

Loads .mat results: time (t), photons (z), and HMM means (m_mod).
Discretizes the mean trajectory into L equal steps, maps continuous states to nearest discrete, and fits photon counts by linear regression for step heights.
Detects step transitions corresponding to binding or bleaching based on user parameters (size thresholds, dwell filters).
Plots raw data + stepwise fit, annotates events, and saves tables.

(Script excerpt, see full file for details):

def assign_to_levels(m_mod, levels):
    # Map every m_mod value to the nearest discrete level
    ...

def detect_binding_bleach_from_state(...):
    # Identify up/down steps using given jump sizes and baseline cutoff
    ...

def filter_short_episodes_by_dwell(...):
    # Filter events with insufficient dwell time
    ...
...
if __name__ == "__main__":
    # Parse command line args, load .mat, process and plot

Complete Scripts

See attached files for full script code:

icon_from_track_csv.m
plot_fig4AB_python.py
icon_post_equal_levels.m (diagnostic, not used for current figures)
icon_multimer_histogram.m (future)

Example Event Table Output

The python script automatically produces a CSV/Excel file summarizing:

Event type (“binding” or “bleach”)
Time (seconds)
Photon count (at event)
States before/after the event
Dwell time (for binding)

In summary:

Figures output from plot_fig4AB_python.py directly visualize both binding and bleaching events as blue (▲) and black (▼) markers, using logic based on HMM analysis and transition detection within the Python code, without any direct dependence on Step 2 “equal_levels” files. This approach is both robust and reproducible for detailed single-molecule state analysis. [^1][^2]

⁂

icon_from_track_csv.m

clear all;
%% icon_from_track_csv.m
%% 用 ICON HMM 分析你自己的 track_intensity CSV 数据
%%
%% 用法（命令行）：
%%   octave icon_from_track_csv.m your_track_intensity_file.csv [track_id]
%%
%% - your_track_intensity_file.csv:
%%     3 列分号分隔：
%%     # track index;time (seconds);track intensity (photon counts)
%% - track_id（可选）:
%%     要分析的 track_index 数值，例如 0、2、10、100 等
%%     如果不给，则自动：
%%       - 若存在 track 100，则用 100
%%       - 否则用第一个 track

%%--------------------------------------------
%% 0. 处理命令行参数
%%--------------------------------------------
arg_list = argv();
if numel(arg_list) < 1
    error("Usage: octave icon_from_track_csv.m 
<track_intensity_csv> [track_id]");
end

input_file = arg_list{1};
track_id_arg = NaN;
if numel(arg_list) >= 2
    track_id_arg = str2double(arg_list{2});
end

fprintf("Input CSV : %s\n", input_file);
if ~isnan(track_id_arg)
    fprintf("Requested track_id: %g\n", track_id_arg);
end

%% 链接 ICON sampler 的源码
%addpath('sampler_SRC');

% 尝试加载 statistics 包（gamrnd 在这里）
try
    pkg load statistics;
catch
    warning("Could not load 'statistics' package. Please install it via 'pkg install -forge statistics'.");
end

%%--------------------------------------------
%% 1. 读入 3 列 CSV: track_index;time;counts
%%--------------------------------------------
fid = fopen(input_file, 'r');
if fid < 0
    error("Cannot open file: %s", input_file);
end

% 第一行是注释头 "# track index;time (seconds);track intensity (photon counts)"
header_line = fgetl(fid);  %#ok
<NASGU>

% 后面每行: track_index;time_sec;intensity
data = textscan(fid, "%f%f%f", "Delimiter", ";");
fclose(fid);

track_idx = data{1};
time_sec  = data{2};
counts    = data{3};

% 按 track 和时间排序，保证序列有序
[~, order] = sortrows([track_idx, time_sec], [1, 2]);
track_idx = track_idx(order);
time_sec  = time_sec(order);
counts    = counts(order);

tracks = unique(track_idx);
fprintf("Found %d tracks: ", numel(tracks));
fprintf("%g ", tracks);
fprintf("\n");

%%--------------------------------------------
%% 2. 选择要分析的 track
%%--------------------------------------------
if ~isnan(track_id_arg)
    tr = track_id_arg;
else
    % 如果存在 track 100，则优先选 100（你自己定义的 accumulated 轨迹）
    if any(tracks == 100)
        tr = 100;
    else
        tr = tracks(1);   % 否则就选第一个
    end
end

if ~any(tracks == tr)
    error("Requested track_id = %g not found in file.", tr);
end

fprintf("Using track_id = %g for ICON analysis.\n", tr);

sel = (track_idx == tr);
t   = time_sec(sel);
z   = counts(sel);

% 按时间排序（理论上已经排过一次，这里再保险）
[ t, order_t ] = sort(t);
z = z(order_t);

z = z(:);  % 列向量
N = numel(z);
fprintf("Track %g has %d time points.\n", tr, N);

%%--------------------------------------------
%% 3. 设置 ICON 参数（与原始脚本一致）
%%--------------------------------------------

% 浓度（Dirichlet 相关超参数）
opts.a = 1;   % Transitions (alpha)
opts.g = 1;   % Base (gamma)

% 超参数 Q
opts.Q(1) = mean(z);         % mean of means (lambda)
opts.Q(2) = 1 / std(z)^2;    % Precision of means (rho)
opts.Q(3) = 0.1;             % Shape of precisions (beta)
opts.Q(4) = 0.00001;         % Scale of precisions (omega)

opts.M = 10;                 % Nodes in the interpolation

% 采样器参数
opts.dr_sk  = 1;             % Stride: 每多少步保存一次 sample
opts.K_init = 50;            % 初始 photon levels 数量（ICON 会自动收缩）

% 输出标志
flag_stat = true;            % 在命令行里打印进度
flag_anim = false;            % 是否弹出动画（Octave 下可以改成 false 更稳定）

%%--------------------------------------------
%% 4. 运行 ICON 采样器
%%--------------------------------------------
R = 1000;   % 样本数，可按需要调整

fprintf("Running ICON sampler on track %g ...\n", tr);
chain = chainer_main( z, R, [], opts, flag_stat, flag_anim );

%%--------------------------------------------
%% 5. 导出 samples 方便后处理
%%--------------------------------------------
fr = 0.25;   % burn-in 比例（前 25%% 样本丢掉）
dr = 2;      % sample stride（每隔多少个样本导出一个）

out_prefix = sprintf('samples_track_%g', tr);
chainer_export(chain, fr, dr, out_prefix, 'mat');
fprintf("Samples exported to %s.mat\n", out_prefix);

%%--------------------------------------------
%% 6. 基本后验分析：state trajectory / transitions / drift
%%--------------------------------------------

% 离散化 state space （在 [0,1] 上划分 25 个 bin）
m_min = 0;
m_max = 1;
m_num = 25;

% (1) 状态轨迹（归一化）：m_mod
[m_mod, m_red] = chainer_analyze_means(chain, fr, dr, m_min, m_max, m_num, z);

% (2) 转移概率
[m_edges, p_mean, d_dist] = chainer_analyze_transitions( ...
    chain, fr, dr, m_min, m_max, m_num, true);

% (3) 漂移轨迹
[y_mean, y_std] = chainer_analyze_drift(chain, fr, dr, z);

% 1) Load the original Octave .mat file
%load('icon_analysis_track_100.mat');
% 2) Re-save in MATLAB-compatible v5/v7 format, with new name
%save('-mat', 'icon_analysis_track_100_matlab.mat');

%% 存到一个 mat 文件里，方便之后画图
%icon_mat = sprintf('icon_analysis_track_%g.mat', tr);
%save(icon_mat, 't', 'z', 'm_mod', 'm_red', 'm_edges', 'p_mean', ...
%               'd_dist', 'y_mean', 'y_std');

% 原来的 Octave 版本（可留可不留）
mat_out = sprintf("icon_analysis_track_%d.mat", tr);
save(mat_out, "m_mod", "m_red", "m_edges", "p_mean", "d_dist", ...
              "y_mean", "y_std", "t", "z");

% 额外保存一个专门给 Python / SciPy 用的 MATLAB-compatible 版本
mat_out_matlab = sprintf("icon_analysis_track_%d_matlab.mat", tr);
save('-mat', mat_out_matlab, "m_mod", "m_red", "m_edges", "p_mean", "d_dist", ...
                          "y_mean", "y_std", "t", "z");

fprintf("ICON analysis saved to %s\n", icon_mat);

fprintf("Done.\n");

icon_post_equal_levels.m

% icon_post_equal_levels.m  (script version)
%
% 用法（终端）:
%   octave icon_post_equal_levels.m icon_analysis_track_XXX.mat
%
% 对 ICON 分析结果 (icon_analysis_track_XXX.mat) 做等间距 photon level 后处理

arg_list = argv();
if numel(arg_list) < 1
    error("Usage: octave icon_post_equal_levels.m icon_analysis_track_XXX.mat");
end

mat_file = arg_list{1};

fprintf("Loading ICON analysis file: %s\n", mat_file);
S = load(mat_file);

if ~isfield(S, 'm_mod') || ~isfield(S, 't') || ~isfield(S, 'z')
    error("File %s must contain variables: t, z, m_mod.", mat_file);
end

t     = S.t(:);
z     = S.z(:);
m_mod = S.m_mod(:);

N = numel(m_mod);
if numel(t) ~= N || numel(z) ~= N
    error("t, z, m_mod must have the same length.");
end

% 从文件名里解析 track_id（如果有）
[~, name, ~] = fileparts(mat_file);
track_id = NaN;
tokens = regexp(name, 'icon_analysis_track_([0-9]+)', 'tokens');
if ~isempty(tokens)
    track_id = str2double(tokens{1}{1});
    fprintf("Detected track_id = %g from file name.\n", track_id);
else
    fprintf("Could not parse track_id from file name, set to NaN.\n");
end

%--------------------------------------------
% 1. 在 m_mod 的范围上搜索等间距 level 数 L
%--------------------------------------------
m_min = min(m_mod);
m_max = max(m_mod);
fprintf("m_mod range: [%.4f, %.4f]\n", m_min, m_max);

L_min = 2;
L_max = 12;   % 可以按需要改大一些

best_score = Inf;
L_best     = L_min;

fprintf("Scanning candidate level numbers L = %d .. %d ...\n", L_min, L_max);

for L = L_min:L_max
    levels = linspace(m_min, m_max, L);

    % 把每个 m_mod(t) 映射到最近的 level：state_temp ∈ {1..L}
    state_temp = assign_to_levels(m_mod, levels);

    % 用这些 level 生成 step-wise 轨迹
    m_step_temp = levels(state_temp);

    % 计算拟合误差 SSE（在归一化空间）
    residual = m_mod - m_step_temp;
    sse = sum(residual.^2);

    % 类 BIC 评分：误差 + 惩罚 L
    score = N * log(sse / N + eps) + L * log(N);

    fprintf("  L = %2d -> SSE = %.4g, score = %.4g\n", L, sse, score);

    if score < best_score
        best_score = score;
        L_best     = L;
    end
end

fprintf("Best L (number of equally spaced levels) = %d\n", L_best);

%--------------------------------------------
% 2. 用最优 L_best 构建最终的等间距 level & state
%--------------------------------------------
levels_norm = linspace(m_min, m_max, L_best);
state       = assign_to_levels(m_mod, levels_norm);   % 1..L_best
m_step      = levels_norm(state);                     % 归一化台阶轨迹

%--------------------------------------------
% 3. 将归一化台阶轨迹线性映射回 photon counts
%    z ≈ a * m_step + b  （最小二乘）
%--------------------------------------------
A = [m_step(:), ones(N,1)];
theta = A \ z;        % 最小二乘拟合 [a; b]
a = theta(1);
b = theta(2);
z_step = A * theta;   % 拟合出来的 photon counts 台阶轨迹

fprintf("Fitted z ≈ a * m_step + b with a = %.4f, b = %.4f\n", a, b);

%--------------------------------------------
% 4. 检测 bleaching 步：state 下降的时刻
%--------------------------------------------
%s_prev = state(1:end-1);
%s_next = state(2:end);
%
%bleach_idx   = find(s_next < s_prev) + 1;
%bleach_times = t(bleach_idx);
%
%fprintf("Found %d bleaching step(s).\n", numel(bleach_idx));

%--------------------------------------------
% Detect upward (binding) and downward (bleach) steps
%--------------------------------------------
s_prev = state(1:end-1);
s_next = state(2:end);

% raw step indices
bind_idx_raw   = find(s_next > s_prev) + 1;  % upward jumps
bleach_idx_raw = find(s_next < s_prev) + 1;  % downward jumps

% Optional: threshold by intensity change to ignore tiny noisy steps
dz = z_step(2:end) - z_step(1:end-1);

min_jump = 30;   % <-- choose something ~ one level step or larger
keep_bind   = dz >  min_jump;
keep_bleach = dz < -min_jump;

bind_idx   = bind_idx_raw(keep_bind(bind_idx_raw-1));
bleach_idx = bleach_idx_raw(keep_bleach(bleach_idx_raw-1));

bind_times   = t(bind_idx);
bleach_times = t(bleach_idx);

fprintf("Found %d binding and %d bleaching steps (with threshold %.1f).\n", ...
        numel(bind_idx), numel(bleach_idx), min_jump);

%--------------------------------------------
% 5. 保存结果
%--------------------------------------------
out_name = sprintf('equal_levels_track_%s.mat', ...
                   ternary(isnan(track_id), 'X', num2str(track_id)));

fprintf("Saving equal-level analysis to %s\n", out_name);

% Save them in the .mat file
%save(out_name, ...
%     't', 'z', 'm_mod', ...
%     'L_best', 'levels_norm', ...
%     'state', 'm_step', 'z_step', ...
%     'bleach_idx', 'bleach_times', ...
%     'track_id');
save(out_name, ...
     't', 'z', 'm_mod', ...
     'L_best', 'levels_norm', ...
     'state', 'm_step', 'z_step', ...
     'bind_idx', 'bind_times', ...
     'bleach_idx', 'bleach_times', ...
     'track_id');

fprintf("Done.\n");

plot_fig4AB_python.py

#!/usr/bin/env python3
"""
plot_fig4AB_python.py

Usage:
    python plot_fig4AB_python.py 
<track_id> [L_best] [min_levels_bind] [min_levels_bleach] [dwell_min] [baseline_state_max]

Arguments
---------

<track_id> : int
    e.g. 10 → uses icon_analysis_track_10_matlab.mat

[L_best]  : int, optional
    Number of equally spaced levels to use for the step-wise fit
    in Fig. 4B. If omitted, default = 3.

[min_levels_bind] : int, optional
    Minimum number of state levels for an upward jump to count as a
    binding event. Default = 3.

[min_levels_bleach] : int, optional
    Minimum number of state levels for a downward jump to count as a
    bleaching step. Default = 1.

[dwell_min] : float, optional
    Minimum allowed dwell time between a binding event and the NEXT
    bleaching event. If Δt = t_bleach - t_bind < dwell_min, then:
        - that binding is removed
        - the paired bleaching event is also removed
    Default = 0 (no dwell-based filtering).

[baseline_state_max] : int, optional
    Highest state index (0-based) that is still considered "baseline /
    unbound" before a binding jump.
    Binding condition becomes:
        dstate >= min_levels_bind  AND  state_before <= baseline_state_max
    If omitted → no baseline constraint (any state can be a start).

Input file
----------
icon_analysis_track_
<track_id>_matlab.mat

Expected variables inside the .mat file:
    t      : time vector (1D, seconds)
    z      : photon counts (1D)
    m_mod  : ICON mean trajectory (1D), same length as t and z

Outputs
-------
1) Figure:
    fig4AB_track_
<track_id>_L<L_best>.png

2) Event tables:
    binding_bleach_events_track_
<track_id>_L<L_best>.csv
    binding_bleach_events_track_
<track_id>_L<L_best>.xlsx   (if pandas available)
"""

import sys
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import loadmat

# Try to import pandas for Excel output
try:
    import pandas as pd
    HAS_PANDAS = True
except ImportError:
    HAS_PANDAS = False
    print("[WARN] pandas not available: Excel output (.xlsx) will be skipped.")

def assign_to_levels(m_mod, levels):
    """
    Map each point in m_mod to the nearest level.

    Parameters
    ----------
    m_mod : array-like, shape (N,)
        Normalized ICON mean trajectory.
    levels : array-like, shape (L,)
        Candidate level values.

    Returns
    -------
    state : ndarray, shape (N,)
        Integer state indices in {0, 1, ..., L-1}.
    """
    m_mod = np.asarray(m_mod).ravel()
    levels = np.asarray(levels).ravel()
    diff = np.abs(m_mod[:, None] - levels[None, :])  # (N, L)
    state = np.argmin(diff, axis=1)  # 0..L-1
    return state

def detect_binding_bleach_from_state(
    z_step,
    t,
    state,
    levels,
    min_levels_bind=3,
    min_levels_bleach=1,
    baseline_state_max=None,
):
    """
    Detect binding (big upward state jumps) and bleaching (downward jumps)
    using the discrete state sequence.

    - Binding: large upward jump (>= min_levels_bind) starting from a
      "baseline" state (state_before <= baseline_state_max) if that
      parameter is given.

    - Bleaching: downward jump (<= -min_levels_bleach)

    Parameters
    ----------
    z_step : array-like, shape (N,)
        Step-wise photon counts.
    t : array-like, shape (N,)
        Time vector (seconds).
    state : array-like, shape (N,)
        Integer states 0..L-1 from assign_to_levels().
    levels : array-like, shape (L,)
        Level values (in m_mod space, not directly photon counts).
    min_levels_bind : int
        Minimum number of levels for an upward jump to be
        considered a binding event.
    min_levels_bleach : int
        Minimum number of levels for a downward jump to be
        considered a bleaching event.
    baseline_state_max : int or None
        Highest state index considered "baseline" before binding.
        If None, any state can be the start of a binding jump.

    Returns
    -------
    bind_idx, bleach_idx : np.ndarray of indices
    bind_times, bleach_times : np.ndarray of times (seconds)
    bind_values, bleach_values : np.ndarray of photon counts at those events
    bind_state_before, bind_state_after : np.ndarray of integer states
    bleach_state_before, bleach_state_after : np.ndarray of integer states
    """

    z_step = np.asarray(z_step).ravel()
    t = np.asarray(t).ravel()
    state = np.asarray(state).ravel()
    levels = np.asarray(levels).ravel()

    N = len(t)
    dstate = np.diff(state)  # length N-1
    idx = np.arange(N - 1)

    # ----- Binding: big upward jump, optionally from baseline only -----
    bind_mask = (dstate >= min_levels_bind)
    if baseline_state_max is not None:
        bind_mask &= (state[idx] <= baseline_state_max)

    bind_idx = idx[bind_mask] + 1

    # ----- Bleaching: downward jump -----
    bleach_mask = (dstate <= -min_levels_bleach)
    bleach_idx = idx[bleach_mask] + 1

    bind_times = t[bind_idx]
    bleach_times = t[bleach_idx]

    bind_values = z_step[bind_idx]
    bleach_values = z_step[bleach_idx]

    bind_state_before = state[bind_idx - 1]
    bind_state_after = state[bind_idx]

    bleach_state_before = state[bleach_idx - 1]
    bleach_state_after = state[bleach_idx]

    return (
        bind_idx,
        bleach_idx,
        bind_times,
        bleach_times,
        bind_values,
        bleach_values,
        bind_state_before,
        bind_state_after,
        bleach_state_before,
        bleach_state_after,
    )

def filter_short_episodes_by_dwell(
    bind_idx,
    bleach_idx,
    bind_times,
    bleach_times,
    bind_values,
    bleach_values,
    bind_state_before,
    bind_state_after,
    bleach_state_before,
    bleach_state_after,
    dwell_min,
):
    """
    Remove short binding episodes based on dwell time and also remove
    the paired bleaching step.

    Rule:
        For each binding, find the first bleaching with t_bleach > t_bind.
        If Δt = t_bleach - t_bind < dwell_min, then:
            - remove this binding
            - remove this bleaching
        All other bleaching events remain.

    Returns
    -------
    (filtered binding arrays, dwell_times) + filtered bleaching arrays
    """

    if dwell_min <= 0 or len(bind_idx) == 0 or len(bleach_idx) == 0:
        # no filtering requested or missing events
        dwell_times = np.full(len(bind_idx), np.nan)
        for i in range(len(bind_idx)):
            future = np.where(bleach_times > bind_times[i])[0]
            if len(future) > 0:
                dwell_times[i] = bleach_times[future[0]] - bind_times[i]
        return (
            bind_idx,
            bleach_idx,
            bind_times,
            bleach_times,
            bind_values,
            bleach_values,
            bind_state_before,
            bind_state_after,
            bleach_state_before,
            bleach_state_after,
            dwell_times,
        )

    bind_idx = np.asarray(bind_idx)
    bleach_idx = np.asarray(bleach_idx)
    bind_times = np.asarray(bind_times)
    bleach_times = np.asarray(bleach_times)
    bind_values = np.asarray(bind_values)
    bleach_values = np.asarray(bleach_values)
    bind_state_before = np.asarray(bind_state_before)
    bind_state_after = np.asarray(bind_state_after)
    bleach_state_before = np.asarray(bleach_state_before)
    bleach_state_after = np.asarray(bleach_state_after)

    keep_bind = np.ones(len(bind_idx), dtype=bool)
    keep_bleach = np.ones(len(bleach_idx), dtype=bool)
    dwell_times = np.full(len(bind_idx), np.nan)

    removed_bind = 0
    removed_bleach = 0

    for i in range(len(bind_idx)):
        t_b = bind_times[i]
        future = np.where(bleach_times > t_b)[0]
        if len(future) == 0:
            # no bleaching afterwards → dwell undefined, keep binding
            dwell_times[i] = np.nan
            continue

        j = future[0]
        dt = bleach_times[j] - t_b
        dwell_times[i] = dt

        if dt < dwell_min:
            # remove this binding and its paired bleaching
            keep_bind[i] = False
            if keep_bleach[j]:
                keep_bleach[j] = False
                removed_bleach += 1
            removed_bind += 1

    print(
        f"[INFO] Dwell-based filter: removed {removed_bind} binding(s) and "
        f"{removed_bleach} paired bleaching step(s) with Δt < {dwell_min} s; "
        f"{np.sum(keep_bind)} binding(s) and {np.sum(keep_bleach)} bleaching step(s) kept."
    )

    # Apply masks
    bind_idx = bind_idx[keep_bind]
    bind_times = bind_times[keep_bind]
    bind_values = bind_values[keep_bind]
    bind_state_before = bind_state_before[keep_bind]
    bind_state_after = bind_state_after[keep_bind]
    dwell_times = dwell_times[keep_bind]

    bleach_idx = bleach_idx[keep_bleach]
    bleach_times = bleach_times[keep_bleach]
    bleach_values = bleach_values[keep_bleach]
    bleach_state_before = bleach_state_before[keep_bleach]
    bleach_state_after = bleach_state_after[keep_bleach]

    return (
        bind_idx,
        bleach_idx,
        bind_times,
        bleach_times,
        bind_values,
        bleach_values,
        bind_state_before,
        bind_state_after,
        bleach_state_before,
        bleach_state_after,
        dwell_times,
    )

def plot_fig4AB(
    track_id,
    L_best=None,
    min_levels_bind=3,
    min_levels_bleach=1,
    dwell_min=0.0,
    baseline_state_max=None,
):
    # --------------------------
    # 1. Load ICON analysis file
    # --------------------------
    mat_file = f"icon_analysis_track_{track_id}_matlab.mat"
    print(f"Loading {mat_file}")
    if not os.path.exists(mat_file):
        raise FileNotFoundError(f"{mat_file} does not exist in current directory.")

    data = loadmat(mat_file)

    def extract_vector(name):
        if name not in data:
            raise KeyError(f"Variable '{name}' not found in {mat_file}")
        v = data[name]
        return np.squeeze(v)

    t = extract_vector("t")
    z = extract_vector("z")
    m_mod = extract_vector("m_mod")

    if not (len(t) == len(z) == len(m_mod)):
        raise ValueError("t, z, and m_mod must have the same length.")

    N = len(t)
    print(f"Track {track_id}: N = {N} points")

    # --------------------------
    # 2. Choose L (number of levels)
    # --------------------------
    if L_best is None:
        L_best = 3  # fallback default
        print(f"No L_best provided, using default L_best = {L_best}")
    else:
        print(f"Using user-specified L_best = {L_best}")

    m_min = np.min(m_mod)
    m_max = np.max(m_mod)
    levels = np.linspace(m_min, m_max, L_best)
    print(f"m_mod range: [{m_min:.4f}, {m_max:.4f}]")
    print(f"Equally spaced levels ({L_best}): {levels}")

    # --------------------------
    # 3. Build step-wise trajectory from m_mod
    # --------------------------
    state = assign_to_levels(m_mod, levels)  # 0..L_best-1
    m_step = levels[state]

    # Map back to photon counts via linear fit z ≈ a*m_step + b
    A = np.column_stack([m_step, np.ones(N)])
    theta, *_ = np.linalg.lstsq(A, z, rcond=None)
    a, b = theta
    z_step = A @ theta
    print(f"Fitted z ≈ a * m_step + b with a = {a:.4f}, b = {b:.4f}")

    # --------------------------
    # 4. Detect binding / bleaching events (state-based)
    # --------------------------
    (
        bind_idx,
        bleach_idx,
        bind_times,
        bleach_times,
        bind_values,
        bleach_values,
        bind_state_before,
        bind_state_after,
        bleach_state_before,
        bleach_state_after,
    ) = detect_binding_bleach_from_state(
        z_step,
        t,
        state,
        levels,
        min_levels_bind=min_levels_bind,
        min_levels_bleach=min_levels_bleach,
        baseline_state_max=baseline_state_max,
    )

    base_msg = (
        f"baseline_state_max={baseline_state_max}"
        if baseline_state_max is not None
        else "baseline_state_max=None (no baseline restriction)"
    )
    print(
        f"Initial detection: {len(bind_idx)} binding and {len(bleach_idx)} bleaching "
        f"events (min_levels_bind={min_levels_bind}, "
        f"min_levels_bleach={min_levels_bleach}, {base_msg})."
    )

    # --------------------------
    # 5. Apply dwell-time filter to binding + paired bleach
    # --------------------------
    (
        bind_idx,
        bleach_idx,
        bind_times,
        bleach_times,
        bind_values,
        bleach_values,
        bind_state_before,
        bind_state_after,
        bleach_state_before,
        bleach_state_after,
        dwell_times,
    ) = filter_short_episodes_by_dwell(
        bind_idx,
        bleach_idx,
        bind_times,
        bleach_times,
        bind_values,
        bleach_values,
        bind_state_before,
        bind_state_after,
        bleach_state_before,
        bleach_state_after,
        dwell_min=dwell_min,
    )

    print(
        f"After dwell filter (dwell_min={dwell_min}s): "
        f"{len(bind_idx)} binding and {len(bleach_idx)} bleaching events remain."
    )

    # --------------------------
    # 6. Build event table & save CSV / Excel
    # --------------------------
    rows = []

    # Binding events
    for i in range(len(bind_idx)):
        idx = int(bind_idx[i])
        rows.append({
            "event_type": "binding",
            "sample_index": idx,
            "time_seconds": float(bind_times[i]),
            "photon_count": float(bind_values[i]),
            "state_before": int(bind_state_before[i]),
            "state_after": int(bind_state_after[i]),
            "level_before_norm": float(levels[bind_state_before[i]]),
            "level_after_norm": float(levels[bind_state_after[i]]),
            "dwell_time": float(dwell_times[i]) if not np.isnan(dwell_times[i]) else "",
        })

    # Bleaching events
    for i in range(len(bleach_idx)):
        idx = int(bleach_idx[i])
        rows.append({
            "event_type": "bleach",
            "sample_index": idx,
            "time_seconds": float(bleach_times[i]),
            "photon_count": float(bleach_values[i]),
            "state_before": int(bleach_state_before[i]),
            "state_after": int(bleach_state_after[i]),
            "level_before_norm": float(levels[bleach_state_before[i]]),
            "level_after_norm": float(levels[bleach_state_after[i]]),
            "dwell_time": "",
        })

    # Sort by time
    rows = sorted(rows, key=lambda r: r["time_seconds"])

    # Write CSV
    csv_name = f"binding_bleach_events_track_{track_id}_L{L_best}.csv"
    import csv
    with open(csv_name, "w", newline="") as f:
        writer = csv.DictWriter(
            f,
            fieldnames=[
                "event_type",
                "sample_index",
                "time_seconds",
                "photon_count",
                "state_before",
                "state_after",
                "level_before_norm",
                "level_after_norm",
                "dwell_time",
            ],
        )
        writer.writeheader()
        for r in rows:
            writer.writerow(r)

    print(f"Saved event table to {csv_name}")

    # Write Excel (if pandas available)
    if HAS_PANDAS:
        df = pd.DataFrame(rows)
        xlsx_name = f"binding_bleach_events_track_{track_id}_L{L_best}.xlsx"
        df.to_excel(xlsx_name, index=False)
        print(f"Saved event table to {xlsx_name}")
    else:
        print("[INFO] pandas not installed → skipped Excel (.xlsx) output.")

    # --------------------------
    # 7. Make a figure similar to Fig. 4A + 4B
    # --------------------------
    fig, axes = plt.subplots(2, 1, figsize=(7, 6), sharex=True)

    # ---- Fig. 4A-like: raw intensity vs time ----
    ax1 = axes[0]
    ax1.plot(t, z, color="black", linewidth=0.8)
    ax1.set_ylabel("Photon counts")
    ax1.set_title(f"Track {track_id}: raw intensity")  #(Fig. 4A-like)

    # ---- Fig. 4B-like: step-wise HMM fit vs time ----
    ax2 = axes[1]
    ax2.plot(t, z, color="0.8", linewidth=0.5, label="raw")
    ax2.plot(t, z_step, color="red", linewidth=1.5,
             label=f"equal levels (L={L_best})")

    # Mark binding (up-steps) and bleaching (down-steps) AFTER filtering
    if len(bind_idx) > 0:
        ax2.scatter(
            bind_times,
            bind_values,
            marker="^",
            color="blue",
            s=30,
            label="binding",
        )
    if len(bleach_idx) > 0:
        ax2.scatter(
            bleach_times,
            bleach_values,
            marker="v",
            color="black",
            s=30,
            label="bleach",
        )

    ax2.set_xlabel("Time (s)")
    ax2.set_ylabel("Photon counts")
    ax2.set_title(
        f"Step-wise HMM fit ("    #Fig. 4B-like,
        f"min_bind_levels={min_levels_bind}, "
        f"min_bleach_levels={min_levels_bleach}, "
        f"dwell_min={dwell_min}s, "
        f"baseline_state_max={baseline_state_max})"
    )
    ax2.legend(loc="best")

    fig.tight_layout()
    out_png = f"fig4AB_track_{track_id}_L{L_best}.png"
    fig.savefig(out_png, dpi=300)
    plt.close(fig)
    print(f"Saved figure to {out_png}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python plot_fig4AB_python.py 
<track_id> [L_best] [min_levels_bind] [min_levels_bleach] [dwell_min] [baseline_state_max]")
        sys.exit(1)

    track_id = int(sys.argv[1])

    # Defaults
    L_best = None
    min_levels_bind = 3
    min_levels_bleach = 1
    dwell_min = 0.0
    baseline_state_max = None

    if len(sys.argv) >= 3:
        L_best = int(sys.argv[2])
    if len(sys.argv) >= 4:
        min_levels_bind = int(sys.argv[3])
    if len(sys.argv) >= 5:
        min_levels_bleach = int(sys.argv[4])
    if len(sys.argv) >= 6:
        dwell_min = float(sys.argv[5])
    if len(sys.argv) >= 7:
        baseline_state_max = int(sys.argv[6])

    plot_fig4AB(
        track_id,
        L_best=L_best,
        min_levels_bind=min_levels_bind,
        min_levels_bleach=min_levels_bleach,
        dwell_min=dwell_min,
        baseline_state_max=baseline_state_max,
    )

icon_multimer_histogram.m

% icon_multimer_histogram.m
%
% 用法（终端）:
%   octave icon_multimer_histogram.m [pattern]
%
% 默认 pattern = "equal_levels_track_*.mat"
%
% 要求每个 mat 文件里至少有：
%   state      : 等间距 level 的索引 (1..L_best)
%   t          : 时间向量
%   z          : 原始 photon counts
%   track_id   : (可选) 轨迹编号，用于打印信息
%
% 输出：
%   - 在命令行打印每条轨迹估计出的 multimer 数目
%   - 生成 Fig. 4C 风格直方图：
%       multimer_histogram.png

arg_list = argv();
if numel(arg_list) >= 1
    pattern = arg_list{1};
else
    pattern = "equal_levels_track_*.mat";
end

fprintf("Searching files with pattern: %s\n", pattern);
files = dir(pattern);

if isempty(files)
    error("No files matched pattern: %s", pattern);
end

multimers = [];      % 保存每条轨迹的 multimer size
track_ids = [];      % 保存对应的 track_id（若存在）

fprintf("Found %d files.\n", numel(files));

for i = 1:numel(files)
    fname = files(i).name;
    fprintf("\nLoading %s ...\n", fname);
    S = load(fname);

    if ~isfield(S, "state") || ~isfield(S, "t")
        warning("  File %s does not contain 'state' or 't'. Skipped.", fname);
        continue;
    end

    state = S.state(:);
    t     = S.t(:);

    N = numel(state);
    if N < 5
        warning("  File %s has too few points (N=%d). Skipped.", fname, N);
        continue;
    end

    % 解析 track_id（如果有）
    tr_id = NaN;
    if isfield(S, "track_id")
        tr_id = S.track_id;
    else
        % 尝试从文件名里解析
        tokens = regexp(fname, 'equal_levels_track_([0-9]+)', 'tokens');
        if ~isempty(tokens)
            tr_id = str2double(tokens{1}{1});
        end
    end

    % 取前 10%% 和后 10%% 时间段的 state 中位数
    n_head = max(1, round(0.1 * N));
    n_tail = max(1, round(0.1 * N));

    head_idx = 1:n_head;
    tail_idx = (N - n_tail + 1):N;

    initial_state = round(median(state(head_idx)));
    final_state   = round(median(state(tail_idx)));

    multimer_size = initial_state - final_state;

    if multimer_size <= 0
        fprintf("  Track %g: initial_state=%d, final_state=%d -> multimer_size=%d (ignored)\n", ...
                tr_id, initial_state, final_state, multimer_size);
        continue;
    end

    fprintf("  Track %g: initial_state=%d, final_state=%d -> multimer_size=%d\n", ...
            tr_id, initial_state, final_state, multimer_size);

    multimers(end+1,1) = multimer_size;
    track_ids(end+1,1) = tr_id;
end

if isempty(multimers)
    error("No valid multimer sizes estimated. Check your data or thresholds.");
end

% 像论文那样，可以选择去掉 monomer/dimer
fprintf("\nTotal %d events (including monomer/dimer).\n", numel(multimers));

% 可选：过滤掉 ≤2 的（monomer / dimer）
mask = multimers > 2;
multimers_filtered = multimers(mask);
fprintf("After removing monomer/dimer (<=2): %d events.\n", numel(multimers_filtered));

if isempty(multimers_filtered)
    error("No events left after filtering monomer/dimer. Try including them.");
end

% 计算直方图
max_mult = max(multimers_filtered);
edges = 0.5:(max_mult + 0.5);
[counts, edges_out] = histcounts(multimers_filtered, edges);
centers = 1:max_mult;

% 画 Fig. 4C 风格的柱状图
figure;
bar(centers, counts, 'FaceColor', [0.2 0.6 0.8]); % 颜色随便，Octave 会忽略
xlabel('Number of mN-LT per origin (multimer size)');
ylabel('Frequency');
title('Distribution of multimer sizes (Fig. 4C-like)');
xlim([0.5, max_mult + 0.5]);

% 在每个柱子上标一下计数
hold on;
for k = 1:max_mult
    if counts(k) > 0
        text(centers(k), counts(k) + 0.1, sprintf('%d', counts(k)), ...
             'HorizontalAlignment', 'center');
    end
end
hold off;

print('-dpng', 'multimer_histogram.png');
fprintf("\n[INFO] Multimer histogram saved to multimer_histogram.png\n");

Bioinformatics Pipelines for DNA Sequencing: From Raw Reads to Biological Insight

Leave a reply

Abstract

English: Advances in DNA sequencing have revolutionized biology, but converting vast sequencing data into usable, robust biological knowledge depends on sophisticated bioinformatics. This review details computational strategies spanning all phases of DNA sequence analysis, starting from raw reads through to functional interpretation and reporting. It begins by characterizing the main sequencing platforms (short-read, long-read, targeted, and metagenomic), describes critical pipeline steps (sample tracking, quality control, read alignment, error correction, variant and structural variant detection, copy number analysis, de novo assembly), and considers the impact of reference genome choice and computational algorithms. Recent machine learning advances for variant annotation and integration with other omics are discussed, with applications highlighted in rare disease diagnostics, cancer genomics, and infectious disease surveillance. Emphasis is placed on reproducible, scalable, and well-documented pipelines using open-source tools, workflow management (Snakemake, Nextflow), containerization, versioning, and FAIR data principles. The review concludes with discussion of ongoing challenges (heterogeneous data, batch effects, benchmarking, privacy) and practical recommendations for robust, interpretable analyses for both experimental biologists and computational practitioners.

Chinese: DNA测序的持续进步彻底改变了生物学和医学研究，而要将庞大的测序数据转化为可靠的生物学知识，则高度依赖高水平的生物信息学方法。本文详细介绍了DNA序列分析全流程的主流计算策略，涵盖原始reads到功能性注释乃至标准化报告的各个环节。首先评述主流测序技术平台（短读长读、靶向、宏基因组），系统阐述实验设计、样本追踪、数据质控、比对、纠错、变异与结构变异检测、拷贝数分析和de novo组装等流程要点，并分析参考基因组和比对算法对结果的影响。文章还总结了机器学习在变异注释、多组学整合中的最新进展，结合罕见病诊断、肿瘤基因组和病原体监测等实际案例深入说明其应用场景。着重强调可重复、高效、透明的分析流程，包括开源工具、流程管理(Snakemake、Nextflow)、容器化、版本管理与FAIR原则。最后讨论了异质数据、批次效应、评测标准和隐私保护等挑战，并为实验与计算生物学研究者提供实用建议。

Detailed Structure \& Outline

Introduction
- Historical overview of DNA sequencing and bioinformatics development
- The necessity of bioinformatics for handling scale, complexity, and error sources in modern sequence data
- Scope: DNA focus (excluding RNA/proteome)
DNA Sequencing Technologies \& Data Properties
- 2.1 Short-read platforms (e.g., Illumina): read length, quality, use cases
- 2.2 Long-read platforms (PacBio, Nanopore): strengths, error profiles, applications
- 2.3 Specialized applications: targeted/exome panels, metagenomics, amplicon/barcode-based diagnostics
Core Bioinformatics Pipeline Components
- 3.1 Experimental metadata, sample barcoding, batch tracking: Crucial for reproducibility and QC
- 3.2 Raw read QC: base quality, adapter/contaminant trimming, typical software/plots
- 3.3 Read alignment/mapping: reference choice (GRCh38, hg19), algorithmic details (FM-index, seed-and-extend), uniqueness/multimapping
- 3.4 Post-alignment processing: file sorting, duplicate marking, base recalibration, depth analysis
- 3.5 Variant calling: SNVs/indels, somatic vs germline separation, quality filters and validation strategies
- 3.6 Structural variant and CNV analysis: breakpoints, split/discordant reads, long-read tools
- 3.7 De novo assembly, polishing, and consensus generation where relevant
Functional Interpretation
- 4.1 Annotation: gene models, regulatory regions, predictive algorithms and public databases
- 4.2 Multi-omics integration: joint analysis of genome, epigenome, transcriptome, regulatory networks
- 4.3 Machine learning/AI approaches: variant scoring, prioritization, deep learning for sequence features
Reproducible and Scalable Workflows
- 5.1 Pipeline frameworks: Snakemake, Nextflow, CWL, and workflow description languages
- 5.2 Containerization: Docker, Singularity for reproducible deployments
- 5.3 Version control/documentation: workflow hubs, deployment on GitHub, FAIR-compliant reporting
- 5.4 Data management: standard formats (FASTQ/BAM/CRAM/VCF), secure storage, metadata
Applications \& Case Studies
- Rare disease genomics: WGS for diagnosis
- Cancer genomics: tumor heterogeneity, evolution, therapy response
- Pathogen surveillance: rapid outbreak detection, resistance tracking
- Other applications to match research interests
Challenges and Future Prospects
- Technical: population-scale analysis, batch correction, pangenomes, benchmarking complexities
- Practical: workflow sharing, legal/ethical/privacy issues
- Methodological: handling new sequencing chemistries, multi-modal omics
Conclusions
- Recap essential lessons
- Actionable recommendations for robust design and execution
- Prospects for further automation, integration, and clinical translation

Section Opening (English / 中文): High-throughput DNA sequencing has fundamentally transformed modern genomics, enabling detailed investigation of human diseases, microbial ecology, and evolution. However, the raw output—massive quantities of short or long reads—is only the starting point; extracting meaningful, robust insights requires optimized bioinformatics pipelines that ensure data integrity and biological relevance.

高通量DNA测序极大地推动了现代基因组学，助力人类疾病、微生物生态与进化等领域的深入探索。但测序仪输出的原始reads只是起点——要获得有意义、可靠的生物学结论，必须依赖优化的生物信息学流程以保证数据质量和生物学解释的可信度。

⁂

Interpreting Binding Events in Single-Molecule Fluorescence Tracking: Conceptual Clarification and Practical Proposal

Leave a reply

Concept: is the start of a track a binding event? From your plot: Each track appears suddenly at some time, then decays / fluctuates and disappears. The accumulated trace shows clear upward steps at those times. In typical kymograph / spot-tracking data: A track is created when the tracking algorithm first detects a fluorescent spot above background. That is usually when a fluorophore binds (or comes into focus / into the ROI). The end of a track is when the spot disappears, often due to bleaching or unbinding. So, conceptually: Yes, in many experiments the first time point of a track is a natural candidate for the binding time of that molecule. But there are a few caveats you might want to mention to Vero: Some molecules may already be bound before the camera started recording – in that case the first frame is not the true binding time, just the first time you see it. Detection thresholds may delay the detected “start” by a few frames. If two tracks overlap in time, the accumulated signal will have multiple steps, and the HMM sees “binding” whenever the sum intensity increases. That also explains your observations: Per-track HMM: each trace is relatively short and noisy → few clear transitions, so the HMM doesn’t detect many events. HMM on the accumulated signal: all tracks summed → big, clean steps when a new molecule appears → HMM picks up almost all track starts as binding events. So it’s very reasonable to treat track starts as a kind of “ground truth” binding time, at least for benchmarking.
Improved text you can send to Vero Here’s a cleaned-up version of what you wrote, phrased as a question: From the plots of the individual tracks and the accumulated photon counts (see attached figure), I have a conceptual question about how to define the binding time. In the track-intensity file, each track appears at some time point and then decays. Would you consider the first time point of each track to be the “binding time” of that molecule? To me this looks like a very natural definition, because the track starts exactly when the signal first appears. Note: – When I run the HMM on individual tracks, only very few binding events are detected. – When I run the HMM on the accumulated signal, I can detect almost all of the track starts as binding events. This makes me think that the track starting times might be a good gold standard for binding time points. Does this interpretation match your understanding of the experiment? If you want I can also help you write a follow-up mail where you propose to use “track start = binding” and “track end = bleaching/unbinding” as ground truth for validating your HMM.

Here is a concise bilingual summary of your text explaining the idea and a cleaned-up proposal to send to Vero:

English Summary

Concept: Is the start of a track a binding event?

Each fluorescent track appears suddenly, fluctuates, then disappears; cumulative traces show clear stepwise increments.
A track starts when a fluorophore binds or enters detection (fluorescent) focus, and ends due to bleaching or unbinding.
Thus, the start of a track commonly represents the molecule’s binding time.
Caveats:
- Some molecules may be bound before recording starts, so first observed frame is not true binding time.
- Detection thresholds may delay start detection.
- Overlapping tracks cause summed signal steps, and HMM detects any increase as binding.
Per-track HMM sees few transitions due to noise and short traces; summed HMM on total signal detects most binding events.

Cleaned-up question to Vero: “From individual and cumulative photon count plots (see attached), does the first time point of each track represent the molecule’s binding time?

HMM on single tracks detects very few events;
HMM on accumulated trace detects nearly all track starts as bindings;
This suggests track start time is a good benchmark for binding. Does this match your understanding? I can help draft a follow-up email proposing to use ‘track start = binding’ and ‘track end = bleaching/unbinding’ as ground truth for HMM validation.”

中文总结

概念解释：轨迹开始是否代表结合事件？

每条轨迹突然出现，随后衰减或波动后消失，累积轨迹中出现明显的阶梯式上升。
轨迹的开始通常是荧光标记分子结合或进入焦点检测区域的时刻，结束则因漂白或解离。
因此轨迹开始时间通常被视为该分子结合的时间点。
注意事项：
- 某些分子可能在录像开始前已结合，首帧不是准确结合时间；
- 检测阈值可能导致结合时间存在几十帧延迟；
- 轨迹重叠造成信号累积，HMM会将信号上升视为结合事件。
针对单条轨迹的HMM因噪声和轨迹短小，事件检测少；对累积信号的HMM检测到几乎所有结合。

整理后可以发给Vero的文本： “根据个别轨迹和累积光子计数图（见附件），我有个问题：是否可以将每条轨迹的首个时间点视为该分子的结合时刻？

针对单轨迹的HMM只检测到少量结合事件；
针对累积轨迹的HMM则几乎检测到所有轨迹期初的结合；
这可能说明轨迹起始时间是一个可靠的结合时间‘黄金标准’。这种理解符合你的看法吗？我也可以帮你起草一封邮件，提议用“轨迹开始=结合”，“轨迹结束=漂白/解离”作为验证HMM的真实标准。”

¹⁰¹¹¹²¹³¹⁴¹⁵¹⁶¹⁷

⁂

使用 ICON HMM 在 Octave 中自动检测单分子 binding/bleach 事件并估计 dwell time

Leave a reply

% detect_binding_bleach.m
% 用法（在终端）:
%   octave detect_binding_bleach.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv

%--------------------------------------------
% 基本设置
%--------------------------------------------
arg_list = argv();
if numel(arg_list) < 1
    error("Usage: octave detect_binding_bleach.m 
<track_intensity_csv>");
end
input_file = arg_list{1};

% 加载 HMM 采样器
%addpath("sampler_SRC");

% 如果安装了 statistics 包，用于 kmeans
try
    pkg load statistics;
catch
    warning("Could not load 'statistics' package. Make sure it is installed if kmeans is missing.");
end

%--------------------------------------------
% 读入 track intensity 文件（分号分隔）
%--------------------------------------------
fprintf("Reading file: %s\n", input_file);

fid = fopen(input_file, "r");
if fid < 0
    error("Cannot open file: %s", input_file);
end

% 第一行是注释头 "# track index;time (seconds);track intensity (photon counts)"
header_line = fgetl(fid);  % 忽略内容，只是读掉这一行

% 后面每行: track_index;time_sec;intensity
data = textscan(fid, "%f%f%f", "Delimiter", ";");
fclose(fid);

track_idx = data{1};
time_sec  = data{2};
counts    = data{3};

% 排序（确保同一个 track 内按时间排序）
[~, order] = sortrows([track_idx, time_sec], [1, 2]);
track_idx = track_idx(order);
time_sec  = time_sec(order);
counts    = counts(order);

tracks = unique(track_idx);
n_tracks = numel(tracks);

fprintf("Found %d tracks.\n", n_tracks);

%--------------------------------------------
% 结果结构体
%--------------------------------------------
results = struct( ...
    "track_id", {}, ...
    "binding_indices", {}, ...
    "binding_times", {}, ...
    "bleach_indices", {}, ...
    "bleach_times", {} );

%--------------------------------------------
% 循环每个 track，跑 HMM + 找事件
%--------------------------------------------
for ti = 1:n_tracks
    tr = tracks(ti);
    fprintf("\n===== Track %d =====\n", tr);

    sel = (track_idx == tr);
    t = time_sec(sel);
    z = counts(sel);
    z = z(:);      % 列向量

    %-------------------------
    % 1) 设定 ICON HMM 的参数
    %-------------------------
    opts = struct();

    % 超参数（和原 MATLAB 代码一致）
    opts.a = 1;
    opts.g = 1;

    opts.Q(1) = mean(z);
    opts.Q(2) = 1 / (std(z)^2 + eps);   % 防止除零
    opts.Q(3) = 0.1;
    opts.Q(4) = 0.00001;

    opts.M = 10;

    % 采样参数
    opts.dr_sk  = 1;
    opts.K_init = 50;

    flag_stat = true;
    flag_anim = false;    % 建议在 Octave 里关掉动画更稳定

    R = 1000;             % 采样次数，可根据需要调小/调大

    %-------------------------
    % 2) 运行采样器
    %-------------------------
    fprintf("  Running HMM sampler...\n");
    chain = chainer_main(z, R, [], opts, flag_stat, flag_anim);

    %-------------------------
    % 3) 做后验分析，得到平滑后的光子水平轨迹
    %-------------------------
    fr = 0.25;        % burn-in 比例
    dr = 2;           % sample 步距

    m_min = 0;
    m_max = 1;
    m_num = 25;

    [m_mod, m_red] = chainer_analyze_means(chain, fr, dr, m_min, m_max, m_num, z);

    % m_mod 是一个和 z 同长度的向量（通常是 [0,1] 范围的归一化水平）
    m_mod = m_mod(:);

    %-------------------------
    % 4) 把 m_mod 聚类成 3 个光子强度状态
    %    状态 1 = 最低强度 (dark / unbound / bleached)
    %    状态 3 = 最高强度 (bound)
    %-------------------------
    K = 3;   % 你可以改成 2 或其他
    try
        [idx_raw, centers] = kmeans(m_mod, K);
    catch
        % 如果没 kmeans，就简单按数值分位数来分三类
        warning("kmeans not available, using simple quantile-based clustering.");
        q1 = quantile(m_mod, 1/3);
        q2 = quantile(m_mod, 2/3);
        idx_raw = ones(size(m_mod));
        idx_raw(m_mod > q1 & m_mod <= q2) = 2;
        idx_raw(m_mod > q2) = 3;
        centers = zeros(K,1);
        for kk = 1:K
            centers(kk) = mean(m_mod(idx_raw == kk));
        end
    end

    % 根据中心值从小到大重排状态编号
    [~, order_centers] = sort(centers);
    state_seq = zeros(size(idx_raw));
    for k = 1:K
        state_seq(idx_raw == order_centers(k)) = k;
    end

    low_state  = 1;
    high_state = K;

    %-------------------------
    % 5) 检测 binding / bleach 跳变
    %    binding:  low -> high
    %    bleach : high -> low
    %-------------------------
    s = state_seq(:);
    % 前一时刻和后一时刻
    s_prev = s(1:end-1);
    s_next = s(2:end);

    bind_idx   = find(s_prev == low_state  & s_next == high_state) + 1;
    bleach_idx = find(s_prev == high_state & s_next == low_state) + 1;

    bind_times   = t(bind_idx);
    bleach_times = t(bleach_idx);

    fprintf("  Found %d binding event(s) and %d bleaching event(s).\n", ...
            numel(bind_idx), numel(bleach_idx));

    % 存入结果
    results(ti).track_id        = tr;
    results(ti).binding_indices = bind_idx;
    results(ti).binding_times   = bind_times;
    results(ti).bleach_indices  = bleach_idx;
    results(ti).bleach_times    = bleach_times;
end

%--------------------------------------------
% 6) 把结果写成 CSV
%--------------------------------------------
out_csv = "binding_bleach_events.csv";
fid_out = fopen(out_csv, "w");
if fid_out < 0
    error("Cannot open output file: %s", out_csv);
end

fprintf(fid_out, "track_index,event_type,sample_index,time_seconds\n");
for ti = 1:numel(results)
    tr = results(ti).track_id;

    % binding
    for k = 1:numel(results(ti).binding_indices)
        fprintf(fid_out, "%d,binding,%d,%.6f\n", tr, ...
                results(ti).binding_indices(k), ...
                results(ti).binding_times(k));
    end

    % bleach
    for k = 1:numel(results(ti).bleach_indices)
        fprintf(fid_out, "%d,bleach,%d,%.6f\n", tr, ...
                results(ti).bleach_indices(k), ...
                results(ti).bleach_times(k));
    end
end
fclose(fid_out);

fprintf("\nAll done. Events written to: %s\n", out_csv);

运行：

octave detect_binding_bleach.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv

保留原始 ICON 后验分析：
- chainer_export 导出样本
- chainer_analyze_means 得到 m_mod
- chainer_analyze_transitions 得到 m_edges, p_mean, d_dist
- chainer_analyze_drift 得到 y_mean, y_std
使用 binding/bleach 时间点 计算 dwell time 分布并做单指数 rate 拟合，并直接画直方图 + 指数拟合曲线。
输出一个 CSV 表，包含所有 binding/bleach 时间点（这是最重要的输出）。

说明：

代码仍然基于你已有的 detect_binding_bleach.m 结构，假定当前目录下有 sampler_SRC（含 chainer_main.m、chainer_analyze_means.m、chainer_analyze_transitions.m、chainer_analyze_drift.m 等）。¹⁸¹⁹

Dwell time 这里使用“连续 binding 事件之间的时间差”来估计 bound-state dwell time（简化版）；如果 bleaching 明确是轨迹末端终止，也可以用 bleach_time - last_binding_time 作为一个 right‑censored dwell（这里暂不做 censoring 修正，只输出原始分布和单指数拟合）。

完整脚本：`detect_binding_bleach_dwell_simple.m`

% detect_binding_bleach_dwell.m
%
% 用法（终端）:
%   octave detect_binding_bleach_dwell.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv
%
% 输入 CSV 格式（分号分隔）:
%   # track index;time (seconds);track intensity (photon counts)
%   1;0.00;123
%   1;0.05;118
%   2;0.00; 95
%   ...

%--------------------------------------------
% 0. 基本设置
%--------------------------------------------
arg_list = argv();

if numel(arg_list) < 1
    error("Usage: octave detect_binding_bleach_dwell.m 
<input_csv>");
end

input_file = arg_list{1};

% 加载 HMM 采样器代码（确保 sampler_SRC 在当前目录下）
addpath("sampler_SRC");

% 如果安装了 statistics 包，用于 kmeans
try
    pkg load statistics;
catch
    warning("Could not load 'statistics' package. Make sure it is installed if kmeans is missing.");
end

%--------------------------------------------
% 1. 读入 track intensity 文件（分号分隔）
%--------------------------------------------
fprintf("Reading file: %s\n", input_file);

fid = fopen(input_file, "r");
if fid < 0
    error("Cannot open file: %s", input_file);
end

% 第一行是注释头 "# track index;time (seconds);track intensity (photon counts)"
header_line = fgetl(fid); % 忽略内容，只是读掉这一行

% 后面每行: track_index;time_sec;intensity
data = textscan(fid, "%f%f%f", "Delimiter", ";");
fclose(fid);

track_idx = data{1};
time_sec  = data{2};
counts    = data{3};

% 排序（确保同一个 track 内按时间排序）
[~, order] = sortrows([track_idx, time_sec], [1, 2]);
track_idx = track_idx(order);
time_sec  = time_sec(order);
counts    = counts(order);

tracks   = unique(track_idx);
n_tracks = numel(tracks);
fprintf("Found %d tracks.\n", n_tracks);

%--------------------------------------------
% 2. 结果结构体（binding / bleach 时间点）
%--------------------------------------------
results = struct( ...
    "track_id",        {}, ...
    "binding_indices", {}, ...
    "binding_times",   {}, ...
    "bleach_indices",  {}, ...
    "bleach_times",    {} );

% 用于 dwell time 收集的数组
all_binding_times = [];   % 所有 track 的 binding time 合并
all_bleach_times  = [];   % 所有 track 的 bleach time 合并

%--------------------------------------------
% 3. 循环每个 track，跑 HMM + ICON 分析 + 事件检测
%--------------------------------------------
for ti = 1:n_tracks

    tr = tracks(ti);
    fprintf("\n===== Track %d =====\n", tr);

    sel = (track_idx == tr);
    t   = time_sec(sel);
    z   = counts(sel);
    z   = z(:); % 列向量

    %-------------------------
    % 3.1 设定 ICON HMM 的参数
    %-------------------------
    opts        = struct();
    % 超参数（与 level_analysis.m 一致）
    opts.a      = 1;
    opts.g      = 1;
    opts.Q(1)   = mean(z);
    opts.Q(2)   = 1 / (std(z)^2 + eps); % 防止 std(z)=0 除零
    opts.Q(3)   = 0.1;
    opts.Q(4)   = 0.00001;
    opts.M      = 10;

    % 采样参数
    opts.dr_sk  = 1;
    opts.K_init = 50;

    flag_stat   = true;
    flag_anim   = false;  % 建议在 Octave 里关掉动画
    R           = 1000;   % 采样次数，可按需求调整

    %-------------------------
    % 3.2 运行采样器（ICON HMM）
    %-------------------------
    fprintf(" Running HMM sampler...\n");
    chain = chainer_main(z, R, [], opts, flag_stat, flag_anim);

    %-------------------------
    % 3.3 ICON 后验分析（与 level_analysis.m 等价）
    %-------------------------
    fr    = 0.25; % burn-in 比例
    dr    = 2;    % sample 步距
    m_min = 0;
    m_max = 1;
    m_num = 25;

    % (1) 均值轨迹：m_mod
    [m_mod, m_red] = chainer_analyze_means(chain, fr, dr, m_min, m_max, m_num, z);
    m_mod = m_mod(:);

    % (2) 导出 samples（为每条 track 单独存一个文件）
    sample_file = sprintf("samples_track_%d", tr);
    chainer_export(chain, fr, dr, sample_file, "mat");

    % (3) 转移概率 / 跃迁统计
    [m_edges, p_mean, d_dist] = chainer_analyze_transitions( ...
        chain, fr, dr, m_min, m_max, m_num, true);

    % (4) 漂移分析
    [y_mean, y_std] = chainer_analyze_drift(chain, fr, dr, z);

    % 你可以根据需要保存这些 ICON 分析结果
    mat_out = sprintf("icon_analysis_track_%d.mat", tr);
    save(mat_out, "m_mod", "m_red", "m_edges", "p_mean", "d_dist", ...
                  "y_mean", "y_std", "t", "z");

    %-------------------------
    % 3.4 把 m_mod 聚类成 3 个光子强度状态（用于事件检测）
    %-------------------------
    K = 3; % 可按物理需求修改

    try
        [idx_raw, centers] = kmeans(m_mod, K);
    catch
        warning("kmeans not available, using simple quantile-based clustering.");
        q1 = quantile(m_mod, 1/3);
        q2 = quantile(m_mod, 2/3);

        idx_raw = ones(size(m_mod));
        idx_raw(m_mod > q1 & m_mod <= q2) = 2;
        idx_raw(m_mod > q2)              = 3;

        centers = zeros(K,1);
        for kk = 1:K
            centers(kk) = mean(m_mod(idx_raw == kk));
        end
    end

    % 根据中心值从小到大重排状态编号，使 state_seq = 1..K 对应 low->high
    [~, order_centers] = sort(centers);
    state_seq = zeros(size(idx_raw));
    for k = 1:K
        state_seq(idx_raw == order_centers(k)) = k;
    end

    low_state  = 1;
    high_state = K;

    %-------------------------
    % 3.5 检测 binding / bleaching
    %
    %   binding: low_state -> high_state
    %   bleach : high_state -> low_state
    %-------------------------
    s      = state_seq(:);
    s_prev = s(1:end-1);
    s_next = s(2:end);

    bind_idx   = find(s_prev == low_state  & s_next == high_state) + 1;
    bleach_idx = find(s_prev == high_state & s_next == low_state) + 1;

    bind_times   = t(bind_idx);
    bleach_times = t(bleach_idx);

    fprintf(" Found %d binding event(s) and %d bleaching event(s).\n", ...
            numel(bind_idx), numel(bleach_idx));

    % 存入结果结构体
    results(ti).track_id        = tr;
    results(ti).binding_indices = bind_idx;
    results(ti).binding_times   = bind_times;
    results(ti).bleach_indices  = bleach_idx;
    results(ti).bleach_times    = bleach_times;

    % 用于全局 dwell-time 统计
    all_binding_times = [all_binding_times; bind_times(:)];
    all_bleach_times  = [all_bleach_times;  bleach_times(:)];

end

%--------------------------------------------
% 4. 输出 binding/bleach 时间点的总表（最重要输出）
%--------------------------------------------
out_csv = "binding_bleach_events.csv";
fid_out = fopen(out_csv, "w");
if fid_out < 0
    error("Cannot open output file: %s", out_csv);
end

fprintf(fid_out, "track_index,event_type,sample_index,time_seconds\n");

for ti = 1:numel(results)
    tr = results(ti).track_id;

    % binding
    for k = 1:numel(results(ti).binding_indices)
        fprintf(fid_out, "%d,binding,%d,%.6f\n", tr, ...
                results(ti).binding_indices(k), ...
                results(ti).binding_times(k));
    end

    % bleach
    for k = 1:numel(results(ti).bleach_indices)
        fprintf(fid_out, "%d,bleach,%d,%.6f\n", tr, ...
                results(ti).bleach_indices(k), ...
                results(ti).bleach_times(k));
    end
end

fclose(fid_out);
fprintf("\n[INFO] Binding/bleach events written to: %s\n", out_csv);

%--------------------------------------------
% 5. Dwell-time 直方图 + rate 拟合
%
%   简化假设：
%   - 把每个 track 中连续 binding 事件之间的时间差视为
%     bound state dwell time（中间没有 bleaching，就重新 binding）。
%   - 如果有明确的最后一个 binding 到 bleaching 的时间，
%     你也可以扩展这里把它们当作 dwell time 加进去。
%--------------------------------------------

% 5.1 收集 bound-state dwell times（简单用 binding 时间差）
dwell_times = [];

for ti = 1:numel(results)
    bt = sort(results(ti).binding_times(:)); % 单个 track 的 binding times
    if numel(bt) >= 2
        dt = diff(bt);    % 相邻 binding 之间的时间差
        dwell_times = [dwell_times; dt];
    end

    % 可选：如果你想把最后一个 binding -> 第一个 bleach 也算入 dwell:
    % if ~isempty(results(ti).binding_times) && ~isempty(results(ti).bleach_times)
    %     last_binding  = max(results(ti).binding_times);
    %     first_bleach  = min(results(ti).bleach_times);
    %     if first_bleach > last_binding
    %         dwell_times = [dwell_times; first_bleach - last_binding];
    %     end
    % end
end

if isempty(dwell_times)
    fprintf("[WARN] No sufficient binding events to compute dwell-time distribution.\n");
else
    % 5.2 绘制 dwell-time 直方图
    figure;
    hold on;
    nbins = max(10, round(sqrt(numel(dwell_times)))); % 简单经验 bin 数
    [counts_hist, edges] = histcounts(dwell_times, nbins);
    centers = (edges(1:end-1) + edges(2:end)) / 2;

    bar(centers, counts_hist, "hist");
    xlabel("Dwell time (s)");
    ylabel("Counts");
    title("Bound-state dwell-time histogram");

    % 5.3 拟合单指数: p(t) ~ (1/tau) * exp(-t/tau)
    % 使用简单的最小二乘拟合 log(count) 对 t
    valid = counts_hist > 0;
    t_fit = centers(valid);
    y_fit = log(counts_hist(valid));

    % 线性拟合: log(N) = a + b * t, 其中 b ~= -1/tau
    p = polyfit(t_fit, y_fit, 1);
    a = p(1);
    b = p(2);

    tau_est = -1 / b;
    fprintf("[INFO] Fitted single-exponential dwell time tau = %.4f s\n", tau_est);

    % 画出拟合曲线（归一到 histogram 的总 counts 约束）
    t_plot = linspace(min(dwell_times), max(dwell_times), 200);
    % 预测密度 ~ exp(a + b t)，但我们要匹配 histogram 的离散计数
    y_model = exp(a + b * t_plot);

    plot(t_plot, y_model, "r-", "LineWidth", 2);
    legend("Histogram", "Single-exp fit");
    hold off;

    % 可选：保存图像
    print("-dpng", "dwell_time_histogram_fit.png");
    fprintf("[INFO] Dwell-time histogram and fit saved to dwell_time_histogram_fit.png\n");
end

fprintf("\nAll done.\n");

使用与扩展建议

运行：

octave detect_binding_bleach_dwell_simple.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv

会生成：

binding_bleach_events.csv：所有 track 的 binding/bleach 时间点（你最关心的表格）。
icon_analysis_track_X.mat：每个 track 的 ICON 分析结果（m_mod, m_edges, p_mean, d_dist, y_mean, y_std 等）。
如果有足够事件，生成 dwell_time_histogram_fit.png，包含 dwell-time 直方图和单指数拟合曲线。
- dwell time 定义：当前用的是“相邻 binding 时间点之差”作为 dwell；你如果有更明确的物理模型（例如每条轨迹只有 1 次 binding + 1 次 bleach），可以把注释里那段 last_binding -> first_bleach 的逻辑打开或进一步改写成更严格的两态 dwell 定义。
- rate constant：对于单指数 $p(t) \propto \exp(-t/\tau)$，tau_est 是平均 dwell time，解离 rate 大致是 $k{\text{off}} = 1/\tau{\text{est}}$。如果你需要可以在脚本里再打印一条 k_off_est。

如果你希望 dwell-time 统计严格区分 “bound state dwell vs unbound state dwell”，或者想直接从 p_mean / d_dist 中做 rate 矩阵估计，可以在这份脚本基础上再往前一步；可以一起设计一个更“教科书版”的 HMM‑rate 推断流程。

特性概览：

使用 ICON HMM（chainer_main）分析每个 track。¹⁸
做原始 level_analysis.m 的四类后验分析：chainer_export、chainer_analyze_means、chainer_analyze_transitions、chainer_analyze_drift。¹⁸
基于 m_mod 聚类成 3 个强度状态，检测 binding / bleaching 事件。¹⁹
输出 binding_bleach_events.csv，包含所有 binding/bleach 时间点。¹⁹
用 binding 时间点计算 dwell-time 分布，做单指数拟合；绘图部分只在内存中生成，用 print + close 保存 PNG（类似 R 的 dev.off()），并使用 hist 而不是 histcounts。¹⁹

将以下内容保存为 detect_binding_bleach_dwell.m 即可直接运行。

`detect_binding_bleach_dwell.m` 完整代码

% detect_binding_bleach_dwell.m
%
% 用法（终端）:
%   octave detect_binding_bleach_dwell.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv
%
% 输入 CSV 格式（分号分隔）:
%   # track index;time (seconds);track intensity (photon counts)
%   1;0.00;123
%   1;0.05;118
%   2;0.00; 95
%   ...

%--------------------------------------------
% 0. 基本设置
%--------------------------------------------
arg_list = argv();

if numel(arg_list) < 1
    error("Usage: octave detect_binding_bleach_dwell.m 
<input_csv>");
end

input_file = arg_list{1};

% 加载 HMM 采样器代码（确保 sampler_SRC 在当前目录下）
addpath("sampler_SRC");

% 如果安装了 statistics 包，用于 kmeans
try
    pkg load statistics;
catch
    warning("Could not load 'statistics' package. Make sure it is installed if kmeans is missing.");
end

%--------------------------------------------
% 1. 读入 track intensity 文件（分号分隔）
%--------------------------------------------
fprintf("Reading file: %s\n", input_file);

fid = fopen(input_file, "r");
if fid < 0
    error("Cannot open file: %s", input_file);
end

% 第一行是注释头 "# track index;time (seconds);track intensity (photon counts)"
header_line = fgetl(fid); % 忽略内容，只是读掉这一行

% 后面每行: track_index;time_sec;intensity
data = textscan(fid, "%f%f%f", "Delimiter", ";");
fclose(fid);

track_idx = data{1};
time_sec  = data{2};
counts    = data{3};

% 排序（确保同一个 track 内按时间排序）
[~, order] = sortrows([track_idx, time_sec], [1, 2]);
track_idx = track_idx(order);
time_sec  = time_sec(order);
counts    = counts(order);

tracks   = unique(track_idx);
n_tracks = numel(tracks);
fprintf("Found %d tracks.\n", n_tracks);

%--------------------------------------------
% 2. 结果结构体（binding / bleach 时间点）
%--------------------------------------------
results = struct( ...
    "track_id",        {}, ...
    "binding_indices", {}, ...
    "binding_times",   {}, ...
    "bleach_indices",  {}, ...
    "bleach_times",    {} );

% 用于 dwell time 收集的数组（跨所有 track）
all_binding_times = [];   % 所有 track 的 binding time 合并
all_bleach_times  = [];   % 所有 track 的 bleach time 合并

%--------------------------------------------
% 3. 循环每个 track，跑 HMM + ICON 分析 + 事件检测
%--------------------------------------------
for ti = 1:n_tracks

    tr = tracks(ti);
    fprintf("\n===== Track %d =====\n", tr);

    sel = (track_idx == tr);
    t   = time_sec(sel);
    z   = counts(sel);
    z   = z(:); % 列向量

    %-------------------------
    % 3.1 设定 ICON HMM 的参数
    %-------------------------
    opts        = struct();
    % 超参数（与 level_analysis.m 一致）
    opts.a      = 1;
    opts.g      = 1;
    opts.Q(1)   = mean(z);
    opts.Q(2)   = 1 / (std(z)^2 + eps); % 防止 std(z)=0 除零
    opts.Q(3)   = 0.1;
    opts.Q(4)   = 0.00001;
    opts.M      = 10;

    % 采样参数
    opts.dr_sk  = 1;
    opts.K_init = 50;

    flag_stat   = true;
    flag_anim   = false;  % 建议在 Octave 里关掉动画
    R           = 1000;   % 采样次数，可按需求调整

    %-------------------------
    % 3.2 运行采样器（ICON HMM）
    %-------------------------
    fprintf(" Running HMM sampler...\n");
    chain = chainer_main(z, R, [], opts, flag_stat, flag_anim);

    %-------------------------
    % 3.3 ICON 后验分析（与 level_analysis.m 等价）
    %-------------------------
    fr    = 0.25; % burn-in 比例
    dr    = 2;    % sample 步距
    m_min = 0;
    m_max = 1;
    m_num = 25;

    % (1) 均值轨迹：m_mod
    [m_mod, m_red] = chainer_analyze_means(chain, fr, dr, m_min, m_max, m_num, z);
    m_mod = m_mod(:);

    % (2) 导出 samples（为每条 track 单独存一个文件）
    sample_file = sprintf("samples_track_%d", tr);
    chainer_export(chain, fr, dr, sample_file, "mat");
    fprintf("%s --- Exported\n", [sample_file ".mat"]);

    % (3) 转移概率 / 跃迁统计
    [m_edges, p_mean, d_dist] = chainer_analyze_transitions( ...
        chain, fr, dr, m_min, m_max, m_num, true);

    % (4) 漂移分析
    [y_mean, y_std] = chainer_analyze_drift(chain, fr, dr, z);

    % 保存这些 ICON 分析结果（可选）
    mat_out = sprintf("icon_analysis_track_%d.mat", tr);
    save(mat_out, "m_mod", "m_red", "m_edges", "p_mean", "d_dist", ...
                  "y_mean", "y_std", "t", "z");

    %-------------------------
    % 3.4 把 m_mod 聚类成 3 个光子强度状态（用于事件检测）
    %-------------------------
    K = 3; % 可按物理需求修改

    try
        [idx_raw, centers] = kmeans(m_mod, K);
    catch
        warning("kmeans not available, using simple quantile-based clustering.");
        q1 = quantile(m_mod, 1/3);
        q2 = quantile(m_mod, 2/3);

        idx_raw = ones(size(m_mod));
        idx_raw(m_mod > q1 & m_mod <= q2) = 2;
        idx_raw(m_mod > q2)              = 3;

        centers = zeros(K,1);
        for kk = 1:K
            centers(kk) = mean(m_mod(idx_raw == kk));
        end
    end

    % 根据中心值从小到大重排状态编号，使 state_seq = 1..K 对应 low->high
    [~, order_centers] = sort(centers);
    state_seq = zeros(size(idx_raw));
    for k = 1:K
        state_seq(idx_raw == order_centers(k)) = k;
    end

    low_state  = 1;
    high_state = K;

    %-------------------------
    % 3.5 检测 binding / bleaching
    %
    %   binding: low_state -> high_state
    %   bleach : high_state -> low_state
    %-------------------------
    s      = state_seq(:);
    s_prev = s(1:end-1);
    s_next = s(2:end);

    bind_idx   = find(s_prev == low_state  & s_next == high_state) + 1;
    bleach_idx = find(s_prev == high_state & s_next == low_state) + 1;

    bind_times   = t(bind_idx);
    bleach_times = t(bleach_idx);

    fprintf("Found %d binding event(s) and %d bleaching event(s).\n", ...
            numel(bind_idx), numel(bleach_idx));

    % 存入结果结构体
    results(ti).track_id        = tr;
    results(ti).binding_indices = bind_idx;
    results(ti).binding_times   = bind_times;
    results(ti).bleach_indices  = bleach_idx;
    results(ti).bleach_times    = bleach_times;

    % 用于全局 dwell-time 统计
    all_binding_times = [all_binding_times; bind_times(:)];
    all_bleach_times  = [all_bleach_times;  bleach_times(:)];

end

%--------------------------------------------
% 4. 输出 binding/bleach 时间点的总表（最重要输出）
%--------------------------------------------
out_csv = "binding_bleach_events.csv";
fid_out = fopen(out_csv, "w");
if fid_out < 0
    error("Cannot open output file: %s", out_csv);
end

fprintf(fid_out, "track_index,event_type,sample_index,time_seconds\n");

for ti = 1:numel(results)
    tr = results(ti).track_id;

    % binding
    for k = 1:numel(results(ti).binding_indices)
        fprintf(fid_out, "%d,binding,%d,%.6f\n", tr, ...
                results(ti).binding_indices(k), ...
                results(ti).binding_times(k));
    end

    % bleach
    for k = 1:numel(results(ti).bleach_indices)
        fprintf(fid_out, "%d,bleach,%d,%.6f\n", tr, ...
                results(ti).bleach_indices(k), ...
                results(ti).bleach_times(k));
    end
end

fclose(fid_out);
fprintf("\n[INFO] Binding/bleach events written to: %s\n", out_csv);

%--------------------------------------------
% 5. Dwell-time 直方图 + rate 拟合
%
%   简化假设：
%   - 把每个 track 中连续 binding 事件之间的时间差视为
%     bound state dwell time（中间没有 bleaching，就重新 binding）。
%   - 如果有明确的最后一个 binding 到 bleaching 的时间，
%     可以扩展此处加入（当前先保留简单版本）。
%--------------------------------------------

% 5.1 收集 bound-state dwell times（简单用 binding 时间差）
dwell_times = [];

for ti = 1:numel(results)
    bt = sort(results(ti).binding_times(:)); % 单个 track 的 binding times
    if numel(bt) >= 2
        dt = diff(bt);    % 相邻 binding 之间的时间差
        dwell_times = [dwell_times; dt];
    end

    % 可选：如果你想把最后一个 binding -> 第一个 bleach 也算入 dwell:
    % if ~isempty(results(ti).binding_times) && ~isempty(results(ti).bleach_times)
    %     last_binding  = max(results(ti).binding_times);
    %     first_bleach  = min(results(ti).bleach_times);
    %     if first_bleach > last_binding
    %         dwell_times = [dwell_times; first_bleach - last_binding];
    %     end
    % end
end

if isempty(dwell_times)
    fprintf("[WARN] No sufficient binding events to compute dwell-time distribution.\n");
else
    % 5.2 绘制 dwell-time 直方图（只在内存中画，不弹窗）
    fig = figure("visible", "off");
    hold on;

    nbins = max(10, round(sqrt(numel(dwell_times))));

    % 使用 hist（Octave 自带）代替 histcounts
    [counts_hist, centers] = hist(dwell_times, nbins);

    bar(centers, counts_hist, "hist");
    xlabel("Dwell time (s)");
    ylabel("Counts");
    title("Bound-state dwell-time histogram");

    % 5.3 拟合单指数: p(t) ~ (1/tau) * exp(-t/tau)
    valid = counts_hist > 0;
    t_fit = centers(valid);
    y_fit = log(counts_hist(valid));

    p = polyfit(t_fit, y_fit, 1);
    a = p(1);
    b = p(2);

    tau_est = -1 / b;
    fprintf("[INFO] Fitted single-exponential dwell time tau = %.4f s\n", tau_est);

    t_plot = linspace(min(dwell_times), max(dwell_times), 200);
    y_model = exp(a + b * t_plot);

    plot(t_plot, y_model, "r-", "LineWidth", 2);
    legend("Histogram", "Single-exp fit");
    hold off;

    % 保存图像并关闭 figure（类似 R 的 dev.off()）
    out_png = "dwell_time_histogram_fit.png";
    print(fig, out_png, "-dpng");
    close(fig);

    fprintf("[INFO] Dwell-time histogram and fit saved to %s\n", out_png);
end

fprintf("\nAll done.\n");

运行方式：

octave detect_binding_bleach_dwell.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv

如有下一步你想基于 d_dist 做更“纯 HMM”的 dwell 分布 / rate 矩阵，也可以在这个脚本上再加一节专门分析。

⁂

level_analysis.m vs detect_binding_bleach.m

这两个脚本在核心 HMM 采样算法上是一致的，但整体流程和用途并不完全相同。level_analysis.m只是做“ICON HMM 拟合 + 若干通用后验分析”，而 detect_binding_bleach.m在此基础上，额外做了“多 track 输入 + 聚类成离散状态 + 自动识别 binding/bleach 跳变 + 输出 CSV”。因此不能说是完全相同算法的简单移植，而是“在相同 HMM 内核上扩展出的一个更完整事件检测流程”。¹⁸¹⁹

下面分两部分说明：先列出关键差异，然后用中文系统讲解 detect_binding_bleach.m 的整个处理流程（包括和原始 level_analysis.m 共享的 HMM 部分）。¹⁹¹⁸

一、算法差异概览

输入数据形式不同
- level_analysis.m：直接从 photoncount.txt 读入一个向量 z，视为单一时间序列（一个 DNA tether 的 photon counts）。¹⁸
- detect_binding_bleach.m：从命令行传入一个分号分隔的 CSV，结构为 track_index;time;intensity，先按 (track_index, time) 排序，然后对每个 track 单独取出 z 与时间 t 做分析，相当于一次运行中批量处理多条轨迹。¹⁹
HMM 部分是否一致
- 两者都使用 chainer_main 这一 ICON HMM 采样器，参数设置非常接近：
  - opts.a = 1; opts.g = 1;（浓度参数）。¹⁸¹⁹
  - opts.Q(1) = mean(z);（均值超分布的均值）。¹⁹¹⁸
  - opts.Q(2) = 1/std(z)^2; vs 1 / (std(z)^2 + eps)（只是在 Octave 版本里加了 eps 防除零，对数值稳定性有轻微改进，但理论上等价）。¹⁸¹⁹
  - opts.Q(3) = 0.1; opts.Q(4) = 0.00001; opts.M = 10; 完全一致。¹⁹¹⁸
  - 采样设置：opts.dr_sk = 1; opts.K_init = 50; R = 1000; 也一致。¹⁸¹⁹
- 因此，给定同一 z，ICON HMM 的抽样和后验均值轨迹的生成逻辑是相同的，只是 Octave 版本做了防数值异常的小修正。整体可以认为 HMM 核心算法是等价的。¹⁹¹⁸
后验分析部分的差异
- level_analysis.m：侧重“通用分析”，调用：
  - chainer_export 导出样本。¹⁸
  - chainer_analyze_means 得到状态均值轨迹 m_mod。¹⁸
  - chainer_analyze_transitions 得到转移概率矩阵、边界等。¹⁸
  - chainer_analyze_drift 得到漂移轨迹 y_mean, y_std。¹⁸
- detect_binding_bleach.m：只调用 chainer_analyze_means 得到 m_mod，然后完全走另一条分析线：
  - 用 kmeans 或分位数把 m_mod 聚成 3 个强度状态（低、中、高）。¹⁹
  - 通过状态序列中的 low→high 和 high→low 跳变来定义 binding 和 bleaching 事件，并输出事件时间。¹⁹
- 所以 HMM 后处理的目标和输出是完全不同的：一个是做统计分析、可视化；另一个是做自动事件检测。
输出结果不同
- level_analysis.m：
  - 导出 MCMC 样本 samples.mat，方便后续任意分析；同时有若干变量在 workspace 中（m_mod, p_mean, d_dist, y_mean 等）。¹⁸
- detect_binding_bleach.m：
  - 对每条 track 计算并存储 binding_indices/binding_times 和 bleach_indices/bleach_times，最后全部写入统一的 binding_bleach_events.csv，用于后续统计或可视化。¹⁹
附加逻辑（聚类与事件判定）是新加的
- kmeans（或分位数替代）、状态重标号、状态序列差分、事件 CSV 输出等逻辑，在 level_analysis.m 中完全不存在，是在 Octave 版本里新增的功能层。也就是说，detect_binding_bleach.m 不只是移植，而是在 HMM 内核基础上的一个“面向 binding/bleach 识别”的上层算法。¹⁹¹⁸

结论：

HMM 模型和采样部分与原 MATLAB 代码在理论上是相同的，只做了数值稳定性改动。
整体脚本逻辑不相同，Octave 版多了一整套“轨迹遍历 + 状态聚类 + binding/bleach 检测 + CSV 输出”的算法层。

二、用中文说明 `detect_binding_bleach.m` 的完整流程（含与原脚本共用的 HMM 思路）

下面把 detect_binding_bleach.m 当作主角，用中文分步骤解释其处理流程，并在相关步骤指出它与原 level_analysis.m 的对应关系。这样方便你检查是否满足你对“算法一致性”的期待。¹⁹¹⁸

1. 命令行输入与文件读取

脚本通过 argv() 读取命令行参数，第一个参数是输入文件名，例如： octave detect_binding_bleach.m p853_250706_p502_10pN_ch5_0bar_b3_1_track_intensity_data_blue_5s.csv。¹⁹
输入文件是以分号分隔的文本，第一行为注释头，后面每行包含三列：
- track_index：轨迹编号（一个 DNA tether 或一个粒子）。
- time_seconds：该轨迹下某一帧的时间点。
- track_intensity：该时间点的 photon counts。
读入后，把三列分别存成 track_idx, time_sec, counts。然后用 sortrows([track_idx, time_sec]) 按 “轨迹号优先，再按时间” 排好顺序，保证每条轨迹都是时间有序的。之后提取所有不同的 tracks，准备对每一条轨迹单独做分析。¹⁹

这一步在原 level_analysis.m 中要简单得多，那里只是 z = load('photoncount.txt');，默认只处理单一系列，没有 track 结构。¹⁸

2. 对每条轨迹循环，构建输入序列 z

外层 for ti = 1:n_tracks 遍历每一个 track。
对于当前轨迹 tr，用 sel = (track_idx == tr) 挑出这一条轨迹对应的时间和强度，得到：
- t = time_sec(sel)：该轨迹的时间轴。
- z = counts(sel)：该轨迹的 photon count 序列。
把 z 转成列向量，作为后续 HMM 的观测数据。

到这里为止，z 的角色与原脚本中的 z = load('photoncount.txt') 完全一致，只是这里是“多轨迹版本”的 z。¹⁸¹⁹

3. 设置 ICON HMM 的超参数和采样参数

这一段是两份脚本的核心重合部分，也是你最关心的“算法一致性”所在。¹⁸¹⁹

先构建 opts 结构体，用于向 chainer_main 传参。
超参数设置为：
- opts.a = 1;：控制 HMM 中状态转换的 Dirichlet 浓度（类似“转移稀疏度”的先验强度）。
- opts.g = 1;：控制状态先验分布的基本参数（base measure 的浓度）。
- opts.Q(1) = mean(z);：假设状态均值的先验均值等于观测平均值。
- opts.Q(2) = 1 / (std(z)^2 + eps);：状态均值先验的精度，与观测标准差成反比；加上 eps 是为了避免在极端情形下 std(z) = 0 导致除零。原 MATLAB 版本是 1/std(z)^2，数学上是一致的设定。
- opts.Q(3) = 0.1;：方差逆数的 Gamma 先验的 shape。
- opts.Q(4) = 0.00001;：方差逆数的 Gamma 先验的 scale。
- opts.M = 10;：插值节点数（ICON 模型内部数值近似需要）。
采样相关的参数：
- opts.dr_sk = 1;：保存样本的间隔步长。
- opts.K_init = 50;：初始假设的隐状态数（实际有效状态数由采样自动调整）。
- R = 1000;：采样迭代次数。

这些设置在 level_analysis.m 中完全一致，只是缺少 eps。所以从贝叶斯 HMM 模型结构和先验设定来看，两者是同一个 ICON 模型。¹⁹¹⁸

4. 运行 HMM 采样器，得到 MCMC 链

调用 chain = chainer_main(z, R, [], opts, flag_stat, flag_anim); 来运行 ICON HMM 的 Gibbs sampler 或类似的 MCMC 算法。
- z 是观测序列。
- R 是采样轮数。
- 第三个参数为空 []，表示不提供初始链，由函数内部自行初始化。
- opts 是前面设定好的超参数和控制参数。
- flag_stat = true 打开进度输出，flag_anim = false 关闭动画（相比原 MATLAB 代码那里默认 flag_anim = true，这是在 Octave 里为了稳定性和性能做的小改动，对推断结果没有理论影响）。¹⁸¹⁹

chainer_main 内部会对每个时间点的隐状态、每个状态的均值和方差、转移矩阵等进行采样，得到一条或多条后验链 chain，用来近似整个后验分布。这个步骤在两个脚本里完全相同，只是 Octave 版本对可视化做了简化。¹⁹¹⁸

5. 后验均值轨迹分析：`chainer_analyze_means`

设置后处理参数：
- fr = 0.25：前 25% 的样本视为 burn-in，丢弃不用。
- dr = 2：之后每隔 2 个样本取一个，用来减弱自相关。
- m_min = 0; m_max = 1; m_num = 25;：把状态均值的可能范围离散成 $[0,1]$ 区间上 25 个格点，用于后验平均的插值或投影。
调用： [m_mod, m_red] = chainer_analyze_means(chain, fr, dr, m_min, m_max, m_num, z);
输出 m_mod 可以理解为“在每一个时间点上的后验平均光子水平”，往往已被归一化到 $[0,1]$ 区间，是一个与 z 等长的一维轨迹。m_red 是某种降维或简化形式，这里没有用到。¹⁸¹⁹

这一部分与 level_analysis.m 完全一致，区别在于：原脚本后面还继续做转移概率估计和漂移分析；Octave 版则转向“离散状态 + 事件检测”。¹⁹¹⁸

6. 把连续水平轨迹聚类成几个离散状态

接下来是 Octave 脚本中新加的上层算法部分，用来从连续的 m_mod 中自动提取“低/中/高”强度状态：

设定状态数 K = 3（可手动改成 2 或其他）。
首选使用 kmeans(m_mod, K) 对 m_mod 做聚类，得到：
- idx_raw：每个时间点所属的簇编号。
- centers：每个簇的中心值（大致对应不同光子水平）。
如果 Octave 没有统计包或 kmeans 不可用，就退化为基于分位数的简单阈值法：
- 用三分位数 q1 = quantile(m_mod, 1/3); q2 = quantile(m_mod, 2/3); 把轨迹分为低、中、高三段。
- 然后为每类算一个均值当作 centers。
得到 centers 后，按照中心值从小到大排序，重排状态编号，使得：
- 状态 1：最低强度（low_state）。
- 状态 K（=3）：最高强度（high_state）。

这一段在 level_analysis.m 中没有任何对应逻辑，是 Octave 版本为 binding/bleach 检测专门设计的。其本质是：用连续的后验均值轨迹做无监督聚类，把连续空间映射为少数几个离散层级状态。¹⁸¹⁹

7. 从状态序列中检测 binding 和 bleaching 事件

有了状态序列 state_seq 后，就可以做简单的“状态跳变检测”来定义事件：

令 s = state_seq(:)。
构造前后相邻状态：
- s_prev = s(1:end-1);
- s_next = s(2:end);
binding 事件：
- 当某一时间点出现 s_prev == low_state 且 s_next == high_state 时，即从低强度直接跳到高强度，视为一次“分子结合”（binding）。
- 对应的索引为 find(...) + 1，因为跳变是在第二个点发生。
bleach 事件：
- 当出现 s_prev == high_state 且 s_next == low_state 时，即从高强度直接掉回低强度，视为一次“光漂白或解离”（bleaching/unbinding）。
- 索引同样是 find(...) + 1。
最后用 t(bind_idx) 和 t(bleach_idx) 把这些索引转换成真实时间点（秒），分别作为 binding_times 和 bleach_times。

这一步把 HMM 的连续后验轨迹，最终转化成了一个离散事件时间列表。在原 level_analysis.m 中，没有这类应用导向的 event detection，只停留在“描述状态分布和转移概率”的层面。¹⁹¹⁸

8. 汇总所有轨迹结果并写出 CSV

在 for ti = 1:n_tracks 循环内，把每条轨迹的事件信息存入 results(ti)：
- track_id
- binding_indices, binding_times
- bleach_indices, bleach_times。¹⁹
循环结束后，统一写出一个 CSV 文件 binding_bleach_events.csv，格式为：
- track_index,event_type,sample_index,time_seconds
- 其中 event_type 是 "binding" 或 "bleach"。
最终在命令行打印一行提示，告诉你输出文件名。

相较之下，level_analysis.m 用的是 chainer_export 把整条 MCMC 链以 .mat 形式保存，供交互式分析和可视化使用，而不是面向“事件列表”的输出。¹⁸¹⁹

总结（中文角度）

从 HMM 模型和抽样的角度看：两个脚本在 ICON HMM 的参数设定和采样过程上是同一套算法，仅在 Octave 版加入了 eps 防止除零，以及关闭动画提高稳定性，这不会改变算法本质。
从整个脚本的“算法流程”看： level_analysis.m 是一个“通用 ICON HMM 演示与分析脚本”，只做单一序列的建模和一些统计性后验分析，不涉及具体事件检测逻辑。 detect_binding_bleach.m 在沿用同一 HMM 内核的基础上，增加了：多轨迹处理、对后验均值轨迹的聚类、将轨迹离散为若干强度状态、基于状态跳变的 binding/bleach 自动识别以及 CSV 输出。这是一层新的、面向应用的算法，不存在于原 MATLAB 脚本中。

如果你的目标是“在 Octave 中复用原 ICON HMM 的统计建模部分，并进一步实现自动 binding/bleaching 检测”，那么当前 detect_binding_bleach.m 的设计是符合这一思路的：核心推断算法相同，外层分析逻辑则是针对具体实验需求进行了扩展。

⁂

level_analysis.m ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩
detect_binding_bleach.m ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩ ↩

Processing Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606 v3

Leave a reply

Vorgabe

#perform PCA analysis, Venn diagram analysis, as well as KEGG and GO annotations. We would also appreciate it if you could include CPM calculations for this dataset (gene_cpm_counts.xlsx). For comparative analysis, we are particularly interested in identifying DEGs between WT and ΔIJ across the different treatments and time points.

I have already performed the six comparisons, using WT as the reference:

    ΔIJ-17 vs WT-17 – no treatment
    ΔIJ-24 vs WT-24 – no treatment
    preΔIJ-17 vs preWT-17 – Treatment A
    preΔIJ-24 vs preWT-24 – Treatment A
    0_5ΔIJ-17 vs 0_5WT-17 – Treatment B
    0_5ΔIJ-24 vs 0_5WT-24 – Treatment B

To gain a deeper understanding of how the ∆adeIJ mutation influences response dynamics over time and under different stimuli, would you also be interested in the following additional comparisons?

Within-strain treatment responses
(to explore how each strain responds to treatments):

WT:

    preWT-17 vs WT-17 → response to Treatment A at 17 h
    preWT-24 vs WT-24 → response to Treatment A at 24 h
    0_5WT-17 vs WT-17 → response to Treatment B at 17 h
    0_5WT-24 vs WT-24 → response to Treatment B at 24 h

∆adeIJ:

    preΔIJ-17 vs ΔIJ-17 → response to Treatment A at 17 h
    preΔIJ-24 vs ΔIJ-24 → response to Treatment A at 24 h
    0_5ΔIJ-17 vs ΔIJ-17 → response to Treatment B at 17 h
    0_5ΔIJ-24 vs ΔIJ-24 → response to Treatment B at 24 h

Time-course comparisons
(to investigate time-dependent changes within each condition):

    WT-24 vs WT-17
    ΔIJ-24 vs ΔIJ-17
    preWT-24 vs preWT-17
    preΔIJ-24 vs preΔIJ-17
    0_5WT-24 vs 0_5WT-17
    0_5ΔIJ-24 vs 0_5ΔIJ-17

I reviewed the datasets again and noticed that there are no ∆adeAB samples included. Should we try to obtain ∆adeAB data from other datasets? However, I’m a bit concerned that batch effects might pose a challenge when integrating data from different datasets.

> It is possible to analyze DEGs across various time points (17 and 24 h) and stimuli (treatment A and B, and without treatment) iswithin both the ∆adeIJ mutant and the WT strain as our phenotypic characterization of these strains across two times points and stimuli shows significant differences but the other mutant ∆adeAB (similar function as AdeIJ) shows no difference compared to WT, therefore we are wondering what's happened to ∆adeIJ.

deltaIJ_17, WT_17 – ΔadeIJ and wildtype strains w/o exposure at 17 h (No treatment)
deltaIJ_24, WT_24 – ΔadeIJ and wildtype strains w/o exposure at 24 h (No treatment)
pre_deltaIJ_17, pre_WT_17 – ΔadeIJ and wildtype strains with 1 exposure at 17 h (Treatment A)
pre_deltaIJ_24, pre_WT_24 – ΔadeIJ and wildtype strains with 1 exposure at 24 h (Treatment A)
0_5_deltaIJ_17, 0_5_WT_17 – ΔadeIJ and wildtype strains with 2 exposure at 17 h (Treatment B)
0_5_deltaIJ_24, 0_5_WT_24 – ΔadeIJ and wildtype strains with 2 exposure at 24 h (Treatment B)

Preparing raw data

mkdir raw_data; cd raw_data
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_1.fq.gz WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_2.fq.gz WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_1.fq.gz WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_2.fq.gz WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_1.fq.gz WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_2.fq.gz WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_1.fq.gz WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_2.fq.gz WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_1.fq.gz WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_2.fq.gz WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_1.fq.gz WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_2.fq.gz WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_1.fq.gz deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_2.fq.gz deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_1.fq.gz deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_2.fq.gz deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_1.fq.gz deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_2.fq.gz deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_1.fq.gz deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_2.fq.gz deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_1.fq.gz deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_2.fq.gz deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_1.fq.gz deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_2.fq.gz deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_1.fq.gz pre_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_2.fq.gz pre_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_1.fq.gz pre_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_2.fq.gz pre_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_1.fq.gz pre_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_2.fq.gz pre_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_1.fq.gz pre_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_2.fq.gz pre_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_1.fq.gz pre_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_2.fq.gz pre_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_1.fq.gz pre_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_2.fq.gz pre_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_1.fq.gz pre_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_2.fq.gz pre_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_1.fq.gz pre_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_2.fq.gz pre_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_1.fq.gz pre_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_2.fq.gz pre_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_1.fq.gz pre_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_2.fq.gz pre_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_1.fq.gz pre_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_2.fq.gz pre_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_1.fq.gz pre_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_2.fq.gz pre_deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_1.fq.gz 0_5_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_2.fq.gz 0_5_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_1.fq.gz 0_5_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_2.fq.gz 0_5_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_1.fq.gz 0_5_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_2.fq.gz 0_5_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_1.fq.gz 0_5_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_2.fq.gz 0_5_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_1.fq.gz 0_5_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_2.fq.gz 0_5_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_1.fq.gz 0_5_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_2.fq.gz 0_5_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_1.fq.gz 0_5_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_2.fq.gz 0_5_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_1.fq.gz 0_5_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_2.fq.gz 0_5_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_1.fq.gz 0_5_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_2.fq.gz 0_5_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_1.fq.gz 0_5_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_2.fq.gz 0_5_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_1.fq.gz 0_5_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_2.fq.gz 0_5_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_1.fq.gz 0_5_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_2.fq.gz 0_5_deltaIJ-24-r3_R2.fq.gz

(Done) Downloading CP059040.fasta and CP059040.gff from GenBank

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT-17-r1 WT-17-r2 WT-17-r3 WT-24-r1 WT-24-r2 WT-24-r3 deltaIJ-17-r1 deltaIJ-17-r2 deltaIJ-17-r3 deltaIJ-24-r1 deltaIJ-24-r2 deltaIJ-24-r3  pre_WT-17-r1 pre_WT-17-r2 pre_WT-17-r3 pre_WT-24-r1 pre_WT-24-r2 pre_WT-24-r3 pre_deltaIJ-17-r1 pre_deltaIJ-17-r2 pre_deltaIJ-17-r3 pre_deltaIJ-24-r1 pre_deltaIJ-24-r2 pre_deltaIJ-24-r3  0_5_WT-17-r1 0_5_WT-17-r2 0_5_WT-17-r3 0_5_WT-24-r1 0_5_WT-24-r2 0_5_WT-24-r3 0_5_deltaIJ-17-r1 0_5_deltaIJ-17-r2 0_5_deltaIJ-17-r3 0_5_deltaIJ-24-r1 0_5_deltaIJ-24-r2 0_5_deltaIJ-24-r3; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_17_r1,WT-17-r1_R1.fq.gz,WT-17-r1_R2.fq.gz,auto
WT_17_r2,WT-17-r2_R1.fq.gz,WT-17-r2_R2.fq.gz,auto
WT_17_r3,WT-17-r3_R1.fq.gz,WT-17-r3_R2.fq.gz,auto
WT_24_r1,WT-24-r1_R1.fq.gz,WT-24-r1_R2.fq.gz,auto
WT_24_r2,WT-24-r2_R1.fq.gz,WT-24-r2_R2.fq.gz,auto
WT_24_r3,WT-24-r3_R1.fq.gz,WT-24-r3_R2.fq.gz,auto
deltaIJ_17_r1,deltaIJ-17-r1_R1.fq.gz,deltaIJ-17-r1_R2.fq.gz,auto
deltaIJ_17_r2,deltaIJ-17-r2_R1.fq.gz,deltaIJ-17-r2_R2.fq.gz,auto
deltaIJ_17_r3,deltaIJ-17-r3_R1.fq.gz,deltaIJ-17-r3_R2.fq.gz,auto
deltaIJ_24_r1,deltaIJ-24-r1_R1.fq.gz,deltaIJ-24-r1_R2.fq.gz,auto
deltaIJ_24_r2,deltaIJ-24-r2_R1.fq.gz,deltaIJ-24-r2_R2.fq.gz,auto
deltaIJ_24_r3,deltaIJ-24-r3_R1.fq.gz,deltaIJ-24-r3_R2.fq.gz,auto
pre_WT_17_r1,pre_WT-17-r1_R1.fq.gz,pre_WT-17-r1_R2.fq.gz,auto
pre_WT_17_r2,pre_WT-17-r2_R1.fq.gz,pre_WT-17-r2_R2.fq.gz,auto
pre_WT_17_r3,pre_WT-17-r3_R1.fq.gz,pre_WT-17-r3_R2.fq.gz,auto
pre_WT_24_r1,pre_WT-24-r1_R1.fq.gz,pre_WT-24-r1_R2.fq.gz,auto
pre_WT_24_r2,pre_WT-24-r2_R1.fq.gz,pre_WT-24-r2_R2.fq.gz,auto
pre_WT_24_r3,pre_WT-24-r3_R1.fq.gz,pre_WT-24-r3_R2.fq.gz,auto
pre_deltaIJ_17_r1,pre_deltaIJ-17-r1_R1.fq.gz,pre_deltaIJ-17-r1_R2.fq.gz,auto
pre_deltaIJ_17_r2,pre_deltaIJ-17-r2_R1.fq.gz,pre_deltaIJ-17-r2_R2.fq.gz,auto
pre_deltaIJ_17_r3,pre_deltaIJ-17-r3_R1.fq.gz,pre_deltaIJ-17-r3_R2.fq.gz,auto
pre_deltaIJ_24_r1,pre_deltaIJ-24-r1_R1.fq.gz,pre_deltaIJ-24-r1_R2.fq.gz,auto
pre_deltaIJ_24_r2,pre_deltaIJ-24-r2_R1.fq.gz,pre_deltaIJ-24-r2_R2.fq.gz,auto
pre_deltaIJ_24_r3,pre_deltaIJ-24-r3_R1.fq.gz,pre_deltaIJ-24-r3_R2.fq.gz,auto
0_5_WT_17_r1,0_5_WT-17-r1_R1.fq.gz,0_5_WT-17-r1_R2.fq.gz,auto
0_5_WT_17_r2,0_5_WT-17-r2_R1.fq.gz,0_5_WT-17-r2_R2.fq.gz,auto
0_5_WT_17_r3,0_5_WT-17-r3_R1.fq.gz,0_5_WT-17-r3_R2.fq.gz,auto
0_5_WT_24_r1,0_5_WT-24-r1_R1.fq.gz,0_5_WT-24-r1_R2.fq.gz,auto
0_5_WT_24_r2,0_5_WT-24-r2_R1.fq.gz,0_5_WT-24-r2_R2.fq.gz,auto
0_5_WT_24_r3,0_5_WT-24-r3_R1.fq.gz,0_5_WT-24-r3_R2.fq.gz,auto
0_5_deltaIJ_17_r1,0_5_deltaIJ-17-r1_R1.fq.gz,0_5_deltaIJ-17-r1_R2.fq.gz,auto
0_5_deltaIJ_17_r2,0_5_deltaIJ-17-r2_R1.fq.gz,0_5_deltaIJ-17-r2_R2.fq.gz,auto
0_5_deltaIJ_17_r3,0_5_deltaIJ-17-r3_R1.fq.gz,0_5_deltaIJ-17-r3_R2.fq.gz,auto
0_5_deltaIJ_24_r1,0_5_deltaIJ-24-r1_R1.fq.gz,0_5_deltaIJ-24-r1_R2.fq.gz,auto
0_5_deltaIJ_24_r2,0_5_deltaIJ-24-r2_R1.fq.gz,0_5_deltaIJ-24-r2_R2.fq.gz,auto
0_5_deltaIJ_24_r3,0_5_deltaIJ-24-r3_R1.fq.gz,0_5_deltaIJ-24-r3_R2.fq.gz,auto

nextflow run

#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

#Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
#(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_1 (CDS --> exon in CP059040.gff) --
#Checking the record (see below) in results/genome/CP059040.gtf
#In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
#--featurecounts_feature_type 'transcript' returns only the tRNA results
#Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon

grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!

grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
mv trimmed/*.fq.gz .; rmdir trimmed
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040_m.gff"        -profile docker -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file

Prepare counts_fixed by hand: delete all “””, “gene-“, replace , to ‘\t’.

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

#keep only gene_id
cut -f1 -d',' counts.tsv > f1
cut -f3- -d',' counts.tsv > f3_
paste -d',' f1 f3_ > counts_fixed.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

#Delete the repliate 2 of ΔadeIJ_two_17 and repliate 1 of ΔadeIJ_two_24 are outlier.
paste -d$'\t' f1_32 f34 f36_ > counts_fixed_2.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed_2.tsv \
  --samples samples_2.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria_2

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c("WT_17_r1" = "./WT_17_r1/quant.sf",
           "WT_17_r2" = "./WT_17_r2/quant.sf",
           "WT_17_r3" = "./WT_17_r3/quant.sf",
           "WT_24_r1" = "./WT_24_r1/quant.sf",
           "WT_24_r2" = "./WT_24_r2/quant.sf",
           "WT_24_r3" = "./WT_24_r3/quant.sf",
           "deltaIJ_17_r1" = "./deltaIJ_17_r1/quant.sf",
           "deltaIJ_17_r2" = "./deltaIJ_17_r2/quant.sf",
           "deltaIJ_17_r3" = "./deltaIJ_17_r3/quant.sf",
           "deltaIJ_24_r1" = "./deltaIJ_24_r1/quant.sf",
           "deltaIJ_24_r2" = "./deltaIJ_24_r2/quant.sf",
           "deltaIJ_24_r3" = "./deltaIJ_24_r3/quant.sf",
           "pre_WT_17_r1" = "./pre_WT_17_r1/quant.sf",
           "pre_WT_17_r2" = "./pre_WT_17_r2/quant.sf",
           "pre_WT_17_r3" = "./pre_WT_17_r3/quant.sf",
           "pre_WT_24_r1" = "./pre_WT_24_r1/quant.sf",
           "pre_WT_24_r2" = "./pre_WT_24_r2/quant.sf",
           "pre_WT_24_r3" = "./pre_WT_24_r3/quant.sf",
           "pre_deltaIJ_17_r1" = "./pre_deltaIJ_17_r1/quant.sf",
           "pre_deltaIJ_17_r2" = "./pre_deltaIJ_17_r2/quant.sf",
           "pre_deltaIJ_17_r3" = "./pre_deltaIJ_17_r3/quant.sf",
           "pre_deltaIJ_24_r1" = "./pre_deltaIJ_24_r1/quant.sf",
           "pre_deltaIJ_24_r2" = "./pre_deltaIJ_24_r2/quant.sf",
           "pre_deltaIJ_24_r3" = "./pre_deltaIJ_24_r3/quant.sf",
           "0_5_WT_17_r1" = "./0_5_WT_17_r1/quant.sf",
           "0_5_WT_17_r2" = "./0_5_WT_17_r2/quant.sf",
           "0_5_WT_17_r3" = "./0_5_WT_17_r3/quant.sf",
           "0_5_WT_24_r1" = "./0_5_WT_24_r1/quant.sf",
           "0_5_WT_24_r2" = "./0_5_WT_24_r2/quant.sf",
           "0_5_WT_24_r3" = "./0_5_WT_24_r3/quant.sf",
           "0_5_deltaIJ_17_r1" = "./0_5_deltaIJ_17_r1/quant.sf",
           "0_5_deltaIJ_17_r2" = "./0_5_deltaIJ_17_r2/quant.sf",
           "0_5_deltaIJ_17_r3" = "./0_5_deltaIJ_17_r3/quant.sf",
           "0_5_deltaIJ_24_r1" = "./0_5_deltaIJ_24_r1/quant.sf",
           "0_5_deltaIJ_24_r2" = "./0_5_deltaIJ_24_r2/quant.sf",
           "0_5_deltaIJ_24_r3" = "./0_5_deltaIJ_24_r3/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",     "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",      "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("WT_none_17","WT_none_17","WT_none_17","WT_none_24","WT_none_24","WT_none_24", "deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_24","deltaadeIJ_none_24","deltaadeIJ_none_24",   "WT_one_17","WT_one_17","WT_one_17","WT_one_24","WT_one_24","WT_one_24", "deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_24","deltaadeIJ_one_24","deltaadeIJ_one_24",   "WT_two_17","WT_two_17","WT_two_17","WT_two_24","WT_two_24","WT_two_24", "deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_24","deltaadeIJ_two_24","deltaadeIJ_two_24"))
# Construct colData manually
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()

png("pca3.png", 1200, 800)
plotPCA(rld, intgroup=c("replicate"))
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# 1) keep only non-WT samples
#pdat <- subset(pdat, !grepl("^WT_", condition))
# drop unused factor levels so empty WT facets disappear
pdat$condition <- droplevels(pdat$condition)
# 2) pretty condition names: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
png("pca4.png", 1200, 800)
ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# Drop WT_* conditions from the data and from factor levels
pdat <- subset(pdat, !grepl("^WT_", condition))
pdat$condition <- droplevels(pdat$condition)
# Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca5.png", 1200, 800); print(p); dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca6.png", 1200, 800); print(p_fac); dev.off()

# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
#png("pca_media.png", 1200, 800)
#plotPCA(rld, intgroup=c("media"))
#dev.off()
#png("pca_strain.png", 1200, 800)
#plotPCA(rld, intgroup=c("strain"))
#dev.off()
#png("pca_time.png", 1200, 800)
#plotPCA(rld, intgroup=c("time"))
#dev.off()

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] WT_none_17         WT_none_17         WT_none_17         WT_none_24
[5] WT_none_24         WT_none_24         deltaadeIJ_none_17 deltaadeIJ_none_17
[9] deltaadeIJ_none_17 deltaadeIJ_none_24 deltaadeIJ_none_24 deltaadeIJ_none_24
[13] WT_one_17          WT_one_17          WT_one_17          WT_one_24
[17] WT_one_24          WT_one_24          deltaadeIJ_one_17  deltaadeIJ_one_17
[21] deltaadeIJ_one_17  deltaadeIJ_one_24  deltaadeIJ_one_24  deltaadeIJ_one_24
[25] WT_two_17          WT_two_17          WT_two_17          WT_two_24
[29] WT_two_24          WT_two_24          deltaadeIJ_two_17  deltaadeIJ_two_17
[33] deltaadeIJ_two_17  deltaadeIJ_two_24  deltaadeIJ_two_24  deltaadeIJ_two_24
12 Levels: deltaadeIJ_none_17 deltaadeIJ_none_24 ... WT_two_24

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# Construct colData automatically
sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("genotype", "exposure", "time")
colData <- cbind(sample_table, split_cond)
colData$genotype <- factor(colData$genotype)
colData$exposure  <- factor(colData$exposure)
colData$time   <- factor(colData$time)
colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
# Construct colData manually
colData2 <- data.frame(condition=condition, row.names=names(files))

# 确保因子顺序（可选）
colData$genotype <- relevel(factor(colData$genotype), ref = "WT")
colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
colData$time   <- relevel(factor(colData$time), ref = "17")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * exposure * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
[1] "Intercept"
[2] "genotype_deltaadeIJ_vs_WT"
[3] "exposure_one_vs_none"
[4] "exposure_two_vs_none"
[5] "time_24_vs_17"
[6] "genotypedeltaadeIJ.exposureone"
[7] "genotypedeltaadeIJ.exposuretwo"
[8] "genotypedeltaadeIJ.time24"
[9] "exposureone.time24"
[10] "exposuretwo.time24"
[11] "genotypedeltaadeIJ.exposureone.time24"
[12] "genotypedeltaadeIJ.exposuretwo.time24"

# 提取 genotype 的主效应: up 10, down 4
contrast <- "genotype_deltaadeIJ_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 one exposure 的主效应: up 196; down 298
contrast <- "exposure_one_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 two exposure 的主效应: up 80; down 105
contrast <- "exposure_two_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 10; down 2
contrast <- "time_24_vs_17"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  ΔadeIJ_none 17h vs WT_none 17h
#2.)  ΔadeIJ_none 24h vs WT_none 24h
#3.)  ΔadeIJ_one 17h vs WT_one 17h
#4.)  ΔadeIJ_one 24h vs WT_one 24h
#5.)  ΔadeIJ_two 17h vs WT_two 17h
#6.)  ΔadeIJ_two 24h vs WT_two 24h

#---- relevel to control ----
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_17_vs_WT_none_17")

dds$condition <- relevel(dds$condition, "WT_none_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_WT_none_24")

dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_17_vs_WT_one_17")

dds$condition <- relevel(dds$condition, "WT_one_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_WT_one_24")

dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_17_vs_WT_two_17")

dds$condition <- relevel(dds$condition, "WT_two_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_WT_two_24")

# WT_none_xh
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_none_24_vs_WT_none_17")

# WT_one_xh
dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_one_24_vs_WT_one_17")

# WT_two_xh
dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_two_24_vs_WT_two_17")

# deltaadeIJ_none_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_deltaadeIJ_none_17")

# deltaadeIJ_one_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_deltaadeIJ_one_17")

# deltaadeIJ_two_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_deltaadeIJ_two_17")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP059040_m.gff > CP059040_gene.gff

for cmp in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done
#deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
#deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
#deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
            emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
    * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            QBlast finished with warnings!
            Blasted Sequences: 2084
            Sequences without results: 105
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP020463_protein) Done"
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
            "Mapping (CP020463_protein) Done"
            "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            "Annotation (CP020463_protein) Done"
            "Annotation finished."
    or blast2go_cli_v1.5.1 (NOT_USED)

            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
        * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
            "InterProScan (CP020463_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
        * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
        #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
    File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
            #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

        #-- before merging (blast2go_annot.annot) --
        #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0085020
        #-- after merging (blast2go_annot.annot2) -->
        #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0070531
        #H0N29_18790     GO:0004842
        #H0N29_18790     GO:0005515
        #H0N29_18790     GO:0085020

        cp blast2go_annot.annot blast2go_annot.annot2

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(

        "deltaadeIJ_none_17_vs_WT_none_17-all.csv",  #up 11, down 3 vs. (10,4)
        "deltaadeIJ_none_24_vs_WT_none_24-all.csv",  #up 0, down 2 vs. (0,2)
        "deltaadeIJ_one_17_vs_WT_one_17-all.csv",    #up 238, down 90 vs. (239,89)  --> height 2600
        "deltaadeIJ_one_24_vs_WT_one_24-all.csv",    #up 83, down 64 vs. (64,71) --> height 1600
        "deltaadeIJ_two_17_vs_WT_two_17-all.csv",    #up 74, down 14 vs. (75,9) --> height 1000
        "deltaadeIJ_two_24_vs_WT_two_24-all.csv",    #up 1, down 3 vs. (3,3)

        "WT_none_24_vs_WT_none_17-all.csv",  #(up 10, down 2)
        "WT_one_24_vs_WT_one_17-all.csv",    #(up 97, down 3)
        "WT_two_24_vs_WT_two_17-all.csv",    #(up 12, down 1)

        "deltaadeIJ_two_24_vs_deltaadeIJ_two_17-all.csv",   #(up 8, down 51)
        "deltaadeIJ_one_24_vs_deltaadeIJ_one_17-all.csv",   #(up 0, down 10)
        "deltaadeIJ_none_24_vs_deltaadeIJ_none_17-all.csv" #(up 0, down 0)

        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

        # -------------------------------- OLD_CODE not automatized with loop ----------------------------
        # Load the results
        res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
        res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

        res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
        res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
        res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
        res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

        res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
        res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )

        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()

        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

        # Filter up-regulated genes
        up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
        # Filter down-regulated genes
        down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete")
        writeData(wb, "Complete_Data", res_updated)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
        #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

        #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
        ## Set the 'GeneName' column as row.names
        #rownames(res_updated) <- res_updated$GeneName
        ## Drop the 'GeneName' column since it's now the row names
        #res_updated$GeneName <- NULL
        # -- BREAK_1 --

        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # -- BREAK_2 --

        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings

        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)

        # Create a new workbook
        #wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes

        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
        # -- BREAK_3 --

        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)

        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)

        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)

        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })

        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })

        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

        # Save the workbook with enrichment results
        saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

        #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
        #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
        #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

        #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
        #Mac_vs_LB
        #LB.AB_vs_LB.WT19606
        #LB.IJ_vs_LB.WT19606
        #LB.W1_vs_LB.WT19606
        #LB.Y1_vs_LB.WT19606
        #Mac.AB_vs_Mac.WT19606
        #Mac.IJ_vs_Mac.WT19606
        #Mac.W1_vs_Mac.WT19606
        #Mac.Y1_vs_Mac.WT19606

        #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

10.5. (DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

            library(openxlsx)

            # Function to read and clean gene ID files
            read_gene_ids <- function(file_path) {
            # Read the gene IDs from the file
            gene_ids <- readLines(file_path)

            # Remove any quotes and trim whitespaces
            gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
            gene_ids <- trimws(gene_ids)  # Trim whitespaces

            # Remove empty entries or NAs
            gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

            return(gene_ids)
            }

            # Example list of LB files with both -up.id and -down.id for each condition
            lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                            "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
            lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                            "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

            # Combine both up and down files for each condition
            lb_files <- c(lb_files_up, lb_files_down)

            # Read gene IDs for each file in LB group
            #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
            lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

            lb_degs_ <- list()
            combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
            #unique_combined_set <- unique(combined_set)
            lb_degs_$AB <- combined_set
            combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
            lb_degs_$IJ <- combined_set
            combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
            lb_degs_$W1 <- combined_set
            combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
            lb_degs_$Y1 <- combined_set

            # Example list of Mac files with both -up.id and -down.id for each condition
            mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                            "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
            mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                            "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

            # Combine both up and down files for each condition in Mac group
            mac_files <- c(mac_files_up, mac_files_down)

            # Read gene IDs for each file in Mac group
            mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

            mac_degs_ <- list()
            combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
            mac_degs_$AB <- combined_set
            combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
            mac_degs_$IJ <- combined_set
            combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
            mac_degs_$W1 <- combined_set
            combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
            mac_degs_$Y1 <- combined_set

            # Function to clean sheet names to ensure no sheet name exceeds 31 characters
            truncate_sheet_name <- function(names_list) {
            sapply(names_list, function(name) {
            if (nchar(name) > 31) {
            return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
            }
            return(name)
            })
            }

            # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

            # Define intersections between different conditions for LB
            inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
            inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
            inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
            inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
            inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
            inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

            # Define intersections between three conditions for LB
            inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
            inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
            inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
            inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Define intersection between all four conditions for LB
            inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Now remove the intersected genes from each original set for LB
            venn_list_lb <- list()

            # For LB.AB, remove genes that are also in other conditions
            venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

            # For LB.IJ, remove genes that are also in other conditions
            venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

            # For LB.W1, remove genes that are also in other conditions
            venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

            # For LB.Y1, remove genes that are also in other conditions
            venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

            # Add the intersections for LB (same as before)
            venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
            venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
            venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
            venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
            venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
            venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

            # Define intersections between three conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
            venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
            venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

            # Define intersection between all four conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

            # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

            # Define intersections between different conditions
            inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
            inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
            inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
            inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
            inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
            inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

            # Define intersections between three conditions
            inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
            inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
            inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
            inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Define intersection between all four conditions
            inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Now remove the intersected genes from each original set
            venn_list_mac <- list()

            # For Mac.AB, remove genes that are also in other conditions
            venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

            # For Mac.IJ, remove genes that are also in other conditions
            venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

            # For Mac.W1, remove genes that are also in other conditions
            venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

            # For Mac.Y1, remove genes that are also in other conditions
            venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

            # Add the intersections (same as before)
            venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
            venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
            venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
            venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
            venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

            # Define intersections between three conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
            venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

            # Define intersection between all four conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

            # Save the gene IDs to Excel for further inspection (optional)
            write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
            write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

            # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
            write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
            write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

            # Venn Diagram for LB group
            venn1 <- ggvenn(lb_degs_,
                            fill_color = c("skyblue", "tomato", "gold", "orchid"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

            # Venn Diagram for Mac group
            venn2 <- ggvenn(mac_degs_,
                            fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

            cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
# -- IMPORTANT_ADAPTION: the script by chaning "HJI06_" with "H0N29_"
for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "deltaadeIJ_none_17_vs_WT_none_17"  #up 11, down 3 vs. (10,4) --> height 600 heatmap_pattern1
contrast <- "deltaadeIJ_none_24_vs_WT_none_24"  #up 0, down 2 vs. (0,2) --> height 600 pattern1
contrast <- "deltaadeIJ_one_17_vs_WT_one_17"    #up 238, down 90 vs. (239,89)  --> height 4000 pattern2
contrast <- "deltaadeIJ_one_24_vs_WT_one_24"    #up 83, down 64 vs. (64,71) --> height 1800 pattern2
contrast <- "deltaadeIJ_two_17_vs_WT_two_17"    #up 74, down 14 vs. (75,9) --> height 1100 pattern2
contrast <- "deltaadeIJ_two_24_vs_WT_two_24"    #up 1, down 3 vs. (3,3) --> height 600 pattern1

contrast <- "WT_none_24_vs_WT_none_17"  #(up 10, down 2) --> height 600 pattern1
contrast <- "WT_one_24_vs_WT_one_17"    #(up 97, down 3) --> height 1400 pattern2
contrast <- "WT_two_24_vs_WT_two_17"    #(up 12, down 1) --> height 600 pattern1
contrast <- "deltaadeIJ_none_24_vs_deltaadeIJ_none_17" #(up 0, down 0)
contrast <- "deltaadeIJ_one_24_vs_deltaadeIJ_one_17"   #(up 0, down 10) --> height 600 pattern1
contrast <- "deltaadeIJ_two_24_vs_deltaadeIJ_two_17"   #(up 8, down 51) --> height 1000 pattern2

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
#GOI <- GOI_down
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 20, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 1000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ # --> NOT_USED since no three time point comparison exists! #contrasts <- c( # "WT_MH_4h_vs_WT_MH_2h", # "WT_MH_18h_vs_WT_MH_2h", # "WT_MH_18h_vs_WT_MH_4h" #) ### Optionally force a condition display order (defaults to order of first appearance) #cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") ##cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltaadeIJ", "\u0394adeIJ", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 2600) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Processing Data_JuliaFuchs_RNAseq_2025 v3

Leave a reply

Targets

> Which genes are differentially expressed between the conditions for each time point.
> Also, from our pulldown experiment, we identified several potential
> target genes, and I’d be particularly interested to see if there are
> expression changes for those in the RNA-seq data. I’ll include the
> list of targets below once it’s ready.
简要结论：莫西沙星（Moxifloxacin）= 抗生素；丝裂霉素C（Mitomycin C）= 临床上不作为抗感染用抗生素。
莫西沙星（Moxifloxacin）：第四代氟喹诺酮类抗生素，用于治疗细菌感染（如呼吸道、皮肤等）。作用机制是抑制DNA旋转酶和拓扑异构酶IV，阻断细菌DNA复制。
丝裂霉素C（Mitomycin C）：本质上是来源于链霉菌的一类“抗肿瘤抗生素”，通过烷基化DNA 造成交联破坏，因此主要用于肿瘤化疗，以及少数局部应用（眼科/耳鼻喉科术中抑制瘢痕增生等）。尽管名字里有“抗生素”，也确有抗菌活性，但全身毒性过大，不用于治疗感染。

> Additionally, I have a specific question regarding the toxin–antitoxin
> system I’m studying. The toxin and antitoxin genes are:
>
> Toxin:
> ttatttacaatgcctcttgatccatgtctcaattccctcaagagtaagatttttgtcgtttactactcttaaagtaaactgaaccgcttcatcttgagtgcattcaaaattaatactatttaacttcaaaaatattaccatagatgtaaaagctgttcttttattcgcattatggaatgcgtgcttttgagctatatttctatatataaaagctgcttttctctcgattgtttcatatagttcaactccaccgaatgattgtttaactccttcaatagtagcattaagaacttctggaactttaacaccaacttgttcttttggtgagaaatcttgtattgcttttacattaatggcaatcacttgtttttcagttaaatatttagtgctttgcat
>
> Antitoxin:
> ttataagtcaaccatcctttttaaagcttggttatactcagtgaatgtttcatccaacaatttaaaaaactcctcgtcctctcttacctccttttcgatggttactttattatcttttacattaaatttaagattatcaccatttgatattccgagtgctgcgatcacttctgtcggtacagaaacaactgaactattaccagcttttcttagttttcttgtagtaatcat
>
> I’m wondering whether you can check if these two genes might share the
> same promoter and whether any RNA-seq signal supports their
> co-expression.

Download and prepare raw data

# ---- Dataset_1 ----
aws configure
> Aws_access_key_id：AKIAYWZZRVKWTQDI4CHT
> Aws_secret_access_key：hbFnZYBlNc1QP6hjm8fpCIXQsvUhLvWTBAaonH8D
>
>
aws s3 cp s3://staefgap-598731762349/ ./ --recursive  #S3 Bucket

# ---- Dataset_2 ----
aws configure
> Aws_access_key_id：AKIAYWZZRVKWXL5FYUBC
> Aws_secret_access_key：Nb9PMn3FywZ7UT4FOkVYPi0HFmk/S3uSCX/D9kmx
>
>
aws s3 cp s3://stavoupp-598731762349/ ./ --recursive  #S3 Bucket

mkdir raw_data; cd raw_data

ln -s ../F25A430001462_STAvoupP/1a_untreated_4h/1a_untreated_4h_1.fq.gz Untreated_4h_1a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1a_untreated_4h/1a_untreated_4h_2.fq.gz Untreated_4h_1a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1b_untreated_4h/1b_untreated_4h_1.fq.gz Untreated_4h_1b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1b_untreated_4h/1b_untreated_4h_2.fq.gz Untreated_4h_1b_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1c_untreated_4h/1c_untreated_4h_1.fq.gz Untreated_4h_1c_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1c_untreated_4h/1c_untreated_4h_2.fq.gz Untreated_4h_1c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1d_untreated_8h/1d_untreated_8h_1.fq.gz Untreated_8h_1d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1d_untreated_8h/1d_untreated_8h_2.fq.gz Untreated_8h_1d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1e_untreated_8h/1e_untreated_8h_1.fq.gz Untreated_8h_1e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1e_untreated_8h/1e_untreated_8h_2.fq.gz Untreated_8h_1e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1f_untreated_8h/1f_untreated_8h_1.fq.gz Untreated_8h_1f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1f_untreated_8h/1f_untreated_8h_2.fq.gz Untreated_8h_1f_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1g_untreated18h/1g_untreated18h_1.fq.gz Untreated_18h_1g_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1g_untreated18h/1g_untreated18h_2.fq.gz Untreated_18h_1g_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/1h_untreated18h/1h_untreated18h_1.fq.gz Untreated_18h_1h_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/1h_untreated18h/1h_untreated18h_2.fq.gz Untreated_18h_1h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1i_untreated18h/1i_untreated18h_1.fq.gz Untreated_18h_1i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1i_untreated18h/1i_untreated18h_2.fq.gz Untreated_18h_1i_R2.fastq.gz

ln -s ../F25A430001462_STAvoupP/2a_Mitomycin_4h/2a_Mitomycin_4h_1.fq.gz Mitomycin_4h_2a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2a_Mitomycin_4h/2a_Mitomycin_4h_2.fq.gz Mitomycin_4h_2a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2b_Mitomycin_4h/2b_Mitomycin_4h_1.fq.gz Mitomycin_4h_2b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2b_Mitomycin_4h/2b_Mitomycin_4h_2.fq.gz Mitomycin_4h_2b_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/2c_Mitomycin_4h/2c_Mitomycin_4h_1.fq.gz Mitomycin_4h_2c_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/2c_Mitomycin_4h/2c_Mitomycin_4h_2.fq.gz Mitomycin_4h_2c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2d_Mitomycin_8h/2d_Mitomycin_8h_1.fq.gz Mitomycin_8h_2d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2d_Mitomycin_8h/2d_Mitomycin_8h_2.fq.gz Mitomycin_8h_2d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2e_Mitomycin_8h/2e_Mitomycin_8h_1.fq.gz Mitomycin_8h_2e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2e_Mitomycin_8h/2e_Mitomycin_8h_2.fq.gz Mitomycin_8h_2e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2f_Mitomycin_8h/2f_Mitomycin_8h_1.fq.gz Mitomycin_8h_2f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2f_Mitomycin_8h/2f_Mitomycin_8h_2.fq.gz Mitomycin_8h_2f_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2g_Mitomycin18h/2g_Mitomycin18h_1.fq.gz Mitomycin_18h_2g_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2g_Mitomycin18h/2g_Mitomycin18h_2.fq.gz Mitomycin_18h_2g_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2h_Mitomycin18h/2h_Mitomycin18h_1.fq.gz Mitomycin_18h_2h_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2h_Mitomycin18h/2h_Mitomycin18h_2.fq.gz Mitomycin_18h_2h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2i_Mitomycin18h/2i_Mitomycin18h_1.fq.gz Mitomycin_18h_2i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2i_Mitomycin18h/2i_Mitomycin18h_2.fq.gz Mitomycin_18h_2i_R2.fastq.gz

ln -s ../F25A430001462_STAvoupP/3a_Moxi_4h/3a_Moxi_4h_1.fq.gz Moxi_4h_3a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3a_Moxi_4h/3a_Moxi_4h_2.fq.gz Moxi_4h_3a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3b_Moxi_4h/3b_Moxi_4h_1.fq.gz Moxi_4h_3b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3b_Moxi_4h/3b_Moxi_4h_2.fq.gz Moxi_4h_3b_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3c_Moxi_4h/3c_Moxi_4h_1.fq.gz Moxi_4h_3c_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3c_Moxi_4h/3c_Moxi_4h_2.fq.gz Moxi_4h_3c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3d_Moxi_8h/3d_Moxi_8h_1.fq.gz Moxi_8h_3d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3d_Moxi_8h/3d_Moxi_8h_2.fq.gz Moxi_8h_3d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3e_Moxi_8h/3e_Moxi_8h_1.fq.gz Moxi_8h_3e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3e_Moxi_8h/3e_Moxi_8h_2.fq.gz Moxi_8h_3e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3f_Moxi_8h/3f_Moxi_8h_1.fq.gz Moxi_8h_3f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3f_Moxi_8h/3f_Moxi_8h_2.fq.gz Moxi_8h_3f_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/3g_Moxi_18h/3g_Moxi_18h_1.fq.gz Moxi_18h_3g_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/3g_Moxi_18h/3g_Moxi_18h_2.fq.gz Moxi_18h_3g_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3h_Moxi_18h/3h_Moxi_18h_1.fq.gz Moxi_18h_3h_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3h_Moxi_18h/3h_Moxi_18h_2.fq.gz Moxi_18h_3h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3i_Moxi_18h/3i_Moxi_18h_1.fq.gz Moxi_18h_3i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3i_Moxi_18h/3i_Moxi_18h_2.fq.gz Moxi_18h_3i_R2.fastq.gz

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in Untreated_4h_1a Untreated_4h_1a Untreated_4h_1b Untreated_4h_1b Untreated_4h_1c Untreated_4h_1c Untreated_8h_1d Untreated_8h_1d Untreated_8h_1e Untreated_8h_1e Untreated_8h_1f Untreated_8h_1f Untreated_18h_1g Untreated_18h_1g Untreated_18h_1h Untreated_18h_1h Untreated_18h_1i Untreated_18h_1i  Mitomycin_4h_2a Mitomycin_4h_2a Mitomycin_4h_2b Mitomycin_4h_2b Mitomycin_4h_2c Mitomycin_4h_2c Mitomycin_8h_2d Mitomycin_8h_2d Mitomycin_8h_2e Mitomycin_8h_2e Mitomycin_8h_2f Mitomycin_8h_2f Mitomycin_18h_2g Mitomycin_18h_2g Mitomycin_18h_2h Mitomycin_18h_2h Mitomycin_18h_2i Mitomycin_18h_2i  Moxi_4h_3a Moxi_4h_3a Moxi_4h_3b Moxi_4h_3b Moxi_4h_3c Moxi_4h_3c Moxi_8h_3d Moxi_8h_3d Moxi_8h_3e Moxi_8h_3e Moxi_8h_3f Moxi_8h_3f Moxi_18h_3g Moxi_18h_3g Moxi_18h_3h Moxi_18h_3h Moxi_18h_3i Moxi_18h_3i; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
Untreated_4h_1a,Untreated_4h_1a_R1.fastq.gz,Untreated_4h_1a_R2.fastq.gz,auto
Untreated_4h_1a,Untreated_4h_1a_R1.fastq.gz,Untreated_4h_1a_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP052959.gff) --
grep -P "\texon\t" CP052959.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP052959.gff | wc -l   #=10  signal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP052959.gff | wc -l    #=10  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP052959.gff | wc -l    #61  tRNA
grep -P "\tCDS\t" CP052959.gff | wc -l  #2581
sed 's/\tCDS\t/\texon\t/g' CP052959.gff > CP052959_m.gff
grep -P "\texon\t" CP052959_m.gff | sort | wc -l  #2662 (81 more comparing with 'CDS')

# -- NOTE that combination of 'CP052959_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP052959_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_JuliaFuchs_RNAseq/CP052959.fasta" --gff "/home/jhuang/DATA/Data_JuliaFuchs_RNAseq/CP052959_m.gff"        -profile docker -resume  --max_cpus 100 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP052959.1"

Generate advanced PCA-plot

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

#keep only gene_id
cut -f1 -d',' counts.tsv > f1
cut -f3- -d',' counts.tsv > f3_
paste -d',' f1 f3_ > counts_fixed.tsv

#IMPORTANT_EDIT:  delete all """, "gene-", replace ',' to '\t' in counts_fixed.tsv.
#IMPORTANT_ENV: mamba activate r_env
#IMPORTANT_NOTE: rownames of samples.tsv and columns of counts.tsv should algin!!!!
Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_JuliaFuchs_RNAseq_2025/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_JuliaFuchs_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files

files <- c("Untreated_4h_r1" = "./Untreated_4h_1a/quant.sf",
           "Untreated_4h_r2" = "./Untreated_4h_1b/quant.sf",
           "Untreated_4h_r3" = "./Untreated_4h_1c/quant.sf",
           "Untreated_8h_r1" = "./Untreated_8h_1d/quant.sf",
           "Untreated_8h_r2" = "./Untreated_8h_1e/quant.sf",
           "Untreated_8h_r3" = "./Untreated_8h_1f/quant.sf",
           "Untreated_18h_r1" = "./Untreated_18h_1g/quant.sf",
           "Untreated_18h_r2" = "./Untreated_18h_1h/quant.sf",
           "Untreated_18h_r3" = "./Untreated_18h_1i/quant.sf",
           "Mitomycin_4h_r1" = "./Mitomycin_4h_2a/quant.sf",
           "Mitomycin_4h_r2" = "./Mitomycin_4h_2b/quant.sf",
           "Mitomycin_4h_r3" = "./Mitomycin_4h_2c/quant.sf",
           "Mitomycin_8h_r1" = "./Mitomycin_8h_2d/quant.sf",
           "Mitomycin_8h_r2" = "./Mitomycin_8h_2e/quant.sf",
           "Mitomycin_8h_r3" = "./Mitomycin_8h_2f/quant.sf",
           "Mitomycin_18h_r1" = "./Mitomycin_18h_2g/quant.sf",
           "Mitomycin_18h_r2" = "./Mitomycin_18h_2h/quant.sf",
           "Mitomycin_18h_r3" = "./Mitomycin_18h_2i/quant.sf",
           "Moxi_4h_r1" = "./Moxi_4h_3a/quant.sf",
           "Moxi_4h_r2" = "./Moxi_4h_3b/quant.sf",
           "Moxi_4h_r3" = "./Moxi_4h_3c/quant.sf",
           "Moxi_8h_r1" = "./Moxi_8h_3d/quant.sf",
           "Moxi_8h_r2" = "./Moxi_8h_3e/quant.sf",
           "Moxi_8h_r3" = "./Moxi_8h_3f/quant.sf",
           "Moxi_18h_r1" = "./Moxi_18h_3g/quant.sf",
           "Moxi_18h_r2" = "./Moxi_18h_3h/quant.sf",
           "Moxi_18h_r3" = "./Moxi_18h_3i/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3",    "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",       "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("Untreated_4h","Untreated_4h","Untreated_4h","Untreated_8h","Untreated_8h","Untreated_8h","Untreated_18h","Untreated_18h","Untreated_18h", "Mitomycin_4h","Mitomycin_4h","Mitomycin_4h","Mitomycin_8h","Mitomycin_8h","Mitomycin_8h","Mitomycin_18h","Mitomycin_18h","Mitomycin_18h", "Moxi_4h","Moxi_4h","Moxi_4h","Moxi_8h","Moxi_8h","Moxi_8h","Moxi_18h","Moxi_18h","Moxi_18h"))
# Construct colData manually
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()

png("pca3.png", 1200, 800)
plotPCA(rld, intgroup=c("replicate"))
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# 1) keep only non-WT samples
#pdat <- subset(pdat, !grepl("^WT_", condition))
# drop unused factor levels so empty WT facets disappear
pdat$condition <- droplevels(pdat$condition)
# 2) pretty condition names: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltasbp", "\u0394sbp", pdat$condition)
png("pca4.png", 1200, 800)
ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# Drop WT_* conditions from the data and from factor levels
pdat <- subset(pdat, !grepl("^WT_", condition))
pdat$condition <- droplevels(pdat$condition)
# Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltasbp", "\u0394sbp", pdat$condition)
p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca5.png", 1200, 800); print(p); dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca6.png", 1200, 800); print(p_fac); dev.off()

# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
#png("pca_media.png", 1200, 800)
#plotPCA(rld, intgroup=c("media"))
#dev.off()
#png("pca_strain.png", 1200, 800)
#plotPCA(rld, intgroup=c("strain"))
#dev.off()
#png("pca_time.png", 1200, 800)
#plotPCA(rld, intgroup=c("time"))
#dev.off()

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
#  [1] Untreated_4h  Untreated_4h  Untreated_4h  Untreated_8h  Untreated_8h
#  [6] Untreated_8h  Untreated_18h Untreated_18h Untreated_18h Mitomycin_4h
#  [11] Mitomycin_4h  Mitomycin_4h  Mitomycin_8h  Mitomycin_8h  Mitomycin_8h
#  [16] Mitomycin_18h Mitomycin_18h Mitomycin_18h Moxi_4h       Moxi_4h
#  [21] Moxi_4h       Moxi_8h       Moxi_8h       Moxi_8h       Moxi_18h
#  [26] Moxi_18h      Moxi_18h
#  9 Levels: Mitomycin_18h Mitomycin_4h Mitomycin_8h Moxi_18h Moxi_4h ... Untreated_8h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# Construct colData automatically
sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
#colnames(split_cond) <- c("genotype", "exposure", "time")
colnames(split_cond) <- c("genotype", "time")
colData <- cbind(sample_table, split_cond)
colData$genotype <- factor(colData$genotype)
#colData$exposure  <- factor(colData$exposure)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
colData$group  <- factor(paste(colData$genotype, colData$time, sep = "_"))
colData2 <- data.frame(condition=condition, row.names=names(files))

# 确保因子顺序（可选）
colData$genotype <- relevel(factor(colData$genotype), ref = "Untreated")
#colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
colData$time   <- relevel(factor(colData$time), ref = "4h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                       "genotype_Mitomycin_vs_Untreated"
#[3] "genotype_Moxi_vs_Untreated"      "time_18h_vs_4h"
#[5] "time_8h_vs_4h"                   "genotypeMitomycin.time18h"
#[7] "genotypeMoxi.time18h"            "genotypeMitomycin.time8h"
#[9] "genotypeMoxi.time8h"

#Mitomycin（丝裂霉素）通常特指丝裂霉素C（Mitomycin C, MMC），是一类来自放线菌（Streptomyces）的抗肿瘤抗生素。它在体内被还原后转化为活性烷化剂，可与DNA发生交联，阻断复制与转录，从而抑制细胞增殖。
#一句话理解：Mitomycin C 是一种能让DNA“粘住”的抗癌药，既可全身化疗，也常被医生小剂量局部用来防止疤痕组织长回来。
# 提取 genotype 的主效应: up 489, down 67
contrast <- "genotype_Mitomycin_vs_Untreated"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#莫西沙星（Moxifloxacin）是一种第四代氟喹诺酮类抗生素，常见商品名如 Avelox（口服/静脉）与 Vigamox（0.5% 眼用滴剂）。
#作用机制: 抑制细菌的DNA 回旋酶（DNA gyrase）和拓扑异构酶 IV，阻断细菌 DNA 复制与修复，属杀菌作用。
# 提取 genotype 的主效应: up 349, down 118
contrast <- "genotype_Moxi_vs_Untreated"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 262; down 51
contrast <- "time_18h_vs_4h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 90; down 18
contrast <- "time_8h_vs_4h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

colData$genotype <- relevel(factor(colData$genotype), ref = "Moxi")
colData$time   <- relevel(factor(colData$time), ref = "8h")
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                  "genotype_Untreated_vs_Moxi"
#[3] "genotype_Mitomycin_vs_Moxi" "time_4h_vs_8h"
#[5] "time_18h_vs_8h"             "genotypeUntreated.time4h"
#[7] "genotypeMitomycin.time4h"   "genotypeUntreated.time18h"
#[9] "genotypeMitomycin.time18h"

# 提取 genotype 的主效应: up 361, down 6
contrast <- "genotype_Mitomycin_vs_Moxi"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 15; down 3
contrast <- "time_18h_vs_8h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  Moxi_4h_vs_Untreated_4h
#2.)  Mitomycin_4h_vs_Untreated_4h
#3.)  Moxi_8h_vs_Untreated_8h
#4.)  Mitomycin_8h_vs_Untreated_8h
#5.)  Moxi_18h_vs_Untreated_18h
#6.)  Mitomycin_18h_vs_Untreated_18h

#---- relevel to control ----
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "Untreated_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_4h_vs_Untreated_4h", "Mitomycin_4h_vs_Untreated_4h")

dds$condition <- relevel(dds$condition, "Untreated_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_8h_vs_Untreated_8h", "Mitomycin_8h_vs_Untreated_8h")

dds$condition <- relevel(dds$condition, "Untreated_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Untreated_18h", "Mitomycin_18h_vs_Untreated_18h")

# Mitomycin_xh
dds$condition <- relevel(dds$condition, "Mitomycin_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Mitomycin_18h_vs_Mitomycin_4h", "Mitomycin_8h_vs_Mitomycin_4h")

dds$condition <- relevel(dds$condition, "Mitomycin_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Mitomycin_18h_vs_Mitomycin_8h")

# Moxi_xh
dds$condition <- relevel(dds$condition, "Moxi_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Moxi_4h", "Moxi_8h_vs_Moxi_4h")

dds$condition <- relevel(dds$condition, "Moxi_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Moxi_8h")

# Untreated_xh
dds$condition <- relevel(dds$condition, "Untreated_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Untreated_18h_vs_Untreated_4h", "Untreated_8h_vs_Untreated_4h")

dds$condition <- relevel(dds$condition, "Untreated_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Untreated_18h_vs_Untreated_8h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP052959_m.gff > CP052959_gene.gff

#NOTE that the script replace_gene_names.py was improved with a single fallback rule: after the initial mapping, any still empty/NA GeneName will be filled with the GeneID stripped of the gene-/rna- prefix. Nothing else changes.
for cmp in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done
#deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
#deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
#deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("Mitomycin_18h_vs_Untreated_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Mitomycin_18h_vs_Untreated_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Mitomycin_18h_vs_Untreated_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Mitomycin_18h_vs_Untreated_18h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP052959_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\(10\).txt CP052959_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP052959_protein_.fasta CP052959_protein.fasta
            emapper.py -i CP052959_protein.fasta -o eggnog_out --cpu 20  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot and blast2go_annot.annot2): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_JuliaFuchs_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_JuliaFuchs_RNAseq_2025; cp /mnt/md1/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_protein.fasta ~/b2gWorkspace_JuliaFuchs_RNAseq_2025;"
    # ------ STEP_1: 100% Load Sequences (CP052959_protein): done ------
    * Button 'File' --> 'Load' --> 'Load Sequences' --> 'Load Fasta File (.fasta)' Choose a protein sequence file (e.g. CP052959_protein.fasta) (Tags: NONE, generated columns: Nr, SeqName) as input
    # ------ STEP_2: 100% QBlast (CP052959_protein): done with warnings [4-5 days]; similar to DAMIAN and the most time-consuming step is blastn/blastp ------
    * Button 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            -- QBlast (CP052959_protein) Warning! --
            QBlast finished with warnings!
            Blasted Sequences: 2011
            Sequences without results: 99
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP052959_protein) Done"
    # ------ STEP_3: 100% Mapping (CP052959_protein): done [3h56m10s] ------
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names)
            -- Mapping (CP052959_protein) Done --
            "Mapping finished - Please proceed now to annotation."
    # ------ STEP_4: 100% Annotation (CP052959_protein): done [7m56s] ------
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names)
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            -- Annotation (CP052959_protein) Done --
            "Annotation finished."
    #(NOT_USED) or blast2go_cli_v1.5.1
            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    # ------ STEP_5: 100% Export Annotations (CP052959_protein): done (for before_merging) ------
    + Button 'File' -> 'Export' -> 'Export Annotations' -> 'Export Annotations (.annot, custom, etc.)' as ~/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot.

    + Option 3 (GO Terms from 'Blast2GO 5 Basic' using interpro): Interpro based protein families / domains --> Button interpro, Export Format XML (e.g. HJI06_00260.xml) to Folder "/home/jhuang/b2gWorkspace_JuliaFuchs_RNAseq_2025"
    # ------ STEP_6: 100% InterProSacn (CP052959_protein): done [1d6h41m51s] ------
    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names)
            -- InterProScan Finished, You can now merge the obtained GO Annotations. --
            "InterProScan (CP052959_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    + MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
    # ------ STEP_7: 100% Merge InterProScan GOs to Annotation (CP052959_protein): done [1s] ------
    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            -- Merge InterProScan GOs to Annotation (CP052959_protein) Done --
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
    #* (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

    # ------ STEP_8: 100% Export Annotations (CP052959_protein): done (for after_merging) ------
    + Button 'File' -> 'Export' -> 'Export Annotations' -> 'Export Annotations (.annot, custom, etc.)' as ~/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot2.

    #NOTE that annotations are different between before_merging and after_merging; after_merging has more annotation-items.
    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #None for HJI06_00005

    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020

    #HJI06_00005     GO:0005737      chromosomal replication initiator protein DnaA
    #HJI06_00005     GO:0005886
    #HJI06_00005     GO:0003688
    #HJI06_00005     GO:0005524
    #HJI06_00005     GO:0008289
    #HJI06_00005     GO:0016887
    #HJI06_00005     GO:0006270
    #HJI06_00005     GO:0006275
    #HJI06_00005     EC:3.6.1
    #HJI06_00005     EC:3.6
    #HJI06_00005     EC:3
    #HJI06_00005     EC:3.6.1.15

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines at the beginning and END and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_JuliaFuchs_RNAseq_2025$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_JuliaFuchs_RNAseq_2025/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(
          "Mitomycin_18h_vs_Untreated_18h-all.csv",  #up 576, down 307 --> height 11000
          "Mitomycin_8h_vs_Untreated_8h-all.csv",    #up 580, down 201 --> height 11000
          "Mitomycin_4h_vs_Untreated_4h-all.csv",    #up 489, down 67  --> height 6400
          "Moxi_18h_vs_Untreated_18h-all.csv",       #up 472, down 317 --> height 10500
          "Moxi_8h_vs_Untreated_8h-all.csv",         #up 486, down 307 --> height 10500
          "Moxi_4h_vs_Untreated_4h-all.csv",         #up 349, down 118 --> height 6400
          "Untreated_18h_vs_Untreated_4h-all.csv",   #(up 262, down 51)
          "Untreated_18h_vs_Untreated_8h-all.csv",   #(up 124, down 26)
          "Untreated_8h_vs_Untreated_4h-all.csv",     #(up 90, down 18) --> in total 368 --> height 5000
          "Mitomycin_18h_vs_Mitomycin_4h-all.csv",   #(up 161, down 63)
          "Mitomycin_18h_vs_Mitomycin_8h-all.csv",   #(up 61, down 28)
          "Mitomycin_8h_vs_Mitomycin_4h-all.csv",     #(up 47, down 10) --> in total 279 --> height 3500
          "Moxi_18h_vs_Moxi_4h-all.csv",   #(up 141, down 29)
          "Moxi_18h_vs_Moxi_8h-all.csv",   #(up 15, down 3)
          "Moxi_8h_vs_Moxi_4h-all.csv"     #(up 67, down 2) --> in total 196 --> height 2600
        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

10.5. (TODO) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

            library(openxlsx)

            # Function to read and clean gene ID files
            read_gene_ids <- function(file_path) {
            # Read the gene IDs from the file
            gene_ids <- readLines(file_path)

            # Remove any quotes and trim whitespaces
            gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
            gene_ids <- trimws(gene_ids)  # Trim whitespaces

            # Remove empty entries or NAs
            gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

            return(gene_ids)
            }

            # Example list of LB files with both -up.id and -down.id for each condition
            lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                            "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
            lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                            "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

            # Combine both up and down files for each condition
            lb_files <- c(lb_files_up, lb_files_down)

            # Read gene IDs for each file in LB group
            #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
            lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

            lb_degs_ <- list()
            combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
            #unique_combined_set <- unique(combined_set)
            lb_degs_$AB <- combined_set
            combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
            lb_degs_$IJ <- combined_set
            combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
            lb_degs_$W1 <- combined_set
            combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
            lb_degs_$Y1 <- combined_set

            # Example list of Mac files with both -up.id and -down.id for each condition
            mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                            "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
            mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                            "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

            # Combine both up and down files for each condition in Mac group
            mac_files <- c(mac_files_up, mac_files_down)

            # Read gene IDs for each file in Mac group
            mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

            mac_degs_ <- list()
            combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
            mac_degs_$AB <- combined_set
            combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
            mac_degs_$IJ <- combined_set
            combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
            mac_degs_$W1 <- combined_set
            combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
            mac_degs_$Y1 <- combined_set

            # Function to clean sheet names to ensure no sheet name exceeds 31 characters
            truncate_sheet_name <- function(names_list) {
            sapply(names_list, function(name) {
            if (nchar(name) > 31) {
            return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
            }
            return(name)
            })
            }

            # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

            # Define intersections between different conditions for LB
            inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
            inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
            inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
            inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
            inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
            inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

            # Define intersections between three conditions for LB
            inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
            inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
            inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
            inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Define intersection between all four conditions for LB
            inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Now remove the intersected genes from each original set for LB
            venn_list_lb <- list()

            # For LB.AB, remove genes that are also in other conditions
            venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

            # For LB.IJ, remove genes that are also in other conditions
            venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

            # For LB.W1, remove genes that are also in other conditions
            venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

            # For LB.Y1, remove genes that are also in other conditions
            venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

            # Add the intersections for LB (same as before)
            venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
            venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
            venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
            venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
            venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
            venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

            # Define intersections between three conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
            venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
            venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

            # Define intersection between all four conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

            # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

            # Define intersections between different conditions
            inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
            inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
            inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
            inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
            inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
            inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

            # Define intersections between three conditions
            inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
            inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
            inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
            inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Define intersection between all four conditions
            inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Now remove the intersected genes from each original set
            venn_list_mac <- list()

            # For Mac.AB, remove genes that are also in other conditions
            venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

            # For Mac.IJ, remove genes that are also in other conditions
            venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

            # For Mac.W1, remove genes that are also in other conditions
            venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

            # For Mac.Y1, remove genes that are also in other conditions
            venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

            # Add the intersections (same as before)
            venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
            venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
            venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
            venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
            venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

            # Define intersections between three conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
            venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

            # Define intersection between all four conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

            # Save the gene IDs to Excel for further inspection (optional)
            write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
            write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

            # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
            write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
            write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

            # Venn Diagram for LB group
            venn1 <- ggvenn(lb_degs_,
                            fill_color = c("skyblue", "tomato", "gold", "orchid"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

            # Venn Diagram for Mac group
            venn2 <- ggvenn(mac_degs_,
                            fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

            cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
for i in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "Mitomycin_18h_vs_Untreated_18h"  #up 576, down 307 --> height 11000
contrast <- "Mitomycin_8h_vs_Untreated_8h"    #up 580, down 201 --> height 11000
contrast <- "Mitomycin_4h_vs_Untreated_4h"    #up 489, down 67  --> height 6500
contrast <- "Moxi_18h_vs_Untreated_18h"       #up 472, down 317 --> height 10500
contrast <- "Moxi_8h_vs_Untreated_8h"         #up 486, down 307 --> height 10500
contrast <- "Moxi_4h_vs_Untreated_4h"         #up 349, down 118 --> height 6500

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
#colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 15, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 6500) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ contrasts <- c( "Untreated_18h_vs_Untreated_4h", #(up 262, down 51) "Untreated_18h_vs_Untreated_8h", #(up 124, down 26) "Untreated_8h_vs_Untreated_4h" #(up 90, down 18) --> in total 368 –> height 5000 ) contrasts <- c( "Mitomycin_18h_vs_Mitomycin_4h", #(up 161, down 63) "Mitomycin_18h_vs_Mitomycin_8h", #(up 61, down 28) "Mitomycin_8h_vs_Mitomycin_4h" #(up 47, down 10) --> in total 279 –> height 3500 ) contrasts <- c( "Moxi_18h_vs_Moxi_4h", #(up 141, down 29) "Moxi_18h_vs_Moxi_8h", #(up 15, down 3) "Moxi_8h_vs_Moxi_4h" #(up 67, down 2) --> in total 196 –> height 2600 ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("Untreated_4h","Untreated_8h","Untreated_18h") cond_order <- c("Mitomycin_4h","Mitomycin_8h","Mitomycin_18h") cond_order <- c("Moxi_4h","Moxi_8h","Moxi_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 2600) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Processing Data_Michelle_RNAseq_2025 v3

Leave a reply

Targets

The experiment we did so far:
I have two strains:
1. 1457 wildtype
2. 1457Δsbp (sbp knock out strain)

I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
The following would be interesting for me:
- PCA plot (sample comparison)
- Heatmaps (wild type vs. sbp knockout)
- Volcano plots (significant genes)
- Gene Ontology (GO) analyses

Download the raw data

Mail von BGI (RNA-SEQ Institute):
The data from project F25A430000603 are uploaded to AWS.
Please download the data as below:
URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
Project：F25A430000603-01-STAkimxP
Alias ID：598731762349
S3 Bucket：stakimxp-598731762349
Account：stakimxp
Password：qR0'A7[o9Ql|
Region：eu-central-1
Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
rsync -avzP /local/dir/ user@remote:/remote/dir/
rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

mkdir raw_data; cd raw_data

#Δsbp->deltasbp
#1457.1_2h_MH,WT,MH,2h,1
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
#1457dsbp118h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
#1457dsbp218h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

#1457.1_2h_
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
#1457dsbp1_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
#1457dsbp2_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
#1457dsbp3_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
#END

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP020463.gff) --
grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

# -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c(
        "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
        "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
        "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
        "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
        "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
        "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
        "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
        "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
        "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
        "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
        "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
        "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
        "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
        "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
        "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
        "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
        "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
        "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
        "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
        "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
        "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
        "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
        "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
        "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
        "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
        "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
        "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
        "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
        "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
        "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
        "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
        "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
        "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
        "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
        "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("strain", "media", "time")
colData <- cbind(sample_table, split_cond)
colData$strain <- factor(colData$strain)
colData$media  <- factor(colData$media)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
# Define the colData for DESeq2
#colData <- data.frame(condition=condition, row.names=names(files))

#grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
#grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# -------------------------------
# 2️⃣ Transcript-level counts
# -------------------------------
# Create DESeqDataSet directly from tximport (transcript-level)
dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds_tx), file="transcript_counts.csv")

# --------------------------------
# 3️⃣ Gene-level summarization
# --------------------------------
# Re-import Salmon data summarized at gene level
txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

# Create DESeqDataSet for gene-level counts
#dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
#或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
#~ strain * media * time    主效应 + 所有交互（推荐）  ✅
#~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
png("pca_media.png", 1200, 800)
plotPCA(rld, intgroup=c("media"))
dev.off()
png("pca_strain.png", 1200, 800)
plotPCA(rld, intgroup=c("strain"))
dev.off()
png("pca_time.png", 1200, 800)
plotPCA(rld, intgroup=c("time"))
dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
[5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
[9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
[13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
[17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
[21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
[25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
[29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
[33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# 确保因子顺序（可选）
colData$strain <- relevel(factor(colData$strain), ref = "WT")
colData$media  <- relevel(factor(colData$media), ref = "TSB")
colData$time   <- relevel(factor(colData$time), ref = "2h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                      "strain_deltasbp_vs_WT"
#[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
#[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
#[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
#[9] "mediaMH.time18h"                "mediaMH.time4h"
#[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

🔹 Main effects for each factor:

表达量
▲
│       ┌────── WT-TSB
│      /
│     /     ┌────── WT-MH
│    /     /
│   /     /     ┌────── deltasbp-TSB
│  /     /     /
│ /     /     /     ┌────── deltasbp-MH
└──────────────────────────────▶ 时间（2h, 4h, 18h）

    * strain_deltasbp_vs_WT
    * media_MH_vs_TSB
    * time_18h_vs_2h
    * time_4h_vs_2h

🔹 两因素交互作用（Two-way interactions）
这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

表达量
▲
│
│             WT ────────┐
│                        └─↘
│                           ↘
│                        deltasbp ←←←← 显著交互（方向/幅度不同）
└──────────────────────────────▶ 时间

straindeltasbp.mediaMH
表示 菌株（strain）和培养基（media）之间的交互作用。
➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

straindeltasbp.time18h
表示 菌株（strain）和时间（time, 18h）之间的交互作用。
➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

straindeltasbp.time4h
同上，是菌株和时间（4h）之间的交互作用。

mediaMH.time18h
表示 培养基（MH）与时间（18h）之间的交互作用。
➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

mediaMH.time4h
与上面类似，是 MH 培养基与 4 小时之间的交互作用。

🔹 三因素交互作用（Three-way interactions）
三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

表达量（TSB）
▲
│
│        WT ──────→→
│        deltasbp ─────→→
└────────────────────────▶ 时间（2h, 4h, 18h）

表达量（MH）
▲
│
│        WT ──────→→
│        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
└────────────────────────▶ 时间（2h, 4h, 18h）

straindeltasbp.mediaMH.time18h
表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

straindeltasbp.mediaMH.time4h
同上，只是观察的是 4 小时下的三因素交互效应。

✅ 总结：
交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

# 提取 strain 的主效应: up 2, down 16
contrast <- "strain_deltasbp_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 media 的主效应: up 76; down 128
contrast <- "media_MH_vs_TSB"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 228, down 98; up 17, down 2
contrast <- "time_18h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

contrast <- "time_4h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  delta sbp 2h TSB vs WT 2h TSB
#2.)  delta sbp 4h TSB vs WT 4h TSB
#3.)  delta sbp 18h TSB vs WT 18h TSB
#4.)  delta sbp 2h MH vs WT 2h MH
#5.)  delta sbp 4h MH vs WT 4h MH
#6.)  delta sbp 18h MH vs WT 18h MH

#---- relevel to control ----
#design=~condition+replicate
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

dds$condition <- relevel(dds$condition, "WT_TSB_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

dds$condition <- relevel(dds$condition, "WT_MH_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

# WT_MH_xh
dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_18h_vs_WT_MH_4h")

# WT_TSB_xh
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

# deltasbp_MH_xh
dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

# deltasbp_TSB_xh
dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
            emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
    * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            QBlast finished with warnings!
            Blasted Sequences: 2084
            Sequences without results: 105
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP020463_protein) Done"
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
            "Mapping (CP020463_protein) Done"
            "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            "Annotation (CP020463_protein) Done"
            "Annotation finished."
    or blast2go_cli_v1.5.1 (NOT_USED)

            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
        * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
            "InterProScan (CP020463_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
        * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
        #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
    File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
            #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

        #-- before merging (blast2go_annot.annot) --
        #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0085020
        #-- after merging (blast2go_annot.annot2) -->
        #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0070531
        #H0N29_18790     GO:0004842
        #H0N29_18790     GO:0005515
        #H0N29_18790     GO:0085020

        cp blast2go_annot.annot blast2go_annot.annot2

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        # Define the filenames and output suffixes
        file_list <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_4h",
          "WT_TSB_4h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_4h",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
        )

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_4h-all.csv",
          "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
# -- IMPORTANT_ADAPTION: the script by chaning "H0N29_" with "B4U56_"
for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

#for f in *-all_annotated.csv; do sed -i '1s/^GeneName,/GeneName2,/;1s/,Description,/,GeneName,/' "$f"; done

for f in *-all_annotated.csv; do
  awk -v FPAT='([^,]*)|("[^"]*")' -v OFS=',' '
    NR==1{
      for(i=1;i<=NF;i++){ if($i=="GeneName") g=i; if($i=="Description") d=i }
      print; next
    }
    { $g=$g" ("$d")"; print }' "$f" > tmp && mv tmp "$f"
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"    #17, height=600, heatmap_pattern1
contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"    #25, height=800, heatmap_pattern1
contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"  #34, height=1000, heatmap_pattern1
contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"      #43, height=1200, heatmap_pattern1
contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"      #26, height=800, heatmap_pattern1
contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"    #41, height=1200, heatmap_pattern1

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
#NOT_NECCESSARY since it was already sorted: colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) labRow_pretty <- gsub('"', "", labRow_pretty) #labRow_pretty <- gsub('\\"', "", labRow_pretty) # truncate if longer than threshold (e.g. 30 characters) threshold <- 152 labRow_pretty <- ifelse( nchar(labRow_pretty) > threshold, paste0(substr(labRow_pretty, 1, threshold), “…”), labRow_pretty ) # heatmap_pattern1 png(paste0(“DEGs_heatmap_”, contrast, “.png”), width=1800, height=1200) heatmap.2(datamat, Rowv = as.dendrogram(hr), Colv = FALSE, # no column clustering dendrogram = “none”, #’row’, col = bluered(75), scale = “row”, #RowSideColors = mycol, trace = “none”, margin = c(10, 120), # bottom, left, old is 20 sepwidth = c(0, 0), density.info = ‘none’, labRow = labRow_pretty, # row labels WITHOUT “gene-” labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.0, cexCol = 2.0, srtCol = 40, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error ‘Error in plot.new() : figure margins too large’ lwid = c(0.2, 4)) # enlarge the first number when reduce the plot size to avoid the error ‘Error in plot.new() : figure margins too large’ dev.off() # heatmap_pattern2 png(paste0(“DEGs_heatmap_”, contrast, “.png”), width = 1800, height = 6500) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = “row”, col = bluered(75), scale = “row”, trace = “none”, density.info = “none”, RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 20, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # —————— Heatmap generation for three samples ———————- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## “WT_MH_4h_vs_WT_MH_2h”, ## “WT_MH_18h_vs_WT_MH_2h”, ## “WT_MH_18h_vs_WT_MH_4h” ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## ——– 0) User inputs ———————————— contrasts <- c( "WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_4h" #--> 424 genes, height=6000, heatmap_pattern2 ) contrasts <- c( "WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_4h" #--> 358 genes, height=5200, heatmap_pattern2 ) contrasts <- c( "deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_4h" #--> 345 genes, height=5120, heatmap_pattern2 ) contrasts <- c( "deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h" #--> 276 genes, height=4000, heatmap_pattern2 ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h") cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h") cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) labRow_pretty <- gsub('"', "", labRow_pretty) threshold <- 240 labRow_pretty <- ifelse( nchar(labRow_pretty) > threshold, paste0(substr(labRow_pretty, 1, threshold), “…”), labRow_pretty ) png(paste0(“DEGs_heatmap_”, out_tag, “.png”), width = 1800, height = 4000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = “none”, #’row’, col = bluered(75), scale = “row”, trace = “none”, density.info = “none”, #RowSideColors = mycol, margins = c(10, 120), # c(bottom, left), 15–>120 sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.5, srtCol = 60, lhei = c(0.01, 4), lwid = c(0.1, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # —————— Heatmap generation for three samples END ———————- # — (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h — cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Processing Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606 v2

Leave a reply

Vorgabe

#perform PCA analysis, Venn diagram analysis, as well as KEGG and GO annotations. We would also appreciate it if you could include CPM calculations for this dataset (gene_cpm_counts.xlsx). For comparative analysis, we are particularly interested in identifying DEGs between WT and ΔIJ across the different treatments and time points.

I have already performed the six comparisons, using WT as the reference:

    ΔIJ-17 vs WT-17 – no treatment
    ΔIJ-24 vs WT-24 – no treatment
    preΔIJ-17 vs preWT-17 – Treatment A
    preΔIJ-24 vs preWT-24 – Treatment A
    0_5ΔIJ-17 vs 0_5WT-17 – Treatment B
    0_5ΔIJ-24 vs 0_5WT-24 – Treatment B

To gain a deeper understanding of how the ∆adeIJ mutation influences response dynamics over time and under different stimuli, would you also be interested in the following additional comparisons?

Within-strain treatment responses
(to explore how each strain responds to treatments):

WT:

    preWT-17 vs WT-17 → response to Treatment A at 17 h
    preWT-24 vs WT-24 → response to Treatment A at 24 h
    0_5WT-17 vs WT-17 → response to Treatment B at 17 h
    0_5WT-24 vs WT-24 → response to Treatment B at 24 h

∆adeIJ:

    preΔIJ-17 vs ΔIJ-17 → response to Treatment A at 17 h
    preΔIJ-24 vs ΔIJ-24 → response to Treatment A at 24 h
    0_5ΔIJ-17 vs ΔIJ-17 → response to Treatment B at 17 h
    0_5ΔIJ-24 vs ΔIJ-24 → response to Treatment B at 24 h

Time-course comparisons
(to investigate time-dependent changes within each condition):

    WT-24 vs WT-17
    ΔIJ-24 vs ΔIJ-17
    preWT-24 vs preWT-17
    preΔIJ-24 vs preΔIJ-17
    0_5WT-24 vs 0_5WT-17
    0_5ΔIJ-24 vs 0_5ΔIJ-17

I reviewed the datasets again and noticed that there are no ∆adeAB samples included. Should we try to obtain ∆adeAB data from other datasets? However, I’m a bit concerned that batch effects might pose a challenge when integrating data from different datasets.

> It is possible to analyze DEGs across various time points (17 and 24 h) and stimuli (treatment A and B, and without treatment) iswithin both the ∆adeIJ mutant and the WT strain as our phenotypic characterization of these strains across two times points and stimuli shows significant differences but the other mutant ∆adeAB (similar function as AdeIJ) shows no difference compared to WT, therefore we are wondering what's happened to ∆adeIJ.

deltaIJ_17, WT_17 – ΔadeIJ and wildtype strains w/o exposure at 17 h (No treatment)
deltaIJ_24, WT_24 – ΔadeIJ and wildtype strains w/o exposure at 24 h (No treatment)
pre_deltaIJ_17, pre_WT_17 – ΔadeIJ and wildtype strains with 1 exposure at 17 h (Treatment A)
pre_deltaIJ_24, pre_WT_24 – ΔadeIJ and wildtype strains with 1 exposure at 24 h (Treatment A)
0_5_deltaIJ_17, 0_5_WT_17 – ΔadeIJ and wildtype strains with 2 exposure at 17 h (Treatment B)
0_5_deltaIJ_24, 0_5_WT_24 – ΔadeIJ and wildtype strains with 2 exposure at 24 h (Treatment B)

Preparing raw data

mkdir raw_data; cd raw_data
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_1.fq.gz WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_2.fq.gz WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_1.fq.gz WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_2.fq.gz WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_1.fq.gz WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_2.fq.gz WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_1.fq.gz WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_2.fq.gz WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_1.fq.gz WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_2.fq.gz WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_1.fq.gz WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_2.fq.gz WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_1.fq.gz deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_2.fq.gz deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_1.fq.gz deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_2.fq.gz deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_1.fq.gz deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_2.fq.gz deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_1.fq.gz deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_2.fq.gz deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_1.fq.gz deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_2.fq.gz deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_1.fq.gz deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_2.fq.gz deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_1.fq.gz pre_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_2.fq.gz pre_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_1.fq.gz pre_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_2.fq.gz pre_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_1.fq.gz pre_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_2.fq.gz pre_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_1.fq.gz pre_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_2.fq.gz pre_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_1.fq.gz pre_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_2.fq.gz pre_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_1.fq.gz pre_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_2.fq.gz pre_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_1.fq.gz pre_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_2.fq.gz pre_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_1.fq.gz pre_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_2.fq.gz pre_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_1.fq.gz pre_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_2.fq.gz pre_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_1.fq.gz pre_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_2.fq.gz pre_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_1.fq.gz pre_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_2.fq.gz pre_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_1.fq.gz pre_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_2.fq.gz pre_deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_1.fq.gz 0_5_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_2.fq.gz 0_5_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_1.fq.gz 0_5_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_2.fq.gz 0_5_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_1.fq.gz 0_5_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_2.fq.gz 0_5_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_1.fq.gz 0_5_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_2.fq.gz 0_5_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_1.fq.gz 0_5_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_2.fq.gz 0_5_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_1.fq.gz 0_5_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_2.fq.gz 0_5_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_1.fq.gz 0_5_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_2.fq.gz 0_5_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_1.fq.gz 0_5_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_2.fq.gz 0_5_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_1.fq.gz 0_5_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_2.fq.gz 0_5_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_1.fq.gz 0_5_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_2.fq.gz 0_5_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_1.fq.gz 0_5_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_2.fq.gz 0_5_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_1.fq.gz 0_5_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_2.fq.gz 0_5_deltaIJ-24-r3_R2.fq.gz

(Done) Downloading CP059040.fasta and CP059040.gff from GenBank

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT-17-r1 WT-17-r2 WT-17-r3 WT-24-r1 WT-24-r2 WT-24-r3 deltaIJ-17-r1 deltaIJ-17-r2 deltaIJ-17-r3 deltaIJ-24-r1 deltaIJ-24-r2 deltaIJ-24-r3  pre_WT-17-r1 pre_WT-17-r2 pre_WT-17-r3 pre_WT-24-r1 pre_WT-24-r2 pre_WT-24-r3 pre_deltaIJ-17-r1 pre_deltaIJ-17-r2 pre_deltaIJ-17-r3 pre_deltaIJ-24-r1 pre_deltaIJ-24-r2 pre_deltaIJ-24-r3  0_5_WT-17-r1 0_5_WT-17-r2 0_5_WT-17-r3 0_5_WT-24-r1 0_5_WT-24-r2 0_5_WT-24-r3 0_5_deltaIJ-17-r1 0_5_deltaIJ-17-r2 0_5_deltaIJ-17-r3 0_5_deltaIJ-24-r1 0_5_deltaIJ-24-r2 0_5_deltaIJ-24-r3; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_17_r1,WT-17-r1_R1.fq.gz,WT-17-r1_R2.fq.gz,auto
WT_17_r2,WT-17-r2_R1.fq.gz,WT-17-r2_R2.fq.gz,auto
WT_17_r3,WT-17-r3_R1.fq.gz,WT-17-r3_R2.fq.gz,auto
WT_24_r1,WT-24-r1_R1.fq.gz,WT-24-r1_R2.fq.gz,auto
WT_24_r2,WT-24-r2_R1.fq.gz,WT-24-r2_R2.fq.gz,auto
WT_24_r3,WT-24-r3_R1.fq.gz,WT-24-r3_R2.fq.gz,auto
deltaIJ_17_r1,deltaIJ-17-r1_R1.fq.gz,deltaIJ-17-r1_R2.fq.gz,auto
deltaIJ_17_r2,deltaIJ-17-r2_R1.fq.gz,deltaIJ-17-r2_R2.fq.gz,auto
deltaIJ_17_r3,deltaIJ-17-r3_R1.fq.gz,deltaIJ-17-r3_R2.fq.gz,auto
deltaIJ_24_r1,deltaIJ-24-r1_R1.fq.gz,deltaIJ-24-r1_R2.fq.gz,auto
deltaIJ_24_r2,deltaIJ-24-r2_R1.fq.gz,deltaIJ-24-r2_R2.fq.gz,auto
deltaIJ_24_r3,deltaIJ-24-r3_R1.fq.gz,deltaIJ-24-r3_R2.fq.gz,auto
pre_WT_17_r1,pre_WT-17-r1_R1.fq.gz,pre_WT-17-r1_R2.fq.gz,auto
pre_WT_17_r2,pre_WT-17-r2_R1.fq.gz,pre_WT-17-r2_R2.fq.gz,auto
pre_WT_17_r3,pre_WT-17-r3_R1.fq.gz,pre_WT-17-r3_R2.fq.gz,auto
pre_WT_24_r1,pre_WT-24-r1_R1.fq.gz,pre_WT-24-r1_R2.fq.gz,auto
pre_WT_24_r2,pre_WT-24-r2_R1.fq.gz,pre_WT-24-r2_R2.fq.gz,auto
pre_WT_24_r3,pre_WT-24-r3_R1.fq.gz,pre_WT-24-r3_R2.fq.gz,auto
pre_deltaIJ_17_r1,pre_deltaIJ-17-r1_R1.fq.gz,pre_deltaIJ-17-r1_R2.fq.gz,auto
pre_deltaIJ_17_r2,pre_deltaIJ-17-r2_R1.fq.gz,pre_deltaIJ-17-r2_R2.fq.gz,auto
pre_deltaIJ_17_r3,pre_deltaIJ-17-r3_R1.fq.gz,pre_deltaIJ-17-r3_R2.fq.gz,auto
pre_deltaIJ_24_r1,pre_deltaIJ-24-r1_R1.fq.gz,pre_deltaIJ-24-r1_R2.fq.gz,auto
pre_deltaIJ_24_r2,pre_deltaIJ-24-r2_R1.fq.gz,pre_deltaIJ-24-r2_R2.fq.gz,auto
pre_deltaIJ_24_r3,pre_deltaIJ-24-r3_R1.fq.gz,pre_deltaIJ-24-r3_R2.fq.gz,auto
0_5_WT_17_r1,0_5_WT-17-r1_R1.fq.gz,0_5_WT-17-r1_R2.fq.gz,auto
0_5_WT_17_r2,0_5_WT-17-r2_R1.fq.gz,0_5_WT-17-r2_R2.fq.gz,auto
0_5_WT_17_r3,0_5_WT-17-r3_R1.fq.gz,0_5_WT-17-r3_R2.fq.gz,auto
0_5_WT_24_r1,0_5_WT-24-r1_R1.fq.gz,0_5_WT-24-r1_R2.fq.gz,auto
0_5_WT_24_r2,0_5_WT-24-r2_R1.fq.gz,0_5_WT-24-r2_R2.fq.gz,auto
0_5_WT_24_r3,0_5_WT-24-r3_R1.fq.gz,0_5_WT-24-r3_R2.fq.gz,auto
0_5_deltaIJ_17_r1,0_5_deltaIJ-17-r1_R1.fq.gz,0_5_deltaIJ-17-r1_R2.fq.gz,auto
0_5_deltaIJ_17_r2,0_5_deltaIJ-17-r2_R1.fq.gz,0_5_deltaIJ-17-r2_R2.fq.gz,auto
0_5_deltaIJ_17_r3,0_5_deltaIJ-17-r3_R1.fq.gz,0_5_deltaIJ-17-r3_R2.fq.gz,auto
0_5_deltaIJ_24_r1,0_5_deltaIJ-24-r1_R1.fq.gz,0_5_deltaIJ-24-r1_R2.fq.gz,auto
0_5_deltaIJ_24_r2,0_5_deltaIJ-24-r2_R1.fq.gz,0_5_deltaIJ-24-r2_R2.fq.gz,auto
0_5_deltaIJ_24_r3,0_5_deltaIJ-24-r3_R1.fq.gz,0_5_deltaIJ-24-r3_R2.fq.gz,auto

nextflow run

#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

#Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
#(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_1 (CDS --> exon in CP059040.gff) --
#Checking the record (see below) in results/genome/CP059040.gtf
#In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
#--featurecounts_feature_type 'transcript' returns only the tRNA results
#Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon

grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!

grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
mv trimmed/*.fq.gz .; rmdir trimmed
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040_m.gff"        -profile docker -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file

Prepare counts_fixed by hand: delete all “””, “gene-“, replace , to ‘\t’.

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

#keep only gene_id
cut -f1 -d',' counts.tsv > f1
cut -f3- -d',' counts.tsv > f3_
paste -d',' f1 f3_ > counts_fixed.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

#Delete the repliate 2 of ΔadeIJ_two_17 and repliate 1 of ΔadeIJ_two_24 are outlier.
paste -d$'\t' f1_32 f34 f36_ > counts_fixed_2.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed_2.tsv \
  --samples samples_2.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria_2

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c("WT_17_r1" = "./WT_17_r1/quant.sf",
           "WT_17_r2" = "./WT_17_r2/quant.sf",
           "WT_17_r3" = "./WT_17_r3/quant.sf",
           "WT_24_r1" = "./WT_24_r1/quant.sf",
           "WT_24_r2" = "./WT_24_r2/quant.sf",
           "WT_24_r3" = "./WT_24_r3/quant.sf",
           "deltaIJ_17_r1" = "./deltaIJ_17_r1/quant.sf",
           "deltaIJ_17_r2" = "./deltaIJ_17_r2/quant.sf",
           "deltaIJ_17_r3" = "./deltaIJ_17_r3/quant.sf",
           "deltaIJ_24_r1" = "./deltaIJ_24_r1/quant.sf",
           "deltaIJ_24_r2" = "./deltaIJ_24_r2/quant.sf",
           "deltaIJ_24_r3" = "./deltaIJ_24_r3/quant.sf",
           "pre_WT_17_r1" = "./pre_WT_17_r1/quant.sf",
           "pre_WT_17_r2" = "./pre_WT_17_r2/quant.sf",
           "pre_WT_17_r3" = "./pre_WT_17_r3/quant.sf",
           "pre_WT_24_r1" = "./pre_WT_24_r1/quant.sf",
           "pre_WT_24_r2" = "./pre_WT_24_r2/quant.sf",
           "pre_WT_24_r3" = "./pre_WT_24_r3/quant.sf",
           "pre_deltaIJ_17_r1" = "./pre_deltaIJ_17_r1/quant.sf",
           "pre_deltaIJ_17_r2" = "./pre_deltaIJ_17_r2/quant.sf",
           "pre_deltaIJ_17_r3" = "./pre_deltaIJ_17_r3/quant.sf",
           "pre_deltaIJ_24_r1" = "./pre_deltaIJ_24_r1/quant.sf",
           "pre_deltaIJ_24_r2" = "./pre_deltaIJ_24_r2/quant.sf",
           "pre_deltaIJ_24_r3" = "./pre_deltaIJ_24_r3/quant.sf",
           "0_5_WT_17_r1" = "./0_5_WT_17_r1/quant.sf",
           "0_5_WT_17_r2" = "./0_5_WT_17_r2/quant.sf",
           "0_5_WT_17_r3" = "./0_5_WT_17_r3/quant.sf",
           "0_5_WT_24_r1" = "./0_5_WT_24_r1/quant.sf",
           "0_5_WT_24_r2" = "./0_5_WT_24_r2/quant.sf",
           "0_5_WT_24_r3" = "./0_5_WT_24_r3/quant.sf",
           "0_5_deltaIJ_17_r1" = "./0_5_deltaIJ_17_r1/quant.sf",
           "0_5_deltaIJ_17_r2" = "./0_5_deltaIJ_17_r2/quant.sf",
           "0_5_deltaIJ_17_r3" = "./0_5_deltaIJ_17_r3/quant.sf",
           "0_5_deltaIJ_24_r1" = "./0_5_deltaIJ_24_r1/quant.sf",
           "0_5_deltaIJ_24_r2" = "./0_5_deltaIJ_24_r2/quant.sf",
           "0_5_deltaIJ_24_r3" = "./0_5_deltaIJ_24_r3/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",     "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",      "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("WT_none_17","WT_none_17","WT_none_17","WT_none_24","WT_none_24","WT_none_24", "deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_24","deltaadeIJ_none_24","deltaadeIJ_none_24",   "WT_one_17","WT_one_17","WT_one_17","WT_one_24","WT_one_24","WT_one_24", "deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_24","deltaadeIJ_one_24","deltaadeIJ_one_24",   "WT_two_17","WT_two_17","WT_two_17","WT_two_24","WT_two_24","WT_two_24", "deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_24","deltaadeIJ_two_24","deltaadeIJ_two_24"))
# Construct colData manually
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()

png("pca3.png", 1200, 800)
plotPCA(rld, intgroup=c("replicate"))
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# 1) keep only non-WT samples
#pdat <- subset(pdat, !grepl("^WT_", condition))
# drop unused factor levels so empty WT facets disappear
pdat$condition <- droplevels(pdat$condition)
# 2) pretty condition names: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
png("pca4.png", 1200, 800)
ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# Drop WT_* conditions from the data and from factor levels
pdat <- subset(pdat, !grepl("^WT_", condition))
pdat$condition <- droplevels(pdat$condition)
# Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca5.png", 1200, 800); print(p); dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca6.png", 1200, 800); print(p_fac); dev.off()

# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
#png("pca_media.png", 1200, 800)
#plotPCA(rld, intgroup=c("media"))
#dev.off()
#png("pca_strain.png", 1200, 800)
#plotPCA(rld, intgroup=c("strain"))
#dev.off()
#png("pca_time.png", 1200, 800)
#plotPCA(rld, intgroup=c("time"))
#dev.off()

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] WT_none_17         WT_none_17         WT_none_17         WT_none_24
[5] WT_none_24         WT_none_24         deltaadeIJ_none_17 deltaadeIJ_none_17
[9] deltaadeIJ_none_17 deltaadeIJ_none_24 deltaadeIJ_none_24 deltaadeIJ_none_24
[13] WT_one_17          WT_one_17          WT_one_17          WT_one_24
[17] WT_one_24          WT_one_24          deltaadeIJ_one_17  deltaadeIJ_one_17
[21] deltaadeIJ_one_17  deltaadeIJ_one_24  deltaadeIJ_one_24  deltaadeIJ_one_24
[25] WT_two_17          WT_two_17          WT_two_17          WT_two_24
[29] WT_two_24          WT_two_24          deltaadeIJ_two_17  deltaadeIJ_two_17
[33] deltaadeIJ_two_17  deltaadeIJ_two_24  deltaadeIJ_two_24  deltaadeIJ_two_24
12 Levels: deltaadeIJ_none_17 deltaadeIJ_none_24 ... WT_two_24

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# Construct colData automatically
sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("genotype", "exposure", "time")
colData <- cbind(sample_table, split_cond)
colData$genotype <- factor(colData$genotype)
colData$exposure  <- factor(colData$exposure)
colData$time   <- factor(colData$time)
colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
# Construct colData manually
colData2 <- data.frame(condition=condition, row.names=names(files))

# 确保因子顺序（可选）
colData$genotype <- relevel(factor(colData$genotype), ref = "WT")
colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
colData$time   <- relevel(factor(colData$time), ref = "17")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * exposure * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
[1] "Intercept"
[2] "genotype_deltaadeIJ_vs_WT"
[3] "exposure_one_vs_none"
[4] "exposure_two_vs_none"
[5] "time_24_vs_17"
[6] "genotypedeltaadeIJ.exposureone"
[7] "genotypedeltaadeIJ.exposuretwo"
[8] "genotypedeltaadeIJ.time24"
[9] "exposureone.time24"
[10] "exposuretwo.time24"
[11] "genotypedeltaadeIJ.exposureone.time24"
[12] "genotypedeltaadeIJ.exposuretwo.time24"

# 提取 genotype 的主效应: up 10, down 4
contrast <- "genotype_deltaadeIJ_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 one exposure 的主效应: up 196; down 298
contrast <- "exposure_one_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 two exposure 的主效应: up 80; down 105
contrast <- "exposure_two_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 10; down 2
contrast <- "time_24_vs_17"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  ΔadeIJ_none 17h vs WT_none 17h
#2.)  ΔadeIJ_none 24h vs WT_none 24h
#3.)  ΔadeIJ_one 17h vs WT_one 17h
#4.)  ΔadeIJ_one 24h vs WT_one 24h
#5.)  ΔadeIJ_two 17h vs WT_two 17h
#6.)  ΔadeIJ_two 24h vs WT_two 24h

#---- relevel to control ----
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_17_vs_WT_none_17")

dds$condition <- relevel(dds$condition, "WT_none_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_WT_none_24")

dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_17_vs_WT_one_17")

dds$condition <- relevel(dds$condition, "WT_one_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_WT_one_24")

dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_17_vs_WT_two_17")

dds$condition <- relevel(dds$condition, "WT_two_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_WT_two_24")

# WT_none_xh
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_none_24_vs_WT_none_17")

# WT_one_xh
dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_one_24_vs_WT_one_17")

# WT_two_xh
dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_two_24_vs_WT_two_17")

# deltaadeIJ_none_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_deltaadeIJ_none_17")

# deltaadeIJ_one_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_deltaadeIJ_one_17")

# deltaadeIJ_two_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_deltaadeIJ_two_17")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP059040_m.gff > CP059040_gene.gff

for cmp in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done
#deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
#deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
#deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
            emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
    * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            QBlast finished with warnings!
            Blasted Sequences: 2084
            Sequences without results: 105
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP020463_protein) Done"
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
            "Mapping (CP020463_protein) Done"
            "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            "Annotation (CP020463_protein) Done"
            "Annotation finished."
    or blast2go_cli_v1.5.1 (NOT_USED)

            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
        * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
            "InterProScan (CP020463_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
        * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
        #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
    File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
            #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

        #-- before merging (blast2go_annot.annot) --
        #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0085020
        #-- after merging (blast2go_annot.annot2) -->
        #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0070531
        #H0N29_18790     GO:0004842
        #H0N29_18790     GO:0005515
        #H0N29_18790     GO:0085020

        cp blast2go_annot.annot blast2go_annot.annot2

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(

        "deltaadeIJ_none_17_vs_WT_none_17-all.csv",  #up 11, down 3 vs. (10,4)
        "deltaadeIJ_none_24_vs_WT_none_24-all.csv",  #up 0, down 2 vs. (0,2)
        "deltaadeIJ_one_17_vs_WT_one_17-all.csv",    #up 238, down 90 vs. (239,89)  --> height 2600
        "deltaadeIJ_one_24_vs_WT_one_24-all.csv",    #up 83, down 64 vs. (64,71) --> height 1600
        "deltaadeIJ_two_17_vs_WT_two_17-all.csv",    #up 74, down 14 vs. (75,9) --> height 1000
        "deltaadeIJ_two_24_vs_WT_two_24-all.csv",    #up 1, down 3 vs. (3,3)

        "WT_none_24_vs_WT_none_17-all.csv",  #(up 10, down 2)
        "WT_one_24_vs_WT_one_17-all.csv",    #(up 97, down 3)
        "WT_two_24_vs_WT_two_17-all.csv",    #(up 12, down 1)

        "deltaadeIJ_two_24_vs_deltaadeIJ_two_17-all.csv",   #(up 8, down 51)
        "deltaadeIJ_one_24_vs_deltaadeIJ_one_17-all.csv",   #(up 0, down 10)
        "deltaadeIJ_none_24_vs_deltaadeIJ_none_17-all.csv" #(up 0, down 0)

        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

        # -------------------------------- OLD_CODE not automatized with loop ----------------------------
        # Load the results
        res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
        res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

        res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
        res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
        res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
        res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
        res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

        res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
        res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
        res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
        res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )

        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()

        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

        # Filter up-regulated genes
        up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
        # Filter down-regulated genes
        down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete")
        writeData(wb, "Complete_Data", res_updated)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
        #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

        #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
        ## Set the 'GeneName' column as row.names
        #rownames(res_updated) <- res_updated$GeneName
        ## Drop the 'GeneName' column since it's now the row names
        #res_updated$GeneName <- NULL
        # -- BREAK_1 --

        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # -- BREAK_2 --

        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings

        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)

        # Create a new workbook
        #wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes

        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
        # -- BREAK_3 --

        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)

        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)

        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)

        ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })

        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })

        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

        # Save the workbook with enrichment results
        saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

        #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
        #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
        #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

        #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
        #Mac_vs_LB
        #LB.AB_vs_LB.WT19606
        #LB.IJ_vs_LB.WT19606
        #LB.W1_vs_LB.WT19606
        #LB.Y1_vs_LB.WT19606
        #Mac.AB_vs_Mac.WT19606
        #Mac.IJ_vs_Mac.WT19606
        #Mac.W1_vs_Mac.WT19606
        #Mac.Y1_vs_Mac.WT19606

        #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

10.5. (DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

            library(openxlsx)

            # Function to read and clean gene ID files
            read_gene_ids <- function(file_path) {
            # Read the gene IDs from the file
            gene_ids <- readLines(file_path)

            # Remove any quotes and trim whitespaces
            gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
            gene_ids <- trimws(gene_ids)  # Trim whitespaces

            # Remove empty entries or NAs
            gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

            return(gene_ids)
            }

            # Example list of LB files with both -up.id and -down.id for each condition
            lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                            "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
            lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                            "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

            # Combine both up and down files for each condition
            lb_files <- c(lb_files_up, lb_files_down)

            # Read gene IDs for each file in LB group
            #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
            lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

            lb_degs_ <- list()
            combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
            #unique_combined_set <- unique(combined_set)
            lb_degs_$AB <- combined_set
            combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
            lb_degs_$IJ <- combined_set
            combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
            lb_degs_$W1 <- combined_set
            combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
            lb_degs_$Y1 <- combined_set

            # Example list of Mac files with both -up.id and -down.id for each condition
            mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                            "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
            mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                            "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

            # Combine both up and down files for each condition in Mac group
            mac_files <- c(mac_files_up, mac_files_down)

            # Read gene IDs for each file in Mac group
            mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

            mac_degs_ <- list()
            combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
            mac_degs_$AB <- combined_set
            combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
            mac_degs_$IJ <- combined_set
            combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
            mac_degs_$W1 <- combined_set
            combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
            mac_degs_$Y1 <- combined_set

            # Function to clean sheet names to ensure no sheet name exceeds 31 characters
            truncate_sheet_name <- function(names_list) {
            sapply(names_list, function(name) {
            if (nchar(name) > 31) {
            return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
            }
            return(name)
            })
            }

            # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

            # Define intersections between different conditions for LB
            inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
            inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
            inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
            inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
            inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
            inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

            # Define intersections between three conditions for LB
            inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
            inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
            inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
            inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Define intersection between all four conditions for LB
            inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Now remove the intersected genes from each original set for LB
            venn_list_lb <- list()

            # For LB.AB, remove genes that are also in other conditions
            venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

            # For LB.IJ, remove genes that are also in other conditions
            venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

            # For LB.W1, remove genes that are also in other conditions
            venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

            # For LB.Y1, remove genes that are also in other conditions
            venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

            # Add the intersections for LB (same as before)
            venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
            venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
            venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
            venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
            venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
            venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

            # Define intersections between three conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
            venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
            venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

            # Define intersection between all four conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

            # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

            # Define intersections between different conditions
            inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
            inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
            inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
            inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
            inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
            inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

            # Define intersections between three conditions
            inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
            inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
            inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
            inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Define intersection between all four conditions
            inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Now remove the intersected genes from each original set
            venn_list_mac <- list()

            # For Mac.AB, remove genes that are also in other conditions
            venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

            # For Mac.IJ, remove genes that are also in other conditions
            venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

            # For Mac.W1, remove genes that are also in other conditions
            venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

            # For Mac.Y1, remove genes that are also in other conditions
            venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

            # Add the intersections (same as before)
            venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
            venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
            venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
            venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
            venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

            # Define intersections between three conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
            venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

            # Define intersection between all four conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

            # Save the gene IDs to Excel for further inspection (optional)
            write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
            write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

            # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
            write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
            write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

            # Venn Diagram for LB group
            venn1 <- ggvenn(lb_degs_,
                            fill_color = c("skyblue", "tomato", "gold", "orchid"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

            # Venn Diagram for Mac group
            venn2 <- ggvenn(mac_degs_,
                            fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

            cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
# -- IMPORTANT_ADAPTION: the script by chaning "HJI06_" with "H0N29_"
for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "deltaadeIJ_none_17_vs_WT_none_17"  #up 11, down 3 vs. (10,4) --> height 600 heatmap_pattern1
contrast <- "deltaadeIJ_none_24_vs_WT_none_24"  #up 0, down 2 vs. (0,2) --> height 600 pattern1
contrast <- "deltaadeIJ_one_17_vs_WT_one_17"    #up 238, down 90 vs. (239,89)  --> height 4000 pattern2
contrast <- "deltaadeIJ_one_24_vs_WT_one_24"    #up 83, down 64 vs. (64,71) --> height 1800 pattern2
contrast <- "deltaadeIJ_two_17_vs_WT_two_17"    #up 74, down 14 vs. (75,9) --> height 1100 pattern2
contrast <- "deltaadeIJ_two_24_vs_WT_two_24"    #up 1, down 3 vs. (3,3) --> height 600 pattern1

contrast <- "WT_none_24_vs_WT_none_17"  #(up 10, down 2) --> height 600 pattern1
contrast <- "WT_one_24_vs_WT_one_17"    #(up 97, down 3) --> height 1400 pattern2
contrast <- "WT_two_24_vs_WT_two_17"    #(up 12, down 1) --> height 600 pattern1
contrast <- "deltaadeIJ_none_24_vs_deltaadeIJ_none_17" #(up 0, down 0)
contrast <- "deltaadeIJ_one_24_vs_deltaadeIJ_one_17"   #(up 0, down 10) --> height 600 pattern1
contrast <- "deltaadeIJ_two_24_vs_deltaadeIJ_two_17"   #(up 8, down 51) --> height 1000 pattern2

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
#GOI <- GOI_down
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) #labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 20, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 1000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ # --> NOT_USED since no three time point comparison exists! #contrasts <- c( # "WT_MH_4h_vs_WT_MH_2h", # "WT_MH_18h_vs_WT_MH_2h", # "WT_MH_18h_vs_WT_MH_4h" #) ### Optionally force a condition display order (defaults to order of first appearance) #cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") ##cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 2600) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Processing Data_Michelle_RNAseq_2025 v2

Leave a reply

In the current results, I extract the main effect. I also compared the condition deltasbp_MH_18h to WT_MH_18h, if you are interested in specific comparison between conditions, please let me know, I can perform differentially expressed analysis and draw corresponding volcano plots for them.

Targets

The experiment we did so far:
I have two strains:
1. 1457 wildtype
2. 1457Δsbp (sbp knock out strain)

I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
The following would be interesting for me:
- PCA plot (sample comparison)
- Heatmaps (wild type vs. sbp knockout)
- Volcano plots (significant genes)
- Gene Ontology (GO) analyses

Download the raw data

Mail von BGI (RNA-SEQ Institute):
The data from project F25A430000603 are uploaded to AWS.
Please download the data as below:
URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
Project：F25A430000603-01-STAkimxP
Alias ID：598731762349
S3 Bucket：stakimxp-598731762349
Account：stakimxp
Password：qR0'A7[o9Ql|
Region：eu-central-1
Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
rsync -avzP /local/dir/ user@remote:/remote/dir/
rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

mkdir raw_data; cd raw_data

#Δsbp->deltasbp
#1457.1_2h_MH,WT,MH,2h,1
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
#1457dsbp118h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
#1457dsbp218h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

#1457.1_2h_
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
#1457dsbp1_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
#1457dsbp2_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
#1457dsbp3_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
#END

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP020463.gff) --
grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

# -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c(
        "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
        "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
        "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
        "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
        "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
        "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
        "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
        "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
        "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
        "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
        "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
        "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
        "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
        "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
        "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
        "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
        "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
        "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
        "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
        "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
        "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
        "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
        "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
        "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
        "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
        "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
        "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
        "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
        "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
        "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
        "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
        "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
        "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
        "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
        "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("strain", "media", "time")
colData <- cbind(sample_table, split_cond)
colData$strain <- factor(colData$strain)
colData$media  <- factor(colData$media)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
# Define the colData for DESeq2
#colData <- data.frame(condition=condition, row.names=names(files))

#grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
#grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# -------------------------------
# 2️⃣ Transcript-level counts
# -------------------------------
# Create DESeqDataSet directly from tximport (transcript-level)
dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds_tx), file="transcript_counts.csv")

# --------------------------------
# 3️⃣ Gene-level summarization
# --------------------------------
# Re-import Salmon data summarized at gene level
txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

# Create DESeqDataSet for gene-level counts
#dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
#或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
#~ strain * media * time    主效应 + 所有交互（推荐）  ✅
#~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
png("pca_media.png", 1200, 800)
plotPCA(rld, intgroup=c("media"))
dev.off()
png("pca_strain.png", 1200, 800)
plotPCA(rld, intgroup=c("strain"))
dev.off()
png("pca_time.png", 1200, 800)
plotPCA(rld, intgroup=c("time"))
dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
[5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
[9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
[13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
[17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
[21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
[25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
[29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
[33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# 确保因子顺序（可选）
colData$strain <- relevel(factor(colData$strain), ref = "WT")
colData$media  <- relevel(factor(colData$media), ref = "TSB")
colData$time   <- relevel(factor(colData$time), ref = "2h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                      "strain_deltasbp_vs_WT"
#[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
#[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
#[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
#[9] "mediaMH.time18h"                "mediaMH.time4h"
#[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

🔹 Main effects for each factor:

表达量
▲
│       ┌────── WT-TSB
│      /
│     /     ┌────── WT-MH
│    /     /
│   /     /     ┌────── deltasbp-TSB
│  /     /     /
│ /     /     /     ┌────── deltasbp-MH
└──────────────────────────────▶ 时间（2h, 4h, 18h）

    * strain_deltasbp_vs_WT
    * media_MH_vs_TSB
    * time_18h_vs_2h
    * time_4h_vs_2h

🔹 两因素交互作用（Two-way interactions）
这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

表达量
▲
│
│             WT ────────┐
│                        └─↘
│                           ↘
│                        deltasbp ←←←← 显著交互（方向/幅度不同）
└──────────────────────────────▶ 时间

straindeltasbp.mediaMH
表示 菌株（strain）和培养基（media）之间的交互作用。
➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

straindeltasbp.time18h
表示 菌株（strain）和时间（time, 18h）之间的交互作用。
➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

straindeltasbp.time4h
同上，是菌株和时间（4h）之间的交互作用。

mediaMH.time18h
表示 培养基（MH）与时间（18h）之间的交互作用。
➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

mediaMH.time4h
与上面类似，是 MH 培养基与 4 小时之间的交互作用。

🔹 三因素交互作用（Three-way interactions）
三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

表达量（TSB）
▲
│
│        WT ──────→→
│        deltasbp ─────→→
└────────────────────────▶ 时间（2h, 4h, 18h）

表达量（MH）
▲
│
│        WT ──────→→
│        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
└────────────────────────▶ 时间（2h, 4h, 18h）

straindeltasbp.mediaMH.time18h
表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

straindeltasbp.mediaMH.time4h
同上，只是观察的是 4 小时下的三因素交互效应。

✅ 总结：
交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

# 提取 strain 的主效应: up 2, down 16
contrast <- "strain_deltasbp_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 media 的主效应: up 76; down 128
contrast <- "media_MH_vs_TSB"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 228, down 98; up 17, down 2
contrast <- "time_18h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

contrast <- "time_4h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  delta sbp 2h TSB vs WT 2h TSB
#2.)  delta sbp 4h TSB vs WT 4h TSB
#3.)  delta sbp 18h TSB vs WT 18h TSB
#4.)  delta sbp 2h MH vs WT 2h MH
#5.)  delta sbp 4h MH vs WT 4h MH
#6.)  delta sbp 18h MH vs WT 18h MH

#---- relevel to control ----
#design=~condition+replicate
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

dds$condition <- relevel(dds$condition, "WT_TSB_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

dds$condition <- relevel(dds$condition, "WT_MH_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

# WT_MH_xh
dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_18h_vs_WT_MH_4h")

# WT_TSB_xh
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

# deltasbp_MH_xh
dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

# deltasbp_TSB_xh
dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
            emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
    * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            QBlast finished with warnings!
            Blasted Sequences: 2084
            Sequences without results: 105
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP020463_protein) Done"
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
            "Mapping (CP020463_protein) Done"
            "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            "Annotation (CP020463_protein) Done"
            "Annotation finished."
    or blast2go_cli_v1.5.1 (NOT_USED)

            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
        * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
            "InterProScan (CP020463_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
        * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
        #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
    File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
            #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

        #-- before merging (blast2go_annot.annot) --
        #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0085020
        #-- after merging (blast2go_annot.annot2) -->
        #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0070531
        #H0N29_18790     GO:0004842
        #H0N29_18790     GO:0005515
        #H0N29_18790     GO:0085020

        cp blast2go_annot.annot blast2go_annot.annot2

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        # Define the filenames and output suffixes
        file_list <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_4h",
          "WT_TSB_4h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_4h",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
        )

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_4h-all.csv",
          "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
# -- IMPORTANT_ADAPTION: the script by chaning "H0N29_" with "B4U56_"
for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"    #17, height=600, heatmap_pattern1
contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"    #25, height=800, heatmap_pattern1
contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"  #34, height=1000, heatmap_pattern1
contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"      #43, height=1200, heatmap_pattern1
contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"      #26, height=800, heatmap_pattern1
contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"    #41, height=1200, heatmap_pattern1

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
#NOT_NECCESSARY since it was already sorted: colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) #labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=1200) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 20, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 6500) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 20, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ contrasts <- c( "WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_4h" #--> 424 genes, height=6000, heatmap_pattern2 ) contrasts <- c( "WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_4h" #--> 358 genes, height=5200, heatmap_pattern2 ) contrasts <- c( "deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_4h" #--> 345 genes, height=5120, heatmap_pattern2 ) contrasts <- c( "deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h" #--> 276 genes, height=4000, heatmap_pattern2 ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h") cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h") cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 4000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 20, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Quantitative Analysis of LT Protein Assembly on DNA Using HMM-Guided Photobleaching Step Detection (结合 HMM 光漂白分级的一种 DNA-蛋白组装定量分析方法)

Methods: HMM-Based Quantification of mN-LT Assembly on DNA

English Methods-Style Text

中文方法学描述

Single-Molecule Binding/Bleaching Detection Pipeline for Data_Vero_Kymographs

Overview

Step 1: ICON HMM Fitting per Track

Step 2: Discretize HMM Means (Not Used for Plot Generation)

Step 3: Event Detection \& Visualization (Python)

Step 4: (For Future) Aggregation

Figure Explanation (e.g. Track 14 and Track 100)

Key Script Function Descriptions

Complete Scripts

Example Event Table Output

Abstract

Detailed Structure \& Outline

English Summary

中文总结

完整脚本：detect_binding_bleach_dwell_simple.m

使用与扩展建议

detect_binding_bleach_dwell.m 完整代码

level_analysis.m vs detect_binding_bleach.m

一、算法差异概览

二、用中文说明 detect_binding_bleach.m 的完整流程（含与原脚本共用的 HMM 思路）

1. 命令行输入与文件读取

2. 对每条轨迹循环，构建输入序列 z

3. 设置 ICON HMM 的超参数和采样参数

4. 运行 HMM 采样器，得到 MCMC 链

5. 后验均值轨迹分析：chainer_analyze_means

6. 把连续水平轨迹聚类成几个离散状态

7. 从状态序列中检测 binding 和 bleaching 事件

8. 汇总所有轨迹结果并写出 CSV

总结（中文角度）

完整脚本：`detect_binding_bleach_dwell_simple.m`

`detect_binding_bleach_dwell.m` 完整代码

二、用中文说明 `detect_binding_bleach.m` 的完整流程（含与原脚本共用的 HMM 思路）

5. 后验均值轨迹分析：`chainer_analyze_means`