Category Archives: Articles

Processing Data_Michelle_RNAseq_2025 v2

In the current results, I extract the main effect. I also compared the condition deltasbp_MH_18h to WT_MH_18h, if you are interested in specific comparison between conditions, please let me know, I can perform differentially expressed analysis and draw corresponding volcano plots for them.

Targets

The experiment we did so far:
I have two strains:
1. 1457 wildtype
2. 1457Δsbp (sbp knock out strain)

I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
The following would be interesting for me:
- PCA plot (sample comparison)
- Heatmaps (wild type vs. sbp knockout)
- Volcano plots (significant genes)
- Gene Ontology (GO) analyses

Download the raw data

Mail von BGI (RNA-SEQ Institute):
The data from project F25A430000603 are uploaded to AWS.
Please download the data as below:
URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
Project：F25A430000603-01-STAkimxP
Alias ID：598731762349
S3 Bucket：stakimxp-598731762349
Account：stakimxp
Password：qR0'A7[o9Ql|
Region：eu-central-1
Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
rsync -avzP /local/dir/ user@remote:/remote/dir/
rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

mkdir raw_data; cd raw_data

#Δsbp->deltasbp
#1457.1_2h_MH,WT,MH,2h,1
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
#1457dsbp118h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
#1457dsbp218h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

#1457.1_2h_
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
#1457dsbp1_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
#1457dsbp2_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
#1457dsbp3_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
#END

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP020463.gff) --
grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

# -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c(
        "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
        "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
        "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
        "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
        "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
        "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
        "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
        "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
        "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
        "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
        "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
        "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
        "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
        "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
        "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
        "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
        "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
        "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
        "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
        "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
        "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
        "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
        "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
        "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
        "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
        "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
        "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
        "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
        "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
        "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
        "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
        "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
        "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
        "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
        "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("strain", "media", "time")
colData <- cbind(sample_table, split_cond)
colData$strain <- factor(colData$strain)
colData$media  <- factor(colData$media)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
# Define the colData for DESeq2
#colData <- data.frame(condition=condition, row.names=names(files))

#grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
#grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# -------------------------------
# 2️⃣ Transcript-level counts
# -------------------------------
# Create DESeqDataSet directly from tximport (transcript-level)
dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds_tx), file="transcript_counts.csv")

# --------------------------------
# 3️⃣ Gene-level summarization
# --------------------------------
# Re-import Salmon data summarized at gene level
txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

# Create DESeqDataSet for gene-level counts
#dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
#或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
#~ strain * media * time    主效应 + 所有交互（推荐）  ✅
#~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
png("pca_media.png", 1200, 800)
plotPCA(rld, intgroup=c("media"))
dev.off()
png("pca_strain.png", 1200, 800)
plotPCA(rld, intgroup=c("strain"))
dev.off()
png("pca_time.png", 1200, 800)
plotPCA(rld, intgroup=c("time"))
dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
[5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
[9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
[13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
[17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
[21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
[25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
[29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
[33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# 确保因子顺序（可选）
colData$strain <- relevel(factor(colData$strain), ref = "WT")
colData$media  <- relevel(factor(colData$media), ref = "TSB")
colData$time   <- relevel(factor(colData$time), ref = "2h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                      "strain_deltasbp_vs_WT"
#[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
#[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
#[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
#[9] "mediaMH.time18h"                "mediaMH.time4h"
#[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

🔹 Main effects for each factor:

表达量
▲
│       ┌────── WT-TSB
│      /
│     /     ┌────── WT-MH
│    /     /
│   /     /     ┌────── deltasbp-TSB
│  /     /     /
│ /     /     /     ┌────── deltasbp-MH
└──────────────────────────────▶ 时间（2h, 4h, 18h）

    * strain_deltasbp_vs_WT
    * media_MH_vs_TSB
    * time_18h_vs_2h
    * time_4h_vs_2h

🔹 两因素交互作用（Two-way interactions）
这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

表达量
▲
│
│             WT ────────┐
│                        └─↘
│                           ↘
│                        deltasbp ←←←← 显著交互（方向/幅度不同）
└──────────────────────────────▶ 时间

straindeltasbp.mediaMH
表示 菌株（strain）和培养基（media）之间的交互作用。
➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

straindeltasbp.time18h
表示 菌株（strain）和时间（time, 18h）之间的交互作用。
➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

straindeltasbp.time4h
同上，是菌株和时间（4h）之间的交互作用。

mediaMH.time18h
表示 培养基（MH）与时间（18h）之间的交互作用。
➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

mediaMH.time4h
与上面类似，是 MH 培养基与 4 小时之间的交互作用。

🔹 三因素交互作用（Three-way interactions）
三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

表达量（TSB）
▲
│
│        WT ──────→→
│        deltasbp ─────→→
└────────────────────────▶ 时间（2h, 4h, 18h）

表达量（MH）
▲
│
│        WT ──────→→
│        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
└────────────────────────▶ 时间（2h, 4h, 18h）

straindeltasbp.mediaMH.time18h
表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

straindeltasbp.mediaMH.time4h
同上，只是观察的是 4 小时下的三因素交互效应。

✅ 总结：
交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

# 提取 strain 的主效应: up 2, down 16
contrast <- "strain_deltasbp_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 media 的主效应: up 76; down 128
contrast <- "media_MH_vs_TSB"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 228, down 98; up 17, down 2
contrast <- "time_18h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

contrast <- "time_4h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  delta sbp 2h TSB vs WT 2h TSB
#2.)  delta sbp 4h TSB vs WT 4h TSB
#3.)  delta sbp 18h TSB vs WT 18h TSB
#4.)  delta sbp 2h MH vs WT 2h MH
#5.)  delta sbp 4h MH vs WT 4h MH
#6.)  delta sbp 18h MH vs WT 18h MH

#---- relevel to control ----
#design=~condition+replicate
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

dds$condition <- relevel(dds$condition, "WT_TSB_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

dds$condition <- relevel(dds$condition, "WT_MH_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

# WT_MH_xh
dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_18h_vs_WT_MH_4h")

# WT_TSB_xh
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

# deltasbp_MH_xh
dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

# deltasbp_TSB_xh
dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
            emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
    * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            QBlast finished with warnings!
            Blasted Sequences: 2084
            Sequences without results: 105
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP020463_protein) Done"
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
            "Mapping (CP020463_protein) Done"
            "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            "Annotation (CP020463_protein) Done"
            "Annotation finished."
    or blast2go_cli_v1.5.1 (NOT_USED)

            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
        * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
            "InterProScan (CP020463_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
        * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
        #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
    File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
            #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

        #-- before merging (blast2go_annot.annot) --
        #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0085020
        #-- after merging (blast2go_annot.annot2) -->
        #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
        #H0N29_18790     GO:0070531
        #H0N29_18790     GO:0004842
        #H0N29_18790     GO:0005515
        #H0N29_18790     GO:0085020

        cp blast2go_annot.annot blast2go_annot.annot2

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        # Define the filenames and output suffixes
        file_list <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_2h",
          "WT_MH_18h_vs_WT_MH_4h",
          "WT_TSB_4h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_2h",
          "WT_TSB_18h_vs_WT_TSB_4h",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
        )

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(
          "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
          "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
          "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
          "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
          "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
          "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

          "WT_MH_4h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_2h-all.csv",
          "WT_MH_18h_vs_WT_MH_4h-all.csv",
          "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
          "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

          "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
          "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
          "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
          "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
# -- IMPORTANT_ADAPTION: the script by chaning "H0N29_" with "B4U56_"
for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"    #17, height=600, heatmap_pattern1
contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"    #25, height=800, heatmap_pattern1
contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"  #34, height=1000, heatmap_pattern1
contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"      #43, height=1200, heatmap_pattern1
contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"      #26, height=800, heatmap_pattern1
contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"    #41, height=1200, heatmap_pattern1

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
#NOT_NECCESSARY since it was already sorted: colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) #labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=1200) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 20, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 6500) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 20, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ contrasts <- c( "WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_4h" #--> 424 genes, height=6000, heatmap_pattern2 ) contrasts <- c( "WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_4h" #--> 358 genes, height=5200, heatmap_pattern2 ) contrasts <- c( "deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_4h" #--> 345 genes, height=5120, heatmap_pattern2 ) contrasts <- c( "deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h" #--> 276 genes, height=4000, heatmap_pattern2 ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h") cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h") cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 4000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 20, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Processing Data_JuliaFuchs_RNAseq_2025 v2

Leave a reply

DEGs_heatmap_Mitomycin_18h_vs_Untreated_18h

Targets

> Which genes are differentially expressed between the conditions for each time point.
> Also, from our pulldown experiment, we identified several potential
> target genes, and I’d be particularly interested to see if there are
> expression changes for those in the RNA-seq data. I’ll include the
> list of targets below once it’s ready.
简要结论：莫西沙星（Moxifloxacin）= 抗生素；丝裂霉素C（Mitomycin C）= 临床上不作为抗感染用抗生素。
莫西沙星（Moxifloxacin）：第四代氟喹诺酮类抗生素，用于治疗细菌感染（如呼吸道、皮肤等）。作用机制是抑制DNA旋转酶和拓扑异构酶IV，阻断细菌DNA复制。
丝裂霉素C（Mitomycin C）：本质上是来源于链霉菌的一类“抗肿瘤抗生素”，通过烷基化DNA 造成交联破坏，因此主要用于肿瘤化疗，以及少数局部应用（眼科/耳鼻喉科术中抑制瘢痕增生等）。尽管名字里有“抗生素”，也确有抗菌活性，但全身毒性过大，不用于治疗感染。

> Additionally, I have a specific question regarding the toxin–antitoxin
> system I’m studying. The toxin and antitoxin genes are:
>
> Toxin:
> ttatttacaatgcctcttgatccatgtctcaattccctcaagagtaagatttttgtcgtttactactcttaaagtaaactgaaccgcttcatcttgagtgcattcaaaattaatactatttaacttcaaaaatattaccatagatgtaaaagctgttcttttattcgcattatggaatgcgtgcttttgagctatatttctatatataaaagctgcttttctctcgattgtttcatatagttcaactccaccgaatgattgtttaactccttcaatagtagcattaagaacttctggaactttaacaccaacttgttcttttggtgagaaatcttgtattgcttttacattaatggcaatcacttgtttttcagttaaatatttagtgctttgcat
>
> Antitoxin:
> ttataagtcaaccatcctttttaaagcttggttatactcagtgaatgtttcatccaacaatttaaaaaactcctcgtcctctcttacctccttttcgatggttactttattatcttttacattaaatttaagattatcaccatttgatattccgagtgctgcgatcacttctgtcggtacagaaacaactgaactattaccagcttttcttagttttcttgtagtaatcat
>
> I’m wondering whether you can check if these two genes might share the
> same promoter and whether any RNA-seq signal supports their
> co-expression.

Download and prepare raw data

# ---- Dataset_1 ----
aws configure
> Aws_access_key_id：AKIAYWZZRVKWTQDI4CHT
> Aws_secret_access_key：hbFnZYBlNc1QP6hjm8fpCIXQsvUhLvWTBAaonH8D
>
>
aws s3 cp s3://staefgap-598731762349/ ./ --recursive  #S3 Bucket

# ---- Dataset_2 ----
aws configure
> Aws_access_key_id：AKIAYWZZRVKWXL5FYUBC
> Aws_secret_access_key：Nb9PMn3FywZ7UT4FOkVYPi0HFmk/S3uSCX/D9kmx
>
>
aws s3 cp s3://stavoupp-598731762349/ ./ --recursive  #S3 Bucket

mkdir raw_data; cd raw_data

ln -s ../F25A430001462_STAvoupP/1a_untreated_4h/1a_untreated_4h_1.fq.gz Untreated_4h_1a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1a_untreated_4h/1a_untreated_4h_2.fq.gz Untreated_4h_1a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1b_untreated_4h/1b_untreated_4h_1.fq.gz Untreated_4h_1b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1b_untreated_4h/1b_untreated_4h_2.fq.gz Untreated_4h_1b_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1c_untreated_4h/1c_untreated_4h_1.fq.gz Untreated_4h_1c_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1c_untreated_4h/1c_untreated_4h_2.fq.gz Untreated_4h_1c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1d_untreated_8h/1d_untreated_8h_1.fq.gz Untreated_8h_1d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1d_untreated_8h/1d_untreated_8h_2.fq.gz Untreated_8h_1d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1e_untreated_8h/1e_untreated_8h_1.fq.gz Untreated_8h_1e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1e_untreated_8h/1e_untreated_8h_2.fq.gz Untreated_8h_1e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1f_untreated_8h/1f_untreated_8h_1.fq.gz Untreated_8h_1f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1f_untreated_8h/1f_untreated_8h_2.fq.gz Untreated_8h_1f_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1g_untreated18h/1g_untreated18h_1.fq.gz Untreated_18h_1g_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1g_untreated18h/1g_untreated18h_2.fq.gz Untreated_18h_1g_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/1h_untreated18h/1h_untreated18h_1.fq.gz Untreated_18h_1h_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/1h_untreated18h/1h_untreated18h_2.fq.gz Untreated_18h_1h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/1i_untreated18h/1i_untreated18h_1.fq.gz Untreated_18h_1i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/1i_untreated18h/1i_untreated18h_2.fq.gz Untreated_18h_1i_R2.fastq.gz

ln -s ../F25A430001462_STAvoupP/2a_Mitomycin_4h/2a_Mitomycin_4h_1.fq.gz Mitomycin_4h_2a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2a_Mitomycin_4h/2a_Mitomycin_4h_2.fq.gz Mitomycin_4h_2a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2b_Mitomycin_4h/2b_Mitomycin_4h_1.fq.gz Mitomycin_4h_2b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2b_Mitomycin_4h/2b_Mitomycin_4h_2.fq.gz Mitomycin_4h_2b_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/2c_Mitomycin_4h/2c_Mitomycin_4h_1.fq.gz Mitomycin_4h_2c_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/2c_Mitomycin_4h/2c_Mitomycin_4h_2.fq.gz Mitomycin_4h_2c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2d_Mitomycin_8h/2d_Mitomycin_8h_1.fq.gz Mitomycin_8h_2d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2d_Mitomycin_8h/2d_Mitomycin_8h_2.fq.gz Mitomycin_8h_2d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2e_Mitomycin_8h/2e_Mitomycin_8h_1.fq.gz Mitomycin_8h_2e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2e_Mitomycin_8h/2e_Mitomycin_8h_2.fq.gz Mitomycin_8h_2e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2f_Mitomycin_8h/2f_Mitomycin_8h_1.fq.gz Mitomycin_8h_2f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2f_Mitomycin_8h/2f_Mitomycin_8h_2.fq.gz Mitomycin_8h_2f_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2g_Mitomycin18h/2g_Mitomycin18h_1.fq.gz Mitomycin_18h_2g_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2g_Mitomycin18h/2g_Mitomycin18h_2.fq.gz Mitomycin_18h_2g_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2h_Mitomycin18h/2h_Mitomycin18h_1.fq.gz Mitomycin_18h_2h_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2h_Mitomycin18h/2h_Mitomycin18h_2.fq.gz Mitomycin_18h_2h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/2i_Mitomycin18h/2i_Mitomycin18h_1.fq.gz Mitomycin_18h_2i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/2i_Mitomycin18h/2i_Mitomycin18h_2.fq.gz Mitomycin_18h_2i_R2.fastq.gz

ln -s ../F25A430001462_STAvoupP/3a_Moxi_4h/3a_Moxi_4h_1.fq.gz Moxi_4h_3a_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3a_Moxi_4h/3a_Moxi_4h_2.fq.gz Moxi_4h_3a_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3b_Moxi_4h/3b_Moxi_4h_1.fq.gz Moxi_4h_3b_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3b_Moxi_4h/3b_Moxi_4h_2.fq.gz Moxi_4h_3b_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3c_Moxi_4h/3c_Moxi_4h_1.fq.gz Moxi_4h_3c_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3c_Moxi_4h/3c_Moxi_4h_2.fq.gz Moxi_4h_3c_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3d_Moxi_8h/3d_Moxi_8h_1.fq.gz Moxi_8h_3d_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3d_Moxi_8h/3d_Moxi_8h_2.fq.gz Moxi_8h_3d_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3e_Moxi_8h/3e_Moxi_8h_1.fq.gz Moxi_8h_3e_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3e_Moxi_8h/3e_Moxi_8h_2.fq.gz Moxi_8h_3e_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3f_Moxi_8h/3f_Moxi_8h_1.fq.gz Moxi_8h_3f_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3f_Moxi_8h/3f_Moxi_8h_2.fq.gz Moxi_8h_3f_R2.fastq.gz
ln -s ../F25A430001462_STAefgaP/3g_Moxi_18h/3g_Moxi_18h_1.fq.gz Moxi_18h_3g_R1.fastq.gz
ln -s ../F25A430001462_STAefgaP/3g_Moxi_18h/3g_Moxi_18h_2.fq.gz Moxi_18h_3g_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3h_Moxi_18h/3h_Moxi_18h_1.fq.gz Moxi_18h_3h_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3h_Moxi_18h/3h_Moxi_18h_2.fq.gz Moxi_18h_3h_R2.fastq.gz
ln -s ../F25A430001462_STAvoupP/3i_Moxi_18h/3i_Moxi_18h_1.fq.gz Moxi_18h_3i_R1.fastq.gz
ln -s ../F25A430001462_STAvoupP/3i_Moxi_18h/3i_Moxi_18h_2.fq.gz Moxi_18h_3i_R2.fastq.gz

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in Untreated_4h_1a Untreated_4h_1a Untreated_4h_1b Untreated_4h_1b Untreated_4h_1c Untreated_4h_1c Untreated_8h_1d Untreated_8h_1d Untreated_8h_1e Untreated_8h_1e Untreated_8h_1f Untreated_8h_1f Untreated_18h_1g Untreated_18h_1g Untreated_18h_1h Untreated_18h_1h Untreated_18h_1i Untreated_18h_1i  Mitomycin_4h_2a Mitomycin_4h_2a Mitomycin_4h_2b Mitomycin_4h_2b Mitomycin_4h_2c Mitomycin_4h_2c Mitomycin_8h_2d Mitomycin_8h_2d Mitomycin_8h_2e Mitomycin_8h_2e Mitomycin_8h_2f Mitomycin_8h_2f Mitomycin_18h_2g Mitomycin_18h_2g Mitomycin_18h_2h Mitomycin_18h_2h Mitomycin_18h_2i Mitomycin_18h_2i  Moxi_4h_3a Moxi_4h_3a Moxi_4h_3b Moxi_4h_3b Moxi_4h_3c Moxi_4h_3c Moxi_8h_3d Moxi_8h_3d Moxi_8h_3e Moxi_8h_3e Moxi_8h_3f Moxi_8h_3f Moxi_18h_3g Moxi_18h_3g Moxi_18h_3h Moxi_18h_3h Moxi_18h_3i Moxi_18h_3i; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
Untreated_4h_1a,Untreated_4h_1a_R1.fastq.gz,Untreated_4h_1a_R2.fastq.gz,auto
Untreated_4h_1a,Untreated_4h_1a_R1.fastq.gz,Untreated_4h_1a_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP052959.gff) --
grep -P "\texon\t" CP052959.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP052959.gff | wc -l   #=10  signal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP052959.gff | wc -l    #=10  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP052959.gff | wc -l    #61  tRNA
grep -P "\tCDS\t" CP052959.gff | wc -l  #2581
sed 's/\tCDS\t/\texon\t/g' CP052959.gff > CP052959_m.gff
grep -P "\texon\t" CP052959_m.gff | sort | wc -l  #2662 (81 more comparing with 'CDS')

# -- NOTE that combination of 'CP052959_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP052959_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_JuliaFuchs_RNAseq/CP052959.fasta" --gff "/home/jhuang/DATA/Data_JuliaFuchs_RNAseq/CP052959_m.gff"        -profile docker -resume  --max_cpus 100 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP052959.1"

Generate advanced PCA-plot

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

#keep only gene_id
cut -f1 -d',' counts.tsv > f1
cut -f3- -d',' counts.tsv > f3_
paste -d',' f1 f3_ > counts_fixed.tsv

#IMPORTANT_EDIT:  delete all """, "gene-", replace ',' to '\t' in counts_fixed.tsv.
#IMPORTANT_ENV: mamba activate r_env
#IMPORTANT_NOTE: rownames of samples.tsv and columns of counts.tsv should algin!!!!
Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_JuliaFuchs_RNAseq_2025/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_JuliaFuchs_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files

files <- c("Untreated_4h_r1" = "./Untreated_4h_1a/quant.sf",
           "Untreated_4h_r2" = "./Untreated_4h_1b/quant.sf",
           "Untreated_4h_r3" = "./Untreated_4h_1c/quant.sf",
           "Untreated_8h_r1" = "./Untreated_8h_1d/quant.sf",
           "Untreated_8h_r2" = "./Untreated_8h_1e/quant.sf",
           "Untreated_8h_r3" = "./Untreated_8h_1f/quant.sf",
           "Untreated_18h_r1" = "./Untreated_18h_1g/quant.sf",
           "Untreated_18h_r2" = "./Untreated_18h_1h/quant.sf",
           "Untreated_18h_r3" = "./Untreated_18h_1i/quant.sf",
           "Mitomycin_4h_r1" = "./Mitomycin_4h_2a/quant.sf",
           "Mitomycin_4h_r2" = "./Mitomycin_4h_2b/quant.sf",
           "Mitomycin_4h_r3" = "./Mitomycin_4h_2c/quant.sf",
           "Mitomycin_8h_r1" = "./Mitomycin_8h_2d/quant.sf",
           "Mitomycin_8h_r2" = "./Mitomycin_8h_2e/quant.sf",
           "Mitomycin_8h_r3" = "./Mitomycin_8h_2f/quant.sf",
           "Mitomycin_18h_r1" = "./Mitomycin_18h_2g/quant.sf",
           "Mitomycin_18h_r2" = "./Mitomycin_18h_2h/quant.sf",
           "Mitomycin_18h_r3" = "./Mitomycin_18h_2i/quant.sf",
           "Moxi_4h_r1" = "./Moxi_4h_3a/quant.sf",
           "Moxi_4h_r2" = "./Moxi_4h_3b/quant.sf",
           "Moxi_4h_r3" = "./Moxi_4h_3c/quant.sf",
           "Moxi_8h_r1" = "./Moxi_8h_3d/quant.sf",
           "Moxi_8h_r2" = "./Moxi_8h_3e/quant.sf",
           "Moxi_8h_r3" = "./Moxi_8h_3f/quant.sf",
           "Moxi_18h_r1" = "./Moxi_18h_3g/quant.sf",
           "Moxi_18h_r2" = "./Moxi_18h_3h/quant.sf",
           "Moxi_18h_r3" = "./Moxi_18h_3i/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3",    "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",       "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("Untreated_4h","Untreated_4h","Untreated_4h","Untreated_8h","Untreated_8h","Untreated_8h","Untreated_18h","Untreated_18h","Untreated_18h", "Mitomycin_4h","Mitomycin_4h","Mitomycin_4h","Mitomycin_8h","Mitomycin_8h","Mitomycin_8h","Mitomycin_18h","Mitomycin_18h","Mitomycin_18h", "Moxi_4h","Moxi_4h","Moxi_4h","Moxi_8h","Moxi_8h","Moxi_8h","Moxi_18h","Moxi_18h","Moxi_18h"))
# Construct colData manually
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()

png("pca3.png", 1200, 800)
plotPCA(rld, intgroup=c("replicate"))
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# 1) keep only non-WT samples
#pdat <- subset(pdat, !grepl("^WT_", condition))
# drop unused factor levels so empty WT facets disappear
pdat$condition <- droplevels(pdat$condition)
# 2) pretty condition names: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
png("pca4.png", 1200, 800)
ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# Drop WT_* conditions from the data and from factor levels
pdat <- subset(pdat, !grepl("^WT_", condition))
pdat$condition <- droplevels(pdat$condition)
# Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca5.png", 1200, 800); print(p); dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca6.png", 1200, 800); print(p_fac); dev.off()

# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
#png("pca_media.png", 1200, 800)
#plotPCA(rld, intgroup=c("media"))
#dev.off()
#png("pca_strain.png", 1200, 800)
#plotPCA(rld, intgroup=c("strain"))
#dev.off()
#png("pca_time.png", 1200, 800)
#plotPCA(rld, intgroup=c("time"))
#dev.off()

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
#  [1] Untreated_4h  Untreated_4h  Untreated_4h  Untreated_8h  Untreated_8h
#  [6] Untreated_8h  Untreated_18h Untreated_18h Untreated_18h Mitomycin_4h
#  [11] Mitomycin_4h  Mitomycin_4h  Mitomycin_8h  Mitomycin_8h  Mitomycin_8h
#  [16] Mitomycin_18h Mitomycin_18h Mitomycin_18h Moxi_4h       Moxi_4h
#  [21] Moxi_4h       Moxi_8h       Moxi_8h       Moxi_8h       Moxi_18h
#  [26] Moxi_18h      Moxi_18h
#  9 Levels: Mitomycin_18h Mitomycin_4h Mitomycin_8h Moxi_18h Moxi_4h ... Untreated_8h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# Construct colData automatically
sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
#colnames(split_cond) <- c("genotype", "exposure", "time")
colnames(split_cond) <- c("genotype", "time")
colData <- cbind(sample_table, split_cond)
colData$genotype <- factor(colData$genotype)
#colData$exposure  <- factor(colData$exposure)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
colData$group  <- factor(paste(colData$genotype, colData$time, sep = "_"))
colData2 <- data.frame(condition=condition, row.names=names(files))

# 确保因子顺序（可选）
colData$genotype <- relevel(factor(colData$genotype), ref = "Untreated")
#colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
colData$time   <- relevel(factor(colData$time), ref = "4h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                       "genotype_Mitomycin_vs_Untreated"
#[3] "genotype_Moxi_vs_Untreated"      "time_18h_vs_4h"
#[5] "time_8h_vs_4h"                   "genotypeMitomycin.time18h"
#[7] "genotypeMoxi.time18h"            "genotypeMitomycin.time8h"
#[9] "genotypeMoxi.time8h"

#Mitomycin（丝裂霉素）通常特指丝裂霉素C（Mitomycin C, MMC），是一类来自放线菌（Streptomyces）的抗肿瘤抗生素。它在体内被还原后转化为活性烷化剂，可与DNA发生交联，阻断复制与转录，从而抑制细胞增殖。
#一句话理解：Mitomycin C 是一种能让DNA“粘住”的抗癌药，既可全身化疗，也常被医生小剂量局部用来防止疤痕组织长回来。
# 提取 genotype 的主效应: up 489, down 67
contrast <- "genotype_Mitomycin_vs_Untreated"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#莫西沙星（Moxifloxacin）是一种第四代氟喹诺酮类抗生素，常见商品名如 Avelox（口服/静脉）与 Vigamox（0.5% 眼用滴剂）。
#作用机制: 抑制细菌的DNA 回旋酶（DNA gyrase）和拓扑异构酶 IV，阻断细菌 DNA 复制与修复，属杀菌作用。
# 提取 genotype 的主效应: up 349, down 118
contrast <- "genotype_Moxi_vs_Untreated"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 262; down 51
contrast <- "time_18h_vs_4h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 90; down 18
contrast <- "time_8h_vs_4h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

colData$genotype <- relevel(factor(colData$genotype), ref = "Moxi")
colData$time   <- relevel(factor(colData$time), ref = "8h")
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                  "genotype_Untreated_vs_Moxi"
#[3] "genotype_Mitomycin_vs_Moxi" "time_4h_vs_8h"
#[5] "time_18h_vs_8h"             "genotypeUntreated.time4h"
#[7] "genotypeMitomycin.time4h"   "genotypeUntreated.time18h"
#[9] "genotypeMitomycin.time18h"

# 提取 genotype 的主效应: up 361, down 6
contrast <- "genotype_Mitomycin_vs_Moxi"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 15; down 3
contrast <- "time_18h_vs_8h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  Moxi_4h_vs_Untreated_4h
#2.)  Mitomycin_4h_vs_Untreated_4h
#3.)  Moxi_8h_vs_Untreated_8h
#4.)  Mitomycin_8h_vs_Untreated_8h
#5.)  Moxi_18h_vs_Untreated_18h
#6.)  Mitomycin_18h_vs_Untreated_18h

#---- relevel to control ----
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "Untreated_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_4h_vs_Untreated_4h", "Mitomycin_4h_vs_Untreated_4h")

dds$condition <- relevel(dds$condition, "Untreated_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_8h_vs_Untreated_8h", "Mitomycin_8h_vs_Untreated_8h")

dds$condition <- relevel(dds$condition, "Untreated_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Untreated_18h", "Mitomycin_18h_vs_Untreated_18h")

# Mitomycin_xh
dds$condition <- relevel(dds$condition, "Mitomycin_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Mitomycin_18h_vs_Mitomycin_4h", "Mitomycin_8h_vs_Mitomycin_4h")

dds$condition <- relevel(dds$condition, "Mitomycin_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Mitomycin_18h_vs_Mitomycin_8h")

# Moxi_xh
dds$condition <- relevel(dds$condition, "Moxi_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Moxi_4h", "Moxi_8h_vs_Moxi_4h")

dds$condition <- relevel(dds$condition, "Moxi_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Moxi_18h_vs_Moxi_8h")

# Untreated_xh
dds$condition <- relevel(dds$condition, "Untreated_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Untreated_18h_vs_Untreated_4h", "Untreated_8h_vs_Untreated_4h")

dds$condition <- relevel(dds$condition, "Untreated_8h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("Untreated_18h_vs_Untreated_8h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP052959_m.gff > CP052959_gene.gff

#NOTE that the script replace_gene_names.py was improved with a single fallback rule: after the initial mapping, any still empty/NA GeneName will be filled with the GeneID stripped of the gene-/rna- prefix. Nothing else changes.
for cmp in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done
#deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
#deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
#deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("Mitomycin_18h_vs_Untreated_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Mitomycin_18h_vs_Untreated_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Mitomycin_18h_vs_Untreated_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Mitomycin_18h_vs_Untreated_18h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

        EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

        Install EggNOG-mapper:

            mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
            mamba activate eggnog_env

        Run annotation:

            #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
            mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
            #NOT_WORKING: emapper.py -i CP052959_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
            #Download the protein sequences from Genbank
            mv ~/Downloads/sequence\(10\).txt CP052959_protein_.fasta
            python ~/Scripts/update_fasta_header.py CP052959_protein_.fasta CP052959_protein.fasta
            emapper.py -i CP052959_protein.fasta -o eggnog_out --cpu 20  #--resume
            #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
            #---->  470.IX87_14445:
                * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
                * IX87_14445 would refer to a specific gene or protein within that genome.

        Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot and blast2go_annot.annot2): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * jhuang@WS-2290C:~/DATA/Data_JuliaFuchs_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_JuliaFuchs_RNAseq_2025; cp /mnt/md1/DATA/Data_JuliaFuchs_RNAseq_2025/CP052959_protein.fasta ~/b2gWorkspace_JuliaFuchs_RNAseq_2025;"
    # ------ STEP_1: 100% Load Sequences (CP052959_protein): done ------
    * Button 'File' --> 'Load' --> 'Load Sequences' --> 'Load Fasta File (.fasta)' Choose a protein sequence file (e.g. CP052959_protein.fasta) (Tags: NONE, generated columns: Nr, SeqName) as input
    # ------ STEP_2: 100% QBlast (CP052959_protein): done with warnings [4-5 days]; similar to DAMIAN and the most time-consuming step is blastn/blastp ------
    * Button 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
            -- QBlast (CP052959_protein) Warning! --
            QBlast finished with warnings!
            Blasted Sequences: 2011
            Sequences without results: 99
            Check the Job log for details and try to submit again.
            Restarting QBlast may result in additional results, depending on the error type.
            "Blast (CP052959_protein) Done"
    # ------ STEP_3: 100% Mapping (CP052959_protein): done [3h56m10s] ------
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names)
            -- Mapping (CP052959_protein) Done --
            "Mapping finished - Please proceed now to annotation."
    # ------ STEP_4: 100% Annotation (CP052959_protein): done [7m56s] ------
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names)
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
            -- Annotation (CP052959_protein) Done --
            "Annotation finished."
    #(NOT_USED) or blast2go_cli_v1.5.1
            #https://help.biobam.com/space/BCD/2250407989/Installation
            #see ~/Scripts/blast2go_pipeline.sh

    # ------ STEP_5: 100% Export Annotations (CP052959_protein): done (for before_merging) ------
    + Button 'File' -> 'Export' -> 'Export Annotations' -> 'Export Annotations (.annot, custom, etc.)' as ~/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot.

    + Option 3 (GO Terms from 'Blast2GO 5 Basic' using interpro): Interpro based protein families / domains --> Button interpro, Export Format XML (e.g. HJI06_00260.xml) to Folder "/home/jhuang/b2gWorkspace_JuliaFuchs_RNAseq_2025"
    # ------ STEP_6: 100% InterProSacn (CP052959_protein): done [1d6h41m51s] ------
    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names)
            -- InterProScan Finished, You can now merge the obtained GO Annotations. --
            "InterProScan (CP052959_protein) Done"
            "InterProScan Finished - You can now merge the obtained GO Annotations."
    + MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
    # ------ STEP_7: 100% Merge InterProScan GOs to Annotation (CP052959_protein): done [1s] ------
    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
            -- Merge InterProScan GOs to Annotation (CP052959_protein) Done --
            "Finished merging GO terms from InterPro with annotations."
            "Maybe you want to run ANNEX (Annotation Augmentation)."
    #* (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

    # ------ STEP_8: 100% Export Annotations (CP052959_protein): done (for after_merging) ------
    + Button 'File' -> 'Export' -> 'Export Annotations' -> 'Export Annotations (.annot, custom, etc.)' as ~/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot2.

    #NOTE that annotations are different between before_merging and after_merging; after_merging has more annotation-items.
    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #None for HJI06_00005

    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020

    #HJI06_00005     GO:0005737      chromosomal replication initiator protein DnaA
    #HJI06_00005     GO:0005886
    #HJI06_00005     GO:0003688
    #HJI06_00005     GO:0005524
    #HJI06_00005     GO:0008289
    #HJI06_00005     GO:0016887
    #HJI06_00005     GO:0006270
    #HJI06_00005     GO:0006275
    #HJI06_00005     EC:3.6.1
    #HJI06_00005     EC:3.6
    #HJI06_00005     EC:3
    #HJI06_00005     EC:3.6.1.15

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    * Go to KAAS
    * Upload your FASTA file.
    * Select an appropriate gene set.
    * Download the KO assignments.

10.2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    * Check available KEGG organisms:

            library(clusterProfiler)
            library(KEGGREST)

            kegg_organisms <- keggList("organism")

            Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

            # Search for Acinetobacter in the list
            grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
            # Gammaproteobacteria
            #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

10.3. Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

    If your organism is not in KEGG, search for the closest relative:

            grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

            kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

10.4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")

        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)

        setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines at the beginning and END and renaming #query to query
        #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_JuliaFuchs_RNAseq_2025$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
        #1,5c1
        #< ## Thu Jan 30 16:34:52 2025
        #< ## emapper-2.1.12
        #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
        #< ##
        #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #---
        #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
        #3620,3622d3615
        #< ## 3614 queries scanned
        #< ## Total time (seconds): 8.176708459854126

        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_JuliaFuchs_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")

        # Key Improvements:
        #    * Looped processing of all 6 input files to avoid redundancy.
        #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
        #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
        #    * Error handling for GO term descriptions via tryCatch.
        #    * Improved clarity and modular structure for easier maintenance and future additions.

        #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

        # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
        suppressPackageStartupMessages({
          library(readr)
          library(dplyr)
          library(stringr)
          library(tidyr)
          library(openxlsx)
          library(clusterProfiler)
          library(AnnotationDbi)
          library(GO.db)
        })

        # ---- PARAMETERS ----
        PADJ_CUT <- 5e-2
        LFC_CUT  <- 2

        # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
        emapper_path <- "~/DATA/Data_JuliaFuchs_RNAseq_2025/eggnog_out.emapper.annotations.txt"

        # Input files (you can add/remove here)
        input_files <- c(
          "Mitomycin_18h_vs_Untreated_18h-all.csv",  #up 576, down 307 --> height 11000
          "Mitomycin_8h_vs_Untreated_8h-all.csv",    #up 580, down 201 --> height 11000
          "Mitomycin_4h_vs_Untreated_4h-all.csv",    #up 489, down 67  --> height 6400
          "Moxi_18h_vs_Untreated_18h-all.csv",       #up 472, down 317 --> height 10500
          "Moxi_8h_vs_Untreated_8h-all.csv",         #up 486, down 307 --> height 10500
          "Moxi_4h_vs_Untreated_4h-all.csv",         #up 349, down 118 --> height 6400
          "Untreated_18h_vs_Untreated_4h-all.csv",   #(up 262, down 51)
          "Untreated_18h_vs_Untreated_8h-all.csv",   #(up 124, down 26)
          "Untreated_8h_vs_Untreated_4h-all.csv",     #(up 90, down 18) --> in total 368 --> height 5000
          "Mitomycin_18h_vs_Mitomycin_4h-all.csv",   #(up 161, down 63)
          "Mitomycin_18h_vs_Mitomycin_8h-all.csv",   #(up 61, down 28)
          "Mitomycin_8h_vs_Mitomycin_4h-all.csv",     #(up 47, down 10) --> in total 279 --> height 3500
          "Moxi_18h_vs_Moxi_4h-all.csv",   #(up 141, down 29)
          "Moxi_18h_vs_Moxi_8h-all.csv",   #(up 15, down 3)
          "Moxi_8h_vs_Moxi_4h-all.csv"     #(up 67, down 2) --> in total 196 --> height 2600
        )

        # ---- HELPERS ----
        # Robust reader (CSV first, then TSV)
        read_table_any <- function(path) {
          tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                        error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                      error = function(e2) NULL))
          tb
        }

        # Return a nice Excel-safe base name
        xlsx_name_from_file <- function(path) {
          base <- tools::file_path_sans_ext(basename(path))
          paste0("DEG_KEGG_GO_", base, ".xlsx")
        }

        # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
        expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
          if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
          kdf <- as.data.frame(kegg_res)
          if (!"geneID" %in% names(kdf)) return(kdf)
          # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
          map_clean <- mapping_tbl %>%
            dplyr::select(KEGG_ko, GeneID) %>%
            filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
            mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
            tidyr::separate_rows(KEGG_ko, sep = ",") %>%
            distinct()

          if (!nrow(map_clean)) {
            return(kdf)
          }

          expanded <- kdf %>%
            tidyr::separate_rows(geneID, sep = "/") %>%
            dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
            distinct() %>%
            dplyr::group_by(ID) %>%
            dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

          kdf %>%
            dplyr::select(-geneID) %>%
            dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
            dplyr::rename(geneID = GeneID)
        }

        # ---- LOAD emapper annotations ----
        eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
        # Ensure character columns for joins
        eggnog_data$query   <- as.character(eggnog_data$query)
        eggnog_data$GOs     <- as.character(eggnog_data$GOs)
        eggnog_data$EC      <- as.character(eggnog_data$EC)
        eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

        # ---- MAIN LOOP ----
        for (f in input_files) {
          if (!file.exists(f)) { message("Missing: ", f); next }

          message("Processing: ", f)
          res <- read_table_any(f)
          if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

          # Coerce expected columns if present
          if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
          if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

          # Ensure GeneID & GeneName exist
          if (!"GeneID" %in% names(res)) {
            # Try to infer from a generic 'gene' column
            if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
          }
          if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

          # Fill missing GeneName from GeneID (drop "gene-")
          res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                                gsub("^gene-", "", as.character(res$GeneID)),
                                as.character(res$GeneName))

          # De-duplicate by GeneName, keep smallest padj
          if (!"padj" %in% names(res)) res$padj <- NA_real_
          res <- res %>%
            group_by(GeneName) %>%
            slice_min(padj, with_ties = FALSE) %>%
            ungroup() %>%
            as.data.frame()

          # Sort by padj asc, then log2FC desc
          if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
          res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

          # Join emapper (strip "gene-" from GeneID to match emapper 'query')
          res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
          res_ann <- res %>%
            left_join(eggnog_data, by = c("GeneID_plain" = "query"))

          # --- Split by UP/DOWN using your volcano cutoffs ---
          up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
          down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

          # --- KEGG enrichment (using K numbers in KEGG_ko) ---
          # Prepare KO lists (remove "ko:" if present)
          k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
          k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
          k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

          # BREAK_LINE

          kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
          kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

          # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
          kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
          kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

          # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
          # Background gene set = all genes in this comparison
          background_genes <- unique(res_ann$GeneID_plain)
          # TERM2GENE table (GO -> GeneID_plain)
          go_annotation <- res_ann %>%
            dplyr::select(GeneID_plain, GOs) %>%
            mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
            tidyr::separate_rows(GOs, sep = ",") %>%
            filter(GOs != "") %>%
            dplyr::select(GOs, GeneID_plain) %>%
            distinct()

          # Gene lists for GO enricher
          go_list_up   <- unique(up_regulated$GeneID_plain)
          go_list_down <- unique(down_regulated$GeneID_plain)

          go_up <- tryCatch(
            enricher(gene = go_list_up, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )
          go_down <- tryCatch(
            enricher(gene = go_list_down, TERM2GENE = go_annotation,
                    pvalueCutoff = 0.05, pAdjustMethod = "BH",
                    universe = background_genes),
            error = function(e) NULL
          )

          go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
          go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

          # Add GO term descriptions via GO.db (best-effort)
          add_go_term_desc <- function(df) {
            if (!nrow(df) || !"ID" %in% names(df)) return(df)
            df$Description <- sapply(df$ID, function(go_id) {
              term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                    columns = "TERM", keytype = "GOID"),
                              error = function(e) NULL)
              if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
            })
            df
          }
          go_up_df   <- add_go_term_desc(go_up_df)
          go_down_df <- add_go_term_desc(go_down_df)

          # ---- Write Excel workbook ----
          out_xlsx <- xlsx_name_from_file(f)
          wb <- createWorkbook()

          addWorksheet(wb, "Complete")
          writeData(wb, "Complete", res_ann)

          addWorksheet(wb, "Up_Regulated")
          writeData(wb, "Up_Regulated", up_regulated)

          addWorksheet(wb, "Down_Regulated")
          writeData(wb, "Down_Regulated", down_regulated)

          addWorksheet(wb, "KEGG_Enrichment_Up")
          writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

          addWorksheet(wb, "KEGG_Enrichment_Down")
          writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

          addWorksheet(wb, "GO_Enrichment_Up")
          writeData(wb, "GO_Enrichment_Up", go_up_df)

          addWorksheet(wb, "GO_Enrichment_Down")
          writeData(wb, "GO_Enrichment_Down", go_down_df)

          saveWorkbook(wb, out_xlsx, overwrite = TRUE)
          message("Saved: ", out_xlsx)
        }

10.5. (TODO) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

            library(openxlsx)

            # Function to read and clean gene ID files
            read_gene_ids <- function(file_path) {
            # Read the gene IDs from the file
            gene_ids <- readLines(file_path)

            # Remove any quotes and trim whitespaces
            gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
            gene_ids <- trimws(gene_ids)  # Trim whitespaces

            # Remove empty entries or NAs
            gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

            return(gene_ids)
            }

            # Example list of LB files with both -up.id and -down.id for each condition
            lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                            "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
            lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                            "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

            # Combine both up and down files for each condition
            lb_files <- c(lb_files_up, lb_files_down)

            # Read gene IDs for each file in LB group
            #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
            lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

            lb_degs_ <- list()
            combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
            #unique_combined_set <- unique(combined_set)
            lb_degs_$AB <- combined_set
            combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
            lb_degs_$IJ <- combined_set
            combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
            lb_degs_$W1 <- combined_set
            combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
            lb_degs_$Y1 <- combined_set

            # Example list of Mac files with both -up.id and -down.id for each condition
            mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                            "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
            mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                            "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

            # Combine both up and down files for each condition in Mac group
            mac_files <- c(mac_files_up, mac_files_down)

            # Read gene IDs for each file in Mac group
            mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

            mac_degs_ <- list()
            combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
            mac_degs_$AB <- combined_set
            combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
            mac_degs_$IJ <- combined_set
            combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
            mac_degs_$W1 <- combined_set
            combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
            mac_degs_$Y1 <- combined_set

            # Function to clean sheet names to ensure no sheet name exceeds 31 characters
            truncate_sheet_name <- function(names_list) {
            sapply(names_list, function(name) {
            if (nchar(name) > 31) {
            return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
            }
            return(name)
            })
            }

            # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

            # Define intersections between different conditions for LB
            inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
            inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
            inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
            inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
            inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
            inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

            # Define intersections between three conditions for LB
            inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
            inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
            inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
            inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Define intersection between all four conditions for LB
            inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

            # Now remove the intersected genes from each original set for LB
            venn_list_lb <- list()

            # For LB.AB, remove genes that are also in other conditions
            venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

            # For LB.IJ, remove genes that are also in other conditions
            venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

            # For LB.W1, remove genes that are also in other conditions
            venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

            # For LB.Y1, remove genes that are also in other conditions
            venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

            # Add the intersections for LB (same as before)
            venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
            venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
            venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
            venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
            venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
            venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

            # Define intersections between three conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
            venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
            venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

            # Define intersection between all four conditions for LB
            venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

            # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

            # Define intersections between different conditions
            inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
            inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
            inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
            inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
            inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
            inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

            # Define intersections between three conditions
            inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
            inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
            inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
            inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Define intersection between all four conditions
            inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

            # Now remove the intersected genes from each original set
            venn_list_mac <- list()

            # For Mac.AB, remove genes that are also in other conditions
            venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

            # For Mac.IJ, remove genes that are also in other conditions
            venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

            # For Mac.W1, remove genes that are also in other conditions
            venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

            # For Mac.Y1, remove genes that are also in other conditions
            venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

            # Add the intersections (same as before)
            venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
            venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
            venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
            venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
            venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

            # Define intersections between three conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
            venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
            venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

            # Define intersection between all four conditions
            venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

            # Save the gene IDs to Excel for further inspection (optional)
            write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
            write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

            # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
            write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
            write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

            # Venn Diagram for LB group
            venn1 <- ggvenn(lb_degs_,
                            fill_color = c("skyblue", "tomato", "gold", "orchid"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

            # Venn Diagram for Mac group
            venn2 <- ggvenn(mac_degs_,
                            fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                            stroke_size = 0.4,
                            set_name_size = 5)
            ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

            cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Clustering the genes and draw heatmap
```
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

#The row’s description column says “TsaE,” but the preferred_name is ydiB (shikimate/quinate dehydrogenase).
#Length = 301 aa — that fits YdiB much better. TsaE (YjeE) is a small P-loop ATPase, typically ~150–170 aa, not ~300 aa.
#The COG/orthology hit and the very strong e-value also point to a canonical enzyme rather than the tiny TsaE ATPase.
#What likely happened
#The “GeneName” (tsaE) was inherited from a prior/automated annotation.
#Orthology mapping (preferred_name) recognizes the protein as YdiB; the free-text product line didn’t update, leaving a label clash.
#What to do
#Treat this locus as ydiB (shikimate dehydrogenase; aka AroE-II), not TsaE.
#If you want to be thorough, BLAST the sequence and/or run InterPro/eggNOG: you should see SDR/oxidoreductase motifs for YdiB, not the P-loop NTPase (Walker A) you’d expect for TsaE.
#Check your genome for the true t6A genes (tsaB/tsaD/tsaE/tsaC); the real tsaE should be a much smaller ORF.
# -- Replace GeneName with Preferred_name when Preferred_name is non-empty and not '-' (first sheet). --
for i in Mitomycin_18h_vs_Untreated_18h Mitomycin_8h_vs_Untreated_8h Mitomycin_4h_vs_Untreated_4h Moxi_18h_vs_Untreated_18h Moxi_8h_vs_Untreated_8h Moxi_4h_vs_Untreated_4h Mitomycin_18h_vs_Mitomycin_4h Mitomycin_18h_vs_Mitomycin_8h Mitomycin_8h_vs_Mitomycin_4h  Moxi_18h_vs_Moxi_4h Moxi_18h_vs_Moxi_8h Moxi_8h_vs_Moxi_4h  Untreated_18h_vs_Untreated_4h Untreated_18h_vs_Untreated_8h Untreated_8h_vs_Untreated_4h; do
  python ~/Scripts/replace_with_preferred_name.py DEG_KEGG_GO_${i}-all.xlsx -o ${i}-all_annotated.csv
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

#RUN rld generation code (see the first part of the file)
setwd("degenes")
## 0) Config ---------------------------------------------------
contrast <- "Mitomycin_18h_vs_Untreated_18h"  #up 576, down 307 --> height 11000
contrast <- "Mitomycin_8h_vs_Untreated_8h"    #up 580, down 201 --> height 11000
contrast <- "Mitomycin_4h_vs_Untreated_4h"    #up 489, down 67  --> height 6500
contrast <- "Moxi_18h_vs_Untreated_18h"       #up 472, down 317 --> height 10500
contrast <- "Moxi_8h_vs_Untreated_8h"         #up 486, down 307 --> height 10500
contrast <- "Moxi_4h_vs_Untreated_4h"         #up 349, down 118 --> height 6500

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  #path <- up_file
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
#colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

#BREAK_LINE

## 7) Pretty labels (display only) ---------------------------
# Start from rownames(datamat) (assumed to be GeneID)
labRow_pretty <- rownames(datamat)
# ---- Replace GeneID with GeneName from "
```
-all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } # Column labels: 'deltaadeIJ' -> ‘ΔadeIJ’ and nicer spacing labCol_pretty <- colnames(datamat) #labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> “WT TSB 2h r1” # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) ## 8) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] #BREAK_LINE labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 20), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 15, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # DEBUG for some items starting with "gene-" labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 6500) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.4, # ↓ smaller column label font (was 1.3) cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ contrasts <- c( "Untreated_18h_vs_Untreated_4h", #(up 262, down 51) "Untreated_18h_vs_Untreated_8h", #(up 124, down 26) "Untreated_8h_vs_Untreated_4h" #(up 90, down 18) --> in total 368 –> height 5000 ) contrasts <- c( "Mitomycin_18h_vs_Mitomycin_4h", #(up 161, down 63) "Mitomycin_18h_vs_Mitomycin_8h", #(up 61, down 28) "Mitomycin_8h_vs_Mitomycin_4h" #(up 47, down 10) --> in total 279 –> height 3500 ) contrasts <- c( "Moxi_18h_vs_Moxi_4h", #(up 141, down 29) "Moxi_18h_vs_Moxi_8h", #(up 15, down 3) "Moxi_8h_vs_Moxi_4h" #(up 67, down 2) --> in total 196 –> height 2600 ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("Untreated_4h","Untreated_8h","Untreated_18h") cond_order <- c("Mitomycin_4h","Mitomycin_8h","Mitomycin_18h") cond_order <- c("Moxi_4h","Moxi_8h","Moxi_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE # -- RUN the code with the new contract from HERE after first run -- ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message(“Removing “, sum(!row_ok), ” constant/NA rows.”) datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- rownames(datamat) # ---- Replace GeneID with GeneName from " -all_annotated.csv” all_path <- paste0(contrast, "-all_annotated.csv") if (file.exists(all_path)) { ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE) pick_col <- function(df, candidates) { hit <- intersect(candidates, names(df)) if (length(hit) == 0) return(NA_character_) hit[1] } id_col <- pick_col(ann, c("GeneID","Gene.ID","Gene_Id","Gene","gene_id","LocusTag","locus_tag","ID")) nm_col <- pick_col(ann, c("GeneName","Gene.Name","Gene_Name","Symbol","gene_name","Name","SYMBOL")) if (!is.na(id_col) && !is.na(nm_col)) { ann[[nm_col]][is.na(ann[[nm_col]])] <- "" id2name <- setNames(ann[[nm_col]], ann[[id_col]]) id2name <- id2name[nzchar(id2name)] # drop empty names hits <- match(rownames(datamat), names(id2name)) repl <- ifelse(is.na(hits), rownames(datamat), id2name[hits]) # avoid duplicate labels on the plot labRow_pretty <- make.unique(repl, sep = "_") } else { warning("Could not find GeneID/GeneName columns in ", all_path) } } else { warning("File not found: ", all_path) } labCol_pretty <- prettify_col_labels(colnames(datamat)) #BREAK_LINE ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- labRow_pretty <- sub("^gene-", "", labRow_pretty) labRow_pretty <- sub("^rna-", "", labRow_pretty) png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 2600) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() # ------------------ Heatmap generation for three samples END ---------------------- # -- (OLD ORIGINAL CODE for heatmap containing all samples) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the “” #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

Macrophage Phagocytosis 巨噬细胞吞噬作用 / Phagocytic Podosomes 吞噬足

Leave a reply

一、吞噬全过程各结构（结合图1 and 图2）

主要步骤与结构：

1. 病原体 (Pathogen) 外来微生物（如细菌、真菌等），被巨噬细胞识别为异物。
1. 吞噬体 (Phagosome) 巨噬细胞膜包裹病原体后形成的膜性囊泡，第一步内化包裹结构。
1. 溶酶体 (Lysosome) 含有多种消化酶的小囊泡，它将与吞噬体融合、提供降解活性。
1. 废料 (Debris/Waste) 病原体被消化酶分解后的残留小分子或碎片。
1. 细胞质 (Cytoplasm) 包围所有细胞器的液体基质，物质交换和生化反应环境。
1. 细胞膜 (Plasma membrane) 细胞与外界的物理边界，吞噬和胞吐均由膜改建完成。

过程核心注释（a-b-c，各配合原文步骤）：

a. 吞噬作用，摄取病原体并形成吞噬体 (Phagocytosis—engulfing the pathogen and forming a phagosome)
b. 溶酶体与吞噬体融合形成吞解体，内含酶消化病原体 (Fusion of lysosome with phagosome into a phagolysosome; enzymatic digestion of pathogen)
c. 分解废料随胞吐排出，部分可被细胞“同化”或利用 (Debris either excreted by exocytosis or assimilated by the cell; the latter not shown in the diagram)

图例补充（英语原图元素）：

bacterium (细菌): 被巨噬细胞识别的病原体种类
receptors (受体): 细胞膜上用于识别并绑定异物的蛋白复合物
phagocytosis (吞噬作用): 整个异物摄取过程
phagosome (吞噬体): 包含吞噬物的囊泡
lysosome (溶酶体): 消化酶小泡
phagolysosomes (吞解体/吞噬溶酶体): 吞噬体和溶酶体融合体，内含降解酶和分解产物
soluble debris (可溶性废料)
exocytosis (胞吐作用)

二、拓展科普：什么是“吞噬足” (Phagocytic Podosomes)？

吞噬足是一类由F-肌动蛋白富集形成的细胞骨架微结构，分布在巨噬细胞膜下，呈点状、圆盘状或“玫瑰环”样。
它们既能充当“黏附脚”，帮细胞牢固吸附在底物或病原体表面，也能作为信号转导和力学支撑结构，调动膜及骨架去包围和内化异物。
吞噬足特异于其富含蛋白、贴附底物、能动态重塑（寿命一般为几分钟）。在“受挫吞噬”或分解大目标（如生物膜）时，巨噬细胞会在被吞噬物下形成多个协同的吞噬足，用于支撑和辅助胞吞。

英文文献与图片例子

形态特征：“Phagocytic podosomes display an hourglass actin-rich core (∼300–600 nm tall, ∼0.3 μm radius) with a ring of integrins and adaptor proteins at their base. They form clusters or rosettes underneath phagocytic cups.”¹²
高分辨率图像例子：nature.com podosome iPALM image

总结

巨噬细胞通过膜受体识别病原体，协同吞噬足精准牢固地粘附和形成包裹，内化形成吞噬体（phagosome）。
吞噬体与溶酶体融合，彻底消灭病原体，残余通过胞吐排出。
吞噬足（phagocytic podosomes）则为高效识别、包埋、力学支撑及信号整合的关键支架，常见于一线免疫应答与复杂吞噬行为中。

参考:

Actin nano-architecture of phagocytic podosomes — Nature Structural \& Molecular Biology, 2022.²¹
“吞噬体 – 维基百科”, “病原体 – 维基百科” ³⁴⁵⁶⁷⁸⁹¹⁰¹¹¹²¹³¹⁴¹⁵¹⁶¹⁷¹⁸¹⁹

⁂

NCBI AMRFinderPlus抗性基因检测工具 processing Data_Patricia_AMRFinderPlus_2025

Leave a reply

NCBI AMRFinderPlus抗性基因检测工具结构化详解 (Guide to NCBI AMRFinderPlus for Antimicrobial Resistance Gene Detection)

一、工具概述

NCBI AMRFinderPlus 是由美国国家生物技术信息中心（NCBI）开发的微生物耐药性和相关基因检测工具。
Identifies not just acquired AMR genes but also point mutations, virulence, metal, biocide, and stress genes by leveraging curated reference data and Hidden Markov Models (HMMs) from NCBI.

二、数据库与依赖资源

核心数据库为“细菌抗菌药物耐药基因参考数据库”（Bacterial AMR Reference Gene Database），BioProject号：PRJNA313047，包含人工整理的耐药基因/HMM模型与命名体系(https://www.ncbi.nlm.nih.gov/bioproject/PRJNA313047/)。
NCBI BioProject: PRJNA313047 — the official umbrella record for resistance genes, supporting scholarly citation and comparative studies.

三、功能支持与检测范围

支持蛋白序列和装配后的基因组（核酸序列）输入，可检测：
- 获得性耐药基因
- 关键点突变
- 毒力因子/金属/应激/生物杀灭剂耐受基因（通过 --plus 选项）
Detects both acquired resistance genes and resistance-associated point mutations; the “Plus” option includes stress, metal, and virulence determinants.

四、安装与更新

推荐用 Bioconda 或 GitHub 安装，定期更新数据库：
- Bioconda: mamba install ncbi-amrfinderplus / mamba update ncbi-amrfinderplus
- Check version: amrfinder -V

五、使用流程（命令结构）

典型命令与输入选项：
- 装配基因组核酸序列： amrfinder -n assembly.fna -o amr.tsv --plus
- 预测蛋白序列： amrfinder -p proteins.faa -o amr.tsv --plus
- 需要突变规则的物种（如大肠杆菌、沙门氏菌等），可添加 --organism Escherichia
If your input is protein (for specificity): amrfinder -p proteins.faa -o amr.tsv --plus
Add --organism for species-specific detection rules (e.g., Salmonella, E. coli).

六、输出结果解读

输出结果包括：
- 基因/等位基因名称
- 类别
- 检测方法（Exact / BLAST / HMM）
- Percent identity \& coverage
Reference Gene Hierarchy explains why HMM hits may be mapped to gene families.

七、与ResFinder工具对比

AMRFinderPlus 强调数据库人工整理、层级命名与HMM深度，支持点突变/毒力因子类别；
ResFinder primarily uses BLAST for acquired resistance genes; both are widely used.
两者结果可互补，建议在关键案例均报告并比较结论。

八、β-内酰胺酶检测相关

AMRFinderPlus 能精准识别各类 β-内酰胺酶以及特定等位基因（KPC、NDM、OXA-48等），但药物抑制剂敏感性需结合表型MIC和人工比对。

九、实践建议与数据管理

分析前应QC（覆盖度、污染、N50等），优先用蛋白输入以降低假阳性；务必记录运行软件与数据库版本。
Reference all results and analyses to your BioProject for project transparency and reproducibility.

十、数据提交与BioProject关联

强烈推荐提交数据到NCBI，关联BioProject，实验报告/论文需标明引用。合规数据将进入 Pathogen Detection/Isolate Browser，便于下游标准化分析与溯源。
Follow NCBI pathogen/AMR submission guidelines for maximum interoperability and future-proofing.

主要优点总结

全面覆盖AMR基因、点突变至毒力因子，有权威数据库与持续更新支持，多模式比对结果提升准确率。适用于宏/微生物组耐药基因监测、临床感染溯源、研究数据管理等多场景。

Citation: Reference Database: BioProject PRJNA313047 Official tool page: https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/

This structured guide is suitable for scientific communication, teaching, or lab documentation. Please cite the BioProject and include the official tool URL when republishing.

²⁰²¹²²²³²⁴²⁵²⁶²⁷²⁸²⁹

⁂

What AMRFinderPlus is good for

Detects acquired AMR genes and key resistance-associated point mutations from assembled genomes (protein or nucleotide).
It also optionally reports virulence, stress/biocide/metal genes (“Plus”).
国家生物技术信息中心+1

Powered by NCBI’s curated Bacterial AMR Reference Gene Database (with HMMs and a naming hierarchy), used in NCBI’s Pathogen Detection pipeline.
journals.asm.org+2国家生物技术信息中心+2

Database you’ll cite/use

Bacterial Antimicrobial Resistance Reference Gene Database
BioProject accession: PRJNA313047. Keep this in methods/notes.
国家生物技术信息中心

Install & update (typical options)

Bioconda or GitHub.
After install, update DB regularly.
Check versions with amrfinder -V.
GitHub+2bioconda.github.io+2

Minimal commands you’ll actually run

On an assembly (nucleotide FASTA): amrfinder -n assembly.fna -o amr.tsv --plus

If you have predicted proteins (often best for specificity): amrfinder -p proteins.faa -o amr.tsv --plus

Add organism when supported for point-mutation rules (e.g., Salmonella, E. coli, Campylobacter, S. aureus): amrfinder -n assembly.fna --organism Escherichia -o amr.tsv 国家生物技术信息中心+1

Reading the output (what to look for)

You’ll get gene/allele names, class, method (Exact/Blast/HMM), and % identity/coverage.
Use the Reference Gene Hierarchy for how borderline hits are named and why an HMM hit maps to a family.
国家生物技术信息中心

How it complements (and differs from) ResFinder

AMRFinderPlus: includes a curated hierarchy, extensive HMMs, and point-mutation rules for several taxa; “Plus” genes beyond classic AMR.
国家生物技术信息中心+1
ResFinder: strong for acquired genes via BLAST rules.
Both are widely used; results can differ slightly by database scope and curation. Consider running both for critical cases and reconciling (multiple reviews/benchmarks discuss trade-offs).
PubMed Central+1

β-lactamase & BLI angle (important in the context)

AMRFinderPlus will identify β-lactamase families and many specific alleles (e.g., KPC, OXA-48-like, NDM, CTX-M).
Neither AMRFinderPlus nor ResFinder guarantees inhibitor efficacy (e.g., avibactam, vaborbactam) because allele-specific substitutions, expression, porins/efflux affect phenotype.
Use the call to narrow likely classes (KPC vs OXA-48 vs MBL) and confirm with MICs.
国家生物技术信息中心+1

Tie-in with BioProject

BioProject is the umbrella record linking all your project’s sequence data (assemblies, SRA reads, BioSamples).
Create one for your study; reference it in your reports.
国家生物技术信息中心
If you submit to NCBI (recommended), follow the Pathogen/AMR submission guidance so your data appear in Pathogen Detection/Isolate Browser and benefit from standardized pipelines.
国家生物技术信息中心+1

Practical checkpoints / best practices

QC first (coverage, contamination, N50); subpar assemblies yield spurious or missed hits.
Run with proteins when available (often fewer false positives); still keep nucleotide run if you lack annotation.
国家生物技术信息中心
Record software & DB versions in your methods (e.g., “AMRFinderPlus vX.Y; database YYYY-MM-DD; PRJNA313047”).
biocontainer-doc.readthedocs.io
Corroborate critical clinical calls (e.g., suspected KPC escape mutants) with phenotypic MICs and, if needed, manual allele review.
PubMed Central

Run through an interpretation plan (what AMR genes/mutations to report, and how to summarize likely β-lactam/BLI implications).

Metagenomic Analysis – Patient Sample

Key Findings

Primary suspected pathogen: Klebsiella pneumoniae
- Clinically meaningful if abundant with good genome coverage; confirm with orthogonal methods.
Background organisms/contaminants:
- Skin flora (e.g., Staphylococcus haemolyticus, Cutibacterium acnes, Corynebacterium spp.)
- Environmental bacteria (e.g., Pseudomonas fluorescens, Dietzia, Massilia, Sphingomonas)
- Non-human DNA and synthetic controls (e.g., Tapinoma sessile, synthetic constructs)
- Taxonomically non-informative bins.

Next Steps

Verification:
- Confirm K. pneumoniae via culture or PCR.
- Cross-check read abundance and genome breadth for clinically relevant taxa.
AMR Profiling:
- Run AMRFinderPlus or ResFinder on Klebsiella-assigned contigs to identify β-lactamases (KPC, OXA-48-like, NDM, CTX-M) and other resistance genes.
- Document software and database versions for reproducibility.
Reporting:
- Only report organisms that are both high-abundance and match the clinical scenario/specimen type.
- Consider all others likely contaminants or non-informative.

Table: Species Interpretation

Species/Taxon	Likelihood	Interpretation/Notes
Klebsiella pneumoniae	High (pathogen)	Common pathogen; clinically relevant if abundant
Moraxella (genus)	Uncertain	Specify species/abundance for clinical interpretation
Staphylococcus haemolyticus	Background/Contam	CoNS; likely skin flora contamination
Cutibacterium acnes, Corynebacterium spp.	Background	Typical skin flora/commensals
Micrococcus, Rothia, Anaerococcus	Background	Skin/oral commensals, rarely pathogenic
Pseudomonas fluorescens, Dietzia, Massilia, etc.	Environmental	Soil, water, or industrial source; not pathogenic here
Tapinoma sessile, synthetic constructs	Contaminant	Non-patient origin DNA, likely process or control contamin
Malassezia (fungus)	Background	Skin yeast; relevant only in dermatologic context
[Unassigned], Bacteria, bacterium	Non-informative	Unresolved taxonomic assignment

Summary: Klebsiella pneumoniae is the likely pathogen based on abundance and breadth; confirmation and AMR genotyping are recommended. Most other taxa are consistent with background skin, environment, or contamination and should generally not be included in clinical reporting unless concordant with the sample type and clinical scenario.

力–光子信号相关性分析流程（单分子光学镊子+荧光数据）

Leave a reply

一、背景假说与验证目标

采用单分子光学镊子+荧光，研究LT蛋白在DNA上的装配、熔解与RAD51/RPA70共定位。
工作假说：如果“拉伸力”是主要驱动力，则力阶跃应伴随显著光子状态转变，否则支持“多聚化主导”模型，与论文结论一致。
胡卓伟 https://baike.baidu.com/item/%E8%83%A1%E5%8D%93%E4%BC%9F/8629598

二、数据准备与时间对齐

统一时间轴
- 将 force(t), photons(t), HMM状态序列重采样到同一采样率（如100 Hz）
- 校正曝光延迟及采样偏移，保证时间戳一致
漂白校正
- 光子信号做指数衰减或局部基线漂移校正
- 剔除长段漂白区或作为协变量标注
去噪与标准化
- 适度滤波（如中位数或Savitzky–Golay）
- 光子信号与力均应Z-score化，便于对比

三、HMM结果信号选取

并行追踪：
- 校正后的连续光子信号 photons(t)
- HMM输出的状态转移与后验概率（state(t)、P(switch at t)）

四、三层相关性分析流程

A. 全局时间序列相关性

互相关函数（CCF）
- 计算 force(t) 与 photons(t)/state(t) 的互相关，扫 ±L 秒时滞
- 用块置换/相位随机化检验显著性，获得p值和置信区间
事件触发平均（ETA）
- 以每个 change_time_s 为触发点，窗口 [t0−W, t0+W] 内叠加平均光子/状态曲线
- 判断ETA在0附近是否有系统响应、显著性超置信区间

B. 事件级判定“相关/不相关”

对每个 change_time_s = t_i:
1. 设窗口 Δ（如2秒），判断窗口内是否有：
  - HMM状态跃迁且后验>0.9
  - 或光子信号显著突变（|Δphotons| > 3×局部噪声SD）
2. 标注候选“相关”（Yes），记录证据
3. 用随机对照/置换检验（假阳性率，Fisher/置换测试，q<0.05）确定最终Yes/No
4. 可选：检查方向一致性（如force上升是否更常引起信号/状态上升），用logistic回归检验

C. 位置（x坐标）相关性（若有位置信息）

按位置bin计算每bin相关率、驻留时间、事件占空比
热点分析：检验特定位点相关率是否显著高于其他bin或随机

五、统计模型拓展（可选）

点过程/危险率模型（Cox/Poisson回归），自变量含force、Δforce、方向、时间、位置等
加入分子ID/实验天数作为随机效应，提升稳健性

六、生物学意义解读

1. 若观察到“强相关”：

说明外力促进DNA局部解链或状态切换，可能改变蛋白装配/停留稳定性
若力学区间与论文设定不同，则为新发现或说明实验几何有差异

2. 若观察到“不相关”：

支持“多聚化驱动”主模型：LT多聚化优先结合/侵入ssDNA，之后ATP驱动解旋
与论文结论一致：10pN张力下力–光子信号无必然联系

七、change_time_s事件判定操作清单

设定窗口Δ（如2秒）和阈值（HMM后验>0.9, |Δphotons|>3σ）
对每个t_i，查找HMM跃迁或光子突变，候选Yes/No、证据类型及最大后验
对全体事件做置换检验，统计显著性
可选：检验方向一致性

八、注意事项与易错点

必做光漂白校正
多重比较做FDR校正
时间序列强自相关优先用互相关+置换检验
同步偏移一次性估计修正
Δ窗口做敏感性分析

九、建议的最终交付物

互相关分析图 (force vs photons/HMM-state)
事件触发平均（ETA）图，含置信区间
change_time_s事件级表：is_correlated, evidence, p_value, lag
位置热点图（如有）
结论文字：是否支持“多聚化主导/与力弱相关”模型

十、论文引用要点（适用于报告撰写）

不依赖ATP水解即可熔解，RAD51/RPA70可共定位
多聚化而非解旋活性是关键驱动力
10pN张力与无张力均可见LT与RAD51共定位，张力不是必要条件
多聚化数量与熔解标志强度/速度成正比（三聚体慢、双六聚体快且稳）

附：可运行脚本支持

如需自动化，可将上述判定规则写成Python/R脚本，读入 force.csv、photons.csv、HMM_state.csv，并一键输出相关性图表和事件判定表。

Detect jump with ruptures on Data_Vero_Kymographs force data

Leave a reply

# ------------------------------------------------------------
# Requirements:
# pip install pandas numpy ruptures openpyxl
# ------------------------------------------------------------

import pandas as pd
import numpy as np
import ruptures as rpt
import os

########################
# 1. User I/O settings #
########################

infile = "./ForceHFCorrectedForce2x_p853_p502_b3_1_downsampled1000_HF_HF.csv"
outfile = "force_steps.xlsx"

MIN_STEP = 0.4          # pN threshold to report
FALLBACK_FS = 78.1      # Hz, if no time column
PENALTY = 5             # ruptures sensitivity

# post-processing params:
MERGE_WINDOW = 0.10     # sec: merge same-direction steps < this apart
CANCEL_WINDOW = 0.20    # sec: remove opposite-direction spike pairs < this apart

#####################################
# 2. Load data and pick time/force  #
#####################################

df = pd.read_csv(infile)

# Guess force column
force_col_candidates = [
        "Corrected Force 2x",
        "Corrected Force",
        "Force",
        "Force HF",
        "force",
        "force_pN",
]
force_col = None
for c in force_col_candidates:
        if c in df.columns:
                force_col = c
                break
if force_col is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        if len(numeric_cols) == 0:
                raise ValueError("No numeric force column found in CSV.")
        force_col = numeric_cols[-1]

# Guess time column
time_col_candidates = [
        "Time",
        "time",
        "t",
        "timestamp",
        "Timestamp (s)",
        "Seconds",
]
time_col = None
for c in time_col_candidates:
        if c in df.columns:
                time_col = c
                break
if time_col is None:
        n = len(df)
        dt = 1.0 / FALLBACK_FS
        df["__time_s"] = np.arange(n) * dt
        time_col = "__time_s"

time = df[time_col].to_numpy()
force = df[force_col].to_numpy()

###########################################
# 3. Changepoint detection (ruptures PELT)#
###########################################

algo = rpt.Pelt(model="l2").fit(force)
bkpts = algo.predict(pen=PENALTY)
# bkpts are 1-based segment ends

seg_starts = [0] + [b for b in bkpts[:-1]]
seg_ends   = [b for b in bkpts]  # 1-based inclusive ends

segments = []
for start, end in zip(seg_starts, seg_ends):
        seg_force = force[start:end]
        seg_time = time[start:end]
        mean_force = np.mean(seg_force)
        mid_idx = start + (len(seg_force) // 2)
        mid_time = time[mid_idx]
        segments.append(
                {
                        "start_idx": start,
                        "end_idx": end - 1,
                        "mean_force": mean_force,
                        "mid_time": mid_time,
                }
        )

# Build raw step list
raw_changes = []
for i in range(len(segments) - 1):
        f_before = segments[i]["mean_force"]
        f_after = segments[i + 1]["mean_force"]
        delta_f = f_after - f_before
        boundary_idx = segments[i]["end_idx"]
        change_time = time[boundary_idx]
        raw_changes.append(
                {
                        "change_time_s": change_time,
                        "force_before_pN": f_before,
                        "force_after_pN": f_after,
                        "delta_force_pN": delta_f,
                        "direction": "up" if delta_f > 0 else "down",
                }
        )

changes_df = pd.DataFrame(raw_changes)

###########################################################
# 4. POST-PROCESSING STEPS
#
# 4a. Merge consecutive steps with SAME direction if they
#     happen within MERGE_WINDOW seconds.
#
#     Example: down at t=58.63s then down again at t=58.69s
###########################################################

def merge_close_same_direction(df_steps, merge_window):
        if df_steps.empty:
                return df_steps.copy()

        merged = []
        cur = df_steps.iloc[0].to_dict()

        for idx in range(1, len(df_steps)):
                row = df_steps.iloc[idx].to_dict()

                same_dir = (row["direction"] == cur["direction"])
                close_in_time = (row["change_time_s"] - cur["change_time_s"]) <= merge_window

                if same_dir and close_in_time:
                        # merge cur and row into a single bigger jump:
                        # new before: cur.force_before_pN
                        # new after: row.force_after_pN
                        # new delta: after - before
                        merged_force_before = cur["force_before_pN"]
                        merged_force_after  = row["force_after_pN"]
                        merged_delta        = merged_force_after - merged_force_before

                        cur["force_after_pN"] = merged_force_after
                        cur["delta_force_pN"] = merged_delta
                        cur["direction"] = "up" if merged_delta > 0 else "down"
                        # keep cur["change_time_s"] as first event time
                else:
                        merged.append(cur)
                        cur = row

        merged.append(cur)
        return pd.DataFrame(merged)

merged_df = merge_close_same_direction(changes_df, MERGE_WINDOW)

###########################################################
# 4b. Remove spike pairs:
#     If we have two consecutive steps with OPPOSITE direction
#     within CANCEL_WINDOW sec, drop both (treat as noise).
#
#     Example: up at 284.302s, down at 284.430s (~0.13s)
###########################################################

def remove_opposite_spikes(df_steps, cancel_window):
        if len(df_steps) <= 1:
                return df_steps.copy()

        to_drop = set()

        for i in range(len(df_steps) - 1):
                t1 = df_steps.iloc[i]["change_time_s"]
                t2 = df_steps.iloc[i+1]["change_time_s"]
                d1 = df_steps.iloc[i]["direction"]
                d2 = df_steps.iloc[i+1]["direction"]

                if d1 != d2 and (t2 - t1) <= cancel_window:
                        # mark both i and i+1 to be dropped
                        to_drop.add(i)
                        to_drop.add(i+1)

        keep_rows = [i for i in range(len(df_steps)) if i not in to_drop]
        return df_steps.iloc[keep_rows].reset_index(drop=True)

clean_df = remove_opposite_spikes(merged_df, CANCEL_WINDOW)

###########################################################
# 5. Apply amplitude filter |Δforce| >= 1 pN
###########################################################

filtered_df = clean_df[np.abs(clean_df["delta_force_pN"]) >= MIN_STEP].copy()

##########################################
# 6. Save all versions to Excel
##########################################

# =========================
# Helpers
# =========================

def round_df(d: pd.DataFrame, nd=3) -> pd.DataFrame:
        d = d.copy()
        float_cols = d.select_dtypes(include=[np.floating]).columns
        if len(float_cols):
                d[float_cols] = d[float_cols].round(nd)
        return d

with pd.ExcelWriter(outfile) as writer:
        changes_df.to_excel(writer, sheet_name="raw_steps", index=False)
        merged_df.to_excel(writer, sheet_name="merged_same_dir", index=False)
        clean_df.to_excel(writer, sheet_name="after_spike_removal", index=False)
        round_df(filtered_df).to_excel(writer, sheet_name="final_ge_1pN", index=False)

print(f"Done. Wrote: {outfile}")
print("Sheets in Excel:")
print(" - raw_steps              (direct from ruptures)")
print(" - merged_same_dir        (after merging close same-direction jumps)")
print(" - after_spike_removal    (after removing fast opposite blips)")
print(" - final_ge_1pN           (after |Δforce|≥1 pN filter)")

Calculate alpha-diversty for Group9/10/11

Leave a reply

(a tight drop-in that restricts to Group9/Group10/Group11 and runs stats/plotting for Shannon only)

## ---- Subset to Group9/10/11 and keep factor levels tidy ---------------------
library(dplyr)
library(ggplot2)
library(ggpubr)
library(rstatix)
library(knitr)
library(kableExtra)

shannon_g9_11 <- div.df2 %>%
filter(Group %in% c("Group9", "Group10", "Group11")) %>%
mutate(Group = factor(Group, levels = c("Group9","Group10","Group11"))) %>%
droplevels()

## Quick table of the data used
knitr::kable(shannon_g9_11[, c("Sample name","Group","Shannon")]) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))

## ---- Summary stats (report in text/table) -----------------------------------
sum_stats <- shannon_g9_11 %>%
group_by(Group) %>%
summarise(n = n(),
                        mean = mean(Shannon, na.rm = TRUE),
                        sd = sd(Shannon, na.rm = TRUE),
                        median = median(Shannon, na.rm = TRUE),
                        IQR = IQR(Shannon, na.rm = TRUE),
                        .groups = "drop")

write.csv(sum_stats, "Shannon_Group9_11_summary.csv", row.names = FALSE)
knitr::kable(sum_stats, digits = 3) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))

## ---- Overall test (3 groups) -------------------------------------------------
# Nonparametric overall test (robust default)
overall_kw <- compare_means(Shannon ~ Group, data = shannon_g9_11, method = "kruskal.test")
knitr::kable(overall_kw, digits = 3) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(overall_kw, "Shannon_Group9_11_overall_Kruskal.csv", row.names = FALSE)

## Optional parametric overall (use if assumptions OK)
overall_anova <- compare_means(Shannon ~ Group, data = shannon_g9_11, method = "anova")
write.csv(overall_anova, "Shannon_Group9_11_overall_ANOVA.csv", row.names = FALSE)

## Assumption checks (optional; helps decide ANOVA vs KW)
shapiro_res <- shannon_g9_11 %>% group_by(Group) %>% shapiro_test(Shannon)
levene_res  <- shannon_g9_11 %>% levene_test(Shannon ~ Group)
write.csv(shapiro_res, "Shannon_Group9_11_shapiro.csv", row.names = FALSE)
write.csv(levene_res,  "Shannon_Group9_11_levene.csv",  row.names = FALSE)

## ---- Pairwise tests with BH correction --------------------------------------
# Wilcoxon (nonparametric)
pw_wilcox <- pairwise_wilcox_test(shannon_g9_11, Shannon ~ Group,
                                                                p.adjust.method = "BH", exact = FALSE)
write.csv(pw_wilcox, "Shannon_Group9_11_pairwise_wilcox_BH.csv", row.names = FALSE)

# t-tests (parametric, optional)
pw_t <- pairwise_t_test(shannon_g9_11, Shannon ~ Group, p.adjust.method = "BH")
write.csv(pw_t, "Shannon_Group9_11_pairwise_t_BH.csv", row.names = FALSE)

## ---- Plot: box/jitter with overall & pairwise p-values ----------------------
my_comparisons <- list(c("Group9","Group10"),
                                        c("Group9","Group11"),
                                        c("Group10","Group11"))

p_shannon_g9_11 <- ggplot(shannon_g9_11, aes(x = Group, y = Shannon)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.12, alpha = 0.7) +
labs(y = "Shannon diversity", x = NULL,
        title = "Shannon diversity: Group9 vs Group10 vs Group11") +
theme_bw() +
stat_compare_means(method = "kruskal.test", label = "p.format", label.y.npc = "top") +  # overall
stat_compare_means(comparisons = my_comparisons, method = "wilcox.test",
                                        label = "p.signif", hide.ns = TRUE)

ggsave("Shannon_Group9_10_11_boxplot.pdf", p_shannon_g9_11, width = 5.5, height = 4.2)

#How to report: cite the Kruskal–Wallis p for the overall 3-group comparison and the BH-adjusted Wilcoxon p-values for pairwise contrasts (use the ANOVA/t-test outputs only if normality and homogeneity look acceptable from the Shapiro/Levene tables).

Automated Step Detection in Force Spectroscopy Data

Leave a reply

The script loads raw force-vs-time data from a CSV file, automatically identifies step-like changes in the force signal, and exports a cleaned list of meaningful force transitions to Excel. It first detects changepoints in the force trace using a PELT-based algorithm from the ruptures library, which segments the signal into plateaus and finds the boundaries between them. Each boundary is treated as a candidate step, with its timestamp, the average force before and after, and the delta force (direction up or down). After detection, the script applies two biologically motivated cleanup rules: (1) it merges consecutive steps in the same direction that occur within a short time window, treating them as a single physical event instead of multiple tiny jumps; and (2) it removes fast “spike” pairs where the force briefly jumps up and then down (or down then up) within a very short interval, which are likely noise rather than true mechanical steps. Finally, it filters for steps with an absolute amplitude ≥ 1 pN and writes several worksheets to an Excel file, including the raw detection, intermediate cleanup stages, and the final curated step list. This makes it easy to batch-process kymograph force traces and consistently quantify discrete force changes.

# ------------------------------------------------------------
# Requirements:
# pip install pandas numpy ruptures openpyxl
# ------------------------------------------------------------

import pandas as pd
import numpy as np
import ruptures as rpt
import os

########################
# 1. User I/O settings #
########################

infile = "./ForceHFCorrectedForce2x_p853_p502_b3_1_downsampled1000_HF_HF.csv"
outfile = "force_steps.xlsx"

MIN_STEP = 1.0          # pN threshold to report
FALLBACK_FS = 78.1      # Hz, if no time column
PENALTY = 5             # ruptures sensitivity

# post-processing params:
MERGE_WINDOW = 0.10     # sec: merge same-direction steps < this apart
CANCEL_WINDOW = 0.20    # sec: remove opposite-direction spike pairs < this apart

#####################################
# 2. Load data and pick time/force  #
#####################################

df = pd.read_csv(infile)

# Guess force column
force_col_candidates = [
    "Corrected Force 2x",
    "Corrected Force",
    "Force",
    "Force HF",
    "force",
    "force_pN",
]
force_col = None
for c in force_col_candidates:
    if c in df.columns:
        force_col = c
        break
if force_col is None:
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) == 0:
        raise ValueError("No numeric force column found in CSV.")
    force_col = numeric_cols[-1]

# Guess time column
time_col_candidates = [
    "Time",
    "time",
    "t",
    "timestamp",
    "Timestamp (s)",
    "Seconds",
]
time_col = None
for c in time_col_candidates:
    if c in df.columns:
        time_col = c
        break
if time_col is None:
    n = len(df)
    dt = 1.0 / FALLBACK_FS
    df["__time_s"] = np.arange(n) * dt
    time_col = "__time_s"

time = df[time_col].to_numpy()
force = df[force_col].to_numpy()

###########################################
# 3. Changepoint detection (ruptures PELT)#
###########################################

algo = rpt.Pelt(model="l2").fit(force)
bkpts = algo.predict(pen=PENALTY)
# bkpts are 1-based segment ends

seg_starts = [0] + [b for b in bkpts[:-1]]
seg_ends   = [b for b in bkpts]  # 1-based inclusive ends

segments = []
for start, end in zip(seg_starts, seg_ends):
    seg_force = force[start:end]
    seg_time = time[start:end]
    mean_force = np.mean(seg_force)
    mid_idx = start + (len(seg_force) // 2)
    mid_time = time[mid_idx]
    segments.append(
        {
            "start_idx": start,
            "end_idx": end - 1,
            "mean_force": mean_force,
            "mid_time": mid_time,
        }
    )

# Build raw step list
raw_changes = []
for i in range(len(segments) - 1):
    f_before = segments[i]["mean_force"]
    f_after = segments[i + 1]["mean_force"]
    delta_f = f_after - f_before
    boundary_idx = segments[i]["end_idx"]
    change_time = time[boundary_idx]
    raw_changes.append(
        {
            "change_time_s": change_time,
            "force_before_pN": f_before,
            "force_after_pN": f_after,
            "delta_force_pN": delta_f,
            "direction": "up" if delta_f > 0 else "down",
        }
    )

changes_df = pd.DataFrame(raw_changes)

###########################################################
# 4. POST-PROCESSING STEPS
#
# 4a. Merge consecutive steps with SAME direction if they
#     happen within MERGE_WINDOW seconds.
#
#     Example: down at t=58.63s then down again at t=58.69s
###########################################################

def merge_close_same_direction(df_steps, merge_window):
    if df_steps.empty:
        return df_steps.copy()

    merged = []
    cur = df_steps.iloc[0].to_dict()

    for idx in range(1, len(df_steps)):
        row = df_steps.iloc[idx].to_dict()

        same_dir = (row["direction"] == cur["direction"])
        close_in_time = (row["change_time_s"] - cur["change_time_s"]) <= merge_window

        if same_dir and close_in_time:
            # merge cur and row into a single bigger jump:
            # new before: cur.force_before_pN
            # new after: row.force_after_pN
            # new delta: after - before
            merged_force_before = cur["force_before_pN"]
            merged_force_after  = row["force_after_pN"]
            merged_delta        = merged_force_after - merged_force_before

            cur["force_after_pN"] = merged_force_after
            cur["delta_force_pN"] = merged_delta
            cur["direction"] = "up" if merged_delta > 0 else "down"
            # keep cur["change_time_s"] as first event time
        else:
            merged.append(cur)
            cur = row

    merged.append(cur)
    return pd.DataFrame(merged)

merged_df = merge_close_same_direction(changes_df, MERGE_WINDOW)

###########################################################
# 4b. Remove spike pairs:
#     If we have two consecutive steps with OPPOSITE direction
#     within CANCEL_WINDOW sec, drop both (treat as noise).
#
#     Example: up at 284.302s, down at 284.430s (~0.13s)
###########################################################

def remove_opposite_spikes(df_steps, cancel_window):
    if len(df_steps) <= 1:
        return df_steps.copy()

    to_drop = set()

    for i in range(len(df_steps) - 1):
        t1 = df_steps.iloc[i]["change_time_s"]
        t2 = df_steps.iloc[i+1]["change_time_s"]
        d1 = df_steps.iloc[i]["direction"]
        d2 = df_steps.iloc[i+1]["direction"]

        if d1 != d2 and (t2 - t1) <= cancel_window:
            # mark both i and i+1 to be dropped
            to_drop.add(i)
            to_drop.add(i+1)

    keep_rows = [i for i in range(len(df_steps)) if i not in to_drop]
    return df_steps.iloc[keep_rows].reset_index(drop=True)

clean_df = remove_opposite_spikes(merged_df, CANCEL_WINDOW)

###########################################################
# 5. Apply amplitude filter |Δforce| >= 1 pN
###########################################################

filtered_df = clean_df[np.abs(clean_df["delta_force_pN"]) >= MIN_STEP].copy()

##########################################
# 6. Save all versions to Excel
##########################################

with pd.ExcelWriter(outfile) as writer:
    changes_df.to_excel(writer, sheet_name="raw_steps", index=False)
    merged_df.to_excel(writer, sheet_name="merged_same_dir", index=False)
    clean_df.to_excel(writer, sheet_name="after_spike_removal", index=False)
    filtered_df.to_excel(writer, sheet_name="final_ge_1pN", index=False)

print(f"Done. Wrote: {outfile}")
print("Sheets in Excel:")
print(" - raw_steps              (direct from ruptures)")
print(" - merged_same_dir        (after merging close same-direction jumps)")
print(" - after_spike_removal    (after removing fast opposite blips)")
print(" - final_ge_1pN           (after |Δforce|≥1 pN filter)")

Processing Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606

Leave a reply

Vorgabe

#perform PCA analysis, Venn diagram analysis, as well as KEGG and GO annotations. We would also appreciate it if you could include CPM calculations for this dataset (gene_cpm_counts.xlsx). For comparative analysis, we are particularly interested in identifying DEGs between WT and ΔIJ across the different treatments and time points.

I have already performed the six comparisons, using WT as the reference:

    ΔIJ-17 vs WT-17 – no treatment
    ΔIJ-24 vs WT-24 – no treatment
    preΔIJ-17 vs preWT-17 – Treatment A
    preΔIJ-24 vs preWT-24 – Treatment A
    0_5ΔIJ-17 vs 0_5WT-17 – Treatment B
    0_5ΔIJ-24 vs 0_5WT-24 – Treatment B

To gain a deeper understanding of how the ∆adeIJ mutation influences response dynamics over time and under different stimuli, would you also be interested in the following additional comparisons?

Within-strain treatment responses
(to explore how each strain responds to treatments):

WT:

    preWT-17 vs WT-17 → response to Treatment A at 17 h
    preWT-24 vs WT-24 → response to Treatment A at 24 h
    0_5WT-17 vs WT-17 → response to Treatment B at 17 h
    0_5WT-24 vs WT-24 → response to Treatment B at 24 h

∆adeIJ:

    preΔIJ-17 vs ΔIJ-17 → response to Treatment A at 17 h
    preΔIJ-24 vs ΔIJ-24 → response to Treatment A at 24 h
    0_5ΔIJ-17 vs ΔIJ-17 → response to Treatment B at 17 h
    0_5ΔIJ-24 vs ΔIJ-24 → response to Treatment B at 24 h

Time-course comparisons
(to investigate time-dependent changes within each condition):

    WT-24 vs WT-17
    ΔIJ-24 vs ΔIJ-17
    preWT-24 vs preWT-17
    preΔIJ-24 vs preΔIJ-17
    0_5WT-24 vs 0_5WT-17
    0_5ΔIJ-24 vs 0_5ΔIJ-17

I reviewed the datasets again and noticed that there are no ∆adeAB samples included. Should we try to obtain ∆adeAB data from other datasets? However, I’m a bit concerned that batch effects might pose a challenge when integrating data from different datasets.

> It is possible to analyze DEGs across various time points (17 and 24 h) and stimuli (treatment A and B, and without treatment) iswithin both the ∆adeIJ mutant and the WT strain as our phenotypic characterization of these strains across two times points and stimuli shows significant differences but the other mutant ∆adeAB (similar function as AdeIJ) shows no difference compared to WT, therefore we are wondering what's happened to ∆adeIJ.

deltaIJ_17, WT_17 – ΔadeIJ and wildtype strains w/o exposure at 17 h (No treatment)
deltaIJ_24, WT_24 – ΔadeIJ and wildtype strains w/o exposure at 24 h (No treatment)
pre_deltaIJ_17, pre_WT_17 – ΔadeIJ and wildtype strains with 1 exposure at 17 h (Treatment A)
pre_deltaIJ_24, pre_WT_24 – ΔadeIJ and wildtype strains with 1 exposure at 24 h (Treatment A)
0_5_deltaIJ_17, 0_5_WT_17 – ΔadeIJ and wildtype strains with 2 exposure at 17 h (Treatment B)
0_5_deltaIJ_24, 0_5_WT_24 – ΔadeIJ and wildtype strains with 2 exposure at 24 h (Treatment B)

Preparing raw data

mkdir raw_data; cd raw_data
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_1.fq.gz WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-1/WT-17-1_2.fq.gz WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_1.fq.gz WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-2/WT-17-2_2.fq.gz WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_1.fq.gz WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-17-3/WT-17-3_2.fq.gz WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_1.fq.gz WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-1/WT-24-1_2.fq.gz WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_1.fq.gz WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-2/WT-24-2_2.fq.gz WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_1.fq.gz WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT-24-3/WT-24-3_2.fq.gz WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_1.fq.gz deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-1/ΔIJ-17-1_2.fq.gz deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_1.fq.gz deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-2/ΔIJ-17-2_2.fq.gz deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_1.fq.gz deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-17-3/ΔIJ-17-3_2.fq.gz deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_1.fq.gz deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-1/ΔIJ-24-1_2.fq.gz deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_1.fq.gz deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-2/ΔIJ-24-2_2.fq.gz deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_1.fq.gz deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/ΔIJ-24-3/ΔIJ-24-3_2.fq.gz deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_1.fq.gz pre_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-1/preWT-17-1_2.fq.gz pre_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_1.fq.gz pre_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-2/preWT-17-2_2.fq.gz pre_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_1.fq.gz pre_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-17-3/preWT-17-3_2.fq.gz pre_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_1.fq.gz pre_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-1/preWT-24-1_2.fq.gz pre_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_1.fq.gz pre_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-2/preWT-24-2_2.fq.gz pre_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_1.fq.gz pre_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preWT-24-3/preWT-24-3_2.fq.gz pre_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_1.fq.gz pre_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-1/preΔIJ-17-1_2.fq.gz pre_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_1.fq.gz pre_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-2/preΔIJ-17-2_2.fq.gz pre_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_1.fq.gz pre_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-17-3/preΔIJ-17-3_2.fq.gz pre_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_1.fq.gz pre_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-1/preΔIJ-24-1_2.fq.gz pre_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_1.fq.gz pre_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-2/preΔIJ-24-2_2.fq.gz pre_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_1.fq.gz pre_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/preΔIJ-24-3/preΔIJ-24-3_2.fq.gz pre_deltaIJ-24-r3_R2.fq.gz

ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_1.fq.gz 0_5_WT-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-1/WT0_5-17-1_2.fq.gz 0_5_WT-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_1.fq.gz 0_5_WT-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-2/WT0_5-17-2_2.fq.gz 0_5_WT-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_1.fq.gz 0_5_WT-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-17-3/WT0_5-17-3_2.fq.gz 0_5_WT-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_1.fq.gz 0_5_WT-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-1/WT0_5-24-1_2.fq.gz 0_5_WT-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_1.fq.gz 0_5_WT-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-2/WT0_5-24-2_2.fq.gz 0_5_WT-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_1.fq.gz 0_5_WT-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/WT0_5-24-3/WT0_5-24-3_2.fq.gz 0_5_WT-24-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_1.fq.gz 0_5_deltaIJ-17-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-1/0_5ΔIJ-17-1_2.fq.gz 0_5_deltaIJ-17-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_1.fq.gz 0_5_deltaIJ-17-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-2/0_5ΔIJ-17-2_2.fq.gz 0_5_deltaIJ-17-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_1.fq.gz 0_5_deltaIJ-17-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-17-3/0_5ΔIJ-17-3_2.fq.gz 0_5_deltaIJ-17-r3_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_1.fq.gz 0_5_deltaIJ-24-r1_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-1/0_5ΔIJ-24-1_2.fq.gz 0_5_deltaIJ-24-r1_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_1.fq.gz 0_5_deltaIJ-24-r2_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-2/0_5ΔIJ-24-2_2.fq.gz 0_5_deltaIJ-24-r2_R2.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_1.fq.gz 0_5_deltaIJ-24-r3_R1.fq.gz
ln -s ../RSMR00204/X101SC25062155-Z01/X101SC25062155-Z01-J001/01.RawData/0_5ΔIJ-24-3/0_5ΔIJ-24-3_2.fq.gz 0_5_deltaIJ-24-r3_R2.fq.gz

(Done) Downloading CP059040.fasta and CP059040.gff from GenBank

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT-17-r1 WT-17-r2 WT-17-r3 WT-24-r1 WT-24-r2 WT-24-r3 deltaIJ-17-r1 deltaIJ-17-r2 deltaIJ-17-r3 deltaIJ-24-r1 deltaIJ-24-r2 deltaIJ-24-r3  pre_WT-17-r1 pre_WT-17-r2 pre_WT-17-r3 pre_WT-24-r1 pre_WT-24-r2 pre_WT-24-r3 pre_deltaIJ-17-r1 pre_deltaIJ-17-r2 pre_deltaIJ-17-r3 pre_deltaIJ-24-r1 pre_deltaIJ-24-r2 pre_deltaIJ-24-r3  0_5_WT-17-r1 0_5_WT-17-r2 0_5_WT-17-r3 0_5_WT-24-r1 0_5_WT-24-r2 0_5_WT-24-r3 0_5_deltaIJ-17-r1 0_5_deltaIJ-17-r2 0_5_deltaIJ-17-r3 0_5_deltaIJ-24-r1 0_5_deltaIJ-24-r2 0_5_deltaIJ-24-r3; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_17_r1,WT-17-r1_R1.fq.gz,WT-17-r1_R2.fq.gz,auto
WT_17_r2,WT-17-r2_R1.fq.gz,WT-17-r2_R2.fq.gz,auto
WT_17_r3,WT-17-r3_R1.fq.gz,WT-17-r3_R2.fq.gz,auto
WT_24_r1,WT-24-r1_R1.fq.gz,WT-24-r1_R2.fq.gz,auto
WT_24_r2,WT-24-r2_R1.fq.gz,WT-24-r2_R2.fq.gz,auto
WT_24_r3,WT-24-r3_R1.fq.gz,WT-24-r3_R2.fq.gz,auto
deltaIJ_17_r1,deltaIJ-17-r1_R1.fq.gz,deltaIJ-17-r1_R2.fq.gz,auto
deltaIJ_17_r2,deltaIJ-17-r2_R1.fq.gz,deltaIJ-17-r2_R2.fq.gz,auto
deltaIJ_17_r3,deltaIJ-17-r3_R1.fq.gz,deltaIJ-17-r3_R2.fq.gz,auto
deltaIJ_24_r1,deltaIJ-24-r1_R1.fq.gz,deltaIJ-24-r1_R2.fq.gz,auto
deltaIJ_24_r2,deltaIJ-24-r2_R1.fq.gz,deltaIJ-24-r2_R2.fq.gz,auto
deltaIJ_24_r3,deltaIJ-24-r3_R1.fq.gz,deltaIJ-24-r3_R2.fq.gz,auto
pre_WT_17_r1,pre_WT-17-r1_R1.fq.gz,pre_WT-17-r1_R2.fq.gz,auto
pre_WT_17_r2,pre_WT-17-r2_R1.fq.gz,pre_WT-17-r2_R2.fq.gz,auto
pre_WT_17_r3,pre_WT-17-r3_R1.fq.gz,pre_WT-17-r3_R2.fq.gz,auto
pre_WT_24_r1,pre_WT-24-r1_R1.fq.gz,pre_WT-24-r1_R2.fq.gz,auto
pre_WT_24_r2,pre_WT-24-r2_R1.fq.gz,pre_WT-24-r2_R2.fq.gz,auto
pre_WT_24_r3,pre_WT-24-r3_R1.fq.gz,pre_WT-24-r3_R2.fq.gz,auto
pre_deltaIJ_17_r1,pre_deltaIJ-17-r1_R1.fq.gz,pre_deltaIJ-17-r1_R2.fq.gz,auto
pre_deltaIJ_17_r2,pre_deltaIJ-17-r2_R1.fq.gz,pre_deltaIJ-17-r2_R2.fq.gz,auto
pre_deltaIJ_17_r3,pre_deltaIJ-17-r3_R1.fq.gz,pre_deltaIJ-17-r3_R2.fq.gz,auto
pre_deltaIJ_24_r1,pre_deltaIJ-24-r1_R1.fq.gz,pre_deltaIJ-24-r1_R2.fq.gz,auto
pre_deltaIJ_24_r2,pre_deltaIJ-24-r2_R1.fq.gz,pre_deltaIJ-24-r2_R2.fq.gz,auto
pre_deltaIJ_24_r3,pre_deltaIJ-24-r3_R1.fq.gz,pre_deltaIJ-24-r3_R2.fq.gz,auto
0_5_WT_17_r1,0_5_WT-17-r1_R1.fq.gz,0_5_WT-17-r1_R2.fq.gz,auto
0_5_WT_17_r2,0_5_WT-17-r2_R1.fq.gz,0_5_WT-17-r2_R2.fq.gz,auto
0_5_WT_17_r3,0_5_WT-17-r3_R1.fq.gz,0_5_WT-17-r3_R2.fq.gz,auto
0_5_WT_24_r1,0_5_WT-24-r1_R1.fq.gz,0_5_WT-24-r1_R2.fq.gz,auto
0_5_WT_24_r2,0_5_WT-24-r2_R1.fq.gz,0_5_WT-24-r2_R2.fq.gz,auto
0_5_WT_24_r3,0_5_WT-24-r3_R1.fq.gz,0_5_WT-24-r3_R2.fq.gz,auto
0_5_deltaIJ_17_r1,0_5_deltaIJ-17-r1_R1.fq.gz,0_5_deltaIJ-17-r1_R2.fq.gz,auto
0_5_deltaIJ_17_r2,0_5_deltaIJ-17-r2_R1.fq.gz,0_5_deltaIJ-17-r2_R2.fq.gz,auto
0_5_deltaIJ_17_r3,0_5_deltaIJ-17-r3_R1.fq.gz,0_5_deltaIJ-17-r3_R2.fq.gz,auto
0_5_deltaIJ_24_r1,0_5_deltaIJ-24-r1_R1.fq.gz,0_5_deltaIJ-24-r1_R2.fq.gz,auto
0_5_deltaIJ_24_r2,0_5_deltaIJ-24-r2_R1.fq.gz,0_5_deltaIJ-24-r2_R2.fq.gz,auto
0_5_deltaIJ_24_r3,0_5_deltaIJ-24-r3_R1.fq.gz,0_5_deltaIJ-24-r3_R2.fq.gz,auto

nextflow run

#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

#Default: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
#(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_1 (CDS --> exon in CP059040.gff) --
#Checking the record (see below) in results/genome/CP059040.gtf
#In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
#--featurecounts_feature_type 'transcript' returns only the tRNA results
#Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon

grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!

grep -P "\tCDS\t" CP059040.gff | wc -l  #3701
sed 's/\tCDS\t/\texon\t/g' CP059040.gff > CP059040_m.gff
grep -P "\texon\t" CP059040_m.gff | sort | wc -l  #3797

# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
mv trimmed/*.fq.gz .; rmdir trimmed
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/CP059040_m.gff"        -profile docker -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file

Prepare counts_fixed by hand: delete all “””, “gene-“, replace , to ‘\t’.

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

#keep only gene_id
cut -f1 -d',' counts.tsv > f1
cut -f3- -d',' counts.tsv > f3_
paste -d',' f1 f3_ > counts_fixed.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

#Delete the repliate 2 of ΔadeIJ_two_17 and repliate 1 of ΔadeIJ_two_24 are outlier.
paste -d$'\t' f1_32 f34 f36_ > counts_fixed_2.tsv

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed_2.tsv \
  --samples samples_2.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria_2

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c("WT_17_r1" = "./WT_17_r1/quant.sf",
           "WT_17_r2" = "./WT_17_r2/quant.sf",
           "WT_17_r3" = "./WT_17_r3/quant.sf",
           "WT_24_r1" = "./WT_24_r1/quant.sf",
           "WT_24_r2" = "./WT_24_r2/quant.sf",
           "WT_24_r3" = "./WT_24_r3/quant.sf",
           "deltaIJ_17_r1" = "./deltaIJ_17_r1/quant.sf",
           "deltaIJ_17_r2" = "./deltaIJ_17_r2/quant.sf",
           "deltaIJ_17_r3" = "./deltaIJ_17_r3/quant.sf",
           "deltaIJ_24_r1" = "./deltaIJ_24_r1/quant.sf",
           "deltaIJ_24_r2" = "./deltaIJ_24_r2/quant.sf",
           "deltaIJ_24_r3" = "./deltaIJ_24_r3/quant.sf",
           "pre_WT_17_r1" = "./pre_WT_17_r1/quant.sf",
           "pre_WT_17_r2" = "./pre_WT_17_r2/quant.sf",
           "pre_WT_17_r3" = "./pre_WT_17_r3/quant.sf",
           "pre_WT_24_r1" = "./pre_WT_24_r1/quant.sf",
           "pre_WT_24_r2" = "./pre_WT_24_r2/quant.sf",
           "pre_WT_24_r3" = "./pre_WT_24_r3/quant.sf",
           "pre_deltaIJ_17_r1" = "./pre_deltaIJ_17_r1/quant.sf",
           "pre_deltaIJ_17_r2" = "./pre_deltaIJ_17_r2/quant.sf",
           "pre_deltaIJ_17_r3" = "./pre_deltaIJ_17_r3/quant.sf",
           "pre_deltaIJ_24_r1" = "./pre_deltaIJ_24_r1/quant.sf",
           "pre_deltaIJ_24_r2" = "./pre_deltaIJ_24_r2/quant.sf",
           "pre_deltaIJ_24_r3" = "./pre_deltaIJ_24_r3/quant.sf",
           "0_5_WT_17_r1" = "./0_5_WT_17_r1/quant.sf",
           "0_5_WT_17_r2" = "./0_5_WT_17_r2/quant.sf",
           "0_5_WT_17_r3" = "./0_5_WT_17_r3/quant.sf",
           "0_5_WT_24_r1" = "./0_5_WT_24_r1/quant.sf",
           "0_5_WT_24_r2" = "./0_5_WT_24_r2/quant.sf",
           "0_5_WT_24_r3" = "./0_5_WT_24_r3/quant.sf",
           "0_5_deltaIJ_17_r1" = "./0_5_deltaIJ_17_r1/quant.sf",
           "0_5_deltaIJ_17_r2" = "./0_5_deltaIJ_17_r2/quant.sf",
           "0_5_deltaIJ_17_r3" = "./0_5_deltaIJ_17_r3/quant.sf",
           "0_5_deltaIJ_24_r1" = "./0_5_deltaIJ_24_r1/quant.sf",
           "0_5_deltaIJ_24_r2" = "./0_5_deltaIJ_24_r2/quant.sf",
           "0_5_deltaIJ_24_r3" = "./0_5_deltaIJ_24_r3/quant.sf")
# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1", "r2", "r3",  "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",     "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3",      "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3", "r1", "r2", "r3"))
condition <- factor(c("WT_none_17","WT_none_17","WT_none_17","WT_none_24","WT_none_24","WT_none_24", "deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_17","deltaadeIJ_none_24","deltaadeIJ_none_24","deltaadeIJ_none_24",   "WT_one_17","WT_one_17","WT_one_17","WT_one_24","WT_one_24","WT_one_24", "deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_17","deltaadeIJ_one_24","deltaadeIJ_one_24","deltaadeIJ_one_24",   "WT_two_17","WT_two_17","WT_two_17","WT_two_24","WT_two_24","WT_two_24", "deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_17","deltaadeIJ_two_24","deltaadeIJ_two_24","deltaadeIJ_two_24"))
# Construct colData manually
colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
#dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition + batch)
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

# -- Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()

png("pca3.png", 1200, 800)
plotPCA(rld, intgroup=c("replicate"))
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# 1) keep only non-WT samples
#pdat <- subset(pdat, !grepl("^WT_", condition))
# drop unused factor levels so empty WT facets disappear
pdat$condition <- droplevels(pdat$condition)
# 2) pretty condition names: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
png("pca4.png", 1200, 800)
ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
# Drop WT_* conditions from the data and from factor levels
pdat <- subset(pdat, !grepl("^WT_", condition))
pdat$condition <- droplevels(pdat$condition)
# Prettify condition labels for the legend: deltaadeIJ -> ΔadeIJ
pdat$condition <- gsub("^deltaadeIJ", "\u0394adeIJ", pdat$condition)
p <- ggplot(pdat, aes(PC1, PC2, color = replicate, shape = condition)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca5.png", 1200, 800); print(p); dev.off()

pdat <- plotPCA(rld, intgroup = c("condition","replicate"), returnData = TRUE)
percentVar <- round(100 * attr(pdat, "percentVar"))
p_fac <- ggplot(pdat, aes(PC1, PC2, color = replicate)) +
  geom_point(size = 3) +
  facet_wrap(~ condition) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  theme_classic()
png("pca6.png", 1200, 800); print(p_fac); dev.off()

# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
#png("pca_media.png", 1200, 800)
#plotPCA(rld, intgroup=c("media"))
#dev.off()
#png("pca_strain.png", 1200, 800)
#plotPCA(rld, intgroup=c("strain"))
#dev.off()
#png("pca_time.png", 1200, 800)
#plotPCA(rld, intgroup=c("time"))
#dev.off()

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] WT_none_17         WT_none_17         WT_none_17         WT_none_24
[5] WT_none_24         WT_none_24         deltaadeIJ_none_17 deltaadeIJ_none_17
[9] deltaadeIJ_none_17 deltaadeIJ_none_24 deltaadeIJ_none_24 deltaadeIJ_none_24
[13] WT_one_17          WT_one_17          WT_one_17          WT_one_24
[17] WT_one_24          WT_one_24          deltaadeIJ_one_17  deltaadeIJ_one_17
[21] deltaadeIJ_one_17  deltaadeIJ_one_24  deltaadeIJ_one_24  deltaadeIJ_one_24
[25] WT_two_17          WT_two_17          WT_two_17          WT_two_24
[29] WT_two_24          WT_two_24          deltaadeIJ_two_17  deltaadeIJ_two_17
[33] deltaadeIJ_two_17  deltaadeIJ_two_24  deltaadeIJ_two_24  deltaadeIJ_two_24
12 Levels: deltaadeIJ_none_17 deltaadeIJ_none_24 ... WT_two_24

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# Construct colData automatically
sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("genotype", "exposure", "time")
colData <- cbind(sample_table, split_cond)
colData$genotype <- factor(colData$genotype)
colData$exposure  <- factor(colData$exposure)
colData$time   <- factor(colData$time)
colData$group  <- factor(paste(colData$genotype, colData$exposure, colData$time, sep = "_"))
# Construct colData manually
colData2 <- data.frame(condition=condition, row.names=names(files))

# 确保因子顺序（可选）
colData$genotype <- relevel(factor(colData$genotype), ref = "WT")
colData$exposure  <- relevel(factor(colData$exposure), ref = "none")
colData$time   <- relevel(factor(colData$time), ref = "17")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ genotype * exposure * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
[1] "Intercept"
[2] "genotype_deltaadeIJ_vs_WT"
[3] "exposure_one_vs_none"
[4] "exposure_two_vs_none"
[5] "time_24_vs_17"
[6] "genotypedeltaadeIJ.exposureone"
[7] "genotypedeltaadeIJ.exposuretwo"
[8] "genotypedeltaadeIJ.time24"
[9] "exposureone.time24"
[10] "exposuretwo.time24"
[11] "genotypedeltaadeIJ.exposureone.time24"
[12] "genotypedeltaadeIJ.exposuretwo.time24"

# 提取 genotype 的主效应: up 10, down 4
contrast <- "genotype_deltaadeIJ_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 one exposure 的主效应: up 196; down 298
contrast <- "exposure_one_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 two exposure 的主效应: up 80; down 105
contrast <- "exposure_two_vs_none"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 10; down 2
contrast <- "time_24_vs_17"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  ΔadeIJ_none 17h vs WT_none 17h
#2.)  ΔadeIJ_none 24h vs WT_none 24h
#3.)  ΔadeIJ_one 17h vs WT_one 17h
#4.)  ΔadeIJ_one 24h vs WT_one 24h
#5.)  ΔadeIJ_two 17h vs WT_two 17h
#6.)  ΔadeIJ_two 24h vs WT_two 24h

#---- relevel to control ----
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_17_vs_WT_none_17")

dds$condition <- relevel(dds$condition, "WT_none_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_WT_none_24")

dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_17_vs_WT_one_17")

dds$condition <- relevel(dds$condition, "WT_one_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_WT_one_24")

dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_17_vs_WT_two_17")

dds$condition <- relevel(dds$condition, "WT_two_24")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_WT_two_24")

# WT_none_xh
dds$condition <- relevel(dds$condition, "WT_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_none_24_vs_WT_none_17")

# WT_one_xh
dds$condition <- relevel(dds$condition, "WT_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_one_24_vs_WT_one_17")

# WT_two_xh
dds$condition <- relevel(dds$condition, "WT_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_two_24_vs_WT_two_17")

# deltaadeIJ_none_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_none_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_none_24_vs_deltaadeIJ_none_17")

# deltaadeIJ_one_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_one_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_one_24_vs_deltaadeIJ_one_17")

# deltaadeIJ_two_xh
dds$condition <- relevel(dds$condition, "deltaadeIJ_two_17")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltaadeIJ_two_24_vs_deltaadeIJ_two_17")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP059040_m.gff > CP059040_gene.gff

for cmp in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/CP059040_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done
#deltaadeIJ_none_24_vs_deltaadeIJ_none_17  up(0) down(0)
#deltaadeIJ_one_24_vs_deltaadeIJ_one_17    up(0) down(8: gabT, H0N29_11475, H0N29_01015, H0N29_01030, ...)
#deltaadeIJ_two_24_vs_deltaadeIJ_two_17    up(8) down(51)

(NOT_PERFORMED) Volcano plots

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

Clustering the genes and draw heatmap

#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

for i in deltaadeIJ_none_17_vs_WT_none_17 deltaadeIJ_none_24_vs_WT_none_24 deltaadeIJ_one_17_vs_WT_one_17 deltaadeIJ_one_24_vs_WT_one_24 deltaadeIJ_two_17_vs_WT_two_17 deltaadeIJ_two_24_vs_WT_two_24    WT_none_24_vs_WT_none_17 WT_one_24_vs_WT_one_17 WT_two_24_vs_WT_two_17 deltaadeIJ_none_24_vs_deltaadeIJ_none_17 deltaadeIJ_one_24_vs_deltaadeIJ_one_17 deltaadeIJ_two_24_vs_deltaadeIJ_two_17; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

# ------------------ Heatmap generation for two samples ----------------------

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

   #12 deltaIJ_17_vs_WT_17-up.txt
   #1 deltaIJ_24_vs_WT_24-up.txt
   #239 pre_deltaIJ_17_vs_pre_WT_17-up.txt
   #84 pre_deltaIJ_24_vs_pre_WT_24-up.txt
   #75 0_5_deltaIJ_17_vs_0_5_WT_17-up.txt
   #2 0_5_deltaIJ_24_vs_0_5_WT_24-up.txt

   #4 deltaIJ_17_vs_WT_17-down.txt
   #3 deltaIJ_24_vs_WT_24-down.txt
   #91 pre_deltaIJ_17_vs_pre_WT_17-down.txt
   #65 pre_deltaIJ_24_vs_pre_WT_24-down.txt
   #15 0_5_deltaIJ_17_vs_0_5_WT_17-down.txt
   #4 0_5_deltaIJ_24_vs_0_5_WT_24-down.txt

## 0) Config ---------------------------------------------------
contrast <- "deltaadeIJ_none_17_vs_WT_none_17"  #up 11, down 3 vs. (10,4)
contrast <- "deltaadeIJ_none_24_vs_WT_none_24"  #up 0, down 2 vs. (0,2)
contrast <- "deltaadeIJ_one_17_vs_WT_one_17"    #up 238, down 90 vs. (239,89)  --> height 2600
contrast <- "deltaadeIJ_one_24_vs_WT_one_24"    #up 83, down 64 vs. (64,71) --> height 1600
contrast <- "deltaadeIJ_two_17_vs_WT_two_17"    #up 74, down 14 vs. (75,9) --> height 1000
contrast <- "deltaadeIJ_two_24_vs_WT_two_24"    #up 1, down 3 vs. (3,3)

contrast <- "WT_none_24_vs_WT_none_17"  #(up 10, down 2)
contrast <- "WT_one_24_vs_WT_one_17"    #(up 97, down 3)
contrast <- "WT_two_24_vs_WT_two_17"    #(up 12, down 1)
contrast <- "deltaadeIJ_none_24_vs_deltaadeIJ_none_17" #(up 0, down 0)
contrast <- "deltaadeIJ_one_24_vs_deltaadeIJ_one_17"   #(up 0, down 10)
contrast <- "deltaadeIJ_two_24_vs_deltaadeIJ_two_17"   #(up 8, down 51)

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)
colnames(RNASeq.NoCellLine) <- c("WT_none_17_r1", "WT_none_17_r2", "WT_none_17_r3", "WT_none_24_r1", "WT_none_24_r2", "WT_none_24_r3", "deltaadeIJ_none_17_r1", "deltaadeIJ_none_17_r2", "deltaadeIJ_none_17_r3", "deltaadeIJ_none_24_r1", "deltaadeIJ_none_24_r2", "deltaadeIJ_none_24_r3", "WT_one_17_r1", "WT_one_17_r2", "WT_one_17_r3", "WT_one_24_r1", "WT_one_24_r2", "WT_one_24_r3", "deltaadeIJ_one_17_r1", "deltaadeIJ_one_17_r2", "deltaadeIJ_one_17_r3", "deltaadeIJ_one_24_r1", "deltaadeIJ_one_24_r2", "deltaadeIJ_one_24_r3", "WT_two_17_r1",      "WT_two_17_r2", "WT_two_17_r3", "WT_two_24_r1", "WT_two_24_r2", "WT_two_24_r3", "deltaadeIJ_two_17_r1", "deltaadeIJ_two_17_r2", "deltaadeIJ_two_17_r3", "deltaadeIJ_two_24_r1", "deltaadeIJ_two_24_r2", "deltaadeIJ_two_24_r3")

# -- RUN the code with the new contract from HERE after first run --

## 4) Build GOI from the two .id files (Note that if empty not run!)-------------------------
up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

## 6.1) Pretty labels (display only) ---------------------------
# Row labels: strip 'gene-'
labRow_pretty <- sub("^gene-", "", rownames(datamat))

# Column labels: 'deltaadeIJ' -> 'ΔadeIJ' and nicer spacing
labCol_pretty <- colnames(datamat)
labCol_pretty <- gsub("^deltaadeIJ", "\u0394adeIJ", labCol_pretty)
labCol_pretty <- gsub("_", " ", labCol_pretty)   # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1"
# If you prefer to drop replicate suffixes, uncomment:
# labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty)

## 7) Clustering -----------------------------------------------
# Row clustering with Pearson distance
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
#row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs"))
#row_cor[!is.finite(row_cor)] <- 0
#hr <- hclust(as.dist(1 - row_cor), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.1)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

#BREAK_LINE

png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=600)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 20),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
        labCol     = labCol_pretty,   # col labels with Δsbp + spaces
        cexRow = 2.5,
        cexCol = 2.5,
        srtCol = 15,
        lhei = c(0.6, 4),           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
        lwid = c(0.8, 4))           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
dev.off()

png(paste0("DEGs_heatmap_", contrast, ".png"), width = 800, height = 1000)
heatmap.2(
  datamat,
  Rowv = as.dendrogram(hr),
  Colv = FALSE,
  dendrogram = "row",
  col = bluered(75),
  scale = "row",
  trace = "none",
  density.info = "none",
  RowSideColors = mycol,
  margins = c(10, 15),      # c(bottom, left)
  sepwidth = c(0, 0),
  labRow = labRow_pretty,
  labCol = labCol_pretty,
  cexRow = 1.3,
  cexCol = 1.8,
  srtCol = 15,
  lhei = c(0.01, 4),
  lwid = c(0.5, 4),
  key = FALSE               # safer; add manual z-score key if you want (see note below)
)
dev.off()

# ------------------ Heatmap generation for three samples ----------------------

## ============================================================
## Three-condition DEGs heatmap from multiple pairwise contrasts
## Example contrasts:
##   "WT_MH_4h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_4h"
## Output shows the union of DEGs across all contrasts and
## only the columns (samples) for the 3 conditions.
## ============================================================

## -------- 0) User inputs ------------------------------------
contrasts <- c(
  "WT_MH_4h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_4h"
)
contrasts <- c(
  "WT_TSB_4h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_4h"
)
contrasts <- c(
  "deltasbp_MH_4h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_4h"
)
contrasts <- c(
  "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
)
## Optionally force a condition display order (defaults to order of first appearance)
cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h")
cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h")
cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h")
cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h")
#cond_order <- NULL

## -------- 1) Packages ---------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## -------- 2) Helpers ----------------------------------------
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE,
                            quote = "\"'", comment.char = ""), error = function(e) NULL)
  if (!is.null(df) && ncol(df) >= 1) {
    ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]]
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE,
                      quote = "\"'", comment.char = "")
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

# From "A_vs_B" return c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x)
  parts
}

# Grep whole tag between start/end or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# Pretty labels for columns (optional tweaks)
prettify_col_labels <- function(x) {
  x <- gsub("^deltasbp", "\u0394sbp", x)  # example from your earlier case
  x <- gsub("_", " ", x)
  x
}

# BREAK_LINE

## -------- 3) Build GOI (union across contrasts) -------------
up_files   <- paste0(contrasts, "-up.id")
down_files <- paste0(contrasts, "-down.id")

GOI <- unique(unlist(c(
  lapply(up_files,   read_ids_from_file),
  lapply(down_files, read_ids_from_file)
)))
if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.")

## -------- 4) Expression matrix (rld or vsd) -----------------
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
expr_all <- as.matrix(expr_all)

present <- intersect(rownames(expr_all), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## -------- 5) Infer the THREE condition tags -----------------
pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B)
cond_tags <- unique(unlist(pair_groups))
if (length(cond_tags) != 3L) {
  stop("Expected exactly three unique condition tags across the contrasts, got: ",
      paste(cond_tags, collapse = ", "))
}

# If user provided an explicit order, use it; else keep first-appearance order
if (!is.null(cond_order)) {
  if (!setequal(cond_order, cond_tags))
    stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", "))
  cond_tags <- cond_order
}

#BREAK_LINE

## -------- 6) Subset columns to those 3 conditions -----------
# helper: does a name contain any of the tags?
match_any_tag <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# helper: return the specific tag that a single name matches
detect_tag <- function(nm, tags) {
  hits <- vapply(tags, function(t)
    grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
  if (!any(hits)) NA_character_ else tags[which(hits)[1]]
}

keep_cols <- match_any_tag(colnames(expr_all), cond_tags)
if (!any(keep_cols)) {
  stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", "))
}

sub_idx <- which(keep_cols)
sub_colnames <- colnames(expr_all)[sub_idx]

# find the tag for each kept column (this is the part that was wrong before)
cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags)

# rank columns by your desired condition order, then by name within each condition
cond_rank <- match(cond_for_col, cond_tags)
ord <- order(cond_rank, sub_colnames)

expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE]

## -------- 7) Remove constant/NA rows ------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

## -------- 8) Labels ----------------------------------------
labRow_pretty <- sub("^gene-", "", rownames(datamat))
labCol_pretty <- prettify_col_labels(colnames(datamat))

## -------- 9) Clustering (rows) ------------------------------
hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.3)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

## -------- 10) Save the matrix used --------------------------
out_tag <- paste(cond_tags, collapse = "_")
write.csv(as.data.frame(datamat),
          file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"),
          quote = FALSE)

## -------- 11) Plot heatmap ----------------------------------
png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000)
heatmap.2(
  datamat,
  Rowv = as.dendrogram(hr),
  Colv = FALSE,
  dendrogram = "row",
  col = bluered(75),
  scale = "row",
  trace = "none",
  density.info = "none",
  RowSideColors = mycol,
  margins = c(10, 15),      # c(bottom, left)
  sepwidth = c(0, 0),
  labRow = labRow_pretty,
  labCol = labCol_pretty,
  cexRow = 1.3,
  cexCol = 1.8,
  srtCol = 15,
  lhei = c(0.01, 4),
  lwid = c(0.5, 4),
  key = FALSE               # safer; add manual z-score key if you want (see note below)
)
dev.off()

mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png
mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png
mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png
mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png
# ------------------ Heatmap generation for three samples END ----------------------

# ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ====
# Inputs:
#   mat        : numeric matrix (genes x samples)
#   hr         : hclust for rows (or TRUE/FALSE)
#   row_colors : vector of RowSideColors of length nrow(mat) or NULL
#   labRow     : character vector of row labels (display only)
#   labCol     : character vector of col labels (display only)
#   outfile    : output PNG path
#   base_res   : DPI for PNG (default 150)
# ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ====
safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) {
  stopifnot(is.matrix(mat))
  nr <- nrow(mat); nc <- ncol(mat)

  # Target slide size & sensible caps
  #target_w <- 2400; target_h <- 1600
  #max_w <- 3000; max_h <- 2000
  target_w <- 800; target_h <- 2000
  max_w <- 1500; max_h <- 1500

  # Label stats
  max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1
  max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1

  #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE,  TRUE,  TRUE)
  attempts <- list()
  add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) {
    attempts[[length(attempts) + 1]] <<- list(
      w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar,
      key = key, showR = showR, showC = showC, trunc_rows = trunc_rows
    )
  }

  # Note that if the key is FALSE, all works, if the key is TRUE, none works!
  # 1) Preferred look: moderate size, biggish fonts, 45° labels
  add_attempt(target_w,           target_h,           0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE)
  # 2) Same, explicit margins computed later
  add_attempt(target_w,           target_h,           0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE)
  # 3) Slightly bigger canvas
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE)
  # 4) Make margins more generous (in lines)
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE)
  # 5) Reduce rotation to 30 if still tight
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12),  FALSE, TRUE, TRUE)
  # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins
  add_attempt(min(target_w+500,   max_w), min(target_h+300, max_h), 0.80, 0.90,  45, c(8,12),  FALSE, TRUE, TRUE)
  # 7) Last resort: truncate long row labels (keeps readability)
  if (max_row_chars > 20) {
    add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18)
  }

  for (i in seq_along(attempts)) {
    a <- attempts[[i]]

    # Compute margins if not provided
    if (is.null(a$mar)) {
      col_margin <- if (a$showC) {
        if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else
                      max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8)))
      } else 4
      row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4
      mar <- c(col_margin, row_margin)
    } else {
      mar <- a$mar
    }

    # Prepare labels for this attempt
    lr <- if (a$showR) labRow else rep("", nr)
    if (a$trunc_rows > 0 && a$showR) {
      lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr)
    }
    lc <- if (a$showC) labCol else rep("", nc)

    # Close any open device
    if (dev.cur() != 1) try(dev.off(), silent = TRUE)

    ok <- FALSE
    try({
      png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res)
      gplots::heatmap.2(
        mat,
        Rowv = as.dendrogram(hr),
        Colv = FALSE,
        dendrogram = "row",
        col = gplots::bluered(75),
        scale = "row",
        trace = "none",
        density.info = "none",
        RowSideColors = row_colors,
        key = a$key,
        margins = mar,           # c(col, row) in lines
        sepwidth = c(0, 0),
        labRow = lr,
        labCol = lc,
        cexRow = a$cr,
        cexCol = a$cc,
        srtCol = a$rot,
        lhei = c(0.1, 4),
        lwid = c(0.1, 4)
      )
      dev.off()
      ok <- TRUE
    }, silent = TRUE)

    if (ok) {
      message(sprintf(
        "✓ Heatmap saved: %s  (attempt %d)  size=%dx%d  margins=c(%d,%d)  cexRow=%.2f  cexCol=%.2f  srtCol=%d",
        outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot
      ))
      return(invisible(TRUE))
    } else {
      if (dev.cur() != 1) try(dev.off(), silent = TRUE)
      message(sprintf("Attempt %d failed; retrying...", i))
    }
  }

  stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.")
}

safe_heatmap2(
  mat        = datamat,
  hr         = hr,
  row_colors = mycol,
  labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
  labCol     = labCol_pretty,   # col labels with Δsbp + spaces
  outfile    = paste0("DEGs_heatmap_", contrast, ".png"),
  #base_res   = 150
)

# -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h --
cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids
#add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
GOI <- read.csv("ids")$Gene_Id
RNASeq.NoCellLine <- assay(rld)
#install.packages("gplots")
library("gplots")
#clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
datamat = RNASeq.NoCellLine[GOI, ]
#datamat = RNASeq.NoCellLine
write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")

constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
if(any(constant_rows)) {
  cat("Removing", sum(constant_rows), "constant rows.\n")
  datamat <- datamat[!constant_rows, ]
}
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.1)
mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
mycol = mycol[as.vector(mycl)]

png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 15),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow = rownames(datamat),
        cexRow = 1.5,
        cexCol = 1.5,
        srtCol = 35,
        lhei = c(0.2, 4),           # reduce top space (was 1 or more)
        lwid = c(0.4, 4))           # reduce left space (was 1 or more)
dev.off()

# -------------- Cluster members ----------------
write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
#~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;

#### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
subset_1<-names(subset(mycl, mycl == '1'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
subset_2<-names(subset(mycl, mycl == '2'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
subset_3<-names(subset(mycl, mycl == '3'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
subset_4<-names(subset(mycl, mycl == '4'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
subset_5<-names(subset(mycl, mycl == '5'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
# Initialize an empty data frame for the annotated data
annotated_data <- data.frame()
# Determine total number of genes
total_genes <- length(rownames(data))
# Loop through each gene to annotate
for (i in 1:total_genes) {
    gene <- rownames(data)[i]
    result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                    filters = 'ensembl_gene_id',
                    values = gene,
                    mart = ensembl)
    # If multiple rows are returned, take the first one
    if (nrow(result) > 1) {
        result <- result[1, ]
    }
    # Check if the result is empty
    if (nrow(result) == 0) {
        result <- data.frame(ensembl_gene_id = gene,
                            external_gene_name = NA,
                            gene_biotype = NA,
                            entrezgene_id = NA,
                            chromosome_name = NA,
                            start_position = NA,
                            end_position = NA,
                            strand = NA,
                            description = NA)
    }
    # Transpose expression values
    expression_values <- t(data.frame(t(data[gene, ])))
    colnames(expression_values) <- colnames(data)
    # Combine gene information and expression data
    combined_result <- cbind(result, expression_values)
    # Append to the final dataframe
    annotated_data <- rbind(annotated_data, combined_result)
    # Print progress every 100 genes
    if (i %% 100 == 0) {
        cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    }
}
# Save the annotated data to a new CSV file
write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download the protein sequences from Genbank
        mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
        python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
        emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

* jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
* 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
* Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
        QBlast finished with warnings!
        Blasted Sequences: 2084
        Sequences without results: 105
        Check the Job log for details and try to submit again.
        Restarting QBlast may result in additional results, depending on the error type.
        "Blast (CP020463_protein) Done"
* Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
        "Mapping (CP020463_protein) Done"
        "Mapping finished - Please proceed now to annotation."
* Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
        * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
        * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
        "Annotation (CP020463_protein) Done"
        "Annotation finished."
or blast2go_cli_v1.5.1 (NOT_USED)

        #https://help.biobam.com/space/BCD/2250407989/Installation
        #see ~/Scripts/blast2go_pipeline.sh

Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
        "InterProScan (CP020463_protein) Done"
        "InterProScan Finished - You can now merge the obtained GO Annotations."
MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
        "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
        "Finished merging GO terms from InterPro with annotations."
        "Maybe you want to run ANNEX (Annotation Augmentation)."
    #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
        #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020

    cp blast2go_annot.annot blast2go_annot.annot2

Option 4 (NOT_USED): RFAM for non-colding RNA

Option 5 (NOT_USED): PSORTb for subcellular localizations

Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

* Go to KAAS
* Upload your FASTA file.
* Select an appropriate gene set.
* Download the KO assignments.

Find the Closest KEGG Organism Code (NOT_USED)

Since your species isn't directly in KEGG, use a closely related organism.

* Check available KEGG organisms:

        library(clusterProfiler)
        library(KEGGREST)

        kegg_organisms <- keggList("organism")

        Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

        # Search for Acinetobacter in the list
        grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
        # Gammaproteobacteria
        #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes)

    #BiocManager::install("GO.db")
    #BiocManager::install("AnnotationDbi")

    # Load required libraries
    library(openxlsx)  # For Excel file handling
    library(dplyr)     # For data manipulation
    library(tidyr)
    library(stringr)
    library(clusterProfiler)  # For KEGG and GO enrichment analysis
    #library(org.Hs.eg.db)  # Replace with appropriate organism database
    library(GO.db)
    library(AnnotationDbi)

    setwd("~/DATA/Data_Tam_RNAseq_2025_subMIC_exposure_ATCC19606/results/star_salmon/degenes")
    # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
    # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
    #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
    #1,5c1
    #< ## Thu Jan 30 16:34:52 2025
    #< ## emapper-2.1.12
    #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
    #< ##
    #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #---
    #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #3620,3622d3615
    #< ## 3614 queries scanned
    #< ## Total time (seconds): 8.176708459854126

    # Step 1: Load the blast2go annotation file with a check for missing columns
    annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

    # If the structure is inconsistent, we can make sure there are exactly 3 columns:
    colnames(annot_df) <- c("GeneID", "Term")
    # Step 2: Filter and aggregate GO and EC terms as before
    go_terms <- annot_df %>%
    filter(grepl("^GO:", Term)) %>%
    group_by(GeneID) %>%
    summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    ec_terms <- annot_df %>%
    filter(grepl("^EC:", Term)) %>%
    group_by(GeneID) %>%
    summarize(EC = paste(Term, collapse = ","), .groups = "drop")

    # Key Improvements:
    #    * Looped processing of all 6 input files to avoid redundancy.
    #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
    #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
    #    * Error handling for GO term descriptions via tryCatch.
    #    * Improved clarity and modular structure for easier maintenance and future additions.

    #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

    # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
    suppressPackageStartupMessages({
      library(readr)
      library(dplyr)
      library(stringr)
      library(tidyr)
      library(openxlsx)
      library(clusterProfiler)
      library(AnnotationDbi)
      library(GO.db)
    })

    # ---- PARAMETERS ----
    PADJ_CUT <- 5e-2
    LFC_CUT  <- 2

    # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
    emapper_path <- "~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606/eggnog_out.emapper.annotations.txt"

    # Input files (you can add/remove here)
    input_files <- c(

    "deltaadeIJ_none_17_vs_WT_none_17-all.csv",  #up 11, down 3 vs. (10,4)
    "deltaadeIJ_none_24_vs_WT_none_24-all.csv",  #up 0, down 2 vs. (0,2)
    "deltaadeIJ_one_17_vs_WT_one_17-all.csv",    #up 238, down 90 vs. (239,89)  --> height 2600
    "deltaadeIJ_one_24_vs_WT_one_24-all.csv",    #up 83, down 64 vs. (64,71) --> height 1600
    "deltaadeIJ_two_17_vs_WT_two_17-all.csv",    #up 74, down 14 vs. (75,9) --> height 1000
    "deltaadeIJ_two_24_vs_WT_two_24-all.csv",    #up 1, down 3 vs. (3,3)

    "WT_none_24_vs_WT_none_17-all.csv",  #(up 10, down 2)
    "WT_one_24_vs_WT_one_17-all.csv",    #(up 97, down 3)
    "WT_two_24_vs_WT_two_17-all.csv",    #(up 12, down 1)

    "deltaadeIJ_two_24_vs_deltaadeIJ_two_17-all.csv",   #(up 8, down 51)
    "deltaadeIJ_one_24_vs_deltaadeIJ_one_17-all.csv",   #(up 0, down 10)
    "deltaadeIJ_none_24_vs_deltaadeIJ_none_17-all.csv" #(up 0, down 0)

    )

    # ---- HELPERS ----
    # Robust reader (CSV first, then TSV)
    read_table_any <- function(path) {
      tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                    error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                  error = function(e2) NULL))
      tb
    }

    # Return a nice Excel-safe base name
    xlsx_name_from_file <- function(path) {
      base <- tools::file_path_sans_ext(basename(path))
      paste0("DEG_KEGG_GO_", base, ".xlsx")
    }

    # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
    expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
      if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
      kdf <- as.data.frame(kegg_res)
      if (!"geneID" %in% names(kdf)) return(kdf)
      # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
      map_clean <- mapping_tbl %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
        tidyr::separate_rows(KEGG_ko, sep = ",") %>%
        distinct()

      if (!nrow(map_clean)) {
        return(kdf)
      }

      expanded <- kdf %>%
        tidyr::separate_rows(geneID, sep = "/") %>%
        dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
        distinct() %>%
        dplyr::group_by(ID) %>%
        dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

      kdf %>%
        dplyr::select(-geneID) %>%
        dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
        dplyr::rename(geneID = GeneID)
    }

    # ---- LOAD emapper annotations ----
    eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
    # Ensure character columns for joins
    eggnog_data$query   <- as.character(eggnog_data$query)
    eggnog_data$GOs     <- as.character(eggnog_data$GOs)
    eggnog_data$EC      <- as.character(eggnog_data$EC)
    eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

    # ---- MAIN LOOP ----
    for (f in input_files) {
      if (!file.exists(f)) { message("Missing: ", f); next }

      message("Processing: ", f)
      res <- read_table_any(f)
      if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

      # Coerce expected columns if present
      if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
      if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

      # Ensure GeneID & GeneName exist
      if (!"GeneID" %in% names(res)) {
        # Try to infer from a generic 'gene' column
        if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
      }
      if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

      # Fill missing GeneName from GeneID (drop "gene-")
      res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                            gsub("^gene-", "", as.character(res$GeneID)),
                            as.character(res$GeneName))

      # De-duplicate by GeneName, keep smallest padj
      if (!"padj" %in% names(res)) res$padj <- NA_real_
      res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup() %>%
        as.data.frame()

      # Sort by padj asc, then log2FC desc
      if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
      res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

      # Join emapper (strip "gene-" from GeneID to match emapper 'query')
      res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
      res_ann <- res %>%
        left_join(eggnog_data, by = c("GeneID_plain" = "query"))

      # --- Split by UP/DOWN using your volcano cutoffs ---
      up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
      down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

      # --- KEGG enrichment (using K numbers in KEGG_ko) ---
      # Prepare KO lists (remove "ko:" if present)
      k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
      k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
      k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

      # BREAK_LINE

      kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
      kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

      # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
      kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
      kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

      # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
      # Background gene set = all genes in this comparison
      background_genes <- unique(res_ann$GeneID_plain)
      # TERM2GENE table (GO -> GeneID_plain)
      go_annotation <- res_ann %>%
        dplyr::select(GeneID_plain, GOs) %>%
        mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
        tidyr::separate_rows(GOs, sep = ",") %>%
        filter(GOs != "") %>%
        dplyr::select(GOs, GeneID_plain) %>%
        distinct()

      # Gene lists for GO enricher
      go_list_up   <- unique(up_regulated$GeneID_plain)
      go_list_down <- unique(down_regulated$GeneID_plain)

      go_up <- tryCatch(
        enricher(gene = go_list_up, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )
      go_down <- tryCatch(
        enricher(gene = go_list_down, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )

      go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
      go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

      # Add GO term descriptions via GO.db (best-effort)
      add_go_term_desc <- function(df) {
        if (!nrow(df) || !"ID" %in% names(df)) return(df)
        df$Description <- sapply(df$ID, function(go_id) {
          term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                columns = "TERM", keytype = "GOID"),
                          error = function(e) NULL)
          if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
        })
        df
      }
      go_up_df   <- add_go_term_desc(go_up_df)
      go_down_df <- add_go_term_desc(go_down_df)

      # ---- Write Excel workbook ----
      out_xlsx <- xlsx_name_from_file(f)
      wb <- createWorkbook()

      addWorksheet(wb, "Complete")
      writeData(wb, "Complete", res_ann)

      addWorksheet(wb, "Up_Regulated")
      writeData(wb, "Up_Regulated", up_regulated)

      addWorksheet(wb, "Down_Regulated")
      writeData(wb, "Down_Regulated", down_regulated)

      addWorksheet(wb, "KEGG_Enrichment_Up")
      writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

      addWorksheet(wb, "KEGG_Enrichment_Down")
      writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

      addWorksheet(wb, "GO_Enrichment_Up")
      writeData(wb, "GO_Enrichment_Up", go_up_df)

      addWorksheet(wb, "GO_Enrichment_Down")
      writeData(wb, "GO_Enrichment_Down", go_down_df)

      saveWorkbook(wb, out_xlsx, overwrite = TRUE)
      message("Saved: ", out_xlsx)
    }

    # -------------------------------- OLD_CODE not automatized with loop ----------------------------
    # Load the results
    res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
    res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

    res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
    res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

    res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
        res$GeneName == "" | is.na(res$GeneName),
        gsub("gene-", "", res$GeneID),
        res$GeneName
    )

    # Remove duplicated genes by selecting the gene with the smallest padj
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

    res <- res %>%
    group_by(GeneName) %>%
    slice_min(padj, with_ties = FALSE) %>%
    ungroup()

    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    # Read eggnog annotations
    eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
    # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
    res$GeneID <- gsub("gene-", "", res$GeneID)
    # Merge eggnog data with res based on GeneID
    res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

    # Merge with the res dataframe
    # Perform the left joins and rename columns
    res_updated <- res %>%
    left_join(go_terms, by = "GeneID") %>%
    left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

    # Filter up-regulated genes
    up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
    # Filter down-regulated genes
    down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet (with annotations)
    addWorksheet(wb, "Complete")
    writeData(wb, "Complete_Data", res_updated)
    # Add the up-regulated genes as the second sheet (with annotations)
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet (with annotations)
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
    #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

    #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

    # Set GeneName as row names after the join
    rownames(res_updated) <- res_updated$GeneName
    res_updated <- res_updated %>% dplyr::select(-GeneName)
    ## Set the 'GeneName' column as row.names
    #rownames(res_updated) <- res_updated$GeneName
    ## Drop the 'GeneName' column since it's now the row names
    #res_updated$GeneName <- NULL
    # -- BREAK_1 --

    # ---- Perform KEGG enrichment analysis (up_regulated) ----
    gene_list_kegg_up <- up_regulated$KEGG_ko
    gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
    kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    # -- convert the GeneID (Kxxxxxx) to the true GeneID --
    # Step 0: Create KEGG to GeneID mapping
    kegg_to_geneid_up <- up_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
    kegg_to_geneid_clean <- kegg_to_geneid_up %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove any duplicate mappings
    # Step 2.1: Expand geneID column in kegg_enrichment_up
    expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
    #dplyr::glimpse(expanded_kegg)
    # Step 3.1: Replace geneID column in the original dataframe
    kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
    # Remove old geneID column and merge new one
    kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

    # ---- Perform KEGG enrichment analysis (down_regulated) ----
    # Step 1: Extract KEGG KO terms from down-regulated genes
    gene_list_kegg_down <- down_regulated$KEGG_ko
    gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
    # Step 2: Perform KEGG enrichment analysis
    kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
    # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
    # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
    kegg_to_geneid_down <- down_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # -- BREAK_2 --

    # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
    kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove duplicate mappings

    # Step 5: Expand geneID column in kegg_enrichment_down
    expanded_kegg_down <- kegg_enrichment_down %>%
    as.data.frame() %>%
    separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
    left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

    # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
    kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
    dplyr::select(-geneID) %>%  # Remove old geneID column
    left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
    dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    # View the updated dataframe
    head(kegg_enrichment_down_df)

    # Create a new workbook
    #wb <- createWorkbook()
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Up")
    writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Down")
    writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

    # Define gene list (up-regulated genes)
    gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
    gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

    # Define background gene set (all genes in res)
    background_genes <- res_updated$GeneID  # Extract the 3646 background genes

    # Prepare GO annotation data from res
    go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
    go_annotation <- go_annotation %>%
    tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    # -- BREAK_3 --

    go_enrichment_up <- enricher(
        gene = gene_list_go_up,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_up <- as.data.frame(go_enrichment_up)

    go_enrichment_down <- enricher(
        gene = gene_list_go_down,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_down <- as.data.frame(go_enrichment_down)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

    # Update the Description column with the term descriptions
    go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })
    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })
    ## Print the updated data frame
    #print(go_enrichment_up)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    # Update the Description column with the term descriptions
    go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })

    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })

    addWorksheet(wb, "GO_Enrichment_Up")
    writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

    addWorksheet(wb, "GO_Enrichment_Down")
    writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

    # Save the workbook with enrichment results
    saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

    #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
    #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
    #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

    #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
    #Mac_vs_LB
    #LB.AB_vs_LB.WT19606
    #LB.IJ_vs_LB.WT19606
    #LB.W1_vs_LB.WT19606
    #LB.Y1_vs_LB.WT19606
    #Mac.AB_vs_Mac.WT19606
    #Mac.IJ_vs_Mac.WT19606
    #Mac.W1_vs_Mac.WT19606
    #Mac.Y1_vs_Mac.WT19606

    #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

        library(openxlsx)

        # Function to read and clean gene ID files
        read_gene_ids <- function(file_path) {
        # Read the gene IDs from the file
        gene_ids <- readLines(file_path)

        # Remove any quotes and trim whitespaces
        gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
        gene_ids <- trimws(gene_ids)  # Trim whitespaces

        # Remove empty entries or NAs
        gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

        return(gene_ids)
        }

        # Example list of LB files with both -up.id and -down.id for each condition
        lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                        "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
        lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                        "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

        # Combine both up and down files for each condition
        lb_files <- c(lb_files_up, lb_files_down)

        # Read gene IDs for each file in LB group
        #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
        lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

        lb_degs_ <- list()
        combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
        #unique_combined_set <- unique(combined_set)
        lb_degs_$AB <- combined_set
        combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
        lb_degs_$IJ <- combined_set
        combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
        lb_degs_$W1 <- combined_set
        combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
        lb_degs_$Y1 <- combined_set

        # Example list of Mac files with both -up.id and -down.id for each condition
        mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                        "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
        mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                        "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

        # Combine both up and down files for each condition in Mac group
        mac_files <- c(mac_files_up, mac_files_down)

        # Read gene IDs for each file in Mac group
        mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

        mac_degs_ <- list()
        combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
        mac_degs_$AB <- combined_set
        combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
        mac_degs_$IJ <- combined_set
        combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
        mac_degs_$W1 <- combined_set
        combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
        mac_degs_$Y1 <- combined_set

        # Function to clean sheet names to ensure no sheet name exceeds 31 characters
        truncate_sheet_name <- function(names_list) {
        sapply(names_list, function(name) {
        if (nchar(name) > 31) {
        return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
        }
        return(name)
        })
        }

        # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

        # Define intersections between different conditions for LB
        inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
        inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
        inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
        inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
        inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
        inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

        # Define intersections between three conditions for LB
        inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
        inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
        inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
        inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Define intersection between all four conditions for LB
        inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Now remove the intersected genes from each original set for LB
        venn_list_lb <- list()

        # For LB.AB, remove genes that are also in other conditions
        venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

        # For LB.IJ, remove genes that are also in other conditions
        venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

        # For LB.W1, remove genes that are also in other conditions
        venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

        # For LB.Y1, remove genes that are also in other conditions
        venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

        # Add the intersections for LB (same as before)
        venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
        venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
        venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
        venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
        venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
        venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

        # Define intersections between three conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
        venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
        venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

        # Define intersection between all four conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

        # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

        # Define intersections between different conditions
        inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
        inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
        inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
        inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
        inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
        inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

        # Define intersections between three conditions
        inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
        inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
        inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
        inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Define intersection between all four conditions
        inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Now remove the intersected genes from each original set
        venn_list_mac <- list()

        # For Mac.AB, remove genes that are also in other conditions
        venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

        # For Mac.IJ, remove genes that are also in other conditions
        venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

        # For Mac.W1, remove genes that are also in other conditions
        venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

        # For Mac.Y1, remove genes that are also in other conditions
        venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

        # Add the intersections (same as before)
        venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
        venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
        venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
        venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
        venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

        # Define intersections between three conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
        venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

        # Define intersection between all four conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

        # Save the gene IDs to Excel for further inspection (optional)
        write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
        write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

        # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
        write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
        write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

        # Venn Diagram for LB group
        venn1 <- ggvenn(lb_degs_,
                        fill_color = c("skyblue", "tomato", "gold", "orchid"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

        # Venn Diagram for Mac group
        venn2 <- ggvenn(mac_degs_,
                        fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

        cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Processing Data_Michelle_RNAseq_2025 including automatizing DEG(Annotated)_KEGG_GO_* and splitting DEG-Heatmaps

Leave a reply

Targets

The experiment we did so far:
I have two strains:
1. 1457 wildtype
2. 1457Δsbp (sbp knock out strain)

I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
The following would be interesting for me:
- PCA plot (sample comparison)
- Heatmaps (wild type vs. sbp knockout)
- Volcano plots (significant genes)
- Gene Ontology (GO) analyses

Download the raw data

Mail von BGI (RNA-SEQ Institute):
The data from project F25A430000603 are uploaded to AWS.
Please download the data as below:
URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
Project：F25A430000603-01-STAkimxP
Alias ID：598731762349
S3 Bucket：stakimxp-598731762349
Account：stakimxp
Password：qR0'A7[o9Ql|
Region：eu-central-1
Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
rsync -avzP /local/dir/ user@remote:/remote/dir/
rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

mkdir raw_data; cd raw_data

#Δsbp->deltasbp
#1457.1_2h_MH,WT,MH,2h,1
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
#1457dsbp118h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
#1457dsbp218h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

#1457.1_2h_
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
#1457dsbp1_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
#1457dsbp2_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
#1457dsbp3_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
#END

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP020463.gff) --
grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

# -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c(
        "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
        "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
        "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
        "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
        "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
        "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
        "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
        "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
        "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
        "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
        "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
        "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
        "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
        "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
        "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
        "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
        "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
        "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
        "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
        "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
        "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
        "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
        "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
        "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
        "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
        "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
        "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
        "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
        "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
        "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
        "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
        "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
        "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
        "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
        "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("strain", "media", "time")
colData <- cbind(sample_table, split_cond)
colData$strain <- factor(colData$strain)
colData$media  <- factor(colData$media)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
# Define the colData for DESeq2
#colData <- data.frame(condition=condition, row.names=names(files))

#grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
#grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# -------------------------------
# 2️⃣ Transcript-level counts
# -------------------------------
# Create DESeqDataSet directly from tximport (transcript-level)
dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds_tx), file="transcript_counts.csv")

# --------------------------------
# 3️⃣ Gene-level summarization
# --------------------------------
# Re-import Salmon data summarized at gene level
txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

# Create DESeqDataSet for gene-level counts
#dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
#或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
#~ strain * media * time    主效应 + 所有交互（推荐）  ✅
#~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

# -- (Optional) Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
png("pca_media.png", 1200, 800)
plotPCA(rld, intgroup=c("media"))
dev.off()
png("pca_strain.png", 1200, 800)
plotPCA(rld, intgroup=c("strain"))
dev.off()
png("pca_time.png", 1200, 800)
plotPCA(rld, intgroup=c("time"))
dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
[5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
[9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
[13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
[17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
[21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
[25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
[29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
[33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# 确保因子顺序（可选）
colData$strain <- relevel(factor(colData$strain), ref = "WT")
colData$media  <- relevel(factor(colData$media), ref = "TSB")
colData$time   <- relevel(factor(colData$time), ref = "2h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                      "strain_deltasbp_vs_WT"
#[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
#[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
#[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
#[9] "mediaMH.time18h"                "mediaMH.time4h"
#[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

🔹 Main effects for each factor:

表达量
▲
│       ┌────── WT-TSB
│      /
│     /     ┌────── WT-MH
│    /     /
│   /     /     ┌────── deltasbp-TSB
│  /     /     /
│ /     /     /     ┌────── deltasbp-MH
└──────────────────────────────▶ 时间（2h, 4h, 18h）

    * strain_deltasbp_vs_WT
    * media_MH_vs_TSB
    * time_18h_vs_2h
    * time_4h_vs_2h

🔹 两因素交互作用（Two-way interactions）
这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

表达量
▲
│
│             WT ────────┐
│                        └─↘
│                           ↘
│                        deltasbp ←←←← 显著交互（方向/幅度不同）
└──────────────────────────────▶ 时间

straindeltasbp.mediaMH
表示 菌株（strain）和培养基（media）之间的交互作用。
➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

straindeltasbp.time18h
表示 菌株（strain）和时间（time, 18h）之间的交互作用。
➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

straindeltasbp.time4h
同上，是菌株和时间（4h）之间的交互作用。

mediaMH.time18h
表示 培养基（MH）与时间（18h）之间的交互作用。
➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

mediaMH.time4h
与上面类似，是 MH 培养基与 4 小时之间的交互作用。

🔹 三因素交互作用（Three-way interactions）
三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

表达量（TSB）
▲
│
│        WT ──────→→
│        deltasbp ─────→→
└────────────────────────▶ 时间（2h, 4h, 18h）

表达量（MH）
▲
│
│        WT ──────→→
│        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
└────────────────────────▶ 时间（2h, 4h, 18h）

straindeltasbp.mediaMH.time18h
表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

straindeltasbp.mediaMH.time4h
同上，只是观察的是 4 小时下的三因素交互效应。

✅ 总结：
交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

# 提取 strain 的主效应: up 2, down 16
contrast <- "strain_deltasbp_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 media 的主效应: up 76; down 128
contrast <- "media_MH_vs_TSB"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 228, down 98; up 17, down 2
contrast <- "time_18h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

contrast <- "time_4h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  delta sbp 2h TSB vs WT 2h TSB
#2.)  delta sbp 4h TSB vs WT 4h TSB
#3.)  delta sbp 18h TSB vs WT 18h TSB
#4.)  delta sbp 2h MH vs WT 2h MH
#5.)  delta sbp 4h MH vs WT 4h MH
#6.)  delta sbp 18h MH vs WT 18h MH

#---- relevel to control ----
#design=~condition+replicate
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

dds$condition <- relevel(dds$condition, "WT_TSB_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

dds$condition <- relevel(dds$condition, "WT_MH_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

# WT_MH_xh
dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_18h_vs_WT_MH_4h")

# WT_TSB_xh
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

# deltasbp_MH_xh
dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

# deltasbp_TSB_xh
dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

Clustering the genes and draw heatmap

#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

# Attached are the heatmap only shown the genes cutoff with padj < 5e-2 and |log2FC| > 2
#- 1457 TSB vs 1457deltasbp TSB early  timepoint (1-2h)
#- 1457 MH vs 1457deltasbp MH early timepoint (1-2h)
#- 1457 TSB vs 1457deltasbp TSB 4h
#- 1457 MH vs 1457deltasbp MH 4h
#- 1457 TSB vs 1457deltasbp TSB 18h
#- 1457 MH vs 1457deltasbp MH 18h

# Attached shown the genes padj < 5e-2 and |log2FC| > 2 betwen 18h and time timepoint (1-2h) or betwen 18h and time timepoint 4h
# The project of Tam_RNAseq can also make it similar to this project!
#- 1457 TSB  early  timepoint (1-2h) vs 4h vs 18h
#- 1457 MH early  timepoint (1-2h) vs 4h vs 18h
#- 1457deltasbp TSB early  timepoint (1-2h) vs 4h vs 18h
#- 1457deltasbp MH early  timepoint (1-2h) vs 4h vs 18h

for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

## 0) Config ---------------------------------------------------
contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"
contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"
contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"
contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"
contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"
contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"

up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Build GOI from the two .id files -------------------------
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

## 4) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

## 6.1) Pretty labels (display only) ---------------------------
# Row labels: strip 'gene-'
labRow_pretty <- sub("^gene-", "", rownames(datamat))

# Column labels: 'deltasbp' -> 'Δsbp' and nicer spacing
labCol_pretty <- colnames(datamat)
labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty)
labCol_pretty <- gsub("_", " ", labCol_pretty)   # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1"
# If you prefer to drop replicate suffixes, uncomment:
# labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty)

#BREAK_LINE

## 7) Clustering -----------------------------------------------
# Row clustering with Pearson distance
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
#row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs"))
#row_cor[!is.finite(row_cor)] <- 0
#hr <- hclust(as.dist(1 - row_cor), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.1)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=950)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 15),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
        labCol     = labCol_pretty,   # col labels with Δsbp + spaces
        cexRow = 2.5,
        cexCol = 2.5,
        srtCol = 15,
        lhei = c(0.6, 4),           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
        lwid = c(0.8, 4))           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
dev.off()

# ------------------ Heatmap generation for three samples ----------------------

## ============================================================
## Three-condition DEGs heatmap from multiple pairwise contrasts
## Example contrasts:
##   "WT_MH_4h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_4h"
## Output shows the union of DEGs across all contrasts and
## only the columns (samples) for the 3 conditions.
## ============================================================

## -------- 0) User inputs ------------------------------------
contrasts <- c(
  "WT_MH_4h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_4h"
)
contrasts <- c(
  "WT_TSB_4h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_4h"
)
contrasts <- c(
  "deltasbp_MH_4h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_4h"
)
contrasts <- c(
  "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
)
## Optionally force a condition display order (defaults to order of first appearance)
cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h")
cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h")
cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h")
cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h")
#cond_order <- NULL

## -------- 1) Packages ---------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## -------- 2) Helpers ----------------------------------------
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE,
                            quote = "\"'", comment.char = ""), error = function(e) NULL)
  if (!is.null(df) && ncol(df) >= 1) {
    ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]]
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE,
                      quote = "\"'", comment.char = "")
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

# From "A_vs_B" return c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x)
  parts
}

# Grep whole tag between start/end or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# Pretty labels for columns (optional tweaks)
prettify_col_labels <- function(x) {
  x <- gsub("^deltasbp", "\u0394sbp", x)  # example from your earlier case
  x <- gsub("_", " ", x)
  x
}

# BREAK_LINE

## -------- 3) Build GOI (union across contrasts) -------------
up_files   <- paste0(contrasts, "-up.id")
down_files <- paste0(contrasts, "-down.id")

GOI <- unique(unlist(c(
  lapply(up_files,   read_ids_from_file),
  lapply(down_files, read_ids_from_file)
)))
if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.")

## -------- 4) Expression matrix (rld or vsd) -----------------
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
expr_all <- as.matrix(expr_all)

present <- intersect(rownames(expr_all), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## -------- 5) Infer the THREE condition tags -----------------
pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B)
cond_tags <- unique(unlist(pair_groups))
if (length(cond_tags) != 3L) {
  stop("Expected exactly three unique condition tags across the contrasts, got: ",
      paste(cond_tags, collapse = ", "))
}

# If user provided an explicit order, use it; else keep first-appearance order
if (!is.null(cond_order)) {
  if (!setequal(cond_order, cond_tags))
    stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", "))
  cond_tags <- cond_order
}

#BREAK_LINE

## -------- 6) Subset columns to those 3 conditions -----------
# helper: does a name contain any of the tags?
match_any_tag <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# helper: return the specific tag that a single name matches
detect_tag <- function(nm, tags) {
  hits <- vapply(tags, function(t)
    grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
  if (!any(hits)) NA_character_ else tags[which(hits)[1]]
}

keep_cols <- match_any_tag(colnames(expr_all), cond_tags)
if (!any(keep_cols)) {
  stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", "))
}

sub_idx <- which(keep_cols)
sub_colnames <- colnames(expr_all)[sub_idx]

# find the tag for each kept column (this is the part that was wrong before)
cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags)

# rank columns by your desired condition order, then by name within each condition
cond_rank <- match(cond_for_col, cond_tags)
ord <- order(cond_rank, sub_colnames)

expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE]

## -------- 7) Remove constant/NA rows ------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

## -------- 8) Labels ----------------------------------------
labRow_pretty <- sub("^gene-", "", rownames(datamat))
labCol_pretty <- prettify_col_labels(colnames(datamat))

## -------- 9) Clustering (rows) ------------------------------
hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.3)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

## -------- 10) Save the matrix used --------------------------
out_tag <- paste(cond_tags, collapse = "_")
write.csv(as.data.frame(datamat),
          file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"),
          quote = FALSE)

## -------- 11) Plot heatmap ----------------------------------
png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000)
heatmap.2(
  datamat,
  Rowv = as.dendrogram(hr),
  Colv = FALSE,
  dendrogram = "row",
  col = bluered(75),
  scale = "row",
  trace = "none",
  density.info = "none",
  RowSideColors = mycol,
  margins = c(10, 15),      # c(bottom, left)
  sepwidth = c(0, 0),
  labRow = labRow_pretty,
  labCol = labCol_pretty,
  cexRow = 1.3,
  cexCol = 1.8,
  srtCol = 15,
  lhei = c(0.01, 4),
  lwid = c(0.5, 4),
  key = FALSE               # safer; add manual z-score key if you want (see note below)
)
dev.off()

mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png
mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png
mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png
mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png
# ------------------ Heatmap generation for three samples END ----------------------

# ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ====
# Inputs:
#   mat        : numeric matrix (genes x samples)
#   hr         : hclust for rows (or TRUE/FALSE)
#   row_colors : vector of RowSideColors of length nrow(mat) or NULL
#   labRow     : character vector of row labels (display only)
#   labCol     : character vector of col labels (display only)
#   outfile    : output PNG path
#   base_res   : DPI for PNG (default 150)
# ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ====
safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) {
  stopifnot(is.matrix(mat))
  nr <- nrow(mat); nc <- ncol(mat)

  # Target slide size & sensible caps
  #target_w <- 2400; target_h <- 1600
  #max_w <- 3000; max_h <- 2000
  target_w <- 800; target_h <- 2000
  max_w <- 1500; max_h <- 1500

  # Label stats
  max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1
  max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1

  #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE,  TRUE,  TRUE)
  attempts <- list()
  add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) {
    attempts[[length(attempts) + 1]] <<- list(
      w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar,
      key = key, showR = showR, showC = showC, trunc_rows = trunc_rows
    )
  }

  # Note that if the key is FALSE, all works, if the key is TRUE, none works!
  # 1) Preferred look: moderate size, biggish fonts, 45° labels
  add_attempt(target_w,           target_h,           0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE)
  # 2) Same, explicit margins computed later
  add_attempt(target_w,           target_h,           0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE)
  # 3) Slightly bigger canvas
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE)
  # 4) Make margins more generous (in lines)
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE)
  # 5) Reduce rotation to 30 if still tight
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12),  FALSE, TRUE, TRUE)
  # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins
  add_attempt(min(target_w+500,   max_w), min(target_h+300, max_h), 0.80, 0.90,  45, c(8,12),  FALSE, TRUE, TRUE)
  # 7) Last resort: truncate long row labels (keeps readability)
  if (max_row_chars > 20) {
    add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18)
  }

  for (i in seq_along(attempts)) {
    a <- attempts[[i]]

    # Compute margins if not provided
    if (is.null(a$mar)) {
      col_margin <- if (a$showC) {
        if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else
                      max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8)))
      } else 4
      row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4
      mar <- c(col_margin, row_margin)
    } else {
      mar <- a$mar
    }

    # Prepare labels for this attempt
    lr <- if (a$showR) labRow else rep("", nr)
    if (a$trunc_rows > 0 && a$showR) {
      lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr)
    }
    lc <- if (a$showC) labCol else rep("", nc)

    # Close any open device
    if (dev.cur() != 1) try(dev.off(), silent = TRUE)

    ok <- FALSE
    try({
      png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res)
      gplots::heatmap.2(
        mat,
        Rowv = as.dendrogram(hr),
        Colv = FALSE,
        dendrogram = "row",
        col = gplots::bluered(75),
        scale = "row",
        trace = "none",
        density.info = "none",
        RowSideColors = row_colors,
        key = a$key,
        margins = mar,           # c(col, row) in lines
        sepwidth = c(0, 0),
        labRow = lr,
        labCol = lc,
        cexRow = a$cr,
        cexCol = a$cc,
        srtCol = a$rot,
        lhei = c(0.1, 4),
        lwid = c(0.1, 4)
      )
      dev.off()
      ok <- TRUE
    }, silent = TRUE)

    if (ok) {
      message(sprintf(
        "✓ Heatmap saved: %s  (attempt %d)  size=%dx%d  margins=c(%d,%d)  cexRow=%.2f  cexCol=%.2f  srtCol=%d",
        outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot
      ))
      return(invisible(TRUE))
    } else {
      if (dev.cur() != 1) try(dev.off(), silent = TRUE)
      message(sprintf("Attempt %d failed; retrying...", i))
    }
  }

  stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.")
}

safe_heatmap2(
  mat        = datamat,
  hr         = hr,
  row_colors = mycol,
  labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
  labCol     = labCol_pretty,   # col labels with Δsbp + spaces
  outfile    = paste0("DEGs_heatmap_", contrast, ".png"),
  #base_res   = 150
)

# -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h --
cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids
#add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
GOI <- read.csv("ids")$Gene_Id
RNASeq.NoCellLine <- assay(rld)
#install.packages("gplots")
library("gplots")
#clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
datamat = RNASeq.NoCellLine[GOI, ]
#datamat = RNASeq.NoCellLine
write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")

constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
if(any(constant_rows)) {
  cat("Removing", sum(constant_rows), "constant rows.\n")
  datamat <- datamat[!constant_rows, ]
}
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.1)
mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
mycol = mycol[as.vector(mycl)]

png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 15),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow = rownames(datamat),
        cexRow = 1.5,
        cexCol = 1.5,
        srtCol = 35,
        lhei = c(0.2, 4),           # reduce top space (was 1 or more)
        lwid = c(0.4, 4))           # reduce left space (was 1 or more)
dev.off()

# -------------- Cluster members ----------------
write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
#~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;

#### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
subset_1<-names(subset(mycl, mycl == '1'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
subset_2<-names(subset(mycl, mycl == '2'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
subset_3<-names(subset(mycl, mycl == '3'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
subset_4<-names(subset(mycl, mycl == '4'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
subset_5<-names(subset(mycl, mycl == '5'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
# Initialize an empty data frame for the annotated data
annotated_data <- data.frame()
# Determine total number of genes
total_genes <- length(rownames(data))
# Loop through each gene to annotate
for (i in 1:total_genes) {
    gene <- rownames(data)[i]
    result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                    filters = 'ensembl_gene_id',
                    values = gene,
                    mart = ensembl)
    # If multiple rows are returned, take the first one
    if (nrow(result) > 1) {
        result <- result[1, ]
    }
    # Check if the result is empty
    if (nrow(result) == 0) {
        result <- data.frame(ensembl_gene_id = gene,
                            external_gene_name = NA,
                            gene_biotype = NA,
                            entrezgene_id = NA,
                            chromosome_name = NA,
                            start_position = NA,
                            end_position = NA,
                            strand = NA,
                            description = NA)
    }
    # Transpose expression values
    expression_values <- t(data.frame(t(data[gene, ])))
    colnames(expression_values) <- colnames(data)
    # Combine gene information and expression data
    combined_result <- cbind(result, expression_values)
    # Append to the final dataframe
    annotated_data <- rbind(annotated_data, combined_result)
    # Print progress every 100 genes
    if (i %% 100 == 0) {
        cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    }
}
# Save the annotated data to a new CSV file
write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download the protein sequences from Genbank
        mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
        python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
        emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

* jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
* 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
* Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
        QBlast finished with warnings!
        Blasted Sequences: 2084
        Sequences without results: 105
        Check the Job log for details and try to submit again.
        Restarting QBlast may result in additional results, depending on the error type.
        "Blast (CP020463_protein) Done"
* Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
        "Mapping (CP020463_protein) Done"
        "Mapping finished - Please proceed now to annotation."
* Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
        * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
        * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
        "Annotation (CP020463_protein) Done"
        "Annotation finished."
or blast2go_cli_v1.5.1 (NOT_USED)

        #https://help.biobam.com/space/BCD/2250407989/Installation
        #see ~/Scripts/blast2go_pipeline.sh

Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
        "InterProScan (CP020463_protein) Done"
        "InterProScan Finished - You can now merge the obtained GO Annotations."
MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
        "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
        "Finished merging GO terms from InterPro with annotations."
        "Maybe you want to run ANNEX (Annotation Augmentation)."
    #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
        #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020

    cp blast2go_annot.annot blast2go_annot.annot2

Option 4 (NOT_USED): RFAM for non-colding RNA

Option 5 (NOT_USED): PSORTb for subcellular localizations

Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

* Go to KAAS
* Upload your FASTA file.
* Select an appropriate gene set.
* Download the KO assignments.

Find the Closest KEGG Organism Code (NOT_USED)

Since your species isn't directly in KEGG, use a closely related organism.

* Check available KEGG organisms:

        library(clusterProfiler)
        library(KEGGREST)

        kegg_organisms <- keggList("organism")

        Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

        # Search for Acinetobacter in the list
        grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
        # Gammaproteobacteria
        #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

    #BiocManager::install("GO.db")
    #BiocManager::install("AnnotationDbi")

    # Load required libraries
    library(openxlsx)  # For Excel file handling
    library(dplyr)     # For data manipulation
    library(tidyr)
    library(stringr)
    library(clusterProfiler)  # For KEGG and GO enrichment analysis
    #library(org.Hs.eg.db)  # Replace with appropriate organism database
    library(GO.db)
    library(AnnotationDbi)

    setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
    # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
    # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
    #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
    #1,5c1
    #< ## Thu Jan 30 16:34:52 2025
    #< ## emapper-2.1.12
    #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
    #< ##
    #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #---
    #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #3620,3622d3615
    #< ## 3614 queries scanned
    #< ## Total time (seconds): 8.176708459854126

    # Step 1: Load the blast2go annotation file with a check for missing columns
    annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

    # If the structure is inconsistent, we can make sure there are exactly 3 columns:
    colnames(annot_df) <- c("GeneID", "Term")
    # Step 2: Filter and aggregate GO and EC terms as before
    go_terms <- annot_df %>%
    filter(grepl("^GO:", Term)) %>%
    group_by(GeneID) %>%
    summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    ec_terms <- annot_df %>%
    filter(grepl("^EC:", Term)) %>%
    group_by(GeneID) %>%
    summarize(EC = paste(Term, collapse = ","), .groups = "drop")

    # Key Improvements:
    #    * Looped processing of all 6 input files to avoid redundancy.
    #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
    #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
    #    * Error handling for GO term descriptions via tryCatch.
    #    * Improved clarity and modular structure for easier maintenance and future additions.

    # Define the filenames and output suffixes
    file_list <- c(
      "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
      "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
      "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
      "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
      "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
      "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

      "WT_MH_4h_vs_WT_MH_2h",
      "WT_MH_18h_vs_WT_MH_2h",
      "WT_MH_18h_vs_WT_MH_4h",
      "WT_TSB_4h_vs_WT_TSB_2h",
      "WT_TSB_18h_vs_WT_TSB_2h",
      "WT_TSB_18h_vs_WT_TSB_4h",

      "deltasbp_MH_4h_vs_deltasbp_MH_2h",
      "deltasbp_MH_18h_vs_deltasbp_MH_2h",
      "deltasbp_MH_18h_vs_deltasbp_MH_4h",
      "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
    )

    #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

    # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
    suppressPackageStartupMessages({
      library(readr)
      library(dplyr)
      library(stringr)
      library(tidyr)
      library(openxlsx)
      library(clusterProfiler)
      library(AnnotationDbi)
      library(GO.db)
    })

    # ---- PARAMETERS ----
    PADJ_CUT <- 5e-2
    LFC_CUT  <- 2

    # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
    emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

    # Input files (you can add/remove here)
    input_files <- c(
      "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
      "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
      "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
      "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
      "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
      "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

      "WT_MH_4h_vs_WT_MH_2h-all.csv",
      "WT_MH_18h_vs_WT_MH_2h-all.csv",
      "WT_MH_18h_vs_WT_MH_4h-all.csv",
      "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
      "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
      "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

      "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
      "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
      "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
      "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
    )

    # ---- HELPERS ----
    # Robust reader (CSV first, then TSV)
    read_table_any <- function(path) {
      tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                    error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                  error = function(e2) NULL))
      tb
    }

    # Return a nice Excel-safe base name
    xlsx_name_from_file <- function(path) {
      base <- tools::file_path_sans_ext(basename(path))
      paste0("DEG_KEGG_GO_", base, ".xlsx")
    }

    # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
    expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
      if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
      kdf <- as.data.frame(kegg_res)
      if (!"geneID" %in% names(kdf)) return(kdf)
      # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
      map_clean <- mapping_tbl %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
        tidyr::separate_rows(KEGG_ko, sep = ",") %>%
        distinct()

      if (!nrow(map_clean)) {
        return(kdf)
      }

      expanded <- kdf %>%
        tidyr::separate_rows(geneID, sep = "/") %>%
        dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
        distinct() %>%
        dplyr::group_by(ID) %>%
        dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

      kdf %>%
        dplyr::select(-geneID) %>%
        dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
        dplyr::rename(geneID = GeneID)
    }

    # ---- LOAD emapper annotations ----
    eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
    # Ensure character columns for joins
    eggnog_data$query   <- as.character(eggnog_data$query)
    eggnog_data$GOs     <- as.character(eggnog_data$GOs)
    eggnog_data$EC      <- as.character(eggnog_data$EC)
    eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

    # ---- MAIN LOOP ----
    for (f in input_files) {
      if (!file.exists(f)) { message("Missing: ", f); next }

      message("Processing: ", f)
      res <- read_table_any(f)
      if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

      # Coerce expected columns if present
      if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
      if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

      # Ensure GeneID & GeneName exist
      if (!"GeneID" %in% names(res)) {
        # Try to infer from a generic 'gene' column
        if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
      }
      if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

      # Fill missing GeneName from GeneID (drop "gene-")
      res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                            gsub("^gene-", "", as.character(res$GeneID)),
                            as.character(res$GeneName))

      # De-duplicate by GeneName, keep smallest padj
      if (!"padj" %in% names(res)) res$padj <- NA_real_
      res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup() %>%
        as.data.frame()

      # Sort by padj asc, then log2FC desc
      if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
      res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

      # Join emapper (strip "gene-" from GeneID to match emapper 'query')
      res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
      res_ann <- res %>%
        left_join(eggnog_data, by = c("GeneID_plain" = "query"))

      # --- Split by UP/DOWN using your volcano cutoffs ---
      up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
      down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

      # --- KEGG enrichment (using K numbers in KEGG_ko) ---
      # Prepare KO lists (remove "ko:" if present)
      k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
      k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
      k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

      # BREAK_LINE

      kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
      kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

      # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
      kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
      kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

      # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
      # Background gene set = all genes in this comparison
      background_genes <- unique(res_ann$GeneID_plain)
      # TERM2GENE table (GO -> GeneID_plain)
      go_annotation <- res_ann %>%
        dplyr::select(GeneID_plain, GOs) %>%
        mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
        tidyr::separate_rows(GOs, sep = ",") %>%
        filter(GOs != "") %>%
        dplyr::select(GOs, GeneID_plain) %>%
        distinct()

      # Gene lists for GO enricher
      go_list_up   <- unique(up_regulated$GeneID_plain)
      go_list_down <- unique(down_regulated$GeneID_plain)

      go_up <- tryCatch(
        enricher(gene = go_list_up, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )
      go_down <- tryCatch(
        enricher(gene = go_list_down, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )

      go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
      go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

      # Add GO term descriptions via GO.db (best-effort)
      add_go_term_desc <- function(df) {
        if (!nrow(df) || !"ID" %in% names(df)) return(df)
        df$Description <- sapply(df$ID, function(go_id) {
          term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                columns = "TERM", keytype = "GOID"),
                          error = function(e) NULL)
          if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
        })
        df
      }
      go_up_df   <- add_go_term_desc(go_up_df)
      go_down_df <- add_go_term_desc(go_down_df)

      # ---- Write Excel workbook ----
      out_xlsx <- xlsx_name_from_file(f)
      wb <- createWorkbook()

      addWorksheet(wb, "Complete")
      writeData(wb, "Complete", res_ann)

      addWorksheet(wb, "Up_Regulated")
      writeData(wb, "Up_Regulated", up_regulated)

      addWorksheet(wb, "Down_Regulated")
      writeData(wb, "Down_Regulated", down_regulated)

      addWorksheet(wb, "KEGG_Enrichment_Up")
      writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

      addWorksheet(wb, "KEGG_Enrichment_Down")
      writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

      addWorksheet(wb, "GO_Enrichment_Up")
      writeData(wb, "GO_Enrichment_Up", go_up_df)

      addWorksheet(wb, "GO_Enrichment_Down")
      writeData(wb, "GO_Enrichment_Down", go_down_df)

      saveWorkbook(wb, out_xlsx, overwrite = TRUE)
      message("Saved: ", out_xlsx)
    }

    # -------------------------------- OLD_CODE not automatized with loop ----------------------------
    # Load the results
    res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
    res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

    res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
    res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

    res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
        res$GeneName == "" | is.na(res$GeneName),
        gsub("gene-", "", res$GeneID),
        res$GeneName
    )

    # Remove duplicated genes by selecting the gene with the smallest padj
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

    res <- res %>%
    group_by(GeneName) %>%
    slice_min(padj, with_ties = FALSE) %>%
    ungroup()

    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    # Read eggnog annotations
    eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
    # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
    res$GeneID <- gsub("gene-", "", res$GeneID)
    # Merge eggnog data with res based on GeneID
    res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

    # Merge with the res dataframe
    # Perform the left joins and rename columns
    res_updated <- res %>%
    left_join(go_terms, by = "GeneID") %>%
    left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

    # Filter up-regulated genes
    up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
    # Filter down-regulated genes
    down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet (with annotations)
    addWorksheet(wb, "Complete")
    writeData(wb, "Complete_Data", res_updated)
    # Add the up-regulated genes as the second sheet (with annotations)
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet (with annotations)
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
    #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

    #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

    # Set GeneName as row names after the join
    rownames(res_updated) <- res_updated$GeneName
    res_updated <- res_updated %>% dplyr::select(-GeneName)
    ## Set the 'GeneName' column as row.names
    #rownames(res_updated) <- res_updated$GeneName
    ## Drop the 'GeneName' column since it's now the row names
    #res_updated$GeneName <- NULL
    # -- BREAK_1 --

    # ---- Perform KEGG enrichment analysis (up_regulated) ----
    gene_list_kegg_up <- up_regulated$KEGG_ko
    gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
    kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    # -- convert the GeneID (Kxxxxxx) to the true GeneID --
    # Step 0: Create KEGG to GeneID mapping
    kegg_to_geneid_up <- up_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
    kegg_to_geneid_clean <- kegg_to_geneid_up %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove any duplicate mappings
    # Step 2.1: Expand geneID column in kegg_enrichment_up
    expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
    #dplyr::glimpse(expanded_kegg)
    # Step 3.1: Replace geneID column in the original dataframe
    kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
    # Remove old geneID column and merge new one
    kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

    # ---- Perform KEGG enrichment analysis (down_regulated) ----
    # Step 1: Extract KEGG KO terms from down-regulated genes
    gene_list_kegg_down <- down_regulated$KEGG_ko
    gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
    # Step 2: Perform KEGG enrichment analysis
    kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
    # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
    # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
    kegg_to_geneid_down <- down_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # -- BREAK_2 --

    # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
    kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove duplicate mappings

    # Step 5: Expand geneID column in kegg_enrichment_down
    expanded_kegg_down <- kegg_enrichment_down %>%
    as.data.frame() %>%
    separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
    left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

    # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
    kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
    dplyr::select(-geneID) %>%  # Remove old geneID column
    left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
    dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    # View the updated dataframe
    head(kegg_enrichment_down_df)

    # Create a new workbook
    #wb <- createWorkbook()
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Up")
    writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Down")
    writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

    # Define gene list (up-regulated genes)
    gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
    gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

    # Define background gene set (all genes in res)
    background_genes <- res_updated$GeneID  # Extract the 3646 background genes

    # Prepare GO annotation data from res
    go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
    go_annotation <- go_annotation %>%
    tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    # -- BREAK_3 --

    go_enrichment_up <- enricher(
        gene = gene_list_go_up,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_up <- as.data.frame(go_enrichment_up)

    go_enrichment_down <- enricher(
        gene = gene_list_go_down,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_down <- as.data.frame(go_enrichment_down)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

    # Update the Description column with the term descriptions
    go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })
    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })
    ## Print the updated data frame
    #print(go_enrichment_up)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    # Update the Description column with the term descriptions
    go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })

    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })

    addWorksheet(wb, "GO_Enrichment_Up")
    writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

    addWorksheet(wb, "GO_Enrichment_Down")
    writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

    # Save the workbook with enrichment results
    saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

    #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
    #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
    #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

    #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
    #Mac_vs_LB
    #LB.AB_vs_LB.WT19606
    #LB.IJ_vs_LB.WT19606
    #LB.W1_vs_LB.WT19606
    #LB.Y1_vs_LB.WT19606
    #Mac.AB_vs_Mac.WT19606
    #Mac.IJ_vs_Mac.WT19606
    #Mac.W1_vs_Mac.WT19606
    #Mac.Y1_vs_Mac.WT19606

    #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

        library(openxlsx)

        # Function to read and clean gene ID files
        read_gene_ids <- function(file_path) {
        # Read the gene IDs from the file
        gene_ids <- readLines(file_path)

        # Remove any quotes and trim whitespaces
        gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
        gene_ids <- trimws(gene_ids)  # Trim whitespaces

        # Remove empty entries or NAs
        gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

        return(gene_ids)
        }

        # Example list of LB files with both -up.id and -down.id for each condition
        lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                        "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
        lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                        "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

        # Combine both up and down files for each condition
        lb_files <- c(lb_files_up, lb_files_down)

        # Read gene IDs for each file in LB group
        #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
        lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

        lb_degs_ <- list()
        combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
        #unique_combined_set <- unique(combined_set)
        lb_degs_$AB <- combined_set
        combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
        lb_degs_$IJ <- combined_set
        combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
        lb_degs_$W1 <- combined_set
        combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
        lb_degs_$Y1 <- combined_set

        # Example list of Mac files with both -up.id and -down.id for each condition
        mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                        "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
        mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                        "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

        # Combine both up and down files for each condition in Mac group
        mac_files <- c(mac_files_up, mac_files_down)

        # Read gene IDs for each file in Mac group
        mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

        mac_degs_ <- list()
        combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
        mac_degs_$AB <- combined_set
        combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
        mac_degs_$IJ <- combined_set
        combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
        mac_degs_$W1 <- combined_set
        combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
        mac_degs_$Y1 <- combined_set

        # Function to clean sheet names to ensure no sheet name exceeds 31 characters
        truncate_sheet_name <- function(names_list) {
        sapply(names_list, function(name) {
        if (nchar(name) > 31) {
        return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
        }
        return(name)
        })
        }

        # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

        # Define intersections between different conditions for LB
        inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
        inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
        inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
        inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
        inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
        inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

        # Define intersections between three conditions for LB
        inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
        inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
        inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
        inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Define intersection between all four conditions for LB
        inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Now remove the intersected genes from each original set for LB
        venn_list_lb <- list()

        # For LB.AB, remove genes that are also in other conditions
        venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

        # For LB.IJ, remove genes that are also in other conditions
        venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

        # For LB.W1, remove genes that are also in other conditions
        venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

        # For LB.Y1, remove genes that are also in other conditions
        venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

        # Add the intersections for LB (same as before)
        venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
        venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
        venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
        venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
        venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
        venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

        # Define intersections between three conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
        venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
        venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

        # Define intersection between all four conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

        # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

        # Define intersections between different conditions
        inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
        inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
        inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
        inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
        inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
        inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

        # Define intersections between three conditions
        inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
        inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
        inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
        inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Define intersection between all four conditions
        inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Now remove the intersected genes from each original set
        venn_list_mac <- list()

        # For Mac.AB, remove genes that are also in other conditions
        venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

        # For Mac.IJ, remove genes that are also in other conditions
        venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

        # For Mac.W1, remove genes that are also in other conditions
        venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

        # For Mac.Y1, remove genes that are also in other conditions
        venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

        # Add the intersections (same as before)
        venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
        venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
        venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
        venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
        venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

        # Define intersections between three conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
        venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

        # Define intersection between all four conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

        # Save the gene IDs to Excel for further inspection (optional)
        write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
        write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

        # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
        write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
        write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

        # Venn Diagram for LB group
        venn1 <- ggvenn(lb_degs_,
                        fill_color = c("skyblue", "tomato", "gold", "orchid"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

        # Venn Diagram for Mac group
        venn2 <- ggvenn(mac_degs_,
                        fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

        cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Report:
```
Please find the heat maps attached.

Comparisons included

    1. 1457 TSB vs 1457Δsbp TSB (early timepoint, 1–2 h)

    2. 1457 MH vs 1457Δsbp MH (early timepoint, 1–2 h)

    3. 1457 TSB vs 1457Δsbp TSB (4 h)

    4. 1457 MH vs 1457Δsbp MH (4 h)

    5. 1457 TSB vs 1457Δsbp TSB (18 h)

    6. 1457 MH vs 1457Δsbp MH (18 h)

    7. 1457 TSB: early (1–2 h) vs 4 h vs 18 h

    8. 1457 MH: early (1–2 h) vs 4 h vs 18 h

    9. 1457Δsbp TSB: early (1–2 h) vs 4 h vs 18 h

    10. 1457Δsbp MH: early (1–2 h) vs 4 h vs 18 h

What each heat map shows

    Rows: significant DEGs only (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2).

    Columns: the samples for the comparison (2 conditions × replicates) or, for the three-condition panels, 3 conditions × replicates = 9 columns.

How the files were generated

    Choose contrasts.

        Two-condition heat maps (items 1–6): contrast <- "<A>_vs_<B>".

        Three-condition panels (items 7–10):

        contrasts <- c("<A2>_vs_<A1>", "<A3>_vs_<A1>", "<A3>_vs_<A2>")
        cond_order <- c("<A1>", "<A2>", "<A3>")  # e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h

    Build the Genes Of Interest (GOI) dynamically. For each contrast, the script reads 
```
-up.id and -down.id and takes the union; for three-condition panels it unions across the three pairwise contrasts. Subset the expression matrix (assay(rld)): keep only the rows in GOI and the sample columns matching the condition tags (e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h). DEG annotation and enrichment The annotated significant DEGs (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2) used in the steps above and the KEGG/GO enrichment results for each comparison are provided in the corresponding Excel files: DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx DEG_KEGG_GO_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_MH_4h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_TSB_4h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx