Category Archives: Articles

Processing Data_Michelle_RNAseq_2025 including automatizing DEG(Annotated)_KEGG_GO_* and splitting DEG-Heatmaps

In the current results, I extract the main effect. I also compared the condition deltasbp_MH_18h to WT_MH_18h, if you are interested in specific comparison between conditions, please let me know, I can perform differentially expressed analysis and draw corresponding volcano plots for them.

Targets

The experiment we did so far:
I have two strains:
1. 1457 wildtype
2. 1457Δsbp (sbp knock out strain)

I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively
1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit)
2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit)

Our idea/hypothesis of what we hope to achieve with the RNA-Seq:
Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other?
The following would be interesting for me:
- PCA plot (sample comparison)
- Heatmaps (wild type vs. sbp knockout)
- Volcano plots (significant genes)
- Gene Ontology (GO) analyses

Download the raw data

Mail von BGI (RNA-SEQ Institute):
The data from project F25A430000603 are uploaded to AWS.
Please download the data as below:
URL：https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects
Project：F25A430000603-01-STAkimxP
Alias ID：598731762349
S3 Bucket：stakimxp-598731762349
Account：stakimxp
Password：qR0'A7[o9Ql|
Region：eu-central-1
Aws_access_key_id：AKIAYWZZRVKW72S4SCPG
Aws_secret_access_key：fo5ousM4ThvsRrOFVuxVhGv2qnzf+aiDZTmE3aho

aws s3 cp s3://stakimxp-598731762349/ ./ --recursive

cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL
rsync -avzP /local/dir/ user@remote:/remote/dir/
rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Prepare raw data

mkdir raw_data; cd raw_data

#Δsbp->deltasbp
#1457.1_2h_MH,WT,MH,2h,1
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz
#1457dsbp118h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz
#1457dsbp218h_
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz
ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz

#1457.1_2h_
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz  WT_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz  WT_TSB_2h_1_R2.fastq.gz
#1457.2_2h_
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz  WT_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz  WT_TSB_2h_2_R2.fastq.gz
#1457.3_2h_
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz  WT_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz  WT_TSB_2h_3_R2.fastq.gz
#1457.1_4h_
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz  WT_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz  WT_TSB_4h_1_R2.fastq.gz
#1457.2_4h_
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz  WT_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz  WT_TSB_4h_2_R2.fastq.gz
#1457.3_4h_
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz  WT_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz  WT_TSB_4h_3_R2.fastq.gz
#1457.1_18h_
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz  WT_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz  WT_TSB_18h_1_R2.fastq.gz
#1457.2_18h_
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz  WT_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz  WT_TSB_18h_2_R2.fastq.gz
#1457.3_18h_
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz  WT_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz  WT_TSB_18h_3_R2.fastq.gz
#1457dsbp1_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz
#1457dsbp2_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz
#1457dsbp3_2h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz
#1457dsbp1_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz
#1457dsbp2_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz
#1457dsbp3_4h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz
#1457dsbp1_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz
#1457dsbp2_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz
#1457dsbp3_18h_
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz
ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz
#END

Preparing the directory trimmed

mkdir trimmed trimmed_unpaired;
for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3  deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
done
mv trimmed/*.fastq.gz .

Preparing samplesheet.csv

sample,fastq_1,fastq_2,strandedness
WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto
...

nextflow run

#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
#docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq

# -- DEBUG_1 (CDS --> exon in CP020463.gff) --
grep -P "\texon\t" CP020463.gff | sort | wc -l    #=81
grep -P "cmsearch\texon\t" CP020463.gff | wc -l   #=11  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
grep -P "Genbank\texon\t" CP020463.gff | wc -l    #=12  16S and 23S ribosomal RNA
grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l    #tRNA 58
grep -P "\tCDS\t" CP020463.gff | wc -l  #3701-->2324
sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff
grep -P "\texon\t" CP020463_m.gff | sort | wc -l  #3797-->2405

# -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript'

# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'

# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1"

Import data and pca-plot

#mamba activate r_env

#install.packages("ggfun")
# Import the required libraries
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library(gplots)
library(tximport)
library(DESeq2)
#library("org.Hs.eg.db")
library(dplyr)
library(tidyverse)
#install.packages("devtools")
#devtools::install_version("gtable", version = "0.3.0")
library(gplots)
library("RColorBrewer")
#install.packages("ggrepel")
library("ggrepel")
# install.packages("openxlsx")
library(openxlsx)
library(EnhancedVolcano)
library(DESeq2)
library(edgeR)

setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon")
# Define paths to your Salmon output quantification files
files <- c(
        "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf",
        "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf",
        "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf",
        "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf",
        "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf",
        "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf",
        "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf",
        "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf",
        "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf",
        "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf",
        "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf",
        "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf",
        "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf",
        "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf",
        "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf",
        "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf",
        "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf",
        "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf",
        "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf",
        "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf",
        "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf",
        "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf",
        "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf",
        "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf",
        "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf",
        "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf",
        "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf",
        "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf",
        "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf",
        "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf",
        "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf",
        "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf",
        "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf",
        "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf",
        "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf")

# Import the transcript abundance data with tximport
txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
# Define the replicates and condition of the samples
replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3"))
condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h"))

sample_table <- data.frame(
    condition = condition,
    replicate = replicate
)
split_cond <- do.call(rbind, strsplit(as.character(condition), "_"))
colnames(split_cond) <- c("strain", "media", "time")
colData <- cbind(sample_table, split_cond)
colData$strain <- factor(colData$strain)
colData$media  <- factor(colData$media)
colData$time   <- factor(colData$time)
#colData$group  <- factor(paste(colData$strain, colData$media, colData$time, sep = "_"))
# Define the colData for DESeq2
#colData <- data.frame(condition=condition, row.names=names(files))

#grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l  #1701
#grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l  #50

# ------------------------
# 1️⃣ Setup and input files
# ------------------------

# Read in transcript-to-gene mapping
tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")

# Prepare tx2gene for gene-level summarization (remove gene_name if needed)
tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")]

# -------------------------------
# 2️⃣ Transcript-level counts
# -------------------------------
# Create DESeqDataSet directly from tximport (transcript-level)
dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
write.csv(counts(dds_tx), file="transcript_counts.csv")

# --------------------------------
# 3️⃣ Gene-level summarization
# --------------------------------
# Re-import Salmon data summarized at gene level
txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE)

# Create DESeqDataSet for gene-level counts
#dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate)
dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time)
#或更简单地写为（推荐）：dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain)
#dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time)
#~ strain * media * time    主效应 + 所有交互（推荐）  ✅
#~ time + media * strain    主效应 + media:strain 交互   ⚠️ 有限制

# --------------------------------
# 4️⃣ Raw counts table (with gene names)
# --------------------------------
# Extract raw gene-level counts
counts_data <- as.data.frame(counts(dds, normalized=FALSE))
counts_data$gene_id <- rownames(counts_data)

# Add gene names
tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")])
counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE)

# Reorder columns: gene_id, gene_name, then counts
count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name"))
counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)]

# --------------------------------
# 5️⃣ Calculate CPM
# --------------------------------
library(edgeR)
library(openxlsx)

# Prepare count matrix for CPM calculation
count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))])

# Calculate CPM
#cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE)
total_counts <- colSums(count_matrix)
cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6
cpm_matrix <- as.data.frame(cpm_matrix)

# Add gene_id and gene_name back to CPM table
cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix)

# --------------------------------
# 6️⃣ Save outputs
# --------------------------------
write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE)
write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE)
write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE)

# -- (Optional) Save the rlog-transformed counts --
dim(counts(dds))
head(counts(dds), 10)
rld <- rlogTransformation(dds)
rlog_counts <- assay(rld)
write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx")

# -- pca --
png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("condition"))
dev.off()
# -- heatmap --
png("heatmap2.png", 1200, 800)
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
hc <- hclust(distsRL)
hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
dev.off()

# -- pca_media_strain --
png("pca_media.png", 1200, 800)
plotPCA(rld, intgroup=c("media"))
dev.off()
png("pca_strain.png", 1200, 800)
plotPCA(rld, intgroup=c("strain"))
dev.off()
png("pca_time.png", 1200, 800)
plotPCA(rld, intgroup=c("time"))
dev.off()

(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.

#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed.
#Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes.
#So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression).

sizeFactors(dds)
#NULL
# Estimate size factors
dds <- estimateSizeFactors(dds)
# Estimate dispersions
dds <- estimateDispersions(dds)
#> sizeFactors(dds)

#control_r1 control_r2  HSV.d2_r1  HSV.d2_r2  HSV.d4_r1  HSV.d4_r2  HSV.d6_r1
#2.3282468  2.0251928  1.8036883  1.3767551  0.9341929  1.0911693  0.5454526
#HSV.d6_r2  HSV.d8_r1  HSV.d8_r2
#0.4604461  0.5799834  0.6803681

# (DEBUG) If avgTxLength is Necessary
#To simplify the computation and ensure sizeFactors are calculated:
assays(dds)$avgTxLength <- NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
#If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths:
dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE)
sizeFactors(dds)

# If alone with virus data, the following BUG occured:
#Still NULL --> BUG --> using manual calculation method for sizeFactor calculation!
                    HeLa_TO_r1                      HeLa_TO_r2
                    0.9978755                       1.1092227
data.frame(genes = rownames(dds), dispersions = dispersions(dds))

#Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts.
1/0.9978755=1.002129023
1/1.1092227=
#bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220

raw_counts <- counts(dds)
normalized_counts <- counts(dds, normalized=TRUE)
#write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
#write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))
    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    # Return the scaling factor
    return(med)
}
#https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html
#http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization
#https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
#https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html
#https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/
#DESeq2’s median of ratios [1]
#EdgeR’s trimmed mean of M values (TMM) [2]
#http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html  #very good website!
test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/")
sum(test_normcount != normalized_counts)

Select the differentially expressed genes

#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/
#https://www.biostars.org/p/282295/
#https://www.biostars.org/p/335751/
dds$condition
[1] deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_2h   deltasbp_MH_4h
[5] deltasbp_MH_4h   deltasbp_MH_4h   deltasbp_MH_18h  deltasbp_MH_18h
[9] deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_2h  deltasbp_TSB_4h
[13] deltasbp_TSB_4h  deltasbp_TSB_4h  deltasbp_TSB_18h deltasbp_TSB_18h
[17] deltasbp_TSB_18h WT_MH_2h         WT_MH_2h         WT_MH_2h
[21] WT_MH_4h         WT_MH_4h         WT_MH_4h         WT_MH_18h
[25] WT_MH_18h        WT_MH_18h        WT_TSB_2h        WT_TSB_2h
[29] WT_TSB_2h        WT_TSB_4h        WT_TSB_4h        WT_TSB_4h
[33] WT_TSB_18h       WT_TSB_18h       WT_TSB_18h
12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h

#CONSOLE: mkdir star_salmon/degenes

setwd("degenes")

# 确保因子顺序（可选）
colData$strain <- relevel(factor(colData$strain), ref = "WT")
colData$media  <- relevel(factor(colData$media), ref = "TSB")
colData$time   <- relevel(factor(colData$time), ref = "2h")

dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time)
dds <- DESeq(dds, betaPrior = FALSE)
resultsNames(dds)
#[1] "Intercept"                      "strain_deltasbp_vs_WT"
#[3] "media_MH_vs_TSB"                "time_18h_vs_2h"
#[5] "time_4h_vs_2h"                  "straindeltasbp.mediaMH"
#[7] "straindeltasbp.time18h"         "straindeltasbp.time4h"
#[9] "mediaMH.time18h"                "mediaMH.time4h"
#[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h"

🔹 Main effects for each factor:

表达量
▲
│       ┌────── WT-TSB
│      /
│     /     ┌────── WT-MH
│    /     /
│   /     /     ┌────── deltasbp-TSB
│  /     /     /
│ /     /     /     ┌────── deltasbp-MH
└──────────────────────────────▶ 时间（2h, 4h, 18h）

    * strain_deltasbp_vs_WT
    * media_MH_vs_TSB
    * time_18h_vs_2h
    * time_4h_vs_2h

🔹 两因素交互作用（Two-way interactions）
这些项表示两个实验因素（如菌株、培养基、时间）之间的组合效应——也就是说，其中一个因素的影响取决于另一个因素的水平。

表达量
▲
│
│             WT ────────┐
│                        └─↘
│                           ↘
│                        deltasbp ←←←← 显著交互（方向/幅度不同）
└──────────────────────────────▶ 时间

straindeltasbp.mediaMH
表示 菌株（strain）和培养基（media）之间的交互作用。
➤ 这意味着：deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同，不能仅通过菌株和培养基的单独效应来解释。

straindeltasbp.time18h
表示 菌株（strain）和时间（time, 18h）之间的交互作用。
➤ 即：突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加，而有协同作用。

straindeltasbp.time4h
同上，是菌株和时间（4h）之间的交互作用。

mediaMH.time18h
表示 培养基（MH）与时间（18h）之间的交互作用。
➤ 即：在 MH 培养基中，18 小时时的表达水平与在其他时间点（例如 2h）不同，且该变化不完全可以用时间和培养基各自单独的效应来解释。

mediaMH.time4h
与上面类似，是 MH 培养基与 4 小时之间的交互作用。

🔹 三因素交互作用（Three-way interactions）
三因素交互作用表示：菌株、培养基和时间这三个因素在一起时，会产生一个新的效应，这种效应无法通过任何两个因素的组合来完全解释。

表达量（TSB）
▲
│
│        WT ──────→→
│        deltasbp ─────→→
└────────────────────────▶ 时间（2h, 4h, 18h）

表达量（MH）
▲
│
│        WT ──────→→
│        deltasbp ─────⬈⬈⬈⬈⬈⬈⬈
└────────────────────────▶ 时间（2h, 4h, 18h）

straindeltasbp.mediaMH.time18h
表示 菌株 × 培养基 × 时间（18h） 三者之间的交互作用。
➤ 即：突变菌株在 MH 培养基下的 18 小时表达模式，与其他组合（比如 WT 在 MH 培养基下，或者在 TSB 下）都不相同。

straindeltasbp.mediaMH.time4h
同上，只是观察的是 4 小时下的三因素交互效应。

✅ 总结：
交互作用项的存在意味着你不能仅通过单个变量（如菌株、时间或培养基）的影响来解释基因表达的变化，必须同时考虑它们之间的组合关系。在 DESeq2 模型中，这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。

# 提取 strain 的主效应: up 2, down 16
contrast <- "strain_deltasbp_vs_WT"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 media 的主效应: up 76; down 128
contrast <- "media_MH_vs_TSB"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

# 提取 time 的主效应 up 228, down 98; up 17, down 2
contrast <- "time_18h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

contrast <- "time_4h_vs_2h"
res = results(dds, name=contrast)
res <- res[!is.na(res$log2FoldChange),]
res_df <- as.data.frame(res)
write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))

#1.)  delta sbp 2h TSB vs WT 2h TSB
#2.)  delta sbp 4h TSB vs WT 4h TSB
#3.)  delta sbp 18h TSB vs WT 18h TSB
#4.)  delta sbp 2h MH vs WT 2h MH
#5.)  delta sbp 4h MH vs WT 4h MH
#6.)  delta sbp 18h MH vs WT 18h MH

#---- relevel to control ----
#design=~condition+replicate
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h")

dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h")

dds$condition <- relevel(dds$condition, "WT_TSB_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h")

dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_2h_vs_WT_MH_2h")

dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_WT_MH_4h")

dds$condition <- relevel(dds$condition, "WT_MH_18h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_WT_MH_18h")

# WT_MH_xh
dds$condition <- relevel(dds$condition, "WT_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h")
dds$condition <- relevel(dds$condition, "WT_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_MH_18h_vs_WT_MH_4h")

# WT_TSB_xh
dds$condition <- relevel(dds$condition, "WT_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h")
dds$condition <- relevel(dds$condition, "WT_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("WT_TSB_18h_vs_WT_TSB_4h")

# deltasbp_MH_xh
dds$condition <- relevel(dds$condition, "deltasbp_MH_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h")
dds$condition <- relevel(dds$condition, "deltasbp_MH_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h")

# deltasbp_TSB_xh
dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h")
dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h")

for (i in clist) {
  contrast = paste("condition", i, sep="_")
  #for_Mac_vs_LB  contrast = paste("media", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  res_df <- as.data.frame(res)

  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  #res$log2FoldChange < -2 & res$padj < 5e-2
  up <- subset(res_df, padj<=0.01 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

# -- Under host-env (mamba activate plot-numpy1) --
mamba activate plot-numpy1
grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff

for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv
  python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv
done

# ---- delta sbp TSB 2h vs WT TSB 2h ----
res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 2h versus WT TSB 2h"))
dev.off()

# ---- delta sbp TSB 4h vs WT TSB 4h ----
res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 4h versus WT TSB 4h"))
dev.off()

# ---- delta sbp TSB 18h vs WT TSB 18h ----
res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp TSB 18h versus WT TSB 18h"))
dev.off()

# ---- delta sbp MH 2h vs WT MH 2h ----
res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
#print(duplicated_genes)
# [1] "bfr"  "lipA" "ahpF" "pcaF" "alr"  "pcaD" "cydB" "lpdA" "pgaC" "ppk1"
#[11] "pcaF" "tuf"  "galE" "murI" "yccS" "rrf"  "rrf"  "arsB" "ptsP" "umuD"
#[21] "map"  "pgaB" "rrf"  "rrf"  "rrf"  "pgaD" "uraH" "benE"
#res[res$GeneName == "bfr", ]

#1st_strategy First occurrence is kept and Subsequent duplicates are removed
#res <- res[!duplicated(res$GeneName), ]
#2nd_strategy keep the row with the smallest padj value for each GeneName
res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

## Ensure the data frame matches the expected format
## For example, it should have columns: log2FoldChange, padj, etc.
#res <- as.data.frame(res)
## Remove rows with NA in log2FoldChange (if needed)
#res <- res[!is.na(res$log2FoldChange),]

# Replace padj = 0 with a small value
#NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 2h versus WT MH 2h"))
dev.off()

# ---- delta sbp MH 4h vs WT MH 4h ----
res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 4h versus WT MH 4h"))
dev.off()

# ---- delta sbp MH 18h vs WT MH 18h ----
res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
  res$GeneName == "" | is.na(res$GeneName),
  gsub("gene-", "", res$GeneID),
  res$GeneName
)
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

res <- res %>%
  group_by(GeneName) %>%
  slice_min(padj, with_ties = FALSE) %>%
  ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]

# Assuming res is your dataframe and already processed
# Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE)

# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
head(res)

#library(EnhancedVolcano)
# Assuming res is already sorted and processed
png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200)
#max.overlaps = 10
EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 5e-2,
                FCcutoff = 2,
                title = '',
                subtitleLabSize = 18,
                pointSize = 3.0,
                labSize = 5.0,
                colAlpha = 1,
                legendIconSize = 4.0,
                drawConnectors = TRUE,
                widthConnectors = 0.5,
                colConnectors = 'black',
                subtitle = expression("Δsbp MH 18h versus WT MH 18h"))
dev.off()

#Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx)

Clustering the genes and draw heatmap

#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/
#http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/
# save the Up-regulated and Down-regulated genes into -up.id and -down.id

# Attached are the heatmap only shown the genes cutoff with padj < 5e-2 and |log2FC| > 2
#- 1457 TSB vs 1457deltasbp TSB early  timepoint (1-2h)
#- 1457 MH vs 1457deltasbp MH early timepoint (1-2h)
#- 1457 TSB vs 1457deltasbp TSB 4h
#- 1457 MH vs 1457deltasbp MH 4h
#- 1457 TSB vs 1457deltasbp TSB 18h
#- 1457 MH vs 1457deltasbp MH 18h

# Attached shown the genes padj < 5e-2 and |log2FC| > 2 betwen 18h and time timepoint (1-2h) or betwen 18h and time timepoint 4h
# The project of Tam_RNAseq can also make it similar to this project!
#- 1457 TSB  early  timepoint (1-2h) vs 4h vs 18h
#- 1457 MH early  timepoint (1-2h) vs 4h vs 18h
#- 1457deltasbp TSB early  timepoint (1-2h) vs 4h vs 18h
#- 1457deltasbp MH early  timepoint (1-2h) vs 4h vs 18h

for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h    WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h  deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do
  echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id";
  echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id";
done

## ------------------------------------------------------------
## DEGs heatmap (dynamic GOI + dynamic column tags)
## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h
## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform)
## ------------------------------------------------------------

## 0) Config ---------------------------------------------------
contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h"
contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h"
contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h"
contrast <- "deltasbp_MH_2h_vs_WT_MH_2h"
contrast <- "deltasbp_MH_4h_vs_WT_MH_4h"
contrast <- "deltasbp_MH_18h_vs_WT_MH_18h"

up_file   <- paste0(contrast, "-up.id")
down_file <- paste0(contrast, "-down.id")

## 1) Packages -------------------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## 2) Helpers --------------------------------------------------
# Read IDs from a file that may be:
#  - one column with or without header "Gene_Id"
#  - may contain quotes
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(
    read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""),
    error = function(e) NULL
  )
  if (!is.null(df) && ncol(df) >= 1) {
    if ("Gene_Id" %in% names(df)) {
      ids <- df[["Gene_Id"]]
    } else if (ncol(df) == 1L) {
      ids <- df[[1]]
    } else {
      first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1]
      if (is.na(first_nonempty)) stop("No usable IDs in: ", path)
      ids <- df[[first_nonempty]]
    }
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "")
    if (ncol(df2) < 1L) stop("No usable IDs in: ", path)
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

#BREAK_LINE

# From "A_vs_B" get c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'")
  parts
}

# Match whole tags at boundaries or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

## 3) Build GOI from the two .id files -------------------------
GOI_up   <- read_ids_from_file(up_file)
GOI_down <- read_ids_from_file(down_file)
GOI <- unique(c(GOI_up, GOI_down))
if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.")

## 4) Expression matrix (DESeq2 rlog/vst) ----------------------
# Use rld if present; otherwise try vsd
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
RNASeq.NoCellLine <- as.matrix(expr_all)

# GOI are already 'gene-*' in your data — use them directly for matching
present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
# Optional: report truly missing IDs (on the same 'gene-*' format)
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## 5) Keep ONLY columns for the two groups in the contrast -----
groups <- split_contrast_groups(contrast)  # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h")
keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups)
if (!any(keep_cols)) {
  stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "),
      ". Check your column names or implement colData-based filtering.")
}
cols_idx <- which(keep_cols)
sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx]

# Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...')
ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))

# Subset safely
expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE]

## 6) Remove constant/NA rows ----------------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

# Save the filtered matrix used for the heatmap (optional)
out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt")
write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE)

## 6.1) Pretty labels (display only) ---------------------------
# Row labels: strip 'gene-'
labRow_pretty <- sub("^gene-", "", rownames(datamat))

# Column labels: 'deltasbp' -> 'Δsbp' and nicer spacing
labCol_pretty <- colnames(datamat)
labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty)
labCol_pretty <- gsub("_", " ", labCol_pretty)   # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1"
# If you prefer to drop replicate suffixes, uncomment:
# labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty)

#BREAK_LINE

## 7) Clustering -----------------------------------------------
# Row clustering with Pearson distance
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
#row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs"))
#row_cor[!is.finite(row_cor)] <- 0
#hr <- hclust(as.dist(1 - row_cor), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.1)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=950)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 15),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
        labCol     = labCol_pretty,   # col labels with Δsbp + spaces
        cexRow = 2.5,
        cexCol = 2.5,
        srtCol = 15,
        lhei = c(0.6, 4),           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
        lwid = c(0.8, 4))           # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large'
dev.off()

# ------------------ Heatmap generation for three samples ----------------------

## ============================================================
## Three-condition DEGs heatmap from multiple pairwise contrasts
## Example contrasts:
##   "WT_MH_4h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_2h",
##   "WT_MH_18h_vs_WT_MH_4h"
## Output shows the union of DEGs across all contrasts and
## only the columns (samples) for the 3 conditions.
## ============================================================

## -------- 0) User inputs ------------------------------------
contrasts <- c(
  "WT_MH_4h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_2h",
  "WT_MH_18h_vs_WT_MH_4h"
)
contrasts <- c(
  "WT_TSB_4h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_2h",
  "WT_TSB_18h_vs_WT_TSB_4h"
)
contrasts <- c(
  "deltasbp_MH_4h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_2h",
  "deltasbp_MH_18h_vs_deltasbp_MH_4h"
)
contrasts <- c(
  "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
  "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
)
## Optionally force a condition display order (defaults to order of first appearance)
cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h")
cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h")
cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h")
cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h")
#cond_order <- NULL

## -------- 1) Packages ---------------------------------------
need <- c("gplots")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(gplots))

## -------- 2) Helpers ----------------------------------------
read_ids_from_file <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
  df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE,
                            quote = "\"'", comment.char = ""), error = function(e) NULL)
  if (!is.null(df) && ncol(df) >= 1) {
    ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]]
  } else {
    df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE,
                      quote = "\"'", comment.char = "")
    ids <- df2[[1]]
  }
  ids <- trimws(gsub('"', "", ids))
  ids[nzchar(ids)]
}

# From "A_vs_B" return c("A","B")
split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x)
  parts
}

# Grep whole tag between start/end or underscores
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# Pretty labels for columns (optional tweaks)
prettify_col_labels <- function(x) {
  x <- gsub("^deltasbp", "\u0394sbp", x)  # example from your earlier case
  x <- gsub("_", " ", x)
  x
}

# BREAK_LINE

## -------- 3) Build GOI (union across contrasts) -------------
up_files   <- paste0(contrasts, "-up.id")
down_files <- paste0(contrasts, "-down.id")

GOI <- unique(unlist(c(
  lapply(up_files,   read_ids_from_file),
  lapply(down_files, read_ids_from_file)
)))
if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.")

## -------- 4) Expression matrix (rld or vsd) -----------------
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' object is available in the environment.")
}
expr_all <- as.matrix(expr_all)

present <- intersect(rownames(expr_all), GOI)
if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.")
missing <- setdiff(GOI, present)
if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.")

## -------- 5) Infer the THREE condition tags -----------------
pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B)
cond_tags <- unique(unlist(pair_groups))
if (length(cond_tags) != 3L) {
  stop("Expected exactly three unique condition tags across the contrasts, got: ",
      paste(cond_tags, collapse = ", "))
}

# If user provided an explicit order, use it; else keep first-appearance order
if (!is.null(cond_order)) {
  if (!setequal(cond_order, cond_tags))
    stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", "))
  cond_tags <- cond_order
}

#BREAK_LINE

## -------- 6) Subset columns to those 3 conditions -----------
# helper: does a name contain any of the tags?
match_any_tag <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# helper: return the specific tag that a single name matches
detect_tag <- function(nm, tags) {
  hits <- vapply(tags, function(t)
    grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
  if (!any(hits)) NA_character_ else tags[which(hits)[1]]
}

keep_cols <- match_any_tag(colnames(expr_all), cond_tags)
if (!any(keep_cols)) {
  stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", "))
}

sub_idx <- which(keep_cols)
sub_colnames <- colnames(expr_all)[sub_idx]

# find the tag for each kept column (this is the part that was wrong before)
cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags)

# rank columns by your desired condition order, then by name within each condition
cond_rank <- match(cond_for_col, cond_tags)
ord <- order(cond_rank, sub_colnames)

expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE]

## -------- 7) Remove constant/NA rows ------------------------
row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.")
datamat <- expr_sub[row_ok, , drop = FALSE]

## -------- 8) Labels ----------------------------------------
labRow_pretty <- sub("^gene-", "", rownames(datamat))
labCol_pretty <- prettify_col_labels(colnames(datamat))

## -------- 9) Clustering (rows) ------------------------------
hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")

# Color row-side groups by cutting the dendrogram
mycl <- cutree(hr, h = max(hr$height) / 1.3)
palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                  "lightblue","pink","purple","lightcyan","salmon","lightgreen")
mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

## -------- 10) Save the matrix used --------------------------
out_tag <- paste(cond_tags, collapse = "_")
write.csv(as.data.frame(datamat),
          file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"),
          quote = FALSE)

## -------- 11) Plot heatmap ----------------------------------
png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000)
heatmap.2(
  datamat,
  Rowv = as.dendrogram(hr),
  Colv = FALSE,
  dendrogram = "row",
  col = bluered(75),
  scale = "row",
  trace = "none",
  density.info = "none",
  RowSideColors = mycol,
  margins = c(10, 15),      # c(bottom, left)
  sepwidth = c(0, 0),
  labRow = labRow_pretty,
  labCol = labCol_pretty,
  cexRow = 1.3,
  cexCol = 1.8,
  srtCol = 15,
  lhei = c(0.01, 4),
  lwid = c(0.5, 4),
  key = FALSE               # safer; add manual z-score key if you want (see note below)
)
dev.off()

mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png
mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png
mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png
mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png
# ------------------ Heatmap generation for three samples END ----------------------

# ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ====
# Inputs:
#   mat        : numeric matrix (genes x samples)
#   hr         : hclust for rows (or TRUE/FALSE)
#   row_colors : vector of RowSideColors of length nrow(mat) or NULL
#   labRow     : character vector of row labels (display only)
#   labCol     : character vector of col labels (display only)
#   outfile    : output PNG path
#   base_res   : DPI for PNG (default 150)
# ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ====
safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) {
  stopifnot(is.matrix(mat))
  nr <- nrow(mat); nc <- ncol(mat)

  # Target slide size & sensible caps
  #target_w <- 2400; target_h <- 1600
  #max_w <- 3000; max_h <- 2000
  target_w <- 800; target_h <- 2000
  max_w <- 1500; max_h <- 1500

  # Label stats
  max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1
  max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1

  #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE,  TRUE,  TRUE)
  attempts <- list()
  add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) {
    attempts[[length(attempts) + 1]] <<- list(
      w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar,
      key = key, showR = showR, showC = showC, trunc_rows = trunc_rows
    )
  }

  # Note that if the key is FALSE, all works, if the key is TRUE, none works!
  # 1) Preferred look: moderate size, biggish fonts, 45° labels
  add_attempt(target_w,           target_h,           0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE)
  # 2) Same, explicit margins computed later
  add_attempt(target_w,           target_h,           0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE)
  # 3) Slightly bigger canvas
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE)
  # 4) Make margins more generous (in lines)
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE)
  # 5) Reduce rotation to 30 if still tight
  add_attempt(min(target_w+300,   max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12),  FALSE, TRUE, TRUE)
  # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins
  add_attempt(min(target_w+500,   max_w), min(target_h+300, max_h), 0.80, 0.90,  45, c(8,12),  FALSE, TRUE, TRUE)
  # 7) Last resort: truncate long row labels (keeps readability)
  if (max_row_chars > 20) {
    add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18)
  }

  for (i in seq_along(attempts)) {
    a <- attempts[[i]]

    # Compute margins if not provided
    if (is.null(a$mar)) {
      col_margin <- if (a$showC) {
        if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else
                      max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8)))
      } else 4
      row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4
      mar <- c(col_margin, row_margin)
    } else {
      mar <- a$mar
    }

    # Prepare labels for this attempt
    lr <- if (a$showR) labRow else rep("", nr)
    if (a$trunc_rows > 0 && a$showR) {
      lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr)
    }
    lc <- if (a$showC) labCol else rep("", nc)

    # Close any open device
    if (dev.cur() != 1) try(dev.off(), silent = TRUE)

    ok <- FALSE
    try({
      png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res)
      gplots::heatmap.2(
        mat,
        Rowv = as.dendrogram(hr),
        Colv = FALSE,
        dendrogram = "row",
        col = gplots::bluered(75),
        scale = "row",
        trace = "none",
        density.info = "none",
        RowSideColors = row_colors,
        key = a$key,
        margins = mar,           # c(col, row) in lines
        sepwidth = c(0, 0),
        labRow = lr,
        labCol = lc,
        cexRow = a$cr,
        cexCol = a$cc,
        srtCol = a$rot,
        lhei = c(0.1, 4),
        lwid = c(0.1, 4)
      )
      dev.off()
      ok <- TRUE
    }, silent = TRUE)

    if (ok) {
      message(sprintf(
        "✓ Heatmap saved: %s  (attempt %d)  size=%dx%d  margins=c(%d,%d)  cexRow=%.2f  cexCol=%.2f  srtCol=%d",
        outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot
      ))
      return(invisible(TRUE))
    } else {
      if (dev.cur() != 1) try(dev.off(), silent = TRUE)
      message(sprintf("Attempt %d failed; retrying...", i))
    }
  }

  stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.")
}

safe_heatmap2(
  mat        = datamat,
  hr         = hr,
  row_colors = mycol,
  labRow     = labRow_pretty,   # row labels WITHOUT "gene-"
  labCol     = labCol_pretty,   # col labels with Δsbp + spaces
  outfile    = paste0("DEGs_heatmap_", contrast, ".png"),
  #base_res   = 150
)

# -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h --
cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids
#add Gene_Id in the first line, delete the ""  #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID.
GOI <- read.csv("ids")$Gene_Id
RNASeq.NoCellLine <- assay(rld)
#install.packages("gplots")
library("gplots")
#clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
datamat = RNASeq.NoCellLine[GOI, ]
#datamat = RNASeq.NoCellLine
write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt")

constant_rows <- apply(datamat, 1, function(row) var(row) == 0)
if(any(constant_rows)) {
  cat("Removing", sum(constant_rows), "constant rows.\n")
  datamat <- datamat[!constant_rows, ]
}
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.1)
mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN");
mycol = mycol[as.vector(mycl)]

png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000)
heatmap.2(datamat,
        Rowv = as.dendrogram(hr),
        col = bluered(75),
        scale = "row",
        RowSideColors = mycol,
        trace = "none",
        margin = c(10, 15),         # bottom, left
        sepwidth = c(0, 0),
        dendrogram = 'row',
        Colv = 'false',
        density.info = 'none',
        labRow = rownames(datamat),
        cexRow = 1.5,
        cexCol = 1.5,
        srtCol = 35,
        lhei = c(0.2, 4),           # reduce top space (was 1 or more)
        lwid = c(0.4, 4))           # reduce left space (was 1 or more)
dev.off()

# -------------- Cluster members ----------------
write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
#~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls;

#### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) #####
subset_1<-names(subset(mycl, mycl == '1'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #2575
subset_2<-names(subset(mycl, mycl == '2'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #1855
subset_3<-names(subset(mycl, mycl == '3'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #217
subset_4<-names(subset(mycl, mycl == '4'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ])  #
subset_5<-names(subset(mycl, mycl == '5'))
data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ])  #
# Initialize an empty data frame for the annotated data
annotated_data <- data.frame()
# Determine total number of genes
total_genes <- length(rownames(data))
# Loop through each gene to annotate
for (i in 1:total_genes) {
    gene <- rownames(data)[i]
    result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                    filters = 'ensembl_gene_id',
                    values = gene,
                    mart = ensembl)
    # If multiple rows are returned, take the first one
    if (nrow(result) > 1) {
        result <- result[1, ]
    }
    # Check if the result is empty
    if (nrow(result) == 0) {
        result <- data.frame(ensembl_gene_id = gene,
                            external_gene_name = NA,
                            gene_biotype = NA,
                            entrezgene_id = NA,
                            chromosome_name = NA,
                            start_position = NA,
                            end_position = NA,
                            strand = NA,
                            description = NA)
    }
    # Transpose expression values
    expression_values <- t(data.frame(t(data[gene, ])))
    colnames(expression_values) <- colnames(data)
    # Combine gene information and expression data
    combined_result <- cbind(result, expression_values)
    # Append to the final dataframe
    annotated_data <- rbind(annotated_data, combined_result)
    # Print progress every 100 genes
    if (i %% 100 == 0) {
        cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
    }
}
# Save the annotated data to a new CSV file
write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE)
write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE)
#~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download the protein sequences from Genbank
        mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
        python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
        emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60  #--resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

* jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025
* 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input -->
* Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
        QBlast finished with warnings!
        Blasted Sequences: 2084
        Sequences without results: 105
        Check the Job log for details and try to submit again.
        Restarting QBlast may result in additional results, depending on the error type.
        "Blast (CP020463_protein) Done"
* Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
        "Mapping (CP020463_protein) Done"
        "Mapping finished - Please proceed now to annotation."
* Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
        * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
        * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.
        "Annotation (CP020463_protein) Done"
        "Annotation finished."
or blast2go_cli_v1.5.1 (NOT_USED)

        #https://help.biobam.com/space/BCD/2250407989/Installation
        #see ~/Scripts/blast2go_pipeline.sh

Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro
    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."
        "InterProScan (CP020463_protein) Done"
        "InterProScan Finished - You can now merge the obtained GO Annotations."
MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs
    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation."
        "Merge InterProScan GOs to Annotation (CP020463_protein) Done"
        "Finished merging GO terms from InterPro with annotations."
        "Maybe you want to run ANNEX (Annotation Augmentation)."
    #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."
File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.)
        #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated!

    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020

    cp blast2go_annot.annot blast2go_annot.annot2

Option 4 (NOT_USED): RFAM for non-colding RNA

Option 5 (NOT_USED): PSORTb for subcellular localizations

Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

* Go to KAAS
* Upload your FASTA file.
* Select an appropriate gene set.
* Download the KO assignments.

Find the Closest KEGG Organism Code (NOT_USED)

Since your species isn't directly in KEGG, use a closely related organism.

* Check available KEGG organisms:

        library(clusterProfiler)
        library(KEGGREST)

        kegg_organisms <- keggList("organism")

        Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).

        # Search for Acinetobacter in the list
        grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
        # Gammaproteobacteria
        #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"

Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)

If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology

Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)

    #BiocManager::install("GO.db")
    #BiocManager::install("AnnotationDbi")

    # Load required libraries
    library(openxlsx)  # For Excel file handling
    library(dplyr)     # For data manipulation
    library(tidyr)
    library(stringr)
    library(clusterProfiler)  # For KEGG and GO enrichment analysis
    #library(org.Hs.eg.db)  # Replace with appropriate organism database
    library(GO.db)
    library(AnnotationDbi)

    setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes")
    # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
    # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query
    #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt
    #1,5c1
    #< ## Thu Jan 30 16:34:52 2025
    #< ## emapper-2.1.12
    #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60
    #< ##
    #< #query        seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #---
    #> query seed_ortholog   evalue  score   eggNOG_OGs      max_annot_lvl   COG_category    Description     Preferred_name  GOs     EC      KEGG_ko KEGG_Pathway   KEGG_Module      KEGG_Reaction   KEGG_rclass     BRITE   KEGG_TC CAZy    BiGG_Reaction   PFAMs
    #3620,3622d3615
    #< ## 3614 queries scanned
    #< ## Total time (seconds): 8.176708459854126

    # Step 1: Load the blast2go annotation file with a check for missing columns
    annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)

    # If the structure is inconsistent, we can make sure there are exactly 3 columns:
    colnames(annot_df) <- c("GeneID", "Term")
    # Step 2: Filter and aggregate GO and EC terms as before
    go_terms <- annot_df %>%
    filter(grepl("^GO:", Term)) %>%
    group_by(GeneID) %>%
    summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    ec_terms <- annot_df %>%
    filter(grepl("^EC:", Term)) %>%
    group_by(GeneID) %>%
    summarize(EC = paste(Term, collapse = ","), .groups = "drop")

    # Key Improvements:
    #    * Looped processing of all 6 input files to avoid redundancy.
    #    * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations.
    #    * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists.
    #    * Error handling for GO term descriptions via tryCatch.
    #    * Improved clarity and modular structure for easier maintenance and future additions.

    # Define the filenames and output suffixes
    file_list <- c(
      "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
      "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
      "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
      "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
      "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
      "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

      "WT_MH_4h_vs_WT_MH_2h",
      "WT_MH_18h_vs_WT_MH_2h",
      "WT_MH_18h_vs_WT_MH_4h",
      "WT_TSB_4h_vs_WT_TSB_2h",
      "WT_TSB_18h_vs_WT_TSB_2h",
      "WT_TSB_18h_vs_WT_TSB_4h",

      "deltasbp_MH_4h_vs_deltasbp_MH_2h",
      "deltasbp_MH_18h_vs_deltasbp_MH_2h",
      "deltasbp_MH_18h_vs_deltasbp_MH_4h",
      "deltasbp_TSB_4h_vs_deltasbp_TSB_2h",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_2h",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_4h"
    )

    #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv"

    # ---------------------- Generated DEG(Annotated)_KEGG_GO_* -----------------------
    suppressPackageStartupMessages({
      library(readr)
      library(dplyr)
      library(stringr)
      library(tidyr)
      library(openxlsx)
      library(clusterProfiler)
      library(AnnotationDbi)
      library(GO.db)
    })

    # ---- PARAMETERS ----
    PADJ_CUT <- 5e-2
    LFC_CUT  <- 2

    # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... )
    emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt"

    # Input files (you can add/remove here)
    input_files <- c(
      "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv",
      "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv",
      "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv",
      "deltasbp_MH_2h_vs_WT_MH_2h-all.csv",
      "deltasbp_MH_4h_vs_WT_MH_4h-all.csv",
      "deltasbp_MH_18h_vs_WT_MH_18h-all.csv",

      "WT_MH_4h_vs_WT_MH_2h-all.csv",
      "WT_MH_18h_vs_WT_MH_2h-all.csv",
      "WT_MH_18h_vs_WT_MH_4h-all.csv",
      "WT_TSB_4h_vs_WT_TSB_2h-all.csv",
      "WT_TSB_18h_vs_WT_TSB_2h-all.csv",
      "WT_TSB_18h_vs_WT_TSB_4h-all.csv",

      "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv",
      "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv",
      "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv",
      "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv",
      "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv"
    )

    # ---- HELPERS ----
    # Robust reader (CSV first, then TSV)
    read_table_any <- function(path) {
      tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE),
                    error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()),
                                                  error = function(e2) NULL))
      tb
    }

    # Return a nice Excel-safe base name
    xlsx_name_from_file <- function(path) {
      base <- tools::file_path_sans_ext(basename(path))
      paste0("DEG_KEGG_GO_", base, ".xlsx")
    }

    # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table
    expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) {
      if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame())
      kdf <- as.data.frame(kegg_res)
      if (!"geneID" %in% names(kdf)) return(kdf)
      # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID
      map_clean <- mapping_tbl %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko), KEGG_ko != "-") %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%
        tidyr::separate_rows(KEGG_ko, sep = ",") %>%
        distinct()

      if (!nrow(map_clean)) {
        return(kdf)
      }

      expanded <- kdf %>%
        tidyr::separate_rows(geneID, sep = "/") %>%
        dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
        distinct() %>%
        dplyr::group_by(ID) %>%
        dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")

      kdf %>%
        dplyr::select(-geneID) %>%
        dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>%
        dplyr::rename(geneID = GeneID)
    }

    # ---- LOAD emapper annotations ----
    eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE)
    # Ensure character columns for joins
    eggnog_data$query   <- as.character(eggnog_data$query)
    eggnog_data$GOs     <- as.character(eggnog_data$GOs)
    eggnog_data$EC      <- as.character(eggnog_data$EC)
    eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko)

    # ---- MAIN LOOP ----
    for (f in input_files) {
      if (!file.exists(f)) { message("Missing: ", f); next }

      message("Processing: ", f)
      res <- read_table_any(f)
      if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next }

      # Coerce expected columns if present
      if ("padj" %in% names(res))   res$padj <- suppressWarnings(as.numeric(res$padj))
      if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange))

      # Ensure GeneID & GeneName exist
      if (!"GeneID" %in% names(res)) {
        # Try to infer from a generic 'gene' column
        if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_
      }
      if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_

      # Fill missing GeneName from GeneID (drop "gene-")
      res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "",
                            gsub("^gene-", "", as.character(res$GeneID)),
                            as.character(res$GeneName))

      # De-duplicate by GeneName, keep smallest padj
      if (!"padj" %in% names(res)) res$padj <- NA_real_
      res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup() %>%
        as.data.frame()

      # Sort by padj asc, then log2FC desc
      if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_
      res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE]

      # Join emapper (strip "gene-" from GeneID to match emapper 'query')
      res$GeneID_plain <- gsub("^gene-", "", res$GeneID)
      res_ann <- res %>%
        left_join(eggnog_data, by = c("GeneID_plain" = "query"))

      # --- Split by UP/DOWN using your volcano cutoffs ---
      up_regulated   <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange >  LFC_CUT)
      down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT,  log2FoldChange < -LFC_CUT)

      # --- KEGG enrichment (using K numbers in KEGG_ko) ---
      # Prepare KO lists (remove "ko:" if present)
      k_up <- up_regulated$KEGG_ko;   k_up <- k_up[!is.na(k_up)]
      k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)]
      k_up <- gsub("ko:", "", k_up);  k_dn <- gsub("ko:", "", k_dn)

      # BREAK_LINE

      kegg_up   <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL)
      kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL)

      # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set)
      kegg_up_df   <- expand_kegg_geneIDs(kegg_up,   up_regulated)
      kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated)

      # --- GO enrichment (custom TERM2GENE built from emapper GOs) ---
      # Background gene set = all genes in this comparison
      background_genes <- unique(res_ann$GeneID_plain)
      # TERM2GENE table (GO -> GeneID_plain)
      go_annotation <- res_ann %>%
        dplyr::select(GeneID_plain, GOs) %>%
        mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>%
        tidyr::separate_rows(GOs, sep = ",") %>%
        filter(GOs != "") %>%
        dplyr::select(GOs, GeneID_plain) %>%
        distinct()

      # Gene lists for GO enricher
      go_list_up   <- unique(up_regulated$GeneID_plain)
      go_list_down <- unique(down_regulated$GeneID_plain)

      go_up <- tryCatch(
        enricher(gene = go_list_up, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )
      go_down <- tryCatch(
        enricher(gene = go_list_down, TERM2GENE = go_annotation,
                pvalueCutoff = 0.05, pAdjustMethod = "BH",
                universe = background_genes),
        error = function(e) NULL
      )

      go_up_df   <- if (!is.null(go_up))   as.data.frame(go_up)   else data.frame()
      go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame()

      # Add GO term descriptions via GO.db (best-effort)
      add_go_term_desc <- function(df) {
        if (!nrow(df) || !"ID" %in% names(df)) return(df)
        df$Description <- sapply(df$ID, function(go_id) {
          term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id,
                                                columns = "TERM", keytype = "GOID"),
                          error = function(e) NULL)
          if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
        })
        df
      }
      go_up_df   <- add_go_term_desc(go_up_df)
      go_down_df <- add_go_term_desc(go_down_df)

      # ---- Write Excel workbook ----
      out_xlsx <- xlsx_name_from_file(f)
      wb <- createWorkbook()

      addWorksheet(wb, "Complete")
      writeData(wb, "Complete", res_ann)

      addWorksheet(wb, "Up_Regulated")
      writeData(wb, "Up_Regulated", up_regulated)

      addWorksheet(wb, "Down_Regulated")
      writeData(wb, "Down_Regulated", down_regulated)

      addWorksheet(wb, "KEGG_Enrichment_Up")
      writeData(wb, "KEGG_Enrichment_Up", kegg_up_df)

      addWorksheet(wb, "KEGG_Enrichment_Down")
      writeData(wb, "KEGG_Enrichment_Down", kegg_down_df)

      addWorksheet(wb, "GO_Enrichment_Up")
      writeData(wb, "GO_Enrichment_Up", go_up_df)

      addWorksheet(wb, "GO_Enrichment_Down")
      writeData(wb, "GO_Enrichment_Down", go_down_df)

      saveWorkbook(wb, out_xlsx, overwrite = TRUE)
      message("Saved: ", out_xlsx)
    }

    # -------------------------------- OLD_CODE not automatized with loop ----------------------------
    # Load the results
    res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv")
    res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv")

    res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv")
    res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv")
    res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv")
    res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv")

    res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv")
    res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv")
    res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv")
    res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv")

    # Replace empty GeneName with modified GeneID
    res$GeneName <- ifelse(
        res$GeneName == "" | is.na(res$GeneName),
        gsub("gene-", "", res$GeneID),
        res$GeneName
    )

    # Remove duplicated genes by selecting the gene with the smallest padj
    duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]

    res <- res %>%
    group_by(GeneName) %>%
    slice_min(padj, with_ties = FALSE) %>%
    ungroup()

    res <- as.data.frame(res)
    # Sort res first by padj (ascending) and then by log2FoldChange (descending)
    res <- res[order(res$padj, -res$log2FoldChange), ]
    # Read eggnog annotations
    eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
    # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
    res$GeneID <- gsub("gene-", "", res$GeneID)
    # Merge eggnog data with res based on GeneID
    res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query"))

    # Merge with the res dataframe
    # Perform the left joins and rename columns
    res_updated <- res %>%
    left_join(go_terms, by = "GeneID") %>%
    left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)

    # Filter up-regulated genes
    up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ]
    # Filter down-regulated genes
    down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ]

    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet (with annotations)
    addWorksheet(wb, "Complete")
    writeData(wb, "Complete_Data", res_updated)
    # Add the up-regulated genes as the second sheet (with annotations)
    addWorksheet(wb, "Up_Regulated")
    writeData(wb, "Up_Regulated", up_regulated)
    # Add the down-regulated genes as the third sheet (with annotations)
    addWorksheet(wb, "Down_Regulated")
    writeData(wb, "Down_Regulated", down_regulated)
    # Save the workbook to a file
    #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE)
    #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments.

    #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments.

    # Set GeneName as row names after the join
    rownames(res_updated) <- res_updated$GeneName
    res_updated <- res_updated %>% dplyr::select(-GeneName)
    ## Set the 'GeneName' column as row.names
    #rownames(res_updated) <- res_updated$GeneName
    ## Drop the 'GeneName' column since it's now the row names
    #res_updated$GeneName <- NULL
    # -- BREAK_1 --

    # ---- Perform KEGG enrichment analysis (up_regulated) ----
    gene_list_kegg_up <- up_regulated$KEGG_ko
    gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
    kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    # -- convert the GeneID (Kxxxxxx) to the true GeneID --
    # Step 0: Create KEGG to GeneID mapping
    kegg_to_geneid_up <- up_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
    kegg_to_geneid_clean <- kegg_to_geneid_up %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove any duplicate mappings
    # Step 2.1: Expand geneID column in kegg_enrichment_up
    expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>%  left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
    #dplyr::glimpse(expanded_kegg)
    # Step 3.1: Replace geneID column in the original dataframe
    kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
    # Remove old geneID column and merge new one
    kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>%  left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  dplyr::rename(geneID = GeneID)  # Rename column back to geneID

    # ---- Perform KEGG enrichment analysis (down_regulated) ----
    # Step 1: Extract KEGG KO terms from down-regulated genes
    gene_list_kegg_down <- down_regulated$KEGG_ko
    gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
    # Step 2: Perform KEGG enrichment analysis
    kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
    # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
    # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
    kegg_to_geneid_down <- down_regulated %>%
    dplyr::select(KEGG_ko, GeneID) %>%
    filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
    mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    # -- BREAK_2 --

    # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
    kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
    mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
    separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
    filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
    distinct()  # Remove duplicate mappings

    # Step 5: Expand geneID column in kegg_enrichment_down
    expanded_kegg_down <- kegg_enrichment_down %>%
    as.data.frame() %>%
    separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
    left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
    distinct() %>%  # Remove duplicate matches
    group_by(ID) %>%
    summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results

    # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
    kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
    dplyr::select(-geneID) %>%  # Remove old geneID column
    left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
    dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    # View the updated dataframe
    head(kegg_enrichment_down_df)

    # Create a new workbook
    #wb <- createWorkbook()
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Up")
    writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
    # Save enrichment results to the workbook
    addWorksheet(wb, "KEGG_Enrichment_Down")
    writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))

    # Define gene list (up-regulated genes)
    gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
    gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes

    # Define background gene set (all genes in res)
    background_genes <- res_updated$GeneID  # Extract the 3646 background genes

    # Prepare GO annotation data from res
    go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
    go_annotation <- go_annotation %>%
    tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    # -- BREAK_3 --

    go_enrichment_up <- enricher(
        gene = gene_list_go_up,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_up <- as.data.frame(go_enrichment_up)

    go_enrichment_down <- enricher(
        gene = gene_list_go_down,                # Up-regulated genes
        TERM2GENE = go_annotation,       # Custom GO annotation
        pvalueCutoff = 0.05,             # Significance threshold
        pAdjustMethod = "BH",
        universe = background_genes      # Define the background gene set
    )
    go_enrichment_down <- as.data.frame(go_enrichment_down)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]

    # Update the Description column with the term descriptions
    go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })
    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })
    ## Print the updated data frame
    #print(go_enrichment_up)

    ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)!
    #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    # Update the Description column with the term descriptions
    go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
    # Using select to get the term description
    term <- tryCatch({
        AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
    }, error = function(e) {
        message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
        return(data.frame(TERM = NA))  # In case of error, return NA
    })

    if (nrow(term) > 0) {
        return(term$TERM)
    } else {
        return(NA)  # If no description found, return NA
    }
    })

    addWorksheet(wb, "GO_Enrichment_Up")
    writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))

    addWorksheet(wb, "GO_Enrichment_Down")
    writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))

    # Save the workbook with enrichment results
    saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE)

    #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process"
    #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807
    #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!

    #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx
    #Mac_vs_LB
    #LB.AB_vs_LB.WT19606
    #LB.IJ_vs_LB.WT19606
    #LB.W1_vs_LB.WT19606
    #LB.Y1_vs_LB.WT19606
    #Mac.AB_vs_Mac.WT19606
    #Mac.IJ_vs_Mac.WT19606
    #Mac.W1_vs_Mac.WT19606
    #Mac.Y1_vs_Mac.WT19606

    #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2.

(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

        library(openxlsx)

        # Function to read and clean gene ID files
        read_gene_ids <- function(file_path) {
        # Read the gene IDs from the file
        gene_ids <- readLines(file_path)

        # Remove any quotes and trim whitespaces
        gene_ids <- gsub('"', '', gene_ids)  # Remove quotes
        gene_ids <- trimws(gene_ids)  # Trim whitespaces

        # Remove empty entries or NAs
        gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)]

        return(gene_ids)
        }

        # Example list of LB files with both -up.id and -down.id for each condition
        lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id",
                        "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id")
        lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id",
                        "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id")

        # Combine both up and down files for each condition
        lb_files <- c(lb_files_up, lb_files_down)

        # Read gene IDs for each file in LB group
        #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files))
        lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files)))

        lb_degs_ <- list()
        combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]])
        #unique_combined_set <- unique(combined_set)
        lb_degs_$AB <- combined_set
        combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]])
        lb_degs_$IJ <- combined_set
        combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]])
        lb_degs_$W1 <- combined_set
        combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]])
        lb_degs_$Y1 <- combined_set

        # Example list of Mac files with both -up.id and -down.id for each condition
        mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id",
                        "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id")
        mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id",
                        "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id")

        # Combine both up and down files for each condition in Mac group
        mac_files <- c(mac_files_up, mac_files_down)

        # Read gene IDs for each file in Mac group
        mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files)))

        mac_degs_ <- list()
        combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]])
        mac_degs_$AB <- combined_set
        combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]])
        mac_degs_$IJ <- combined_set
        combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]])
        mac_degs_$W1 <- combined_set
        combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]])
        mac_degs_$Y1 <- combined_set

        # Function to clean sheet names to ensure no sheet name exceeds 31 characters
        truncate_sheet_name <- function(names_list) {
        sapply(names_list, function(name) {
        if (nchar(name) > 31) {
        return(substr(name, 1, 31))  # Truncate sheet name to 31 characters
        }
        return(name)
        })
        }

        # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.)

        # Define intersections between different conditions for LB
        inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ)
        inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1)
        inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1)
        inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1)
        inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1)
        inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1)

        # Define intersections between three conditions for LB
        inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1))
        inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1))
        inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1))
        inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Define intersection between all four conditions for LB
        inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1))

        # Now remove the intersected genes from each original set for LB
        venn_list_lb <- list()

        # For LB.AB, remove genes that are also in other conditions
        venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1)))

        # For LB.IJ, remove genes that are also in other conditions
        venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1)))

        # For LB.W1, remove genes that are also in other conditions
        venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1)))

        # For LB.Y1, remove genes that are also in other conditions
        venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1)))

        # Add the intersections for LB (same as before)
        venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij
        venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1
        venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1
        venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1
        venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1
        venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1

        # Define intersections between three conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1
        venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1
        venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1

        # Define intersection between all four conditions for LB
        venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1

        # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.)

        # Define intersections between different conditions
        inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ)
        inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1)
        inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1)
        inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1)
        inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1)
        inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1)

        # Define intersections between three conditions
        inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1))
        inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1))
        inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1))
        inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Define intersection between all four conditions
        inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1))

        # Now remove the intersected genes from each original set
        venn_list_mac <- list()

        # For Mac.AB, remove genes that are also in other conditions
        venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1)))

        # For Mac.IJ, remove genes that are also in other conditions
        venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1)))

        # For Mac.W1, remove genes that are also in other conditions
        venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1)))

        # For Mac.Y1, remove genes that are also in other conditions
        venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1)))

        # Add the intersections (same as before)
        venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij
        venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1
        venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1
        venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1
        venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1

        # Define intersections between three conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1
        venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1
        venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1

        # Define intersection between all four conditions
        venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1

        # Save the gene IDs to Excel for further inspection (optional)
        write.xlsx(lb_degs, file = "LB_DEGs.xlsx")
        write.xlsx(mac_degs, file = "Mac_DEGs.xlsx")

        # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files
        write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE)
        write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE)

        # Venn Diagram for LB group
        venn1 <- ggvenn(lb_degs_,
                        fill_color = c("skyblue", "tomato", "gold", "orchid"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300)

        # Venn Diagram for Mac group
        venn2 <- ggvenn(mac_degs_,
                        fill_color = c("lightgreen", "slateblue", "plum", "orange"),
                        stroke_size = 0.4,
                        set_name_size = 5)
        ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300)

        cat("✅ All Venn intersection sets exported to Excel successfully.\n")

Report:
```
Please find the heat maps attached.

Comparisons included

    1. 1457 TSB vs 1457Δsbp TSB (early timepoint, 1–2 h)

    2. 1457 MH vs 1457Δsbp MH (early timepoint, 1–2 h)

    3. 1457 TSB vs 1457Δsbp TSB (4 h)

    4. 1457 MH vs 1457Δsbp MH (4 h)

    5. 1457 TSB vs 1457Δsbp TSB (18 h)

    6. 1457 MH vs 1457Δsbp MH (18 h)

    7. 1457 TSB: early (1–2 h) vs 4 h vs 18 h

    8. 1457 MH: early (1–2 h) vs 4 h vs 18 h

    9. 1457Δsbp TSB: early (1–2 h) vs 4 h vs 18 h

    10. 1457Δsbp MH: early (1–2 h) vs 4 h vs 18 h

What each heat map shows

    Rows: significant DEGs only (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2).

    Columns: the samples for the comparison (2 conditions × replicates) or, for the three-condition panels, 3 conditions × replicates = 9 columns.

How the files were generated

    Choose contrasts.

        Two-condition heat maps (items 1–6): contrast <- "<A>_vs_<B>".

        Three-condition panels (items 7–10):

        contrasts <- c("<A2>_vs_<A1>", "<A3>_vs_<A1>", "<A3>_vs_<A2>")
        cond_order <- c("<A1>", "<A2>", "<A3>")  # e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h

    Build the Genes Of Interest (GOI) dynamically. For each contrast, the script reads 
```
-up.id and -down.id and takes the union; for three-condition panels it unions across the three pairwise contrasts. Subset the expression matrix (assay(rld)): keep only the rows in GOI and the sample columns matching the condition tags (e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h). DEG annotation and enrichment The annotated significant DEGs (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2) used in the steps above and the KEGG/GO enrichment results for each comparison are provided in the corresponding Excel files: DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx DEG_KEGG_GO_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_MH_4h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_TSB_4h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx

Comprehensive RNA-seq Time-Course Analysis Pipeline for Bacterial Stress-Related Genes (Data_Michelle_RNAseq_2025/README_oxidoreductases)

Leave a reply

This article summarizes the pipeline and preserves all major code used for preparing metadata, processing raw counts, performing statistical analysis, integrating annotations, and reporting results. The workflow is designed for bacterial RNA-seq time-course analysis focusing on stress-related genes and oxidoreductases.

1. Prepare `samples.tsv` from the samplesheet

vim samples.tsv

Example contents:

sample  condition   time_h  batch   genotype    medium  replicate
WT_MH_2h_1  WT_MH   2       WT  MH  1
WT_MH_2h_2  WT_MH   2       WT  MH  2
WT_MH_2h_3  WT_MH   2       WT  MH  3
WT_MH_4h_1  WT_MH   4       WT  MH  1
WT_MH_4h_2  WT_MH   4       WT  MH  2
WT_MH_4h_3  WT_MH   4       WT  MH  3
WT_MH_18h_1 WT_MH   18      WT  MH  1
WT_MH_18h_2 WT_MH   18      WT  MH  2
WT_MH_18h_3 WT_MH   18      WT  MH  3
deltasbp_MH_2h_1    deltasbp_MH 2       deltasbp    MH  1
deltasbp_MH_2h_2    deltasbp_MH 2       deltasbp    MH  2
deltasbp_MH_2h_3    deltasbp_MH 2       deltasbp    MH  3
...
deltasbp_TSB_18h_3  deltasbp_TSB    18      deltasbp    TSB 3

2. Reformat `counts.tsv` from STAR/Salmon

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

Clean file manually:

Remove any double quotes ("), remove gene- from first column, replace delimiters to tab.

Clean file in R:

cts <- read.delim("counts.tsv", check.names = FALSE)
names(cts)[1] <- "gene_id"
if ("gene_name" %in% names(cts)) cts$gene_name <- NULL
names(cts) <- sub("_r([0-9]+)$", "_\\1", names(cts))
write.table(cts, file="counts_fixed.tsv", sep="\t", quote=FALSE, row.names=FALSE)
smp <- read.delim("samples.tsv", check.names = FALSE)
setdiff(colnames(cts)[-1], smp$sample)
setdiff(smp$sample, colnames(cts)[-1])

3. Run the R time-course analysis

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

4. Summarize and convert results

~/Tools/csv2xls-0.4/csv_to_xls.py oxidoreductases_time_trends.tsv stress_genes_time_trends.tsv -d$'\t' -o oxidoreductases_and_stress_genes_time_trends.xls

5. Key summary and reporting

PCA plots, time trends, and top decreasing/increasing genes by condition are summarized. For further filtering, decreasing genes can be extracted by filtering direction == "decreasing" in the results tables.

6. Full main R script: `rna_timecourse_bacteria.R`

#!/usr/bin/env Rscript

# ===============================
# RNA-seq time-course helper (Bacteria) — DESeq2
# Uses eggNOG emapper annotations (GOs & EC) for oxidoreductases + stress genes
# ===============================
# Example:
#   Rscript rna_timecourse_bacteria.R \
#     --counts counts.tsv \
#     --samples samples.tsv \
#     --condition_col condition \
#     --time_col time_h \
#     --batch_col batch \
#     --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
#     --volcano_csvs contrasts/ctrl_vs_treat.csv \
#     --outdir results_bacteria
#
# Assumptions:
#   - counts.tsv: first column gene_id matching 'query' in emapper file
#   - samples.tsv: columns 'sample', condition/time (numeric), optional batch
#
suppressPackageStartupMessages({
  library(optparse)
  library(DESeq2)
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(ComplexHeatmap)
  library(circlize)
  library(ggplot2)
  library(purrr)
})

opt_list <- list(
  make_option("--counts", type="character", help="Counts matrix TSV (genes x samples). First col = gene_id"),
  make_option("--samples", type="character", help="Sample metadata TSV with 'sample' column."),
  make_option("--gene_id_col", type="character", default="gene_id", help="Counts gene id column name."),
  make_option("--condition_col", type="character", default="condition", help="Condition column in samples."),
  make_option("--time_col", type="character", default="time", help="Numeric time column in samples."),
  make_option("--batch_col", type="character", default=NULL, help="Optional batch column."),
  make_option("--emapper", type="character", help="eggNOG emapper annotations file (tab-delimited)."),
  make_option("--volcano_csvs", type="character", default=NULL, help="Comma-separated volcano CSV/TSV files (must have a 'gene' column)."),
  make_option("--outdir", type="character", default="results_bacteria", help="Output directory.")
)

opt <- parse_args(OptionParser(option_list = opt_list))
dir.create(opt$outdir, showWarnings = FALSE, recursive = TRUE)

message("[1/7] Load data")
counts <- read_tsv(opt$counts, col_types = cols())
stopifnot(opt$gene_id_col %in% colnames(counts))
counts <- as.data.frame(counts)
rownames(counts) <- counts[[opt$gene_id_col]]
counts[[opt$gene_id_col]] <- NULL

samples <- read_tsv(opt$samples, col_types = cols()) %>%
  filter(sample %in% colnames(counts))
samples <- as.data.frame(samples)
rownames(samples) <- samples$sample
samples$sample <- NULL
samples[[opt$time_col]] <- as.numeric(samples[[opt$time_col]])

# --- Coerce counts to numeric and validate ---
# Remove any commas and coerce to numeric
counts[] <- lapply(counts, function(x) {
  if (is.character(x)) x <- gsub(",", "", x, fixed = TRUE)
  suppressWarnings(as.numeric(x))
})
# Report any NA introduced by coercion
na_cols <- vapply(counts, function(x) any(is.na(x)), logical(1))
if (any(na_cols)) {
  bad <- names(which(na_cols))
  message("WARNING: Non-numeric values detected in count columns; introduced NAs in: ", paste(bad, collapse=", "))
  # Replace NA with 0 (safe fallback) and continue
  counts[bad] <- lapply(counts[bad], function(x) { x[is.na(x)] <- 0; x })
}

# Ensure samples and counts columns align 1:1 and reorder counts accordingly
missing_in_samples <- setdiff(colnames(counts), rownames(samples))
missing_in_counts  <- setdiff(rownames(samples), colnames(counts))
if (length(missing_in_samples) > 0) {
  stop("These count columns have no matching row in samples.tsv: ", paste(missing_in_samples, collapse=", "))
}
if (length(missing_in_counts) > 0) {
  stop("These samples.tsv rows have no matching column in counts.tsv: ", paste(missing_in_counts, collapse=", "))
}
counts <- counts[, rownames(samples), drop=FALSE]
# Finally, round to integers as required by DESeq2
counts <- round(as.matrix(counts))

message("[2/7] DESeq2 model (time-course)")
design_terms <- c()
if (!is.null(opt$batch_col) && opt$batch_col %in% colnames(samples)) {
  design_terms <- c(design_terms, opt$batch_col)
}
design_terms <- c(design_terms, opt$condition_col, opt$time_col, paste0(opt$condition_col, ":", opt$time_col))
design_formula <- as.formula(paste("~", paste(design_terms, collapse=" + ")))

dds <- DESeqDataSetFromMatrix(countData = round(as.matrix(counts)),
                              colData = samples,
                              design = design_formula)
dds <- dds[rowSums(counts(dds)) > 1, ]
dds <- DESeq(dds, test="LRT",
             full = design_formula,
             reduced = as.formula(paste("~", paste(setdiff(design_terms, paste0(opt$condition_col, ":", opt$time_col)), collapse=" + "))))

vsd <- vst(dds, blind=FALSE)
vsd_mat <- assay(vsd)

message("[3/7] Parse emapper for GO/EC (oxidoreductases & stress genes)")
stopifnot(!is.null(opt$emapper))
emap <- read_tsv(opt$emapper, comment = "#", col_types = cols(.default = "c"))
# Expecting columns: query, GOs, EC, Description, Preferred_name, etc.
emap <- emap %>%
  transmute(gene = query,
            GOs = ifelse(is.na(GOs), "", GOs),
            EC = ifelse(is.na(EC), "", EC),
            Description = ifelse(is.na(Description), "", Description),
            Preferred_name = ifelse(is.na(Preferred_name), "", Preferred_name))
emap <- emap %>% distinct(gene, .keep_all = TRUE)

# Flags:
# 1) oxidoreductase: EC starts with "1." OR GO includes GO:0016491
is_ox_by_ec <- grepl("^1\\.", emap$EC)
is_ox_by_go <- grepl("\\bGO:0016491\\b", emap$GOs)
emap$is_oxidoreductase <- is_ox_by_ec | is_ox_by_go

# 2) stress-related: search for stress GO ids in GOs
stress_gos <- c("GO:0006950","GO:0033554","GO:0006979","GO:0006974","GO:0009408","GO:0009266") # response to stress; cellular response; oxidative stress; DNA damage; response to heat; response to starvation
re_pat <- paste(stress_gos, collapse="|")
emap$is_stress <- grepl(re_pat, emap$GOs)

write_tsv(emap, file.path(opt$outdir, "emapper_flags.tsv"))

message("[4/7] Per-gene time slopes within each condition")
cond_levels <- unique(samples[[opt$condition_col]])
slope_summaries <- list()

for (cond in cond_levels) {
  sel <- samples[[opt$condition_col]] == cond
  mat <- vsd_mat[, sel, drop=FALSE]
  tvec <- samples[[opt$time_col]][sel]

  slopes <- apply(mat, 1, function(y) {
    fit <- try(lm(y ~ tvec), silent = TRUE)
    if (inherits(fit, "try-error")) return(c(NA, NA))
    co <- summary(fit)$coefficients
    c(beta=unname(co["tvec","Estimate"]), p=unname(co["tvec","Pr(>|t|)"]))
  })
  slopes <- t(slopes)
  df <- as.data.frame(slopes)
  df$gene <- rownames(mat)
  df$condition <- cond
  slope_summaries[[cond]] <- df
}

# (keep whatever you have above this point unchanged)
slope_df <- bind_rows(slope_summaries) %>%
  mutate(padj = p.adjust(p, method="BH")) %>%
  relocate(gene, condition, beta, p, padj)

# ---- Robust join to emapper ----
# clean IDs: trim; strip version suffixes like ".1"
emap$gene <- trimws(emap$gene)
emap$gene_clean <- sub("\\.\\d+$", "", emap$gene)

slope_df$gene <- trimws(slope_df$gene)
slope_df$gene_clean <- sub("\\.\\d+$", "", slope_df$gene)

# join on cleaned key
slope_df <- slope_df %>%
  dplyr::left_join(
    emap %>% dplyr::select(gene_clean, GOs, EC, Description, Preferred_name,
                           is_oxidoreductase, is_stress),
    by = "gene_clean"
  )

# recompute flags from EC/GOs when missing
slope_df <- slope_df %>%
  mutate(
    is_oxidoreductase = ifelse(
      is.na(is_oxidoreductase),
      (!is.na(EC) & grepl("^1\\.", EC)) | (!is.na(GOs) & grepl("\\bGO:0016491\\b", GOs)),
      is_oxidoreductase
    ),
    is_stress = ifelse(
      is.na(is_stress),
      (!is.na(GOs) & grepl("GO:0006950|GO:0033554|GO:0006979|GO:0006974|GO:0009408|GO:0009266", GOs)),
      is_stress
    )
  ) %>%
  dplyr::select(-gene_clean)

# write full slopes
readr::write_tsv(slope_df, file.path(opt$outdir, "time_slopes_by_condition.tsv"))

# summaries
ox_summary <- slope_df %>%
  dplyr::filter(!is.na(is_oxidoreductase) & is_oxidoreductase) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(ox_summary, file.path(opt$outdir, "oxidoreductases_time_trends.tsv"))

stress_summary <- slope_df %>%
  dplyr::filter(!is.na(is_stress) & is_stress) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(stress_summary, file.path(opt$outdir, "stress_genes_time_trends.tsv"))

message("[5/7] Heatmaps from volcano gene lists (plus per-gene)")
make_heatmap <- function(glist, tag, kmeans_rows=NA) {
  sub <- vsd_mat[rownames(vsd_mat) %in% glist, , drop=FALSE]
  if (nrow(sub) == 0) {
    message("No overlap for ", tag)
    return(invisible(NULL))
  }
  z <- t(scale(t(sub)))
  ha_col <- HeatmapAnnotation(
    df = data.frame(
      condition = samples[[opt$condition_col]],
      time = samples[[opt$time_col]]
    )
  )
  png(file.path(opt$outdir, paste0("heatmap_", tag, ".png")), width=1400, height=1000, res=140)
  print(Heatmap(z, name="z", top_annotation = ha_col,
                clustering_distance_rows = "euclidean",
                clustering_method_rows = "ward.D2",
                show_row_names = FALSE, show_column_names = TRUE,
                row_km = kmeans_rows))
  dev.off()
}

make_single_gene_heatmaps <- function(glist, tag) {
  for (g in glist) {
    if (!(g %in% rownames(vsd_mat))) next
    z <- t(scale(t(vsd_mat[g,,drop=FALSE])))
    png(file.path(opt$outdir, paste0("heatmap_", tag, "_", g, ".png")), width=1200, height=400, res=150)
    print(Heatmap(z, name="z", cluster_rows=FALSE, cluster_columns=FALSE,
                  show_row_names=TRUE, show_column_names=TRUE))
    dev.off()
  }
}

if (!is.null(opt$volcano_csvs) && nzchar(opt$volcano_csvs)) {
  files <- str_split(opt$volcano_csvs, ",")[[1]] %>% trimws()
  for (f in files) {
    df <- tryCatch({
      if (grepl("\\.tsv$", f, ignore.case = TRUE)) read_tsv(f, col_types=cols())
      else read_csv(f, col_types=cols())
    }, error=function(e) NULL)
    if (is.null(df) || !("gene" %in% names(df))) next
    genes <- df$gene %>% unique()
    tag <- tools::file_path_sans_ext(basename(f))
    make_heatmap(genes, tag, kmeans_rows = 4)
    make_single_gene_heatmaps(genes, paste0(tag, "_single"))
  }
}

message("[6/7] PCA")
pca <- plotPCA(vsd, intgroup=c(opt$condition_col, opt$time_col), returnData = TRUE)
percentVar <- round(100 * attr(pca, "percentVar"))

# change 'deltasbp_*' to 'Δ_*' for legend labels
pca[[opt$condition_col]] <- gsub("^deltasbp", "Δsbp", pca[[opt$condition_col]])

p <- ggplot(pca, aes(PC1, PC2,
                     color = .data[[opt$condition_col]],
                     shape = factor(.data[[opt$time_col]]))) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  labs(color = "Condition", shape = "Factor(time_h)") +  # legend titles
  theme_bw()
ggsave(file.path(opt$outdir, "PCA_condition_time.png"), p, width=10, height=7, dpi=150)

message("[7/7] Summary")
summary_txt <- file.path(opt$outdir, "SUMMARY.txt")
sink(summary_txt)
cat("Bacterial time-course summary\n")
cat("Date:", as.character(Sys.time()), "\n\n")
cat("Conditions:", paste(unique(samples[[opt$condition_col]]), collapse=", "), "\n\n")
cat("Top oxidoreductases decreasing over time:\n")
print(ox_summary %>% filter(direction=="decreasing") %>% head(20))
cat("\nTop stress genes decreasing over time:\n")
print(stress_summary %>% filter(direction=="decreasing") %>% head(20))
sink()
message("Done -> ", opt$outdir)

This code-rich summary provides a replicable basis for advanced bacterial RNA-seq time-course analysis and reporting. All main code steps and script logic are retained for transparency and practical reuse.

Tenergy / Dignics 与 Butterfly Timo Boll ALC 终极实用手册

Leave a reply

版本：2025-10-27（Europe/Berlin）

TODO: 第一选择：想要抓球+弧线：正手 Dignics 09C 黑 2.1(起下旋最稳)，反手 Dignics 64 红 1.9；第二选择：想要直接、快出：正手 Tenergy 05 红 2.1，反手 Tenergy 80 黑 1.9。

面向：使用/考虑使用 Butterfly Timo Boll ALC（TB ALC） 的选手，搭配 Dignics / Tenergy 胶皮（含 09C、64、64 FX、05、80）。目标：快速定型正反手配置，理解每款胶的手感、弧线、速度与台内表现，并给出厚度/配色建议。

快速结论（给着急的人）
TB ALC 底板一览
胶皮速览卡（D09C / D64 / T05 / T80 / T64 / T64 FX）
关键差异对照表
正反手搭配方案（按打法）
厚度与配色（红/黑）选择
与 TB ALC 的化学反应：上手感受与注意事项
台内/发接发与相持要点
维护与更换周期
常见问答（FAQ）
术语小词典

1) 快速结论（给着急的人）

稳健FH弧圈：Dignics 09C（黑）2.1 或 Tenergy 05（红）2.1
凌厉BH快带/反拉：Dignics 64（1.9） 或 Tenergy 64（1.9）
一张通吃/省心：Tenergy 80（FH 2.1 / BH 1.9）
BH 要容错/易起球：Tenergy 64 FX（1.9 或 2.1）
为什么少推 T64 做正手：低弧直线，台内活、起下旋容错较低；正手通常更需要抓球+弧线（09C/05/80 更贴合）。
红/黑颜色：规则只要求“一面黑一面非黑”。普遍经验：黑皮更黏/质感更实（适合FH抓摩），红皮更通透爽快（常见BH或快出风格）。

2) TB ALC 底板一览

类型：OFF / OFF-（进攻），ALC 纤维（Arylate-Carbon）
层数（经典 ALC 叠层）：Koto – ALC – Limba – Kiri（芯）– Limba – ALC – Koto
常见参数：厚 ~5.7–5.9 mm；重 ~86–90 g；板面 ~157×150 mm
打感：甜区大、低震动、击球干净略“闷”，速度快但可控；弧圈中等抛物线，挡/对冲稳、响应线性。

与相近底板

Viscaria：整体更柔一点，抛物线略高；TB ALC 更直接、低一丝抛。
TB ZLC：更快更脆，持球/容错降低。
TB ZLF：更软更吃球，但顶速低。

3) 胶皮速览卡

Dignics 09C（D09C）

特性：微黏顶皮，抓球最强、吃球最深；中高弧线，起下旋最稳；前台不弹，中远台后劲足。
定位：FH 现代弧圈核心；在 TB ALC 上中和“干脆”，提升弧线与控制。

Dignics 64（D64）

特性：直线、反弹快，借力好；中低弧线，出速高；台内略活。
定位：BH 强势快带/对冲；在 TB ALC 上 BH 非常顺手。

Tenergy 05（T05）

特性：高摩擦颗粒设定，抓球强、弧线中高，台内更稳，响应线性。
定位：FH 万金油弧圈/拉冲；在 TB ALC 上成熟耐用的经典搭配。

Tenergy 80（T80）

特性：位于 05 与 64 之间的平衡点；弧线中等，速度/控制均衡。
定位：双面皆可，适合想“一张打全场”的简化方案。

Tenergy 64（T64）

特性：出球直、弧线中低、顶速与穿透强；台内更易“活”。
定位：BH 快带/反拉导向；FH 少量玩家偏好直线穿透可选。

Tenergy 64 FX（T64 FX）

特性：更软海绵版本；易起球、容错高，低中功率更轻松；台内更活、顶速略降。
定位：BH 容错/易操控路线；入门到中级/小力量友好。

4) 关键差异对照表

型号	抓球/持球	弧线高度	出速/穿透	台内控制	起下旋容错	最佳位面
D09C	最大	中-高	中后程强	稳	最优	FH
T05	高	中-高	中高	稳	很好	FH / BH 控制
T80	中高	中	中高	稳定	好	FH / BH
D64	中	中-低	最高	中等（活）	中等	BH
T64	中	低-中	最高	中等（活）	较低	BH / 少数 FH
T64 FX	中高（软）	中	高（低中功率更易）	偏活	较高	BH 初中级

注：表中“台内活”=反弹系数高、对小动作更敏感；需要更细腻的手上控制。

5) 正反手搭配方案（按打法）

A. FH 现代弧圈 + BH 快带/反拉（主流）

FH：D09C 2.1（黑） / T05 2.1（红）!!!!
BH：D64 1.9（红）/ T64 1.9 / T80 1.9（黑）!!!!

B. 近台借力对冲 + 二速快上手

FH：T05 2.1 / T80 2.1
BH：T64 1.9 / T64 FX 1.9（要容错）

C. 中远台大力爆冲

FH：D09C 2.1
BH：D64 2.1 或 T80 2.1（看稳定需求）

D. 一套省心通吃

FH：T80 2.1
BH：T80 1.9

E. 坚持 FH 直线穿透风

FH：T64 1.9（控台内）
BH：T80 1.9 / T64 FX 1.9（容错）

常见选择思路

粘性/中国套（如狂飙、09C）做正手 → 多数人选黑色; 黑皮通常略更黏、更实、更“顶”，适合发力刷摩、前冲。
日德套（如 Tenergy/Dignics/ESN）做正手 → 很多人选红色; 红皮一般手感更通透一点、出球更爽快，弧线略高、速度更轻快。

结合你这块 TB ALC：

想要抓球+弧线：正手 Dignics 09C 黑 2.1，反手 Dignics 64 红 1.9。
想要直接、快出：正手 Tenergy 05 红 2.1，反手 Tenergy 80 黑 1.9。
想要更稳控：都用 1.9 厚度即可。

6) 厚度与配色（红/黑）选择

厚度（1.9 vs 2.1）

1.9 mm：更稳，台内控制好、起板成功率高，适合反手或控球为先。
2.1 mm：最大威力与后程顶速，适合正手或追求爆冲者。

配色（规则与习惯）

规则：一面黑，一面非黑（通常红），没有“正手必须红/黑”的硬性规定。
经验：黑皮往往更黏、更扎实（FH 抓摩）；红皮更通透快出（BH/快攻）。
推荐：FH 黑 / BH 红（若选 09C/T05 等抓球型）；若使用 64/64 FX 做 BH，红/黑皆可。

7) 与 TB ALC 的化学反应：上手感受与注意事项

TB ALC 出球干净、回弹快，与 D09C/T05 组合能获得更稳的弧线与起下旋容错；
与 D64/T64 叠加会更“利落”，BH 爽快但台内更活；
若 FH 也选 64：建议 1.9 厚，台内要格外细腻；或更柔的底板来“降躁”。

8) 台内/发接发与相持要点

台内：D09C/T05/T80 更易控短与摆短；64/64 FX 需降低击球力度与板形开合，加大摩擦比重。
起下旋：D09C 最稳、T05 次之；64/64 FX 要注意摩擦角度与击球深度。
对拉/反拉：64/D64 出速高、直线穿透强；T05/T80 更有弧线安全窗。
借力：64 系列占优；D09C 需要主动发力，更吃你质量。

9) 维护与更换周期

清洁：每次打完用微湿海绵/专用清洁剂轻拭，贴保护膜；微黏（09C）表面避免硬擦。
更换：高频训练（>3 次/周）约 2–3 月更换一侧；普通强度 3–6 月。
粘贴：使用水溶性无机胶；避免非法改装（如违规增黏/增弹）。

10) 常见问答（FAQ）

Q1：为什么很多人不推荐 T64 做正手？
A：直线低弧、台内活、起下旋容错低，与 TB ALC 叠加更“躁”。大多数 FH 更需要抓球与弧线（09C/05/80）。

Q2：D64 vs T64？
A：D64 抓球与容错略好，仍保持直线与高出速；T64 更“经典 64 味”，更凌厉但更挑台内控制。

Q3：T64 vs T64 FX？
A：FX 更软，低中功率更容易、容错高；顶速与台内稳定性不及 T64。BH 入门到中级偏向 FX。

Q4：T80 能否双面？
A：可以。它在 05 与 64 之间，速度/弧线/控制均衡，是省心选择。

Q5：红黑是否影响性能？
A：配方/批次差异外，普遍经验是黑略黏、红更通透；按手感与需求定，无硬性规则。

11) 术语小词典

抓球/持球：球在胶皮上的“停留与摩擦”感。越强越有利于起下旋与弧圈稳定。
台内活：小力量下的反弹灵敏度高，容易弹起，控制难度增加。
弧线高度：出球抛物线的“拱度”，高抛提供更大安全窗。
直线穿透：出球平直、速度快，吃台后“冲”的感觉。
容错：击球角度/力量略有偏差时，仍能上台/有效的宽容度。

小结

FH 选抓球与弧线（D09C/T05），BH 选直线出速与借力（D64/T64/T80）。
T80 是“两边都不极端”的万能解；T64 FX 给 BH 轻松与容错。
TB ALC 性格“干净+线性”，合理胶皮搭配即可兼顾台内与相持。

ONT Methylation Analysis — Comprehensive Summary

Leave a reply

Scope: What methylation is (5‑mC, 6‑mA, 4‑mC), how Oxford Nanopore (ONT) detects it, how it differs from bisulfite sequencing, required coverage, file types (modBAM/CRAM with MM/ML tags), basecalling models (Dorado), practical workflows, pipelines (nf-core/methylong), deliverables to request from providers (e.g., Novogene), and specific advice for bacterial projects.

1) What are 5‑mC, 6‑mA, 4‑mC?

5‑mC (5‑methylcytosine): methyl group on cytosine C5 carbon. In eukaryotes strongly linked to gene regulation (CpG), chromatin state, imprinting. Also present in some bacteria (e.g., Dcm at CCWGG).
6‑mA (N6‑methyladenine): methyl on adenine N6. Very common in bacteria/archaea (e.g., Dam at GATC), functions in restriction–modification (R–M), mismatch repair, replication control, and gene regulation.
4‑mC (N4‑methylcytosine): methyl on cytosine N4, mostly in bacteria/archaea (R–M and regulation).

Coverage guidance (ONT direct detection):

≥ 10× for 5‑mC calling/quantification.
≥ 50× for 6‑mA and 4‑mC (signals are weaker; models need depth).

2) How ONT detects methylation (no chemical conversion)

ONT does not convert bases (unlike bisulfite sequencing which converts un‑methylated C → U → read as T). ONT reads remain A/C/G/T.
ONT measures ionic current while DNA k‑mers pass the pore. Modified bases (5‑mC/6‑mA/4‑mC) slightly shift current distributions.
A modified‑base basecaller (now Dorado; historically Guppy+Remora) decodes those shifts and writes methylation annotations into aligned BAM/CRAM as MM/ML tags:
- MM: modified motif and per‑read positions.
- ML: per‑site modification probabilities/scores.
Downstream tools (e.g., modkit, methylartist, nf‑core/methylong) summarize per‑site/per‑region methylation and export BED/bedGraph/bigWig for visualization/statistics.

Key contrast with bisulfite (BS‑seq):

BS‑seq chemically converts un‑methylated C to U (sequenced as T) → uses base changes to infer methylation.
ONT uses signal differences; no base letters change. Methylation is metadata in BAM tags, not edits in the sequence.

3) Data types & what you need (modBAM vs “assembly” reads)

Previous ONT reads used for genome assembly are typically standard basecalls (A/C/G/T only) and lack MM/ML tags, so not suitable for methylation quantification.
For methylation analysis you need either:
1. Provider delivers aligned modified‑base BAM/CRAM (modBAM/CRAM) with MM/ML tags and indices (.bai/.crai).
2. Or you re‑basecall FAST5/FASTQ with a modified‑base Dorado model and then align to your reference (producing modBAM).

Reference genome requirement:

For aligned BAM, you (or the provider) must map to a reference FASTA. Keep the exact FASTA (and .fai) used for reproducibility and downstream summarization.

4) Practical workflow (bacteria)

A. Planning & sequencing

Decide targets: in bacteria prioritize 6‑mA/4‑mC; optionally 5‑mC (if Dam/Dcm enzymes present).
Coverage targets: ≥50× (6‑mA/4‑mC), ≥10× (5‑mC).
Ask provider to run Dorado (modified‑base model) and deliver aligned modBAM/CRAM with MM/ML tags.

B. Inputs/outputs to request from provider (e.g., Novogene)

Deliverables:
- modBAM/CRAM (aligned to our provided reference), with MM/ML tags + .bai/.crai.
- Optional per‑site tracks: BED/bedGraph/bigWig and a QC report.
Reference:
- Can we provide bacterial reference FASTA? Will they return the exact FASTA (.fai) used?
Models & modifications:
- Which Dorado model version and which mods (5‑mC, 6‑mA, 4‑mC) are called by default?
Unaligned data:
- If delivering unmapped uBAM/FASTQ, request that modified‑base calls (tags) are still included, or obtain raw signal/FAST5 if re‑calling in‑house.

C. In‑house analysis (outline)

Align mod‑called reads to reference (if not already) → modBAM.
Run modkit to summarize per‑site methylation frequencies and export bedGraph/bigWig.
Use methylartist for regional plots, motif‑centric views, metaplots over features (promoters, operons, RND genes, etc.).
Integrate with other omics (RNA‑seq) by averaging methylation in promoter/operon windows and correlating with expression changes.

5) nf‑core/methylong (pipeline overview)

Community Nextflow pipeline for ONT methylation. Typical features:
- Supports Dorado modified‑base calling (or consumes modBAM/CRAM).
- Performs alignment (e.g., minimap2) to your reference, keeps MM/ML tags.
- Generates per‑site/ per‑region summaries, tracks (bedGraph/bigWig), and QC.
Inputs: reads (FASTQ/FAST5) or modBAM + reference FASTA; sample sheet with metadata.
Outputs: modBAM/CRAM + indices, per‑site methylation tables, genome tracks, multiQC‑style reports.

(Exact CLI flags vary by version; coordinate with the provider or your compute environment.)

6) QC & caveats

Depth matters: 6‑mA/4‑mC need higher coverage than 5‑mC.
Model choice: Use the correct Dorado modified‑base model for your chemistry/flow cell and target modifications.
Reference fidelity: Use the same reference throughout (and document version).
BAM integrity: Verify MM/ML tags exist; confirm alignment header matches the provided FASTA.
Context effects: Methylation calling is k‑mer context‑dependent; some motifs are easier/harder.
Biological interpretation: In bacteria, methylation is often tied to R–M systems, replication, and gene regulation; interpret rates in motif/operon context, not only at single CpG‑style sites.

7) What to ask a provider (email checklist)

Will you deliver aligned modBAM/CRAM with MM/ML tags (+ index)?
Which modified bases are called (5‑mC, 6‑mA, 4‑mC)? Which Dorado model/version?
Do you require us to provide a bacterial reference FASTA for alignment? Will you return the exact reference used?
Can you also provide per‑site methylation tracks (bedGraph/bigWig) and a QC report?
What coverage will be achieved per sample (target ≥10× for 5‑mC; ≥50× for 6‑mA/4‑mC)?

8) Suggested minimal deliverables

modBAM/CRAM aligned to our provided reference (+ .bai/.crai).
Reference FASTA and .fai used in alignment/calling.
Per‑site tables (tsv) and tracks (bedGraph/bigWig).
Brief QC (coverage, fraction modified by motif, per‑site confidence).

9) Bacterial project recommendation (one‑liner)

For bacteria, profile 6‑mA (and 4‑mC) as primary targets (≥50×), optionally 5‑mC (≥10× if Dcm‑like activity expected), using Dorado modified‑base calling and aligned modBAM/CRAM with MM/ML tags; summarize with modkit/methylartist and integrate with RNA‑seq.

10) Handy pointers & checks (quick ref)

Check BAM has mods: samtools view -h mod.bam | head → look for MM:Z: and ML:B:C tags.
Confirm reference: samtools view -H mod.bam | grep '^@SQ' and keep the FASTA.
Summarize (example modkit): modkit pileup mod.bam ref.fa --bedgraph out.bg --min-mapq 20
Visualize: Load bigWig/bedGraph in IGV/JBrowse; overlay RNA‑seq coverage/DE results.

Prepared from the morning discussion to serve as a self‑contained guide and hand‑off document.

RNA-seq Sample Interpretation & Design (Updated Mapping) for RNAseq_2025_LB_vs_Mac_ATCC19606

Leave a reply

This document explains the background and rationale of your RNA-seq dataset based on sample names, updates the genotype mapping (AB = ΔadeAB; IJ = ΔadeIJ; WT19606 = WT; W1/Y1 = clinical WT isolates), and provides ready-to-run analysis contrasts aligned with your research proposal.

1) Naming logic (decoded)

Medium prefix
- LB- → growth in LB (non-selective baseline; total transcriptome without membrane/efflux stress).
- Mac- → growth in MacConkey (bile salts + crystal violet; membrane/outer-membrane/efflux stress condition).
Strain/genotype tag
- WT19606 → A. baumannii ATCC 19606, wild-type (reference WT).
- IJ → ΔadeIJ (AdeIJK RND efflux knockout).
- AB → ΔadeAB (AdeAB subfamily knockout).
- W1, Y1 → two clinical wild-type isolates (distinct lineages).
Replicates
- r1 / r2 / r3 / r4 → biological replicates.

Note on proposal text: The provided proposal did not explicitly name W1/Y1; it referred to “clinical isolates” generally. If you analyze W1/Y1, add a Methods line defining them (e.g., source, genotype as WT).

2) Updated mapping table

Sample prefix	Medium	Strain	Genotype	Purpose / Rationale
LB-AB-*	LB	AB	ΔadeAB	Baseline transcriptome of AdeAB knockout (no stress)
Mac-AB-*	MacConkey	AB	ΔadeAB	Stress response without AdeAB (efflux-dependent programs)
LB-IJ-*	LB	IJ	ΔadeIJ	Baseline of AdeIJK knockout
Mac-IJ-*	MacConkey	IJ	ΔadeIJ	Stress response without AdeIJK
LB-WT19606-*	LB	ATCC 19606	WT	Reference WT baseline (anchor strain)
Mac-WT19606-*	MacConkey	ATCC 19606	WT	Reference WT under stress (gold-standard contrast vs ΔadeAB/ΔadeIJ)
LB-W1-*	LB	W1	WT (clinical)	Baseline of clinical WT lineage W1
Mac-W1-*	MacConkey	W1	WT (clinical)	W1 under stress
LB-Y1-*	LB	Y1	WT (clinical)	Baseline of clinical WT lineage Y1
Mac-Y1-*	MacConkey	Y1	WT (clinical)	Y1 under stress

Why these conditions matter (proposal-aligned):

LB captures baseline networks.
Mac induces the membrane/efflux stress program that revealed R vs S behavior in your proposal and is tightly linked to RND function.
Contrasting WT vs efflux knockouts (ΔadeAB/ΔadeIJ) under LB/Mac tests both genotype main effects and genotype × stress interactions.
Multiple WT lineages (WT19606/W1/Y1) allow testing conservation vs isolate-specificity of stress responses (avoid overfitting to one WT).

3) Suggested metadata file (`samples.csv`)

sample,fastq_1,fastq_2,medium,strain,genotype,replicate,notes,strandedness
LB-AB-r1,LB-AB-r1_R1.fq.gz,LB-AB-r1_R2.fq.gz,LB,AB,ΔadeAB,r1,AdeAB knockout baseline,auto
LB-AB-r2,LB-AB-r2_R1.fq.gz,LB-AB-r2_R2.fq.gz,LB,AB,ΔadeAB,r2,AdeAB knockout baseline,auto
LB-AB-r3,LB-AB-r3_R1.fq.gz,LB-AB-r3_R2.fq.gz,LB,AB,ΔadeAB,r3,AdeAB knockout baseline,auto
LB-IJ-r1,LB-IJ-r1_R1.fq.gz,LB-IJ-r1_R2.fq.gz,LB,IJ,ΔadeIJ,r1,AdeIJK knockout baseline,auto
LB-IJ-r2,LB-IJ-r2_R1.fq.gz,LB-IJ-r2_R2.fq.gz,LB,IJ,ΔadeIJ,r2,AdeIJK knockout baseline,auto
LB-IJ-r4,LB-IJ-r4_R1.fq.gz,LB-IJ-r4_R2.fq.gz,LB,IJ,ΔadeIJ,r4,AdeIJK knockout baseline,auto
LB-W1-r1,LB-W1-r1_R1.fq.gz,LB-W1-r1_R2.fq.gz,LB,W1,WT,r1,clinical WT W1 baseline,auto
LB-W1-r2,LB-W1-r2_R1.fq.gz,LB-W1-r2_R2.fq.gz,LB,W1,WT,r2,clinical WT W1 baseline,auto
LB-W1-r3,LB-W1-r3_R1.fq.gz,LB-W1-r3_R2.fq.gz,LB,W1,WT,r3,clinical WT W1 baseline,auto
LB-WT19606-r2,LB-WT19606-r2_R1.fq.gz,LB-WT19606-r2_R2.fq.gz,LB,WT19606,WT,r2,reference WT baseline,auto
LB-WT19606-r3,LB-WT19606-r3_R1.fq.gz,LB-WT19606-r3_R2.fq.gz,LB,WT19606,WT,r3,reference WT baseline,auto
LB-WT19606-r4,LB-WT19606-r4_R1.fq.gz,LB-WT19606-r4_R2.fq.gz,LB,WT19606,WT,r4,reference WT baseline,auto
LB-Y1-r2,LB-Y1-r2_R1.fq.gz,LB-Y1-r2_R2.fq.gz,LB,Y1,WT,r2,clinical WT Y1 baseline,auto
LB-Y1-r3,LB-Y1-r3_R1.fq.gz,LB-Y1-r3_R2.fq.gz,LB,Y1,WT,r3,clinical WT Y1 baseline,auto
LB-Y1-r4,LB-Y1-r4_R1.fq.gz,LB-Y1-r4_R2.fq.gz,LB,Y1,WT,r4,clinical WT Y1 baseline,auto
Mac-AB-r1,Mac-AB-r1_R1.fq.gz,Mac-AB-r1_R2.fq.gz,MacConkey,AB,ΔadeAB,r1,AdeAB knockout under stress,auto
Mac-AB-r2,Mac-AB-r2_R1.fq.gz,Mac-AB-r2_R2.fq.gz,MacConkey,AB,ΔadeAB,r2,AdeAB knockout under stress,auto
Mac-AB-r3,Mac-AB-r3_R1.fq.gz,Mac-AB-r3_R2.fq.gz,MacConkey,AB,ΔadeAB,r3,AdeAB knockout under stress,auto
Mac-IJ-r1,Mac-IJ-r1_R1.fq.gz,Mac-IJ-r1_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r1,AdeIJK knockout under stress,auto
Mac-IJ-r2,Mac-IJ-r2_R1.fq.gz,Mac-IJ-r2_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r2,AdeIJK knockout under stress,auto
Mac-IJ-r4,Mac-IJ-r4_R1.fq.gz,Mac-IJ-r4_R2.fq.gz,MacConkey,IJ,ΔadeIJ,r4,AdeIJK knockout under stress,auto
Mac-W1-r1,Mac-W1-r1_R1.fq.gz,Mac-W1-r1_R2.fq.gz,MacConkey,W1,WT,r1,clinical WT W1 under stress,auto
Mac-W1-r2,Mac-W1-r2_R1.fq.gz,Mac-W1-r2_R2.fq.gz,MacConkey,W1,WT,r2,clinical WT W1 under stress,auto
Mac-W1-r3,Mac-W1-r3_R1.fq.gz,Mac-W1-r3_R2.fq.gz,MacConkey,W1,WT,r3,clinical WT W1 under stress,auto
Mac-WT19606-r2,Mac-WT19606-r2_R1.fq.gz,Mac-WT19606-r2_R2.fq.gz,MacConkey,WT19606,WT,r2,reference WT under stress,auto
Mac-WT19606-r3,Mac-WT19606-r3_R1.fq.gz,Mac-WT19606-r3_R2.fq.gz,MacConkey,WT19606,WT,r3,reference WT under stress,auto
Mac-WT19606-r4,Mac-WT19606-r4_R1.fq.gz,Mac-WT19606-r4_R2.fq.gz,MacConkey,WT19606,WT,r4,reference WT under stress,auto
Mac-Y1-r2,Mac-Y1-r2_R1.fq.gz,Mac-Y1-r2_R2.fq.gz,MacConkey,Y1,WT,r2,clinical WT Y1 under stress,auto
Mac-Y1-r3,Mac-Y1-r3_R1.fq.gz,Mac-Y1-r3_R2.fq.gz,MacConkey,Y1,WT,r3,clinical WT Y1 under stress,auto
Mac-Y1-r4,Mac-Y1-r4_R1.fq.gz,Mac-Y1-r4_R2.fq.gz,MacConkey,Y1,WT,r4,clinical WT Y1 under stress,auto

4) DE design & contrasts (DESeq2/edgeR)

Model

Reference-genotype focus (WT19606 vs knockouts):
~ medium * genotype
where medium ∈ {LB, MacConkey}, genotype ∈ {WT, ΔadeAB, ΔadeIJ}.
(Set WT19606_LB as reference level.)
All strains (WT lineages):
~ medium * strain
where strain ∈ {WT19606, W1, Y1, ΔadeAB, ΔadeIJ}.

Hypothesis-driven contrasts

Stress response within each background:
Mac vs LB for WT19606, ΔadeAB, ΔadeIJ, W1, Y1.
→ Membrane/efflux-stress regulons; check adeA/B/C, adeI/J/K, envelope stress, OM biogenesis, osmotic/ions, ribosome PQC.
Genotype main effect at baseline (LB):
ΔadeAB vs WT19606 (LB), ΔadeIJ vs WT19606 (LB).
→ Efflux-independent differences; compensatory pathways.
Genotype effect under stress (Mac):
ΔadeAB vs WT19606 (Mac), ΔadeIJ vs WT19606 (Mac).
→ How loss of AdeAB/AdeIJK alters the stress transcriptome.
Interaction (genotype × medium):
(ΔadeAB_Mac − ΔadeAB_LB) − (WT19606_Mac − WT19606_LB); same for ΔadeIJ.
→ Core test: does RND loss change the stress response itself?
Conservation across WT lineages:
Intersect Mac vs LB DEGs across WT19606/W1/Y1 to define a conserved “Mac stress signature”; then identify isolate-specific modules.

QC

Verify strandedness with RSeQC (you set auto), mapping rate, rRNA %, insert size.
PCA: expect Mac vs LB separation; ΔadeIJ should diverge strongly under Mac.
Validate a small panel via RT-qPCR (ade genes + OM stress markers).

5) Where this plugs into the proposal

Mac vs LB embodies the proposal’s “LB = who is alive” vs “Mac = who can cope” paradigm (membrane/efflux stress).
ΔadeAB / ΔadeIJ model the role of RND efflux in stress adaptation and heterogeneity.
Multiple WT lineages prevent overfitting to ATCC 19606 and test generalizability.
Downstream integration with PAP/MDK/R% metrics links transcriptome to phenotype (R vs S).

If you want, I can also generate a starter DESeq2 R script that reads this samples.csv, sets factors/contrasts, and outputs PCA, volcano plots, and KEGG/GO enrichment stubs.

模型选择与对比：如何在有限功效下回答“two≈one@17h、two>one@24h（尤其ΔadeIJ）

Leave a reply

亚MIC驱动的表型异质与耐药演化：以 A. baumannii RND 外排泵为核心的机制与干预

模型选择与对比：如何在有限功效下回答“two≈one@17h、two>one@24h（尤其ΔadeIJ）”

三种模型的对比

1) 分层按时间做（Option A：每个时间点各跑一个模型）

公式：在 17 h 子集：~ exposure * genotype；在 24 h 子集：~ exposure * genotype
（因为同一子集里 time_h 不变，再加 + time_h 是冗余的）

能回答的问题

直接给出同一时间点内的所有核心对比（如 two vs one、基因型差异及其交互）。
很适合你现在关心的：“17 h 两次≈一次，而 24 h 两次>一次？”

优点

参数少、功效高（不去估计三方交互）。
结果直观：每个时间点一套 DEGs，便于和 Fig.3c 的现象逐一对应。
抗不平衡设计（两个时间点样本数/离群点不一样时更稳）。

缺点

不能在一个模型里“正式检验”时间是否改变了暴露×基因型的效应（缺少三方交互的整体显著性检验）。
跨时间比较需要你在结果层面去比对（例如比较 17 h 和 24 h 的 LFC/DEG 数），不是模型内参数。

适用场景：样本量有限、你最关心分时间的效应；先得到清晰、功效高的 per-time 结果。

2) 全局简化模型（Global reduced）：`~ exposure * genotype + time_h`

能回答的问题

“暴露效应是否被基因型修改？”（有 exposure×genotype 交互）
同时控制了一个时间的平均主效应。

优点

比全模型参数更少、功效更高。
仍可做“差-中-差”式的交互检验（但这是跨两个时间的平均）。

缺点

假设暴露×基因型的交互在 17 h 和 24 h 相同/相似；
无法告诉你“17 h 小、24 h 大”（时间特异的交互被“平均”掉）。

适用场景：如果 PCA/探索性分析显示时间影响较小而且平行，且你只想得到“平均意义”上的交互结论。

3) 全模型（Full）：`~ exposure * genotype * time_h`

能回答的问题

所有主效应与全部交互，尤其是三方交互（检验“暴露×基因型的差异是否随时间改变”）。
可用 LRT（似然比检验） 与简化模型比较，正式证明时间特异。

优点

统计上最完整；能一口气检验你提出的“17 h 小、24 h 大”这种时间依赖。

缺点

参数最多、功效最低（在基因层面更难达显著；多重比较负担更大）。
解释略复杂（系数多，需要合成对比）。

适用场景：样本量足/信号强，或者用于确证“时间改变交互”的结论（比如在 Option A 看到强烈趋势之后）。

给你这套数据的建议（结合你目标与当前信号）

主分析：优先用 Option A（分层按时间）

在 17 h 和 24 h 分别拟合 ~ exposure * genotype，导出 two vs one（WT 与 ΔadeIJ 各一套）。
这和 Fig.3c 的读图是一致的，功效也最好；如果你的预期正确，17 h 的 two vs one 基本不显著，24 h 显著增多，尤其是 ΔadeIJ。

确证时间依赖（可选但推荐）：跑一版全模型

用 ~ exposure * genotype * time_h 做 LRT 对比简化模型 ~ exposure * genotype + time_h，
若大量基因在 LRT 中显著 ⇒ 说明三方交互真实存在（时间改变了交互）。
对重点基因/通路，再报告三方交互的方向与效应量。

若功效更有限或只想给一个“总体交互”的结论：

可以只给 Global reduced 的结果（平均的 exposure×genotype 交互），
再辅以 Option A 的火山图作为“时间分层的可视化支持”。

简短决策树

你要看清 17 h vs 24 h 的差别 → Option A
你要在统计上证明“差别随时间改变” → Full + LRT（可在 Option A 之后做）
你要功效最大但只要平均结论 → Global reduced

实操小贴士

两个时间点样本数尽量均衡；有离群点先处理（PCA/库大小/样本间相关）。
需要时加入批次/库制备等协变量（+ batch），在三种模型里都可以加。
报告时同时给：DEG 数、FDR、代表基因 LFC，并用 GSEA/富集做通路层面的对照。
图表：每个时间点出 two vs one 的火山图/热图；全模型的 LRT p 值分布或显著通路一览。

一句话结论

为了回答你现在最关心的“17 h 的 two≈one，24 h 的 two>one，尤其在 ΔadeIJ”，Option A 是首选（清晰、功效高）；随后用全模型 + LRT补一个“官方盖章”的时间依赖即可。

0. 术语速览（便于快速阅读）

MDR：多重耐药（Multi-Drug Resistance）
RND 外排泵：Resistance–Nodulation–Division 家族（A. baumannii 主要为 AdeABC/AdeIJK）
sub-MIC / 亚MIC：低于最低抑菌浓度（MIC）的暴露/处理
R/S 亚群：在选择压力下仍能成殖的耐受/适应亚群（R）与不成殖/受损的易感亚群（S）
PAP：群体分析谱（Population Analysis Profiling）
Time-kill / MDK：时间杀菌曲线 / 最小杀灭持续时间（MDK99/MDK99.99）
AUM：人工尿培养基（Artificial Urine Medium）
MH2B（MHB）：Mueller–Hinton Broth 第二配方
EPI：外排泵抑制剂（Efflux Pump Inhibitor）

1. 研究背景与问题定位

1.1 临床与公共卫生动因

MDR 革兰阴性菌威胁升级，A. baumannii 在 ICU/院感中占比高，治疗选择有限。
最后线药物（替加环素、碳青霉烯等）敏感率下滑，且地区波动大，提示环境/宿主体内因素的重要性。
亚MIC 暴露在临床（药代/组织渗透/生物膜）与环境（废水/污泥/水体）里普遍存在，能驱动适应、耐受、异质耐药等演化过程。

1.2 科学空白

RND 外排泵被公认为 A. baumannii 耐药关键，但基因表达与表型耐药不总一致；
异质性（heteroresistance/耐受/持留）与宿主相关培养基（AUM/尿液）的影响在常规 AST 中难以显现；
缺少将亚MIC → 表型异质 → 宿主环境转录重编程串联起来的系统证据链。

2. 科学问题、核心假说与研究目标

2.1 科学问题（What）

1) 亚MIC 替加环素如何诱发/放大 A. baumannii 的表型异质性与耐受/适应亚群？
2) RND 外排泵（AdeABC/AdeIJK）在这一过程中扮演何种动力学与调控角色？
3) 尿液/AUM等宿主相关环境如何重编程转录网络、改变药敏表型与毒力？

2.2 核心假说（Why + How）

H1：反复亚MIC 暴露通过应激响应/膜稳健性/代谢重分配，选择或诱导“在膜/外排应激下仍能成殖”的 R 亚群上升；RND 缺失株（ΔadeIJ）这一现象尤显著。
H2：RND 泵不仅通过外排直接降低胞内药物，还通过变构/能量耦联影响膜稳健、代谢与应激路径，放大表型异质。
H3：尿液/AUM汇聚多因素（渗透压、尿素、金属离子、碳源限制），系统性重编程外排泵与代谢网络，导致MIC 变化与毒力改变。

2.3 研究目标（Deliverables）

1) 建立亚MIC → R/S 亚群的定量框架（PAP、MDK、脉冲生存）并定位分子机制；
2) 解析 RND 在异质性与宿主环境适应中的作用；
3) 构建UTI 相关药敏解读范式与EPI 靶点清单。

3. 实验总体设计与样本体系

3.1 株系与处理

菌株：WT（野生型）与 ΔadeIJ（RND 缺失），必要时扩展至 ΔadeIJK。
处理与命名逻辑：
- No（未暴露）：WT_17/24、deltaIJ_17/24
- One exposure（一次亚MIC）：pre_WT_17/24、pre_deltaIJ_17/24
- Two exposures（两次亚MIC，0.5×MIC ×2）：0_5_WT_17/24、0_5_deltaIJ_17/24
- 时间点：17 h、24 h（或 8/16/24 h 批次采样）
Rationale：“05”表示每次 0.5×MIC 的两次脉冲，强调sub-MIC 叠加效应。

3.2 培养基与应用场景

LB（液体/平板）：生长曲线（液体）、总 CFU 与母板（平板）。
MacConkey agar（胆盐+结晶紫）：膜/外排应激选择，用于显化 R/S 亚群、计算 CFU_Mac/CFU_LB。
MH2B（液体）：标准药敏/动力学参照。
AUM（液体）/Urine（液体）：模拟尿路体内环境，用于生长、MIC 与 RNA-seq。

4. 指标体系与统计学（统一口径）

4.1 关键指标定义

R 比例（3c 指标）：R% = CFU_Mac / CFU_LB × 100%
R/S 比值（可选报告）：R/S = CFU_Mac / (CFU_LB − CFU_Mac)
PAP 尾部比例：p_tail = ΣCFU(≥MIC) / CFU(无药)
AUC（PAP 曲线下面积）：衡量整体耐受度
MDK99 / MDK99.99：达到 99%/99.99% 杀灭所需时间
脉冲生存率：survival = CFU_pulse / CFU_start
RNA-seq：DEGs（|log2FC| 与 FDR），通路富集（KEGG/GO），主成分贡献（PC1/PC2）

4.2 统计检验与多重校正

组间比较：t 检验/曼–惠特尼；多组用 ANOVA/Kruskal–Wallis + FDR/Bonferroni。
比例/计数：二项置信区间、Fisher 精确检验、GLM(logit)。
时间–生存类：分段回归/非线性拟合；MDK 的置信区间或 bootstrap。
RNA-seq：DESeq2（FDR < 0.05；|log2FC| 阈值预先注册）。

4.3 生物学重复与效应量

每组 ≥3 生物学重复；报告 效应量（Cohen’s d 或 Cliff’s delta）与 95% CI，避免只看 p 值。

5. SOP：核心实验流程（可直接贴进方法学）

5.1 并行铺板（LB vs MacConkey）与 R 比例

1) 标准化 OD/CFU 接种；
2) 在目标时点（8/16/24 h）取样，并行涂布到 LB agar 与 MacConkey agar；
3) 过夜培养计数 CFU_LB / CFU_Mac；
4) 计算 R% = CFU_Mac/CFU_LB，记录 ±95% CI；
5) 统计 No vs One vs Two、WT vs ΔadeIJ 的差异。

5.2 复制平板分离 S（易感）

1) 先在 LB 无药制备母板（100–300 个离散菌落）；
2) 用天鹅绒/膜 整体复制到 MacConkey（±药物板）；
3) 选择板不长/显著受损的坐标 = S 候选；
4) 回到 LB 母板对应坐标挑取 S，纯化 2–3 轮，建库；
5) 验证：MIC（不升/接近原始群体）、Time-kill（无长尾）、PAP 尾部低。

5.3 PAP（Population Analysis Profiling）

1) 配置药物阶梯：0、0.25×、0.5×、1×、2×MIC（可扩展）；
2) 等量涂布 → 计数 CFU；
3) 作图（log10 CFU vs 浓度），计算 AUC 与 p_tail；
4) 统计处理间差异与效应量。

5.4 Time-kill / MDK

1) 固定高浓度药（杀菌剂常 ≥10×MIC；四环素类可选联合或耐受表征方案）；
2) 0–24 h 分时取样计 CFU；
3) 拟合曲线、估 MDK99/MDK99.99；
4) MDK↑/长尾加重＝耐受增强。

5.5 脉冲生存（Pulse Survival）

1) 给予短时高浓度脉冲（10–20×MIC，30–60 min）；
2) 终止药效、复苏、计 CFU；
3) 生存比例比较处理差异。

5.6 RNA-seq（尿液/AUM/MH2B & 暴露对照）

管线：FastQC → HISAT2/STAR → featureCounts → DESeq2；
批次设计：WT/ΔadeIJ ×（尿液/AUM/MH2B）×（No/One/Two）× 时间点；
输出：PCA（PC1/PC2）、DEGs（FDR<0.05）、KEGG/GO 富集；重点关注 外排泵（adeA/B/C、adeI/J/K）、膜/代谢/应激通路。

6. 初步结果（与前述一致的“证据链”）

6.1 Fig.3 — 亚MIC 暴露 → 适应与 R 亚群上升

3a：流程示意（未画基质）。
3b（LB 液体）：Two > One > None 的生长恢复/存活。
3c（R%）：R% = CFU_Mac/CFU_LB；在 ΔadeIJ 中 Two > One > None（16–24 h 最显著；*p<0.005, p<0.01**），WT 变化小。

6.2 Fig.4 — 预暴露后 R/S 表型分化与药板验证

LB vs MacConkey：R 能长，S 不能长/受损；WT 在两板均可生长。
药板（四环素、碳青霉烯、利福平、多黏菌素、万古霉素）：S 不可成殖，R 表现更耐受。

6.3 Fig.5 — 培养基效应（AUM/Urine/MH2B）

生长：MH2B 最快；Urine 适中；AUM 较慢。
MIC 转变：AUM 中 四环素类 ↓~4×、碳青霉烯 ↓~8×（vs MH2B），提示介质效应。

6.4 Fig.6 — 尿液/AUM 的全转录重编程

PCA：Urine 沿 PC1（~75%）聚类；AUM 与 MH2B 沿 PC2（~20%）分离。
DEGs/通路：两者各自独特且部分重叠；脲酶上调；Urine 中 adeB 上调、AUM 中 adeA/adeB 下调；富集 氨基酸/碳代谢、TCA 等。

7. 质量控制（QC）与混杂控制

接种量与生长期：统一 OD600/CFU 与生长阶段，避免接种效应。
时间点：固定 8/16/24 h（或 17/24 h）窗口；保证重复一致。
平板并行：同一稀释度并行涂布 LB vs MacConkey；同批操作。
批次效应：RNA-seq 采用阻断设计/ComBat 校正；PCA 检查批次漂移。
药物效力：MIC 校准、现配现用、稳定性检查；AUM/尿液配方与 pH/渗透压监控。
统计注册：预注册阈值（FDR、|log2FC|、效应量），报告完整。

8. 风险点与备选方案（Mitigation）

R/S 不显著：提高重复数；增加 MacConkey 胆盐/结晶紫梯度；延长/调整暴露方案。
RNA-seq 变异大：加大样本量或聚焦关键时间点；加入 RT-qPCR 验证。
WT 也出现明显 R 上升：进一步引入 ΔadeABC / ΔadeIJK 与互补株，拆解特异性。
MIC 与耐受定义混淆：采用 “MIC（稳态）+ MDK（动力学）”双标准区分。

9. 项目管理：时间线与里程碑

时间	核心任务	可交付物
2026 Q1–Q2	生长曲线/MIC/PAP/并行铺板体系搭建	QC 报告、首版 SOP、基线数据
2026 Q3	演化实验/Time-kill/MDK/脉冲生存	AUC/MDK/生存率统计与图表
2026 Q4	RNA-seq（尿液/AUM/MH2B）、WGS	PCA、DEGs、KEGG/GO、突变谱
2027 Q1	敲除株构建与表型/分子验证	RT-qPCR/WB/银染/代谢物数据
2027 Q2	综合分析与模型建立	机制图、整合图谱（Fig. Model）
2027 Q3–Q4	论文/会议/专利	3–4 篇论文草稿、会议摘要、EPI 靶点清单

10. 预期产出与学术/临床价值

机制贡献：完整链路——亚MIC 暴露 → RND/膜/代谢 → R/S 异质 → UTI 环境转录重编程 → 药敏变化。
工具与范式：R 比例/PAP/MDK/脉冲生存的统一报告范式；UTI 相关 AST 的培养基效应提示。
转化潜力：聚焦 TM 变构/能量耦联位点的 EPI 设计；提示在 UTI 情景下药物/联合方案的优化路径。

11. 附：可直接复用的图注（精炼版，供英文化）

Fig. 3：Sub-MIC tigecycline exposures (0.5×MIC ×1/×2) promote adaptation in ΔadeIJ; LB growth (OD600) shows Two > One > None; MacConkey tolerance ratio R% = CFU_Mac/CFU_LB increases significantly at 16–24 h (*p<0.005, p<0.01**); WT shows minimal change.
Fig. 4：Post-exposure ΔadeIJ splits into R (grows on MacConkey) and S (impaired); S fails on drug plates (TET/CARB/RIF/PMB/VAN).
Fig. 5：MH2B fastest growth; urine moderate; AUM slower; MICs drop in AUM (TET ~4×, CARB ~8× vs MH2B).
Fig. 6：Urine clusters on PC1 (~75%); AUM separates on PC2 (~20%); urease up; adeB up in urine, adeA/adeB down in AUM; AA/carbon/TCA enriched.

Toxin–Antitoxin (TA) Systems & Pulldown Experiments — Practical Guide

Leave a reply

TA_operon_shared_promoter_v3_en

A consolidated reference covering TA gene organization and regulation, promoter vs RBS roles, co-transcription criteria, start-codon troubleshooting, RNA-seq analysis strategy, pulldown experiment design/controls/statistics, and multi-omics integration.

1) TA System Overview

Composition: Paired genes encoding toxin (protein) and antitoxin (protein).
Typical organization: Same strand, antitoxin upstream, toxin downstream, forming a bicistronic operon.
Transcriptional control: Frequently transcribed from a shared σ⁷⁰-like promoter (−35/−10) upstream of the antitoxin.
Autoregulation: Antitoxin or TA complex often binds operator sites near the promoter to repress transcription. Under stress (e.g., antitoxin proteolysis), repression is relieved → toxin increases.
Functions: Stress response, persistence, plasmid maintenance, virulence modulation (family-specific).

2) Promoter vs RBS — Who Does What?

Promoter → transcription start.
- Recognized by RNA polymerase holoenzyme (core RNAP + σ factor; often σ⁷⁰).
- −35/−10 boxes typically spaced 16–19 bp; TSS sits downstream of −10.
RBS (Shine–Dalgarno) → translation start.
- 16S rRNA (30S) anti-SD tail base-pairs with the RBS to position the start codon (usually ATG, also GTG/TTG).
- RBS–start codon spacing commonly 5–10 nt.
Cheats: Promoter decides where transcription begins; RBS decides where translation begins.

3) Evidence Framework for a Shared Promoter / Co-transcription

Goal: Decide whether antitoxin & toxin belong to the same transcript and quantify co-expression.

3.1 Structural / Sequence Evidence

Genomic context: Same strand; short intergenic (<50–100 bp) or slight overlap.
Promoter prediction: Clear −35/−10 upstream of antitoxin; no strong independent promoter upstream of toxin.
RBS: SD-like motifs upstream of both ORFs.
Terminator: No strong Rho-independent terminator between the pair; terminator at operon end.

3.2 RNA-seq Evidence (Strand-specific libraries preferred)

Coverage continuity: Same-strand coverage crosses the intergenic region.
Spanning fragments: Paired-end insert spans the antitoxin↔toxin boundary.
Expression correlation: From all samples (e.g., 27), compute TPM/CPM correlations; Pearson/Spearman r ≥ 0.8, p<0.01; remains high within each timepoint subset.
DE consistency: For each timepoint’s treated vs control, log2FC for both genes are same direction with FDR<0.05.
(Optional) TSS evidence: 5′-enriched or TSS-seq reveals shared TSS cluster upstream of antitoxin.

Note: Non-strand-specific libraries weaken strand-continuity evidence; interpret cautiously.

4) Why Your Provided Sequences Start with “TTA,” Not “ATG”

Observed “TTA” starts suggest: 1) Sequences include 5′ UTR/promoter (CDS not cut at true start). 2) Sequences could be reverse-complement relative to coding strand. 3) Bacteria can use GTG/TTG as starts, but TTA is not a typical start codon.
Standard resolution steps:
- BLAST the fragments to the genome to get strand & coordinates.
- Six-frame translate; on the correct strand, locate the longest ORF starting with ATG/GTG/TTG and ending at a stop.
- Verify RBS distance (5–10 nt) and domain homology (BLASTX/HMM against TA families).
- Use RNA-seq coverage shape/TSS to refine the start site.

5) RNA-seq Analysis Plan for 27 Samples (Example Design)

Design: Same strain × 3 conditions (untreated / Mitomycin C / Moxifloxacin) × 3 timepoints × 3 biological replicates = 27 samples.

5.1 Pipeline Outline

QC & Alignment: FastQC/MultiQC → trimming → align to reference (confirm strand-specificity).
Quantification: featureCounts/Salmon → DESeq2/edgeR normalization.
Differential Expression:
- For each timepoint, contrast treated vs untreated (include batch if needed).
- Output per contrast: log2FC, SE, p, FDR.
TA Co-transcription Checks:
- IGV views: same-strand continuity across intergenic; spanning fragments.
- Correlation between antitoxin & toxin across 27 samples (r, p).
- DE direction consistency for both genes.
Pulldown Targets in RNA-seq:
- For candidate target list, extract log2FC/FDR; produce volcano/heatmaps.
- Perform functional enrichment (GO/KEGG/COG) with overlap to pulldown hits.
Deliverables:
- IGV screenshots with annotated −35/−10, TSS, RBS, terminator.
- MA/volcano plots, sample PCA, correlation plots.
- Tables summarizing DEGs per timepoint and pulldown×RNA-seq overlaps.

6) Pulldown Experiments — Types, Controls, Statistics

6.1 Types

Protein–protein pulldown / affinity purification: Bait = toxin/antitoxin protein (His/FLAG/biotin); ID by LC–MS/MS.
Nucleic-acid pulldown:
- DNA pulldown: bait = promoter/operator DNA; identify bound proteins (MS).
- RNA pulldown: bait = specific RNA; identify bound proteins (MS) or enriched RNAs.

6.2 Critical Controls

Empty vector/beads, irrelevant protein, mutant bait (disrupt binding), competition elution.
≥ 3 biological replicates recommended.

6.3 Hit Calling (Proteomics Example)

Use SAINT / MSstats / DEP or log2FC + FDR thresholds, e.g. log2FC ≥ 1 & FDR ≤ 0.05, consistently detected in ≥2 replicates.
Remove sticky background proteins (CRAPome) and ubiquitous ribosomal/chaperones where appropriate.
Deliver a high-confidence candidate list.

6.4 Integration with RNA-seq

Cross-table: pulldown hits vs RNA-seq log2FC/FDR across conditions/timepoints.
Enrichment/pathways: overlap enrichment for hits and DEGs.
Evidence ladder: 1) Pulldown enrichment (binding); 2) RNA-seq co-expression / DE (regulatory consistency); 3) Biophysical/functional assays (EMSA/SPR/ChIP-qPCR, reporter assays) for validation.

7) Validation Roadmap (Low→High Effort)

RT-qPCR: Junction-spanning primers across antitoxin→toxin.
EMSA/SPR: Direct binding & affinity to operator by antitoxin/TA complex.
Reporter / Mutagenesis: Disrupt operator/−35/−10 or RBS, assess transcription/translation impact.
ChIP-qPCR/ChIP-seq: In vivo occupancy (if antitoxin has DNA-binding domain).
RACE/TSS-seq: Precise TSS mapping to confirm shared promoter.

8) Practical Criteria & Verdict Grades

Structure: Same strand; intergenic <100 bp; no strong terminator in-between.
Promoter: Clear −35/−10 upstream of antitoxin; toxin lacks strong independent promoter.
RNA-seq: Same-strand continuity across intergenic; boundary-spanning fragments; r ≥ 0.8 (p<0.01) across all samples; per-timepoint log2FC same direction (FDR<0.05).
Conclusion grades: Strong support / Support / Insufficient evidence.

9) Schematic Figures (Generated)

Chinese-labeled (non-overlapping):
TA_operon_shared_promoter_v3_cn.png — open/download
English-labeled (non-overlapping):
TA_operon_shared_promoter_v3_en.png — open/download

Each figure depicts: shared σ⁷⁰-like promoter (−35/−10, TSS) → antitoxin (upstream) → toxin (downstream) with each RBS, a terminal Rho-independent terminator, and stylized same-strand RNA-seq coverage that spans the intergenic region.

10) “Ready-to-Ask” Template for Collaborators

Objective: Determine if the TA pair shares a promoter and is co-transcribed; call DEGs per timepoint across conditions; test RNA-seq changes for pulldown targets.

Please deliver:

IGV tracks with −35/−10, TSS, RBS, terminator, and boundary-spanning reads.
DE tables (per timepoint per contrast), with log2FC/FDR.
Correlation stats (antitoxin↔toxin r, p) across 27 samples and within timepoints.
Pulldown×RNA-seq cross table (+ enrichment analyses).
One-page verdict: shared promoter? co-transcription? evidence grade & key screenshots.

Inputs we’ll provide: Reference genome/annotation (FASTA/GFF/GTF), BAM/BAI, sample sheet, pulldown target list.

One-liner Summary

Promoter = transcription start; RBS = translation start.
For TA pairs, antitoxin→toxin often sits in a single operon driven by a shared promoter; RNA-seq continuity, spanning fragments, correlation, and concordant DE together provide strong evidence for co-transcription.

克隆≠表型完全相同 —— 详细阐述与具体例子

Leave a reply

“克隆性”是指细菌在基因型上几乎无差异，属于高度同源的一组，但这并不意味着它们在耐药性等表型上一定一模一样。克隆株的表型可以有差异，原因包括调控机制、基因表达水平、外排泵活性、膜蛋白变化、插入序列等导致的基因功能或表达的不同。

具体例子：

在临床实践中，研究发现铜绿假单胞菌（Pseudomonas aeruginosa）的同一克隆菌株，虽然基因组完全一致，但对美罗培南（Meropenem）的最小抑菌浓度（MIC）可能不同。进一步检测发现：某些菌株因oprD基因突变、插入等，导致外膜通道蛋白表达下调，从而表现为高MIC（耐药），而同簇内oprD完整的菌株则敏感。¹

这说明：即使是同一克隆簇的菌株，耐药表型可以因调控突变或基因表达等后天因素存在差异。

科学文章（中文发布版）

克隆性≠表型一致：基因同源不等于耐药全同 —— Holger邮件讨论解读

在近期的菌群全基因组分析中，我们通过cgMLST/WGS技术识别出了若干克隆相关的菌株簇。以Acinetobacter baumannii和Klebsiella pneumoniae为例，每个克隆簇内的菌株，其耐药基因和毒力因子分布高度一致，且AST（药敏）表型大多数时间点表现相近。

但邮件交流中，Gradientech团队指出：“仅凭MLST等判断克隆性，不能保证所有基因组无差异。即使核心基因型相同，也可能存有表型差异，例如耐药性或毒力不同。” 这很有道理。

举例说明：临床细菌如铜绿假单胞菌，同簇内部分菌株，因为外排泵或通道蛋白基因（如oprD）表达下降、插入序列影响或失活突变，导致美罗培南敏感性变弱（MIC升高），表型变为耐药。而同克隆簇的另一株也许表达正常，表现为敏感。这种“基因型同源但表型不完全一致”的现象，正是精准医学面临的挑战之一。¹

针对克隆株是否要在分析中剔除，Holger建议保持信息透明，在方法和讨论部分如实披露、科学解释，不做删减。讨论段推荐补充一句：“克隆性并不必然对应表型耐药表达的一致，实际还受调控机制等多种因素影响，因此本研究保留所有分离株评估表型检测方法的性能。如果后续审核需要，可以按簇去重再行敏感性分析。”

总结

克隆唯一区分于基因型层面，表型如耐药往往还会受基因调控、表达水平、特殊突变等多种影响，同一克隆内“表型一致”不是绝对规律。
这一认识对临床耐药菌株流行病学追踪和新型方法学评估极其关键，有助于提升科学结论的严谨性。¹ ²³⁴⁵⁶⁷⁸⁹¹⁰

⁂

Step‑by‑Step Workflow for Phylogenomic and Functional Analysis of 100 Clinical Isolates Using WGS Data

Leave a reply

Below is a sorted, step‑by‑step protocol integrating all commands and materials from your notes — organized into logical analysis phases. It outlines how to process whole‑genome data, annotate with Prokka, perform pan‑genome and SNP‑based phylogenetic analysis, and conduct resistome profiling and visualization.

1. Initial Setup and Environment Configuration

Create and activate working environments:

conda create -n bengal3_ac3 python=3.9
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3

Copy necessary config files for bacto:

cp /home/jhuang/Tools/bacto/bacto-0.1.json .
cp /home/jhuang/Tools/bacto/Snakefile .

Prepare R environment for later visualization:

mamba create -n r_414_bioc314 -c conda-forge -c bioconda \
  r-base=4.1.3 bioconductor-ggtree bioconductor-treeio \
  r-ggplot2 r-dplyr r-ape pandoc

2. Species Assignment and MLST Screening

Use WGS data to determine phylogenetic relationships based on MLST. Confirm species identity of isolates: E. coli, K. pneumoniae, A. baumannii complex, P. aeruginosa.
Example log output:

[10:19:59] Excluding ecoli.icdA.262 due to --exclude option
[10:19:59] If you like MLST, you're going to absolutely love cgMLST!
[10:19:59] Done.

Observation: No strain was close enough to be a suspected clonal strain (to be confirmed later with SNP‑based analysis).

3. Genome Annotation with Prokka

Cluster isolates into four species folders and annotate each genome separately.

Example shell script:

for cluster in abaumannii_2 ecoli_achtman_4 klebsiella paeruginosa; do
  GENUS_MAP=( [abaumannii_2]="Acinetobacter" [ecoli_achtman_4]="Escherichia" [klebsiella]="Klebsiella" [paeruginosa]="Pseudomonas" )
  SPECIES_MAP=( [abaumannii_2]="baumannii" [ecoli_achtman_4]="coli" [klebsiella]="pneumoniae" [paeruginosa]="aeruginosa" )
  prokka --force --outdir prokka/${cluster} --cpus 8 \
         --genus ${GENUS_MAP[$cluster]} --species ${SPECIES_MAP[$cluster]} \
         --addgenes --addmrna --prefix ${cluster} \
         -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
 done

4. Pan‑Genome Construction with Roary

Run Roary for each species cluster (GFF outputs from Prokka):

roary -f roary/ecoli_cluster -e --mafft -p 100 prokka/ecoli_cluster/*.gff
roary -f roary/kpneumoniae_cluster1 -e --mafft -p 100 prokka/kpneumoniae_cluster1/*.gff
roary -f roary/abaumannii_cluster -e --mafft -p 100 prokka/abaumannii_cluster/*.gff
roary -f roary/paeruginosa_cluster -e --mafft -p 100 prokka/paeruginosa_cluster/*.gff

Output: core gene alignments for each species.

5. Core‑Genome Phylogenetic Tree Building (RAxML‑NG)

Use maximum likelihood reconstruction per species:

raxml-ng --all --msa roary/ecoli_achtman_4/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix ecoli_core_gene_tree_1000
raxml-ng --all --msa roary/klebsiella/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix klebsiella_core_gene_tree_1000
raxml-ng --all --msa roary/abaumannii/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix abaumannii_core_gene_tree_1000
raxml-ng --all --msa roary/paeruginosa/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix paeruginosa_core_gene_tree_1000

6. SNP Detection and Distance Calculation

Extract SNP sites and generate distance matrices:

snp-sites -v -o core_snps.vcf roary/ecoli_achtman_4/core_gene_alignment.aln
snp-dists roary/ecoli_achtman_4/core_gene_alignment.aln > ecoli_snp_dist.tsv

Repeat for other species and convert to Excel summary:

~/Tools/csv2xls-0.4/csv_to_xls.py ecoli_snp_dist.tsv klebsiella_snp_dist.tsv abaumannii_snp_dist.tsv paeruginosa_snp_dist.tsv -d$'\t' -o snp_dist.xls

7. Resistome and Virulence Profiling with Abricate

Install and set up databases:

conda install -c bioconda -c conda-forge abricate=1.0.1
abricate --setupdb
abricate --list
DATABASE        SEQUENCES       DBTYPE  DATE
vfdb    2597    nucl    2025-Oct-22
resfinder       3077    nucl    2025-Oct-22
argannot        2223    nucl    2025-Oct-22
ecoh    597     nucl    2025-Oct-22
megares 6635    nucl    2025-Oct-22
card    2631    nucl    2025-Oct-22
ecoli_vf        2701    nucl    2025-Oct-22
plasmidfinder   460     nucl    2025-Oct-22
ncbi    5386    nucl    2025-Oct-22

Run VFDB for virulence genes:

# # Run VFDB example
# abricate --db vfdb genome.fasta > vfdb.tsv
#
# # strict filter (≥90% ID, ≥80% cov) using header-safe awk
# awk -F'\t' 'NR==1{
#   for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
#   print; next
# } ($id+0)>=90 && ($cov+0)>=80' vfdb.tsv > vfdb.strict.tsv

# 0) Define your cluster directories (exactly as in your prompt)
CLUSTERS="fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3"

# 1) Run Abricate (VFDB) per isolate, strictly filtered (≥90% ID, ≥80% COV)
for D in $CLUSTERS; do
  mkdir -p "$D/abricate_vfdb"
  for F in "$D"/*.fasta; do
    ISO=$(basename "$F" .fasta)
    # raw
    abricate --db vfdb "$F" > "$D/abricate_vfdb/${ISO}.vfdb.tsv"
    # strict filter while keeping header (header-safe awk)
    awk -F'\t' 'NR==1{
      for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
      print; next
    } ($id+0)>=90 && ($cov+0)>=80' \
      "$D/abricate_vfdb/${ISO}.vfdb.tsv" > "$D/abricate_vfdb/${ISO}.vfdb.strict.tsv"
  done
done

# 2) Build per-cluster "isolate
<TAB>gene" lists (Abricate column 5 = GENE)
for D in $CLUSTERS; do
  OUT="$D/virulence_isolate_gene.tsv"
  : > "$OUT"
  for T in "$D"/abricate_vfdb/*.vfdb.strict.tsv; do
    ISO=$(basename "$T" .vfdb.strict.tsv)
    awk -F'\t' -v I="$ISO" 'NR>1{print I"\t"$5}' "$T" >> "$OUT"
  done
  sort -u "$OUT" -o "$OUT"
done

# 3) Create a per-isolate "signature" (sorted, comma-joined gene list)
for D in $CLUSTERS; do
  IN="$D/virulence_isolate_gene.tsv"
  SIG="$D/virulence_signatures.tsv"
  awk -F'\t' '
  {m[$1][$2]=1}
  END{
    for(i in m){
      n=0
      for(g in m[i]){n++; a[n]=g}
      asort(a)
      printf("%s\t", i)
      for(k=1;k<=n;k++){printf("%s%s", (k>1?",":""), a[k])}
      printf("\n")
      delete a
    }
  }' "$IN" | sort > "$SIG"
done

# 4) Report whether each cluster is internally identical
for D in $CLUSTERS; do
  SIG="$D/virulence_signatures.tsv"
  echo "== $D =="
  # how many unique signatures?
  CUT=$(cut -f2 "$SIG" | sort -u | wc -l)
  if [ "$CUT" -eq 1 ]; then
    echo "  Virulence profiles: IDENTICAL within cluster"
  else
    echo "  Virulence profiles: NOT IDENTICAL (unique signatures: $CUT)"
    echo "  --- differing isolates & their signatures ---"
    cat "$SIG"
  fi
done

Summarize VFDB results:

#----
# Make gene lists (VFDB "GENE" = column 5) for each isolate
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | tail -n +2 | sort -u > c2_070.genes.txt

# Show symmetric differences
echo "Genes present in QRC018 only:"
comm -23 c2_018.genes.txt c2_070.genes.txt

echo "Genes present in QRC070 only:"
comm -13 c2_018.genes.txt c2_070.genes.txt

awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | sort -u > c2_018.gene_product.txt
awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | sort -u > c2_070.gene_product.txt

# Genes unique to QRC018 with product
join -t $'\t' <(cut -f1 c2_018.gene_product.txt) <(cut -f1 c2_070.gene_product.txt) -v1 > /dev/null # warms caches
comm -23 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_018.gene_product.txt

# Genes unique to QRC070 with product
comm -13 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_070.gene_product.txt

# Make a cluster summary table from raw (or strict) TSVs
for D in fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3; do
  echo "== $D =="
  abricate --summary "$D"/abricate_vfdb/*.vfdb.strict.tsv | column -t
done

Make gene presence lists and find symmetric differences between isolates:

cut -f5 fasta_kpneumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
comm -23 c2_018.genes.txt c2_070.genes.txt

5 (Optional). For A. baumannii with zero hits, run DIAMOND BLASTp against VFDB proteins.

#If you want to be extra thorough for A. baumannii. Because your VFDB nucleotide set returned zero for A. baumannii, you can cross-check with VFDB protein via DIAMOND:

# Build a DIAMOND db (once)
diamond makedb -in VFDB_proteins.faa -d vfdb_prot

# Query predicted proteins (Prokka .faa) per isolate
for F in fasta_abaumannii_cluster/*.fasta; do
  ISO=$(basename "$F" .fasta)
  prokka --cpus 8 --outdir prokka/$ISO --prefix $ISO "$F"
  diamond blastp -q prokka/$ISO/$ISO.faa -d vfdb_prot \
    -o abricate_vfdb/$ISO.vfdb_prot.tsv \
    -f 6 qseqid sseqid pident length evalue bitscore qcovhsp \
    --id 90 --query-cover 80 --max-target-seqs 1 --threads 8
done

8. Visualization with ggtree and Heatmaps

Render circular core genome trees and overlay selected ARGs or gene presence/absence matrices.

Key R snippet:

library(ggtree)
library(ggplot2)
library(dplyr)
info <- read.csv("typing_until_ecpA.csv", sep="\t")
tree <- read.tree("raxml-ng/snippy.core.aln.raxml.tree")
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info +
     geom_tippoint(aes(color=ST), size=4) + geom_tiplab2(aes(label=name), offset=1)
gheatmap(p, heatmapData2, width=4, colnames_angle=45, offset=4.2)

Output: multi‑layer circular tree (e.g., ggtree_and_gheatmap_selected_genes.png).

9. Final Analyses and Reporting

Compute pairwise SNP distances and identify potential clonal clusters using species‑specific thresholds (e.g., ≤17 SNPs for E. coli).
If clones exist, retain one representative per cluster for subsequent Boruta analysis or CA/EA metrics.
Integrate ARG heatmaps to the trees for intuitive visualization of resistance determinants relative to phylogeny.

Protocol Description Summary

This workflow systematically processes WGS data from multiple bacterial species:

Annotation (Prokka) ensures consistent gene predictions.
Pan‑genome alignment (Roary) captures core and accessory variation.
Phylogenetic reconstruction (RAxML‑NG) provides species‑level evolutionary context.
SNP‑distance computation (snp‑sites, snp‑dists) allows clonal determination and diversity quantification.
Resistome analysis (Abricate, Diamond) characterizes antimicrobial resistance and virulence determinants.
Visualization (ggtree + gheatmap) integrates tree topology with gene presence–absence or ARG annotations.

Properly documented, this end‑to‑end pipeline can serve as the Methods section for publication or as an online reproducible workflow guide.

Methods (for the manuscript). Genomes were annotated with PROKKA v1.14.5 [1]. A pangenome was built with Roary v3.13.0 [2] using default settings (95% protein identity) to derive a gene presence/absence matrix. Core genes—defined as present in 99–100% of isolates by Roary—were individually aligned with MAFFT v7 [3] and concatenated; the resulting multiple alignment was used to infer a maximum-likelihood phylogeny in RAxML-NG [4] under GTR+G, with 1,000 bootstrap replicates [5] and a random starting tree. Trees were visualized with the R package ggtree [6] in circular layout; tip colors denote sequence type (ST), with MLST assigned via PubMLST/BIGSdb [7]; concentric heatmap rings indicate the presence (+) or absence (-) of resistance genes.

Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693.
Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780.
Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A, Wren J. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. 2017. ggtree: an R package for visualization and annotation of phylogenetic trees. Methods in Ecology and Evolution 8(1):28–36.
Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3:124.

Phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages (plotTreeHeatmap)

Leave a reply

library(ggtree)
library(ggplot2)
library(dplyr)
#setwd("/home/jhuang/DATA/Data_Ben_Boruta_Analysis/plotTreeHeatmap_ecoli/")

info <- read.csv("isolate_ecoli_.csv", sep="\t", check.names = FALSE)
info$name <- info$Isolate
# make ST discrete
info$ST <- factor(info$ST)

tree <- read.tree("../ecoli_core_gene_tree_1000.raxml.bestTree")
cols <- c("10"="cornflowerblue","38"="darkgreen","46"="seagreen3","69"="tan","88"="red",  "131"="navyblue", "156"="gold",     "167"="green","216"="orange","405"="pink","410"="purple","1882"="magenta","2450"="brown", "2851"="darksalmon","3570"="chocolate4","4774"="darkkhaki")
# "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"

#heatmapData2 <- info %>% select(Isolate, blaCTX.M, blaIMP, blaKPC, blaNDM.1, blaNDM.5, blaOXA.23.like, blaOXA.24.like, blaOXA.48.like, blaOXA.58.like, blaPER.1, blaSHV, blaVEB.1, blaVIM)  #ST,
heatmapData2 <- info %>% select(
    Isolate,
    `blaCTX-M`, blaIMP, blaKPC, `blaNDM-1`, `blaNDM-5`,
    `blaOXA-23-like`, `blaOXA-24-like`, `blaOXA-48-like`, `blaOXA-58-like`,
    `blaPER-1`, blaSHV, `blaVEB-1`, blaVIM
)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
#names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")

heatmap.colours <- c("darkgreen", "grey")
names(heatmap.colours) <- c("+","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
#svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

# ---------

# 1) Optional: shrink tree to create more outer space (keeps relative scale)
tree_small <- tree
tree_small$edge.length <- tree_small$edge.length * 0.4    # try 0.3–0.5

# 2) Height after shrinking
ht <- max(ape::node.depth.edgelength(tree_small))

# 3) Base tree
info$ST <- factor(info$ST)
p <- ggtree(tree_small, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.4) +
    geom_tiplab2(aes(label = name), size = 1.4, offset = 0.02 * ht) +
    scale_color_manual(values = cols)

# 4) Reserve room outside the tips
off <- 0.30 * ht   # gap
wid <- 2.80 * ht   # ring thickness
mar <- 0.25 * ht
p_wide <- p + xlim(0, ht + off + wid + mar)

# 5) Ensure discrete values are factors and keep both levels
heatmapData2[] <- lapply(heatmapData2, function(x) factor(x, levels = c("+","-")))

# 6) Draw the heatmap
png("ggtree_and_gheatmap_readable.png", width = 3600, height = 3200, res = 350)
gheatmap(
    p_wide, heatmapData2,
    offset = off,
    width  = wid,
    color  = "white",            # borders between tiles
    colnames = TRUE,
    colnames_position = "top",
    colnames_angle = 0,
    colnames_offset_y = 0.04 * ht,
    hjust = 0.5,
    font.size = 7
) +
    scale_fill_manual(values = c("+"="darkgreen", "-"="grey"), drop = FALSE) +
    guides(
        fill  = guide_legend(title = "", override.aes = list(size = 4)),
        color = guide_legend(override.aes = list(size = 3))
    )
dev.off()

#-------------- plot with true scale -------------

# tree height (root → farthest tip)
ht <- max(ape::node.depth.edgelength(tree))

# circular tree with a small opening and compact labels
p <- ggtree(tree, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.8, alpha = 0.9) +
    geom_tiplab2(aes(label = name), size = 2.2, offset = 0.06 * ht) +
    scale_color_manual(values = cols) +
    theme(legend.title = element_text(size = 12),
                legend.text  = element_text(size = 10))

# higher-DPI export so small cells look crisp
png("ggtree_true_scale.png", width = 2400, height = 2400, res = 300)
p
dev.off()

# --- Heatmap: push it further out and make it wider ---
#svg("ggtree_and_gheatmap_mibi_selected_genes_true_scale.svg",
#    width = 32, height = 28)

png("ggtree_and_gheatmap_mibi_selected_genes_true_scale.png",
        width = 2000, height = 1750,  res = 200)

gheatmap(
    p, heatmapData2,
    offset = 0.24 * ht,         # farther from tips (was 0.08*ht)
    width  = 20.0 * ht,         # much wider ring (was 0.35*ht)
    color  = "white",           # thin tile borders help readability
    colnames_position  = "top",
    colnames_angle     = 90,     # keep column labels horizontal to save space
    colnames_offset_y  = 0.08 * ht,
    hjust = 0.1,
    font.size = 2.6               # bigger column label font
) +
    scale_fill_manual(values = heatmap.colours) +
    guides(fill  = guide_legend(title = "", override.aes = list(size = 4)),
                 color = guide_legend(override.aes = list(size = 3))) +
    theme(legend.position = "right",
                legend.title    = element_text(size = 10),
                legend.text     = element_text(size = 8))
dev.off()

This R script visualizes a phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages, annotated with sequence type (ST) colors and a heatmap showing the presence or absence of antimicrobial resistance genes. It creates several versions of the circular tree figure to adjust scaling and readability.

Key Steps

Load libraries and data The code imports ggtree, ggplot2, and dplyr, reads isolate metadata (isolate_ecoli_.csv), and loads a phylogenetic tree file (ecoli_core_gene_tree_1000.raxml.bestTree).¹¹¹²
Prepare metadata It defines each isolate’s name and converts the sequence type (ST) into a categorical factor. A subset of columns (various bla resistance gene markers) is selected to create a heatmap data matrix, where each gene’s presence (“+”) or absence (“–”) will be visualized.
Define color schemes Different STs are assigned distinct colors, and binary gene presence/absence states (“+”, “–”) mapped to “darkgreen” and “grey,” respectively.¹³
Plot tree with annotation The main tree is plotted circularly using:

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
     geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols)

Tip labels are added for isolates, and STs are colored as per the legend.

Add heatmap (gene presence/absence) The gheatmap() function appends the gene matrix to the tree, aligning rows with the tree tips and providing a heatmap of resistance gene patterns.¹²¹³
Export figures Multiple PNGs are generated:
- ggtree.png: Basic circular tree
- ggtree_and_gheatmap_mibi_selected_genes.png: Tree + resistance gene heatmap
- Scaled versions with adjusted offsets and spacing for improved readability and “true-scale” representations

Final Output

The resulting figures show a circular phylogenetic tree annotated by ST colors, accompanied by a ring-style heatmap indicating the distribution of key resistance genes across isolates. The layout adjustments (tree shrinking, offset tuning, width scaling) ensure legibility when genes or isolates are numerous.¹²¹³ ¹⁴¹⁵¹⁶¹⁷¹⁸¹⁹²⁰²¹²²²³²⁴²⁵²⁶²⁷²⁸²⁹³⁰

⁂

1. Prepare samples.tsv from the samplesheet

2. Reformat counts.tsv from STAR/Salmon

3. Run the R time-course analysis

4. Summarize and convert results

5. Key summary and reporting

6. Full main R script: rna_timecourse_bacteria.R

目录

1) 快速结论（给着急的人）

2) TB ALC 底板一览

3) 胶皮速览卡

Dignics 09C（D09C）

Dignics 64（D64）

Tenergy 05（T05）

Tenergy 80（T80）

Tenergy 64（T64）

Tenergy 64 FX（T64 FX）

4) 关键差异对照表

5) 正反手搭配方案（按打法）

A. FH 现代弧圈 + BH 快带/反拉（主流）

B. 近台借力对冲 + 二速快上手

C. 中远台大力爆冲

D. 一套省心通吃

E. 坚持 FH 直线穿透风

6) 厚度与配色（红/黑）选择

厚度（1.9 vs 2.1）

配色（规则与习惯）

7) 与 TB ALC 的化学反应：上手感受与注意事项

8) 台内/发接发与相持要点

9) 维护与更换周期

10) 常见问答（FAQ）

11) 术语小词典

小结

1) What are 5‑mC, 6‑mA, 4‑mC?

2) How ONT detects methylation (no chemical conversion)

3) Data types & what you need (modBAM vs “assembly” reads)

4) Practical workflow (bacteria)

5) nf‑core/methylong (pipeline overview)

6) QC & caveats

7) What to ask a provider (email checklist)

8) Suggested minimal deliverables

9) Bacterial project recommendation (one‑liner)

10) Handy pointers & checks (quick ref)

1) Naming logic (decoded)

2) Updated mapping table

3) Suggested metadata file (samples.csv)

4) DE design & contrasts (DESeq2/edgeR)

Model

Hypothesis-driven contrasts

QC

5) Where this plugs into the proposal

模型选择与对比：如何在有限功效下回答“two≈one@17h、two>one@24h（尤其ΔadeIJ）”

三种模型的对比

1) 分层按时间做（Option A：每个时间点各跑一个模型）

2) 全局简化模型（Global reduced）：~ exposure * genotype + time_h

3) 全模型（Full）：~ exposure * genotype * time_h

给你这套数据的建议（结合你目标与当前信号）

简短决策树

实操小贴士

一句话结论

0. 术语速览（便于快速阅读）

1. 研究背景与问题定位

1.1 临床与公共卫生动因

1.2 科学空白

2. 科学问题、核心假说与研究目标

2.1 科学问题（What）

2.2 核心假说（Why + How）

2.3 研究目标（Deliverables）

3. 实验总体设计与样本体系

3.1 株系与处理

3.2 培养基与应用场景

4. 指标体系与统计学（统一口径）

4.1 关键指标定义

4.2 统计检验与多重校正

4.3 生物学重复与效应量

5. SOP：核心实验流程（可直接贴进方法学）

5.1 并行铺板（LB vs MacConkey）与 R 比例

5.2 复制平板分离 S（易感）

5.3 PAP（Population Analysis Profiling）

5.4 Time-kill / MDK

5.5 脉冲生存（Pulse Survival）

1. Prepare `samples.tsv` from the samplesheet

2. Reformat `counts.tsv` from STAR/Salmon

6. Full main R script: `rna_timecourse_bacteria.R`

3) Suggested metadata file (`samples.csv`)

2) 全局简化模型（Global reduced）：`~ exposure * genotype + time_h`

3) 全模型（Full）：`~ exposure * genotype * time_h`