In the current results, I extract the main effect. I also compared the condition deltasbp_MH_18h to WT_MH_18h, if you are interested in specific comparison between conditions, please let me know, I can perform differentially expressed analysis and draw corresponding volcano plots for them.
-
Targets
The experiment we did so far: I have two strains: 1. 1457 wildtype 2. 1457Δsbp (sbp knock out strain) I have grown these two strains in two media for 2h (early biofilm phase, primary attachment), 4h (biofilm accumulation phase), 18h (mature biofilm phase) respectively 1. medium TSB -> nutrient-rich medium: differences in biofilm formation and growth visible (sbp knockout shows less biofilm formation and a growth deficit) 2. medium MH -> nutrient-poor medium: differences between wild type more obvious (sbp knockout shows stronger growth deficit) Our idea/hypothesis of what we hope to achieve with the RNA-Seq: Since we already see differences in growth and biofilm formation and also differences in the proteome (through cooperation with mass spectrometry), we also expect differences in the transcription of the genes in the RNA-Seq. Could you analyze the RNA-Seq data for me and compare the strains at the different time points? But maybe also compare the different time points of one strain with each other? The following would be interesting for me: - PCA plot (sample comparison) - Heatmaps (wild type vs. sbp knockout) - Volcano plots (significant genes) - Gene Ontology (GO) analyses -
Download the raw data
Mail von BGI (RNA-SEQ Institute): The data from project F25A430000603 are uploaded to AWS. Please download the data as below: URL:https://s3.console.aws.amazon.com/s3/buckets/stakimxp-598731762349?region=eu-central-1&tab=objects Project:F25A430000603-01-STAkimxP Alias ID:598731762349 S3 Bucket:stakimxp-598731762349 Account:stakimxp Password:qR0'A7[o9Ql| Region:eu-central-1 Aws_access_key_id:AKIAYWZZRVKW72S4SCPG Aws_secret_access_key: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 aws s3 cp s3://stakimxp-598731762349/ ./ --recursive cp -r raw_data/ /media/jhuang/Smarty/Data_Michelle_RNAseq_2025_raw_data_DEL rsync -avzP /local/dir/ user@remote:/remote/dir/ rsync -avzP raw_data jhuang@10.169.63.113:/home/jhuang/DATA/Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO -
Prepare raw data
mkdir raw_data; cd raw_data #Δsbp->deltasbp #1457.1_2h_MH,WT,MH,2h,1 ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_1.fq.gz WT_MH_2h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.1_2h_MH/1457.1_2h_MH_2.fq.gz WT_MH_2h_1_R2.fastq.gz #1457.2_2h_ ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_1.fq.gz WT_MH_2h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.2_2h_MH/1457.2_2h_MH_2.fq.gz WT_MH_2h_2_R2.fastq.gz #1457.3_2h_ ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_1.fq.gz WT_MH_2h_3_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.3_2h_MH/1457.3_2h_MH_2.fq.gz WT_MH_2h_3_R2.fastq.gz #1457.1_4h_ ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_1.fq.gz WT_MH_4h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.1_4h_MH/1457.1_4h_MH_2.fq.gz WT_MH_4h_1_R2.fastq.gz #1457.2_4h_ ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_1.fq.gz WT_MH_4h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.2_4h_MH/1457.2_4h_MH_2.fq.gz WT_MH_4h_2_R2.fastq.gz #1457.3_4h_ ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_1.fq.gz WT_MH_4h_3_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.3_4h_MH/1457.3_4h_MH_2.fq.gz WT_MH_4h_3_R2.fastq.gz #1457.1_18h_ ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_1.fq.gz WT_MH_18h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.1_18h_MH/1457.1_18h_MH_2.fq.gz WT_MH_18h_1_R2.fastq.gz #1457.2_18h_ ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_1.fq.gz WT_MH_18h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.2_18h_MH/1457.2_18h_MH_2.fq.gz WT_MH_18h_2_R2.fastq.gz #1457.3_18h_ ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_1.fq.gz WT_MH_18h_3_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457.3_18h_MH/1457.3_18h_MH_2.fq.gz WT_MH_18h_3_R2.fastq.gz #1457dsbp1_2h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_1.fq.gz deltasbp_MH_2h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_2h_MH/1457dsbp1_2h_MH_2.fq.gz deltasbp_MH_2h_1_R2.fastq.gz #1457dsbp2_2h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_1.fq.gz deltasbp_MH_2h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_2h_MH/1457dsbp2_2h_MH_2.fq.gz deltasbp_MH_2h_2_R2.fastq.gz #1457dsbp3_2h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_1.fq.gz deltasbp_MH_2h_3_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_2h_MH/1457dsbp3_2h_MH_2.fq.gz deltasbp_MH_2h_3_R2.fastq.gz #1457dsbp1_4h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_1.fq.gz deltasbp_MH_4h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp1_4h_MH/1457dsbp1_4h_MH_2.fq.gz deltasbp_MH_4h_1_R2.fastq.gz #1457dsbp2_4h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_1.fq.gz deltasbp_MH_4h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp2_4h_MH/1457dsbp2_4h_MH_2.fq.gz deltasbp_MH_4h_2_R2.fastq.gz #1457dsbp3_4h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_1.fq.gz deltasbp_MH_4h_3_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp3_4h_MH/1457dsbp3_4h_MH_2.fq.gz deltasbp_MH_4h_3_R2.fastq.gz #1457dsbp118h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_1.fq.gz deltasbp_MH_18h_1_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp118h_MH/1457dsbp118h_MH_2.fq.gz deltasbp_MH_18h_1_R2.fastq.gz #1457dsbp218h_ ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_1.fq.gz deltasbp_MH_18h_2_R1.fastq.gz ln -s ../F25A430000603-01_STAkimxP/1457dsbp218h_MH/1457dsbp218h_MH_2.fq.gz deltasbp_MH_18h_2_R2.fastq.gz #1457.1_2h_ ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_1.fq.gz WT_TSB_2h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.1_2h_TSB/1457.1_2h_TSB_2.fq.gz WT_TSB_2h_1_R2.fastq.gz #1457.2_2h_ ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_1.fq.gz WT_TSB_2h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.2_2h_TSB/1457.2_2h_TSB_2.fq.gz WT_TSB_2h_2_R2.fastq.gz #1457.3_2h_ ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_1.fq.gz WT_TSB_2h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.3_2h_TSB/1457.3_2h_TSB_2.fq.gz WT_TSB_2h_3_R2.fastq.gz #1457.1_4h_ ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_1.fq.gz WT_TSB_4h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.1_4h_TSB/1457.1_4h_TSB_2.fq.gz WT_TSB_4h_1_R2.fastq.gz #1457.2_4h_ ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_1.fq.gz WT_TSB_4h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.2_4h_TSB/1457.2_4h_TSB_2.fq.gz WT_TSB_4h_2_R2.fastq.gz #1457.3_4h_ ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_1.fq.gz WT_TSB_4h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.3_4h_TSB/1457.3_4h_TSB_2.fq.gz WT_TSB_4h_3_R2.fastq.gz #1457.1_18h_ ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_1.fq.gz WT_TSB_18h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.1_18h_TSB/1457.1_18h_TSB_2.fq.gz WT_TSB_18h_1_R2.fastq.gz #1457.2_18h_ ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_1.fq.gz WT_TSB_18h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.2_18h_TSB/1457.2_18h_TSB_2.fq.gz WT_TSB_18h_2_R2.fastq.gz #1457.3_18h_ ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_1.fq.gz WT_TSB_18h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457.3_18h_TSB/1457.3_18h_TSB_2.fq.gz WT_TSB_18h_3_R2.fastq.gz #1457dsbp1_2h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_1.fq.gz deltasbp_TSB_2h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp1_2hTSB/1457dsbp1_2hTSB_2.fq.gz deltasbp_TSB_2h_1_R2.fastq.gz #1457dsbp2_2h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_1.fq.gz deltasbp_TSB_2h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp2_2hTSB/1457dsbp2_2hTSB_2.fq.gz deltasbp_TSB_2h_2_R2.fastq.gz #1457dsbp3_2h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_1.fq.gz deltasbp_TSB_2h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp3_2hTSB/1457dsbp3_2hTSB_2.fq.gz deltasbp_TSB_2h_3_R2.fastq.gz #1457dsbp1_4h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_1.fq.gz deltasbp_TSB_4h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp1_4hTSB/1457dsbp1_4hTSB_2.fq.gz deltasbp_TSB_4h_1_R2.fastq.gz #1457dsbp2_4h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_1.fq.gz deltasbp_TSB_4h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp2_4hTSB/1457dsbp2_4hTSB_2.fq.gz deltasbp_TSB_4h_2_R2.fastq.gz #1457dsbp3_4h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_1.fq.gz deltasbp_TSB_4h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp3_4hTSB/1457dsbp3_4hTSB_2.fq.gz deltasbp_TSB_4h_3_R2.fastq.gz #1457dsbp1_18h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_1.fq.gz deltasbp_TSB_18h_1_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp118hTSB/1457dsbp118hTSB_2.fq.gz deltasbp_TSB_18h_1_R2.fastq.gz #1457dsbp2_18h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_1.fq.gz deltasbp_TSB_18h_2_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp218hTSB/1457dsbp218hTSB_2.fq.gz deltasbp_TSB_18h_2_R2.fastq.gz #1457dsbp3_18h_ ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_1.fq.gz deltasbp_TSB_18h_3_R1.fastq.gz ln -s ../F25A430000603_STAmsvaP/1457dsbp318hTSB/1457dsbp318hTSB_2.fq.gz deltasbp_TSB_18h_3_R2.fastq.gz #END -
Preparing the directory trimmed
mkdir trimmed trimmed_unpaired; for sample_id in WT_MH_2h_1 WT_MH_2h_2 WT_MH_2h_3 WT_MH_4h_1 WT_MH_4h_2 WT_MH_4h_3 WT_MH_18h_1 WT_MH_18h_2 WT_MH_18h_3 WT_TSB_2h_1 WT_TSB_2h_2 WT_TSB_2h_3 WT_TSB_4h_1 WT_TSB_4h_2 WT_TSB_4h_3 WT_TSB_18h_1 WT_TSB_18h_2 WT_TSB_18h_3 deltasbp_MH_2h_1 deltasbp_MH_2h_2 deltasbp_MH_2h_3 deltasbp_MH_4h_1 deltasbp_MH_4h_2 deltasbp_MH_4h_3 deltasbp_MH_18h_1 deltasbp_MH_18h_2 deltasbp_TSB_2h_1 deltasbp_TSB_2h_2 deltasbp_TSB_2h_3 deltasbp_TSB_4h_1 deltasbp_TSB_4h_2 deltasbp_TSB_4h_3 deltasbp_TSB_18h_1 deltasbp_TSB_18h_2 deltasbp_TSB_18h_3; do java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log; done mv trimmed/*.fastq.gz . -
Preparing samplesheet.csv
sample,fastq_1,fastq_2,strandedness WT_MH_2h_1,WT_MH_2h_1_R1.fastq.gz,WT_MH_2h_1_R2.fastq.gz,auto ... -
nextflow run
#See an example: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/ #docker pull nfcore/rnaseq ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq # -- DEBUG_1 (CDS --> exon in CP020463.gff) -- grep -P "\texon\t" CP020463.gff | sort | wc -l #=81 grep -P "cmsearch\texon\t" CP020463.gff | wc -l #=11 ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA grep -P "Genbank\texon\t" CP020463.gff | wc -l #=12 16S and 23S ribosomal RNA grep -P "tRNAscan-SE\texon\t" CP020463.gff | wc -l #tRNA 58 grep -P "\tCDS\t" CP020463.gff | wc -l #3701-->2324 sed 's/\tCDS\t/\texon\t/g' CP020463.gff > CP020463_m.gff grep -P "\texon\t" CP020463_m.gff | sort | wc -l #3797-->2405 # -- NOTE that combination of 'CP020463_m.gff' and 'exon' in the command will result in ERROR, using 'transcript' instead in the command line! --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP020463_m.gff" --featurecounts_feature_type 'transcript' # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ---- (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463.fasta" --gff "/home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_m.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript' # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file, both are "CP020463.1" -
Import data and pca-plot
#mamba activate r_env #install.packages("ggfun") # Import the required libraries library("AnnotationDbi") library("clusterProfiler") library("ReactomePA") library(gplots) library(tximport) library(DESeq2) #library("org.Hs.eg.db") library(dplyr) library(tidyverse) #install.packages("devtools") #devtools::install_version("gtable", version = "0.3.0") library(gplots) library("RColorBrewer") #install.packages("ggrepel") library("ggrepel") # install.packages("openxlsx") library(openxlsx) library(EnhancedVolcano) library(DESeq2) library(edgeR) setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon") # Define paths to your Salmon output quantification files files <- c( "deltasbp_MH_2h_r1" = "./deltasbp_MH_2h_1/quant.sf", "deltasbp_MH_2h_r2" = "./deltasbp_MH_2h_2/quant.sf", "deltasbp_MH_2h_r3" = "./deltasbp_MH_2h_3/quant.sf", "deltasbp_MH_4h_r1" = "./deltasbp_MH_4h_1/quant.sf", "deltasbp_MH_4h_r2" = "./deltasbp_MH_4h_2/quant.sf", "deltasbp_MH_4h_r3" = "./deltasbp_MH_4h_3/quant.sf", "deltasbp_MH_18h_r1" = "./deltasbp_MH_18h_1/quant.sf", "deltasbp_MH_18h_r2" = "./deltasbp_MH_18h_2/quant.sf", "deltasbp_TSB_2h_r1" = "./deltasbp_TSB_2h_1/quant.sf", "deltasbp_TSB_2h_r2" = "./deltasbp_TSB_2h_2/quant.sf", "deltasbp_TSB_2h_r3" = "./deltasbp_TSB_2h_3/quant.sf", "deltasbp_TSB_4h_r1" = "./deltasbp_TSB_4h_1/quant.sf", "deltasbp_TSB_4h_r2" = "./deltasbp_TSB_4h_2/quant.sf", "deltasbp_TSB_4h_r3" = "./deltasbp_TSB_4h_3/quant.sf", "deltasbp_TSB_18h_r1" = "./deltasbp_TSB_18h_1/quant.sf", "deltasbp_TSB_18h_r2" = "./deltasbp_TSB_18h_2/quant.sf", "deltasbp_TSB_18h_r3" = "./deltasbp_TSB_18h_3/quant.sf", "WT_MH_2h_r1" = "./WT_MH_2h_1/quant.sf", "WT_MH_2h_r2" = "./WT_MH_2h_2/quant.sf", "WT_MH_2h_r3" = "./WT_MH_2h_3/quant.sf", "WT_MH_4h_r1" = "./WT_MH_4h_1/quant.sf", "WT_MH_4h_r2" = "./WT_MH_4h_2/quant.sf", "WT_MH_4h_r3" = "./WT_MH_4h_3/quant.sf", "WT_MH_18h_r1" = "./WT_MH_18h_1/quant.sf", "WT_MH_18h_r2" = "./WT_MH_18h_2/quant.sf", "WT_MH_18h_r3" = "./WT_MH_18h_3/quant.sf", "WT_TSB_2h_r1" = "./WT_TSB_2h_1/quant.sf", "WT_TSB_2h_r2" = "./WT_TSB_2h_2/quant.sf", "WT_TSB_2h_r3" = "./WT_TSB_2h_3/quant.sf", "WT_TSB_4h_r1" = "./WT_TSB_4h_1/quant.sf", "WT_TSB_4h_r2" = "./WT_TSB_4h_2/quant.sf", "WT_TSB_4h_r3" = "./WT_TSB_4h_3/quant.sf", "WT_TSB_18h_r1" = "./WT_TSB_18h_1/quant.sf", "WT_TSB_18h_r2" = "./WT_TSB_18h_2/quant.sf", "WT_TSB_18h_r3" = "./WT_TSB_18h_3/quant.sf") # Import the transcript abundance data with tximport txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE) # Define the replicates and condition of the samples replicate <- factor(c("r1","r2","r3", "r1","r2","r3", "r1","r2", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3", "r1","r2","r3")) condition <- factor(c("deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_4h","deltasbp_MH_18h","deltasbp_MH_18h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_4h","deltasbp_TSB_18h","deltasbp_TSB_18h","deltasbp_TSB_18h","WT_MH_2h","WT_MH_2h","WT_MH_2h","WT_MH_4h","WT_MH_4h","WT_MH_4h","WT_MH_18h","WT_MH_18h","WT_MH_18h","WT_TSB_2h","WT_TSB_2h","WT_TSB_2h","WT_TSB_4h","WT_TSB_4h","WT_TSB_4h","WT_TSB_18h","WT_TSB_18h","WT_TSB_18h")) sample_table <- data.frame( condition = condition, replicate = replicate ) split_cond <- do.call(rbind, strsplit(as.character(condition), "_")) colnames(split_cond) <- c("strain", "media", "time") colData <- cbind(sample_table, split_cond) colData$strain <- factor(colData$strain) colData$media <- factor(colData$media) colData$time <- factor(colData$time) #colData$group <- factor(paste(colData$strain, colData$media, colData$time, sep = "_")) # Define the colData for DESeq2 #colData <- data.frame(condition=condition, row.names=names(files)) #grep "gene_name" ./results/genome/CP059040_m.gtf | wc -l #1701 #grep "gene_name" ./results/genome/CP020463_m.gtf | wc -l #50 # ------------------------ # 1️⃣ Setup and input files # ------------------------ # Read in transcript-to-gene mapping tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE) colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name") # Prepare tx2gene for gene-level summarization (remove gene_name if needed) tx2gene_geneonly <- tx2gene[, c("transcript_id", "gene_id")] # ------------------------------- # 2️⃣ Transcript-level counts # ------------------------------- # Create DESeqDataSet directly from tximport (transcript-level) dds_tx <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition) write.csv(counts(dds_tx), file="transcript_counts.csv") # -------------------------------- # 3️⃣ Gene-level summarization # -------------------------------- # Re-import Salmon data summarized at gene level txi_gene <- tximport(files, type="salmon", tx2gene=tx2gene_geneonly, txOut=FALSE) # Create DESeqDataSet for gene-level counts #dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition+replicate) dds <- DESeqDataSetFromTximport(txi_gene, colData=colData, design=~condition) #dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media + strain + media:strain + strain:time) #或更简单地写为(推荐):dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ time + media * strain) #dds <- DESeqDataSetFromTximport(txi, colData = colData, design = ~ strain * media * time) #~ strain * media * time 主效应 + 所有交互(推荐) ✅ #~ time + media * strain 主效应 + media:strain 交互 ⚠️ 有限制 # -------------------------------- # 4️⃣ Raw counts table (with gene names) # -------------------------------- # Extract raw gene-level counts counts_data <- as.data.frame(counts(dds, normalized=FALSE)) counts_data$gene_id <- rownames(counts_data) # Add gene names tx2gene_unique <- unique(tx2gene[, c("gene_id", "gene_name")]) counts_data <- merge(counts_data, tx2gene_unique, by="gene_id", all.x=TRUE) # Reorder columns: gene_id, gene_name, then counts count_cols <- setdiff(colnames(counts_data), c("gene_id", "gene_name")) counts_data <- counts_data[, c("gene_id", "gene_name", count_cols)] # -------------------------------- # 5️⃣ Calculate CPM # -------------------------------- library(edgeR) library(openxlsx) # Prepare count matrix for CPM calculation count_matrix <- as.matrix(counts_data[, !(colnames(counts_data) %in% c("gene_id", "gene_name"))]) # Calculate CPM #cpm_matrix <- cpm(count_matrix, normalized.lib.sizes=FALSE) total_counts <- colSums(count_matrix) cpm_matrix <- t(t(count_matrix) / total_counts) * 1e6 cpm_matrix <- as.data.frame(cpm_matrix) # Add gene_id and gene_name back to CPM table cpm_counts <- cbind(counts_data[, c("gene_id", "gene_name")], cpm_matrix) # -------------------------------- # 6️⃣ Save outputs # -------------------------------- write.csv(counts_data, "gene_raw_counts.csv", row.names=FALSE) write.xlsx(counts_data, "gene_raw_counts.xlsx", row.names=FALSE) write.xlsx(cpm_counts, "gene_cpm_counts.xlsx", row.names=FALSE) # -- (Optional) Save the rlog-transformed counts -- dim(counts(dds)) head(counts(dds), 10) rld <- rlogTransformation(dds) rlog_counts <- assay(rld) write.xlsx(as.data.frame(rlog_counts), "gene_rlog_transformed_counts.xlsx") # -- pca -- png("pca2.png", 1200, 800) plotPCA(rld, intgroup=c("condition")) dev.off() # -- heatmap -- png("heatmap2.png", 1200, 800) distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) hc <- hclust(distsRL) hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100) heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13)) dev.off() # -- pca_media_strain -- png("pca_media.png", 1200, 800) plotPCA(rld, intgroup=c("media")) dev.off() png("pca_strain.png", 1200, 800) plotPCA(rld, intgroup=c("strain")) dev.off() png("pca_time.png", 1200, 800) plotPCA(rld, intgroup=c("time")) dev.off() -
(Optional; ERROR–>need to be debugged!) ) estimate size factors and dispersion values.
#Size Factors: These are used to normalize the read counts across different samples. The size factor for a sample accounts for differences in sequencing depth (i.e., the total number of reads) and other technical biases between samples. After normalization with size factors, the counts should be comparable across samples. Size factors are usually calculated in a way that they reflect the median or mean ratio of gene expression levels between samples, assuming that most genes are not differentially expressed. #Dispersion: This refers to the variability or spread of gene expression measurements. In RNA-seq data analysis, each gene has its own dispersion value, which reflects how much the counts for that gene vary between different samples, more than what would be expected just due to the Poisson variation inherent in counting. Dispersion is important for accurately modeling the data and for detecting differentially expressed genes. #So in summary, size factors are specific to samples (used to make counts comparable across samples), and dispersion values are specific to genes (reflecting variability in gene expression). sizeFactors(dds) #NULL # Estimate size factors dds <- estimateSizeFactors(dds) # Estimate dispersions dds <- estimateDispersions(dds) #> sizeFactors(dds) #control_r1 control_r2 HSV.d2_r1 HSV.d2_r2 HSV.d4_r1 HSV.d4_r2 HSV.d6_r1 #2.3282468 2.0251928 1.8036883 1.3767551 0.9341929 1.0911693 0.5454526 #HSV.d6_r2 HSV.d8_r1 HSV.d8_r2 #0.4604461 0.5799834 0.6803681 # (DEBUG) If avgTxLength is Necessary #To simplify the computation and ensure sizeFactors are calculated: assays(dds)$avgTxLength <- NULL dds <- estimateSizeFactors(dds) sizeFactors(dds) #If you want to retain avgTxLength but suspect it is causing issues, you can explicitly instruct DESeq2 to compute size factors without correcting for library size with average transcript lengths: dds <- estimateSizeFactors(dds, controlGenes = NULL, use = FALSE) sizeFactors(dds) # If alone with virus data, the following BUG occured: #Still NULL --> BUG --> using manual calculation method for sizeFactor calculation! HeLa_TO_r1 HeLa_TO_r2 0.9978755 1.1092227 data.frame(genes = rownames(dds), dispersions = dispersions(dds)) #Given the raw counts, the control_r1 and control_r2 samples seem to have a much lower sequencing depth (total read count) than the other samples. Therefore, when normalization methods are applied, the normalization factors for these control samples will be relatively high, boosting the normalized counts. 1/0.9978755=1.002129023 1/1.1092227= #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor --effectiveGenomeSize 2864785220 bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023 --effectiveGenomeSize 2864785220 bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor 0.901532217 --effectiveGenomeSize 2864785220 raw_counts <- counts(dds) normalized_counts <- counts(dds, normalized=TRUE) #write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA) #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA) #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor estimSf <- function (cds){ # Get the count matrix cts <- counts(cds) # Compute the geometric mean geomMean <- function(x) prod(x)^(1/length(x)) # Compute the geometric mean over the line gm.mean <- apply(cts, 1, geomMean) # Zero values are set to NA (avoid subsequentcdsdivision by 0) gm.mean[gm.mean == 0] <- NA # Divide each line by its corresponding geometric mean # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...) # MARGIN: 1 or 2 (line or columns) # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN # FUN: the function to be applied cts <- sweep(cts, 1, gm.mean, FUN="/") # Compute the median over the columns med <- apply(cts, 2, median, na.rm=TRUE) # Return the scaling factor return(med) } #https://dputhier.github.io/ASG/practicals/rnaseq_diff_Snf2/rnaseq_diff_Snf2.html #http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization #https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html #https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html #https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/ #DESeq2’s median of ratios [1] #EdgeR’s trimmed mean of M values (TMM) [2] #http://www.nathalievialaneix.eu/doc/html/TP1_normalization.html #very good website! test_normcount <- sweep(raw_counts, 2, sizeFactors(dds), "/") sum(test_normcount != normalized_counts) -
Select the differentially expressed genes
#https://galaxyproject.eu/posts/2020/08/22/three-steps-to-galaxify-your-tool/ #https://www.biostars.org/p/282295/ #https://www.biostars.org/p/335751/ dds$condition [1] deltasbp_MH_2h deltasbp_MH_2h deltasbp_MH_2h deltasbp_MH_4h [5] deltasbp_MH_4h deltasbp_MH_4h deltasbp_MH_18h deltasbp_MH_18h [9] deltasbp_TSB_2h deltasbp_TSB_2h deltasbp_TSB_2h deltasbp_TSB_4h [13] deltasbp_TSB_4h deltasbp_TSB_4h deltasbp_TSB_18h deltasbp_TSB_18h [17] deltasbp_TSB_18h WT_MH_2h WT_MH_2h WT_MH_2h [21] WT_MH_4h WT_MH_4h WT_MH_4h WT_MH_18h [25] WT_MH_18h WT_MH_18h WT_TSB_2h WT_TSB_2h [29] WT_TSB_2h WT_TSB_4h WT_TSB_4h WT_TSB_4h [33] WT_TSB_18h WT_TSB_18h WT_TSB_18h 12 Levels: deltasbp_MH_18h deltasbp_MH_2h deltasbp_MH_4h ... WT_TSB_4h #CONSOLE: mkdir star_salmon/degenes setwd("degenes") # 确保因子顺序(可选) colData$strain <- relevel(factor(colData$strain), ref = "WT") colData$media <- relevel(factor(colData$media), ref = "TSB") colData$time <- relevel(factor(colData$time), ref = "2h") dds <- DESeqDataSetFromTximport(txi, colData, design = ~ strain * media * time) dds <- DESeq(dds, betaPrior = FALSE) resultsNames(dds) #[1] "Intercept" "strain_deltasbp_vs_WT" #[3] "media_MH_vs_TSB" "time_18h_vs_2h" #[5] "time_4h_vs_2h" "straindeltasbp.mediaMH" #[7] "straindeltasbp.time18h" "straindeltasbp.time4h" #[9] "mediaMH.time18h" "mediaMH.time4h" #[11] "straindeltasbp.mediaMH.time18h" "straindeltasbp.mediaMH.time4h" 🔹 Main effects for each factor: 表达量 ▲ │ ┌────── WT-TSB │ / │ / ┌────── WT-MH │ / / │ / / ┌────── deltasbp-TSB │ / / / │ / / / ┌────── deltasbp-MH └──────────────────────────────▶ 时间(2h, 4h, 18h) * strain_deltasbp_vs_WT * media_MH_vs_TSB * time_18h_vs_2h * time_4h_vs_2h 🔹 两因素交互作用(Two-way interactions) 这些项表示两个实验因素(如菌株、培养基、时间)之间的组合效应——也就是说,其中一个因素的影响取决于另一个因素的水平。 表达量 ▲ │ │ WT ────────┐ │ └─↘ │ ↘ │ deltasbp ←←←← 显著交互(方向/幅度不同) └──────────────────────────────▶ 时间 straindeltasbp.mediaMH 表示 菌株(strain)和培养基(media)之间的交互作用。 ➤ 这意味着:deltasbp 这个突变菌株在 MH 培养基中的表现与它在 TSB 中的不同,不能仅通过菌株和培养基的单独效应来解释。 straindeltasbp.time18h 表示 菌株(strain)和时间(time, 18h)之间的交互作用。 ➤ 即:突变菌株在 18 小时时的表达变化不只是菌株效应或时间效应的简单相加,而有协同作用。 straindeltasbp.time4h 同上,是菌株和时间(4h)之间的交互作用。 mediaMH.time18h 表示 培养基(MH)与时间(18h)之间的交互作用。 ➤ 即:在 MH 培养基中,18 小时时的表达水平与在其他时间点(例如 2h)不同,且该变化不完全可以用时间和培养基各自单独的效应来解释。 mediaMH.time4h 与上面类似,是 MH 培养基与 4 小时之间的交互作用。 🔹 三因素交互作用(Three-way interactions) 三因素交互作用表示:菌株、培养基和时间这三个因素在一起时,会产生一个新的效应,这种效应无法通过任何两个因素的组合来完全解释。 表达量(TSB) ▲ │ │ WT ──────→→ │ deltasbp ─────→→ └────────────────────────▶ 时间(2h, 4h, 18h) 表达量(MH) ▲ │ │ WT ──────→→ │ deltasbp ─────⬈⬈⬈⬈⬈⬈⬈ └────────────────────────▶ 时间(2h, 4h, 18h) straindeltasbp.mediaMH.time18h 表示 菌株 × 培养基 × 时间(18h) 三者之间的交互作用。 ➤ 即:突变菌株在 MH 培养基下的 18 小时表达模式,与其他组合(比如 WT 在 MH 培养基下,或者在 TSB 下)都不相同。 straindeltasbp.mediaMH.time4h 同上,只是观察的是 4 小时下的三因素交互效应。 ✅ 总结: 交互作用项的存在意味着你不能仅通过单个变量(如菌株、时间或培养基)的影响来解释基因表达的变化,必须同时考虑它们之间的组合关系。在 DESeq2 模型中,这些交互项的显著性可以揭示特定条件下是否有特异的调控行为。 # 提取 strain 的主效应: up 2, down 16 contrast <- "strain_deltasbp_vs_WT" res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-")) up <- subset(res_df, padj<=0.05 & log2FoldChange>=2) down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-")) # 提取 media 的主效应: up 76; down 128 contrast <- "media_MH_vs_TSB" res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-")) up <- subset(res_df, padj<=0.05 & log2FoldChange>=2) down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-")) # 提取 time 的主效应 up 228, down 98; up 17, down 2 contrast <- "time_18h_vs_2h" res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-")) up <- subset(res_df, padj<=0.05 & log2FoldChange>=2) down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-")) contrast <- "time_4h_vs_2h" res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(contrast, "all.txt", sep="-")) up <- subset(res_df, padj<=0.05 & log2FoldChange>=2) down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-")) #1.) delta sbp 2h TSB vs WT 2h TSB #2.) delta sbp 4h TSB vs WT 4h TSB #3.) delta sbp 18h TSB vs WT 18h TSB #4.) delta sbp 2h MH vs WT 2h MH #5.) delta sbp 4h MH vs WT 4h MH #6.) delta sbp 18h MH vs WT 18h MH #---- relevel to control ---- #design=~condition+replicate dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition) dds$condition <- relevel(dds$condition, "WT_TSB_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_TSB_2h_vs_WT_TSB_2h") dds$condition <- relevel(dds$condition, "WT_TSB_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_TSB_4h_vs_WT_TSB_4h") dds$condition <- relevel(dds$condition, "WT_TSB_18h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_TSB_18h_vs_WT_TSB_18h") dds$condition <- relevel(dds$condition, "WT_MH_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_MH_2h_vs_WT_MH_2h") dds$condition <- relevel(dds$condition, "WT_MH_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_MH_4h_vs_WT_MH_4h") dds$condition <- relevel(dds$condition, "WT_MH_18h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_MH_18h_vs_WT_MH_18h") # WT_MH_xh dds$condition <- relevel(dds$condition, "WT_MH_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h") dds$condition <- relevel(dds$condition, "WT_MH_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("WT_MH_18h_vs_WT_MH_4h") # WT_TSB_xh dds$condition <- relevel(dds$condition, "WT_TSB_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h") dds$condition <- relevel(dds$condition, "WT_TSB_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("WT_TSB_18h_vs_WT_TSB_4h") # deltasbp_MH_xh dds$condition <- relevel(dds$condition, "deltasbp_MH_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h") dds$condition <- relevel(dds$condition, "deltasbp_MH_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_MH_18h_vs_deltasbp_MH_4h") # deltasbp_TSB_xh dds$condition <- relevel(dds$condition, "deltasbp_TSB_2h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h") dds$condition <- relevel(dds$condition, "deltasbp_TSB_4h") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("deltasbp_TSB_18h_vs_deltasbp_TSB_4h") for (i in clist) { contrast = paste("condition", i, sep="_") #for_Mac_vs_LB contrast = paste("media", i, sep="_") res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-")) #res$log2FoldChange < -2 & res$padj < 5e-2 up <- subset(res_df, padj<=0.01 & log2FoldChange>=2) down <- subset(res_df, padj<=0.01 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-")) } # -- Under host-env (mamba activate plot-numpy1) -- mamba activate plot-numpy1 grep -P "\tgene\t" CP020463.gff > CP020463_gene.gff for cmp in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-all.txt ${cmp}-all.csv python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-up.txt ${cmp}-up.csv python3 ~/Scripts/replace_gene_names.py /home/jhuang/DATA/Data_Michelle_RNAseq_2025/CP020463_gene.gff ${cmp}-down.txt ${cmp}-down.csv done # ---- delta sbp TSB 2h vs WT TSB 2h ---- res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] #print(duplicated_genes) # [1] "bfr" "lipA" "ahpF" "pcaF" "alr" "pcaD" "cydB" "lpdA" "pgaC" "ppk1" #[11] "pcaF" "tuf" "galE" "murI" "yccS" "rrf" "rrf" "arsB" "ptsP" "umuD" #[21] "map" "pgaB" "rrf" "rrf" "rrf" "pgaD" "uraH" "benE" #res[res$GeneName == "bfr", ] #1st_strategy First occurrence is kept and Subsequent duplicates are removed #res <- res[!duplicated(res$GeneName), ] #2nd_strategy keep the row with the smallest padj value for each GeneName res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) ## Ensure the data frame matches the expected format ## For example, it should have columns: log2FoldChange, padj, etc. #res <- as.data.frame(res) ## Remove rows with NA in log2FoldChange (if needed) #res <- res[!is.na(res$log2FoldChange),] # Replace padj = 0 with a small value #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150 #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp TSB 2h versus WT TSB 2h")) dev.off() # ---- delta sbp TSB 4h vs WT TSB 4h ---- res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_TSB_4h_vs_WT_TSB_4h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp TSB 4h versus WT TSB 4h")) dev.off() # ---- delta sbp TSB 18h vs WT TSB 18h ---- res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_TSB_18h_vs_WT_TSB_18h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_TSB_18h_vs_WT_TSB_18h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp TSB 18h versus WT TSB 18h")) dev.off() # ---- delta sbp MH 2h vs WT MH 2h ---- res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] #print(duplicated_genes) # [1] "bfr" "lipA" "ahpF" "pcaF" "alr" "pcaD" "cydB" "lpdA" "pgaC" "ppk1" #[11] "pcaF" "tuf" "galE" "murI" "yccS" "rrf" "rrf" "arsB" "ptsP" "umuD" #[21] "map" "pgaB" "rrf" "rrf" "rrf" "pgaD" "uraH" "benE" #res[res$GeneName == "bfr", ] #1st_strategy First occurrence is kept and Subsequent duplicates are removed #res <- res[!duplicated(res$GeneName), ] #2nd_strategy keep the row with the smallest padj value for each GeneName res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_MH_2h_vs_WT_MH_2h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) ## Ensure the data frame matches the expected format ## For example, it should have columns: log2FoldChange, padj, etc. #res <- as.data.frame(res) ## Remove rows with NA in log2FoldChange (if needed) #res <- res[!is.na(res$log2FoldChange),] # Replace padj = 0 with a small value #NO_SUCH_RECORDS: res$padj[res$padj == 0] <- 1e-150 #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_MH_2h_vs_WT_MH_2h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp MH 2h versus WT MH 2h")) dev.off() # ---- delta sbp MH 4h vs WT MH 4h ---- res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_MH_4h_vs_WT_MH_4h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_MH_4h_vs_WT_MH_4h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp MH 4h versus WT MH 4h")) dev.off() # ---- delta sbp MH 18h vs WT MH 18h ---- res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Assuming res is your dataframe and already processed # Filter up-regulated genes: log2FoldChange > 2 and padj < 5e-2 up_regulated <- res[res$log2FoldChange > 2 & res$padj < 5e-2, ] # Filter down-regulated genes: log2FoldChange < -2 and padj < 5e-2 down_regulated <- res[res$log2FoldChange < -2 & res$padj < 5e-2, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet addWorksheet(wb, "Complete_Data") writeData(wb, "Complete_Data", res) # Add the up-regulated genes as the second sheet addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file saveWorkbook(wb, "Gene_Expression_Δsbp_MH_18h_vs_WT_MH_18h.xlsx", overwrite = TRUE) # Set the 'GeneName' column as row.names rownames(res) <- res$GeneName # Drop the 'GeneName' column since it's now the row names res$GeneName <- NULL head(res) #library(EnhancedVolcano) # Assuming res is already sorted and processed png("Δsbp_MH_18h_vs_WT_MH_18h.png", width=1200, height=1200) #max.overlaps = 10 EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'padj', pCutoff = 5e-2, FCcutoff = 2, title = '', subtitleLabSize = 18, pointSize = 3.0, labSize = 5.0, colAlpha = 1, legendIconSize = 4.0, drawConnectors = TRUE, widthConnectors = 0.5, colConnectors = 'black', subtitle = expression("Δsbp MH 18h versus WT MH 18h")) dev.off() #Annotate the Gene_Expression_xxx_vs_yyy.xlsx in the next steps (see below e.g. Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx) -
Clustering the genes and draw heatmap
#http://xgenes.com/article/article-content/150/draw-venn-diagrams-using-matplotlib/ #http://xgenes.com/article/article-content/276/go-terms-for-s-epidermidis/ # save the Up-regulated and Down-regulated genes into -up.id and -down.id # Attached are the heatmap only shown the genes cutoff with padj < 5e-2 and |log2FC| > 2 #- 1457 TSB vs 1457deltasbp TSB early timepoint (1-2h) #- 1457 MH vs 1457deltasbp MH early timepoint (1-2h) #- 1457 TSB vs 1457deltasbp TSB 4h #- 1457 MH vs 1457deltasbp MH 4h #- 1457 TSB vs 1457deltasbp TSB 18h #- 1457 MH vs 1457deltasbp MH 18h # Attached shown the genes padj < 5e-2 and |log2FC| > 2 betwen 18h and time timepoint (1-2h) or betwen 18h and time timepoint 4h # The project of Tam_RNAseq can also make it similar to this project! #- 1457 TSB early timepoint (1-2h) vs 4h vs 18h #- 1457 MH early timepoint (1-2h) vs 4h vs 18h #- 1457deltasbp TSB early timepoint (1-2h) vs 4h vs 18h #- 1457deltasbp MH early timepoint (1-2h) vs 4h vs 18h for i in deltasbp_TSB_2h_vs_WT_TSB_2h deltasbp_TSB_4h_vs_WT_TSB_4h deltasbp_TSB_18h_vs_WT_TSB_18h deltasbp_MH_2h_vs_WT_MH_2h deltasbp_MH_4h_vs_WT_MH_4h deltasbp_MH_18h_vs_WT_MH_18h WT_MH_4h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_2h WT_MH_18h_vs_WT_MH_4h WT_TSB_4h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_2h WT_TSB_18h_vs_WT_TSB_4h deltasbp_MH_4h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_2h deltasbp_MH_18h_vs_deltasbp_MH_4h deltasbp_TSB_4h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_2h deltasbp_TSB_18h_vs_deltasbp_TSB_4h; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done ## ------------------------------------------------------------ ## DEGs heatmap (dynamic GOI + dynamic column tags) ## Example contrast: deltasbp_TSB_2h_vs_WT_TSB_2h ## Assumes 'rld' (or 'vsd') is in the environment (DESeq2 transform) ## ------------------------------------------------------------ ## 0) Config --------------------------------------------------- contrast <- "deltasbp_TSB_2h_vs_WT_TSB_2h" contrast <- "deltasbp_TSB_4h_vs_WT_TSB_4h" contrast <- "deltasbp_TSB_18h_vs_WT_TSB_18h" contrast <- "deltasbp_MH_2h_vs_WT_MH_2h" contrast <- "deltasbp_MH_4h_vs_WT_MH_4h" contrast <- "deltasbp_MH_18h_vs_WT_MH_18h" up_file <- paste0(contrast, "-up.id") down_file <- paste0(contrast, "-down.id") ## 1) Packages ------------------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## 2) Helpers -------------------------------------------------- # Read IDs from a file that may be: # - one column with or without header "Gene_Id" # - may contain quotes read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch( read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL ) if (!is.null(df) && ncol(df) >= 1) { if ("Gene_Id" %in% names(df)) { ids <- df[["Gene_Id"]] } else if (ncol(df) == 1L) { ids <- df[[1]] } else { first_nonempty <- which(colSums(df != "", na.rm = TRUE) > 0)[1] if (is.na(first_nonempty)) stop("No usable IDs in: ", path) ids <- df[[first_nonempty]] } } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") if (ncol(df2) < 1L) stop("No usable IDs in: ", path) ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } #BREAK_LINE # From "A_vs_B" get c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be in the form 'GroupA_vs_GroupB'") parts } # Match whole tags at boundaries or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } ## 3) Build GOI from the two .id files ------------------------- GOI_up <- read_ids_from_file(up_file) GOI_down <- read_ids_from_file(down_file) GOI <- unique(c(GOI_up, GOI_down)) if (length(GOI) == 0) stop("No gene IDs found in up/down .id files.") ## 4) Expression matrix (DESeq2 rlog/vst) ---------------------- # Use rld if present; otherwise try vsd if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } RNASeq.NoCellLine <- as.matrix(expr_all) # GOI are already 'gene-*' in your data — use them directly for matching present <- intersect(rownames(RNASeq.NoCellLine), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") # Optional: report truly missing IDs (on the same 'gene-*' format) missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## 5) Keep ONLY columns for the two groups in the contrast ----- groups <- split_contrast_groups(contrast) # e.g., c("deltasbp_TSB_2h", "WT_TSB_2h") keep_cols <- match_tags(colnames(RNASeq.NoCellLine), groups) if (!any(keep_cols)) { stop("No columns matched the contrast groups: ", paste(groups, collapse = " and "), ". Check your column names or implement colData-based filtering.") } cols_idx <- which(keep_cols) sub_colnames <- colnames(RNASeq.NoCellLine)[cols_idx] # Put the second group first (e.g., WT first in 'deltasbp..._vs_WT...') ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE)) # Subset safely expr_sub <- RNASeq.NoCellLine[present, cols_idx, drop = FALSE][, ord, drop = FALSE] ## 6) Remove constant/NA rows ---------------------------------- row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.") datamat <- expr_sub[row_ok, , drop = FALSE] # Save the filtered matrix used for the heatmap (optional) out_mat <- paste0("DEGs_heatmap_expression_data_", contrast, ".txt") write.csv(as.data.frame(datamat), file = out_mat, quote = FALSE) ## 6.1) Pretty labels (display only) --------------------------- # Row labels: strip 'gene-' labRow_pretty <- sub("^gene-", "", rownames(datamat)) # Column labels: 'deltasbp' -> 'Δsbp' and nicer spacing labCol_pretty <- colnames(datamat) labCol_pretty <- gsub("^deltasbp", "\u0394sbp", labCol_pretty) labCol_pretty <- gsub("_", " ", labCol_pretty) # e.g., WT_TSB_2h_r1 -> "WT TSB 2h r1" # If you prefer to drop replicate suffixes, uncomment: # labCol_pretty <- gsub(" r\\d+$", "", labCol_pretty) #BREAK_LINE ## 7) Clustering ----------------------------------------------- # Row clustering with Pearson distance hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") #row_cor <- suppressWarnings(cor(t(datamat), method = "pearson", use = "pairwise.complete.obs")) #row_cor[!is.finite(row_cor)] <- 0 #hr <- hclust(as.dist(1 - row_cor), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.1) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] png(paste0("DEGs_heatmap_", contrast, ".png"), width=800, height=950) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces cexRow = 2.5, cexCol = 2.5, srtCol = 15, lhei = c(0.6, 4), # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' lwid = c(0.8, 4)) # enlarge the first number when reduce the plot size to avoid the error 'Error in plot.new() : figure margins too large' dev.off() # ------------------ Heatmap generation for three samples ---------------------- ## ============================================================ ## Three-condition DEGs heatmap from multiple pairwise contrasts ## Example contrasts: ## "WT_MH_4h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_2h", ## "WT_MH_18h_vs_WT_MH_4h" ## Output shows the union of DEGs across all contrasts and ## only the columns (samples) for the 3 conditions. ## ============================================================ ## -------- 0) User inputs ------------------------------------ contrasts <- c( "WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_4h" ) contrasts <- c( "WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_4h" ) contrasts <- c( "deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_4h" ) contrasts <- c( "deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h" ) ## Optionally force a condition display order (defaults to order of first appearance) cond_order <- c("WT_MH_2h","WT_MH_4h","WT_MH_18h") cond_order <- c("WT_TSB_2h","WT_TSB_4h","WT_TSB_18h") cond_order <- c("deltasbp_MH_2h","deltasbp_MH_4h","deltasbp_MH_18h") cond_order <- c("deltasbp_TSB_2h","deltasbp_TSB_4h","deltasbp_TSB_18h") #cond_order <- NULL ## -------- 1) Packages --------------------------------------- need <- c("gplots") to_install <- setdiff(need, rownames(installed.packages())) if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org") suppressPackageStartupMessages(library(gplots)) ## -------- 2) Helpers ---------------------------------------- read_ids_from_file <- function(path) { if (!file.exists(path)) stop("File not found: ", path) df <- tryCatch(read.table(path, header = TRUE, stringsAsFactors = FALSE, quote = "\"'", comment.char = ""), error = function(e) NULL) if (!is.null(df) && ncol(df) >= 1) { ids <- if ("Gene_Id" %in% names(df)) df[["Gene_Id"]] else df[[1]] } else { df2 <- read.table(path, header = FALSE, stringsAsFactors = FALSE, quote = "\"'", comment.char = "") ids <- df2[[1]] } ids <- trimws(gsub('"', "", ids)) ids[nzchar(ids)] } # From "A_vs_B" return c("A","B") split_contrast_groups <- function(x) { parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]] if (length(parts) != 2L) stop("Contrast must be 'GroupA_vs_GroupB': ", x) parts } # Grep whole tag between start/end or underscores match_tags <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # Pretty labels for columns (optional tweaks) prettify_col_labels <- function(x) { x <- gsub("^deltasbp", "\u0394sbp", x) # example from your earlier case x <- gsub("_", " ", x) x } # BREAK_LINE ## -------- 3) Build GOI (union across contrasts) ------------- up_files <- paste0(contrasts, "-up.id") down_files <- paste0(contrasts, "-down.id") GOI <- unique(unlist(c( lapply(up_files, read_ids_from_file), lapply(down_files, read_ids_from_file) ))) if (!length(GOI)) stop("No gene IDs found in any up/down .id files for the given contrasts.") ## -------- 4) Expression matrix (rld or vsd) ----------------- if (exists("rld")) { expr_all <- assay(rld) } else if (exists("vsd")) { expr_all <- assay(vsd) } else { stop("Neither 'rld' nor 'vsd' object is available in the environment.") } expr_all <- as.matrix(expr_all) present <- intersect(rownames(expr_all), GOI) if (!length(present)) stop("None of the GOI were found among the rownames of the expression matrix.") missing <- setdiff(GOI, present) if (length(missing)) message("Note: ", length(missing), " GOI not found and will be skipped.") ## -------- 5) Infer the THREE condition tags ----------------- pair_groups <- lapply(contrasts, split_contrast_groups) # list of c(A,B) cond_tags <- unique(unlist(pair_groups)) if (length(cond_tags) != 3L) { stop("Expected exactly three unique condition tags across the contrasts, got: ", paste(cond_tags, collapse = ", ")) } # If user provided an explicit order, use it; else keep first-appearance order if (!is.null(cond_order)) { if (!setequal(cond_order, cond_tags)) stop("cond_order must contain exactly these tags: ", paste(cond_tags, collapse = ", ")) cond_tags <- cond_order } #BREAK_LINE ## -------- 6) Subset columns to those 3 conditions ----------- # helper: does a name contain any of the tags? match_any_tag <- function(nms, tags) { pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)") grepl(pat, nms, perl = TRUE) } # helper: return the specific tag that a single name matches detect_tag <- function(nm, tags) { hits <- vapply(tags, function(t) grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1)) if (!any(hits)) NA_character_ else tags[which(hits)[1]] } keep_cols <- match_any_tag(colnames(expr_all), cond_tags) if (!any(keep_cols)) { stop("No columns matched any of the three condition tags: ", paste(cond_tags, collapse = ", ")) } sub_idx <- which(keep_cols) sub_colnames <- colnames(expr_all)[sub_idx] # find the tag for each kept column (this is the part that was wrong before) cond_for_col <- vapply(sub_colnames, detect_tag, character(1), tags = cond_tags) # rank columns by your desired condition order, then by name within each condition cond_rank <- match(cond_for_col, cond_tags) ord <- order(cond_rank, sub_colnames) expr_sub <- expr_all[present, sub_idx, drop = FALSE][, ord, drop = FALSE] ## -------- 7) Remove constant/NA rows ------------------------ row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0) if (any(!row_ok)) message("Removing ", sum(!row_ok), " constant/NA rows.") datamat <- expr_sub[row_ok, , drop = FALSE] ## -------- 8) Labels ---------------------------------------- labRow_pretty <- sub("^gene-", "", rownames(datamat)) labCol_pretty <- prettify_col_labels(colnames(datamat)) ## -------- 9) Clustering (rows) ------------------------------ hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete") # Color row-side groups by cutting the dendrogram mycl <- cutree(hr, h = max(hr$height) / 1.3) palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon", "lightblue","pink","purple","lightcyan","salmon","lightgreen") mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1] ## -------- 10) Save the matrix used -------------------------- out_tag <- paste(cond_tags, collapse = "_") write.csv(as.data.frame(datamat), file = paste0("DEGs_heatmap_expression_data_", out_tag, ".txt"), quote = FALSE) ## -------- 11) Plot heatmap ---------------------------------- png(paste0("DEGs_heatmap_", out_tag, ".png"), width = 1000, height = 5000) heatmap.2( datamat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = mycol, margins = c(10, 15), # c(bottom, left) sepwidth = c(0, 0), labRow = labRow_pretty, labCol = labCol_pretty, cexRow = 1.3, cexCol = 1.8, srtCol = 15, lhei = c(0.01, 4), lwid = c(0.5, 4), key = FALSE # safer; add manual z-score key if you want (see note below) ) dev.off() mv DEGs_heatmap_WT_MH_2h_WT_MH_4h_WT_MH_18h.png DEGs_heatmap_WT_MH.png mv DEGs_heatmap_WT_TSB_2h_WT_TSB_4h_WT_TSB_18h.png DEGs_heatmap_WT_TSB.png mv DEGs_heatmap_deltasbp_MH_2h_deltasbp_MH_4h_deltasbp_MH_18h.png DEGs_heatmap_deltasbp_MH.png mv DEGs_heatmap_deltasbp_TSB_2h_deltasbp_TSB_4h_deltasbp_TSB_18h.png DEGs_heatmap_deltasbp_TSB.png # ------------------ Heatmap generation for three samples END ---------------------- # ==== (NOT_USED) Ultra-robust heatmap.2 plotter with many attempts ==== # Inputs: # mat : numeric matrix (genes x samples) # hr : hclust for rows (or TRUE/FALSE) # row_colors : vector of RowSideColors of length nrow(mat) or NULL # labRow : character vector of row labels (display only) # labCol : character vector of col labels (display only) # outfile : output PNG path # base_res : DPI for PNG (default 150) # ==== Slide-tuned heatmap.2 plotter (moderate size, larger fonts, 45° labels) ==== safe_heatmap2 <- function(mat, hr, row_colors, labRow, labCol, outfile, base_res = 150) { stopifnot(is.matrix(mat)) nr <- nrow(mat); nc <- ncol(mat) # Target slide size & sensible caps #target_w <- 2400; target_h <- 1600 #max_w <- 3000; max_h <- 2000 target_w <- 800; target_h <- 2000 max_w <- 1500; max_h <- 1500 # Label stats max_row_chars <- if (length(labRow)) max(nchar(labRow), na.rm = TRUE) else 1 max_col_chars <- if (length(labCol)) max(nchar(labCol), na.rm = TRUE) else 1 #add_attempt(target_w, target_h, 0.90, 1.00, 45, NULL, TRUE, TRUE, TRUE) attempts <- list() add_attempt <- function(w, h, cr, cc, rot, mar = NULL, key = TRUE, showR = TRUE, showC = TRUE, trunc_rows = 0) { attempts[[length(attempts) + 1]] <<- list( w = w, h = h, cr = cr, cc = cc, rot = rot, mar = mar, key = key, showR = showR, showC = showC, trunc_rows = trunc_rows ) } # Note that if the key is FALSE, all works, if the key is TRUE, none works! # 1) Preferred look: moderate size, biggish fonts, 45° labels add_attempt(target_w, target_h, 0.90, 1.00, 30, c(2,1), TRUE, TRUE, TRUE) # 2) Same, explicit margins computed later add_attempt(target_w, target_h, 0.85, 0.95, 45, c(10,15), TRUE, TRUE, TRUE) # 3) Slightly bigger canvas add_attempt(min(target_w+300, max_w), min(target_h+200, max_h), 0.85, 0.95, 30, c(10,15), TRUE, TRUE, TRUE) # 4) Make margins more generous (in lines) add_attempt(min(target_w+300, max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(10,14), FALSE, TRUE, TRUE) # 5) Reduce rotation to 30 if still tight add_attempt(min(target_w+300, max_w), min(target_h+200, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE) # 6) Final fallback: keep fonts reasonable, 0° labels, slightly bigger margins add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 45, c(8,12), FALSE, TRUE, TRUE) # 7) Last resort: truncate long row labels (keeps readability) if (max_row_chars > 20) { add_attempt(min(target_w+500, max_w), min(target_h+300, max_h), 0.80, 0.90, 30, c(8,12), FALSE, TRUE, TRUE, trunc_rows = 18) } for (i in seq_along(attempts)) { a <- attempts[[i]] # Compute margins if not provided if (is.null(a$mar)) { col_margin <- if (a$showC) { if (a$rot > 0) max(6, ceiling(0.45 * max_col_chars * max(a$cc, 0.8))) else max(5, ceiling(0.22 * max_col_chars * max(a$cc, 0.8))) } else 4 row_margin <- if (a$showR) max(6, ceiling(0.55 * max_row_chars * max(a$cr, 0.8))) else 4 mar <- c(col_margin, row_margin) } else { mar <- a$mar } # Prepare labels for this attempt lr <- if (a$showR) labRow else rep("", nr) if (a$trunc_rows > 0 && a$showR) { lr <- ifelse(nchar(lr) > a$trunc_rows, paste0(substr(lr, 1, a$trunc_rows), "…"), lr) } lc <- if (a$showC) labCol else rep("", nc) # Close any open device if (dev.cur() != 1) try(dev.off(), silent = TRUE) ok <- FALSE try({ png(outfile, width = ceiling(a$w), height = ceiling(a$h), res = base_res) gplots::heatmap.2( mat, Rowv = as.dendrogram(hr), Colv = FALSE, dendrogram = "row", col = gplots::bluered(75), scale = "row", trace = "none", density.info = "none", RowSideColors = row_colors, key = a$key, margins = mar, # c(col, row) in lines sepwidth = c(0, 0), labRow = lr, labCol = lc, cexRow = a$cr, cexCol = a$cc, srtCol = a$rot, lhei = c(0.1, 4), lwid = c(0.1, 4) ) dev.off() ok <- TRUE }, silent = TRUE) if (ok) { message(sprintf( "✓ Heatmap saved: %s (attempt %d) size=%dx%d margins=c(%d,%d) cexRow=%.2f cexCol=%.2f srtCol=%d", outfile, i, ceiling(a$w), ceiling(a$h), mar[1], mar[2], a$cr, a$cc, a$rot )) return(invisible(TRUE)) } else { if (dev.cur() != 1) try(dev.off(), silent = TRUE) message(sprintf("Attempt %d failed; retrying...", i)) } } stop("Failed to draw heatmap after tuned attempts. Consider ComplexHeatmap if this persists.") } safe_heatmap2( mat = datamat, hr = hr, row_colors = mycol, labRow = labRow_pretty, # row labels WITHOUT "gene-" labCol = labCol_pretty, # col labels with Δsbp + spaces outfile = paste0("DEGs_heatmap_", contrast, ".png"), #base_res = 150 ) # -- (OLD CODE) DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h -- cat deltasbp_TSB_2h_vs_WT_TSB_2h-up.id deltasbp_TSB_2h_vs_WT_TSB_2h-down.id | sort -u > ids #add Gene_Id in the first line, delete the "" #Note that using GeneID as index, rather than GeneName, since .txt contains only GeneID. GOI <- read.csv("ids")$Gene_Id RNASeq.NoCellLine <- assay(rld) #install.packages("gplots") library("gplots") #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). pearson or spearman datamat = RNASeq.NoCellLine[GOI, ] #datamat = RNASeq.NoCellLine write.csv(as.data.frame(datamat), file ="DEGs_heatmap_expression_data.txt") constant_rows <- apply(datamat, 1, function(row) var(row) == 0) if(any(constant_rows)) { cat("Removing", sum(constant_rows), "constant rows.\n") datamat <- datamat[!constant_rows, ] } hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "MAGENTA", "CYAN", "RED", "GREEN", "MAROON", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN", "LIGHTRED", "LIGHTGREEN"); mycol = mycol[as.vector(mycl)] png("DEGs_heatmap_deltasbp_TSB_2h_vs_WT_TSB_2h.png", width=1200, height=2000) heatmap.2(datamat, Rowv = as.dendrogram(hr), col = bluered(75), scale = "row", RowSideColors = mycol, trace = "none", margin = c(10, 15), # bottom, left sepwidth = c(0, 0), dendrogram = 'row', Colv = 'false', density.info = 'none', labRow = rownames(datamat), cexRow = 1.5, cexCol = 1.5, srtCol = 35, lhei = c(0.2, 4), # reduce top space (was 1 or more) lwid = c(0.4, 4)) # reduce left space (was 1 or more) dev.off() # -------------- Cluster members ---------------- write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt') write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt') write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt') #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls #~/Tools/csv2xls-0.4/csv_to_xls.py DEGs_heatmap_expression_data.txt -d',' -o DEGs_heatmap_expression_data.xls; #### (NOT_WORKING) cluster members (adding annotations, note that it does not work for the bacteria, since it is not model-speices and we cannot use mart=ensembl) ##### subset_1<-names(subset(mycl, mycl == '1')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ]) #2575 subset_2<-names(subset(mycl, mycl == '2')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ]) #1855 subset_3<-names(subset(mycl, mycl == '3')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ]) #217 subset_4<-names(subset(mycl, mycl == '4')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_4, ]) # subset_5<-names(subset(mycl, mycl == '5')) data <- as.data.frame(datamat[rownames(datamat) %in% subset_5, ]) # # Initialize an empty data frame for the annotated data annotated_data <- data.frame() # Determine total number of genes total_genes <- length(rownames(data)) # Loop through each gene to annotate for (i in 1:total_genes) { gene <- rownames(data)[i] result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = gene, mart = ensembl) # If multiple rows are returned, take the first one if (nrow(result) > 1) { result <- result[1, ] } # Check if the result is empty if (nrow(result) == 0) { result <- data.frame(ensembl_gene_id = gene, external_gene_name = NA, gene_biotype = NA, entrezgene_id = NA, chromosome_name = NA, start_position = NA, end_position = NA, strand = NA, description = NA) } # Transpose expression values expression_values <- t(data.frame(t(data[gene, ]))) colnames(expression_values) <- colnames(data) # Combine gene information and expression data combined_result <- cbind(result, expression_values) # Append to the final dataframe annotated_data <- rbind(annotated_data, combined_result) # Print progress every 100 genes if (i %% 100 == 0) { cat(sprintf("Processed gene %d out of %d\n", i, total_genes)) } } # Save the annotated data to a new CSV file write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE) write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE) write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE) write.csv(annotated_data, "cluster4_DARKMAGENTA.csv", row.names=FALSE) write.csv(annotated_data, "cluster5_DARKCYAN.csv", row.names=FALSE) #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o DEGs_heatmap_clusters.xls
KEGG and GO annotations in non-model organisms
-
Assign KEGG and GO Terms (see diagram above)
Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations. Option 1 (KEGG Terms): EggNog based on orthology and phylogenies EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms. Install EggNOG-mapper: mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda #eggnog-mapper_2.1.12 mamba activate eggnog_env Run annotation: #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/ download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/ #NOT_WORKING: emapper.py -i CP020463_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd #Download the protein sequences from Genbank mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60 #--resume #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations. #----> 470.IX87_14445: * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain). * IX87_14445 would refer to a specific gene or protein within that genome. Extract KEGG KO IDs from annotations.emapper.annotations. Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping * jhuang@WS-2290C:~/DATA/Data_Michelle_RNAseq_2025$ ~/Tools/Blast2GO/Blast2GO_Launcher setting the workspace "mkdir ~/b2gWorkspace_Michelle_RNAseq_2025"; cp /mnt/md1/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025 * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) by choosing the file CP020463_protein.fasta as input --> * Buttons 'blast' at the NCBI (Parameters: blastp, nr, ...) (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean), QBlast finished with warnings! Blasted Sequences: 2084 Sequences without results: 105 Check the Job log for details and try to submit again. Restarting QBlast may result in additional results, depending on the error type. "Blast (CP020463_protein) Done" * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation." "Mapping (CP020463_protein) Done" "Mapping finished - Please proceed now to annotation." * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished." * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value. * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score. "Annotation (CP020463_protein) Done" "Annotation finished." or blast2go_cli_v1.5.1 (NOT_USED) #https://help.biobam.com/space/BCD/2250407989/Installation #see ~/Scripts/blast2go_pipeline.sh Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations." "InterProScan (CP020463_protein) Done" "InterProScan Finished - You can now merge the obtained GO Annotations." MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." "Merge InterProScan GOs to Annotation (CP020463_protein) Done" "Finished merging GO terms from InterPro with annotations." "Maybe you want to run ANNEX (Annotation Augmentation)." #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping." File -> Export -> Export Annotations -> Export Annotations (.annot, custom, etc.) #~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot is generated! #-- before merging (blast2go_annot.annot) -- #H0N29_18790 GO:0004842 ankyrin repeat domain-containing protein #H0N29_18790 GO:0085020 #-- after merging (blast2go_annot.annot2) --> #H0N29_18790 GO:0031436 ankyrin repeat domain-containing protein #H0N29_18790 GO:0070531 #H0N29_18790 GO:0004842 #H0N29_18790 GO:0005515 #H0N29_18790 GO:0085020 cp blast2go_annot.annot blast2go_annot.annot2 Option 4 (NOT_USED): RFAM for non-colding RNA Option 5 (NOT_USED): PSORTb for subcellular localizations Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server) * Go to KAAS * Upload your FASTA file. * Select an appropriate gene set. * Download the KO assignments. -
Find the Closest KEGG Organism Code (NOT_USED)
Since your species isn't directly in KEGG, use a closely related organism. * Check available KEGG organisms: library(clusterProfiler) library(KEGGREST) kegg_organisms <- keggList("organism") Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants). # Search for Acinetobacter in the list grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE) # Gammaproteobacteria #Extract KO IDs from the eggnog results for "Acinetobacter baumannii strain ATCC 19606" -
Find the Closest KEGG Organism for a Non-Model Species (NOT_USED)
If your organism is not in KEGG, search for the closest relative: grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE) # Example search For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4): kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko") # "ko" = KEGG Orthology -
Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2025_LB_vs_Mac_ATCC19606/results/star_salmon/degenes)
#BiocManager::install("GO.db") #BiocManager::install("AnnotationDbi") # Load required libraries library(openxlsx) # For Excel file handling library(dplyr) # For data manipulation library(tidyr) library(stringr) library(clusterProfiler) # For KEGG and GO enrichment analysis #library(org.Hs.eg.db) # Replace with appropriate organism database library(GO.db) library(AnnotationDbi) setwd("~/DATA/Data_Michelle_RNAseq_2025/results/star_salmon/degenes") # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_ # PREPARING eggnog_out.emapper.annotations.txt from eggnog_out.emapper.annotations by removing ## lines and renaming #query to query #(plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024_AUM_MHB_Urine_ATCC19606$ diff eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt #1,5c1 #< ## Thu Jan 30 16:34:52 2025 #< ## emapper-2.1.12 #< ## /home/jhuang/mambaforge/envs/eggnog_env/bin/emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 #< ## #< #query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs #--- #> query seed_ortholog evalue score eggNOG_OGs max_annot_lvl COG_category Description Preferred_name GOs EC KEGG_ko KEGG_Pathway KEGG_Module KEGG_Reaction KEGG_rclass BRITE KEGG_TC CAZy BiGG_Reaction PFAMs #3620,3622d3615 #< ## 3614 queries scanned #< ## Total time (seconds): 8.176708459854126 # Step 1: Load the blast2go annotation file with a check for missing columns annot_df <- read.table("/home/jhuang/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2_", header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE) # If the structure is inconsistent, we can make sure there are exactly 3 columns: colnames(annot_df) <- c("GeneID", "Term") # Step 2: Filter and aggregate GO and EC terms as before go_terms <- annot_df %>% filter(grepl("^GO:", Term)) %>% group_by(GeneID) %>% summarize(GOs = paste(Term, collapse = ","), .groups = "drop") ec_terms <- annot_df %>% filter(grepl("^EC:", Term)) %>% group_by(GeneID) %>% summarize(EC = paste(Term, collapse = ","), .groups = "drop") # Key Improvements: # * Looped processing of all 6 input files to avoid redundancy. # * Robust handling of empty KEGG and GO enrichment results to prevent contamination of results between iterations. # * File-safe output: Each dataset creates a separate Excel workbook with enriched sheets only if data exists. # * Error handling for GO term descriptions via tryCatch. # * Improved clarity and modular structure for easier maintenance and future additions. # Define the filenames and output suffixes file_list <- c( "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv", "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv", "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv", "deltasbp_MH_2h_vs_WT_MH_2h-all.csv", "deltasbp_MH_4h_vs_WT_MH_4h-all.csv", "deltasbp_MH_18h_vs_WT_MH_18h-all.csv", "WT_MH_4h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_2h", "WT_MH_18h_vs_WT_MH_4h", "WT_TSB_4h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_2h", "WT_TSB_18h_vs_WT_TSB_4h", "deltasbp_MH_4h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_2h", "deltasbp_MH_18h_vs_deltasbp_MH_4h", "deltasbp_TSB_4h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h" ) #file_name = "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv" # ---------------------- Generated DEG(Annotated)_KEGG_GO_* ----------------------- suppressPackageStartupMessages({ library(readr) library(dplyr) library(stringr) library(tidyr) library(openxlsx) library(clusterProfiler) library(AnnotationDbi) library(GO.db) }) # ---- PARAMETERS ---- PADJ_CUT <- 5e-2 LFC_CUT <- 2 # Your emapper annotations (with columns: query, GOs, EC, KEGG_ko, KEGG_Pathway, KEGG_Module, ... ) emapper_path <- "~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt" # Input files (you can add/remove here) input_files <- c( "deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv", "deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv", "deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv", "deltasbp_MH_2h_vs_WT_MH_2h-all.csv", "deltasbp_MH_4h_vs_WT_MH_4h-all.csv", "deltasbp_MH_18h_vs_WT_MH_18h-all.csv", "WT_MH_4h_vs_WT_MH_2h-all.csv", "WT_MH_18h_vs_WT_MH_2h-all.csv", "WT_MH_18h_vs_WT_MH_4h-all.csv", "WT_TSB_4h_vs_WT_TSB_2h-all.csv", "WT_TSB_18h_vs_WT_TSB_2h-all.csv", "WT_TSB_18h_vs_WT_TSB_4h-all.csv", "deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv", "deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv", "deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv", "deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv", "deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv", "deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv" ) # ---- HELPERS ---- # Robust reader (CSV first, then TSV) read_table_any <- function(path) { tb <- tryCatch(readr::read_csv(path, show_col_types = FALSE), error = function(e) tryCatch(readr::read_tsv(path, col_types = cols()), error = function(e2) NULL)) tb } # Return a nice Excel-safe base name xlsx_name_from_file <- function(path) { base <- tools::file_path_sans_ext(basename(path)) paste0("DEG_KEGG_GO_", base, ".xlsx") } # KEGG expand helper: replace K-numbers with GeneIDs using mapping from the same result table expand_kegg_geneIDs <- function(kegg_res, mapping_tbl) { if (is.null(kegg_res) || nrow(as.data.frame(kegg_res)) == 0) return(data.frame()) kdf <- as.data.frame(kegg_res) if (!"geneID" %in% names(kdf)) return(kdf) # mapping_tbl: columns KEGG_ko (possibly multiple separated by commas) and GeneID map_clean <- mapping_tbl %>% dplyr::select(KEGG_ko, GeneID) %>% filter(!is.na(KEGG_ko), KEGG_ko != "-") %>% mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>% tidyr::separate_rows(KEGG_ko, sep = ",") %>% distinct() if (!nrow(map_clean)) { return(kdf) } expanded <- kdf %>% tidyr::separate_rows(geneID, sep = "/") %>% dplyr::left_join(map_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>% distinct() %>% dplyr::group_by(ID) %>% dplyr::summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop") kdf %>% dplyr::select(-geneID) %>% dplyr::left_join(expanded %>% dplyr::select(ID, GeneID), by = "ID") %>% dplyr::rename(geneID = GeneID) } # ---- LOAD emapper annotations ---- eggnog_data <- read.delim(emapper_path, header = TRUE, sep = "\t", quote = "", check.names = FALSE) # Ensure character columns for joins eggnog_data$query <- as.character(eggnog_data$query) eggnog_data$GOs <- as.character(eggnog_data$GOs) eggnog_data$EC <- as.character(eggnog_data$EC) eggnog_data$KEGG_ko <- as.character(eggnog_data$KEGG_ko) # ---- MAIN LOOP ---- for (f in input_files) { if (!file.exists(f)) { message("Missing: ", f); next } message("Processing: ", f) res <- read_table_any(f) if (is.null(res) || nrow(res) == 0) { message("Empty/unreadable: ", f); next } # Coerce expected columns if present if ("padj" %in% names(res)) res$padj <- suppressWarnings(as.numeric(res$padj)) if ("log2FoldChange" %in% names(res)) res$log2FoldChange <- suppressWarnings(as.numeric(res$log2FoldChange)) # Ensure GeneID & GeneName exist if (!"GeneID" %in% names(res)) { # Try to infer from a generic 'gene' column if ("gene" %in% names(res)) res$GeneID <- as.character(res$gene) else res$GeneID <- NA_character_ } if (!"GeneName" %in% names(res)) res$GeneName <- NA_character_ # Fill missing GeneName from GeneID (drop "gene-") res$GeneName <- ifelse(is.na(res$GeneName) | res$GeneName == "", gsub("^gene-", "", as.character(res$GeneID)), as.character(res$GeneName)) # De-duplicate by GeneName, keep smallest padj if (!"padj" %in% names(res)) res$padj <- NA_real_ res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() %>% as.data.frame() # Sort by padj asc, then log2FC desc if (!"log2FoldChange" %in% names(res)) res$log2FoldChange <- NA_real_ res <- res[order(res$padj, -res$log2FoldChange), , drop = FALSE] # Join emapper (strip "gene-" from GeneID to match emapper 'query') res$GeneID_plain <- gsub("^gene-", "", res$GeneID) res_ann <- res %>% left_join(eggnog_data, by = c("GeneID_plain" = "query")) # --- Split by UP/DOWN using your volcano cutoffs --- up_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT, log2FoldChange > LFC_CUT) down_regulated <- res_ann %>% filter(!is.na(padj), padj < PADJ_CUT, log2FoldChange < -LFC_CUT) # --- KEGG enrichment (using K numbers in KEGG_ko) --- # Prepare KO lists (remove "ko:" if present) k_up <- up_regulated$KEGG_ko; k_up <- k_up[!is.na(k_up)] k_dn <- down_regulated$KEGG_ko; k_dn <- k_dn[!is.na(k_dn)] k_up <- gsub("ko:", "", k_up); k_dn <- gsub("ko:", "", k_dn) # BREAK_LINE kegg_up <- tryCatch(enrichKEGG(gene = k_up, organism = "ko"), error = function(e) NULL) kegg_down <- tryCatch(enrichKEGG(gene = k_dn, organism = "ko"), error = function(e) NULL) # Convert KEGG K-numbers to your GeneIDs (using mapping from the same result set) kegg_up_df <- expand_kegg_geneIDs(kegg_up, up_regulated) kegg_down_df <- expand_kegg_geneIDs(kegg_down, down_regulated) # --- GO enrichment (custom TERM2GENE built from emapper GOs) --- # Background gene set = all genes in this comparison background_genes <- unique(res_ann$GeneID_plain) # TERM2GENE table (GO -> GeneID_plain) go_annotation <- res_ann %>% dplyr::select(GeneID_plain, GOs) %>% mutate(GOs = ifelse(is.na(GOs), "", GOs)) %>% tidyr::separate_rows(GOs, sep = ",") %>% filter(GOs != "") %>% dplyr::select(GOs, GeneID_plain) %>% distinct() # Gene lists for GO enricher go_list_up <- unique(up_regulated$GeneID_plain) go_list_down <- unique(down_regulated$GeneID_plain) go_up <- tryCatch( enricher(gene = go_list_up, TERM2GENE = go_annotation, pvalueCutoff = 0.05, pAdjustMethod = "BH", universe = background_genes), error = function(e) NULL ) go_down <- tryCatch( enricher(gene = go_list_down, TERM2GENE = go_annotation, pvalueCutoff = 0.05, pAdjustMethod = "BH", universe = background_genes), error = function(e) NULL ) go_up_df <- if (!is.null(go_up)) as.data.frame(go_up) else data.frame() go_down_df <- if (!is.null(go_down)) as.data.frame(go_down) else data.frame() # Add GO term descriptions via GO.db (best-effort) add_go_term_desc <- function(df) { if (!nrow(df) || !"ID" %in% names(df)) return(df) df$Description <- sapply(df$ID, function(go_id) { term <- tryCatch(AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"), error = function(e) NULL) if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_ }) df } go_up_df <- add_go_term_desc(go_up_df) go_down_df <- add_go_term_desc(go_down_df) # ---- Write Excel workbook ---- out_xlsx <- xlsx_name_from_file(f) wb <- createWorkbook() addWorksheet(wb, "Complete") writeData(wb, "Complete", res_ann) addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) addWorksheet(wb, "KEGG_Enrichment_Up") writeData(wb, "KEGG_Enrichment_Up", kegg_up_df) addWorksheet(wb, "KEGG_Enrichment_Down") writeData(wb, "KEGG_Enrichment_Down", kegg_down_df) addWorksheet(wb, "GO_Enrichment_Up") writeData(wb, "GO_Enrichment_Up", go_up_df) addWorksheet(wb, "GO_Enrichment_Down") writeData(wb, "GO_Enrichment_Down", go_down_df) saveWorkbook(wb, out_xlsx, overwrite = TRUE) message("Saved: ", out_xlsx) } # -------------------------------- OLD_CODE not automatized with loop ---------------------------- # Load the results res <- read.csv("deltasbp_TSB_2h_vs_WT_TSB_2h-all.csv") res <- read.csv("deltasbp_TSB_4h_vs_WT_TSB_4h-all.csv") res <- read.csv("deltasbp_TSB_18h_vs_WT_TSB_18h-all.csv") res <- read.csv("deltasbp_MH_2h_vs_WT_MH_2h-all.csv") res <- read.csv("deltasbp_MH_4h_vs_WT_MH_4h-all.csv") res <- read.csv("deltasbp_MH_18h_vs_WT_MH_18h-all.csv") res <- read.csv("WT_MH_4h_vs_WT_MH_2h-all.csv") res <- read.csv("WT_MH_18h_vs_WT_MH_2h-all.csv") res <- read.csv("WT_MH_18h_vs_WT_MH_4h-all.csv") res <- read.csv("WT_TSB_4h_vs_WT_TSB_2h-all.csv") res <- read.csv("WT_TSB_18h_vs_WT_TSB_2h-all.csv") res <- read.csv("WT_TSB_18h_vs_WT_TSB_4h-all.csv") res <- read.csv("deltasbp_MH_4h_vs_deltasbp_MH_2h-all.csv") res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_2h-all.csv") res <- read.csv("deltasbp_MH_18h_vs_deltasbp_MH_4h-all.csv") res <- read.csv("deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.csv") res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.csv") res <- read.csv("deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.csv") # Replace empty GeneName with modified GeneID res$GeneName <- ifelse( res$GeneName == "" | is.na(res$GeneName), gsub("gene-", "", res$GeneID), res$GeneName ) # Remove duplicated genes by selecting the gene with the smallest padj duplicated_genes <- res[duplicated(res$GeneName), "GeneName"] res <- res %>% group_by(GeneName) %>% slice_min(padj, with_ties = FALSE) %>% ungroup() res <- as.data.frame(res) # Sort res first by padj (ascending) and then by log2FoldChange (descending) res <- res[order(res$padj, -res$log2FoldChange), ] # Read eggnog annotations eggnog_data <- read.delim("~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t") # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format res$GeneID <- gsub("gene-", "", res$GeneID) # Merge eggnog data with res based on GeneID res <- res %>% left_join(eggnog_data, by = c("GeneID" = "query")) # Merge with the res dataframe # Perform the left joins and rename columns res_updated <- res %>% left_join(go_terms, by = "GeneID") %>% left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y) # Filter up-regulated genes up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.05, ] # Filter down-regulated genes down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.05, ] # Create a new workbook wb <- createWorkbook() # Add the complete dataset as the first sheet (with annotations) addWorksheet(wb, "Complete") writeData(wb, "Complete_Data", res_updated) # Add the up-regulated genes as the second sheet (with annotations) addWorksheet(wb, "Up_Regulated") writeData(wb, "Up_Regulated", up_regulated) # Add the down-regulated genes as the third sheet (with annotations) addWorksheet(wb, "Down_Regulated") writeData(wb, "Down_Regulated", down_regulated) # Save the workbook to a file #saveWorkbook(wb, "Gene_Expression_with_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx", overwrite = TRUE) #NOTE: The generated annotation-files contains all columns of DESeq2 (GeneName, GeneID, baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) + almost all columns of eggNOG (GeneID, seed_ortholog, evalue, score, eggNOG_OGs, max_annot_lvl, COG_category, Description, Preferred_name, KEGG_ko, KEGG_Pathway, KEGG_Module, KEGG_Reaction, KEGG_rclass, BRITE, KEGG_TC, CAZy, BiGG_Reaction, PFAMs) except for -[GOs, EC] + two columns from Blast2GO (COs, EC); In the code below, we use the columns KEGG_ko and GOs for the KEGG and GO enrichments. #TODO: for Michelle's data, we can also perform both KEGG and GO enrichments. # Set GeneName as row names after the join rownames(res_updated) <- res_updated$GeneName res_updated <- res_updated %>% dplyr::select(-GeneName) ## Set the 'GeneName' column as row.names #rownames(res_updated) <- res_updated$GeneName ## Drop the 'GeneName' column since it's now the row names #res_updated$GeneName <- NULL # -- BREAK_1 -- # ---- Perform KEGG enrichment analysis (up_regulated) ---- gene_list_kegg_up <- up_regulated$KEGG_ko gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up) kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko') # -- convert the GeneID (Kxxxxxx) to the true GeneID -- # Step 0: Create KEGG to GeneID mapping kegg_to_geneid_up <- up_regulated %>% dplyr::select(KEGG_ko, GeneID) %>% filter(!is.na(KEGG_ko)) %>% # Remove missing KEGG KO entries mutate(KEGG_ko = str_remove(KEGG_ko, "ko:")) # Remove 'ko:' prefix if present # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs) kegg_to_geneid_clean <- kegg_to_geneid_up %>% mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>% # Remove 'ko:' prefixes separate_rows(KEGG_ko, sep = ",") %>% # Ensure each KEGG ID is on its own row filter(KEGG_ko != "-") %>% # Remove invalid KEGG IDs ("-") distinct() # Remove any duplicate mappings # Step 2.1: Expand geneID column in kegg_enrichment_up expanded_kegg <- kegg_enrichment_up %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>% left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>% # Explicitly handle many-to-many distinct() %>% # Remove duplicate matches group_by(ID) %>% summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop") # Re-collapse results #dplyr::glimpse(expanded_kegg) # Step 3.1: Replace geneID column in the original dataframe kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up) # Remove old geneID column and merge new one kegg_enrichment_up_df <- kegg_enrichment_up_df %>% dplyr::select(-geneID) %>% left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>% dplyr::rename(geneID = GeneID) # Rename column back to geneID # ---- Perform KEGG enrichment analysis (down_regulated) ---- # Step 1: Extract KEGG KO terms from down-regulated genes gene_list_kegg_down <- down_regulated$KEGG_ko gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down) # Step 2: Perform KEGG enrichment analysis kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko') # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs --- # Step 3: Create KEGG to GeneID mapping from down_regulated dataset kegg_to_geneid_down <- down_regulated %>% dplyr::select(KEGG_ko, GeneID) %>% filter(!is.na(KEGG_ko)) %>% # Remove missing KEGG KO entries mutate(KEGG_ko = str_remove(KEGG_ko, "ko:")) # Remove 'ko:' prefix if present # -- BREAK_2 -- # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs) kegg_to_geneid_down_clean <- kegg_to_geneid_down %>% mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>% # Remove 'ko:' prefixes separate_rows(KEGG_ko, sep = ",") %>% # Ensure each KEGG ID is on its own row filter(KEGG_ko != "-") %>% # Remove invalid KEGG IDs ("-") distinct() # Remove duplicate mappings # Step 5: Expand geneID column in kegg_enrichment_down expanded_kegg_down <- kegg_enrichment_down %>% as.data.frame() %>% separate_rows(geneID, sep = "/") %>% # Split multiple KEGG IDs (Kxxxxx) left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>% # Handle many-to-many mappings distinct() %>% # Remove duplicate matches group_by(ID) %>% summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop") # Re-collapse results # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>% dplyr::select(-geneID) %>% # Remove old geneID column left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>% # Merge new GeneID column dplyr::rename(geneID = GeneID) # Rename column back to geneID # View the updated dataframe head(kegg_enrichment_down_df) # Create a new workbook #wb <- createWorkbook() # Save enrichment results to the workbook addWorksheet(wb, "KEGG_Enrichment_Up") writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df)) # Save enrichment results to the workbook addWorksheet(wb, "KEGG_Enrichment_Down") writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df)) # Define gene list (up-regulated genes) gene_list_go_up <- up_regulated$GeneID # Extract the 149 up-regulated genes gene_list_go_down <- down_regulated$GeneID # Extract the 65 down-regulated genes # Define background gene set (all genes in res) background_genes <- res_updated$GeneID # Extract the 3646 background genes # Prepare GO annotation data from res go_annotation <- res_updated[, c("GOs","GeneID")] # Extract relevant columns go_annotation <- go_annotation %>% tidyr::separate_rows(GOs, sep = ",") # Split multiple GO terms into separate rows # -- BREAK_3 -- go_enrichment_up <- enricher( gene = gene_list_go_up, # Up-regulated genes TERM2GENE = go_annotation, # Custom GO annotation pvalueCutoff = 0.05, # Significance threshold pAdjustMethod = "BH", universe = background_genes # Define the background gene set ) go_enrichment_up <- as.data.frame(go_enrichment_up) go_enrichment_down <- enricher( gene = gene_list_go_down, # Up-regulated genes TERM2GENE = go_annotation, # Custom GO annotation pvalueCutoff = 0.05, # Significance threshold pAdjustMethod = "BH", universe = background_genes # Define the background gene set ) go_enrichment_down <- as.data.frame(go_enrichment_down) ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)! #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"] # Update the Description column with the term descriptions go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) { # Using select to get the term description term <- tryCatch({ AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID") }, error = function(e) { message(paste("Error for GO term:", go_id)) # Print which GO ID caused the error return(data.frame(TERM = NA)) # In case of error, return NA }) if (nrow(term) > 0) { return(term$TERM) } else { return(NA) # If no description found, return NA } }) ## Print the updated data frame #print(go_enrichment_up) ## Remove the 'p.adjust' column since no adjusted methods have been applied --> In this version we have used pvalue filtering (see above)! #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"] # Update the Description column with the term descriptions go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) { # Using select to get the term description term <- tryCatch({ AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID") }, error = function(e) { message(paste("Error for GO term:", go_id)) # Print which GO ID caused the error return(data.frame(TERM = NA)) # In case of error, return NA }) if (nrow(term) > 0) { return(term$TERM) } else { return(NA) # If no description found, return NA } }) addWorksheet(wb, "GO_Enrichment_Up") writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up)) addWorksheet(wb, "GO_Enrichment_Down") writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down)) # Save the workbook with enrichment results saveWorkbook(wb, "DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h.xlsx", overwrite = TRUE) #Error for GO term: GO:0006807: replace "GO:0006807 obsolete nitrogen compound metabolic process" #obsolete nitrogen compound metabolic process #https://www.ebi.ac.uk/QuickGO/term/GO:0006807 #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment! #mv KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx KEGG_and_GO_Enrichments_Mac_vs_LB.xlsx #Mac_vs_LB #LB.AB_vs_LB.WT19606 #LB.IJ_vs_LB.WT19606 #LB.W1_vs_LB.WT19606 #LB.Y1_vs_LB.WT19606 #Mac.AB_vs_Mac.WT19606 #Mac.IJ_vs_Mac.WT19606 #Mac.W1_vs_Mac.WT19606 #Mac.Y1_vs_Mac.WT19606 #TODO: write reply hints in KEGG_and_GO_Enrichments_deltasbp_TSB_4h_vs_WT_TSB_4h.xlsx contains icaABCD, gtf1 and gtf2. -
(DEBUG) Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.
library(openxlsx) # Function to read and clean gene ID files read_gene_ids <- function(file_path) { # Read the gene IDs from the file gene_ids <- readLines(file_path) # Remove any quotes and trim whitespaces gene_ids <- gsub('"', '', gene_ids) # Remove quotes gene_ids <- trimws(gene_ids) # Trim whitespaces # Remove empty entries or NAs gene_ids <- gene_ids[gene_ids != "" & !is.na(gene_ids)] return(gene_ids) } # Example list of LB files with both -up.id and -down.id for each condition lb_files_up <- c("LB.AB_vs_LB.WT19606-up.id", "LB.IJ_vs_LB.WT19606-up.id", "LB.W1_vs_LB.WT19606-up.id", "LB.Y1_vs_LB.WT19606-up.id") lb_files_down <- c("LB.AB_vs_LB.WT19606-down.id", "LB.IJ_vs_LB.WT19606-down.id", "LB.W1_vs_LB.WT19606-down.id", "LB.Y1_vs_LB.WT19606-down.id") # Combine both up and down files for each condition lb_files <- c(lb_files_up, lb_files_down) # Read gene IDs for each file in LB group #lb_degs <- setNames(lapply(lb_files, read_gene_ids), gsub("-(up|down).id", "", lb_files)) lb_degs <- setNames(lapply(lb_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", lb_files))) lb_degs_ <- list() combined_set <- c(lb_degs[["LB.AB_vs_LB.WT19606"]], lb_degs[["LB.AB_vs_LB.WT19606.1"]]) #unique_combined_set <- unique(combined_set) lb_degs_$AB <- combined_set combined_set <- c(lb_degs[["LB.IJ_vs_LB.WT19606"]], lb_degs[["LB.IJ_vs_LB.WT19606.1"]]) lb_degs_$IJ <- combined_set combined_set <- c(lb_degs[["LB.W1_vs_LB.WT19606"]], lb_degs[["LB.W1_vs_LB.WT19606.1"]]) lb_degs_$W1 <- combined_set combined_set <- c(lb_degs[["LB.Y1_vs_LB.WT19606"]], lb_degs[["LB.Y1_vs_LB.WT19606.1"]]) lb_degs_$Y1 <- combined_set # Example list of Mac files with both -up.id and -down.id for each condition mac_files_up <- c("Mac.AB_vs_Mac.WT19606-up.id", "Mac.IJ_vs_Mac.WT19606-up.id", "Mac.W1_vs_Mac.WT19606-up.id", "Mac.Y1_vs_Mac.WT19606-up.id") mac_files_down <- c("Mac.AB_vs_Mac.WT19606-down.id", "Mac.IJ_vs_Mac.WT19606-down.id", "Mac.W1_vs_Mac.WT19606-down.id", "Mac.Y1_vs_Mac.WT19606-down.id") # Combine both up and down files for each condition in Mac group mac_files <- c(mac_files_up, mac_files_down) # Read gene IDs for each file in Mac group mac_degs <- setNames(lapply(mac_files, read_gene_ids), make.unique(gsub("-(up|down).id", "", mac_files))) mac_degs_ <- list() combined_set <- c(mac_degs[["Mac.AB_vs_Mac.WT19606"]], mac_degs[["Mac.AB_vs_Mac.WT19606.1"]]) mac_degs_$AB <- combined_set combined_set <- c(mac_degs[["Mac.IJ_vs_Mac.WT19606"]], mac_degs[["Mac.IJ_vs_Mac.WT19606.1"]]) mac_degs_$IJ <- combined_set combined_set <- c(mac_degs[["Mac.W1_vs_Mac.WT19606"]], mac_degs[["Mac.W1_vs_Mac.WT19606.1"]]) mac_degs_$W1 <- combined_set combined_set <- c(mac_degs[["Mac.Y1_vs_Mac.WT19606"]], mac_degs[["Mac.Y1_vs_Mac.WT19606.1"]]) mac_degs_$Y1 <- combined_set # Function to clean sheet names to ensure no sheet name exceeds 31 characters truncate_sheet_name <- function(names_list) { sapply(names_list, function(name) { if (nchar(name) > 31) { return(substr(name, 1, 31)) # Truncate sheet name to 31 characters } return(name) }) } # Assuming lb_degs_ is already a list of gene sets (LB.AB, LB.IJ, etc.) # Define intersections between different conditions for LB inter_lb_ab_ij <- intersect(lb_degs_$AB, lb_degs_$IJ) inter_lb_ab_w1 <- intersect(lb_degs_$AB, lb_degs_$W1) inter_lb_ab_y1 <- intersect(lb_degs_$AB, lb_degs_$Y1) inter_lb_ij_w1 <- intersect(lb_degs_$IJ, lb_degs_$W1) inter_lb_ij_y1 <- intersect(lb_degs_$IJ, lb_degs_$Y1) inter_lb_w1_y1 <- intersect(lb_degs_$W1, lb_degs_$Y1) # Define intersections between three conditions for LB inter_lb_ab_ij_w1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1)) inter_lb_ab_ij_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$Y1)) inter_lb_ab_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$W1, lb_degs_$Y1)) inter_lb_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1)) # Define intersection between all four conditions for LB inter_lb_ab_ij_w1_y1 <- Reduce(intersect, list(lb_degs_$AB, lb_degs_$IJ, lb_degs_$W1, lb_degs_$Y1)) # Now remove the intersected genes from each original set for LB venn_list_lb <- list() # For LB.AB, remove genes that are also in other conditions venn_list_lb[["LB.AB_only"]] <- setdiff(lb_degs_$AB, union(inter_lb_ab_ij, union(inter_lb_ab_w1, inter_lb_ab_y1))) # For LB.IJ, remove genes that are also in other conditions venn_list_lb[["LB.IJ_only"]] <- setdiff(lb_degs_$IJ, union(inter_lb_ab_ij, union(inter_lb_ij_w1, inter_lb_ij_y1))) # For LB.W1, remove genes that are also in other conditions venn_list_lb[["LB.W1_only"]] <- setdiff(lb_degs_$W1, union(inter_lb_ab_w1, union(inter_lb_ij_w1, inter_lb_ab_w1_y1))) # For LB.Y1, remove genes that are also in other conditions venn_list_lb[["LB.Y1_only"]] <- setdiff(lb_degs_$Y1, union(inter_lb_ab_y1, union(inter_lb_ij_y1, inter_lb_ab_w1_y1))) # Add the intersections for LB (same as before) venn_list_lb[["LB.AB_AND_LB.IJ"]] <- inter_lb_ab_ij venn_list_lb[["LB.AB_AND_LB.W1"]] <- inter_lb_ab_w1 venn_list_lb[["LB.AB_AND_LB.Y1"]] <- inter_lb_ab_y1 venn_list_lb[["LB.IJ_AND_LB.W1"]] <- inter_lb_ij_w1 venn_list_lb[["LB.IJ_AND_LB.Y1"]] <- inter_lb_ij_y1 venn_list_lb[["LB.W1_AND_LB.Y1"]] <- inter_lb_w1_y1 # Define intersections between three conditions for LB venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1"]] <- inter_lb_ab_ij_w1 venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.Y1"]] <- inter_lb_ab_ij_y1 venn_list_lb[["LB.AB_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_w1_y1 venn_list_lb[["LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ij_w1_y1 # Define intersection between all four conditions for LB venn_list_lb[["LB.AB_AND_LB.IJ_AND_LB.W1_AND_LB.Y1"]] <- inter_lb_ab_ij_w1_y1 # Assuming mac_degs_ is already a list of gene sets (Mac.AB, Mac.IJ, etc.) # Define intersections between different conditions inter_mac_ab_ij <- intersect(mac_degs_$AB, mac_degs_$IJ) inter_mac_ab_w1 <- intersect(mac_degs_$AB, mac_degs_$W1) inter_mac_ab_y1 <- intersect(mac_degs_$AB, mac_degs_$Y1) inter_mac_ij_w1 <- intersect(mac_degs_$IJ, mac_degs_$W1) inter_mac_ij_y1 <- intersect(mac_degs_$IJ, mac_degs_$Y1) inter_mac_w1_y1 <- intersect(mac_degs_$W1, mac_degs_$Y1) # Define intersections between three conditions inter_mac_ab_ij_w1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1)) inter_mac_ab_ij_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$Y1)) inter_mac_ab_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$W1, mac_degs_$Y1)) inter_mac_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1)) # Define intersection between all four conditions inter_mac_ab_ij_w1_y1 <- Reduce(intersect, list(mac_degs_$AB, mac_degs_$IJ, mac_degs_$W1, mac_degs_$Y1)) # Now remove the intersected genes from each original set venn_list_mac <- list() # For Mac.AB, remove genes that are also in other conditions venn_list_mac[["Mac.AB_only"]] <- setdiff(mac_degs_$AB, union(inter_mac_ab_ij, union(inter_mac_ab_w1, inter_mac_ab_y1))) # For Mac.IJ, remove genes that are also in other conditions venn_list_mac[["Mac.IJ_only"]] <- setdiff(mac_degs_$IJ, union(inter_mac_ab_ij, union(inter_mac_ij_w1, inter_mac_ij_y1))) # For Mac.W1, remove genes that are also in other conditions venn_list_mac[["Mac.W1_only"]] <- setdiff(mac_degs_$W1, union(inter_mac_ab_w1, union(inter_mac_ij_w1, inter_mac_ab_w1_y1))) # For Mac.Y1, remove genes that are also in other conditions venn_list_mac[["Mac.Y1_only"]] <- setdiff(mac_degs_$Y1, union(inter_mac_ab_y1, union(inter_mac_ij_y1, inter_mac_ab_w1_y1))) # Add the intersections (same as before) venn_list_mac[["Mac.AB_AND_Mac.IJ"]] <- inter_mac_ab_ij venn_list_mac[["Mac.AB_AND_Mac.W1"]] <- inter_mac_ab_w1 venn_list_mac[["Mac.AB_AND_Mac.Y1"]] <- inter_mac_ab_y1 venn_list_mac[["Mac.IJ_AND_Mac.W1"]] <- inter_mac_ij_w1 venn_list_mac[["Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ij_y1 venn_list_mac[["Mac.W1_AND_Mac.Y1"]] <- inter_mac_w1_y1 # Define intersections between three conditions venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1"]] <- inter_mac_ab_ij_w1 venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.Y1"]] <- inter_mac_ab_ij_y1 venn_list_mac[["Mac.AB_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_w1_y1 venn_list_mac[["Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ij_w1_y1 # Define intersection between all four conditions venn_list_mac[["Mac.AB_AND_Mac.IJ_AND_Mac.W1_AND_Mac.Y1"]] <- inter_mac_ab_ij_w1_y1 # Save the gene IDs to Excel for further inspection (optional) write.xlsx(lb_degs, file = "LB_DEGs.xlsx") write.xlsx(mac_degs, file = "Mac_DEGs.xlsx") # Clean sheet names and write the Venn intersection sets for LB and Mac groups into Excel files write.xlsx(venn_list_lb, file = "Venn_LB_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_lb)), rowNames = FALSE) write.xlsx(venn_list_mac, file = "Venn_Mac_Genes_Intersect.xlsx", sheetName = truncate_sheet_name(names(venn_list_mac)), rowNames = FALSE) # Venn Diagram for LB group venn1 <- ggvenn(lb_degs_, fill_color = c("skyblue", "tomato", "gold", "orchid"), stroke_size = 0.4, set_name_size = 5) ggsave("Venn_LB_Genes.png", plot = venn1, width = 7, height = 7, dpi = 300) # Venn Diagram for Mac group venn2 <- ggvenn(mac_degs_, fill_color = c("lightgreen", "slateblue", "plum", "orange"), stroke_size = 0.4, set_name_size = 5) ggsave("Venn_Mac_Genes.png", plot = venn2, width = 7, height = 7, dpi = 300) cat("✅ All Venn intersection sets exported to Excel successfully.\n") -
Report:
Please find the heat maps attached. Comparisons included 1. 1457 TSB vs 1457Δsbp TSB (early timepoint, 1–2 h) 2. 1457 MH vs 1457Δsbp MH (early timepoint, 1–2 h) 3. 1457 TSB vs 1457Δsbp TSB (4 h) 4. 1457 MH vs 1457Δsbp MH (4 h) 5. 1457 TSB vs 1457Δsbp TSB (18 h) 6. 1457 MH vs 1457Δsbp MH (18 h) 7. 1457 TSB: early (1–2 h) vs 4 h vs 18 h 8. 1457 MH: early (1–2 h) vs 4 h vs 18 h 9. 1457Δsbp TSB: early (1–2 h) vs 4 h vs 18 h 10. 1457Δsbp MH: early (1–2 h) vs 4 h vs 18 h What each heat map shows Rows: significant DEGs only (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2). Columns: the samples for the comparison (2 conditions × replicates) or, for the three-condition panels, 3 conditions × replicates = 9 columns. How the files were generated Choose contrasts. Two-condition heat maps (items 1–6): contrast <- "<A>_vs_<B>". Three-condition panels (items 7–10): contrasts <- c("<A2>_vs_<A1>", "<A3>_vs_<A1>", "<A3>_vs_<A2>") cond_order <- c("<A1>", "<A2>", "<A3>") # e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h Build the Genes Of Interest (GOI) dynamically. For each contrast, the script reads-up.id and -down.id and takes the union; for three-condition panels it unions across the three pairwise contrasts. Subset the expression matrix (assay(rld)): keep only the rows in GOI and the sample columns matching the condition tags (e.g., WT_MH_2h, WT_MH_4h, WT_MH_18h). DEG annotation and enrichment The annotated significant DEGs (padj < 0.05 and |log2(fold-change)| > 2, i.e., > 2 or < −2) used in the steps above and the KEGG/GO enrichment results for each comparison are provided in the corresponding Excel files: DEG_KEGG_GO_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx DEG_KEGG_GO_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_MH_4h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_2h-all.xlsx DEG_KEGG_GO_WT_MH_18h_vs_WT_MH_4h-all.xlsx DEG_KEGG_GO_WT_TSB_4h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_2h-all.xlsx DEG_KEGG_GO_WT_TSB_18h_vs_WT_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_4h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_2h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_deltasbp_MH_4h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_4h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_2h-all.xlsx DEG_KEGG_GO_deltasbp_TSB_18h_vs_deltasbp_TSB_4h-all.xlsx DEG_KEGG_GO_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx











