Analyzing WaGa and MKL-1 Cell Line miRNA (Data_Ute_smallRNA_via_exceRpt_workspace)

manhattan_plot_Carmen_custom_labels_WaGa.R

manhattan_plot_Carmen_custom_labels_MKL-1.R

For example, MKL-1 Cell Line miRNA Analysis Results are as follows.

* Raw count data (d_raw_MKL-1.xlsx): Contains the raw, unnormalized read counts for all miRNAs.
* Mapping heatmap (mapping_heatmap3_MKL-1.pdf)
* Volcano plot (MKL.1_wt_EV_vs_MKL.1_wt_cells.png and .svg)
* PCA plot (pca_MKL-1.png)
* Manhattan plot and data (manhattan_plot_MKL1_vs_EV.png, .svg, and manhattan_plot_MKL1_data.xlsx)
  1. Input data

     WaGa wt cells (nf774* (Considering to be deleted, due to possibly be an outlier, but in the current version, it is still included in the analysis), nf961, nf962)
     WaGa wt_EV_RNA (nf657* (The sample was EXCLUDED, since it is obviously a outlier, not clustered with the other 2 samples), nf930, nf935)
     WaGa_sT_DMSO_EV_RNA (nf931, nf936, nf971)
     WaGa_sT_Dox_EV_RNA (nf932, nf937, nf972)
     WaGa_scr_DMSO_EV_RNA (nf933, nf938, nf973)
     WaGa_scr_Dox_EV_RNA (nf934, nf939, nf974)
     # --> In total, 17 samples
    
     MKL-1 wt cells (nf780*, nf796*, nf797*)
     MKL-1 wt_EV_RNA (nf655* (The sample was EXCLUDED), 2404, 2608)
     MKL-1_sT_DMSO_EV_RNA (2608, 2701, 2802)
     MKL-1_sT_Dox_EV_RNA (2608, 2701, 2802)
     MKL-1_scr_DMSO_EV_RNA (2608, 2701, 2802)
     MKL-1_scr_Dox_EV_RNA (2608, 2701, 2802)
     # --> In total, 18 samples
    
     #Note that the real paths are as follows:
     #./20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz, ./20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz
     #./20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz
  2. Adapter trimming

     #some common adapter sequences from different kits for reference:
     #    - TruSeq Small RNA (Illumina): TGGAATTCTCGGGTGCCAAGG
     #    - Small RNA Kits V1 (Illumina): TCGTATGCCGTCTTCTGCTTGT
     #    - Small RNA Kits V1.5 (Illumina): ATCTCGTATGCCGTCTTCTGCTTG
     #    - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific): TGGAATTCTCGGGTGCCAAGG
     #    - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina): TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC *
     mkdir Data_Ute_smallRNA_via_exceRpt_workspace/trimmed; cd Data_Ute_smallRNA_via_exceRpt_workspace/trimmed
    
     echo "------------------------------------ cutadapting nf774 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf774.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf657 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf657.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf655 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf655.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf780 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf780.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf796 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf796.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf797 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf797.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf930 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf930.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf931 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf931.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf932 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf932.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf933 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf933.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf934 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf934.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf935 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf935.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf936 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf936.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf937 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf937.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf938 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf938.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf939 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf939.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf940 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf940.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf941 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf941.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf961 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf961.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf962 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf962.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf971 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf971.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf972 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf972.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf973 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf973.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting nf974 -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf974.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2404_MKL1_wt_EVs -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2404_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_wt_EVs -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_sT_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_sT_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_scr_DMSO -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2608_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2701_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz >> LOG
    
     echo "------------------------------------ cutadapting 2802_MKL1_scr_Dox -----------------------------------" >> LOG
     cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz >> LOG
  3. Install exceRpt (https://github.gersteinlab.org/exceRpt/)

     docker pull rkitchen/excerpt
     mkdir MyexceRptDatabase
     cd /mnt/nvme0n1p1/MyexceRptDatabase
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     tar -xvf exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg19_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
     #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_mm10_lowmem.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOmiRNArRNA.tgz
     tar -xvf exceRptDB_v4_EXOmiRNArRNA.tgz
     wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOGenomes.tgz
     tar -xvf exceRptDB_v4_EXOGenomes.tgz
    
     # List extracted hg38 directory structure
     find hg38 -type f | sed 's|^hg38/||' | sort > extracted_hg38.txt
     comm -3 extracted_hg38.txt <(tar -tf exceRptDB_v4_hg38_lowmem.tgz | grep '^hg38/' | sed 's|^hg38/||' | sort)  # --> DIR hg38
     tar -tf exceRptDB_v4_EXOmiRNArRNA.tgz  # --> DIR ribosomeDatabase, NCBI_taxonomy_taxdump, miRBase
     tar -tf exceRptDB_v4_EXOGenomes.tgz  # --> Genomes_BacteriaFungiMammalPlantProtistVirus
  4. Run exceRpt

     #[---- REAL_RUNNING_COMPLETE_DB ---->]
     #NOTE that if not renamed in the input files, then have to RENAME all files recursively by removing "_cutadapted.fastq" in all names in _CORE_RESULTS_v4.6.3.tgz (first unzip, removing, then zip, mv to ../results_g).
     cd trimmed
     for file in *.fastq.gz; do
         echo "mv \"$file\" \"${file/.fastq/}\""
     done
    
     mkdir results
     for sample in nf780 nf796 nf797  nf655    nf774 nf961 nf962  nf657 nf930 nf935  nf931 nf936 nf971  nf932 nf937 nf972  nf933 nf938 nf973  nf934 nf939 nf974; do
         docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                    -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
                   -v /mnt/nvme0n1p1/MyexceRptDatabase:/exceRpt_DB \
                   -t rkitchen/excerpt \
                   INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
     done
    
     for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
         docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                    -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
                   -v /mnt/nvme3n1p1/MyexceRptDatabase:/exceRpt_DB \
                   -t rkitchen/excerpt \
                   INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
     done
    
     #DEBUG the excerpt env
     docker inspect rkitchen/excerpt:latest
     # Without /bin/bash → May run and exit immediately
     #docker run -it rkitchen/excerpt
     # With /bin/bash → Stays open for interaction
     docker run -it --entrypoint /bin/bash rkitchen/excerpt
  5. Processing exceRpt output from multiple samples

     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master
     mamba activate r_env
     mamba install -c conda-forge -c bioconda \
         bioconductor-marray \
         bioconductor-rgraphviz \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx r-rcurl r-xml \
         -y
     mamba install -c conda-forge -c bioconda \
         r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx \
         bioconductor-marray bioconductor-rgraphviz \
         -y
    
     #mkdir summaries heatmap_all_WaGa+4_MKL-1
     mkdir results_WaGa_EXCLUDED results_MKL-1 summaries_WaGa summaries_MKL-1 heatmap_WaGa heatmap_MKL-1
     #! EXCLUDE some isolates since they have total different pattern or due to bad quality --> outliner, until now only one sample, namely nf657 from WaGa wt EV:
     sudo mv results/nf657* results_WaGa_EXCLUDED/
     sudo mv results/nf780* results_MKL-1/
     sudo mv results/nf796* results_MKL-1/
     sudo mv results/nf797* results_MKL-1/
     sudo mv results/nf655* results_MKL-1/
     for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
         echo "sudo mv results/${sample}* results_MKL-1/"
     done
     #Following our initial QC, I noticed that one of the MKL-1 wt-EV samples (nf655) is a clear outlier, clustering far apart from the other two wt-EV replicates in the PCoA plots. I recommend removing nf655 from the downstream MKL-1 analysis, which is similar to our earlier analysis for MKL-1, in which we removed the outlier nf657. Please see the attached figures for reference.
     mv results_MKL-1/nf655* results_MKL-1_EXCLUDED/
    
     (r_env) jhuang@WS-2290C:~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master$ R
     #WARNING: need to reload the R-script after each change of the script.
     source("mergePipelineRuns_functions.R")
     processSamplesInDir("../results_WaGa/", "../summaries_WaGa")
     processSamplesInDir("../results_MKL-1/", "../summaries_MKL-1")
    
     #mkdir heatmap_WaGa; cp summaries_WaGa/*.RData heatmap_WaGa; rm heatmap_WaGa/exceRpt_sampleGroupDefinitions.txt;
     source("mergePipelineRuns_functions_addSampleGroupInfo_WaGa.R")
     processSamplesInDir("../results_WaGa/", "../heatmap_WaGa")
    
     #mkdir heatmap_MKL-1; cp summaries_MKL-1/*.RData heatmap_MKL-1; rm heatmap_MKL-1/exceRpt_sampleGroupDefinitions.txt;
     source("mergePipelineRuns_functions_addSampleGroupInfo_MKL-1.R")
     processSamplesInDir("../results_MKL-1/", "../heatmap_MKL-1")
    
     #!!!!! IMPORTANT: REPORT heatmap_MKL-1/exceRpt_DiagnosticPlots.pdf and heatmap_MKL-1/mapping_heatmap3.pdf (They are almost the same, mapping_heatmap3.pdf is better due to bigger font size) !!!!
     #CONSIDERING_TO_DEL_nf774 since it is very far to another two samples (MAYBE BETTER NOT DO THIS, SINCE I HAVE TO GENERATE PCA- and MANHATTAN PLOTS!!): now the sample nf774 was kept in the WaGa results.
    
     #~/Tools/csv2xls-0.4/csv_to_xls.py exceRpt_miRNA_ReadsPerMillion.txt exceRpt_tRNA_ReadsPerMillion.txt exceRpt_piRNA_ReadsPerMillion.txt -d$'\t' -o exceRpt_results_detailed.xls
    
     # Report summaries_WaGa/exceRpt_mapping_heatmaps_WaGa.xlsx or summaries_MKL-1/exceRpt_mapping_heatmaps_MKL-1.xlsx;
     #        summaries_WaGa/exceRpt_results_detailed_WaGa.xls or summaries_MKL-1/exceRpt_results_detailed_MKL-1.xls;
     #        heatmap_WaGa/mapping_heatmap3_WaGa.pdf or heatmap_MKL-1/mapping_heatmap3_MKL-1.pdf
  6. Downstream analyis using R for miRNAs (17 WaGa samples)

     #Input file
     #exceRpt_miRNA_ReadCounts.txt
     #exceRpt_piRNA_ReadCounts.txt
    
     ## WaGa experimental groups (scr = scramble control; sT = target knockdown)
     #WaGa_scr_DMSO_EV (nf933, nf938, nf973)
     #WaGa_scr_Dox_EV (nf934, nf939, nf974)
     #WaGa_sT_DMSO_EV (nf931, nf936, nf971)
     #WaGa_sT_Dox_EV (nf932, nf937, nf972)
     #
     ## WaGa wild-type controls
     #WaGa_wt_cells (nf774, nf961, nf962)
     #WaGa_wt_EV (nf930, nf935)
    
     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_WaGa
     mamba activate r_env
     R
    
     #BiocManager::install("AnnotationDbi")
     #BiocManager::install("clusterProfiler")
     #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
     #BiocManager::install("limma")
     #BiocManager::install("sva")
     #install.packages("writexl")
     #install.packages("openxlsx")
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library("org.Hs.eg.db")
     library(DESeq2)
     library(gplots)
     library(limma)
     library(sva)
     #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
     library(openxlsx)
    
     d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)
    
     # Desired column order
     desired_order <- c(
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"
     )
    
     # Reorder columns
     d.raw <- d.raw[, desired_order]
     setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
     #sapply(d.raw, is.numeric)
     d.raw[] <- lapply(d.raw, as.numeric)
     #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
     d.raw <- round(d.raw)
     write.csv(d.raw, file ="d_raw.csv")
     write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)
    
     # ------ Code sent to Ute ------
     #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
     Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
     replicates = as.factor(c("WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV",     "WaGa_scr_Dox_EV","WaGa_scr_Dox_EV","WaGa_scr_Dox_EV",  "WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV",  "WaGa_sT_Dox_EV","WaGa_sT_Dox_EV","WaGa_sT_Dox_EV",  "WaGa_wt_cells", "WaGa_wt_cells","WaGa_wt_cells",  "WaGa_wt_EV", "WaGa_wt_EV"))
     ids = as.factor(c(
         "nf933", "nf938", "nf973",
         "nf934", "nf939", "nf974",
         "nf931", "nf936", "nf971",
         "nf932", "nf937", "nf972",
         "nf774", "nf961", "nf962",
         "nf930", "nf935"))
     cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     # Filter low-count miRNAs
     dds <- dds[ rowSums(counts(dds)) > 10, ]
     rld <- rlogTransformation(dds)
    
     # -- before pca --
     png("pca.png", 1200, 800)
     plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     #plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
     png("pca2.png", 1200, 800)
     #plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
    
     # Batch Effect Removal Methods (Non-batch effect removal applied!)
    
     #### STEP2: DEGs ####
     #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
     #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
     #- Manhattan plot miRNAs; 1x for WaGa cell line
     #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
     #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line
    
     #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
     sizeFactors(dds)
     #NULL
     dds <- estimateSizeFactors(dds)
     sizeFactors(dds)
     normalized_counts <- counts(dds, normalized=TRUE)
     write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
     write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)
    
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     dds$replicates <- relevel(dds$replicates, "WaGa_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("WaGa_wt_EV_vs_WaGa_wt_cells")
    
     #NOTE that the results sent to Ute is |padj|<=0.1.
     for (i in clist) {
         contrast = paste("replicates", i, sep="_")
         res = results(dds, name=contrast)
         res <- res[!is.na(res$log2FoldChange),]
         #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
         res$padj <- ifelse(is.na(res$padj), 1, res$padj)
         res_df <- as.data.frame(res)
         write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
         up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
         down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
         write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
         write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
     }
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     WaGa_wt_EV_vs_WaGa_wt_cells-all.txt \
     WaGa_wt_EV_vs_WaGa_wt_cells-up.txt \
     WaGa_wt_EV_vs_WaGa_wt_cells-down.txt \
     -d$',' -o WaGa_wt_EV_vs_WaGa_wt_cells.xls;
    
     # ------------------- volcano_plot -------------------
     library(ggplot2)
     library(ggrepel)
    
     geness_res <- read.csv(file = paste("WaGa_wt_EV_vs_WaGa_wt_cells", "all.txt", sep="-"), row.names=1)
    
     external_gene_name <- rownames(geness_res)
     geness_res <- cbind(geness_res, external_gene_name)
     #top_g are from ids
     top_g <- c("hsa-let-7b-5p","hsa-let-7g-5p","hsa-let-7i-5p","hsa-miR-103a-3p","hsa-miR-107","hsa-miR-1224-5p","hsa-miR-122-5p","hsa-miR-1226-5p","hsa-miR-1246","hsa-miR-127-3p","hsa-miR-1290","hsa-miR-130a-3p","hsa-miR-139-3p","hsa-miR-141-3p","hsa-miR-143-3p","hsa-miR-148b-3p","hsa-miR-155-5p","hsa-miR-15a-5p","hsa-miR-17-5p","hsa-miR-184","hsa-miR-18a-3p","hsa-miR-18a-5p","hsa-miR-190a-5p","hsa-miR-191-5p","hsa-miR-193b-5p","hsa-miR-197-5p","hsa-miR-200a-3p","hsa-miR-200b-5p","hsa-miR-206","hsa-miR-20a-5p","hsa-miR-210-3p","hsa-miR-2110","hsa-miR-21-5p","hsa-miR-218-5p","hsa-miR-219a-1-3p","hsa-miR-221-3p","hsa-miR-23b-3p","hsa-miR-27a-3p","hsa-miR-27b-3p","hsa-miR-27b-5p","hsa-miR-28-3p","hsa-miR-30a-5p","hsa-miR-30c-5p","hsa-miR-30e-5p","hsa-miR-3127-5p","hsa-miR-3131","hsa-miR-3180|hsa-miR-3180-3p","hsa-miR-320a","hsa-miR-320b","hsa-miR-320c","hsa-miR-320d","hsa-miR-330-3p","hsa-miR-335-3p","hsa-miR-33b-5p","hsa-miR-340-5p","hsa-miR-342-5p","hsa-miR-3605-5p","hsa-miR-361-3p","hsa-miR-365a-5p","hsa-miR-374b-5p","hsa-miR-378i","hsa-miR-379-5p","hsa-miR-3940-5p","hsa-miR-409-3p","hsa-miR-411-5p","hsa-miR-423-3p","hsa-miR-423-5p","hsa-miR-4286","hsa-miR-429","hsa-miR-432-5p","hsa-miR-4326","hsa-miR-451a","hsa-miR-4520-3p","hsa-miR-454-3p","hsa-miR-4646-5p","hsa-miR-4667-5p","hsa-miR-4748","hsa-miR-483-5p","hsa-miR-486-5p","hsa-miR-5010-5p","hsa-miR-504-3p","hsa-miR-5187-5p","hsa-miR-590-3p","hsa-miR-6128","hsa-miR-625-5p","hsa-miR-6726-5p","hsa-miR-6730-5p","hsa-miR-676-3p","hsa-miR-6767-5p","hsa-miR-6777-5p","hsa-miR-6780a-5p","hsa-miR-6794-5p","hsa-miR-6817-3p","hsa-miR-708-5p","hsa-miR-7-5p","hsa-miR-766-5p","hsa-miR-7854-3p","hsa-miR-873-3p","hsa-miR-885-3p","hsa-miR-92b-5p","hsa-miR-93-5p","hsa-miR-937-3p","hsa-miR-9-5p","hsa-miR-98-5p")
     subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
     geness_res$Color <- "NS or log2FC < 2.0"
     geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
     geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
     geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
    
     write.csv(geness_res, "WaGa_wt_EV_vs_WaGa_wt_cells_with_Category.csv")
     geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
    
     geness_res <- geness_res[, -1*ncol(geness_res)]
     png("WaGa_wt_EV_vs_WaGa_wt_cells.png",width=1200, height=1400)
     #svg("WaGa_wt_EV_vs_WaGa_wt_cells.svg",width=12, height=14)
     ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
     dev.off()
    
     # ----------------------------------------
     # ----------- manhattan_plot -------------
    
     Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt
  7. Downstream analyis using R for miRNAs (17 MKL-1 samples)

     #Input file
     #exceRpt_miRNA_ReadCounts.txt
     #exceRpt_piRNA_ReadCounts.txt
    
     #MKL-1_sT_DMSO_EV ("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO")
     #MKL-1_sT_Dox_EV ("X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox")
     #MKL-1_scr_DMSO_EV ("X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO")
     #MKL-1_scr_Dox_EV ()"X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox")
     #MKL-1_wt_cells ("nf780","nf796","nf797")
     #MKL-1_wt_EV ("X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs")
    
     cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_MKL-1
     mamba activate r_env
     R
    
     #BiocManager::install("AnnotationDbi")
     #BiocManager::install("clusterProfiler")
     #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
     #BiocManager::install("limma")
     #BiocManager::install("sva")
     #install.packages("writexl")
     #install.packages("openxlsx")
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library("org.Hs.eg.db")
     library(DESeq2)
     library(gplots)
     library(limma)
     library(sva)
     #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
     library(openxlsx)
    
     d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)
    
     # Desired column order
     desired_order <- c(
         "X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
         "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"
     )
    
     # Reorder columns
     d.raw <- d.raw[, desired_order]
     setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
     #sapply(d.raw, is.numeric)
     d.raw[] <- lapply(d.raw, as.numeric)
     #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
     d.raw <- round(d.raw)
     write.csv(d.raw, file ="d_raw.csv")
     write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)
    
     #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
     Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
     replicates = as.factor(c("MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV",     "MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV",  "MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV",  "MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV",    "MKL-1_wt_cells", "MKL-1_wt_cells","MKL-1_wt_cells",  "MKL-1_wt_EV","MKL-1_wt_EV"))
     ids = as.factor(c("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
         "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"))
     cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     # Filter low-count miRNAs
     dds <- dds[ rowSums(counts(dds)) > 10, ]
     rld <- rlogTransformation(dds)
    
     # -- before pca --
     png("pca.png", 1200, 800)
     plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     #plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
     png("pca2.png", 1200, 800)
     #plotPCA(rld, intgroup=c("replicates"))
     #plotPCA(rld, intgroup = c("replicates", "batch"))
     plotPCA(rld, intgroup = c("replicates", "ids"))
     #plotPCA(rld, "batch")
     dev.off()
    
     # Batch Effect Removal Methods (Non-batch effect removal applied!)
    
     #### STEP2: DEGs ####
     #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
     #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
     #- Manhattan plot miRNAs; 1x for WaGa cell line
     #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
     #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line
    
     #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
     sizeFactors(dds)
     #NULL
     dds <- estimateSizeFactors(dds)
     sizeFactors(dds)
     normalized_counts <- counts(dds, normalized=TRUE)
     write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
     write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)
    
     dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)
    
     dds$replicates <- relevel(dds$replicates, "MKL-1_wt_cells")
     dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
     resultsNames(dds)
     clist <- c("MKL.1_wt_EV_vs_MKL.1_wt_cells")
    
     #NOTE that the results sent to Ute is |padj|<=0.1.
     for (i in clist) {
         contrast = paste("replicates", i, sep="_")
         res = results(dds, name=contrast)
         res <- res[!is.na(res$log2FoldChange),]
         #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
         res$padj <- ifelse(is.na(res$padj), 1, res$padj)
         res_df <- as.data.frame(res)
         write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
         up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
         down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
         write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
         write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
     }
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-all.txt \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-up.txt \
     MKL.1_wt_EV_vs_MKL.1_wt_cells-down.txt \
     -d$',' -o MKL.1_wt_EV_vs_MKL.1_wt_cells.xls;
    
     # ------------------- volcano_plot -------------------
     library(ggplot2)
     library(ggrepel)
    
     geness_res <- read.csv(file = paste("MKL.1_wt_EV_vs_MKL.1_wt_cells", "all.txt", sep="-"), row.names=1)
    
     external_gene_name <- rownames(geness_res)
     geness_res <- cbind(geness_res, external_gene_name)
     #top_g are from ids
    
     top_g <- c("hsa-miR-203a-3p","hsa-miR-6850-5p","hsa-miR-4511","hsa-miR-5187-5p","hsa-miR-133b","hsa-miR-1246","hsa-miR-625-3p","hsa-miR-6741-3p","hsa-miR-192-5p","hsa-miR-10b-5p","hsa-miR-885-5p","hsa-miR-30e-3p","hsa-miR-101-3p","hsa-miR-1307-5p","hsa-miR-95-3p","hsa-miR-889-3p","hsa-miR-206","hsa-miR-301a-3p","hsa-miR-1-3p","hsa-let-7c-5p","hsa-miR-196a-5p","hsa-let-7f-5p","hsa-let-7e-5p","hsa-miR-30c-5p","hsa-miR-30a-3p","hsa-miR-146b-5p","hsa-miR-25-3p","hsa-miR-182-5p","hsa-miR-98-5p","hsa-let-7a-5p","hsa-miR-149-5p","hsa-miR-148a-3p","hsa-miR-873-3p","hsa-miR-19b-3p","hsa-miR-320c","hsa-miR-375","hsa-miR-30a-5p","hsa-miR-877-5p","hsa-miR-34a-5p","hsa-miR-324-5p","hsa-miR-652-3p","hsa-miR-342-5p","hsa-miR-7706","hsa-miR-361-3p","hsa-miR-361-5p","hsa-miR-1180-3p","hsa-miR-217","hsa-miR-1307-3p","hsa-miR-1908-5p","hsa-miR-15b-5p","hsa-miR-92b-5p","hsa-miR-484","hsa-miR-197-3p","hsa-miR-200c-3p","hsa-miR-671-5p","hsa-miR-339-5p","hsa-miR-1301-3p","hsa-miR-769-5p","hsa-miR-328-3p","hsa-miR-93-5p","hsa-miR-103a-3p")
     subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
     geness_res$Color <- "NS or log2FC < 2.0"
     geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
     geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
     geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
    
     write.csv(geness_res, "MKL.1_wt_EV_vs_MKL.1_wt_cells_with_Category.csv")
     geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
    
     geness_res <- geness_res[, -1*ncol(geness_res)]
     png("MKL.1_wt_EV_vs_MKL.1_wt_cells.png",width=1200, height=1400)
     #svg("MKL.1_wt_EV_vs_MKL.1_wt_cells.svg",width=12, height=14)
     ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
     dev.off()
    
     # ----------------------------------------
     # ----------- manhattan_plot -------------
    
     Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt

Comparing https://chat.z.ai/, https://chat.qwen.ai/, and https://chat.deepseek.com/ to explain Maaslin2

Maaslin2 解释

什么是Maaslin2?

Maaslin2(Microbiome Multivariable Association with Linear Models)是一个用于微生物组数据多变量关联分析的R语言包。

主要用途

  • 寻找微生物特征(如细菌丰度)与元数据(如年龄、饮食、疾病状态)之间的关联
  • 适用于16S rRNA测序、宏基因组等微生物组数据

核心特点

1. 统计方法

  • 使用线性模型广义线性模型
  • 支持固定效应和随机效应
  • 可处理重复测量数据

2. 数据标准化

  • 自动处理微生物数据的组成性问题
  • 提供 TSS、CSS、CLR 等多种标准化方法

3. 多变量分析

  • 可同时调整多个协变量
  • 控制混杂因素

基本工作流程

# 安装
library(Maaslin2)

# 运行分析
results <- Maaslin2(
  input_data = microbiome_data,    # 微生物丰度表
  input_metadata = metadata,       # 元数据
  output = "results",              # 输出目录
  fixed_effects = c("age", "diet"), # 固定效应
  random_effects = c("subject_id"), # 随机效应
  normalization = "CLR",           # 标准化方法
  transformation = "LOG"           # 数据转换
)

输出结果

  • 显著性关联表:包含p值、q值、效应大小
  • 可视化图表:热图、箱线图等
  • 汇总统计

优势

  1. 用户友好:简单的接口,默认参数合理
  2. 灵活性强:支持多种标准化和转换方法
  3. 统计严谨:多重检验校正(FDR)
  4. 可重复性:结果易于复现

应用场景

  • 疾病与肠道菌群关联研究
  • 饮食干预对微生物组影响
  • 药物处理前后菌群变化
  • 环境因素与微生物组关系


Here’s Python code to read a file and output to console:

# Method 1: Read entire file at once
with open('filename.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

# Method 2: Read line by line
with open('filename.txt', 'r', encoding='utf-8') as file:
    for line in file:
        print(line.strip())  # strip() removes newline characters

# Method 3: Read all lines into a list
with open('filename.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

# With error handling
try:
    with open('filename.txt', 'r', encoding='utf-8') as file:
        content = file.read()
        print(content)
except FileNotFoundError:
    print("File not found!")
except Exception as e:
    print(f"An error occurred: {e}")

Key points:

  • Use with statement to automatically close the file
  • Specify encoding='utf-8' to handle special characters
  • strip() removes trailing newlines when reading line by line
  • Always include error handling for robust code

Replace 'filename.txt' with your actual file path.

Processing DNAseq for Data_Tam_DNAseq_2026_2605_2617_2631_2914_Acinetobacter_sp

  1. Run nextflow bacass

     conda deactivate
    
     # Downlod k2_standard_08_GB_20251015.tar.gz from https://benlangmead.github.io/aws-indexes/k2#kraken2--bracken
     # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056; 'tar xzf 20190108_kmerfinder_stable_dirs.tar.gz'  #The database does not work!
     # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz  #The database works!
    
     # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working!
    
     nextflow run nf-core/bacass -r 2.6.0 -profile docker --help
    
     # -- Hybrid assembly --
     nextflow run nf-core/bacass -r 2.6.0 -profile docker \
       --input samplesheet_bacass.tsv \
       --outdir bacass_out \
       --assembly_type hybrid \
       --assembler unicycler,dragonflye \
       --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
       --skip_kmerfinder \
       -resume \
       -work-dir bacass_out/work
    
     # -- Short assembly --
     #Maybe BUG is from '--skip_kmerfinder for -r 2.6.0, using db in 2.5.0'
     nextflow run nf-core/bacass -r 2.5.0 -profile docker \
       --input samplesheet.tsv \
       --outdir bacass_out \
       --assembly_type short \
       --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
       --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
       -resume \
       -work-dir bacass_out/work
  2. Verify if the genome is pure

     # 1. Go up one level to the main 'bacass_out' directory
     cd ..
    
     # 2. Create directories for CheckM inputs and outputs
     mkdir -p checkm_input checkm_output
    
     # 3. Copy all .fna files into the 'checkm_input' folder
     # (CheckM cannot search subdirectories, so they must be in one folder)
     find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;
    
     # 4. Run CheckM on all 4 assemblies
     checkm lineage_wf -x fna checkm_input checkm_output
  3. Species Identification: 快速筛查用 Mash → 精确分类用 GTDB-Tk → 种级验证用 FastANI,三者结合可最大限度提高物种鉴定的准确性和可解释性。

     # 1. 创建环境(推荐 mamba)
     mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk
     mamba activate gtdbtk
    
     # 2. 下载数据库(仅需首次,约 60GB)
     gtdbtk download --data_dir ./gtdb_data --release 220
    
     wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.g
     mamba env config vars set GTDBTK_DATA_PATH="/mnt/nvme4n1p1/gtdb_data/release232"
     # 先退出当前环境,再重新激活
     mamba deactivate
     mamba activate gtdbtk
    
     # 验证环境变量是否加载成功
     echo $GTDBTK_DATA_PATH
     # 应输出:/mnt/nvme4n1p1/gtdb_data/release232
    
     # 3. 运行分类(你提供的命令 + 实用参数)
     gtdbtk classify_wf \
       --genome_dir ./checkm_input \
       --out_dir gtdb_out \
       --cpus 64 \
       --extension .fna \
       --prefix mygenome
    
     # 4. 查看结果
     cat gtdb_out/classify/mygenome.bac120.summary.tsv   # 细菌结果
  4. Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

     conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
     abricate --list
    
     conda deactivate
    
     ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
     ASM=bacass_out/checkm_input/2914_.fna \
     SAMPLE=2914 \
     OUTDIR=resistome_virulence_2914 \
     MINID=80 MINCOV=60 \
     THREADS=32 \
     ~/Scripts/run_abricate_resistome_virulome_one_per_gene.sh
    
     #ABRicate thresholds: MINID=80 MINCOV=60
     Database        Hit_lines       File
     MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
     CARD    21      resistome_virulence_2605/raw/2605.card.tab
     ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
     VFDB    0       resistome_virulence_2605/raw/2605.vfdb.tab
    
     # Database        Hit_lines       File
     # MEGARes 42      resistome_virulence_2631/raw/2631.megares.tab
     # CARD    37      resistome_virulence_2631/raw/2631.card.tab
     # ResFinder       16      resistome_virulence_2631/raw/2631.resfinder.tab
     # VFDB    0       resistome_virulence_2631/raw/2631.vfdb.tab
    
     Database        Hit_lines       File
     MEGARes 35      resistome_virulence_2914/raw/2914.megares.tab
     CARD    31      resistome_virulence_2914/raw/2914.card.tab
     ResFinder       11      resistome_virulence_2914/raw/2914.resfinder.tab
     VFDB    0       resistome_virulence_2914/raw/2914.vfdb.tab
    
     # #ABRicate thresholds: MINID=70 MINCOV=50
     # Database        Hit_lines       File
     # MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
     # CARD    21      resistome_virulence_2605/raw/2605.card.tab
     # ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
     # VFDB    3       resistome_virulence_2605/raw/2605.vfdb.tab
    
     conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
     #NEED_TO_ADAPT: OUTDIR = Path("resistome_virulence_An7")
     #NEED_TO_ADAPT: SAMPLE = "An7"
     #DEPRECATED_DUE_TO_NEED_MANULL_SETTING: python ~/Scripts/merge_amr_sources_by_gene.py
    
     python ~/Scripts/export_resistome_virulence_to_excel_py36.py \
       --workdir resistome_virulence_2914 \
       --sample 2914 \
       --out Resistome_Virulence_2914.xlsx
     # Delete the column 'COVERAGE_MAP' in all 'Raw_*' sheets
  5. Report

     Please find below a summary of genomic analyses for samples 2605, 2617, 2631 and 2914.
    
     ### 1. Assembly and checkM
    
             ------------------------------------------------------------------------------------------------------------------------------------------------------------------
             Bin Id            Completeness   Contamination   Strain heterogeneity
             ------------------------------------------------------------------------------------------------------------------------------------------------------------------
             2631_       100.00          100.00             78.57
             2617_          100.00          100.00             78.57
             2605_     100.00           0.00               0.00
             2914_         99.98            0.63               0.00
             ----------------------------------------------------------------------------------------------------------------------------------------------------------------
    
             From the results of checkM, we see the samples 2631_ and 2617_ both are genomes between 7.0-7.1 M. and the contamination is 100.00, which means the DNA sample contained two closely related strains of the same species from a non-clonal culture. If the true genome size is a standard ~3.7 Mb  and the assembler couldn't merge the two highly similar strains, it would build both side-by-side. This results in a ~7.0 Mb assembly where every gene is duplicated.
             The sample 2605_.fna is 3.7 M and 2914_.fna is about 3.9M. they are pure isolates.
    
             ### 1. Species Identification
    
             **Sample 2605_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.02% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.874 | ✅ 87.4% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2605_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2617_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.00% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.859 | ✅ 85.9% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2617_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2631_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.07% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.860 | ✅ 86.0% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2631_ is confidently assigned to *Acinetobacter baumannii*.
    
             ***
    
             **Sample 2914_:** *Acinetobacter baumannii* ✅ Confirmed
    
             | Parameter | Value | Interpretation |
             |---|---|---|
             | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
             | ANI | 98.11% | ✅ Well above 95% species threshold |
             | AF (Alignment Fraction) | 0.873 | ✅ 87.3% of genome aligns; ANI estimate is robust |
             | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |
    
             🟢 **Conclusion:** 2914_ is confidently assigned to *Acinetobacter baumannii*.
    
     ### 3. Since 2631_ and 2617_ are not a pure isolates, they are the mixed of two strains. I exclude the two samples from AMR and VFDB analysis. AMR Genes and Virulence Factors (VFDB) Summary, see the Resistome_Virulence_2605.xlsx and Resistome_Virulence_2914.xlsx.


Since CheckM requires all the genome files (bins) to be in the same directory, you first need to gather your .fna files into a single folder before running the command.

Here are the exact bash commands to prepare your files and run CheckM for all four samples (2605, 2617, 2631, and 2914).

Step 1: Prepare the Input and Run CheckM

Run the following commands in your terminal:

# 1. Go up one level to the main 'bacass_out' directory
cd ..

# 2. Create directories for CheckM inputs and outputs
mkdir -p checkm_input checkm_output

# 3. Copy all .fna files into the 'checkm_input' folder
# (CheckM cannot search subdirectories, so they must be in one folder)
find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;

# 4. Run CheckM on all 4 assemblies
checkm lineage_wf -x fna checkm_input checkm_output

(Note: If you get an error saying the CheckM database is not found, you need to set the path to your CheckM data folder first using: checkm data setRoot /path/to/your/checkm_data)

Step 2: How to Interpret the Results

CheckM will print a summary table directly to your terminal, and it will also save detailed statistics in checkm_output/bin_stats_ext.tsv.

Since you are working with Acinetobacter sp. (which typically has a genome size of ~3.5 to 4.5 Mb), a 7.0 Mb assembly is almost certainly a mix of two closely related strains that the assembler failed to collapse.

Look specifically at these three columns in the CheckM output for your 2631_ sample:

  1. Completeness: This should be high (ideally > 95%). Because the assembler built two genomes, CheckM will likely still find all the marker genes, so completeness might look deceptively high.
  2. Contamination: This is the most critical metric. For a pure isolate, this should be < 5%. If your 2631 assembly is a mixed strain, CheckM will detect the duplicated marker genes and flag this as high contamination (often > 20-50%).
  3. Strain heterogeneity: This column specifically calculates if there are multiple strains of the same species present. A high value here confirms that your assembly contains a mixture of closely related strains.

What to do next?

  • If 2631 has high contamination/strain heterogeneity: You have confirmed it is a mixed assembly. You can either use a tool like purge_dups to separate the two strains, or map your raw reads back to the assembly and filter out the contigs with half the average coverage.
  • If 2631 has low contamination: The 7.0 Mb size might be real (some Acinetobacter strains have large plasmids or are naturally larger), but you should still verify this using a BlobPlot (GC vs. Coverage) to be absolutely sure there is no foreign DNA.
  • Check the other 3 samples: Compare the CheckM results of 2605, 2617, and 2914 to see if they are pure, high-quality genomes (high completeness, low contamination).

            ------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Bin Id         Marker lineage         # genomes   # markers   # marker sets   0    1     2    3   4   5+   Completeness   Contamination   Strain heterogeneity
            ------------------------------------------------------------------------------------------------------------------------------------------------------------------
            2631_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
            2617_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
            2605_    f__Moraxellaceae (UID4680)       86         689           365        0   689    0    0   0   0       100.00           0.00               0.00        
            2914_    g__Acinetobacter (UID4685)       42         1193          254        1   1188   4    0   0   0       99.98            0.63               0.00        
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------

Ah, I understand now! Thank you for clarifying. The CheckM results you showed earlier (with 100% contamination and 56 markers in the ‘2’ column) were indeed from this Unicycler assembly.

Looking at your Unicycler log, we can see exactly why it didn’t fix the problem: Total Length: 7,199,947 bp (~7.2 Mb) Main Chromosome: Component 1 consists of 416 fragmented contigs totaling 7,165,229 bp.

Why Unicycler Failed to Collapse the Genome

The log states: “Since you provided only short reads, Unicycler will essentially function as a SPAdes-optimiser.”

Because you only have short reads (Illumina), Unicycler builds a de Bruijn graph. When a bacterial sample contains a mixture of two closely related strains (a very common scenario in non-clonal isolates), the assembler sees the slight genetic differences between the two strains as “bubbles” in the graph.

Without long reads to span across these bubbles and prove which path belongs to which strain, the software conservatively outputs both paths. This results in an assembly that contains both strains side-by-side, doubling your genome size to ~7.2 Mb and causing the 100% duplication flagged by CheckM.

How to Fix This (Since Re-assembling Won’t Work)

Running another short-read assembler will just result in the exact same 7.2 Mb duplicated assembly. You now need to computationally “purge” the duplicate contigs.

Option 1: The Quick Fix using BBMap (dedupe.sh)

If you have the BBMap suite installed, it has a fantastic tool specifically for removing duplicate sequences. It will compare all your contigs to each other and remove the redundant ones.

dedupe.sh in=assembly.fasta out=assembly_purged.fasta minidentity=95
  • minidentity=95 tells it to remove contigs that are 95% identical or more to another contig. You can adjust this (e.g., 90 or 98) depending on how closely related the two strains are.
  • After running this, run CheckM again on assembly_purged.fasta. You should see the genome size drop to ~3.5 Mb and the contamination drop to 0%.

Option 2: The Thorough Fix using purge_dups

purge_dups is the gold standard for separating primary contigs from duplicated “haplotigs” in an assembly. It works by aligning your assembly to itself and using read depth to separate the true genome from the duplicates.

  1. Calculate coverage cutoffs: calcuts assembly.fasta > cutoffs 2>calcults.log
  2. Split the assembly: split_fa assembly.fasta > assembly.split.fa
  3. Align to itself: minimap2 -xasm5 assembly.split.fa assembly.split.fa > alignment.paf
  4. Get coverage: pbcstat alignment.paf and calcuts PB.base.cov
  5. Purge: purge_dups -2 -T 1024 -c CUTOFFS assembly.split.fa alignment.paf > dups.bed
  6. Extract pure genome: get_seqs dups.bed assembly.split.fa (This will output assembly.split.fa.purged.fa, which is your pure ~3.5 Mb genome).

Option 3: Do you have Long Reads?

If you happen to have Oxford Nanopore (ONT) or PacBio reads for this sample, you can solve this biologically rather than computationally. Running Unicycler with long reads (unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq) will allow the assembler to easily span across the “bubbles” and collapse the two strains into a single, perfect, circular ~3.5 Mb chromosome.


Bonus Observation: You likely have 3 plasmids!

Looking at the end of your log, Unicycler successfully closed 3 circular contigs:

  • 8,731 bp (7.89x coverage)
  • 5,837 bp (5.27x coverage)
  • 2,120 bp (7.74x coverage)

These are almost certainly plasmids. The massive ~7.2 Mb “Component 1” is your duplicated chromosome. When you use purge_dups or dedupe.sh to fix the chromosomal duplication, these plasmids should remain intact in your final assembly!



是的,这种强制“去冗余”(Purging duplicates / Haplotype purging)的操作确实有可能引入组装错误或偏差。

在真核生物(如动植物)的组装中,分离单倍型是常规操作;但在细菌(单倍体) 基因组中,强行使用去冗余工具来“修复”7.2Mb的组装结果,是一把双刃剑。

以下是这种操作可能引入的具体错误和风险,以及背后的科学原因:

1. 误删真实的生物学重复序列(Over-purging / 假阴性缺失)

细菌基因组中天然存在许多真实的重复序列,例如:

  • rRNA 操纵子(通常有 5-10 个高度相似的拷贝)。
  • IS 元件(插入序列)和转座子(在 Acinetobacter 等细菌中非常活跃,经常有多个拷贝)。
  • 基因家族或旁系同源基因(Paralogs)。

风险:如果你使用 bbmap (dedupe.sh) 并设置了一个较高的相似度阈值(例如 minidentity=95),去冗余工具无法区分“组装错误导致的重复”和“基因组天然存在的重复”。它可能会把你基因组中真实存在的、具有重要功能(如耐药性、毒力)的 IS 元件或 rRNA 拷贝当作“冗余的单倍型”直接删除,导致你的最终基因组缺失关键基因

2. 产生“嵌合体”基因组(Chimeric Assembly)

你的 CheckM 结果显示完美的 1:1 重复(所有 marker 基因都在 ‘2’ 列)。这通常意味着两种可能:

  • 情况 A(组装软件的 Bug):样本是纯的,但 SPAdes/Unicycler 因为某些复杂的局部重复或测序偏好性,把同一段序列组装了两遍。
  • 情况 B(样本不纯/混合菌株):你的培养物中混入了两个亲缘关系极近的菌株(Mixed strains)。短读长无法跨越它们之间的 SNP/Indel 差异(即 de Bruijn 图中的 bubbles),所以软件把两套基因组都保留了下来。

风险:如果是情况 B,去冗余工具在决定“保留哪一套、丢弃哪一套”时,可能会在两个菌株的序列之间来回切换。最终你得到的并不是一个真实的单一基因组,而是一个自然界中不存在的“弗兰肯斯坦(Frankenstein)”嵌合体

  • 后果:这种嵌合体会严重影响后续的 SNP calling(产生大量假阳性突变)、进化树构建(Phylogeny)以及耐药基因(AMR)的准确定位。

3. 破坏基因组的连续性(Structural Breaks)

风险:在剥离“副拷贝”的过程中,去冗余算法可能会在原本连续的 contig 上造成人为的断裂。这会导致你的组装结果碎片化(Contig 数量增加,N50 降低),原本可以闭合的环状染色体可能会断裂成多个线性片段。


如何安全地处理 2631 和 2617?(最佳实践)

为了避免引入上述错误,不要盲目直接运行去冗余工具。建议按照以下步骤进行排查和处理:

第一步:回贴原始 Reads,查看覆盖度(Coverage)分布

将你的原始短读长(clean reads)比对回这 7.2Mb 的组装结果上(使用 bwabowtie2),然后统计每个 Contig 的平均覆盖度。

  • 如果所有 Contig 的覆盖度都非常均匀(例如都在 100x 左右):这说明是情况 A(组装软件的 Bug)。基因组是纯的,只是被错误地组装了两遍。此时去冗余是相对安全的,但仍需谨慎。
  • 如果 Contig 的覆盖度呈现明显的两个峰(例如一半是 100x,另一半是 50x):这证实了是情况 B(混合菌株)
  • 如果覆盖度极其混乱:说明可能存在更复杂的污染。

第二步:根据覆盖度结果选择策略

  • 如果是混合菌株(情况 B)
    • 最完美的生物学解决方案:重新划线分离单克隆(Single colony isolation),重新提取 DNA 测序。这是唯一能获得 100% 纯基因组的方法。
    • 计算解决方案:不要使用 dedupe.sh。你应该使用基于覆盖度的分离工具(如 metaSPAdes 的 binning 功能,或者 BlobTools),利用 100x 和 50x 的覆盖度差异,将两个菌株的 Contig 彻底分开,分别得到两个独立的基因组,然后再分别跑 CheckM。
  • 如果是组装 Bug(情况 A)
    • 你可以尝试使用 purge_dups,它比 dedupe.sh 更智能,会结合自比对(self-alignment)和覆盖度来区分“主序列(Primary)”和“副序列(Haplotig)”。
    • 关键操作:在运行 purge_dups 后,必须将提取出的“Primary”序列再次运行 CheckM 和 BUSCO
    • 验收标准:如果 CheckM 显示 Completeness > 95%,Contamination < 5%,并且基因组大小降到了 ~3.5 Mb,说明去冗余成功且没有误删核心基因。如果 Completeness 大幅下降,说明你误删了真实的基因,需要调高 purge_dups 的保守参数重新运行。

第三步:尝试重新组装(替代方案)

有时候,Unicycler 的默认参数对某些特定的 Acinetobacter 菌株不够友好。你可以尝试直接使用 SPAdes 并开启特定的模式:

spades.py --isolate --careful -1 2631__1.fastp.fastq.gz -2 2631__2.fastp.fastq.gz -o spades_2631_careful

--careful 参数会让 SPAdes 在组装后尝试纠正错配和 Indel,有时它内部的 repeat resolution 机制能更好地 collapse 那些被 Unicycler 错误保留的重复区域。

总结

强行去冗余确实有引入嵌合体或误删真实重复序列的风险。对于细菌基因组,“计算修补”永远是下策。最严谨的做法是通过 Read Mapping 确认覆盖度,如果是混合菌株则进行 Binning 分离或重新实验;如果是组装冗余,则使用 purge_dups严格依赖二次 CheckM 结果来验证基因组的完整性。

HUMAnN 通路丰度计算方法详解 (Data_Tam_Metagenomics_2026_Soil)

总体流程

HUMAnN (HMP Unified Metabolic Analysis Network) 是 bioBakery 工作流中用于分析宏基因组功能的核心工具[[12]]。通路丰度的计算是一个多步骤的递归过程:

计算步骤:

  1. 基因家族丰度 → 2. 反应丰度 → 3. 通路丰度

详细计算原理

第1步:基因家族丰度(Gene Family Abundance)

从原始测序 reads 开始:

  • 使用 BLASTX 将 reads 比对到参考数据库(如 UniRef)
  • 根据比对质量、覆盖度、序列长度进行加权
  • 生成 RPK(Reads Per Kilobase)值

公式:

基因丰度 = Σ(比对权重) / 基因长度(kb)

其中每个 read 的总权重为 1.0,根据比对质量分配到多个基因匹配上[[9]]。


第2步:反应丰度(Reaction Abundance)

每个生化反应由一个或多个基因催化:

反应丰度 = Σ(催化该反应的所有基因丰度)

第3步:通路丰度(Pathway Abundance)

这是最关键的一步。通路包含多个反应,反应之间有不同的关系:

核心原则: 通路丰度由”最弱环节”(weakest link)决定[[1]]

计算方法:

  • 串联反应(必须全部存在):使用调和平均数(harmonic mean)
  • 并联反应(可选路径):使用最大值(max)
  • 可选反应:只有当其丰度大于必需反应的调和平均数时才计入[[1]]

最终通路丰度 = 通路中丰度最低的关键反应


具体示例

示例场景:糖酵解通路(Glycolysis)

假设糖酵解通路包含 5 个关键反应(R1-R5):

葡萄糖 → R1 → G6P → R2 → F6P → R3 → ... → 丙酮酸

基因-反应关系:

  • R1: 由基因 GK1 和 GK2 催化(冗余)
  • R2: 由基因 PGI 催化
  • R3: 由基因 PFK 催化
  • R4: 由基因 ALDO 催化
  • R5: 由基因 GAPDH 催化

测序后得到的基因丰度(RPK单位):

GK1:  8.0
GK2:  4.0
PGI:  10.0
PFK:  6.0
ALDO: 7.0
GAPDH: 5.0

计算步骤:

① 计算反应丰度:

R1 = GK1 + GK2 = 8.0 + 4.0 = 12.0  (冗余基因相加)
R2 = PGI = 10.0
R3 = PFK = 6.0
R4 = ALDO = 7.0
R5 = GAPDH = 5.0

② 计算通路丰度: 由于糖酵解是串联反应(所有步骤必须完成),使用”最弱环节”原则:

通路丰度 = min(R1, R2, R3, R4, R5)
          = min(12.0, 10.0, 6.0, 7.0, 5.0)
          = 5.0 RPK

解释: 该样本中糖酵解通路的丰度为 5.0 RPK,意味着”最弱环节”(R5/GAPDH)的覆盖度为 5.0。这表示通路中至少有 5.0 个”完整拷贝”的活性[[1]]。


归一化处理

为什么需要归一化?

原始 RPK 值受测序深度影响,不能直接跨样本比较[[1]]。

示例:

  • 样本 A:总 reads = 1000万,通路丰度 = 5.0 RPK
  • 样本 B:总 reads = 2000万,通路丰度 = 5.0 RPK

虽然都是 5.0 RPK,但样本 A 的相对丰度更高!

归一化方法:

CPM(Counts Per Million)或 RPKM:

归一化丰度 = (原始RPK / 总RPK) × 1,000,000

HUMAnN 输出的 pathabundance_relab.tsv 文件已经是归一化后的相对丰度[[11]]。


输出文件说明

HUMAnN 生成两个关键文件:

文件 含义 取值范围
pathcoverage.tsv 通路覆盖度(存在/缺失概率) 0-1
pathabundance.tsv 通路丰度(相对拷贝数) 0 及以上

覆盖度 vs 丰度:

  • 覆盖度 = 0.8:表示 80% 置信度该通路存在
  • 丰度 = 5.0:表示该通路约有 5 个完整拷贝(在归一化前)[[9]]

分层分析(Stratification)

HUMAnN 还能计算每个物种对通路的贡献:

总通路丰度 = 物种A贡献 + 物种B贡献 + 未分类贡献

例如:

糖酵解通路总丰度: 5.0 RPK
├─ Escherichia coli:  3.0 RPK (60%)
├─ Bacteroides fragilis: 1.5 RPK (30%)
└─ Unclassified: 0.5 RPK (10%)

这帮助您理解哪些微生物在驱动特定功能[[21]]。


关键要点总结

通路丰度 = 最弱关键反应的丰度(木桶原理)
单位:RPK(归一化前)或 CPM/RPKM(归一化后)
跨样本比较:必须使用归一化值(pathabundance_relab.tsv)[[15]]
解释:丰度 5.0 表示通路中最弱环节有 5.0 RPK 的覆盖度
准确性:考虑了基因冗余、反应关系、可选步骤等生物学复杂性


参考文献

  1. bioBakery Forum – Pathway abundance calculation [[1]]
  2. HUMAnN SOP – HMP Data Coordination Center [[9]]
  3. HUMAnN3 Documentation – Huttenhower Lab [[12]]
  4. bioBakery 3 publication (Nature Methods, 2021) [[13]]


HUMAnN Pathway Abundance 计算方法

基本原理

HUMAnN (HMP Unified Metabolic Analysis Network) 是 biobakery 工具集的一部分,用于从宏基因组数据中计算功能通路的丰度。

计算步骤

1. 基因家族定量 (Gene Family Quantification)

  • 首先将测序 reads 比对到 UniRef 基因家族数据库
  • 计算每个基因家族的丰度(reads per kilobase, RPK)

2. 通路映射 (Pathway Mapping)

  • 将基因家族映射到 MetaCyc 通路
  • 一个通路通常包含多个反应步骤,每个步骤可能由多个基因家族催化

3. 通路丰度计算 (Pathway Abundance Calculation)

关键公式:

通路丰度 = min(该通路中所有反应步骤的丰度)

其中:

  • 每个反应步骤的丰度 = 该步骤中所有基因家族丰度的总和
  • 通路的最终丰度 = 所有反应步骤丰度的最小值(瓶颈原则)

具体示例

假设情况

假设有一个简单的代谢通路 “Glycolysis”(糖酵解),包含 3 个反应步骤:

反应步骤 1:葡萄糖 → 葡萄糖-6-磷酸

  • 由基因家族 UniRef90_A 和 UniRef90_B 催化
  • UniRef90_A 丰度 = 100 RPK
  • UniRef90_B 丰度 = 50 RPK
  • 步骤 1 丰度 = 100 + 50 = 150 RPK

反应步骤 2:葡萄糖-6-磷酸 → 果糖-6-磷酸

  • 由基因家族 UniRef90_C 催化
  • UniRef90_C 丰度 = 80 RPK
  • 步骤 2 丰度 = 80 RPK

反应步骤 3:果糖-6-磷酸 → 果糖-1,6-二磷酸

  • 由基因家族 UniRef90_D 和 UniRef90_E 催化
  • UniRef90_D 丰度 = 200 RPK
  • UniRef90_E 丰度 = 120 RPK
  • 步骤 3 丰度 = 200 + 120 = 320 RPK

通路丰度计算

Glycolysis 通路丰度 = min(步骤1, 步骤2, 步骤3)
                    = min(150, 80, 320)
                    = 80 RPK

为什么用最小值?

  • 这遵循”木桶原理”(瓶颈效应)
  • 通路的整体通量受限于最慢/最少的反应步骤
  • 步骤 2 只有 80 RPK,即使其他步骤丰度很高,整个通路的最大通量也只能是 80

归一化 (Normalization)

HUMAnN 还会进行归一化处理:

1. RPK (Reads Per Kilobase)

RPK = (比对到基因的 reads 数) / (基因长度 kb)

2. RPKM/CPM (可选)

RPKM = RPK / (总百万 reads)

3. 相对丰度 (Relative Abundance)

通路相对丰度 = (通路 RPKM) / (所有通路 RPKM 总和) × 1,000,000

单位:CPM (Counts Per Million)


您的数据示例

从您的输出看:

otu_table()   OTU Table:         [ 1097 taxa and 2 samples ]

这 1097 个 “taxa” 实际上是 1097 个 MetaCyc 通路,每个通路的丰度值是经过以下处理的:

  1. 原始丰度:基于上述最小值原则计算
  2. 归一化:转换为相对丰度(sum = 1 或 100%)
  3. 输出文件pathabundance_relab.tsv 中的值就是相对丰度

关键要点

特点 说明
计算方法 取通路中所有反应步骤的最小丰度
单位 通常是相对丰度(0-1 或 0-100%)
生物学意义 反映通路的潜在代谢能力
优势 考虑了通路的完整性,不是简单加和
局限性 无法区分活跃/非活跃通路(需要转录组验证)

注意事项

⚠️ 重要提醒

  • 通路丰度反映的是基因潜力(gene potential),不是实际代谢活性
  • 一个通路存在 ≠ 该通路正在被使用
  • 需要结合转录组(RNA-seq)或代谢组数据才能确定实际活性
  • 对于您的 n=1 样本,只能做描述性比较,无法统计推断

STEP 5 — Descriptive visualisations (appropriate for n = 1 per group) for Data_Tam_Metagenomics_2026_Soil

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Add openxlsx to your package list (for Excel export):

# Install if needed
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr",
                   "tidyr", "pheatmap", "openxlsx"))

And load it:

library(openxlsx)

Complete Step 5 — Replace your existing Step 5 with this

# =============================================================================
# STEP 5 — Visualisations (PNG) + Complete Excel exports
# =============================================================================

dir.create("figures", showWarnings = FALSE)
dir.create("tables",  showWarnings = FALSE)

# Helper: safe log2 fold change (adds pseudocount to avoid log(0))
safe_log2fc <- function(x, y, pseudo = 1e-6) {
  log2((x + pseudo) / (y + pseudo))
}

# =============================================================================
# 5a. Top-N species stacked bar plot (Loc1 vs Loc4)
# =============================================================================

top_n_sp <- 20
top_species <- names(sort(rowMeans(otu_table(species_ps)),
                          decreasing = TRUE))[1:top_n_sp]

ps_top <- prune_taxa(top_species, species_ps)

df_species <- psmelt(ps_top) %>%
  mutate(Species = factor(Species, levels = rev(top_species)))

p_species <- ggplot(df_species, aes(x = Location, y = Abundance, fill = Species)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "D") +
  labs(title = paste("Top", top_n_sp, "Species by Location"),
       x = "Location", y = "Relative Abundance", fill = "Species") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/species_top20_barplot.png", p_species,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/species_top20_barplot.png\n")

# =============================================================================
# 5b. Species heatmap (all species, row-scaled)
# =============================================================================

otu_mat <- as.matrix(otu_table(species_ps))

# Filter species with very low abundance (max < 0.01% across both samples)
keep <- apply(otu_mat, 1, max) > 0.0001
otu_filt <- otu_mat[keep, , drop = FALSE]

# Annotation column
ann_col <- data.frame(Location = metadata[colnames(otu_filt), "Location"],
                      row.names = colnames(otu_filt))

# Write heatmap to PNG
png("figures/species_heatmap.png",
    width = 10, height = max(8, 0.25 * nrow(otu_filt) + 2),
    units = "in", res = 300)

pheatmap(otu_filt,
         scale         = "row",
         clustering_distance_rows = "euclidean",
         clustering_method        = "complete",
         annotation_col = ann_col,
         main           = "Species Abundance Heatmap (row-scaled)",
         fontsize_row   = 6,
         fontsize_col   = 10,
         show_rownames  = nrow(otu_filt) <= 80)

dev.off()
cat("✅ Saved: figures/species_heatmap.png\n")

# =============================================================================
# 5c. Top pathways stacked bar plot
# =============================================================================

top_n_pw <- 20
top_pw_names <- names(sort(rowMeans(otu_table(pathway_ps)),
                           decreasing = TRUE))[1:top_n_pw]

ps_pw_top <- prune_taxa(top_pw_names, pathway_ps)

df_pw <- psmelt(ps_pw_top) %>%
  mutate(Pathway = factor(Pathway, levels = rev(top_pw_names)))

p_pw <- ggplot(df_pw, aes(x = Location, y = Abundance, fill = Pathway)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  labs(title = paste("Top", top_n_pw, "HUMAnN Pathways by Location"),
       x = "Location", y = "Relative Abundance", fill = "Pathway") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/pathways_top20_barplot.png", p_pw,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathways_top20_barplot.png\n")

# =============================================================================
# 5d. Pathway dot plot (Loc1 vs Loc4)
# =============================================================================

df_pw_wide <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway") %>%
  filter(Pathway %in% top_pw_names) %>%
  pivot_longer(-Pathway, names_to = "Sample", values_to = "Abundance") %>%
  left_join(data.frame(Sample = rownames(metadata_clean),
                       Location = metadata_clean$Location,
                       stringsAsFactors = FALSE),
            by = "Sample")

p_dot <- ggplot(df_pw_wide, aes(x = Location, y = Pathway,
                                 size = Abundance, color = Abundance)) +
  geom_point() +
  scale_color_viridis_c() +
  labs(title = "Pathway Abundance: Loc1 vs Loc4",
       x = "Location", y = "Pathway") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7))

ggsave("figures/pathway_dotplot.png", p_dot,
       width = 8, height = 10, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathway_dotplot.png\n")

# =============================================================================
# 5e. COMPLETE species list → Excel
# =============================================================================

# Build full species table (ALL species, no cutoff)
sp_full <- as.data.frame(otu_table(species_ps)) %>%
  rownames_to_column("Species")

# Ensure columns exist (defensive)
if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(sp_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in species OTU table.")
}

sp_full <- sp_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Species,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for species
sp_wb <- createWorkbook()
addWorksheet(sp_wb, "All_Species")
addWorksheet(sp_wb, "Top50_by_Diff")
addWorksheet(sp_wb, "Top50_by_Abundance")
addWorksheet(sp_wb, "Loc1_only")
addWorksheet(sp_wb, "Loc4_only")
addWorksheet(sp_wb, "Shared")

writeData(sp_wb, "All_Species",       sp_full)
writeData(sp_wb, "Top50_by_Diff",     head(sp_full, 50))
writeData(sp_wb, "Top50_by_Abundance",
          sp_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(sp_wb, "Loc1_only",
          sp_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(sp_wb, "Loc4_only",
          sp_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(sp_wb, "Shared",
          sp_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

# Formatting
header_style <- createStyle(textDecoration = "bold", bgFill = "#D3D3D3")
for (sh in c("All_Species","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(sp_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(sp_wb, sh, cols = 1:ncol(sp_full), widths = "auto")
  freezePane(sp_wb, sh, firstRow = TRUE)
}

saveWorkbook(sp_wb, "tables/species_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/species_Loc1_vs_Loc4.xlsx  (%d species total)\n",
            nrow(sp_full)))

# =============================================================================
# 5f. COMPLETE pathway list → Excel
# =============================================================================

pw_full <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway")

if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(pw_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in pathway OTU table.")
}

pw_full <- pw_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Pathway,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for pathways
pw_wb <- createWorkbook()
addWorksheet(pw_wb, "All_Pathways")
addWorksheet(pw_wb, "Top50_by_Diff")
addWorksheet(pw_wb, "Top50_by_Abundance")
addWorksheet(pw_wb, "Loc1_only")
addWorksheet(pw_wb, "Loc4_only")
addWorksheet(pw_wb, "Shared")

writeData(pw_wb, "All_Pathways",       pw_full)
writeData(pw_wb, "Top50_by_Diff",      head(pw_full, 50))
writeData(pw_wb, "Top50_by_Abundance",
          pw_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(pw_wb, "Loc1_only",
          pw_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(pw_wb, "Loc4_only",
          pw_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(pw_wb, "Shared",
          pw_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

for (sh in c("All_Pathways","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(pw_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(pw_wb, sh, cols = 1:ncol(pw_full), widths = "auto")
  freezePane(pw_wb, sh, firstRow = TRUE)
}

saveWorkbook(pw_wb, "tables/pathways_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways total)\n",
            nrow(pw_full)))

# =============================================================================
# Summary
# =============================================================================

cat("\n========================================\n")
cat("STEP 5 COMPLETE\n")
cat("========================================\n")
cat("Figures (PNG, 300 dpi):\n")
cat("  • figures/species_top20_barplot.png\n")
cat("  • figures/species_heatmap.png\n")
cat("  • figures/pathways_top20_barplot.png\n")
cat("  • figures/pathway_dotplot.png\n")
cat("\nExcel tables (complete lists, no cutoff):\n")
cat(sprintf("  • tables/species_Loc1_vs_Loc4.xlsx   (%d species)\n", nrow(sp_full)))
cat(sprintf("  • tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways)\n", nrow(pw_full)))
cat("========================================\n")

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

File Content Size
species_top20_barplot.png Top 20 species stacked bar 10 × 8 in
species_heatmap.png All species above 0.01% threshold, row-scaled 10 × auto (scales with # species)
pathways_top20_barplot.png Top 20 pathways stacked bar 10 × 8 in
pathway_dotplot.png Top 20 pathways dot plot 8 × 10 in

📑 Excel Files (complete lists, no cutoff)

Each workbook contains 6 sheets:

Sheet Content
All_Species / All_Pathways Every detected feature, sorted by absolute difference
Top50_by_Diff Top 50 features by Loc4 − Loc1
Top50_by_Abundance Top 50 features by mean abundance
Loc1_only Features detected only in Loc1
Loc4_only Features detected only in Loc4
Shared Features detected in both locations

📋 Columns in each Excel file

Column Meaning
Species / Pathway Feature name
Abundance_Loc1 Raw relative abundance in Loc1
Abundance_Loc4 Raw relative abundance in Loc4
Diff_Loc4_minus_Loc1 Absolute difference (Loc4 − Loc1)
Log2FC_Loc4_vs_Loc1 Log₂ fold change (with pseudocount 1e-6 to handle zeros)
Present_in_Loc1 TRUE/FALSE
Present_in_Loc4 TRUE/FALSE
Total_Abundance Sum across both samples
Mean_Abundance Mean across both samples

Notes

  1. Log₂ fold change: Uses a small pseudocount (1e-6) to avoid log(0). For features absent in one location, this gives a large but finite fold change — interpret these as “presence/absence” rather than true fold change.

  2. Heatmap filtering: Species with maximum abundance < 0.01% across both samples are excluded to keep the heatmap readable. Adjust the threshold in keep <- apply(otu_mat, 1, max) > 0.0001 if needed.

  3. Excel formatting: Headers are bold with grey background, columns auto-sized, and the first row is frozen for easy scrolling.

  4. File locations: All outputs go into figures/ and tables/ subfolders inside your current working directory (reports/).

医院废水处理工艺显著降低了废水中“副血链球菌”(*Streptococcus parasanguinis*)的相对丰度

这个统计结果表明:在排除了不同采样时间(季节/月份)带来的自然波动影响后,医院废水处理工艺显著降低了废水中“副血链球菌”(Streptococcus parasanguinis)的相对丰度。

下面为您逐列详细解释这些统计指标的含义:

1. 字段详细解释

字段名 结果值 含义解释
feature Streptococcus.parasanguinis 特征名称:即发生显著变化的物种名称,中文通常译为“副血链球菌”。这是一种常见于人类口腔和肠道的细菌,属于机会致病菌。
value Post 比较的组别:因为我们在 MaAsLin2 中设置 Pre(处理前)为参考组(Reference),这里的 Post 表示这是 “处理后”相对于“处理前” 的比较结果。
coef -0.0006712921 效应系数(Coefficient):代表处理后该物种相对丰度的变化量。负值(-)表示丰度下降。因为您的输入数据是相对丰度(0~1之间),这意味着经过处理后,该菌的相对丰度比处理前降低了约 0.067%
pval 0.0001454243 原始 P 值(P-value):未经多重检验校正的显著性水平。这个值远小于 0.05(甚至小于 0.001),说明在单变量统计检验中,处理前后的差异是极其显著的。
qval 0.01915278 校正后的 Q 值(FDR):即错误发现率(False Discovery Rate)。因为宏基因组数据同时检验了成百上千个物种,必须进行多重检验校正(如 Benjamini-Hochberg 方法)以防止假阳性。Q值 < 0.05,说明即使经过了严格校正,这个差异依然是统计学显著的

2. 结合您的实验设计的深度解读

在您的实验设计中,MaAsLin2 模型同时纳入了 Treatment(处理前/后)和 TimePoint(11月、1月、3月、5月)两个变量。

  • 控制时间变量(TimePoint):废水中的微生物群落会随着季节、气温等时间因素自然变化。模型将 TimePoint 作为协变量(Covariate)剔除,意味着这个 -0.00067 的下降纯粹是由废水处理工艺(Pre vs Post)引起的,而不是因为采样月份不同造成的。
  • 统计学稳健性:原始 P 值(0.00014)和 Q 值(0.019)都非常小,说明这个结果非常稳健,不是偶然产生的假阳性。

3. 生物学与环境意义

  • 副血链球菌(Streptococcus parasanguinis 通常是人类口腔、呼吸道的正常菌群,但也可能引起心内膜炎等机会性感染。在医院废水中检测到它,说明它可能来源于医院的医疗排放或人类排泄物。
  • 环境意义:该菌在处理后(Post)显著减少,说明您的医院废水处理系统能够有效地截留、降解或灭活这种条件致病菌,降低了其排放到自然环境中带来的潜在生物安全风险。这是一个非常积极的工程效果!


Great question! The difference in min_abundance thresholds between species (0.001) and pathways (0.0001) is intentional and based on the fundamental differences in how MetaPhlAn and HUMAnN data are distributed:

1. Data Distribution Differences

Species data (MetaPhlAn):

  • Community composition is typically dominated by a few highly abundant species
  • Most species have relative abundances well above 0.1% (0.001)
  • A threshold of 0.001 (0.1%) effectively filters out “noise” while retaining biologically meaningful species

Pathway data (HUMAnN):

  • Functional profiles are much more evenly distributed across thousands of pathways (you have 15,785 pathways!)
  • Many biologically important pathways exist at very low relative abundances (e.g., antibiotic resistance, xenobiotic degradation, rare metabolic functions)
  • A higher threshold would eliminate too many pathways, potentially losing critical functional signals

2. Why 0.0001 (0.01%) for Pathways?

  • Captures the “long tail” of functional diversity: Many critical metabolic pathways operate at low abundance but are essential for ecosystem function
  • Balances sensitivity and specificity: Combined with min_prevalence = 0.1 (the pathway must appear in at least 10% of samples), this ensures we only test features that are consistently detected, even if at low levels
  • Standard practice in metagenomics: HUMAnN pathway analyses commonly use lower abundance thresholds because functional redundancy means even rare pathways can be important

3. You Can Adjust This

If you want to be more conservative (fewer pathways tested, faster computation, less multiple testing correction burden), you could increase it:

min_abundance = 0.001  # Same as species, 0.1% threshold

Or if you want to be even more sensitive:

min_abundance = 0.00001  # 0.001% threshold

4. Practical Recommendation

Given your dataset has only 11 samples, I’d actually recommend increasing the pathway threshold to match species (min_abundance = 0.001) because:

  • With small sample sizes, testing too many low-abundance features increases the multiple testing burden
  • It makes the Q-value correction more stringent
  • You’ll get more robust, interpretable results

Would you like me to update the code to use min_abundance = 0.001 for pathways as well?

Can differentiated organoid flip? 小鼠肠道类器官的分化,用于耶尔森氏菌(Y. enterocolitica)感染研究,采用双极性构型。

结合你提供的详细背景信息,幻灯片上提出的 “Can differentiated organoid flip?”(已分化的类器官能否翻转?) 是一个非常核心且切中实验痛点的问题。

在中文学术语境中,这里的 “Flip(翻转)”绝对不是指像翻煎饼一样把培养皿里的球体物理翻个面,而是指 “细胞极性的反转(Apical-Basal Polarity Reversal)”“培养维度的转换(从3D转为2D单层)”

结合你的背景信息,我为你深度解析这个问题到底在问什么,以及为什么它对你的研究至关重要:


一、 结合背景信息:为什么要探讨“Flip(翻转)”?

回顾你背景信息中提到的 “双极性构型(two-polarity conformations)”“耶尔森氏菌(Y. enterocolitica)感染”

  1. 体内的真实生理状态: 在真实肠道中,耶尔森氏菌等致病菌是存在于肠腔中的。因此,细菌入侵时,首先接触的是肠上皮细胞的管腔面(顶侧,Apical side,长有微绒毛和黏液层)
  2. 常规3D类器官的致命痛点: 在Matrigel(基质胶)中常规培养的3D肠类器官,其极性是“内向”的:顶侧朝内(朝向中央管腔),基底侧朝外(接触外面的培养基)。如果你把细菌加在培养基里,细菌只能接触到细胞的基底侧。这完全违背了生理状态,无法真实模拟细菌从肠腔入侵的过程。
  3. “Flip”的目的: 为了让实验符合生理状态,必须把类器官“Flip(翻转)”,让顶侧(Apical)朝外,直接暴露给含有细菌的培养基。这就是幻灯片探讨这个问题的核心动机。

二、 科学解答:已分化的类器官能否“Flip”?

答案是:能,但需要特定的技术手段。

当幻灯片问“已分化的类器官能否翻转”时,它实际上在探讨一个细胞生物学难题:类器官已经分化出了成熟的吸收细胞、杯状细胞、潘氏细胞等,这种高度分化的状态能否被打破并重新排列,以适应新的极性?

在实验中,科学家通常通过以下两种方式实现“Flip”:

1. 3D 转 2D 单层培养(最常用、最彻底的 Flip)

这是目前解决极性问题的“黄金标准”。

  • 操作方法:将3D胶里已经分化的肠类器官消化成单细胞或小细胞团,然后接种到 Transwell(半透膜小室) 上。
  • Flip 结果:细胞会在膜上重新生长,形成你背景信息中提到的 “单层柱状上皮”。此时,细胞的基底侧自然贴在膜上(朝下接触底部培养基),而顶侧(长满微绒毛) 自然朝上(暴露在顶部空间)。
  • 优势:完美实现了极性翻转。细菌加在顶部,直接感染顶侧;底部培养基可以模拟血液侧,加入免疫细胞或药物。

2. 微流控芯片(Gut-on-a-chip)

  • 这是 Transwell 的高级进阶版。在微流控通道中培养单层细胞,一侧通道模拟肠腔(加细菌),另一侧通道模拟血管(加流体剪切力和免疫细胞)。这也是一种高级的“Flip”应用,能更好地模拟体内动态环境。

3. 3D 囊泡的直接诱导反转(难度极高)

  • 通过特殊的离心力、局部酶解基质胶、或使用特定的小分子化合物,强行让3D球体的细胞重新排列,使顶侧翻到外面。但这种方法很难保证所有细胞极性一致,且容易破坏类器官结构,目前较少用于感染实验。

三、 幻灯片提出这个问题的深层科学考量

当讲者提出“Can differentiated organoid flip?”时,他/她实际上是在引导听众思考以下三个关键科学问题(你可以用来丰富你的理解或提问):

  1. 分化状态的可塑性(Plasticity): 类器官在3D胶里已经分化成熟了,当我们把它打散重铺成2D单层(Flip)时,这些细胞还能不能重新建立正确的顶底极性?还能不能保留原有的细胞类型(如杯状细胞还能不能分泌黏液)?
  2. 屏障功能的完整性(Barrier Integrity): 在翻转和重新铺板的过程中,细胞间的紧密连接(Tight junctions) 会不会受损?如果屏障漏了,细菌就不需要“入侵”细胞,而是直接从细胞缝隙漏过去了,这就失去了研究耶尔森氏菌侵入机制的意义。
  3. 微绒毛与黏液层的保留: 耶尔森氏菌感染需要穿透肠道黏液层。Flip 之后,顶侧朝外,杯状细胞分泌的黏液层和吸收细胞的微绒毛还能不能正常形成并发挥功能?

四、 一句话总结

结合你的背景信息,幻灯片上 “Can differentiated organoid flip?” 的准确中文解释是:

“已经分化成熟的肠类器官,能否通过改变培养方式(如从3D转为2D单层),实现细胞极性的反转(让顶侧朝外),从而克服常规3D培养中顶侧朝内的缺陷,完美模拟耶尔森氏菌从肠腔入侵的真实生理过程?”

结论是:能。 通过 Transwell 或微流控芯片等技术,将 3D 类器官转化为 2D 单层,是实现极性翻转(Flip)并研究肠道病原体感染的最佳策略。



杯状细胞 的英语是:

Goblet cell

(复数形式:Goblet cells)


补充说明:

  • 名称由来:因细胞形态像高脚杯(goblet)而得名
  • 主要功能:分泌黏液(mucin),形成黏液层保护上皮
  • 分布位置:肠道上皮、呼吸道上皮、结膜等


如果你说的 “flip” 是指类器官(organoid)在已经分化(differentiated)之后发生极性翻转(polarity flip,inside-out flipping),那么答案是:

可以,在一定条件下可以发生。

什么是“flip”?

许多上皮来源的类器官(如肠道类器官、气道类器官、肾脏类器官)具有明确的顶端-基底极性:

  • Apical(顶端面):通常朝向类器官内部腔体(lumen)
  • Basal(基底面):通常朝向外部基质(如 Matrigel)

正常情况下结构如下:

外部基质
    ↓
[Basal面]
  细胞层
[Apical面]
    ↓
   Lumen

当研究人员去除基质、改变培养条件或进行机械处理后,类器官可能重组为:

外部培养液
    ↓
[Apical面]
  细胞层
[Basal面]
    ↓
  内部

这就是常说的 inside-out organoidpolarity reversal(极性翻转)

已经分化后还能翻转吗?

通常可以

原因是:

  1. 分化状态和极性状态不是同一个概念

    • 细胞已经分化为吸收细胞、杯状细胞、纤毛细胞等,
    • 但其空间排列和极性仍具有一定可塑性。
  2. 细胞会感知外部ECM(细胞外基质)

    • Integrin等受体持续感知基质信号。
    • 当ECM环境改变时,细胞可能重新建立极性。
  3. 许多实验已经实现

    • 肠道类器官
    • 肺/气道类器官
    • 肾脏类器官

都报道过在成熟后发生极性反转。

会不会失去分化特征?

不一定。

很多情况下:

  • 分化标志物仍然存在;
  • 细胞类型组成变化不大;
  • 只是顶端和基底面的朝向改变。

但如果培养条件改变过大或时间过长:

  • 部分细胞可能去分化(dedifferentiation);
  • 分化谱系比例发生变化;
  • 类器官结构重塑。

如果你说的“flip”是另一种意思

有时研究者也用“flip”指:

  1. 类器官物理翻转(turn over)
  2. 极性翻转(polarity flip)
  3. 分化细胞重新回到干细胞状态(fate reversal)

其中:

  • 极性翻转 → 经常可以发生;
  • 完全命运逆转(differentiated → stem-like)→ 也可能发生,但机制不同,通常称为去分化(dedifferentiation)美式英语:/diˌdɪfəˌrɛnʃiˈeɪʃən/ ,而不是“flip”。

如果你是在阅读某篇论文或实验方案,可以把原句贴出来,我可以根据具体上下文解释这里的 “flip” 到底指哪一种现象。



一、Organoid(类器官)—— 深度解析

1. 定义

类器官是指利用成体干细胞(ASC)多能干细胞(PSC,包括胚胎干细胞和诱导多能干细胞),在体外三维(3D)培养体系中,通过自组织(self-organization)和定向分化,形成的微型三维组织样结构。它能够再现来源器官的部分细胞类型、空间架构和生理功能

2. 核心特征(四点)

特征 说明
三维结构 不同于传统的二维(2D)单层细胞培养,类器官具有立体的细胞极性和细胞间连接
多细胞类型 包含多种分化细胞,而非单一细胞系
自组织能力 细胞按照内在的发育程序自行排列成类似体内的结构,无需外部支架的精细引导
功能性 能够执行部分器官特有功能,如肠类器官的吸收/分泌、脑类器官的神经电活动

3. 来源分类

  • 成体干细胞来源(ASC-derived):取自组织中的成体干细胞(如肠道隐窝干细胞),分化潜能有限(只能形成该组织的细胞类型),但成熟度高、更接近体内真实状态。
  • 多能干细胞来源(PSC-derived):取自胚胎干细胞或iPS细胞,分化潜能大,可模拟胚胎发育过程,但成熟度相对较低。

4. 应用领域

  • 疾病建模(感染、肿瘤、遗传病)
  • 药物筛选与毒性测试
  • 再生医学(移植修复)
  • 发育生物学研究

二、Enteroid(肠类器官)—— 深度解析

1. 定义

肠类器官是类器官的一种特定亚型,特指来源于肠道上皮干细胞(主要是小肠或结肠的Lgr5⁺隐窝基底柱状细胞),在三维培养中形成的、包含所有肠道上皮主要细胞类型的微型肠道组织。

2. 肠类器官包含的细胞类型(核心)

细胞类型 功能
肠上皮吸收细胞(Enterocyte) 营养吸收、离子转运
杯状细胞(Goblet cell) 分泌黏液,保护上皮屏障
潘氏细胞(Paneth cell) 分泌抗菌肽(如溶菌酶),调节干细胞微环境
肠内分泌细胞(Enteroendocrine cell) 分泌激素(如GLP-1、5-羟色胺),调控消化和食欲
干细胞(Lgr5⁺ CBC细胞) 自我更新,分化补充其他细胞类型
簇细胞(Tuft cell) 参与免疫感知和寄生虫防御

3. 结构特征

  • 肠类器官在Matrigel(基底膜基质)中培养时,会形成极性的囊泡状结构
    • 管腔面(Apical side):面向内部空腔,带有微绒毛,对应肠道的内腔(接触食物和菌群的一面)
    • 基底侧面(Basolateral side):朝向外侧,对应肠道的基底膜侧(接触血管和结缔组织的一面)
  • 这种极性正是你原句中“two-polarity conformations(双极性构型)”所指的内容。

4. 与体内肠道的对应关系

  • 肠类器官的隐窝-绒毛轴在体外被简化为出芽结构(buds)——出芽部分富含干细胞和潘氏细胞(对应隐窝),中央囊泡部分富含分化细胞(对应绒毛区)。

三、Organoid vs Enteroid —— 更详细的对比表

对比维度 Organoid(类器官) Enteroid(肠类器官)
中文译名 类器官 肠类器官(或称肠道类器官)
层级关系 上位概念(总称) 下位概念(子类)
组织来源 可以是脑、肝、肾、肺、肠、胰腺、视网膜等任何器官 仅限于小肠或结肠
干细胞类型 ASC 或 PSC 均可 通常为 ASC(肠道隐窝干细胞)
分化方向 定向分化为对应器官的细胞谱系 定向分化为肠道上皮各细胞类型
培养条件 因器官而异,需不同的细胞因子组合(如脑需Noggin,肝需HGF等) 通常需EGF、Noggin、R-spondin、Wnt3a等(经典“肠类器官培养基”)
结构特点 因器官而异(如脑类器官呈脑区样结构,肝类器官呈肝板样) 呈囊泡状+出芽结构,具有肠上皮特有的隐窝-绒毛极性
典型应用 广泛(神经退行性疾病、肝癌、肾发育、肺纤维化等) 主要集中于肠道感染(如沙门氏菌、耶尔森氏菌)、炎症性肠病(IBD)、结直肠癌、营养代谢

四、为什么你的句子用 Enteroid 而不是 Organoid —— 深入解释

你的原句:

“Differentiation of murine enteroids for Y. enterocolitica infection in two-polarity conformations”

  • 研究对象耶尔森氏菌(Yersinia enterocolitica——这是一种肠道致病菌,主要通过侵入肠道上皮引起感染。
  • 要研究这类细菌,你需要一个能模拟肠道上皮屏障、黏液层、极性运输和细菌侵入机制的模型。
  • 肠类器官(enteroid)正好提供了:
    • 完整的肠道上皮细胞多样性
    • 生理性的顶-底极性(细菌从顶侧侵入,免疫信号从底侧传递)
    • 与体内更接近的微环境

如果用 organoid(泛称),就太笼统了,别人不知道你研究的是肠、胃还是其他器官的类器官。因此,enteroid 是此处最精确、最专业的术语。


五、延伸概念:Colonoid 和 Gastroid

为了帮你更清楚理解 enteroid 的范围,再补充两个相关术语:

术语 中文 来源 说明
Enteroid 肠类器官 小肠或结肠(广义) 有时狭义特指小肠来源
Colonoid 结肠类器官 结肠(大肠) 是 enteroid 的下位细分
Gastroid 胃类器官 胃组织 与 enteroid 同属消化道类器官但不是同一类

在严格学术语境中,enteroid 有时只指小肠来源,而colonoid结肠来源,但日常文献中常混用,统称 intestinal organoids(肠道类器官),而 enteroid 则是更专业的叫法。


六、补充:关于“two-polarity conformations(双极性构型)”

这一点与你问的概念密切相关,我额外解释一下:

  • 在体内,肠道上皮是单层柱状上皮,具有明确的顶底极性(apical-basal polarity)

    • 顶侧(Apical):朝向肠腔,接触食物、菌群、病原体
    • 底侧(Basolateral):朝向基底膜,接触固有层、血管、免疫细胞
  • 在体外肠类器官培养中:

    • 常规构型:类器官的顶侧朝向管腔内部(封闭腔),底侧朝外(接触培养基)。这种构型下,细菌难以直接接触到顶侧(因为顶侧在腔里面),研究感染时需要显微注射或破坏类器官。
    • 反转构型(反向极性):通过特殊培养方法(如去掉基质胶或使用transwell),使顶侧朝外建立开放顶面的单层,这样病原体可以直接从培养液接触顶侧,更便于感染实验。

你的句子中 “two-polarity conformations” 就是指这两种不同的极性排列方式(常规封闭腔 vs 开放/反转极性),用于比较哪种更适于研究耶尔森氏菌的侵入机制。


Further topics:

  • 肠类器官的具体培养方法(培养基成分、Matrigel操作)
  • “反转极性”如何实现(具体实验方案)
  • 耶尔森氏菌感染肠类器官的研究案例


这是一个非常关键且容易混淆的问题。直接给你结论:

隐窝(Crypt)既不是管腔面(Apical side),也不是基底侧面(Basolateral side)。

为了让你彻底明白,我从空间位置细胞极性两个维度来拆解:


一、从空间位置来看(解剖学视角)

在真实的肠道组织中,隐窝和绒毛是肠壁上的两种不同地形结构

  • 绒毛(Villus):像手指一样向上突起,伸向肠道内部(管腔)。
  • 隐窝(Crypt):像小口袋一样向下凹陷,嵌入肠壁深层的固有层中。

所以,隐窝是一个“凹陷的坑”,绒毛是一个“凸起的包”。它们之间的关系是“高低起伏”的地形关系,而不是“内外表面”的关系。


二、从细胞极性来看(细胞生物学视角)

关键是:无论是隐窝里的细胞,还是绒毛上的细胞,它们都是单层上皮细胞,每一颗细胞都有自己的“管腔面”和“基底侧面”。

细胞部位 管腔面(Apical,顶侧) 基底侧面(Basolateral,底侧)
绒毛上的细胞 朝向肠腔内部(接触食物、菌群) 朝向绒毛内部的固有层(接触血管)
隐窝里的细胞 朝向隐窝的管腔(即那个凹陷的空腔) 朝向隐窝外周的基底膜和固有层

也就是说:

隐窝本身是一个“下陷的管腔空间”。隐窝里的干细胞和潘氏细胞,它们的顶膜(Apical)是面向隐窝那个空洞的,而基底膜(Basolateral)是面向隐窝外壁的组织间隙的。


三、用一张图来理解(文字版)

        ═══════════════════════════════   ← 肠腔(真正的管腔,食物通过的地方)
        ↑  绒毛(Villus)            ↑
        |  细胞顶面(Apical)朝向肠腔   |
        |  细胞底面(Basolateral)朝内   |
        ═══════════════════════════════   ← 基底膜
                 ↓↓↓ 向下凹陷 ↓↓↓
        ═══════════════════════════════   ← 隐窝开口
        |  隐窝(Crypt)              |
        |  细胞顶面(Apical)朝向隐窝空腔 |
        |  细胞底面(Basolateral)朝向外周 |
        ═══════════════════════════════   ← 基底膜

四、回答你问题的核心

  • 问:隐窝是管腔面吗?

    • 不是。 隐窝是一个结构区域(凹陷区),而管腔面(Apical)是细胞的一个面(朝向空腔的那一面)。隐窝里的细胞自己的管腔面(朝向隐窝空洞),但隐窝本身不等于管腔面。
  • 问:隐窝是基底侧面吗?

    • 不是。 基底侧面(Basolateral)是细胞接触基底膜和细胞间质的那一面。隐窝里的细胞也有基底侧面(朝向隐窝外周的基底膜),但隐窝本身不等于基底侧面。

五、在肠类器官(Enteroid)中对应什么?

在体外3D培养的肠类器官中:

体内结构 类器官中的对应结构
绒毛(Villus) 类器官的中央囊泡(中央腔体),上皮较薄,含分化细胞
隐窝(Crypt) 类器官的出芽(Bud),向外凸起的芽状结构,富含Lgr5⁺干细胞和潘氏细胞

注意:这里的“出芽”是向外凸起的,而不是向下凹陷的,这是因为类器官的极性相对于基质胶是“内外颠倒”的(顶面朝内腔,基底面朝外)。


六、一句话总结

隐窝是一个“结构位置”(凹陷区域),不是细胞的“一个面”。隐窝里的细胞既有管腔面(朝隐窝空洞),也有基底侧面(朝隐窝外周的基底膜)。

Further topics:

  • 为什么类器官的极性是“内外颠倒”的?
  • 在Transwell系统中如何实现“顶面朝外”?


1. “单层柱状上皮”是指只有一层细胞吗?

是的,千真万确。

  • 单层”在组织学里就是字面意思:只有一层细胞,不像皮肤(复层鳞状上皮)那样有很多层叠加在一起。
  • 柱状”指的是细胞的形态:它们不是圆形的,也不是扁平的,而是像高脚杯或柱子一样,高度 > 宽度,看起来细细长长的。

正因为只有一层,所以这层细胞肩负着巨大的责任:既要当“屏障”挡住坏人(细菌毒素),又要当“搬运工”吸收营养。


2. 细胞的一侧朝向肠腔,另一侧朝向基底膜吗?

完全正确! 这正是单层上皮最经典的极性(Polarity)特征。

因为只有一层,所以这层细胞必须“一头一尾”各司其职:

  • 顶端(Apical side / 管腔面)朝向肠腔(也就是大便和食物残渣流过的地方)。这一侧长满了密密麻麻的微绒毛(Microvilli),用来增加吸收面积,且没有血管直接接触。
  • 基底侧(Basolateral side / 基底侧面)朝向基底膜(Basement membrane)。基底膜像一层“胶水”和“滤网”,把这层上皮和下面的固有层(里面有血管、淋巴管、免疫细胞)粘在一起。细胞吸收的营养物质,会从这一侧排出去,进入血管。

3. 帮你建立“立体空间感”(重要补充)

你可能会困惑:“隐窝不是凹进去的吗?那凹进去的坑里,哪边算肠腔?”

记住这个黄金法则:只要是“空腔”的那一面,就是顶侧(Apical)。

  • 在绒毛上:细胞顶侧 → 对着肠道中间的大洞(肠腔)。
  • 在隐窝里:细胞顶侧 → 对着隐窝那个“小坑”的洞。虽然这个洞是封闭的凹陷,但它依然属于“腔”的一部分。
  • 细胞基底侧:永远对着基底膜(即远离食物,靠近身体内部的那一侧)。

4. 套用在你研究的“双极性构型”上

你现在完全理解了这个结构,就很容易搞懂体外实验的痛点了:

  • 在体内:细菌(如耶尔森氏菌)站在肠腔里,直接接触的就是细胞的顶侧(Apical),然后细菌入侵。
  • 在常规类器官(3D胶里):细胞顶侧(有微绒毛的那面)被包在里面(朝向类器官的内腔),细菌在胶外面,接触的是细胞的基底侧。这就反了!所以科学家才要费尽心机做“双极性构型(反转极性)”,目的就是把细胞的顶侧翻到外面来,让细菌能像在体内一样,从顶侧发起攻击。

总结一句话: 单层 = 只有一层细胞极性 = 顶侧对肠腔(吸收/感染面),基底侧对基底膜(连接身体面)

Generating report using R (Data_Tam_Metagenomics_2026_Wastewater)

To analyze your metagenomics data with greater control, you can use the R packages phyloseq, ggplot2, vegan, and MaAsLin2. The biobakery workflow has already generated the merged tables you need in the results/ directory.

Below is the complete R code to:

  1. Load the data and reproduce the figures from wmgx_report.pdf (PCoA, Heatmap, Stacked Barplot).
  2. Perform differential abundance analysis between Pre-treatment and Post-treatment.
  3. Statistically test if the differences between time points are significant.

Prerequisites

Make sure you have the following R packages installed:

install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr", "tidyr", "pheatmap"))
# MaAsLin2 is installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("MaAsLin2")

Step 1: Load Data and Create Metadata

The biobakery workflow merged your samples into single tables. We will load the MetaPhlAn species table and HUMAnN pathway table, and map your experimental design.

library(phyloseq)
library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Load MetaPhlAn taxonomic profiles
mpa_file <- "results/metaphlan/merged/metaphlan_taxonomic_profiles.tsv"
mpa_raw <- read.table(mpa_file, header=TRUE, sep="\t", comment.char="", check.names=FALSE)
colnames(mpa_raw)[1] <- "clade_name"
rownames(mpa_raw) <- mpa_raw$clade_name
mpa_otu <- mpa_raw[, -1]

# Filter for species level (s__) and clean names
species_idx <- grep("s__", rownames(mpa_otu))
species_otu <- mpa_otu[species_idx, ]
species_names <- gsub(".*s__", "", rownames(species_otu))
species_names <- gsub("\\|.*", "", species_names)
species_names <- gsub("_", " ", species_names) # Replace underscores with spaces
rownames(species_otu) <- species_names

# 2. Create Metadata based on your file names
sample_names <- colnames(species_otu)
metadata <- data.frame(
  row.names = sample_names,
  TimePoint = case_when(
    grepl("Nov", sample_names) ~ "Nov",
    grepl("Jan", sample_names) ~ "Jan",
    grepl("Mar", sample_names) ~ "Mar",
    grepl("May", sample_names) ~ "May"
  ),
  Treatment = ifelse(grepl("Pre", sample_names), "Pre", "Post")
)

# 3. Create phyloseq object for Species
tax_mat <- matrix(rownames(species_otu), ncol=1, dimnames=list(rownames(species_otu), "Species"))
species_ps <- phyloseq(
  otu_table(species_otu, taxa_are_rows=TRUE),
  sample_data(metadata),
  tax_table(tax_mat)
)

# (Optional) Load HUMAnN pathways using the same logic
# path_file <- "results/humann/merged/pathabundance_relab.tsv"
# ... (similar loading steps as above) ...

Step 2: Reproduce Figures from wmgx_report.pdf

1. Ordination (PCoA)

Reproduces the Bray-Curtis Principal Coordinate Analysis.

library(vegan)

# Calculate Bray-Curtis distance and perform PCoA
ord <- ordinate(species_ps, method="PCoA", distance="bray")

# Plot
plot_ordination(species_ps, ord, color="Treatment", shape="TimePoint") +
  geom_point(size=4) +
  theme_minimal() +
  labs(title="PCoA of Species Profiles (Bray-Curtis)", 
       subtitle="Colored by Treatment, Shaped by Time Point") +
  theme(legend.position = "right")

2. Heatmap (Top 25 Species)

Reproduces the hierarchical clustering heatmap using Euclidean distance.

# Get top 25 most abundant species
top25 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:25]
ps_top25 <- prune_taxa(top25, species_ps)

# Plot heatmap (method=NULL defaults to hierarchical clustering)
plot_heatmap(ps_top25, method=NULL, distance="euclidean", 
             taxa.labels=TRUE, label.size=3, na.value=NA) +
  scale_fill_viridis_c() +
  labs(title="Heatmap of Top 25 Most Abundant Species")

3. Stacked Barplot (Top 15 Species)

Reproduces the stacked barplot, sorted by total abundance.

# Get top 15 species
top15 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:15]
ps_top15 <- prune_taxa(top15, species_ps)

# Melt data for ggplot2
df <- psmelt(ps_top15)

# Sort samples by the total abundance of these top 15 species (decreasing order)
sample_order <- df %>%
  group_by(Sample) %>%
  summarise(Total_Ab = sum(Abundance)) %>%
  arrange(desc(Total_Ab)) %>%
  pull(Sample)

df$Sample <- factor(df$Sample, levels=sample_order)

# Plot
ggplot(df, aes(x=Sample, y=Abundance, fill=Species)) +
  geom_bar(stat="identity", position="stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
  labs(title="Relative Abundance of Top 15 Most Abundant Species", 
       y="Relative Abundance") +
  scale_fill_discrete(name = "Species")

Step 3: Differential Abundance Analysis (Pre vs. Post)

To find which specific species or pathways are significantly different between Pre- and Post-treatment, we use MaAsLin2 (the official biobakery recommendation). It allows us to test the effect of Treatment while adjusting for TimePoint as a covariate.

library(MaAsLin2)

# MaAsLin2 requires samples as rows and features as columns
species_t <- t(otu_table(species_ps))
species_df <- as.data.frame(species_t)
species_df$sample_id <- rownames(species_df)

metadata_df <- data.frame(
  sample_id = rownames(metadata),
  metadata
)

# Run MaAsLin2
# We set "Pre" and "Nov" as reference groups. 
# The output coefficients will represent the difference of Post vs Pre.
fit_results <- MaAsLin2(
  input_data = species_df,
  input_metadata = metadata_df,
  output = "maaslin2_species_output",
  fixed_effects = c("Treatment", "TimePoint"),
  reference = c("Treatment,Pre", "TimePoint,Nov"), 
  random_effects = c(),
  correction = "FDR",
  standardize = FALSE,
  min_abundance = 0.001, 
  min_prevalence = 0.1   # Feature must be present in at least 10% of samples
)

# View significant results (q-value < 0.05)
sig_results <- subset(fit_results$results, qval < 0.05 & metadata == "Treatment")
if(nrow(sig_results) > 0) {
  print(sig_results[, c("feature", "coef", "qval")] %>% arrange(qval))
} else {
  print("No significantly different species found at FDR < 0.05.")
}

Note: You can run the exact same code for pathways by replacing species_df with your HUMAnN pathway data frame.


Step 4: Is the difference between time points significant?

To determine if the microbial community composition changes significantly across the different time points (Nov, Jan, Mar, May), we use PERMANOVA (via the adonis2 function in the vegan package). This tests the overall community structure rather than individual features.

library(vegan)

# Calculate Bray-Curtis distance matrix
dist_bc <- vegdist(t(otu_table(species_ps)), method="bray")

# Run PERMANOVA
# This tests the main effects of Treatment and TimePoint, and their interaction
adonis_res <- adonis2(dist_bc ~ Treatment * TimePoint, data=metadata, permutations=999)
print(adonis_res)

How to interpret the adonis2 output:

  1. Treatment row: Look at the Pr(>F) (p-value). If it is < 0.05, it means there is a statistically significant difference in the overall microbiome between Pre-treatment and Post-treatment.
  2. TimePoint row: Look at the Pr(>F). If it is < 0.05, it means the microbial community composition differs significantly across the months (Nov, Jan, Mar, May).
  3. Treatment:TimePoint row: If this interaction is significant (< 0.05), it means the effect of the treatment depends on the time of year (e.g., the treatment works differently in November compared to May).

Important Note on Statistical Power: Your dataset has 11 samples total. In the PERMANOVA model, this leaves only 3 residual degrees of freedom. While this mathematically allows for testing, the statistical power is low. Significant results are highly trustworthy, but non-significant results might simply be due to the small sample size.

Processing RNAseq (Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE)

  1. Preparing raw data

     mkdir raw_data; cd raw_data
    
     # control samples (8)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_1.fq.gz AYE-WT_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1/1_2.fq.gz AYE-WT_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_1.fq.gz AYE-WT_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2/2_2.fq.gz AYE-WT_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_1.fq.gz AYE-WT_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3/3_2.fq.gz AYE-WT_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_1.fq.gz AYE-T_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4/4_2.fq.gz AYE-T_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_1.fq.gz AYE-T_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5/5_2.fq.gz AYE-T_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_1.fq.gz AYE-T_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6/6_2.fq.gz AYE-T_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_1.fq.gz AYE-O_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/7/7_2.fq.gz AYE-O_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_1.fq.gz AYE-O_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/8/8_2.fq.gz AYE-O_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_1.fq.gz AYE-O_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/9/9_2.fq.gz AYE-O_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_1.fq.gz O-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/10/10_2.fq.gz O-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_1.fq.gz O-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/11/11_2.fq.gz O-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_1.fq.gz O-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/12/12_2.fq.gz O-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_1.fq.gz WT-Trans_ctr_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/1new/1new_2.fq.gz WT-Trans_ctr_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_1.fq.gz WT-Trans_ctr_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/2new/2new_2.fq.gz WT-Trans_ctr_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_1.fq.gz WT-Trans_ctr_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/3new/3new_2.fq.gz WT-Trans_ctr_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_1.fq.gz AYE-WT_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/49/49_2.fq.gz AYE-WT_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_1.fq.gz AYE-WT_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/50/50_2.fq.gz AYE-WT_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_1.fq.gz AYE-WT_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/51/51_2.fq.gz AYE-WT_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_1.fq.gz AYE-O_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/52/52_2.fq.gz AYE-O_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_1.fq.gz AYE-O_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/53/53_2.fq.gz AYE-O_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_1.fq.gz AYE-O_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/54/54_2.fq.gz AYE-O_ctr_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_1.fq.gz AYE-T_ctr_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/55/55_2.fq.gz AYE-T_ctr_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_1.fq.gz AYE-T_ctr_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/56/56_2.fq.gz AYE-T_ctr_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_1.fq.gz AYE-T_ctr_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/57/57_2.fq.gz AYE-T_ctr_solid_r3_R2.fastq.gz
    
     # Diclofenac(双氯芬酸)treatment (6)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_1.fq.gz AYE-WT_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/25/25_2.fq.gz AYE-WT_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_1.fq.gz AYE-WT_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/26/26_2.fq.gz AYE-WT_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_1.fq.gz AYE-WT_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/27/27_2.fq.gz AYE-WT_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_1.fq.gz AYE-T_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/28/28_2.fq.gz AYE-T_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_1.fq.gz AYE-T_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/29/29_2.fq.gz AYE-T_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_1.fq.gz AYE-T_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/30/30_2.fq.gz AYE-T_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_1.fq.gz AYE-O_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/31/31_2.fq.gz AYE-O_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_1.fq.gz AYE-O_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/32/32_2.fq.gz AYE-O_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_1.fq.gz AYE-O_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/33/33_2.fq.gz AYE-O_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_1.fq.gz O-Trans_Diclo375_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/34/34_2.fq.gz O-Trans_Diclo375_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_1.fq.gz O-Trans_Diclo375_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/35/35_2.fq.gz O-Trans_Diclo375_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_1.fq.gz O-Trans_Diclo375_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/36/36_2.fq.gz O-Trans_Diclo375_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_1.fq.gz WT-Trans_Diclo750_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/4new/4new_2.fq.gz WT-Trans_Diclo750_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_1.fq.gz WT-Trans_Diclo750_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/5new/5new_2.fq.gz WT-Trans_Diclo750_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_1.fq.gz WT-Trans_Diclo750_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/6new/6new_2.fq.gz WT-Trans_Diclo750_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_1.fq.gz AYE-WT_Diclo1250_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/73/73_2.fq.gz AYE-WT_Diclo1250_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_1.fq.gz AYE-WT_Diclo1250_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/74/74_2.fq.gz AYE-WT_Diclo1250_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_1.fq.gz AYE-WT_Diclo1250_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/75/75_2.fq.gz AYE-WT_Diclo1250_solid_r3_R2.fastq.gz
    
     # Rifampicin(利福平)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_1.fq.gz AYE-WT_Rifampicin1.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/13/13_2.fq.gz AYE-WT_Rifampicin1.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_1.fq.gz AYE-WT_Rifampicin1.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/14/14_2.fq.gz AYE-WT_Rifampicin1.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_1.fq.gz AYE-WT_Rifampicin1.5_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/15/15_2.fq.gz AYE-WT_Rifampicin1.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_1.fq.gz AYE-T_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/16/16_2.fq.gz AYE-T_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_1.fq.gz AYE-T_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/17/17_2.fq.gz AYE-T_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_1.fq.gz AYE-T_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/18/18_2.fq.gz AYE-T_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_1.fq.gz AYE-O_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/19/19_2.fq.gz AYE-O_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_1.fq.gz AYE-O_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/20/20_2.fq.gz AYE-O_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_1.fq.gz AYE-O_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/21/21_2.fq.gz AYE-O_Rifampicin2_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_1.fq.gz O-Trans_Rifampicin2_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/22/22_2.fq.gz O-Trans_Rifampicin2_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_1.fq.gz O-Trans_Rifampicin2_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/23/23_2.fq.gz O-Trans_Rifampicin2_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_1.fq.gz O-Trans_Rifampicin2_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/24/24_2.fq.gz O-Trans_Rifampicin2_r3_R2.fastq.gz
    
     # Meropenem(美罗培南)treatment (4)
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_1.fq.gz AYE-WT_Mero0.35-0.5_r1_R1.fastq.gz  #AYE-WT_Mero0.5_r1
     ln -s ../X101SC26025981-Z01-J001/01.RawData/37/37_2.fq.gz AYE-WT_Mero0.35-0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_1.fq.gz AYE-WT_Mero0.35-0.5_r2_R1.fastq.gz  #AYE-WT_YX_Mero0.35_r2
     ln -s ../X101SC26025981-Z01-J001/01.RawData/38/38_2.fq.gz AYE-WT_Mero0.35-0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_1.fq.gz AYE-WT_Mero0.35-0.5_r3_R1.fastq.gz  #AYE-WT_public_Mero0.35_r3
     ln -s ../X101SC26025981-Z01-J001/01.RawData/39/39_2.fq.gz AYE-WT_Mero0.35-0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_1.fq.gz AYE-T_Mero0.15_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/40/40_2.fq.gz AYE-T_Mero0.15_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_1.fq.gz AYE-T_Mero0.15_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/41/41_2.fq.gz AYE-T_Mero0.15_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_1.fq.gz AYE-T_Mero0.15_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/42/42_2.fq.gz AYE-T_Mero0.15_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_1.fq.gz AYE-O_Mero0.5_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/43/43_2.fq.gz AYE-O_Mero0.5_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_1.fq.gz AYE-O_Mero0.5_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/44/44_2.fq.gz AYE-O_Mero0.5_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_1.fq.gz AYE-O_Mero0.5_r3_R1.fastq.gz  #Mero0.45
     ln -s ../X101SC26025981-Z01-J001/01.RawData/45/45_2.fq.gz AYE-O_Mero0.5_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_1.fq.gz O-Trans_Mero0.25_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/46/46_2.fq.gz O-Trans_Mero0.25_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_1.fq.gz O-Trans_Mero0.25_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/47/47_2.fq.gz O-Trans_Mero0.25_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_1.fq.gz O-Trans_Mero0.25_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/48/48_2.fq.gz O-Trans_Mero0.25_r3_R2.fastq.gz
    
     # Azithromycin(阿奇霉素)treatment (5), among them, F_ctr_solid is clinical isolate.
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_1.fq.gz F_ctr_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/58/58_2.fq.gz F_ctr_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_1.fq.gz F_ctr_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/59/59_2.fq.gz F_ctr_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_1.fq.gz F_ctr_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/60/60_2.fq.gz F_ctr_solid_r3_R2.fastq.gz  #clinical
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_1.fq.gz AYE-WT_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/61/61_2.fq.gz AYE-WT_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_1.fq.gz AYE-WT_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/62/62_2.fq.gz AYE-WT_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_1.fq.gz AYE-WT_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/63/63_2.fq.gz AYE-WT_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_1.fq.gz AYE-T_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/67/67_2.fq.gz AYE-T_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_1.fq.gz AYE-T_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/68/68_2.fq.gz AYE-T_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_1.fq.gz AYE-T_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/69/69_2.fq.gz AYE-T_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_1.fq.gz AYE-O_Azi20_solid_r1_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/64/64_2.fq.gz AYE-O_Azi20_solid_r1_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_1.fq.gz AYE-O_Azi20_solid_r2_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/65/65_2.fq.gz AYE-O_Azi20_solid_r2_R2.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_1.fq.gz AYE-O_Azi20_solid_r3_R1.fastq.gz
     ln -s ../X101SC26025981-Z01-J001/01.RawData/66/66_2.fq.gz AYE-O_Azi20_solid_r3_R2.fastq.gz
    
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_1.fq.gz F_Azi20_solid_r1_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/70/70_2.fq.gz F_Azi20_solid_r1_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_1.fq.gz F_Azi20_solid_r2_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/71/71_2.fq.gz F_Azi20_solid_r2_R2.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_1.fq.gz F_Azi20_solid_r3_R1.fastq.gz  #clinical
     ln -s ../X101SC26025981-Z01-J001/01.RawData/72/72_2.fq.gz F_Azi20_solid_r3_R2.fastq.gz  #clinical
  2. Preparing the directory trimmed

     mkdir trimmed trimmed_unpaired;
     for sample_id in AYE-O_Azi20_solid_r1 AYE-O_Azi20_solid_r2 AYE-O_Azi20_solid_r3 AYE-O_ctr_r1 AYE-O_ctr_r2 AYE-O_ctr_r3 AYE-O_ctr_solid_r1 AYE-O_ctr_solid_r2 AYE-O_ctr_solid_r3 AYE-O_Diclo375_r1 AYE-O_Diclo375_r2 AYE-O_Diclo375_r3 AYE-O_Mero0.5_r1 AYE-O_Mero0.5_r2 AYE-O_Mero0.5_r3 AYE-O_Rifampicin2_r1 AYE-O_Rifampicin2_r2 AYE-O_Rifampicin2_r3 AYE-T_Azi20_solid_r1 AYE-T_Azi20_solid_r2 AYE-T_Azi20_solid_r3 AYE-T_ctr_r1 AYE-T_ctr_r2 AYE-T_ctr_r3 AYE-T_ctr_solid_r1 AYE-T_ctr_solid_r2 AYE-T_ctr_solid_r3 AYE-T_Diclo375_r1 AYE-T_Diclo375_r2 AYE-T_Diclo375_r3 AYE-T_Mero0.15_r1 AYE-T_Mero0.15_r2 AYE-T_Mero0.15_r3 AYE-T_Rifampicin2_r1 AYE-T_Rifampicin2_r2 AYE-T_Rifampicin2_r3 AYE-WT_Azi20_solid_r1 AYE-WT_Azi20_solid_r2 AYE-WT_Azi20_solid_r3 AYE-WT_ctr_r1 AYE-WT_ctr_r2 AYE-WT_ctr_r3 AYE-WT_ctr_solid_r1 AYE-WT_ctr_solid_r2 AYE-WT_ctr_solid_r3 AYE-WT_Diclo1250_solid_r1 AYE-WT_Diclo1250_solid_r2 AYE-WT_Diclo1250_solid_r3 AYE-WT_Diclo750_r1 AYE-WT_Diclo750_r2 AYE-WT_Diclo750_r3 AYE-WT_Mero0.35-0.5_r1 AYE-WT_Mero0.35-0.5_r2 AYE-WT_Mero0.35-0.5_r3 AYE-WT_Rifampicin1.5_r1 AYE-WT_Rifampicin1.5_r2 AYE-WT_Rifampicin1.5_r3 F_Azi20_solid_r1 F_Azi20_solid_r2 F_Azi20_solid_r3 F_ctr_solid_r1 F_ctr_solid_r2 F_ctr_solid_r3 O-Trans_ctr_r1 O-Trans_ctr_r2 O-Trans_ctr_r3 O-Trans_Diclo375_r1 O-Trans_Diclo375_r2 O-Trans_Diclo375_r3 O-Trans_Mero0.25_r1 O-Trans_Mero0.25_r2 O-Trans_Mero0.25_r3 O-Trans_Rifampicin2_r1 O-Trans_Rifampicin2_r2 O-Trans_Rifampicin2_r3 WT-Trans_ctr_r1 WT-Trans_ctr_r2 WT-Trans_ctr_r3 WT-Trans_Diclo750_r1 WT-Trans_Diclo750_r2 WT-Trans_Diclo750_r3; do \
     for sample_id in AYE-T_Diclo375_r2; do \
             java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fastq.gz raw_data/${sample_id}_R2.fastq.gz trimmed/${sample_id}_R1.fastq.gz trimmed_unpaired/${sample_id}_R1.fastq.gz trimmed/${sample_id}_R2.fastq.gz trimmed_unpaired/${sample_id}_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
     done
  3. (Optional) using trinity to find the most closely reference

     #Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
     #https://www.genome.jp/kegg/tables/br08606.html#prok
     acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
     abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
     aby     KGB     Acinetobacter baumannii AYE     2008    GenBank --> *
     abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
     abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
     abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
     abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
     abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
     abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
     abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
     abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
     abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
     abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
     abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
     abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
     abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
     abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
     abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
     abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
     abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
     abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
     #Note that the Acinetobacter baumannii strain ATCC 19606 chromosome, complete genome (GenBank: CU459141.1) was choosen as reference!
  4. Preparing samplesheet.csv

     sample,fastq_1,fastq_2,strandedness
     Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
     ...
  5. Downloading CU459141.fasta and CU459141.gff from GenBank and preparing CU459141_m.gff

     #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
     #Default NOT_WORKING: --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
     #(host_env) !NOT_WORKING! jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CU459141.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_1 (CDS --> exon in CP059040.gff) --
     #Checking the record (see below) in results/genome/CP059040.gtf
     #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"
     #--featurecounts_feature_type 'transcript' returns only the tRNA results
     #Since the tRNA records have "transcript and exon". In gene records, we have "transcript and CDS". replace the CDS with exon
    
     grep -P "\texon\t" CP059040.gff | sort | wc -l    #96
     grep -P "cmsearch\texon\t" CP059040.gff | wc -l    #=10  ignal recognition particle sRNA small typ, transfer-messenger RNA, 5S ribosomal RNA
     grep -P "Genbank\texon\t" CP059040.gff | wc -l    #=12  16S and 23S ribosomal RNA
     grep -P "tRNAscan-SE\texon\t" CP059040.gff | wc -l    #tRNA 74
     wc -l star_salmon/AUM_r3/quant.genes.sf  #--featurecounts_feature_type 'transcript' results in 96 records!
    
     grep -P "\tCDS\t" CU459141.gff3 | wc -l  #3659
     sed 's/\tCDS\t/\texon\t/g' CU459141.gff3 > CU459141_m.gff
     grep -P "\texon\t" CU459141_m.gff | sort | wc -l  #3760
    
     # -- DEBUG_2: combination of 'CU459141_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
     --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff" --featurecounts_feature_type 'transcript'
    
     # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
  6. nextflow run

     # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
     (host_env) mv trimmed/*.fastq.gz .
     (host_env) nextflow run nf-core/rnaseq -r 3.14.0 -profile docker \
     --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2026_on_AYE/CU459141_m.gff"        -resume  --max_cpus 90 --max_memory 900.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
  7. (OPTIONAL, since the R-script complete_deg_pipeline_custom_cutoff.R runs completely) Generate PCA-plot

     #mamba activate r_env
    
     #install.packages("ggfun")
     # Import the required libraries
     library("AnnotationDbi")
     library("clusterProfiler")
     library("ReactomePA")
     library(gplots)
     library(tximport)
     library(DESeq2)
     #library("org.Hs.eg.db")
     library(dplyr)
     library(tidyverse)
     #install.packages("devtools")
     #devtools::install_version("gtable", version = "0.3.0")
     library(gplots)
     library("RColorBrewer")
     #install.packages("ggrepel")
     library("ggrepel")
     # install.packages("openxlsx")
     library(openxlsx)
     library(EnhancedVolcano)
     library(DESeq2)
     library(edgeR)
    
     setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon")
     # Define paths to your Salmon output quantification files
    
     # Store sample names in a character vector
     samples <- c(
         "AYE-O_Azi20_solid_r1", "AYE-O_Azi20_solid_r2", "AYE-O_Azi20_solid_r3", "AYE-O_ctr_r1", "AYE-O_ctr_r2",
         "AYE-O_ctr_r3", "AYE-O_ctr_solid_r1", "AYE-O_ctr_solid_r2", "AYE-O_ctr_solid_r3",
         "AYE-O_Diclo375_r1", "AYE-O_Diclo375_r2", "AYE-O_Diclo375_r3", "AYE-O_Mero0.5_r1",
         "AYE-O_Mero0.5_r2", "AYE-O_Mero0.5_r3", "AYE-O_Rifampicin2_r1", "AYE-O_Rifampicin2_r2",
         "AYE-O_Rifampicin2_r3", "AYE-T_Azi20_solid_r1", "AYE-T_Azi20_solid_r2", "AYE-T_Azi20_solid_r3",
         "AYE-T_ctr_r1", "AYE-T_ctr_r2", "AYE-T_ctr_r3", "AYE-T_ctr_solid_r1", "AYE-T_ctr_solid_r2",
         "AYE-T_ctr_solid_r3", "AYE-T_Diclo375_r1", "AYE-T_Diclo375_r2", "AYE-T_Diclo375_r3",
         "AYE-T_Mero0.15_r1", "AYE-T_Mero0.15_r2", "AYE-T_Mero0.15_r3", "AYE-T_Rifampicin2_r1",
         "AYE-T_Rifampicin2_r2", "AYE-T_Rifampicin2_r3", "AYE-WT_Azi20_solid_r1", "AYE-WT_Azi20_solid_r2",
         "AYE-WT_Azi20_solid_r3", "AYE-WT_ctr_r1", "AYE-WT_ctr_r2", "AYE-WT_ctr_r3", "AYE-WT_ctr_solid_r1",
         "AYE-WT_ctr_solid_r2", "AYE-WT_ctr_solid_r3", "AYE-WT_Diclo1250_solid_r1", "AYE-WT_Diclo1250_solid_r2",
         "AYE-WT_Diclo1250_solid_r3", "AYE-WT_Diclo750_r1", "AYE-WT_Diclo750_r2", "AYE-WT_Diclo750_r3",
         "AYE-WT_Mero0.35-0.5_r1", "AYE-WT_Mero0.35-0.5_r2", "AYE-WT_Mero0.35-0.5_r3", "AYE-WT_Rifampicin1.5_r1",
         "AYE-WT_Rifampicin1.5_r2", "AYE-WT_Rifampicin1.5_r3", "F_Azi20_solid_r1", "F_Azi20_solid_r2",
         "F_Azi20_solid_r3", "F_ctr_solid_r1", "F_ctr_solid_r2", "F_ctr_solid_r3",
         "O-Trans_ctr_r1","O-Trans_ctr_r2","O-Trans_ctr_r3",  "O-Trans_Diclo375_r1","O-Trans_Diclo375_r2","O-Trans_Diclo375_r3",
         "O-Trans_Mero0.25_r1", "O-Trans_Mero0.25_r2", "O-Trans_Mero0.25_r3", "O-Trans_Rifampicin2_r1",
         "O-Trans_Rifampicin2_r2", "O-Trans_Rifampicin2_r3", "WT-Trans_ctr_r1", "WT-Trans_ctr_r2",
         "WT-Trans_ctr_r3", "WT-Trans_Diclo750_r1", "WT-Trans_Diclo750_r2", "WT-Trans_Diclo750_r3"
     )
    
     # Automatically generate the named vector
     files <- setNames(paste0("./", samples, "/quant.sf"), samples)
     # Import the transcript abundance data with tximport
     txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
     # -----------------------------------------------------------------
     # ---- Step 1: Create Detailed Metadata from Your Sample Names ----
    
     # Extract metadata from sample names
     samples <- names(files)
    
     # Parse the complex sample names
     metadata <- data.frame(
     sample = samples,
     stringsAsFactors = FALSE
     )
    
     # Extract strain (everything before first underscore or hyphen treatment)
     metadata$strain <- sapply(strsplit(samples, "[-_]"), function(x) {
     if(x[1] %in% c("AYE", "O", "WT", "F")) {
         if(x[1] == "AYE" && length(x) > 1 && x[2] %in% c("WT", "T", "O")) {
         paste(x[1:2], collapse = "-")
         } else if(x[1] %in% c("O", "WT") && x[2] == "Trans") {
         paste(x[1:2], collapse = "-")
         } else {
         x[1]
         }
     } else {
         x[1]
     }
     })
    
     # Extract treatment type
     metadata$treatment <- sapply(samples, function(x) {
         if(grepl("_ctr", x)) return("ctrl")
         if(grepl("Diclo", x)) return("Diclo")
         if(grepl("Mero", x)) return("Mero")
         if(grepl("Azi", x)) return("Azi")
         if(grepl("Rifampicin", x)) return("Rifampicin")
         return("ctrl")
     })
    
     # Extract concentration
     metadata$concentration <- sapply(samples, function(x) {
         if(grepl("Diclo1250", x)) return("1250")
         if(grepl("Diclo750", x)) return("750")
         if(grepl("Diclo375", x)) return("375")
         if(grepl("Mero0.5", x)) return("0.5")
         if(grepl("Mero0.35", x)) return("0.35")
         if(grepl("Mero0.25", x)) return("0.25")
         if(grepl("Mero0.15", x)) return("0.15")
         if(grepl("Azi20", x)) return("20")
         if(grepl("Rifampicin2", x)) return("2")
         if(grepl("Rifampicin1.5", x)) return("1.5")
         return("0")
     })
    
     # Extract condition (solid vs liquid)
     metadata$condition <- ifelse(grepl("_solid", samples), "solid", "liquid")
    
     # Extract replicate
     metadata$replicate <- sapply(strsplit(samples, "_"), function(x) {
     rep_part <- x[length(x)]
     gsub("r", "", rep_part)
     })
    
     # Create combined group for easy comparisons
     metadata$group <- paste(metadata$strain, metadata$treatment, metadata$concentration, sep = "_")
    
     # Set row names
     rownames(metadata) <- metadata$sample
    
     # Reorder to match txi columns
     metadata <- metadata[colnames(txi$counts), ]
    
     # ---------------------------------------------
     # ---- Step 2: Choose Your Design Strategy ----
    
     # Strategy A: Full Factorial Design (if balanced)
     #dds <- DESeqDataSetFromTximport(txi, metadata,
     #                         design = ~ strain + treatment + condition)
    
     # --> Strategy B: Combined Group Factor ⭐ RECOMMENDED
     metadata$group <- factor(paste(metadata$strain,
                                     metadata$treatment,
                                     metadata$concentration,
                                     metadata$condition,
                                     sep = "_"))
    
     dds <- DESeqDataSetFromTximport(txi, metadata,
                                     design = ~ group)
     dds <- DESeq(dds)
    
     # See all available comparisons
     resultsNames(dds)
    
     # -------------------------------------------------------------
     # ---- Step 3: Set Up Specific Comparisons from Your Notes ----
     # ==========================================
     # 1. Define Exact Comparisons from Your Notes
     # ==========================================
     planned_comparisons <- list(
     # --- Baseline / Strain Controls ---
     AYE_T_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-T_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_WT_ctr            = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_WT_ctr          = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_WT_ctr         = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_vs_AYE_T                 = list(treat = "AYE-O_ctrl_0_liquid",   ctrl = "AYE-T_ctrl_0_liquid"),
     O_Trans_ctr_vs_AYE_T               = list(treat = "O-Trans_ctrl_0_liquid", ctrl = "AYE-T_ctrl_0_liquid"),
     WT_Trans_ctr_vs_AYE_T              = list(treat = "WT-Trans_ctrl_0_liquid",ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Condition Effects (Solid vs Liquid) ---
     AYE_WT_ctr_solid_vs_AYE_WT_ctr     = list(treat = "AYE-WT_ctrl_0_solid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_O_ctr       = list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-O_ctrl_0_liquid"),
     AYE_T_ctr_solid_vs_AYE_T_ctr       = list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-O_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_ctr_solid_vs_AYE_WT_ctr_solid= list(treat = "AYE-T_ctrl_0_solid",    ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Diclofenac ---
     AYE_WT_Diclo750_vs_AYE_WT_ctr      = list(treat = "AYE-WT_Diclo_750_liquid",   ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-T_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_WT_ctr       = list(treat = "AYE-O_Diclo_375_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_AYE_WT_ctr     = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "AYE-WT_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_AYE_WT_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     Diclo_AYE_WT_1250_solid_vs_solid_ctr = list(treat = "AYE-WT_Diclo_1250_solid", ctrl = "AYE-WT_ctrl_0_solid"),
    
     # --- Meropenem ---
     AYE_WT_Mero_vs_AYE_WT_ctr          = list(treat = "AYE-WT_Mero_0.35_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_WT_ctr           = list(treat = "AYE-O_Mero_0.5_liquid",       ctrl = "AYE-WT_ctrl_0_liquid"),
     O_Trans_Mero_vs_AYE_WT_ctr         = list(treat = "O-Trans_Mero_0.25_liquid",    ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Mero_vs_AYE_T_ctr            = list(treat = "AYE-T_Mero_0.15_liquid",      ctrl = "AYE-T_ctrl_0_liquid"),
    
     # --- Azithromycin (Solid) ---
     AYE_WT_Azi_vs_solid_ctr            = list(treat = "AYE-WT_Azi_20_solid", ctrl = "AYE-WT_ctrl_0_solid"),
     AYE_T_Azi_vs_solid_ctr             = list(treat = "AYE-T_Azi_20_solid",  ctrl = "AYE-T_ctrl_0_solid"),
     AYE_O_Azi_vs_solid_ctr             = list(treat = "AYE-O_Azi_20_solid",  ctrl = "AYE-O_ctrl_0_solid"),
     F_Azi_vs_F_solid_ctr               = list(treat = "F_Azi_20_solid",      ctrl = "F_ctrl_0_solid"),
    
     # --- Rifampicin ---
     AYE_WT_Rif_vs_AYE_WT_ctr           = list(treat = "AYE-WT_Rifampicin_1.5_liquid", ctrl = "AYE-WT_ctrl_0_liquid"),
     AYE_T_Rif_vs_AYE_T_ctr             = list(treat = "AYE-T_Rifampicin_2_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Rif_vs_AYE_O_ctr             = list(treat = "AYE-O_Rifampicin_2_liquid",    ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Rif_vs_O_Trans_ctr         = list(treat = "O-Trans_Rifampicin_2_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # Additional Analyses
     planned_comparisons <- list(
     # --- Diclofenac_2 ---
     # * AYE-T_Diclo_375_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Diclo_375_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Diclo_375_liquid vs AYE-O-Trans_ctrl_0_liquid
     # * WT-Trans_Diclo_750_liquid vs AYE-WT-Trans_ctrl_0_liquid
     AYE_T_Diclo375_vs_AYE_T_ctr        = list(treat = "AYE-T_Diclo_375_liquid",        ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Diclo375_vs_AYE_O_ctr        = list(treat = "AYE-O_Diclo_375_liquid",        ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Diclo375_vs_O_Trans_ctr    = list(treat = "O-Trans_Diclo_375_liquid",  ctrl = "O-Trans_ctrl_0_liquid"),
     WT_Trans_Diclo750_vs_WT_Trans_ctr    = list(treat = "WT-Trans_Diclo_750_liquid", ctrl = "WT-Trans_ctrl_0_liquid"),
    
     # --- Meropenem_2 ---
     # * AYE-T_Mero_0.15_liquid vs AYE-T_ctrl_0_liquid
     # * AYE-O_Mero_0.5_liquid vs AYE-O_ctrl_0_liquid
     # * O-Trans_Mero_0.25_liquid vs AYE-O-Trans_ctrl_0_liquid
     AYE_T_Mero_vs_AYE_T_ctr        = list(treat = "AYE-T_Mero_0.15_liquid",    ctrl = "AYE-T_ctrl_0_liquid"),
     AYE_O_Mero_vs_AYE_O_ctr        = list(treat = "AYE-O_Mero_0.5_liquid",     ctrl = "AYE-O_ctrl_0_liquid"),
     O_Trans_Mero_vs_O_Trans_ctr    = list(treat = "O-Trans_Mero_0.25_liquid",  ctrl = "O-Trans_ctrl_0_liquid")
     )
    
     # ==========================================
     # 2. Verification & Validation Script
     # ==========================================
     # Identify which column in colData holds your group names
     group_col <- if("group" %in% colnames(colData(dds))) "group" else
                 if("treatment" %in% colnames(colData(dds))) "treatment" else
                 stop("❌ Please specify the correct colData column containing group names.")
    
     actual_groups <- unique(colData(dds)[[group_col]])
    
     cat("\n", paste(rep("=", 85), collapse=""), "\n")
     cat("📋 VERIFICATION OF NOTE-DERIVED COMPARISONS\n")
     cat(paste(rep("=", 85), collapse=""), "\n\n")
    
     validation_results <- data.frame(
     Comparison_Name = character(),
     Treatment_String = character(),
     Control_String = character(),
     Status = character(),
     Suggested_Contrast = character(),
     stringsAsFactors = FALSE
     )
    
     for(name in names(planned_comparisons)) {
     trt <- planned_comparisons[[name]]$treat
     ctl <- planned_comparisons[[name]]$ctrl
    
     # Find closest matches in actual data
     trt_match <- actual_groups[grepl(trt, actual_groups, fixed = TRUE)]
     ctl_match <- actual_groups[grepl(ctl, actual_groups, fixed = TRUE)]
    
     status <- if(length(trt_match) > 0 && length(ctl_match) > 0) "✅ VALID" else "⚠️  CHECK"
     contrast_str <- if(status == "✅ VALID")
         paste0('c("', group_col, '", "', trt_match[1], '", "', ctl_match[1], '")') else "N/A"
    
     validation_results <- rbind(validation_results, data.frame(
         Comparison_Name = name,
         Treatment_String = trt,
         Control_String = ctl,
         Status = status,
         Suggested_Contrast = contrast_str,
         stringsAsFactors = FALSE
     ))
    
     cat(sprintf("%-45s | T:%-25s C:%-20s | %s\n", name, trt, ctl, status))
     if(status == "⚠️  CHECK") {
         if(length(trt_match) == 0) cat("   🔍 Treat not found. Closest: ", paste(head(actual_groups[grepl(strsplit(trt, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
         if(length(ctl_match) == 0) cat("   🔍 Ctrl not found.  Closest: ", paste(head(actual_groups[grepl(strsplit(ctl, "_")[[1]][1], actual_groups)], 3), collapse=", "), "\n")
     }
     }
    
     # ==========================================
     # 3. Auto-Generate DESeq2 results() Calls (Optional)
     # ==========================================
     valid_comparisons <- validation_results[validation_results$Status == "✅ VALID", ]
     if(nrow(valid_comparisons) > 0) {
     cat("\n📜 READY-TO-RUN DESeq2 CONTRASTS:\n")
     cat(paste(rep("-", 60), collapse=""), "\n")
     for(i in seq_len(nrow(valid_comparisons))) {
         cat(sprintf('res_%s <- results(dds, contrast = %s)\n',
                     gsub("[^A-Za-z0-9]", "_", valid_comparisons$Comparison_Name[i]),
                     valid_comparisons$Suggested_Contrast[i]))
     }
     } else {
     cat("\n⚠️  No exact matches found. Check your colData group naming convention.\n")
     }
    
     # -----------------------------
     # ---- Step 4: PCA figures ----
    
     # 🔍 What each figure shows:
     #
     #    01_PCA_by_Strain.png → Tests if genetic background (AYE-WT, AYE-T, AYE-O, Trans, F) is the dominant source of variation.
     #    02_PCA_by_Treatment.png → Shows clustering by antibiotic/drug exposure (ctrl, Diclo, Mero, Azi, Rifampicin).
     #    03_PCA_by_Condition.png → Reveals batch/growth media effects (solid vs liquid).
     #    04_PCA_CombinedGroups.png → Full experimental grouping with labeled sample names for quick outlier detection.
     #    05_PCA_Ellipses.png → Adds 95% confidence boundaries per strain to visualize group spread and overlap.
     #
     # ⚠️ Quick Checklist Before Running:
     #
     #    Ensure metadata columns (strain, treatment, condition, group) are attached to colData(dds).
     #    If ggrepel is missing, run install.packages("ggrepel").
     #    All PNGs will save to your current working directory (getwd()).
    
     # Install if missing: install.packages(c("ggplot2", "ggrepel"))
     library(DESeq2)
     library(ggplot2)
     library(ggrepel)
    
     # 1. Variance Stabilizing Transformation & Extract PCA Data
     vsd <- vst(dds, blind = FALSE)
     pca_data <- plotPCA(vsd, intgroup = c("strain", "treatment", "condition", "group"), returnData = TRUE)
     percent_var <- round(100 * attr(pca_data, "percentVar"))
    
     # Consistent theme for all plots
     base_theme <- theme_bw(base_size = 12) +
     theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
             legend.position = "right",
             legend.title = element_text(face = "bold"),
             panel.grid.major = element_line(color = "grey90"),
             panel.grid.minor = element_blank())
    
     # --- Plot 1: Colored by Strain ---
     p1 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2.5, max.overlaps = 20, show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Strain",
         color = "Strain", shape = "Condition") +
     base_theme
     ggsave("01_PCA_by_Strain.png", p1, width = 8, height = 6, dpi = 300)
    
     # --- Plot 2: Colored by Treatment ---
     p2 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = treatment, shape = condition)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Treatment",
         color = "Treatment", shape = "Condition") +
     base_theme
     ggsave("02_PCA_by_Treatment.png", p2, width = 8, height = 6, dpi = 300)
    
     # --- Plot 3: Colored by Condition (Solid vs Liquid) ---
     p3 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = strain)) +
     geom_point(size = 3, alpha = 0.8) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Samples Colored by Growth Condition",
         color = "Condition", shape = "Strain") +
     base_theme
     ggsave("03_PCA_by_Condition.png", p3, width = 8, height = 6, dpi = 300)
    
     # --- Plot 4: Combined Groups with Sample Labels ---
     p4 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = group)) +
     geom_point(size = 3, alpha = 0.8) +
     geom_text_repel(aes(label = name), size = 2, max.overlaps = 30, box.padding = 0.3) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: Combined Experimental Groups",
         color = "Group") +
     base_theme + theme(legend.position = "none")
     ggsave("04_PCA_CombinedGroups.png", p4, width = 9, height = 7, dpi = 300)
    
     # --- Plot 5: 95% Confidence Ellipses (by Strain) ---
     p5 <- ggplot(pca_data, aes(x = PC1, y = PC2, color = strain, fill = strain)) +
     geom_point(size = 3, alpha = 0.7) +
     stat_ellipse(level = 0.95, alpha = 0.2, geom = "polygon", show.legend = FALSE) +
     labs(x = paste0("PC1: ", percent_var[1], "% variance"),
         y = paste0("PC2: ", percent_var[2], "% variance"),
         title = "PCA: 95% Confidence Ellipses by Strain",
         color = "Strain", fill = "Strain") +
     base_theme
     ggsave("05_PCA_Ellipses.png", p5, width = 8, height = 6, dpi = 300)
    
     message("✅ All 5 PCA plots saved to working directory!")
  8. Run Differential Expression & PCA Analysis Complete

     (r_env) cd ~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/
     #(r_env) Rscript complete_deg_pipeline.R  #For standard cutoff in the project
    
     # Adapted the script to the following requests:
     # (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and < -1.2 for the KEGG and GO analyses.
     # (b) Baseline / Strain Controls: use genes with a cutoff of log2 fold change > 1.4 and < -1.4 for the KEGG and GO analyses.
     # (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.
    
     # How it works:
     #   * Rifampicin: The script looks for "Rif" in the comparison name (e.g., 28_AYE_WT_Rif_vs_Ctrl) and applies |log2FC| >= 1.2.
     #   * Baseline/Strain Controls: The script looks for "_ctr_vs_" in the comparison name (e.g., 01_AYE_T_ctr_vs_AYE_WT_ctr) and applies |log2FC| >= 1.4.
     #   * All Others: Falls back to the original 2.0 cutoff.
     #   * The console output will now explicitly print which cutoff is being used for each specific comparison.
    
     (r_env) Rscript complete_deg_pipeline_custom_cutoff.R
  9. LOG of Rscript complete_deg_pipeline_custom_cutoff.R

     -rw-rw-r-- 1 jhuang jhuang 355609 Jun 22 12:53 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 253271 Jun 22 12:53 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 368843 Jun 22 12:53 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 358124 Jun 22 12:53 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang 347126 Jun 22 12:53 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang 283473 Jun 22 12:53 Volcano_31_O_Trans_Rif_vs_Ctrl.png
    
     -rw-rw-r-- 1 jhuang jhuang 352375 Jun 22 12:53 DEG_32_AYE_T_Diclo375_vs_AYE_T_ctr.xlsx
     -rw-rw-r-- 1 jhuang jhuang 257075 Jun 22 12:53 Volcano_32_AYE_T_Diclo375_vs_AYE_T_ctr.png
    
     -rw-rw-r-- 1 jhuang jhuang  355610 Jun  5 16:26 DEG_29_AYE_T_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  254804 Jun  5 16:26 Volcano_29_AYE_T_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  368843 Jun  5 16:26 DEG_30_AYE_O_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  358417 Jun  5 16:26 Volcano_30_AYE_O_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang  347127 Jun  5 16:26 DEG_31_O_Trans_Rif_vs_Ctrl.xlsx
     -rw-rw-r-- 1 jhuang jhuang  284084 Jun  5 16:26 Volcano_31_O_Trans_Rif_vs_Ctrl.png
     -rw-rw-r-- 1 jhuang jhuang    1579 Jun  5 16:26 DEG_Summary_All_31.csv
    
     (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon$ Rscript complete_deg_pipeline_custom_cutoff.R
     There were 22 warnings (use warnings() to see them)
     📖 Parsing annotation file to build tx2gene mapping...
     ✅ Created tx2gene mapping with 7520 entries.
     📥 Importing Salmon quantifications...
     reading in files with read_tsv
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
     removing duplicated transcript rows from tx2gene
     summarizing abundance
     summarizing counts
     summarizing length
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using counts and average transcript lengths from tximport
     🚀 Running DESeq2 pipeline...
     estimating size factors
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     using 'avgTxLength' from assays(dds), correcting for library size
     estimating dispersions
     gene-wise dispersion estimates
     mean-dispersion relationship
       Note: levels of factors in the design contain characters other than
       letters, numbers, '_' and '.'. It is recommended (but not required) to use
       only letters, numbers, and delimiters '_' or '.', as these are safe characters
       for column names in R. [This is a message, not a warning or an error]
     final dispersion estimates
     fitting model and testing
     Available groups in dds:
     [1] "AYE-O_Azi_20_solid"           "AYE-O_ctrl_0_liquid"
     [3] "AYE-O_ctrl_0_solid"           "AYE-O_Diclo_375_liquid"
     [5] "AYE-O_Mero_0.5_liquid"        "AYE-O_Rifampicin_2_liquid"
     [7] "AYE-T_Azi_20_solid"           "AYE-T_ctrl_0_liquid"
     [9] "AYE-T_ctrl_0_solid"           "AYE-T_Diclo_375_liquid"
     [11] "AYE-T_Mero_0.15_liquid"       "AYE-T_Rifampicin_2_liquid"
     [13] "AYE-WT_Azi_20_solid"          "AYE-WT_ctrl_0_liquid"
     [15] "AYE-WT_ctrl_0_solid"          "AYE-WT_Diclo_1250_solid"
     [17] "AYE-WT_Diclo_750_liquid"      "AYE-WT_Mero_0.35-0.5_liquid"
     [19] "AYE-WT_Rifampicin_1.5_liquid" "F_Azi_20_solid"
     [21] "F_ctrl_0_solid"               "O-Trans_ctrl_0_liquid"
     [23] "O-Trans_Diclo_375_liquid"     "O-Trans_Mero_0.25_liquid"
     [25] "O-Trans_Rifampicin_2_liquid"  "WT-Trans_ctrl_0_liquid"
     [27] "WT-Trans_Diclo_750_liquid"
     📖 Parsing annotation file for gene names...
    
     🚀 PROCESSING  38  COMPARISONS
     Base Thresholds: padj <= 0.05, |log2FC| >= 2 (Dynamic adjustments apply per comparison)
     ================================================================================
    
     [01/38] 01_AYE_T_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 16, Down: 151)
    
     [02/38] 02_AYE_O_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 9, Down: 9)
    
     [03/38] 03_O_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 44)
    
     [04/38] 04_WT_Trans_ctr_vs_AYE_WT_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 15, Down: 153)
    
     [05/38] 05_AYE_O_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 28, Down: 3)
    
     [06/38] 06_O_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 11, Down: 10)
    
     [07/38] 07_WT_Trans_ctr_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 1.4
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 4, Down: 14)
    
     [08/38] 08_AYE_WT_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 485, Down: 324)
    
     [09/38] 09_AYE_O_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 475, Down: 309)
    
     [10/38] 10_AYE_T_ctr_solid_vs_liquid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 501, Down: 332)
    
     [11/38] 11_AYE_O_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 5)
    
     [12/38] 12_AYE_T_ctr_solid_vs_AYE_WT_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 5, Down: 8)
    
     [13/38] 13_AYE_WT_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 12, Down: 45)
    
     [14/38] 14_AYE_T_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 75, Down: 181)
    
     [15/38] 15_AYE_O_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 26, Down: 240)
    
     [16/38] 16_O_Trans_Diclo375_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 17, Down: 82)
    
     [17/38] 17_WT_Trans_Diclo750_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 25, Down: 230)
    
     [18/38] 18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 66, Down: 26)
    
     [19/38] 19_AYE_WT_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 359, Down: 175)
    
     [20/38] 20_AYE_T_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 93, Down: 110)
    
     [21/38] 21_AYE_O_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 267, Down: 244)
    
     [22/38] 22_O_Trans_Mero_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 183, Down: 153)
    
     [23/38] 23_AYE_T_Mero_vs_AYE_T_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [24/38] 24_AYE_WT_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 341, Down: 329)
    
     [25/38] 25_AYE_T_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 510, Down: 433)
    
     [26/38] 26_AYE_O_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 484, Down: 443)
    
     [27/38] 27_F_Azi_solid_vs_Ctrl_solid
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 147, Down: 271)
    
     [28/38] 28_AYE_WT_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 11)
    
     [29/38] 29_AYE_T_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 34, Down: 41)
    
     [30/38] 30_AYE_O_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 71, Down: 100)
    
     [31/38] 31_O_Trans_Rif_vs_Ctrl
       -> Using dynamic LFC cutoff: |log2FC| >= 1.2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 6, Down: 10)
    
     [32/38] 32_AYE_T_Diclo375_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 47, Down: 6)
    
     [33/38] 33_AYE_O_Diclo375_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 8, Down: 73)
    
     [34/38] 34_O_Trans_Diclo375_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 2, Down: 17)
    
     [35/38] 35_WT_Trans_Diclo750_vs_WT_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 10, Down: 27)
    
     [36/38] 36_AYE_T_Mero_vs_AYE_T_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 91, Down: 27)
    
     [37/38] 37_AYE_O_Mero_vs_AYE_O_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 242, Down: 135)
    
     [38/38] 38_O_Trans_Mero_vs_O_Trans_ctr
       -> Using dynamic LFC cutoff: |log2FC| >= 2
    
     🔍 DEBUG: Columns in res_df  : baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean
     🔍 DEBUG: Columns in anno_df : gene_id, gene_id_clean, gene_name
     🔍 DEBUG: Columns after join:  baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, GeneID, GeneID_clean, gene_id, gene_name
       ✅ Annotation step completed successfully.
       ✅ Excel + Volcano saved (Up: 177, Down: 50)
    
     ================================================================================
     📊 FINAL SUMMARY OF ALL 38 COMPARISONS
                                         name total  up down sig_total pct_sig
               # |log2FC| >= 1.4
                   01_AYE_T_ctr_vs_AYE_WT_ctr  3609  16  151       167     4.6
                   02_AYE_O_ctr_vs_AYE_WT_ctr  3609   9    9        18     0.5
                  03_O_Trans_ctr_vs_AYE_WT_ctr  3609  10   44        54     1.5
               04_WT_Trans_ctr_vs_AYE_WT_ctr  3609  15  153       168     4.7
                   05_AYE_O_ctr_vs_AYE_T_ctr  3609  28    3        31     0.9
                 06_O_Trans_ctr_vs_AYE_T_ctr  3609  11   10        21     0.6
                 07_WT_Trans_ctr_vs_AYE_T_ctr  3609   4   14        18     0.5
    
               # |log2FC| >= 2
               08_AYE_WT_ctr_solid_vs_liquid  3609 485  324       809    22.4
                 09_AYE_O_ctr_solid_vs_liquid  3609 475  309       784    21.7
                 10_AYE_T_ctr_solid_vs_liquid  3609 501  332       833    23.1
           11_AYE_O_ctr_solid_vs_AYE_WT_solid  3609  12    5        17     0.5
           12_AYE_T_ctr_solid_vs_AYE_WT_solid  3609   5    8        13     0.4
                   13_AYE_WT_Diclo750_vs_Ctrl  3609  12   45        57     1.6
                   14_AYE_T_Diclo375_vs_Ctrl  3609  75  181       256     7.1
                   15_AYE_O_Diclo375_vs_Ctrl  3609  26  240       266     7.4
                 16_O_Trans_Diclo375_vs_Ctrl  3609  17   82        99     2.7
                 17_WT_Trans_Diclo750_vs_Ctrl  3609  25  230       255     7.1
     18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid  3609  66   26        92     2.5
                       19_AYE_WT_Mero_vs_Ctrl  3609 359  175       534    14.8
                       20_AYE_T_Mero_vs_Ctrl  3609  93  110       203     5.6
                       21_AYE_O_Mero_vs_Ctrl  3609 267  244       511    14.2
                     22_O_Trans_Mero_vs_Ctrl  3609 183  153       336     9.3
                 23_AYE_T_Mero_vs_AYE_T_Ctrl  3609  91   27       118     3.3
           24_AYE_WT_Azi_solid_vs_Ctrl_solid  3609 341  329       670    18.6
             25_AYE_T_Azi_solid_vs_Ctrl_solid  3609 510  433       943    26.1
             26_AYE_O_Azi_solid_vs_Ctrl_solid  3609 484  443       927    25.7
                 27_F_Azi_solid_vs_Ctrl_solid  3609 147  271       418    11.6
    
               # |log2FC| >= 1.2
                       28_AYE_WT_Rif_vs_Ctrl  3609   8   11        19     0.5
                         29_AYE_T_Rif_vs_Ctrl  3609  34   41        75     2.1
                         30_AYE_O_Rif_vs_Ctrl  3609  71  100       171     4.7
                       31_O_Trans_Rif_vs_Ctrl  3609   6   10        16     0.4
    
               # |log2FC| >= 2
                                         name total  up down sig_total
               32_AYE_T_Diclo375_vs_AYE_T_ctr  3609  46    6        52
               33_AYE_O_Diclo375_vs_AYE_O_ctr  3609   8   73        81     2.2
           34_O_Trans_Diclo375_vs_O_Trans_ctr  3609   2   17        19     0.5
         35_WT_Trans_Diclo750_vs_WT_Trans_ctr  3609  10   27        37     1.0
                   36_AYE_T_Mero_vs_AYE_T_ctr  3609  91   27       118     3.3
                   37_AYE_O_Mero_vs_AYE_O_ctr  3609 242  135       377    10.4
               38_O_Trans_Mero_vs_O_Trans_ctr  3609 177   50       227     6.3
    
     ✨ All files saved to: /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete
     📁 Each comparison contains:
         1️ DEG_
    .xlsx (3 sheets: Complete_Results, Up_Regulated, Down_Regulated) 2️ Volcano_ .png (GeneName labels for top significant genes) 📤 EXPORTING COMPLETE RESULTS TO CSV FOR KEGG/GO… ✅ Successfully exported 38 CSV files to DEG_Results_Complete
  10. KEGG and GO annotations in non-model organisms

    (a) Rifampicin: use genes with a cutoff of log2 fold change > 1.2 and 1.4 and < -1.4 for the KEGG and GO analyses. (c) All other comparisons: please retain the same selection criteria as in the previous analysis you sent to me.

https://www.biobam.com/functional-analysis/

10.1. Assign KEGG and GO Terms (see diagram above)

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

* Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

    Install EggNOG-mapper:

        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env

    Run annotation:

        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        #Download CU459141_protein_.fasta from NCBI
        python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta
        emapper.py -i CU459141_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.

    Extract KEGG KO IDs from annotations.emapper.annotations.

* Preparing file 2 blast2go_annot.annot2_ for the R-code below:

  - Basic (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    * 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    * Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    * Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    * Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished."
            * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
            * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

  - Advanced (GO Terms from 'Blast2GO 5 Basic'): Interpro based protein families / domains --> Button interpro

    * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

  - MERGE the results of InterPro GO IDs (advanced) to GO IDs (basic) and generate final GO IDs, saved in blast2go_annot.annot2

    * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)."
    * (NOT_USED) Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

  - PREPARING go_terms and ec_terms: annot_* file (NOTE that blast2go_annot.annot2 is after merging InterPro_GO_IDs and GO_IDs):

    cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_

10.2. Perform KEGG and GO Enrichment in R

      (r_env) cd /mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete

      #For |deg_cutoff_log_foldchange| >=1.4
      sed "s/01_AYE_T_ctr_vs_AYE_WT_ctr/02_AYE_O_ctr_vs_AYE_WT_ctr/g" 1.R > 2.R
      ...

      #For |deg_cutoff_log_foldchange| >=2.0
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/09_AYE_O_ctr_solid_vs_liquid/g" 8.R > 9.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/10_AYE_T_ctr_solid_vs_liquid/g" 8.R > 10.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/11_AYE_O_ctr_solid_vs_AYE_WT_solid/g" 8.R > 11.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/12_AYE_T_ctr_solid_vs_AYE_WT_solid/g" 8.R > 12.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/13_AYE_WT_Diclo750_vs_Ctrl/g" 8.R > 13.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/14_AYE_T_Diclo375_vs_Ctrl/g" 8.R > 14.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/15_AYE_O_Diclo375_vs_Ctrl/g" 8.R > 15.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/16_O_Trans_Diclo375_vs_Ctrl/g" 8.R > 16.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/17_WT_Trans_Diclo750_vs_Ctrl/g" 8.R > 17.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid/g" 8.R > 18.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/19_AYE_WT_Mero_vs_Ctrl/g" 8.R > 19.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/20_AYE_T_Mero_vs_Ctrl/g" 8.R > 20.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/21_AYE_O_Mero_vs_Ctrl/g" 8.R > 21.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/22_O_Trans_Mero_vs_Ctrl/g" 8.R > 22.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/23_AYE_T_Mero_vs_AYE_T_Ctrl/g" 8.R > 23.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/24_AYE_WT_Azi_solid_vs_Ctrl_solid/g" 8.R > 24.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/25_AYE_T_Azi_solid_vs_Ctrl_solid/g" 8.R > 25.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/26_AYE_O_Azi_solid_vs_Ctrl_solid/g" 8.R > 26.R
      sed "s/08_AYE_WT_ctr_solid_vs_liquid/27_F_Azi_solid_vs_Ctrl_solid/g" 8.R > 27.R

      #For |deg_cutoff_log_foldchange| >=1.2
      sed "s/28_AYE_WT_Rif_vs_Ctrl/29_AYE_T_Rif_vs_Ctrl/g" 28.R > 29.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/30_AYE_O_Rif_vs_Ctrl/g" 28.R > 30.R
      sed "s/28_AYE_WT_Rif_vs_Ctrl/31_O_Trans_Rif_vs_Ctrl/g" 28.R > 31.R

      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 1.R
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid KEGG IDs: 4
      #  Enriched pathways: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 50
      #  Enriched pathways: 4
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #'select()' returned 1:1 mapping between keys and columns
      #=== SUMMARY ===
      #Up-regulated genes: 16
      #  Valid GO IDs: 16
      #  Enriched GO-terms: 0
      #Down-regulated genes: 151
      #  Valid KEGG IDs: 151
      #  Enriched GO-terms: 3
      #...

10.3. Finalizing the KEGG and GO Enrichment table

      1. NOTE (Already realized in the code): geneIDs in KEGG_Enrichment have been already translated from ko to geneID in H0N29_*-format; If not, nachmachen using eggnog-res, 因为 eggnog里有1-1-mspping Info between ko-Name and GeneID.
      2. NEED_MANUAL_DELETION (Already setting the cutoff in the code): p.adjust values have been calculated, we have to filter all records in GO_Enrichment-results by |p.adjust|<=0.05. DON'T_NEED_ANY_MORE, since pvalueCutoff = 0.05 settings in enricher. Alternative using pvalueCutoff=1.0, marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment.
      3. NOTE (Not occuring in the new dataset): In rare case, the description is missing for some IDs, e.g. GO term: GO:0006807: replace GO:0006807    obsolete nitrogen compound metabolic process;  ko00975: Metabolism, Biosynthesis of other secondary metabolites
  1. KEGG and GO enrichments via 1.R, 2.R, …

      #Update "01_AYE_T_ctr_vs_AYE_WT_ctr" with "02_AYE_O_ctr_vs_AYE_WT_ctr" in 1.R and save the updated script as 2.R.
      (r_env) jhuang@WS-2290C:/mnt/md1/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete$ Rscript 2.R
      === SUMMARY ===
      Up-regulated genes: 9
        Valid KEGG IDs: 1
        Enriched pathways: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 6
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 9
        Valid GO IDs: 9
        Enriched GO-terms: 0
      Down-regulated genes: 9
        Valid KEGG IDs: 9
        Enriched GO-terms: 2
    
      Rscript 3.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 2
        Enriched pathways: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 20
        Enriched pathways: 6
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 44
        Valid KEGG IDs: 44
        Enriched GO-terms: 0
    
      Rscript 4.R
      === SUMMARY ===
      Up-regulated genes: 15
        Valid KEGG IDs: 7
        Enriched pathways: 6
      Down-regulated genes: 152
        Valid KEGG IDs: 43
        Enriched pathways: 11
      === SUMMARY ===
      Up-regulated genes: 15
        Valid GO IDs: 15
        Enriched GO-terms: 0
      Down-regulated genes: 152
        Valid KEGG IDs: 152
        Enriched GO-terms: 0
    
      Rscript 5.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 6.R
      === SUMMARY ===
      Up-regulated genes: 28
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 3
        Valid KEGG IDs: 2
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 28
        Valid GO IDs: 28
        Enriched GO-terms: 0
      Down-regulated genes: 3
        Valid KEGG IDs: 3
        Enriched GO-terms: 1
    
      Rscript 7.R
      === SUMMARY ===
      Up-regulated genes: 4
        Valid KEGG IDs: 0
        Enriched pathways: 0
      Down-regulated genes: 14
        Valid KEGG IDs: 10
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 4
        Valid GO IDs: 4
        Enriched GO-terms: 1
      Down-regulated genes: 14
        Valid KEGG IDs: 14
        Enriched GO-terms: 3
    
      Rscript 8.R
      === SUMMARY ===
      Up-regulated genes: 479
        Valid KEGG IDs: 246
        Enriched pathways: 30
      Down-regulated genes: 322
        Valid KEGG IDs: 186
        Enriched pathways: 13
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 479
        Valid GO IDs: 479
        Enriched GO-terms: 4
      Down-regulated genes: 322
        Valid KEGG IDs: 322
        Enriched GO-terms: 6
    
      Rscript 9.R
      === SUMMARY ===
      Up-regulated genes: 472
        Valid KEGG IDs: 230
        Enriched pathways: 25
      Down-regulated genes: 306
        Valid KEGG IDs: 217
        Enriched pathways: 19
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 472
        Valid GO IDs: 472
        Enriched GO-terms: 3
      Down-regulated genes: 306
        Valid KEGG IDs: 306
        Enriched GO-terms: 6
    
      Rscript 10.R
      === SUMMARY ===
      Up-regulated genes: 496
        Valid KEGG IDs: 236
        Enriched pathways: 28
      Down-regulated genes: 330
        Valid KEGG IDs: 250
        Enriched pathways: 26
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 496
        Valid GO IDs: 496
        Enriched GO-terms: 2
      Down-regulated genes: 330
        Valid KEGG IDs: 330
        Enriched GO-terms: 6
    
      Rscript 11.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 2
        Enriched pathways: 5
      Down-regulated genes: 5
        Valid KEGG IDs: 2
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 0
      Down-regulated genes: 5
        Valid KEGG IDs: 5
        Enriched GO-terms: 0
    
      Rscript 12.R
      === SUMMARY ===
      Up-regulated genes: 5
        Valid KEGG IDs: 1
        Enriched pathways: 3
      Down-regulated genes: 8
        Valid KEGG IDs: 3
        Enriched pathways: 2
      === SUMMARY ===
      Up-regulated genes: 5
        Valid GO IDs: 5
        Enriched GO-terms: 0
      Down-regulated genes: 8
        Valid KEGG IDs: 8
        Enriched GO-terms: 0
    
      Rscript 13.R
      === SUMMARY ===
      Up-regulated genes: 12
        Valid KEGG IDs: 8
        Enriched pathways: 4
      Down-regulated genes: 45
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 12
        Valid GO IDs: 12
        Enriched GO-terms: 1
      Down-regulated genes: 45
        Valid KEGG IDs: 45
        Enriched GO-terms: 0
    
      Rscript 14.R
      === SUMMARY ===
      Up-regulated genes: 73
        Valid KEGG IDs: 37
        Enriched pathways: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 29
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 73
        Valid GO IDs: 73
        Enriched GO-terms: 5
      Down-regulated genes: 180
        Valid KEGG IDs: 180
        Enriched GO-terms: 0
    
      Rscript 15.R
      === SUMMARY ===
      Up-regulated genes: 26
        Valid KEGG IDs: 14
        Enriched pathways: 10
      Down-regulated genes: 239
        Valid KEGG IDs: 38
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 26
        Valid GO IDs: 26
        Enriched GO-terms: 0
      Down-regulated genes: 239
        Valid KEGG IDs: 239
        Enriched GO-terms: 0
    
      Rscript 16.R
      === SUMMARY ===
      Up-regulated genes: 17
        Valid KEGG IDs: 8
        Enriched pathways: 11
      Down-regulated genes: 82
        Valid KEGG IDs: 14
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 17
        Valid GO IDs: 17
        Enriched GO-terms: 0
      Down-regulated genes: 82
        Valid KEGG IDs: 82
        Enriched GO-terms: 2
    
      Rscript 17.R
      === SUMMARY ===
      Up-regulated genes: 25
        Valid KEGG IDs: 15
        Enriched pathways: 3
      Down-regulated genes: 229
        Valid KEGG IDs: 46
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 25
        Valid GO IDs: 25
        Enriched GO-terms: 2
      Down-regulated genes: 229
        Valid KEGG IDs: 229
        Enriched GO-terms: 1
    
      Rscript 18.R
      === SUMMARY ===
      Up-regulated genes: 66
        Valid KEGG IDs: 18
        Enriched pathways: 8
      Down-regulated genes: 26
        Valid KEGG IDs: 19
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 66
        Valid GO IDs: 66
        Enriched GO-terms: 5
      Down-regulated genes: 26
        Valid KEGG IDs: 26
        Enriched GO-terms: 3
    
      Rscript 19.R
      === SUMMARY ===
      Up-regulated genes: 355
        Valid KEGG IDs: 200
        Enriched pathways: 25
      Down-regulated genes: 175
        Valid KEGG IDs: 70
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 355
        Valid GO IDs: 355
        Enriched GO-terms: 6
      Down-regulated genes: 175
        Valid KEGG IDs: 175
        Enriched GO-terms: 2
    
      Rscript 20.R
      === SUMMARY ===
      Up-regulated genes: 93
        Valid KEGG IDs: 48
        Enriched pathways: 21
      Down-regulated genes: 110
        Valid KEGG IDs: 32
        Enriched pathways: 4
      === SUMMARY ===
      Up-regulated genes: 93
        Valid GO IDs: 93
        Enriched GO-terms: 0
      Down-regulated genes: 110
        Valid KEGG IDs: 110
        Enriched GO-terms: 0
    
      Rscript 21.R
      === SUMMARY ===
      Up-regulated genes: 263
        Valid KEGG IDs: 154
        Enriched pathways: 26
      Down-regulated genes: 244
        Valid KEGG IDs: 72
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 263
        Valid GO IDs: 263
        Enriched GO-terms: 8
      Down-regulated genes: 244
        Valid KEGG IDs: 244
        Enriched GO-terms: 5
    
      Rscript 22.R
      === SUMMARY ===
      Up-regulated genes: 182
        Valid KEGG IDs: 115
        Enriched pathways: 23
      Down-regulated genes: 153
        Valid KEGG IDs: 40
        Enriched pathways: 3
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 182
        Valid GO IDs: 182
        Enriched GO-terms: 3
      Down-regulated genes: 153
        Valid KEGG IDs: 153
        Enriched GO-terms: 2
    
      Rscript 23.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 24.R
      === SUMMARY ===
      Up-regulated genes: 333
        Valid KEGG IDs: 193
        Enriched pathways: 12
      Down-regulated genes: 328
        Valid KEGG IDs: 175
        Enriched pathways: 24
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 333
        Valid GO IDs: 333
        Enriched GO-terms: 4
      Down-regulated genes: 328
        Valid KEGG IDs: 328
        Enriched GO-terms: 4
    
      Rscript 25.R
      === SUMMARY ===
      Up-regulated genes: 499
        Valid KEGG IDs: 264
        Enriched pathways: 19
      Down-regulated genes: 430
        Valid KEGG IDs: 239
        Enriched pathways: 36
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 499
        Valid GO IDs: 499
        Enriched GO-terms: 2
      Down-regulated genes: 430
        Valid KEGG IDs: 430
        Enriched GO-terms: 0
    
      Rscript 26.R
      === SUMMARY ===
      Up-regulated genes: 474
        Valid KEGG IDs: 267
        Enriched pathways: 19
      Down-regulated genes: 440
        Valid KEGG IDs: 216
        Enriched pathways: 32
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 474
        Valid GO IDs: 474
        Enriched GO-terms: 2
      Down-regulated genes: 440
        Valid KEGG IDs: 440
        Enriched GO-terms: 1
    
      Rscript 27.R
      === SUMMARY ===
      Up-regulated genes: 142
        Valid KEGG IDs: 69
        Enriched pathways: 1
      Down-regulated genes: 269
        Valid KEGG IDs: 148
        Enriched pathways: 27
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 142
        Valid GO IDs: 142
        Enriched GO-terms: 5
      Down-regulated genes: 269
        Valid KEGG IDs: 269
        Enriched GO-terms: 3
    
      Rscript 28.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 6
        Enriched pathways: 2
      Down-regulated genes: 11
        Valid KEGG IDs: 3
        Enriched pathways: 0
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 11
        Valid KEGG IDs: 11
        Enriched GO-terms: 0
    
      Rscript 29.R
      === SUMMARY ===
      Up-regulated genes: 34
        Valid KEGG IDs: 11
        Enriched pathways: 6
      Down-regulated genes: 41
        Valid KEGG IDs: 22
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 34
        Valid GO IDs: 34
        Enriched GO-terms: 0
      Down-regulated genes: 41
        Valid KEGG IDs: 41
    
      Rscript 30.R
      === SUMMARY ===
      Up-regulated genes: 70
        Valid KEGG IDs: 47
        Enriched pathways: 16
      Down-regulated genes: 99
        Valid KEGG IDs: 40
        Enriched pathways: 3
      === SUMMARY ===
      Up-regulated genes: 70
        Valid GO IDs: 70
        Enriched GO-terms: 0
      Down-regulated genes: 99
        Valid KEGG IDs: 99
        Enriched GO-terms: 0
    
      Rscript 31.R
      === SUMMARY ===
      Up-regulated genes: 6
        Valid KEGG IDs: 3
        Enriched pathways: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      === SUMMARY ===
      Up-regulated genes: 6
        Valid GO IDs: 6
        Enriched GO-terms: 0
      Down-regulated genes: 10
        Valid KEGG IDs: 10
        Enriched GO-terms: 5
    
      Rscript 32.R
      === SUMMARY ===
      Up-regulated genes: 46
        Valid KEGG IDs: 18
        Enriched pathways: 1
      Down-regulated genes: 6
        Valid KEGG IDs: 0
        Enriched pathways: 0
      No gene sets have size between 10 and 500 ...
      --> return NULL...
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 46
        Valid GO IDs: 46
        Enriched GO-terms: 4
      Down-regulated genes: 6
        Valid KEGG IDs: 6
        Enriched GO-terms: 0
    
      Rscript 33.R
      === SUMMARY ===
      Up-regulated genes: 8
        Valid KEGG IDs: 9
        Enriched pathways: 8
      Down-regulated genes: 72
        Valid KEGG IDs: 7
        Enriched pathways: 3
    
      === SUMMARY ===
      Up-regulated genes: 8
        Valid GO IDs: 8
        Enriched GO-terms: 0
      Down-regulated genes: 72
        Valid KEGG IDs: 72
        Enriched GO-terms: 0
    
      Rscript 34.R
      === SUMMARY ===
      Up-regulated genes: 2
        Valid KEGG IDs: 2
        Enriched pathways: 3
      Down-regulated genes: 17
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 2
        Valid GO IDs: 2
        Enriched GO-terms: 0
      Down-regulated genes: 17
        Valid KEGG IDs: 17
        Enriched GO-terms: 0
    
      Rscript 35.R
      === SUMMARY ===
      Up-regulated genes: 10
        Valid KEGG IDs: 5
        Enriched pathways: 5
      Down-regulated genes: 27
        Valid KEGG IDs: 3
        Enriched pathways: 0
    
      === SUMMARY ===
      Up-regulated genes: 10
        Valid GO IDs: 10
        Enriched GO-terms: 0
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 0
    
      Rscript 36.R
      === SUMMARY ===
      Up-regulated genes: 91
        Valid KEGG IDs: 37
        Enriched pathways: 17
      Down-regulated genes: 27
        Valid KEGG IDs: 14
        Enriched pathways: 1
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 91
        Valid GO IDs: 91
        Enriched GO-terms: 1
      Down-regulated genes: 27
        Valid KEGG IDs: 27
        Enriched GO-terms: 4
    
      Rscript 37.R
      === SUMMARY ===
      Up-regulated genes: 238
        Valid KEGG IDs: 141
        Enriched pathways: 28
      Down-regulated genes: 134
        Valid KEGG IDs: 35
        Enriched pathways: 2
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 238
        Valid GO IDs: 238
        Enriched GO-terms: 2
      Down-regulated genes: 134
        Valid KEGG IDs: 134
        Enriched GO-terms: 4
    
      Rscript 38.R
      === SUMMARY ===
      Up-regulated genes: 177
        Valid KEGG IDs: 104
        Enriched pathways: 22
      Down-regulated genes: 50
        Valid KEGG IDs: 13
        Enriched pathways: 4
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
      'select()' returned 1:1 mapping between keys and columns
    
      === SUMMARY ===
      Up-regulated genes: 177
        Valid GO IDs: 177
        Enriched GO-terms: 2
      Down-regulated genes: 50
        Valid KEGG IDs: 50
        Enriched GO-terms: 4
  2. (DEPRECATED due to be WRONG and TIME-CONSUMING) KEGG and GO enrichments by replacing the sample name in the R-code, and by copying and pasting the R-code directly on console

      # For your reference, here is the exact list of the 31 comparisons and their assigned groupings:
    
      # Baseline / Strain Controls (|log2FC| >= 1.4)
      "01_AYE_T_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-T_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "02_AYE_O_ctr_vs_AYE_WT_ctr"       = c("group", "AYE-O_ctrl_0_liquid",   "AYE-WT_ctrl_0_liquid"),
      "03_O_Trans_ctr_vs_AYE_WT_ctr"     = c("group", "O-Trans_ctrl_0_liquid", "AYE-WT_ctrl_0_liquid"),
      "04_WT_Trans_ctr_vs_AYE_WT_ctr"    = c("group", "WT-Trans_ctrl_0_liquid","AYE-WT_ctrl_0_liquid"),
      "05_AYE_O_ctr_vs_AYE_T_ctr"        = c("group", "AYE-O_ctrl_0_liquid",   "AYE-T_ctrl_0_liquid"),
      "06_O_Trans_ctr_vs_AYE_T_ctr"      = c("group", "O-Trans_ctrl_0_liquid", "AYE-T_ctrl_0_liquid"),
      "07_WT_Trans_ctr_vs_AYE_T_ctr"     = c("group", "WT-Trans_ctrl_0_liquid","AYE-T_ctrl_0_liquid"),
    
      # Condition Effects (|log2FC| >= 2.0)
      "08_AYE_WT_ctr_solid_vs_liquid"    = c("group", "AYE-WT_ctrl_0_solid",   "AYE-WT_ctrl_0_liquid"),
      "09_AYE_O_ctr_solid_vs_liquid"     = c("group", "AYE-O_ctrl_0_solid",    "AYE-O_ctrl_0_liquid"),
      "10_AYE_T_ctr_solid_vs_liquid"     = c("group", "AYE-T_ctrl_0_solid",    "AYE-T_ctrl_0_liquid"),
      "11_AYE_O_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-O_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
      "12_AYE_T_ctr_solid_vs_AYE_WT_solid"=c("group", "AYE-T_ctrl_0_solid",    "AYE-WT_ctrl_0_solid"),
    
      # Diclofenac (|log2FC| >= 2.0)
      "13_AYE_WT_Diclo750_vs_Ctrl"       = c("group", "AYE-WT_Diclo_750_liquid",   "AYE-WT_ctrl_0_liquid"),
      "14_AYE_T_Diclo375_vs_Ctrl"        = c("group", "AYE-T_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "15_AYE_O_Diclo375_vs_Ctrl"        = c("group", "AYE-O_Diclo_375_liquid",    "AYE-WT_ctrl_0_liquid"),
      "16_O_Trans_Diclo375_vs_Ctrl"      = c("group", "O-Trans_Diclo_375_liquid",  "AYE-WT_ctrl_0_liquid"),
      "17_WT_Trans_Diclo750_vs_Ctrl"     = c("group", "WT-Trans_Diclo_750_liquid", "AYE-WT_ctrl_0_liquid"),
      "18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid" = c("group", "AYE-WT_Diclo_1250_solid", "AYE-WT_ctrl_0_solid"),
    
      # Meropenem (|log2FC| >= 2.0)
      "19_AYE_WT_Mero_vs_Ctrl"           = c("group", "AYE-WT_Mero_0.35-0.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "20_AYE_T_Mero_vs_Ctrl"            = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-WT_ctrl_0_liquid"),
      "21_AYE_O_Mero_vs_Ctrl"            = c("group", "AYE-O_Mero_0.5_liquid",       "AYE-WT_ctrl_0_liquid"),
      "22_O_Trans_Mero_vs_Ctrl"          = c("group", "O-Trans_Mero_0.25_liquid",    "AYE-WT_ctrl_0_liquid"),
      "23_AYE_T_Mero_vs_AYE_T_Ctrl"      = c("group", "AYE-T_Mero_0.15_liquid",      "AYE-T_ctrl_0_liquid"),
    
      # Azithromycin (Solid) (|log2FC| >= 2.0)
      "24_AYE_WT_Azi_solid_vs_Ctrl_solid"= c("group", "AYE-WT_Azi_20_solid", "AYE-WT_ctrl_0_solid"),
      "25_AYE_T_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-T_Azi_20_solid",  "AYE-T_ctrl_0_solid"),
      "26_AYE_O_Azi_solid_vs_Ctrl_solid" = c("group", "AYE-O_Azi_20_solid",  "AYE-O_ctrl_0_solid"),
      "27_F_Azi_solid_vs_Ctrl_solid"     = c("group", "F_Azi_20_solid",      "F_ctrl_0_solid"),
    
      # Rifampicin (|log2FC| >= 1.2)
      "28_AYE_WT_Rif_vs_Ctrl"            = c("group", "AYE-WT_Rifampicin_1.5_liquid", "AYE-WT_ctrl_0_liquid"),
      "29_AYE_T_Rif_vs_Ctrl"             = c("group", "AYE-T_Rifampicin_2_liquid",    "AYE-T_ctrl_0_liquid"),
      "30_AYE_O_Rif_vs_Ctrl"             = c("group", "AYE-O_Rifampicin_2_liquid",    "AYE-O_ctrl_0_liquid"),
      "31_O_Trans_Rif_vs_Ctrl"           = c("group", "O-Trans_Rifampicin_2_liquid",  "O-Trans_ctrl_0_liquid")
    
      ./DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv
      ./DEG_02_AYE_O_ctr_vs_AYE_WT_ctr.csv
      ./DEG_03_O_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_04_WT_Trans_ctr_vs_AYE_WT_ctr.csv
      ./DEG_05_AYE_O_ctr_vs_AYE_T_ctr.csv
      ./DEG_06_O_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_07_WT_Trans_ctr_vs_AYE_T_ctr.csv
      ./DEG_08_AYE_WT_ctr_solid_vs_liquid.csv
      ./DEG_09_AYE_O_ctr_solid_vs_liquid.csv
      ./DEG_10_AYE_T_ctr_solid_vs_liquid.csv
      ./DEG_11_AYE_O_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_12_AYE_T_ctr_solid_vs_AYE_WT_solid.csv
      ./DEG_13_AYE_WT_Diclo750_vs_Ctrl.csv
      ./DEG_14_AYE_T_Diclo375_vs_Ctrl.csv
      ./DEG_15_AYE_O_Diclo375_vs_Ctrl.csv
      ./DEG_16_O_Trans_Diclo375_vs_Ctrl.csv
      ./DEG_17_WT_Trans_Diclo750_vs_Ctrl.csv
      ./DEG_18_AYE_WT_Diclo1250_solid_vs_Ctrl_solid.csv
      ./DEG_19_AYE_WT_Mero_vs_Ctrl.csv
      ./DEG_20_AYE_T_Mero_vs_Ctrl.csv
      ./DEG_21_AYE_O_Mero_vs_Ctrl.csv
      ./DEG_22_O_Trans_Mero_vs_Ctrl.csv
      ./DEG_23_AYE_T_Mero_vs_AYE_T_Ctrl.csv
      ./DEG_24_AYE_WT_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_25_AYE_T_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_26_AYE_O_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_27_F_Azi_solid_vs_Ctrl_solid.csv
      ./DEG_28_AYE_WT_Rif_vs_Ctrl.csv
      ./DEG_29_AYE_T_Rif_vs_Ctrl.csv
      ./DEG_30_AYE_O_Rif_vs_Ctrl.csv
      ./DEG_31_O_Trans_Rif_vs_Ctrl.csv
    
        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        setwd("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/results/star_salmon/DEG_Results_Complete_manually")
    
        # 1. Blast2GO: Extract GO & EC terms (Primary source)
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_AYE/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
        colnames(annot_df) <- c("GeneID", "Term")
    
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        res <- read.csv("DEG_01_AYE_T_ctr_vs_AYE_WT_ctr.csv")
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2026_Dicl_Mero_Azith_Rifa_on_AYE/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # Filter up- and down-regulated genes (NEEDS_TO_BE_ADAPTED_BY_1.4_or_2.0_or_1.2)
        up_regulated <- res_updated[res_updated$log2FoldChange >= 1.4 & res_updated$padj <= 0.05, ]
        down_regulated <- res_updated[res_updated$log2FoldChange <= -1.4 & res_updated$padj <= 0.05, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        saveWorkbook(wb, "Gene_Expression_with_Annotations_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
    
        # ---------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
    
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
    
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
    
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
    
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # -----------------------------------------------------------
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
        #saveWorkbook(wb, "KEGG_Enrichment.xlsx", overwrite = TRUE)
    
        # ----------------------------------------
        # ---- Perform GO enrichment analysis ----
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis, where pAdjustMethod is one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 0.05,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
    
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx", overwrite = TRUE)
  3. BUG_1: The Count column does not match the number of gene IDs listed in the geneID column.

        The discrepancy between the Count and the number of listed GeneIDs is not an error, but a result of KEGG's many-to-many mapping between physical genes and KOs.
    
        KEGG enrichment statistics (Count, GeneRatio, p-values) are strictly calculated based on unique KOs, not GeneIDs. The geneID column simply maps these enriched KOs back to our specific genome for display.
        Concrete Examples (see KEGG_and_GO_Enrichments_01_AYE_T_ctr_vs_AYE_WT_ctr.xlsx):
    
            * ko03070 (Bacterial secretion system): Count = 2 (KOs: K02456/K02457). However, both KOs map to the exact same physical gene (ABAYE2071), so only 1 geneID is listed.
            * ko05111 (Biofilm formation): Count = 3 (KOs: K02456/K02457/K03092), which map back to 2 unique genes (ABAYE2071/ABAYE3136).
    
        To ensure full transparency, I have added a new KEGG_ko column (placed between Count and geneID). It explicitly lists the exact KOs contributing to the Count.