Monthly Archives: June 2026

Protected: 汉堡国际象棋学校面向6-12岁儿童的在线培训课程

Enter your password to view comments.

Protected: 明年初的 HJET（汉堡青少年个人锦标赛）的前16名，有资格参加 HJEM（Hamburger Jugendeinzelmeisterschaften，汉堡青少年个人冠军赛/汉堡锦标赛）

Enter your password to view comments.

This content is password-protected. To view it, please enter the password below.

Password:

Analyzing WaGa and MKL-1 Cell Line miRNA (Data_Ute_smallRNA_via_exceRpt_workspace)

Leave a reply

manhattan_plot_Carmen_custom_labels_WaGa.R

manhattan_plot_Carmen_custom_labels_MKL-1.R

For example, MKL-1 Cell Line miRNA Analysis Results are as follows.

* Raw count data (d_raw_MKL-1.xlsx): Contains the raw, unnormalized read counts for all miRNAs.
* Mapping heatmap (mapping_heatmap3_MKL-1.pdf)
* Volcano plot (MKL.1_wt_EV_vs_MKL.1_wt_cells.png and .svg)
* PCA plot (pca_MKL-1.png)
* Manhattan plot and data (manhattan_plot_MKL1_vs_EV.png, .svg, and manhattan_plot_MKL1_data.xlsx)

Input data

 WaGa wt cells (nf774* (Considering to be deleted, due to possibly be an outlier, but in the current version, it is still included in the analysis), nf961, nf962)
 WaGa wt_EV_RNA (nf657* (The sample was EXCLUDED, since it is obviously a outlier, not clustered with the other 2 samples), nf930, nf935)
 WaGa_sT_DMSO_EV_RNA (nf931, nf936, nf971)
 WaGa_sT_Dox_EV_RNA (nf932, nf937, nf972)
 WaGa_scr_DMSO_EV_RNA (nf933, nf938, nf973)
 WaGa_scr_Dox_EV_RNA (nf934, nf939, nf974)
 # --> In total, 17 samples

 MKL-1 wt cells (nf780*, nf796*, nf797*)
 MKL-1 wt_EV_RNA (nf655* (The sample was EXCLUDED), 2404, 2608)
 MKL-1_sT_DMSO_EV_RNA (2608, 2701, 2802)
 MKL-1_sT_Dox_EV_RNA (2608, 2701, 2802)
 MKL-1_scr_DMSO_EV_RNA (2608, 2701, 2802)
 MKL-1_scr_Dox_EV_RNA (2608, 2701, 2802)
 # --> In total, 18 samples

 #Note that the real paths are as follows:
 #./20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz, ./20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz
 #./20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz
 #./20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz
 #./20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz
 #./20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz, ./20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz

Adapter trimming

 #some common adapter sequences from different kits for reference:
 #    - TruSeq Small RNA (Illumina): TGGAATTCTCGGGTGCCAAGG
 #    - Small RNA Kits V1 (Illumina): TCGTATGCCGTCTTCTGCTTGT
 #    - Small RNA Kits V1.5 (Illumina): ATCTCGTATGCCGTCTTCTGCTTG
 #    - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific): TGGAATTCTCGGGTGCCAAGG
 #    - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina): TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC *
 mkdir Data_Ute_smallRNA_via_exceRpt_workspace/trimmed; cd Data_Ute_smallRNA_via_exceRpt_workspace/trimmed

 echo "------------------------------------ cutadapting nf774 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf774.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf657 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf657.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf655 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf655.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf780 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf780.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf796 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf796.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf797 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf797.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf930 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf930.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf931 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf931.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf932 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf932.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf933 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf933.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf934 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf934.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf935 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf935.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf936 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf936.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf937 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf937.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf938 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf938.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf939 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf939.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf940 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf940.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf941 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf941.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf961 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf961.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf962 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf962.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf971 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf971.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf972 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf972.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf973 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf973.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting nf974 -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf974.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2404_MKL1_wt_EVs -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2404_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2404_MKL1_wt_EVs/2404_MKL1_wt_EVs_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2608_MKL1_wt_EVs -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_wt_EVs.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_wt_EVs/2608_MKL1_wt_EVs_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2608_MKL1_sT_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_DMSO/2608_MKL1_sT_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2701_MKL1_sT_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_DMSO/2701_MKL1_sT_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2802_MKL1_sT_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_DMSO/2802_MKL1_sT_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2608_MKL1_sT_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_sT_Dox/2608_MKL1_sT_Dox_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2701_MKL1_sT_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_sT_Dox/2701_MKL1_sT_Dox_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2802_MKL1_sT_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_sT_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_sT_Dox/2802_MKL1_sT_Dox_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2608_MKL1_scr_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_DMSO/2608_MKL1_scr_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2701_MKL1_scr_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_DMSO/2701_MKL1_scr_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2802_MKL1_scr_DMSO -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_DMSO.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_DMSO/2802_MKL1_scr_DMSO_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2608_MKL1_scr_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2608_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2608_MKL1_scr_Dox/2608_MKL1_scr_Dox_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2701_MKL1_scr_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2701_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2701_MKL1_scr_Dox/2701_MKL1_scr_Dox_R1.fastq.gz >> LOG

 echo "------------------------------------ cutadapting 2802_MKL1_scr_Dox -----------------------------------" >> LOG
 cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o 2802_MKL1_scr_Dox.fastq.gz ~/DATA/Data_Ute_smallRNA/20260506_AV243904_0073_A/2802_MKL1_scr_Dox/2802_MKL1_scr_Dox_R1.fastq.gz >> LOG

Install exceRpt (https://github.gersteinlab.org/exceRpt/)

 docker pull rkitchen/excerpt
 mkdir MyexceRptDatabase
 cd /mnt/nvme0n1p1/MyexceRptDatabase
 wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
 tar -xvf exceRptDB_v4_hg38_lowmem.tgz
 #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg19_lowmem.tgz
 #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz
 #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_mm10_lowmem.tgz
 wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOmiRNArRNA.tgz
 tar -xvf exceRptDB_v4_EXOmiRNArRNA.tgz
 wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOGenomes.tgz
 tar -xvf exceRptDB_v4_EXOGenomes.tgz

 # List extracted hg38 directory structure
 find hg38 -type f | sed 's|^hg38/||' | sort > extracted_hg38.txt
 comm -3 extracted_hg38.txt <(tar -tf exceRptDB_v4_hg38_lowmem.tgz | grep '^hg38/' | sed 's|^hg38/||' | sort)  # --> DIR hg38
 tar -tf exceRptDB_v4_EXOmiRNArRNA.tgz  # --> DIR ribosomeDatabase, NCBI_taxonomy_taxdump, miRBase
 tar -tf exceRptDB_v4_EXOGenomes.tgz  # --> Genomes_BacteriaFungiMammalPlantProtistVirus

Run exceRpt

 #[---- REAL_RUNNING_COMPLETE_DB ---->]
 #NOTE that if not renamed in the input files, then have to RENAME all files recursively by removing "_cutadapted.fastq" in all names in _CORE_RESULTS_v4.6.3.tgz (first unzip, removing, then zip, mv to ../results_g).
 cd trimmed
 for file in *.fastq.gz; do
     echo "mv \"$file\" \"${file/.fastq/}\""
 done

 mkdir results
 for sample in nf780 nf796 nf797  nf655    nf774 nf961 nf962  nf657 nf930 nf935  nf931 nf936 nf971  nf932 nf937 nf972  nf933 nf938 nf973  nf934 nf939 nf974; do
     docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
               -v /mnt/nvme0n1p1/MyexceRptDatabase:/exceRpt_DB \
               -t rkitchen/excerpt \
               INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
 done

 for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
     docker run -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/trimmed:/exceRptInput \
                -v ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/results:/exceRptOutput \
               -v /mnt/nvme3n1p1/MyexceRptDatabase:/exceRpt_DB \
               -t rkitchen/excerpt \
               INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on
 done

 #DEBUG the excerpt env
 docker inspect rkitchen/excerpt:latest
 # Without /bin/bash → May run and exit immediately
 #docker run -it rkitchen/excerpt
 # With /bin/bash → Stays open for interaction
 docker run -it --entrypoint /bin/bash rkitchen/excerpt

Processing exceRpt output from multiple samples

 cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master
 mamba activate r_env
 mamba install -c conda-forge -c bioconda \
     bioconductor-marray \
     bioconductor-rgraphviz \
     r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx r-rcurl r-xml \
     -y
 mamba install -c conda-forge -c bioconda \
     r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx \
     bioconductor-marray bioconductor-rgraphviz \
     -y

 #mkdir summaries heatmap_all_WaGa+4_MKL-1
 mkdir results_WaGa_EXCLUDED results_MKL-1 summaries_WaGa summaries_MKL-1 heatmap_WaGa heatmap_MKL-1
 #! EXCLUDE some isolates since they have total different pattern or due to bad quality --> outliner, until now only one sample, namely nf657 from WaGa wt EV:
 sudo mv results/nf657* results_WaGa_EXCLUDED/
 sudo mv results/nf780* results_MKL-1/
 sudo mv results/nf796* results_MKL-1/
 sudo mv results/nf797* results_MKL-1/
 sudo mv results/nf655* results_MKL-1/
 for sample in 2404_MKL1_wt_EVs 2608_MKL1_wt_EVs    2608_MKL1_sT_DMSO 2701_MKL1_sT_DMSO 2802_MKL1_sT_DMSO    2608_MKL1_sT_Dox 2701_MKL1_sT_Dox 2802_MKL1_sT_Dox    2608_MKL1_scr_DMSO 2701_MKL1_scr_DMSO 2802_MKL1_scr_DMSO    2608_MKL1_scr_Dox 2701_MKL1_scr_Dox 2802_MKL1_scr_Dox; do
     echo "sudo mv results/${sample}* results_MKL-1/"
 done
 #Following our initial QC, I noticed that one of the MKL-1 wt-EV samples (nf655) is a clear outlier, clustering far apart from the other two wt-EV replicates in the PCoA plots. I recommend removing nf655 from the downstream MKL-1 analysis, which is similar to our earlier analysis for MKL-1, in which we removed the outlier nf657. Please see the attached figures for reference.
 mv results_MKL-1/nf655* results_MKL-1_EXCLUDED/

 (r_env) jhuang@WS-2290C:~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/exceRpt-master$ R
 #WARNING: need to reload the R-script after each change of the script.
 source("mergePipelineRuns_functions.R")
 processSamplesInDir("../results_WaGa/", "../summaries_WaGa")
 processSamplesInDir("../results_MKL-1/", "../summaries_MKL-1")

 #mkdir heatmap_WaGa; cp summaries_WaGa/*.RData heatmap_WaGa; rm heatmap_WaGa/exceRpt_sampleGroupDefinitions.txt;
 source("mergePipelineRuns_functions_addSampleGroupInfo_WaGa.R")
 processSamplesInDir("../results_WaGa/", "../heatmap_WaGa")

 #mkdir heatmap_MKL-1; cp summaries_MKL-1/*.RData heatmap_MKL-1; rm heatmap_MKL-1/exceRpt_sampleGroupDefinitions.txt;
 source("mergePipelineRuns_functions_addSampleGroupInfo_MKL-1.R")
 processSamplesInDir("../results_MKL-1/", "../heatmap_MKL-1")

 #!!!!! IMPORTANT: REPORT heatmap_MKL-1/exceRpt_DiagnosticPlots.pdf and heatmap_MKL-1/mapping_heatmap3.pdf (They are almost the same, mapping_heatmap3.pdf is better due to bigger font size) !!!!
 #CONSIDERING_TO_DEL_nf774 since it is very far to another two samples (MAYBE BETTER NOT DO THIS, SINCE I HAVE TO GENERATE PCA- and MANHATTAN PLOTS!!): now the sample nf774 was kept in the WaGa results.

 #~/Tools/csv2xls-0.4/csv_to_xls.py exceRpt_miRNA_ReadsPerMillion.txt exceRpt_tRNA_ReadsPerMillion.txt exceRpt_piRNA_ReadsPerMillion.txt -d$'\t' -o exceRpt_results_detailed.xls

 # Report summaries_WaGa/exceRpt_mapping_heatmaps_WaGa.xlsx or summaries_MKL-1/exceRpt_mapping_heatmaps_MKL-1.xlsx;
 #        summaries_WaGa/exceRpt_results_detailed_WaGa.xls or summaries_MKL-1/exceRpt_results_detailed_MKL-1.xls;
 #        heatmap_WaGa/mapping_heatmap3_WaGa.pdf or heatmap_MKL-1/mapping_heatmap3_MKL-1.pdf

Downstream analyis using R for miRNAs (17 WaGa samples)

 #Input file
 #exceRpt_miRNA_ReadCounts.txt
 #exceRpt_piRNA_ReadCounts.txt

 ## WaGa experimental groups (scr = scramble control; sT = target knockdown)
 #WaGa_scr_DMSO_EV (nf933, nf938, nf973)
 #WaGa_scr_Dox_EV (nf934, nf939, nf974)
 #WaGa_sT_DMSO_EV (nf931, nf936, nf971)
 #WaGa_sT_Dox_EV (nf932, nf937, nf972)
 #
 ## WaGa wild-type controls
 #WaGa_wt_cells (nf774, nf961, nf962)
 #WaGa_wt_EV (nf930, nf935)

 cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_WaGa
 mamba activate r_env
 R

 #BiocManager::install("AnnotationDbi")
 #BiocManager::install("clusterProfiler")
 #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
 #BiocManager::install("limma")
 #BiocManager::install("sva")
 #install.packages("writexl")
 #install.packages("openxlsx")
 library("AnnotationDbi")
 library("clusterProfiler")
 library("ReactomePA")
 library("org.Hs.eg.db")
 library(DESeq2)
 library(gplots)
 library(limma)
 library(sva)
 #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
 library(openxlsx)

 d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)

 # Desired column order
 desired_order <- c(
     "nf933", "nf938", "nf973",
     "nf934", "nf939", "nf974",
     "nf931", "nf936", "nf971",
     "nf932", "nf937", "nf972",
     "nf774", "nf961", "nf962",
     "nf930", "nf935"
 )

 # Reorder columns
 d.raw <- d.raw[, desired_order]
 setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
 #sapply(d.raw, is.numeric)
 d.raw[] <- lapply(d.raw, as.numeric)
 #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
 d.raw <- round(d.raw)
 write.csv(d.raw, file ="d_raw.csv")
 write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)

 # ------ Code sent to Ute ------
 #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
 Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
 replicates = as.factor(c("WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV",     "WaGa_scr_Dox_EV","WaGa_scr_Dox_EV","WaGa_scr_Dox_EV",  "WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV",  "WaGa_sT_Dox_EV","WaGa_sT_Dox_EV","WaGa_sT_Dox_EV",  "WaGa_wt_cells", "WaGa_wt_cells","WaGa_wt_cells",  "WaGa_wt_EV", "WaGa_wt_EV"))
 ids = as.factor(c(
     "nf933", "nf938", "nf973",
     "nf934", "nf939", "nf974",
     "nf931", "nf936", "nf971",
     "nf932", "nf937", "nf972",
     "nf774", "nf961", "nf962",
     "nf930", "nf935"))
 cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
 dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)

 # Filter low-count miRNAs
 dds <- dds[ rowSums(counts(dds)) > 10, ]
 rld <- rlogTransformation(dds)

 # -- before pca --
 png("pca.png", 1200, 800)
 plotPCA(rld, intgroup=c("replicates"))
 #plotPCA(rld, intgroup = c("replicates", "batch"))
 #plotPCA(rld, intgroup = c("replicates", "ids"))
 #plotPCA(rld, "batch")
 dev.off()
 png("pca2.png", 1200, 800)
 #plotPCA(rld, intgroup=c("replicates"))
 #plotPCA(rld, intgroup = c("replicates", "batch"))
 plotPCA(rld, intgroup = c("replicates", "ids"))
 #plotPCA(rld, "batch")
 dev.off()

 # Batch Effect Removal Methods (Non-batch effect removal applied!)

 #### STEP2: DEGs ####
 #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
 #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
 #- Manhattan plot miRNAs; 1x for WaGa cell line
 #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
 #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line

 #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
 sizeFactors(dds)
 #NULL
 dds <- estimateSizeFactors(dds)
 sizeFactors(dds)
 normalized_counts <- counts(dds, normalized=TRUE)
 write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
 write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)

 dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)

 dds$replicates <- relevel(dds$replicates, "WaGa_wt_cells")
 dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
 resultsNames(dds)
 clist <- c("WaGa_wt_EV_vs_WaGa_wt_cells")

 #NOTE that the results sent to Ute is |padj|<=0.1.
 for (i in clist) {
     contrast = paste("replicates", i, sep="_")
     res = results(dds, name=contrast)
     res <- res[!is.na(res$log2FoldChange),]
     #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
     res$padj <- ifelse(is.na(res$padj), 1, res$padj)
     res_df <- as.data.frame(res)
     write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
     up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
     down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
     write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
     write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
 }

 ~/Tools/csv2xls-0.4/csv_to_xls.py \
 WaGa_wt_EV_vs_WaGa_wt_cells-all.txt \
 WaGa_wt_EV_vs_WaGa_wt_cells-up.txt \
 WaGa_wt_EV_vs_WaGa_wt_cells-down.txt \
 -d$',' -o WaGa_wt_EV_vs_WaGa_wt_cells.xls;

 # ------------------- volcano_plot -------------------
 library(ggplot2)
 library(ggrepel)

 geness_res <- read.csv(file = paste("WaGa_wt_EV_vs_WaGa_wt_cells", "all.txt", sep="-"), row.names=1)

 external_gene_name <- rownames(geness_res)
 geness_res <- cbind(geness_res, external_gene_name)
 #top_g are from ids
 top_g <- c("hsa-let-7b-5p","hsa-let-7g-5p","hsa-let-7i-5p","hsa-miR-103a-3p","hsa-miR-107","hsa-miR-1224-5p","hsa-miR-122-5p","hsa-miR-1226-5p","hsa-miR-1246","hsa-miR-127-3p","hsa-miR-1290","hsa-miR-130a-3p","hsa-miR-139-3p","hsa-miR-141-3p","hsa-miR-143-3p","hsa-miR-148b-3p","hsa-miR-155-5p","hsa-miR-15a-5p","hsa-miR-17-5p","hsa-miR-184","hsa-miR-18a-3p","hsa-miR-18a-5p","hsa-miR-190a-5p","hsa-miR-191-5p","hsa-miR-193b-5p","hsa-miR-197-5p","hsa-miR-200a-3p","hsa-miR-200b-5p","hsa-miR-206","hsa-miR-20a-5p","hsa-miR-210-3p","hsa-miR-2110","hsa-miR-21-5p","hsa-miR-218-5p","hsa-miR-219a-1-3p","hsa-miR-221-3p","hsa-miR-23b-3p","hsa-miR-27a-3p","hsa-miR-27b-3p","hsa-miR-27b-5p","hsa-miR-28-3p","hsa-miR-30a-5p","hsa-miR-30c-5p","hsa-miR-30e-5p","hsa-miR-3127-5p","hsa-miR-3131","hsa-miR-3180|hsa-miR-3180-3p","hsa-miR-320a","hsa-miR-320b","hsa-miR-320c","hsa-miR-320d","hsa-miR-330-3p","hsa-miR-335-3p","hsa-miR-33b-5p","hsa-miR-340-5p","hsa-miR-342-5p","hsa-miR-3605-5p","hsa-miR-361-3p","hsa-miR-365a-5p","hsa-miR-374b-5p","hsa-miR-378i","hsa-miR-379-5p","hsa-miR-3940-5p","hsa-miR-409-3p","hsa-miR-411-5p","hsa-miR-423-3p","hsa-miR-423-5p","hsa-miR-4286","hsa-miR-429","hsa-miR-432-5p","hsa-miR-4326","hsa-miR-451a","hsa-miR-4520-3p","hsa-miR-454-3p","hsa-miR-4646-5p","hsa-miR-4667-5p","hsa-miR-4748","hsa-miR-483-5p","hsa-miR-486-5p","hsa-miR-5010-5p","hsa-miR-504-3p","hsa-miR-5187-5p","hsa-miR-590-3p","hsa-miR-6128","hsa-miR-625-5p","hsa-miR-6726-5p","hsa-miR-6730-5p","hsa-miR-676-3p","hsa-miR-6767-5p","hsa-miR-6777-5p","hsa-miR-6780a-5p","hsa-miR-6794-5p","hsa-miR-6817-3p","hsa-miR-708-5p","hsa-miR-7-5p","hsa-miR-766-5p","hsa-miR-7854-3p","hsa-miR-873-3p","hsa-miR-885-3p","hsa-miR-92b-5p","hsa-miR-93-5p","hsa-miR-937-3p","hsa-miR-9-5p","hsa-miR-98-5p")
 subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
 geness_res$Color <- "NS or log2FC < 2.0"
 geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
 geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
 geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"

 write.csv(geness_res, "WaGa_wt_EV_vs_WaGa_wt_cells_with_Category.csv")
 geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)

 geness_res <- geness_res[, -1*ncol(geness_res)]
 png("WaGa_wt_EV_vs_WaGa_wt_cells.png",width=1200, height=1400)
 #svg("WaGa_wt_EV_vs_WaGa_wt_cells.svg",width=12, height=14)
 ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
 dev.off()

 # ----------------------------------------
 # ----------- manhattan_plot -------------

 Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt

Downstream analyis using R for miRNAs (17 MKL-1 samples)

 #Input file
 #exceRpt_miRNA_ReadCounts.txt
 #exceRpt_piRNA_ReadCounts.txt

 #MKL-1_sT_DMSO_EV ("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO")
 #MKL-1_sT_Dox_EV ("X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox")
 #MKL-1_scr_DMSO_EV ("X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO")
 #MKL-1_scr_Dox_EV ()"X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox")
 #MKL-1_wt_cells ("nf780","nf796","nf797")
 #MKL-1_wt_EV ("X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs")

 cd ~/DATA/Data_Ute_smallRNA_via_exceRpt_workspace/summaries_MKL-1
 mamba activate r_env
 R

 #BiocManager::install("AnnotationDbi")
 #BiocManager::install("clusterProfiler")
 #BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
 #BiocManager::install("limma")
 #BiocManager::install("sva")
 #install.packages("writexl")
 #install.packages("openxlsx")
 library("AnnotationDbi")
 library("clusterProfiler")
 library("ReactomePA")
 library("org.Hs.eg.db")
 library(DESeq2)
 library(gplots)
 library(limma)
 library(sva)
 #library(writexl)  #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx");
 library(openxlsx)

 d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1)

 # Desired column order
 desired_order <- c(
     "X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
     "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"
 )

 # Reorder columns
 d.raw <- d.raw[, desired_order]
 setdiff(desired_order, colnames(d.raw))  # Shows missing or misnamed columns
 #sapply(d.raw, is.numeric)
 d.raw[] <- lapply(d.raw, as.numeric)
 #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x)))
 d.raw <- round(d.raw)
 write.csv(d.raw, file ="d_raw.csv")
 write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE)

 #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1)
 Cell_or_EV = as.factor(c("EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "EV","EV","EV",  "Cell","Cell","Cell",  "EV","EV"))
 replicates = as.factor(c("MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV","MKL-1_sT_DMSO_EV",     "MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV","MKL-1_sT_Dox_EV",  "MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV","MKL-1_scr_DMSO_EV",  "MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV","MKL-1_scr_Dox_EV",    "MKL-1_wt_cells", "MKL-1_wt_cells","MKL-1_wt_cells",  "MKL-1_wt_EV","MKL-1_wt_EV"))
 ids = as.factor(c("X2608_MKL1_sT_DMSO","X2701_MKL1_sT_DMSO","X2802_MKL1_sT_DMSO", "X2608_MKL1_sT_Dox","X2701_MKL1_sT_Dox","X2802_MKL1_sT_Dox", "X2608_MKL1_scr_DMSO","X2701_MKL1_scr_DMSO","X2802_MKL1_scr_DMSO", "X2608_MKL1_scr_Dox","X2701_MKL1_scr_Dox","X2802_MKL1_scr_Dox",
     "nf780","nf796","nf797", "X2404_MKL1_wt_EVs","X2608_MKL1_wt_EVs"))
 cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV)
 dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)

 # Filter low-count miRNAs
 dds <- dds[ rowSums(counts(dds)) > 10, ]
 rld <- rlogTransformation(dds)

 # -- before pca --
 png("pca.png", 1200, 800)
 plotPCA(rld, intgroup=c("replicates"))
 #plotPCA(rld, intgroup = c("replicates", "batch"))
 #plotPCA(rld, intgroup = c("replicates", "ids"))
 #plotPCA(rld, "batch")
 dev.off()
 png("pca2.png", 1200, 800)
 #plotPCA(rld, intgroup=c("replicates"))
 #plotPCA(rld, intgroup = c("replicates", "batch"))
 plotPCA(rld, intgroup = c("replicates", "ids"))
 #plotPCA(rld, "batch")
 dev.off()

 # Batch Effect Removal Methods (Non-batch effect removal applied!)

 #### STEP2: DEGs ####
 #- Heatmap untreated/wt vs parental; 1x for WaGa cell line
 #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line
 #- Manhattan plot miRNAs; 1x for WaGa cell line
 #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line
 #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line

 #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
 sizeFactors(dds)
 #NULL
 dds <- estimateSizeFactors(dds)
 sizeFactors(dds)
 normalized_counts <- counts(dds, normalized=TRUE)
 write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
 write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE)

 dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates)

 dds$replicates <- relevel(dds$replicates, "MKL-1_wt_cells")
 dds = DESeq(dds, betaPrior=FALSE)  #default betaPrior is FALSE
 resultsNames(dds)
 clist <- c("MKL.1_wt_EV_vs_MKL.1_wt_cells")

 #NOTE that the results sent to Ute is |padj|<=0.1.
 for (i in clist) {
     contrast = paste("replicates", i, sep="_")
     res = results(dds, name=contrast)
     res <- res[!is.na(res$log2FoldChange),]
     #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
     res$padj <- ifelse(is.na(res$padj), 1, res$padj)
     res_df <- as.data.frame(res)
     write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
     up <- subset(res_df, padj<=0.05 & log2FoldChange>=2)
     down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2)
     write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
     write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
 }

 ~/Tools/csv2xls-0.4/csv_to_xls.py \
 MKL.1_wt_EV_vs_MKL.1_wt_cells-all.txt \
 MKL.1_wt_EV_vs_MKL.1_wt_cells-up.txt \
 MKL.1_wt_EV_vs_MKL.1_wt_cells-down.txt \
 -d$',' -o MKL.1_wt_EV_vs_MKL.1_wt_cells.xls;

 # ------------------- volcano_plot -------------------
 library(ggplot2)
 library(ggrepel)

 geness_res <- read.csv(file = paste("MKL.1_wt_EV_vs_MKL.1_wt_cells", "all.txt", sep="-"), row.names=1)

 external_gene_name <- rownames(geness_res)
 geness_res <- cbind(geness_res, external_gene_name)
 #top_g are from ids

 top_g <- c("hsa-miR-203a-3p","hsa-miR-6850-5p","hsa-miR-4511","hsa-miR-5187-5p","hsa-miR-133b","hsa-miR-1246","hsa-miR-625-3p","hsa-miR-6741-3p","hsa-miR-192-5p","hsa-miR-10b-5p","hsa-miR-885-5p","hsa-miR-30e-3p","hsa-miR-101-3p","hsa-miR-1307-5p","hsa-miR-95-3p","hsa-miR-889-3p","hsa-miR-206","hsa-miR-301a-3p","hsa-miR-1-3p","hsa-let-7c-5p","hsa-miR-196a-5p","hsa-let-7f-5p","hsa-let-7e-5p","hsa-miR-30c-5p","hsa-miR-30a-3p","hsa-miR-146b-5p","hsa-miR-25-3p","hsa-miR-182-5p","hsa-miR-98-5p","hsa-let-7a-5p","hsa-miR-149-5p","hsa-miR-148a-3p","hsa-miR-873-3p","hsa-miR-19b-3p","hsa-miR-320c","hsa-miR-375","hsa-miR-30a-5p","hsa-miR-877-5p","hsa-miR-34a-5p","hsa-miR-324-5p","hsa-miR-652-3p","hsa-miR-342-5p","hsa-miR-7706","hsa-miR-361-3p","hsa-miR-361-5p","hsa-miR-1180-3p","hsa-miR-217","hsa-miR-1307-3p","hsa-miR-1908-5p","hsa-miR-15b-5p","hsa-miR-92b-5p","hsa-miR-484","hsa-miR-197-3p","hsa-miR-200c-3p","hsa-miR-671-5p","hsa-miR-339-5p","hsa-miR-1301-3p","hsa-miR-769-5p","hsa-miR-328-3p","hsa-miR-93-5p","hsa-miR-103a-3p")
 subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0))
 geness_res$Color <- "NS or log2FC < 2.0"
 geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
 geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
 geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"

 write.csv(geness_res, "MKL.1_wt_EV_vs_MKL.1_wt_cells_with_Category.csv")
 geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)

 geness_res <- geness_res[, -1*ncol(geness_res)]
 png("MKL.1_wt_EV_vs_MKL.1_wt_cells.png",width=1200, height=1400)
 #svg("MKL.1_wt_EV_vs_MKL.1_wt_cells.svg",width=12, height=14)
 ggplot(geness_res,       aes(x = log2FoldChange, y = -log10(pvalue),           color = Color, label = external_gene_name)) +       geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +       geom_hline(yintercept = -log10(0.05), lty = "dashed") +       geom_point() +       labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") +       scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) +       geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) +       theme_bw(base_size = 16) +       theme(legend.position = "bottom")
 dev.off()

 # ----------------------------------------
 # ----------- manhattan_plot -------------

 Rscript manhattan_plot_Carmen_custom_labels.R  #exceRpt_miRNA_ReadCounts.txt

Comparing https://chat.z.ai/, https://chat.qwen.ai/, and https://chat.deepseek.com/ to explain Maaslin2

Leave a reply

Maaslin2 解释

什么是Maaslin2？

Maaslin2（Microbiome Multivariable Association with Linear Models）是一个用于微生物组数据多变量关联分析的R语言包。

主要用途

寻找微生物特征（如细菌丰度）与元数据（如年龄、饮食、疾病状态）之间的关联
适用于16S rRNA测序、宏基因组等微生物组数据

核心特点

1. 统计方法

使用线性模型或广义线性模型
支持固定效应和随机效应
可处理重复测量数据

2. 数据标准化

自动处理微生物数据的组成性问题
提供 TSS、CSS、CLR 等多种标准化方法

3. 多变量分析

可同时调整多个协变量
控制混杂因素

基本工作流程

# 安装
library(Maaslin2)

# 运行分析
results <- Maaslin2(
  input_data = microbiome_data,    # 微生物丰度表
  input_metadata = metadata,       # 元数据
  output = "results",              # 输出目录
  fixed_effects = c("age", "diet"), # 固定效应
  random_effects = c("subject_id"), # 随机效应
  normalization = "CLR",           # 标准化方法
  transformation = "LOG"           # 数据转换
)

输出结果

显著性关联表：包含p值、q值、效应大小
可视化图表：热图、箱线图等
汇总统计

优势

用户友好：简单的接口，默认参数合理
灵活性强：支持多种标准化和转换方法
统计严谨：多重检验校正（FDR）
可重复性：结果易于复现

应用场景

疾病与肠道菌群关联研究
饮食干预对微生物组影响
药物处理前后菌群变化
环境因素与微生物组关系

Here’s Python code to read a file and output to console:

# Method 1: Read entire file at once
with open('filename.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

# Method 2: Read line by line
with open('filename.txt', 'r', encoding='utf-8') as file:
    for line in file:
        print(line.strip())  # strip() removes newline characters

# Method 3: Read all lines into a list
with open('filename.txt', 'r', encoding='utf-8') as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

# With error handling
try:
    with open('filename.txt', 'r', encoding='utf-8') as file:
        content = file.read()
        print(content)
except FileNotFoundError:
    print("File not found!")
except Exception as e:
    print(f"An error occurred: {e}")

Key points:

Use with statement to automatically close the file
Specify encoding='utf-8' to handle special characters
strip() removes trailing newlines when reading line by line
Always include error handling for robust code

Replace 'filename.txt' with your actual file path.

Processing DNAseq for Data_Tam_DNAseq_2026_2605_2617_2631_2914_Acinetobacter_sp

Leave a reply

Run nextflow bacass

 conda deactivate

 # Downlod k2_standard_08_GB_20251015.tar.gz from https://benlangmead.github.io/aws-indexes/k2#kraken2--bracken
 # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056; 'tar xzf 20190108_kmerfinder_stable_dirs.tar.gz'  #The database does not work!
 # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz  #The database works!

 # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working!

 nextflow run nf-core/bacass -r 2.6.0 -profile docker --help

 # -- Hybrid assembly --
 nextflow run nf-core/bacass -r 2.6.0 -profile docker \
   --input samplesheet_bacass.tsv \
   --outdir bacass_out \
   --assembly_type hybrid \
   --assembler unicycler,dragonflye \
   --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
   --skip_kmerfinder \
   -resume \
   -work-dir bacass_out/work

 # -- Short assembly --
 #Maybe BUG is from '--skip_kmerfinder for -r 2.6.0, using db in 2.5.0'
 nextflow run nf-core/bacass -r 2.5.0 -profile docker \
   --input samplesheet.tsv \
   --outdir bacass_out \
   --assembly_type short \
   --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
   --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
   -resume \
   -work-dir bacass_out/work

Verify if the genome is pure

 # 1. Go up one level to the main 'bacass_out' directory
 cd ..

 # 2. Create directories for CheckM inputs and outputs
 mkdir -p checkm_input checkm_output

 # 3. Copy all .fna files into the 'checkm_input' folder
 # (CheckM cannot search subdirectories, so they must be in one folder)
 find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;

 # 4. Run CheckM on all 4 assemblies
 checkm lineage_wf -x fna checkm_input checkm_output

Species Identification: 快速筛查用 Mash → 精确分类用 GTDB-Tk → 种级验证用 FastANI，三者结合可最大限度提高物种鉴定的准确性和可解释性。

 # 1. 创建环境（推荐 mamba）
 mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk
 mamba activate gtdbtk

 # 2. 下载数据库（仅需首次，约 60GB）
 gtdbtk download --data_dir ./gtdb_data --release 220

 wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.g
 mamba env config vars set GTDBTK_DATA_PATH="/mnt/nvme4n1p1/gtdb_data/release232"
 # 先退出当前环境，再重新激活
 mamba deactivate
 mamba activate gtdbtk

 # 验证环境变量是否加载成功
 echo $GTDBTK_DATA_PATH
 # 应输出：/mnt/nvme4n1p1/gtdb_data/release232

 # 3. 运行分类（你提供的命令 + 实用参数）
 gtdbtk classify_wf \
   --genome_dir ./checkm_input \
   --out_dir gtdb_out \
   --cpus 64 \
   --extension .fna \
   --prefix mygenome

 # 4. 查看结果
 cat gtdb_out/classify/mygenome.bac120.summary.tsv   # 细菌结果

Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

 conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
 abricate --list

 conda deactivate

 ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
 ASM=bacass_out/checkm_input/2914_.fna \
 SAMPLE=2914 \
 OUTDIR=resistome_virulence_2914 \
 MINID=80 MINCOV=60 \
 THREADS=32 \
 ~/Scripts/run_abricate_resistome_virulome_one_per_gene.sh

 #ABRicate thresholds: MINID=80 MINCOV=60
 Database        Hit_lines       File
 MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
 CARD    21      resistome_virulence_2605/raw/2605.card.tab
 ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
 VFDB    0       resistome_virulence_2605/raw/2605.vfdb.tab

 # Database        Hit_lines       File
 # MEGARes 42      resistome_virulence_2631/raw/2631.megares.tab
 # CARD    37      resistome_virulence_2631/raw/2631.card.tab
 # ResFinder       16      resistome_virulence_2631/raw/2631.resfinder.tab
 # VFDB    0       resistome_virulence_2631/raw/2631.vfdb.tab

 Database        Hit_lines       File
 MEGARes 35      resistome_virulence_2914/raw/2914.megares.tab
 CARD    31      resistome_virulence_2914/raw/2914.card.tab
 ResFinder       11      resistome_virulence_2914/raw/2914.resfinder.tab
 VFDB    0       resistome_virulence_2914/raw/2914.vfdb.tab

 # #ABRicate thresholds: MINID=70 MINCOV=50
 # Database        Hit_lines       File
 # MEGARes 24      resistome_virulence_2605/raw/2605.megares.tab
 # CARD    21      resistome_virulence_2605/raw/2605.card.tab
 # ResFinder       4       resistome_virulence_2605/raw/2605.resfinder.tab
 # VFDB    3       resistome_virulence_2605/raw/2605.vfdb.tab

 conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
 #NEED_TO_ADAPT: OUTDIR = Path("resistome_virulence_An7")
 #NEED_TO_ADAPT: SAMPLE = "An7"
 #DEPRECATED_DUE_TO_NEED_MANULL_SETTING: python ~/Scripts/merge_amr_sources_by_gene.py

 python ~/Scripts/export_resistome_virulence_to_excel_py36.py \
   --workdir resistome_virulence_2914 \
   --sample 2914 \
   --out Resistome_Virulence_2914.xlsx
 # Delete the column 'COVERAGE_MAP' in all 'Raw_*' sheets

Report

 Please find below a summary of genomic analyses for samples 2605, 2617, 2631 and 2914.

 ### 1. Assembly and checkM

         ------------------------------------------------------------------------------------------------------------------------------------------------------------------
         Bin Id            Completeness   Contamination   Strain heterogeneity
         ------------------------------------------------------------------------------------------------------------------------------------------------------------------
         2631_       100.00          100.00             78.57
         2617_          100.00          100.00             78.57
         2605_     100.00           0.00               0.00
         2914_         99.98            0.63               0.00
         ----------------------------------------------------------------------------------------------------------------------------------------------------------------

         From the results of checkM, we see the samples 2631_ and 2617_ both are genomes between 7.0-7.1 M. and the contamination is 100.00, which means the DNA sample contained two closely related strains of the same species from a non-clonal culture. If the true genome size is a standard ~3.7 Mb  and the assembler couldn't merge the two highly similar strains, it would build both side-by-side. This results in a ~7.0 Mb assembly where every gene is duplicated.
         The sample 2605_.fna is 3.7 M and 2914_.fna is about 3.9M. they are pure isolates.

         ### 1. Species Identification

         **Sample 2605_:** *Acinetobacter baumannii* ✅ Confirmed

         | Parameter | Value | Interpretation |
         |---|---|---|
         | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
         | ANI | 98.02% | ✅ Well above 95% species threshold |
         | AF (Alignment Fraction) | 0.874 | ✅ 87.4% of genome aligns; ANI estimate is robust |
         | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |

         🟢 **Conclusion:** 2605_ is confidently assigned to *Acinetobacter baumannii*.

         ***

         **Sample 2617_:** *Acinetobacter baumannii* ✅ Confirmed

         | Parameter | Value | Interpretation |
         |---|---|---|
         | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
         | ANI | 98.00% | ✅ Well above 95% species threshold |
         | AF (Alignment Fraction) | 0.859 | ✅ 85.9% of genome aligns; ANI estimate is robust |
         | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |

         🟢 **Conclusion:** 2617_ is confidently assigned to *Acinetobacter baumannii*.

         ***

         **Sample 2631_:** *Acinetobacter baumannii* ✅ Confirmed

         | Parameter | Value | Interpretation |
         |---|---|---|
         | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
         | ANI | 98.07% | ✅ Well above 95% species threshold |
         | AF (Alignment Fraction) | 0.860 | ✅ 86.0% of genome aligns; ANI estimate is robust |
         | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |

         🟢 **Conclusion:** 2631_ is confidently assigned to *Acinetobacter baumannii*.

         ***

         **Sample 2914_:** *Acinetobacter baumannii* ✅ Confirmed

         | Parameter | Value | Interpretation |
         |---|---|---|
         | Closest Reference | GCF_009759685.1 | Reference genome of *A. baumannii* |
         | ANI | 98.11% | ✅ Well above 95% species threshold |
         | AF (Alignment Fraction) | 0.873 | ✅ 87.3% of genome aligns; ANI estimate is robust |
         | Final Taxonomy | `d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__Acinetobacter baumannii` | Consistent with genomic expectations |

         🟢 **Conclusion:** 2914_ is confidently assigned to *Acinetobacter baumannii*.

 ### 3. Since 2631_ and 2617_ are not a pure isolates, they are the mixed of two strains. I exclude the two samples from AMR and VFDB analysis. AMR Genes and Virulence Factors (VFDB) Summary, see the Resistome_Virulence_2605.xlsx and Resistome_Virulence_2914.xlsx.

Since CheckM requires all the genome files (bins) to be in the same directory, you first need to gather your .fna files into a single folder before running the command.

Here are the exact bash commands to prepare your files and run CheckM for all four samples (2605, 2617, 2631, and 2914).

Step 1: Prepare the Input and Run CheckM

Run the following commands in your terminal:

# 1. Go up one level to the main 'bacass_out' directory
cd ..

# 2. Create directories for CheckM inputs and outputs
mkdir -p checkm_input checkm_output

# 3. Copy all .fna files into the 'checkm_input' folder
# (CheckM cannot search subdirectories, so they must be in one folder)
find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;

# 4. Run CheckM on all 4 assemblies
checkm lineage_wf -x fna checkm_input checkm_output

(Note: If you get an error saying the CheckM database is not found, you need to set the path to your CheckM data folder first using: checkm data setRoot /path/to/your/checkm_data)

Step 2: How to Interpret the Results

CheckM will print a summary table directly to your terminal, and it will also save detailed statistics in checkm_output/bin_stats_ext.tsv.

Since you are working with Acinetobacter sp. (which typically has a genome size of ~3.5 to 4.5 Mb), a 7.0 Mb assembly is almost certainly a mix of two closely related strains that the assembler failed to collapse.

Look specifically at these three columns in the CheckM output for your 2631_ sample:

Completeness: This should be high (ideally > 95%). Because the assembler built two genomes, CheckM will likely still find all the marker genes, so completeness might look deceptively high.
Contamination: This is the most critical metric. For a pure isolate, this should be < 5%. If your 2631 assembly is a mixed strain, CheckM will detect the duplicated marker genes and flag this as high contamination (often > 20-50%).
Strain heterogeneity: This column specifically calculates if there are multiple strains of the same species present. A high value here confirms that your assembly contains a mixture of closely related strains.

What to do next?

If 2631 has high contamination/strain heterogeneity: You have confirmed it is a mixed assembly. You can either use a tool like purge_dups to separate the two strains, or map your raw reads back to the assembly and filter out the contigs with half the average coverage.
If 2631 has low contamination: The 7.0 Mb size might be real (some Acinetobacter strains have large plasmids or are naturally larger), but you should still verify this using a BlobPlot (GC vs. Coverage) to be absolutely sure there is no foreign DNA.

Check the other 3 samples: Compare the CheckM results of 2605, 2617, and 2914 to see if they are pure, high-quality genomes (high completeness, low contamination).

        ------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Bin Id         Marker lineage         # genomes   # markers   # marker sets   0    1     2    3   4   5+   Completeness   Contamination   Strain heterogeneity
        ------------------------------------------------------------------------------------------------------------------------------------------------------------------
        2631_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
        2617_           root (UID1)              5656         56            24        0    0     56   0   0   0       100.00          100.00             78.57        
        2605_    f__Moraxellaceae (UID4680)       86         689           365        0   689    0    0   0   0       100.00           0.00               0.00        
        2914_    g__Acinetobacter (UID4685)       42         1193          254        1   1188   4    0   0   0       99.98            0.63               0.00        
        ----------------------------------------------------------------------------------------------------------------------------------------------------------------

Ah, I understand now! Thank you for clarifying. The CheckM results you showed earlier (with 100% contamination and 56 markers in the ‘2’ column) were indeed from this Unicycler assembly.

Looking at your Unicycler log, we can see exactly why it didn’t fix the problem: Total Length: 7,199,947 bp (~7.2 Mb) Main Chromosome: Component 1 consists of 416 fragmented contigs totaling 7,165,229 bp.

Why Unicycler Failed to Collapse the Genome

The log states: “Since you provided only short reads, Unicycler will essentially function as a SPAdes-optimiser.”

Because you only have short reads (Illumina), Unicycler builds a de Bruijn graph. When a bacterial sample contains a mixture of two closely related strains (a very common scenario in non-clonal isolates), the assembler sees the slight genetic differences between the two strains as “bubbles” in the graph.

Without long reads to span across these bubbles and prove which path belongs to which strain, the software conservatively outputs both paths. This results in an assembly that contains both strains side-by-side, doubling your genome size to ~7.2 Mb and causing the 100% duplication flagged by CheckM.

How to Fix This (Since Re-assembling Won’t Work)

Running another short-read assembler will just result in the exact same 7.2 Mb duplicated assembly. You now need to computationally “purge” the duplicate contigs.

Option 1: The Quick Fix using BBMap (`dedupe.sh`)

If you have the BBMap suite installed, it has a fantastic tool specifically for removing duplicate sequences. It will compare all your contigs to each other and remove the redundant ones.

dedupe.sh in=assembly.fasta out=assembly_purged.fasta minidentity=95

minidentity=95 tells it to remove contigs that are 95% identical or more to another contig. You can adjust this (e.g., 90 or 98) depending on how closely related the two strains are.
After running this, run CheckM again on assembly_purged.fasta. You should see the genome size drop to ~3.5 Mb and the contamination drop to 0%.

Option 2: The Thorough Fix using `purge_dups`

purge_dups is the gold standard for separating primary contigs from duplicated “haplotigs” in an assembly. It works by aligning your assembly to itself and using read depth to separate the true genome from the duplicates.

Calculate coverage cutoffs: calcuts assembly.fasta > cutoffs 2>calcults.log
Split the assembly: split_fa assembly.fasta > assembly.split.fa
Align to itself: minimap2 -xasm5 assembly.split.fa assembly.split.fa > alignment.paf
Get coverage: pbcstat alignment.paf and calcuts PB.base.cov
Purge: purge_dups -2 -T 1024 -c CUTOFFS assembly.split.fa alignment.paf > dups.bed
Extract pure genome: get_seqs dups.bed assembly.split.fa (This will output assembly.split.fa.purged.fa, which is your pure ~3.5 Mb genome).

Option 3: Do you have Long Reads?

If you happen to have Oxford Nanopore (ONT) or PacBio reads for this sample, you can solve this biologically rather than computationally. Running Unicycler with long reads (unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq) will allow the assembler to easily span across the “bubbles” and collapse the two strains into a single, perfect, circular ~3.5 Mb chromosome.

Bonus Observation: You likely have 3 plasmids!

Looking at the end of your log, Unicycler successfully closed 3 circular contigs:

8,731 bp (7.89x coverage)
5,837 bp (5.27x coverage)
2,120 bp (7.74x coverage)

These are almost certainly plasmids. The massive ~7.2 Mb “Component 1” is your duplicated chromosome. When you use purge_dups or dedupe.sh to fix the chromosomal duplication, these plasmids should remain intact in your final assembly!

是的，这种强制“去冗余”（Purging duplicates / Haplotype purging）的操作确实有可能引入组装错误或偏差。

在真核生物（如动植物）的组装中，分离单倍型是常规操作；但在细菌（单倍体） 基因组中，强行使用去冗余工具来“修复”7.2Mb的组装结果，是一把双刃剑。

以下是这种操作可能引入的具体错误和风险，以及背后的科学原因：

1. 误删真实的生物学重复序列（Over-purging / 假阴性缺失）

细菌基因组中天然存在许多真实的重复序列，例如：

rRNA 操纵子（通常有 5-10 个高度相似的拷贝）。
IS 元件（插入序列）和转座子（在 Acinetobacter 等细菌中非常活跃，经常有多个拷贝）。
基因家族或旁系同源基因（Paralogs）。

风险：如果你使用 bbmap (dedupe.sh) 并设置了一个较高的相似度阈值（例如 minidentity=95），去冗余工具无法区分“组装错误导致的重复”和“基因组天然存在的重复”。它可能会把你基因组中真实存在的、具有重要功能（如耐药性、毒力）的 IS 元件或 rRNA 拷贝当作“冗余的单倍型”直接删除，导致你的最终基因组缺失关键基因。

2. 产生“嵌合体”基因组（Chimeric Assembly）

你的 CheckM 结果显示完美的 1:1 重复（所有 marker 基因都在 ‘2’ 列）。这通常意味着两种可能：

情况 A（组装软件的 Bug）：样本是纯的，但 SPAdes/Unicycler 因为某些复杂的局部重复或测序偏好性，把同一段序列组装了两遍。
情况 B（样本不纯/混合菌株）：你的培养物中混入了两个亲缘关系极近的菌株（Mixed strains）。短读长无法跨越它们之间的 SNP/Indel 差异（即 de Bruijn 图中的 bubbles），所以软件把两套基因组都保留了下来。

风险：如果是情况 B，去冗余工具在决定“保留哪一套、丢弃哪一套”时，可能会在两个菌株的序列之间来回切换。最终你得到的并不是一个真实的单一基因组，而是一个自然界中不存在的“弗兰肯斯坦（Frankenstein）”嵌合体。

后果：这种嵌合体会严重影响后续的 SNP calling（产生大量假阳性突变）、进化树构建（Phylogeny）以及耐药基因（AMR）的准确定位。

3. 破坏基因组的连续性（Structural Breaks）

风险：在剥离“副拷贝”的过程中，去冗余算法可能会在原本连续的 contig 上造成人为的断裂。这会导致你的组装结果碎片化（Contig 数量增加，N50 降低），原本可以闭合的环状染色体可能会断裂成多个线性片段。

如何安全地处理 2631 和 2617？（最佳实践）

为了避免引入上述错误，不要盲目直接运行去冗余工具。建议按照以下步骤进行排查和处理：

第一步：回贴原始 Reads，查看覆盖度（Coverage）分布

将你的原始短读长（clean reads）比对回这 7.2Mb 的组装结果上（使用 bwa 或 bowtie2），然后统计每个 Contig 的平均覆盖度。

如果所有 Contig 的覆盖度都非常均匀（例如都在 100x 左右）：这说明是情况 A（组装软件的 Bug）。基因组是纯的，只是被错误地组装了两遍。此时去冗余是相对安全的，但仍需谨慎。
如果 Contig 的覆盖度呈现明显的两个峰（例如一半是 100x，另一半是 50x）：这证实了是情况 B（混合菌株）。
如果覆盖度极其混乱：说明可能存在更复杂的污染。

第二步：根据覆盖度结果选择策略

如果是混合菌株（情况 B）：
- 最完美的生物学解决方案：重新划线分离单克隆（Single colony isolation），重新提取 DNA 测序。这是唯一能获得 100% 纯基因组的方法。
- 计算解决方案：不要使用 dedupe.sh。你应该使用基于覆盖度的分离工具（如 metaSPAdes 的 binning 功能，或者 BlobTools），利用 100x 和 50x 的覆盖度差异，将两个菌株的 Contig 彻底分开，分别得到两个独立的基因组，然后再分别跑 CheckM。
如果是组装 Bug（情况 A）：
- 你可以尝试使用 purge_dups，它比 dedupe.sh 更智能，会结合自比对（self-alignment）和覆盖度来区分“主序列（Primary）”和“副序列（Haplotig）”。
- 关键操作：在运行 purge_dups 后，必须将提取出的“Primary”序列再次运行 CheckM 和 BUSCO。
- 验收标准：如果 CheckM 显示 Completeness > 95%，Contamination < 5%，并且基因组大小降到了 ~3.5 Mb，说明去冗余成功且没有误删核心基因。如果 Completeness 大幅下降，说明你误删了真实的基因，需要调高 purge_dups 的保守参数重新运行。

第三步：尝试重新组装（替代方案）

有时候，Unicycler 的默认参数对某些特定的 Acinetobacter 菌株不够友好。你可以尝试直接使用 SPAdes 并开启特定的模式：

spades.py --isolate --careful -1 2631__1.fastp.fastq.gz -2 2631__2.fastp.fastq.gz -o spades_2631_careful

--careful 参数会让 SPAdes 在组装后尝试纠正错配和 Indel，有时它内部的 repeat resolution 机制能更好地 collapse 那些被 Unicycler 错误保留的重复区域。

总结

强行去冗余确实有引入嵌合体或误删真实重复序列的风险。对于细菌基因组，“计算修补”永远是下策。最严谨的做法是通过 Read Mapping 确认覆盖度，如果是混合菌株则进行 Binning 分离或重新实验；如果是组装冗余，则使用 purge_dups 并严格依赖二次 CheckM 结果来验证基因组的完整性。

Protected: First Response of Data_Tam_Metagenomics_2026_Wastewater

Enter your password to view comments.

HUMAnN 通路丰度计算方法详解 (Data_Tam_Metagenomics_2026_Soil)

Leave a reply

总体流程

HUMAnN (HMP Unified Metabolic Analysis Network) 是 bioBakery 工作流中用于分析宏基因组功能的核心工具[[12]]。通路丰度的计算是一个多步骤的递归过程：

计算步骤：

基因家族丰度 → 2. 反应丰度 → 3. 通路丰度

详细计算原理

第1步：基因家族丰度（Gene Family Abundance）

从原始测序 reads 开始：

使用 BLASTX 将 reads 比对到参考数据库（如 UniRef）
根据比对质量、覆盖度、序列长度进行加权
生成 RPK（Reads Per Kilobase）值

公式：

基因丰度 = Σ(比对权重) / 基因长度(kb)

其中每个 read 的总权重为 1.0，根据比对质量分配到多个基因匹配上[[9]]。

第2步：反应丰度（Reaction Abundance）

每个生化反应由一个或多个基因催化：

反应丰度 = Σ(催化该反应的所有基因丰度)

第3步：通路丰度（Pathway Abundance）

这是最关键的一步。通路包含多个反应，反应之间有不同的关系：

核心原则： 通路丰度由”最弱环节”（weakest link）决定[[1]]

计算方法：

串联反应（必须全部存在）：使用调和平均数（harmonic mean）
并联反应（可选路径）：使用最大值（max）
可选反应：只有当其丰度大于必需反应的调和平均数时才计入[[1]]

最终通路丰度 = 通路中丰度最低的关键反应

具体示例

示例场景：糖酵解通路（Glycolysis）

假设糖酵解通路包含 5 个关键反应（R1-R5）：

葡萄糖 → R1 → G6P → R2 → F6P → R3 → ... → 丙酮酸

基因-反应关系：

R1: 由基因 GK1 和 GK2 催化（冗余）
R2: 由基因 PGI 催化
R3: 由基因 PFK 催化
R4: 由基因 ALDO 催化
R5: 由基因 GAPDH 催化

测序后得到的基因丰度（RPK单位）：

GK1:  8.0
GK2:  4.0
PGI:  10.0
PFK:  6.0
ALDO: 7.0
GAPDH: 5.0

计算步骤：

① 计算反应丰度：

R1 = GK1 + GK2 = 8.0 + 4.0 = 12.0  (冗余基因相加)
R2 = PGI = 10.0
R3 = PFK = 6.0
R4 = ALDO = 7.0
R5 = GAPDH = 5.0

② 计算通路丰度： 由于糖酵解是串联反应（所有步骤必须完成），使用”最弱环节”原则：

通路丰度 = min(R1, R2, R3, R4, R5)
          = min(12.0, 10.0, 6.0, 7.0, 5.0)
          = 5.0 RPK

解释： 该样本中糖酵解通路的丰度为 5.0 RPK，意味着”最弱环节”（R5/GAPDH）的覆盖度为 5.0。这表示通路中至少有 5.0 个”完整拷贝”的活性[[1]]。

归一化处理

为什么需要归一化？

原始 RPK 值受测序深度影响，不能直接跨样本比较[[1]]。

示例：

样本 A：总 reads = 1000万，通路丰度 = 5.0 RPK
样本 B：总 reads = 2000万，通路丰度 = 5.0 RPK

虽然都是 5.0 RPK，但样本 A 的相对丰度更高！

归一化方法：

CPM（Counts Per Million）或 RPKM：

归一化丰度 = (原始RPK / 总RPK) × 1,000,000

HUMAnN 输出的 pathabundance_relab.tsv 文件已经是归一化后的相对丰度[[11]]。

输出文件说明

HUMAnN 生成两个关键文件：

文件	含义	取值范围
`pathcoverage.tsv`	通路覆盖度（存在/缺失概率）	0-1
`pathabundance.tsv`	通路丰度（相对拷贝数）	0 及以上

覆盖度 vs 丰度：

覆盖度 = 0.8：表示 80% 置信度该通路存在
丰度 = 5.0：表示该通路约有 5 个完整拷贝（在归一化前）[[9]]

分层分析（Stratification）

HUMAnN 还能计算每个物种对通路的贡献：

总通路丰度 = 物种A贡献 + 物种B贡献 + 未分类贡献

例如：

糖酵解通路总丰度: 5.0 RPK
├─ Escherichia coli:  3.0 RPK (60%)
├─ Bacteroides fragilis: 1.5 RPK (30%)
└─ Unclassified: 0.5 RPK (10%)

这帮助您理解哪些微生物在驱动特定功能[[21]]。

关键要点总结

✅ 通路丰度 = 最弱关键反应的丰度（木桶原理）
✅ 单位：RPK（归一化前）或 CPM/RPKM（归一化后）
✅ 跨样本比较：必须使用归一化值（pathabundance_relab.tsv）[[15]]
✅ 解释：丰度 5.0 表示通路中最弱环节有 5.0 RPK 的覆盖度
✅ 准确性：考虑了基因冗余、反应关系、可选步骤等生物学复杂性

参考文献

bioBakery Forum – Pathway abundance calculation [[1]]
HUMAnN SOP – HMP Data Coordination Center [[9]]
HUMAnN3 Documentation – Huttenhower Lab [[12]]
bioBakery 3 publication (Nature Methods, 2021) [[13]]

HUMAnN Pathway Abundance 计算方法

基本原理

HUMAnN (HMP Unified Metabolic Analysis Network) 是 biobakery 工具集的一部分，用于从宏基因组数据中计算功能通路的丰度。

计算步骤

1. 基因家族定量 (Gene Family Quantification)

首先将测序 reads 比对到 UniRef 基因家族数据库
计算每个基因家族的丰度（reads per kilobase, RPK）

2. 通路映射 (Pathway Mapping)

将基因家族映射到 MetaCyc 通路
一个通路通常包含多个反应步骤，每个步骤可能由多个基因家族催化

3. 通路丰度计算 (Pathway Abundance Calculation)

关键公式：

通路丰度 = min(该通路中所有反应步骤的丰度)

其中：

每个反应步骤的丰度 = 该步骤中所有基因家族丰度的总和
通路的最终丰度 = 所有反应步骤丰度的最小值（瓶颈原则）

具体示例

假设情况

假设有一个简单的代谢通路 “Glycolysis”（糖酵解），包含 3 个反应步骤：

反应步骤 1：葡萄糖 → 葡萄糖-6-磷酸

由基因家族 UniRef90_A 和 UniRef90_B 催化
UniRef90_A 丰度 = 100 RPK
UniRef90_B 丰度 = 50 RPK
步骤 1 丰度 = 100 + 50 = 150 RPK

反应步骤 2：葡萄糖-6-磷酸 → 果糖-6-磷酸

由基因家族 UniRef90_C 催化
UniRef90_C 丰度 = 80 RPK
步骤 2 丰度 = 80 RPK

反应步骤 3：果糖-6-磷酸 → 果糖-1,6-二磷酸

由基因家族 UniRef90_D 和 UniRef90_E 催化
UniRef90_D 丰度 = 200 RPK
UniRef90_E 丰度 = 120 RPK
步骤 3 丰度 = 200 + 120 = 320 RPK

通路丰度计算

Glycolysis 通路丰度 = min(步骤1, 步骤2, 步骤3)
                    = min(150, 80, 320)
                    = 80 RPK

为什么用最小值？

这遵循”木桶原理”（瓶颈效应）
通路的整体通量受限于最慢/最少的反应步骤
步骤 2 只有 80 RPK，即使其他步骤丰度很高，整个通路的最大通量也只能是 80

归一化 (Normalization)

HUMAnN 还会进行归一化处理：

1. RPK (Reads Per Kilobase)

RPK = (比对到基因的 reads 数) / (基因长度 kb)

2. RPKM/CPM (可选)

RPKM = RPK / (总百万 reads)

3. 相对丰度 (Relative Abundance)

通路相对丰度 = (通路 RPKM) / (所有通路 RPKM 总和) × 1,000,000

单位：CPM (Counts Per Million)

您的数据示例

从您的输出看：

otu_table()   OTU Table:         [ 1097 taxa and 2 samples ]

这 1097 个 “taxa” 实际上是 1097 个 MetaCyc 通路，每个通路的丰度值是经过以下处理的：

原始丰度：基于上述最小值原则计算
归一化：转换为相对丰度（sum = 1 或 100%）
输出文件：pathabundance_relab.tsv 中的值就是相对丰度

关键要点

特点	说明
计算方法	取通路中所有反应步骤的最小丰度
单位	通常是相对丰度（0-1 或 0-100%）
生物学意义	反映通路的潜在代谢能力
优势	考虑了通路的完整性，不是简单加和
局限性	无法区分活跃/非活跃通路（需要转录组验证）

注意事项

⚠️ 重要提醒：

通路丰度反映的是基因潜力（gene potential），不是实际代谢活性
一个通路存在 ≠ 该通路正在被使用
需要结合转录组（RNA-seq）或代谢组数据才能确定实际活性
对于您的 n=1 样本，只能做描述性比较，无法统计推断

STEP 5 — Descriptive visualisations (appropriate for n = 1 per group) for Data_Tam_Metagenomics_2026_Soil

Leave a reply

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Add openxlsx to your package list (for Excel export):

# Install if needed
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr",
                   "tidyr", "pheatmap", "openxlsx"))

And load it:

library(openxlsx)

Complete Step 5 — Replace your existing Step 5 with this

# =============================================================================
# STEP 5 — Visualisations (PNG) + Complete Excel exports
# =============================================================================

dir.create("figures", showWarnings = FALSE)
dir.create("tables",  showWarnings = FALSE)

# Helper: safe log2 fold change (adds pseudocount to avoid log(0))
safe_log2fc <- function(x, y, pseudo = 1e-6) {
  log2((x + pseudo) / (y + pseudo))
}

# =============================================================================
# 5a. Top-N species stacked bar plot (Loc1 vs Loc4)
# =============================================================================

top_n_sp <- 20
top_species <- names(sort(rowMeans(otu_table(species_ps)),
                          decreasing = TRUE))[1:top_n_sp]

ps_top <- prune_taxa(top_species, species_ps)

df_species <- psmelt(ps_top) %>%
  mutate(Species = factor(Species, levels = rev(top_species)))

p_species <- ggplot(df_species, aes(x = Location, y = Abundance, fill = Species)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "D") +
  labs(title = paste("Top", top_n_sp, "Species by Location"),
       x = "Location", y = "Relative Abundance", fill = "Species") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/species_top20_barplot.png", p_species,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/species_top20_barplot.png\n")

# =============================================================================
# 5b. Species heatmap (all species, row-scaled)
# =============================================================================

otu_mat <- as.matrix(otu_table(species_ps))

# Filter species with very low abundance (max < 0.01% across both samples)
keep <- apply(otu_mat, 1, max) > 0.0001
otu_filt <- otu_mat[keep, , drop = FALSE]

# Annotation column
ann_col <- data.frame(Location = metadata[colnames(otu_filt), "Location"],
                      row.names = colnames(otu_filt))

# Write heatmap to PNG
png("figures/species_heatmap.png",
    width = 10, height = max(8, 0.25 * nrow(otu_filt) + 2),
    units = "in", res = 300)

pheatmap(otu_filt,
         scale         = "row",
         clustering_distance_rows = "euclidean",
         clustering_method        = "complete",
         annotation_col = ann_col,
         main           = "Species Abundance Heatmap (row-scaled)",
         fontsize_row   = 6,
         fontsize_col   = 10,
         show_rownames  = nrow(otu_filt) <= 80)

dev.off()
cat("✅ Saved: figures/species_heatmap.png\n")

# =============================================================================
# 5c. Top pathways stacked bar plot
# =============================================================================

top_n_pw <- 20
top_pw_names <- names(sort(rowMeans(otu_table(pathway_ps)),
                           decreasing = TRUE))[1:top_n_pw]

ps_pw_top <- prune_taxa(top_pw_names, pathway_ps)

df_pw <- psmelt(ps_pw_top) %>%
  mutate(Pathway = factor(Pathway, levels = rev(top_pw_names)))

p_pw <- ggplot(df_pw, aes(x = Location, y = Abundance, fill = Pathway)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  labs(title = paste("Top", top_n_pw, "HUMAnN Pathways by Location"),
       x = "Location", y = "Relative Abundance", fill = "Pathway") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/pathways_top20_barplot.png", p_pw,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathways_top20_barplot.png\n")

# =============================================================================
# 5d. Pathway dot plot (Loc1 vs Loc4)
# =============================================================================

df_pw_wide <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway") %>%
  filter(Pathway %in% top_pw_names) %>%
  pivot_longer(-Pathway, names_to = "Sample", values_to = "Abundance") %>%
  left_join(data.frame(Sample = rownames(metadata_clean),
                       Location = metadata_clean$Location,
                       stringsAsFactors = FALSE),
            by = "Sample")

p_dot <- ggplot(df_pw_wide, aes(x = Location, y = Pathway,
                                 size = Abundance, color = Abundance)) +
  geom_point() +
  scale_color_viridis_c() +
  labs(title = "Pathway Abundance: Loc1 vs Loc4",
       x = "Location", y = "Pathway") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7))

ggsave("figures/pathway_dotplot.png", p_dot,
       width = 8, height = 10, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathway_dotplot.png\n")

# =============================================================================
# 5e. COMPLETE species list → Excel
# =============================================================================

# Build full species table (ALL species, no cutoff)
sp_full <- as.data.frame(otu_table(species_ps)) %>%
  rownames_to_column("Species")

# Ensure columns exist (defensive)
if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(sp_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in species OTU table.")
}

sp_full <- sp_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Species,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for species
sp_wb <- createWorkbook()
addWorksheet(sp_wb, "All_Species")
addWorksheet(sp_wb, "Top50_by_Diff")
addWorksheet(sp_wb, "Top50_by_Abundance")
addWorksheet(sp_wb, "Loc1_only")
addWorksheet(sp_wb, "Loc4_only")
addWorksheet(sp_wb, "Shared")

writeData(sp_wb, "All_Species",       sp_full)
writeData(sp_wb, "Top50_by_Diff",     head(sp_full, 50))
writeData(sp_wb, "Top50_by_Abundance",
          sp_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(sp_wb, "Loc1_only",
          sp_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(sp_wb, "Loc4_only",
          sp_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(sp_wb, "Shared",
          sp_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

# Formatting
header_style <- createStyle(textDecoration = "bold", bgFill = "#D3D3D3")
for (sh in c("All_Species","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(sp_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(sp_wb, sh, cols = 1:ncol(sp_full), widths = "auto")
  freezePane(sp_wb, sh, firstRow = TRUE)
}

saveWorkbook(sp_wb, "tables/species_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/species_Loc1_vs_Loc4.xlsx  (%d species total)\n",
            nrow(sp_full)))

# =============================================================================
# 5f. COMPLETE pathway list → Excel
# =============================================================================

pw_full <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway")

if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(pw_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in pathway OTU table.")
}

pw_full <- pw_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Pathway,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for pathways
pw_wb <- createWorkbook()
addWorksheet(pw_wb, "All_Pathways")
addWorksheet(pw_wb, "Top50_by_Diff")
addWorksheet(pw_wb, "Top50_by_Abundance")
addWorksheet(pw_wb, "Loc1_only")
addWorksheet(pw_wb, "Loc4_only")
addWorksheet(pw_wb, "Shared")

writeData(pw_wb, "All_Pathways",       pw_full)
writeData(pw_wb, "Top50_by_Diff",      head(pw_full, 50))
writeData(pw_wb, "Top50_by_Abundance",
          pw_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(pw_wb, "Loc1_only",
          pw_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(pw_wb, "Loc4_only",
          pw_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(pw_wb, "Shared",
          pw_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

for (sh in c("All_Pathways","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(pw_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(pw_wb, sh, cols = 1:ncol(pw_full), widths = "auto")
  freezePane(pw_wb, sh, firstRow = TRUE)
}

saveWorkbook(pw_wb, "tables/pathways_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways total)\n",
            nrow(pw_full)))

# =============================================================================
# Summary
# =============================================================================

cat("\n========================================\n")
cat("STEP 5 COMPLETE\n")
cat("========================================\n")
cat("Figures (PNG, 300 dpi):\n")
cat("  • figures/species_top20_barplot.png\n")
cat("  • figures/species_heatmap.png\n")
cat("  • figures/pathways_top20_barplot.png\n")
cat("  • figures/pathway_dotplot.png\n")
cat("\nExcel tables (complete lists, no cutoff):\n")
cat(sprintf("  • tables/species_Loc1_vs_Loc4.xlsx   (%d species)\n", nrow(sp_full)))
cat(sprintf("  • tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways)\n", nrow(pw_full)))
cat("========================================\n")

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

File	Content	Size
`species_top20_barplot.png`	Top 20 species stacked bar	10 × 8 in
`species_heatmap.png`	All species above 0.01% threshold, row-scaled	10 × auto (scales with # species)
`pathways_top20_barplot.png`	Top 20 pathways stacked bar	10 × 8 in
`pathway_dotplot.png`	Top 20 pathways dot plot	8 × 10 in

📑 Excel Files (complete lists, no cutoff)

Each workbook contains 6 sheets:

Sheet	Content
`All_Species` / `All_Pathways`	Every detected feature, sorted by absolute difference
`Top50_by_Diff`	Top 50 features by	Loc4 − Loc1
`Top50_by_Abundance`	Top 50 features by mean abundance
`Loc1_only`	Features detected only in Loc1
`Loc4_only`	Features detected only in Loc4
`Shared`	Features detected in both locations

📋 Columns in each Excel file

Column	Meaning
`Species` / `Pathway`	Feature name
`Abundance_Loc1`	Raw relative abundance in Loc1
`Abundance_Loc4`	Raw relative abundance in Loc4
`Diff_Loc4_minus_Loc1`	Absolute difference (Loc4 − Loc1)
`Log2FC_Loc4_vs_Loc1`	Log₂ fold change (with pseudocount 1e-6 to handle zeros)
`Present_in_Loc1`	TRUE/FALSE
`Present_in_Loc4`	TRUE/FALSE
`Total_Abundance`	Sum across both samples
`Mean_Abundance`	Mean across both samples

Notes

Log₂ fold change: Uses a small pseudocount (1e-6) to avoid log(0). For features absent in one location, this gives a large but finite fold change — interpret these as “presence/absence” rather than true fold change.
Heatmap filtering: Species with maximum abundance < 0.01% across both samples are excluded to keep the heatmap readable. Adjust the threshold in keep <- apply(otu_mat, 1, max) > 0.0001 if needed.
Excel formatting: Headers are bold with grey background, columns auto-sized, and the first row is frozen for easy scrolling.
File locations: All outputs go into figures/ and tables/ subfolders inside your current working directory (reports/).

医院废水处理工艺显著降低了废水中“副血链球菌”（Streptococcus parasanguinis）的相对丰度

Leave a reply

这个统计结果表明：在排除了不同采样时间（季节/月份）带来的自然波动影响后，医院废水处理工艺显著降低了废水中“副血链球菌”（Streptococcus parasanguinis）的相对丰度。

下面为您逐列详细解释这些统计指标的含义：

1. 字段详细解释

字段名	结果值	含义解释
feature	Streptococcus.parasanguinis	特征名称：即发生显著变化的物种名称，中文通常译为“副血链球菌”。这是一种常见于人类口腔和肠道的细菌，属于机会致病菌。
value	Post	比较的组别：因为我们在 MaAsLin2 中设置 `Pre`（处理前）为参考组（Reference），这里的 `Post` 表示这是 “处理后”相对于“处理前” 的比较结果。
coef	-0.0006712921	效应系数（Coefficient）：代表处理后该物种相对丰度的变化量。负值（-）表示丰度下降。因为您的输入数据是相对丰度（0~1之间），这意味着经过处理后，该菌的相对丰度比处理前降低了约 0.067%。
pval	0.0001454243	原始 P 值（P-value）：未经多重检验校正的显著性水平。这个值远小于 0.05（甚至小于 0.001），说明在单变量统计检验中，处理前后的差异是极其显著的。
qval	0.01915278	校正后的 Q 值（FDR）：即错误发现率（False Discovery Rate）。因为宏基因组数据同时检验了成百上千个物种，必须进行多重检验校正（如 Benjamini-Hochberg 方法）以防止假阳性。Q值 < 0.05，说明即使经过了严格校正，这个差异依然是统计学显著的。

2. 结合您的实验设计的深度解读

在您的实验设计中，MaAsLin2 模型同时纳入了 Treatment（处理前/后）和 TimePoint（11月、1月、3月、5月）两个变量。

控制时间变量（TimePoint）：废水中的微生物群落会随着季节、气温等时间因素自然变化。模型将 TimePoint 作为协变量（Covariate）剔除，意味着这个 -0.00067 的下降纯粹是由废水处理工艺（Pre vs Post）引起的，而不是因为采样月份不同造成的。
统计学稳健性：原始 P 值（0.00014）和 Q 值（0.019）都非常小，说明这个结果非常稳健，不是偶然产生的假阳性。

3. 生物学与环境意义

副血链球菌（Streptococcus parasanguinis） 通常是人类口腔、呼吸道的正常菌群，但也可能引起心内膜炎等机会性感染。在医院废水中检测到它，说明它可能来源于医院的医疗排放或人类排泄物。
环境意义：该菌在处理后（Post）显著减少，说明您的医院废水处理系统能够有效地截留、降解或灭活这种条件致病菌，降低了其排放到自然环境中带来的潜在生物安全风险。这是一个非常积极的工程效果！

Great question! The difference in min_abundance thresholds between species (0.001) and pathways (0.0001) is intentional and based on the fundamental differences in how MetaPhlAn and HUMAnN data are distributed:

1. Data Distribution Differences

Species data (MetaPhlAn):

Community composition is typically dominated by a few highly abundant species
Most species have relative abundances well above 0.1% (0.001)
A threshold of 0.001 (0.1%) effectively filters out “noise” while retaining biologically meaningful species

Pathway data (HUMAnN):

Functional profiles are much more evenly distributed across thousands of pathways (you have 15,785 pathways!)
Many biologically important pathways exist at very low relative abundances (e.g., antibiotic resistance, xenobiotic degradation, rare metabolic functions)
A higher threshold would eliminate too many pathways, potentially losing critical functional signals

2. Why 0.0001 (0.01%) for Pathways?

Captures the “long tail” of functional diversity: Many critical metabolic pathways operate at low abundance but are essential for ecosystem function
Balances sensitivity and specificity: Combined with min_prevalence = 0.1 (the pathway must appear in at least 10% of samples), this ensures we only test features that are consistently detected, even if at low levels
Standard practice in metagenomics: HUMAnN pathway analyses commonly use lower abundance thresholds because functional redundancy means even rare pathways can be important

3. You Can Adjust This

If you want to be more conservative (fewer pathways tested, faster computation, less multiple testing correction burden), you could increase it:

min_abundance = 0.001  # Same as species, 0.1% threshold

Or if you want to be even more sensitive:

min_abundance = 0.00001  # 0.001% threshold

4. Practical Recommendation

Given your dataset has only 11 samples, I’d actually recommend increasing the pathway threshold to match species (min_abundance = 0.001) because:

With small sample sizes, testing too many low-abundance features increases the multiple testing burden
It makes the Q-value correction more stringent
You’ll get more robust, interpretable results

Would you like me to update the code to use min_abundance = 0.001 for pathways as well?

Can differentiated organoid flip? 小鼠肠道类器官的分化，用于耶尔森氏菌（Y. enterocolitica）感染研究，采用双极性构型。

Leave a reply

结合你提供的详细背景信息，幻灯片上提出的 “Can differentiated organoid flip?”（已分化的类器官能否翻转？） 是一个非常核心且切中实验痛点的问题。

在中文学术语境中，这里的 “Flip（翻转）”绝对不是指像翻煎饼一样把培养皿里的球体物理翻个面，而是指 “细胞极性的反转（Apical-Basal Polarity Reversal）” 或 “培养维度的转换（从3D转为2D单层）”。

结合你的背景信息，我为你深度解析这个问题到底在问什么，以及为什么它对你的研究至关重要：

一、结合背景信息：为什么要探讨“Flip（翻转）”？

回顾你背景信息中提到的 “双极性构型（two-polarity conformations）” 和 “耶尔森氏菌（Y. enterocolitica）感染”：

体内的真实生理状态：在真实肠道中，耶尔森氏菌等致病菌是存在于肠腔中的。因此，细菌入侵时，首先接触的是肠上皮细胞的管腔面（顶侧，Apical side，长有微绒毛和黏液层）。
常规3D类器官的致命痛点：在Matrigel（基质胶）中常规培养的3D肠类器官，其极性是“内向”的：顶侧朝内（朝向中央管腔），基底侧朝外（接触外面的培养基）。如果你把细菌加在培养基里，细菌只能接触到细胞的基底侧。这完全违背了生理状态，无法真实模拟细菌从肠腔入侵的过程。
“Flip”的目的：为了让实验符合生理状态，必须把类器官“Flip（翻转）”，让顶侧（Apical）朝外，直接暴露给含有细菌的培养基。这就是幻灯片探讨这个问题的核心动机。

二、科学解答：已分化的类器官能否“Flip”？

答案是：能，但需要特定的技术手段。

当幻灯片问“已分化的类器官能否翻转”时，它实际上在探讨一个细胞生物学难题：类器官已经分化出了成熟的吸收细胞、杯状细胞、潘氏细胞等，这种高度分化的状态能否被打破并重新排列，以适应新的极性？

在实验中，科学家通常通过以下两种方式实现“Flip”：

1. 3D 转 2D 单层培养（最常用、最彻底的 Flip）

这是目前解决极性问题的“黄金标准”。

操作方法：将3D胶里已经分化的肠类器官消化成单细胞或小细胞团，然后接种到 Transwell（半透膜小室） 上。
Flip 结果：细胞会在膜上重新生长，形成你背景信息中提到的 “单层柱状上皮”。此时，细胞的基底侧自然贴在膜上（朝下接触底部培养基），而顶侧（长满微绒毛） 自然朝上（暴露在顶部空间）。
优势：完美实现了极性翻转。细菌加在顶部，直接感染顶侧；底部培养基可以模拟血液侧，加入免疫细胞或药物。

2. 微流控芯片（Gut-on-a-chip）

这是 Transwell 的高级进阶版。在微流控通道中培养单层细胞，一侧通道模拟肠腔（加细菌），另一侧通道模拟血管（加流体剪切力和免疫细胞）。这也是一种高级的“Flip”应用，能更好地模拟体内动态环境。

3. 3D 囊泡的直接诱导反转（难度极高）

通过特殊的离心力、局部酶解基质胶、或使用特定的小分子化合物，强行让3D球体的细胞重新排列，使顶侧翻到外面。但这种方法很难保证所有细胞极性一致，且容易破坏类器官结构，目前较少用于感染实验。

三、幻灯片提出这个问题的深层科学考量

当讲者提出“Can differentiated organoid flip?”时，他/她实际上是在引导听众思考以下三个关键科学问题（你可以用来丰富你的理解或提问）：

分化状态的可塑性（Plasticity）：类器官在3D胶里已经分化成熟了，当我们把它打散重铺成2D单层（Flip）时，这些细胞还能不能重新建立正确的顶底极性？还能不能保留原有的细胞类型（如杯状细胞还能不能分泌黏液）？
屏障功能的完整性（Barrier Integrity）：在翻转和重新铺板的过程中，细胞间的紧密连接（Tight junctions） 会不会受损？如果屏障漏了，细菌就不需要“入侵”细胞，而是直接从细胞缝隙漏过去了，这就失去了研究耶尔森氏菌侵入机制的意义。
微绒毛与黏液层的保留：耶尔森氏菌感染需要穿透肠道黏液层。Flip 之后，顶侧朝外，杯状细胞分泌的黏液层和吸收细胞的微绒毛还能不能正常形成并发挥功能？

四、一句话总结

结合你的背景信息，幻灯片上 “Can differentiated organoid flip?” 的准确中文解释是：

“已经分化成熟的肠类器官，能否通过改变培养方式（如从3D转为2D单层），实现细胞极性的反转（让顶侧朝外），从而克服常规3D培养中顶侧朝内的缺陷，完美模拟耶尔森氏菌从肠腔入侵的真实生理过程？”

结论是：能。 通过 Transwell 或微流控芯片等技术，将 3D 类器官转化为 2D 单层，是实现极性翻转（Flip）并研究肠道病原体感染的最佳策略。

杯状细胞 的英语是：

Goblet cell

（复数形式：Goblet cells）

补充说明：

名称由来：因细胞形态像高脚杯（goblet）而得名
主要功能：分泌黏液（mucin），形成黏液层保护上皮
分布位置：肠道上皮、呼吸道上皮、结膜等

如果你说的 “flip” 是指类器官（organoid）在已经分化（differentiated）之后发生极性翻转（polarity flip，inside-out flipping），那么答案是：

可以，在一定条件下可以发生。

什么是“flip”？

许多上皮来源的类器官（如肠道类器官、气道类器官、肾脏类器官）具有明确的顶端-基底极性：

Apical（顶端面）：通常朝向类器官内部腔体（lumen）
Basal（基底面）：通常朝向外部基质（如 Matrigel）

正常情况下结构如下：

外部基质
    ↓
[Basal面]
  细胞层
[Apical面]
    ↓
   Lumen

当研究人员去除基质、改变培养条件或进行机械处理后，类器官可能重组为：

外部培养液
    ↓
[Apical面]
  细胞层
[Basal面]
    ↓
  内部

这就是常说的 inside-out organoid 或 polarity reversal（极性翻转）。

已经分化后还能翻转吗？

通常可以。

原因是：

分化状态和极性状态不是同一个概念
- 细胞已经分化为吸收细胞、杯状细胞、纤毛细胞等，
- 但其空间排列和极性仍具有一定可塑性。
细胞会感知外部ECM（细胞外基质）
- Integrin等受体持续感知基质信号。
- 当ECM环境改变时，细胞可能重新建立极性。
许多实验已经实现
- 肠道类器官
- 肺/气道类器官
- 肾脏类器官

都报道过在成熟后发生极性反转。

会不会失去分化特征？

不一定。

很多情况下：

分化标志物仍然存在；
细胞类型组成变化不大；
只是顶端和基底面的朝向改变。

但如果培养条件改变过大或时间过长：

部分细胞可能去分化（dedifferentiation）；
分化谱系比例发生变化；
类器官结构重塑。

如果你说的“flip”是另一种意思

有时研究者也用“flip”指：

类器官物理翻转（turn over）
极性翻转（polarity flip）
分化细胞重新回到干细胞状态（fate reversal）

其中：

极性翻转 → 经常可以发生；
完全命运逆转（differentiated → stem-like）→ 也可能发生，但机制不同，通常称为去分化（dedifferentiation）美式英语：/diˌdɪfəˌrɛnʃiˈeɪʃən/ ，而不是“flip”。

如果你是在阅读某篇论文或实验方案，可以把原句贴出来，我可以根据具体上下文解释这里的 “flip” 到底指哪一种现象。

一、Organoid（类器官）—— 深度解析

1. 定义

类器官是指利用成体干细胞（ASC）或多能干细胞（PSC，包括胚胎干细胞和诱导多能干细胞），在体外三维（3D）培养体系中，通过自组织（self-organization）和定向分化，形成的微型三维组织样结构。它能够再现来源器官的部分细胞类型、空间架构和生理功能。

2. 核心特征（四点）

特征	说明
三维结构	不同于传统的二维（2D）单层细胞培养，类器官具有立体的细胞极性和细胞间连接
多细胞类型	包含多种分化细胞，而非单一细胞系
自组织能力	细胞按照内在的发育程序自行排列成类似体内的结构，无需外部支架的精细引导
功能性	能够执行部分器官特有功能，如肠类器官的吸收/分泌、脑类器官的神经电活动

3. 来源分类

成体干细胞来源（ASC-derived）：取自组织中的成体干细胞（如肠道隐窝干细胞），分化潜能有限（只能形成该组织的细胞类型），但成熟度高、更接近体内真实状态。
多能干细胞来源（PSC-derived）：取自胚胎干细胞或iPS细胞，分化潜能大，可模拟胚胎发育过程，但成熟度相对较低。

4. 应用领域

疾病建模（感染、肿瘤、遗传病）
药物筛选与毒性测试
再生医学（移植修复）
发育生物学研究

二、Enteroid（肠类器官）—— 深度解析

1. 定义

肠类器官是类器官的一种特定亚型，特指来源于肠道上皮干细胞（主要是小肠或结肠的Lgr5⁺隐窝基底柱状细胞），在三维培养中形成的、包含所有肠道上皮主要细胞类型的微型肠道组织。

2. 肠类器官包含的细胞类型（核心）

细胞类型	功能
肠上皮吸收细胞（Enterocyte）	营养吸收、离子转运
杯状细胞（Goblet cell）	分泌黏液，保护上皮屏障
潘氏细胞（Paneth cell）	分泌抗菌肽（如溶菌酶），调节干细胞微环境
肠内分泌细胞（Enteroendocrine cell）	分泌激素（如GLP-1、5-羟色胺），调控消化和食欲
干细胞（Lgr5⁺ CBC细胞）	自我更新，分化补充其他细胞类型
簇细胞（Tuft cell）	参与免疫感知和寄生虫防御

3. 结构特征

肠类器官在Matrigel（基底膜基质）中培养时，会形成极性的囊泡状结构：
- 管腔面（Apical side）：面向内部空腔，带有微绒毛，对应肠道的内腔（接触食物和菌群的一面）
- 基底侧面（Basolateral side）：朝向外侧，对应肠道的基底膜侧（接触血管和结缔组织的一面）
这种极性正是你原句中“two-polarity conformations（双极性构型）”所指的内容。

4. 与体内肠道的对应关系

肠类器官的隐窝-绒毛轴在体外被简化为出芽结构（buds）——出芽部分富含干细胞和潘氏细胞（对应隐窝），中央囊泡部分富含分化细胞（对应绒毛区）。

三、Organoid vs Enteroid —— 更详细的对比表

对比维度	Organoid（类器官）	Enteroid（肠类器官）
中文译名	类器官	肠类器官（或称肠道类器官）
层级关系	上位概念（总称）	下位概念（子类）
组织来源	可以是脑、肝、肾、肺、肠、胰腺、视网膜等任何器官	仅限于小肠或结肠
干细胞类型	ASC 或 PSC 均可	通常为 ASC（肠道隐窝干细胞）
分化方向	定向分化为对应器官的细胞谱系	定向分化为肠道上皮各细胞类型
培养条件	因器官而异，需不同的细胞因子组合（如脑需Noggin，肝需HGF等）	通常需EGF、Noggin、R-spondin、Wnt3a等（经典“肠类器官培养基”）
结构特点	因器官而异（如脑类器官呈脑区样结构，肝类器官呈肝板样）	呈囊泡状+出芽结构，具有肠上皮特有的隐窝-绒毛极性
典型应用	广泛（神经退行性疾病、肝癌、肾发育、肺纤维化等）	主要集中于肠道感染（如沙门氏菌、耶尔森氏菌）、炎症性肠病（IBD）、结直肠癌、营养代谢

四、为什么你的句子用 Enteroid 而不是 Organoid —— 深入解释

你的原句：

“Differentiation of murine enteroids for Y. enterocolitica infection in two-polarity conformations”

研究对象是 耶尔森氏菌（Yersinia enterocolitica）——这是一种肠道致病菌，主要通过侵入肠道上皮引起感染。
要研究这类细菌，你需要一个能模拟肠道上皮屏障、黏液层、极性运输和细菌侵入机制的模型。
肠类器官（enteroid）正好提供了：
- 完整的肠道上皮细胞多样性
- 生理性的顶-底极性（细菌从顶侧侵入，免疫信号从底侧传递）
- 与体内更接近的微环境

如果用 organoid（泛称），就太笼统了，别人不知道你研究的是肠、胃还是其他器官的类器官。因此，enteroid 是此处最精确、最专业的术语。

五、延伸概念：Colonoid 和 Gastroid

为了帮你更清楚理解 enteroid 的范围，再补充两个相关术语：

术语	中文	来源	说明
Enteroid	肠类器官	小肠或结肠（广义）	有时狭义特指小肠来源
Colonoid	结肠类器官	结肠（大肠）	是 enteroid 的下位细分
Gastroid	胃类器官	胃组织	与 enteroid 同属消化道类器官但不是同一类

在严格学术语境中，enteroid 有时只指小肠来源，而colonoid 指结肠来源，但日常文献中常混用，统称 intestinal organoids（肠道类器官），而 enteroid 则是更专业的叫法。

六、补充：关于“two-polarity conformations（双极性构型）”

这一点与你问的概念密切相关，我额外解释一下：

在体内，肠道上皮是单层柱状上皮，具有明确的顶底极性（apical-basal polarity）：
- 顶侧（Apical）：朝向肠腔，接触食物、菌群、病原体
- 底侧（Basolateral）：朝向基底膜，接触固有层、血管、免疫细胞
在体外肠类器官培养中：
- 常规构型：类器官的顶侧朝向管腔内部（封闭腔），底侧朝外（接触培养基）。这种构型下，细菌难以直接接触到顶侧（因为顶侧在腔里面），研究感染时需要显微注射或破坏类器官。
- 反转构型（反向极性）：通过特殊培养方法（如去掉基质胶或使用transwell），使顶侧朝外或建立开放顶面的单层，这样病原体可以直接从培养液接触顶侧，更便于感染实验。

你的句子中 “two-polarity conformations” 就是指这两种不同的极性排列方式（常规封闭腔 vs 开放/反转极性），用于比较哪种更适于研究耶尔森氏菌的侵入机制。

Further topics：

肠类器官的具体培养方法（培养基成分、Matrigel操作）
“反转极性”如何实现（具体实验方案）
耶尔森氏菌感染肠类器官的研究案例

这是一个非常关键且容易混淆的问题。直接给你结论：

隐窝（Crypt）既不是管腔面（Apical side），也不是基底侧面（Basolateral side）。

为了让你彻底明白，我从空间位置和细胞极性两个维度来拆解：

一、从空间位置来看（解剖学视角）

在真实的肠道组织中，隐窝和绒毛是肠壁上的两种不同地形结构：

绒毛（Villus）：像手指一样向上突起，伸向肠道内部（管腔）。
隐窝（Crypt）：像小口袋一样向下凹陷，嵌入肠壁深层的固有层中。

所以，隐窝是一个“凹陷的坑”，绒毛是一个“凸起的包”。它们之间的关系是“高低起伏”的地形关系，而不是“内外表面”的关系。

二、从细胞极性来看（细胞生物学视角）

关键是：无论是隐窝里的细胞，还是绒毛上的细胞，它们都是单层上皮细胞，每一颗细胞都有自己的“管腔面”和“基底侧面”。

细胞部位	管腔面（Apical，顶侧）	基底侧面（Basolateral，底侧）
绒毛上的细胞	朝向肠腔内部（接触食物、菌群）	朝向绒毛内部的固有层（接触血管）
隐窝里的细胞	朝向隐窝的管腔（即那个凹陷的空腔）	朝向隐窝外周的基底膜和固有层

也就是说：

隐窝本身是一个“下陷的管腔空间”。隐窝里的干细胞和潘氏细胞，它们的顶膜（Apical）是面向隐窝那个空洞的，而基底膜（Basolateral）是面向隐窝外壁的组织间隙的。

三、用一张图来理解（文字版）

        ═══════════════════════════════   ← 肠腔（真正的管腔，食物通过的地方）
        ↑  绒毛（Villus）            ↑
        |  细胞顶面（Apical）朝向肠腔   |
        |  细胞底面（Basolateral）朝内   |
        ═══════════════════════════════   ← 基底膜
                 ↓↓↓ 向下凹陷 ↓↓↓
        ═══════════════════════════════   ← 隐窝开口
        |  隐窝（Crypt）              |
        |  细胞顶面（Apical）朝向隐窝空腔 |
        |  细胞底面（Basolateral）朝向外周 |
        ═══════════════════════════════   ← 基底膜

四、回答你问题的核心

问：隐窝是管腔面吗？
- 不是。 隐窝是一个结构区域（凹陷区），而管腔面（Apical）是细胞的一个面（朝向空腔的那一面）。隐窝里的细胞有自己的管腔面（朝向隐窝空洞），但隐窝本身不等于管腔面。
问：隐窝是基底侧面吗？
- 不是。 基底侧面（Basolateral）是细胞接触基底膜和细胞间质的那一面。隐窝里的细胞也有基底侧面（朝向隐窝外周的基底膜），但隐窝本身不等于基底侧面。

五、在肠类器官（Enteroid）中对应什么？

在体外3D培养的肠类器官中：

体内结构	类器官中的对应结构
绒毛（Villus）	类器官的中央囊泡（中央腔体），上皮较薄，含分化细胞
隐窝（Crypt）	类器官的出芽（Bud），向外凸起的芽状结构，富含Lgr5⁺干细胞和潘氏细胞

注意：这里的“出芽”是向外凸起的，而不是向下凹陷的，这是因为类器官的极性相对于基质胶是“内外颠倒”的（顶面朝内腔，基底面朝外）。

六、一句话总结

隐窝是一个“结构位置”（凹陷区域），不是细胞的“一个面”。隐窝里的细胞既有管腔面（朝隐窝空洞），也有基底侧面（朝隐窝外周的基底膜）。

Further topics：

为什么类器官的极性是“内外颠倒”的？
在Transwell系统中如何实现“顶面朝外”？

1. “单层柱状上皮”是指只有一层细胞吗？

是的，千真万确。

“单层”在组织学里就是字面意思：只有一层细胞，不像皮肤（复层鳞状上皮）那样有很多层叠加在一起。
“柱状”指的是细胞的形态：它们不是圆形的，也不是扁平的，而是像高脚杯或柱子一样，高度 > 宽度，看起来细细长长的。

正因为只有一层，所以这层细胞肩负着巨大的责任：既要当“屏障”挡住坏人（细菌毒素），又要当“搬运工”吸收营养。

2. 细胞的一侧朝向肠腔，另一侧朝向基底膜吗？

完全正确！ 这正是单层上皮最经典的极性（Polarity）特征。

因为只有一层，所以这层细胞必须“一头一尾”各司其职：

顶端（Apical side / 管腔面）：朝向肠腔（也就是大便和食物残渣流过的地方）。这一侧长满了密密麻麻的微绒毛（Microvilli），用来增加吸收面积，且没有血管直接接触。
基底侧（Basolateral side / 基底侧面）：朝向基底膜（Basement membrane）。基底膜像一层“胶水”和“滤网”，把这层上皮和下面的固有层（里面有血管、淋巴管、免疫细胞）粘在一起。细胞吸收的营养物质，会从这一侧排出去，进入血管。

3. 帮你建立“立体空间感”（重要补充）

你可能会困惑：“隐窝不是凹进去的吗？那凹进去的坑里，哪边算肠腔？”

记住这个黄金法则：只要是“空腔”的那一面，就是顶侧（Apical）。

在绒毛上：细胞顶侧 → 对着肠道中间的大洞（肠腔）。
在隐窝里：细胞顶侧 → 对着隐窝那个“小坑”的洞。虽然这个洞是封闭的凹陷，但它依然属于“腔”的一部分。
细胞基底侧：永远对着基底膜（即远离食物，靠近身体内部的那一侧）。

4. 套用在你研究的“双极性构型”上

你现在完全理解了这个结构，就很容易搞懂体外实验的痛点了：

在体内：细菌（如耶尔森氏菌）站在肠腔里，直接接触的就是细胞的顶侧（Apical），然后细菌入侵。
在常规类器官（3D胶里）：细胞顶侧（有微绒毛的那面）被包在里面（朝向类器官的内腔），细菌在胶外面，接触的是细胞的基底侧。这就反了！所以科学家才要费尽心机做“双极性构型（反转极性）”，目的就是把细胞的顶侧翻到外面来，让细菌能像在体内一样，从顶侧发起攻击。

总结一句话： 单层 = 只有一层细胞；极性 = 顶侧对肠腔（吸收/感染面），基底侧对基底膜（连接身体面）。

For example, MKL-1 Cell Line miRNA Analysis Results are as follows.

Maaslin2 解释

什么是Maaslin2？

主要用途

核心特点

1. 统计方法

2. 数据标准化

3. 多变量分析

基本工作流程

输出结果

优势

应用场景

Step 1: Prepare the Input and Run CheckM

Step 2: How to Interpret the Results

What to do next?

Why Unicycler Failed to Collapse the Genome

How to Fix This (Since Re-assembling Won’t Work)

Option 1: The Quick Fix using BBMap (dedupe.sh)

Option 2: The Thorough Fix using purge_dups

Option 3: Do you have Long Reads?

Bonus Observation: You likely have 3 plasmids!

1. 误删真实的生物学重复序列（Over-purging / 假阴性缺失）

2. 产生“嵌合体”基因组（Chimeric Assembly）

3. 破坏基因组的连续性（Structural Breaks）

如何安全地处理 2631 和 2617？（最佳实践）

第一步：回贴原始 Reads，查看覆盖度（Coverage）分布

第二步：根据覆盖度结果选择策略

第三步：尝试重新组装（替代方案）

总结

总体流程

计算步骤：

详细计算原理

第1步：基因家族丰度（Gene Family Abundance）

第2步：反应丰度（Reaction Abundance）

第3步：通路丰度（Pathway Abundance）

具体示例

示例场景：糖酵解通路（Glycolysis）

归一化处理

为什么需要归一化？

归一化方法：

输出文件说明

分层分析（Stratification）

关键要点总结

参考文献

HUMAnN Pathway Abundance 计算方法

基本原理

计算步骤

具体示例

假设情况

通路丰度计算

归一化 (Normalization)

您的数据示例

关键要点

注意事项

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Complete Step 5 — Replace your existing Step 5 with this

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

📑 Excel Files (complete lists, no cutoff)

📋 Columns in each Excel file

Notes

1. 字段详细解释

2. 结合您的实验设计的深度解读

3. 生物学与环境意义

1. Data Distribution Differences

2. Why 0.0001 (0.01%) for Pathways?

3. You Can Adjust This

4. Practical Recommendation

一、 结合背景信息：为什么要探讨“Flip（翻转）”？

二、 科学解答：已分化的类器官能否“Flip”？

1. 3D 转 2D 单层培养（最常用、最彻底的 Flip）

2. 微流控芯片（Gut-on-a-chip）

3. 3D 囊泡的直接诱导反转（难度极高）

三、 幻灯片提出这个问题的深层科学考量

四、 一句话总结

什么是“flip”？

已经分化后还能翻转吗？

会不会失去分化特征？

如果你说的“flip”是另一种意思

Option 1: The Quick Fix using BBMap (`dedupe.sh`)

Option 2: The Thorough Fix using `purge_dups`

一、结合背景信息：为什么要探讨“Flip（翻转）”？

二、科学解答：已分化的类器官能否“Flip”？

三、幻灯片提出这个问题的深层科学考量

四、一句话总结