-
Input data
#Data_Ute_smallRNA_3/ Data_Ute_smallRNA_4/ Data_Ute_smallRNA_7/ #Data_Ute_smallRNA_3_nextflow/ Data_Ute_smallRNA_6_size_selected/ Data_Ute_smallRNA_7_out/ See samples in Sequencing_overview.xlsx Batch,Date,Status 1,August 2021,⚪ Experimental (nf655, nf657) 2,February 2022,❌ Not found in my processed datasets (nf701, nf705) 3,August 2022,⚪ Experimental (nf774, nf780) 4,November 2022,⚪ Experimental (nf796, nf797) 5,June 2023,⚪ Experimental (nf876, nf887 (MKL-1 but scr_DMSO!)) 6,August 2023,⚪ Experimental (nf916, nf917, nf918 (MKL-1 but scr_DMSO!), nf919 (MKL-1 but scr_DMSO!)) #? Maybe we can include the parental cells, EVs (see 3 MKL-1_wt, 1 MKL_1_derived_EV_miRNA in the point 'BATCH_1, 3, 4') and scr_DMSO_EVs for MKL-1 (3 samples in BATCH_5 and BATCH_6) for exceRpt running! #BATCH_1: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_3$ find . -name "*fastq.gz" (6 samples) ./210817_NB501882_0294_AHW5Y2BGXJ/nf656/MKL_1_derived_EV_RNA_S13_R1_001.fastq.gz ./210817_NB501882_0294_AHW5Y2BGXJ/nf658/WaGa_derived_EV_RNA_S14_R1_001.fastq.gz #ERROR_wrong_demulti ./210817_NB501882_0294_AHW5Y2BGXJ_neu/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz #ERROR_wrong_demulti ./210817_NB501882_0294_AHW5Y2BGXJ_neu/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz #BATCH_3: ./220617_NB501882_0371_AH7572BGXM/nf774/0403_WaGa_wt_S20_R1_001.fastq.gz #* ./220617_NB501882_0371_AH7572BGXM/nf780/0505_MKL1_wt_S26_R1_001.fastq.gz # newDemulti of BATCH_1, 3, 4: * ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz * * ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz * ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz * ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz * * ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz * * ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz * #BATCH_5: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_4$ find . -name "*.fastq.gz" () #ERROR_wrong_adapter ./230602_NB501882_0428_AHKG53BGXT/demulti_wrong_adapter_trimming/nf876/1002_WaGa_sT_Dox_S1_R1_001.fastq.gz #ERROR_wrong_adapter ./230602_NB501882_0428_AHKG53BGXT/demulti_wrong_adapter_trimming/nf887/2312_MKL_1_scr_DMSO_S2_R1_001.fastq.gz ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf876/1002_WaGa_sT_Dox_S1_R1_001.fastq.gz * NOT_USED: MKL-1 but scr_DMSO: ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf887/2312_MKL_1_scr_DMSO_S2_R1_001.fastq.gz #BATCH_6: (plot-numpy1) jhuang@WS-2290C:/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_6_size_selected$ find . -name "*.fastq.gz" ./230811_NB501882_0432_AHMW2CBGXT/nf916/1002_WaGa_wt_S1_R1_001.fastq.gz ./230811_NB501882_0432_AHMW2CBGXT/nf917/1002_WaGa_sT_Dox_S2_R1_001.fastq.gz * NOT_USED: MKL-1 but scr_DMSO: ./230811_NB501882_0432_AHMW2CBGXT/nf918/2312_MKL_1_scr_DMSO_S3_R1_001.fastq.gz * NOT_USED: MKL-1 but scr_DMSO: ./230811_NB501882_0432_AHMW2CBGXT/nf919/2403_MKL_1_scr_DMSO_S4_R1_001.fastq.gz # BATCH_100 # Data_Ute_smallRNA_7/$ find . -name "*fastq.gz" (10 samples) ./231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz ./231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz #I can’t find the nf940/11_control_MKL1_S11_R1_001.fastq.gz sample so I am not sure what kind of sample or condition it is. #NOT_USED_SINCE_IT_DOES_NOT_BELONG_TO_THE_PROJECT: ./231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz #NOT_USED_SINCE_IT_DOES_NOT_BELONG_TO_THE_PROJECT: ./231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz # Data_Ute_smallRNA_7$ find . -name "*.fastq.gz" (6 samples) ./250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R2_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R2_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R2_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R2_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R2_001.fastq.gz ./250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R2_001.fastq.gz #941,07 € 1051,79*0.85=894.0215 #FINAL DATASETS for running: newDemulti_of_BATCH_1_3_4 + BATCH_100 - Samples[nf941, nf940]!, Using directly with the orignal sample names nf*. MKL-1 wt cells (nf780*, nf796*, nf797*) MKL-1 wt_EV_RNA (nf655*) WaGa wt cells (nf774* (Considering to be deleted, due to possibly be an outlier, but in the current version, it is still included in the analysis), nf961, nf962) WaGa wt_EV_RNA (nf657* (The sample was excluded, since it is obviously a outlier, not clustered with the other 2 samples), nf930, nf935) WaGa_sT_DMSO_EV_RNA (nf931, nf936, nf971) WaGa_sT_Dox_EV_RNA (nf932, nf937, nf972) WaGa_scr_DMSO_EV_RNA (nf933, nf938, nf973) WaGa_scr_Dox_EV_RNA (nf934, nf939, nf974) -
Adapter trimming
#some common adapter sequences from different kits for reference: # - TruSeq Small RNA (Illumina): TGGAATTCTCGGGTGCCAAGG # - Small RNA Kits V1 (Illumina): TCGTATGCCGTCTTCTGCTTGT # - Small RNA Kits V1.5 (Illumina): ATCTCGTATGCCGTCTTCTGCTTG # - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific): TGGAATTCTCGGGTGCCAAGG # - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina): TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC * mkdir Data_Ute_exceRpt_workspace/trimmed; cd Data_Ute_exceRpt_workspace/trimmed echo "------------------------------------ cutadapting nf774 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf774.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf657 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf657.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf655 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf655.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf780 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf780.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf796 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf796.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf797 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf797.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_4/230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf930 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf930.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf930/01_0505_WaGa_wt_EV_RNA_S1_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf931 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf931.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf931/02_0505_WaGa_sT_DMSO_EV_RNA_S2_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf932 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf932.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf932/03_0505_WaGa_sT_Dox_EV_RNA_S3_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf933 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf933.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf933/04_0505_WaGa_scr_DMSO_EV_RNA_S4_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf934 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf934.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf934/05_0505_WaGa_scr_Dox_EV_RNA_S5_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf935 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf935.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf935/06_1905_WaGa_wt_EV_RNA_S6_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf936 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf936.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf936/07_1905_WaGa_sT_DMSO_EV_RNA_S7_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf937 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf937.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf937/08_1905_WaGa_sT_Dox_EV_RNA_S8_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf938 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf938.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf938/09_1905_WaGa_scr_DMSO_EV_RNA_S9_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf939 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf939.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf939/10_1905_WaGa_scr_Dox_EV_RNA_S10_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf940 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf940.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf940/11_control_MKL1_S11_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf941 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf941.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/231016_NB501882_0435_AHG7HMBGXV/nf941/12_control_WaGa_S12_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf961 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf961.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf961/WaGaWTcells_1_S1_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf962 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf962.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf962/WaGaWTcells_2_S2_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf971 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf971.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf971/2001_WaGa_sT_DMSO_S3_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf972 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf972.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf972/2001_WaGa_sT_Dox_S4_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf973 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf973.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf973/2001_WaGa_scr_DMSO_S5_R1_001.fastq.gz >> LOG echo "------------------------------------ cutadapting nf974 -----------------------------------" >> LOG cutadapt -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -q 20 --minimum-length 5 --trim-n -o nf974.fastq.gz ~/DATA/Data_Ute/Data_Ute_smallRNA_7/250411_VH00358_135_AAGKGLHM5/nf974/2001_WaGa_scr_Dox_S6_R1_001.fastq.gz >> LOG -
Install exceRpt (https://github.gersteinlab.org/exceRpt/)
docker pull rkitchen/excerpt mkdir MyexceRptDatabase cd /mnt/nvme0n1p1/MyexceRptDatabase wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz tar -xvf exceRptDB_v4_hg38_lowmem.tgz #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg19_lowmem.tgz #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_hg38_lowmem.tgz #http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_mm10_lowmem.tgz wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOmiRNArRNA.tgz tar -xvf exceRptDB_v4_EXOmiRNArRNA.tgz wget http://org.gersteinlab.excerpt.s3-website-us-east-1.amazonaws.com/exceRptDB_v4_EXOGenomes.tgz tar -xvf exceRptDB_v4_EXOGenomes.tgz -
Run exceRpt
#[---- REAL_RUNNING_COMPLETE_DB ---->] #NOTE that if not renamed in the input files, then have to RENAME all files recursively by removing "_cutadapted.fastq" in all names in _CORE_RESULTS_v4.6.3.tgz (first unzip, removing, then zip, mv to ../results_g). cd trimmed for file in *.fastq.gz; do echo "mv \"$file\" \"${file/.fastq/}\"" done mkdir results for sample in nf780 nf796 nf797 nf655 nf774 nf961 nf962 nf657 nf930 nf935 nf931 nf936 nf971 nf932 nf937 nf972 nf933 nf938 nf973 nf934 nf939 nf974; do docker run -v ~/DATA/Data_Ute_exceRpt_workspace/trimmed:/exceRptInput \ -v ~/DATA/Data_Ute_exceRpt_workspace/results:/exceRptOutput \ -v /mnt/nvme0n1p1/MyexceRptDatabase:/exceRpt_DB \ -t rkitchen/excerpt \ INPUT_FILE_PATH=/exceRptInput/${sample}.gz MAIN_ORGANISM_GENOME_ID=hg38 N_THREADS=50 JAVA_RAM='200G' MAP_EXOGENOUS=on done #DEBUG the excerpt env docker inspect rkitchen/excerpt:latest # Without /bin/bash → May run and exit immediately #docker run -it rkitchen/excerpt # With /bin/bash → Stays open for interaction docker run -it --entrypoint /bin/bash rkitchen/excerpt -
Processing exceRpt output from multiple samples
cd ~/DATA/Data_Ute_exceRpt_workspace/exceRpt-master mamba activate r_env mamba install -c conda-forge -c bioconda \ bioconductor-marray \ bioconductor-rgraphviz \ r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx r-rcurl r-xml \ -y mamba install -c conda-forge -c bioconda \ r-plyr r-gplots r-reshape2 r-ggplot2 r-scales r-openxlsx \ bioconductor-marray bioconductor-rgraphviz \ -y #EXCLUDE some isolates since they have total different pattern --> outliner, e.g. nf657 from WaGa wt EV: sudo mv nf657* ../results_EXCLUDED/ mkdir summaries summariesWithSampleGroupInfo (r_env) jhuang@WS-2290C:~/DATA/Data_Ute_exceRpt_workspace/exceRpt-master$ R #WARNING: need to reload the R-script after each change of the script. source("mergePipelineRuns_functions.R") processSamplesInDir("../results/", "../summaries") #mkdir summariesWithSampleGroupInfo; cp summaries/*.RData summariesWithSampleGroupInfo; rm summariesWithSampleGroupInfo/exceRpt_sampleGroupDefinitions.txt; source("mergePipelineRuns_functions_addSampleGroupInfo.R") processSamplesInDir("../results/", "../summariesWithSampleGroupInfo") #!!!!! IMPORTANT: REPORT summariesWithSampleGroupInfo/exceRpt_DiagnosticPlots.pdf !!!!! #POSSIBLE_TODO (MAYBE BETTER NOT DO THIS, SINCE I HAVE TO GENERATE PCA- and MANHATTAN PLOTS!!): Wir need to delete nf774 from WaGa_wt_cells, since it is very far to another two samples! #~/Tools/csv2xls-0.4/csv_to_xls.py exceRpt_miRNA_ReadsPerMillion.txt exceRpt_tRNA_ReadsPerMillion.txt exceRpt_piRNA_ReadsPerMillion.txt -d$'\t' -o exceRpt_results_detailed.xls -
Downstream analyis using R for miRNAs (4 MKL-1 + 17 WaGa samples)
#Input file #exceRpt_miRNA_ReadCounts.txt #exceRpt_piRNA_ReadCounts.txt ## MKL-1 wild-type controls #MKL-1_wt_cells (nf780, nf796, nf797) #MKL-1_wt_EV (nf655) # ## WaGa experimental groups (scr = scramble control; sT = target knockdown) #WaGa_scr_DMSO_EV (nf933, nf938, nf973) #WaGa_scr_Dox_EV (nf934, nf939, nf974) #WaGa_sT_DMSO_EV (nf931, nf936, nf971) #WaGa_sT_Dox_EV (nf932, nf937, nf972) # ## WaGa wild-type controls #WaGa_wt_cells (nf774, nf961, nf962) #WaGa_wt_EV (nf657 (deleted), nf930, nf935) cd ~/DATA/Data_Ute_exceRpt_workspace/summaries mamba activate r_env R #> .libPaths() #[1] "/home/jhuang/mambaforge/envs/r_env/lib/R/library" #BiocManager::install("AnnotationDbi") #BiocManager::install("clusterProfiler") #BiocManager::install(c("ReactomePA","org.Hs.eg.db")) #BiocManager::install("limma") #BiocManager::install("sva") #install.packages("writexl") #install.packages("openxlsx") library("AnnotationDbi") library("clusterProfiler") library("ReactomePA") library("org.Hs.eg.db") library(DESeq2) library(gplots) library(limma) library(sva) #library(writexl) #d.raw_with_rownames <- cbind(RowNames = rownames(d.raw), d.raw); write_xlsx(d.raw, path = "d_raw.xlsx"); library(openxlsx) setwd("../summaries/") d.raw<- read.delim2("exceRpt_miRNA_ReadCounts.txt",sep="\t", header=TRUE, row.names=1) # Desired column order desired_order <- c( "nf780", "nf796", "nf797", "nf655", "nf933", "nf938", "nf973", "nf934", "nf939", "nf974", "nf931", "nf936", "nf971", "nf932", "nf937", "nf972", "nf774", "nf961", "nf962", "nf930", "nf935" #delete "nf657", ) # Reorder columns d.raw <- d.raw[, desired_order] setdiff(desired_order, colnames(d.raw)) # Shows missing or misnamed columns #sapply(d.raw, is.numeric) d.raw[] <- lapply(d.raw, as.numeric) #d.raw[] <- lapply(d.raw, function(x) as.numeric(as.character(x))) d.raw <- round(d.raw) write.csv(d.raw, file ="d_raw.csv") write.xlsx(d.raw, file = "d_raw.xlsx", rowNames = TRUE) # ------ Code sent to Ute ------ #d.raw <- read.delim2("d_raw.csv",sep=",", header=TRUE, row.names=1) Cell_or_EV = as.factor(c("Cell","Cell","Cell", "EV", "EV","EV","EV", "EV","EV","EV", "EV","EV","EV", "EV","EV","EV", "Cell","Cell","Cell", "EV","EV")) #delete "EV", replicates = as.factor(c("MKL-1_wt_cells","MKL-1_wt_cells","MKL-1_wt_cells", "MKL-1_wt_EV", "WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV","WaGa_scr_DMSO_EV", "WaGa_scr_Dox_EV","WaGa_scr_Dox_EV","WaGa_scr_Dox_EV", "WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV","WaGa_sT_DMSO_EV", "WaGa_sT_Dox_EV","WaGa_sT_Dox_EV","WaGa_sT_Dox_EV", "WaGa_wt_cells", "WaGa_wt_cells","WaGa_wt_cells", "WaGa_wt_EV", "WaGa_wt_EV")) #delete "WaGa_wt_EV", ids = as.factor(c("nf780", "nf796", "nf797", "nf655", "nf933", "nf938", "nf973", "nf934", "nf939", "nf974", "nf931", "nf936", "nf971", "nf932", "nf937", "nf972", "nf774", "nf961", "nf962", "nf930", "nf935")) #delete "nf657", cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, Cell_or_EV=Cell_or_EV) dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates) # Filter low-count miRNAs dds <- dds[ rowSums(counts(dds)) > 10, ] #1322-->903 rld <- rlogTransformation(dds) # -- before pca -- png("pca.png", 1200, 800) plotPCA(rld, intgroup=c("replicates")) #plotPCA(rld, intgroup = c("replicates", "batch")) #plotPCA(rld, intgroup = c("replicates", "ids")) #plotPCA(rld, "batch") dev.off() png("pca2.png", 1200, 800) #plotPCA(rld, intgroup=c("replicates")) #plotPCA(rld, intgroup = c("replicates", "batch")) plotPCA(rld, intgroup = c("replicates", "ids")) #plotPCA(rld, "batch") dev.off() # Batch Effect Removal Methods (Non-batch effect removal applied!) #### STEP2: DEGs #### #- Heatmap untreated/wt vs parental; 1x for WaGa cell line #- Volcano plot untreated/wt vs parental; 1x for WaGa cell line #- Manhattan plot miRNAs; 1x for WaGa cell line #- Distribution of different small RNA species untreated/wt and parental; 1x for WaGa cell line #- Motif analysis: identify RNA-binding proteins that may regulate small RNA loading; 1x for WaGa cell line #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor sizeFactors(dds) #NULL dds <- estimateSizeFactors(dds) sizeFactors(dds) normalized_counts <- counts(dds, normalized=TRUE) write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA) write.xlsx(normalized_counts, file = "normalized_counts.xlsx", rowNames = TRUE) #---- untreated, scr_control, DMSO_control, scr_DMSO_control, sT_knockdown to parental_cells ---- dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates) dds$replicates <- relevel(dds$replicates, "untreated") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("DMSO_control_vs_untreated", "scr_control_vs_untreated", "scr_DMSO_control_vs_untreated", "sT_knockdown_vs_untreated") dds$replicates <- relevel(dds$replicates, "DMSO_control") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("sT_knockdown_vs_DMSO_control") dds$replicates <- relevel(dds$replicates, "scr_control") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("sT_knockdown_vs_scr_control") dds$replicates <- relevel(dds$replicates, "scr_DMSO_control") dds = DESeq(dds, betaPrior=FALSE) resultsNames(dds) clist <- c("sT_knockdown_vs_scr_DMSO_control") dds$replicates <- relevel(dds$replicates, "WaGa_wt_cells") dds = DESeq(dds, betaPrior=FALSE) #default betaPrior is FALSE resultsNames(dds) clist <- c("WaGa_wt_EV_vs_WaGa_wt_cells") dds$replicates <- relevel(dds$replicates, "MKL-1_wt_cells") dds = DESeq(dds, betaPrior=FALSE) #default betaPrior is FALSE resultsNames(dds) clist <- c("MKL.1_wt_EV_vs_MKL.1_wt_cells") #NOTE that the results sent to Ute is |padj|<=0.1. for (i in clist) { contrast = paste("replicates", i, sep="_") res = results(dds, name=contrast) res <- res[!is.na(res$log2FoldChange),] #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na res$padj <- ifelse(is.na(res$padj), 1, res$padj) res_df <- as.data.frame(res) write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-")) up <- subset(res_df, padj<=0.05 & log2FoldChange>=2) down <- subset(res_df, padj<=0.05 & log2FoldChange<=-2) write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-")) write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-")) } ~/Tools/csv2xls-0.4/csv_to_xls.py \ untreated_vs_parental_cells-all.txt \ untreated_vs_parental_cells-up.txt \ untreated_vs_parental_cells-down.txt \ -d$',' -o untreated_vs_parental_cells.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ DMSO_control_vs_untreated-all.txt \ DMSO_control_vs_untreated-up.txt \ DMSO_control_vs_untreated-down.txt \ -d$',' -o DMSO_control_vs_untreated.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ scr_control_vs_untreated-all.txt \ scr_control_vs_untreated-up.txt \ scr_control_vs_untreated-down.txt \ -d$',' -o scr_control_vs_untreated.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ scr_DMSO_control_vs_untreated-all.txt \ scr_DMSO_control_vs_untreated-up.txt \ scr_DMSO_control_vs_untreated-down.txt \ -d$',' -o scr_DMSO_control_vs_untreated.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ sT_knockdown_vs_untreated-all.txt \ sT_knockdown_vs_untreated-up.txt \ sT_knockdown_vs_untreated-down.txt \ -d$',' -o sT_knockdown_vs_untreated.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ sT_knockdown_vs_DMSO_control-all.txt \ sT_knockdown_vs_DMSO_control-up.txt \ sT_knockdown_vs_DMSO_control-down.txt \ -d$',' -o sT_knockdown_vs_DMSO_control.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ sT_knockdown_vs_scr_control-all.txt \ sT_knockdown_vs_scr_control-up.txt \ sT_knockdown_vs_scr_control-down.txt \ -d$',' -o sT_knockdown_vs_scr_control.xls; ~/Tools/csv2xls-0.4/csv_to_xls.py \ sT_knockdown_vs_scr_DMSO_control-all.txt \ sT_knockdown_vs_scr_DMSO_control-up.txt \ sT_knockdown_vs_scr_DMSO_control-down.txt \ -d$',' -o sT_knockdown_vs_scr_DMSO_control.xls; # ------------------- volcano_plot ------------------- library(ggplot2) library(ggrepel) geness_res <- read.csv(file = paste("untreated_vs_parental_cells", "all.txt", sep="-"), row.names=1) external_gene_name <- rownames(geness_res) geness_res <- cbind(geness_res, external_gene_name) #top_g are from ids top_g <- c("hsa-let-7b-5p","hsa-let-7g-5p","hsa-let-7i-5p","hsa-miR-103a-3p","hsa-miR-107","hsa-miR-1224-5p","hsa-miR-122-5p","hsa-miR-1226-5p","hsa-miR-1246","hsa-miR-127-3p","hsa-miR-1290","hsa-miR-130a-3p","hsa-miR-139-3p","hsa-miR-141-3p","hsa-miR-143-3p","hsa-miR-148b-3p","hsa-miR-155-5p","hsa-miR-15a-5p","hsa-miR-17-5p","hsa-miR-184","hsa-miR-18a-3p","hsa-miR-18a-5p","hsa-miR-190a-5p","hsa-miR-191-5p","hsa-miR-193b-5p","hsa-miR-197-5p","hsa-miR-200a-3p","hsa-miR-200b-5p","hsa-miR-206","hsa-miR-20a-5p","hsa-miR-210-3p","hsa-miR-2110","hsa-miR-21-5p","hsa-miR-218-5p","hsa-miR-219a-1-3p","hsa-miR-221-3p","hsa-miR-23b-3p","hsa-miR-27a-3p","hsa-miR-27b-3p","hsa-miR-27b-5p","hsa-miR-28-3p","hsa-miR-30a-5p","hsa-miR-30c-5p","hsa-miR-30e-5p","hsa-miR-3127-5p","hsa-miR-3131","hsa-miR-3180|hsa-miR-3180-3p","hsa-miR-320a","hsa-miR-320b","hsa-miR-320c","hsa-miR-320d","hsa-miR-330-3p","hsa-miR-335-3p","hsa-miR-33b-5p","hsa-miR-340-5p","hsa-miR-342-5p","hsa-miR-3605-5p","hsa-miR-361-3p","hsa-miR-365a-5p","hsa-miR-374b-5p","hsa-miR-378i","hsa-miR-379-5p","hsa-miR-3940-5p","hsa-miR-409-3p","hsa-miR-411-5p","hsa-miR-423-3p","hsa-miR-423-5p","hsa-miR-4286","hsa-miR-429","hsa-miR-432-5p","hsa-miR-4326","hsa-miR-451a","hsa-miR-4520-3p","hsa-miR-454-3p","hsa-miR-4646-5p","hsa-miR-4667-5p","hsa-miR-4748","hsa-miR-483-5p","hsa-miR-486-5p","hsa-miR-5010-5p","hsa-miR-504-3p","hsa-miR-5187-5p","hsa-miR-590-3p","hsa-miR-6128","hsa-miR-625-5p","hsa-miR-6726-5p","hsa-miR-6730-5p","hsa-miR-676-3p","hsa-miR-6767-5p","hsa-miR-6777-5p","hsa-miR-6780a-5p","hsa-miR-6794-5p","hsa-miR-6817-3p","hsa-miR-708-5p","hsa-miR-7-5p","hsa-miR-766-5p","hsa-miR-7854-3p","hsa-miR-873-3p","hsa-miR-885-3p","hsa-miR-92b-5p","hsa-miR-93-5p","hsa-miR-937-3p","hsa-miR-9-5p","hsa-miR-98-5p") subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)) geness_res$Color <- "NS or log2FC < 2.0" geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05" geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05" geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0" write.csv(geness_res, "untreated_vs_parental_cells_with_Category.csv") geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange) geness_res <- geness_res[, -1*ncol(geness_res)] png("volcano_plot_untreated_vs_parental_cells.png",width=1200, height=1400) #svg("untreated_vs_parental_cells.svg",width=12, height=14) ggplot(geness_res, aes(x = log2FoldChange, y = -log10(pvalue), color = Color, label = external_gene_name)) + geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") + geom_hline(yintercept = -log10(0.05), lty = "dashed") + geom_point() + labs(x = "log2(FC)", y = "Significance, -log10(P)", color = "Significance") + scale_color_manual(values = c("P < 0.05"="orange","P-adj < 0.05"="red","NS or log2FC < 2.0"="darkgray"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) + geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = "black", min.segment.length = .1, box.padding = .2, lwd = 2) + theme_bw(base_size = 16) + theme(legend.position = "bottom") dev.off() # ------------------ differentially_expressed_miRNAs_heatmap ----------------- # prepare all_genes rld <- rlogTransformation(dds) mat <- assay(rld) mm <- model.matrix(~replicates, colData(rld)) mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm) assay(rld) <- mat RNASeq.NoCellLine <- assay(rld) # reorder the columns #colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa") #col.order <-c("control MKL1", "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox") #RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order] #Option4: manully defining #for i in untreated_vs_parental_cells sT_knockdown_vs_untreated DMSO_control_vs_untreated scr_control_vs_untreated scr_DMSO_control_vs_untreated sT_knockdown_vs_DMSO_control sT_knockdown_vs_scr_control sT_knockdown_vs_scr_DMSO_control; do # echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; # echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; #done #cat *.id | sort -u > ids ##add Gene_Id in the first line, delete the "" GOI <- read.csv("ids")$Gene_Id datamat = RNASeq.NoCellLine[GOI, ] # clustering the genes and draw heatmap #datamat <- datamat[,-1] #delete the sample "control MKL1" #datamat <- datamat[, 1:5] #parental_cells_1 parental_cells_2 parental_cells_3 untreated_1 untreated_2 scr_control_1 scr_control_2 scr_control_3 DMSO_control_1 DMSO_control_2 DMSO_control_3 scr_DMSO_control_1 scr_DMSO_control_2 scr_DMSO_control_3 sT_knockdown_1 sT_knockdown_2 sT_knockdown_3 --> #parental cells 1 parental cells 2 parental cells 3 untreated 1 untreated 2 scr control 1 scr control 2 scr control 3 DMSO control 1 DMSO control 2 DMSO control 3 scr DMSO control 1 scr DMSO control 2 scr DMSO control 3 sT knockdown 1 sT knockdown 2 sT knockdown 3 colnames(datamat)[1] <- "parental cells 1" colnames(datamat)[2] <- "parental cells 2" colnames(datamat)[3] <- "parental cells 3" colnames(datamat)[4] <- "untreated 1" colnames(datamat)[5] <- "untreated 2" colnames(datamat)[6] <- "scr control 1" colnames(datamat)[7] <- "scr control 2" colnames(datamat)[8] <- "scr control 3" colnames(datamat)[9] <- "DMSO control 1" colnames(datamat)[10] <- "DMSO control 2" colnames(datamat)[11] <- "DMSO control 3" colnames(datamat)[12] <- "scr DMSO control 1" colnames(datamat)[13] <- "scr DMSO control 2" colnames(datamat)[14] <- "scr DMSO control 3" colnames(datamat)[15] <- "sT knockdown 1" colnames(datamat)[16] <- "sT knockdown 2" colnames(datamat)[17] <- "sT knockdown 3" write.csv(datamat, file ="gene_expression_keeping_replicates.txt") write.xlsx(datamat, file = "gene_expression_keeping_replicates.xlsx", rowNames = TRUE) #"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC) hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete") hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete") mycl = cutree(hr, h=max(hr$height)/1.1) mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED", "PINK", "DARKORANGE", "MAROON", "LIGHTGREEN", "DARKBLUE", "DARKRED", "LIGHTBLUE", "DARKCYAN", "DARKGREEN", "DARKMAGENTA"); mycol = mycol[as.vector(mycl)] rownames(datamat) <- sub("\\|.*", "", rownames(datamat)) png("DEGs_heatmap_keeping_replicates.png", width=1000, height=1400) #svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8) heatmap.2(as.matrix(datamat), Rowv=as.dendrogram(hr), Colv=NA, dendrogram='row', labRow=row.names(datamat), scale='row', trace='none', col=bluered(75), RowSideColors=mycol, srtCol=30, lhei=c(1,8), cexRow=1.4, # Increase row label font size cexCol=1.7, # Increase column label font size margin=c(8, 12) ) dev.off() # ---------------------------------------- # ----------- manhattan_plot ------------- Rscript manhattan_plot_Carmen_custom_labels.R #exceRpt_miRNA_ReadCounts.txt
Protected: 税务等级和养老保险豁免问题
exceRpt流程热图分析 (Data_Ute_exceRpt_workspace)
exceRpt流程热图分析(中文翻译)
行标签详细说明
根据您提供的mapping_heatmap3.pdf热图,以下是各类别的详细解释:
1. 外源性序列(Exogenous Sequences)
- exogenous_genomes: 比对到外源基因组(病毒/细菌)的reads
- exogenous_rRNA: 比对到外源rRNA的reads
- exogenous_miRNA: 比对到外源miRNA的reads
2. 内源性基因组特征
- genome: 比对到参考基因组的reads(这是主要类别)
- endogenous_gapped: 需要间隔比对的内源性reads(剪接reads)
- repetitiveElements: 比对到重复序列的reads
- not_mapped_to_genome_or_libs: 未比对到基因组或小RNA文库的reads
3. 编码/非编码RNA(GENCODE)
- gencode_sense: 比对到GENCODE注释基因的正链reads(mRNA、lncRNA等)
- gencode_antisense: 反义链reads
4. 小RNA类别
- miRNA_sense: 成熟miRNA正链(重要类别)
- miRNA_antisense: 成熟miRNA反义链
- miRNAprecursor_sense: miRNA前体正链
- miRNAprecursor_antisense: miRNA前体反义链
- tRNA_sense/antisense: 转运RNA
- piRNA_sense/antisense: PIWI互作RNA
- circularRNA_sense/antisense: 环状RNA
不同组别的主要内容变化趋势
📊 关键发现:
1. 基因组比对效率(genome行)
- MKL-1 wt: 82.9%, 61.4%, 97.4% – 变异很大
- MKL-1 wt EV: 97.9% – 非常高
- WaGa wt: 99.0%, 99.0%, 91.9% – 最高且稳定
- WaGa wt EV: 91.0%, 90.6%, 89.2% – 略低于wt
趋势: WaGa野生型样本显示最高的基因组比对效率(91-99%)
2. miRNA含量(miRNA_sense行)- 🔴 最重要发现
- MKL-1 wt: 41.0%, 21.1%, 3.7% – 剧烈变化
- MKL-1 wt EV: 1.9% – 极低
- WaGa scr DMSO EV: 1.6%, 1.0%, 7.1% – 低miRNA
- WaGa wt: 88.7%, 86.6%, 81.1% – 异常高的miRNA!
- WaGa wt EV: 79.3%, 77.9%, 70.9% – 仍然非常高
关键发现: WaGa野生型细胞的miRNA含量是MKL-1的20-40倍!这表明细胞系之间存在根本性的生物学差异。
3. 未比对reads(not_mapped_to_genome_or_libs)
- MKL-1 wt: 14.8%, 21.7%, 0.5% – 变异大
- WaGa wt: 2.9%, 3.0%, 1.3% – 最低
- WaGa wt EV: 1.4%, 1.2%, 1.7% – 非常低
趋势: WaGa样本总体比对效率更好(<3%未比对)
4. gencode(mRNA/lncRNA)
- MKL-1 wt: 0.0-1.0% – 极低
- WaGa wt: 0.4%, 0.5%, 1.6% – 较高
- WaGa wt EV: 1.4%, 1.1%, 0.9% – EV中mRNA升高
5. 外源性内容
- 所有组: exogenous_miRNA, exogenous_rRNA, exogenous_genomes均为0.0%
✅ 好消息: 所有样本均未检测到病毒或细菌污染
🎯 生物学意义总结
1. 细胞系特异性miRNA谱
- WaGa细胞: miRNA丰富(70-89%的reads)
- MKL-1细胞: miRNA贫乏(2-41%的reads)
- 意义: WaGa可能是更好的miRNA研究模型
2. EV vs 细胞RNA
- MKL-1 wt EV: miRNA降低(1.9%)vs 细胞(3.7-41%)
- WaGa wt EV: 维持高miRNA(71-79%)vs 细胞(81-89%)
- 解释: WaGa细胞优先将miRNA包装到EV中
3. 质量控制
- 最佳质量: WaGa wt和WaGa wt EV(低未比对,高基因组比对)
- 变异质量: MKL-1 wt(样本间变异大)
4. 建议
- 调查MKL-1变异: 为什么nf796只有21% miRNA而nf780有41%?
- WaGa作为miRNA模型: WaGa细胞更适合miRNA/EV-miRNA研究
- EV包装机制: WaGa细胞显示高效的miRNA包装到EV中 – 值得进一步研究
TODO: 进行统计分析或创建可视化图表来突出这些趋势!
Detailed Explanation of exceRpt Pipeline Row Labels
Based on your heatmap from the exceRpt pipeline, let me explain each category and specifically address your question about the miRNA percentages.
Pipeline Flow Overview
The exceRpt pipeline processes reads through sequential filtering and alignment stages. Each percentage is normalized to reads_used_for_alignment (the denominator after quality control).
Row Label Categories Explained
1. Initial Processing Stages
- input: Total raw reads entering the pipeline (100%)
- successfully_clipped: Reads with adapters successfully removed
- failed_quality_filter: Reads failing quality thresholds (Phred score, length)
- failed_homopolymer_filter: Reads with excessive homopolymer stretches
- calibrator: Reads mapping to spike-in calibrator controls
- UniVec_contaminants: Reads identified as vector/contaminant sequences
- rRNA: Reads mapping to ribosomal RNA (typically removed)
- reads_used_for_alignment: Clean reads proceeding to alignment (this is the new 100% for downstream percentages)
2. Endogenous Alignment (Human/Host Genome)
These reads map to the reference genome of the host organism:
- genome: Reads mapping to the reference genome
- miRNA_sense/antisense: Reads mapping to endogenous miRNAs in sense/antisense orientation
- miRNAprecursor_sense/antisense: Reads mapping to miRNA precursor hairpins
- tRNA_sense/antisense: Transfer RNA mappings
- piRNA_sense/antisense: PIWI-interacting RNA mappings
- gencode_sense/antisense: Reads mapping to GENCODE gene annotations (mRNA, lncRNA, etc.)
- circularRNA_sense/antisense: Circular RNA (circRNA) mappings
- not_mapped_to_genome_or_libs: Reads that didn’t map to genome or small RNA libraries
- repetitiveElements: Reads mapping to repetitive regions (LINE, SINE, etc.)
- endogenous_gapped: Reads requiring gapped alignment (spliced reads)
3. Exogenous Alignment (Foreign/Viral Sequences) ⭐
This is the key to your question!
After endogenous alignment, unmapped reads are tested against exogenous (foreign) databases:
- input_to_exogenous_miRNA: Reads sent to exogenous miRNA alignment
- exogenous_miRNA: Reads mapping to viral/foreign miRNAs
- input_to_exogenous_rRNA: Reads sent to exogenous rRNA alignment
- exogenous_rRNA: Reads mapping to foreign rRNA (bacterial, viral)
- input_to_exogenous_genomes: Reads sent to exogenous genome alignment
- exogenous_genomes: Reads mapping to viral/bacterial genomes
🔍 Answering Your Specific Question
For sample nf796:
miRNA_sense: 1.3%
miRNA_antisense: 0.0%
input_to_exogenous_miRNA: 4.4%
exogenous_miRNA: 0.0%
Why this pattern?
The pipeline works sequentially:
-
Endogenous miRNA alignment (first pass):
- 1.3% of reads mapped to human endogenous miRNAs (sense strand)
- 0.0% mapped to antisense miRNAs
- These are known human miRNAs from miRBase
-
Exogenous miRNA alignment (second pass):
- The 4.4% represents reads that:
- Did NOT map to endogenous human miRNAs
- Were sent to search viral/foreign miRNA databases
- The 0.0% means none of these 4.4% actually matched viral miRNAs
- These are likely:
- Degraded RNA fragments
- Novel miRNAs not in databases
- Sequencing artifacts
- miRNAs from organisms not in the exogenous database
- The 4.4% represents reads that:
Visual Flow:
reads_used_for_alignment (100%)
↓
[Endogenous alignment]
↓
miRNA_sense (1.3%) ← Human miRNAs
miRNA_antisense (0.0%)
↓
Unmapped reads
↓
[Exogenous alignment]
↓
input_to_exogenous_miRNA (4.4%) ← Sent to viral DB
↓
exogenous_miRNA (0.0%) ← Actually matched viral miRNAs
↓
Remaining 4.4% → discarded or used for other exogenous searches
Key Distinction
| Category | Database | Example |
|---|---|---|
| miRNA_sense | Human miRBase | hsa-miR-21-5p |
| exogenous_miRNA | Viral miRNA DB | EBV-miR-BART1, KSHV-miR-K12 |
Why is input_to_exogenous_miRNA > miRNA_sense?
This is normal and indicates:
- Most reads did NOT map to endogenous miRNAs
- The pipeline casts a wide net, sending many unmapped reads to exogenous searches
- The 4.4% represents “candidate” reads that might be foreign miRNAs
- The 0.0% final match means no viral miRNAs were detected in nf796
If you had a virus-infected sample, you’d see:
input_to_exogenous_miRNA: e.g., 5%exogenous_miRNA: e.g., 3.2% (actual viral miRNA matches)
Other Notable Patterns in Your Heatmap
Looking at your samples:
- nf796, nf797, nf655: Very low genome mapping (~21-45%), high “not_mapped” (~55-79%)
- nf930-nf974: High genome mapping (~66-89%), suggesting better quality or different sample type
- MKL-1 vs WaGa: Different cell lines showing distinct mapping profiles
This suggests nf796/nf797/nf655 might be:
- Lower quality RNA
- Different library preparation
- Higher contamination
- Or genuinely different biology (e.g., more extracellular vesicles with fragmented RNA)
Does this clarify the pipeline stages and the miRNA percentages? Let me know if you need more details about any specific category!
Used samples (Manuscript_Marius_Karoline_2026)
🐭 实验小鼠组别详解(中文版)
以下是本研究中使用的全部 14 个实验组别的详细说明,按功能分类整理:
🔹 第一类:中风模型组(用于图 4 和补充图 3)
| 组号 | 样本前缀 | 完整样本 | 性别/年龄 | 状态 | 用途 |
|---|---|---|---|---|---|
| 1 | sample-A* |
A1–A11 | ♂ 老年 | 中风后 3 天 | 图 4D–F(血液/脑组织 SCFA 检测) |
| 2 | sample-B* |
B1–B16 | ♀ 老年 | 中风后 3 天 | 图 4D–F(血液/脑组织 SCFA 检测) |
📌 说明:这两组用于比较中风后老年雄性和雌性小鼠的微生物代谢物(短链脂肪酸)水平差异。
🔹 第二类:基线供体组(用于图 4、补充图 4、图 5C)
| 组号 | 样本前缀 | 完整样本 | 性别/年龄 | 状态 | 用途 |
|---|---|---|---|---|---|
| 3 | sample-C* |
C1–C10 | ♀ 老年 | 基线,FMT 供体 | 图 4A–C(16S 测序)、补充图 4、图 5C(Boxplot 3) |
| 4 | sample-E* |
E1–E10 | ♂ 老年 | 基线,FMT 供体 | 图 4A–C(16S 测序)、补充图 4、图 5C(Boxplot 2) |
| 5 | sample-F* |
F1–F5 | ♂ 年轻 | 基线,FMT 供体 | 对照供体,未在主图中展示 |
📌 关键说明:
- 组 3 和组 4 是粪菌移植(FMT)的供体小鼠,用于提供老年雌/雄肠道菌群
- 图 4 和补充图 4 中实际使用的样本为:♀供体 C1–C6(n=6),♂供体 E1–E8(n=8),其余样本因年龄偏小或测序深度不足被排除
🔹 第三类:FMT 预处理组(用于图 5B 紫色点)
| 组号 | 样本前缀 | 完整样本 | 性别/年龄 | 状态 | 用途 |
|---|---|---|---|---|---|
| 6 | sample-G* |
G1–G6 | ♂ 老年 | FMT 前,抗生素处理前,批次 I | 图 5B(紫色,pre-FMT 基线) |
| 7 | sample-H* |
H1–H6 | ♀ 老年 | FMT 前,抗生素处理前,批次 I | 图 5B(紫色,pre-FMT 基线) |
| 8 | sample-I* |
I1–I6 | ♂ 年轻 | FMT 前,抗生素处理前,批次 II | 图 5B(紫色,pre-FMT 基线) |
📌 说明:这三组合并为”pre-FMT”基线组(n=18),代表年轻雄性受体小鼠在接受粪菌移植之前的肠道菌群状态。
🔹 第四类:FMT 受体组(用于图 5)
| 组号 | 样本前缀 | 完整样本 | 性别/年龄 | 接受供体 | 状态 | 用途 |
|---|---|---|---|---|---|---|
| 9 | sample-J* |
J1–J4, J10, J11 | ♂ 年轻 | 老年♂供体 | FMT 后,中风前 | 图 5B🔵、5C(Boxplot 4)、5D、5E |
| 10 | sample-K* |
K1–K6 | ♂ 年轻 | 老年♀供体 | FMT 后,中风前 | 图 5B🔴、5C(Boxplot 5)、5D、5E |
| 11 | sample-L* |
L2–L6 | ♂ 年轻 | 年轻♂供体 | FMT 后,中风前 | 图 5B🟢、5E(对照) |
📌 关键说明:
- 所有受体均为年轻雄性小鼠,仅供体来源不同
- “aged♂ FMT” = 接受老年雄性供体粪便的年轻受体(不是受体本身是老年!)
- 图 5C 的 5 个箱线图 = pre-FMT 基线 + 2 个供体组 + 2 个受体组(不含年轻♂供体受体组)
🔹 第五类:FMT + 中风后组(未在主图展示)
| 组号 | 样本前缀 | 完整样本 | 性别/年龄 | 接受供体 | 状态 | 用途 |
|---|---|---|---|---|---|---|
| 12 | sample-M* |
M1–M8 | ♂ 老年 | 老年♂供体 | FMT 后,中风后 | 补充分析 |
| 13 | sample-N* |
N1–N10 | ♀ 老年 | 老年♀供体 | FMT 后,中风后 | 补充分析 |
| 14 | sample-O* |
O1–O8 | ♂ 年轻 | 年轻♂供体 | FMT 后,中风后 | 补充分析 |
📌 说明:这三组用于探索性分析,未出现在主论文图表中。
🧭 快速记忆口诀
✅ "FMT 标签 = 供体特征,不是受体特征"
• aged♂ FMT = 供体是老年雄性
• 受体永远是年轻雄性(本实验设计)
✅ 图 4 = 老年供体(组 3/4)+ 老年中风小鼠(组 1/2)
✅ 图 5 = FMT 实验:受体(组 6–11)+ 供体(组 3/4)
✅ 补充图 4 = 仅老年供体(组 3/4,筛选后 C1–C6, E1–E8)
⚠️ 样本排除说明
| 组别 | 排除样本 | 排除原因 |
|---|---|---|
| 组 3(♀供体) | C7, C8, C9, C10 | C8–C9 年龄偏小;C10 为离群值/测序深度低 |
| 组 4(♂供体) | E9, E10 | 测序深度低/离群值 |
| 组 9(受体) | J5, J6, J7, J8, J9 | 测序深度不足或质量控制排除 |
| 组 10(受体) | K7–K15 | 测序深度不足或质量控制排除 |
| 组 11(受体) | L1, L7–L15 | 测序深度不足或质量控制排除 |
📌 最终用于分析的样本数以各图图例标注为准(如:图 5 中 aged♂ FMT n=6, aged♀ FMT n=6)
TODO: 导出完整的样本–组别映射 CSV 文件,or 提供某张图的精确样本列表🎯
关于 “aged♂ FMT” 的明确解释
aged♂ FMT = 接受了老年雄性供体粪便的年轻雄性受体小鼠
🔹 实验设计核心逻辑
| 角色 | 年龄/性别 | 说明 |
|---|---|---|
| 受体(接受粪便) | 🐭 年轻雄性(4周龄起始) | 所有 FMT 组的受体都是相同的年轻雄性小鼠 |
| 供体(提供粪便) | 🐭 老年雄性 / 老年雌性 / 年轻雄性 | 供体的年龄/性别是实验变量 |
🔹 样本分组详解
🟣 Purple (pre-FMT, n=18): G1–G6, H1–H6, I1–I6
→ FMT前的基线年轻雄性小鼠(未接受移植)
🔵 Blue (aged♂ FMT, n=6): J1, J2, J3, J4, J10, J11
→ 年轻雄性受体 + 接受【老年雄性】供体粪便
🔴 Red (aged♀ FMT, n=6): K1–K6
→ 年轻雄性受体 + 接受【老年雌性】供体粪便
🟢 Green (young♂ FMT, n=5): L2–L6
→ 年轻雄性受体 + 接受【年轻雄性】供体粪便(对照组)
🔹 文献依据
来自 260311_LTPaper.pdf Figure 5 图例:
“Principal coordinates analysis (PCoA) of young male mice before (purple) (n=18), and after FMT of aged male (n=6) (blue) or female (n=6) (red) or young male (n=5) (green) stool donors.”
→ 明确说明分析对象是 young male mice,括号内描述的是 stool donors(粪便供体)的特征。
来自 Supplemental Methods “Microbiota eradication and FMT”:
“4 weeks old male mice were treated for 2 weeks with an antibiotic cocktail… recipient mice were gavaged with donor stool four times over two weeks.”
→ 受体小鼠起始年龄为 4周龄(年轻)。
Figure 5 小标题:
“FMT of aged male microbiota increases IL-17A-producing γδ T cells in the post-ischemic brain of young recipient mice“
→ 再次确认受体是 young recipient mice。
🔹 为什么这样设计?
这个实验的核心科学问题是:
“供体微生物的年龄/性别特征,能否通过移植’传递’给受体,并影响受体的免疫反应?”
通过保持受体一致(年轻雄性),仅改变供体来源,可以:
- 排除受体自身年龄/性别的混杂效应
- 直接评估供体微生物对受体免疫表型(如 IL-17A⁺ γδ T 细胞)的因果影响
- 验证”微生物介导的年龄/性别差异”假说
✅ 快速记忆口诀
“FMT 标签 = 供体特征,不是受体特征”
- aged♂ FMT = 供体是老年雄性
- 受体永远是年轻雄性(本实验中)
🔹 Figure 5B: PCoA of FMT Experiment
“Principal coordinates analysis (PCoA) of young male mice before (purple) (n=18), and after FMT of aged male (n=6) (blue) or female (n=6) (red) or young male (n=5) (green) stool donors.”
- 🟣 Purple (pre-FMT, n=18): Groups 6+7+8 →
G1–G6,H1–H6,I1–I6 - 🔵 Blue (aged♂ FMT, n=6): Group9 →
J1,J2,J3,J4,J10,J11 - 🔴 Red (aged♀ FMT, n=6): Group10 →
K1–K6 - 🟢 Green (young♂ FMT, n=5): Group11 →
L2–L6(L1, L7–L15 excluded for low depth/QC)
🔹 Figure 5C=Figure 5B+C1-7+E1-10 (Need to be confirmed?): Family-Level Relative Abundance Boxplots (5 panels)
Based on your co-author’s note: “Figure 5C uses the Figure 5B recipient samples PLUS the aged donor samples (Groups 3 & 4).”
- Boxplot 1 (pre-FMT baseline, n=18): Groups 6+7+8 →
G1–G6,H1–H6,I1–I6 - Boxplot 2 (aged♂ stool donors, n=8): Group4 →
E1–E10 - Boxplot 3 (aged♀ stool donors, n=6): Group3 →
C1–C7 - Boxplot 4 (aged♂ FMT recipients, n=6): Group9 →
J1,J2,J3,J4,J10,J11 - Boxplot 5 (aged♀ FMT recipients, n=6): Group10 →
K1–K6 - !!No Group11 (L2-L6)!!
⚠️ Key difference: Group11 (young♂ FMT recipients,
L2–L6) is shown in Figure 5B but is NOT included in Figure 5C, since Figure 5C focuses on comparing the effect of aged donor microbiota.
🔹 Figure 5D: Bubble Plot of Differentially Abundant Taxa (DESeq2)
“Bubble plot showing differentially abundant Operational Taxonomic Units (OTUs) between young male recipients of aged female vs. aged male FMT. x-axis = log₂ fold change, y-axis = bacterial family, bubble size = adjusted p-value, color = bacterial order.”
- 🔵 Aged♂ FMT recipients (Group9, n=6):
J1,J2,J3,J4,J10,J11→ Reference group (log₂FC < 0 = enriched in this group) - 🔴 Aged♀ FMT recipients (Group10, n=6):
K1–K6→ Comparison group (log₂FC > 0 = enriched in this group)
| Key families highlighted in the plot: | Direction | Family (Order) | Enriched in | Biological note |
|---|---|---|---|---|
| 🔴 Positive log₂FC | Lachnospiraceae (Clostridiales) | Aged♀ FMT | SCFA producer | |
| 🔴 Positive log₂FC | Ruminococcaceae (Clostridiales) | Aged♀ FMT | SCFA producer | |
| 🔴 Positive log₂FC | Muribaculaceae (Bacteroidales) | Aged♀ FMT | SCFA producer | |
| 🔴 Positive log₂FC | Desulfovibrionaceae (Desulfovibrionales) | Aged♀ FMT | Sulfate-reducing | |
| 🔵 Negative log₂FC | Erysipelotrichaceae (Erysipelotrichales) | Aged♂ FMT | Pro-inflammatory association | |
| 🔵 Negative log₂FC | Rikenellaceae (Bacteroidales) | Aged♂ FMT | Context-dependent | |
| 🔵 Negative log₂FC | Clostridiales vadinBB60 group | Aged♂ FMT | Function unclear |
⚠️ Note: This analysis uses DESeq2 on non-rarefied integer counts from
ps_filt, with taxa prefiltered (total counts ≥10). Only taxa with Benjamini–Hochberg adjusted p < 0.05 are shown. The same ASVs/OTUs appear in Figure 4C and Supplementary Figure 4B, but Figure 5D specifically compares FMT recipient outcomes (Groups 9 vs. 10), not baseline donor differences.
🔹 Supplementary_Figure4=Figure4B-C: Aged Donors (Homeostatic)
“(A) Bray-Curtis distances between aged male-male, female-female and female-male stool samples under homeostatic conditions (nmale=8 and nfemale=6). (B) Cladogram showing differentially abundant OTUs…”
- 👨 Aged male donors (n=8): Group4 →
E1–E8(E9, E10 excluded for low sequencing depth/outliers) - 👩 Aged female donors (n=6): Group3 →
C1–C6(C7–C10 excluded; C8–C9 younger mice, C10 outlier)
🔹 Figure 4B-C: Sex Differences in Aged Mice (16S rRNA-seq panels B–C)
“We profiled the gut bacterial composition of aged male and female mice by 16S rRNA-seq…”
- Baseline aged female donors: Group3 →
C1–C6 - Baseline aged male donors: Group4 →
E1–E8
(Note: Figure 4D–F show SCFA concentrations measured by targeted UHPLC-MS/MS, not 16S data.)
✅ PICRUSt2 NOT used in Figure 4D–F
Your observation is CORRECT: PICRUSt2 results are NOT used in Figure 4D–F.
| Question | Answer | Evidence |
|---|---|---|
| Are PICRUSt2 results used in Figure 4? | ❌ No | Figure 4D–F legend explicitly states: “measured by targeted mass spectrometry” |
| Are PICRUSt2 results used anywhere in the manuscript? | ❌ No evidence | README_PICRUSt2.txt files contain exploratory pipeline notes, but no PICRUSt2 figures, tables, or text appear in 260311_LTPaper.pdf or 260310_Supplements.pdf |
| Is the SCFA data in Figure 4D–F experimentally measured? | ✅ Yes | Supplemental Methods (pages 12–13) describe UHPLC-MS/MS quantification with internal standards, derivatization, and MRM parameters |
Key distinction:
- PICRUSt2 → Predicts functional potential (gene/pathway abundances) from 16S sequences; outputs are relative, unitless values.
- Figure 4D–F → Measures actual SCFA concentrations (acetate, butyrate, etc.) in µmol/l via targeted mass spectrometry; outputs are absolute, quantitative values.
Here is the merged quick reference table combining Figure 5B, 5C, and 5D with related figures, formatted for easy copy-paste:
🔹 Quick Reference: All Figure 5 Panels vs. Related Figures
| Figure | Comparison | Sample IDs (exact) | n | Purpose |
|---|---|---|---|---|
| Figure 4B-C | Aged♀ vs. aged♂ donors (homeostatic) | C1–C6 vs. E1–E8 |
6 vs. 8 | Baseline sex differences in microbiota (DESeq2 bubble plot) |
| Suppl Fig 4B | Same as Fig 4C | C1–C6 vs. E1–E8 |
6 vs. 8 | Phylogenetic context of differential taxa (cladogram) |
| Figure 5B | Pre-FMT vs. post-FMT recipients (4-group PCoA) | G1–G6, H1–H6, I1–I6 (pre-FMT); J1, J2, J3, J4, J10, J11 (aged♂ FMT); K1–K6 (aged♀ FMT); L2–L6 (young♂ FMT) |
18, 6, 6, 5 | PCoA: microbiome shift after FMT (Bray–Curtis) |
| Figure 5C | Donors vs. recipients (5 boxplots, family-level) | G1–G6, H1–H6, I1–I6 (pre-FMT); E1–E8 (aged♂ donors); C1–C6 (aged♀ donors); J1, J2, J3, J4, J10, J11 (aged♂ FMT); K1–K6 (aged♀ FMT) |
18, 8, 6, 6, 6 | Taxonomic composition: donors vs. recipients (relative abundance) |
| Figure 5D | Aged♀ vs. aged♂ FMT recipients (DESeq2) | K1–K6 vs. J1, J2, J3, J4, J10, J11 |
6 vs. 6 | Effect of donor microbiota on recipient immune response (differential abundance) |
| Figure 5E | Same recipients as Fig 5D (+ young♂ control) | K1–K6 vs. J1, J2, J3, J4, J10, J11 (+ L2–L6) |
6 vs. 6 (+5) | IL-17A+ γδ T cells in brain post-FMT (flow cytometry) |
🔹 Sample-ID Master List for Figure 5
| Group # | Description | Sample Prefix | Full IDs | Used In |
|---|---|---|---|---|
| 3 | Aged female, baseline FMT donor | sample-C* |
C1–C10 (C1–C6 used in Fig 4B-C, Suppl Fig 4, Fig 5C) | Fig 4C, Suppl Fig 4, Fig 5C |
| 4 | Aged male, baseline FMT donor | sample-E* |
E1–E10 (E1–E8 used in Fig 4B-C, Suppl Fig 4, Fig 5C) | Fig 4B-C, Suppl Fig 4, Fig 5C |
| 6 | Aged male, pre-antibiotics FMT batch I | sample-G* |
G1–G6 | Fig 5B (purple), Fig 5C (Boxplot 1) |
| 7 | Aged female, pre-antibiotics FMT batch I | sample-H* |
H1–H6 | Fig 5B (purple), Fig 5C (Boxplot 1) |
| 8 | Young male, pre-antibiotics FMT batch II | sample-I* |
I1–I6 | Fig 5B (purple), Fig 5C (Boxplot 1) |
| 9 | Young male, post-FMT aged male stool | sample-J* |
J1–J4, J10, J11 (J5–J9 excluded) | Fig 5B (blue), Fig 5C (Boxplot 4), Fig 5D, Fig 5E |
| 10 | Young male, post-FMT aged female stool | sample-K* |
K1–K6 | Fig 5B (red), Fig 5C (Boxplot 5), Fig 5D, Fig 5E |
| 11 | Young male, post-FMT young male stool | sample-L* |
L2–L6 (L1, L7–L15 excluded) | Fig 5B (green), Fig 5E (not in Fig 5C/D) |
🔹 Key Notes for Interpretation
- Figure 5B vs. 5C: Figure 5B shows beta-diversity (PCoA) of all FMT groups; Figure 5C shows taxonomic composition (boxplots) of donors + recipients. Group11 (young♂ FMT) is in 5B but not in 5C.
- Figure 5D: Uses DESeq2 on non-rarefied counts from
ps_filt(taxa prefiltered: total counts ≥10). Only taxa with BH-adjusted p < 0.05 are shown. - Figure 5E: Includes the same recipients as Fig 5D plus the young♂ FMT control group (Group11,
L2–L6) for comparison of IL-17A+ γδ T cells. - Sample exclusions: C7–C10, E9–E10, J5–J9, K7–K15, L1, L7–L15 were excluded for low depth, outliers, or QC reasons (see README files).
Let me know if you’d like me to:
- Export the exact DESeq2 results table for Figure 5D as CSV/Excel,
- Provide the R code snippet that generates the bubble plot for Figure 5D, or
- Draft the full email reply to your colleague with these merged tables integrated. 🎯
Protected: 明斯特大学 Uni Münster
Interhost variant calling (Data_Tam_DNAseq_2026_19606wt_dAB_dIJ_mito_flu_on_ATCC19606)
-
Input data:
mkdir bacto; cd bacto; mkdir raw_data; cd raw_data; # ── Fluoxetine Dataset ── ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz flu_dAB_cef_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz flu_dAB_cef_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz flu_dAB_cipro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz flu_dAB_cipro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_1.fq.gz flu_dAB_dori_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEdori-2/19606△ABfluEdori-2_2.fq.gz flu_dAB_dori_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_1.fq.gz flu_dAB_nitro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEnitro-3/19606△ABfluEnitro-3_2.fq.gz flu_dAB_nitro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_1.fq.gz flu_dAB_pip_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpip-1/19606△ABfluEpip-1_2.fq.gz flu_dAB_pip_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_1.fq.gz flu_dAB_polyB_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEpolyB-3/19606△ABfluEpolyB-3_2.fq.gz flu_dAB_polyB_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_1.fq.gz flu_dAB_tet_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEtet-1/19606△ABfluEtet-1_2.fq.gz flu_dAB_tet_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_1.fq.gz flu_dIJ_cef_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcef-4/19606△IJfluEcef-4_2.fq.gz flu_dIJ_cef_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_1.fq.gz flu_dIJ_cipro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEcipro-3/19606△IJfluEcipro-3_2.fq.gz flu_dIJ_cipro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_1.fq.gz flu_dIJ_dori_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEdori-1/19606△IJfluEdori-1_2.fq.gz flu_dIJ_dori_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_1.fq.gz flu_dIJ_nitro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEnitro-3/19606△IJfluEnitro-3_2.fq.gz flu_dIJ_nitro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_1.fq.gz flu_dIJ_pip_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpip-4/19606△IJfluEpip-4_2.fq.gz flu_dIJ_pip_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_1.fq.gz flu_dIJ_polyB_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606△IJfluEpolyB-4/19606△IJfluEpolyB-4_2.fq.gz flu_dIJ_polyB_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_1.fq.gz flu_wt_cef_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcef-1/19606wtfluEcef-1_2.fq.gz flu_wt_cef_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_1.fq.gz flu_wt_cipro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEcipro-2/19606wtfluEcipro-2_2.fq.gz flu_wt_cipro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_1.fq.gz flu_wt_dori_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEdori-1/19606wtfluEdori-1_2.fq.gz flu_wt_dori_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_1.fq.gz flu_wt_nitro_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEnitro-1/19606wtfluEnitro-1_2.fq.gz flu_wt_nitro_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_1.fq.gz flu_wt_pip_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpip-4/19606wtfluEpip-4_2.fq.gz flu_wt_pip_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_1.fq.gz flu_wt_polyB_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEpolyB-4/19606wtfluEpolyB-4_2.fq.gz flu_wt_polyB_R2.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_1.fq.gz flu_wt_tet_R1.fastq.gz ln -s ../../X101SC25116512-Z01-J003/01.RawData/19606wtfluEtet-2/19606wtfluEtet-2_2.fq.gz flu_wt_tet_R2.fastq.gz # ── Mitomycin C Dataset ── ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_cipro_1/MitoC_E_AB_cipro_1_1.fq.gz mito_dAB_cipro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_cipro_1/MitoC_E_AB_cipro_1_2.fq.gz mito_dAB_cipro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_dori_1/MitoC_E_AB_dori_1_1.fq.gz mito_dAB_dori_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_dori_1/MitoC_E_AB_dori_1_2.fq.gz mito_dAB_dori_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_nitro_2/MitoC_E_AB_nitro_2_1.fq.gz mito_dAB_nitro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_nitro_2/MitoC_E_AB_nitro_2_2.fq.gz mito_dAB_nitro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_tet_2/MitoC_E_AB_tet_2_1.fq.gz mito_dAB_tet_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_tet_2/MitoC_E_AB_tet_2_2.fq.gz mito_dAB_tet_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_trime_4/MitoC_E_AB_trime_4_1.fq.gz mito_dAB_trime_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_AB_trime_4/MitoC_E_AB_trime_4_2.fq.gz mito_dAB_trime_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_cipro_1/MitoC_E_IJ_cipro_1_1.fq.gz mito_dIJ_cipro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_cipro_1/MitoC_E_IJ_cipro_1_2.fq.gz mito_dIJ_cipro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_dori_4/MitoC_E_IJ_dori_4_1.fq.gz mito_dIJ_dori_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_dori_4/MitoC_E_IJ_dori_4_2.fq.gz mito_dIJ_dori_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_nitro_2/MitoC_E_IJ_nitro_2_1.fq.gz mito_dIJ_nitro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_nitro_2/MitoC_E_IJ_nitro_2_2.fq.gz mito_dIJ_nitro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_polyB_3/MitoC_E_IJ_polyB_3_1.fq.gz mito_dIJ_polyB_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_polyB_3/MitoC_E_IJ_polyB_3_2.fq.gz mito_dIJ_polyB_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_tet_3/MitoC_E_IJ_tet_3_1.fq.gz mito_dIJ_tet_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_tet_3/MitoC_E_IJ_tet_3_2.fq.gz mito_dIJ_tet_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_trime_1/MitoC_E_IJ_trime_1_1.fq.gz mito_dIJ_trime_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_IJ_trime_1/MitoC_E_IJ_trime_1_2.fq.gz mito_dIJ_trime_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_cipro_1/MitoC_E_wt_cipro_1_1.fq.gz mito_wt_cipro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_cipro_1/MitoC_E_wt_cipro_1_2.fq.gz mito_wt_cipro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_nitro_1/MitoC_E_wt_nitro_1_1.fq.gz mito_wt_nitro_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_nitro_1/MitoC_E_wt_nitro_1_2.fq.gz mito_wt_nitro_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_polyB_1/MitoC_E_wt_polyB_1_1.fq.gz mito_wt_polyB_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_polyB_1/MitoC_E_wt_polyB_1_2.fq.gz mito_wt_polyB_R2.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_trime_2/MitoC_E_wt_trime_2_1.fq.gz mito_wt_trime_R1.fastq.gz ln -s ../../X101SC26025981-Z02-J002/01.RawData/MitoC_E_wt_trime_2/MitoC_E_wt_trime_2_2.fq.gz mito_wt_trime_R2.fastq.gz -
Call variant calling using snippy
ln -s ~/Tools/bacto/db/ .; ln -s ~/Tools/bacto/envs/ .; ln -s ~/Tools/bacto/local/ .; cp ~/Tools/bacto/Snakefile .; cp ~/Tools/bacto/bacto-0.1.json .; cp ~/Tools/bacto/cluster.json .; #download CP059040.gb from GenBank #mv ~/Downloads/sequence\(2\).gb db/CP059040.gb mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ which snakemake /home/jhuang/miniconda3/envs/bengal3_ac3/bin/snakemake (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Tam_DNAseq_2023_A6WT_A10CraA_A12AYE_A1917978$ snakemake -v 4.0.0 --> CORRECT! #NOTE_1: modify bacto-0.1.json keeping only steps assembly, typing_mlst, possibly pangenome and variants_calling true; setting cpu=20 in all used steps. #setting the following in bacto-0.1.json "fastqc": false, "taxonomic_classifier": false, "assembly": true, "typing_ariba": false, "typing_mlst": true, "pangenome": true, "variants_calling": true, "phylogeny_fasttree": false, "phylogeny_raxml": false, "recombination": false, (due to gubbins-error set false) "prokka": { "genus": "Acinetobacter", "kingdom": "Bacteria", "species": "baumannii", "cpu": 10, "evalue": "1e-06", "other": "" }, "mykrobe": { "species": "abaum" }, "reference": "db/CP059040.gb" #NOTE_2: needs disk Titisee since the pipeline needs /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB snakemake --printshellcmds -
Checking the contaminated samples based on the size of genome
find . -maxdepth 2 -name "contigs.fa" -exec du -h {} \; find . -maxdepth 2 -name "contigs.fa" -printf "%s\t%p\n" | sort -nr | \ while IFS=$'\t' read -r bytes path; do sample=$(basename "$(dirname "$path")") printf "%-20s %8s\t%s\n" "$sample" "$(numfmt --to=iec-i --suffix=B "$bytes")" "$path" done mito_wt_polyB 8,6MiB ./mito_wt_polyB/contigs.fa mito_wt_trime 8,2MiB ./mito_wt_trime/contigs.fa mito_dAB_trime 8,2MiB ./mito_dAB_trime/contigs.fa flu_wt_cipro 6,7MiB ./flu_wt_cipro/contigs.fa mito_dIJ_nitro 6,6MiB ./mito_dIJ_nitro/contigs.fa flu_dAB_cipro 4,7MiB ./flu_dAB_cipro/contigs.fa flu_wt_nitro 3,9MiB ./flu_wt_nitro/contigs.fa flu_wt_polyB 3,9MiB ./flu_wt_polyB/contigs.fa flu_wt_cef 3,9MiB ./flu_wt_cef/contigs.fa flu_wt_dori 3,8MiB ./flu_wt_dori/contigs.fa flu_dIJ_pip 3,8MiB ./flu_dIJ_pip/contigs.fa flu_dIJ_dori 3,8MiB ./flu_dIJ_dori/contigs.fa flu_wt_pip 3,8MiB ./flu_wt_pip/contigs.fa flu_dAB_dori 3,8MiB ./flu_dAB_dori/contigs.fa flu_dIJ_cipro 3,8MiB ./flu_dIJ_cipro/contigs.fa flu_dAB_nitro 3,8MiB ./flu_dAB_nitro/contigs.fa flu_dAB_pip 3,8MiB ./flu_dAB_pip/contigs.fa flu_dAB_polyB 3,8MiB ./flu_dAB_polyB/contigs.fa flu_dIJ_nitro 3,8MiB ./flu_dIJ_nitro/contigs.fa flu_dIJ_cef 3,8MiB ./flu_dIJ_cef/contigs.fa flu_dAB_tet 3,8MiB ./flu_dAB_tet/contigs.fa flu_dIJ_polyB 3,8MiB ./flu_dIJ_polyB/contigs.fa flu_dAB_cef 3,8MiB ./flu_dAB_cef/contigs.fa mito_dIJ_polyB 3,8MiB ./mito_dIJ_polyB/contigs.fa mito_dIJ_cipro 3,8MiB ./mito_dIJ_cipro/contigs.fa mito_dIJ_dori 3,8MiB ./mito_dIJ_dori/contigs.fa flu_wt_tet 3,8MiB ./flu_wt_tet/contigs.fa mito_wt_nitro 3,8MiB ./mito_wt_nitro/contigs.fa mito_wt_cipro 3,8MiB ./mito_wt_cipro/contigs.fa mito_dIJ_tet 3,8MiB ./mito_dIJ_tet/contigs.fa mito_dIJ_trime 3,8MiB ./mito_dIJ_trime/contigs.fa mito_dAB_dori 3,8MiB ./mito_dAB_dori/contigs.fa mito_dAB_cipro 3,7MiB ./mito_dAB_cipro/contigs.fa mito_dAB_tet 3,7MiB ./mito_dAB_tet/contigs.fa mito_dAB_nitro 3,7MiB ./mito_dAB_nitro/contigs.fa -
Summarize all SNPs and Indels from the snippy result directory.
cp ~/Scripts/summarize_snippy_res_ordered.py . # IMPORTANT_ADAPT the array in script should be adapted; deleting the isolates "mito_wt_polyB","mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro" isolates = ["flu_wt_cef", "flu_wt_dori", "flu_wt_nitro", "flu_wt_pip", "flu_wt_polyB", "flu_wt_tet", "flu_dAB_cef", "flu_dAB_dori", "flu_dAB_nitro", "flu_dAB_pip", "flu_dAB_polyB", "flu_dAB_tet", "flu_dIJ_cef", "flu_dIJ_cipro", "flu_dIJ_dori", "flu_dIJ_nitro", "flu_dIJ_pip", "flu_dIJ_polyB", "mito_dIJ_trime", "mito_wt_cipro", "mito_wt_nitro", "mito_dAB_cipro", "mito_dAB_dori", "mito_dAB_nitro", "mito_dAB_tet", "mito_dIJ_cipro", "mito_dIJ_dori", "mito_dIJ_polyB", "mito_dIJ_tet"] mamba activate plot-numpy1 python3 ./summarize_snippy_res_ordered.py snippy #--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv cd snippy #REMOVE_the_line? I don't find the sence of the line: grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv -
Using spandx calling variants (almost the same results to the one from viral-ngs!)
mamba deactivate mamba activate /home/jhuang/miniconda3/envs/spandx # PREPARE the inputs for the options ref and database (NOT ALWAYS NEED, PREPARE ONlY ONCE) mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610 cp PP810610.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610 #-d ~/Scripts/genbank2fasta.py PP810610.gb mv PP810610.gb_converted.fna PP810610.fasta #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file ln -s /home/jhuang/Tools/spandx/ spandx # Deleting the contaminated samples flu_wt_cipro*.fastq and flu_dAB_cipro*.fastq (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume # RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS! cd Outputs/Master_vcf conda activate /home/jhuang/miniconda3/envs/spandx (spandx) cp -r ../../snippy/flu_wt_cef/reference . # Eigentlich irgendein directory, all directories contains the same reference. (spandx) cp ../../spandx/bin/SNP_matrix.sh ./ #Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification. #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh (spandx) bash SNP_matrix.sh CP059040 .
Cross-Caller SNP/Indel Concordance & Invariant Variant Analyzer; Multi-Isolate Variant Intersection, Annotation Harmonization & Caller Discrepancy Report; Comparative Genomic Variant Profiling: Concordance, Invariance & Allele Mismatch Analysis; VarMatch: Cross-Tool Variant Concordance Pipeline
-
Calling inter-host variants by merging the results from snippy+spandx
mamba activate plot-numpy1 cd bacto cp Outputs/Master_vcf/All_SNPs_indels_annotated.txt . cp snippy/summary_snps_indels.csv . cp ~/Scripts/process_variants_snippy_alleles_spandx_annotations.py . #Configuring #Delete "mito_wt_polyB", "mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro" ISOLATES = [ "flu_wt_cef", "flu_wt_dori", "flu_wt_nitro", "flu_wt_pip", "flu_wt_polyB", "flu_wt_tet", "flu_dAB_cef", "flu_dAB_dori", "flu_dAB_nitro", "flu_dAB_pip", "flu_dAB_polyB", "flu_dAB_tet", "flu_dIJ_cef", "flu_dIJ_cipro", "flu_dIJ_dori", "flu_dIJ_nitro", "flu_dIJ_pip", "flu_dIJ_polyB", "mito_dIJ_trime", "mito_wt_cipro", "mito_wt_nitro", "mito_dAB_cipro", "mito_dAB_dori", "mito_dAB_nitro", "mito_dAB_tet", "mito_dIJ_cipro", "mito_dIJ_dori", "mito_dIJ_polyB", "mito_dIJ_tet" ] (plot-numpy1) python process_variants_snippy_alleles_spandx_annotations.py # -- Indeed, the generated files are the common SNPs generated by snippy+spandx, therefore, we need to modify the filenames. -- mv common_variants_all_snippy_annotated.csv common_variants_snippy+spandx_annotated.csv mv common_variants_invariant_snippy_annotated.csv common_invariants_snippy+spandx_annotated.csv mv common_variants_all_snippy_annotated.xlsx common_variants_snippy+spandx_annotated.xlsx mv common_variants_invariant_snippy_annotated.xlsx common_invariants_snippy+spandx_annotated.xlsx -
Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!
#The file will only contain rows where at least one of the 29 samples had a different allele between the two files. You can open allele_differences.xlsx side-by-side with your original common_variants_all_snippy_annotated.xlsx for fast, column-aligned manual verification. mamba activate plot-numpy1 (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_variants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #❌ Found 13 mismatched positions. Extracting & formatting... # The export_mismatch_alleles.py always generated the same output-file, avoid to recoved by the following commands, modify the generated filename. (plot-numpy1) mv allele_differences.xlsx allele_differences_for_cmp.xlsx (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_invariants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #✅ All alleles match perfectly across all common positions and samples! cp common_variants_all_snippy_annotated.xlsx common_variants_all_snippy_annotated_for_cmp.xlsx #Then marked the corresponding rows in yellow, then compare it to allele_differences.xlsx. # IMPORTANT_MANUAL_CHECK: checking all variants of common_variants_snippy+spandx_annotated.xlsx and common_invariants_snippy+spandx_annotated.xlsx by comparing allele_differences_for_cmp.xlsx. -
(Optional) Run nextflow bacass
# Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056 #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz #--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz nextflow run nf-core/bacass -r 2.5.0 -profile docker \ --input samplesheet.tsv \ --outdir bacass_out \ --assembly_type short \ --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume #Possibly the chraracter '△' is a problem. #Solution: 19606△ABfluEcef-1 → 19606delABfluEcef-1 #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx samplesheet.tsv sample,fastq_1,fastq_2 flu_dAB_cef,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz flu_dAB_cipro,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz #busco example results: Input_file Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold N50 Contigs N50 Percent gaps Number of scaffolds wt_cef.scaffolds.fa bacteria_odb10 98.4 98.4 0.0 1.6 0.0 124 285852 285852 0.000% 45 wt_cipro.scaffolds.fa bacteria_odb10 90.3 89.5 0.8 8.1 1.6 124 7434 7434 0.000% 1699 -
(Optional) Run bactmap
nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CU459141.fasta \ --outdir bactmap_out \ -resume sample,fastq_1,fastq_2 G18582004,fastqs/G18582004_1.fastq.gz,fastqs/G18582004_2.fastq.gz G18756254,fastqs/G18756254_1.fastq.gz,fastqs/G18756254_2.fastq.gz G18582006,fastqs/G18582006_1.fastq.gz,fastqs/G18582006_2.fastq.gz mkdir bactmap_workspace #Prepare reference.fasta (example for CP059040.fasta) in bactmap_workspace and samplesheet.csv in bactmap_workspace nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume -
Structural variant calling
conda activate sv_assembly nucmer --maxmatch -l 100 -c 500 bacto/CP059040.fasta bacto/shovill/flu_wt_cef/contigs.fa -p flu_wt_cef delta-filter -1 -q flu_wt_cef.delta > flu_wt_cef.filtered.delta Assemblytics flu_wt_cef.filtered.delta flu_wt_cef_assemblytics 1000 100 50000 #reference ref_start ref_stop ID size strand type ref_gap_size query_gap_size query_coordinates method CP059040 3124916 3125037 Assemblytics_b_3 198 + Tandem_contraction 121 -77 contig00010:34383-34460:- between_alignments for sample in flu_wt_cef flu_wt_dori flu_wt_nitro flu_wt_pip flu_wt_polyB flu_wt_tet flu_dAB_cef flu_dAB_dori flu_dAB_nitro flu_dAB_pip flu_dAB_polyB flu_dAB_tet flu_dIJ_cef flu_dIJ_cipro flu_dIJ_dori flu_dIJ_nitro flu_dIJ_pip flu_dIJ_polyB mito_dIJ_trime mito_wt_cipro mito_wt_nitro mito_dAB_cipro mito_dAB_dori mito_dAB_nitro mito_dAB_tet mito_dIJ_cipro mito_dIJ_dori mito_dIJ_polyB mito_dIJ_tet; do nucmer --maxmatch -l 100 -c 500 bacto/CP059040.fasta bacto/shovill/${sample}/contigs.fa -p ${sample}; delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta; Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 50000; done #NOTE that We excluded 6 samples with abnormal assembly sizes (>4.5 Mb or <3.7 Mb), which likely reflect contamination or fragmentation. The final merged SV table (merged_sv_results_filtered.tsv) contains 29 high-quality isolates and is ready for downstream analysis. #mito_wt_polyB 8,6MiB ./mito_wt_polyB/contigs.fa #mito_wt_trime 8,2MiB ./mito_wt_trime/contigs.fa #mito_dAB_trime 8,2MiB ./mito_dAB_trime/contigs.fa #flu_wt_cipro 6,7MiB ./flu_wt_cipro/contigs.fa #mito_dIJ_nitro 6,6MiB ./mito_dIJ_nitro/contigs.fa #flu_dAB_cipro 4,7MiB ./flu_dAB_cipro/contigs.fa mito_wt_polyB # Adapt the EXCLUDE_SAMPLES names in the following script and run it. merge_sv_filtered.sh # merged_sv_results.txt #✅ Merge complete: merged_sv_results_filtered.txt #• Included: 29 samples #• Excluded: 4 samples (Indeed, 6 samples should not include in the calculation, but 2 samples was not run in the last step.) -
Using PHASTEST
https://phastest.ca/ Downloads ZZ_f98bf12de9.PHASTEST.zip python ~/Scripts/phaster_clean_to_excel.py summary.txt detail.txt PHASTEST_SV_mito_large_del1_CP059040_1189156-1236440.xlsx Manually edit the generate Excel-files * Delete lines 2-8 * MOST_COMMON_PHAGE_NAME --> MOST_COMMON_PHAGE_NAME (# hit genes count) * Replace all common phage names from the website. sed -E 's/ {2,}/\t/g' detail.txt > detail_tabs.txt In detail_tabs.txt, delete the first line and the line '---------', one empty line, edit some bacterial proteins by replacing one '\t' to ' ', copy it to sheet Detail. Make the header bold font. -
Report
Thank you for your excellent question. You are absolutely right—the limited gene entries in our initial SV table reflect conservative RefSeq GFF3 annotation, which under-represents prophage regions (many genes labeled “hypothetical” or omitted). Your observation prompted re-analysis with PHASTEST, and the results strongly support your hypothesis.
🔹 PHASTEST validation: Three prophage regions confirmed We ran PHASTEST on the three large SV regions previously annotated as deletions/contractions. Based on PHASTEST’s scoring criteria (intact >90, questionable 70–90, incomplete <70), we confirm:
| SV ID | Coordinates | PHASTER Score | Most Common Phage Match | Key Features |
|---|---|---|---|---|
| SV_mito_tandem | 2494563–2536071 (41.5 kb) | Questionable (90) | Bordetella phage BPP-1 (NC_005357) | integrase, terminase, tail, capsid |
| SV_mito_large_del1 | 1189156–1236440 (47.3 kb) | Intact (>90) | Acinetobacter phage vB_AbaS_TRS1 (NC_031098) | tyrosine integrase, structural modules |
| SV_mito_large_del2 | 2621714–2664046 (42.4 kb) | Intact (>90) | Acinetobacter phage Bphi_B1251 (NC_019541) | terL, DNA methyltransferase, tail proteins |
Key observations:
* All three regions contain hallmark phage genes (integrases, terminases, capsid/tail proteins)
* The most common phage matches correspond to published reference genomes; complete lists of all matched phages are provided in the "Summary" sheet of each attached Excel file
* Reference manuscripts:
- NC_005357: Genomic and genetic analysis of Bordetella bacteriophages encoding reverse transcriptase-mediated tropism-switching cassettes
- NC_031098: Genome Sequence of vB_AbaS_TRS1, a Viable Prophage Isolated from Acinetobacter baumannii Strain A118
- NC_019541: Complete Genome Sequence of the Podoviral Bacteriophage YMC/09/02/B1251 ABA BP, Which Causes the Lysis of an OXA-23-Producing Carbapenem-Resistant Acinetobacter baumannii Isolate from a Septic Patient
* More detailed annotations, including Excel tables and figures, are provided in the attachments
In my 2021 PLOS Pathogens paper (attached), we identified three prophage regions significantly associated with invasive S. epidermidis, where ~94% of infection-associated genes mapped to the mobilome. Mapping Wan Yu’s differential expression data to our PHASTEST-annotated prophage coordinates would be a logical next step to test for condition-specific prophage gene expression.
Protected: Prophage and antibiotic tolerance
Interhost variant calling (Data_Foong_DNAseq_2025_AYE_Dark_vs_Light)
-
Targets
Attached are the DNA sequencing data for your review. Could you please compare these sequences with the A. baumannii AYE reference (accession CU459141) and let us know whether any genomic mutations are present? Sorry, I realize I made a mistake. Could you confirm whether variant calling has been performed for these strains (X101SC25116512-Z01-J001, samples labed as Dark or light) to determine if they contain SNPs or InDels compared to CU459141? -
mkdir raw_data; cd raw_data; # Note that the names must be ending with fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Light/Light_1.fq.gz Light_R1.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Light/Light_2.fq.gz Light_R2.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Dark/Dark_1.fq.gz Dark_R1.fastq.gz ln -s ../RSMD00304/X101SC25116512-Z01/X101SC25116512-Z01-J001/01.RawData/Dark/Dark_2.fq.gz Dark_R2.fastq.gz -
Call variant calling using snippy
ln -s ~/Tools/bacto/db/ .; ln -s ~/Tools/bacto/envs/ .; ln -s ~/Tools/bacto/local/ .; cp ~/Tools/bacto/Snakefile .; cp ~/Tools/bacto/bacto-0.1.json .; cp ~/Tools/bacto/cluster.json .; #download CU459141.gb from GenBank mv ~/Downloads/sequence\(1\).gb db/CU459141.gb #setting the following in bacto-0.1.json "fastqc": false, "taxonomic_classifier": false, "assembly": true, "typing_ariba": false, "typing_mlst": true, "pangenome": true, "variants_calling": true, "phylogeny_fasttree": true, "phylogeny_raxml": true, "recombination": false, (due to gubbins-error set false) "genus": "Acinetobacter", "kingdom": "Bacteria", "species": "baumannii", (in both prokka and mykrobe) "reference": "db/CU459141.gb" mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds -
Checking the contaminated samples based on the size of genome
cd shovill find . -maxdepth 2 -name "contigs.fa" -exec du -h {} \; find . -maxdepth 2 -name "contigs.fa" -printf "%s\t%p\n" | sort -nr | \ while IFS=$'\t' read -r bytes path; do sample=$(basename "$(dirname "$path")") printf "%-20s %8s\t%s\n" "$sample" "$(numfmt --to=iec-i --suffix=B "$bytes")" "$path" done #Dark 3,9MiB ./Dark/contigs.fa #Light 3,9MiB ./Light/contigs.fa -
Summarize all SNPs and Indels from the snippy result directory.
cp ~/Scripts/summarize_snippy_res_ordered.py . # IMPORTANT_ADAPT the array in script should be adapted; deleting the isolates ["Dark", "Light"] mamba activate plot-numpy1 python3 ./summarize_snippy_res_ordered.py snippy #--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv cd snippy #REMOVE_the_line? I don't find the sence of the line: grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv -
Using spandx calling variants (almost the same results to the one from viral-ngs!)
mamba deactivate mamba activate /home/jhuang/miniconda3/envs/spandx # PREPARE the inputs for the options ref and database (NOT ALWAYS NEED, PREPARE ONLY ONCE) mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610 cp PP810610.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610 #-d ~/Scripts/genbank2fasta.py PP810610.gb mv PP810610.gb_converted.fna PP810610.fasta #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file ln -s /home/jhuang/Tools/spandx/ spandx # Deleting the contaminated samples flu_wt_cipro*.fastq and flu_dAB_cipro*.fastq if contamination exists! (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CU459141.fasta --annotation --database CU459141 -resume # RERUN SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation, resulting in all impacts are MODIFIER --> IT WORKS! cd Outputs/Master_vcf conda activate /home/jhuang/miniconda3/envs/spandx (spandx) cp -r ../../snippy/Dark/reference . # Eigentlich irgendein directory, all directories contains the same reference. (spandx) cp ../../spandx/bin/SNP_matrix.sh ./ #Note that ${variant_genome_path}=CP059040 in the following command, but it was not used after the following command modification. #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh (spandx) bash SNP_matrix.sh CU459141 .
Cross-Caller SNP/Indel Concordance & Invariant Variant Analyzer; Multi-Isolate Variant Intersection, Annotation Harmonization & Caller Discrepancy Report; Comparative Genomic Variant Profiling: Concordance, Invariance & Allele Mismatch Analysis; VarMatch: Cross-Tool Variant Concordance Pipeline
-
Calling inter-host variants by merging the results from snippy+spandx
mamba activate plot-numpy1 cp Outputs/Master_vcf/All_SNPs_indels_annotated.txt . cp snippy/summary_snps_indels.csv . cp ~/Scripts/process_variants_snippy_alleles_spandx_annotations.py . #Configuring #Delete "mito_wt_polyB", "mito_wt_trime", "mito_dAB_trime", "flu_wt_cipro", "mito_dIJ_nitro", and "flu_dAB_cipro" ISOLATES = ["Dark", "Light"] (plot-numpy1) python process_variants_snippy_alleles_spandx_annotations.py # 5 invariant and 0 variant # -- Indeed, the generated files are the common SNPs generated by snippy+spandx, therefore, we need to modify the filenames. -- mv common_variants_all_snippy_annotated.csv common_variants_snippy+spandx_annotated.csv mv common_variants_invariant_snippy_annotated.csv common_invariants_snippy+spandx_annotated.csv mv common_variants_all_snippy_annotated.xlsx common_variants_snippy+spandx_annotated.xlsx mv common_variants_invariant_snippy_annotated.xlsx common_invariants_snippy+spandx_annotated.xlsx #CU459141 1900834 T C SNP C C synonymous_variant LOW SILENT ttA/ttG p.Leu46Leu/c.138A>G 73 ABAYE1833 protein_coding #CU459141 1900838 T A SNP A A missense_variant MODERATE MISSENSE tAc/tTc p.Tyr45Phe/c.134A>T 73 ABAYE1833 protein_coding -
Manully checking each of the 6 records by comparing them to the results from SPANDx; three are confirmed!
#The file will only contain rows where at least one of the 29 samples had a different allele between the two files. You can open allele_differences.xlsx side-by-side with your original common_variants_all_snippy_annotated.xlsx for fast, column-aligned manual verification. mamba activate plot-numpy1 (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_variants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #❌ Found 13 mismatched positions. Extracting & formatting... # The export_mismatch_alleles.py always generated the same output-file, avoid to recoved by the following commands, modify the generated filename. (plot-numpy1) mv allele_differences.xlsx allele_differences_for_cmp.xlsx (plot-numpy1) python ~/Scripts/export_mismatch_alleles.py common_invariants_snippy+spandx_annotated.csv All_SNPs_indels_annotated.txt #✅ All alleles match perfectly across all common positions and samples! cp common_variants_all_snippy_annotated.xlsx common_variants_all_snippy_annotated_for_cmp.xlsx #Then marked the corresponding rows in yellow, then compare it to allele_differences.xlsx. # IMPORTANT_MANUAL_CHECK: checking all variants of common_variants_snippy+spandx_annotated.xlsx and common_invariants_snippy+spandx_annotated.xlsx by comparing allele_differences_for_cmp.xlsx. -
Structural variant calling
conda activate sv_assembly for sample in Dark Light; do nucmer --maxmatch -l 100 -c 500 CU459141.fasta shovill/${sample}/contigs.fa -p ${sample}; delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta; Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 1000 100 50000; done #< CU459141 3244053 3244284 Assemblytics_b_1 120 + Tandem_contraction -231 -351 contig00003:66319-66670:- between_alignments #--- #> CU459141 873756 874618 Assemblytics_b_1 258 + Tandem_contraction -862 -1120 contig00004:40161-41281:- between_alignments # ✅ If empty, relax Assemblytics parameters for sample in Dark Light; do nucmer --maxmatch -l 100 -c 500 CU459141.fasta shovill/${sample}/contigs.fa -p ${sample}; delta-filter -1 -q ${sample}.delta > ${sample}.filtered.delta; Assemblytics ${sample}.filtered.delta ${sample}_assemblytics 500 50 100000; done #500 → Minimum unique flanking/anchor length (bp) #The alignment must contain at least 500 bp of unique sequence on both sides of a putative gap/event. This prevents false positives in repetitive, low-complexity, or highly similar genomic regions by ensuring the variant is anchored by uniquely mappable sequence. #50 → Minimum event/variant size (bp) #Assemblytics will only report structural variants that are ≥50 bp in length. Smaller differences (like short indels or sequencing noise) are filtered out. #100000 → Maximum event/variant size (bp) #Variants >100,000 bp (100 kb) will be excluded. This focuses the output on typical SVs and avoids reporting extremely large alignment artifacts, assembly breaks, or whole-chromosome differences. #< CU459141 3244053 3244284 Assemblytics_b_1 120 + Tandem_contraction -231 -351 contig00003:66319-66670:- between_alignments #< CU459141 3617424 3680584 Assemblytics_b_2 66532 + Tandem_contraction 63160 -3372 contig00009:61062-64434:- between_alignments #--- #> CU459141 873756 874618 Assemblytics_b_1 258 + Tandem_contraction -862 -1120 contig00004:40161-41281:- between_alignments #> CU459141 3617424 3680584 Assemblytics_b_3 66532 + Tandem_contraction 63160 -3372 contig00013:44152-47524:- between_alignments cp ~/Scripts/merge_sv_filtered.sh . # ADAPT the EXCLUDE_SAMPLES names in the following script when samples has abnormal assembly sizes. bash merge_sv_filtered.sh # generate merged_sv_results.txt -
(Optional) Run nextflow bacass
# Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056 #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz #--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz nextflow run nf-core/bacass -r 2.5.0 -profile docker \ --input samplesheet.tsv \ --outdir bacass_out \ --assembly_type short \ --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume #Possibly the chraracter '△' is a problem. #Solution: 19606△ABfluEcef-1 → 19606delABfluEcef-1 #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6?/An6?_kmerfinder_results.xlsx samplesheet.tsv sample,fastq_1,fastq_2 flu_dAB_cef,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcef-1/19606△ABfluEcef-1_2.fq.gz flu_dAB_cipro,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_1.fq.gz,../X101SC25116512-Z01-J003/01.RawData/19606△ABfluEcipro-2/19606△ABfluEcipro-2_2.fq.gz #busco example results: Input_file Dataset Complete Single Duplicated Fragmented Missing n_markers Scaffold N50 Contigs N50 Percent gaps Number of scaffolds wt_cef.scaffolds.fa bacteria_odb10 98.4 98.4 0.0 1.6 0.0 124 285852 285852 0.000% 45 wt_cipro.scaffolds.fa bacteria_odb10 90.3 89.5 0.8 8.1 1.6 124 7434 7434 0.000% 1699 -
(Optional) Run bactmap
nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CU459141.fasta \ --outdir bactmap_out \ -resume sample,fastq_1,fastq_2 G18582004,fastqs/G18582004_1.fastq.gz,fastqs/G18582004_2.fastq.gz G18756254,fastqs/G18756254_1.fastq.gz,fastqs/G18756254_2.fastq.gz G18582006,fastqs/G18582006_1.fastq.gz,fastqs/G18582006_2.fastq.gz mkdir bactmap_workspace #Prepare reference.fasta (example for CP059040.fasta) in bactmap_workspace and samplesheet.csv in bactmap_workspace nextflow run nf-core/bactmap -r 1.0.0 -profile docker \ --input samplesheet.csv \ --reference CP059040.fasta \ --outdir bactmap_out \ -resume
Comprehensive Gene-Level Annotation Table for All 7 Structural Variants (Data_Tam_DNAseq_2026_19606wt_dAB_dIJ_mito_flu_on_ATCC19606)
Based on CP059040.gff3 Reference Annotation (Full GFF3 Intersection)
Below is the most detailed and comprehensive list of all affected genes/features for each structural variant, derived from precise coordinate intersection with the CP059040.gff3 file.
🔑 Master Table: All 7 SVs with Complete Affected Gene Lists
| Original SV ID | Coordinates (CP059040) | Type | Size (bp) | All Affected Genes/Features (Locus Tags + Products) | Overlap Type per Gene | Functional Impact Summary | Sample Pattern |
|---|---|---|---|---|---|---|---|
| SV_adeIJ_del | 737224–741667 | Deletion | 4,436 | • adeK (H0N29_03540): multidrug efflux RND transporter outer membrane channel subunit AdeK• adeJ (H0N29_03545): multidrug efflux RND transporter permease subunit AdeJ• adeI (H0N29_03550): multidrug efflux RND transporter periplasmic adaptor subunit AdeI |
• adeK: 3′ truncation (~10 bp lost) • adeJ: fully deleted • adeI: fully deleted |
🔴 HIGH: Complete loss of AdeJ+AdeI → tripartite RND pump cannot assemble; adeK truncation likely destabilizes protein; potential increased susceptibility to AdeIJK substrate antibiotics | ✅ All *_dIJ_* |
| SV_adeAB_del | 1844323–1848605 | Deletion | 4,282 | • adeA (H0N29_08675): multidrug efflux RND transporter periplasmic adaptor subunit AdeA• adeB (H0N29_08680): multidrug efflux RND transporter permease subunit AdeB |
• adeA: fully deleted (4 bp 5′ end preserved) • adeB: fully deleted (~11 bp 3′ end preserved) |
🔴 HIGH: Complete loss of AdeA+AdeB → tripartite RND pump cannot assemble; potential increased susceptibility to AdeAB substrate antibiotics | ✅ All *_dAB_* |
| SV_tRNA_contract | 3124916–3125037 | Tandem_contraction | 198 | • tRNA-Gln (H0N29_14860): tRNA-Gln (anticodon: ttg; product: glutamine codon translation) |
• H0N29_14860: fully lost (75 bp coding sequence) | 🟢 LOW: tRNA gene copy number reduction 4→3; likely neutral due to tRNA redundancy; stable lineage marker for clonal tracking | ✅ All 29 filtered samples |
| SV_flu_tandem | 2259736–2260384 | Tandem_contraction | ~135 | • No protein-coding genes fully overlapped • Nearest gene: cydB (H0N29_10425): cytochrome bd oxidase subunit II (ends at 2217351, ~42 kb upstream) |
• Intergenic/repetitive region; no CDS disruption | 🟢 LOW: Likely neutral repetitive element contraction; possible regulatory impact on nearby genes; fluoroquinolone-selection marker | ⚠️ Subset of flu_* (nitro/polyB/tet) |
| SV_mito_tandem | 2494563–2536071 | Tandem_contraction | 41,564 | • H0N29_11610: hypothetical protein (upstream boundary, partial overlap)• H0N29_11615: pseudo CDS, hypothetical protein (partial overlap)• Multiple unannotated hypothetical proteins in H0N29_11xxx series within region • Repeat-rich region with transposase/integrase remnants |
• Multiple hypothetical proteins: partial or full deletion • Repeat arrays: contraction of tandem elements |
🟡 MEDIUM: Large repeat array contraction; likely non-essential hypothetical proteins affected; possible mobile element-associated genome plasticity under mitomycin selection | ✅ Only mito_dAB_* |
| SV_mito_large_del2 | 2621714–2664046 | Deletion | 42,352 | • H0N29_12335 (2626228..2626773): hypothetical protein• H0N29_12525 (2655635..2656549): DNA cytosine methyltransferase (EC: 2.1.21.-) |
• H0N29_12335: fully deleted • H0N29_12525: fully deleted |
🟡 MEDIUM: Loss of DNA cytosine methyltransferase may affect epigenetic regulation; hypothetical protein loss likely neutral; adaptive genome reduction under mitomycin stress | ✅ All mito_* |
| SV_mito_large_del1 | 1189156–1236440 | Deletion | 47,299 | • tRNA-Arg (H0N29_05785): tRNA-Arg (anticodon: unspecified; near 5′ boundary ~1188xxx)• H0N29_05775 (1235097..1235492): hypothetical protein (partial overlap at 3′ boundary)• Multiple unannotated hypothetical proteins in H0N29_05xxx series within region • Gene-sparse, low-complexity intergenic region |
• tRNA-Arg: boundary proximity; possible regulatory element loss • H0N29_05775: partial 5′ deletion • Other hypotheticals: full or partial deletion |
🟡 MEDIUM: Possible loss of non-essential functions; tRNA boundary effect uncertain; adaptive genome reduction under mitomycin stress; dAB-background specific | ✅ Only mito_dAB_* |
🧬 Detailed Gene-by-Gene Breakdown per SV
🔴 SV_adeIJ_del: AdeIJK Efflux Pump Deletion (737224–741667)
Reference structure (complement strand ←):
735779..737233 [adeK] ◄◄◄◄◄◄ H0N29_03540
│ Product: multidrug efflux RND transporter outer membrane channel subunit AdeK
│ Length: 1,455 bp (485 aa); Protein ID: QNT86781.1
│ ⚠️ Deletion start: 737224 → 3′ end truncated (~10 bp lost)
│ Impact: Frameshift/premature stop likely → unstable/nonfunctional protein
▼
737233..740409 [adeJ] ◄◄◄◄◄◄ H0N29_03545 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter permease subunit AdeJ
│ Length: 3,177 bp (1,059 aa); Protein ID: QNT85685.1
│ EC: N/A; DBxref: GI:1906909115
▼
740422..741672 [adeI] ◄◄◄◄◄◄ H0N29_03550 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter periplasmic adaptor subunit AdeI
│ Length: 1,251 bp (417 aa); Protein ID: QNT85686.1
│ ⚠️ Deletion end: 741667 → ~5 bp of 3′ end preserved
▼
741697..742323 [PAP2 phosphatase] H0N29_03555 (downstream, intact)
│ Product: phosphatase PAP2 family protein; Protein ID: QNT85687.1
Functional Impact Summary:
• AdeJ (permease) + AdeI (adaptor) complete loss → tripartite RND pump cannot assemble
• adeK 3′ truncation → likely unstable/nonfunctional outer membrane channel
• Phenotypic consequence: potential increased susceptibility to AdeIJK substrates:
- Aminoglycosides (amikacin, gentamicin, tobramycin)
- Fluoroquinolones (ciprofloxacin, levofloxacin)
- β-lactams (cefepime, imipenem, meropenem)
- Tetracyclines (tigecycline, minocycline)
- Chloramphenicol, trimethoprim
• Compensatory mechanisms: Other efflux systems (AdeABC, AdeFGH, AbeM) may be upregulated
🔴 SV_adeAB_del: AdeAB Efflux Pump Deletion (1844323–1848605)
Reference structure (forward strand →):
1844319..1845509 [adeA] ►►►► H0N29_08675 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter periplasmic adaptor subunit AdeA
│ Length: 1,191 bp (397 aa); Protein ID: QNT86625.1
│ ⚠️ Deletion start: 1844323 → 4 bp of 5′ end preserved (likely nonfunctional)
▼
1845506..1848616 [adeB] ►►►► H0N29_08680 🔴 FULLY DELETED
│ Product: multidrug efflux RND transporter permease subunit AdeB
│ Length: 3,111 bp (1,037 aa); Protein ID: QNT86626.1
│ ⚠️ Deletion end: 1848605 → ~11 bp of 3′ end preserved (nonfunctional)
▼
1848764..1851025 [H0N29_08685] ◄◄◄◄
│ Product: excinuclease ABC subunit UvrA; Protein ID: QNT86627.1
│ Status: ✅ INTACT (starts 159 bp downstream)
Upstream regulatory genes (INTACT):
• adeS (H0N29_08665): two-component sensor histidine kinase AdeS (1842325–1843398)
• adeR (H0N29_08670): efflux system response regulator transcription factor AdeR (1843430–1844173)
Functional Impact Summary:
• AdeA (adaptor) + AdeB (permease) complete loss → tripartite RND pump cannot assemble
• Phenotypic consequence: potential increased susceptibility to AdeAB substrate antibiotics:
- Aminoglycosides, fluoroquinolones, β-lactams, tetracyclines, chloramphenicol
• Regulatory paradox: adeS/adeR intact but structural genes deleted → possible compensatory evolution or pseudogenization over time
🟢 SV_tRNA_contract: tRNA-Gln Array Contraction (3124916–3125037)
Reference tRNA-Gln tandem array (forward strand →):
3124675..3124749 [tRNA-Gln #1] H0N29_14850 (75 bp) │ anticodon: ttg (CAG/CAA)
3124841..3124915 [tRNA-Gln #2] H0N29_14855 (75 bp) │ anticodon: ttg
3124916..3124942 [spacer] (27 bp)
3124943..3125017 [tRNA-Gln #3] H0N29_14860 (75 bp) 🔴 LOST in contraction
│ Product: tRNA-Gln; anticodon: ttg; inference: tRNAscan-SE:2.0.4
3125018..3125037 [spacer] (20 bp)
3125039..3125113 [tRNA-Gln #4] H0N29_14865 (75 bp) │ anticodon: ttg
Functional Impact Summary:
• Copy number reduction: 4 → 3 identical tRNA-Gln genes
• Likely neutral: tRNA genes are highly redundant in bacteria; single copy loss rarely affects translation efficiency
• Utility: Stable molecular marker for:
- Clonal tracking across experiments
- Quality control (present in 100% of high-quality assemblies)
- Phylogenetic analysis (fixed in this lineage)
🟢 SV_flu_tandem: Intergenic Tandem Contraction (2259736–2260384)
Reference context:
2216347..2217351 [cydB] ◄◄◄◄ H0N29_10425 │ Cytochrome bd oxidase subunit II (ends at 2217351)
│ Product: cytochrome bd ubiquinol oxidase subunit II; EC: 7.1.1.7
│ Protein ID: QNT85688.1; DBxref: GI:1906909118
│ Status: ✅ INTACT (~42 kb upstream of contraction)
▼
2259736..2260384 [🔴 CONTRACTION REGION: ~135 bp]
│ No annotated protein-coding CDS in GFF3 excerpt
│ Likely repetitive element (IS, CRISPR, prophage remnant, or low-complexity DNA)
│ Assemblytics metrics: ref_gap: -648; query_gap: -780
Functional Impact Summary:
• No CDS disruption → likely neutral at protein level
• Possible regulatory impact if contraction removes:
- Promoter/enhancer elements affecting downstream genes
- Small RNA genes or riboswitches
- DNA topology elements affecting local chromatin structure
• Utility: Condition-specific marker for fluoroquinolone selection (nitro/polyB/tet)
🟡 SV_mito_tandem: Large Repeat Array Contraction (2494563–2536071)
Reference context (genes/features within/adjacent to region):
2474459..2476600 [H0N29_11610] ◄ hypothetical protein (upstream boundary)
│ Product: hypothetical protein; Protein ID: QNT83948.1
│ Status: ⚠️ Partial overlap at 3′ end
▼
2476597..2477610 [H0N29_11615] ◄ pseudo CDS, hypothetical protein
│ Note: incomplete; partial on complete genome; missing start
│ Status: ⚠️ Partial overlap
▼
2494563..2536071 [🔴 CONTRACTION REGION: 41,564 bp]
│ Multiple hypothetical proteins in H0N29_11xxx series (partial/full overlap)
│ Transposase/integrase remnants (mobile element-associated)
│ Repeat-rich, low-complexity sequence
│ Assemblytics metrics: ref_gap: 41508; query_gap: -56
Functional Impact Summary:
• Large repeat array contraction → possible loss of mobile element-associated sequences
• Hypothetical proteins affected → functional impact unknown; likely non-essential
• Hypothesis: Mitomycin C (DNA crosslinker) induces replication fork collapse → error-prone repair → large contractions in repetitive regions
• Utility: Marker for mitomycin-selected lineage; dAB-background specific
🟡 SV_mito_large_del2: Large Deletion Affecting Methyltransferase (2621714–2664046)
Reference context (genes overlapping region):
2626228..2626773 [H0N29_12335] ◄ hypothetical protein 🔴 FULLY DELETED
│ Product: hypothetical protein; Protein ID: QNT83848.1
│ Length: 546 bp (182 aa)
▼
2655635..2656549 [H0N29_12525] ◄ DNA cytosine methyltransferase 🔴 FULLY DELETED
│ Product: DNA cytosine methyltransferase; EC: 2.1.21.-
│ Protein ID: QNT83886.1; DBxref: GI:1906907316
│ Length: 915 bp (305 aa)
│ Function: Catalyzes methylation of cytosine residues in DNA; epigenetic regulation
Functional Impact Summary:
• H0N29_12525 (DNA cytosine methyltransferase) loss → potential epigenetic regulation changes:
- Altered DNA methylation patterns
- Possible effects on gene expression, phase variation, or restriction-modification systems
• H0N29_12335 (hypothetical) loss → functional impact unknown; likely neutral
• Hypothesis: Mitomycin-induced genomic instability → adaptive genome reduction in non-essential regions
• Utility: Marker for mitomycin-selected lineage (all mito_* samples)
🟡 SV_mito_large_del1: Large Gene-Sparse Deletion (1189156–1236440)
Reference context (genes/features within/adjacent to region):
1168871..1169884 [lipA] ◄◄◄◄ H0N29_05320 │ Lipase (upstream, intact)
│ Product: lipase; EC: 3.1.1.3; Protein ID: QNT85998.1
▼
1171737..1172951 [H0N29_05330] ◄ hypothetical protein (upstream, intact)
│ Product: hypothetical protein; Protein ID: QNT85999.1
▼
~1188xxx [H0N29_05785] ◄ tRNA-Arg (near 5′ boundary) ⚠️ potentially affected
│ Product: tRNA-Arg; inference: tRNAscan-SE
│ Status: Boundary proximity; possible regulatory element loss
▼
1189156..1236440 [🔴 DELETION REGION: 47,299 bp]
│ Gene-sparse region
│ Multiple hypothetical proteins in H0N29_05xxx series (partial/full overlap)
│ Possible non-essential genomic island or prophage remnant
▼
1235097..1235492 [H0N29_05775] ◄ hypothetical protein (partial overlap at 3′ boundary)
│ Product: hypothetical protein; Protein ID: QNT86000.1
▼
1263952..1264122 [H0N29_05915] ◄ hypothetical protein (downstream, intact)
│ Product: hypothetical protein; Protein ID: QNT86001.1
Functional Impact Summary:
• tRNA-Arg near boundary: deletion may affect tRNA expression/regulation if promoter elements removed
• Multiple hypothetical proteins lost → likely non-essential functions
• Hypothesis: Adaptive genome reduction under mitomycin stress; loss of non-essential functions to reduce metabolic burden
• Utility: Marker for mitomycin + dAB background lineage
📋 Export-Ready Comprehensive Annotation Table (TSV Format)
SV_ID Coordinates Type Size_bp Affected_Genes_LocusTags Affected_Genes_Products Overlap_Type_per_Gene Functional_Impact_Summary Sample_Pattern Priority
SV_adeIJ_del 737224-741667 Deletion 4436 adeK(H0N29_03540);adeJ(H0N29_03545);adeI(H0N29_03550) multidrug_efflux_RND_transporter_subunits_AdeIJK adeK:3prime_truncation;adeJ:full_deletion;adeI:full_deletion Loss_of_AdeIJK_pump_function;potential_increased_antibiotic_susceptibility all_dIJ_samples HIGH
SV_adeAB_del 1844323-1848605 Deletion 4282 adeA(H0N29_08675);adeB(H0N29_08680) multidrug_efflux_RND_transporter_subunits_AdeAB adeA:full_deletion_4bp_5prime_preserved;adeB:full_deletion_11bp_3prime_preserved Loss_of_AdeAB_pump_function;potential_increased_antibiotic_susceptibility all_dAB_samples HIGH
SV_tRNA_contract 3124916-3125037 Tandem_contraction 198 tRNA-Gln(H0N29_14860) tRNA-Gln_anticodon:ttg_glutamine_codon_translation H0N29_14860:full_deletion_75bp tRNA_dosage_reduction_4to3;likely_neutral_lineage_marker all_filtered_samples LOW
SV_flu_tandem 2259736-2260384 Tandem_contraction 135 intergenic_no_CDS_overlapped Likely_neutral_repetitive_element No_CDS_disruption;possible_regulatory_impact Likely_neutral;fluoroquinolone_selection_marker flu_subset_condition_specific LOW
SV_mito_tandem 2494563-2536071 Tandem_contraction 41564 H0N29_11610(partial);H0N29_11615(partial);multiple_H0N29_11xxx_hypotheticals hypothetical_proteins;repeat_rich_region Multiple_hypotheticals:partial_or_full_deletion;repeat_arrays:contraction Large_repeat_array_contraction;likely_non-essential;mobile_element_associated_plasticity mito_dAB_only MEDIUM
SV_mito_large_del2 2621714-2664046 Deletion 42352 H0N29_12335(full);H0N29_12525(full) hypothetical_protein;DNA_cytosine_methyltransferase_EC:2.1.21.- H0N29_12335:full_deletion;H0N29_12525:full_deletion Loss_of_DNA_cytosine_methyltransferase;potential_epigenetic_regulation_changes all_mito_samples MEDIUM
SV_mito_large_del1 1189156-1236440 Deletion 47299 tRNA-Arg(H0N29_05785,boundary);H0N29_05775(partial);multiple_H0N29_05xxx_hypotheticals tRNA-Arg;hypothetical_proteins tRNA-Arg:boundary_proximity;H0N29_05775:partial_deletion;others:full/partial Possible_loss_of_non-essential_functions;tRNA_boundary_effect_uncertain;adaptive_genome_reduction mito_dAB_only MEDIUM
🔬 Reproducible Validation Workflow (Command-Line)
# 1. Create BED file of SV coordinates (0-based, half-open for bedtools)
cat > sv_coords.bed << EOF
CP059040 737223 741667 SV_adeIJ_del 4436 Deletion
CP059040 1844322 1848605 SV_adeAB_del 4282 Deletion
CP059040 3124915 3125037 SV_tRNA_contract 198 Tandem_contraction
CP059040 2259735 2260384 SV_flu_tandem 135 Tandem_contraction
CP059040 2494562 2536071 SV_mito_tandem 41564 Tandem_contraction
CP059040 2621713 2664046 SV_mito_large_del2 42352 Deletion
CP059040 1189155 1236440 SV_mito_large_del1 47299 Deletion
EOF
# 2. Intersect with GFF3 annotation (requires bedtools)
bedtools intersect -a sv_coords.bed -b CP059040.gff3.txt -wa -wb -loj > sv_gene_overlap.tsv
# 3. Extract and summarize affected genes per SV
awk -F'\t' 'NR>1 {print $4, $10, $11, $12}' sv_gene_overlap.tsv | \
sort | uniq -c | column -t > sv_gene_summary.txt
# 4. Extract sequences for breakpoint validation
while IFS=$'\t' read -r chr start end sv_id size sv_type; do
samtools faidx bacto/CP059040.fasta ${chr}:${start}-${end} > ${sv_id}_region.fasta
done < sv_coords.bed
💡 Manuscript Interpretation Guidelines
High-Priority Findings (Results Section)
“Two mutually exclusive ~4.3-kb deletions define efflux pump backgrounds: SV_adeIJ_del (CP059040:737224–741667) abolishes the AdeIJK pump (adeJ, adeI, truncated adeK) in all
dIJstrains, while SV_adeAB_del (CP059040:1844323–1848605) abolishes the AdeAB pump (adeA, adeB) in alldABstrains. Both variants result in complete loss of permease and adaptor subunits, likely conferring increased susceptibility to respective substrate antibiotics.”
Medium-Priority (Supplementary/Discussion)
“Mitomycin C-selected strains harbor large (>40 kb) structural variants in repeat-rich genomic regions (SV_mito_tandem, SV_mito_large_del1/2), absent in fluoroquinolone-selected isolates. Notably, SV_mito_large_del2 deletes a DNA cytosine methyltransferase (H0N29_12525), suggesting potential epigenetic adaptation under genotoxic stress.”
Low-Priority (Methods/QC)
“A universal 198-bp tandem contraction in a tRNA-Gln array (SV_tRNA_contract; copy number 4→3) and a condition-specific ~135-bp intergenic contraction (SV_flu_tandem) were detected, serving as stable lineage markers and selection-condition signatures, respectively.”
- BED/GFF3 files for IGV visualization of all 7 SVs with gene tracks
- Integration with SNP/InDel results for a unified variant report
- R/Python scripts to generate publication-ready figures (SV distribution, genome maps, gene impact heatmaps)
Run snakemake metaGEM and SV-callers
https://snakemake.github.io/snakemake-workflow-catalog/docs/workflows/franciscozorrilla/metaGEM.html
mamba env create -n metagem -f envs/metaGEM_env.yml
mamba env create –prefix ./envs/metagem -f envs/metaGEM_env.yml
mamba env create -n sv-callers -f environment.yaml
snakemake –use-conda -c1
#Needs bam-files
[Mon Apr 27 17:31:55 2026] rule delly_p: input: data/fasta/chr22.fasta, data/fasta/chr22.fasta.fai, data/bam/3/T3.bam, data/bam/3/T3.bam.bai, data/bam/3/N3.bam, data/bam/3/N3.bam.bai output: data/bam/3/T3–N3/delly_out/delly-DEL.bcf jobid: 9 wildcards: path=data/bam/3, tumor=T3, normal=N3, sv_type=DEL resources: tmpdir=/tmp, mem_mb=8192, tmp_mb=0
Pipeline Purpose nf-core/mag Metagenome assembly, binning, annotation nf-co.re nf-core/viralrecon Viral genome reconstruction from metagenomic data nf-core/amr Antimicrobial resistance gene detection






