Variant Calling Results and Analysis Summary for the 6 Pairs of DNA-seq Data (Data_Holger_DNAseq_2026_Sepi_Pairs)

Run nextflow bacass

 conda deactivate

 # Downlod k2_standard_08_GB_20251015.tar.gz from https://benlangmead.github.io/aws-indexes/k2#kraken2--bracken
 # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056; 'tar xzf 20190108_kmerfinder_stable_dirs.tar.gz'  #The database does not work!
 # Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz  #The database works!

 # DEBUG: --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ not working!

 nextflow run nf-core/bacass -r 2.6.0 -profile docker --help

 # -- Short assembly --
 #Maybe BUG is from '--skip_kmerfinder for -r 2.6.0, using db in 2.5.0'
 nextflow run nf-core/bacass -r 2.5.0 -profile docker \
   --input samplesheet.tsv \
   --outdir bacass_out \
   --assembly_type short \
   --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
   --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
   -resume \
   -work-dir bacass_out/work

Verify if the genome is pure

 # 1. Go up one level to the main 'bacass_out' directory
 cd ..

 # 2. Create directories for CheckM inputs and outputs
 mkdir -p checkm_input checkm_output

 # 3. Copy all .fna files into the 'checkm_input' folder
 # (CheckM cannot search subdirectories, so they must be in one folder)
 find ./Prokka -name "*.fna" -exec cp {} checkm_input/ \;

 # 4. Run CheckM on all 4 assemblies
 (checkm_env2) checkm lineage_wf -x fna checkm_input checkm_output

mlst

 (bacto) mlst *.fna > mlst

 #Pair1
 2780_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)
 12835_.fna  sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)

 #Pair2
 4667_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)
 9940_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)

 #Pair3
 3139_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 19296_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair4
 9849_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 11635_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair5
 3006_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 10393_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair6
 1736_.fna   sepidermidis    59  arcC(2) aroE(1) gtr(1)  mutS(1) pyrR(2) tpiA(1) yqiL(1)
 8583_.fna   sepidermidis    59  arcC(2) aroE(1) gtr(1)  mutS(1) pyrR(2) tpiA(1) yqiL(1)

#Compared to other isolates from the lab: patient HD04 [ST5], HD21 [ST2], HD26 [ST5], HD27 [ST5], HD29 [ST5], HD59 [ST5], HD33 [ST87]

Species Identification: 快速筛查用 Mash → 精确分类用 GTDB-Tk → 种级验证用 FastANI，三者结合可最大限度提高物种鉴定的准确性和可解释性。

 # 1. 创建环境（推荐 mamba）
 mamba create -n gtdbtk -c conda-forge -c bioconda gtdbtk
 mamba activate gtdbtk

 # 2. 下载数据库（仅需首次，约 60GB）
 gtdbtk download --data_dir ./gtdb_data --release 220

 wget https://data.gtdb.aau.ecogenomic.org/releases/release232/232.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r232_data.tar.g
 mamba env config vars set GTDBTK_DATA_PATH="/mnt/nvme4n1p1/gtdb_data/release232"
 # 先退出当前环境，再重新激活
 mamba deactivate
 mamba activate gtdbtk

 # 验证环境变量是否加载成功
 echo $GTDBTK_DATA_PATH
 # 应输出：/mnt/nvme4n1p1/gtdb_data/release232

 # 3. 运行分类（你提供的命令 + 实用参数）
 gtdbtk classify_wf \
   --genome_dir ./checkm_input \
   --out_dir gtdb_out \
   --cpus 64 \
   --extension .fna \
   --prefix mygenome

 # 4. 查看结果
 cat gtdb_out/classify/mygenome.bac120.summary.tsv   # 细菌结果

 #Pair1
 2780_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)
 12835_.fna  sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)

 #Pair2
 4667_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)
 9940_.fna   sepidermidis    2   arcC(7) aroE(1) gtr(2)  mutS(2) pyrR(4) tpiA(1) yqiL(1)

 #Pair3
 3139_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 19296_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair4
 9849_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 11635_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair5
 3006_.fna   sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)
 10393_.fna  sepidermidis    5   arcC(1) aroE(1) gtr(1)  mutS(2) pyrR(2) tpiA(1) yqiL(1)

 #Pair6
 1736_.fna   sepidermidis    59  arcC(2) aroE(1) gtr(1)  mutS(1) pyrR(2) tpiA(1) yqiL(1)
 8583_.fna   sepidermidis    59  arcC(2) aroE(1) gtr(1)  mutS(1) pyrR(2) tpiA(1) yqiL(1)

 ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bin Id         Marker lineage         # genomes   # markers   # marker sets   0    1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------
 2780_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00
 12835_   g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00

 4667_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00
 9940_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00

 3139_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00
 19296_   g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00

 9849_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00
 11635_   g__Staphylococcus (UID294)       60         773           178        2   771   0   0   0   0       99.73            0.00               0.00

 3006_    g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00
 10393_   g__Staphylococcus (UID294)       60         773           178        1   772   0   0   0   0       99.81            0.00               0.00

 1736_    g__Staphylococcus (UID294)       60         773           178        1   771   1   0   0   0       99.81            0.28               0.00
 8583_    g__Staphylococcus (UID294)       60         773           178        1   771   1   0   0   0       99.81            0.28               0.00
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------

Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

 conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
 abricate --list

 conda deactivate

 # 1. Set both required Java variables to bypass the buggy conda activation script
 export JAVA_HOME=/home/jhuang/miniconda3/envs/bengal3_ac3
 export JAVA_LD_LIBRARY_PATH=$JAVA_HOME/lib/server

 # 2. Run your script
 for sample in 2780 12835 4667 9940 3139 19296 9849 11635 3006 10393 1736 8583; do
     ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
     ASM=bacass_out/checkm_input/${sample}_.fna \
     SAMPLE=${sample} \
     OUTDIR=resistome_virulence_${sample} \
     MINID=80 MINCOV=60 \
     THREADS=32 \
     ~/Scripts/run_abricate_resistome_virulome_one_per_gene.sh
 done

 conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
 #NEED_TO_ADAPT: OUTDIR = Path("resistome_virulence_An7")
 #NEED_TO_ADAPT: SAMPLE = "An7"
 #DEPRECATED_DUE_TO_NEED_MANULL_SETTING: python ~/Scripts/merge_amr_sources_by_gene.py

 for sample in 2780 12835 4667 9940 3139 19296 9849 11635 3006 10393 1736 8583; do
     python ~/Scripts/export_resistome_virulence_to_excel_py36.py \
     --workdir resistome_virulence_${sample} \
     --sample ${sample} \
     --out Resistome_Virulence_${sample}.xlsx
 done

 Wrote Excel: Resistome_Virulence_2780.xlsx
 AMR genes (one row per gene): 29
 VFDB hits: 0
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_12835.xlsx
 AMR genes (one row per gene): 30
 VFDB hits: 0
 DB hit counts saved.

 Wrote Excel: Resistome_Virulence_4667.xlsx
 AMR genes (one row per gene): 33
 VFDB hits: 0
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_9940.xlsx
 AMR genes (one row per gene): 30
 VFDB hits: 0
 DB hit counts saved.

 Wrote Excel: Resistome_Virulence_3139.xlsx
 AMR genes (one row per gene): 27
 VFDB hits: 1
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_19296.xlsx
 AMR genes (one row per gene): 27
 VFDB hits: 1
 DB hit counts saved.

 Wrote Excel: Resistome_Virulence_9849.xlsx
 AMR genes (one row per gene): 20
 VFDB hits: 1
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_11635.xlsx
 AMR genes (one row per gene): 20
 VFDB hits: 1
 DB hit counts saved.

 Wrote Excel: Resistome_Virulence_3006.xlsx
 AMR genes (one row per gene): 21
 VFDB hits: 1
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_10393.xlsx
 AMR genes (one row per gene): 24
 VFDB hits: 1
 DB hit counts saved.

 Wrote Excel: Resistome_Virulence_1736.xlsx
 AMR genes (one row per gene): 22
 VFDB hits: 0
 DB hit counts saved.
 Wrote Excel: Resistome_Virulence_8583.xlsx
 AMR genes (one row per gene): 22
 VFDB hits: 0
 DB hit counts saved.

 # TODO: Delete the column 'COVERAGE_MAP' in all 'Raw_*' sheets

Rename the files

 mv ./raw_data/Pair1/2780_S71_1_sequence.fastq.gz ./raw_data/Pair1/2780_S71_R1.fastq.gz
 mv ./raw_data/Pair1/2780_S71_2_sequence.fastq.gz ./raw_data/Pair1/2780_S71_R2.fastq.gz
 mv ./raw_data/Pair1/12835_S72_1_sequence.fastq.gz ./raw_data/Pair1/12835_S72_R1.fastq.gz
 mv ./raw_data/Pair1/12835_S72_2_sequence.fastq.gz ./raw_data/Pair1/12835_S72_R2.fastq.gz

 mv ./raw_data/Pair2/4667_S75_1_sequence.fastq.gz ./raw_data/Pair2/4667_S75_R1.fastq.gz
 mv ./raw_data/Pair2/4667_S75_2_sequence.fastq.gz ./raw_data/Pair2/4667_S75_R2.fastq.gz
 mv ./raw_data/Pair2/9940_S78_1_sequence.fastq.gz ./raw_data/Pair2/9940_S78_R1.fastq.gz
 mv ./raw_data/Pair2/9940_S78_2_sequence.fastq.gz ./raw_data/Pair2/9940_S78_R2.fastq.gz

 mv ./raw_data/Pair3/3139_S74_1_sequence.fastq.gz ./raw_data/Pair3/3139_S74_R1.fastq.gz
 mv ./raw_data/Pair3/3139_S74_2_sequence.fastq.gz ./raw_data/Pair3/3139_S74_R2.fastq.gz
 mv ./raw_data/Pair3/19296_S81_1_sequence.fastq.gz ./raw_data/Pair3/19296_S81_R1.fastq.gz
 mv ./raw_data/Pair3/19296_S81_2_sequence.fastq.gz ./raw_data/Pair3/19296_S81_R2.fastq.gz

 mv ./raw_data/Pair4/11635_S80_1_sequence.fastq.gz ./raw_data/Pair4/11635_S80_R1.fastq.gz
 mv ./raw_data/Pair4/11635_S80_2_sequence.fastq.gz ./raw_data/Pair4/11635_S80_R2.fastq.gz
 mv ./raw_data/Pair4/9849_S77_1_sequence.fastq.gz ./raw_data/Pair4/9849_S77_R1.fastq.gz
 mv ./raw_data/Pair4/9849_S77_2_sequence.fastq.gz ./raw_data/Pair4/9849_S77_R2.fastq.gz

 mv ./raw_data/Pair5/3006_S73_1_sequence.fastq.gz ./raw_data/Pair5/3006_S73_R1.fastq.gz
 mv ./raw_data/Pair5/3006_S73_2_sequence.fastq.gz ./raw_data/Pair5/3006_S73_R2.fastq.gz
 mv ./raw_data/Pair5/10393_S79_1_sequence.fastq.gz ./raw_data/Pair5/10393_S79_R1.fastq.gz
 mv ./raw_data/Pair5/10393_S79_2_sequence.fastq.gz ./raw_data/Pair5/10393_S79_R2.fastq.gz

 mv ./raw_data/Pair6/1736_S70_1_sequence.fastq.gz ./raw_data/Pair6/1736_S70_R1.fastq.gz
 mv ./raw_data/Pair6/1736_S70_2_sequence.fastq.gz ./raw_data/Pair6/1736_S70_R2.fastq.gz
 mv ./raw_data/Pair6/8583_S76_1_sequence.fastq.gz ./raw_data/Pair6/8583_S76_R1.fastq.gz
 mv ./raw_data/Pair6/8583_S76_2_sequence.fastq.gz ./raw_data/Pair6/8583_S76_R2.fastq.gz

 mv ./Pair1/2780_S71_R1.fastq.gz ./Pair1_2780_S71_R1.fastq.gz
 mv ./Pair1/2780_S71_R2.fastq.gz ./Pair1_2780_S71_R2.fastq.gz
 mv ./Pair1/12835_S72_R1.fastq.gz ./Pair1_12835_S72_R1.fastq.gz
 mv ./Pair1/12835_S72_R2.fastq.gz ./Pair1_12835_S72_R2.fastq.gz

 mv ./Pair2/4667_S75_R1.fastq.gz ./Pair2_4667_S75_R1.fastq.gz
 mv ./Pair2/4667_S75_R2.fastq.gz ./Pair2_4667_S75_R2.fastq.gz
 mv ./Pair2/9940_S78_R1.fastq.gz ./Pair2_9940_S78_R1.fastq.gz
 mv ./Pair2/9940_S78_R2.fastq.gz ./Pair2_9940_S78_R2.fastq.gz

 mv ./Pair3/3139_S74_R1.fastq.gz ./Pair3_3139_S74_R1.fastq.gz
 mv ./Pair3/3139_S74_R2.fastq.gz ./Pair3_3139_S74_R2.fastq.gz
 mv ./Pair3/19296_S81_R1.fastq.gz ./Pair3_19296_S81_R1.fastq.gz
 mv ./Pair3/19296_S81_R2.fastq.gz ./Pair3_19296_S81_R2.fastq.gz

 mv ./Pair4/9849_S77_R1.fastq.gz ./Pair4_9849_S77_R1.fastq.gz
 mv ./Pair4/9849_S77_R2.fastq.gz ./Pair4_9849_S77_R2.fastq.gz
 mv ./Pair4/11635_S80_R1.fastq.gz ./Pair4_11635_S80_R1.fastq.gz
 mv ./Pair4/11635_S80_R2.fastq.gz ./Pair4_11635_S80_R2.fastq.gz

 mv ./Pair5/3006_S73_R1.fastq.gz ./Pair5_3006_S73_R1.fastq.gz
 mv ./Pair5/3006_S73_R2.fastq.gz ./Pair5_3006_S73_R2.fastq.gz
 mv ./Pair5/10393_S79_R1.fastq.gz ./Pair5_10393_S79_R1.fastq.gz
 mv ./Pair5/10393_S79_R2.fastq.gz ./Pair5_10393_S79_R2.fastq.gz

 mv ./Pair6/1736_S70_R1.fastq.gz ./Pair6_1736_S70_R1.fastq.gz
 mv ./Pair6/1736_S70_R2.fastq.gz ./Pair6_1736_S70_R2.fastq.gz
 mv ./Pair6/8583_S76_R1.fastq.gz ./Pair6_8583_S76_R1.fastq.gz
 mv ./Pair6/8583_S76_R2.fastq.gz ./Pair6_8583_S76_R2.fastq.gz

https://bioinformatics.cc/call-and-merge-snp-and-indel-results-from-using-two-variant-calling-methods/

SNP-calling on AP019721 using snippy, but too far is the reference –> throw the results away!

 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto

 #Prepare raw_data and bacto-0.1.json
 #Set "reference": "db/CP133***.gb" in bacto-0.1.json

 conda activate bengal3_ac3
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

https://bioinformatics.cc/analysis-of-snps-indels-transposons-and-is-elements-in-5-a-baumannii-strains/

Annotate a sequence as reference

 conda activate /home/jhuang/miniconda3/envs/bakta
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair1_2780_S71/contigs.fa --prefix 2780
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair2_4667_S75/contigs.fa --prefix 4667
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair3_3139_S74/contigs.fa --prefix 3139
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair4_9849_S77/contigs.fa --prefix 9849
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair5_3006_S73/contigs.fa --prefix 3006
 bakta --db /mnt/nvme0n1p1/bakta_db shovill/Pair6_1736_S70/contigs.fa --prefix 1736
 #END

Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

 * Preparing file 1 eggnog_out.emapper.annotations.txt for the R-code below: (KEGG Terms): EggNog based on orthology and phylogenies

     EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.

     Install EggNOG-mapper:

         mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
         mamba activate eggnog_env

     Run annotation:

         #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
         mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
         #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd

         # (DO_NOT_NEED_HERE) Download CU459141_protein_.fasta from NCBI
         # python ~/Scripts/update_fasta_header.py CU459141_protein_.fasta CU459141_protein.fasta

         emapper.py -i 2780.faa -o 2780_eggnog_out --cpu 80    #--resume
         emapper.py -i 4667.faa -o 4667_eggnog_out --cpu 80
         emapper.py -i 3139.faa -o 3139_eggnog_out --cpu 80
         emapper.py -i 9849.faa -o 9849_eggnog_out --cpu 80
         emapper.py -i 3006.faa -o 3006_eggnog_out --cpu 80
         emapper.py -i 1736.faa -o 1736_eggnog_out --cpu 80
         #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
         #---->  470.IX87_14445:
             * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
             * IX87_14445 would refer to a specific gene or protein within that genome.

     Extract KEGG KO IDs from annotations.emapper.annotations.

Call variant calling on self using snippy

 mkdir snippy_Pair1
 mv 2780* snippy_Pair1
 cd snippy_Pair1

 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto

 #Prepare raw_data and bacto-0.1.json
 mkdir raw_data
 mv ../raw_data/Pair1_2780_S71_R1.fastq.gz raw_data
 mv ../raw_data/Pair1_2780_S71_R2.fastq.gz raw_data
 mv ../raw_data/Pair1_12835_S72_R1.fastq.gz raw_data
 mv ../raw_data/Pair1_12835_S72_R2.fastq.gz raw_data

 #Copy the reference 2780.gbff to db, and set "reference": "db/2780.gbff" in bacto-0.1.json

 conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

 --

 mkdir snippy_Pair2
 mv 4667* snippy_Pair2
 cd snippy_Pair2
 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto
 mkdir raw_data
 mv ../raw_data/Pair2_4667_S75_R1.fastq.gz raw_data
 mv ../raw_data/Pair2_4667_S75_R2.fastq.gz raw_data
 mv ../raw_data/Pair2_9940_S78_R1.fastq.gz raw_data
 mv ../raw_data/Pair2_9940_S78_R2.fastq.gz raw_data
 #Copy the reference 4667.gbff to db, and set "reference": "db/2780.gbff" in bacto-0.1.json
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

 --

 mkdir snippy_Pair3
 mv 3139* snippy_Pair3
 cd snippy_Pair3
 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto
 mkdir raw_data
 mv ../raw_data/Pair3_3139_S74_R1.fastq.gz raw_data
 mv ../raw_data/Pair3_3139_S74_R2.fastq.gz raw_data
 mv ../raw_data/Pair3_19296_S81_R1.fastq.gz raw_data
 mv ../raw_data/Pair3_19296_S81_R2.fastq.gz raw_data
 #Copy the reference 3139.gbff to db, and set "reference": "db/3139.gbff" in bacto-0.1.json
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

 --

 mkdir snippy_Pair4
 mv 9849* snippy_Pair4
 cd snippy_Pair4
 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto
 mkdir raw_data
 mv ../raw_data/Pair4_9849_S77_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair4_9849_S77_R2.fastq.gz ./raw_data
 mv ../raw_data/Pair4_11635_S80_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair4_11635_S80_R2.fastq.gz ./raw_data
 #Copy the reference 9849.gbff to db, and set "reference": "db/9849.gbff" in bacto-0.1.json
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

 --

 mkdir snippy_Pair5
 mv 3006* snippy_Pair5
 cd snippy_Pair5
 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto
 mkdir raw_data
 mv ../raw_data/Pair5_3006_S73_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair5_3006_S73_R2.fastq.gz ./raw_data
 mv ../raw_data/Pair5_10393_S79_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair5_10393_S79_R2.fastq.gz ./raw_data
 #Copy the reference 3006.gbff to db, and set "reference": "db/3006.gbff" in bacto-0.1.json
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

 --

 mkdir snippy_Pair6
 mv 1736* snippy_Pair6
 cd snippy_Pair6
 git clone https://github.com/huang/bacto
 mv bacto/* ./
 rm -rf bacto
 mkdir raw_data
 mv ../raw_data/Pair6_1736_S70_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair6_1736_S70_R2.fastq.gz ./raw_data
 mv ../raw_data/Pair6_8583_S76_R1.fastq.gz ./raw_data
 mv ../raw_data/Pair6_8583_S76_R2.fastq.gz ./raw_data
 #Copy the reference 1736.gbff to db, and set "reference": "db/1736.gbff" in bacto-0.1.json
 /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

TODOs: # Merge the annotated information 2780_eggnog_out.emapper.annotations into the SNP-Indels-results after making concensus SNP-Indels-results merging results of SPANDx!

!!!! OPTIONALLY (for the project, 我觉得用snippy就够了: simply only taking the results of snippy, do not do the concensus of the results using snippy and SPANDx !!!!)

https://bioinformatics.cc/interhost-variant-calling-data_tam_dnaseq_2025_y1y2y3y4w1w2w3w4_tig1_tig2_dij_on_atcc19606/

Summarize all SNPs and Indels from the snippy result directory.

cp ~/Scripts/summarize_snippy_res_ordered.py .
# IMPORTANT_ADAPT the array in script should be adapted
isolates = ["W1", "W2", "W3", "W4", "Y1", "Y2", "Y3", "Y4"]
mamba activate plot-numpy1
python3 ./summarize_snippy_res_ordered.py snippy
#--> Summary CSV file created successfully at: snippy/summary_snps_indels.csv
cd snippy
#REMOVE_the_line? I don't find the sence of the line:    grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv

Convert the results to Excel-format

#REPLCE_HEAD: CHROM,POS,REF,TYPE,Pair1_2780_S71,Pair1_12835_S72,Functional_Class,Codon_change,Protein_and_nucleotide_change,Gene_ID,Gene_name
#Replace ' ' --> ','; Replace ",,,,,," --> ','; Remove "nan"
#Check the first sample is the same to the reference, otherwise delete it.
#Sort according contig number and position

~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair1/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair1_on_2780.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair2/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair2_on_4667.xls
#EMPTY ~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair3/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair3_on_3139.xls
#EMPTY ~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair4/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair4_on_9849.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair5/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair5_on_3006.xls
#EMPTY ~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair6/snippy/summary_snps_indels.csv -d',' -o SNPs_Indels_pair6_on_1736.xls

~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair1/2780_eggnog_out.emapper.annotations -d$'\t' -o 2780_ORF_annotations.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair2/4667_eggnog_out.emapper.annotations -d$'\t' -o 4667_ORF_annotations.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair3/3139_eggnog_out.emapper.annotations -d$'\t' -o 3139_ORF_annotations.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair4/9849_eggnog_out.emapper.annotations -d$'\t' -o 9849_ORF_annotations.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair5/3006_eggnog_out.emapper.annotations -d$'\t' -o 3006_ORF_annotations.xls
~/Tools/csv2xls-0.4/csv_to_xls.py ./snippy_Pair6/1736_eggnog_out.emapper.annotations -d$'\t' -o 1736_ORF_annotations.xls

~/Tools/csv2xls-0.4/csv_to_xls.py mlst.txt -d$'\t' -o mlst.xls

Report

I have completed the variant calling and functional annotation for the 6 pairs of Staphylococcus epidermidis DNA-seq data.

Key Findings:
Out of the 6 pairs, 3 pairs reported zero variants (Pairs 3, 4, and 6), while the other 3 pairs (Pairs 1, 2, and 5) yielded SNPs and Indels.
This result aligns with the resistome and virulence profiling. For the pairs with no genetic variants (Pairs 3, 4, and 6), the isolates have identical AMR and VFDB gene profiles (e.g., both isolates in Pair 3 have 27 AMR genes and 1 VFDB hit; both in Pair 4 have 20 AMR genes). In contrast, the pairs that showed SNPs/Indels (Pairs 1, 2, and 5) also exhibit differences in their AMR gene counts (e.g., Pair 2 has 33 vs. 30 AMR genes; Pair 5 has 21 vs. 24 AMR genes).
Attached Files:
Please find the following results attached:

    MLST calling: mlst.xls.
    Resistome_Virulence results: 12 Excel files
    SNP/Indel Results (for the pairs with variants):
        SNPs_Indels_pair1_on_2780.xls
        SNPs_Indels_pair2_on_4667.xls
        SNPs_Indels_pair5_on_3006.xls
    ORF Annotations (EggNOG-mapper results with KEGG/GO terms for all 6 reference genomes in the 6 paired comparisons):
        2780_ORF_annotations.xls, 4667_ORF_annotations.xls, 3139_ORF_annotations.xls, 9849_ORF_annotations.xls, 3006_ORF_annotations.xls, 1736_ORF_annotations.xls

Brief Methodology Summary:

    Assembly & QC: We used nf-core/bacass for short-read assembly. CheckM confirmed high genome purity (>99.7% completeness, ~0% contamination), and GTDB-Tk confirmed the species as Staphylococcus epidermidis.
    Typing & AMR: MLST was performed (identifying ST2, ST5, and ST59 lineages). Resistome and virulence profiling were done using Abricate and RGI.
    Variant Calling: We used snippy to map each pair against its own assembled reference genome to ensure high-confidence SNP/Indel calling.
    Functional Annotation: Bakta was used for structural annotation, and EggNOG-mapper was utilized to assign KEGG and GO terms to the ORFs for all 6 reference genomes in the 6 paired comparisons.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Variant Calling Results and Analysis Summary for the 6 Pairs of DNA-seq Data (Data_Holger_DNAseq_2026_Sepi_Pairs)

Leave a Reply Cancel reply