Variant calling for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs)

  1. Input data:

     ln -s ../raw_data_2024/hCoV229E_Rluc_R1.fastq.gz hCoV229E_Rluc_R1.fastq.gz
     ln -s ../raw_data_2024/hCoV229E_Rluc_R2.fastq.gz hCoV229E_Rluc_R2.fastq.gz
     ln -s ../raw_data_2024/p10_DMSO_R1.fastq.gz p10_DMSO_R1.fastq.gz
     ln -s ../raw_data_2024/p10_DMSO_R2.fastq.gz p10_DMSO_R2.fastq.gz
     ln -s ../raw_data_2024/p10_K22_R1.fastq.gz p10_K22_R1.fastq.gz
     ln -s ../raw_data_2024/p10_K22_R2.fastq.gz p10_K22_R2.fastq.gz
     ln -s ../raw_data_2024/p10_K7523_R1.fastq.gz p10_K7523_R1.fastq.gz
     ln -s ../raw_data_2024/p10_K7523_R2.fastq.gz p10_K7523_R2.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R1_001.fastq.gz p16_DMSO_R1.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20606/p16_DMSO_S29_R2_001.fastq.gz p16_DMSO_R2.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R1_001.fastq.gz p16_K22_R1.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20607/p16_K22_S30_R2_001.fastq.gz p16_K22_R2.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R1_001.fastq.gz p16_X7523_R1.fastq.gz
     ln -s ../raw_data_2025/250506_VH00358_136_AAG3YJ5M5/p20608/p16_X7523_S31_R2_001.fastq.gz p16_X7523_R2.fastq.gz
  2. Call variant calling using snippy

     ln -s ~/Tools/bacto/db/ .;
     ln -s ~/Tools/bacto/envs/ .;
     ln -s ~/Tools/bacto/local/ .;
     cp ~/Tools/bacto/Snakefile .;
     cp ~/Tools/bacto/bacto-0.1.json .;
     cp ~/Tools/bacto/cluster.json .;
    
     #download CU459141.gb from GenBank
     mv ~/Downloads/sequence\(2\).gb db/PP810610.gb
    
     #setting the following in bacto-0.1.json
         "fastqc": false,
         "taxonomic_classifier": false,
         "assembly": true,
         "typing_ariba": false,
         "typing_mlst": true,
         "pangenome": true,
         "variants_calling": true,
         "phylogeny_fasttree": true,
         "phylogeny_raxml": true,
         "recombination": false, (due to gubbins-error set false)
         "genus": "Alphacoronavirus",
         "kingdom": "Viruses",
         "species": "Human coronavirus 229E",
         "mykrobe": {
             "species": "corona"
         },
         "reference": "db/PP810610.gb"
    
     mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
     (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  3. Summarize all SNPs and Indels from the snippy result directory.

     #Output: snippy/summary_snps_indels.csv
     # IMPORTANT_ADAPT the array isolates = ["AYE-S", "AYE-Q", "AYE-WT on Tig4", "AYE-craA on Tig4", "AYE-craA-1 on Cm200", "AYE-craA-2 on Cm200"]
     python3 ~/Scripts/summarize_snippy_res.py snippy
     cd snippy
     #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
  4. Using spandx calling variants (almost the same results to the one from viral-ngs!)

     mamba activate /home/jhuang/miniconda3/envs/spandx
     mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
     cp PP810610.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
     vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
     /home/jhuang/miniconda3/envs/spandx/bin/snpEff build PP810610    #-d
     ~/Scripts/genbank2fasta.py PP810610.gb
     mv PP810610.gb_converted.fna PP810610.fasta    #rename "NC_001348.1 xxxxx" to "NC_001348" in the fasta-file
     ln -s /home/jhuang/Tools/spandx/ spandx
     (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref PP810610.fasta --annotation --database PP810610 -resume
    
     # Rerun SNP_matrix.sh due to the error ERROR_CHROMOSOME_NOT_FOUND in the variants annotation
     cd Outputs/Master_vcf
     (spandx) cp -r ../../snippy/hCoV229E_Rluc/reference .
     (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
     #Note that ${variant_genome_path}=NC_001348 in the following command, but it was not used after command replacement.
     #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
     "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
     (spandx) bash SNP_matrix.sh PP810610 .
  5. Calling inter-host variants by merging the results from snippy+spandx (Manually!)

     # Inter-host variants(宿主间变异):一种病毒在两个人之间有不同的基因变异,这些变异可能与宿主的免疫反应、疾病表现或病毒传播的方式相关。
     cp All_SNPs_indels_annotated.txt All_SNPs_indels_annotated_backup.txt
     vim All_SNPs_indels_annotated.txt
    
     #in the file ids: grep "$(echo -e '\t')353$(echo -e '\t')" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt
     #Replace \n with " All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated_.txt\ngrep "
     #Replace grep " --> grep "$(echo -e '\t')
     #Replace " All_ --> $(echo -e '\t')" All_
    
     # Potential intra-host variants: 10871, 19289, 23435.
     CHROM   POS     REF     ALT     TYPE    hCoV229E_Rluc_trimmed   p10_DMSO_trimmed        p10_K22_trimmed p10_K7523_trimmed       p16_DMSO_trimmed        p16_K22_trimmed p16_X7523_trimmed       Effect  Impact  Functional_Class        Codon_change    Protein_and_nucleotide_change   Amino_Acid_Length       Gene_name       Biotype
     PP810610        1464    T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        gTt/gCt p.Val416Ala/c.1247T>C   6757    CDS_1   protein_coding
     PP810610        1699    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val494Val/c.1482C>T   6757    CDS_1   protein_coding
     PP810610        6691    C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  tgC/tgT p.Cys2158Cys/c.6474C>T  6757    CDS_1   protein_coding
     PP810610        6919    C       G       SNP     G       G       G       G       G       G       G       synonymous_variant      LOW     SILENT  ggC/ggG p.Gly2234Gly/c.6702C>G  6757    CDS_1   protein_coding
     PP810610        7294    T       A       SNP     A       A       A       A       A       A       A       missense_variant        MODERATE        MISSENSE        agT/agA p.Ser2359Arg/c.7077T>A  6757    CDS_1   protein_coding
     * PP810610       10871   C       T       SNP     C       C/T     T       C/T     C/T     T       C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu3552Phe/c.10654C>T 6757    CDS_1   protein_coding
     PP810610        14472   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        aTg/aCg p.Met4752Thr/c.14255T>C 6757    CDS_1   protein_coding
     PP810610        15458   T       C       SNP     C       C       C       C       C       C       C       synonymous_variant      LOW     SILENT  Ttg/Ctg p.Leu5081Leu/c.15241T>C 6757    CDS_1   protein_coding
     PP810610        16035   C       A       SNP     A       A       A       A       A       A       A       stop_gained     HIGH    NONSENSE        tCa/tAa p.Ser5273*/c.15818C>A   6757    CDS_1   protein_coding
     PP810610        17430   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu5738Ser/c.17213T>C 6757    CDS_1   protein_coding
     * PP810610       19289   G       T       SNP     G       G       T       G       G       G/T     G       missense_variant        MODERATE        MISSENSE        Gtt/Ttt p.Val6358Phe/c.19072G>T 6757    CDS_1   protein_coding
     PP810610        21183   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        tTt/tGt p.Phe230Cys/c.689T>G    1173    CDS_2   protein_coding
     PP810610        22636   T       G       SNP     G       G       G       G       G       G       G       missense_variant        MODERATE        MISSENSE        aaT/aaG p.Asn714Lys/c.2142T>G   1173    CDS_2   protein_coding
     PP810610        23022   T       C       SNP     C       C       C       C       C       C       C       missense_variant        MODERATE        MISSENSE        tTa/tCa p.Leu843Ser/c.2528T>C   1173    CDS_2   protein_coding
     * PP810610       23435   C       T       SNP     C       C       T       C/T     C       C/T     C/T     missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu981Phe/c.2941C>T   1173    CDS_2   protein_coding
     PP810610        24512   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctc/Ttc p.Leu36Phe/c.106C>T     88      CDS_4   protein_coding
     PP810610        24781   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        aCt/aTt p.Thr36Ile/c.107C>T     77      CDS_5   protein_coding
     PP810610        25163   C       T       SNP     T       T       T       T       T       T       T       missense_variant        MODERATE        MISSENSE        Ctt/Ttt p.Leu82Phe/c.244C>T     225     CDS_6   protein_coding
     PP810610        25264   C       T       SNP     T       T       T       T       T       T       T       synonymous_variant      LOW     SILENT  gtC/gtT p.Val115Val/c.345C>T    225     CDS_6   protein_coding
     PP810610        26838   G       T       SNP     T       T       T       T       T       T       T
  6. Calling intra-host variants using viral-ngs

     # Intra-host variants(宿主内变异):同一个人感染了某种病毒,但在其体内的不同细胞或器官中可能存在多个不同的病毒变异株。
    
     #How to run and debug the viral-ngs docker?
     # ---- DEBUG_2025_1: using docker instead ----
     mkdir viralngs; cd viralngs
     ln -s ~/Tools/viral-ngs_docker/Snakefile Snakefile
     ln -s  ~/Tools/viral-ngs_docker/bin bin
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/refsel.acids refsel.acids
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/lastal.acids lastal.acids
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/config.yaml config.yaml
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-runs.txt samples-runs.txt
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-depletion.txt samples-depletion.txt
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-metagenomics.txt samples-metagenomics.txt
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly.txt samples-assembly.txt
     cp  ~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2024/samples-assembly-failures.txt samples-assembly-failures.txt
     # Adapt the sample-*.txt
    
     mkdir viralngs/data
     mkdir viralngs/data/00_raw
    
     mkdir bams
     ref_fa="PP810610.fasta";
     #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
     for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
         bwa index ${ref_fa}; \
         bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_trimmed_P_1.fastq trimmed/${sample}_trimmed_P_2.fastq | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
     done
    
     conda activate viral-ngs4
     #for sample in hCoV229E_Rluc p10_DMSO p10_K22; do
     #for sample in p10_K7523 p16_DMSO p16_K22 p16_X7523; do
     for sample in p16_K22; do
         picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=~/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs/data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
     done
     conda deactivate
    
     # -- ! Firstly set the samples-assembly.txt empty, so that only focus on running depletion!
     docker run -it -v /mnt/md1/DATA_D/Data_Pietschmann_229ECoronavirus_Mutations_2025/viralngs:/work -v /home/jhuang/Tools/viral-ngs_docker:/home/jhuang/Tools/viral-ngs_docker -v /home/jhuang/REFs:/home/jhuang/REFs -v /home/jhuang/Tools/GenomeAnalysisTK-3.6:/home/jhuang/Tools/GenomeAnalysisTK-3.6 -v /home/jhuang/Tools/novocraft_v3:/home/jhuang/Tools/novocraft_v3 -v /usr/local/bin/gatk:/usr/local/bin/gatk   own_viral_ngs bash
     cd /work
     snakemake --directory /work --printshellcmds --cores 40
    
     # -- ! Secondly manully run assembly steps
     # --> By itereative add the unfinished assembly in the list, each time replace one, and run "snakemake --directory /work --printshellcmds --cores 40"
    
         # # ---- NOTE that the following steps need rerun --> DOES NOT WORK, USE STRATEGY ABOVE ----
         # #for sample in p10_K22 p10_K7523; do
         # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
         #     bin/read_utils.py merge_bams data/01_cleaned/${sample}.cleaned.bam tmp/01_cleaned/${sample}.cleaned.bam --picardOptions SORT_ORDER=queryname
         #     bin/read_utils.py rmdup_mvicuna_bam tmp/01_cleaned/${sample}.cleaned.bam data/01_per_sample/${sample}.cleaned.bam --JVMmemory 30g
         # done
         #
         # #Note that the error generated by nextflow is from the step gapfill_gap2seq!
         # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
         #     bin/assembly.py assemble_spades data/01_per_sample/${sample}.taxfilt.bam /home/jhuang/REFs/viral_ngs_dbs/trim_clip/contaminants.fasta tmp/02_assembly/${sample}.assembly1-spades.fasta --nReads 10000000 --threads 15 --memLimitGb 12
         # done
         # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
         # for sample in p10_K22 p10_K7523; do
         #     bin/assembly.py order_and_orient tmp/02_assembly/${sample}.assembly1-spades.fasta refsel_db/refsel.fasta tmp/02_assembly/${sample}.assembly2-scaffolded.fasta --min_pct_contig_aligned 0.05 --outAlternateContigs tmp/02_assembly/${sample}.assembly2-alternate_sequences.fasta --nGenomeSegments 1 --outReference tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta --threads 15
         # done
         #
         # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
         #     bin/assembly.py gapfill_gap2seq tmp/02_assembly/${sample}.assembly2-scaffolded.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
         # done
    
     #IMPORTANT: Reun the following commands!
     for sample in hCoV229E_Rluc  p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
    
         bin/assembly.py impute_from_reference tmp/02_assembly/${sample}.assembly2-gapfilled.fasta tmp/02_assembly/${sample}.assembly2-scaffold_ref.fasta tmp/02_assembly/${sample}.assembly3-modify.fasta --newName ${sample} --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index  --loglevel DEBUG
     done
    
         # for sample in hCoV229E_Rluc p10_DMSO p10_K22 p10_K7523  p16_DMSO p16_K22 p16_X7523; do
         #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly3-modify.fasta data/01_per_sample/${sample}.cleaned.bam tmp/02_assembly/${sample}.assembly4-refined.fasta --outVcf tmp/02_assembly/${sample}.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 15  --loglevel DEBUG
         #     bin/assembly.py refine_assembly tmp/02_assembly/${sample}.assembly4-refined.fasta data/01_per_sample/${sample}.cleaned.bam data/02_assembly/${sample}.fasta --outVcf tmp/02_assembly/${sample}.assembly4.vcf.gz --min_coverage 3 --novo_params '-r Random -l 20 -g 40 -x 20 -t 100' --threads 15  --loglevel DEBUG
         # done
    
     # -- ! Thirdly set the samples-assembly.txt completely and run "snakemake --directory /work --printshellcmds --cores 40"
  7. Merge intra- and inter-host variants, comparing the variants to the alignments of the assemblies to confirm its correctness.

     cat NC_001348.fasta viralngs/data/02_assembly/VZV_20S.fasta viralngs/data/02_assembly/VZV_60S.fasta > aligned_1.fasta
     mafft --clustalout aligned_1.fasta > aligned_1.aln
     #~/Scripts/convert_fasta_to_clustal.py aligned_1.fasta_orig aligned_1.aln
     ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
     #manully delete the postion with all or '-' in aligned_1_.aln
     ~/Scripts/check_sequence_differences.py aligned_1_.aln
     ~/Scripts/check_sequence_differences.py aligned_1_.aln > aligned_1.res
     grep -v " = n" aligned_1.res > aligned_1_.res
    
     cat NC_001348.fasta viralngs/tmp/02_assembly/VZV_20S.assembly4-refined.fasta viralngs/tmp/02_assembly/VZV_60S.assembly4-refined.fasta > aligned_1.fasta
     mafft --clustalout aligned_1.fasta > aligned_1.aln
     ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
     ~/Scripts/check_sequence_differences.py aligned_1_.aln > aligned_1.res
     grep -v " = n" aligned_1.res > aligned_1_.res
    
     #Differences found at the following positions (150):
     Position 8956: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
     Position 8991: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
     Position 8992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
     Position 8995: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
     Position 9190: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
     * Position 13659: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
     * Position 47969: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
     * Position 53691: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
     * Position 55501: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
     * Position 63248: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
     Position 63799: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
     * Position 64328: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
     Position 65179: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
     * Position 65225: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
     * Position 95302: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    
     gunzip isnvs.annot.txt.gz
     ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
     cut -d$'\t' filtered_isnvs.annot.txt -f1-7
     chr     pos     sample  patient time    alleles iSNV_freq
     OP297860        13203   HSV1_S1 HSV1_S1         T,C,A   1.0
     OP297860        13203   HSV-Klinik_S2   HSV-Klinik_S2           T,C,A   1.0
     OP297860        13522   HSV1_S1 HSV1_S1         G,T     1.0
     OP297860        13522   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008905554253573941
     OP297860        13659   HSV1_S1 HSV1_S1         G,T     1.0
     OP297860        13659   HSV-Klinik_S2   HSV-Klinik_S2           G,T     0.008383233532934131
    
     ~/Scripts/convert_clustal_to_fasta.py aligned_1_.aln aligned_1.fasta
     samtools faidx aligned_1.fasta
     samtools faidx aligned_1.fasta OP297860.1 > OP297860.1.fasta
     samtools faidx aligned_1.fasta HSV1_S1-1 > HSV1_S1-1.fasta
     samtools faidx aligned_1.fasta HSV-Klinik_S2-1 > HSV-Klinik_S2-1.fasta
     seqkit seq OP297860.1.fasta -w 70 > OP297860.1_w70.fasta
     diff OP297860.1_w70.fasta ../../refsel_db/refsel.fasta
  8. Consensus sequences of each and of all isolates

     cp data/02_assembly/*.fasta ./
     for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
     for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
     mv ${sample}.fasta ${sample}.fa
     cat all.fa ${sample}.fa >> all.fa
     done
     cat RSV_dedup.fa all.fa > RSV_all.fa
     mafft --adjustdirection RSV_all.fa > RSV_all.aln
     snp-sites RSV_all.aln -o RSV_all_.aln
  9. Download all Human alphaherpesvirus 3 (Varicella-zoster virus) genomes

     Human alphaherpesvirus 3
     acronym: HHV-3 VZV
     equivalent: Human herpes virus 3
    
     Human alphaherpesvirus 3 (Varicella-zoster virus)
         * Human herpesvirus 3 strain Dumas
         * Human herpesvirus 3 strain Oka vaccine
         * Human herpesvirus 3 VZV-32
    
     #Taxonomy ID: 10335
     esearch -db nucleotide -query "txid10335[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10335_ncbi.fasta
     python ~/Scripts/filter_fasta.py genome_10335_ncbi.fasta complete_genome_10335_ncbi.fasta  #2041-->165
     # ---- Download related genomes from ENA ----
     https://www.ebi.ac.uk/ena/browser/view/10335
     #Click "Sequence" and download "Counts" (2003) and "Taxon descendants count" (2005) if there is enough time! Downloading time points is 11.03.2025.
     python ~/Scripts/filter_fasta.py  ena_10335_sequence.fasta complete_genome_10335_ena_taxon_descendants_count.fasta  #2005-->153
     #python ~/Scripts/filter_fasta.py ena_10335_sequence_Counts.fasta complete_genome_10335_ena_Counts.fasta  #xxx, 5.8G
     https://www.ebi.ac.uk/ena/browser/view/10239
     https://www.ebi.ac.uk/ena/browser/view/2497569
     https://www.ebi.ac.uk/ena/browser/view/Taxon:2497569
     ena_10239_sequence.fasta
     esearch -db nucleotide -query "txid10239[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_10239_ncbi.fasta
  10. Using Multi-CAR for scaffolding the contigs (If not useful, choose another scaffolding tool, e.g. https://github.com/malonge/RagTag)

      All contigs over 500 bp were successfully scaffolded to the graft genome using Multi-CAR (13), resulting in a chromosomal assembly of 4,506,689 bp.
      https://genome.cs.nthu.edu.tw/Multi-CAR/
      https://github.com/ablab-nthu/Multi-CSAR
  11. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

     (vrap) vrap/vrap.py  -1 trimmed/VZV_20S_trimmed_P_1.fastq -2 trimmed/VZV_20S_trimmed_P_2.fastq  -o VZV_20S_on_X04370 --host /home/jhuang/DATA/Data_Huang_Human_herpesvirus_3/X04370.fasta   -t 100 -l 200  -g
     cd bowtie
     mv mapped mapped.sam
     samtools view -S -b mapped.sam > mapped.bam
     samtools sort mapped.bam -o mapped_sorted.bam
     samtools index mapped_sorted.bam
     samtools view -H mapped_sorted.bam
     samtools flagstat mapped_sorted.bam
  12. Show the bw on IGV

  13. Reports

     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly4-refined.fasta
    
     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly1-spades.fasta
     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly2-scaffolded.fasta
     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly2-gapfilled.fasta
     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly3-modify.fasta
     diff data/02_assembly/2040_04.fasta tmp/02_assembly/2040_04.assembly4-refined.fasta
     ./2040_04.assembly2-alternate_sequences.fasta
     ./2040_04.assembly2-scaffold_ref.fasta

Leave a Reply

Your email address will not be published. Required fields are marked *