Variant Calling for Herpes Simplex Virus 1 from Patient Sample Using Capture Probe Sequencing

gene_x 0 like s 44 view s

Tags: processing, pipeline

  1. Using bacto pipeline to get trimmed reads

    cp /home/jhuang/Tools/bacto/bacto-0.1.json .
    cp /home/jhuang/Tools/bacto/cluster.json .
    cp /home/jhuang/Tools/bacto/Snakefile .
    ln -s /home/jhuang/Tools/bacto/local .
    ln -s /home/jhuang/Tools/bacto/db .
    ln -s /home/jhuang/Tools/bacto/envs .
    mkdir raw_data; cd raw_data
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R1_001.fastq.gz HSV1_S1_R1.fastq.gz
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R2_001.fastq.gz HSV1_S1_R2.fastq.gz
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R1_001.fastq.gz HSV-Klinik_S2_R1.fastq.gz
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R2_001.fastq.gz HSV-Klinik_S2_R2.fastq.gz
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R1_001.fastq.gz NTC_S3_R1.fastq.gz
    ln -s ../20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R2_001.fastq.gz NTC_S3_R2.fastq.gz
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    
    #0       HSV1_S1 snp=8   del=1   ins=0   het=97  unaligned=125842
    #1       HSV-Klinik_S2   snp=94  del=1   ins=2   het=550 unaligned=78080
    
  2. Filtering low complexity

    fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    
        Read1 before filtering:
        total reads: 1755209
        total bases: 163663141
        Q20 bases: 162306612(99.1711%)
        Q30 bases: 159234526(97.2941%)
    
        Read2 before filtering:
        total reads: 1755209
        total bases: 163045950
        Q20 bases: 161178082(98.8544%)
        Q30 bases: 157052184(96.3239%)
    
        Read1 after filtering:
        total reads: 1733241
        total bases: 161547828
        Q20 bases: 160217907(99.1768%)
        Q30 bases: 157196236(97.3063%)
    
        Read2 aftering filtering:
        total reads: 1733241
        total bases: 160825521
        Q20 bases: 159057902(98.9009%)
        Q30 bases: 155354052(96.5979%)
    
        Filtering result:
        reads passed filter: 3466482
        reads failed due to low quality: 550
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads failed due to low complexity: 43386
        reads with adapter trimmed: 21424
        bases trimmed due to adapters: 159261
    
        Duplication rate: 14.2379%
    
        Insert size peak (evaluated by paired-end reads): 41
    
        JSON report: fastp.json
        HTML report: fastp.html
    
        fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
        fastp v0.20.1, time used: 7 seconds
        Read1 before filtering:
        total reads: 2688264
        total bases: 330035144
        Q20 bases: 326999269(99.0801%)
        Q30 bases: 320136918(97.0009%)
    
        Read2 before filtering:
        total reads: 2688264
        total bases: 327364405
        Q20 bases: 323331005(98.7679%)
        Q30 bases: 314500076(96.0703%)
    
        Read1 after filtering:
        total reads: 2660598
        total bases: 326564634
        Q20 bases: 323572956(99.0839%)
        Q30 bases: 316783667(97.0049%)
    
        Read2 aftering filtering:
        total reads: 2660598
        total bases: 324709841
        Q20 bases: 320840657(98.8084%)
        Q30 bases: 312570288(96.2614%)
    
        Filtering result:
        reads passed filter: 5321196
        reads failed due to low quality: 1110
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads failed due to low complexity: 54222
        reads with adapter trimmed: 39080
        bases trimmed due to adapters: 357915
    
        Duplication rate: 9.91821%
    
        Insert size peak (evaluated by paired-end reads): 96
    
        JSON report: fastp.json
        HTML report: fastp.html
    
        fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
        fastp v0.20.1, time used: 15 seconds
    
  3. Using vrap to assembly and annotate the contigs, the spades-step was replaced with idba of DAMIAN

    ln -s ~/Tools/vrap/ .
    #CHANGE the txid10298 in download_db.py: txid10298[Organism] AND complete genome[Title]
    gzip trimmed/*_R1.fastq trimmed/*_R2.fastq
    mv trimmed/*.gz ./
    #--host /home/jhuang/REFs/genome.fa -n /mnt/nvme0n1p1/blast/nt -a /mnt/nvme0n1p1/blast/nr
    vrap/vrap.py  -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out   -t 40 -l 100
    vrap/vrap.py  -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out   -t 40 -l 100
    #--> ERROR in spades-assembly, we usding idba from DAMIAN assembly, copy the assembly to spades. IT WORKS!
    
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R2.fastq.gz --sample HSV1_S1_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R2.fastq.gz --sample HSV-Klinik_S2_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    
    cp ~/rtpd_files/Affe30_megablast/idba_ud_assembly/contig.fa contigs.fasta
    cp ~/rtpd_files/Affe31_megablast/idba_ud_assembly/contig.fa contigs.fasta
    
    #RERUN vrap/vrap.py again with the replaced contigs.fasta!
    #vrap/vrap.py  -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out   -t 40 -l 100
    #vrap/vrap.py  -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out   -t 40 -l 100
    
  4. Analyses using viral-ngs

    mv Snakefile Snakefile_bacto
    mv trimmed trimmed_bacto
    
    conda activate viral3
    #conda install -c anaconda openjdk=8
    
    ln -s ~/Tools/viral-ngs/Snakefile Snakefile
    ln -s ~/Tools/viral-ngs/bin bin
    
    cp  ~/Tools/viral-ngs/refsel.acids refsel.acids
    cp  ~/Tools/viral-ngs/lastal.acids lastal.acids
    cp  ~/Tools/viral-ngs/config.yaml config.yaml
    cp  ~/Tools/viral-ngs/samples-runs.txt samples-runs.txt
    cp  ~/Tools/viral-ngs/samples-depletion.txt samples-depletion.txt
    cp  ~/Tools/viral-ngs/samples-metagenomics.txt samples-metagenomics.txt
    cp  ~/Tools/viral-ngs/samples-assembly.txt samples-assembly.txt
    cp  ~/Tools/viral-ngs/samples-assembly-failures.txt samples-assembly-failures.txt
    mkdir data
    cd data
    mkdir 00_raw
    cd ../..
    
  5. Prepare lastal.acids, refsel.acids and accessions_for_ref_genome_build in config.yaml

    #Herpes simplex virus 1 (HSV-1) and Human alphaherpesvirus 1 (also known as Simplexvirus humanalpha1) are indeed the same virus.
    #The different names result from varied naming conventions:
    #    * Herpes simplex virus 1 (HSV-1) is the common name, often used in clinical and general contexts.
    #    * Human alphaherpesvirus 1 is the official taxonomic name, as defined by the International Committee on Taxonomy of Viruses (ICTV). This name is used in scientific classifications and databases like NCBI to specify its place in the Herpesviridae family under the Alphaherpesvirinae subfamily.
    #In some databases or references, it might also appear under Simplexvirus humanalpha1, which refers to its taxonomic classification at the genus level (Simplexvirus) and species level (Human alphaherpesvirus 1). However, all these terms refer to the same virus, commonly known as HSV-1.
    
    #https://www.uniprot.org/taxonomy?query=Human+herpesvirus
    #https://www.uniprot.org/taxonomy/3050292
    esearch -db nuccore -query "txid3050292[Organism]" | efetch -format fasta > taxon_3050292_sequences.fasta
    esearch -db nuccore -query "txid3050292[Organism]" | efetch -format acc > taxon_3050292_accessions.txt
    esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format fasta > taxon_3050292_complete_genomes.fasta
    esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format acc > taxon_10298_complete_genomes.acc  # 161 genomes
    mv taxon_10298_complete_genomes.acc lastal.acids
    
    https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10298
    Human alphaherpesvirus 1 (Herpes simplex virus type 1)     Click on organism name to get more information.
        Human alphaherpesvirus 1 strain 17
        Human alphaherpesvirus 1 strain A44
        Human alphaherpesvirus 1 strain Angelotti
        Human alphaherpesvirus 1 strain CL101
        Human alphaherpesvirus 1 strain CVG-2
        Human alphaherpesvirus 1 strain F
        Human alphaherpesvirus 1 strain H129
        Human alphaherpesvirus 1 strain HFEM
        Human alphaherpesvirus 1 strain HZT
        Human alphaherpesvirus 1 strain KOS
        Human alphaherpesvirus 1 strain MGH-10
        Human alphaherpesvirus 1 strain MP
        Human alphaherpesvirus 1 strain Patton
        Human alphaherpesvirus 1 strain R-15
        Human alphaherpesvirus 1 strain R19
        Human alphaherpesvirus 1 strain RH2
        Human alphaherpesvirus 1 strain SC16
    
  6. Trimming using trimmomatic

    mkdir trimmed bams
    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ./raw_data/${sample}_R1.fastq.gz ./raw_data/${sample}_R2.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz  ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; \
    done
    
  7. Mapping

    cd trimmed
    seqtk sample -s100 HSV1_S1_R1.fastq.gz 0.1 > HSV1_S1_sampled_R1.fastq
    seqtk sample -s100 HSV1_S1_R2.fastq.gz 0.1 > HSV1_S1_sampled_R2.fastq
    gzip HSV1_S1_sampled_R1.fastq HSV1_S1_sampled_R2.fastq
    
    ref_fa="NC_001806.fasta";
    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
    for sample in HSV1_S1_sampled; do
        bwa index ${ref_fa}; \
        bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
        #for table filling using the following commands! -->3000000 \
        #bwa mem -M -t 14 ${ref_fa} ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz | samtools view -bS -F 256 - > bams/${sample}_uniqmap.bam; \
    done
    
  8. AddOrReplaceReadGroup is IMPORTANT step, otherwise the step viral_ngs cannot run correctly

    for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    for sample in HSV1_S1; do
    for sample in HSV1_S1_sampled; do
        picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
    done
    
  9. Configure the viral-ngs conda environment

    conda config --add channels r
    conda config --add channels defaults
    conda config --add channels conda-forge
    conda config --add channels bioconda
    conda config --add channels broad-viral
    
    mamba env remove -n viral-ngs3
    mamba create -n viral-ngs3 python=3.6 blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    conda activate viral-ngs3
    
    mamba create -n viral-ngs4 python=3.6
    conda activate viral-ngs4
    #vim requirements-conda.txt
    mamba install blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    # -- Eventually DEBUG --
    #mamba remove picard
    #mamba clean --packages
    #mamba install -c bioconda picard
    ##mamba install libgfortran=5 sqlite=3.46.0
    ##mamba install picard --clobber
    ##mamba create -n viral-ngs-fresh -c bioconda -c conda-forge picard python=3.6 sqlite=3.46.0 libgfortran=5
    
    mamba install cd-hit cd-hit-auxtools diamond gap2seq=2.1 mafft=7.221 mummer4 muscle=3.8 parallel pigz prinseq samtools=1.6 tbl2asn trimmomatic trinity unzip vphaser2 bedtools -c r -c defaults -c conda-forge -c bioconda  #-c broad-viral
    mamba install snpeff=4.1l
    mamba install gatk=3.6
    mamba install bwa
    #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    #IMPORTANT_CHECK if it works
    #        java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    #        /usr/local/bin/gatk -T RealignerTargetCreator --help
    #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    
    # -- NO ERROR --> INSTALL END HERE --
    
    # -- DEBUG: ClobberError: This transaction has incompatible packages due to a shared path. --
    # SafetyError: The package for snpeff located at /home/jhuang/miniconda3/pkgs/snpeff-4.1l-hdfd78af_8
    # appears to be corrupted. The path 'share/snpeff-4.1l-8/snpEff.config'
    # has an incorrect size.
    # reported size: 9460047 bytes
    # actual size: 9460357 bytes
    #
    # ClobberError: This transaction has incompatible packages due to a shared path.
    # packages: bioconda/linux-64::bowtie2-2.5.4-h7071971_4, bioconda/linux-64::bowtie-1.3.1-py36h769816f_3
    # path: 'bin/scripts/convert_quals.pl'
    
    # sovle confilict between bowtie, bowtie2 and snpeff
    mamba remove bowtie
    mamba install bowtie2
    mamba remove snpeff
    mamba install snpeff=4.1l
    
    # -- WITH ERROR caused by bowtie and snpeff --> INSTALL END HERE --
    
    #mamba install -c bioconda viral-ngs  #so that gatk3-register and novoalign-license-register available --> ERROR
        #Due to license restrictions, the viral-ngs conda package cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK v3.8 from the Broad Institute, and call “gatk3-register,” which will copy GATK into your viral-ngs conda environment:
            mkdir -p /path/to/gatk_dir
            wget -O - 'https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.6-0-g89b7209' | tar -xjvC /path/to/gatk_dir
            gatk3-register /path/to/gatk_dir/GenomeAnalysisTK.jar
        #The single-threaded version of Novoalign is installed by default. If you have a license for Novoalign to enable multi-threaded operation, viral-ngs will copy it to the viral-ngs conda environment if the NOVOALIGN_LICENSE_PATH environment variable is set. Alternatively, the conda version of Novoalign can be overridden if the NOVOALIGN_PATH environment variable is set. If you obtain a Novoalign license after viral-ngs has already been installed, it can be added to the conda environment by calling:
            # obtain a Novoalign license file: novoalign.lic
            novoalign-license-register /path/to/novoalign.lic
    
    # # --We don't have registers, so we have to manually install novoalign and gatk--
    # #At first install novoalign, then samtools
    # mamba remove samtools
    # mamba install -c bioconda novoalign  # Eventually not necessary, since the path is defined in config.yaml NOVOALIGN_PATH: "/home/jhuang/Tools/novocraft_v3", and novoalign.lic is also in the same path.
    # mamba install -c bioconda samtools
    #
    # mamba install -c bioconda gatk #(3.8)  #IN /usr/local/bin/gatk FROM /home/jhuang/Tools/SPANDx_v3.2/GenomeAnalysisTK.jar
    # #UPDATED TO: '/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar'
    
    # # If necessary, clean up the conda cache. This will remove any partially installed or corrupted packages.
    # conda clean --all
    
    ## reinstall samtools 1.6 --> NOT RELEVANT
    #mamba install samtools=1.6
    
  10. Run snakemake

    #Set values in samples-*.txt before running viral-ngs
    
    #conda list --export > environment2.yml
    #mamba create --name viral-ngs2 --file environment2.yml
    #samtools fastq HSV1_S1_sampled.bam > HSV1_S1_sampled.fastq
    snakemake --printshellcmds --cores 4
    
    /usr/local/bin/gatk
    https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    
    java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    java -jar ~/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    /usr/local/bin/gatk -T RealignerTargetCreator --help
    
    https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    Djava.io.tmpdir=/tmp/tmp-assembly-refine_assembly-2d9z3pcr
    java -jar ~/Tools/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    ~/Tools/GenomeAnalysisTK-4.1.2.0/gatk -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    
    # -- DEBUG_1: Configure the Conda Environment to Use Host's Java (version 17) while keeping BLAST 2.6.0+ --
    
    bin/taxon_filter.py deplete data/00_raw/HSV1_S1.bam tmp/01_cleaned/HSV1_S1.raw.bam tmp/01_cleaned/HSV1_S1.bmtagger_depleted.bam tmp/01_cleaned/HSV1_S1.rmdup.bam data/01_cleaned/HSV1_S1.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters --threads 120 --srprismMemory 142500000 --JVMmemory 256g --loglevel DEBUG
    
    #2024-11-06 15:55:01,162 - __init__:444:_attempt_install - DEBUG - Currently installed version of blast: 2.16.0-hc155240_2
    #2024-11-06 15:55:01,162 - __init__:448:_attempt_install - DEBUG - Expected version of blast:            2.6.0
    #2024-11-06 15:55:01,162 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
    
    #  + (blast 2.6.0 needs java 17, therefore java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java" in /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard) blast                             2.6.0  boost1.64_2          bioconda        Cached
    #  + (bmtagger 3.101 needs blast 2.6.0) blast=2.6.0 + bmtagger 3.101  h470a237_4           bioconda        Cached
    #  + pango                            1.50.7  hbd2fdc8_0           conda-forge     Cached
    #  + openjdk                         11.0.15  hc6918da_0           conda-forge     Cached
    #  + r-base                            4.2.0  h1ae530e_0           pkgs/r          Cached
    #  + picard                            3.0.0  hdfd78af_0           bioconda        Cached
    #  + java -version                     openjdk version "11.0.15-internal" 2022-04-19
    
    Then, edit in the following file so that it can use the host java (version 17) for the viral-ngs2 picard 3.0.0! --
    vim /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard
    
    # ---------------------------------------------------------
    # Use Java installed with Anaconda to ensure correct version
    java="$ENV_PREFIX/bin/java"
    
    # if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    if [ -n "${JAVA_HOME:=}" ]; then
    if [ -e "$JAVA_HOME/bin/java" ]; then
        java="$JAVA_HOME/bin/java"
    fi
    fi
    # -------------------------------------------------------->
    #COMMENTED
    # Use Java installed with Anaconda to ensure correct version
    #java="$ENV_PREFIX/bin/java"
    
    #MODIFIED
    ## if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    #if [ -n "${JAVA_HOME:=}" ]; then
    #  if [ -e "$JAVA_HOME/bin/java" ]; then
    #      java="$JAVA_HOME/bin/java"
    #  fi
    #fi
    java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java"
    # ---------------------------------------------------------
    
    # -- DEBUG_2: lastal version not compatible --
    bin/ncbi.py fetch_fastas j.huang@uke.de lastal_db NC_001806.2 --combinedFilePrefix lastal --removeSeparateFiles --forceOverwrite --chunkSize 300
    bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    mamba remove last
    mamba install -c bioconda last=876
    lastal -V
    bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    
    # -- DEBUG_3: lastal version not compatible --
    bin/assembly.py gapfill_gap2seq tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffolded.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
    
    #2024-11-07 12:34:14,732 - __init__:460:_attempt_install - DEBUG - Attempting install...
    #2024-11-07 12:34:14,733 - __init__:545:install_package - DEBUG - Creating conda environment and installing package gap2seq=2.1
    mamba install gap2seq=2.1
    
    # -- DEBUG_4 --
    bin/assembly.py impute_from_reference tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffold_ref.fasta tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta --newName HSV1_S1_sampled --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index --loglevel DEBUG
    
    2024-11-07 14:05:20,438 - __init__:445:_attempt_install - DEBUG - Currently installed version of muscle: 5.2-h4ac6f70_0
    2024-11-07 14:05:20,438 - __init__:448:_attempt_install - DEBUG - Expected version of muscle:            3.8.1551
    2024-11-07 14:05:20,438 - __init__:449:_attempt_install - DEBUG - Incorrect version of muscle installed. Removing it...
    mamba install muscle=3.8
    #- muscle       5.2  h4ac6f70_0  bioconda     Cached
    #+ muscle  3.8.1551  h7d875b9_6  bioconda     Cached
    
    #/home/jhuang/Tools/novocraft_v3/novoalign -f data/01_per_sample/HSV1_S1.cleaned.bam -r Random -l 20 -g 40 -x 20 -t 100 -F BAM -d tmp/02_assembly/HSV1_S1.assembly4-refined.nix -o SAM
    
    # -- DEBUG_5 --
    bin/assembly.py refine_assembly tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV1_S1_sampled.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    #Shebang in /usr/local/bin/gatk is corrupt.
    
    # -- DEBUG_6 --
    bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1_sampled.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    2024-11-07 14:47:34,163 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.526-h4bc722e_0
    2024-11-07 14:47:34,163 - __init__:448:_attempt_install - DEBUG - Expected version of mafft:            7.221
    2024-11-07 14:47:34,164 - __init__:449:_attempt_install - DEBUG - Incorrect version of mafft installed. Removing it...
    mamba install mafft=7.221
    
    # -- DEBUG_7 --
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1_sampled.mapped.bam data/02_assembly/HSV1_S1_sampled.fasta data/04_intrahost/vphaser2.HSV1_S1_sampled.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    
    export TMPDIR=/home/jhuang/tmp
    (viral-ngs) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing$ /home/jhuang/miniconda3/envs/viral-ngs/bin/vphaser2 -i /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam -o /home/jhuang/tmp/tmpyg8mlj5qvphaser2
    
    samtools depth /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam > coverage.txt
    
    # -- DEBUG_8 --
    snakemake --printshellcmds --cores 100
    bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    snakemake --printshellcmds --cores 100
    
    # -- DEBUG_9 --
    bin/assembly.py refine_assembly tmp/02_assembly/HSV-Klinik_S2.assembly3-modify.fasta data/01_per_sample/HSV-Klinik_S2.cleaned.bam tmp/02_assembly/HSV-Klinik_S2.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV-Klinik_S2.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    /usr/local/bin/gatk -Xmx20g -Djava.io.tmpdir=/home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p -T RealignerTargetCreator -I /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpwbzvjo9j.rmdup.bam -R /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpxq4obe29.deambig.fasta -o /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmptkw8zcf3.intervals --num_threads 120
    
    mamba install gatk=3.6
    #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    #IMPORTANT_CHECK if it works
    #        java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    #        /usr/local/bin/gatk -T RealignerTargetCreator --help
    #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    
    # -- DEBUG_10 (if the sequencing is too shawlow, then seperate running) --
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam -o /tmp/tmp1x6jsiu_vphaser2
    [EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam
    # Run seperate intrahost.py --> no error:
    #342 reads
    
    2024-11-08 14:27:33,575 - intrahost:223:compute_library_bias - DEBUG - LB:standard has 161068 reads in 1 read group(s) (HSV-Klinik_S2)
    2024-11-08 14:27:34,875 - __init__:445:_attempt_install - DEBUG - Currently installed version of vphaser2: 2.0-h7a259b3_14
    
    samtools index HSV1_S1.mapped.bam
    samtools index HSV-Klinik_S2.mapped.bam
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz   --minReadsEach 1 --maxBias 2 --loglevel DEBUG   # --vphaserNumThreads 120 --removeDoublyMappedReads
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    
    samtools idxstats data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    samtools view -H data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/output_dir
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    
    samtools view -b data/02_align_to_self/HSV-Klinik_S2.mapped.bam "HSV-Klinik_S2-1" > subset.bam
    samtools index subset.bam
    @SQ     SN:HSV-Klinik_S2-1      LN:141125       AS:tmp35_s3ghx.ref_copy.nix
    samtools view -b subset.bam "HSV-Klinik_S2-1:1-10000" > small_subset.bam
    samtools index small_subset.bam
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i small_subset.bam -o /tmp/output_dir
    /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i subset.bam -o vphaser2_out
    
    # -- DEBUG_11 in step multi_align_mafft: aligned_1.fasta is always empty, we need generate it manually with mafft and mark it as complete --
    
    #[Fri Nov  8 14:51:45 2024]
    #rule multi_align_mafft:
    #    input: data/02_assembly/HSV1_S1.fasta, data/02_assembly/HSV-Klinik_S2.fasta, ref_genome/reference.fasta
    #    output: data/03_multialign_to_ref/sampleNameList.txt, data/03_multialign_to_ref/aligned_1.fasta, data/03_multialign_to_ref/aligned_2.fasta, ... data/03_multialign_to_ref/aligned_161.fasta
    #    jobid: 24
    #    resources: tmpdir=/tmp, mem=8, threads=120
    
    bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    #b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    #-------
    #2024-11-08 14:51:46,324 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, #Feb 28 2019, 09:07:38)
    #[GCC 7.3.0]
    #2024-11-08 15:00:26,375 - cmd:195:main_argparse - INFO - command: bin/interhost.py multichr_mafft inFastas=['ref_genome/reference.fasta', 'data/02_assembly/HSV1_S1.fasta', 'data/02_assembly/HSV-Klinik_S2.fasta'] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=120 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False
    #2024-11-08 15:00:26,375 - cmd:209:main_argparse - DEBUG - using tempDir: /tmp/tmp-interhost-multichr_mafft-sw91_svl
    #2024-11-08 15:00:27,718 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.221-0
    #2024-11-08 15:00:27,719 - mafft:141:execute - DEBUG - /home/jhuang/miniconda3/envs/viral-ngs4/bin/mafft --thread 120 --localpair --preservecase --op 1.53 --ep 0.123 --quiet --maxiterate 1000 /tmp/tmp-interhost-multichr_mafft-sw91_svl/tmp68_ln_ha.fasta
    
    snakemake --cleanup-metadata 03_multialign_to_ref --cores 4
    
  11. (Optional)

    GapFiller.pl -l libraries_p2564.txt -s data/02_assembly/p2564.fasta
    #parainfluenza bwa /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R1.fastq.gz /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R2.fastq.gz 300 1.0 FR
    
    #since HSV1 and HSV-Klinik_S2 has different regions covered --> multialign_to_ref is none!
    bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    
    samtools flagstat HSV1_S1.taxfilt.bam
    45674 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 mapped (0.00% : N/A)
    45674 + 0 paired in sequencing
    22837 + 0 read1
    22837 + 0 read2
    0 + 0 properly paired (0.00% : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    342 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    342 + 0 mapped (100.00% : N/A)
    342 + 0 paired in sequencing
    163 + 0 read1
    179 + 0 read2
    258 + 0 properly paired (75.44% : N/A)
    264 + 0 with itself and mate mapped
    78 + 0 singletons (22.81% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.mapped.bam
    162156 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    162156 + 0 mapped (100.00% : N/A)
    162156 + 0 paired in sequencing
    81048 + 0 read1
    81108 + 0 read2
    161068 + 0 properly paired (99.33% : N/A)
    161630 + 0 with itself and mate mapped
    526 + 0 singletons (0.32% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/01_per_sample$ samtools flagstat HSV-Klinik_S2.taxfilt.bam
    800454 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 mapped (0.00% : N/A)
    800454 + 0 paired in sequencing
    400227 + 0 read1
    400227 + 0 read2
    0 + 0 properly paired (0.00% : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (0.00% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.bam
    885528 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    191932 + 0 duplicates
    354088 + 0 mapped (39.99% : N/A)
    885528 + 0 paired in sequencing
    442764 + 0 read1
    442764 + 0 read2
    323502 + 0 properly paired (36.53% : N/A)
    324284 + 0 with itself and mate mapped
    29804 + 0 singletons (3.37% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
  12. Summarize statistics from snakemake-output

    samples-runs.txt
    
    samtools flagstat data/02_align_to_self/838_S1.mapped.bam
    samtools flagstat data/02_align_to_self/840_S2.mapped.bam
    samtools flagstat data/02_align_to_self/820_S3.mapped.bam
    samtools flagstat data/02_align_to_self/828_S4.mapped.bam
    samtools flagstat data/02_align_to_self/815_S5.mapped.bam
    samtools flagstat data/02_align_to_self/834_S6.mapped.bam
    samtools flagstat data/02_align_to_self/808_S7.mapped.bam
    samtools flagstat data/02_align_to_self/811_S8.mapped.bam
    samtools flagstat data/02_align_to_self/837_S9.mapped.bam
    samtools flagstat data/02_align_to_self/768_S10.mapped.bam
    samtools flagstat data/02_align_to_self/773_S11.mapped.bam
    samtools flagstat data/02_align_to_self/767_S12.mapped.bam
    samtools flagstat data/02_align_to_self/810_S13.mapped.bam
    samtools flagstat data/02_align_to_self/814_S14.mapped.bam
    samtools flagstat data/02_align_to_self/10121-16_S15.mapped.bam     -->           3c
    Origin of hepatitis C virus genotype 3 in Africa as estimated
                through an evolutionary analysis of the full-length genomes of nine
                subtypes, including the newly sequenced 3d and 3e
    
    samtools flagstat data/02_align_to_self/7510-15_S16.mapped.bam      -->
    samtools flagstat data/02_align_to_self/828-17_S17.mapped.bam
    samtools flagstat data/02_align_to_self/8806-15_S18.mapped.bam
    samtools flagstat data/02_align_to_self/9881-16_S19.mapped.bam
    samtools flagstat data/02_align_to_self/8981-14_S20.mapped.bam
    
  13. Consensus sequences of each and of all isolates

    cp data/02_assembly/*.fasta ./
    for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
    for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
    mv ${sample}.fasta ${sample}.fa
    cat all.fa ${sample}.fa >> all.fa
    done
    
    cat RSV_dedup.fa all.fa > RSV_all.fa
    mafft --adjustdirection RSV_all.fa > RSV_all.aln
    snp-sites RSV_all.aln -o RSV_all_.aln
    
  14. Finding the next strain with Phylogenetics: send both HCV231_all.png and HCV231_all.pdf to the Nicole

    #1, generate tree
    cat SARS-CoV-2_len25000_w60_newheader.fa ~/rtpd_files/2029-AW_S5/idba_ud_assembly/gapped_contig.fa > CoV2_all.fa
    mafft --adjustdirection CoV2_all.fa > CoV2_all.aln
    snp-sites CoV2_all.aln -o CoV2_all_.aln
    fasttree -gtr -nt RSV_all_.aln > RSV_all.tree
    fasttree -gtr -nt Ortho_SNP_matrix_RAxML.fa > Ortho_SNP_matrix_RAxML.tree
    raxml-ng --all --model GTR+G+ASC_LEWIS --prefix CoV2_all_raxml.aln --threads 1 --msa CoV2_all_.aln --bs-trees 1000 --redo
    #raxml-ng --all --model GTR+G+ASC_LEWIS --prefix raxml-ng/snippy.core.aln --threads 1 --msa variants/snippy.core.aln --bs-trees 1000 --redo
    
    #2, open tree on Dendroscope, from phylogenetic tree, get genotype-refs as follows,
    1a: S10, S11, 814_S14(3-->1a?), S18 --> 1a_EF407457
    1b: S12 --> 1b_M58335
    2a: 815_S5(3-->2a?) --> 2a_D00944
    2c: S20 --> 2c_D50409
    3a: S3, S7, S8, S13, S15, S16, S19 --> 3c_KY620605
    4d: S1, S2, S9 --> 4d_EU392172
    4k: S4, S6 --> 4k_EU392173
    
    --> KX249682.1
    --> KX765935.1
    --> KM517573.1
    
    cd data/02_assembly/
    cat p2.fasta p3e.fasta p4e.fasta p5e.fasta > all.fasta
    sed -i -e 's/-1//g' all.fasta
    #sed -i -e 's/e-1//g' all.fasta
    mafft --adjustdirection --clustalout all.fasta > all.aln
    # MANUALLY CORRECTION!
    
    ##POLISH the assembled contigs
    #for sample in p953 p938 p942 p943 p944 p947 p948  p955 p954 p952 p951 p946 p945 p940; do
    #  rm ${sample}_polished.fa
    #  #seqtk sample ../../trimmed/${sample}_R1.fastq.gz 0.1 > ${sample}_0.1_R1.fastq
    #  #seqtk sample ../../trimmed/${sample}_R2.fastq.gz 0.1 > ${sample}_0.1_R2.fastq
    #  polish_viral_ref.sh -1 ../../trimmed/${sample}_R1.fastq.gz -2 ../../trimmed/${sample}_R2.fastq.gz -r ${sample}.fasta -o ${sample}_polished.fa -t 6
    #done
    
    for sample in p946 p954 p952 p948 p945 p947  p955 p943 p951 p942; do  #all.aln
    for sample in p944 p938 p953 p940; do  #all2.aln
    for sample in p2 p3 p4 p5; do
    grep "${sample}" all.aln > REF${sample}.fasta
    #cut -f2-2 -d$'\t' REF${sample}.fasta > REF${sample}.fast
    sed -i -e "s/${sample}              //g" REF${sample}.fasta
    sed -i -e "s/${sample}-1            //g" REF${sample}.fasta
    sed -i -e 's/-//g' REF${sample}.fasta
    echo ">REF${sample}" > REF${sample}.header
    cat REF${sample}.header REF${sample}.fasta > REF${sample}.fas
    seqkit seq -u REF${sample}.fas -o REF${sample}.fa
    cp REF${sample}.fa ${sample}.fa
    mv REF${sample}.fa ../..
    sed -i -e "s/REF//g" ${sample}.fa    #still under data/02_assembly/
    done
    
    #ReferenceSeeker determines closely related reference genomes
    #https://github.com/oschwengers/referenceseeker
    (referenceseeker) jhuang@hamburg:~/DATA/Data_Holger_Efaecium$ ~/Tools/referenceseeker/bin/referenceseeker -v ~/REFs/bacteria-refseq/ shovill/noAB_wildtype/contigs.fasta
    
    # Annotating the fasta using VAPiD
    makeblastdb -in *.fasta -dbtype nucl
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta p946R.fa ~/REFs/template_Less.sbt
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta REFp944.fa ~/REFs/template_Less.sbt   # KT581445.1 selected!
    python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta contigs_final.fasta ~/REFs/template_Amir.sbt
    python ~/Tools/VAPiD/vapid3.py --online contigs_final.fasta ~/REFs/template_Amir.sbt
    
  15. All packages under the viral-ngs4 env, note that novoalign is not installed. The used Novoalign path: /home/jhuang/Tools/novocraft_v3/novoalign; the used gatk: /usr/local/bin/gatk using /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar (see the point 9).

    jhuang@WS-2290C:~$ conda activate viral-ngs4
    (viral-ngs4) jhuang@WS-2290C:~$ conda list
    # packages in environment at /home/jhuang/miniconda3/envs/viral-ngs4:
    #
    # Name                    Version                   Build  Channel
    _libgcc_mutex             0.1                 conda_forge    conda-forge
    _openmp_mutex             4.5                       2_gnu    conda-forge
    _r-mutex                  1.0.1               anacondar_1    conda-forge
    alsa-lib                  1.2.3.2              h166bdaf_0    conda-forge
    bamtools                  2.5.2                hdcf5f25_5    bioconda
    bedtools                  2.31.1               hf5e1c6e_2    bioconda
    binutils_impl_linux-64    2.43                 h4bf12b8_2    conda-forge
    binutils_linux-64         2.43                 h4852527_2    conda-forge
    biopython                 1.79             py36h8f6f2f9_0    conda-forge
    blast                     2.6.0               boost1.64_2    bioconda
    bmfilter                  3.101                h4ac6f70_5    bioconda
    bmtagger                  3.101                h470a237_4    bioconda
    bmtool                    3.101                hdbdd923_5    bioconda
    boost                     1.64.0                   py36_4    conda-forge
    boost-cpp                 1.64.0                        1    conda-forge
    bowtie                    1.3.1            py36h769816f_3    bioconda
    bowtie2                   2.5.4                h7071971_4    bioconda
    bwa                       0.7.18               he4a0461_1    bioconda
    bwidget                   1.9.14               ha770c72_1    conda-forge
    bzip2                     1.0.8                h4bc722e_7    conda-forge
    c-ares                    1.34.2               heb4867d_0    conda-forge
    ca-certificates           2024.9.24            h06a4308_0
    cairo                     1.16.0            h18b612c_1001    conda-forge
    cd-hit                    4.8.1               h43eeafb_10    bioconda
    cd-hit-auxtools           4.8.1                h4ac6f70_3    bioconda
    certifi                   2021.5.30        py36h5fab9bb_0    conda-forge
    curl                      7.68.0               hf8cf82a_0    conda-forge
    cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
    dbus                      1.13.6               hfdff14a_1    conda-forge
    diamond                   2.1.10               h43eeafb_2    bioconda
    expat                     2.6.4                h5888daf_0    conda-forge
    extract_fullseq           3.101                h4ac6f70_5    bioconda
    fastqc                    0.12.1               hdfd78af_0    bioconda
    font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
    fontconfig                2.14.1               hef1e5e3_0
    freetype                  2.12.1               h267a509_2    conda-forge
    fribidi                   1.0.10               h36c2ea0_0    conda-forge
    future                    0.18.2           py36h5fab9bb_3    conda-forge
    gap2seq                   2.1                 boost1.64_1    bioconda
    gatk                      3.6                 hdfd78af_11    bioconda
    gcc_impl_linux-64         14.2.0               h6b349bd_1    conda-forge
    gcc_linux-64              14.2.0               h5910c8f_5    conda-forge
    gettext                   0.22.5               he02047a_3    conda-forge
    gettext-tools             0.22.5               he02047a_3    conda-forge
    gfortran_impl_linux-64    14.2.0               hc73f493_1    conda-forge
    gfortran_linux-64         14.2.0               hda50785_5    conda-forge
    giflib                    5.2.2                hd590300_0    conda-forge
    glib                      2.66.3               h58526e2_0    conda-forge
    graphite2                 1.3.13            h59595ed_1003    conda-forge
    gsl                       2.4               h294904e_1006    conda-forge
    gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
    gstreamer                 1.14.5               h36ae1b5_2    conda-forge
    gxx_impl_linux-64         14.2.0               h2c03514_1    conda-forge
    gxx_linux-64              14.2.0               h9423afd_5    conda-forge
    harfbuzz                  2.4.0                h37c48d4_1    conda-forge
    icu                       58.2              hf484d3e_1000    conda-forge
    jpeg                      9e                   h0b41bf4_3    conda-forge
    kernel-headers_linux-64   3.10.0              he073ed8_18    conda-forge
    keyutils                  1.6.1                h166bdaf_0    conda-forge
    kiwisolver                1.3.1            py36h605e78d_1    conda-forge
    kmer-jellyfish            2.3.1                h4ac6f70_2    bioconda
    krb5                      1.16.4               h2fd8d38_0    conda-forge
    last                      876                      py36_0    bioconda
    lcms2                     2.12                 hddcbb42_0    conda-forge
    ld_impl_linux-64          2.43                 h712a8e2_2    conda-forge
    libasprintf               0.22.5               he8f35ee_3    conda-forge
    libasprintf-devel         0.22.5               he8f35ee_3    conda-forge
    libblas                   3.9.0           25_linux64_openblas    conda-forge
    libcblas                  3.9.0           25_linux64_openblas    conda-forge
    libcurl                   7.68.0               hda55be3_0    conda-forge
    libdeflate                1.21                 h4bc722e_0    conda-forge
    libedit                   3.1.20191231         he28a2e2_2    conda-forge
    libev                     4.33                 hd590300_2    conda-forge
    libexpat                  2.6.4                h5888daf_0    conda-forge
    libffi                    3.2.1             he1b5a44_1007    conda-forge
    libgcc                    14.2.0               h77fa898_1    conda-forge
    libgcc-devel_linux-64     14.2.0             h41c2201_101    conda-forge
    libgcc-ng                 14.2.0               h69a702a_1    conda-forge
    libgettextpo              0.22.5               he02047a_3    conda-forge
    libgettextpo-devel        0.22.5               he02047a_3    conda-forge
    libgfortran-ng            7.5.0               h14aa051_20    conda-forge
    libgfortran4              7.5.0               h14aa051_20    conda-forge
    libgfortran5              14.2.0               hd5240d6_1    conda-forge
    libglib                   2.66.3               hbe7bbb4_0    conda-forge
    libgomp                   14.2.0               h77fa898_1    conda-forge
    libiconv                  1.17                 hd590300_2    conda-forge
    libidn11                  1.33                 h7b6447c_0
    liblapack                 3.9.0           25_linux64_openblas    conda-forge
    libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
    libnsl                    2.0.1                hd590300_0    conda-forge
    libopenblas               0.3.28          pthreads_h94d23a6_0    conda-forge
    libpng                    1.6.43               h2797004_0    conda-forge
    libsanitizer              14.2.0               h2a3dede_1    conda-forge
    libsqlite                 3.46.0               hde9e2c9_0    conda-forge
    libssh2                   1.10.0               haa6b8db_3    conda-forge
    libstdcxx                 14.2.0               hc0a3c3a_1    conda-forge
    libstdcxx-devel_linux-64  14.2.0             h41c2201_101    conda-forge
    libstdcxx-ng              14.2.0               h4852527_1    conda-forge
    libtiff                   4.2.0                hf544144_3    conda-forge
    libuuid                   1.0.3                h7f8727e_2
    libwebp-base              1.4.0                hd590300_0    conda-forge
    libxcb                    1.17.0               h8a09558_0    conda-forge
    libxcrypt                 4.4.36               hd590300_1    conda-forge
    libxml2                   2.9.14               h74e7548_0
    libzlib                   1.2.13               h4ab18f5_6    conda-forge
    llvm-openmp               8.0.1                hc9558a2_0    conda-forge
    mafft                     7.221                         0    bioconda
    make                      4.4.1                hb9d3cd8_2    conda-forge
    matplotlib                3.3.4            py36h5fab9bb_0    conda-forge
    matplotlib-base           3.3.4            py36hd391965_0    conda-forge
    mummer4                   4.0.0rc1        pl5321hdbdd923_7    bioconda
    muscle                    3.8.1551             h7d875b9_6    bioconda
    mvicuna                   1.0                 h4ac6f70_10    bioconda
    ncurses                   6.5                  he02047a_1    conda-forge
    numpy                     1.19.5           py36hfc0c790_2    conda-forge
    olefile                   0.46               pyh9f0ad1d_1    conda-forge
    openjdk                   8.0.412              hd590300_1    conda-forge
    openjpeg                  2.4.0                hb52868f_1    conda-forge
    openmp                    8.0.1                         0    conda-forge
    openssl                   1.1.1w               hd590300_0    conda-forge
    pandas                    1.1.5            py36h284efc9_0    conda-forge
    pango                     1.42.4               h7062337_4    conda-forge
    parallel                  20240922             ha770c72_0    conda-forge
    pcre                      8.45                 h9c3ff4c_0    conda-forge
    perl                      5.32.1          7_hd590300_perl5    conda-forge
    picard                    3.0.0                hdfd78af_0    bioconda
    pigz                      2.6                  h27cfd23_0
    pillow                    8.2.0            py36ha6010c0_1    conda-forge
    pip                       21.3.1             pyhd8ed1ab_0    conda-forge
    pixman                    0.38.0            h516909a_1003    conda-forge
    prinseq                   0.20.4               hdfd78af_5    bioconda
    pthread-stubs             0.4               hb9d3cd8_1002    conda-forge
    pybedtools                0.9.0            py36h7281c5b_1    bioconda
    pyparsing                 3.1.4              pyhd8ed1ab_0    conda-forge
    pyqt                      5.9.2            py36hcca6a23_4    conda-forge
    pysam                     0.16.0           py36h873a209_0    bioconda
    python                    3.6.7             h381d211_1004    conda-forge
    python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
    python_abi                3.6                     2_cp36m    conda-forge
    pytz                      2023.3.post1       pyhd8ed1ab_0    conda-forge
    pyyaml                    5.4.1            py36h8f6f2f9_1    conda-forge
    qt                        5.9.7                h52cfd70_2    conda-forge
    r-assertthat              0.2.1             r36h6115d3f_2    conda-forge
    r-backports               1.2.1             r36hcfec24a_0    conda-forge
    r-base                    3.6.1                h9bb98a2_1
    r-bitops                  1.0_7             r36hcfec24a_0    conda-forge
    r-brio                    1.1.2             r36hcfec24a_0    conda-forge
    r-callr                   3.7.0             r36hc72bb7e_0    conda-forge
    r-catools                 1.18.2            r36h03ef668_0    conda-forge
    r-cli                     2.5.0             r36hc72bb7e_0    conda-forge
    r-colorspace              2.0_1             r36hcfec24a_0    conda-forge
    r-crayon                  1.4.1             r36hc72bb7e_0    conda-forge
    r-desc                    1.3.0             r36hc72bb7e_0    conda-forge
    r-diffobj                 0.3.4             r36hcfec24a_0    conda-forge
    r-digest                  0.6.27            r36h03ef668_0    conda-forge
    r-ellipsis                0.3.2             r36hcfec24a_0    conda-forge
    r-evaluate                0.14              r36h6115d3f_2    conda-forge
    r-fansi                   0.4.2             r36hcfec24a_0    conda-forge
    r-farver                  2.1.0             r36h03ef668_0    conda-forge
    r-ggplot2                 3.3.3             r36hc72bb7e_0    conda-forge
    r-glue                    1.4.2             r36hcfec24a_0    conda-forge
    r-gplots                  3.1.1             r36hc72bb7e_0    conda-forge
    r-gsalib                  2.1                    r36_1002    conda-forge
    r-gtable                  0.3.0             r36h6115d3f_3    conda-forge
    r-gtools                  3.8.2             r36hcdcec82_1    conda-forge
    r-isoband                 0.2.4             r36h03ef668_0    conda-forge
    r-jsonlite                1.7.2             r36hcfec24a_0    conda-forge
    r-kernsmooth              2.23_20           r36h742201e_0    conda-forge
    r-labeling                0.4.2             r36h142f84f_0    conda-forge
    r-lattice                 0.20_44           r36hcfec24a_0    conda-forge
    r-lifecycle               1.0.0             r36hc72bb7e_0    conda-forge
    r-magrittr                2.0.1             r36hcfec24a_1    conda-forge
    r-mass                    7.3_54            r36hcfec24a_0    conda-forge
    r-matrix                  1.3_3             r36he454529_0    conda-forge
    r-mgcv                    1.8_35            r36he454529_0    conda-forge
    r-munsell                 0.5.0           r36h6115d3f_1003    conda-forge
    r-nlme                    3.1_152           r36h859d828_0    conda-forge
    r-pillar                  1.6.1             r36hc72bb7e_0    conda-forge
    r-pkgconfig               2.0.3             r36h6115d3f_1    conda-forge
    r-pkgload                 1.2.1             r36h03ef668_0    conda-forge
    r-plyr                    1.8.6             r36h0357c0b_1    conda-forge
    r-praise                  1.0.0           r36h6115d3f_1004    conda-forge
    r-processx                3.5.2             r36hcfec24a_0    conda-forge
    r-ps                      1.6.0             r36hcfec24a_0    conda-forge
    r-r6                      2.5.0             r36hc72bb7e_0    conda-forge
    r-rcolorbrewer            1.1_2           r36h6115d3f_1003    conda-forge
    r-rcpp                    1.0.6             r36h03ef668_0    conda-forge
    r-rematch2                2.1.2             r36h6115d3f_1    conda-forge
    r-reshape                 0.8.8             r36hcdcec82_2    conda-forge
    r-rlang                   0.4.11            r36hcfec24a_0    conda-forge
    r-rprojroot               2.0.2             r36hc72bb7e_0    conda-forge
    r-rstudioapi              0.13              r36hc72bb7e_0    conda-forge
    r-scales                  1.1.1             r36h6115d3f_0    conda-forge
    r-testthat                3.0.2             r36h03ef668_0    conda-forge
    r-tibble                  3.1.2             r36hcfec24a_0    conda-forge
    r-utf8                    1.2.1             r36hcfec24a_0    conda-forge
    r-vctrs                   0.3.8             r36hcfec24a_1    conda-forge
    r-viridislite             0.4.0             r36hc72bb7e_0    conda-forge
    r-waldo                   0.2.5             r36hc72bb7e_0    conda-forge
    r-withr                   2.4.2             r36hc72bb7e_0    conda-forge
    readline                  7.0               hf8c457e_1001    conda-forge
    salmon                    0.14.2               ha0cc327_0    bioconda
    samtools                  1.6                  h244ad75_5    bioconda
    setuptools                58.0.4           py36h5fab9bb_2    conda-forge
    sip                       4.19.8          py36hf484d3e_1000    conda-forge
    six                       1.16.0             pyh6c4a22f_0    conda-forge
    snpeff                    4.1l                 hdfd78af_8    bioconda
    spades                    3.15.5               h95f258a_1    bioconda
    sqlite                    3.28.0               h8b20d00_0    conda-forge
    srprism                   2.4.24               h6a68c12_5    bioconda
    sysroot_linux-64          2.17                h4a8ded7_18    conda-forge
    tbb                       2020.3               hfd86e86_0
    tbl2asn                   25.7                 h9ee0642_1    bioconda
    tk                        8.6.13          noxft_h4845f30_101    conda-forge
    tktable                   2.10                 h8bc8fbc_6    conda-forge
    tornado                   6.1              py36h8f6f2f9_1    conda-forge
    trimmomatic               0.39                 hdfd78af_2    bioconda
    trinity                   2.8.5                h8b12597_5    bioconda
    tzdata                    2024b                hc8b5060_0    conda-forge
    unzip                     6.0                  h611a1e1_0
    vphaser2                  2.0                 h7a259b3_14    bioconda
    wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
    xorg-libice               1.0.10               h7f98852_0    conda-forge
    xorg-libsm                1.2.2                h470a237_5    conda-forge
    xorg-libx11               1.8.10               h4f16b4b_0    conda-forge
    xorg-libxau               1.0.11               hb9d3cd8_1    conda-forge
    xorg-libxdmcp             1.1.5                hb9d3cd8_0    conda-forge
    xorg-libxext              1.3.6                hb9d3cd8_0    conda-forge
    xorg-libxfixes            6.0.1                hb9d3cd8_0    conda-forge
    xorg-libxi                1.8.2                hb9d3cd8_0    conda-forge
    xorg-libxrender           0.9.11               hb9d3cd8_1    conda-forge
    xorg-libxtst              1.2.5                hb9d3cd8_3    conda-forge
    xorg-xorgproto            2024.1               hb9d3cd8_1    conda-forge
    xz                        5.2.6                h166bdaf_0    conda-forge
    yaml                      0.2.5                h7f98852_2    conda-forge
    zlib                      1.2.13               h4ab18f5_6    conda-forge
    zstd                      1.5.6                ha6fb4c9_0    conda-forge
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum