Enriching Monkeypox virus using xGen™ Monkeypox Virus Amplicon Panel

gene_x 0 like s 38 view s

Tags: processing, pipeline

https://www.idtdna.com/pages/products/next-generation-sequencing/workflow/xgen-ngs-amplicon-sequencing/predesigned-amplicon-panels/xgen-monkeypox-virus-amplicon-panel

Monkeypox has been termed a global health emergency. For researchers interested in tracking the evolution of the monkeypox virus genome, it can be difficult to get sufficient coverage. The xGen Monkeypox Virus Amplicon Panel provides a single-tube, two-step PCR amplification workflow with primer sets designed to create amplicons across the entire genome.*

If coverage is too low, you may need to sequence more reads to ensure uniform coverage. For many assembly algorithms, 30x coverage or higher is recommended.

Query Seq-id Start of alignment in query End of alignment in query Start of alignment in subject End of alignment in subject Expect value Alignment length Percentage of identical matches Subject Seq-id Subject Title Query Coverage Per Subject Query Coverage Per HSP Subject accession Subject sequence length Query sequence length

  1. Using bacto pipeline to get trimmed reads

    cp /home/jhuang/Tools/bacto/bacto-0.1.json .
    cp /home/jhuang/Tools/bacto/cluster.json .
    cp /home/jhuang/Tools/bacto/Snakefile .
    ln -s /home/jhuang/Tools/bacto/local .
    ln -s /home/jhuang/Tools/bacto/db .
    ln -s /home/jhuang/Tools/bacto/envs .
    
    mkdir raw_data; cd raw_data
    ln -s ../raw_data_downloads/Affe30_S1_L001_R1_001.fastq.gz Affe30_R1.fastq.gz
    ln -s ../raw_data_downloads/Affe30_S1_L001_R2_001.fastq.gz Affe30_R2.fastq.gz
    ln -s ../raw_data_downloads/Affe31_S2_L001_R1_001.fastq.gz Affe31_R1.fastq.gz
    ln -s ../raw_data_downloads/Affe31_S2_L001_R2_001.fastq.gz Affe31_R2.fastq.gz
    
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    
  2. Filtering low complexity

    fastp -i Affe30_trimmed_P_1.fastq -I Affe30_trimmed_P_2.fastq -o Affe30_trimmed_R1.fastq -O Affe30_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    fastp -i Affe31_trimmed_P_1.fastq -I Affe31_trimmed_P_2.fastq -o Affe31_trimmed_R1.fastq -O Affe31_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    
  3. Using vrap to assembly and annotate the contigs, the spades-step was replaced with idba of DAMIAN

    #--host /home/jhuang/REFs/genome.fa -n /mnt/nvme0n1p1/blast/nt -a /mnt/nvme0n1p1/blast/nr
    vrap/vrap.py  -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out   -t 40 -l 100
    vrap/vrap.py  -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out   -t 40 -l 100
    
    #--> ERROR in spades-assembly, we usding idba from DAMIAN assembly, copy the assembly to spades. IT WORKS!
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Susanne_MPox/Affe31_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Susanne_MPox/Affe31_trimmed_R2.fastq.gz --sample Affe31_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Susanne_MPox/Affe30_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Susanne_MPox/Affe30_trimmed_R2.fastq.gz --sample Affe30_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Susanne_MPox/Affe31_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Susanne_MPox/Affe31_trimmed_R2.fastq.gz --sample Affe31_blastn --blastn progressive --blastp never --min_contiglength 100 --threads 56 --force
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Susanne_MPox/Affe30_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Susanne_MPox/Affe30_trimmed_R2.fastq.gz --sample Affe30_blastn --blastn progressive --blastp never --min_contiglength 100 --threads 56 --force
    cp ~/rtpd_files/Affe30_megablast/idba_ud_assembly/contig.fa contigs.fasta
    cp ~/rtpd_files/Affe31_megablast/idba_ud_assembly/contig.fa contigs.fasta
    
    vrap/vrap.py  -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out   -t 40 -l 100
    vrap/vrap.py  -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out   -t 40 -l 100
    
  4. Step-by-Step Process for read and contig mapping profiling plots

    a. mapping the contig on the reference JX878414
    
            bowtie2-build JX878414.1.fasta JX878414.1_index
    
            bowtie2 -f -x JX878414.1_index -U Affe30_trimmed_vrap_out/spades/contigs.fasta -S Affe30_contigs_aligned.sam
            samtools view -S -b Affe30_contigs_aligned.sam > Affe30_contigs_aligned.bam
            samtools sort Affe30_contigs_aligned.bam -o Affe30_contigs_aligned_sorted.bam
            samtools index Affe30_contigs_aligned_sorted.bam
    
            bowtie2 -f -x JX878414.1_index -U Affe31_trimmed_vrap_out/spades/contigs.fasta -S Affe31_contigs_aligned.sam
            samtools view -S -b Affe31_contigs_aligned.sam > Affe31_contigs_aligned.bam
            samtools sort Affe31_contigs_aligned.bam -o Affe31_contigs_aligned_sorted.bam
            samtools index Affe31_contigs_aligned_sorted.bam
    
            62 reads; of these:
            62 (100.00%) were unpaired; of these:
                1 (1.61%) aligned 0 times
                61 (98.39%) aligned exactly 1 time
                0 (0.00%) aligned >1 times
            98.39% overall alignment rate
            17 reads; of these:
            17 (100.00%) were unpaired; of these:
                1 (5.88%) aligned 0 times
                16 (94.12%) aligned exactly 1 time
                0 (0.00%) aligned >1 times
            94.12% overall alignment rate
    
    b. candidate methods for mapping the contigs on a reference
    
            #1. Minimap2 – Efficient for aligning large contigs or entire genomes.
            minimap2 -ax asm5 JX878414.1.fasta contigs_Affe30.fasta > output.sam
    
            #2. BLAST – More sensitive but slower; good for divergent sequences.
            blastn -query contigs_Affe30.fasta -db JX878414.1.fasta -out output.blast -outfmt 6
    
            #3. LAST – Suitable for more distant relationships and large genomic changes.
            lastdb my_reference_db JX878414.1.fasta
            lastal my_reference_db contigs_Affe30.fasta > output.maf
    
            #4. MUMmer (nucmer) – Best for comparing large contigs or entire genomes, especially with structural variations.
            nucmer JX878414.1.fasta contigs_Affe30.fasta -p output
    
    c. Index the Reference Genome (JX878414.1.fasta): You'll need to index your reference genome to align the contigs and reads against it.
    
            samtools faidx JX878414.1.fasta
    
    d. Generate Coverage Profile for Reads (from Fastq): Align the trimmed fastq reads (Affe31_trimmed_R1.fastq.gz and Affe31_trimmed_R2.fastq.gz) to the reference genome using a mapper like BWA or Bowtie2.
    
            bwa index JX878414.1.fasta
            bwa mem JX878414.1.fasta Affe31_trimmed_R1.fastq.gz Affe31_trimmed_R2.fastq.gz > Affe31_reads_aligned.sam
    
        Convert the SAM file to BAM and sort it:
    
            samtools view -Sb Affe31_reads_aligned.sam | samtools sort -o Affe31_reads_aligned_sorted.bam
            samtools index Affe31_reads_aligned_sorted.bam
            #6743 + 0 in total (QC-passed reads + QC-failed reads)
            #0 + 0 secondary
            #9 + 0 supplementary
            #0 + 0 duplicates
            #5312 + 0 mapped (78.78% : N/A)
            #6734 + 0 paired in sequencing
            #3367 + 0 read1
            #3367 + 0 read2
            #4822 + 0 properly paired (71.61% : N/A)
            #4844 + 0 with itself and mate mapped
            #459 + 0 singletons (6.82% : N/A)
            #0 + 0 with mate mapped to a different chr
            #0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    e. Generate Coverage Profile for Contigs: The file Affe31_output_sorted.bam is already aligned against the reference genome. You can directly use it to visualize contig coverage.
    
        Ensure the BAM file is indexed:
    
            samtools index Affe31_contigs_aligned_sorted.bam
    
    f. Generate Coverage Tracks: Use BamCoverage to generate coverage files (in bigWig format) for both the reads and contigs.
    
        For reads coverage:
    
            bamCoverage -b Affe31_reads_aligned_sorted.bam -o Affe31_reads_coverage.bw
    
        For contigs coverage:
    
            bamCoverage -b Affe31_contigs_aligned_sorted.bam -o Affe31_contigs_coverage.bw
    
    g. Generate Coverage Profile for Reads (from Fastq): Align the trimmed fastq reads (Affe30_trimmed_R1.fastq.gz and Affe30_trimmed_R2.fastq.gz) to the reference genome using a mapper like BWA or Bowtie2.
    
            bwa index JX878414.1.fasta
            bwa mem JX878414.1.fasta Affe30_trimmed_R1.fastq.gz Affe30_trimmed_R2.fastq.gz > Affe30_reads_aligned.sam
    
        Convert the SAM file to BAM and sort it:
    
            samtools view -Sb Affe30_reads_aligned.sam | samtools sort -o Affe30_reads_aligned_sorted.bam
            samtools index Affe30_reads_aligned_sorted.bam
            #20556 + 0 in total (QC-passed reads + QC-failed reads)
            #0 + 0 secondary
            #42 + 0 supplementary
            #0 + 0 duplicates
            #13645 + 0 mapped (66.38% : N/A)
            #20514 + 0 paired in sequencing
            #10257 + 0 read1
            #10257 + 0 read2
            #12582 + 0 properly paired (61.33% : N/A)
            #12648 + 0 with itself and mate mapped
            #955 + 0 singletons (4.66% : N/A)
            #0 + 0 with mate mapped to a different chr
            #0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    h. Generate Coverage Profile for Contigs: The file Affe30_output_sorted.bam is already aligned against the reference genome. You can directly use it to visualize contig coverage.
    
        Ensure the BAM file is indexed:
    
            samtools index Affe30_contigs_aligned_sorted.bam
    
    i. Generate Coverage Tracks: Use BamCoverage to generate coverage files (in bigWig format) for both the reads and contigs.
    
        For reads coverage:
    
            bamCoverage -b Affe30_reads_aligned_sorted.bam -o Affe30_reads_coverage.bw
    
        For contigs coverage:
    
            bamCoverage -b Affe30_contigs_aligned_sorted.bam -o Affe30_contigs_coverage.bw
    
    j. calculate coverages
    
            samtools depth -a Affe30_reads_aligned_sorted.bam > coverage.txt
            awk '{sum+=$3} END {print "Average coverage = ",sum/NR}' coverage.txt
            #Average coverage =  26.6073
            samtools depth -a Affe31_reads_aligned_sorted.bam > coverage.txt
            awk '{sum+=$3} END {print "Average coverage = ",sum/NR}' coverage.txt
            Average coverage =  22.0228
    
    k. Visualize Alignments: Use tools like IGV (Integrative Genomics Viewer) or the following python scripts to visualize the alignment.
    
            import matplotlib.pyplot as plt
            import pandas as pd
            import pysam
    
            # File paths
            contig_bam_file = 'Affe30_contigs_aligned_sorted.bam'
            reads_bam_file = 'Affe30_reads_aligned_sorted.bam'
    
            # Function to calculate coverage from BAM file
            def calculate_coverage(bam_file):
                samfile = pysam.AlignmentFile(bam_file, "rb")
                coverage = []
    
                # Iterate over each position in the BAM file to get coverage
                for pileupcolumn in samfile.pileup():
                    coverage.append(pileupcolumn.n)  # Number of reads covering this position
    
                return coverage
    
            # Calculate read coverage
            read_coverage = calculate_coverage(reads_bam_file)
    
            # Create a DataFrame for read coverage
            read_positions = range(len(read_coverage))  # Position is the index
            read_data = pd.DataFrame({'position': read_positions, 'coverage': read_coverage})
    
            # Plotting Read Coverage
            plt.figure(figsize=(12, 6))
            plt.plot(read_data['position'], read_data['coverage'], color='blue')
            plt.title("Read Coverage Profile")
            plt.xlabel("Genomic Position")
            plt.ylabel("Coverage")
            plt.grid()
            plt.savefig("reads_coverage_profile.png")  # Save the plot to a file
            plt.close()
    
            # Calculate contig coverage
            contig_coverage = calculate_coverage(contig_bam_file)
    
            # Create a DataFrame for contig coverage
            contig_positions = range(len(contig_coverage))  # Position is the index
            contig_data = pd.DataFrame({'position': contig_positions, 'coverage': contig_coverage})
    
            # Plotting Contig Coverage
            plt.figure(figsize=(12, 6))
            plt.plot(contig_data['position'], contig_data['coverage'], color='green')
            plt.title("Contigs Coverage Profile")
            plt.xlabel("Genomic Position")
            plt.ylabel("Coverage")
            plt.grid()
            plt.savefig("contigs_coverage_profile.png")  # Save the plot to a file
            plt.close()
    
            print("Coverage plots saved successfully.")
    
  5. Reporting

    cp ./Affe30_trimmed_vrap_out/spades/contigs.fasta Affe30_contigs.fasta
    cp ./Affe31_trimmed_vrap_out/spades/contigs.fasta Affe31_contigs.fasta
    cp ./Affe30_trimmed_vrap_out/blast/blastn.xlsx Affe30_contigs_annotation.xlsx
    cp ./Affe31_trimmed_vrap_out/blast/blastn.xlsx Affe31_contigs_annotation.xlsx
    
    1. The contigs from the de novo assembly are incomplete and far from covering the full genome (see attached: Affe30_contigs.fasta, Affe31_contigs.fasta, Affe30_contigs_annotation.xlsx, Affe31_contigs_annotation.xlsx).
    
    2. I've generated a plot showing both the reads and assembled contigs mapped to the closely related reference genome JX878414 (see attached: reads_and_contigs_on_JX878414.png).
    
    3. One major issue I've noted is the low sequencing depth. After processing (trimming, adapter removal, and filtering out low-complexity reads), the remaining read count is very low. In comparison, the benchmark data from IDT shows they generated approximately 2-2.7 million paired reads per sample for a similar study. Our read count after quality control is significantly lower: Affe30 has only 10,257 paired-end reads, and Affe31 has 3,367 paired-end reads.
    
    Here's a summary of the read counts:
    
    Summary of Read Numbers
    
    * Raw read count:
    - Affe30: 88,183 x 2 paired-end reads
    - Affe31: 70,896 x 2 paired-end reads
    
    * After trimming (adapter sequences, bases with quality < Q30, and reads < 36 nt):
    - Affe30: 18,789 x 2 paired-end reads
    - Affe31: 11,282 x 2 paired-end reads
    
    * After removing low-complexity reads (e.g., GGGATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT):
    - Affe30: 10,257 x 2 paired-end reads
    - Affe31: 3,367 x 2 paired-end reads
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum