Deciphering S. aureus with metatranscriptomics (Marc's project)

gene_x 0 like s 15 view s

Tags: human, bacterium, pipeline, RNA-seq

RNAseq analaysis using nextflow set reference as CP000255

  1. prepare reference

    mkdir raw_data; cd raw_data
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
    ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz
    
  2. Downloading CP000255.fasta and CP000255.gff from GenBank

    mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
    mv ~/Downloads/sequence\(2\).gff3 CP000255.gff
    
    MRSA: USA300
    EMRSA (CC22): CC22 strain
    MSSA: Newmann strain
    
    #https://journals.asm.org/doi/10.1128/jb.01000-07
    #Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
    #https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430
    
    #Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
    https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1
    
    #Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
    https://pubmed.ncbi.nlm.nih.gov/30473348/
    
    sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
    grep -P "\texon\t" CP000255_m.gff | sort | wc -l  #2630
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
    --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'
    
  3. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in P50_LH1and2; do \
        java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
    
  4. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto
    
  5. nextflow run

    mv trimmed/*.fq.gz .
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
    --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
    (host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- in results/genome/CP059040_genes.gtf -->
    #P059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
    #CP059040.1      Genbank exon    1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
    #CP059040.1      Genbank transcript      1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
    #CP059040.1      Genbank exon    1496    2644    .       +       .       transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";
    
    # -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
    >CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
    

Calculate Host and Bacterial Compositions using vrap

  1. download all S epidermidis genomes and identified all ST2 isolates from them!

    Staphylococcus aureus  ID: 1280
    Staphylococcus epidermidis Taxonomy ID: 1282
    
    esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788
    
    esearch -db nucleotide -query "txid1280[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
    esearch -db nucleotide -query "txid1282[Organism:exp]" | \
    efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
    cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
    python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta  #3478, 9.2G
    python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta  #447, 1.1G
    
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/1280
    #Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
    https://www.ebi.ac.uk/ena/browser/view/1282
    
    #python ~/Scripts/filter_fasta.py  ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
    #S.aureus, Staphylococcus epidermidis
    python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta  #2187, 5.8G
    python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta  #243, 596M
    
    replace virus to S.aureus
    
  2. run vrap

    ln -s ~/Tools/vrap/ .
    conda activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    
    #TODO: change virus_user_db --> S.aureus_user_db
    #ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome
    
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    
    @PG     ID:bowtie2      PN:bowtie2      VN:2.4.4        CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
    24061654 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    22492206 + 0 mapped (93.48% : N/A)
    12342028 + 0 paired in sequencing
    6171014 + 0 read1
    6171014 + 0 read2
    10326710 + 0 properly paired (83.67% : N/A)
    11134274 + 0 with itself and mate mapped
    56254 + 0 singletons (0.46% : N/A)
    421598 + 0 with mate mapped to a different chr
    15942 + 0 with mate mapped to a different chr (mapQ>=5)
    
  3. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    vrap/vrap.py  -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #24061651 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    541762 + 0 mapped (2.25% : N/A)
    12342022 + 0 paired in sequencing
    6171011 + 0 read1
    6171011 + 0 read2
    382256 + 0 properly paired (3.10% : N/A)
    388562 + 0 with itself and mate mapped
    22349 + 0 singletons (0.18% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    
    bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
    #21.4043
    
    samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
    samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
    #21.4043
    samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
    samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'
    
    python ~/Scripts/calculate_coverage.py mapped_sorted.bam
    #Average coverage: 24.41
    
  4. Report

    I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:
    
        Host and Bacterial Composition:
            * The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).
    
        Genome Assembly and Annotation:
            * After removing the human reads, we assembled the bacterial genome.
            * For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
            * The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).
    
        Mapping and Coverage:
            * Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
            * The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
            * The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.
    
    Conclusion:
    The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.
    

Calculate the percentages of reads are human rRNA and human mRNA, see the attached review "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics".

  1. run nextflow set human as reference

    (OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.
    
            diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa
    
            featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam
    
    Here, gene_type in the GTF file may distinguish mRNA and rRNA features.
    
    /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf
    
    (OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"   --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"    -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'
    #--pseudo_aligner 'salmon'
    
    'GRCh38' {
        fasta       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
        bwa         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
        bowtie2     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
        star        = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
        bismark     = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
        gtf         = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
        bed12       = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
        mito_name   = "chrM"
        macs_gsize  = "2.7e9"
        blacklist   = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"
    
  2. Report

    Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.
    
    However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.
    
    This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.
    
    Based on these findings, depleting human mRNA appears to be the most critical step for improvement.
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum