gene_x 0 like s 15 view s
Tags: human, bacterium, pipeline, RNA-seq
RNAseq analaysis using nextflow set reference as CP000255
prepare reference
mkdir raw_data; cd raw_data
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_1.fq.gz P50_LH1and2_R1.fq.gz
ln -s ../F24A430002261_BACwhbaP/P50_LH1+2/P50_LH1+2_2.fq.gz P50_LH1and2_R2.fq.gz
Downloading CP000255.fasta and CP000255.gff from GenBank
mv ~/Downloads/sequence\(3\).fasta CP000255.fasta
mv ~/Downloads/sequence\(2\).gff3 CP000255.gff
MRSA: USA300
EMRSA (CC22): CC22 strain
MSSA: Newmann strain
#https://journals.asm.org/doi/10.1128/jb.01000-07
#Staphylococcus aureus subsp. aureus strain Newman (Taxonomy ID: 426430)
#https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=426430
#Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
https://www.ncbi.nlm.nih.gov/nuccore/CP000255.1
#Methicillin-resistant Staphylococcus aureus CC22-MRSA-IV as an agent of dairy cow intramammary infections
https://pubmed.ncbi.nlm.nih.gov/30473348/
sed 's/\tCDS\t/\texon\t/g' CP000255.gff > CP000255_m.gff
grep -P "\texon\t" CP000255_m.gff | sort | wc -l #2630
# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead! (but there is no 'transcript' in the column 3, strange????) --
--gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" --featurecounts_feature_type 'transcript'
Preparing the directory trimmed
mkdir trimmed trimmed_unpaired;
for sample_id in P50_LH1and2; do \
java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
done
Preparing samplesheet.csv
sample,fastq_1,fastq_2,strandedness
P50_LH1and2,P50_LH1and2_R1.fq.gz,P50_LH1and2_R2.fq.gz,auto
nextflow run
mv trimmed/*.fq.gz .
# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
docker pull nfcore/rnaseq
ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255.fasta" --gff "/home/jhuang/DATA/Data_Marc_RNAseq_2024/CP000255_m.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
# -- DEBUG_2: combination of 'CP059040_m.gff' and 'exon' results in ERROR, using 'transcript' instead!
--gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" --featurecounts_feature_type 'transcript'
# ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker after replacing 'CDS' with 'exon' ----
(host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040_m.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
# -- in results/genome/CP059040_genes.gtf -->
#P059040.1 Genbank transcript 1 1398 . + . transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";
#CP059040.1 Genbank exon 1 1398 . + . transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Dbxref "NCBI_GP:QNT85048.1"; Name "QNT85048.1"; gbkey "CDS"; gene "dnaA"; inference "COORDINATES: similar to AA sequence:RefSeq:YP_004996698.1"; locus_tag "H0N29_00005"; product "chromosomal replication initiator protein DnaA"; protein_id "QNT85048.1"; transl_table "11";
#CP059040.1 Genbank transcript 1496 2644 . + . transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Name "H0N29_00010"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "H0N29_00010";
#CP059040.1 Genbank exon 1496 2644 . + . transcript_id "gene-H0N29_00010"; gene_id "gene-H0N29_00010"; Dbxref "NCBI_GP:QNT85049.1"; Name "QNT85049.1"; gbkey "CDS"; inference "COORDINATES: similar to AA sequence:RefSeq:WP_010116376.1"; locus_tag "H0N29_00010"; product "DNA polymerase III subunit beta"; protein_id "QNT85049.1"; transl_table "11";
# -- DEBUG_3: make sure the header of fasta is the same to the *_m.gff file
>CP000255.1 Staphylococcus aureus subsp. aureus USA300_FPR3757, complete genome
Calculate Host and Bacterial Compositions using vrap
download all S epidermidis genomes and identified all ST2 isolates from them!
Staphylococcus aureus ID: 1280
Staphylococcus epidermidis Taxonomy ID: 1282
esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta # 23788
esearch -db nucleotide -query "txid1280[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1280_ncbi.fasta
esearch -db nucleotide -query "txid1282[Organism:exp]" | \
efetch -format fasta -email j.huang@uke.de > genome_1282_ncbi.fasta
cd /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024
python ~/Scripts/filter_fasta.py genome_1280_ncbi.fasta complete_genome_1280_ncbi.fasta #3478, 9.2G
python ~/Scripts/filter_fasta.py genome_1282_ncbi.fasta complete_genome_1282_ncbi.fasta #447, 1.1G
# ---- Download related genomes from ENA ----
https://www.ebi.ac.uk/ena/browser/view/1280
#Click "Sequence" and download "Counts" and "Taxon descendants count" if there is enough time! Downloading time points is 10.01.2025.
https://www.ebi.ac.uk/ena/browser/view/1282
#python ~/Scripts/filter_fasta.py ena_1282_sequence_taxon_descendants_count.fasta complete_genome_1282_ena_taxon_descendants.fasta
#S.aureus, Staphylococcus epidermidis
python ~/Scripts/filter_fasta.py ena_1280_sequence.fasta complete_genome_1280_ena.fasta #2187, 5.8G
python ~/Scripts/filter_fasta.py ena_1282_sequence.fasta complete_genome_1282_ena.fasta #243, 596M
replace virus to S.aureus
run vrap
ln -s ~/Tools/vrap/ .
conda activate /home/jhuang/miniconda3/envs/vrap
vrap/vrap.py -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
vrap/vrap.py -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_1280 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/genomes_by_taxid/complete_genome_1280_ncbi.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
#TODO: change virus_user_db --> S.aureus_user_db
#ref|NZ_CP075562.1| --> Staphylococcus aureus strain Sta2021010 chromosome, complete genome
mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam
@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/usr/bin/bowtie2-align-s --wrapper basic-0 -x /home/jhuang/REFs/genome -q --fast -p 100 --passthrough -1 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_1.fastq -2 /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.notCombined_2.fastq -U /mnt/md1/DATA_md1/Data_Marc_RNAseq_2024/P50_LH1and2/flash/flash.extendedFrags.fastq"
24061654 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
22492206 + 0 mapped (93.48% : N/A)
12342028 + 0 paired in sequencing
6171014 + 0 read1
6171014 + 0 read2
10326710 + 0 properly paired (83.67% : N/A)
11134274 + 0 with itself and mate mapped
56254 + 0 singletons (0.46% : N/A)
421598 + 0 with mate mapped to a different chr
15942 + 0 with mate mapped to a different chr (mapQ>=5)
Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)
vrap/vrap.py -1 trimmed/P50_LH1and2_R1.fq.gz -2 trimmed/P50_LH1and2_R2.fq.gz -o P50_LH1and2_on_CP015646 --host CP015646.fasta -t 100 -l 200 -g
cd bowtie
mv mapped mapped.sam
samtools view -S -b mapped.sam > mapped.bam
samtools sort mapped.bam -o mapped_sorted.bam
samtools index mapped_sorted.bam
samtools view -H mapped_sorted.bam
samtools flagstat mapped_sorted.bam
#24061651 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
541762 + 0 mapped (2.25% : N/A)
12342022 + 0 paired in sequencing
6171011 + 0 read1
6171011 + 0 read2
382256 + 0 properly paired (3.10% : N/A)
388562 + 0 with itself and mate mapped
22349 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
bamCoverage -b mapped_sorted.bam -o ../../P50_LH1and2_on_CP015646_reads_coverage.bw
samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {if (count > 0) print sum/count; else print 0}'
#21.4043
samtools depth -a mapped_sorted.bam | awk '{sum+=$3; count++} END {print sum/count}'
samtools depth -a mapped_sorted.bam | awk -v len=2742807 '{sum+=$3} END {print sum/len}'
#21.4043
samtools depth mapped_sorted.bam | awk '$3 > 0 {count++} END {print count}'
samtools depth mapped_sorted.bam | awk '$3 > 10 {count++} END {print count}'
python ~/Scripts/calculate_coverage.py mapped_sorted.bam
#Average coverage: 24.41
Report
I hope this email finds you well. Below is a summary of the analysis results based on the sequencing data:
Host and Bacterial Composition:
* The human genome accounted for 22,492,206 reads out of the total 24,061,654 reads (93.48%).
Genome Assembly and Annotation:
* After removing the human reads, we assembled the bacterial genome.
* For annotation, we used the S. aureus user database (S.aureus_user_db), which includes 3,478 complete S. aureus genomes available in GenBank as of January 10, 2025.
* The annotation summary is provided in the attached file assembly_summary.xlsx. The sequence CP015646.1 appears to be the closest genome for the data, as indicated in the attached table (highlighted in yellow).
Mapping and Coverage:
* Using CP015646.1 as the reference genome, 541,762 reads (2.25%) were mapped to this reference.
* The average coverage is approximately 21x, with 951,154 positions out of 2,742,807 in the reference genome covered by at least 10 reads.
* The attached figure xxx.png shows the coverage of reads mapped to CP015646.1.
Conclusion:
The analysis suggests that the majority of reads originate from the human genome, and the S. aureus reads are insufficient for a robust RNA-seq analysis.
Calculate the percentages of reads are human rRNA and human mRNA, see the attached review "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics".
run nextflow set human as reference
(OPTION_1, NOT_USED) Use Annotation Tools: Annotation pipelines like FeatureCounts or HTSeq can classify reads based on genomic annotations (e.g., GTF/GFF files). Use a comprehensive human genome annotation to distinguish between mRNA and rRNA regions.
diff /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa /home/jhuang/REFs/genome.fa
featureCounts -a /mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf -o counts.txt -t exon -g gene_type aligned.bam
Here, gene_type in the GTF file may distinguish mRNA and rRNA features.
/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/gencode.v27.annotation.gtf
(OPTION_2, USED, under host_env) /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf" --star_index "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon'
#--pseudo_aligner 'salmon'
'GRCh38' {
fasta = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa"
bwa = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BWAIndex/version0.6.0/"
bowtie2 = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/"
star = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/STARIndex/"
bismark = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Sequence/BismarkIndex/"
gtf = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"
bed12 = "/mnt/nvme0n1p1/ref/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.bed"
mito_name = "chrM"
macs_gsize = "2.7e9"
blacklist = "/mnt/nvme0n1p1/ref/Homo_sapiens/UCSC/hg38/blacklists/hg38-blacklist.bed"
Report
Indeed, until now I haven’t had direct experience with analyzing RNA-seq data from metatranscriptomic samples, as my expertise has primarily been with data derived from cultured samples. Ideally, sequencing from cultured samples would be the optimal approach.
However, if RNA-seq from metatranscriptomic samples becomes necessary, it would be crucial to deplete host (human) mRNA, host (human) rRNA, and microbial rRNA before sequencing (refer to the attached review PIIS0168952523001270.pdf, "Current concepts, advances, and challenges in deciphering the human microbiota with metatranscriptomics"). This step is important for enriching bacterial mRNA and could potentially enhance the quality of sequencing data.
This morning, I explored the origins of reads within the human genome. Despite over 20 million reads originating from humans, only a small fraction successfully mapped to unique positions (83,500 reads). The remaining reads failed due to reasons such as Unassigned_MultiMapping (the most common issue) and Unassigned_NoFeatures. Among the 83,500 mapped reads, only 0.01% originated from rRNA. For more detailed biotype analysis, please refer to featurecounts_biotype_plot-1.png.
Based on these findings, depleting human mRNA appears to be the most critical step for improvement.
点赞本文的读者
还没有人对此文章表态
没有评论
How to use H3K27ac, H3K4me1, and RNA-seq to identify enhancers and their target genes?
RNA-seq 2024 Ute from raw counts
Preparing a GTF file from GenBank for bacterial RNA-seq analysis, using the example of WA
Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585
© 2023 XGenes.com Impressum