http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
Methods
Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).
The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:
nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript
-
prepare reference
They are wildtype strains grown in different medium. AUM - artificial urine medium Urine - human urine MHB - Mueller-Hinton broth mkdir raw_data; cd raw_data ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
-
(Optional) using trinity to find the most closely reference
Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz --right trimmed/wt_r1_R2.fastq.gz --CPU 12 #https://www.genome.jp/kegg/tables/br08606.html#prok acb KGB Acinetobacter baumannii ATCC 17978 2007 GenBank abm KGB Acinetobacter baumannii SDF 2008 GenBank aby KGB Acinetobacter baumannii AYE 2008 GenBank abc KGB Acinetobacter baumannii ACICU 2008 GenBank abn KGB Acinetobacter baumannii AB0057 2008 GenBank abb KGB Acinetobacter baumannii AB307-0294 2008 GenBank abx KGB Acinetobacter baumannii 1656-2 2012 GenBank abz KGB Acinetobacter baumannii MDR-ZJ06 2012 GenBank abr KGB Acinetobacter baumannii MDR-TJ 2012 GenBank abd KGB Acinetobacter baumannii TCDC-AB0715 2012 GenBank abh KGB Acinetobacter baumannii TYTH-1 2012 GenBank abad KGB Acinetobacter baumannii D1279779 2013 GenBank abj KGB Acinetobacter baumannii BJAB07104 2013 GenBank abab KGB Acinetobacter baumannii BJAB0715 2013 GenBank abaj KGB Acinetobacter baumannii BJAB0868 2013 GenBank abaz KGB Acinetobacter baumannii ZW85-1 2013 GenBank abk KGB Acinetobacter baumannii AbH12O-A2 2014 GenBank abau KGB Acinetobacter baumannii AB030 2014 GenBank abaa KGB Acinetobacter baumannii AB031 2014 GenBank abw KGB Acinetobacter baumannii AC29 2014 GenBank abal KGB Acinetobacter baumannii LAC-4 2015 GenBank
-
Downloading CP059040.fasta and CP059040.gff from GenBank
-
(Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed
#Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040 cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta . # Elements (Anna C.arnes) cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 . cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf . cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed . rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/ rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/ rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/ (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*" ./CP059040.fasta ./CP059040.bed ./CP059040.gb ./CP059040.gff3 ./CP059040.gff3_backup ./CP059040_full.gb ./CP059040_gene.gff3 ./CP059040_gene.gtf ./CP059040_gene_old.gff3 ./CP059040_rRNA.gff3 ./CP059040_rRNA_v.gff3 # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ---- #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_ #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l #69=3887-3803 cp CP059040.gff3 CP059040_backup.gff3 sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3 grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3 sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3 #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42 # -------------------------------------------------------------------------------------------------------------------------------------------------- # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------ cp CP059040.gff3 CP059040_backup.gff3 sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3 grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3 sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3 grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3 #-->3796 #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing. # from gff3 to gtf sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf sed -i -e "s/;/\"; /g" CP059040_gene.gtf sed -i -e "s/=/=\"/g" CP059040_gene.gtf #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf #using editor instead! #The following is GTF-format. CP000521.1 Genbank exon 95 1492 . + . transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001"; #NZ_MJHA01000001.1 RefSeq region 1 8663 . + . ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii #NZ_MJHA01000001.1 RefSeq gene 228 746 . - . ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795 #NZ_MJHA01000001.1 Protein Homology CDS 228 746 . - 0 ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11 ##gff-version 3 ##sequence-region CP059040.1 1 3980852 ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470 gffread -E -F --bed CP059040.gff3 -o CP059040.bed #-->3796 ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3 ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown #[01/21 10:57:46] Loading reference annotation (guides).. #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810) #[01/21 10:57:46] 3796 reference transcripts loaded. #Default stack size for threads: 8388608 #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?) #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! #Please make sure the -G annotation file uses the same naming convention for the genome sequences. #[01/21 10:58:30] All threads finished. # ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. # The specified gene identifier attribute is 'Name' # An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA' # The program has to termin # ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. # The specified gene identifier attribute is 'gene_biotype' # An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA' # The program has to terminate. #grep "ID=cds-" CP059040.gff3 | wc -l #grep "ID=exon-" CP059040.gff3 | wc -l #grep "ID=gene-" CP059040.gff3 | wc -l #the same as H0N29_18980/5=3796 grep "gbkey=" CP059040.gff3 | wc -l 7695 grep "ID=id-" CP059040.gff3 | wc -l 5 grep "locus_tag=" CP059040.gff3 | wc -l 7689 #... cds 3701 locus_tag=xxxx, no gene_biotype exon 96 locus_tag=xxxx, no gene_biotype gene 3796 locus_tag=xxxx, gene_biotype=xxxx, id (riboswitch+direct_repeat,5) both no --> ignoring them!! # grep "ID=id-" CP059040.gff3 rna 96 locus_tag=xxxx, no gene_biotype ------------------ 7694 cp CP059040.gff3_backup CP059040.gff3 grep "^##" CP059040.gff3 > CP059040_gene.gff3 grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3 #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'! sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
-
Preparing the directory trimmed
mkdir trimmed trimmed_unpaired; for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \ for sample_id in MHB_r1 MHB_r2 MHB_r3; do \ java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log; done
-
Preparing samplesheet.csv
sample,fastq_1,fastq_2,strandedness AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
-
nextflow run
#Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/ # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ---- docker pull nfcore/rnaseq ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq (host_env) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript' /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff" -profile docker -resume --max_cpus 55 --max_memory 512.GB --max_time 2400.h --save_align_intermeds --save_unaligned --save_reference --aligner 'star_salmon' --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript' # -- DEBUG_1 -- #After checking the record (see below) in results/genome/CP059040.gtf, we have to change 'exon' to 'transcript', the default values are --gtf_group_features 'gene_id' --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon' #In ./results/genome/CP059040.gtf e.g. "CP059040.1 Genbank transcript 1 1398 . + . transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"