gene_x 0 like s 608 view s
Tags: processing, bacterium, RNA-seq
In the example RP62A
#[IMPORTANT FINAL STEPS]
[1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf or python3 ~/Scripts/genbank_to_gtf.py 1585.gb 1585.gtf
[2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
python3 modify_gtf.py 1585.gtf 1585_m.gtf #replace CDS with exon!
gene_biotype "protein_coding";
[3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf
# --gtf_extra_attributes [string] By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when
running Salmon. [default: gene_name]
# --gtf_group_features [string] Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
# --featurecounts_group_type [string] The attribute type used to group feature types in the GTF file when generating the biotype plot with
featureCounts. [default: gene_biotype]
# --featurecounts_feature_type [string] By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]
#--gtf_extra_attributes gene
#--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3])
# ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
#Upload the scripts gff_to_gtf.py and modify_gtf.py.
python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf
#[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
#[CHANGE2] "Protein Homology" --> "RefSeq"
#[CHANGE3] CDS --> exon
#[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)
In the example of 1585 wildtype genome 1585.gb (CP143516-CP143519)
Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA
Genes (total) :: 2,332 (CDS + RNA)
CDSs (total) :: 2,240 (CDS) grep "CDS" 1585.gtf | wc -l
Genes (coding) :: 2,183 (Pseudo has also CDS?)
CDSs (with protein) :: 2,183
Genes (RNA) :: 92
rRNAs :: 7, 6, 6 (5S, 16S, 23S) grep "rRNA" 1585.gtf | wc -l
complete rRNAs :: 7, 6, 6 (5S, 16S, 23S)
tRNAs :: 69 grep "tRNA" 1585.gtf | wc -l
ncRNAs :: 4 grep "ncRNA" 1585.gtf | wc -l
Pseudo Genes (total) :: 57
CDSs (without protein) :: 57
Pseudo Genes (ambiguous residues) :: 0 of 57
Pseudo Genes (frameshifted) :: 33 of 57
Pseudo Genes (incomplete) :: 30 of 57
Pseudo Genes (internal stop) :: 25 of 57
Pseudo Genes (multiple problems) :: 21 of 57
gene 730216..731766
/locus_tag="V0R30_03330"
rRNA 730216..731766
/locus_tag="V0R30_03330"
#Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
cut -d$'\t' -f3 1585_m.gtf | sort -u
exon
#gene
ncRNA
rRNA
tmRNA
tRNA
#Ausführung: STEP0: one record (tmRNA) lacking the "tmRNA" row, manually corrected!
# Transfer-messenger RNA (tmRNA, 10Sa RNA, ssrA) is bacterial RNA having both tRNA and mRNA properties
# STEP1: python3 genbank_to_gtf.py 1585.gb 1585.gtf
# STEP2: python3 modify_gtf.py 1585.gtf 1585_m.gtf #"CDS" --> "exon"
# STEP3: replace transcript_id " --> transcript_id "tx-
# STEP4: ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
# using "python3 add_geneid_genename_genebiotype.py 1585_m.gtf 1585_m_.gtf
#DEBUG ERROR Stop at line : CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
# Error Message: Cannot find transcript_id!
# STEP5: Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described. with "python3 add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py 1585_m_.gtf 1585_m__.gtf"
# STEP6: MANUALLY modify the lines ncRNA (3), tmRNA (1), rRNA (19), tRNA (69): tRNA --> exon, 'gene_biotype "tRNA";';
# with "grep "tRNA" 1585_m_.gtf | wc -l" check they are correctly changed!
#RESULTS:
#CP143516 biopython gene 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
#CP143516 biopython exon 1 1356 . + . gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
#...
#/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
#grep ">" genome.transcripts.fa | wc -l
#2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!
# STEP7 (Optional) To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
# see 1585_m___.gtf
# grep "exon" 1585_m___.gtf | wc -l
# 2332
# grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
# grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
# 2332
#For example exon-V0R30_00090 is a tRNA.
#>tx-V0R30_00090
#GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA
#We can also filter the count number in the downstream analysis from the big matrix!
#tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
>tx-V0R30_05655
AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
>tx-V0R30_06520
TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
>tx-V0R30_11060
CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG
nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m___.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0
点赞本文的读者
还没有人对此文章表态
没有评论
Deciphering S. aureus with metatranscriptomics (Marc's project)
RNA-seq 2024 Ute from raw counts
© 2023 XGenes.com Impressum