Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585

gene_x 0 like s 1170 view s

Tags: processing, bacterium, RNA-seq

In the example RP62A

#[IMPORTANT FINAL STEPS]
[1] python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf or python3 ~/Scripts/genbank_to_gtf.py 1585.gb 1585.gtf

[2] python3 modify_gtf.py RP62A.gtf RP62A_m.gtf
    python3 modify_gtf.py 1585.gtf 1585_m.gtf  #replace CDS with exon!

    gene_biotype "protein_coding"; 
[3] sed -i -e 's/gene_id "rna-/gene_id "gene-/g' RP62A_m.gtf

#  --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when 
                                          running Salmon. [default: gene_name] 
#  --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id]
#  --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with 
                                          featureCounts. [default: gene_biotype] 
#  --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]

#--gtf_extra_attributes gene 
#--gtf_group_features Parent --featurecounts_group_type gene_biotype --featurecounts_feature_type CDS (outdated since [CHANGE1,2,3])

# ------------ gff_to_gtf.py, then modify *.gtf file as follows -------------
#Upload the scripts gff_to_gtf.py and modify_gtf.py.
python3 gff_to_gtf.py RP62A.gff3 RP62A.gtf

#[CHANGE1] ID in gene and Parent in CDS --> gene_id; ID in exon --> transcript_id; Name in gene --> gene_name
#[CHANGE2] "Protein Homology" --> "RefSeq"
#[CHANGE3] CDS --> exon
#[CHANGE4?] --gtf_extra_attributes ||||gene|||| refers to ID=gene-SERP_RS00220;Name=noc;gbkey=Gene;||||gene=noc||||; "gene=" exists in both gene and exon. choose gene as gtf_extra_attributes! (see --gtf_extra_attributes gene in the next command)

In the example of 1585 wildtype genome 1585.gb (CP143516-CP143519)

            Features Annotated                :: Gene; CDS; rRNA; tRNA; ncRNA
            Genes (total)                     :: 2,332 (CDS + RNA)
            CDSs (total)                      :: 2,240 (CDS)  grep "CDS" 1585.gtf | wc -l
            Genes (coding)                    :: 2,183 (Pseudo has also CDS?)
            CDSs (with protein)               :: 2,183

            Genes (RNA)                       :: 92  
            rRNAs                             :: 7, 6, 6 (5S, 16S, 23S)  grep "rRNA" 1585.gtf | wc -l
            complete rRNAs                    :: 7, 6, 6 (5S, 16S, 23S)
            tRNAs                             :: 69  grep "tRNA" 1585.gtf | wc -l
            ncRNAs                            :: 4  grep "ncRNA" 1585.gtf | wc -l

            Pseudo Genes (total)              :: 57
            CDSs (without protein)            :: 57
            Pseudo Genes (ambiguous residues) :: 0 of 57
            Pseudo Genes (frameshifted)       :: 33 of 57
            Pseudo Genes (incomplete)         :: 30 of 57
            Pseudo Genes (internal stop)      :: 25 of 57
            Pseudo Genes (multiple problems)  :: 21 of 57

    gene            730216..731766
                    /locus_tag="V0R30_03330"
    rRNA            730216..731766
                    /locus_tag="V0R30_03330"

#Ziel all ncRNA, rRNA, tmRNA, tRNA as exon, the original types set as gene_biotype "protein_coding"; gene_biotype "tRNA"; ...
cut -d$'\t' -f3 1585_m.gtf | sort -u
exon
#gene
ncRNA
rRNA
tmRNA            
tRNA
#Ausführung: STEP0: one record (tmRNA) lacking the "tmRNA" row, manually corrected! 
#                Transfer-messenger RNA (tmRNA, 10Sa RNA, ssrA) is bacterial RNA having both tRNA and mRNA properties
#            STEP1: python3 genbank_to_gtf.py 1585.gb 1585.gtf
#            STEP2: python3 modify_gtf.py 1585.gtf 1585_m.gtf  #"CDS" --> "exon"
#            STEP3: replace transcript_id " --> transcript_id "tx-
#            STEP4: ergänzen all lines as gene_id; gene_name; transcript_id; and add all lines annotated with "gene" at the end with 'gene_biotype "protein_coding";'
#                   using "python3 add_geneid_genename_genebiotype.py 1585_m.gtf 1585_m_.gtf

#DEBUG ERROR Stop at line : CP143516       biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"
#    Error Message: Cannot find transcript_id!
#            STEP5:     Process features of types Gene, CDS, rRNA, tRNA, ncRNA, and tmRNA; For rRNA, tRNA, tmRNA, and ncRNA features, it will add an additional line with exon_id following the pattern described.   with "python3 add_exonid_to_exon_tRNA_rRNA_tmRNA_ncRNA.py 1585_m_.gtf 1585_m__.gtf"

#            STEP6: MANUALLY modify the lines ncRNA (3), tmRNA (1), rRNA (19), tRNA (69): tRNA --> exon, 'gene_biotype "tRNA";';
#                   with "grep "tRNA" 1585_m_.gtf | wc -l" check they are correctly changed!

#RESULTS:
#CP143516        biopython       gene    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; gene_biotype "protein_coding"
#CP143516        biopython       exon    1       1356    .       +       .       gene_id "dnaA"; gene_name "dnaA"; transcript_id "tx-V0R30_00005"; exon_id "exon-V0R30_00005";
#...

#/home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem
#grep ">" genome.transcripts.fa  | wc -l
#2240 only exon level, but not tRNA, rRNA, tmRNA, ncRNA records!

#            STEP7 (Optional) To include also the 92 records in the results, we can change the annotations records of rRNA, tRNA, tmRNA, ncRNA to exon, so that we can also the records!
#                see 1585_m___.gtf
#                grep "exon" 1585_m___.gtf | wc -l
#                2332
#                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/genome.transcripts.fa | wc -l
#                grep ">" /home/jhuang/DATA/Data_Marc_RNA-seq_Sepidermidis/results_1585/genome/rsem/genome.transcripts.fa | wc -l
#                2332

#For example exon-V0R30_00090 is a tRNA.
#>tx-V0R30_00090
#GGCTCCTTGGTCAAGCGGTTAAGACACCGCCCTTTCACGGCGGTAACACGGGTTCGAGTCCCGTAGGAGTCACCA

#We can also filter the count number in the downstream analysis from the big matrix!
#tmRNA: V0R30_09330; ncRNA: V0R30_05655, V0R30_06520, V0R30_11060
>tx-V0R30_05655
AGTCATCTGTGGTGTTCGTAAGTTTGCTTTTTATTTGGGCCTAACACTCTTTGATCAGGGAGCCCAATAGGTTTTCTCGCAGCGCACACGCCTCTATAGGAGGACTTGCAAAACGAGAAACAGGGCACCCACCTGTATATAGCAGGCCGAATGATCAAGCTATTTATAACTACGGCATCAACGGACTCTAT
>tx-V0R30_06520
TATTTCGGGTAATCGCTATAGCAATATAGAGGAAAGTCCATGCTCACACAATCTGAGATGATTGTAGTGTTCGTGCTTGATGAAACAATAAATCAAGGCATTAATTTGACGGCAATGAAATAACCTAAGTCATTGGATATGGTTAGAATAGTTTGAAAGTGCCACAGTGACGTAGCTTTTATAGAAATATAAAAGGTGGAACGCGGTAAACCCCTCGAGTGAGCAATCCAAATTGGGTAGGAGCACTTGTTTAGCGGAATTCAACGTATAGACGAGACGATTTTTACGCGAAAGTAAAAATATGTAGACAGATGGTTACCACCGACGTACCAGTGTAACTAGTACACATTATGAGTACAACGGAACAGAACATGGCTTACAGAA
>tx-V0R30_11060
CCGTGCTAGGTGGGGAGGTAGCGGTTCCCTGTACTCGAAATCCGCTTTATGCGAGACTTAATTCCTTTGTTGAGGGCGTATTTTTGTGAAGTCTGCCCGAAGCACGTAGTGTTTGAAGATTTCGGTCCTATGCAATATGAACCCATGAACCATGTCAGGTCCTGACGGAAGCAGCATTAAGTGGATCCTCATATGTGCCGTAGGGTAGCCGAGATTTAGCTGTCGACTTCGGTAACGTTTATGATATGCGTTCGATACAAAGGTGCATGG

nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_1585 --fasta 1585.fasta --gtf 1585_m___.gtf -profile test_full -resume --max_memory 200.GB --max_time 2400.h --save_reference --aligner star_salmon   --skip_rseqc --skip_dupradar --skip_preseq --skip_biotype_qc --skip_deseq2_qc --min_mapped_reads 0

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Preparing GTF file for nextflow rnaseq regarding S. epidermidis 1585

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章