A common tool to utilize RNAseq data to correct or enhance a reference genome, especially for viruses, is StringTie. This tool is primarily designed for transcript assembly and quantification based on the alignments of RNAseq reads.
Here’s a basic pipeline using StringTie with a reference virus sequence:
-
Mapping/Aligning Reads to the Reference:
-
You’ll first need to align the RNAseq reads to the reference genome. You can use STAR or HISAT2 for this.
STAR --runThreadN 4 --genomeDir /path/to/genomeDir --readFilesIn /path/to/rnaseq.fastq --outFileNamePrefix /path/to/output_prefix
-
-
Transcript Assembly using StringTie:
-
With the alignment file produced (usually in BAM or SAM format), you can use StringTie to assemble the transcripts.
stringtie -p 4 -o output.gtf -l virus /path/to/aligned_data.bam
-
-
Compare and Correct the Reference:
- If you have a reference annotation, you can compare the assembled transcripts against the annotations to identify novel transcripts or refine existing ones.
-
GFFCompare can be used to compare and evaluate the accuracy of the assembled transcripts against the reference annotation.
gffcompare -r reference_annotation.gtf -G -o comparison_output output.gtf
-
Visual Inspection (Optional but Recommended):
- You can visualize the alignments and the assembled transcripts using a genome browser like IGV (Integrative Genomics Viewer). This can give you insights into regions with alternative splicing, novel exons, or discrepancies between your data and the reference annotation.
-
Further Analysis:
- Based on the assembled transcripts, you can proceed to differential expression analysis, identification of novel transcripts, or even SNPs/variants detection if needed.
For the entire process, you’d need:
- Reference genome (for STAR or HISAT2 alignment).
- RNAseq FASTQ files.
- (Optionally) Reference annotation for comparison.