Small RNA-seq analysis pipeline XICRA

gene_x 0 like s 254 view s

Tags: pipeline

https://github.com/cougarlj/COMPSRA/issues/18

https://github.com/HCGB-IGTP/XICRA

XICRA_pipeline

  • This pipeline is designed to take paired end reads in fastq format, trim adapters and low-quality base pairs positions, and merge read pairs (R1 & R2) that overlap.
  • A mapping step to the reference genome (user defined) assigns joined reads to all major RNA biotypes including miRNA and isomiRs, tRNA fragments (tRFs) and piwi associated RNAs (piRNAs).
  • Then, XICRA produces a miRNA analysis at the isomiR level using joined reads, with several choices of software that can be selected by the user with standardized output.
  • Results are generated for each sample, analyzed and summarized for all samples in a single expression matrix.
  • This information can be processed at the miRNA or isomiR level (single sequence) but also summarizing for each isomiR variant type.
  • Statistical summaries can be easily accessed using the accompanied R package XICRA.stats (https://github.com/HCGB-IGTP/XICRA.stats).
  • Although the pipeline is designed to take paired-end reads, it also accepts single-end reads.

  • XICRA uses cutadapt [30] for the adapter trimming analysis.

  • Default trimming preset parameter settings are: to keep all reads regardless of whether the adapter is found or not, a 10% maximum adapter matching error rate (mismatches, insertions and deletions), and a 3 bp minimum overlap length.
  • User must provide specific adapter sequences for the trimming analysis.
  • An optional previous quality checking step can be performed for each sample using FastQC [31] before the trimming analysis.
  • Results are summarized for all samples using MultiQC software [32].

  • Once all reads are adapter trimmed, the tool uses fastq-join from ea-utils [33] to join the two PE reads, if provided, on the overlapping ends.

  • Apart from the joined reads, this tool also generates two files with the R1 and R2 reads that cannot be joined.
  • As a default the minimum overlap is set to 6 bp and the maximum allowed difference for the reads to be joined is set to 0% to retain 100% matching read pairs ensuring high quality sequencing information.
  • Parameters can be modified using the different options provided.

  • The XICRA pipeline can continue to process either joined PE reads or SR reads.

  • Two levels of mapping are implemented. The first level profiles RNA biotypes using STAR [34] to map reads against the reference genome and featureCounts [35] to extract and quantify numbers of reads by class. The second level focuses specifically on small RNA subclasses.
  • Here we describe the miRNA analysis implemented within XICRA but the modularity and versatility of the pipeline would make it quite straightforward to include other RNA biotypes analyses in detail.

  • For miRNAs analysis at the isomiR resolution level, XICRA allows the user to use either miraligner [26], sRNAbench from sRNAtoolbox [27] or OPTIMIR [28].

  • Each software uses different strategies and might produce different results [36].
  • We have included them as they allow following standardization procedures performed by miRTOP software and adopt the miR.gff3 file format [37].
  • Again, the pipeline modular implementation would allow adding additional softwares converging and adapting to miRTOP and miR.gff3 format.
  • For each of the softwares mentioned above and included within the miRNA module in XICRA default parameters are used.
  • Some of these parameters can be modified using the different options provided.
  • As a result of this miRNA module, annotation is generated that categorizes isomiRs into classes based on their sequence modifications (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl) following miRTOP suggested classification scheme.
  • A final conversion step from individual per sample miR.gff3 files into a single expression matrix is performed.
  • This file serves as input for differential expression (DE) analysis.
  • Information is provided for each unique sequence and indexed names contain the miRNA, the variant type and license plate (unique identifier, UID) provided by miRTOP.
  • Duplicated entries at the sequence level, produced by different modifications from the same or different miRNA are discarded.
  • An additional matrix is provided containing the sequence information for each encrypted UID.

  • Per sample read count matrices at the isomiR level are summarized into a single expression matrix that it serves as input for DE analysis between the comparison groups of interest.

  • We have generated an additional R package (XICRA.stats) that facilitates the retrieval of these matrices and parses the information included within each unique index name provided.
  • The DE analysis can be done aggregating data at the mature miRNA level (i.e. hsa-miR-501-3p), by isomiR class (i.e. hsa-miR-501-3p_iso_5p), by specific length variant cluster (i.e. hsa-miR-501-3p_iso_3p:-2) or with the sequence of the read itself as the counting data.
  • This is useful since different types of modification may coexist in a single sequence, and non-templated additions and internally edited sequences can differ leading to isomiRs that can fall into different categories or be derived from different mature miRNAs.
  • DE analysis is performed outside of the tool with DESeq2 package in R [38].

Types of small RNAs

  • miRNA(微小RNA):微小RNA是长度约为20-24个核苷酸的短RNA分子,它们在基因表达的后转录水平上发挥调控作用。miRNA通过结合到mRNA分子的互补序列上,可以抑制翻译过程或导致mRNA的降解。miRNA参与许多细胞过程,包括发育、分化和细胞周期控制。

  • tRNA(转运RNA):转运RNA是长约70-90个核苷酸的适配器分子,在蛋白质生物合成的翻译过程中起着关键作用。它们负责将相应的氨基酸运送到核糖体,并通过其反密码子环识别mRNA上的密码子,确保在形成的蛋白质中正确排序氨基酸。

  • piRNA(Piwi相互作用RNA):Piwi相互作用RNA是一类长度通常在24至31个核苷酸之间的小RNA分子,主要存在于动物的生殖细胞中,参与转座子沉默和基因组稳定性。它们与Piwi蛋白互作,帮助抑制转座元件的转录,保护基因组完整性。

  • snRNA(小核RNA):小核RNA是一组在真核生物细胞核中发现的小RNA分子,长度约为100-200个核苷酸。它们是剪接体的重要组成部分,剪接体是负责从前mRNA分子中移除内含子的复合体。

  • snoRNA(小核仁RNA):小核仁RNA是发现在细胞的核仁中的特化RNA分子,长度约为60-300个核苷酸。它们参与rRNA的修饰和成熟过程,特别是rRNA的化学修饰过程,如甲基化和伪尿苷化。

  • circRNA(环形RNA):环形RNA是一种具有闭合圆形结构的RNA,没有常见的5'和3'端。它们可以由外显子或内含子产生,并具有多种功能,包括作为miRNA的分子海绵、影响转录和调节基因表达。CircRNAs参与许多生物学过程,并与各种疾病有关。

在生物学研究中,进行差异表达分析通常用于比较不同样本或条件下RNA分子的表达水平变化。对于上面RNA类型:

  • miRNA(微小RNA):进行差异表达分析是有意义的。miRNAs在调控基因表达和参与多种生物过程中扮演关键角色,因此,分析其在不同条件下的表达差异有助于了解其在疾病发生、发展或其他生物学过程中的功能。

  • tRNA(转运RNA):虽然tRNAs是蛋白质合成中必不可少的,但它们在不同状态下的表达量差异通常不是研究的主要焦点。尽管如此,tRNA的改变有时可以反映细胞的代谢状态或应对压力的能力,但这并不是差异表达分析的常见应用。

  • piRNA(Piwi-interacting RNA):进行差异表达分析同样有意义,尤其是在研究生殖细胞、干细胞和癌症等领域。piRNAs与转座子的沉默和基因组稳定性维护有关,因此分析其差异表达有助于揭示它们在这些过程中的作用。

  • snRNA(小核RNA)和snoRNA(小核仁RNA):这两种RNA主要参与RNA加工和修饰,如剪接和rRNA的修饰。通常情况下,对它们进行差异表达分析不是很常见,因为它们更多地涉及基本的细胞内过程。然而,如果研究的目的是特定RNA加工途径的变化,那么它们的表达分析可能是有意义的。

  • circRNA(环状RNA):进行差异表达分析是有意义的。circRNAs与许多生理过程有关,包括作为miRNA的海绵、影响基因的转录和参与疾病的发展。因此,分析circRNAs的表达差异可以帮助理解它们在不同生物学背景下的功能和作用。

总的来说,miRNA、piRNA和circRNA的差异表达分析在许多研究领域是有意义的,因为它们在疾病和生物学过程中的角色。而tRNA、snRNA和snoRNA的差异表达分析可能不那么常见,除非研究特定的生物过程或条件下它们的特定变化。

通过将 cDNA 测序读段映射到人类基因组上,我们能否确定这些读段是否来源于 circRNA?

可以通过将 cDNA 的 reads 映射到人类基因组上来帮助确定它们是否来源于 circRNA。在 circRNA 的研究中,这种映射过程中寻找的关键特征是反向剪接事件,也就是说,寻找那些正常线性剪接顺序被反向连接的 reads。

在标准的基因表达研究中,mRNA 被逆转录成 cDNA,然后生成的测序 reads 通常会映射到基因组的连续区域以表示线性剪接事件。然而,对于 circRNA,由于它们是由反向剪接形成的,因此具有独特的“头尾相连”的结构。在测序数据中,这些反向剪接或“头尾相连”的事件会导致 reads 映射到基因组的非连续区域,即一个 exon 的末尾连接到另一个 exon 的开始,这与正常的线性剪接顺序不同。

通过检测这种非典型的、非连续的映射模式,可以推断出 reads 来自于 circRNA。需要专门的生物信息学工具和算法来识别这些特殊的映射模式,从而鉴定出 circRNAs。这些工具能够识别跨越独特的反向剪接点的 reads,帮助研究者确定哪些 reads 来自 circRNAs。

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum