Small RNA processing

Small RNA sequencing is a type of RNA-sequencing (RNA-seq) that specifically targets and sequences small RNA molecules in a sample.

RNA-seq is a technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, capturing a snapshot of the transcriptome.

Small RNAs, including microRNAs (miRNAs), small interfering RNAs (siRNAs), and piwi-interacting RNAs (piRNAs), play crucial roles in gene regulation. They typically range from 20 to 30 nucleotides in length.

prepare raw data

#mv Data_Ute_smallRNA_3/bundle_v1 Data_Ute_smallRNA_5
ln -s ../Data_Ute_smallRNA_3/bundle_v1 .

#OLD    MKL_1_wt_1_221216.fastq.gz -> ../221216_NB501882_0404_AHLVNMBGXM/ute/nf796/MKL_1_wt_1_S16_R1_001.fastq.gz
#OLD    MKL_1_wt_2_221216.fastq.gz -> ../221216_NB501882_0404_AHLVNMBGXM/ute/nf797/MKL_1_wt_2_S17_R1_001.fastq.gz

ln -s ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf655/MKL_1_derived_EV_miRNA_S1_R1_001.fastq.gz 2021_August_nf655_MKL-1_EV-miRNA.fastq.gz
ln -s ./230623_newDemulti_smallRNAs/210817_NB501882_0294_AHW5Y2BGXJ_smallRNA_Ute_newDemulti/2021_nf_ute_smallRNA/nf657/WaGa_derived_EV_miRNA_S2_R1_001.fastq.gz 2021_August_nf657_WaGa_EV-miRNA.fastq.gz

ln -s ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf774/0403_WaGa_wt_S1_R1_001.fastq.gz 2022_August_nf774_0403_WaGa_wt.fastq.gz
ln -s ./230623_newDemulti_smallRNAs/220617_NB501882_0371_AH7572BGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf780/0505_MKL1_wt_S2_R1_001.fastq.gz 2022_August_nf780_0505_MKL-1_wt.fastq.gz

ln -s ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf796/MKL-1_wt_1_S1_R1_001.fastq.gz 2022_November_nf796_MKL-1_wt_1.fastq.gz
ln -s ./230623_newDemulti_smallRNAs/221216_NB501882_0404_AHLVNMBGXM_smallRNA_Ute_newDemulti/2022_nf_ute_smallRNA/nf797/MKL-1_wt_2_S2_R1_001.fastq.gz 2022_November_nf797_MKL-1_wt_2.fastq.gz

ln -s ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf876/1002_WaGa_sT_Dox_S1_R1_001.fastq.gz    2023_June_nf876_1002_WaGa_sT_Dox.fastq.gz
ln -s ./230602_NB501882_0428_AHKG53BGXT/demulti_new/nf887/2312_MKL_1_scr_DMSO_S2_R1_001.fastq.gz 2023_June_nf887_2312_MKL-1_scr_DMSO.fastq.gz

main run

mkdir our_out
# -qc -ra TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -rb 4 #NOT_USED
# -mic -mtool Blast -mdb viruses #IGNORING Microbe Module since it is very time-consuming!

#jhuang@hamburg:~/DATA/Data_Ute/Data_Ute_smallRNA_5$ java -jar COMPSRA.jar -ref hg38 -qc -ra TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC -rb 4 -rh 20 -rt 20 -rr 20 -rlh 8,17 -aln -mt star -ann -ac 1,2,3,4,5,6 -mic -mtool Blast -mdb viruses -in 2021_August_nf655_MKL-1_EV-miRNA.fastq.gz -out ./our_out/
for sample in 2021_August_nf655_MKL-1_EV-miRNA 2021_August_nf657_WaGa_EV-miRNA    2022_August_nf774_0403_WaGa_wt 2022_August_nf780_0505_MKL-1_wt    2022_November_nf796_MKL-1_wt_1 2022_November_nf797_MKL-1_wt_2    2023_June_nf876_1002_WaGa_sT_Dox 2023_June_nf887_2312_MKL-1_scr_DMSO; do
  mkdir our_out/${sample}/
  java -jar COMPSRA.jar -ref hg38   -rh 20 -rt 20 -rr 20 -rlh 8,17 -aln -mt star -ann -ac 1,2,3,4,5,6  -in ${sample}.fastq.gz -out ./our_out/
done

prepare Data_Ute/Data_Ute_smallRNA_4/sample.list

2021_August_nf655_MKL-1_EV-miRNA
2021_August_nf657_WaGa_EV-miRNA
2022_August_nf774_0403_WaGa_wt
2022_August_nf780_0505_MKL-1_wt
2022_November_nf796_MKL-1_wt_1
2022_November_nf797_MKL-1_wt_2
2023_June_nf876_1002_WaGa_sT_Dox
2023_June_nf887_2312_MKL-1_scr_DMSO

extract the raw counts and perform statistical test on pre-defined groups.

#The following results calculate raw counts (Note: If you only want to merge the count files, you can use -fm -fms.)
java -jar COMPSRA.jar -ref hg38 -fun -fm -fms 1-8 -fdclass 1,2,3,4,5 -fdann -pro COMPSRA_MERGE -inf ./sample.list -out ./our_out/
#The following command calculate statistical test after defining case and control.
java -jar COMPSRA.jar -ref hg38 -fun -fd -fdclass 1,2,3,4,5,6 -fdcase 1-2 -fdctrl 3-6 -fdnorm cpm -fdtest mwu -fdann -pro COMPSRA_DEG -inf ./sample.list -out ./our_out/

sed -i -e 's/_August/-08/g' COMPSRA_MERGE_0_miRNA.txt
sed -i -e 's/_November/-11/g' COMPSRA_MERGE_0_miRNA.txt
sed -i -e 's/_June/-06/g' COMPSRA_MERGE_0_miRNA.txt
sed -i -e 's/_STAR_Aligned_miRNA.txt//g' COMPSRA_MERGE_0_miRNA.txt
#sed -i -e 's/_piRNA.txt//g' COMPSRA_MERGE_0_piRNA.txt
#sed -i -e 's/_tRNA.txt//g' COMPSRA_MERGE_0_tRNA.txt
#sed -i -e 's/_snoRNA.txt//g' COMPSRA_MERGE_0_snoRNA.txt
#sed -i -e 's/_snRNA.txt//g' COMPSRA_DEG_0_snRNA.txt
sed -i -e 's/_August/-08/g' COMPSRA_DEG_0_miRNA.txt
sed -i -e 's/_November/-11/g' COMPSRA_DEG_0_miRNA.txt
sed -i -e 's/_June/-06/g' COMPSRA_DEG_0_miRNA.txt
sed -i -e 's/_STAR_Aligned_miRNA.txt//g' COMPSRA_DEG_0_miRNA.txt

import pandas as pd
df = pd.read_csv('COMPSRA_MERGE_0_miRNA.txt', sep='\t', index_col=0)
df = df.drop(columns=['Unnamed: 9'])
# Assuming df is your DataFrame
df.loc['Sum'] = df.sum(numeric_only=True)

df.to_csv('COMPSRA_MERGE_0_miRNA_.txt', sep='\t')
df = pd.read_csv('COMPSRA_MERGE_0_piRNA.txt', sep='\t', index_col=0)
df = df.drop(columns=['Unnamed: 9'])
df.loc['Sum'] = df.sum(numeric_only=True)
df.to_csv('COMPSRA_MERGE_0_piRNA_.txt', sep='\t')
df = pd.read_csv('COMPSRA_MERGE_0_snoRNA.txt', sep='\t', index_col=0)
df = df.drop(columns=['Unnamed: 9'])
df.loc['Sum'] = df.sum(numeric_only=True)
df.to_csv('COMPSRA_MERGE_0_snoRNA_.txt', sep='\t')
df = pd.read_csv('COMPSRA_MERGE_0_snRNA.txt', sep='\t', index_col=0)
df = df.drop(columns=['Unnamed: 9'])
df.loc['Sum'] = df.sum(numeric_only=True)
df.to_csv('COMPSRA_MERGE_0_snRNA_.txt', sep='\t')
df = pd.read_csv('COMPSRA_MERGE_0_tRNA.txt', sep='\t', index_col=0)
df = df.drop(columns=['Unnamed: 9'])
df.loc['Sum'] = df.sum(numeric_only=True)
df.to_csv('COMPSRA_MERGE_0_tRNA_.txt', sep='\t')

#samtools flagstat **.bam
#47217410 + 0 in total (QC-passed reads + QC-failed reads)
#45166321 + 0 mapped (95.66% : N/A)
#2051089 + 0 in total (QC-passed reads + QC-failed reads)

#TODO: check the microRNA in the paper mentioned in which position?
#Single publications on EVs as transport vehicles for specific miRNAs in the pathogenesis of Merkel cell carcinoma have also been reported, such as miR-375 and its function in proliferation

~/Tools/csv2xls-0.4/csv_to_xls.py COMPSRA_MERGE_0_miRNA_.txt \
COMPSRA_MERGE_0_piRNA_.txt \
COMPSRA_MERGE_0_tRNA_.txt \
COMPSRA_MERGE_0_snoRNA_.txt \
COMPSRA_MERGE_0_snRNA_.txt \
-d$'\t' -o raw_counts.xls;

# sorting the DEG table, change the sheet labels to 'miRNA_between_columns_B-C_and_columns_D-G'
~/Tools/csv2xls-0.4/csv_to_xls.py COMPSRA_DEG_0_miRNA.txt -d$'\t' -o normalized_and_significance_test_miRNA.xls;

##merging the row counts and statical values
#cut -f1-1 COMPSRA_MERGE_0_snoRNA.txt > f1_MERGE
#cut -f1-1 COMPSRA_DEG_0_snoRNA.txt > f1_DEG
#cut -f1-1 COMPSRA_MERGE_0_miRNA.txt > f1_MERGE
#cut -f1-1 COMPSRA_DEG_0_miRNA.txt > f1_DEG
#diff f1_MERGE f1_DEG

calculate the number of total reads, total human reads and total non-human reads.

for sample in 2021_August_nf655_MKL-1_EV-miRNA 2021_August_nf657_WaGa_EV-miRNA    2022_August_nf774_0403_WaGa_wt 2022_August_nf780_0505_MKL-1_wt    2022_November_nf796_MKL-1_wt_1 2022_November_nf797_MKL-1_wt_2    2023_June_nf876_1002_WaGa_sT_Dox 2023_June_nf887_2312_MKL-1_scr_DMSO; do
  echo "--------------- ${sample} ---------------"
  samtools flagstat ./${sample}/${sample}_STAR_Aligned.out.bam
  samtools flagstat ./${sample}/${sample}_STAR_Aligned_UnMapped.bam
done

COMPSRA was composed of five functional modules: Quality Control, Alignment, Annotation, Microbe and Function. They are integrated into a pipeline and each module can also process independently.

Quality Control: To deal with fastq files and filter out the adapter sequences and reads with low quality.
- FASTQ files from the small RNA sequencing of biological samples are the default input.
- First, the adapter portions of a read are trimmed along with any randomized bases at ligation junctions that are produced by some small RNA-seq kits (e.g., NEXTflexT M Small RNA-Seq kit). The adapter sequences, typically situated at the 3′ (3-prime) end, vary across different kits. Some commonly employed adapter sequences are listed below:
  - TruSeq Small RNA (Illumina) TGGAATTCTCGGGTGCCAAGG
  - Small RNA Kits V1 (Illumina) TCGTATGCCGTCTTCTGCTTGT
  - Small RNA Kits V1.5 (Illumina) ATCTCGTATGCCGTCTTCTGCTTG
  - NEXTflex Small RNA Sequencing Kit v3 for Illumina Platforms (Bioo Scientific) TGGAATTCTCGGGTGCCAAGG
  - LEXOGEN Small RNA-Seq Library Prep Kit (Illumina) TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC
- The read quality of the remaining sequence is evaluated using its corresponding Phred score.
- Poor quality reads are removed according to quality control parameters set in the command line (-rh 20 –rt 20 –rr 20).
- Users can specify qualified reads of specific length intervals for input into subse-quent modules.
Alignment:
- To align the clean reads to the reference genome. COMPSRA uses STAR as its default RNA sequence aligner with default parameters which are customizable on the command line.
- Qualified reads from the QC module output are first mapped to the human genome hg19/hg38, and then aligned reads are quantified and annotated in the Annotation Module.
- Reads that could not be mapped to the human genome are saved into a FASTA file for input into the Microbe Module.
Annotation:
- To annotate different kinds of circulating RNAs based on the alignment result.
- COMPSRA currently uses several different small RNA databases for annotating human genome mapped reads and provides all the possible annotations:
  - miRBase (Kozomara and Griffiths-Jones, 2011) for miRNA;
  - piRNABank (Sai Lakshmi and Agrawal, 2008);
  - piRBase (Zhang, et al., 2014) and piRNACluster (Rosenkranz, 2016) for piRNA;
  - gtRNAdb (Chan and Lowe, 2016) for tRNA;
  - GENCODE release 27 (Harrow, et al., 2012) for snRNA and snoRNA;
  - circBase (Glazar, et al., 2014) for circular RNA.
- To conform the different reference human genome versions in these databases, we use an automatic LiftOver created by the UCSC Genome Browser Group.
- All the databases used are already pre-built, enabling speedy annotation.
Microbe:
- To predict the possible species of microbes existed in the samples.
- The qualified reads that could not be mapped to the human genome in the Alignment Module are aligned to the nucleotide (nt) database (Coordinators, 2013) from UCSC using BLAST.
- The four major microbial taxons archaea, bacteria, fungi and viruses are supported.
Function:
- To perform differential expression analysis and other functional studies to be extended.
- The read count of each RNA molecule that is identified in the Annotation Module is outputted as a tab-delimited text file according to RNA type.
- With more than one sample FASTQ file inputs, the output are further aggregated into a data matrix of RNA molecules as rows and samples as columns showing the read counts of an RNA molecule across different samples.
- The user can mark each sample FASTQ file column as either a case or a control in the command line, and perform a case versus control differential expression analysis for each RNA molecule using the Mann-Whitney rank sum test (Wilcoxon Rank Sum Test) as the default statistical test.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Leave a Reply Cancel reply