Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)

There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:

| Tool / Package           | Enrichment        | Custom Motifs   | CLI or R? | RNA-specific?  |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich**            | ✅                 | ✅               | R         | ✅              |
| **RBPmap**               | ✅                 | ❌ (uses own db) | Web/CLI   | ✅              |  ----> try RBPmap_results + enrichments!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅               | R         | ❌              |  #ATtRACT + Biostrings / TFBSTools
| **rmap**                 | ✅ (CLIP-based)    | ❌               | R         | ✅              |
| **Homer**                | ✅                 | ✅               | CLI       | ⚠ RNA optional |
| **MEME (AME, FIMO)**     | ✅                 | ✅               | Web/CLI   | ⚠ Generic      |

Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta

        mRNA Transcript
┌────────────┬────────────┬────────────┐
│   5′ UTR   │     CDS    │   3′ UTR   │
└────────────┴────────────┴────────────┘
↑            ↑            ↑            ↑
Start        Start        Stop         End
of           Codon       Codon        of
Transcript                             Transcript

✅ Option 1: Use GENCODE and python scripts (CHOSEN!)

~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt    #20086
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt  #634
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt     #23832
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt   #375

#Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
#1. Only protein_coding genes have well-annotated 3' UTRs
#3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
#Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
#2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".

grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt

#Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
    * GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
    * Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v48.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up  #5988
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down  #93
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up  #6538

✅ Option 2-5 see at the end!

Why 3′ UTR?

🧬 miRNA, RBP, or translation/post-transcriptional regulation
➡️ Use 3' UTR sequences

Because:

Most miRNA binding and many RBP motifs are located in the 3' UTR.

It’s the primary region for mRNA stability, localization, and translation regulation.

🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
✅ Input = 3UTR.fasta

🧪 If you're testing PBRs related to:
- Translation initiation, upstream ORFs, or 5' cap interaction:
➡️ Use 5' UTR

- Coding mutations, protein-level motifs, or translational efficiency:
➡️ Use CDS

- General transcriptome-wide motif search (no preference):
➡️ Use transcripts, or test all regions separately to localize signal

Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)

RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.

To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.

    1. Prepare foreground (target) and background sequences

        Extract 3′ UTRs of:

        📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs

        ⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes

            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt

            cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #5988
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #112
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt  #13139

            cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #6538
            cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #93
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt  #12608

            python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
            python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background

            foreground.fasta: 你的目标（前景）序列，例如下调基因的 3′UTRs。
            background.fasta: 你的背景对照序列，例如未显著差异表达的基因的 3′UTRs。

    2. Run RBPmap separately on both sets (in total of 6 calculations)

        * Submit both sets of UTRs to RBPmap.
        * Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
        * Choose all RBPs
        * Download motif match outputs for both sets

    3. Count motif hits per RBP in each set

        You now have:
        For each RBP:
        a: number of target 3′ UTRs with a motif match
        b: number of background UTRs with a motif match
        c: total number of target UTRs
        d: total number of background UTRs

    4. Perform Fisher’s Exact Test per RBP

        For each RBP, construct a 2x2 table:

        Motif Present   Motif Absent
        Foreground (targets)    a   c - a
        Background  b   d - b

    5. Adjust p-values for multiple testing
    Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.

    6.✅ Summary

        Step    Tool
        Prepare Database of RNA-binding motifs  ATtRACT
        3′ UTR extraction   extract_transcript_parts.py
        Motif scan  RBPmap or FIMO
        Count motif hits    Your own parser (Python or R)
        Fisher’s exact test scipy.stats or fisher.test()
        FDR correction  multipletests() or p.adjust()

    python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv

Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)

1. [ATtRACT + FIMO (MEME suite)]

    ATtRACT: Database of RNA-binding motifs.
    FIMO: Fast and scriptable motif scanning tool.

    #Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences

    grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt

    #cut -f12 attract_human.txt | sort | uniq > valid_ids.txt

    python convert_attract_pwm_to_meme.py

    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
    #end

    #TODO_TOMORROW: mv PBS_analysis RBP_analysis

    #Test
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_down/fimo.tsv \
        --fimo_bg fimo_foreground2/fimo.tsv \
        --output rbp_enrichment_test.csv

    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_up/fimo.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.tsv \
        --output rbp_enrichment_MKL-1_up.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_down/fimo.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.tsv \
        --output rbp_enrichment_MKL-1_down.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_up/fimo.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.tsv \
        --output rbp_enrichment_WaGa_up.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_down/fimo.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.tsv \
        --output rbp_enrichment_WaGa_down.csv

    #工具 功能  关注点 应用场景
    FIMO    精确查找 motif 出现位置 motif 在什么位置出现   找出具体结合位点
    AME 统计 motif 富集情况   哪些 motif 在某组序列中更富集  比较 motif 是否显著出现更多

    如你还在做差异表达后的RBP富集分析，可以考虑先用 FIMO 扫描，再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作，或直接用 AME 做分析

    # Generate the attract_human.meme inkl. Gene_name!
    #python generate_named_meme.py pwm.txt attract_human.txt
    python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt

    #ERROR during running ame --> DEBUG!
    #--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
    ame --control --shuffle-- \
    --oc ame_out \
    --scoring avg \
    --method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme

2. GraphProt2 (ALTERNATIVE_TODO)

    ML-based tool using sequence + structure

    Pre-trained models for many RBPs

    ✅ Advantages:

    Local, GPU/CPU supported

    More biologically realistic (includes structure)

miRNAs motif analysis using ATtRACT + FIMO

✅ Goal

    * Extract their sequences
    * Generate a background set
    * Run RBP enrichment (e.g., with RBPmap or FIMO)
    * Get p-adjusted enrichment stats (e.g., Fisher + BH)

    5.1 (Optional)
    Input_1. DE results (differential expression file from smallRNA-seq)
        Example file: smallRNA_upregulated.txt
        Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.

    Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
        From miRBase:
        mature.fa.gz → contains mature miRNA sequences
        hairpin.fa.gz → for pre-miRNAs

        python extract_miRNA_fasta.py smallRNA_upregulated.txt mature.fa up_mature_miRNAs.fa
        python extract_miRNA_fasta.py smallRNA_downregulated.txt hairpin.fa down_precursor_miRNAs.fa

    5.2 (Advanced)
        Extract Sequences + Background Set

        Inputs:
            * up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
            * mature.fa or hairpin.fa from miRBase

        Outputs:
            * mirna_up.fa
            * mirna_down.fa
            * mirna_background.fa

        python prepare_miRNA_sets.py up_miRNA.txt down_miRNA.txt mature.fa mirna

    🔬 What You Can Do Next

    Goal    Tool    Input
    * RBP motif enrichment in pre-miRNAs    RBPmap, FIMO, AME   up_precursor_miRNAs.fa
    * Motif comparison (up vs down miRNAs)  DREME, MEME, HOMER  Up/down mature miRNAs
    * Build background for enrichment   Random subset of other miRNAs   Filtered from hairpin.fa

    ✅ RBP Enrichment from RBPmap Results
    🔹 Use RBPmap output (typically CSV or TSV)
    🔹 Compare hit counts in input vs background
    🔹 Perform Fisher's exact test + Benjamini-Hochberg correction
    🔹 Plot significantly enriched RBPs

    📁 Requirements
    You’ll need:

    File    Description
    rbpmap_up.tsv   RBPmap result file for upregulated set
    rbpmap_background.tsv   RBPmap result file for background set

    📝 These should have columns like:

    Motif Name or Protein

    Sequence Name or Sequence ID
    (If different, I’ll show you how to adjust.

    python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png

    ✅ Output
    enriched_up.csv
    RBP FG_hits BG_hits pval    padj    enriched
    ELAVL1  24  2   0.0001  0.003   ✅
    HNRNPA1 15  10  0.04    0.06    ❌

    enriched_up_plot.png
    Barplot showing top significant RBPs (lowest FDR)

    🧰 Customization Options
    Would you like:

        * Support for multiple RBPmap files at once?

        * To match by RBP family?

        * A full report (PDF/HTML) of top hits?

        * Let me know, and I’ll tailor the next script!

RBP enrichments via FIMO (The same to the workflow in 4)

1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.

2. Prepare Motif Database (MEME format)

    * ATtRACT: https://attract.cnic.es
    * RBPDB: http://rbpdb.ccbr.utoronto.ca
    * Ray2013 (CISBP-RNA motifs) — available via MEME Suite
    * [RBPmap motifs (if downloadable)]
    #Example format: rbp_motifs.meme

2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)

    fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
    fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
    fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
    #This produces fimo.tsv in each output folder.

3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment):

    ame \
    --control control_3UTRs.fasta \
    --oc ame_out \
    --scoring avg \
    --method fisher \
    3UTR.fasta \
    rbp_motifs.meme

    Where:

    * 3UTR.fasta = your downregulated genes’ 3′ UTRs
    * control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
    * rbp_motifs.meme = motif file from RBPDB or Ray2013

4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.

    You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.

5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs

    Use tools like: miRanda, TargetScan, miRDB

    Then intersect predicted targets with your downregulated genes to identify likely functional interactions.

6. Summary

    Goal    Input   Tool / Approach
    RBP enrichment  3UTR.fasta of downregulated genes   AME with RBP motifs
    Background/control  3′ UTRs from non-differential or upregulated genes
    Link miRNA to targets   Use TargetScan / miRanda    Intersect with down genes

7. Would you like:

    * Ready-to-use RBP motif .meme file?
    * Script to generate background sequences?
    * Visualization options for the enrichment results?

Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts

✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!

    Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61

    Select:

    Database: Ensembl Genes
    Dataset: Homo sapiens genes (GRCh38 or latest)

    Click on “Filters” → expand Region or Gene to limit your selection (optional).
    Click on “Attributes”:
    Under Sequences, check:
    Sequences
    3' UTR sequences

    Optionally add gene IDs, transcript IDs, etc.

    Click “Results” to view/download the FASTA of 3' UTRs.

✅ Option 3: Use GENCODE (precompiled annotations) and gffread

    Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:

        #gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
        #gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf

        grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
        gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
        bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa

        gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa

    Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.

✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)

    # Install if not already installed
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("GenomicFeatures")
    BiocManager::install("txdbmaker")
    #sudo apt-get update
    #sudo apt-get install libmariadb-dev
    #(optional)sudo apt-get install libmysqlclient-dev
    install.packages("RMariaDB")

    # Load library
    library(GenomicFeatures)

    # Create TxDb object for human genome
    txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")

    # Extract 3' UTRs by transcript
    utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)

# View or export as needed

✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
    🔗 Website:
    UCSC Table Browser

    🔹 Step-by-Step Instructions
    1. Set the basic parameters:
    Clade: Mammal

    Genome: Human

    Assembly: GRCh38/hg38

    Group: Genes and Gene Predictions

    Track: GENCODE v44 (or latest)

    Table: knownGene or wgEncodeGencodeBasicV44

    Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE

    2. Region:
    Select: genome (default)

    3. Output format:
    Select: sequence

    4. Click "get output"
    🔹 Sequence Retrieval Options:
    On the next page (after clicking "get output"), you’ll see sequence options.

    Configure as follows:
    ✅ Output format: FASTA

    ✅ Which part of the gene: Select only
    → UTRs → 3' UTR only

    ✅ Header options: choose if you want gene name,

⚡️ Bonus: Combine with miRNA-mRNA predictions

Once you have RBPs enriched in downregulated mRNAs, you can intersect:
    * Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
    * Check if miRNAs and RBPs compete or co-bind
This can lead to identifying miRNA-RBP regulatory modules.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)

Leave a Reply Cancel reply