Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v3)

gene_x 0 like s 331 view s

Tags: pipeline

There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:

| Tool / Package           | Enrichment        | Custom Motifs   | CLI or R? | RNA-specific?  |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich**            | ✅                 | ✅               | R         | ✅              | --> tried (see pipeline.v1-block3)!
| **RBPmap**               | ✅                 | ❌ (uses own db) | Web/CLI   | ✅              | --> tried RBPmap, but it is too slow!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅               | R         | ❌              |  #ATtRACT+Biostrings/TFBSTools (tried, pipeline.v1-block3)
| **rmap**                 | ✅ (CLIP-based)    | ❌               | R         | ✅              |
| **Homer**                | ✅                 | ✅               | CLI       | ⚠ RNA optional |
| **MEME (AME, FIMO)**     | ✅                 | ✅               | Web/CLI   | ⚠ Generic      | --> Finally using ATtRACT+FIMO, AMD has BUG, not runnable!

#For me it was suggested to use “RBPmap” or “GraphProt” to do this analysis.

Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta

        mRNA Transcript
┌────────────┬────────────┬────────────┐
│   5′ UTR   │     CDS    │   3′ UTR   │
└────────────┴────────────┴────────────┘
↑            ↑            ↑            ↑
Start        Start        Stop         End
of           Codon       Codon        of
Transcript                             Transcript

✅ Option 1: Use GENCODE and python scripts (CHOSEN!)

#Input: up- and down-, all-regulated files
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt    #20086
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt  #634
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt     #23832
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt   #375

#Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
#1. Only protein_coding genes have well-annotated 3' UTRs
#3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
#Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
#2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".

cd ~/DATA/Data_Ute/RBPs_analysis/extract_3UTR_5UTR_CDS_transcript
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt

#Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
    * GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
    * Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v48.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz

python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up  #5988
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down  #93
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up  #6538

✅ Option 2-5 see at the end!

Why 3' UTR?

🧬 miRNA, RBP, or translation/post-transcriptional regulation
➡️ Use 3' UTR sequences

Because:

Most miRNA binding and many RBP motifs are located in the 3' UTR.

It’s the primary region for mRNA stability, localization, and translation regulation.

🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
✅ Input = 3UTR.fasta

🧪 If you're testing PBRs related to:
- Translation initiation, upstream ORFs, or 5' cap interaction:
➡️ Use 5' UTR

- Coding mutations, protein-level motifs, or translational efficiency:
➡️ Use CDS

- General transcriptome-wide motif search (no preference):
➡️ Use transcripts, or test all regions separately to localize signal

Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)

RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.

To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.

    1. Prepare foreground (target) and background sequences

        Extract 3′ UTRs of:

        📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs

        ⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes

            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
            grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt

            cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #5988
            cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #112
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt  #13139

            cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt  #19239
            cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt  #6538
            cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt  #93
            cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
            comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
            grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt  #12608

            python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
            python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background

            foreground.fasta: 你的目标（前景）序列，例如下调基因的 3′UTRs。
            background.fasta: 你的背景对照序列，例如未显著差异表达的基因的 3′UTRs。

    2. Run RBPmap separately on both sets (in total of 6 calculations)

        * Submit both sets of UTRs to RBPmap.
        * Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
        * Choose all RBPs
        * Download motif match outputs for both sets

    3. Count motif hits per RBP in each set

        You now have:
        For each RBP:
        a: number of target 3′ UTRs with a motif match
        b: number of background 3′ UTRs with a motif match
        c: total number of target 3′ UTRs
        d: total number of background 3′ UTRs

    4. Perform Fisher’s Exact Test per RBP

        For each RBP, construct a 2x2 table:

        Motif Present   Motif Absent
        Foreground (targets)    a   c - a
        Background  b   d - b

    5. Adjust p-values for multiple testing
    Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.

    6.✅ Summary

        Step    Tool
        Prepare Database of RNA-binding motifs  ATtRACT
        3′ UTR extraction   extract_transcript_parts.py
        Motif scan  RBPmap or FIMO
        Count motif hits    Your own parser (Python or R)
        Fisher’s exact test scipy.stats or fisher.test()
        FDR correction  multipletests() or p.adjust()

    python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv

Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)

1. [ATtRACT + FIMO (MEME suite)]

    ATtRACT: Database of RNA-binding motifs.
    FIMO: Fast and scriptable motif scanning tool.

    #Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences

    grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt

    #cut -f12 attract_human.txt | sort | uniq > valid_ids.txt

    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta MKL-1_background.filtered.3UTR.fasta
    ✅ 筛选完成: 总序列 = 70650
    🧹 已移除过短序列 (<16 nt): 1760
    🟢 保留有效序列 (≥16 nt): 68890
    📁 新背景文件保存为: MKL-1_background.filtered.3UTR.fasta
    # 检查背景文件中有多少序列：
    grep -c "^>" MKL-1_background.filtered.3UTR.fasta
    68890
    # 检查背景 FIMO 命中的总序列数：
    cut -f3 fimo_background_MKL-1_background/fimo.tsv | sort | uniq | wc -l
    67841
    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta MKL-1_up.filtered.3UTR.fasta
    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta MKL-1_down.filtered.3UTR.fasta
    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta WaGa_background.filtered.3UTR.fasta
    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta WaGa_up.filtered.3UTR.fasta
    python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta WaGa_down.filtered.3UTR.fasta

    python convert_attract_pwm_to_meme.py

    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
    fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta

    #Explanation for the table from FIMO (Find Individual Motif Occurrences), which scans sequences to find statistically significant matches to known motifs (e.g., RNA or DNA binding sites).

    Column  Meaning
    motif_id    ID of the motif, as defined in the .meme file
    motif_alt_id    Alternative ID or name for the motif (may be blank or unused)
    sequence_name   Name of the sequence where the motif was found (e.g., gene
    start   Start position (1-based) of the motif match within the sequence
    stop    End position of the motif match
    strand  Strand on which the motif was found: + (forward) or - (reverse)
    score   Motif match score; higher scores indicate better matches
    p-value Statistical significance of the match (lower is more significant)
    q-value Adjusted p-value (False Discovery Rate corrected)
    matched_sequence    The actual sequence in the input that matches the motif

    ✅ Example Interpretation
    1338 ENSG00000134871|ENST00000714397|3UTR 103 114 + 23.4126 5.96e-08 0.111 GGAGAGAAGGGA
    motif_id: 1338 — a numeric ID from your motif file
    sequence_name: ENSG00000134871|ENST00000714397|3UTR — refers to the gene, transcript, and region (3′ UTR)
    start–stop: 103–114 — the motif occurs from position 103 to 114
    strand: + — found on the positive strand
    score: 23.41 — high score means strong motif match
    p-value: 5.96e-08 — very statistically significant
    q-value: 0.111 — FDR-corrected p-value
    matched_sequence: GGAGAGAAGGGA — the actual sequence match in the UTR

    💡 Tips
    You can map motif_id to RBP (RNA-binding protein) names using an annotation file like ATtRACT_db.txt.
    Typically, q-value < 0.05 is considered significant.
    Duplicate matches in different transcripts of the same gene may occur and are valid.
    Would you like help converting motif_id to RBP names for clarity?

    🧠 In most biological contexts:
        * Counting a motif as present multiple times because it's in several transcripts can inflate significance.
        * If you're using Fisher's exact test (as in enrichment), this transcript-level duplication can bias results.

    ⚠️ Caveat: If you're studying isoform-specific regulation, then transcript-level data may be valuable and shouldn't be collapsed. But for most general RBP enrichment or gene expression studies, the gene-level collapse is preferred.

    #Keep only one match per gene (based on Ensembl Gene ID like ENSG00000134871) for each RBP motif, even if multiple transcripts have hits.
    #python filter_fimo_best_per_gene.py --input fimo_foreground/fimo.tsv --output fimo_foreground/fimo.filtered.tsv
    convert_gtf_to_Gene_annotation_TSV_file.py  #generate gene_annotation.tsv
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_foreground_MKL-1_down/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_foreground_MKL-1_down/fimo.filtered.tsv \
    --output_annotated fimo_foreground_MKL-1_down/fimo.filtered.annotated.tsv
    #21559
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_foreground_MKL-1_up/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_foreground_MKL-1_up/fimo.filtered.tsv \
    --output_annotated fimo_foreground_MKL-1_up/fimo.filtered.annotated.tsv
    #(736661 rows)
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_background_MKL-1_background/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_background_MKL-1_background/fimo.filtered.tsv \
    --output_annotated fimo_background_MKL-1_background/fimo.filtered.annotated.tsv
    #(1869075 rows)
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_foreground_WaGa_down/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_foreground_WaGa_down/fimo.filtered.tsv \
    --output_annotated fimo_foreground_WaGa_down/fimo.filtered.annotated.tsv
    #(20364 rows)
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_foreground_WaGa_up/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_foreground_WaGa_up/fimo.filtered.tsv \
    --output_annotated fimo_foreground_WaGa_up/fimo.filtered.annotated.tsv
    #(805634 rows)
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_background_WaGa_background/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_background_WaGa_background/fimo.filtered.tsv \
    --output_annotated fimo_background_WaGa_background/fimo.filtered.annotated.tsv
    #(1811615 rows)

    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_up/fimo.filtered.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
        --output rbp_enrichment_MKL-1_up.csv \
        --strategy inclusive
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_MKL-1_down/fimo.filtered.tsv \
        --fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
        --output rbp_enrichment_MKL-1_down.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_up/fimo.filtered.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
        --output rbp_enrichment_WaGa_up.csv
    python run_enrichment.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_foreground_WaGa_down/fimo.filtered.tsv \
        --fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
        --output rbp_enrichment_WaGa_down.csv

    python plot_volcano.py --csv rbp_enrichment_MKL-1_up.csv --output MKL-1_volcano_up.pdf --title "Upregulated MKL-1"
    python plot_rbp_heatmap.py \
    --csvs rbp_enrichment_MKL-1_up.csv rbp_enrichment_MKL-1_down.csv \
    --labels Upregulated Downregulated \
    --output MKL-1_rbp_enrichment_heatmap.pdf

    #Column Meaning
    #a  Number of unique foreground UTRs hit by the RBP
    #b  Number of unique background UTRs hit by the RBP
    #c  Total number of foreground UTRs
    #d  Total number of background UTRs (⬅️ this is the value you're asking about)
    #p_value, fdr   From Fisher's exact test on enrichment

    #-- Get all genes the number 1621 refers to --
    #AGO2,1621,5050,5732,12987,1.0,1.0   #MKL-1_up
    #motif_ids are 414 and 399
    grep "^414" fimo.filtered.annotated.tsv > AGO2.txt
    grep "^399" fimo.filtered.annotated.tsv >> AGO2.txt
    cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
    wc -l AGO2_uniq.txt
    #1621 AGO2_uniq.txt

    #工具 功能  关注点 应用场景
    FIMO    精确查找 motif 出现位置 motif 在什么位置出现   找出具体结合位点
    AME 统计 motif 富集情况   哪些 motif 在某组序列中更富集  比较 motif 是否显著出现更多

    如你还在做差异表达后的RBP富集分析，可以考虑先用 FIMO 扫描，再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作，或直接用 AME 做分析

    # Generate the attract_human.meme inkl. Gene_name!
    #python generate_named_meme.py pwm.txt attract_human.txt
    python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt

    #ERROR during running ame --> DEBUG!
    #--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
    ame --control --shuffle-- \
    --oc ame_out \
    --scoring avg \
    --method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme

2. GraphProt2 (ALTERNATIVE_TODO)

    ML-based tool using sequence + structure

    Pre-trained models for many RBPs

    ✅ Advantages:

    Local, GPU/CPU supported

    More biologically realistic (includes structure)

miRNAs motif analysis using ATtRACT + FIMO

✅ Goal

    * Extract their sequences
    * Generate a background set
    * Run RBP enrichment (e.g., with RBPmap or FIMO)
    * Get p-adjusted enrichment stats (e.g., Fisher + BH)

    Input_1. DE results (differential expression file from smallRNA-seq)
        #Input: up- and down-, all-regulated files
        #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-up.txt  #83
        #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-down.txt  #34
        #~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-all.txt  #1304
        ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-up.txt  #66
        ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-down.txt  #38
        ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-all.txt  #1304
        #Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.

    Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
        #From miRBase: https://mirbase.org/download/  https://mirbase.org/download/CURRENT/
        ##miRBase_v21
        #mature.fa.gz → contains mature miRNA sequences
        #hairpin.fa.gz → for pre-miRNAs

        cp ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-*.txt .
        #"hsa-miR-3180|hsa-miR-3180-3p"
        #>hsa-miR-3180 MIMAT0018178 Homo sapiens miR-3180
        #UGGGGCGGAGCUUCCGGAG
        #>hsa-miR-3180-3p MIMAT0015058 Homo sapiens miR-3180-3p
        #UGGGGCGGAGCUUCCGGAGGCC

    5.1 (Optional, not used!)

        #python extract_miRNA_fasta.py EV_vs_parental-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #84+0
        #python extract_miRNA_fasta.py EV_vs_parental-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
        #python extract_miRNA_fasta.py EV_vs_parental-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #34+0
        #python extract_miRNA_fasta.py EV_vs_parental-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
        #python extract_miRNA_fasta.py EV_vs_parental-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
        #python extract_miRNA_fasta.py EV_vs_parental-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16
        python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt  #67+0
        python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt  #0
        python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt  #38+0
        python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt  #0
        python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt         #1304+16
        python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt  #16

    5.2 (Advanced)
        Extract Sequences + Background Set

        Inputs:
            * up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
            * mature.fa or hairpin.fa from miRBase

        Outputs:
            * mirna_up.fa
            * mirna_down.fa
            * mirna_background.fa

        #Use all remaining miRNAs as background:
        python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna --full-background
        mv mirna_background.fa mirna_full-background.fa
        #Use random subset background. Note that the generated background has the number of maxsize(up, down), in the case is up (84 records):
        python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna
        # grep ">" mature_v21.fa | wc -l  #35828
        # grep ">" mirna_full-background.fa | wc -l  #35710-->35723
        # grep ">" mirna_up.fa | wc -l  #84
        # grep ">" mirna_down.fa | wc -l  #34
        # grep ">" mirna_background.fa | wc -l  #84-->67
        # #35,710 + 84 + 34 = 35,828

    🔬 What You Can Do Next
    Goal    Tool    Input
    * RBP motif enrichment in pre-miRNAs    RBPmap, FIMO, AME   up_precursor_miRNAs.fa
    * Motif comparison (up vs down miRNAs)  DREME, MEME, HOMER  Up/down mature miRNAs
    * Build background for enrichment   Random subset of other miRNAs   Filtered from hairpin.fa

    fimo --thresh 1e-4 --oc fimo_mirna_down attract_human.meme mirna_down.fa
    fimo --thresh 1e-4 --oc fimo_mirna_up attract_human.meme mirna_up.fa
    fimo --thresh 1e-4 --oc fimo_mirna_full-background attract_human.meme mirna_full-background.fa
    fimo --thresh 1e-4 --oc fimo_mirna_background attract_human.meme mirna_background.fa
    #END

    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_mirna_down/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_mirna_down/fimo.filtered.tsv \
    --output_annotated fimo_mirna_down/fimo.filtered.annotated.tsv  #21
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_mirna_up/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_mirna_up/fimo.filtered.tsv \
    --output_annotated fimo_mirna_up/fimo.filtered.annotated.tsv  #48
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_mirna_full-background/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_mirna_full-background/fimo.filtered.tsv \
    --output_annotated fimo_mirna_full-background/fimo.filtered.annotated.tsv  #896
    python filter_fimo_best_per_gene_annotated.py \
    --input fimo_mirna_background/fimo.tsv \
    --annot Homo_sapiens.GRCh38.gene_annotation.tsv \
    --output_filtered fimo_mirna_background/fimo.filtered.tsv \
    --output_annotated fimo_mirna_background/fimo.filtered.annotated.tsv  #57

    python run_enrichment_miRNAs.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
        --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
        --output rbp_enrichment_mirna_up.csv \
        --strategy inclusive
    python run_enrichment_miRNAs.py \
        --attract ATtRACT_db.txt \
        --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
        --fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
        --output rbp_enrichment_mirna_down.csv \
        --strategy inclusive
    #python run_enrichment_miRNAs.py \
    #    --attract ATtRACT_db.txt \
    #    --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
    #    --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
    #    --output rbp_enrichment_mirna_up_on_subset-background.csv \
    #    --strategy inclusive
    #python run_enrichment_miRNAs.py \
    #    --attract ATtRACT_db.txt \
    #    --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
    #    --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
    #    --output rbp_enrichment_mirna_down_on_subset-background.csv \
    #    --strategy inclusive

    #FXR2   1 (hsa-miR-92b-5p)  1   1   118 0.0168067226890756  0.365546218487395
    #ORB2   1 (hsa-miR-4748)    1   1   118 0.0168067226890756  0.365546218487395

    #-- Get all genes the number 1621 refers to --
    grep "^FXR2" ATtRACT_db.txt
    #motif_ids is M020_0.6
    grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv > FXR2.txt
    grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv
    #cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
    #wc -l AGO2_uniq.txt (1621 records)

    grep "^ORB2" ATtRACT_db.txt
    grep "^M120_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv

RBP Enrichment from RBPmap Results (NOT implemented!) 🔹 Use RBPmap output (typically CSV or TSV) 🔹 Compare hit counts in input vs background 🔹 Perform Fisher's exact test + Benjamini-Hochberg correction 🔹 Plot significantly enriched RBPs

    📁 Requirements
    You’ll need:

    File    Description
    rbpmap_up.tsv   RBPmap result file for upregulated set
    rbpmap_background.tsv   RBPmap result file for background set

    📝 These should have columns like:

    Motif Name or Protein

    Sequence Name or Sequence ID
    (If different, I’ll show you how to adjust.

    python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png

    ✅ Output
    enriched_up.csv
    RBP FG_hits BG_hits pval    padj    enriched
    ELAVL1  24  2   0.0001  0.003   ✅
    HNRNPA1 15  10  0.04    0.06    ❌

    enriched_up_plot.png
    Barplot showing top significant RBPs (lowest FDR)

    🧰 Customization Options
    Would you like:

        * Support for multiple RBPmap files at once?
        * To match by RBP family?
        * A full report (PDF/HTML) of top hits?
        * Let me know, and I’ll tailor the next script!

The planned pipeline is very reasonable and scientifically sound for RBP (RNA-binding protein) enrichment using miRNAs as input for FIMO motif scanning and downstream analysis.

Here’s a breakdown of why your steps make sense, plus a few important suggestions:

✅ YOUR PIPELINE LOGIC (EXCELLENT)
1. Motif Scanning with FIMO

fimo --thresh 1e-4 --oc fimo_mirna_* attract_human.meme mirna_*.fa
✔️ Uses ATtRACT motifs to scan miRNA sequences.

✔️ Applies a reasonable p-value threshold (1e-4) to detect motif hits.

📌 If you want more sensitivity (but more noise), you could try 1e-3; for higher stringency, 1e-5.

2. Filter FIMO hits to best per gene and annotate

python filter_fimo_best_per_gene_annotated.py --input ... --annot ... --output ...
✔️ Filters redundant hits.

✔️ Annotates with gene symbols or other metadata — essential for interpretability.

3. RBP Enrichment

python run_enrichment.py --fimo_fg ... --fimo_bg ... --attract ... --output ...
✔️ Uses foreground (up/down-regulated) and background (non-differential) miRNAs.

✔️ The --strategy inclusive is typical and includes all motif hits, not just strongest per gene or region.

✔️ Running up and down separately = ✅ best practice.

🔍 MINOR IMPROVEMENTS / SUGGESTIONS
✅ Use the same background for both up and down analyses:
You already are — good!

This ensures your enrichment comparison is balanced and interpretable.

⚠️ Check motif directionality (optional):
Some RBPs are strand-specific — but since miRNAs are single-stranded, and your FASTA is from mature sequences, you’re probably okay.

✅ Consider including summary stats:
How many motifs hit in up/down/background?

How many unique RBPs?

This helps interpret enrichment p-values.

🧪 Statistical validation (optional but helpful):
If you have many miRNAs (e.g. >30), consider bootstrapping enrichment from random subsets of the background to assess robustness.

✅ FINAL VERDICT
Your pipeline is:

✔️ Technically correct

✔️ Biologically appropriate

✔️ Reproducible and clear

🧠 Well-structured for miRNA-RBP motif enrichment

Why background differs between RNA-seq and miRNA motif analysis:

In RNA-seq differential expression analysis, especially for gene-level enrichment (e.g., GO or pathway analysis), the background typically includes all genes detected or expressed, not just a random subset. The logic is:

✅ “If a gene was testable (i.e., expressed above a threshold), it should be part of the universe we consider when evaluating enrichment.”

This is because the statistical model assumes:

Your universe is all genes tested.

You're asking: Are my differentially expressed genes overrepresented in some category, compared to all testable genes?

🔬 Why a random subset background was used in the miRNA motif pipeline (e.g., RBP enrichment):
In the prepare_miRNA_sets.py script, the subset was used for one reason only:

📉 To balance the number of background sequences relative to the foreground (up/down) — especially for statistical tests like Fisher's exact test.

But this is optional and not required for correctness. A larger background improves statistical power, but must match distributional properties (e.g., GC content, sequence length) to avoid bias.

✅ So: Should you use all non-differentially expressed miRNAs as background?
Yes, that is often more appropriate, assuming:

You have a full set of detectable/testable miRNAs (e.g., all from the FASTA).

You exclude the foreground (up/down) sets.

The background matches the general properties of the foreground set.

RBP enrichments via FIMO (Another description; it is the same to the workflow in the point 4)

1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.

2. Prepare Motif Database (MEME format)

    * ATtRACT: https://attract.cnic.es
    * RBPDB: http://rbpdb.ccbr.utoronto.ca
    * Ray2013 (CISBP-RNA motifs) — available via MEME Suite
    * [RBPmap motifs (if downloadable)]
    #Example format: rbp_motifs.meme

2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)

    fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
    fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
    fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
    #This produces fimo.tsv in each output folder.

3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment). Note that FIMO+run_enrichment.py=AME, however, directly using AME returns ERROR:

    ame \
    --control control_3UTRs.fasta \
    --oc ame_out \
    --scoring avg \
    --method fisher \
    3UTR.fasta \
    rbp_motifs.meme

    Where:

    * 3UTR.fasta = your downregulated genes’ 3′ UTRs
    * control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
    * rbp_motifs.meme = motif file from RBPDB or Ray2013

4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.

    You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.

5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs

    Use tools like: miRanda, TargetScan, miRDB

    Then intersect predicted targets with your downregulated genes to identify likely functional interactions.

6. Summary

    Goal    Input   Tool / Approach
    RBP enrichment  3UTR.fasta of downregulated genes   AME with RBP motifs
    Background/control  3′ UTRs from non-differential or upregulated genes
    Link miRNA to targets   Use TargetScan / miRanda    Intersect with down genes

7. Would you like:

    * Ready-to-use RBP motif .meme file?
    * Script to generate background sequences?
    * Visualization options for the enrichment results?

Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts

✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!

    Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61

    Select:

    Database: Ensembl Genes
    Dataset: Homo sapiens genes (GRCh38 or latest)

    Click on “Filters” → expand Region or Gene to limit your selection (optional).
    Click on “Attributes”:
    Under Sequences, check:
    Sequences
    3' UTR sequences

    Optionally add gene IDs, transcript IDs, etc.

    Click “Results” to view/download the FASTA of 3' UTRs.

✅ Option 3: Use GENCODE (precompiled annotations) and gffread

    Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:

        #gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
        #gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf

        grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
        gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
        bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa

        gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa

    Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.

✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)

    # Install if not already installed
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install("GenomicFeatures")
    BiocManager::install("txdbmaker")
    #sudo apt-get update
    #sudo apt-get install libmariadb-dev
    #(optional)sudo apt-get install libmysqlclient-dev
    install.packages("RMariaDB")

    # Load library
    library(GenomicFeatures)

    # Create TxDb object for human genome
    txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")

    # Extract 3' UTRs by transcript
    utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)

# View or export as needed

✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
    🔗 Website:
    UCSC Table Browser

    🔹 Step-by-Step Instructions
    1. Set the basic parameters:
    Clade: Mammal

    Genome: Human

    Assembly: GRCh38/hg38

    Group: Genes and Gene Predictions

    Track: GENCODE v44 (or latest)

    Table: knownGene or wgEncodeGencodeBasicV44

    Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE

    2. Region:
    Select: genome (default)

    3. Output format:
    Select: sequence

    4. Click "get output"
    🔹 Sequence Retrieval Options:
    On the next page (after clicking "get output"), you’ll see sequence options.

    Configure as follows:
    ✅ Output format: FASTA

    ✅ Which part of the gene: Select only
    → UTRs → 3' UTR only

    ✅ Header options: choose if you want gene name,

⚡️ Bonus: Combine with miRNA-mRNA predictions

Once you have RBPs enriched in downregulated mRNAs, you can intersect:
    * Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
    * Check if miRNAs and RBPs compete or co-bind
This can lead to identifying miRNA-RBP regulatory modules.

Reports

Please find attached the results of the RNA-binding protein (RBP) enrichment analysis using FIMO and the ATtRACT motif database, along with a brief description of the procedures used for both the 3′ UTR-based analysis (RNA-seq) and the miRNA-based analysis (small RNA-seq).

    1. RBP Motif Enrichment from RNA-seq (3′ UTRs)

    We focused on 3′ UTRs, as they are key regulatory regions for RBPs. Sequences shorter than 16 nucleotides were excluded. Using FIMO (from the MEME suite) with motifs from the ATtRACT database, we scanned both foreground and background 3′ UTR sets to identify motif occurrences.

    Foreground: Differentially expressed transcripts (e.g., MKL-1 up/down, WaGa up/down)
    Background: All non-differentially expressed transcripts

    Analysis: Fisher’s exact test was used to assess motif enrichment; p-values were adjusted using the Benjamini–Hochberg method.

    Output files (RNA-seq):

        * rbp_enrichment_MKL-1_down.xlsx / .png
        * rbp_enrichment_MKL-1_up.xlsx / .png
        * rbp_enrichment_WaGa_down.xlsx / .png
        * rbp_enrichment_WaGa_up.xlsx / .png


    2. RBP Motif Enrichment from Small RNA-seq (miRNAs)

    This analysis focused on differentially expressed miRNAs, using either mature miRNA sequences from miRBase. We scanned for RBP binding motifs within these sequences using FIMO and assessed motif enrichment relative to background sets.

    Foreground: DE miRNAs (up/down) from small RNA-seq comparisons
    Background: All other miRNAs from miRBase

    Analysis: FIMO was used with --thresh 1e-4, followed by annotation and filtering. Enrichment was assessed using Fisher’s test + BH correction.

    Output files (miRNAs):

        * rbp_enrichment_mirna_down.xlsx
        * rbp_enrichment_mirna_up.xlsx

    How to Interpret the Numbers
    Each row in the result tables represents one RBP and its enrichment statistics:

    a: foreground genes/sequences with the motif
    b: background genes/sequences with the motif
    c: total number of foreground genes/sequences
    d: total number of background genes/sequences

    These values are used to compute p-values and FDRs.

    For example, in rbp_enrichment_MKL-1_up.xlsx, AGO2 has a = 1621, meaning FIMO detected AGO2 motifs in 1,621 genes in the MKL-1 upregulated set. These genes are listed in AGO2_uniq.txt.

    Similarly, for the miRNA analysis (e.g., rbp_enrichment_mirna_up.xlsx and rbp_enrichment_mirna_down.xlsx), the numbers represent counts of unique miRNAs with at least one significant motif hit. As examples, I calculated the detailed membership for FXR2 and ORB2.

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v3)

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章