gene_x 0 like s 95 view s
Tags: pipeline
There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:
| Tool / Package | Enrichment | Custom Motifs | CLI or R? | RNA-specific? |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich** | ✅ | ✅ | R | ✅ | --> tried (see pipeline.v1-block3)!
| **RBPmap** | ✅ | ❌ (uses own db) | Web/CLI | ✅ | --> tried RBPmap, but it is too slow!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅ | R | ❌ | #ATtRACT+Biostrings/TFBSTools (tried, pipeline.v1-block3)
| **rmap** | ✅ (CLIP-based) | ❌ | R | ✅ |
| **Homer** | ✅ | ✅ | CLI | ⚠ RNA optional |
| **MEME (AME, FIMO)** | ✅ | ✅ | Web/CLI | ⚠ Generic | --> Finally using ATtRACT+FIMO, AMD has BUG, not runnable!
#For me it was suggested to use “RBPmap” or “GraphProt” to do this analysis.
Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta
mRNA Transcript
┌────────────┬────────────┬────────────┐
│ 5′ UTR │ CDS │ 3′ UTR │
└────────────┴────────────┴────────────┘
↑ ↑ ↑ ↑
Start Start Stop End
of Codon Codon of
Transcript Transcript
✅ Option 1: Use GENCODE and python scripts (CHOSEN!)
#Input: up- and down-, all-regulated files
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt #20086
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt #634
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt #23832
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt #375
#Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
#1. Only protein_coding genes have well-annotated 3' UTRs
#3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
#Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
#2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".
cd ~/DATA/Data_Ute/RBPs_analysis/extract_3UTR_5UTR_CDS_transcript
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt
#Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
* GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
* Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v48.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up #5988
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down #93
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up #6538
✅ Option 2-5 see at the end!
Why 3' UTR?
🧬 miRNA, RBP, or translation/post-transcriptional regulation
➡️ Use 3' UTR sequences
Because:
Most miRNA binding and many RBP motifs are located in the 3' UTR.
It’s the primary region for mRNA stability, localization, and translation regulation.
🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
✅ Input = 3UTR.fasta
🧪 If you're testing PBRs related to:
- Translation initiation, upstream ORFs, or 5' cap interaction:
➡️ Use 5' UTR
- Coding mutations, protein-level motifs, or translational efficiency:
➡️ Use CDS
- General transcriptome-wide motif search (no preference):
➡️ Use transcripts, or test all regions separately to localize signal
Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)
RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.
To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.
1. Prepare foreground (target) and background sequences
Extract 3′ UTRs of:
📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs
⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt
cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt #19239
cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt #5988
cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt #112
cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt #13139
cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt #19239
cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt #6538
cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt #93
cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt #12608
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background
foreground.fasta: 你的目标(前景)序列,例如下调基因的 3′UTRs。
background.fasta: 你的背景对照序列,例如未显著差异表达的基因的 3′UTRs。
2. Run RBPmap separately on both sets (in total of 6 calculations)
* Submit both sets of UTRs to RBPmap.
* Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
* Choose all RBPs
* Download motif match outputs for both sets
3. Count motif hits per RBP in each set
You now have:
For each RBP:
a: number of target 3′ UTRs with a motif match
b: number of background 3′ UTRs with a motif match
c: total number of target 3′ UTRs
d: total number of background 3′ UTRs
4. Perform Fisher’s Exact Test per RBP
For each RBP, construct a 2x2 table:
Motif Present Motif Absent
Foreground (targets) a c - a
Background b d - b
5. Adjust p-values for multiple testing
Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.
6.✅ Summary
Step Tool
Prepare Database of RNA-binding motifs ATtRACT
3′ UTR extraction extract_transcript_parts.py
Motif scan RBPmap or FIMO
Count motif hits Your own parser (Python or R)
Fisher’s exact test scipy.stats or fisher.test()
FDR correction multipletests() or p.adjust()
python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv
Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)
1. [ATtRACT + FIMO (MEME suite)]
ATtRACT: Database of RNA-binding motifs.
FIMO: Fast and scriptable motif scanning tool.
#Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences
grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt
#cut -f12 attract_human.txt | sort | uniq > valid_ids.txt
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta MKL-1_background.filtered.3UTR.fasta
✅ 筛选完成: 总序列 = 70650
🧹 已移除过短序列 (<16 nt): 1760
🟢 保留有效序列 (≥16 nt): 68890
📁 新背景文件保存为: MKL-1_background.filtered.3UTR.fasta
# 检查背景文件中有多少序列:
grep -c "^>" MKL-1_background.filtered.3UTR.fasta
68890
# 检查背景 FIMO 命中的总序列数:
cut -f3 fimo_background_MKL-1_background/fimo.tsv | sort | uniq | wc -l
67841
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta MKL-1_up.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta MKL-1_down.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta WaGa_background.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta WaGa_up.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta WaGa_down.filtered.3UTR.fasta
python convert_attract_pwm_to_meme.py
fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
#Explanation for the table from FIMO (Find Individual Motif Occurrences), which scans sequences to find statistically significant matches to known motifs (e.g., RNA or DNA binding sites).
Column Meaning
motif_id ID of the motif, as defined in the .meme file
motif_alt_id Alternative ID or name for the motif (may be blank or unused)
sequence_name Name of the sequence where the motif was found (e.g., gene
start Start position (1-based) of the motif match within the sequence
stop End position of the motif match
strand Strand on which the motif was found: + (forward) or - (reverse)
score Motif match score; higher scores indicate better matches
p-value Statistical significance of the match (lower is more significant)
q-value Adjusted p-value (False Discovery Rate corrected)
matched_sequence The actual sequence in the input that matches the motif
✅ Example Interpretation
1338 ENSG00000134871|ENST00000714397|3UTR 103 114 + 23.4126 5.96e-08 0.111 GGAGAGAAGGGA
motif_id: 1338 — a numeric ID from your motif file
sequence_name: ENSG00000134871|ENST00000714397|3UTR — refers to the gene, transcript, and region (3′ UTR)
start–stop: 103–114 — the motif occurs from position 103 to 114
strand: + — found on the positive strand
score: 23.41 — high score means strong motif match
p-value: 5.96e-08 — very statistically significant
q-value: 0.111 — FDR-corrected p-value
matched_sequence: GGAGAGAAGGGA — the actual sequence match in the UTR
💡 Tips
You can map motif_id to RBP (RNA-binding protein) names using an annotation file like ATtRACT_db.txt.
Typically, q-value < 0.05 is considered significant.
Duplicate matches in different transcripts of the same gene may occur and are valid.
Would you like help converting motif_id to RBP names for clarity?
🧠 In most biological contexts:
* Counting a motif as present multiple times because it's in several transcripts can inflate significance.
* If you're using Fisher's exact test (as in enrichment), this transcript-level duplication can bias results.
⚠️ Caveat: If you're studying isoform-specific regulation, then transcript-level data may be valuable and shouldn't be collapsed. But for most general RBP enrichment or gene expression studies, the gene-level collapse is preferred.
#Keep only one match per gene (based on Ensembl Gene ID like ENSG00000134871) for each RBP motif, even if multiple transcripts have hits.
#python filter_fimo_best_per_gene.py --input fimo_foreground/fimo.tsv --output fimo_foreground/fimo.filtered.tsv
convert_gtf_to_Gene_annotation_TSV_file.py #generate gene_annotation.tsv
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_MKL-1_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_MKL-1_down/fimo.filtered.tsv \
--output_annotated fimo_foreground_MKL-1_down/fimo.filtered.annotated.tsv
#21559
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_MKL-1_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_MKL-1_up/fimo.filtered.tsv \
--output_annotated fimo_foreground_MKL-1_up/fimo.filtered.annotated.tsv
#(736661 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_background_MKL-1_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_background_MKL-1_background/fimo.filtered.tsv \
--output_annotated fimo_background_MKL-1_background/fimo.filtered.annotated.tsv
#(1869075 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_WaGa_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_WaGa_down/fimo.filtered.tsv \
--output_annotated fimo_foreground_WaGa_down/fimo.filtered.annotated.tsv
#(20364 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_WaGa_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_WaGa_up/fimo.filtered.tsv \
--output_annotated fimo_foreground_WaGa_up/fimo.filtered.annotated.tsv
#(805634 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_background_WaGa_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_background_WaGa_background/fimo.filtered.tsv \
--output_annotated fimo_background_WaGa_background/fimo.filtered.annotated.tsv
#(1811615 rows)
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_MKL-1_up/fimo.filtered.tsv \
--fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
--output rbp_enrichment_MKL-1_up.csv \
--strategy inclusive
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_MKL-1_down/fimo.filtered.tsv \
--fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
--output rbp_enrichment_MKL-1_down.csv
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_WaGa_up/fimo.filtered.tsv \
--fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
--output rbp_enrichment_WaGa_up.csv
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_WaGa_down/fimo.filtered.tsv \
--fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
--output rbp_enrichment_WaGa_down.csv
python plot_volcano.py --csv rbp_enrichment_MKL-1_up.csv --output MKL-1_volcano_up.pdf --title "Upregulated MKL-1"
python plot_rbp_heatmap.py \
--csvs rbp_enrichment_MKL-1_up.csv rbp_enrichment_MKL-1_down.csv \
--labels Upregulated Downregulated \
--output MKL-1_rbp_enrichment_heatmap.pdf
#Column Meaning
#a Number of unique foreground UTRs hit by the RBP
#b Number of unique background UTRs hit by the RBP
#c Total number of foreground UTRs
#d Total number of background UTRs (⬅️ this is the value you're asking about)
#p_value, fdr From Fisher's exact test on enrichment
#-- Get all genes the number 1621 refers to --
#AGO2,1621,5050,5732,12987,1.0,1.0 #MKL-1_up
#motif_ids are 414 and 399
grep "^414" fimo.filtered.annotated.tsv > AGO2.txt
grep "^399" fimo.filtered.annotated.tsv >> AGO2.txt
cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
wc -l AGO2_uniq.txt
#1621 AGO2_uniq.txt
#工具 功能 关注点 应用场景
FIMO 精确查找 motif 出现位置 motif 在什么位置出现 找出具体结合位点
AME 统计 motif 富集情况 哪些 motif 在某组序列中更富集 比较 motif 是否显著出现更多
如你还在做差异表达后的RBP富集分析,可以考虑先用 FIMO 扫描,再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作,或直接用 AME 做分析
# Generate the attract_human.meme inkl. Gene_name!
#python generate_named_meme.py pwm.txt attract_human.txt
python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt
#ERROR during running ame --> DEBUG!
#--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
ame --control --shuffle-- \
--oc ame_out \
--scoring avg \
--method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme
2. GraphProt2 (ALTERNATIVE_TODO)
ML-based tool using sequence + structure
Pre-trained models for many RBPs
✅ Advantages:
Local, GPU/CPU supported
More biologically realistic (includes structure)
miRNAs motif analysis using ATtRACT + FIMO
✅ Goal
* Extract their sequences
* Generate a background set
* Run RBP enrichment (e.g., with RBPmap or FIMO)
* Get p-adjusted enrichment stats (e.g., Fisher + BH)
Input_1. DE results (differential expression file from smallRNA-seq)
#Input: up- and down-, all-regulated files
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-up.txt #83
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-down.txt #34
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-all.txt #1304
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-up.txt #66
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-down.txt #38
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-all.txt #1304
#Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.
Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
#From miRBase: https://mirbase.org/download/ https://mirbase.org/download/CURRENT/
##miRBase_v21
#mature.fa.gz → contains mature miRNA sequences
#hairpin.fa.gz → for pre-miRNAs
cp ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-*.txt .
#"hsa-miR-3180|hsa-miR-3180-3p"
#>hsa-miR-3180 MIMAT0018178 Homo sapiens miR-3180
#UGGGGCGGAGCUUCCGGAG
#>hsa-miR-3180-3p MIMAT0015058 Homo sapiens miR-3180-3p
#UGGGGCGGAGCUUCCGGAGGCC
5.1 (Optional, not used!)
#python extract_miRNA_fasta.py EV_vs_parental-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt #84+0
#python extract_miRNA_fasta.py EV_vs_parental-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt #0
#python extract_miRNA_fasta.py EV_vs_parental-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt #34+0
#python extract_miRNA_fasta.py EV_vs_parental-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt #0
#python extract_miRNA_fasta.py EV_vs_parental-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt #1304+16
#python extract_miRNA_fasta.py EV_vs_parental-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt #16
python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt #67+0
python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt #0
python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt #38+0
python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt #0
python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt #1304+16
python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt #16
5.2 (Advanced)
Extract Sequences + Background Set
Inputs:
* up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
* mature.fa or hairpin.fa from miRBase
Outputs:
* mirna_up.fa
* mirna_down.fa
* mirna_background.fa
#Use all remaining miRNAs as background:
python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna --full-background
mv mirna_background.fa mirna_full-background.fa
#Use random subset background. Note that the generated background has the number of maxsize(up, down), in the case is up (84 records):
python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna
# grep ">" mature_v21.fa | wc -l #35828
# grep ">" mirna_full-background.fa | wc -l #35710-->35723
# grep ">" mirna_up.fa | wc -l #84
# grep ">" mirna_down.fa | wc -l #34
# grep ">" mirna_background.fa | wc -l #84-->67
# #35,710 + 84 + 34 = 35,828
🔬 What You Can Do Next
Goal Tool Input
* RBP motif enrichment in pre-miRNAs RBPmap, FIMO, AME up_precursor_miRNAs.fa
* Motif comparison (up vs down miRNAs) DREME, MEME, HOMER Up/down mature miRNAs
* Build background for enrichment Random subset of other miRNAs Filtered from hairpin.fa
fimo --thresh 1e-4 --oc fimo_mirna_down attract_human.meme mirna_down.fa
fimo --thresh 1e-4 --oc fimo_mirna_up attract_human.meme mirna_up.fa
fimo --thresh 1e-4 --oc fimo_mirna_full-background attract_human.meme mirna_full-background.fa
fimo --thresh 1e-4 --oc fimo_mirna_background attract_human.meme mirna_background.fa
#END
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_down/fimo.filtered.tsv \
--output_annotated fimo_mirna_down/fimo.filtered.annotated.tsv #21
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_up/fimo.filtered.tsv \
--output_annotated fimo_mirna_up/fimo.filtered.annotated.tsv #48
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_full-background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_full-background/fimo.filtered.tsv \
--output_annotated fimo_mirna_full-background/fimo.filtered.annotated.tsv #896
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_background/fimo.filtered.tsv \
--output_annotated fimo_mirna_background/fimo.filtered.annotated.tsv #57
python run_enrichment_miRNAs.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_up/fimo.filtered.tsv \
--fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_up.csv \
--strategy inclusive
python run_enrichment_miRNAs.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_down/fimo.filtered.tsv \
--fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_down.csv \
--strategy inclusive
#python run_enrichment_miRNAs.py \
# --attract ATtRACT_db.txt \
# --fimo_fg fimo_mirna_up/fimo.filtered.tsv \
# --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
# --output rbp_enrichment_mirna_up_on_subset-background.csv \
# --strategy inclusive
#python run_enrichment_miRNAs.py \
# --attract ATtRACT_db.txt \
# --fimo_fg fimo_mirna_down/fimo.filtered.tsv \
# --fimo_bg fimo_mirna_background/fimo.filtered.tsv \
# --output rbp_enrichment_mirna_down_on_subset-background.csv \
# --strategy inclusive
#FXR2 1 (hsa-miR-92b-5p) 1 1 118 0.0168067226890756 0.365546218487395
#ORB2 1 (hsa-miR-4748) 1 1 118 0.0168067226890756 0.365546218487395
#-- Get all genes the number 1621 refers to --
grep "^FXR2" ATtRACT_db.txt
#motif_ids is M020_0.6
grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv > FXR2.txt
grep "^M020_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv
#cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
#wc -l AGO2_uniq.txt (1621 records)
grep "^ORB2" ATtRACT_db.txt
grep "^M120_0.6" fimo_mirna_up/fimo.filtered.annotated.tsv
RBP Enrichment from RBPmap Results (NOT implemented!) 🔹 Use RBPmap output (typically CSV or TSV) 🔹 Compare hit counts in input vs background 🔹 Perform Fisher's exact test + Benjamini-Hochberg correction 🔹 Plot significantly enriched RBPs
📁 Requirements
You’ll need:
File Description
rbpmap_up.tsv RBPmap result file for upregulated set
rbpmap_background.tsv RBPmap result file for background set
📝 These should have columns like:
Motif Name or Protein
Sequence Name or Sequence ID
(If different, I’ll show you how to adjust.
python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png
✅ Output
enriched_up.csv
RBP FG_hits BG_hits pval padj enriched
ELAVL1 24 2 0.0001 0.003 ✅
HNRNPA1 15 10 0.04 0.06 ❌
enriched_up_plot.png
Barplot showing top significant RBPs (lowest FDR)
🧰 Customization Options
Would you like:
* Support for multiple RBPmap files at once?
* To match by RBP family?
* A full report (PDF/HTML) of top hits?
* Let me know, and I’ll tailor the next script!
The planned pipeline is very reasonable and scientifically sound for RBP (RNA-binding protein) enrichment using miRNAs as input for FIMO motif scanning and downstream analysis.
Here’s a breakdown of why your steps make sense, plus a few important suggestions:
✅ YOUR PIPELINE LOGIC (EXCELLENT)
1. Motif Scanning with FIMO
fimo --thresh 1e-4 --oc fimo_mirna_* attract_human.meme mirna_*.fa
✔️ Uses ATtRACT motifs to scan miRNA sequences.
✔️ Applies a reasonable p-value threshold (1e-4) to detect motif hits.
📌 If you want more sensitivity (but more noise), you could try 1e-3; for higher stringency, 1e-5.
2. Filter FIMO hits to best per gene and annotate
python filter_fimo_best_per_gene_annotated.py --input ... --annot ... --output ...
✔️ Filters redundant hits.
✔️ Annotates with gene symbols or other metadata — essential for interpretability.
3. RBP Enrichment
python run_enrichment.py --fimo_fg ... --fimo_bg ... --attract ... --output ...
✔️ Uses foreground (up/down-regulated) and background (non-differential) miRNAs.
✔️ The --strategy inclusive is typical and includes all motif hits, not just strongest per gene or region.
✔️ Running up and down separately = ✅ best practice.
🔍 MINOR IMPROVEMENTS / SUGGESTIONS
✅ Use the same background for both up and down analyses:
You already are — good!
This ensures your enrichment comparison is balanced and interpretable.
⚠️ Check motif directionality (optional):
Some RBPs are strand-specific — but since miRNAs are single-stranded, and your FASTA is from mature sequences, you’re probably okay.
✅ Consider including summary stats:
How many motifs hit in up/down/background?
How many unique RBPs?
This helps interpret enrichment p-values.
🧪 Statistical validation (optional but helpful):
If you have many miRNAs (e.g. >30), consider bootstrapping enrichment from random subsets of the background to assess robustness.
✅ FINAL VERDICT
Your pipeline is:
✔️ Technically correct
✔️ Biologically appropriate
✔️ Reproducible and clear
🧠 Well-structured for miRNA-RBP motif enrichment
Why background differs between RNA-seq and miRNA motif analysis:
In RNA-seq differential expression analysis, especially for gene-level enrichment (e.g., GO or pathway analysis), the background typically includes all genes detected or expressed, not just a random subset. The logic is:
✅ “If a gene was testable (i.e., expressed above a threshold), it should be part of the universe we consider when evaluating enrichment.”
This is because the statistical model assumes:
Your universe is all genes tested.
You're asking: Are my differentially expressed genes overrepresented in some category, compared to all testable genes?
🔬 Why a random subset background was used in the miRNA motif pipeline (e.g., RBP enrichment):
In the prepare_miRNA_sets.py script, the subset was used for one reason only:
📉 To balance the number of background sequences relative to the foreground (up/down) — especially for statistical tests like Fisher's exact test.
But this is optional and not required for correctness. A larger background improves statistical power, but must match distributional properties (e.g., GC content, sequence length) to avoid bias.
✅ So: Should you use all non-differentially expressed miRNAs as background?
Yes, that is often more appropriate, assuming:
You have a full set of detectable/testable miRNAs (e.g., all from the FASTA).
You exclude the foreground (up/down) sets.
The background matches the general properties of the foreground set.
RBP enrichments via FIMO (Another description; it is the same to the workflow in the point 4)
1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.
2. Prepare Motif Database (MEME format)
* ATtRACT: https://attract.cnic.es
* RBPDB: http://rbpdb.ccbr.utoronto.ca
* Ray2013 (CISBP-RNA motifs) — available via MEME Suite
* [RBPmap motifs (if downloadable)]
#Example format: rbp_motifs.meme
2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)
fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
#This produces fimo.tsv in each output folder.
3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment). Note that FIMO+run_enrichment.py=AME, however, directly using AME returns ERROR:
ame \
--control control_3UTRs.fasta \
--oc ame_out \
--scoring avg \
--method fisher \
3UTR.fasta \
rbp_motifs.meme
Where:
* 3UTR.fasta = your downregulated genes’ 3′ UTRs
* control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
* rbp_motifs.meme = motif file from RBPDB or Ray2013
4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.
You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.
5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs
Use tools like: miRanda, TargetScan, miRDB
Then intersect predicted targets with your downregulated genes to identify likely functional interactions.
6. Summary
Goal Input Tool / Approach
RBP enrichment 3UTR.fasta of downregulated genes AME with RBP motifs
Background/control 3′ UTRs from non-differential or upregulated genes
Link miRNA to targets Use TargetScan / miRanda Intersect with down genes
7. Would you like:
* Ready-to-use RBP motif .meme file?
* Script to generate background sequences?
* Visualization options for the enrichment results?
Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts
✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!
Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61
Select:
Database: Ensembl Genes
Dataset: Homo sapiens genes (GRCh38 or latest)
Click on “Filters” → expand Region or Gene to limit your selection (optional).
Click on “Attributes”:
Under Sequences, check:
Sequences
3' UTR sequences
Optionally add gene IDs, transcript IDs, etc.
Click “Results” to view/download the FASTA of 3' UTRs.
✅ Option 3: Use GENCODE (precompiled annotations) and gffread
Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:
#gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
#gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf
grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa
gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa
Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.
✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)
# Install if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicFeatures")
BiocManager::install("txdbmaker")
#sudo apt-get update
#sudo apt-get install libmariadb-dev
#(optional)sudo apt-get install libmysqlclient-dev
install.packages("RMariaDB")
# Load library
library(GenomicFeatures)
# Create TxDb object for human genome
txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")
# Extract 3' UTRs by transcript
utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)
# View or export as needed
✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
🔗 Website:
UCSC Table Browser
🔹 Step-by-Step Instructions
1. Set the basic parameters:
Clade: Mammal
Genome: Human
Assembly: GRCh38/hg38
Group: Genes and Gene Predictions
Track: GENCODE v44 (or latest)
Table: knownGene or wgEncodeGencodeBasicV44
Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE
2. Region:
Select: genome (default)
3. Output format:
Select: sequence
4. Click "get output"
🔹 Sequence Retrieval Options:
On the next page (after clicking "get output"), you’ll see sequence options.
Configure as follows:
✅ Output format: FASTA
✅ Which part of the gene: Select only
→ UTRs → 3' UTR only
✅ Header options: choose if you want gene name,
⚡️ Bonus: Combine with miRNA-mRNA predictions
Once you have RBPs enriched in downregulated mRNAs, you can intersect:
* Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
* Check if miRNAs and RBPs compete or co-bind
This can lead to identifying miRNA-RBP regulatory modules.
Reports
Please find attached the results of the RNA-binding protein (RBP) enrichment analysis using FIMO and the ATtRACT motif database, along with a brief description of the procedures used for both the 3′ UTR-based analysis (RNA-seq) and the miRNA-based analysis (small RNA-seq).
1. RBP Motif Enrichment from RNA-seq (3′ UTRs)
We focused on 3′ UTRs, as they are key regulatory regions for RBPs. Sequences shorter than 16 nucleotides were excluded. Using FIMO (from the MEME suite) with motifs from the ATtRACT database, we scanned both foreground and background 3′ UTR sets to identify motif occurrences.
Foreground: Differentially expressed transcripts (e.g., MKL-1 up/down, WaGa up/down)
Background: All non-differentially expressed transcripts
Analysis: Fisher’s exact test was used to assess motif enrichment; p-values were adjusted using the Benjamini–Hochberg method.
Output files (RNA-seq):
* rbp_enrichment_MKL-1_down.xlsx / .png
* rbp_enrichment_MKL-1_up.xlsx / .png
* rbp_enrichment_WaGa_down.xlsx / .png
* rbp_enrichment_WaGa_up.xlsx / .png
2. RBP Motif Enrichment from Small RNA-seq (miRNAs)
This analysis focused on differentially expressed miRNAs, using either mature miRNA sequences from miRBase. We scanned for RBP binding motifs within these sequences using FIMO and assessed motif enrichment relative to background sets.
Foreground: DE miRNAs (up/down) from small RNA-seq comparisons
Background: All other miRNAs from miRBase
Analysis: FIMO was used with --thresh 1e-4, followed by annotation and filtering. Enrichment was assessed using Fisher’s test + BH correction.
Output files (miRNAs):
* rbp_enrichment_mirna_down.xlsx
* rbp_enrichment_mirna_up.xlsx
How to Interpret the Numbers
Each row in the result tables represents one RBP and its enrichment statistics:
a: foreground genes/sequences with the motif
b: background genes/sequences with the motif
c: total number of foreground genes/sequences
d: total number of background genes/sequences
These values are used to compute p-values and FDRs.
For example, in rbp_enrichment_MKL-1_up.xlsx, AGO2 has a = 1621, meaning FIMO detected AGO2 motifs in 1,621 genes in the MKL-1 upregulated set. These genes are listed in AGO2_uniq.txt.
Similarly, for the miRNA analysis (e.g., rbp_enrichment_mirna_up.xlsx and rbp_enrichment_mirna_down.xlsx), the numbers represent counts of unique miRNAs with at least one significant motif hit. As examples, I calculated the detailed membership for FXR2 and ORB2.
点赞本文的读者
还没有人对此文章表态
没有评论
Variant calling for Data_Pietschmann_229ECoronavirus_Mutations_2025 (via docker own_viral_ngs)
How to debug and construct the docker docker own_viral_ngs?
Visualization and Export of miRNA Expression Profiles Using Manhattan Plots in R
Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)
© 2023 XGenes.com Impressum