RNA-seq + Proteomics Integration Workflow (Δsbp; MH & TSB)

cd results/star_salmon/degenes

The article uses the following Python scripts (please see Data_Michelle_RNAseq_2025_scripts.zip).

* map_orf_1to1_progress_safe.py
* map_orf_1to1_progress.py (deprecated)
* check_missing_uniprot_fastas.py
* rescue_missing_uniprot.py
* compare_transcript_protein.py
* make_rna_only_present_excels.py (needs debugging)
* make_rna_only_not_in_protDE.py (needs debugging)
* map_symbols.py (deprecated)

0) Goal

Match IDs: convert UniProt protein IDs to gene symbols to allow direct comparison with RNA-seq.
Merge datasets by gene_symbol and retain: protein_log2FC, protein_padj, rna_log2FC, rna_padj.
Classify each gene/protein into:
- both_significant_same_direction
- both_significant_opposite_direction
- protein_only_significant
- rna_only_significant
Subset analysis: examine proteins significant in proteomics but not in RNA-seq (and vice versa) to identify post-transcriptional regulation, stability, secretion/export, or translational effects.

Comparisons (process first: 1–6)

MH medium

Index	Comparison (normalized)	RNA-seq	Proteomics
1	Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)	✓	✓
2	Δsbp_4h_vs_WT_4h	✓	✓
3	Δsbp_18h_vs_WT_18h	✓	✓

TSB medium

Index	Comparison (normalized)	RNA-seq	Proteomics
4	Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)	✓	✓
5	Δsbp_4h_vs_WT_4h	✓	✓
6	Δsbp_18h_vs_WT_18h	✓	✓
7	Δsbp_1h_vs_Δsbp_4h	–	✓
8	Δsbp_4h_vs_Δsbp_18h	–	✓
9	WT_1h_vs_WT_4h	–	✓
10	WT_4h_vs_WT_18h	–	✓

1) Environment

mamba activate rnaseq_prot_comparison
mamba install -c conda-forge biopython pandas openpyxl requests requests-cache tqdm -y
mamba activate rnaseq_prot_comparison

2) Populate cache once (per‑ID fetch via aligner)

# MH (example outcome: mapped 589)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh.csv

# Expected console example:
# ORFs translated: 2319; proteins: 616; mapped: 589
# Output: map_mh.csv

3) Check which UniProt IDs are missing (cache and stream)

python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report.txt

# Example:
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1296
# Missing in cache: 485
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

4) Rescue missing accessions (UniSave → UniParc → UniProtKB remap)

After each run, check rescue_summary.txt and rerun the checker before redoing alignments.

# 4.1) Rescue cache-missing
python rescue_missing_uniprot.py --miss_file missing_cache.txt --cache cache/uniprot_fastas

# 4.2) Re-check
python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report_after_rescue.txt

# Example (OK to proceed):
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1781
# Missing in cache: 0
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

# 4.3) Optional: rescue stream-missing
python rescue_missing_uniprot.py --miss_file missing_stream.txt --cache cache/uniprot_fastas

Notes:

If “Present via UniProt stream now” is 0, it usually indicates a streaming query/parse/limit issue; cached per‑accession FASTAs are valid and sufficient for alignment.

5) Rerun the aligner with rescued FASTAs

# MH final
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh_final.csv

# TSB (cache already populated; re-run if desired)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb \
  --min_aa 10 \
  --requests_cache \
  --out map_tsb.csv

6) Final comparison (merge, classify, export)

# Build merged comparisons (1–6), classify categories
python compare_transcript_protein.py \
  --map_mh map_mh_final.csv \
  --map_tsb map_tsb.csv \
  --out_dir results_compare \
  --alpha 0.05

# Convert CSVs to a single Excel workbook
~/Tools/csv2xls-0.4/csv_to_xls.py \
  summary_counts.csv MH_2h_merged.csv MH_4h_merged.csv MH_18h_merged.csv \
  TSB_2h_merged.csv TSB_4h_merged.csv TSB_18h_merged.csv \
  -d',' -o RNAseq-Proteomics_deltasbp_MH-TSB_2h4h18h_summary_20251002.xlsx

# Optional: export RNA-only-present genes per comparison (empty proteomics fields)
python make_rna_only_present_excels.py --out_dir rna_only_present_excels

7) Why counts differ (e.g., 1173(1319) vs 1107; TSB 663 vs 650)

The aligner uses “unique, usable” FASTA sequences; duplicates, deprecated/secondary or invalid accessions, and empty/invalid sequences are filtered, so usable proteins can be fewer than attempted fetches.
TSB 663 vs 650: about 13 entries typically drop due to duplicate/merged IDs, obsolete accessions, decoy/contaminant-like strings, or missing gene symbols that prevent a 1‑to‑1 map with RNA-seq.

中文备注：结论：对齐阶段用的是“可用且唯一”的蛋白序列集合，因此数量会小于最初尝试获取的条目数；对 TSB，663 行在规范化与去重后得到 650 个可用记录属正常现象。

8) Report (email-ready text)

As mentioned, the IDs between proteomics and RNA‑seq were matched based on direct sequence alignment (ORF nucleotide sequences translated and aligned to UniProt protein FASTA) to ensure 1‑to‑1 mapping and unambiguous ID correspondence.

Please note that the recorded counts are constrained by the proteomics differentially expressed (DE) tables: for MH medium, 1107 proteins/genes; for TSB medium, 650. The RNA‑seq DE lists are larger overall, but the integration and comparisons are based on entities present in both proteomics and RNA‑seq.

For the overlap between the two datasets, each gene/protein was placed into one of five categories:

not_significant_in_either
rna_only_significant
protein_only_significant
both_significant_same_direction
both_significant_opposite_direction

The attached Excel file contains per‑comparison merged tables (protein_log2FC, protein_padj, rna_log2FC, rna_padj), category calls for each entry, and summary counts for quick review.

9) Notes \& assumptions

MH comparison blocks are read in 18h, 4h, 1h order from Ord_609 (three repeated “Log2 difference / q-value (Benjamini FDR)” blocks after intensities). If the template differs, adjust the mapping.
TSB comparisons are explicitly labeled (sbp vs Control at 1h/4h/18h) and are used verbatim.
Gene symbol preference: use proteomics-provided symbols (MH via GN=; TSB via Gene Symbol); otherwise fall back to RNA Preferred_name by locus to unify merge keys.
Significance: default padj < 0.05 for both datasets (set via –alpha). Add effect-size filters if needed.

10) TODOs

Cluster by presence across platforms: using TSB 650 and MH 1319 proteins, derive 5+1 groups; the +1 group contains genes not occurring in the proteomics results; Namely, RNA-only-present category: implemented—prefer set-difference on gene_symbol (RNA DE minus proteomics DE) per comparison to produce concise lists and avoid many‑to‑many expansions.
Re-run proteomics DE with Gunnar’s project scripts to expand DE coverage if needed.

Appendix: Deprecated symbol-only mapping (for reference)

# (Deprecated) Symbol-based mapping reached ~35% overlap → switched to sequence-alignment-based mapping.

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx \
  --prot Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 1280
# Intersect (symbols): 519
# Coverage vs RNA symbols:        35.5%
# Coverage vs Proteomics symbols:  40.5%
# Coverage vs all RNA rows:        22.1%

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx \
  --prot 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 568
# Intersect (symbols): 512
# Coverage vs RNA symbols:        35.1%
# Coverage vs Proteomics symbols:  90.1%
# Coverage vs all RNA rows:        21.8%

¹²³⁴⁵⁶⁷⁸⁹¹⁰

⁂

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.