How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:

Preprocessing and Normalization

For RNA-Seq data:
- Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
- Normalize the data (e.g., using DESeq2’s variance stabilizing transformation (VST) or edgeR’s TMM normalization).
For MS proteomics data:
- Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
- Log-transform the data if needed to stabilize variance.
Data Mapping and Integration
- Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.
- Common Identifiers:
  - Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
  - Remove entries without matching pairs to ensure one-to-one correspondence.
Data Filtering
- Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
- Optionally, keep only genes/proteins of interest or those with high coverage.
Correlation Analysis
- For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.
  
  Steps:
  - Construct a table with rows as genes/proteins and columns as samples.
  - For each row, you’ll have two vectors:
    - RNA expression (e.g., normalized RNA counts)
    - Protein abundance (e.g., log-transformed LFQ intensity)
  - Calculate:
```
  from scipy.stats import pearsonr, spearmanr

  rna_vector = [...]
  protein_vector = [...]

  pearson_corr, _ = pearsonr(rna_vector, protein_vector)
  spearman_corr, _ = spearmanr(rna_vector, protein_vector)
```
Visualize and Interpret
- Plot scatter plots of RNA vs protein levels for:
  - All genes/proteins together (aggregate view)
  - Selected genes of interest
- Plot correlation coefficients:
  - Histogram of all gene/protein correlations
  - Heatmap if you have sample-wise data
- Assess overall agreement:
  - Typically, moderate correlation (~0.3–0.6) is observed in many studies.
Consider Batch Effects and Biological Variability
- If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).
- Be mindful that:
  - Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
  - Some genes/proteins might show no correlation due to translational regulation, stability, etc.
Summary Workflow

✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret

Python script that walks through the key steps of correlating RNA-seq data with proteomics data:

 import pandas as pd
 import numpy as np
 from scipy.stats import pearsonr, spearmanr
 import matplotlib.pyplot as plt
 import seaborn as sns

 # --- Step 1: Load your data ---

 # Example: CSVs with genes/proteins as rows, samples as columns
 rna_data = pd.read_csv('rna_seq_data.csv', index_col=0)  # genes x samples
 protein_data = pd.read_csv('proteomics_data.csv', index_col=0)  # proteins x samples

 # --- Step 2: Map genes to proteins (assuming same identifiers) ---

 # Filter to common genes/proteins
 common_genes = rna_data.index.intersection(protein_data.index)
 rna_data_filtered = rna_data.loc[common_genes]
 protein_data_filtered = protein_data.loc[common_genes]

 print(f"Number of common genes/proteins: {len(common_genes)}")

 # --- Step 3: Log transform if needed (optional) ---

 rna_data_log = np.log2(rna_data_filtered + 1)
 protein_data_log = np.log2(protein_data_filtered + 1)

 # --- Step 4: Calculate gene-wise correlations across samples ---

 pearson_corrs = []
 spearman_corrs = []

 for gene in common_genes:
     rna_vector = rna_data_log.loc[gene]
     protein_vector = protein_data_log.loc[gene]

     pearson_corr, _ = pearsonr(rna_vector, protein_vector)
     spearman_corr, _ = spearmanr(rna_vector, protein_vector)

     pearson_corrs.append(pearson_corr)
     spearman_corrs.append(spearman_corr)

 # Save results
 correlation_df = pd.DataFrame({
     'Gene': common_genes,
     'Pearson': pearson_corrs,
     'Spearman': spearman_corrs
 })
 correlation_df.to_csv('gene_protein_correlations.csv', index=False)
 print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")

 # --- Step 5: Visualize the correlation distributions ---

 sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
 plt.xlabel('Pearson Correlation')
 plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
 plt.show()

 sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
 plt.xlabel('Spearman Correlation')
 plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
 plt.show()

 # --- Step 6: Scatter plot for a selected gene/protein ---

 example_gene = common_genes[0]  # change to your gene of interest
 plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
 plt.xlabel('Log2 RNA Expression')
 plt.ylabel('Log2 Protein Abundance')
 plt.title(f'RNA vs Protein for {example_gene}')
 plt.grid(True)
 plt.show()

 # Key Notes:
 #✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
 #✅ The script expects rows to be genes/proteins and columns to be samples.
 #✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).

R script that covers the same steps as above:

 # --- Load libraries ---
 library(ggplot2)
 library(dplyr)

 # --- Step 1: Load your data ---
 # Example: CSVs with genes/proteins as rows, samples as columns
 rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
 protein_data <- read.csv("proteomics_data.csv", row.names = 1)

 # --- Step 2: Find common genes/proteins ---
 common_genes <- intersect(rownames(rna_data), rownames(protein_data))
 rna_data_filtered <- rna_data[common_genes, ]
 protein_data_filtered <- protein_data[common_genes, ]

 cat("Number of common genes/proteins:", length(common_genes), "\n")

 # --- Step 3: Log-transform if needed (optional) ---
 rna_data_log <- log2(rna_data_filtered + 1)
 protein_data_log <- log2(protein_data_filtered + 1)

 # --- Step 4: Calculate gene-wise correlations across samples ---
 pearson_corrs <- numeric(length(common_genes))
 spearman_corrs <- numeric(length(common_genes))

 for (i in seq_along(common_genes)) {
 gene <- common_genes[i]
 rna_vector <- as.numeric(rna_data_log[gene, ])
 protein_vector <- as.numeric(protein_data_log[gene, ])

 pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
 spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
 }

 # Save the results
 correlation_df <- data.frame(
 Gene = common_genes,
 Pearson = pearson_corrs,
 Spearman = spearman_corrs
 )

 write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
 cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")

 # --- Step 5: Visualize the correlation distributions ---
 ggplot(correlation_df, aes(x = Pearson)) +
 geom_histogram(bins = 30, fill = "skyblue", color = "black") +
 labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
     x = "Pearson Correlation", y = "Frequency") +
 theme_minimal()

 ggplot(correlation_df, aes(x = Spearman)) +
 geom_histogram(bins = 30, fill = "salmon", color = "black") +
 labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
     x = "Spearman Correlation", y = "Frequency") +
 theme_minimal()

 # --- Step 6: Scatter plot for a selected gene/protein ---
 example_gene <- common_genes[1]  # change this to your gene of interest
 df_example <- data.frame(
 RNA = as.numeric(rna_data_log[example_gene, ]),
 Protein = as.numeric(protein_data_log[example_gene, ])
 )

 ggplot(df_example, aes(x = RNA, y = Protein)) +
 geom_point() +
 labs(title = paste("RNA vs Protein for", example_gene),
     x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
 theme_minimal() +
 geom_smooth(method = "lm", se = FALSE, color = "red")

 # Key Notes:
 #✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
 #✅ Rows: genes/proteins, columns: samples.
 #✅ Change example_gene to any gene of interest for plotting.
 #Tweak this for the new dataset or extend it with batch correction or other normalizations?

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Leave a Reply Cancel reply