How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:

Preprocessing and Normalization

For RNA-Seq data:
- Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
- Normalize the data (e.g., using DESeq2’s variance stabilizing transformation (VST) or edgeR’s TMM normalization).
For MS proteomics data:
- Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
- Log-transform the data if needed to stabilize variance.
Data Mapping and Integration
- Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.
- Common Identifiers:
  - Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
  - Remove entries without matching pairs to ensure one-to-one correspondence.
Data Filtering
- Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
- Optionally, keep only genes/proteins of interest or those with high coverage.
Correlation Analysis
- For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.
  
  Steps:
  - Construct a table with rows as genes/proteins and columns as samples.
  - For each row, you’ll have two vectors:
    - RNA expression (e.g., normalized RNA counts)
    - Protein abundance (e.g., log-transformed LFQ intensity)
  - Calculate:
```
from scipy.stats import pearsonr, spearmanr

rna_vector = [...]
protein_vector = [...]

pearson_corr, _ = pearsonr(rna_vector, protein_vector)
spearman_corr, _ = spearmanr(rna_vector, protein_vector)
```
Visualize and Interpret
- Plot scatter plots of RNA vs protein levels for:
  - All genes/proteins together (aggregate view)
  - Selected genes of interest
- Plot correlation coefficients:
  - Histogram of all gene/protein correlations
  - Heatmap if you have sample-wise data
- Assess overall agreement:
  - Typically, moderate correlation (~0.3–0.6) is observed in many studies.
Consider Batch Effects and Biological Variability
- If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).
- Be mindful that:
  - Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
  - Some genes/proteins might show no correlation due to translational regulation, stability, etc.
Summary Workflow

✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret

Python script that walks through the key steps of correlating RNA-seq data with proteomics data:

import pandas as pd
import numpy as np
from scipy.stats import pearsonr, spearmanr
import matplotlib.pyplot as plt
import seaborn as sns

# --- Step 1: Load your data ---

# Example: CSVs with genes/proteins as rows, samples as columns
rna_data = pd.read_csv('rna_seq_data.csv', index_col=0)  # genes x samples
protein_data = pd.read_csv('proteomics_data.csv', index_col=0)  # proteins x samples

# --- Step 2: Map genes to proteins (assuming same identifiers) ---

# Filter to common genes/proteins
common_genes = rna_data.index.intersection(protein_data.index)
rna_data_filtered = rna_data.loc[common_genes]
protein_data_filtered = protein_data.loc[common_genes]

print(f"Number of common genes/proteins: {len(common_genes)}")

# --- Step 3: Log transform if needed (optional) ---

rna_data_log = np.log2(rna_data_filtered + 1)
protein_data_log = np.log2(protein_data_filtered + 1)

# --- Step 4: Calculate gene-wise correlations across samples ---

pearson_corrs = []
spearman_corrs = []

for gene in common_genes:
    rna_vector = rna_data_log.loc[gene]
    protein_vector = protein_data_log.loc[gene]

    pearson_corr, _ = pearsonr(rna_vector, protein_vector)
    spearman_corr, _ = spearmanr(rna_vector, protein_vector)

    pearson_corrs.append(pearson_corr)
    spearman_corrs.append(spearman_corr)

# Save results
correlation_df = pd.DataFrame({
    'Gene': common_genes,
    'Pearson': pearson_corrs,
    'Spearman': spearman_corrs
})
correlation_df.to_csv('gene_protein_correlations.csv', index=False)
print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")

# --- Step 5: Visualize the correlation distributions ---

sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
plt.xlabel('Pearson Correlation')
plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
plt.show()

sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
plt.xlabel('Spearman Correlation')
plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
plt.show()

# --- Step 6: Scatter plot for a selected gene/protein ---

example_gene = common_genes[0]  # change to your gene of interest
plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
plt.xlabel('Log2 RNA Expression')
plt.ylabel('Log2 Protein Abundance')
plt.title(f'RNA vs Protein for {example_gene}')
plt.grid(True)
plt.show()

# Key Notes:
#✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
#✅ The script expects rows to be genes/proteins and columns to be samples.
#✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).

R script that covers the same steps as above:

# --- Load libraries ---
library(ggplot2)
library(dplyr)

# --- Step 1: Load your data ---
# Example: CSVs with genes/proteins as rows, samples as columns
rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
protein_data <- read.csv("proteomics_data.csv", row.names = 1)

# --- Step 2: Find common genes/proteins ---
common_genes <- intersect(rownames(rna_data), rownames(protein_data))
rna_data_filtered <- rna_data[common_genes, ]
protein_data_filtered <- protein_data[common_genes, ]

cat("Number of common genes/proteins:", length(common_genes), "\n")

# --- Step 3: Log-transform if needed (optional) ---
rna_data_log <- log2(rna_data_filtered + 1)
protein_data_log <- log2(protein_data_filtered + 1)

# --- Step 4: Calculate gene-wise correlations across samples ---
pearson_corrs <- numeric(length(common_genes))
spearman_corrs <- numeric(length(common_genes))

for (i in seq_along(common_genes)) {
gene <- common_genes[i]
rna_vector <- as.numeric(rna_data_log[gene, ])
protein_vector <- as.numeric(protein_data_log[gene, ])

pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
}

# Save the results
correlation_df <- data.frame(
Gene = common_genes,
Pearson = pearson_corrs,
Spearman = spearman_corrs
)

write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")

# --- Step 5: Visualize the correlation distributions ---
ggplot(correlation_df, aes(x = Pearson)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
    x = "Pearson Correlation", y = "Frequency") +
theme_minimal()

ggplot(correlation_df, aes(x = Spearman)) +
geom_histogram(bins = 30, fill = "salmon", color = "black") +
labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
    x = "Spearman Correlation", y = "Frequency") +
theme_minimal()

# --- Step 6: Scatter plot for a selected gene/protein ---
example_gene <- common_genes[1]  # change this to your gene of interest
df_example <- data.frame(
RNA = as.numeric(rna_data_log[example_gene, ]),
Protein = as.numeric(protein_data_log[example_gene, ])
)

ggplot(df_example, aes(x = RNA, y = Protein)) +
geom_point() +
labs(title = paste("RNA vs Protein for", example_gene),
    x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
theme_minimal() +
geom_smooth(method = "lm", se = FALSE, color = "red")

# Key Notes:
#✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
#✅ Rows: genes/proteins, columns: samples.
#✅ Change example_gene to any gene of interest for plotting.
#Tweak this for the new dataset or extend it with batch correction or other normalizations?

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Leave a Reply Cancel reply