Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:
-
Preprocessing and Normalization
For RNA-Seq data:
- Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
- Normalize the data (e.g., using DESeq2’s variance stabilizing transformation (VST) or edgeR’s TMM normalization).
For MS proteomics data:
- Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
- Log-transform the data if needed to stabilize variance.
-
Data Mapping and Integration
-
Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.
-
Common Identifiers:
- Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
- Remove entries without matching pairs to ensure one-to-one correspondence.
-
-
Data Filtering
- Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
- Optionally, keep only genes/proteins of interest or those with high coverage.
-
Correlation Analysis
-
For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.
Steps:
- Construct a table with rows as genes/proteins and columns as samples.
- For each row, you’ll have two vectors:
- RNA expression (e.g., normalized RNA counts)
- Protein abundance (e.g., log-transformed LFQ intensity)
-
Calculate:
from scipy.stats import pearsonr, spearmanr rna_vector = [...] protein_vector = [...] pearson_corr, _ = pearsonr(rna_vector, protein_vector) spearman_corr, _ = spearmanr(rna_vector, protein_vector)
-
-
Visualize and Interpret
-
Plot scatter plots of RNA vs protein levels for:
- All genes/proteins together (aggregate view)
- Selected genes of interest
-
Plot correlation coefficients:
- Histogram of all gene/protein correlations
- Heatmap if you have sample-wise data
-
Assess overall agreement:
- Typically, moderate correlation (~0.3–0.6) is observed in many studies.
-
-
Consider Batch Effects and Biological Variability
-
If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).
-
Be mindful that:
- Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
- Some genes/proteins might show no correlation due to translational regulation, stability, etc.
-
-
Summary Workflow
✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret
-
Python script that walks through the key steps of correlating RNA-seq data with proteomics data:
import pandas as pd import numpy as np from scipy.stats import pearsonr, spearmanr import matplotlib.pyplot as plt import seaborn as sns # --- Step 1: Load your data --- # Example: CSVs with genes/proteins as rows, samples as columns rna_data = pd.read_csv('rna_seq_data.csv', index_col=0) # genes x samples protein_data = pd.read_csv('proteomics_data.csv', index_col=0) # proteins x samples # --- Step 2: Map genes to proteins (assuming same identifiers) --- # Filter to common genes/proteins common_genes = rna_data.index.intersection(protein_data.index) rna_data_filtered = rna_data.loc[common_genes] protein_data_filtered = protein_data.loc[common_genes] print(f"Number of common genes/proteins: {len(common_genes)}") # --- Step 3: Log transform if needed (optional) --- rna_data_log = np.log2(rna_data_filtered + 1) protein_data_log = np.log2(protein_data_filtered + 1) # --- Step 4: Calculate gene-wise correlations across samples --- pearson_corrs = [] spearman_corrs = [] for gene in common_genes: rna_vector = rna_data_log.loc[gene] protein_vector = protein_data_log.loc[gene] pearson_corr, _ = pearsonr(rna_vector, protein_vector) spearman_corr, _ = spearmanr(rna_vector, protein_vector) pearson_corrs.append(pearson_corr) spearman_corrs.append(spearman_corr) # Save results correlation_df = pd.DataFrame({ 'Gene': common_genes, 'Pearson': pearson_corrs, 'Spearman': spearman_corrs }) correlation_df.to_csv('gene_protein_correlations.csv', index=False) print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'") # --- Step 5: Visualize the correlation distributions --- sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue') plt.xlabel('Pearson Correlation') plt.title('Distribution of Pearson Correlations (RNA vs Protein)') plt.show() sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon') plt.xlabel('Spearman Correlation') plt.title('Distribution of Spearman Correlations (RNA vs Protein)') plt.show() # --- Step 6: Scatter plot for a selected gene/protein --- example_gene = common_genes[0] # change to your gene of interest plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene]) plt.xlabel('Log2 RNA Expression') plt.ylabel('Log2 Protein Abundance') plt.title(f'RNA vs Protein for {example_gene}') plt.grid(True) plt.show() # Key Notes: #✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files. #✅ The script expects rows to be genes/proteins and columns to be samples. #✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).
-
R script that covers the same steps as above:
# --- Load libraries --- library(ggplot2) library(dplyr) # --- Step 1: Load your data --- # Example: CSVs with genes/proteins as rows, samples as columns rna_data <- read.csv("rna_seq_data.csv", row.names = 1) protein_data <- read.csv("proteomics_data.csv", row.names = 1) # --- Step 2: Find common genes/proteins --- common_genes <- intersect(rownames(rna_data), rownames(protein_data)) rna_data_filtered <- rna_data[common_genes, ] protein_data_filtered <- protein_data[common_genes, ] cat("Number of common genes/proteins:", length(common_genes), "\n") # --- Step 3: Log-transform if needed (optional) --- rna_data_log <- log2(rna_data_filtered + 1) protein_data_log <- log2(protein_data_filtered + 1) # --- Step 4: Calculate gene-wise correlations across samples --- pearson_corrs <- numeric(length(common_genes)) spearman_corrs <- numeric(length(common_genes)) for (i in seq_along(common_genes)) { gene <- common_genes[i] rna_vector <- as.numeric(rna_data_log[gene, ]) protein_vector <- as.numeric(protein_data_log[gene, ]) pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson") spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman") } # Save the results correlation_df <- data.frame( Gene = common_genes, Pearson = pearson_corrs, Spearman = spearman_corrs ) write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE) cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n") # --- Step 5: Visualize the correlation distributions --- ggplot(correlation_df, aes(x = Pearson)) + geom_histogram(bins = 30, fill = "skyblue", color = "black") + labs(title = "Distribution of Pearson Correlations (RNA vs Protein)", x = "Pearson Correlation", y = "Frequency") + theme_minimal() ggplot(correlation_df, aes(x = Spearman)) + geom_histogram(bins = 30, fill = "salmon", color = "black") + labs(title = "Distribution of Spearman Correlations (RNA vs Protein)", x = "Spearman Correlation", y = "Frequency") + theme_minimal() # --- Step 6: Scatter plot for a selected gene/protein --- example_gene <- common_genes[1] # change this to your gene of interest df_example <- data.frame( RNA = as.numeric(rna_data_log[example_gene, ]), Protein = as.numeric(protein_data_log[example_gene, ]) ) ggplot(df_example, aes(x = RNA, y = Protein)) + geom_point() + labs(title = paste("RNA vs Protein for", example_gene), x = "Log2 RNA Expression", y = "Log2 Protein Abundance") + theme_minimal() + geom_smooth(method = "lm", se = FALSE, color = "red") # Key Notes: #✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names. #✅ Rows: genes/proteins, columns: samples. #✅ Change example_gene to any gene of interest for plotting. #Tweak this for the new dataset or extend it with batch correction or other normalizations?