How to correlate RNA-seq Data with Mass Spectrometry Proteomics Data?

Correlating RNA-seq data with mass spectrometry (MS)-based proteomics data is a powerful way to link transcript-level expression with protein-level abundance. Here’s a step-by-step outline of how to approach it:

  1. Preprocessing and Normalization

    For RNA-Seq data:

    • Obtain gene-level expression data, usually as raw counts or TPM (transcripts per million) / FPKM (fragments per kilobase million).
    • Normalize the data (e.g., using DESeq2’s variance stabilizing transformation (VST) or edgeR’s TMM normalization).

    For MS proteomics data:

    • Quantify protein abundances, often using spectral counts, iBAQ, LFQ intensities, or other measures.
    • Log-transform the data if needed to stabilize variance.
  2. Data Mapping and Integration

    • Gene/Protein Mapping: Use gene symbols, Ensembl IDs, or UniProt IDs to map transcript-level data (RNA-seq) to protein-level data (MS). Be cautious of differences in annotation – e.g., some genes might have multiple protein isoforms.

    • Common Identifiers:

      • Convert all IDs to a common identifier (e.g., gene symbols or Ensembl IDs).
      • Remove entries without matching pairs to ensure one-to-one correspondence.
  3. Data Filtering

    • Filter out lowly expressed genes/proteins or those not reliably detected in both datasets.
    • Optionally, keep only genes/proteins of interest or those with high coverage.
  4. Correlation Analysis

    • For each matched gene/protein pair, calculate correlation (usually Pearson or Spearman) across the samples.

      Steps:

      • Construct a table with rows as genes/proteins and columns as samples.
      • For each row, you’ll have two vectors:
        • RNA expression (e.g., normalized RNA counts)
        • Protein abundance (e.g., log-transformed LFQ intensity)
      • Calculate:

          from scipy.stats import pearsonr, spearmanr
        
          rna_vector = [...]
          protein_vector = [...]
        
          pearson_corr, _ = pearsonr(rna_vector, protein_vector)
          spearman_corr, _ = spearmanr(rna_vector, protein_vector)
  5. Visualize and Interpret

    • Plot scatter plots of RNA vs protein levels for:

      • All genes/proteins together (aggregate view)
      • Selected genes of interest
    • Plot correlation coefficients:

      • Histogram of all gene/protein correlations
      • Heatmap if you have sample-wise data
    • Assess overall agreement:

      • Typically, moderate correlation (~0.3–0.6) is observed in many studies.
  6. Consider Batch Effects and Biological Variability

    • If the datasets come from different experiments or platforms, consider batch correction methods (e.g., ComBat from the sva R package).

    • Be mindful that:

      • Post-transcriptional regulation affects how well mRNA levels correlate with protein levels.
      • Some genes/proteins might show no correlation due to translational regulation, stability, etc.
  7. Summary Workflow

    ✅ Preprocess & normalize both datasets ✅ Map genes/proteins to common IDs ✅ Filter to shared, high-quality data ✅ Calculate correlations ✅ Visualize and interpret

  8. Python script that walks through the key steps of correlating RNA-seq data with proteomics data:

     import pandas as pd
     import numpy as np
     from scipy.stats import pearsonr, spearmanr
     import matplotlib.pyplot as plt
     import seaborn as sns
    
     # --- Step 1: Load your data ---
    
     # Example: CSVs with genes/proteins as rows, samples as columns
     rna_data = pd.read_csv('rna_seq_data.csv', index_col=0)  # genes x samples
     protein_data = pd.read_csv('proteomics_data.csv', index_col=0)  # proteins x samples
    
     # --- Step 2: Map genes to proteins (assuming same identifiers) ---
    
     # Filter to common genes/proteins
     common_genes = rna_data.index.intersection(protein_data.index)
     rna_data_filtered = rna_data.loc[common_genes]
     protein_data_filtered = protein_data.loc[common_genes]
    
     print(f"Number of common genes/proteins: {len(common_genes)}")
    
     # --- Step 3: Log transform if needed (optional) ---
    
     rna_data_log = np.log2(rna_data_filtered + 1)
     protein_data_log = np.log2(protein_data_filtered + 1)
    
     # --- Step 4: Calculate gene-wise correlations across samples ---
    
     pearson_corrs = []
     spearman_corrs = []
    
     for gene in common_genes:
         rna_vector = rna_data_log.loc[gene]
         protein_vector = protein_data_log.loc[gene]
    
         pearson_corr, _ = pearsonr(rna_vector, protein_vector)
         spearman_corr, _ = spearmanr(rna_vector, protein_vector)
    
         pearson_corrs.append(pearson_corr)
         spearman_corrs.append(spearman_corr)
    
     # Save results
     correlation_df = pd.DataFrame({
         'Gene': common_genes,
         'Pearson': pearson_corrs,
         'Spearman': spearman_corrs
     })
     correlation_df.to_csv('gene_protein_correlations.csv', index=False)
     print("Saved gene-wise correlation data to 'gene_protein_correlations.csv'")
    
     # --- Step 5: Visualize the correlation distributions ---
    
     sns.histplot(correlation_df['Pearson'], bins=30, kde=True, color='skyblue')
     plt.xlabel('Pearson Correlation')
     plt.title('Distribution of Pearson Correlations (RNA vs Protein)')
     plt.show()
    
     sns.histplot(correlation_df['Spearman'], bins=30, kde=True, color='salmon')
     plt.xlabel('Spearman Correlation')
     plt.title('Distribution of Spearman Correlations (RNA vs Protein)')
     plt.show()
    
     # --- Step 6: Scatter plot for a selected gene/protein ---
    
     example_gene = common_genes[0]  # change to your gene of interest
     plt.scatter(rna_data_log.loc[example_gene], protein_data_log.loc[example_gene])
     plt.xlabel('Log2 RNA Expression')
     plt.ylabel('Log2 Protein Abundance')
     plt.title(f'RNA vs Protein for {example_gene}')
     plt.grid(True)
     plt.show()
    
     # Key Notes:
     #✅ Replace the filenames (rna_seq_data.csv and proteomics_data.csv) with your actual files.
     #✅ The script expects rows to be genes/proteins and columns to be samples.
     #✅ Modify or add steps if you have different normalization needs (e.g., DESeq2 normalization).
  9. R script that covers the same steps as above:

     # --- Load libraries ---
     library(ggplot2)
     library(dplyr)
    
     # --- Step 1: Load your data ---
     # Example: CSVs with genes/proteins as rows, samples as columns
     rna_data <- read.csv("rna_seq_data.csv", row.names = 1)
     protein_data <- read.csv("proteomics_data.csv", row.names = 1)
    
     # --- Step 2: Find common genes/proteins ---
     common_genes <- intersect(rownames(rna_data), rownames(protein_data))
     rna_data_filtered <- rna_data[common_genes, ]
     protein_data_filtered <- protein_data[common_genes, ]
    
     cat("Number of common genes/proteins:", length(common_genes), "\n")
    
     # --- Step 3: Log-transform if needed (optional) ---
     rna_data_log <- log2(rna_data_filtered + 1)
     protein_data_log <- log2(protein_data_filtered + 1)
    
     # --- Step 4: Calculate gene-wise correlations across samples ---
     pearson_corrs <- numeric(length(common_genes))
     spearman_corrs <- numeric(length(common_genes))
    
     for (i in seq_along(common_genes)) {
     gene <- common_genes[i]
     rna_vector <- as.numeric(rna_data_log[gene, ])
     protein_vector <- as.numeric(protein_data_log[gene, ])
    
     pearson_corrs[i] <- cor(rna_vector, protein_vector, method = "pearson")
     spearman_corrs[i] <- cor(rna_vector, protein_vector, method = "spearman")
     }
    
     # Save the results
     correlation_df <- data.frame(
     Gene = common_genes,
     Pearson = pearson_corrs,
     Spearman = spearman_corrs
     )
    
     write.csv(correlation_df, "gene_protein_correlations.csv", row.names = FALSE)
     cat("Saved gene-wise correlation data to 'gene_protein_correlations.csv'\n")
    
     # --- Step 5: Visualize the correlation distributions ---
     ggplot(correlation_df, aes(x = Pearson)) +
     geom_histogram(bins = 30, fill = "skyblue", color = "black") +
     labs(title = "Distribution of Pearson Correlations (RNA vs Protein)",
         x = "Pearson Correlation", y = "Frequency") +
     theme_minimal()
    
     ggplot(correlation_df, aes(x = Spearman)) +
     geom_histogram(bins = 30, fill = "salmon", color = "black") +
     labs(title = "Distribution of Spearman Correlations (RNA vs Protein)",
         x = "Spearman Correlation", y = "Frequency") +
     theme_minimal()
    
     # --- Step 6: Scatter plot for a selected gene/protein ---
     example_gene <- common_genes[1]  # change this to your gene of interest
     df_example <- data.frame(
     RNA = as.numeric(rna_data_log[example_gene, ]),
     Protein = as.numeric(protein_data_log[example_gene, ])
     )
    
     ggplot(df_example, aes(x = RNA, y = Protein)) +
     geom_point() +
     labs(title = paste("RNA vs Protein for", example_gene),
         x = "Log2 RNA Expression", y = "Log2 Protein Abundance") +
     theme_minimal() +
     geom_smooth(method = "lm", se = FALSE, color = "red")
    
     # Key Notes:
     #✅ Replace "rna_seq_data.csv" and "proteomics_data.csv" with your real file names.
     #✅ Rows: genes/proteins, columns: samples.
     #✅ Change example_gene to any gene of interest for plotting.
     #Tweak this for the new dataset or extend it with batch correction or other normalizations?

Leave a Reply

Your email address will not be published. Required fields are marked *