Author Archives: gene_x

交联质谱(XL-MS 或 CLMS)

交联质谱(XL-MS 或 CLMS)是一种强大的技术,用于研究蛋白质-蛋白质相互作用、蛋白质复合物结构和蛋白质构象变化。该方法包括使用化学交联剂将蛋白质或蛋白质之间相互作用或紧密位于的氨基酸共价连接。然后将交联蛋白质消化成肽,通过质谱分析交联肽。

交联质谱的一般工作流程如下:

  1. 交联:将蛋白质或蛋白质复合物与化学交联剂处理,使其在近距离的氨基酸之间形成共价连接。交联剂的选择取决于实验要求,例如所需的交联距离和用于交联的特定氨基酸。

  2. 消化:将交联蛋白质消化成肽,使用蛋白酶如胰蛋白酶,它在特定的氨基酸序列处切割蛋白质。

  3. 富集:根据使用的交联剂和实验设计,可能需要从复杂的肽混合物中富集交联肽。这可以通过多种方法实现,包括亲和纯化、大小排阻层析或强阳离子交换层析。

  4. 质谱分析:使用高分辨率质谱对交联肽进行分析。质谱仪测量肽的质量与电荷比(m/z),提供关于其质量和电荷的信息。串联质谱(MS/MS)用于将交联肽片段化并获得其序列信息。

  5. 数据分析:使用专门的生物信息学工具和算法分析结果的质谱数据。这些工具有助于识别交联肽并推断相互作用的氨基酸或蛋白质区域。所识别的交联可用于生成结构模型或绘制蛋白质相互作用界面。

交联质谱已成功应用于研究各种生物系统,包括蛋白质-蛋白质相互作用、蛋白质-核酸相互作用和蛋白质-脂质相互作用。它还被用于研究大型蛋白质复合物和膜蛋白的结构和动态。

Crosslinking mass spectrometry (XL-MS or CLMS) is a powerful technique used to study protein-protein interactions, protein complex structures, and protein conformational changes. The method involves covalently linking interacting or closely located amino acids within a protein or between proteins using chemical crosslinkers. The crosslinked proteins are then digested into peptides, and the crosslinked peptides are analyzed using mass spectrometry.

The general workflow of crosslinking mass spectrometry is as follows:

  1. Crosslinking: Proteins or protein complexes are treated with a chemical crosslinker that covalently links amino acids in close proximity. The choice of crosslinker depends on the experimental requirements, such as the desired crosslinking distance and the specific amino acids targeted for crosslinking.

  2. Digestion: Crosslinked proteins are digested into peptides using proteolytic enzymes such as trypsin, which cleaves proteins at specific amino acid sequences.

  3. Enrichment: Depending on the crosslinker used and the experimental design, it may be necessary to enrich crosslinked peptides from the complex peptide mixture. This can be achieved using various approaches, including affinity purification, size-exclusion chromatography, or strong cation exchange chromatography.

  4. Mass spectrometry analysis: The crosslinked peptides are analyzed using high-resolution mass spectrometry. The mass spectrometer measures the mass-to-charge ratios (m/z) of the peptides, providing information about their mass and charge. Tandem mass spectrometry (MS/MS) is used to fragment the crosslinked peptides and obtain their sequence information.

  5. Data analysis: The resulting mass spectrometry data are analyzed using specialized bioinformatics tools and algorithms. These tools help identify the crosslinked peptides and infer the interacting amino acids or protein regions. The identified crosslinks can be used to generate structural models or map protein interaction interfaces.

Crosslinking mass spectrometry has been successfully applied to study various biological systems, including protein-protein interactions, protein-nucleic acid interactions, and protein-lipid interactions. It has also been used to investigate the structure and dynamics of large protein complexes and membrane proteins.

Crosslinking mass spectrometry (XL-MS) has been under development since the late 1990s and early 2000s. Early work in the field focused on optimizing the crosslinking and mass spectrometry techniques, as well as developing computational methods to analyze the resulting data. The technique has evolved significantly over the past two decades, thanks to advancements in mass spectrometry instrumentation, crosslinking reagents, and bioinformatics tools.

One of the pioneering studies in XL-MS was published in 2000 by Rappsilber, Siniossoglou, et al., who used the technique to study the structure of protein complexes in yeast cells. Since then, numerous improvements have been made to the method, leading to its widespread adoption in the field of structural biology and proteomics.

As the technology continues to mature, XL-MS has become an essential tool for studying protein-protein interactions, protein complex structures, and protein conformational changes, providing valuable insights into the molecular mechanisms underlying various biological processes.

Fan Liu’s lab, based at the Max Planck Institute of Biochemistry in Germany, has made significant contributions to the field of crosslinking mass spectrometry (XL-MS). The group focuses on developing innovative XL-MS methods, as well as computational tools for data analysis, to study protein structures and interactions.

One of the key contributions from the Fan Liu lab is the development of the xQuest/xProphet software suite. xQuest is an algorithm for identifying crosslinked peptides from mass spectrometry data, while xProphet is a post-processing tool that statistically validates the xQuest results. This software suite has greatly facilitated the analysis of XL-MS data and is widely used in the field.

The lab has also contributed to the development of novel crosslinking reagents and methodologies for studying protein interactions. For example, they have developed a method called “disuccinimidyl sulfoxide (DSSO) crosslinking” that enables the identification of interacting protein regions by using a photoactivatable crosslinker.

Furthermore, Fan Liu’s group has applied their XL-MS expertise to investigate the structure and function of various biologically significant protein complexes, such as the 26S proteasome, the Mediator complex, and the nuclear pore complex, providing valuable insights into the molecular mechanisms of these systems.

Overall, the Fan Liu lab has played an important role in advancing the field of crosslinking mass spectrometry and continues to push the boundaries of what can be achieved with this powerful technique.

二琥珀酰亚砜基 (disuccinimidyl sulfoxide) 交联是一种在蛋白质结构和相互作用研究领域中使用的交联方法。通过使用这种光活化交联剂,可以识别相互作用的蛋白质区域。在交联质谱(XL-MS)技术中,二琥珀酰亚砜基 (DSSO) 交联为研究蛋白质复合物结构提供了一种有效的工具。

10 NGS Categories

  1. RNA Sequencing: RNA sequencing (RNA-Seq) is a technique used to study the transcriptome, or the complete set of RNA molecules expressed within a cell or tissue. This approach allows researchers to analyze gene expression patterns, identify novel transcripts, and detect alternative splicing events.

  2. Whole Genome Sequencing: Whole genome sequencing (WGS) is a method that determines the complete DNA sequence of an organism’s genome. This technique provides comprehensive information about an organism’s genetic makeup, including the identification of genes, regulatory elements, and variations such as single nucleotide polymorphisms (SNPs) and structural variants.

  3. Amplicon Sequencing: Amplicon sequencing involves the targeted sequencing of specific genomic regions or genes using polymerase chain reaction (PCR) to amplify the regions of interest before sequencing. This approach is often used to study specific genetic variations or target known functional regions in the genome.

  4. Exome Sequencing: Exome sequencing targets the protein-coding regions of the genome, known as the exome. These regions account for approximately 1-2% of the genome but contain the majority of disease-causing genetic variations. Exome sequencing is used to identify novel disease-associated genes and mutations in known genes.

  5. CRISPR Validation (genoTYPER-NEXT): CRISPR validation is a process to assess the efficiency and specificity of CRISPR/Cas9-mediated genome editing. genoTYPER-NEXT is a high-throughput sequencing platform used to validate CRISPR/Cas9-induced mutations by sequencing the target sites and identifying the exact mutations generated by the editing process.

  6. Targeted Sequencing: Targeted sequencing is a method that focuses on sequencing specific genomic regions or genes of interest, rather than the entire genome. This approach is more cost-effective and allows for higher sequencing depth, providing better resolution for detecting low-frequency genetic variations.

  7. Metagenomics: Metagenomics is the study of genetic material obtained directly from environmental samples, such as soil or water, without the need for culturing individual organisms. This approach allows researchers to analyze the composition, diversity, and functional potential of complex microbial communities.

  8. Epigenomics: Epigenomics is the study of epigenetic modifications on a genome-wide scale, including DNA methylation, histone modifications, and non-coding RNA molecules. These modifications play essential roles in gene regulation and can have long-lasting effects on an organism’s phenotype without changing the DNA sequence.

  9. Immunogenomics: Immunogenomics is the study of the genetic and epigenetic factors that influence the immune system and its response to various stimuli, including pathogens, allergens, and self-antigens. This field integrates genomic, transcriptomic, and epigenomic data to better understand the molecular mechanisms underlying immune responses and develop novel therapies for immune-related diseases.

  10. Proteomics is a branch of molecular biology that focuses on the large-scale study of proteins within an organism or a specific biological system. Unlike genomics, which deals with the study of DNA and gene sequences, proteomics investigates the structure, function, and interactions of proteins. Proteins are crucial for many cellular processes, and their functions are dictated by their structure, abundance, and modifications.

    There are several key aspects and techniques in proteomics:

    • Protein identification: Identifying the proteins present in a given sample, such as a cell, tissue, or body fluid, is a fundamental aspect of proteomics. Mass spectrometry is the most commonly used technique for protein identification, often combined with liquid chromatography to separate complex protein mixtures before analysis.

    • Protein quantification: Measuring the relative or absolute abundance of proteins in a sample is important for understanding their biological roles and potential involvement in disease processes. Techniques like label-free quantification, stable isotope labeling, and targeted mass spectrometry approaches (e.g., selected reaction monitoring) are commonly used for protein quantification.

    • Post-translational modifications (PTMs): PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, glycosylation, and acetylation. They can alter a protein’s activity, stability, localization, or interaction with other molecules. Mass spectrometry-based approaches and specific enrichment techniques are used to identify and quantify PTMs.

    • Protein-protein interactions: Investigating how proteins interact with one another is crucial for understanding cellular processes and signaling pathways. Techniques like yeast two-hybrid screens, affinity purification-mass spectrometry (AP-MS), and proximity labeling methods can reveal protein-protein interactions.

    • Structural proteomics: This area focuses on determining the three-dimensional structures of proteins, which provide insights into their functions and interactions. Techniques like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) are used to determine protein structures.

    • Functional proteomics: This aspect of proteomics aims to understand the biological roles of proteins and their involvement in cellular processes. Techniques like RNA interference (RNAi), CRISPR/Cas9-mediated gene editing, and activity-based protein profiling (ABPP) are used to study protein functions and identify potential drug targets.

Draw 3D PCA and calculate background genes

Download python3 script for generating the plot

Download raw data 1 for generating the plot

Download raw data 2 for generating the plot

PCA_3D_2.png

construct a data structure (merged_df) as above with data and pc

library(ggplot2)
data <- plotPCA(rld, intgroup=c("condition", "donor"), returnData=TRUE)
write.csv(data, file="plotPCA_data.csv")
#calculate all PCs including PC3 with the following codes
library(genefilter)
ntop <- 500
rv <- rowVars(assay(rld))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
mat <- t( assay(rld)[select, ] )
pc <- prcomp(mat)
pc$x[,1:3]
#df_pc <- data.frame(pc$x[,1:3])
df_pc <- data.frame(pc$x)

identical(rownames(data), rownames(df_pc)) #-->TRUE
## define the desired order of row names
#desired_order <- rownames(data)
## sort the data frame by the desired order of row names
#df <- df[match(desired_order, rownames(df_pc)), ]

data$PC1 <- NULL
data$PC2 <- NULL
merged_df <- merge(data, df_pc, by = "row.names")
#merged_df <- merged_df[, -1]
row.names(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL  # remove the "name" column
merged_df$name <- NULL
merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","PC21","PC22","PC23","PC24","PC25","PC26","PC27","PC28","group","condition","donor")]
#results in 26PCs: merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","PC21","PC22","PC23","PC24","PC25","PC26","group","condition","donor")]
write.csv(merged_df, file="merged_df_28PCs.csv")
#> summary(pc)  #--> variances of PC1, PC2 and PC3 are 0.6795 0.08596 0.06599
#Importance of components:
#                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
#Standard deviation     2.6011 0.92510 0.81059 0.67065 0.51952 0.46429 0.41632
#Proportion of Variance 0.6795 0.08596 0.06599 0.04517 0.02711 0.02165 0.01741
#Cumulative Proportion  0.6795 0.76548 0.83148 0.87665 0.90376 0.92541 0.94282
#                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
#Standard deviation     0.38738 0.32048 0.27993 0.24977 0.20217 0.17316 0.16293
#Proportion of Variance 0.01507 0.01032 0.00787 0.00627 0.00411 0.00301 0.00267
#Cumulative Proportion  0.95789 0.96821 0.97608 0.98234 0.98645 0.98946 0.99213
#                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
#Standard deviation     0.15058 0.13083 0.12449 0.08789 0.06933 0.06064 0.04999
#Proportion of Variance 0.00228 0.00172 0.00156 0.00078 0.00048 0.00037 0.00025
#Cumulative Proportion  0.99441 0.99612 0.99768 0.99846 0.99894 0.99931 0.99956
#                          PC22    PC23    PC24    PC25     PC26
#Standard deviation     0.04143 0.03876 0.03202 0.01054 0.005059
#Proportion of Variance 0.00017 0.00015 0.00010 0.00001 0.000000
#Cumulative Proportion  0.99973 0.99988 0.99999 1.00000 1.000000                                                                     

draw 3D with merged_df using plot3D

import plotly.graph_objects as go
import pandas as pd
from sklearn.decomposition import PCA
import numpy as np

# Read in data as a pandas dataframe
#df = pd.DataFrame({
#    'PC1': [-13.999925, -12.504291, -12.443057, -13.065235, -17.316215],
#    'PC2': [-1.498823, -3.342411, -6.067055, -8.205809, 3.293993],
#    'PC3': [-3.335085, 15.207755, -14.725450, 15.078469, -6.917358],
#    'condition': ['GFP d3', 'GFP d3', 'GFP d8', 'GFP d8', 'GFP+mCh d9/12'],
#    'donor': ['DI', 'DII', 'DI', 'DII', 'DI']
#})
df = pd.read_csv('merged_df_28PCs.csv', index_col=0, header=0)
df['condition'] = df['condition'].replace("GFP+mCh d9/12", "ctrl LTtr+sT d9/12")
df['condition'] = df['condition'].replace("sT+LTtr d9/12", "LTtr+sT d9/12")
df['condition'] = df['condition'].replace("GFP d3", "ctrl LT/LTtr d3")
df['condition'] = df['condition'].replace("GFP d8", "ctrl LT/LTtr d8")
df['condition'] = df['condition'].replace("mCh d3", "ctrl sT d3")
df['condition'] = df['condition'].replace("mCh d8", "ctrl sT d8")

# Fit PCA model to reduce data dimensions to 3
pca = PCA(n_components=3)
pca.fit(df.iloc[:, :-3])
X_reduced = pca.transform(df.iloc[:, :-3])

# Add reduced data back to dataframe
df['PC1'] = X_reduced[:, 0]
df['PC2'] = X_reduced[:, 1]
df['PC3'] = X_reduced[:, 2]

# Create PCA plot with 3D scatter
fig = go.Figure()

#['circle', 'circle-open', 'square', 'square-open', 'diamond', 'diamond-open', 'cross', 'x']
# if donor == 'DI' else marker=dict(size=2, opacity=0.8, color=condition_color, symbol=donor_symbol)

#decrease diamond size to 6 while keep the circle as size 10 in the following code:
#'rgb(128, 150, 128)'
#I need three families of colors, always from light to deep, the first one should close to grey.
#the first serie for 'ctrl LTtr+sT d9/12', 'LTtr+sT d9/12' 
#the second serie for 'ctrl LT/LTtr d3', 'ctrl LT/LTtr d8', 'LT d3', 'LT d8', 'LTtr d3', 'LTtr d8'
#the third serie for 'ctrl sT d3', 'ctrl sT d8', 'sT d3', 'sT d8', 'sT+LT d3'

condition_color_map_untreated = {'untreated':'black'}
donor_symbol_map_untreated = {'DI': 'circle-open', 'DII': 'diamond-open'}
#condition_color_map = {'ctrl LTtr+sT d9/12': 'green', 'GFP d3': 'blue', 'GFP d8': 'red', 'GFP+mCh d9/12': 'green', 'LT d3': 'orange'}
condition_color_map = {
    'ctrl LTtr+sT d9/12': 'black',
    'LTtr+sT d9/12': '#a14a1a',

    'ctrl LT/LTtr d3': '#fdbf6f',
    'ctrl LT/LTtr d8': '#ff7f00',
    'LT d3': '#b2df8a',
    'LT d8': '#33a02c',
    'LTtr d3': '#a6cee3',
    'LTtr d8': '#1f78b4',

    'ctrl sT d3': 'rgb(200, 200, 200)',
    'ctrl sT d8': 'rgb(100, 100, 100)',
    'sT d3': '#fb9a99',
    'sT d8': '#e31a1c',
    'sT+LT d3': 'magenta'
}
donor_symbol_map = {'DI': 'circle', 'DII': 'diamond'}

for donor, donor_symbol in donor_symbol_map_untreated.items():
    for condition, condition_color in condition_color_map_untreated.items():
        mask = (df['condition'] == condition) & (df['donor'] == donor)
        fig.add_trace(go.Scatter3d(x=df.loc[mask, 'PC1'], y=df.loc[mask, 'PC2'], z=df.loc[mask, 'PC3'],
                                   mode='markers',
                                   name=f'{condition}' if donor == 'DI' else None,
                                   legendgroup=f'{condition}',
                                   showlegend=True if donor == 'DI' else False,
                                   marker=dict(size=6 if donor_symbol in ['diamond-open'] else 10, opacity=0.8, color=condition_color, symbol=donor_symbol)))

for donor, donor_symbol in donor_symbol_map.items():
    for condition, condition_color in condition_color_map.items():
        mask = (df['condition'] == condition) & (df['donor'] == donor)
        fig.add_trace(go.Scatter3d(x=df.loc[mask, 'PC1'], y=df.loc[mask, 'PC2'], z=df.loc[mask, 'PC3'],
                                   mode='markers',
                                   name=f'{condition}' if donor == 'DI' else None,
                                   legendgroup=f'{condition}',
                                   showlegend=True if donor == 'DI' else False,
                                   marker=dict(size=6 if donor_symbol in ['diamond'] else 10, opacity=0.8, color=condition_color, symbol=donor_symbol)))

for donor, donor_symbol in donor_symbol_map.items():
    fig.add_trace(go.Scatter3d(x=[None], y=[None], z=[None],
                               mode='markers',
                               name=donor,
                               legendgroup=f'{donor}',
                               showlegend=True,
                               marker=dict(size=10, opacity=1, color='black', symbol=donor_symbol),
                               hoverinfo='none'))

# Annotations for the legend blocks
fig.update_layout(
    annotations=[
        dict(x=1.1, y=1.0, xref='paper', yref='paper', showarrow=False,
             text='Condition', font=dict(size=15)),
        dict(x=1.1, y=0.6, xref='paper', yref='paper', showarrow=False,
             text='Donor', font=dict(size=15))
    ],
    scene=dict(
        aspectmode='cube',
        xaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC1: 36% v.'),
        yaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC2: 17% v.'),
        zaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC3: 15% variance'),
        bgcolor='white'
    ),
    margin=dict(l=5, r=5, b=5, t=0)  # Adjust the margins to prevent clipping of axis titles
)

#fig.show()
fig.write_image("fig1.svg")

calculate background genes

    # [1] ensembl_gene_id         gene_name               MKL.1_RNA              
    # [4] MKL.1_RNA_118           MKL.1_RNA_147           MKL.1_EV.RNA           
    # [7] MKL.1_EV.RNA_2          MKL.1_EV.RNA_118        MKL.1_EV.RNA_87        
    #[10] MKL.1_EV.RNA_27         X042_MKL.1_wt_EV        X0505_MKL.1_wt_EV      
    #[13] X042_MKL.1_sT_DMSO      X0505_MKL.1_sT_DMSO_EV  X042_MKL.1_scr_DMSO_EV 
    #[16] X0505_MKL.1_scr_DMSO_EV X042_MKL.1_sT_Dox       X0505_MKL.1_sT_Dox_EV  
    #[19] X042_MKL.1_scr_Dox_EV   X0505_MKL.1_scr_Dox_EV 

"ensembl_gene_id
gene_name 
python process.py ../fpkm_values_MKL-1.csv 3 > MKL.1_RNA 
python process.py ../fpkm_values_MKL-1.csv 4 > MKL.1_RNA_118 
python process.py ../fpkm_values_MKL-1.csv 5 > MKL.1_RNA_147 
python process.py ../fpkm_values_MKL-1.csv 6 > MKL.1_EV.RNA 
python process.py ../fpkm_values_MKL-1.csv 7 > MKL.1_EV.RNA_2 
python process.py ../fpkm_values_MKL-1.csv 8 > MKL.1_EV.RNA_118 
python process.py ../fpkm_values_MKL-1.csv 9 > MKL.1_EV.RNA_87 
python process.py ../fpkm_values_MKL-1.csv 10 > MKL.1_EV.RNA_27 
python process.py ../fpkm_values_MKL-1.csv 11 > X042_MKL.1_wt_EV 
python process.py ../fpkm_values_MKL-1.csv 12 > X0505_MKL.1_wt_EV 
python process.py ../fpkm_values_MKL-1.csv 13 > X042_MKL.1_sT_DMSO 
python process.py ../fpkm_values_MKL-1.csv 14 > X0505_MKL.1_sT_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 15 > X042_MKL.1_scr_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 16 > X0505_MKL.1_scr_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 17 > X042_MKL.1_sT_Dox 
python process.py ../fpkm_values_MKL-1.csv 18 > X0505_MKL.1_sT_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 19 > X042_MKL.1_scr_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 20 > X0505_MKL.1_scr_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 21 > median_length 

cut -f1 -d',' MKL.1_RNA>_MKL-1_RNA
cut -f1 -d',' MKL.1_RNA_118>_MKL-1_RNA_118
cut -f1 -d',' MKL.1_RNA_147>_MKL-1_RNA_147
cut -f1 -d',' MKL.1_EV.RNA>_MKL-1_EV_RNA
cut -f1 -d',' MKL.1_EV.RNA_2>_MKL-1_EV_RNA_2
cut -f1 -d',' MKL.1_EV.RNA_118>_MKL-1_EV_RNA_118
cut -f1 -d',' MKL.1_EV.RNA_87>_MKL-1_EV_RNA_87
cut -f1 -d',' MKL.1_EV.RNA_27>_MKL-1_EV_RNA_27
cut -f1 -d',' X042_MKL.1_wt_EV>_X042_MKL-1_wt_EV
cut -f1 -d',' X0505_MKL.1_wt_EV>_X0505_MKL-1_wt_EV
cut -f1 -d',' X042_MKL.1_sT_DMSO>_X042_MKL-1_sT_DMSO
cut -f1 -d',' X0505_MKL.1_sT_DMSO_EV>_X0505_MKL-1_sT_DMSO_EV
cut -f1 -d',' X042_MKL.1_scr_DMSO_EV>_X042_MKL-1_scr_DMSO_EV
cut -f1 -d',' X0505_MKL.1_scr_DMSO_EV>_X0505_MKL-1_scr_DMSO_EV
cut -f1 -d',' X042_MKL.1_sT_Dox>_X042_MKL-1_sT_Dox
cut -f1 -d',' X0505_MKL.1_sT_Dox_EV>_X0505_MKL-1_sT_Dox_EV
cut -f1 -d',' X042_MKL.1_scr_Dox_EV>_X042_MKL-1_scr_Dox_EV
cut -f1 -d',' X0505_MKL.1_scr_Dox_EV>_X0505_MKL-1_scr_Dox_EV

~/Tools/csv2xls-0.4/csv_to_xls.py  _MKL-1_RNA _MKL-1_RNA_118 _MKL-1_RNA_147 _MKL-1_EV_RNA _MKL-1_EV_RNA_2 _MKL-1_EV_RNA_118 _MKL-1_EV_RNA_87 _MKL-1_EV_RNA_27 _X042_MKL-1_wt_EV _X0505_MKL-1_wt_EV _X042_MKL-1_sT_DMSO _X0505_MKL-1_sT_DMSO_EV _X042_MKL-1_scr_DMSO_EV _X0505_MKL-1_scr_DMSO_EV _X042_MKL-1_sT_Dox _X0505_MKL-1_sT_Dox_EV _X042_MKL-1_scr_Dox_EV _X0505_MKL-1_scr_Dox_EV -d',' -o background_genes_MKL-1.xls;

process.py

import sys

filename = sys.argv[1]
n = int(sys.argv[2])

with open(filename, 'r') as file:
    for line in file:
        words = line.split(",")
        if len(words) >= n and words[n-1] != '0':
            print(line.strip())

# -- in the example of K331A data --
#FORMAT normalized_counts.txt to normalized_counts.csv
#DEBUG1: add gene_symbol in the first line
#DEBUG2: '\n' --> ',\n'
#RESULT: it looks like #gene_symbol,ctrl_d3_DonorI,ctrl_d3_DonorII,ctrl_d8_DonorI,ctrl_d8_DonorII,LT_d3_DonorI,LT_d3_DonorII,LT_d8_DonorI,LT_d8_DonorII,LTtr_d3_DonorI,LTtr_d3_DonorII,LTtr_d8_DonorI,LTtr_d8_DonorII,K331A_DonorI,K331A_DonorII,
#DDX11L1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
python process.py normalized_counts.csv 2 > ctrl_d3_DonorI
python process.py normalized_counts.csv 3 > ctrl_d3_DonorII 
python process.py normalized_counts.csv 4 > ctrl_d8_DonorI
python process.py normalized_counts.csv 5 > ctrl_d8_DonorII
python process.py normalized_counts.csv 6 > LT_d3_DonorI
python process.py normalized_counts.csv 7 > LT_d3_DonorII
python process.py normalized_counts.csv 8 > LT_d8_DonorI
python process.py normalized_counts.csv 9 > LT_d8_DonorII
python process.py normalized_counts.csv 10 > LTtr_d3_DonorI
python process.py normalized_counts.csv 11 > LTtr_d3_DonorII
python process.py normalized_counts.csv 12 > LTtr_d8_DonorI
python process.py normalized_counts.csv 13 > LTtr_d8_DonorII
python process.py normalized_counts.csv 14 > K331A_DonorI
python process.py normalized_counts.csv 15 > K331A_DonorII

cut -f1 -d',' ctrl_d3_DonorI > _ctrl_d3_DonorI
cut -f1 -d',' ctrl_d3_DonorII > _ctrl_d3_DonorII
cut -f1 -d',' ctrl_d8_DonorI > _ctrl_d8_DonorI
cut -f1 -d',' ctrl_d8_DonorII > _ctrl_d8_DonorII
cut -f1 -d',' LT_d3_DonorI > _LT_d3_DonorI
cut -f1 -d',' LT_d3_DonorII > _LT_d3_DonorII
cut -f1 -d',' LT_d8_DonorI > _LT_d8_DonorI
cut -f1 -d',' LT_d8_DonorII > _LT_d8_DonorII
cut -f1 -d',' LTtr_d3_DonorI > _LTtr_d3_DonorI
cut -f1 -d',' LTtr_d3_DonorII > _LTtr_d3_DonorII
cut -f1 -d',' LTtr_d8_DonorI > _LTtr_d8_DonorI
cut -f1 -d',' LTtr_d8_DonorII > _LTtr_d8_DonorII
cut -f1 -d',' K331A_DonorI > _K331A_DonorI
cut -f1 -d',' K331A_DonorII > _K331A_DonorII

~/Tools/csv2xls-0.4/csv_to_xls.py _ctrl_d3_DonorI _ctrl_d3_DonorII _ctrl_d8_DonorI _ctrl_d8_DonorII   _LT_d3_DonorI _LT_d3_DonorII _LT_d8_DonorI _LT_d8_DonorII  _LTtr_d3_DonorI _LTtr_d3_DonorII _LTtr_d8_DonorI _LTtr_d8_DonorII  _K331A_DonorI _K331A_DonorII  -d',' -o background_genes.xls;

Top 15 Sequencing Companies

As of our best knowledge, the following are some of the top sequencing companies in the genomics industry.

  1. Illumina: A leading provider of DNA sequencing technologies, with platforms such as NovaSeq, HiSeq, MiSeq, and iSeq.

  2. Thermo Fisher Scientific: Offers the Ion Torrent sequencing platform, based on semiconductor sequencing technology.

  3. BGI (Beijing Genomics Institute): A Chinese company that provides sequencing services and has developed its own sequencing platforms, such as the BGISEQ and MGISEQ series.

  4. Pacific Biosciences (PacBio): Known for its long-read sequencing technology, Single Molecule, Real-Time (SMRT) sequencing.

  5. Oxford Nanopore Technologies: Develops nanopore sequencing technology, with platforms like MinION, GridION, and PromethION.

  6. 10x Genomics: Provides genomics and single-cell solutions, including the Chromium platform for linked-read sequencing and single-cell analysis.

  7. Qiagen: A global provider of sample and assay technologies, including the GeneReader NGS system for targeted sequencing applications.

  8. Roche: Offers various sequencing technologies, including the 454 Life Sciences platform for pyrosequencing and the Roche NimbleGen target enrichment system.

  9. DNA Electronics (DNAe): A company working on semiconductor-based DNA sequencing technology, with applications in various fields, including diagnostics and personalized medicine.

  10. GenapSys: Developer of the GENIUS system, a compact and cost-effective next-generation sequencing (NGS) platform.

  11. GENEWIZ is a global genomics service company specializing in DNA sequencing, gene synthesis, molecular biology, and genomic services. They provide Sanger sequencing, next-generation sequencing (NGS), and a variety of other services to researchers in academia, pharmaceuticals, and biotechnology. Although GENEWIZ does not develop sequencing platforms like some of the companies mentioned earlier, its expertise in delivering high-quality sequencing services has made it a popular choice for researchers in need of genomics services. It was founded in 1999 by Dr. Steve Sun and Dr. Amy Liao. Both of them have extensive experience in molecular biology and genomics. Dr. Sun holds a Ph.D. in Molecular Biology from the University of California, Berkeley, and an MBA from Rutgers University, while Dr. Liao holds a Ph.D. in Molecular Biology from the University of Southern California. Their combined expertise and vision for providing genomics services led to the establishment of GENEWIZ as a global genomics services company.

  12. Novogene: Founded in 2011, Novogene is a leading provider of genomic services and solutions with cutting-edge next-generation sequencing (NGS) and bioinformatics expertise. The company offers a wide range of services, including whole-genome sequencing, transcriptome sequencing, single-cell sequencing, and various other genomics services for research and clinical applications. Novogene has grown rapidly and has established a strong global presence, serving customers in academia, pharmaceuticals, biotechnology, agriculture, and more.

  13. Biomarker Technologies: Biomarker Technologies is a Chinese company founded in 2010 that focuses on providing genomics services and developing genomic technologies for the agricultural and environmental sectors. The company offers a variety of sequencing services, such as whole-genome sequencing, transcriptome sequencing, and metagenomic sequencing, to support research in plant and animal breeding, environmental monitoring, and other fields. Biomarker Technologies also develops molecular markers and other genomic tools to facilitate the application of genomics in agriculture and environmental sciences.

  14. Singleron’s products and services include single-cell RNA sequencing, single-cell ATAC sequencing, single-cell multi-omics sequencing, and related bioinformatics services. These offerings allow researchers to explore gene expression, chromatin accessibility, and other molecular features at the single-cell level, enabling a deeper understanding of cellular heterogeneity and the underlying mechanisms in various biological systems.

  15. NanoString Technologies is a biotechnology company that provides life science tools for translational research and molecular diagnostics. The company’s flagship product, the nCounter Analysis System, offers a simple and efficient way to analyze gene expression, protein expression, and nucleic acid abundance in a single experiment.

    The nCounter platform is based on a digital barcoding technology that enables the direct measurement and counting of individual RNA or DNA molecules without the need for amplification, such as PCR. The system allows for multiplexing of up to 800 targets in a single reaction, providing a highly sensitive, accurate, and reproducible method for gene expression analysis. Moreover, the platform is compatible with various sample types, including fresh, frozen, and formalin-fixed, paraffin-embedded (FFPE) tissues.

    NanoString’s technology has been widely adopted in various research fields, such as oncology, immunology, and neuroscience, as well as in clinical applications, including companion diagnostics and prognostic tests. The nCounter platform can help researchers identify biomarkers, understand gene regulation, and elucidate the mechanisms underlying diseases and treatments. Additionally, NanoString has expanded its product offerings to include new applications like spatial transcriptomics with the GeoMx Digital Spatial Profiler (GeoMx Digital Spatial Profiling), enabling researchers to study gene expression in the context of tissue architecture.

These companies offer a wide range of sequencing technologies and services to cater to different research and diagnostic needs. The industry is continuously evolving, with new players and technologies entering the market regularly.

How to use H3K27me3 and H3K4me3 to identify transcription factors?

H3K27me3 (histone H3 lysine 27 trimethylation) and H3K4me3 (histone H3 lysine 4 trimethylation) are histone marks associated with gene repression and activation, respectively. While these marks can provide insights into the chromatin state and regulation of gene expression, they are not directly used to identify transcription factors. Instead, they can be used to identify regions of interest where transcription factors might bind and regulate gene expression.

To identify transcription factors, you would typically perform a chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiment using antibodies specific for the transcription factors of interest. However, you can use the information from H3K27me3 and H3K4me3 ChIP-seq data to narrow down the regions where you may expect to find transcription factors binding.

Here’s a general approach:

  1. Identify regions marked by H3K4me3, which indicates active promoters, and H3K27me3, which indicates repressed regions, using peak calling tools such as MACS or HOMER.

  2. Exclude regions marked by H3K27me3, as transcription factors are less likely to bind and regulate genes in repressed chromatin.

  3. Focus on the regions marked by H3K4me3, as these represent active promoters where transcription factors might bind to regulate gene expression.

  4. Perform motif analysis on these active promoter regions using tools like MEME, HOMER, or JASPAR to identify overrepresented DNA motifs that could be the binding sites of transcription factors.

  5. Compare the identified motifs with known transcription factor binding motifs in databases such as JASPAR, TRANSFAC, or HOCOMOCO to predict the transcription factors that may bind to these regions.

By integrating the information from H3K27me3 and H3K4me3 ChIP-seq data, you can prioritize regions for further analysis and potentially identify transcription factors that may be regulating gene expression in your system.

How to use H3K27ac, H3K4me1, and RNA-seq to identify enhancers and their target genes?

  1. Identify the overlapping peaks of H3K27ac and H3K4me1 using tools like BEDTools intersect or HOMER mergePeaks. H3K4me1 (histone H3 lysine 4 monomethylation) is commonly found in both active and poised enhancer regions. On the other hand, H3K27ac (histone H3 lysine 27 acetylation) is more specific to active enhancers. By combining both H3K27ac and H3K4me1 histone marks, we can more confidently identify active enhancer regions, as the overlapping regions marked by both histone modifications are more likely to represent true enhancers with regulatory roles in gene expression. This approach helps to reduce the number of false positives.

  2. Exclude promoter regions (TSS +/- 2 kb) from the overlapping peaks using BEDTools subtract or HOMER mergePeaks with the -exclude option. This is because the histone marks H3K27ac and H3K4me1 can be present in both promoters and enhancers. By excluding the promoter regions, we can reduce the likelihood of identifying promoter-driven regulatory elements and focus on the distal enhancers that are more likely to have long-range regulatory effects.

  3. For each remaining distal enhancer region (overlapping peak), find the nearest gene(s) using tools like BEDTools closest or HOMER annotatePeaks.

  4. Analyze the expression patterns of these nearest genes in the RNA-seq data to determine if they are differentially expressed or show any interesting expression patterns that could be related to the putative enhancer activity.

How to extract promoters positions?

#https://charlesjb.github.io/How_to_extract_promoters_positions/  
setwd("/media/jhuang/Elements1/Data_Denise_LT_DNA_Bindung")

# Load required libraries
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicFeatures)
library(rtracklayer)
library(GenomicRanges)
library(org.Hs.eg.db)

# Load TxDb object
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# Get promoters and genes
promoters_txdb <- promoters(genes(txdb))
genes_txdb <- genes(txdb)

# Find overlaps between promoters and genes
overlaps <- findOverlaps(promoters_txdb, genes_txdb)

# Add gene names, scores, and strands to promoters
promoters_with_gene_names <- promoters_txdb[queryHits(overlaps)]
gene_ids <- mcols(genes_txdb[subjectHits(overlaps)])$gene_id
gene_symbols <- mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = "ENTREZID", multiVals = "first")
mcols(promoters_with_gene_names)$gene_name <- gene_symbols
mcols(promoters_with_gene_names)$score <- 0

# Remove duplicate entries
promoters_with_gene_names <- unique(promoters_with_gene_names)

# Custom function to export the data in the desired BED format
export_bed_with_extra_columns <- function(gr, file) {
  bed_data <- data.frame(chrom = as.character(seqnames(gr)),
                         start = pmax(start(gr) - 1, 0), # Ensure start is never negative
                         end = end(gr),
                         gene_name = mcols(gr)$gene_name,
                         score = mcols(gr)$score,
                         strand = as.character(strand(gr)),
                         stringsAsFactors = FALSE)
  write.table(bed_data, file, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
}

# Save promoters with gene names, scores, and strands as a BED file
export_bed_with_extra_columns(promoters_with_gene_names, "promoters_with_gene_names.bed")

Why Do Significant Gene Lists Change After Adding Additional Conditions in Differential Gene Expression Analysis?

DESeq2 is a popular method for differential expression analysis of count data from RNA-seq experiments. It estimates fold changes and tests for differential expression using a negative binomial generalized linear model. When you perform a differential expression analysis using DESeq2, the results may vary depending on the experimental conditions included in the analysis. In your case, the significant gene list differs between the two analyses: one with condition A and B, and the other with condition A, B, and C.

There are several reasons for this discrepancy:

  1. Normalization factors: DESeq2 estimates size factors for normalization of read counts across samples. When you add an additional condition (condition C in your case), the normalization factors may change, which can affect the fold change estimates and ultimately the list of differentially expressed genes.

  2. Dispersion estimation: DESeq2 uses a shrinkage estimator for dispersions, which are gene-specific measures of biological variability across replicates. Including an additional condition may affect the dispersion estimates, and therefore influence the list of differentially expressed genes.

  3. Multiple testing correction: DESeq2 uses the Benjamini-Hochberg procedure to control the false discovery rate (FDR) when testing for differentially expressed genes. With the addition of another condition, the number of tests may change, which can affect the FDR threshold and consequently the list of significant genes.

  4. Model fitting: DESeq2 fits a negative binomial generalized linear model to the count data. Including an additional condition may affect the model fitting, leading to different estimates of the coefficients and p-values, which in turn can affect the list of differentially expressed genes.

These factors can contribute to the differences in the significant gene list between the two analyses. To minimize discrepancies, ensure that you have a well-designed experiment with biological replicates for each condition and carefully consider the comparisons of interest when setting up the design matrix for DESeq2 analysis.

The choice between inputting A and B or inputting A, B, and C depends on your research goals and the specific comparisons you want to make. Both methods can be appropriate, but they address different questions.

  • Input A and B: If you are interested in comparing the gene expression profiles between conditions A and B, and condition C is not relevant to this specific comparison, then inputting only A and B would be a better choice. By analyzing only the conditions relevant to your research question, you can focus on the specific contrasts of interest, making the interpretation of the results more straightforward.

  • Input A, B, and C: If your research goals involve understanding the gene expression profiles in a broader context, such as comparing all three conditions or investigating how the expression profiles change across a time course or gradient, then inputting A, B, and C would be more appropriate. In this case, including all conditions in the analysis will provide a more comprehensive view of the gene expression changes, and the comparisons between the different conditions can help identify genes that show unique expression patterns or that are specific to certain conditions.

In summary, the choice between inputting A and B or inputting A, B, and C depends on your research question and objectives. It is crucial to clearly define the comparisons you are interested in and to consider the biological context of your study before deciding which method to use.

When adding an additional condition to a differential gene expression analysis, it’s natural for the results to change due to factors such as normalization, dispersion estimation, model fitting, and multiple testing correction, as previously discussed. However, there are some strategies you can apply to minimize the impact of adding an additional condition and make your results more robust:

  1. Independent analysis: Analyze the pairwise comparisons separately (A vs B, A vs C, and B vs C), and then compare the lists of significant genes obtained from each analysis. This approach keeps the comparisons of interest independent of the other conditions.

  2. Batch effect correction: If the additional condition introduces batch effects, you can use methods like ComBat from the sva package or limma’s removeBatchEffect function to correct for these batch effects before running differential expression analysis. This can help reduce the impact of adding an additional condition on the results.

  3. Consistent normalization methods: Use consistent normalization methods across all samples, even if you are adding a new condition. This will ensure that the read counts are comparable across all samples, reducing the impact of the additional condition on the results.

  4. Incorporate the additional condition in the model: If you want to include the additional condition in your analysis but keep the results between A and B consistent, you can include the additional condition as a covariate in your model. This will account for the effect of the additional condition while still allowing you to make the comparisons of interest.

  5. Intersection of significant genes: Perform the differential expression analysis with and without the additional condition, and then take the intersection of the significant genes from both analyses. The intersecting set of genes is likely to be more robust against the addition of the extra condition.

However, it’s important to note that adding an additional condition will inherently influence the results to some extent, as the analysis is dependent on the data provided. The strategies mentioned above can help to minimize the impact of adding an additional condition, but they cannot completely eliminate it.

H3K4me3, H3K27me3 and TF

H3K4me3(trimethylated histone H3 lysine 4)和H3K27me3(trimethylated histone H3 lysine 27)是组蛋白修饰的一种,与基因表达调控密切相关。转录因子(transcription factor)则是一类能够调控基因转录的蛋白质。

  1. H3K4me3:通常与基因启动子区域相关联,标记着活性染色质,有利于基因的转录。H3K4me3作为一个表观遗传标记,可以招募并结合转录因子和其他转录辅助因子,从而启动基因转录。

  2. H3K27me3:与H3K4me3相反,H3K27me3通常与基因沉默相关联,它通过招募和结合PRC2复合物(Polycomb Repressive Complex 2)来抑制基因转录。当PRC2将H3K27位点的甲基化水平提高到三甲基化时,基因表达受到抑制。

  3. 转录因子:是一类特殊的蛋白质,能够结合到特定的DNA序列,进而调控相应基因的转录。转录因子可以分为激活因子(activators)和抑制因子(repressors),分别负责增强和抑制基因转录。转录因子通过识别并结合到启动子、增强子等调控元件上,招募或阻止RNA聚合酶的结合,进而调控基因转录。

H3K4me3和H3K27me3作为表观遗传修饰,与转录因子一起共同参与基因表达调控。在某些情况下,这两种修饰可以在同一基因上共存,形成所谓的“bivalent domains”,这种状态使得基因处于一种“待机”状态,可以在适当的信号刺激下迅速被激活或进一步被抑制。这种表观遗传调控机制在发育过程中和疾病发生过程中起到了重要作用。

利用H3K4me3和H3K27me3数据不能直接计算出转录因子。H3K4me3和H3K27me3是一种组蛋白修饰,它们可以影响基因表达,但并非直接参与转录因子结合的过程。要识别转录因子的结合位点,可以使用ChIP-seq(染色质免疫沉淀测序)等实验方法。

然而,H3K4me3和H3K27me3数据可以帮助我们预测可能的转录起始位点(TSS)和活性/非活性基因。接下来,我们可以利用这些信息来分析转录因子的结合模式和功能。

以下是一种可能的策略:

  • 分析H3K4me3和H3K27me3的ChIP-seq数据,找到基因组上这两种修饰的富集区域。
  • 利用这些修饰的富集信息,预测可能的转录起始位点(TSS)和活性/非活性基因。
  • 对于已知的或预测的活性基因,可以进一步分析转录因子的结合模式。此时,需要转录因子的ChIP-seq数据或其他相关实验数据,如DNase-seq或ATAC-seq,它们可以揭示开放染色质区域。
  • 对转录因子的结合位点进行分析,可以使用生物信息学工具如MEME、HOMER等来识别转录因子结合位点的共有序列特征(motif)。
  • 结合基因表达数据,可以进一步研究转录因子对特定基因的调控作用。

总之,H3K4me3和H3K27me3数据可以为我们提供基因表达调控的信息,但需要结合其他实验数据和分析方法,才能对转录因子的结合和功能进行研究。

结合H3K4me3、H3K27me3的ChIP-seq数据和RNA-seq数据来推测转录因子活动的详细策略如下:

  1. 数据准备:

    • 收集H3K4me3、H3K27me3的ChIP-seq数据,以及RNA-seq数据。
    • 对ChIP-seq和RNA-seq数据进行质量控制和预处理。
  2. ChIP-seq数据分析:

    • 使用MACS、SICER等软件对ChIP-seq数据进行比对和峰值检测,找到H3K4me3和H3K27me3在基因组上的富集区域。
    • 预测潜在的转录起始位点(TSS)及活性/非活性基因。H3K4me3富集区域通常位于活性基因的启动子附近,而H3K27me3富集区域则与非活性基因相关联。
  3. RNA-seq数据分析:

    • 对RNA-seq数据进行比对,使用HISAT2、STAR等软件将测序读取比对到参考基因组。
    • 估算基因表达水平,使用featureCounts、HTSeq等软件计算每个基因的读取计数,然后使用DESeq2、edgeR等软件对计数数据进行标准化,得到基因的表达水平。
  4. 差异表达基因分析:

    • 结合基因表达水平和H3K4me3、H3K27me3富集区域信息,确定哪些基因在特定条件下是活性的。
    • 使用DESeq2、edgeR等软件进行差异表达基因分析,找到在不同条件下显著差异表达的基因。
  5. 预测转录因子结合位点:

    • 对于差异表达的基因,检查它们的启动子和调控元件区域,以找到潜在的转录因子结合位点。
    • 使用生物信息学工具如MEME、HOMER等来识别转录因子结合位点的共有序列特征(motif)。
  6. 构建转录因子调控网络:

    • 根据转录因子结合位点和基因表达变化之间的关联,推测哪些转录因子可能在特定条件下起到了调控作用。
    • 使用Gene Set Enrichment Analysis(GSEA)等方法评估转录因子的潜在活动。

将预测出的转录因子与它们可能调控的差异表达基因关联起来,构建转录因子调控网络。 请注意,虽然这种策略可以在一定程度上推测转录因子活动,但它并不能直接计算出转录因子本身.

H3K4me3的peak region通常与转录起始位点(TSS)附近的启动子区域相关联。这些区域通常包含转录因子的结合位点。然而,H3K4me3富集区域并不等同于转录因子的结合区域。它们只是一个与活性基因相关的表观遗传标记。

转录因子结合位点通常位于开放染色质区域,这些区域可以通过DNase-seq或ATAC-seq等实验方法来检测。为了更直接地研究特定转录因子的结合和活动,可以使用ChIP-seq方法来检测转录因子在基因组上的结合位置。

综上所述,虽然H3K4me3的peak region可能包含转录因子的结合位点,但它们并不等同于转录因子的结合区域。要更准确地找到转录因子的结合位点,需要使用更专门针对转录因子结合的实验方法。

H3K27me3(histone H3 trimethylated at lysine 27)是一种组蛋白修饰,通常与基因沉默和抑制相关联。H3K27me3主要由PRC2复合物(Polycomb Repressive Complex 2)催化生成。H3K27me3的结合位置主要出现在基因组的以下区域:

  • 启动子区域:H3K27me3可以结合到基因的启动子区域,从而抑制RNA聚合酶的结合和基因的转录。

  • 基因体内:H3K27me3也可以在基因体内的染色质区域中富集,与抑制基因表达相关联。

  • 间隔区域:在某些情况下,H3K27me3可以在基因间区域形成大片的富集区,这些区域被称为“Polycomb组合域”。这些区域通常与异染色质结构、基因组稳定性和长期基因沉默有关。

H3K27me3结合位置的识别可以通过ChIP-seq(染色质免疫沉淀测序)实验方法实现。通过分析H3K27me3 ChIP-seq数据,可以找到H3K27me3在基因组上的富集区域。这些区域往往与基因表达受到抑制的区域相对应。

利用RNA-seq数据预测转录因子(transcription factor,TF)活性是可能的,但需要采用一些间接方法。RNA-seq数据为我们提供了基因在给定条件下的表达水平,但无法直接显示TF在基因组上的结合位点。然而,我们可以通过分析差异表达基因和TF的调控网络来推测TF的活性。

以下是使用RNA-seq数据预测TF活性的一种策略:

分析RNA-seq数据:

对RNA-seq数据进行比对,使用HISAT2、STAR等软件将测序读取比对到参考基因组。
估算基因表达水平,使用featureCounts、HTSeq等软件计算每个基因的读取计数,然后使用DESeq2、edgeR等软件对计数数据进行标准化,得到基因的表达水平。
差异表达基因分析:

使用DESeq2、edgeR等软件进行差异表达基因分析,找到在不同条件下显著差异表达的基因。
基因调控网络推断:

利用已知的转录因子靶点关系(如来自TRANSFAC、JASPAR等数据库),或者使用基因调控网络推断工具(如GENIE3、ARACNe等),根据差异表达基因的表达模式推断可能的TF-靶基因关系。
转录因子活性评估:

结合差异表达基因和推断出的TF-靶基因关系,使用Gene Set Enrichment Analysis(GSEA)或其他类似方法评估转录因子的潜在活动。
请注意,这种策略依赖于预测的TF-靶基因关系和基因表达模式,可能无法准确反映TF的真实活动。为了更直接地研究特定TF的结合和活动,可以使用ChIP-seq、DNase-seq或ATAC-seq等方法来检测TF在基因组上的结合位置。

可以通过分析差异表达基因的启动子区域来推导出潜在的motif(转录因子结合位点的共有序列特征)。以下是一种实现这一目标的策略:

提取差异表达基因的启动子序列:对于每个差异表达基因,从参考基因组提取其转录起始位点(TSS)附近的一段序列,作为启动子区域。通常可以选择TSS上游1000bp到下游200bp的区域,但这个范围可以根据具体需求进行调整。

连接启动子序列:将所有差异表达基因的启动子序列连接起来,形成一个较长的序列。这将用于后续的motif发现分析。

寻找motif:使用生物信息学工具,如MEME、HOMER、DREME等,分析启动子序列,以寻找在多个启动子中重复出现的共有序列特征。这些共有序列特征可能表示潜在的转录因子结合位点。

比较已知motif:将发现的motif与已知的转录因子结合位点进行比较,以确定可能的转录因子。可以使用转录因子结合位点数据库,如JASPAR、TRANSFAC、HOCOMOCO等,来比较motif的相似性。

分析和可视化motif:可以使用软件工具(如Seq2Logo、WebLogo等)生成motif的序列logo,以直观地展示核苷酸在转录因子结合位点的保守性。此外,还可以分析motif在不同差异表达基因的启动子中的分布和共现模式。

请注意,这种方法假设差异表达基因的调控主要通过转录因子在启动子区域的结合来实现。实际上,转录因子也可以通过远离TSS的增强子区域或其他调控元件来调控基因表达。因此,在分析启动子motif时,可能会遗漏一些重要的调控信息。

使用特定蛋白的抗体进行ChIP-seq实验可以帮助确定该蛋白在基因组上的结合位点。如果这个蛋白是一个已知的转录协同因子或一个与增强子活性相关的因子,可以通过分析其结合位点来识别潜在的增强子区域。

以下是使用特定蛋白抗体的ChIP-seq数据来寻找增强子的策略:

数据准备:

收集针对目标蛋白的ChIP-seq数据。
对ChIP-seq数据进行质量控制和预处理。
ChIP-seq数据分析:

使用MACS、SICER等软件对ChIP-seq数据进行比对和峰值检测,找到目标蛋白在基因组上的结合位点。
筛选潜在增强子区域:

筛选位于基因上游、内含子或基因间区域的结合位点,因为这些区域更有可能包含增强子。
结合其他相关的组蛋白修饰数据(如H3K4me1和H3K27ac)或开放染色质数据(如DNase-seq或ATAC-seq),进一步筛选具有这些特征的结合位点。
验证和功能分析:

使用Reporter Assay、CRISPR/Cas9或其他实验方法验证筛选出的潜在增强子的功能。
通过基因表达数据或其他功能分析方法,了解目标蛋白结合位点与增强子活性的关系。
请注意,这种策略依赖于所研究的特定蛋白与增强子活性有关。对于更通用的增强子预测,可以考虑使用组蛋白修饰数据(如H3K4me1和H3K27ac)或开放染色质数据(如DNase-seq或ATAC-seq)。

增强子(enhancer)是一类非编码的调控序列,它可以在与启动子相距较远的位置通过调控转录因子的结合来影响基因的表达。以下是一些与增强子结合并参与基因调控的主要分子:

转录因子(Transcription factors,TFs):转录因子是一类可以结合特定DNA序列的蛋白质。它们可以通过与增强子结合来调控基因表达。一些典型的转录因子家族包括bZIP、bHLH、Zinc finger、Homeodomain等。

转录协同因子(Transcription co-factors):转录协同因子可以与转录因子共同作用,帮助它们结合到增强子并调控基因表达。这些协同因子可以是组蛋白修饰酶、染色质重塑因子或其他调控蛋白质。

染色质修饰酶(Chromatin modifiers):这些酶可以通过添加或去除组蛋白修饰(如乙酰化、甲基化等)来调整染色质的可访问性和活性。例如,增强子活性通常与H3K4me1(histone H3 monomethylated at lysine 4)和H3K27ac(histone H3 acetylated at lysine 27)修饰相关联。

染色质重塑因子(Chromatin remodelers):这些蛋白质可以通过改变染色质结构来调整DNA的可访问性,从而影响转录因子和其他调控因子与增强子的结合。

Mediator复合物:Mediator是一个多蛋白复合物,它在转录调控中起到了桥梁的作用。它可以与增强子上的转录因子结合,并与启动子区域的RNA聚合酶II形成一个环状结构,从而促进基因的转录。

长非编码RNA(Long non-coding RNAs,lncRNAs):某些lncRNAs可以通过与增强子结合并调控转录因子和其他调控因子的活性来影响基因表达。

这些分子共同作用,与增强子结合并参与基因调控。在研究增强子的功能时,可以通过ChIP-seq、DNase-seq、ATAC-seq等技术检测这些分子在基因组上的结合位置,从而揭示增强子的调控网络。

Creating an hg38 Database File for DiffReps Analysis

To generate a database file for the hg38 human genome assembly for use with the DiffReps tool, you will first need to download the hg38 annotation file in GTF or GFF format, and then convert it to the BED format that DiffReps accepts.

Here are the steps to create the database file:

  1. Download the hg38 annotation file in GTF or GFF format from a reputable source like GENCODE or UCSC Genome Browser. For example, you can download the GTF file from GENCODE: https://www.gencodegenes.org/human/.

  2. Install the necessary tools. In this case, we will use bedtools to convert GTF to BED format. You can install it using conda or any other package manager:

    conda install -c bioconda bedtools
  3. Convert the GTF file to the BED format:

    bedtools gtf2bed -i gencode.v38.annotation.gtf > gencode.v38.annotation.bed

    Replace “gencode.v38.annotation.gtf” with the name of the GTF file you downloaded in step 1.

  4. The resulting “gencode.v38.annotation.bed” file can be used as a database file for the DiffReps analysis. Provide this file as input to the –db option when running DiffReps:

    diffreps -tr treatment.bam -co control.bam -db gencode.v38.annotation.bed -o output_dir

Please note that these instructions are for generating a database file for hg38, but you can follow similar steps for other genome assemblies as well.