Author Archives: gene_x

多瘤病毒科家族中的MCPyV与TSPyV

MCPyV(梅克尔细胞多瘤病毒,Merkel cell polyomavirus)和TSPyV(纺锤状毛发发育不良相关多瘤病毒,Trichodysplasia spinulosa-associated polyomavirus)都是多瘤病毒科(Polyomaviridae)家族的成员,这是一类可以感染各种脊椎动物的小型双链DNA病毒。尽管它们都属于同一家族,但它们与不同类型的疾病相关。

MCPyV与一种罕见的皮肤癌——梅克尔细胞癌(Merkel cell carcinoma,MCC)有关。梅克尔细胞癌是一种快速生长的神经内分泌肿瘤,主要发生在皮肤表面。MCPyV在大约80%的梅克尔细胞癌患者中被发现。MCPyV感染通常是无害的,但在某些情况下,病毒可能会整合到宿主细胞的基因组中,导致细胞恶性转化和肿瘤发展。

与之相反,TSPyV与一种罕见的皮肤病——纺锤状毛发发育不良(Trichodysplasia spinulosa,简称TS)相关。这种病状的特点是毛囊纺锤状突起、脱发和毛囊异常生长,主要影响免疫受损的个体,如器官移植受者或艾滋病患者。

尽管MCPyV和TSPyV都属于多瘤病毒科家族,它们在致病机制、相关疾病和受影响人群方面存在显著差异。研究这些病毒将有助于更好地了解它们的感染和致病机制,以及为相关疾病的患者开发有效的治疗策略。

Guide to Submitting Data to GEO (Gene Expression Omnibus)

  1. Create an account: First, create a GEO account at https://www.ncbi.nlm.nih.gov/geo/submission/. If you already have an NCBI account, you can use the same credentials to log in.

  2. Upload data files via FTP: Upload your raw data and processed data files to the GEO server using an FTP client. Please refer to GEO’s FTP upload instructions: https://www.ncbi.nlm.nih.gov/geo/info/ftp.html.

  3. Download the appropriate template: Based on your data type, download the corresponding Excel template (called “SOFT” files) from the GEO submission guidelines page: https://www.ncbi.nlm.nih.gov/geo/info/seq.html. There are different templates for platforms, samples, and series.

  4. Prepare metadata in the Excel template: Fill out the Excel template with the required information about your samples, platform, and series (experiment). Be sure to follow the GEO guidelines for formatting and required fields.

    • Platform: Describe the technology used for data generation (e.g., microarray or RNA-seq). Provide platform details like manufacturer, layout, probe sequences, etc.

    • Samples: Provide sample details such as source, treatment, extraction protocol, labeling, and hybridization methods. Also, include any relevant clinical or phenotypic data.

    • Series: Describe the overall experiment design and goals, as well as any related publications or supplementary files.

  5. Submit the Excel template: Log in to the GEO Submission Portal (https://www.ncbi.nlm.nih.gov/geo/submission/) using your NCBI account. Click “Submit” to start a new submission and upload the completed Excel template.

    Download an Excel template

    Download an example Excel file for ChIP-seq submission

    Download an example Excel file for RNA-seq submission

  6. Notify GEO about your FTP file transfer (suitable for high-throughput sequencing or large microarray submissions and updates). GEO_notify_screenshot

  7. Wait for the review: The GEO team will review your submission and may contact you for additional information or clarification. Once your submission is approved, you will receive a confirmation email containing your GEO accession number(s).

  8. Cite your data: Include the GEO accession number(s) in any related publications or presentations to ensure proper attribution and facilitate data discovery.

For more detailed instructions and guidelines, visit the GEO Submission Guidelines page: https://www.ncbi.nlm.nih.gov/geo/info/submission.html.

Quick Instructions

  1. Check that GEO accepts your data type.
  2. Gather raw data files.
  3. Gather processed data files .
  4. Fill in Metadata Template (one seq type per template). Please review “Before completing your Metadata Template” below.
  5. Fill in MD5 Checksums sheet for any raw data files and processed data files referenced in Metadata Template.
  6. Create a folder on your computer that contains all raw and processed files and your completed Metadata Template in Excel format.
  7. FTP the entire data folder to GEO.
  8. Notify GEO using the ‘Submit to GEO’ web form, after the FTP transfer is complete; unannounced files will not be processed.
  9. Your submission is placed into the processing queue and reviewed within 5 business days; expect to receive an email from GEO curators with questions about your submission or the GEO accession numbers.

* Updating GEO records (that have been processed and approved) can be labor-intensive and time-consuming, so please carefully prepare your submission before you transfer your files to the GEO FTP server.

* A complete GEO submission consists of the following 3 components. If your transfer does not include all 3 components, please explain the reason in the comment box below. An incomplete submission may result in processing delays.

  • Completed metadata worksheet
  • Raw data
  • Processed data

* When this submission should be released to the public (more information about release dates)

  • Keep my existing release date
  • Specify a new future release date for the submission being updated (up to 4 years from today). New release dates apply only to submissions that are still private.

https://submit.ncbi.nlm.nih.gov/geo/submission/

https://www.ncbi.nlm.nih.gov/geo/info/faq.html#holduntilpublished

https://www.ncbi.nlm.nih.gov/geo/submitter/

https://www.ncbi.nlm.nih.gov/geo/subs/

耶尔森氏菌Type III分泌系统效应蛋白

  1. type III secretion system translocon subunit YopB:YopB (Yersinia outer protein B) 是Yersinia属细菌的一种效应蛋白,与YopD共同参与形成跨膜通道,使其他Yop效应蛋白能够穿过宿主细胞膜进入宿主细胞。

  2. type III secretion system translocon subunit YopD:YopD (Yersinia outer protein D) 与YopB共同作用,形成一种跨膜通道,有助于其他Yop效应蛋白进入宿主细胞。此外,YopD还参与调控Type III分泌系统(T3SS)的效应蛋白的分泌。

  3. type III secretion system effector YopK:YopK 是一种调控蛋白,主要作用是细调Type III分泌系统的效应蛋白进入宿主细胞的程度,从而平衡细菌的毒力和免疫逃避。

  4. T3SS effector protein-tyrosine-phosphatase YopH:YopH 是一种酪氨酸磷酸酶,可以抑制宿主细胞的信号传导,从而干扰细胞粘附和免疫细胞的功能,有助于细菌逃避宿主的免疫系统。

  5. type III secretion system effector acetyltransferase YopJ:YopJ 是一种酰基酶,可以抑制宿主细胞内的NF-κB和MAPK信号通路,进而抑制炎症反应和细胞凋亡,有助于细菌逃避宿主免疫系统。

  6. SctW family type III secretion system gatekeeper subunit YopN:YopN 主要作为Type III分泌系统的分泌调控蛋白,可防止Yop效应蛋白在细菌内过早分泌,以确保在适当时机释放。

  7. type III secretion system effector YopM:YopM 是一种调节蛋白,可以促进炎性细胞凋亡,抑制细胞因子的产生,从而降低宿主的炎症反应。

  8. T3SS polymerization control protein YopR:关于YopR的信息有限,可能是一个错误的命名,与YopM和YopN重复。

  9. type III secretion system effector GTPase activator YopE:YopE 是一种GTP酶激活蛋白(GAP),通过抑制宿主细胞的Rho GTP酶家族成员来破坏细胞骨架,从而削弱宿主细胞的免疫反应。

  10. type III secretion system effector protein kinase YopO/YpkA:YopO(又称YpkA)是一种丝氨酸/苏氨酸蛋白激酶,可以通过干扰宿主细胞的细胞骨架,抑制细胞迁移,从而影响免疫细胞的功能。此外,YopO还可以激活Rho GTP酶家族成员,影响宿主细胞的信号传导。

  11. T3SS effector cysteine protease YopT:YopT 是一种具有蛋白酶活性的效应蛋白,主要通过切割宿主细胞的Rho GTP酶家族成员,破坏宿主细胞的细胞骨架,从而干扰细胞粘附和迁移。

这11种Yersinia属细菌的Type III分泌系统效应蛋白在病原体侵染过程中具有重要作用。它们通过破坏宿主细胞的信号传导和细胞骨架、抑制炎症反应和细胞凋亡等方式,协同作用以维持病原体在宿主体内的生存和繁殖。了解这些效应蛋白的功能和作用机制对于研究Yersinia属细菌的致病机制和寻找新的治疗方法具有重要意义。

How to run AlphaFold2?

AlphaFold2 is a protein structure prediction model developed by DeepMind. To run AlphaFold2, you’ll need to follow these steps:

  1. Clone the AlphaFold repository:

    git clone https://github.com/deepmind/alphafold.git
    cd alphafold
  2. Set up the environment:

    You will need to install the necessary dependencies for AlphaFold. It’s recommended to use a Python virtual environment or a Conda environment.

    If you’re using a Python virtual environment, create and activate it:

    python3 -m venv alphafold_venv
    source alphafold_venv/bin/activate

    Then install the required packages:

    pip install -r requirements.txt

    If you prefer to use Conda, create a Conda environment and activate it:

    conda create -n alphafold python=3.8
    conda activate alphafold

    Then install the required packages:

    conda install -c conda-forge openmm
    conda install -c conda-forge pdbfixer
    pip install -r requirements.txt
  3. Download the necessary model data:

    You need to download the model parameters and databases. Create a directory to store the data:

    mkdir data

    Download the model parameters from the AlphaFold GitHub repository:

    wget -P data/ https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar
    tar -xf data/alphafold_params_2021-07-14.tar -C data/

    Download the necessary databases (e.g., UniRef, BFD, and MGnify). You can find instructions on how to download them in the README.md file in the AlphaFold repository or on their respective websites.

  4. Run AlphaFold2:

    You can run the AlphaFold2 using the provided run_alphafold.py script. For example, to predict the structure of a protein with the sequence in input.fasta, you can use the following command:

    python run_alphafold.py --fasta_paths=input.fasta --output_dir=output/ --preset=full_dbs --max_template_date=2099-12-31 --data_dir=data/

    This command will run the full AlphaFold2 pipeline with all available databases and store the resulting structures in the output/ directory.

    Make sure to replace input.fasta with the path to your input FASTA file, and adjust other options as needed.

  5. Analyze the results:

    After the prediction is finished, you can find the predicted structures in the output/ directory. The PDB files can be visualized using molecular visualization software such as PyMOL, Chimera, or VMD.

Plot phylogenetic tree_heatmap and MSA on yopBDJTEMKOH[NR]

Isolate yopJ yopB yopT yopE yopD yopM yopK yopO yopH
Yersinia_enterocolitica_1055Rr 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_2516-87 0 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_8081 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_8081_bis 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_KNG22703 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_WA 1 1 1 1 1 0 1 1 1
Yersinia_enterocolitica_Y1 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_Y11 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE1 1 0 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE165 1 1 1 1 0 1 1 0 1
Yersinia_enterocolitica_YE3 1 0 1 1 1 1 1 0 1
Yersinia_enterocolitica_YE5 1 0 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE6 1 1 1 1 1 1 1 0 1
Yersinia_enterocolitica_YE7 1 1 1 1 1 1 1 1 1
Yersinia_pestis_1045 1 1 1 1 1 1 1 1 1
Yersinia_pestis_1412 1 1 0 1 1 1 1 1 1
Yersinia_pestis_1413 1 1 0 1 1 1 1 1 1
Yersinia_pestis_1522 1 1 0 0 1 1 1 1 1
Yersinia_pestis_20 1 1 1 1 1 1 1 1 1
Yersinia_pestis_2944 1 1 1 1 1 1 1 1 1
Yersinia_pestis_3067 1 1 0 1 1 1 1 1 1
Yersinia_pestis_3770 1 1 0 1 1 1 1 1 1
Yersinia_pestis_790 0 1 1 1 1 1 1 0 1
Yersinia_pestis_8787 1 1 0 1 1 1 1 1 1
Yersinia_pestis_91001 1 1 1 1 1 1 1 1 1
Yersinia_pestis_94 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Angola 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Angola_bis 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Antiqua 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Antiqua_bis 0 1 1 1 1 1 1 1 1
Yersinia_pestis_CO92 1 1 1 1 1 1 1 1 1
Yersinia_pestis_CO92_pgm-_pPCP1- 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Cadman 1 1 1 1 1 1 1 1 1
Yersinia_pestis_D106004 1 1 1 1 1 1 1 1 1
Yersinia_pestis_D182038 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Dodson 1 1 1 1 1 1 1 1 1
Yersinia_pestis_EV76-CN 1 1 1 1 1 1 1 1 1
Yersinia_pestis_El_Dorado 1 1 1 1 1 1 1 1 1
Yersinia_pestis_FDAARGOS_601 1 1 1 1 1 1 1 0 1
Yersinia_pestis_FDAARGOS_602 0 1 1 1 1 0 1 1 1
Yersinia_pestis_FDAARGOS_603 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Harbin_35 1 1 1 1 1 1 1 0 1
Yersinia_pestis_Harbin_35_bis 1 0 1 1 1 1 1 0 1
Yersinia_pestis_Java9 1 1 1 1 1 1 1 0 1
Yersinia_pestis_KIM5 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Nicholisk_41 1 1 1 0 1 1 1 0 1
Yersinia_pestis_PBM19 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Pestoides_B 0 1 1 1 1 1 1 1 1
Yersinia_pestis_Pestoides_F 1 1 0 1 1 1 1 1 1
Yersinia_pestis_Pestoides_F_bis 1 1 0 1 1 1 1 1 1
Yersinia_pestis_Pestoides_G 1 1 0 1 1 1 1 1 1
Yersinia_pestis_R 1 1 1 1 1 1 1 1 1
Yersinia_pestis_S19960127 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-5935_I-1996 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-5942_I-2638 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-6530 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-6899_231 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-DNA-18_I-3113 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Shasta 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Z176003 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_598 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_EP2+ 0 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_579 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_580 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_581 1 1 1 0 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_582 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_583 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP2666pIB1 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP32953 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP32953_bis 1 1 1 1 0 1 1 1 1
Yersinia_pseudotuberculosis_NZYP4713 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_PA3606 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_PB1+_bis 1 1 1 1 1 0 1 1 1

ggtree_and_gheatmap_yopK

alignment_yopK

  1. This step uses rsync to download data from the NCBI server to a local directory, save all gff-files in the directory prokka.

    rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 Yersinia_pestis_1045
  2. The step processes GFF files containing gene annotations for a set of samples in the directory prokka. The primary goal is to modify the GFF files and create new ones with specific changes and to save them in the directory prokka_plus. The script operates on each sample one by one, and for each sample, it performs the following steps:

    • Replace all occurrences of \tCDS\t with CDS in the original GFF file.

    • Extract all lines containing CDS and save them in a new file with the suffix _CDS.gff.

    • Replace all occurrences of ID= with ID_old= in the new _CDS.gff file.

    • Cut the second field (delimited by ;) from the _CDS.gff file and save it in a new file with the suffix _CDS_f2.

    • Replace all occurrences of Parent=gene- with ID= in the _CDS_f2 file.

    • Paste the contents of the _CDS.gff and _CDS_f2 files side by side, with a ; delimiter, and save the result in a new file with the suffix CDS.gff.

    • Run the enum.py script on the CDS.gff file to add line numbers at the end, and save the result in a new file with the suffix _CDS__.gff. import sys

        if len(sys.argv) < 2:
            print("Please provide a filename as an argument.")
            sys.exit(1)
      
        filename = sys.argv[1]
      
        try:
            with open(filename) as f:
                for i, line in enumerate(f):
                    print(f"{line.strip()}_{i+1}")
        except FileNotFoundError:
            print(f"File {filename} not found.")
    • Extract all lines from the original GFF file that do not contain CDS and save them in a new file with the suffix _nonCDS.gff.

    • Remove all lines containing ### from the _nonCDS.gff file and save the result in a new file with the suffix nonCDS.gff.

    • Concatenate the contents of the nonCDS.gff and _CDS__.gff files and save the result in a new file with the suffix _nonCDS_CDS.gff.

    • Replace all occurrences of CDS with \tCDS\t in the _nonCDS_CDS.gff file.

    • Append the string ##FASTA to the end of the _nonCDS_CDS.gff file.

    • Modify the FASTA file associated with the sample by replacing the first field (delimited by a space) with the corresponding sample name.

    • Concatenate the modified GFF file (_nonCDS_CDS.gff) and the modified FASTA file, and save the result in the ../prokka_plus/ directory with a new name based on the sample name.

    • After processing all samples, the script removes intermediate files generated during the process.

      for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 Yersinia_pestis_2944 Yersinia_pestis_KIM10+ Yersinia_pestis_M-1482; do
        sed -i 's/\tCDS\t/_CDS_/g' ${sample}.gff
        grep "_CDS_" ${sample}.gff > ${sample}_CDS.gff
        sed -i 's/ID=/ID_old=/g' ${sample}_CDS.gff
        cut -d';' -f2 ${sample}_CDS.gff > ${sample}_CDS_f2
        sed -i 's/Parent=gene-/ID=/g' ${sample}_CDS_f2
        paste -d';' ${sample}_CDS.gff ${sample}_CDS_f2 > ${sample}_CDS_.gff
        python enum.py ${sample}_CDS_.gff > ${sample}_CDS__.gff   # add a line number to end to avoid the sameple Gene_ID
      
        grep -v "_CDS_" ${sample}.gff > ${sample}_nonCDS.gff
        grep -v "###" ${sample}_nonCDS.gff > ${sample}_nonCDS_.gff
      
        cat ${sample}_nonCDS_.gff ${sample}_CDS__.gff > ${sample}_nonCDS_CDS.gff
        sed -i 's/_CDS_/\tCDS\t/g' ${sample}_nonCDS_CDS.gff
        echo "##FASTA" >> ${sample}_nonCDS_CDS.gff
      
        cut -d' ' -f1 ../assembly/${sample}.fna > ../assembly/${sample}.fasta;
        cat ${sample}_nonCDS_CDS.gff ../assembly/${sample}.fasta > ../prokka_plus/$(echo $sample | cut -d'_' -f3- | tr " " "_").gff;
      done
      rm *_CDS.gff *_CDS_f2 *_CDS_.gff *_CDS__.gff *_nonCDS.gff *_nonCDS_.gff
  3. This step runs Roary, a tool for pan-genome analysis. It takes annotated bacterial genomes in GFF3 format as input and clusters the genes based on sequence similarity.

    roary -p 4 -f ./roary -i 95 -cd 99 -s -e -n -v  prokka_plus/1045.gff prokka_plus/SCPM-O-B-6291_C-25.gff prokka_plus/2944.gff prokka_plus/KIM10+.gff
  4. This step extracts the coding sequences (CDS) of specific genes from multiple genome files and saves them to an output file. Start-files: roary/pan_genome_reference.fa and roary/gene_presence_absence.csv

    #grep "yopT" roary/gene_presence_absence.csv
    > yopT_seq.txt
    for gene_id in M486_RS20945_3996 YE105_RS20560_4012; do
      for gbff in  Yersinia_massiliensis_2011N-4075/GCF_013282765.1_ASM1328276v1/GCF_013282765.1_ASM1328276v1_genomic.gbff.gz Yersinia_pestis_EV_NIIEG/GCF_000590535.2_ASM59053v2/GCF_000590535.2_ASM59053v2_genomic.gbff.gz Yersinia_pestis_Shasta/GCF_000834335.1_ASM83433v1/GCF_000834335.1_ASM83433v1_genomic.gbff.gz Yersinia_ruckeri_NVI-492/GCF_023212565.2_ASM2321256v2/GCF_023212565.2_ASM2321256v2_genomic.gbff.gz Yersinia_pestis_Pestoides_G/GCF_000834985.1_ASM83498v1/GCF_000834985.1_ASM83498v1_genomic.gbff.gz; do
        output=$(python3 extract_CDS_of_a_locus_tag.py ${gbff} $(echo "${gene_id}" | cut -d '_' -f 1-2))
        if [[ ! -z "${output}" ]]; then
            gbff_short=$(echo "${gbff}" | cut -d '/' -f 1)
            printf "%s\t%s\n" "${gbff_short}" "${output}" >> yopT_seq.txt
        fi
      done
    done
    
    #-- code snippet of extract_CDS_of_a_locus_tag.py --
    from Bio import SeqIO
    import sys
    import gzip
    
    #python3 extract_CDS_of_a_locus_tag.py GCF_001188735.1_ASM118873v1_genomic.gbff.gz M486_RS20950
    
    # Get the file name from the command-line argument
    if len(sys.argv) == 3:
        filename = sys.argv[1]
        locus_tag = sys.argv[2]
    else:
        print(sys.argv)
        print("Error: no file name provided")
        sys.exit(1)
    
    # Open the compressed file and read the sequences
    with gzip.open(filename, "rt") as f:
    
        # Open the GenBank file and read the first record
        #    record = SeqIO.read("GCF_001188735.1_ASM118873v1_genomic.gbff", "genbank")
        for record in SeqIO.parse(f, "genbank"):
            #print("%s %i" % (record.id, len(record)))
    
            # Define the locus_tag you want to extract
            #locus_tag = "M486_RS20950"
    
            # Loop through the features and extract the CDS with the specified locus_tag
            for feature in record.features:
                if feature.type == "CDS" and "locus_tag" in feature.qualifiers and feature.qualifiers["locus_tag"][0] == locus_tag:
                    # Extract the CDS location information and sequence
                    location = feature.location
                    seq = location.extract(record).seq
                    #print(f">{locus_tag}")
                    print(f"{seq}")
                    # Translate the nucleotide sequence to protein sequence
                    #protein_seq = seq.translate()
                    #print(f"Locus tag: {locus_tag}\nProtein sequence: {protein_seq}")
  5. The given code converts a DNA sequence file to a protein sequence file, aligns the protein sequences using MAFFT or MUSCLE, and constructs a phylogenetic tree using FastTree.

    #mafft --clustalout --adjustdirection yopK_seqs.fasta > yopK_seqs.output
    #fasttree -gtr -gamma -nt  yopK_alignment.fasta > yopK_alignment.tree
    #raxml-ng --all --model GTR+G+ASC_LEWIS --prefix core_gene_raxml --threads 6 --msa yopK_alignment_.fasta --bs-trees 1000 -slow 
    #fasttree 
    > for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do python3 ../../../plotTreeHeatmap/dna_to_protein.py ${yop}_seq.txt ${yop}_protein.fasta python3 ../../../plotTreeHeatmap/protein_alignment.py ${yop}_protein.fasta ${yop}_aligned_protein.fasta mafft awk -F ‘_’ ‘/^>/ { printf(“>%s”, $3); for (i = 4; i <= NF; ++i) printf("_%s", $i); printf("\n"); next } { print }' ${yop}_aligned_protein.fasta > ${yop}_aligned_protein_.fasta done fasttree yopE_aligned_protein_.fasta > yopE_aligned_protein.tree fasttree yopO_aligned_protein_.fasta > yopO_aligned_protein.tree fasttree yopT_aligned_protein_.fasta > yopT_aligned_protein.tree grep “>” yopE_aligned_protein.fasta > typing_yopE.csv grep “>” yopO_aligned_protein.fasta > typing_yopO.csv grep “>” yopT_aligned_protein.fasta > typing_yopT.csv #– construct typing_for yopE — cut -f1 -d’_’ typing_yopE.csv > f1 cut -f2 -d’_’ typing_yopE.csv > f2 cut -f3- -d’_’ typing_yopE.csv > f3_ paste f3_ typing_yopE.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopE_.csv #Isolate,Name,Genus,Species,yopE #R,Yersinia_pestis_R,Yersinia,pestis,Yes #20,Yersinia_pestis_20,Yersinia,pestis,Yes #… #– for yopO — cut -f1 -d’_’ typing_yopO.csv > f1 cut -f2 -d’_’ typing_yopO.csv > f2 cut -f3- -d’_’ typing_yopO.csv > f3_ paste f3_ typing_yopO.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopO_.csv #– for yopT — cut -f1 -d’_’ typing_yopT.csv > f1 cut -f2 -d’_’ typing_yopT.csv > f2 cut -f3- -d’_’ typing_yopT.csv > f3_ paste f3_ typing_yopT.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopT_.csv # code snippet of dna_to_protein.py from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord import sys def translate_dna_to_protein(dna_fasta_file, protein_fasta_file): protein_records = [] for record in SeqIO.parse(dna_fasta_file, “fasta”): protein_seq = record.seq.translate(to_stop=True) protein_record = SeqRecord(protein_seq, id=record.id, description=record.description) protein_records.append(protein_record) SeqIO.write(protein_records, protein_fasta_file, “fasta”) if __name__ == “__main__”: input_file = sys.argv[1] output_file = sys.argv[2] translate_dna_to_protein(input_file, output_file) # code snippet of protein_alignment.py import sys from Bio.Align.Applications import MafftCommandline, MuscleCommandline from Bio import SeqIO from Bio import AlignIO def run_alignment(input_file, output_file, aligner): if aligner.lower() == “mafft”: mafft_cline = MafftCommandline(input=input_file) stdout, stderr = mafft_cline() elif aligner.lower() == “muscle”: muscle_cline = MuscleCommandline(input=input_file, out=output_file) stdout, stderr = muscle_cline() else: print(“Invalid aligner. Please choose ‘mafft’ or ‘muscle’.”) sys.exit(1) with open(output_file, “w”) as aligned_fasta: aligned_fasta.write(stdout) if __name__ == “__main__”: if len(sys.argv) < 4: print("Usage: python protein_alignment.py input_fasta output_fasta aligner") print("Example: python protein_alignment.py input_protein.fasta aligned_protein.fasta mafft") sys.exit(1) else: input_fasta = sys.argv[1] output_fasta = sys.argv[2] aligner_choice = sys.argv[3] run_alignment(input_fasta, output_fasta, aligner_choice)
  6. This step generates a circular phylogenetic tree, a heatmap, and a multiple alignment sequence in a single figure. The tree is constructed from the aligned protein sequence “yopE_alignedprotein.fasta” and the heatmap is based on the “typingyopE.csv” data. The multiple alignment sequence is added into the final figure by extracting specific sequences from the aligned protein sequence file.

    library(ggtree)
    library(ggplot2)
    library(Biostrings)
    library(stringr)
    library(dplyr)
    
    library(ape)
    library(ggmsa)
    
    #install.packages("cowplot")
    library(cowplot)
    
    setwd("/home/jhuang/DATA/Data_Gunnar_Yersiniomics/plotTreeHeatmap/")
    
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    #info <- read.csv("typing_198.csv")
    info <- read.csv("typing_yopE_.csv")
    
    info$name <- info$Isolate
    #rownames(info) <- info$name
    info$yopE <- factor(info$yopE)
    
    #tree <- read.tree("core_gene_alignment_fasttree_directly_from_186isoaltes.tree")  --> NOT GOOD!
    #tree <- read.tree("raxml.tree")
    #tree <- read.tree("yopK_alignment_modified.tree") 
    tree <- read.tree("yopE_aligned_protein.tree") 
    #tree <- read.tree("core_gene_raxml.raxml.bestTree") 
    cols <- c(Yes='purple2', No='skyblue2')     
    
    # -- tree --
    tree_plot <- ggtree(tree, layout='circular', branch.length='none') %<+% info + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1) + geom_tippoint(aes(color=yopE)) 
    #+ scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    #, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
    #difference between geom_tiplab and geom_tiplab2?
    #+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
    #font.size=10, 
    png("ggtree2.png", width=1260, height=1260)
    #png("ggtree.png", width=1000, height=1000)
    #svg("ggtree.svg", width=1260, height=1260)
    tree_plot
    dev.off()
    
    # -- heatmap --
    #heatmapData2 <- info %>% select(Isolate, cgMLST, Lineage)
    heatmapData2 <- info %>% select(Isolate, Species)
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
      #"1","2","3","4","9","12","13","14","16","18",  "19","30","32","36","39","41","42","43","44","53",  "64","79","83","84","92","140","148","154","171","172", "173","194","215","217","252","277","279","282","290","312", "335","NA"
    #https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
    #"blue","cyan", "skyblue2", "azure3","blueviolet","darkgoldenrod",  "tomato","mediumpurple4","indianred", 
    #"cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta",     "cornflowerblue","darkgreen","red","tan","brown",
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "lightcyan3","green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki",  "maroon","lightgreen",      "darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "lightcyan3", "maroon","lightgreen", "darkgrey")
    #names(heatmap.colours) <- c("pestis","pseudotuberculosis","similis","enterocolitica","frederiksenii","kristensenii","occitanica","intermedia","hibernica","canariae","alsatica","rohdei","massiliensis","bercovieri","aleksiciae","mollaretii","aldovae","ruckeri","entomophaga",     "1","1Aa","1B","2","2/3-9a","2/3-9b","4","5","6","8","10","11","12","13","14","16","17","29","NA")
    
    heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "lightcyan3","green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki",  "maroon","lightgreen")
    names(heatmap.colours) <- c("pestis","pseudotuberculosis","similis","enterocolitica","frederiksenii","kristensenii","occitanica","intermedia","hibernica","canariae","alsatica","rohdei","massiliensis","bercovieri","aleksiciae","mollaretii","aldovae","ruckeri","entomophaga")
    
    #"cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cornflowerblue",
    #"2.MED","4.ANT","2.ANT","1.ORI","1.IN","1.ANT","0.ANT3","0.PE4","0.PE3","0.PE2","1a",
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #rochesterensis-->occitanica
    
    png("ggtree_and_gheatmap_yopE.png", width=1000, height=800)
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #png("ggtree_and_gheatmap.png", width=1690, height=1400)
    #svg("ggtree_and_gheatmap.svg", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    tree_heatmap_plot <- gheatmap(tree_plot, heatmapData2, width=0.2,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
    tree_heatmap_plot
    dev.off()
    
    #samtools faidx yopE_aligned_protein_.fasta "1522" > yopE_aligned_protein_sorted.fasta
    #samtools faidx yopE_aligned_protein_.fasta "Pestoides_F_bis" >> yopE_aligned_protein_sorted.fasta
    #...
    #samtools faidx yopE_aligned_protein_.fasta "8081" >> yopE_aligned_protein_sorted.fasta
    #samtools faidx yopE_aligned_protein_.fasta "WA" >> yopE_aligned_protein_sorted.fasta
    
    #https://bioconductor.org/packages/devel/bioc/vignettes/ggtreeExtra/inst/doc/ggtreeExtra.html
    #https://github.com/YuLab-SMU/supplemental-ggmsa
    #https://github.com/YuLab-SMU/ggmsa
    
    library(ggmsa)
    library(ggplot2)
    library(ggtree)
    #library(gggenes)
    library(ape)
    library(Biostrings)
    library(ggnewscale)
    library(dplyr)
    library(ggtreeExtra)
    library(phangorn)
    library(RColorBrewer)
    library(patchwork)
    library(ggplotify)
    library(aplot)
    library(magick)
    library(treeio)
    
    #data <- "../supplemental-ggmsa-main/data/s_RBD.fasta"
    data <- "yopE_aligned_protein_.fasta"
    #data <- "yopE_aligned_protein_sorted.fasta"
    #dms <- read.csv("data/DMS.csv")
    #del <- c("expr_lib1", "expr_lib2","expr_avg","bind_lib1","bind_lib2")
    #dms <- dms[,!colnames(dms) %in% del]
    tidymsa <- tidy_msa(data)
    #tidymsa <- assign_dms(tidymsa, dms)
    #Mapping the position to the protein-protein interaction plot
    #tidymsa$position <- tidymsa$position + 330
    
    #dms = TRUE, 
    png("alignment_yopE.png", width=1100, height=5400)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    #scale_fill_gradientn(name = "ACE2 binding", colors = c(colRD(75),colBU(25)))
    msa_plot
    dev.off()
    
    # #Combine the ggtree and ggmsa plots using the cowplot package:
    # combined_plot <- cowplot::plot_grid(msa_plot, tree_heatmap_plot, ncol = 1, align = "v", axis = "l", rel_heights = c(3, 2))
    # ggsave("combined_plot.png", combined_plot, width = 10, height = 10, dpi = 300)

Episome (外质体)

在中文里,“episome”可以翻译为“外质体”。外质体是一种遗传元素,它可以作为独立的环状DNA分子存在,也可以整合到宿主的染色体中。外质体与宿主的染色体DNA分开复制,并在细胞分裂过程中在细胞之间传递。无论是原核生物还是真核生物,都可以找到外质体。

在细菌中,外质体通常是质粒,质粒是一种小的环状DNA分子,携带对宿主有益的基因,如抗生素抗性或产生毒素的能力。细菌质粒可以通过称为“共轭”的过程在细菌细胞之间传递,从而促进有益基因在细菌种群中的传播。

在真核细胞中,例如人类细胞,外质体可以是病毒基因组,如Epstein-Barr病毒(EBV)基因组。感染后,EBV使其线性DNA环化并作为外质体建立潜伏在宿主细胞的细胞核中。在潜伏期间,病毒外质体在宿主细胞中保持存在,但不会立即造成损害,病毒可以在以后重新激活,可能导致疾病。作为外质体,病毒基因组可以避免被宿主免疫系统检测和清除,同时保持在重新激活后复制并产生新的病毒颗粒的能力。

An episome is a genetic element that can exist either independently as a circular DNA molecule or integrate into a host’s chromosome. Episomes replicate separately from the host’s chromosomal DNA and can be transferred between cells during cell division. They are found in both prokaryotic and eukaryotic organisms.

In bacteria, episomes are often plasmids, which are small, circular DNA molecules that carry genes beneficial to the host, such as antibiotic resistance or the ability to produce toxins. Bacterial plasmids can be transferred between bacterial cells through a process called conjugation, which promotes the spread of beneficial genes within a bacterial population.

In eukaryotic cells, such as human cells, episomes can be viral genomes, like the Epstein-Barr virus (EBV) genome. Upon infection, EBV circularizes its linear DNA and establishes latency as an episome in the host cell’s nucleus. During latency, the viral episome persists in the host cell without causing immediate damage, and the virus can reactivate at a later time, potentially causing disease. By existing as an episome, the viral genome can avoid detection and clearance by the host’s immune system while maintaining the ability to replicate and produce new virus particles upon reactivation.

Human gammaherpesvirus 4 (HHV-4) is another name for Epstein-Barr virus (EBV). EBV is a member of the Herpesviridae family and belongs to the Gammaherpesvirinae subfamily. It is a double-stranded DNA virus that infects humans and is associated with various diseases, including infectious mononucleosis, certain types of lymphomas, and nasopharyngeal carcinoma.

EBV establishes a latent infection in the host’s B cells, where it can persist as an episome. The virus can reactivate at a later time, potentially leading to disease.

EV (细胞外囊泡:生物学中的关键细胞间通讯调控者)

In biology, “EV” often stands for “extracellular vesicle.” Extracellular vesicles are membrane-bound particles released by cells into the extracellular environment. They play essential roles in cell-to-cell communication, both within the same organism and between different organisms.

Extracellular vesicles can range in size from approximately 30 nm to 1 µm or more, and their contents include proteins, lipids, and nucleic acids (such as DNA, mRNA, and microRNAs). These vesicles can be classified into various categories, such as exosomes, microvesicles, and apoptotic bodies, based on their biogenesis, size, and molecular composition.

EVs are involved in numerous physiological and pathological processes, including immune responses, angiogenesis, cell differentiation, and cancer progression. They can transfer their contents from one cell to another, influencing the recipient cell’s behavior and function. Given their importance in intercellular communication, extracellular vesicles have become a significant area of research in recent years, with potential applications in diagnostics, therapeutics, and drug delivery.

在生物学中,“EV”通常表示“细胞外囊泡”(extracellular vesicle)。细胞外囊泡是由细胞释放到细胞外环境的膜结构颗粒。它们在同一生物体内以及不同生物体之间的细胞间通讯中起着至关重要的作用。

细胞外囊泡的大小范围从约30纳米到1微米或更大,其内容包括蛋白质、脂质和核酸(如DNA、mRNA和microRNAs)。根据它们的生物生成途径、大小和分子组成,这些囊泡可以被分为不同的类别,例如外泌体(exosomes)、微囊泡(microvesicles)和凋亡小体(apoptotic bodies)。

EVs参与许多生理和病理过程,包括免疫应答、血管生成、细胞分化和癌症进展。它们可以将其内容从一个细胞传递到另一个细胞,影响受体细胞的行为和功能。鉴于它们在细胞间通信中的重要性,近年来细胞外囊泡已成为一个重要的研究领域,具有潜在的诊断、治疗和药物递送应用。

有趣的基因轶事

基因学是一个充满惊奇和趣事的领域。以下是一些有趣的基因轶事:

  1. 鳄梨基因:一项研究表明,鳄梨的基因组有一个特别之处,那就是它保留了一些远古恐龙时代的基因。这些基因可能帮助鳄梨适应恶劣的环境,使它们在漫长的历史中得以幸存。
  2. CCR5-Δ32:CCR5-Δ32是一种罕见的基因突变,能让人免疫艾滋病毒(HIV)。这种突变使得艾滋病毒无法进入人体免疫系统的细胞,从而保护突变基因携带者不受感染。
  3. 嗅觉差异:我们对气味的感知能力主要取决于基因。有一种名为OR6A2的基因,它决定了我们对香草和肉桂气味的喜好。携带某种突变形式的人会觉得香草和肉桂有恶臭。
  4. 抗草酸基因:有一种名为“抗草酸”基因,它使得某些人在进食菠菜、瑞士甜菜和甜菜等含草酸高的食物时不会感到不适。草酸会与钙结合形成结石,而携带这种基因的人则能够防止草酸在体内形成结石。
  5. 猫咪拟态:澳洲的一种叫做“猫咪拟态”的植物,通过模拟猫咪的气味来保护自己。这种植物产生一种化合物,它的气味类似猫咪的信息素,使得其天敌如老鼠等动物误以为有猫咪在附近,从而避开它们。
  6. 左撇子基因:尽管左撇子在全球人口中只占10%左右,但研究发现,一个名为LRRTM1的基因与左撇子有关。具有这种基因的人更有可能成为左撇子。
  7. 蓝眼睛基因:蓝眼睛实际上是一种罕见的基因突变。研究表明,所有蓝眼睛的人都有一个共同的祖先,这个祖先生活在大约6,000-10,000年前。这种突变影响了名为OCA2的基因,导致黑色素在虹膜中的减少,使眼睛呈现蓝色。
  8. 咖啡因代谢:每个人对咖啡因的反应不同,这主要归功于基因。携带CYP1A2基因特定突变的人能更快地代谢咖啡因,因此对咖啡因不那么敏感。
  9. 甜品基因:有些人对甜食特别钟爱,这可能与名为TAS1R2的基因有关。这个基因对甜味敏感,携带特定突变的人更喜欢甜食。
  10. 花生过敏基因:花生过敏是一种常见的食物过敏,携带名为HLA-DQ和HLA-DR的基因突变的人更容易对花生过敏。
  11. 音乐天赋基因:名为AVPR1A的基因与音乐天赋有关。携带这个基因特定突变的人在音乐创作、即兴表演和音乐记忆方面表现更好。
  12. 超级熬夜基因:有些人几乎不需要睡觉就能保持清醒和精力充沛。这与一种名为“超级熬夜基因”的基因突变有关,携带这种突变的人每天只需要较少的睡眠。
  13. 超级记忆基因:有些人拥有惊人的记忆力,能够轻松记住大量的细节。这可能与一种名为BDNF的基因变异有关。BDNF基因编码一种在大脑中起到关键作用的神经营养因子。携带这种基因变异的人可能具有更强大的记忆力。
  14. 恐怖片爱好者基因:有些人对恐怖片情有独钟,而这可能与一种名为COMT的基因变异有关。这个基因影响多巴胺代谢,多巴胺是一种与愉悦感相关的神经递质。携带这种基因变异的人在观看恐怖片时可能感到更多的刺激和愉悦。
  15. 不喜欢西兰花的基因:有些人对西兰花的味道特别反感,这可能与一种名为TAS2R38的基因变异有关。这个基因编码一种味觉受体,携带这种基因变异的人对西兰花中的苦味非常敏感。

这些有趣的基因轶事展示了基因对我们生活的各个方面的影响,从我们的生理特征到我们的行为习惯。随着科学家们对基因学的进一步研究,我们将不断发现更多有趣的基因现象。

交联质谱(XL-MS 或 CLMS)

交联质谱(XL-MS 或 CLMS)是一种强大的技术,用于研究蛋白质-蛋白质相互作用、蛋白质复合物结构和蛋白质构象变化。该方法包括使用化学交联剂将蛋白质或蛋白质之间相互作用或紧密位于的氨基酸共价连接。然后将交联蛋白质消化成肽,通过质谱分析交联肽。

交联质谱的一般工作流程如下:

  1. 交联:将蛋白质或蛋白质复合物与化学交联剂处理,使其在近距离的氨基酸之间形成共价连接。交联剂的选择取决于实验要求,例如所需的交联距离和用于交联的特定氨基酸。

  2. 消化:将交联蛋白质消化成肽,使用蛋白酶如胰蛋白酶,它在特定的氨基酸序列处切割蛋白质。

  3. 富集:根据使用的交联剂和实验设计,可能需要从复杂的肽混合物中富集交联肽。这可以通过多种方法实现,包括亲和纯化、大小排阻层析或强阳离子交换层析。

  4. 质谱分析:使用高分辨率质谱对交联肽进行分析。质谱仪测量肽的质量与电荷比(m/z),提供关于其质量和电荷的信息。串联质谱(MS/MS)用于将交联肽片段化并获得其序列信息。

  5. 数据分析:使用专门的生物信息学工具和算法分析结果的质谱数据。这些工具有助于识别交联肽并推断相互作用的氨基酸或蛋白质区域。所识别的交联可用于生成结构模型或绘制蛋白质相互作用界面。

交联质谱已成功应用于研究各种生物系统,包括蛋白质-蛋白质相互作用、蛋白质-核酸相互作用和蛋白质-脂质相互作用。它还被用于研究大型蛋白质复合物和膜蛋白的结构和动态。

Crosslinking mass spectrometry (XL-MS or CLMS) is a powerful technique used to study protein-protein interactions, protein complex structures, and protein conformational changes. The method involves covalently linking interacting or closely located amino acids within a protein or between proteins using chemical crosslinkers. The crosslinked proteins are then digested into peptides, and the crosslinked peptides are analyzed using mass spectrometry.

The general workflow of crosslinking mass spectrometry is as follows:

  1. Crosslinking: Proteins or protein complexes are treated with a chemical crosslinker that covalently links amino acids in close proximity. The choice of crosslinker depends on the experimental requirements, such as the desired crosslinking distance and the specific amino acids targeted for crosslinking.

  2. Digestion: Crosslinked proteins are digested into peptides using proteolytic enzymes such as trypsin, which cleaves proteins at specific amino acid sequences.

  3. Enrichment: Depending on the crosslinker used and the experimental design, it may be necessary to enrich crosslinked peptides from the complex peptide mixture. This can be achieved using various approaches, including affinity purification, size-exclusion chromatography, or strong cation exchange chromatography.

  4. Mass spectrometry analysis: The crosslinked peptides are analyzed using high-resolution mass spectrometry. The mass spectrometer measures the mass-to-charge ratios (m/z) of the peptides, providing information about their mass and charge. Tandem mass spectrometry (MS/MS) is used to fragment the crosslinked peptides and obtain their sequence information.

  5. Data analysis: The resulting mass spectrometry data are analyzed using specialized bioinformatics tools and algorithms. These tools help identify the crosslinked peptides and infer the interacting amino acids or protein regions. The identified crosslinks can be used to generate structural models or map protein interaction interfaces.

Crosslinking mass spectrometry has been successfully applied to study various biological systems, including protein-protein interactions, protein-nucleic acid interactions, and protein-lipid interactions. It has also been used to investigate the structure and dynamics of large protein complexes and membrane proteins.

Crosslinking mass spectrometry (XL-MS) has been under development since the late 1990s and early 2000s. Early work in the field focused on optimizing the crosslinking and mass spectrometry techniques, as well as developing computational methods to analyze the resulting data. The technique has evolved significantly over the past two decades, thanks to advancements in mass spectrometry instrumentation, crosslinking reagents, and bioinformatics tools.

One of the pioneering studies in XL-MS was published in 2000 by Rappsilber, Siniossoglou, et al., who used the technique to study the structure of protein complexes in yeast cells. Since then, numerous improvements have been made to the method, leading to its widespread adoption in the field of structural biology and proteomics.

As the technology continues to mature, XL-MS has become an essential tool for studying protein-protein interactions, protein complex structures, and protein conformational changes, providing valuable insights into the molecular mechanisms underlying various biological processes.

Fan Liu’s lab, based at the Max Planck Institute of Biochemistry in Germany, has made significant contributions to the field of crosslinking mass spectrometry (XL-MS). The group focuses on developing innovative XL-MS methods, as well as computational tools for data analysis, to study protein structures and interactions.

One of the key contributions from the Fan Liu lab is the development of the xQuest/xProphet software suite. xQuest is an algorithm for identifying crosslinked peptides from mass spectrometry data, while xProphet is a post-processing tool that statistically validates the xQuest results. This software suite has greatly facilitated the analysis of XL-MS data and is widely used in the field.

The lab has also contributed to the development of novel crosslinking reagents and methodologies for studying protein interactions. For example, they have developed a method called “disuccinimidyl sulfoxide (DSSO) crosslinking” that enables the identification of interacting protein regions by using a photoactivatable crosslinker.

Furthermore, Fan Liu’s group has applied their XL-MS expertise to investigate the structure and function of various biologically significant protein complexes, such as the 26S proteasome, the Mediator complex, and the nuclear pore complex, providing valuable insights into the molecular mechanisms of these systems.

Overall, the Fan Liu lab has played an important role in advancing the field of crosslinking mass spectrometry and continues to push the boundaries of what can be achieved with this powerful technique.

二琥珀酰亚砜基 (disuccinimidyl sulfoxide) 交联是一种在蛋白质结构和相互作用研究领域中使用的交联方法。通过使用这种光活化交联剂,可以识别相互作用的蛋白质区域。在交联质谱(XL-MS)技术中,二琥珀酰亚砜基 (DSSO) 交联为研究蛋白质复合物结构提供了一种有效的工具。

10 NGS Categories

  1. RNA Sequencing: RNA sequencing (RNA-Seq) is a technique used to study the transcriptome, or the complete set of RNA molecules expressed within a cell or tissue. This approach allows researchers to analyze gene expression patterns, identify novel transcripts, and detect alternative splicing events.

  2. Whole Genome Sequencing: Whole genome sequencing (WGS) is a method that determines the complete DNA sequence of an organism’s genome. This technique provides comprehensive information about an organism’s genetic makeup, including the identification of genes, regulatory elements, and variations such as single nucleotide polymorphisms (SNPs) and structural variants.

  3. Amplicon Sequencing: Amplicon sequencing involves the targeted sequencing of specific genomic regions or genes using polymerase chain reaction (PCR) to amplify the regions of interest before sequencing. This approach is often used to study specific genetic variations or target known functional regions in the genome.

  4. Exome Sequencing: Exome sequencing targets the protein-coding regions of the genome, known as the exome. These regions account for approximately 1-2% of the genome but contain the majority of disease-causing genetic variations. Exome sequencing is used to identify novel disease-associated genes and mutations in known genes.

  5. CRISPR Validation (genoTYPER-NEXT): CRISPR validation is a process to assess the efficiency and specificity of CRISPR/Cas9-mediated genome editing. genoTYPER-NEXT is a high-throughput sequencing platform used to validate CRISPR/Cas9-induced mutations by sequencing the target sites and identifying the exact mutations generated by the editing process.

  6. Targeted Sequencing: Targeted sequencing is a method that focuses on sequencing specific genomic regions or genes of interest, rather than the entire genome. This approach is more cost-effective and allows for higher sequencing depth, providing better resolution for detecting low-frequency genetic variations.

  7. Metagenomics: Metagenomics is the study of genetic material obtained directly from environmental samples, such as soil or water, without the need for culturing individual organisms. This approach allows researchers to analyze the composition, diversity, and functional potential of complex microbial communities.

  8. Epigenomics: Epigenomics is the study of epigenetic modifications on a genome-wide scale, including DNA methylation, histone modifications, and non-coding RNA molecules. These modifications play essential roles in gene regulation and can have long-lasting effects on an organism’s phenotype without changing the DNA sequence.

  9. Immunogenomics: Immunogenomics is the study of the genetic and epigenetic factors that influence the immune system and its response to various stimuli, including pathogens, allergens, and self-antigens. This field integrates genomic, transcriptomic, and epigenomic data to better understand the molecular mechanisms underlying immune responses and develop novel therapies for immune-related diseases.

  10. Proteomics is a branch of molecular biology that focuses on the large-scale study of proteins within an organism or a specific biological system. Unlike genomics, which deals with the study of DNA and gene sequences, proteomics investigates the structure, function, and interactions of proteins. Proteins are crucial for many cellular processes, and their functions are dictated by their structure, abundance, and modifications.

    There are several key aspects and techniques in proteomics:

    • Protein identification: Identifying the proteins present in a given sample, such as a cell, tissue, or body fluid, is a fundamental aspect of proteomics. Mass spectrometry is the most commonly used technique for protein identification, often combined with liquid chromatography to separate complex protein mixtures before analysis.

    • Protein quantification: Measuring the relative or absolute abundance of proteins in a sample is important for understanding their biological roles and potential involvement in disease processes. Techniques like label-free quantification, stable isotope labeling, and targeted mass spectrometry approaches (e.g., selected reaction monitoring) are commonly used for protein quantification.

    • Post-translational modifications (PTMs): PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, glycosylation, and acetylation. They can alter a protein’s activity, stability, localization, or interaction with other molecules. Mass spectrometry-based approaches and specific enrichment techniques are used to identify and quantify PTMs.

    • Protein-protein interactions: Investigating how proteins interact with one another is crucial for understanding cellular processes and signaling pathways. Techniques like yeast two-hybrid screens, affinity purification-mass spectrometry (AP-MS), and proximity labeling methods can reveal protein-protein interactions.

    • Structural proteomics: This area focuses on determining the three-dimensional structures of proteins, which provide insights into their functions and interactions. Techniques like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) are used to determine protein structures.

    • Functional proteomics: This aspect of proteomics aims to understand the biological roles of proteins and their involvement in cellular processes. Techniques like RNA interference (RNAi), CRISPR/Cas9-mediated gene editing, and activity-based protein profiling (ABPP) are used to study protein functions and identify potential drug targets.