Author Archives: gene_x

Plot phylogenetic tree_heatmap and MSA on yopBDJTEMKOH[NR]

Isolate yopJ yopB yopT yopE yopD yopM yopK yopO yopH
Yersinia_enterocolitica_1055Rr 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_2516-87 0 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_8081 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_8081_bis 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_KNG22703 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_WA 1 1 1 1 1 0 1 1 1
Yersinia_enterocolitica_Y1 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_Y11 1 1 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE1 1 0 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE165 1 1 1 1 0 1 1 0 1
Yersinia_enterocolitica_YE3 1 0 1 1 1 1 1 0 1
Yersinia_enterocolitica_YE5 1 0 1 1 1 1 1 1 1
Yersinia_enterocolitica_YE6 1 1 1 1 1 1 1 0 1
Yersinia_enterocolitica_YE7 1 1 1 1 1 1 1 1 1
Yersinia_pestis_1045 1 1 1 1 1 1 1 1 1
Yersinia_pestis_1412 1 1 0 1 1 1 1 1 1
Yersinia_pestis_1413 1 1 0 1 1 1 1 1 1
Yersinia_pestis_1522 1 1 0 0 1 1 1 1 1
Yersinia_pestis_20 1 1 1 1 1 1 1 1 1
Yersinia_pestis_2944 1 1 1 1 1 1 1 1 1
Yersinia_pestis_3067 1 1 0 1 1 1 1 1 1
Yersinia_pestis_3770 1 1 0 1 1 1 1 1 1
Yersinia_pestis_790 0 1 1 1 1 1 1 0 1
Yersinia_pestis_8787 1 1 0 1 1 1 1 1 1
Yersinia_pestis_91001 1 1 1 1 1 1 1 1 1
Yersinia_pestis_94 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Angola 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Angola_bis 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Antiqua 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Antiqua_bis 0 1 1 1 1 1 1 1 1
Yersinia_pestis_CO92 1 1 1 1 1 1 1 1 1
Yersinia_pestis_CO92_pgm-_pPCP1- 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Cadman 1 1 1 1 1 1 1 1 1
Yersinia_pestis_D106004 1 1 1 1 1 1 1 1 1
Yersinia_pestis_D182038 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Dodson 1 1 1 1 1 1 1 1 1
Yersinia_pestis_EV76-CN 1 1 1 1 1 1 1 1 1
Yersinia_pestis_El_Dorado 1 1 1 1 1 1 1 1 1
Yersinia_pestis_FDAARGOS_601 1 1 1 1 1 1 1 0 1
Yersinia_pestis_FDAARGOS_602 0 1 1 1 1 0 1 1 1
Yersinia_pestis_FDAARGOS_603 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Harbin_35 1 1 1 1 1 1 1 0 1
Yersinia_pestis_Harbin_35_bis 1 0 1 1 1 1 1 0 1
Yersinia_pestis_Java9 1 1 1 1 1 1 1 0 1
Yersinia_pestis_KIM5 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Nicholisk_41 1 1 1 0 1 1 1 0 1
Yersinia_pestis_PBM19 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Pestoides_B 0 1 1 1 1 1 1 1 1
Yersinia_pestis_Pestoides_F 1 1 0 1 1 1 1 1 1
Yersinia_pestis_Pestoides_F_bis 1 1 0 1 1 1 1 1 1
Yersinia_pestis_Pestoides_G 1 1 0 1 1 1 1 1 1
Yersinia_pestis_R 1 1 1 1 1 1 1 1 1
Yersinia_pestis_S19960127 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-5935_I-1996 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-5942_I-2638 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-6530 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-B-6899_231 1 1 1 1 1 1 1 1 1
Yersinia_pestis_SCPM-O-DNA-18_I-3113 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Shasta 1 1 1 1 1 1 1 1 1
Yersinia_pestis_Z176003 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_598 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_EP2+ 0 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_579 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_580 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_581 1 1 1 0 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_582 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_FDAARGOS_583 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP2666pIB1 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP32953 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_IP32953_bis 1 1 1 1 0 1 1 1 1
Yersinia_pseudotuberculosis_NZYP4713 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_PA3606 1 1 1 1 1 1 1 1 1
Yersinia_pseudotuberculosis_PB1+_bis 1 1 1 1 1 0 1 1 1

ggtree_and_gheatmap_yopK

alignment_yopK

  1. This step uses rsync to download data from the NCBI server to a local directory, save all gff-files in the directory prokka.

    rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 Yersinia_pestis_1045
  2. The step processes GFF files containing gene annotations for a set of samples in the directory prokka. The primary goal is to modify the GFF files and create new ones with specific changes and to save them in the directory prokka_plus. The script operates on each sample one by one, and for each sample, it performs the following steps:

    • Replace all occurrences of \tCDS\t with CDS in the original GFF file.

    • Extract all lines containing CDS and save them in a new file with the suffix _CDS.gff.

    • Replace all occurrences of ID= with ID_old= in the new _CDS.gff file.

    • Cut the second field (delimited by ;) from the _CDS.gff file and save it in a new file with the suffix _CDS_f2.

    • Replace all occurrences of Parent=gene- with ID= in the _CDS_f2 file.

    • Paste the contents of the _CDS.gff and _CDS_f2 files side by side, with a ; delimiter, and save the result in a new file with the suffix CDS.gff.

    • Run the enum.py script on the CDS.gff file to add line numbers at the end, and save the result in a new file with the suffix _CDS__.gff. import sys

        if len(sys.argv) < 2:
            print("Please provide a filename as an argument.")
            sys.exit(1)
      
        filename = sys.argv[1]
      
        try:
            with open(filename) as f:
                for i, line in enumerate(f):
                    print(f"{line.strip()}_{i+1}")
        except FileNotFoundError:
            print(f"File {filename} not found.")
    • Extract all lines from the original GFF file that do not contain CDS and save them in a new file with the suffix _nonCDS.gff.

    • Remove all lines containing ### from the _nonCDS.gff file and save the result in a new file with the suffix nonCDS.gff.

    • Concatenate the contents of the nonCDS.gff and _CDS__.gff files and save the result in a new file with the suffix _nonCDS_CDS.gff.

    • Replace all occurrences of CDS with \tCDS\t in the _nonCDS_CDS.gff file.

    • Append the string ##FASTA to the end of the _nonCDS_CDS.gff file.

    • Modify the FASTA file associated with the sample by replacing the first field (delimited by a space) with the corresponding sample name.

    • Concatenate the modified GFF file (_nonCDS_CDS.gff) and the modified FASTA file, and save the result in the ../prokka_plus/ directory with a new name based on the sample name.

    • After processing all samples, the script removes intermediate files generated during the process.

      for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 Yersinia_pestis_2944 Yersinia_pestis_KIM10+ Yersinia_pestis_M-1482; do
        sed -i 's/\tCDS\t/_CDS_/g' ${sample}.gff
        grep "_CDS_" ${sample}.gff > ${sample}_CDS.gff
        sed -i 's/ID=/ID_old=/g' ${sample}_CDS.gff
        cut -d';' -f2 ${sample}_CDS.gff > ${sample}_CDS_f2
        sed -i 's/Parent=gene-/ID=/g' ${sample}_CDS_f2
        paste -d';' ${sample}_CDS.gff ${sample}_CDS_f2 > ${sample}_CDS_.gff
        python enum.py ${sample}_CDS_.gff > ${sample}_CDS__.gff   # add a line number to end to avoid the sameple Gene_ID
      
        grep -v "_CDS_" ${sample}.gff > ${sample}_nonCDS.gff
        grep -v "###" ${sample}_nonCDS.gff > ${sample}_nonCDS_.gff
      
        cat ${sample}_nonCDS_.gff ${sample}_CDS__.gff > ${sample}_nonCDS_CDS.gff
        sed -i 's/_CDS_/\tCDS\t/g' ${sample}_nonCDS_CDS.gff
        echo "##FASTA" >> ${sample}_nonCDS_CDS.gff
      
        cut -d' ' -f1 ../assembly/${sample}.fna > ../assembly/${sample}.fasta;
        cat ${sample}_nonCDS_CDS.gff ../assembly/${sample}.fasta > ../prokka_plus/$(echo $sample | cut -d'_' -f3- | tr " " "_").gff;
      done
      rm *_CDS.gff *_CDS_f2 *_CDS_.gff *_CDS__.gff *_nonCDS.gff *_nonCDS_.gff
  3. This step runs Roary, a tool for pan-genome analysis. It takes annotated bacterial genomes in GFF3 format as input and clusters the genes based on sequence similarity.

    roary -p 4 -f ./roary -i 95 -cd 99 -s -e -n -v  prokka_plus/1045.gff prokka_plus/SCPM-O-B-6291_C-25.gff prokka_plus/2944.gff prokka_plus/KIM10+.gff
  4. This step extracts the coding sequences (CDS) of specific genes from multiple genome files and saves them to an output file. Start-files: roary/pan_genome_reference.fa and roary/gene_presence_absence.csv

    #grep "yopT" roary/gene_presence_absence.csv
    > yopT_seq.txt
    for gene_id in M486_RS20945_3996 YE105_RS20560_4012; do
      for gbff in  Yersinia_massiliensis_2011N-4075/GCF_013282765.1_ASM1328276v1/GCF_013282765.1_ASM1328276v1_genomic.gbff.gz Yersinia_pestis_EV_NIIEG/GCF_000590535.2_ASM59053v2/GCF_000590535.2_ASM59053v2_genomic.gbff.gz Yersinia_pestis_Shasta/GCF_000834335.1_ASM83433v1/GCF_000834335.1_ASM83433v1_genomic.gbff.gz Yersinia_ruckeri_NVI-492/GCF_023212565.2_ASM2321256v2/GCF_023212565.2_ASM2321256v2_genomic.gbff.gz Yersinia_pestis_Pestoides_G/GCF_000834985.1_ASM83498v1/GCF_000834985.1_ASM83498v1_genomic.gbff.gz; do
        output=$(python3 extract_CDS_of_a_locus_tag.py ${gbff} $(echo "${gene_id}" | cut -d '_' -f 1-2))
        if [[ ! -z "${output}" ]]; then
            gbff_short=$(echo "${gbff}" | cut -d '/' -f 1)
            printf "%s\t%s\n" "${gbff_short}" "${output}" >> yopT_seq.txt
        fi
      done
    done
    
    #-- code snippet of extract_CDS_of_a_locus_tag.py --
    from Bio import SeqIO
    import sys
    import gzip
    
    #python3 extract_CDS_of_a_locus_tag.py GCF_001188735.1_ASM118873v1_genomic.gbff.gz M486_RS20950
    
    # Get the file name from the command-line argument
    if len(sys.argv) == 3:
        filename = sys.argv[1]
        locus_tag = sys.argv[2]
    else:
        print(sys.argv)
        print("Error: no file name provided")
        sys.exit(1)
    
    # Open the compressed file and read the sequences
    with gzip.open(filename, "rt") as f:
    
        # Open the GenBank file and read the first record
        #    record = SeqIO.read("GCF_001188735.1_ASM118873v1_genomic.gbff", "genbank")
        for record in SeqIO.parse(f, "genbank"):
            #print("%s %i" % (record.id, len(record)))
    
            # Define the locus_tag you want to extract
            #locus_tag = "M486_RS20950"
    
            # Loop through the features and extract the CDS with the specified locus_tag
            for feature in record.features:
                if feature.type == "CDS" and "locus_tag" in feature.qualifiers and feature.qualifiers["locus_tag"][0] == locus_tag:
                    # Extract the CDS location information and sequence
                    location = feature.location
                    seq = location.extract(record).seq
                    #print(f">{locus_tag}")
                    print(f"{seq}")
                    # Translate the nucleotide sequence to protein sequence
                    #protein_seq = seq.translate()
                    #print(f"Locus tag: {locus_tag}\nProtein sequence: {protein_seq}")
  5. The given code converts a DNA sequence file to a protein sequence file, aligns the protein sequences using MAFFT or MUSCLE, and constructs a phylogenetic tree using FastTree.

    #mafft --clustalout --adjustdirection yopK_seqs.fasta > yopK_seqs.output
    #fasttree -gtr -gamma -nt  yopK_alignment.fasta > yopK_alignment.tree
    #raxml-ng --all --model GTR+G+ASC_LEWIS --prefix core_gene_raxml --threads 6 --msa yopK_alignment_.fasta --bs-trees 1000 -slow 
    #fasttree 
    > for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do python3 ../../../plotTreeHeatmap/dna_to_protein.py ${yop}_seq.txt ${yop}_protein.fasta python3 ../../../plotTreeHeatmap/protein_alignment.py ${yop}_protein.fasta ${yop}_aligned_protein.fasta mafft awk -F ‘_’ ‘/^>/ { printf(“>%s”, $3); for (i = 4; i <= NF; ++i) printf("_%s", $i); printf("\n"); next } { print }' ${yop}_aligned_protein.fasta > ${yop}_aligned_protein_.fasta done fasttree yopE_aligned_protein_.fasta > yopE_aligned_protein.tree fasttree yopO_aligned_protein_.fasta > yopO_aligned_protein.tree fasttree yopT_aligned_protein_.fasta > yopT_aligned_protein.tree grep “>” yopE_aligned_protein.fasta > typing_yopE.csv grep “>” yopO_aligned_protein.fasta > typing_yopO.csv grep “>” yopT_aligned_protein.fasta > typing_yopT.csv #– construct typing_for yopE — cut -f1 -d’_’ typing_yopE.csv > f1 cut -f2 -d’_’ typing_yopE.csv > f2 cut -f3- -d’_’ typing_yopE.csv > f3_ paste f3_ typing_yopE.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopE_.csv #Isolate,Name,Genus,Species,yopE #R,Yersinia_pestis_R,Yersinia,pestis,Yes #20,Yersinia_pestis_20,Yersinia,pestis,Yes #… #– for yopO — cut -f1 -d’_’ typing_yopO.csv > f1 cut -f2 -d’_’ typing_yopO.csv > f2 cut -f3- -d’_’ typing_yopO.csv > f3_ paste f3_ typing_yopO.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopO_.csv #– for yopT — cut -f1 -d’_’ typing_yopT.csv > f1 cut -f2 -d’_’ typing_yopT.csv > f2 cut -f3- -d’_’ typing_yopT.csv > f3_ paste f3_ typing_yopT.csv > temp1 paste f1 f2 > temp2 paste temp1 temp2 > typing_yopT_.csv # code snippet of dna_to_protein.py from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord import sys def translate_dna_to_protein(dna_fasta_file, protein_fasta_file): protein_records = [] for record in SeqIO.parse(dna_fasta_file, “fasta”): protein_seq = record.seq.translate(to_stop=True) protein_record = SeqRecord(protein_seq, id=record.id, description=record.description) protein_records.append(protein_record) SeqIO.write(protein_records, protein_fasta_file, “fasta”) if __name__ == “__main__”: input_file = sys.argv[1] output_file = sys.argv[2] translate_dna_to_protein(input_file, output_file) # code snippet of protein_alignment.py import sys from Bio.Align.Applications import MafftCommandline, MuscleCommandline from Bio import SeqIO from Bio import AlignIO def run_alignment(input_file, output_file, aligner): if aligner.lower() == “mafft”: mafft_cline = MafftCommandline(input=input_file) stdout, stderr = mafft_cline() elif aligner.lower() == “muscle”: muscle_cline = MuscleCommandline(input=input_file, out=output_file) stdout, stderr = muscle_cline() else: print(“Invalid aligner. Please choose ‘mafft’ or ‘muscle’.”) sys.exit(1) with open(output_file, “w”) as aligned_fasta: aligned_fasta.write(stdout) if __name__ == “__main__”: if len(sys.argv) < 4: print("Usage: python protein_alignment.py input_fasta output_fasta aligner") print("Example: python protein_alignment.py input_protein.fasta aligned_protein.fasta mafft") sys.exit(1) else: input_fasta = sys.argv[1] output_fasta = sys.argv[2] aligner_choice = sys.argv[3] run_alignment(input_fasta, output_fasta, aligner_choice)
  6. This step generates a circular phylogenetic tree, a heatmap, and a multiple alignment sequence in a single figure. The tree is constructed from the aligned protein sequence “yopE_alignedprotein.fasta” and the heatmap is based on the “typingyopE.csv” data. The multiple alignment sequence is added into the final figure by extracting specific sequences from the aligned protein sequence file.

    library(ggtree)
    library(ggplot2)
    library(Biostrings)
    library(stringr)
    library(dplyr)
    
    library(ape)
    library(ggmsa)
    
    #install.packages("cowplot")
    library(cowplot)
    
    setwd("/home/jhuang/DATA/Data_Gunnar_Yersiniomics/plotTreeHeatmap/")
    
    # -- edit tree --
    #https://icytree.org/
    #0.000780
    #info <- read.csv("typing_198.csv")
    info <- read.csv("typing_yopE_.csv")
    
    info$name <- info$Isolate
    #rownames(info) <- info$name
    info$yopE <- factor(info$yopE)
    
    #tree <- read.tree("core_gene_alignment_fasttree_directly_from_186isoaltes.tree")  --> NOT GOOD!
    #tree <- read.tree("raxml.tree")
    #tree <- read.tree("yopK_alignment_modified.tree") 
    tree <- read.tree("yopE_aligned_protein.tree") 
    #tree <- read.tree("core_gene_raxml.raxml.bestTree") 
    cols <- c(Yes='purple2', No='skyblue2')     
    
    # -- tree --
    tree_plot <- ggtree(tree, layout='circular', branch.length='none') %<+% info + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1) + geom_tippoint(aes(color=yopE)) 
    #+ scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    #, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
    #difference between geom_tiplab and geom_tiplab2?
    #+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
    #font.size=10, 
    png("ggtree2.png", width=1260, height=1260)
    #png("ggtree.png", width=1000, height=1000)
    #svg("ggtree.svg", width=1260, height=1260)
    tree_plot
    dev.off()
    
    # -- heatmap --
    #heatmapData2 <- info %>% select(Isolate, cgMLST, Lineage)
    heatmapData2 <- info %>% select(Isolate, Species)
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
      #"1","2","3","4","9","12","13","14","16","18",  "19","30","32","36","39","41","42","43","44","53",  "64","79","83","84","92","140","148","154","171","172", "173","194","215","217","252","277","279","282","290","312", "335","NA"
    #https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
    #"blue","cyan", "skyblue2", "azure3","blueviolet","darkgoldenrod",  "tomato","mediumpurple4","indianred", 
    #"cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta",     "cornflowerblue","darkgreen","red","tan","brown",
    
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "lightcyan3","green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki",  "maroon","lightgreen",      "darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "lightcyan3", "maroon","lightgreen", "darkgrey")
    #names(heatmap.colours) <- c("pestis","pseudotuberculosis","similis","enterocolitica","frederiksenii","kristensenii","occitanica","intermedia","hibernica","canariae","alsatica","rohdei","massiliensis","bercovieri","aleksiciae","mollaretii","aldovae","ruckeri","entomophaga",     "1","1Aa","1B","2","2/3-9a","2/3-9b","4","5","6","8","10","11","12","13","14","16","17","29","NA")
    
    heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "lightcyan3","green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki",  "maroon","lightgreen")
    names(heatmap.colours) <- c("pestis","pseudotuberculosis","similis","enterocolitica","frederiksenii","kristensenii","occitanica","intermedia","hibernica","canariae","alsatica","rohdei","massiliensis","bercovieri","aleksiciae","mollaretii","aldovae","ruckeri","entomophaga")
    
    #"cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cornflowerblue",
    #"2.MED","4.ANT","2.ANT","1.ORI","1.IN","1.ANT","0.ANT3","0.PE4","0.PE3","0.PE2","1a",
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #rochesterensis-->occitanica
    
    png("ggtree_and_gheatmap_yopE.png", width=1000, height=800)
    #png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #png("ggtree_and_gheatmap.png", width=1690, height=1400)
    #svg("ggtree_and_gheatmap.svg", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    tree_heatmap_plot <- gheatmap(tree_plot, heatmapData2, width=0.2,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
    tree_heatmap_plot
    dev.off()
    
    #samtools faidx yopE_aligned_protein_.fasta "1522" > yopE_aligned_protein_sorted.fasta
    #samtools faidx yopE_aligned_protein_.fasta "Pestoides_F_bis" >> yopE_aligned_protein_sorted.fasta
    #...
    #samtools faidx yopE_aligned_protein_.fasta "8081" >> yopE_aligned_protein_sorted.fasta
    #samtools faidx yopE_aligned_protein_.fasta "WA" >> yopE_aligned_protein_sorted.fasta
    
    #https://bioconductor.org/packages/devel/bioc/vignettes/ggtreeExtra/inst/doc/ggtreeExtra.html
    #https://github.com/YuLab-SMU/supplemental-ggmsa
    #https://github.com/YuLab-SMU/ggmsa
    
    library(ggmsa)
    library(ggplot2)
    library(ggtree)
    #library(gggenes)
    library(ape)
    library(Biostrings)
    library(ggnewscale)
    library(dplyr)
    library(ggtreeExtra)
    library(phangorn)
    library(RColorBrewer)
    library(patchwork)
    library(ggplotify)
    library(aplot)
    library(magick)
    library(treeio)
    
    #data <- "../supplemental-ggmsa-main/data/s_RBD.fasta"
    data <- "yopE_aligned_protein_.fasta"
    #data <- "yopE_aligned_protein_sorted.fasta"
    #dms <- read.csv("data/DMS.csv")
    #del <- c("expr_lib1", "expr_lib2","expr_avg","bind_lib1","bind_lib2")
    #dms <- dms[,!colnames(dms) %in% del]
    tidymsa <- tidy_msa(data)
    #tidymsa <- assign_dms(tidymsa, dms)
    #Mapping the position to the protein-protein interaction plot
    #tidymsa$position <- tidymsa$position + 330
    
    #dms = TRUE, 
    png("alignment_yopE.png", width=1100, height=5400)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    #scale_fill_gradientn(name = "ACE2 binding", colors = c(colRD(75),colBU(25)))
    msa_plot
    dev.off()
    
    # #Combine the ggtree and ggmsa plots using the cowplot package:
    # combined_plot <- cowplot::plot_grid(msa_plot, tree_heatmap_plot, ncol = 1, align = "v", axis = "l", rel_heights = c(3, 2))
    # ggsave("combined_plot.png", combined_plot, width = 10, height = 10, dpi = 300)

Episome (外质体)

在中文里,“episome”可以翻译为“外质体”。外质体是一种遗传元素,它可以作为独立的环状DNA分子存在,也可以整合到宿主的染色体中。外质体与宿主的染色体DNA分开复制,并在细胞分裂过程中在细胞之间传递。无论是原核生物还是真核生物,都可以找到外质体。

在细菌中,外质体通常是质粒,质粒是一种小的环状DNA分子,携带对宿主有益的基因,如抗生素抗性或产生毒素的能力。细菌质粒可以通过称为“共轭”的过程在细菌细胞之间传递,从而促进有益基因在细菌种群中的传播。

在真核细胞中,例如人类细胞,外质体可以是病毒基因组,如Epstein-Barr病毒(EBV)基因组。感染后,EBV使其线性DNA环化并作为外质体建立潜伏在宿主细胞的细胞核中。在潜伏期间,病毒外质体在宿主细胞中保持存在,但不会立即造成损害,病毒可以在以后重新激活,可能导致疾病。作为外质体,病毒基因组可以避免被宿主免疫系统检测和清除,同时保持在重新激活后复制并产生新的病毒颗粒的能力。

An episome is a genetic element that can exist either independently as a circular DNA molecule or integrate into a host’s chromosome. Episomes replicate separately from the host’s chromosomal DNA and can be transferred between cells during cell division. They are found in both prokaryotic and eukaryotic organisms.

In bacteria, episomes are often plasmids, which are small, circular DNA molecules that carry genes beneficial to the host, such as antibiotic resistance or the ability to produce toxins. Bacterial plasmids can be transferred between bacterial cells through a process called conjugation, which promotes the spread of beneficial genes within a bacterial population.

In eukaryotic cells, such as human cells, episomes can be viral genomes, like the Epstein-Barr virus (EBV) genome. Upon infection, EBV circularizes its linear DNA and establishes latency as an episome in the host cell’s nucleus. During latency, the viral episome persists in the host cell without causing immediate damage, and the virus can reactivate at a later time, potentially causing disease. By existing as an episome, the viral genome can avoid detection and clearance by the host’s immune system while maintaining the ability to replicate and produce new virus particles upon reactivation.

Human gammaherpesvirus 4 (HHV-4) is another name for Epstein-Barr virus (EBV). EBV is a member of the Herpesviridae family and belongs to the Gammaherpesvirinae subfamily. It is a double-stranded DNA virus that infects humans and is associated with various diseases, including infectious mononucleosis, certain types of lymphomas, and nasopharyngeal carcinoma.

EBV establishes a latent infection in the host’s B cells, where it can persist as an episome. The virus can reactivate at a later time, potentially leading to disease.

EV (细胞外囊泡:生物学中的关键细胞间通讯调控者)

In biology, “EV” often stands for “extracellular vesicle.” Extracellular vesicles are membrane-bound particles released by cells into the extracellular environment. They play essential roles in cell-to-cell communication, both within the same organism and between different organisms.

Extracellular vesicles can range in size from approximately 30 nm to 1 µm or more, and their contents include proteins, lipids, and nucleic acids (such as DNA, mRNA, and microRNAs). These vesicles can be classified into various categories, such as exosomes, microvesicles, and apoptotic bodies, based on their biogenesis, size, and molecular composition.

EVs are involved in numerous physiological and pathological processes, including immune responses, angiogenesis, cell differentiation, and cancer progression. They can transfer their contents from one cell to another, influencing the recipient cell’s behavior and function. Given their importance in intercellular communication, extracellular vesicles have become a significant area of research in recent years, with potential applications in diagnostics, therapeutics, and drug delivery.

在生物学中,“EV”通常表示“细胞外囊泡”(extracellular vesicle)。细胞外囊泡是由细胞释放到细胞外环境的膜结构颗粒。它们在同一生物体内以及不同生物体之间的细胞间通讯中起着至关重要的作用。

细胞外囊泡的大小范围从约30纳米到1微米或更大,其内容包括蛋白质、脂质和核酸(如DNA、mRNA和microRNAs)。根据它们的生物生成途径、大小和分子组成,这些囊泡可以被分为不同的类别,例如外泌体(exosomes)、微囊泡(microvesicles)和凋亡小体(apoptotic bodies)。

EVs参与许多生理和病理过程,包括免疫应答、血管生成、细胞分化和癌症进展。它们可以将其内容从一个细胞传递到另一个细胞,影响受体细胞的行为和功能。鉴于它们在细胞间通信中的重要性,近年来细胞外囊泡已成为一个重要的研究领域,具有潜在的诊断、治疗和药物递送应用。

有趣的基因轶事

基因学是一个充满惊奇和趣事的领域。以下是一些有趣的基因轶事:

  1. 鳄梨基因:一项研究表明,鳄梨的基因组有一个特别之处,那就是它保留了一些远古恐龙时代的基因。这些基因可能帮助鳄梨适应恶劣的环境,使它们在漫长的历史中得以幸存。
  2. CCR5-Δ32:CCR5-Δ32是一种罕见的基因突变,能让人免疫艾滋病毒(HIV)。这种突变使得艾滋病毒无法进入人体免疫系统的细胞,从而保护突变基因携带者不受感染。
  3. 嗅觉差异:我们对气味的感知能力主要取决于基因。有一种名为OR6A2的基因,它决定了我们对香草和肉桂气味的喜好。携带某种突变形式的人会觉得香草和肉桂有恶臭。
  4. 抗草酸基因:有一种名为“抗草酸”基因,它使得某些人在进食菠菜、瑞士甜菜和甜菜等含草酸高的食物时不会感到不适。草酸会与钙结合形成结石,而携带这种基因的人则能够防止草酸在体内形成结石。
  5. 猫咪拟态:澳洲的一种叫做“猫咪拟态”的植物,通过模拟猫咪的气味来保护自己。这种植物产生一种化合物,它的气味类似猫咪的信息素,使得其天敌如老鼠等动物误以为有猫咪在附近,从而避开它们。
  6. 左撇子基因:尽管左撇子在全球人口中只占10%左右,但研究发现,一个名为LRRTM1的基因与左撇子有关。具有这种基因的人更有可能成为左撇子。
  7. 蓝眼睛基因:蓝眼睛实际上是一种罕见的基因突变。研究表明,所有蓝眼睛的人都有一个共同的祖先,这个祖先生活在大约6,000-10,000年前。这种突变影响了名为OCA2的基因,导致黑色素在虹膜中的减少,使眼睛呈现蓝色。
  8. 咖啡因代谢:每个人对咖啡因的反应不同,这主要归功于基因。携带CYP1A2基因特定突变的人能更快地代谢咖啡因,因此对咖啡因不那么敏感。
  9. 甜品基因:有些人对甜食特别钟爱,这可能与名为TAS1R2的基因有关。这个基因对甜味敏感,携带特定突变的人更喜欢甜食。
  10. 花生过敏基因:花生过敏是一种常见的食物过敏,携带名为HLA-DQ和HLA-DR的基因突变的人更容易对花生过敏。
  11. 音乐天赋基因:名为AVPR1A的基因与音乐天赋有关。携带这个基因特定突变的人在音乐创作、即兴表演和音乐记忆方面表现更好。
  12. 超级熬夜基因:有些人几乎不需要睡觉就能保持清醒和精力充沛。这与一种名为“超级熬夜基因”的基因突变有关,携带这种突变的人每天只需要较少的睡眠。
  13. 超级记忆基因:有些人拥有惊人的记忆力,能够轻松记住大量的细节。这可能与一种名为BDNF的基因变异有关。BDNF基因编码一种在大脑中起到关键作用的神经营养因子。携带这种基因变异的人可能具有更强大的记忆力。
  14. 恐怖片爱好者基因:有些人对恐怖片情有独钟,而这可能与一种名为COMT的基因变异有关。这个基因影响多巴胺代谢,多巴胺是一种与愉悦感相关的神经递质。携带这种基因变异的人在观看恐怖片时可能感到更多的刺激和愉悦。
  15. 不喜欢西兰花的基因:有些人对西兰花的味道特别反感,这可能与一种名为TAS2R38的基因变异有关。这个基因编码一种味觉受体,携带这种基因变异的人对西兰花中的苦味非常敏感。

这些有趣的基因轶事展示了基因对我们生活的各个方面的影响,从我们的生理特征到我们的行为习惯。随着科学家们对基因学的进一步研究,我们将不断发现更多有趣的基因现象。

交联质谱(XL-MS 或 CLMS)

交联质谱(XL-MS 或 CLMS)是一种强大的技术,用于研究蛋白质-蛋白质相互作用、蛋白质复合物结构和蛋白质构象变化。该方法包括使用化学交联剂将蛋白质或蛋白质之间相互作用或紧密位于的氨基酸共价连接。然后将交联蛋白质消化成肽,通过质谱分析交联肽。

交联质谱的一般工作流程如下:

  1. 交联:将蛋白质或蛋白质复合物与化学交联剂处理,使其在近距离的氨基酸之间形成共价连接。交联剂的选择取决于实验要求,例如所需的交联距离和用于交联的特定氨基酸。

  2. 消化:将交联蛋白质消化成肽,使用蛋白酶如胰蛋白酶,它在特定的氨基酸序列处切割蛋白质。

  3. 富集:根据使用的交联剂和实验设计,可能需要从复杂的肽混合物中富集交联肽。这可以通过多种方法实现,包括亲和纯化、大小排阻层析或强阳离子交换层析。

  4. 质谱分析:使用高分辨率质谱对交联肽进行分析。质谱仪测量肽的质量与电荷比(m/z),提供关于其质量和电荷的信息。串联质谱(MS/MS)用于将交联肽片段化并获得其序列信息。

  5. 数据分析:使用专门的生物信息学工具和算法分析结果的质谱数据。这些工具有助于识别交联肽并推断相互作用的氨基酸或蛋白质区域。所识别的交联可用于生成结构模型或绘制蛋白质相互作用界面。

交联质谱已成功应用于研究各种生物系统,包括蛋白质-蛋白质相互作用、蛋白质-核酸相互作用和蛋白质-脂质相互作用。它还被用于研究大型蛋白质复合物和膜蛋白的结构和动态。

Crosslinking mass spectrometry (XL-MS or CLMS) is a powerful technique used to study protein-protein interactions, protein complex structures, and protein conformational changes. The method involves covalently linking interacting or closely located amino acids within a protein or between proteins using chemical crosslinkers. The crosslinked proteins are then digested into peptides, and the crosslinked peptides are analyzed using mass spectrometry.

The general workflow of crosslinking mass spectrometry is as follows:

  1. Crosslinking: Proteins or protein complexes are treated with a chemical crosslinker that covalently links amino acids in close proximity. The choice of crosslinker depends on the experimental requirements, such as the desired crosslinking distance and the specific amino acids targeted for crosslinking.

  2. Digestion: Crosslinked proteins are digested into peptides using proteolytic enzymes such as trypsin, which cleaves proteins at specific amino acid sequences.

  3. Enrichment: Depending on the crosslinker used and the experimental design, it may be necessary to enrich crosslinked peptides from the complex peptide mixture. This can be achieved using various approaches, including affinity purification, size-exclusion chromatography, or strong cation exchange chromatography.

  4. Mass spectrometry analysis: The crosslinked peptides are analyzed using high-resolution mass spectrometry. The mass spectrometer measures the mass-to-charge ratios (m/z) of the peptides, providing information about their mass and charge. Tandem mass spectrometry (MS/MS) is used to fragment the crosslinked peptides and obtain their sequence information.

  5. Data analysis: The resulting mass spectrometry data are analyzed using specialized bioinformatics tools and algorithms. These tools help identify the crosslinked peptides and infer the interacting amino acids or protein regions. The identified crosslinks can be used to generate structural models or map protein interaction interfaces.

Crosslinking mass spectrometry has been successfully applied to study various biological systems, including protein-protein interactions, protein-nucleic acid interactions, and protein-lipid interactions. It has also been used to investigate the structure and dynamics of large protein complexes and membrane proteins.

Crosslinking mass spectrometry (XL-MS) has been under development since the late 1990s and early 2000s. Early work in the field focused on optimizing the crosslinking and mass spectrometry techniques, as well as developing computational methods to analyze the resulting data. The technique has evolved significantly over the past two decades, thanks to advancements in mass spectrometry instrumentation, crosslinking reagents, and bioinformatics tools.

One of the pioneering studies in XL-MS was published in 2000 by Rappsilber, Siniossoglou, et al., who used the technique to study the structure of protein complexes in yeast cells. Since then, numerous improvements have been made to the method, leading to its widespread adoption in the field of structural biology and proteomics.

As the technology continues to mature, XL-MS has become an essential tool for studying protein-protein interactions, protein complex structures, and protein conformational changes, providing valuable insights into the molecular mechanisms underlying various biological processes.

Fan Liu’s lab, based at the Max Planck Institute of Biochemistry in Germany, has made significant contributions to the field of crosslinking mass spectrometry (XL-MS). The group focuses on developing innovative XL-MS methods, as well as computational tools for data analysis, to study protein structures and interactions.

One of the key contributions from the Fan Liu lab is the development of the xQuest/xProphet software suite. xQuest is an algorithm for identifying crosslinked peptides from mass spectrometry data, while xProphet is a post-processing tool that statistically validates the xQuest results. This software suite has greatly facilitated the analysis of XL-MS data and is widely used in the field.

The lab has also contributed to the development of novel crosslinking reagents and methodologies for studying protein interactions. For example, they have developed a method called “disuccinimidyl sulfoxide (DSSO) crosslinking” that enables the identification of interacting protein regions by using a photoactivatable crosslinker.

Furthermore, Fan Liu’s group has applied their XL-MS expertise to investigate the structure and function of various biologically significant protein complexes, such as the 26S proteasome, the Mediator complex, and the nuclear pore complex, providing valuable insights into the molecular mechanisms of these systems.

Overall, the Fan Liu lab has played an important role in advancing the field of crosslinking mass spectrometry and continues to push the boundaries of what can be achieved with this powerful technique.

二琥珀酰亚砜基 (disuccinimidyl sulfoxide) 交联是一种在蛋白质结构和相互作用研究领域中使用的交联方法。通过使用这种光活化交联剂,可以识别相互作用的蛋白质区域。在交联质谱(XL-MS)技术中,二琥珀酰亚砜基 (DSSO) 交联为研究蛋白质复合物结构提供了一种有效的工具。

10 NGS Categories

  1. RNA Sequencing: RNA sequencing (RNA-Seq) is a technique used to study the transcriptome, or the complete set of RNA molecules expressed within a cell or tissue. This approach allows researchers to analyze gene expression patterns, identify novel transcripts, and detect alternative splicing events.

  2. Whole Genome Sequencing: Whole genome sequencing (WGS) is a method that determines the complete DNA sequence of an organism’s genome. This technique provides comprehensive information about an organism’s genetic makeup, including the identification of genes, regulatory elements, and variations such as single nucleotide polymorphisms (SNPs) and structural variants.

  3. Amplicon Sequencing: Amplicon sequencing involves the targeted sequencing of specific genomic regions or genes using polymerase chain reaction (PCR) to amplify the regions of interest before sequencing. This approach is often used to study specific genetic variations or target known functional regions in the genome.

  4. Exome Sequencing: Exome sequencing targets the protein-coding regions of the genome, known as the exome. These regions account for approximately 1-2% of the genome but contain the majority of disease-causing genetic variations. Exome sequencing is used to identify novel disease-associated genes and mutations in known genes.

  5. CRISPR Validation (genoTYPER-NEXT): CRISPR validation is a process to assess the efficiency and specificity of CRISPR/Cas9-mediated genome editing. genoTYPER-NEXT is a high-throughput sequencing platform used to validate CRISPR/Cas9-induced mutations by sequencing the target sites and identifying the exact mutations generated by the editing process.

  6. Targeted Sequencing: Targeted sequencing is a method that focuses on sequencing specific genomic regions or genes of interest, rather than the entire genome. This approach is more cost-effective and allows for higher sequencing depth, providing better resolution for detecting low-frequency genetic variations.

  7. Metagenomics: Metagenomics is the study of genetic material obtained directly from environmental samples, such as soil or water, without the need for culturing individual organisms. This approach allows researchers to analyze the composition, diversity, and functional potential of complex microbial communities.

  8. Epigenomics: Epigenomics is the study of epigenetic modifications on a genome-wide scale, including DNA methylation, histone modifications, and non-coding RNA molecules. These modifications play essential roles in gene regulation and can have long-lasting effects on an organism’s phenotype without changing the DNA sequence.

  9. Immunogenomics: Immunogenomics is the study of the genetic and epigenetic factors that influence the immune system and its response to various stimuli, including pathogens, allergens, and self-antigens. This field integrates genomic, transcriptomic, and epigenomic data to better understand the molecular mechanisms underlying immune responses and develop novel therapies for immune-related diseases.

  10. Proteomics is a branch of molecular biology that focuses on the large-scale study of proteins within an organism or a specific biological system. Unlike genomics, which deals with the study of DNA and gene sequences, proteomics investigates the structure, function, and interactions of proteins. Proteins are crucial for many cellular processes, and their functions are dictated by their structure, abundance, and modifications.

    There are several key aspects and techniques in proteomics:

    • Protein identification: Identifying the proteins present in a given sample, such as a cell, tissue, or body fluid, is a fundamental aspect of proteomics. Mass spectrometry is the most commonly used technique for protein identification, often combined with liquid chromatography to separate complex protein mixtures before analysis.

    • Protein quantification: Measuring the relative or absolute abundance of proteins in a sample is important for understanding their biological roles and potential involvement in disease processes. Techniques like label-free quantification, stable isotope labeling, and targeted mass spectrometry approaches (e.g., selected reaction monitoring) are commonly used for protein quantification.

    • Post-translational modifications (PTMs): PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, glycosylation, and acetylation. They can alter a protein’s activity, stability, localization, or interaction with other molecules. Mass spectrometry-based approaches and specific enrichment techniques are used to identify and quantify PTMs.

    • Protein-protein interactions: Investigating how proteins interact with one another is crucial for understanding cellular processes and signaling pathways. Techniques like yeast two-hybrid screens, affinity purification-mass spectrometry (AP-MS), and proximity labeling methods can reveal protein-protein interactions.

    • Structural proteomics: This area focuses on determining the three-dimensional structures of proteins, which provide insights into their functions and interactions. Techniques like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) are used to determine protein structures.

    • Functional proteomics: This aspect of proteomics aims to understand the biological roles of proteins and their involvement in cellular processes. Techniques like RNA interference (RNAi), CRISPR/Cas9-mediated gene editing, and activity-based protein profiling (ABPP) are used to study protein functions and identify potential drug targets.

Draw 3D PCA and calculate background genes

Download python3 script for generating the plot

Download raw data 1 for generating the plot

Download raw data 2 for generating the plot

PCA_3D_2.png

construct a data structure (merged_df) as above with data and pc

library(ggplot2)
data <- plotPCA(rld, intgroup=c("condition", "donor"), returnData=TRUE)
write.csv(data, file="plotPCA_data.csv")
#calculate all PCs including PC3 with the following codes
library(genefilter)
ntop <- 500
rv <- rowVars(assay(rld))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
mat <- t( assay(rld)[select, ] )
pc <- prcomp(mat)
pc$x[,1:3]
#df_pc <- data.frame(pc$x[,1:3])
df_pc <- data.frame(pc$x)

identical(rownames(data), rownames(df_pc)) #-->TRUE
## define the desired order of row names
#desired_order <- rownames(data)
## sort the data frame by the desired order of row names
#df <- df[match(desired_order, rownames(df_pc)), ]

data$PC1 <- NULL
data$PC2 <- NULL
merged_df <- merge(data, df_pc, by = "row.names")
#merged_df <- merged_df[, -1]
row.names(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL  # remove the "name" column
merged_df$name <- NULL
merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","PC21","PC22","PC23","PC24","PC25","PC26","PC27","PC28","group","condition","donor")]
#results in 26PCs: merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","PC21","PC22","PC23","PC24","PC25","PC26","group","condition","donor")]
write.csv(merged_df, file="merged_df_28PCs.csv")
#> summary(pc)  #--> variances of PC1, PC2 and PC3 are 0.6795 0.08596 0.06599
#Importance of components:
#                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
#Standard deviation     2.6011 0.92510 0.81059 0.67065 0.51952 0.46429 0.41632
#Proportion of Variance 0.6795 0.08596 0.06599 0.04517 0.02711 0.02165 0.01741
#Cumulative Proportion  0.6795 0.76548 0.83148 0.87665 0.90376 0.92541 0.94282
#                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
#Standard deviation     0.38738 0.32048 0.27993 0.24977 0.20217 0.17316 0.16293
#Proportion of Variance 0.01507 0.01032 0.00787 0.00627 0.00411 0.00301 0.00267
#Cumulative Proportion  0.95789 0.96821 0.97608 0.98234 0.98645 0.98946 0.99213
#                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
#Standard deviation     0.15058 0.13083 0.12449 0.08789 0.06933 0.06064 0.04999
#Proportion of Variance 0.00228 0.00172 0.00156 0.00078 0.00048 0.00037 0.00025
#Cumulative Proportion  0.99441 0.99612 0.99768 0.99846 0.99894 0.99931 0.99956
#                          PC22    PC23    PC24    PC25     PC26
#Standard deviation     0.04143 0.03876 0.03202 0.01054 0.005059
#Proportion of Variance 0.00017 0.00015 0.00010 0.00001 0.000000
#Cumulative Proportion  0.99973 0.99988 0.99999 1.00000 1.000000                                                                     

draw 3D with merged_df using plot3D

import plotly.graph_objects as go
import pandas as pd
from sklearn.decomposition import PCA
import numpy as np

# Read in data as a pandas dataframe
#df = pd.DataFrame({
#    'PC1': [-13.999925, -12.504291, -12.443057, -13.065235, -17.316215],
#    'PC2': [-1.498823, -3.342411, -6.067055, -8.205809, 3.293993],
#    'PC3': [-3.335085, 15.207755, -14.725450, 15.078469, -6.917358],
#    'condition': ['GFP d3', 'GFP d3', 'GFP d8', 'GFP d8', 'GFP+mCh d9/12'],
#    'donor': ['DI', 'DII', 'DI', 'DII', 'DI']
#})
df = pd.read_csv('merged_df_28PCs.csv', index_col=0, header=0)
df['condition'] = df['condition'].replace("GFP+mCh d9/12", "ctrl LTtr+sT d9/12")
df['condition'] = df['condition'].replace("sT+LTtr d9/12", "LTtr+sT d9/12")
df['condition'] = df['condition'].replace("GFP d3", "ctrl LT/LTtr d3")
df['condition'] = df['condition'].replace("GFP d8", "ctrl LT/LTtr d8")
df['condition'] = df['condition'].replace("mCh d3", "ctrl sT d3")
df['condition'] = df['condition'].replace("mCh d8", "ctrl sT d8")

# Fit PCA model to reduce data dimensions to 3
pca = PCA(n_components=3)
pca.fit(df.iloc[:, :-3])
X_reduced = pca.transform(df.iloc[:, :-3])

# Add reduced data back to dataframe
df['PC1'] = X_reduced[:, 0]
df['PC2'] = X_reduced[:, 1]
df['PC3'] = X_reduced[:, 2]

# Create PCA plot with 3D scatter
fig = go.Figure()

#['circle', 'circle-open', 'square', 'square-open', 'diamond', 'diamond-open', 'cross', 'x']
# if donor == 'DI' else marker=dict(size=2, opacity=0.8, color=condition_color, symbol=donor_symbol)

#decrease diamond size to 6 while keep the circle as size 10 in the following code:
#'rgb(128, 150, 128)'
#I need three families of colors, always from light to deep, the first one should close to grey.
#the first serie for 'ctrl LTtr+sT d9/12', 'LTtr+sT d9/12' 
#the second serie for 'ctrl LT/LTtr d3', 'ctrl LT/LTtr d8', 'LT d3', 'LT d8', 'LTtr d3', 'LTtr d8'
#the third serie for 'ctrl sT d3', 'ctrl sT d8', 'sT d3', 'sT d8', 'sT+LT d3'

condition_color_map_untreated = {'untreated':'black'}
donor_symbol_map_untreated = {'DI': 'circle-open', 'DII': 'diamond-open'}
#condition_color_map = {'ctrl LTtr+sT d9/12': 'green', 'GFP d3': 'blue', 'GFP d8': 'red', 'GFP+mCh d9/12': 'green', 'LT d3': 'orange'}
condition_color_map = {
    'ctrl LTtr+sT d9/12': 'black',
    'LTtr+sT d9/12': '#a14a1a',

    'ctrl LT/LTtr d3': '#fdbf6f',
    'ctrl LT/LTtr d8': '#ff7f00',
    'LT d3': '#b2df8a',
    'LT d8': '#33a02c',
    'LTtr d3': '#a6cee3',
    'LTtr d8': '#1f78b4',

    'ctrl sT d3': 'rgb(200, 200, 200)',
    'ctrl sT d8': 'rgb(100, 100, 100)',
    'sT d3': '#fb9a99',
    'sT d8': '#e31a1c',
    'sT+LT d3': 'magenta'
}
donor_symbol_map = {'DI': 'circle', 'DII': 'diamond'}

for donor, donor_symbol in donor_symbol_map_untreated.items():
    for condition, condition_color in condition_color_map_untreated.items():
        mask = (df['condition'] == condition) & (df['donor'] == donor)
        fig.add_trace(go.Scatter3d(x=df.loc[mask, 'PC1'], y=df.loc[mask, 'PC2'], z=df.loc[mask, 'PC3'],
                                   mode='markers',
                                   name=f'{condition}' if donor == 'DI' else None,
                                   legendgroup=f'{condition}',
                                   showlegend=True if donor == 'DI' else False,
                                   marker=dict(size=6 if donor_symbol in ['diamond-open'] else 10, opacity=0.8, color=condition_color, symbol=donor_symbol)))

for donor, donor_symbol in donor_symbol_map.items():
    for condition, condition_color in condition_color_map.items():
        mask = (df['condition'] == condition) & (df['donor'] == donor)
        fig.add_trace(go.Scatter3d(x=df.loc[mask, 'PC1'], y=df.loc[mask, 'PC2'], z=df.loc[mask, 'PC3'],
                                   mode='markers',
                                   name=f'{condition}' if donor == 'DI' else None,
                                   legendgroup=f'{condition}',
                                   showlegend=True if donor == 'DI' else False,
                                   marker=dict(size=6 if donor_symbol in ['diamond'] else 10, opacity=0.8, color=condition_color, symbol=donor_symbol)))

for donor, donor_symbol in donor_symbol_map.items():
    fig.add_trace(go.Scatter3d(x=[None], y=[None], z=[None],
                               mode='markers',
                               name=donor,
                               legendgroup=f'{donor}',
                               showlegend=True,
                               marker=dict(size=10, opacity=1, color='black', symbol=donor_symbol),
                               hoverinfo='none'))

# Annotations for the legend blocks
fig.update_layout(
    annotations=[
        dict(x=1.1, y=1.0, xref='paper', yref='paper', showarrow=False,
             text='Condition', font=dict(size=15)),
        dict(x=1.1, y=0.6, xref='paper', yref='paper', showarrow=False,
             text='Donor', font=dict(size=15))
    ],
    scene=dict(
        aspectmode='cube',
        xaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC1: 36% v.'),
        yaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC2: 17% v.'),
        zaxis=dict(gridcolor='black', backgroundcolor='white', zerolinecolor='black', title='PC3: 15% variance'),
        bgcolor='white'
    ),
    margin=dict(l=5, r=5, b=5, t=0)  # Adjust the margins to prevent clipping of axis titles
)

#fig.show()
fig.write_image("fig1.svg")

calculate background genes

    # [1] ensembl_gene_id         gene_name               MKL.1_RNA              
    # [4] MKL.1_RNA_118           MKL.1_RNA_147           MKL.1_EV.RNA           
    # [7] MKL.1_EV.RNA_2          MKL.1_EV.RNA_118        MKL.1_EV.RNA_87        
    #[10] MKL.1_EV.RNA_27         X042_MKL.1_wt_EV        X0505_MKL.1_wt_EV      
    #[13] X042_MKL.1_sT_DMSO      X0505_MKL.1_sT_DMSO_EV  X042_MKL.1_scr_DMSO_EV 
    #[16] X0505_MKL.1_scr_DMSO_EV X042_MKL.1_sT_Dox       X0505_MKL.1_sT_Dox_EV  
    #[19] X042_MKL.1_scr_Dox_EV   X0505_MKL.1_scr_Dox_EV 

"ensembl_gene_id
gene_name 
python process.py ../fpkm_values_MKL-1.csv 3 > MKL.1_RNA 
python process.py ../fpkm_values_MKL-1.csv 4 > MKL.1_RNA_118 
python process.py ../fpkm_values_MKL-1.csv 5 > MKL.1_RNA_147 
python process.py ../fpkm_values_MKL-1.csv 6 > MKL.1_EV.RNA 
python process.py ../fpkm_values_MKL-1.csv 7 > MKL.1_EV.RNA_2 
python process.py ../fpkm_values_MKL-1.csv 8 > MKL.1_EV.RNA_118 
python process.py ../fpkm_values_MKL-1.csv 9 > MKL.1_EV.RNA_87 
python process.py ../fpkm_values_MKL-1.csv 10 > MKL.1_EV.RNA_27 
python process.py ../fpkm_values_MKL-1.csv 11 > X042_MKL.1_wt_EV 
python process.py ../fpkm_values_MKL-1.csv 12 > X0505_MKL.1_wt_EV 
python process.py ../fpkm_values_MKL-1.csv 13 > X042_MKL.1_sT_DMSO 
python process.py ../fpkm_values_MKL-1.csv 14 > X0505_MKL.1_sT_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 15 > X042_MKL.1_scr_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 16 > X0505_MKL.1_scr_DMSO_EV 
python process.py ../fpkm_values_MKL-1.csv 17 > X042_MKL.1_sT_Dox 
python process.py ../fpkm_values_MKL-1.csv 18 > X0505_MKL.1_sT_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 19 > X042_MKL.1_scr_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 20 > X0505_MKL.1_scr_Dox_EV 
python process.py ../fpkm_values_MKL-1.csv 21 > median_length 

cut -f1 -d',' MKL.1_RNA>_MKL-1_RNA
cut -f1 -d',' MKL.1_RNA_118>_MKL-1_RNA_118
cut -f1 -d',' MKL.1_RNA_147>_MKL-1_RNA_147
cut -f1 -d',' MKL.1_EV.RNA>_MKL-1_EV_RNA
cut -f1 -d',' MKL.1_EV.RNA_2>_MKL-1_EV_RNA_2
cut -f1 -d',' MKL.1_EV.RNA_118>_MKL-1_EV_RNA_118
cut -f1 -d',' MKL.1_EV.RNA_87>_MKL-1_EV_RNA_87
cut -f1 -d',' MKL.1_EV.RNA_27>_MKL-1_EV_RNA_27
cut -f1 -d',' X042_MKL.1_wt_EV>_X042_MKL-1_wt_EV
cut -f1 -d',' X0505_MKL.1_wt_EV>_X0505_MKL-1_wt_EV
cut -f1 -d',' X042_MKL.1_sT_DMSO>_X042_MKL-1_sT_DMSO
cut -f1 -d',' X0505_MKL.1_sT_DMSO_EV>_X0505_MKL-1_sT_DMSO_EV
cut -f1 -d',' X042_MKL.1_scr_DMSO_EV>_X042_MKL-1_scr_DMSO_EV
cut -f1 -d',' X0505_MKL.1_scr_DMSO_EV>_X0505_MKL-1_scr_DMSO_EV
cut -f1 -d',' X042_MKL.1_sT_Dox>_X042_MKL-1_sT_Dox
cut -f1 -d',' X0505_MKL.1_sT_Dox_EV>_X0505_MKL-1_sT_Dox_EV
cut -f1 -d',' X042_MKL.1_scr_Dox_EV>_X042_MKL-1_scr_Dox_EV
cut -f1 -d',' X0505_MKL.1_scr_Dox_EV>_X0505_MKL-1_scr_Dox_EV

~/Tools/csv2xls-0.4/csv_to_xls.py  _MKL-1_RNA _MKL-1_RNA_118 _MKL-1_RNA_147 _MKL-1_EV_RNA _MKL-1_EV_RNA_2 _MKL-1_EV_RNA_118 _MKL-1_EV_RNA_87 _MKL-1_EV_RNA_27 _X042_MKL-1_wt_EV _X0505_MKL-1_wt_EV _X042_MKL-1_sT_DMSO _X0505_MKL-1_sT_DMSO_EV _X042_MKL-1_scr_DMSO_EV _X0505_MKL-1_scr_DMSO_EV _X042_MKL-1_sT_Dox _X0505_MKL-1_sT_Dox_EV _X042_MKL-1_scr_Dox_EV _X0505_MKL-1_scr_Dox_EV -d',' -o background_genes_MKL-1.xls;

process.py

import sys

filename = sys.argv[1]
n = int(sys.argv[2])

with open(filename, 'r') as file:
    for line in file:
        words = line.split(",")
        if len(words) >= n and words[n-1] != '0':
            print(line.strip())

# -- in the example of K331A data --
#FORMAT normalized_counts.txt to normalized_counts.csv
#DEBUG1: add gene_symbol in the first line
#DEBUG2: '\n' --> ',\n'
#RESULT: it looks like #gene_symbol,ctrl_d3_DonorI,ctrl_d3_DonorII,ctrl_d8_DonorI,ctrl_d8_DonorII,LT_d3_DonorI,LT_d3_DonorII,LT_d8_DonorI,LT_d8_DonorII,LTtr_d3_DonorI,LTtr_d3_DonorII,LTtr_d8_DonorI,LTtr_d8_DonorII,K331A_DonorI,K331A_DonorII,
#DDX11L1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
python process.py normalized_counts.csv 2 > ctrl_d3_DonorI
python process.py normalized_counts.csv 3 > ctrl_d3_DonorII 
python process.py normalized_counts.csv 4 > ctrl_d8_DonorI
python process.py normalized_counts.csv 5 > ctrl_d8_DonorII
python process.py normalized_counts.csv 6 > LT_d3_DonorI
python process.py normalized_counts.csv 7 > LT_d3_DonorII
python process.py normalized_counts.csv 8 > LT_d8_DonorI
python process.py normalized_counts.csv 9 > LT_d8_DonorII
python process.py normalized_counts.csv 10 > LTtr_d3_DonorI
python process.py normalized_counts.csv 11 > LTtr_d3_DonorII
python process.py normalized_counts.csv 12 > LTtr_d8_DonorI
python process.py normalized_counts.csv 13 > LTtr_d8_DonorII
python process.py normalized_counts.csv 14 > K331A_DonorI
python process.py normalized_counts.csv 15 > K331A_DonorII

cut -f1 -d',' ctrl_d3_DonorI > _ctrl_d3_DonorI
cut -f1 -d',' ctrl_d3_DonorII > _ctrl_d3_DonorII
cut -f1 -d',' ctrl_d8_DonorI > _ctrl_d8_DonorI
cut -f1 -d',' ctrl_d8_DonorII > _ctrl_d8_DonorII
cut -f1 -d',' LT_d3_DonorI > _LT_d3_DonorI
cut -f1 -d',' LT_d3_DonorII > _LT_d3_DonorII
cut -f1 -d',' LT_d8_DonorI > _LT_d8_DonorI
cut -f1 -d',' LT_d8_DonorII > _LT_d8_DonorII
cut -f1 -d',' LTtr_d3_DonorI > _LTtr_d3_DonorI
cut -f1 -d',' LTtr_d3_DonorII > _LTtr_d3_DonorII
cut -f1 -d',' LTtr_d8_DonorI > _LTtr_d8_DonorI
cut -f1 -d',' LTtr_d8_DonorII > _LTtr_d8_DonorII
cut -f1 -d',' K331A_DonorI > _K331A_DonorI
cut -f1 -d',' K331A_DonorII > _K331A_DonorII

~/Tools/csv2xls-0.4/csv_to_xls.py _ctrl_d3_DonorI _ctrl_d3_DonorII _ctrl_d8_DonorI _ctrl_d8_DonorII   _LT_d3_DonorI _LT_d3_DonorII _LT_d8_DonorI _LT_d8_DonorII  _LTtr_d3_DonorI _LTtr_d3_DonorII _LTtr_d8_DonorI _LTtr_d8_DonorII  _K331A_DonorI _K331A_DonorII  -d',' -o background_genes.xls;

Top 15 Sequencing Companies

As of our best knowledge, the following are some of the top sequencing companies in the genomics industry.

  1. Illumina: A leading provider of DNA sequencing technologies, with platforms such as NovaSeq, HiSeq, MiSeq, and iSeq.

  2. Thermo Fisher Scientific: Offers the Ion Torrent sequencing platform, based on semiconductor sequencing technology.

  3. BGI (Beijing Genomics Institute): A Chinese company that provides sequencing services and has developed its own sequencing platforms, such as the BGISEQ and MGISEQ series.

  4. Pacific Biosciences (PacBio): Known for its long-read sequencing technology, Single Molecule, Real-Time (SMRT) sequencing.

  5. Oxford Nanopore Technologies: Develops nanopore sequencing technology, with platforms like MinION, GridION, and PromethION.

  6. 10x Genomics: Provides genomics and single-cell solutions, including the Chromium platform for linked-read sequencing and single-cell analysis.

  7. Qiagen: A global provider of sample and assay technologies, including the GeneReader NGS system for targeted sequencing applications.

  8. Roche: Offers various sequencing technologies, including the 454 Life Sciences platform for pyrosequencing and the Roche NimbleGen target enrichment system.

  9. DNA Electronics (DNAe): A company working on semiconductor-based DNA sequencing technology, with applications in various fields, including diagnostics and personalized medicine.

  10. GenapSys: Developer of the GENIUS system, a compact and cost-effective next-generation sequencing (NGS) platform.

  11. GENEWIZ is a global genomics service company specializing in DNA sequencing, gene synthesis, molecular biology, and genomic services. They provide Sanger sequencing, next-generation sequencing (NGS), and a variety of other services to researchers in academia, pharmaceuticals, and biotechnology. Although GENEWIZ does not develop sequencing platforms like some of the companies mentioned earlier, its expertise in delivering high-quality sequencing services has made it a popular choice for researchers in need of genomics services. It was founded in 1999 by Dr. Steve Sun and Dr. Amy Liao. Both of them have extensive experience in molecular biology and genomics. Dr. Sun holds a Ph.D. in Molecular Biology from the University of California, Berkeley, and an MBA from Rutgers University, while Dr. Liao holds a Ph.D. in Molecular Biology from the University of Southern California. Their combined expertise and vision for providing genomics services led to the establishment of GENEWIZ as a global genomics services company.

  12. Novogene: Founded in 2011, Novogene is a leading provider of genomic services and solutions with cutting-edge next-generation sequencing (NGS) and bioinformatics expertise. The company offers a wide range of services, including whole-genome sequencing, transcriptome sequencing, single-cell sequencing, and various other genomics services for research and clinical applications. Novogene has grown rapidly and has established a strong global presence, serving customers in academia, pharmaceuticals, biotechnology, agriculture, and more.

  13. Biomarker Technologies: Biomarker Technologies is a Chinese company founded in 2010 that focuses on providing genomics services and developing genomic technologies for the agricultural and environmental sectors. The company offers a variety of sequencing services, such as whole-genome sequencing, transcriptome sequencing, and metagenomic sequencing, to support research in plant and animal breeding, environmental monitoring, and other fields. Biomarker Technologies also develops molecular markers and other genomic tools to facilitate the application of genomics in agriculture and environmental sciences.

  14. Singleron’s products and services include single-cell RNA sequencing, single-cell ATAC sequencing, single-cell multi-omics sequencing, and related bioinformatics services. These offerings allow researchers to explore gene expression, chromatin accessibility, and other molecular features at the single-cell level, enabling a deeper understanding of cellular heterogeneity and the underlying mechanisms in various biological systems.

  15. NanoString Technologies is a biotechnology company that provides life science tools for translational research and molecular diagnostics. The company’s flagship product, the nCounter Analysis System, offers a simple and efficient way to analyze gene expression, protein expression, and nucleic acid abundance in a single experiment.

    The nCounter platform is based on a digital barcoding technology that enables the direct measurement and counting of individual RNA or DNA molecules without the need for amplification, such as PCR. The system allows for multiplexing of up to 800 targets in a single reaction, providing a highly sensitive, accurate, and reproducible method for gene expression analysis. Moreover, the platform is compatible with various sample types, including fresh, frozen, and formalin-fixed, paraffin-embedded (FFPE) tissues.

    NanoString’s technology has been widely adopted in various research fields, such as oncology, immunology, and neuroscience, as well as in clinical applications, including companion diagnostics and prognostic tests. The nCounter platform can help researchers identify biomarkers, understand gene regulation, and elucidate the mechanisms underlying diseases and treatments. Additionally, NanoString has expanded its product offerings to include new applications like spatial transcriptomics with the GeoMx Digital Spatial Profiler (GeoMx Digital Spatial Profiling), enabling researchers to study gene expression in the context of tissue architecture.

These companies offer a wide range of sequencing technologies and services to cater to different research and diagnostic needs. The industry is continuously evolving, with new players and technologies entering the market regularly.

How to use H3K27me3 and H3K4me3 to identify transcription factors?

H3K27me3 (histone H3 lysine 27 trimethylation) and H3K4me3 (histone H3 lysine 4 trimethylation) are histone marks associated with gene repression and activation, respectively. While these marks can provide insights into the chromatin state and regulation of gene expression, they are not directly used to identify transcription factors. Instead, they can be used to identify regions of interest where transcription factors might bind and regulate gene expression.

To identify transcription factors, you would typically perform a chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiment using antibodies specific for the transcription factors of interest. However, you can use the information from H3K27me3 and H3K4me3 ChIP-seq data to narrow down the regions where you may expect to find transcription factors binding.

Here’s a general approach:

  1. Identify regions marked by H3K4me3, which indicates active promoters, and H3K27me3, which indicates repressed regions, using peak calling tools such as MACS or HOMER.

  2. Exclude regions marked by H3K27me3, as transcription factors are less likely to bind and regulate genes in repressed chromatin.

  3. Focus on the regions marked by H3K4me3, as these represent active promoters where transcription factors might bind to regulate gene expression.

  4. Perform motif analysis on these active promoter regions using tools like MEME, HOMER, or JASPAR to identify overrepresented DNA motifs that could be the binding sites of transcription factors.

  5. Compare the identified motifs with known transcription factor binding motifs in databases such as JASPAR, TRANSFAC, or HOCOMOCO to predict the transcription factors that may bind to these regions.

By integrating the information from H3K27me3 and H3K4me3 ChIP-seq data, you can prioritize regions for further analysis and potentially identify transcription factors that may be regulating gene expression in your system.

How to use H3K27ac, H3K4me1, and RNA-seq to identify enhancers and their target genes?

  1. Identify the overlapping peaks of H3K27ac and H3K4me1 using tools like BEDTools intersect or HOMER mergePeaks. H3K4me1 (histone H3 lysine 4 monomethylation) is commonly found in both active and poised enhancer regions. On the other hand, H3K27ac (histone H3 lysine 27 acetylation) is more specific to active enhancers. By combining both H3K27ac and H3K4me1 histone marks, we can more confidently identify active enhancer regions, as the overlapping regions marked by both histone modifications are more likely to represent true enhancers with regulatory roles in gene expression. This approach helps to reduce the number of false positives.

  2. Exclude promoter regions (TSS +/- 2 kb) from the overlapping peaks using BEDTools subtract or HOMER mergePeaks with the -exclude option. This is because the histone marks H3K27ac and H3K4me1 can be present in both promoters and enhancers. By excluding the promoter regions, we can reduce the likelihood of identifying promoter-driven regulatory elements and focus on the distal enhancers that are more likely to have long-range regulatory effects.

  3. For each remaining distal enhancer region (overlapping peak), find the nearest gene(s) using tools like BEDTools closest or HOMER annotatePeaks.

  4. Analyze the expression patterns of these nearest genes in the RNA-seq data to determine if they are differentially expressed or show any interesting expression patterns that could be related to the putative enhancer activity.