Author Archives: gene_x

Yops (Yersinia outer proteins) analysis 2

http://xgenes.com/article/article-content/162/yersinia-outer-proteins-yops-analysis/

  1. extract all plasmids of the 50 isolates with plasmids but no yopK

    python3 extract_plasmids_from_gff.py ../prokka_plus/1045.gff  #reference
    for sample in SCPM-O-B-6291_C-25 KIM10+ 195P Nepal516 A1122 A1122_bis Nairobi IP31758 228 NW57 NW117 NW56 NW115 FORC_002 FORC_002_bis Gp200 NW116 Gp169 Y225 ATCC_BAA-2637 CFS1934 LC20 GTA 2011N-4075 ATCC_43970 NHV_3758 NVI-10705 NVI-1292 NVI-4570 NVI-6614 NVI-11267 NVI-11294 NVI-10571 NVI-8524 NVI-1176 NVI-701 17Y0412 17Y0414 NVI-492 NVI-9681 SC09 17Y0189 17Y0153 17Y0155 KMM821 16Y0180 NVI-5089 NVI-10587 NVI-4840 17Y0159; do
        python3 extract_plasmids_from_gff.py ../prokka_plus/${sample}.gff
    done
    
    grep "yop" *.gff3
    grep "Yop" *.gff3
    (yopH, yopO, yopE, yopT, yopM, yopD, yopB, yopN) and (YopH, YopJ, YopO, YopE, YopT, YopM, YopD, YopB, YopN, YopR) in 195P_NZ_CP019710
    
    # code of extract_plasmids_from_gff.py
    import sys
    import os
    from Bio import SeqIO
    from Bio.Alphabet import generic_dna
    
    if len(sys.argv) != 2:
        print("Usage: python script_name.py your_input.gff3")
        sys.exit(1)
    
    input_gff = sys.argv[1]
    base_filename = os.path.splitext(os.path.basename(input_gff))[0]
    
    # Split the GFF file into annotations and sequences
    with open(input_gff, 'r') as f:
        lines = f.readlines()
        fasta_start = lines.index("##FASTA\n")
        gff_lines = lines[:fasta_start]
        fasta_lines = lines[fasta_start + 1:]
    
    # Separate GFF content for each plasmid/chromosome
    gff_dict = {}
    for line in gff_lines:
        if not line.startswith("#"):
            record_id = line.split("\t")[0]
            if record_id not in gff_dict:
                gff_dict[record_id] = []
            gff_dict[record_id].append(line)
    
    # Write the sequences temporarily to a file
    with open("temp.fasta", 'w') as f:
        f.writelines(fasta_lines)
    
    # Read the sequences from the temporary file
    records = list(SeqIO.parse("temp.fasta", format="fasta"))
    
    for idx, rec in enumerate(records):
        # Skip the chromosome (the first record)
        if idx == 0:
            continue
    
        # Write GFF3
        with open(f"{base_filename}_{rec.id}.gff3", "w") as output_handle:
            output_handle.writelines(gff_dict.get(rec.id, []))
            output_handle.write("##FASTA\n")
            SeqIO.write(rec, output_handle, "fasta")
    
        ## Write GenBank (without annotations)
        #with open(f"plasmid_{rec.id}.gbk", "w") as output_handle:
        #    rec.seq.alphabet = generic_dna  # Add temporary alphabet
        #    SeqIO.write(rec, output_handle, "genbank")
    
        # Write FASTA
        with open(f"{base_filename}_{rec.id}.fasta", "w") as output_handle:
            SeqIO.write(rec, output_handle, "fasta")
  2. (optional) cluster all plasmids against reference using fastANI

    git clone https://github.com/ParBLiSS/FastANI.git
    cd FastANI
    ./bootstrap.sh
    ./configure
    make
    ~/Tools/FastANI/fastANI -q 1045_NZ_CP006795.1.fasta --rl plasmids.txt -o output_ani.txt

The output of FastANI is a tab-separated file, typically with four columns:

  • Query genome path: The path (or filename) of the query genome.

  • Reference genome path: The path (or filename) of the reference genome.

  • ANI percentage: The average nucleotide identity (ANI) value between the query and reference genomes. This is expressed as a percentage and represents the average nucleotide identity of orthologous gene pairs between the two genomes.

  • Orthologous fragment count: The number of orthologous fragments that were found and compared between the two genomes. FastANI breaks genomes into fixed-size fragments (default is 3kb) and then identifies orthologous fragments between genomes for ANI calculation. This column indicates the number of such orthologous fragment pairs used in the ANI calculation.

    1045_NZ_CP006795.1.fasta     ./SCPM-O-B-6291_C-25_NZ_CP045165.1.fasta   94.3585 1       23
  • Query: plasmid_NZ_CP006795.1.fasta

  • Reference: ./plasmid_NZ_CP045165.1.fasta

  • ANI: 94.3585%

  • Orthologous fragments: 1 out of 23

  • Here, the plasmid 1045_NZ_CP006795.1.fasta is being compared to SCPM-O-B-6291_C-25_NZ_CP045165.1.fasta. They have an ANI of approximately 94.36%. However, only 1 out of the 23 fragments in the query plasmid was found to be orthologous with the reference plasmid.

  1. The 50 isolates with plasmids but no yopK are as follows.

    SCPM-O-B-6291_C-25.gff  2   Yersinia pestis SCPM-O-B-6291 C-25  aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.MED   2   _plasmid    No  NA
    KIM10+  4   Yersinia pestis KIM10+  aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.MED   1   _plasmid    No  NA
    195P    19  Yersinia pestis 195P    aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.ANT   3   _plasmid    No  NA
    Nepal516    20  Yersinia pestis Nepal516    aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.ANT   2   _plasmid    No  NA
    A1122   24  Yersinia pestis A1122   aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ORI   2   _plasmid    No  NA
    A1122_bis   26  Yersinia pestis A1122 bis   aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ORI   2   _plasmid    No  NA
    Nairobi 43  Yersinia pestis Nairobi aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ANT   1   _plasmid    No  NA
    IP31758 82  Yersinia pseudotuberculosis IP31758 adk(1)  argA(2) aroA(1) glnA(6) thrA(8) tmk(3)  trpE(2) pseudotuberculosis  pseudotuberculosis  2   8       2   _plasmid    No  NA
    228 94  Yersinia similis 228    adk(5)  argA(4) aroA(12)    glnA(12)    thrA(15)    tmk(9)  trpE(9) similis similis 92          1   _plasmid    No  NA
    NW57    115 Yersinia enterocolitica NW57    adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW117   116 Yersinia enterocolitica NW117   adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW56    118 Yersinia enterocolitica NW56    adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW115   119 Yersinia enterocolitica NW115   adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    FORC_002    121 Yersinia enterocolitica FORC_002    adk(12) argA(19)    aroA(21)    glnA(22)    thrA(25)    tmk(24) trpE(19)    enterocolitica  enterocolitica  252 1Aa     1   _plasmid    No  NA
    FORC_002_bis    122 Yersinia enterocolitica FORC_002 bis    adk(12) argA(19)    aroA(21)    glnA(22)    thrA(25)    tmk(24) trpE(19)    enterocolitica  enterocolitica  252 1Aa     1   _plasmid    No  NA
    Gp200   129 Yersinia enterocolitica Gp200   adk(20) argA(21)    aroA(85)    glnA(32)    thrA(25)    tmk(~71)    trpE(19)    enterocolitica  enterocolitica      1Aa     1   _plasmid    No  NA
    NW116   130 Yersinia enterocolitica NW116   adk(86) argA(41)    aroA(31)    glnA(83)    thrA(31)    tmk(104)    trpE(16)    enterocolitica  enterocolitica  335 1Aa     1   _plasmid    No  NA
    Gp169   131 Yersinia enterocolitica Gp169   adk(86) argA(41)    aroA(31)    glnA(83)    thrA(31)    tmk(104)    trpE(16)    enterocolitica  enterocolitica  335 1Aa     1   _plasmid    No  NA
    Y225    134 Yersinia frederiksenii Y225 aarF(43)    dfp(41) galR(50)    glnS(47)    hemA(48)    rfaE(41)    speA(39)    frederiksenii   occitanica  83          1   _plasmid    No  NA
    ATCC_BAA-2637   137 Yersinia rochesterensis ATCC BAA-2637   aarF(43)    dfp(41) galR(50)    glnS(10)    hemA(58)    rfaE(41)    speA(39)    rochesterensis  occitanica  84          2   _plasmid    No  NA
    CFS1934 140 Yersinia hibernica CFS1934                              hibernica   hibernica               1   _plasmid    No  NA
    LC20    141 Yersinia hibernica LC20 adk(-)  argA(66)    aroA(-) glnA(68)    thrA(78)    tmk(85) trpE(76)    hibernica   hibernica               2   _plasmid    No  NA
    GTA 146 Yersinia massiliensis GTA   aarF(15)    dfp(~31)    galR(32)    glnS(15)    hemA(30)    rfaE(32)    speA(16)    massiliensis    massiliensis        2       2   _plasmid    No  NA
    2011N-4075  147 Yersinia massiliensis 2011N-4075    aarF(15)    dfp(~31)    galR(~32)   glnS(15)    hemA(30)    rfaE(32)    speA(16)    massiliensis    massiliensis        2       2   _plasmid    No  NA
    ATCC_43970  151 Yersinia bercovieri ATCC 43970  aarF(47)    dfp(45) galR(54)    glnS(61)    hemA(63)    rfaE(45)    speA(9) bercovieri  bercovieri  30          1   _plasmid    No  NA
    NHV_3758    163 Yersinia ruckeri NHV_3758   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-10705   164 Yersinia ruckeri NVI-10705  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-1292    165 Yersinia ruckeri NVI-1292   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-4570    166 Yersinia ruckeri NVI-4570   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             3   _plasmid    No  NA
    NVI-6614    167 Yersinia ruckeri NVI-6614   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             3   _plasmid    No  NA
    NVI-11267   168 Yersinia ruckeri NVI-11267  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-11294   169 Yersinia ruckeri NVI-11294  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-10571   170 Yersinia ruckeri NVI-10571  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-8524    171 Yersinia ruckeri NVI-8524   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-1176    172 Yersinia ruckeri NVI-1176   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-701 173 Yersinia ruckeri NVI-701    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    17Y0412 174 Yersinia ruckeri 17Y0412    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    17Y0414 175 Yersinia ruckeri 17Y0414    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-492 176 Yersinia ruckeri NVI-492    aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-9681    177 Yersinia ruckeri NVI-9681   adk(64) argA(67)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    SC09    178 Yersinia ruckeri SC09   aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri             2   _plasmid    No  NA
    17Y0189 180 Yersinia ruckeri 17Y0189    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    17Y0153 181 Yersinia ruckeri 17Y0153    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    17Y0155 182 Yersinia ruckeri 17Y0155    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    KMM821  183 Yersinia ruckeri KMM821 aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          2   _plasmid    No  NA
    16Y0180 184 Yersinia ruckeri 16Y0180    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    NVI-5089    189 Yersinia ruckeri NVI-5089   adk(75) argA(76)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-10587   190 Yersinia ruckeri NVI-10587  adk(75) argA(76)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-4840    191 Yersinia ruckeri NVI-4840   adk(75) argA(76)    aroA(72)    glnA(79)    thrA(90)    tmk(96) trpE(89)    ruckeri ruckeri             2   _plasmid    No  NA
    17Y0159 197 Yersinia ruckeri 17Y0159    aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    RfaE(14)    speA(14)    ruckeri ruckeri             3   _plasmid    No  NA

Should the inputs for GSVA be normalized or raw?

Gene Set Variation Analysis (GSVA) is a non-parametric and unsupervised method used for estimating the variation of gene set enrichment through the samples of a gene expression matrix. Given its nature and the underlying computations, there are specific recommendations for input data preprocessing.

  • Normalization: Gene expression data should generally be normalized before applying GSVA. This ensures that the different scales and potential batch effects from different experiments or different runs are accounted for. There are several normalization methods available depending on the type of gene expression data. For RNA-seq data, methods like TMM (Trimmed Mean of M-values) or RLE (Relative Log Expression) are popular. For microarray data, quantile normalization is commonly used.

  • Log Transformation: It is generally recommended to log-transform gene expression data before using GSVA. The rationale is similar to the reasons for normalizing the data: taking the logarithm compresses the range of expression values, making highly expressed genes and lowly expressed genes more comparable in scale. Typically, for RNA-seq count data, one might use a transformation like log2(CPM+1) or log2(FPKM/TPM + 1). The “+1” is added to handle zero counts.

However, the best practices can vary based on the specifics of the dataset and the research question. It’s crucial to consult the GSVA documentation, relevant literature, and potentially perform some exploratory analysis on your dataset to determine the best preprocessing steps.

Lastly, always remember to ensure that the gene identifiers in your expression dataset match those in the gene sets you are using for the enrichment analysis. This often requires additional preprocessing steps to map between different types of gene identifiers (e.g., gene symbols, Entrez IDs, Ensembl IDs).

https://bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html

Input arguments of gsva(): There are four classes of parameter objects corresponding to the methods listed above, and may have different additional parameters to tune, but all of them require at least the following two input arguments:

  • A normalized gene expression dataset, which can be provided in one of the following containers:
    • A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    • An ExpressionSet object; see package Biobase.
    • A SummarizedExperiment object, see package SummarizedExperiment.
  • A collection of gene sets; which can be provided in one of the following containers:
    • A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    • A GeneSetCollection object; see package GSEABase.

In the context of gene expression data or other biological datasets, the term “non-log space” usually refers to the original, untransformed measurements or values. This is in contrast to “log space,” where values have been transformed using a logarithm, typically the natural logarithm or the base-2 logarithm.

Here’s a bit more on why and when these terms are used:

  • Log Transformation: For various types of data, including gene expression data, a logarithm transformation is often applied. One reason to do this is to stabilize variance or to make the data distribution more normal-like, especially when measurements can span several orders of magnitude. For example, gene expression data from microarray or RNA-seq experiments might have a long-tailed distribution, and taking the logarithm can help in compressing extreme values.

  • Non-log Space: When we refer to values in “non-log space,” we’re talking about the original measurements, before any logarithm transformation. For gene expression, these might be raw read counts, FPKM values (for RNA-seq data), or probe intensities (for microarrays).

  • Back Transformation: Sometimes, after performing computations or analyses in the log space, one might want to transform the results back to the original scale. This “back transformation” involves taking the antilogarithm (exponential) of the log-transformed values.

For example, consider the average expression of a gene. If you take the average of log-transformed expression values and then back-transform this average by taking the exponential, you won’t get the same result as taking the average of the original, non-log-transformed values. This is because the logarithm is a non-linear transformation.

In summary, “non-log space” refers to data that hasn’t been log-transformed, and it represents the original scale of the measurements.

GSVA calculation for RNA-seq data

#------------------------------
#---- 1. prepare gene sets ----

library(org.Hs.eg.db)
library(dplyr)
library(AnnotationDbi)

#1. Platelets
data <- data.frame(
  geneSymbol = c("GP1BA","GP5","GP6","GP9","LY6G6D","MMRN1","PEAR1","PF4","PF4V1","PPBP","SLC35D3"),
  geneEntrezID = c("2811","2814","51206","2815","58530","22915","375033","5196","5197","5473","340146"),
  GeneSet = rep("Platelets", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Platelets.csv", row.names = FALSE)

#2. Granulocytes
data <- data.frame(
  geneSymbol = c("CXCR2","CD177","CLC","CTSS","DEFA1","FUT7","LTB4R","MMP25","OSM","RETN"),
  geneEntrezID = c("3579","57126","1178","1520","1667","2529","1241","64386","5008","56729"),
  GeneSet = rep("Granulocytes", 10)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Granulocytes.csv", row.names = FALSE)

#3. LDG
data <- data.frame(
  geneSymbol = c("AZU1","CAMP","CEACAM6","CEACAM8","CTSG","DEFA4","ELANE","LCN2","LTF","MPO","OLFM4","RNASE3"),
  geneEntrezID = c("566","820","4680","1088","1511","1669","1991","3934","4057","4353","10562","6037"),
  GeneSet = rep("LDG", 12)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "LDG.csv", row.names = FALSE)

#4. pDC
data <- data.frame(
  geneSymbol = c("CLEC4C","NRP1"),
  geneEntrezID = c("170482","8829"),
  GeneSet = rep("pDC", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "pDC.csv", row.names = FALSE)

#5. Anti-inflammation
data <- data.frame(
  geneSymbol = c("IL1RN","TNFAIP3","SOCS3"),
  geneEntrezID = c("3557","7128","9021"),
  GeneSet = rep("Anti-inflammation", 3)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Anti-inflammation.csv", row.names = FALSE)

#6. Pro-inflam. IL-1
data <- data.frame(
  geneSymbol = c("IL1A","IL1B","IL18","IL36A","IL36B","IL36G","IL33","IL1R1","IL1RAP","IL18R1","IL18RAP"),
  geneEntrezID = c("3552","3553","3606","27179","27177","56300","90865","3554","3556","8809","8807"),
  GeneSet = rep("Pro-inflam. IL-1", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Pro-inflam._IL-1.csv", row.names = FALSE)

#7. Dendritic cells
data <- data.frame(
  geneSymbol = c("CLEC12A","CLEC10A","CLEC9A","CSF1R","IGIP","LILRA4","LY75","XCR1"),
  geneEntrezID = c("160364","10462","283420","1436","492311","23547","4065","2829"),
  GeneSet = rep("Dendritic cells", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Dendritic_cells.csv", row.names = FALSE)

#8. MHC II
data <- data.frame(
  geneSymbol = c("HLA-DMA","HLA-DMB","HLA-DPA1","HLA-DPB1","HLA-DPB2","HLA-DQA1","HLA-DQA2","HLA-DQB1","HLA-DQB2","HLA-DRA","HLA-DRB1","HLA-DRB3","HLA-DRB4","HLA-DRB5","HLA-DRB6"),
  geneEntrezID = c("3108","3109","3113","3115","3116","3117","3118","3119","3120","3122","3123","3125","3126","3127","3128"),
  GeneSet = rep("MHC II", 15)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "MHC_II.csv", row.names = FALSE)

#9. Alt. complement
data <- data.frame(
  geneSymbol = c("C3","C5","C6","C7","C8A","C9","CFB","CFD","CFH","CFHR5","CFP","CR1","GZMM","MIR520B","MIR520E","VSIG4"),
  geneEntrezID = c("718","727","729","730","731","735","629","1675","3075","81494","5199","1378","3004","574473","574461","11326"),
  GeneSet = rep("Alt. complement", 16)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Alt._complement.csv", row.names = FALSE)

#10. TNF
data <- data.frame(
  geneSymbol = c("BRCA1","CASP1","CXCL1","CXCL2","GBP1","GP1BA","IFI44","IL18","IL1B","IL1RN","MX1","NAMPT","NR3C1","OAS3","PATJ","SH3BP5","TAP2","TNF","TNFAIP3","TNFRSF11A","WT1","ACLY","ACSL1","ADGRE2","AK3","AKAP10","AMPD3","APOL3","ARID3A","ARSE","ASAP1","B4GALT5","BCL2A1","BHLHE41","BHMT","BIRC3","CALD1","CASP10","CCL15","CCL20","CCL23","CCL3L1","CD37","CD38","CD83","CDKN3","CKB","CR2","CTNND2","CXCL3","CXCL8","CYP27B1","DAB2","EBI3","EGR1","EGR2","EPB41","EREG","ETAA1","F3","FABP1","FBXL2","FCER2","FCGR2A","FLJ11129","FLNA","G0S2","GCH1","GJB2","GLS","GMIP","GRK3","HCAR3","HHEX","HOMER2","HP","ICAM1","IDO1","IKBKG","IL16","IL1A","IL6","INHBA","INSIG1","ITGA6","KITLG","KLF1","KMO","LGALS3BP","MAP3K4","MARCKS","MGLL","MMP19","MN1","MRPS15","MSC","MTF1","NELL2","NFKB1","NFKB2","NFKBIA","NFKBIZ","NKX3-2","PDE4DIP","PDPN","PIAS4","PLAUR","PTGES","PTGS2","RELB","RPGR","RPS9","SDC4","SERPIND1","SFRP1","SLAMF1","SLC30A4","SOD2","SPI1","SSPN","STAT4","TAF15","TBX3","TFF1","TNFAIP2","TRAF1","TSC22D1","TYROBP","UBE2C","VEGFA"),
  geneEntrezID = c("672","834","2919","2920","2633","2811","10561","3606","3553","3557","4599","10135","2908","4940","10207","9467","6891","7124","7128","8792","7490","47","2180","30817","50808","11216","272","80833","1820","415","50807","9334","597","79365","635","330","800","843","6359","6364","6368","6349","951","952","9308","1033","1152","1380","1501","2921","3576","1594","1601","10148","1958","1959","2035","2069","54465","2152","2168","25827","2208","2212","54674","2316","50486","2643","2706","2744","51291","157","8843","3087","9455","3240","3383","3620","8517","3603","3552","3569","3624","3638","3655","4254","10661","8564","3959","4216","4082","11343","4327","4330","64960","9242","4520","4753","4790","4791","4792","64332","579","9659","10630","51588","5329","9536","5743","5971","6103","6203","6385","3053","6422","6504","7782","6648","6688","8082","6775","8148","6926","7031","7127","7185","8848","7305","11065","7422"),
  GeneSet = rep("TNF", 130)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TNF.csv", row.names = FALSE)

#11. NLRP3 inflammasome
data <- data.frame(
  geneSymbol = c("APP","ATAT1","CARD8","CASP1","CASP4","CD36","CPTP","DHX33","EIF2AK2","GBP5","GSDMD","HSP90AB1","MEFV","NFKB1","NFKB2","NLRC3","NLRP3","P2RX7","PANX1","PSTPIP1","PYCARD","PYDC2","RELA","SIRT2","SUGT1","TLR4","TLR6","TXN","TXNIP","USP50"),
  geneEntrezID = c("351","79969","22900","834","837","948","80772","56919","5610","115362","79792","3326","4210","4790","4791","197358","114548","5027","24145","9051","29108","152138","5970","22933","10910","7099","10333","7295","10628","373509"),
  GeneSet = rep("NLRP3 inflammasome", 30)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "NLRP3_inflammasome.csv", row.names = FALSE)

#12. Unfolded protein
data <- data.frame(
  geneSymbol = c("B4GALT3","CALR","CALU","CANX","CDS2","CHST12","CHST2","DERL1","DERL2","DNAJC3","EDEM2","EDEM3","EMC9","ERAP1","ERGIC2","ERO1L","EXT1","GALNT2","GOLT1B","HERPUD1","HYOU1","IER3IP1","IMPAD1","KDELC1","KDELR2","LMAN2","LPGAT1","MAN1A1","MANEA","MANF","NUCB2","PDIA4","PDIA6","PIGK","PPIB","SEC24D","SEC61G","SPCS3","SSR1","SSR3","TRAM1","TRAM2","UGGT1","XBP1"),
  geneEntrezID = c("8703","811","813","821","8760","55501","9435","79139","51009","5611","55741","80267","51016","51752","51290","30001","2131","2590","51026","9709","10525","51124","54928","79070","11014","10960","9926","4121","79694","7873","4925","9601","10130","10026","5479","9871","23480","60559","6745","6747","23471","9697","56886","7494"),
  GeneSet = rep("Unfolded protein", 44)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Unfolded_protein.csv", row.names = FALSE)

#13. B cells
data <- data.frame(
  geneSymbol = c("IGHD","SH3BP5","BANK1","BLK","BLNK","CD19","CD22","CD79A","CD79B","DAPP1","FCRL1","FCRL2","FCRL3","FCRLA","GON4L","GPR183","IGHM","KLHL6","MS4A1","PAX5","PLCL2","VPREB1","ZNF318"),
  geneEntrezID = c("3495","9467","55024","640","29760","930","933","973","974","27071","115350","79368","115352","84824","54856","1880","3507","89857","931","5079","23228","7441","24149"),
  GeneSet = rep("B cells", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "B_cells.csv", row.names = FALSE)

#14. Monocyte cell surface
data <- data.frame(
  geneSymbol = c("CD14","CD300C","CD33","CD68","CLEC12A","CLEC4D","CLEC4E","FCGR1A","FCGR1B","FCGR3B","LILRA5","LILRA6","LILRB2","LILRB3","OSCAR","SEMA4A","SIGLEC1"),
  geneEntrezID = c("929","10871","945","968","160364","338339","26253","2209","2210","2215","353514","79168","10288","11025","126014","64218","6614"),
  GeneSet = rep("Monocyte cell surface", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocyte_cell_surface.csv", row.names = FALSE)

#15. Inflammasome
data <- data.frame(
  geneSymbol = c("CASP1","AIM2","CASP5","CTSB","GSDMB","GSDMD","NAIP","NEK7","NLRC4","NLRP1","NLRP3","NOD2","P2RX7","PANX1","PYCARD","RIPK1"),
  geneEntrezID = c("834","9447","838","1508","55876","79792","4671","140609","58484","22861","114548","64127","5027","24145","29108","8737"),
  GeneSet = rep("Inflammasome", 16)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Inflammasome.csv", row.names = FALSE)

#16. Monocytes_40
data <- data.frame(
  geneSymbol = c("C1QA","C1QB","C1QC","C1RL","CCL2","CCL8","CD14","CD300C","CD33","CD68","CLEC12A","CLEC4D","CLEC4E","CXCL1","CXCL2","CXCR2","FCGR1A","FCGR1B","FCGR3B","GRN","IK","IL18RAP","IL1B","IL1RN","LILRA5","LILRA6","LILRB2","LILRB3","OSCAR","S100A8","SEMA4A","SIGLEC1","THBD","BST1","C1R","CD163","CSF2","FUT4","MNDA","MRC1"),
  geneEntrezID = c("712","713","714","51279","6347","6355","929","10871","945","968","160364","338339","26253","2919","2920","3579","2209","2210","2215","2896","3550","8807","3553","3557","353514","79168","10288","11025","126014","6279","64218","6614","7056","683","715","9332","1437","2526","4332","4360"),
  GeneSet = rep("Monocytes_40", 40)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocytes_40.csv", row.names = FALSE)

#17. Monocyte secreted
data <- data.frame(
  geneSymbol = c("C1QA","C1QB","C1QC","C1RL","CCL2","CCL8","CXCL1","CXCL2","GRN","IK","IL18RAP","IL1B","IL1RN","S100A8","THBD","TNF","CXCL10"),
  geneEntrezID = c("712","713","714","51279","6347","6355","2919","2920","2896","3550","8807","3553","3557","6279","7056","7124","3627"),
  GeneSet = rep("Monocyte secreted", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocyte_secreted.csv", row.names = FALSE)

#18. IL-1 cytokines
data <- data.frame(
  geneSymbol = c("IL18","IL1B"),
  geneEntrezID = c("3606","3553"),
  GeneSet = rep("IL-1 cytokines", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IL-1_cytokines.csv", row.names = FALSE)

#19. SNOR low UP
data <- data.frame(
  geneSymbol = c("FCGR1A","CEACAM1","LGALS1","SNORD24","SNORD44","SNORD47","SNORD80"),
  geneEntrezID = c("2209","634","3956","26820","26806","26802","26774"),
  GeneSet = rep("SNOR low UP", 7)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "SNOR_low_UP.csv", row.names = FALSE)

#20. CD40 activated
data <- data.frame(
  geneSymbol = c("BUB1B","CCL22","CD58","DBI","DUSP4","FABP5","FDPS","FEN1","GAPDH","GARS","GRHPR","H2AFX","H2AFZ","HMGB2","HMMR","HPRT1","IMPDH2","LDHA","LDHB","LMNB1","LPXN","MCM2","MCM3","MCM6","MSH6","MTHFD2","MYBL2","NDC80","NME1","PAICS","PCNA","PKM","PRDX3","RFTN1","RGS10","SLAMF1","TK1","TOP2A","TPX2","TRAF1","TUBA1B","TXN","TYMS","UBE2S","VDAC1","WEE1","WSB2","ZWINT"),
  geneEntrezID = c("701","6367","965","1622","1846","2171","2224","2237","2597","2617","9380","3014","3015","3148","3161","3251","3615","3939","3945","4001","9404","4171","4172","4175","2956","10797","4605","10403","4830","10606","5111","5315","10935","23180","6001","6504","7083","7153","22974","7185","10376","7295","7298","27338","7416","7465","55884","11130"),
  GeneSet = rep("CD40 activated", 48)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "CD40_activated.csv", row.names = FALSE)

#21. Lectin complement
data <- data.frame(
  geneSymbol = c("A2M","C2","C3","C4A","C4B","C4B_2","C5","C6","C7","C8A","C9","COLEC10","COLEC11","FCN1","FCN2","FCN3","KRT1","MASP1","MASP2","MBL2","SERPING1"),
  geneEntrezID = c("2","717","718","720","721","100293534","727","729","730","731","735","10584","78989","2219","2220","8547","3848","5648","10747","4153","710"),
  GeneSet = rep("Lectin complement", 21)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Lectin_complement.csv", row.names = FALSE)

#22. Classical complement
data <- data.frame(
  geneSymbol = c("C1QA","CIQB","C1QC","C1R","C1S","C2","C3","C4A","C4B","C4B_2","C5","C6","C7","C8A","C9"),
  geneEntrezID = c("712","713","714","715","716","717","718","720","721","100293534","727","729","730","731","735"),
  GeneSet = rep("Classical complement", 15)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Classical_complement.csv", row.names = FALSE)

#23. Cell cycle
data <- data.frame(
  geneSymbol = c("BRCA1","ASPM","AURKA","AURKB","CCNB1","CCNB2","CCNE1","CDC20","CENPM","CEP55","E2F3","GINS2","MCM10","MCM2","MKI67","NCAPG","NDC80","PTTG1","TYMS"),
  geneEntrezID = c("672","259266","6790","9212","891","9133","898","991","79019","55165","1871","51659","55388","4171","4288","64151","10403","9232","7298"),
  GeneSet = rep("Cell cycle", 19)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Cell_cycle.csv", row.names = FALSE)

#24. Plasma cells
data <- data.frame(
  geneSymbol = c("IGHV4-28","IGHV4-34","IGHD","IGHG1","IGHV2-5","IGHV3-20","IGHV3-23","IGKC","IGLV1-40","IGLV3-1","IGLV3-19","IGLV3-25","IGLV4-3","IGLV4-60","IGLV5-45","IGLV6-57","IGLVI-70","C19orf10","IGH","IGHMBP2","IGHV4-31","IGK","IGL","IGLJ3","IGLL1","IGLV@","IGLV1-44","IGLV2-14","IGLV2-5","MZB1","PRDM1","SDC1","THEMIS2","TNFRSF17"),
  geneEntrezID = c("28400","28395","3495","3500","28457","28445","28442","3514","28825","28809","28797","28793","28786","28785","28781","28778","28763","56005","3492","3508","28396","50802","3535","28831","3543","3546","28823","28815","28818","51237","639","6382","9473","608"),
  GeneSet = rep("Plasma cells", 34)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Plasma_cells.csv", row.names = FALSE)

#25. IG chains
data <- data.frame(
  geneSymbol = c("IGHG1","IGHV2-5","IGHV3-20","IGHV3-23","IGHV4-28","IGHV4-34","IGKC","IGLV1-40","IGLV3-1","IGLV3-19","IGLV3-25","IGLV4-3","IGLV4-60","IGLV5-45","IGLV6-57","IGLVI-70","IGHA1","IGHA2","IGHD2-15","IGHD2-2","IGHD2-21","IGHD3-10","IGHD3-16","IGHD3-3","IGHD3-9","IGHG2","IGHG3","IGHG4","IGHJ1","IGHJ2","IGHJ3","IGHJ4","IGHJ5","IGHJ6","IGHV1-18","IGHV1-2","IGHV1-24","IGHV1-3","IGHV1-45","IGHV1-46","IGHV1-58","IGHV1-69-2","IGHV2-26","IGHV2-70","IGHV3-13","IGHV3-15","IGHV3-21","IGHV3-33","IGHV3-43","IGHV3-48","IGHV3-49","IGHV3-53","IGHV3-62","IGHV3-64","IGHV3-7","IGHV3-72","IGHV3-73","IGHV3-74","IGHV4-30-2","IGHV4-39","IGHV4-59","IGHV4-61","IGHV5-51","IGHV6-1","IGHV7-81","IGKJ1","IGKJ2","IGKJ3","IGKJ4","IGKJ5","IGKV1-16","IGKV1-17","IGKV1-27","IGKV1-5","IGKV1-6","IGKV1-9","IGKV1D-16","IGKV1D-17","IGKV1D-43","IGKV1D-8","IGKV2-24","IGKV2D-26","IGKV2D-29","IGKV2D-30","IGKV3-20","IGKV3D-20","IGKV3D-7","IGKV4-1","IGKV5-2","IGLC2","IGLC7","IGLJ6","IGLV10-54","IGLV1-36","IGLV1-47","IGLV2-11","IGLV2-18","IGLV2-23","IGLV2-33","IGLV2-8","IGLV3-10","IGLV3-12","IGLV3-16","IGLV3-21","IGLV3-27","IGLV3-32","IGLV4-69","IGLV5-37","IGLV7-43","IGLV8-61","IGLV9-49"),
  geneEntrezID = c("3500","28457","28445","28442","28400","28395","3514","28825","28809","28797","28793","28786","28785","28781","28778","28763","3493","3494","28503","28505","28502","28499","28498","28501","28500","3501","3502","3503","28483","28481","28479","28477","28476","28475","28468","28474","28467","28473","28466","28465","28464","28458","28455","28454","28449","28448","28444","28434","28426","28424","28423","28420","28416","28414","28452","28410","28409","28408","28398","28394","28392","28391","28388","28385","28378","28950","28949","28948","28947","28946","28938","28937","28935","28299","28943","28941","28901","28900","28891","28904","28923","28884","28882","28881","28912","28874","28877","28908","28907","3538","28834","28828","28772","28826","28822","28816","28814","28813","28811","28817","28803","28802","28799","28796","28791","28787","28784","28783","28776","28774","28773"),
  GeneSet = rep("IG chains", 111)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IG_chains.csv", row.names = FALSE)

#26. Erythrocytes
data <- data.frame(
  geneSymbol = c("EPO","GFI1B","GYPA","GYPB","GYPE","ICAM4","NFE2","SLC4A1","TRIM10","TSPO2","ZNF367"),
  geneEntrezID = c("2056","8328","2993","2994","2996","3386","4778","6521","10107","222642","195828"),
  GeneSet = rep("Erythrocytes", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Erythrocytes.csv", row.names = FALSE)

#27. IL-6R complex
data <- data.frame(
  geneSymbol = c("IL6","IL6R","IL6ST"),
  geneEntrezID = c("3569","3570","3572"),
  GeneSet = rep("IL-6R complex", 3)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IL-6R_complex.csv", row.names = FALSE)

#28. IFN
data <- data.frame(
  geneSymbol = c("EIF2AK2","GBP1","GBP2","GBP4","HERC5","HERC6","IFI27","IFI30","IFI35","IFI44","IFI44L","IFI6","IFIT1","IFIT2","IFIT3","IFIT5","IFITM1","IFITM2","IFITM3","ISG15","ISG20","MX1","MX2","OAS1","OAS2","OAS3","OASL","RSAD2","SAMD9","SAMD9L","SP100","SP110"),
  geneEntrezID = c("5610","2633","2634","115361","51191","55008","3429","10437","3430","10561","10964","2537","3434","3433","3437","24138","8519","10581","10410","9636","3669","4599","4600","4938","4939","4940","8638","91543","54809","219285","6672","3431"),
  GeneSet = rep("IFN", 32)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IFN.csv", row.names = FALSE)

#29. TCRB
data <- data.frame(
  geneSymbol = c("TRBC2","TRBJ2-1","TRBJ2-2","TRBJ2-2P","TRBJ2-3","TRBJ2-4","TRBJ2-5","TRBJ2-6","TRBJ2-7","TRBV1","TRBV10-1","TRBV10-2","TRBV11-1","TRBV11-2","TRBV19","TRBV2","TRBV20-1","TRBV21-1","TRBV23-1","TRBV24-1","TRBV25-1","TRBV27","TRBV28","TRBV3-1","TRBV4-1","TRBV4-2","TRBV5-1","TRBV5-3","TRBV5-4","TRBV5-5","TRBV5-6","TRBV5-7","TRBV6-1","TRBV6-4","TRBV6-5","TRBV6-6","TRBV6-7","TRBV6-8","TRBV7-1","TRBV7-3","TRBV7-4","TRBV7-5","TRBV7-6","TRBV7-7","TRBV9"),
  geneEntrezID = c("28638","28629","28628","28627","28626","28625","28624","28623","28622","28621","28585","28584","28582","28581","28568","28620","28567","28566","28564","28563","28562","28560","28559","28619","28617","28616","28614","28612","28611","28610","28609","28608","28606","28603","28602","28601","28600","28599","28597","28595","28594","28593","28592","28591","28586"),
  GeneSet = rep("TCRB", 45)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRB.csv", row.names = FALSE)

#30. TCRA
data <- data.frame(
  geneSymbol = c("TRAV10","TRAV1-1","TRAV1-2","TRAV12-1","TRAV12-2","TRAV12-3","TRAV13-1","TRAV13-2","TRAV14DV4","TRAV16","TRAV17","TRAV18","TRAV19","TRAV2","TRAV20","TRAV21","TRAV22","TRAV23DV6","TRAV24","TRAV25","TRAV26-1","TRAV26-2","TRAV27","TRAV29DV5","TRAV3","TRAV30","TRAV34","TRAV35","TRAV36DV7","TRAV38-1","TRAV38-2DV8","TRAV39","TRAV4","TRAV40","TRAV41","TRAV5","TRAV7","TRAV8-1","TRAV8-2","TRAV8-3","TRAV8-4","TRAV8-6","TRAV8-7","TRAV9-1","TRAV9-2"),
  geneEntrezID = c("28676","28693","28692","28674","28673","28672","28671","28670","28669","28667","28666","28665","28664","28691","28663","28662","28661","28660","28659","28658","28657","28656","28655","28653","28690","28652","28648","28647","28646","28644","28643","28642","28689","28641","28640","28688","28686","28685","28684","28683","28682","28680","28679","28678","28677"),
  GeneSet = rep("TCRA", 45)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRA.csv", row.names = FALSE)

#31. Cyt. act. T cells
data <- data.frame(
  geneSymbol = c("GZMB","TAGAP","CD69","EOMES","GZMH","IFNG","IL2RB","PRF1","SGK1","TBX21","TFRC","ZNF683"),
  geneEntrezID = c("3002","117289","969","8320","2999","3458","3560","5551","6446","30009","7037","257101"),
  GeneSet = rep("Cyt. act. T cells", 12)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Cyt._act._T_cells.csv", row.names = FALSE)

#32. TCRG
data <- data.frame(
  geneSymbol = c("TRGC2","TRGV1","TRGV10","TRGV11","TRGV2","TRGV3","TRGV4","TRGV5","TRGV7","TRGV8","TRGV9"),
  geneEntrezID = c("6967","6973","6984","6985","6974","6976","6977","6978","6981","6982","6983"),
  GeneSet = rep("TCRG", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRG.csv", row.names = FALSE)

#33. T cells
data <- data.frame(
  geneSymbol = c("CD8A","CD8B","TRDC","CCR3","CD226","CD247","CD28","CD3D","CD3E","CD3G","CD4","CD5","ETS1","GATA3","GRAP2","LEF1","SH2D1A","TRAC","TRBC1"),
  geneEntrezID = c("925","926","28526","1232","10666","919","940","915","916","917","920","921","2113","2625","9402","51176","4068","28755","28639"),
  GeneSet = rep("T cells", 19)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_cells.csv", row.names = FALSE)

#34. CD8T-NK-NKT
data <- data.frame(
  geneSymbol = c("CD8A","CD8B","GZMB","NKTR","CD2","CD7","CRTAM","GNLY","GZMA","GZMK","GZMM","HCST","KIR2DL3","KIR3DL1","KIR3DL2","KLRB1","KLRC3","KLRC4","KLRD1","NKG7","RASAL3","TIA1","TXK"),
  geneEntrezID = c("925","926","3002","4820","914","924","56253","10578","3001","3003","3004","10870","3804","3811","3812","3820","3823","8302","3824","4818","64926","7072","7294"),
  GeneSet = rep("CD8T-NK-NKT", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "CD8T-NK-NKT.csv", row.names = FALSE)

#35. Anergic or act. T cells
data <- data.frame(
  geneSymbol = c("CD160","CD244","CTLA4","HAVCR2","ICOS","KLRG1","LAG3","PDCD1"),
  geneEntrezID = c("11126","51744","1493","84868","29851","10219","3902","5133"),
  GeneSet = rep("Anergic or act. T cells", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Anergic_or_act._T_cells.csv", row.names = FALSE)

#36. T activated
data <- data.frame(
  geneSymbol = c("FASLG","TAGAP","CD40LG","CREM","IL17A","IL17F","IL23R","JAKMIP1","KCNA3","KCNN4","P2RX5","PRKCQ","RELT","RNF125","SATB1","TNFRSF4","TNFRSF9"),
  geneEntrezID = c("356","117289","959","1390","3605","112744","149233","152789","3738","3783","5026","5588","84957","54941","6304","7293","3604"),
  GeneSet = rep("T activated", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_activated.csv", row.names = FALSE)

#37. NK cells
data <- data.frame(
  geneSymbol = c("KLRF1","NCAM1","NCR1","NCR3","SH2D1B"),
  geneEntrezID = c("51348","4684","9437","259197","117157"),
  GeneSet = rep("NK cells", 5)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "NK_cells.csv", row.names = FALSE)

#38. TCRD
data <- data.frame(
  geneSymbol = c("TRDC","TRDJ1","TRDJ2","TRDJ3","TRDJ4","TRDV1","TRDV2","TRDV3"),
  geneEntrezID = c("28526","28522","28521","28520","28519","28518","28517","28516"),
  GeneSet = rep("TCRD", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRD.csv", row.names = FALSE)

#39. T regs
data <- data.frame(
  geneSymbol = c("FOXP3","IKZF2"),
  geneEntrezID = c("50943","22807"),
  GeneSet = rep("T regs", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_regs.csv", row.names = FALSE)

#40. SNOR low DOWN
data <- data.frame(
  geneSymbol = c("RNU4ATAC","SNORA11","SNORA16A","SNORA23","SNORA38B","SNORA73A","SNORA73B"),
  geneEntrezID = c("100151683","677799","692073","677808","100124536","6080","26768"),
  GeneSet = rep("SNOR low DOWN", 7)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "SNOR_low_DOWN.csv", row.names = FALSE)

#--

#41. Monocytes
data <- data.frame(
  geneSymbol = c("ACE","ADAM8","ADAMDEC1","ADGRE1","ADGRE2","APOBEC3B","APOBEC3G","APOBR","ART4","C1QA","C1QC","C2","C4A","C4B","C4BPA","C4BPB","C5","C6","C8A","C9","CCL17","CCL18","CCL22","CCL28","CCL7","CCL8","CD14","CD163","CD209","CD300C","CD300E","CD5L","CD80","CFB","CFD","CFP","CHI3L1","CHIT1","CLEC12A","CLEC12B","CLEC4D","CLEC4E","CLEC5A","CLEC7A","CLEC9A","CSF1R","CXCL1","CXCL10","CXCL11","CXCL13","CXCL8","CXCL9","CYBB","F12","FCGR3B","FFAR2","HMMR","HVCN1","IGSF6","IL10RA","IL12A","IL12B","IL1RAP","IL20","IL23A","IL27","IL31RA","LAMP3","LGALS12","LGALS4","LGALS9","LGALS9B","LGALS9C","LILRA2","LILRA5","LILRA6","LILRB5","LMNB1","LY86","LYZ","MARCO","MEGF10","MERTK","MFGE8","MPEG1","MRC1","MS4A4A","MSR1","NTSR1","OCM","OLR1","OSCAR","OTOF","PDCD1LG2","PILRA","PLA2G2D","PLA2G5","PYHIN1","S100A8","S100A9","S1PR5","SCARB1","SCARF2","SECTM1","SEMA4A","SERPINB9","SERPING1","SIGLEC1","SIGLEC14","SIGLEC5","SIGLEC7","SLC11A1","SLITRK4","SMPDL3B","SPIC","STAB2","STAP2","TEK","TGM2","TIMD4","TLR2","TLR8","TNF","TNFAIP8L2","TNFRSF1B","TNIP3","TULP1","UBD","VENTX","VSTM1"),
  geneEntrezID = c("1636","101","27299","2015","30817","9582","60489","55911","420","712","714","717","720","721","722","725","727","729","731","735","6361","6362","6367","56477","6354","6355","929","9332","30835","10871","342510","922","941","629","1675","5199","1116","1118","160364","387837","338339","26253","23601","64581","283420","1436","2919","3627","6373","10563","3576","4283","1536","2161","2215","2867","3161","84329","10261","3587","3592","3593","3556","50604","51561","246778","133396","27074","85329","3960","3965","284194","654346","11027","353514","79168","10990","4001","9450","4069","8685","84466","10461","4240","219972","4360","51338","4481","4923","654231","4973","126014","9381","80380","29992","26279","5322","149628","6279","6280","53637","949","91179","6398","64218","5272","710","6614","100049587","8778","27036","6556","139065","27293","121599","55576","55620","7010","7052","91937","7097","51311","7124","79626","7133","79931","7287","10537","27287","284415"),
  GeneSet = rep("Monocytes", 130)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocytes.csv", row.names = FALSE)

#42. Myeloid cells
data <- data.frame(
  geneSymbol = c("ADGRE3","AIF1","AOAH","APOC1","BMX","C1QB","CD101","CD300LF","CD33","CD68","CLEC10A","CLEC16A","CLEC1A","CLEC2B","CLEC4A","CLEC4G","CLEC6A","CSF2RA","CSF2RB","CSF3R","CST3","CYBA","FCER1A","FCER1G","FCGR1A","FCGR1B","FCGR2A","FCGR2B","FCGR2C","FCGR3A","FGR","FLT3","FOLR2","IFI30","IFNL1","IFNL2","IFNL3","IL18","IL1A","IL1B","IL1F10","IL1RN","IL26","IL37","ITGAX","LILRA1","LILRA3","LILRB1","LILRB2","LILRB3","LILRB4","LYVE1","MEFV","MNDA","MS4A6A","MS4A7","NLRP12","NLRP3","NOD2","PECAM1","PLEK","PRAM1","RNASE3","RNASE6","RNASE7","SIGLEC10","SKAP2","SLAMF8","SLPI","SPI1","TNFSF13B","TNFSF14","TREM1","TREML2","TREML4","TWIST1","TYROBP","UNC93B1","VSIG4"),
  geneEntrezID = c("84658","199","313","341","660","713","9398","146722","945","968","10462","23274","51267","9976","50856","339390","93978","1438","1439","1441","1471","1535","2205","2207","2209","2210","2212","2213","9103","2214","2268","2322","2350","10437","282618","282616","282617","3606","3552","3553","84639","3557","55801","27178","3687","11024","11026","10859","10288","11025","11006","10894","4210","4332","64231","58475","91662","114548","64127","5175","5341","84106","6037","6039","84659","89790","8935","56833","6590","6688","10673","8740","54210","79865","285852","7291","7305","81622","11326"),
  GeneSet = rep("Myeloid cells", 79)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Myeloid_cells.csv", row.names = FALSE)

#43. Neutrophils
data <- data.frame(
  geneSymbol = c("ABHD3","ARG1","BPI","CAMP","CD177","CD83","CRISP3","DEFA1","DEFA1B","DEFA3","DEFB103A","DEFB103B","DEFB106B","DEFB136","DEFB4A","FUT4","LY6E","MMP8","OLFM4","OR1J2","PGLYRP1","PRTN3","S100A6"),
  geneEntrezID = c("171586","383","671","820","57126","9308","10321","1667","728358","1668","414325","55894","503841","613210","1673","2526","4061","4317","10562","26740","8993","5657","6277"),
  GeneSet = rep("Neutrophils", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Neutrophils.csv", row.names = FALSE)

    #44. Neutrophils_
    data <- data.frame(
     geneSymbol = c("EMR2","C5AR2","PLEKHG3","DPEP2","PADI4","FAM212B","TREML2","REPS2","MAK","STEAP4","PGLYRP1","MXD1","MEFV","BTNL8","CREB5","ALOX5","BST1","CEACAM3","VNN3","LILRA2","HSPA6","NFE2","HAL","FFAR2","MMP25","QPCT","CDA","P2RY13","CHST15","TNFRSF10C","FPR2","MGAM","APOBEC3A","TLR2","CXCR1","FAM65B","LST1","TREM1","CSF3R","C5AR1","SELL","AQP9","CXCR2","VNN2","MNDA","FPR1","FCGR3B"),
     GeneSet = rep("Neutrophils_", 47)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Neutrophils_.csv", row.names = FALSE)

    #45. Inflammatory_neutrophils
    data <- data.frame(
     geneSymbol = c("CD177.1","ZDHHC19.1","PLAC8.1","IFI6.2","MT2A.2","SERPING1.2","RSAD2.2","GBP1.2","ISG15.2","LY6E.2","S100A12.2","GBP5.2","XAF1.2","IFI44L.2","IFIT3.2","IFITM3.2","SELL.2","TRIM22.2","EPSTI1.2","FCER1G.2","IFI44.2","CST7.1","OAS1.2","TNFSF13B.2","SLFN5","IFIT2.2","MX1.2","PLSCR1.2","MMP9.2","LAP3","OAS2.2","IFIT1.2","TAP1.2","SAMD9L.2","HIST1H2AC.1","APOL6.2","OASL.1","DDX58.1","WSB1.1","RPL28.1","IL1RN.2","MCEMP1.1","HERC5.1","IRF7.2","CKAP4.1","UBE2L6.1","RNF213.2","GBP2.2","CLEC4D.1","LILRA6.1","SHISA5.2","PHF11.1","IFI16.2","OAS3.1","PIM1.1","SERPINB1.1","GYG1.1","IFITM1.1","PSTPIP2","TMSB10.1","GBP4","NFKBIA.2","EIF2AK2.1","FFAR2.2","PARP9.1","NT5C3A","TNFSF10.2","MYL12A.1","PARP14.1","DDX60","DDX60L.2","STAT1.1","XRN1","SAMD9.2","SNX3","HIST1H2BD","PGD.1","FCGR1A","JUN","CD44","PLP2.1","CARD16.2","LIMK2.2","CYSTM1.1","HMGB2.1","IFIH1.2","C1orf162","KLF4","STAT2","ZBP1","ALPL.1","GSTK1","SP110.2","ANXA1","ANXA3.1","METTL9","PML","NUCB1","CEACAM1","SECTM1","PSMB9.1","SAT1","RNF10.1","MAPK14.1","LILRA5","C3AR1","ZCCHC2","SAMHD1","LGALS9","CR1","C4orf3","IRF1.1","CD63","GRINA","TXN.1","MX2.1","GAPDH.1","RBCK1","NBN.2","NMI.1","CAST","HIST2H2BE","NTNG2","SP100.1","CASP1.1","TNFAIP6","PLEK.1","LMNB1","NFIL3","APOL2","BST1","NUB1","PIK3AP1","CCR1.2","ALOX5AP.1","EMB","ISG20","TMEM123.1","ADAR","GIMAP4","SPTLC2","SAMSN1.1","SPI1","GRN.1","APOBEC3A","GLRX","CD53.1","TRIM38","CD82","SH3GLB1","VIM.1","ADM.1","CASP4.1","S100A6.1","RAC2.1","BAZ1A","ADD3","ITGAM","LY96.2","MSRB1.1","DYSF.1","MOB1A","TPM3.1","FGR","ACSL1","IL2RG.1","CDKN2D","FYB1.1","CLEC4E.1","HLA-F","LILRB3","RBMS1","PLBD1","FLOT1","GNG5","PTEN.1","PROK2.1","UBE2J1","CD37.1","KCNJ15.1","BRI3.1","HIF1A","PRR13.1","LRG1","MAX","RHOG","NFE2","STXBP2","B4GALT5","CAPZA1","SERPINA1","ZYX","RGS19","FKBP5","GCA.1","FKBP1A","MTPN","VNN2","AC245128.3","CD55","CFL1.1"),  # shortened for clarity
     GeneSet = rep("Inflammatory neutrophils", 201)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Inflammatory_neutrophils.csv", row.names = FALSE)

    #46. Suppressive_neutrophils
    data <- data.frame(
     geneSymbol = c("LY6E","XAF1","RSAD2","ISG15","IFI44L","IFITM3","MT2A","IFIT3","EPSTI1","IFIT1","IFI6","IFIT2","MX1","GBP5","PLSCR1","SELL","SERPING1","SHISA5","TRIM22","GBP1","IFI44","IRF7","FFAR2","OAS1","OAS2","IL1RN","IFI16","CCR1","SP110","APOL6","NFKBIA","TNFSF10","FCER1G","IFIH1","TNFSF13B","SAMD9","SAMD9L","DDX60L","TAP1","NBN","RNF213","CARD16","GBP2","LY96","CLEC2B","LIMK2"),
     GeneSet = rep("Suppressive neutrophils", 46)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Suppressive_neutrophils.csv", row.names = FALSE)

    #47. Apoptosis
    data <- data.frame(
      geneSymbol = c("AIFM1","APAF1","BAD","BAK1","BAX","BCL2L11","BID","CASP10","CASP2","CASP3","CASP6","CASP7","CASP8","CASP9","DIABLO","ENDOG","FAS","FASLG","HTRA2","TNFRSF1A","TP53"),
      geneEntrezID = c("9131","317","572","578","581","10018","637","843","835","836","839","840","841","842","56616","2021","355","356","27429","7132","7157"),
      GeneSet = rep("Apoptosis", 21)
    )
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneEntrezID, 
                          keytype="ENTREZID", 
                          columns="ENSEMBL")
    result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
    sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Apoptosis.csv", row.names = FALSE)

    #48. NFkB complex
    data <- data.frame(
      geneSymbol = c("IKBKB","IKBKG","NFKB1","NFKB2","RELA","RELB","REL"),
      geneEntrezID = c("3551","8517","4790","4791","5970","5971","5966"),
      GeneSet = rep("NFkB complex", 7)
    )
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneEntrezID, 
                          keytype="ENTREZID", 
                          columns="ENSEMBL")
    result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
    sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "NFkB_complex.csv", row.names = FALSE)

#~/Tools/csv2xls-0.4/csv_to_xls.py Alt._complement.csv Anergic_or_act._T_cells.csv Anti-inflammation.csv B_cells.csv CD40_activated.csv CD8T-NK-NKT.csv Cell_cycle.csv Classical_complement.csv Cyt._act._T_cells.csv Dendritic_cells.csv Erythrocytes.csv Granulocytes.csv IFN.csv IG_chains.csv IL-1_cytokines.csv IL-6R_complex.csv Inflammasome.csv LDG.csv Lectin_complement.csv MHC_II.csv Monocytes.csv Monocytes_40.csv Monocyte_cell_surface.csv Monocyte_secreted.csv Myeloid_cells.csv Neutrophils.csv NK_cells.csv NLRP3_inflammasome.csv pDC.csv Plasma_cells.csv Platelets.csv Pro-inflam._IL-1.csv SNOR_low_DOWN.csv SNOR_low_UP.csv TCRA.csv TCRB.csv TCRD.csv TCRG.csv TNF.csv T_activated.csv T_cells.csv T_regs.csv Unfolded_protein.csv Neutrophils_.csv Inflammatory_neutrophils.csv Suppressive_neutrophils.csv Apoptosis.csv NFkB_complex.csv -d',' -o Signatures.xls

#---------------------------------------------------------------------
#---- 2. Prepare gene expression matrix: calculate DESeq2 results ----

setwd("/home/jhuang/DATA/Data_Susanne_Carotis_RNASeq/results/featureCounts_2023")

library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
#BiocManager::install("org.Hs.eg.db")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)

d.raw<- read.delim2("merged_gene_counts.txt",sep="\t", header=TRUE, row.names=1)

colnames(d.raw)<-c("gene_name", "leer_mock_2h_r2", "Ace2_mock_2h_r2", "leer_inf_24h_r1", "Ace2_inf_2h_r1", "leer_inf_24h_r2", "leer_inf_2h_r1", "leer_mock_2h_r1", "leer_inf_2h_r2", "Ace2_inf_2h_r2", "Ace2_mock_2h_r1", "Ace2_inf_24h_r2", "Ace2_inf_24h_r1")

col_order <- c("gene_name", "leer_mock_2h_r1","leer_mock_2h_r2","leer_inf_2h_r1","leer_inf_2h_r2","leer_inf_24h_r1","leer_inf_24h_r2","Ace2_mock_2h_r1","Ace2_mock_2h_r2","Ace2_inf_2h_r1","Ace2_inf_2h_r2","Ace2_inf_24h_r1","Ace2_inf_24h_r2")

reordered.raw <- d.raw[,col_order]
reordered.raw$gene_name <- NULL
#d <- d.raw[rowSums(reordered.raw>3)>2,]

condition = as.factor(c("leer_mock_2h","leer_mock_2h","leer_inf_2h","leer_inf_2h","leer_inf_24h","leer_inf_24h","Ace2_mock_2h","Ace2_mock_2h","Ace2_inf_2h","Ace2_inf_2h","Ace2_inf_24h","Ace2_inf_24h"))
ids = as.factor(c("leer_mock_2h_r1","leer_mock_2h_r2","leer_inf_2h_r1","leer_inf_2h_r2","leer_inf_24h_r1","leer_inf_24h_r2","Ace2_mock_2h_r1","Ace2_mock_2h_r2","Ace2_inf_2h_r1","Ace2_inf_2h_r2","Ace2_inf_24h_r1","Ace2_inf_24h_r2"))

#cData = data.frame(row.names=colnames(reordered.raw), condition=condition,  batch=batch, ids=ids)
#dds<-DESeqDataSetFromMatrix(countData=reordered.raw, colData=cData, design=~batch+condition)
cData = data.frame(row.names=colnames(reordered.raw), condition=condition, ids=ids)
dds<-DESeqDataSetFromMatrix(countData=reordered.raw, colData=cData, design=~condition)

#----more detailed and specific with the following code!----
dds$condition <- relevel(dds$condition, "Ace2_mock_2h")
dds = DESeq(dds, betaPrior=FALSE)  # betaPrior default value is FALSE
resultsNames(dds)

#--------------------------------------
#---- 3. draw violin on signatures ----

library(GSVA)
library(ggplot2)
#install.packages("readxl")
library(readxl)

# Path to the Excel file
file_path <- "Signatures.xls"

#example of a signature:
#geneSymbol geneEntrezID    ENSEMBL GeneSet
#CD160  11126   ENSG00000117281 Anergic or act. T cells
#CD244  51744   ENSG00000122223 Anergic or act. T cells
#CTLA4  1493    ENSG00000163599 Anergic or act. T cells
#HAVCR2 84868   ENSG00000135077 Anergic or act. T cells
#ICOS   29851   ENSG00000163600 Anergic or act. T cells
#KLRG1  10219   ENSG00000139187 Anergic or act. T cells
#LAG3   3902    ENSG00000089692 Anergic or act. T cells
#PDCD1  5133    ENSG00000188389 Anergic or act. T cells
#PDCD1  5133    ENSG00000276977 Anergic or act. T cells

# Get the names of the sheets
sheet_names <- excel_sheets(file_path)

# Initialize an empty list to hold gene sets
geneSets <- list()

# Loop over each sheet, extract the ENSEMBL IDs, and add to the list
for (sheet in sheet_names) {
  # Read the sheet
  data <- read_excel(file_path, sheet = sheet)

  # Process the GeneSet names (replacing spaces with underscores, for example)
  gene_set_name <- gsub(" ", "_", unique(data$GeneSet)[1])

  # Add ENSEMBL IDs to the list
  geneSets[[gene_set_name]] <- as.character(data$ENSEMBL)
}

# Print the result to check
print(geneSets)

# 0. for Nanostring, the GSVA input requires a gene expression matrix 'exprs' where rows are genes and columns are samples.
#exprs <- exprs(filtered_or_neg_target_m666Data)
# 0. for RNAseq, the GSVA input requires a gene expression matrix where rows are genes and columns are samples.
exprs <- counts(dds, normalized=TRUE)

# 1. Compute GSVA scores:
library(GSVA)
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- dds$condition

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("Ace2_mock_2h", "Ace2_inf_24h"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("Ace2_mock_2h", "mock", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("Ace2_inf_24h", "infection", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("mock", "infection"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "mock", gene_name]
  condition2_data <- data[data$Condition == "infection", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value

  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))

  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )

  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)

  return(p)
}

# 6. Generate the list of plots:
#genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
#plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 6. Generate the list of plots in a predefined order:
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated",  "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes",  "Myeloid_cells","Neutrophils")
genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("All_Violin_Plots_RNAseq.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

# ------------------------ Heatmap Presentation --------------------

#-- NanoString GSVA scores --
exprs <- exprs(target_m666Data)
#exprs <- exprs(filtered_or_neg_target_m666Data)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Grp
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1", "Group1", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1", "Group3"))

# Filter the data for "Group1" samples
group1_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1", genes]

# Set the median values for each column in the data.frame
# Calculate median values for each column
group1_medians <- apply(group1_data, 2, median)

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Group
#identical(rownames(gsva_df_filtered), rownames(pData(target_m666Data)) )
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1a", "1b", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1a", "Group1a", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("1b", "Group1b", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1a", "Group1b", "Group3"))

# Filter the data for "Group1a" samples
group1a_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1a", genes]
group1a_medians <- apply(group1a_data, 2, median)

# Filter the data for "Group1b" samples
group1b_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1b", genes]
group1b_medians <- apply(group1b_data, 2, median)

# Filter the data for "Group3" samples
group3_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group3", genes]
group3_medians <- apply(group3_data, 2, median)

#-- RNAseq GSVA scores --
exprs <- counts(dds, normalized=TRUE)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- dds$condition

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("Ace2_mock_2h", "Ace2_inf_24h"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("Ace2_mock_2h", "mock", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("Ace2_inf_24h", "infection", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("mock", "infection"))

# Filter the data for "mock" samples
mock_data <- gsva_df_filtered[gsva_df_filtered$Condition == "mock", genes]
mock_medians <- apply(mock_data, 2, median)

# Filter the data for "infection" samples
infection_data <- gsva_df_filtered[gsva_df_filtered$Condition == "infection", genes]
infection_medians <- apply(infection_data, 2, median)

# Create a new data frame with the middle values for both groups
heatmap_data_rnaseq <- data.frame(mock = mock_medians, infection = infection_medians)

# Transpose the data matrix to switch rows and columns
#heatmap_data <- t(heatmap_data)

# Convert the data frame to a numeric matrix
heatmap_matrix <- as.matrix(heatmap_data)

# Specify heatmap colors
color_scheme <- colorRampPalette(c("blue", "white", "red"))(50)

#TODO: adjust the color, so that it is not so dark and light!!!!, add RNAseq data for mock and infection!!
# Create the heatmap without hierarchical clustering
library(gplots)
png("Heatmap.png", width=1000, height=1000)
heatmap.2(
  heatmap_matrix,
  Colv = FALSE,  # Turn off column dendrogram
  Rowv = FALSE,  # Turn off row dendrogram
  trace = "none",  # Turn off row and column dendrograms
  scale = "row",   # Scale the rows
  col = color_scheme,
  main = "GSVA Heatmap",  # Title of the heatmap
  key = TRUE,            # Show color key
  key.title = "GSVA Score",  # Color key title
  key.xlab = NULL,       # Remove x-axis label from color key
  density.info = "none",  # Remove density plot
  cexRow = 1.2,  # Increase font size of row names
  cexCol = 1.2,  # Increase font size of column names
  srtCol = 40, keysize = 0.72, margins = c(5, 14)
)
dev.off()

GSVA calculation for NanoString data

library("rmarkdown")
library("tidyverse")
library(rmarkdown)
library("GeomxTools")
library("GeoMxWorkflows")
library("NanoStringNCTools")
setwd("/home/jhuang/DATA/Data_Susanne_Carotis_spatialRNA_PUBLISHING/run_2023")
render("run.Rmd", c("html_document"))

# ------------------------ Violin Presentation 1 --------------------
#identical(exprs(target_m666Data), assay(target_m666Data, "exprs")): contains the gene expression values that have been both normalized and log-transformed, which are often used for downstream analysis and visualization in many genomic studies.

#assay(target_m666Data, "log_q"): These are likely log2-transformed values of the quantile-normalized data. Quantile normalization ensures that the distributions of gene expression values across samples are the same, and taking the log2 of these normalized values is a common step in the analysis of high-throughput gene expression data. It's a way to make the data more suitable for downstream statistical analysis and visualization.

#assay(target_m666Data, "q_norm"): This typically represents the quantile-normalized gene expression values. Quantile normalization adjusts the expression values in each sample so that they have the same distribution. This normalization method helps remove systematic biases and makes the data more comparable across samples.

library(readxl)

# Path to the Excel file
file_path <- "Signatures.xls"

#example of a signature:
#geneSymbol geneEntrezID    ENSEMBL GeneSet
#CD160  11126   ENSG00000117281 Anergic or act. T cells
#CD244  51744   ENSG00000122223 Anergic or act. T cells
#CTLA4  1493    ENSG00000163599 Anergic or act. T cells
#HAVCR2 84868   ENSG00000135077 Anergic or act. T cells
#ICOS   29851   ENSG00000163600 Anergic or act. T cells
#KLRG1  10219   ENSG00000139187 Anergic or act. T cells
#LAG3   3902    ENSG00000089692 Anergic or act. T cells
#PDCD1  5133    ENSG00000188389 Anergic or act. T cells
#PDCD1  5133    ENSG00000276977 Anergic or act. T cells

# Get the names of the sheets
sheet_names <- excel_sheets(file_path)

# Initialize an empty list to hold gene sets
geneSets <- list()

# Loop over each sheet, extract the ENSEMBL IDs, and add to the list
for (sheet in sheet_names) {
  # Read the sheet
  data <- read_excel(file_path, sheet = sheet)

  # Process the GeneSet names (replacing spaces with underscores, for example)
  gene_set_name <- gsub(" ", "_", unique(data$GeneSet)[1])

  # Add ENSEMBL IDs to the list
  geneSets[[gene_set_name]] <- unique(as.character(data$geneSymbol))
}

# Print the result to check
print(geneSets)

library(gridExtra)
library(ggplot2)
library(GSVA)

# 0. for the following calculation, the GSVA input requires a gene expression matrix 'exprs' where rows are genes and columns are samples.
exprs <- exprs(target_m666Data)
#exprs <- exprs(filtered_or_neg_target_m666Data)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Grp
#identical(rownames(gsva_df_filtered), rownames(pData(target_m666Data)) )
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1", "Group1", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1", "Group3"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "Group1", gene_name]
  condition2_data <- data[data$Condition == "Group3", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value

  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))

  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )

  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)

  return(p)
}

# 6. Generate the list of plots:
#genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
#plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 6. Generate the list of plots in a predefined order:
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated",  "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes",  "Myeloid_cells","Neutrophils")
genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
genes <- genes[!is.na(genes)]
plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 7. Pad the list of plots:
#plots_needed_for_full_grid <- ceiling(length(plots_list) / 6) * 6
#remaining_plots <- plots_needed_for_full_grid - length(plots_list)
#plots_list <- c(plots_list, rep(list(NULL), remaining_plots))

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("All_Violin_Plots_NanoString_3_vs_1.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

# ------------------------ Violin Presentation 2 --------------------

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Group
#identical(rownames(gsva_df_filtered), rownames(pData(target_m666Data)) )
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1a", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1a", "Group1a", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1a", "Group3"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "Group1a", gene_name]
  condition2_data <- data[data$Condition == "Group3", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value

  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))

  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )

  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)

  return(p)
}

# 6. Generate the list of plots:
#genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
#plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 6. Generate the list of plots in a predefined order:
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated",  "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes",  "Myeloid_cells","Neutrophils")
genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
genes <- genes[!is.na(genes)]
plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 7. Pad the list of plots:
#plots_needed_for_full_grid <- ceiling(length(plots_list) / 6) * 6
#remaining_plots <- plots_needed_for_full_grid - length(plots_list)
#plots_list <- c(plots_list, rep(list(NULL), remaining_plots))

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("All_Violin_Plots_NanoString_3_vs_1a.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

# ------------------------ Heatmap Presentation --------------------

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Grp
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1", "Group1", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1", "Group3"))

# Filter the data for "Group1" samples
group1_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1", genes]

# Set the median values for each column in the data.frame
# Calculate median values for each column
group1_medians <- apply(group1_data, 2, median)

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Group
#identical(rownames(gsva_df_filtered), rownames(pData(target_m666Data)) )
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1a", "1b", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1a", "Group1a", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("1b", "Group1b", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1a", "Group1b", "Group3"))

# Filter the data for "Group1a" samples
group1a_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1a", genes]
group1a_medians <- apply(group1a_data, 2, median)

# Filter the data for "Group1b" samples
group1b_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1b", genes]
group1b_medians <- apply(group1b_data, 2, median)

# Filter the data for "Group3" samples
group3_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group3", genes]
group3_medians <- apply(group3_data, 2, median)

# Create a new data frame with the middle values for both groups
heatmap_data_nanostring <- data.frame(Group1 = group1_medians, Group1a = group1a_medians, Group1b = group1b_medians, Group3 = group3_medians)

# Merge two heatmap_data_* to a matrix
heatmap_data <- merge(heatmap_data_nanostring, heatmap_data_rnaseq, by="row.names", all=TRUE)
rownames(heatmap_data) <- heatmap_data$Row.names
heatmap_data$Row.names <- NULL

# Transpose the data matrix to switch rows and columns
#heatmap_data <- t(heatmap_data)

# Convert the data frame to a numeric matrix
heatmap_matrix <- as.matrix(heatmap_data)

# Transpose the data matrix to switch rows and columns
#heatmap_data <- t(heatmap_data)

# Convert the data frame to a numeric matrix
heatmap_matrix <- as.matrix(heatmap_data)

# Specify heatmap colors
color_scheme <- colorRampPalette(c("blue", "white", "red"))(50)

#TODO: adjust the color, so that it is not so dark and light!!!!, add RNAseq data for mock and infection!!
# Create the heatmap without hierarchical clustering
library(gplots)
png("Heatmap.png", width=1000, height=1000)
heatmap.2(
  heatmap_matrix,
  Colv = FALSE,  # Turn off column dendrogram
  Rowv = FALSE,  # Turn off row dendrogram
  trace = "none",  # Turn off row and column dendrograms
  scale = "row",   # Scale the rows
  col = color_scheme,
  main = "GSVA Heatmap",  # Title of the heatmap
  key = TRUE,            # Show color key
  key.title = "GSVA Score",  # Color key title
  key.xlab = NULL,       # Remove x-axis label from color key
  density.info = "none",  # Remove density plot
  cexRow = 1.2,  # Increase font size of row names
  cexCol = 1.2,  # Increase font size of column names
  srtCol = 40, keysize = 0.72, margins = c(5, 14)
)
dev.off()

# Write the table to Excel-file
library(writexl)
# Specify the file path where you want to save the Excel file
excel_file <- "heatmap_data.xlsx"

# Save the data frame to the Excel file
write_xlsx(heatmap_data, excel_file)

# Verify that the Excel file has been saved
if (file.exists(excel_file)) {
  cat("Data frame has been saved to", excel_file, "\n")
} else {
  cat("Error: Failed to save the data frame to", excel_file, "\n")
}

Ordinary vs. Moderated P-values: A Key Comparison in limma Differential Expression Analysis

In the context of differential expression analysis, limma is a popular R package that originally was designed for microarray data but has since been adapted for RNA-seq data (using voom transformation). One of the unique features of limma is its use of moderated statistics. Here’s a breakdown of the difference between ordinary p-values and moderated p-values in limma:

  • Ordinary P-values:

    • These are the p-values you would get if you were to do standard hypothesis testing on each gene individually without borrowing information from other genes.
    • Calculated based on the ordinary standard errors.
  • Moderated P-values:

    • Limma borrows information across genes to get more precise estimates of the variability for each gene. This is especially useful when the number of samples (replicates) is small.
    • The process involves “shrinking” the gene-wise sample variances towards a pooled estimate, resulting in moderated t-statistics, which are more stable than ordinary t-statistics.
    • Moderated p-values are calculated based on these moderated t-statistics.
    • As a result, these moderated p-values tend to be more reliable, especially in experiments with small sample sizes.
  • Why is Moderation Necessary?:

    • In many genomics experiments, there’s a challenge: While there are thousands of genes (or more), there might be a relatively small number of replicates or samples. This can make estimates of variance for each gene unreliable.
    • By borrowing strength from the ensemble of genes, limma can stabilize these variance estimates, which, in turn, makes the resulting p-values more reliable.
  • Empirical Bayes Method:

    • The moderation in limma is achieved through an empirical Bayes method. This doesn’t mean it’s a fully Bayesian approach but rather that it borrows some concepts from Bayesian statistics to stabilize variance estimates across genes.

In practice, when using limma, researchers often focus on the moderated p-values because of their enhanced reliability, especially in the context of multiple hypothesis testing in genomics. The moderated statistics help reduce the number of false positives that might arise from genes with unusually low variance estimates due to chance alone.

GSVA calculation for Mass spectrometry (MS)-based proteomics

library("rmarkdown")
library("tidyverse")
library(rmarkdown)
setwd("/home/jhuang/DATA/Data_Susanne_Carotis_MS/LIMMA-pipeline-proteomics/Results_20231006_165122")

# -1. prepare and read signatures (namely gene sets)
#Monocytes.csv #66
#Monocytes2.csv #201
#Neutrophils.csv #40
#Neutrophils2.csv #61
#~/Tools/csv2xls-0.4/csv_to_xls.py Alt._complement.csv Anergic_or_act._T_cells.csv Anti-inflammation.csv B_cells.csv CD40_activated.csv CD8T-NK-NKT.csv Cell_cycle.csv Classical_complement.csv Cyt._act._T_cells.csv Dendritic_cells.csv Erythrocytes.csv Granulocytes.csv IFN.csv IG_chains.csv IL-1_cytokines.csv IL-6R_complex.csv Inflammasome.csv LDG.csv Lectin_complement.csv MHC_II.csv Monocytes.csv Monocyte_cell_surface.csv Monocyte_secreted.csv Myeloid_cells.csv Neutrophils.csv NK_cells.csv NLRP3_inflammasome.csv pDC.csv Plasma_cells.csv Platelets.csv Pro-inflam._IL-1.csv SNOR_low_DOWN.csv SNOR_low_UP.csv TCRA.csv TCRB.csv TCRD.csv TCRG.csv TNF.csv T_activated.csv T_cells.csv T_regs.csv Unfolded_protein.csv  Inflammatory_neutrophils.csv Suppressive_neutrophils.csv Apoptosis.csv NFkB_complex.csv -d',' -o Signatures_46.xls 

library(readxl)
file_path <- "../../Signatures_46.xls"
# Get the names of the sheets
sheet_names <- excel_sheets(file_path)
# Initialize an empty list to hold gene sets
geneSets <- list()
# Loop over each sheet, extract the ENSEMBL IDs, and add to the list
for (sheet in sheet_names) {
  # Read the sheet
  data <- read_excel(file_path, sheet = sheet)
  # Process the GeneSet names (replacing spaces with underscores, for example)
  gene_set_name <- gsub(" ", "_", unique(data$GeneSet)[1])
  # Add ENSEMBL IDs to the list
  geneSets[[gene_set_name]] <- unique(as.character(data$geneSymbol))
}
# Print the result to check
print(geneSets)
length(geneSets)

# 0. for the following calculation, the GSVA input requires a gene expression matrix 'exprs' where rows are genes and columns are samples.
library(gridExtra)
library(ggplot2)
library(GSVA)
exprs_ <- read.csv(sep="\t", file="20231006_165122_final_data.tsv")

# List of desired columns
desired_columns <- c("CA1", "CA2", "CA3", "CA10", "CA5", "CA6", "CA8", "CA14", "CA12", "CA13", "Con1", "Con2", "Con3", "Con4", "Con5", "Con6")
# # Find duplicated symbols
# duplicated_symbols <- exprs_$Symbol[duplicated(exprs_$Symbol) | duplicated(exprs_$Symbol, fromLast = TRUE)]
# print(duplicated_symbols)
# Filter out rows with empty 'Symbol' values
exprs_ <- exprs_[exprs_$Symbol != "", ]
# Subset the dataframe
exprs <- exprs_[, desired_columns]
# Set the 'Symbol' column as row names
rownames(exprs) <- exprs_$Symbol
exprs_matrix <- as.matrix(exprs)

# 1. Compute GSVA scores:
library(GSVA)
gsva_scores <- gsva(exprs_matrix, geneSets, method="gsva", useNames=TRUE)

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- c("COVID","COVID","COVID","COVID","COVID","COVID","COVID","COVID","COVID","COVID","control","control","control","control","control","control")

# 4. Filter the gsva_df to retain only the desired conditions:

# 5. Define a function to plot violin plots:
gsva_df$Condition <- factor(gsva_df$Condition, levels = c("control", "COVID"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "control", gene_name]
  condition2_data <- data[data$Condition == "COVID", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value
  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))
  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )
  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)
  return(p)
}

# 6. Generate the list of plots in a predefined order:
# [1] "Platelets"                "Granulocytes"            
# [3] "LDG"                      "pDC"                     
# [5] "Anti-inflammation"        "Pro-inflam._IL-1"        
# [7] "Dendritic_cells"          "MHC_II"                  
# [9] "Alt._complement"          "TNF"                     
#[11] "NLRP3_inflammasome"       "Unfolded_protein"        
#[13] "B_cells"                  "Monocyte_cell_surface"   
#[15] "Inflammasome"             "Monocyte_secreted"       
#[17] "IL-1_cytokines"           "SNOR_low_UP"             
#[19] "CD40_activated"           "Lectin_complement"       
#[21] "Classical_complement"     "Cell_cycle"              
#[23] "Plasma_cells"             "IG_chains"               
#[25] "Erythrocytes"             "IL-6R_complex"           
#[27] "IFN"                      "TCRB"                    
#[29] "TCRA"                     "Cyt._act._T_cells"       
#[31] "TCRG"                     "T_cells"                 
#[33] "CD8T-NK-NKT"              "Anergic_or_act._T_cells" 
#[35] "T_activated"              "Neutrophils"             
#[37] "NK_cells"                 "TCRD"                    
#[39] "T_regs"                   "SNOR_low_DOWN"           
#[41] "Monocytes"                "Myeloid_cells"           
#[43] "Inflammatory_neutrophils" "Suppressive_neutrophils" 
#[45] "Apoptosis"                "NFkB_complex"
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated", "Neutrophils", "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes", "Myeloid_cells", "Inflammatory_neutrophils", "Suppressive_neutrophils", "Apoptosis", "NFkB_complex")
genes <- colnames(gsva_df)[!colnames(gsva_df) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
filtered_genes <- genes[!is.na(genes)]
print(filtered_genes)
plots_list <- lapply(filtered_genes, function(filtered_genes) plot_violin(gsva_df, filtered_genes))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("Violin_Plots_GSVA_MS.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

# 10. Save GSVA scores as Excel-file.
library(writexl)
write_xlsx(gsva_df, "GSVA_Scores_MS.xlsx")

LIMMA pipeline processing proteomics (MS)

Statistical analysis on LC-MS data

In order to detect significant changes between two experimental groups we performed statistical analysis on LC-MS data. We have chosen LIMMA moderated t test statistics as proposed in Kammers et al. (2015) over standard t test, since it utilizes full data to shrink the observed sample variance toward a pooled estimate, resulting in far more stable and powerful inference compared to ordinary t test. Missing values in proteomic data are a common problem and it is widely used in practice to impute missing information before applying statistical methods. We used the imputation algorithm proposed in Tyanova et al. (2016) because it allows us to randomly draw values from a distribution meant to simulate values below the detection limit of the MS instrument. We implemented statistical analysis in R and codes are freely available at the following GitHub repository https://github.com/wasimaftab/LIMMA-pipeline– proteomics/tree/master/Mitochondrial_Loop. The final volcano plots only feature proteins that are part of the high-confidence mito- chondrial proteome (Morgenstern et al., 2017).

DATA AND CODE AVAILABILITY

The codes generated during this study to perform statistical analysis on the BioID LC-MS data are freely available at GitHub repository https://github.com/wasimaftab/LIMMA-pipeline-proteomics/tree/master/Mitochondrial_Loop. The data supporting this study have been uploaded on the Mendeley Dataset public repository at https://doi.org/10.17632/9symckg9wn.

#This is a pipeline to analyze proteiomic data in a proteinGroups.txt (MaxQuant output) file for two group comparision
#install.packages("matlab")
library(dplyr)
library(stringr)
library(MASS)
library(matlab)
library(plotly)
library(limma)
library(qvalue)
library(htmlwidgets)
library(rstudioapi)

source("limma_helper_functions.R")

# ---- INPUT for Gunnar Project, directly read the sheet "Gene Prot (FOUND)" ----
## Load the proteingroups file
proteingroups_ <- read.csv("210618_MG_Carotisartery_consensus.csv", sep=",", header=TRUE)
proteingroups <- proteingroups_ %>% filter(Checked == "WAHR")

#Gene Symbol    Accession   CA1 CA2 CA3 CA10    CA5 CA6 CA8 CA14    CA12    CA13    CA11    CA4 Con1    Con2    Con3    Con4    Con5    Con6
#IGKV3-7    A0A075B6H7      195532822   276145022   75324344    179451608   93312672    188741058   244072994   540080672   133307714   102293349.5 133798065   166553494.5 138312451   45263876    361462136   185500736   357708246   156721253

### Kill the code if proteingroups does not contain crap columns
#temp <-
#  select(
#    proteingroups,
#    matches("(Reverse|Potential.contaminant|Only.identified.by.site)")
#  )
#if (!nrow(temp) * ncol(temp)) {
#  stop("File error, It does not contain crap...enter another file with crap")
#}

## Display data to faciliate choice of treatment and control
temp <- select(proteingroups, matches("Abundance"))
print(names(temp))

### Remove "+" identifeid rows from proteingroups
#idx <- NULL
#temp <-
#  select(
#    proteingroups,
#    matches("(Only.identified.by.site|Reverse|Potential.contaminant)")
#  )
#for (i in 1:ncol(temp)) {
#  index <- which(unlist(!is.na(match(temp[, i], "+"))))
#  idx <- union(idx, index)
#}
#proteingroups <- proteingroups[-idx, ] # removing indexed rows

### Remove proteins with Sequence coverage [%] < 5 [%] and Unique peptides == 1
#temp <-
#  select(proteingroups,
#         matches("(^Sequence.coverage(.){4}$|^Unique.peptides$)"))
#proteingroups <-
#  proteingroups[-union(which(temp[, 1] == 1), which(temp[, 2] < 5)) , ] # removing indexed rows

## Extract Uniprot and gene symbols
#splits <- unlist(strsplit("CAD protein OS=Mus musculus OX=10090 GN=Cad PE=1 SV=1", '\\|'))
#splits <- unlist(str_match("CAD protein OS=Mus musculus OX=10090 GN=Cad PE=1 SV=1", "GN=(.*?) PE="))
Uniprot <- character(length = nrow(proteingroups))
Symbol <- character(length = nrow(proteingroups))
Descriptions <- character(length = nrow(proteingroups))
for (i in 1:nrow(proteingroups)) {
  temp <- as.character(proteingroups$Description[i])
  Descriptions[i] <- temp
  Uniprot[i] <- as.character(proteingroups$Accession[i])
  splits <- unlist(str_match(temp, "GN=(.*?) PE="))
  Symbol[i] <- splits[2]
}

# WAP_1, WAP_2, WAP_3, 
# D-ALPHA_1, D-ALPHA_2, D-ALPHA_3
# Q-ALPHA_1, Q-ALPHA_2, Q-ALPHA_3
# pTTSS+Q_1, pTTSS+Q_2, pTTSS+Q_3
# pTTSS+Q-ALPHA_1, pTTSS+Q-ALPHA_2, pTTSS+Q-ALPHA_3
##Extract required data for Limma
#treatment <- readline('Enter treatment name(case insensitive) as it appeared in the iBAQ/LFQ column= ')
#control <- readline('Enter control name(case insensitive) as it appeared in the iBAQ/LFQ column= ')
#ibaq <- readinteger_binary('Enter 1 for iBAQ or 0 for LFQ= ')
treatment <- c("Abundance..F4..Sample","Abundance..F5..Sample","Abundance..F6..Sample")
control <- c("Abundance..F1..Sample","Abundance..F2..Sample","Abundance..F3..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))  #LFQ
treatment_reps <- select(temp, matches(treatment))
#treatment_reps <- data_sanity_check(temp, 'treatment', treatment)
control_reps <- select(temp, matches(control))
#control_reps <- data_sanity_check(temp, 'control', control)
data <- cbind(treatment_reps,
        control_reps,
        #select(proteingroups, matches("^id$")),  #maybe problematic.
        Uniprot,
        Symbol)
##Extract required data for Limma
treatment <- c("Abundance..F7..Sample","Abundance..F8..Sample","Abundance..F9..Sample")
control <- c("Abundance..F1..Sample","Abundance..F2..Sample","Abundance..F3..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))
treatment_reps <- select(temp, matches(treatment))
control_reps <- select(temp, matches(control))
data <- cbind(treatment_reps,
        control_reps,
        Uniprot,
        Symbol)
##Extract required data for Limma
treatment <- c("Abundance..F13..Sample","Abundance..F14..Sample","Abundance..F15..Sample")
control <- c("Abundance..F10..Sample","Abundance..F11..Sample","Abundance..F12..Sample")
temp <- select(proteingroups, matches(paste('^.*', "Abundance", '.*$', sep = '')))
treatment_reps <- select(temp, matches(treatment))
control_reps <- select(temp, matches(control))
data <- cbind(treatment_reps,
        control_reps,
        Uniprot,
        Symbol)
#> head(data)
#  Abundance..F13..Sample Abundance..F14..Sample Abundance..F15..Sample
#1            74399257195             7824880491            2,67259E+11
#2             7002793339             2132521146            23458705782
#3             1060277580              620505532             5544368780
#4            442697991,6            613281893,8             1971535002
#5             1683646539            904648237,7             8359400640
#6            360135421,6             1339034150             1614352533
#  Abundance..F10..Sample Abundance..F11..Sample Abundance..F12..Sample Uniprot
#1             4487833586            4,47435E+11            2,23515E+11  B2RQC6
#2             2172021935            39427573076            28112131746  Q6ZQA0
#3            426743632,6             4947260656             6773100611  Q3UJB9
#4            926489052,4             1636804139             1250211974  Q8BX70
#5            797097872,6            10397417632             9384902608  Q9EP71
#6             2428231339             2254991154             6269016713  P26039
#  Symbol
#1    Cad
#2 Nbeal2
#3   Edc4
#4 Vps13c
#5  Rai14
#6   Tln1

# ---- INPUT for Susanne Project, directly read the sheet "Gene Prot (FOUND)" ----
data <- read.csv("210618_MG_Carotisartery_consensus.csv", dec=",")

#                          ,,1,1,1,1,1,1,1,1,1,1,                           2,2,         4,4,4,4,4,4
#Gene Symbol,Accession,    CA1,CA2,CA3,CA10,CA5,CA6,CA8,CA14,CA12,CA13,     CA11,CA4,    Con1,Con2,Con3,Con4,Con5,Con6
# Drop columns "CA11" and "CA4"
data <- data[, !(names(data) %in% c("CA11", "CA4"))]

## Impute data
#print(names(data))
#rep_treats <- readinteger("Enter the number of treatment replicates=")
#rep_conts <- readinteger("Enter the number of control replicates=")
#FC_Cutoff <- readfloat("Enter the log fold change cut off=")
rep_treats = 10
rep_conts = 6
FC_Cutoff = 2

## removing blank rows
#NO_BLANK_ROWS --> deactivate the following code.
#For Gunnar data: temp <- as.matrix(rowSums(apply(data[, 1:(rep_treats + rep_conts)], 2, as.numeric)))
temp <- as.matrix(rowSums(apply(data[, 3:(rep_treats + rep_conts + 2)], 2, as.numeric)))
idx <- which(temp == 0)
if (length(idx)) {
  data <- data[-idx,]
}
#NO INFINITE --> deactivate!  
#  data_limma <- log2(as.matrix( as.numeric(data[c(1:(rep_treats + rep_conts))]) ))
#data[,1] <- gsub(',', '.', data[,1])
#data[,2] <- gsub(',', '.', data[,2])
#data[,3] <- gsub(',', '.', data[,3])
#data[,4] <- gsub(',', '.', data[,4])
#data[,5] <- gsub(',', '.', data[,5])
#data[,6] <- gsub(',', '.', data[,6])

#For Gunnar data: data_limma <- log2(as.matrix( apply(data[, 1:(rep_treats + rep_conts)], 2, as.numeric) ))
data_limma <- log2(as.matrix( apply(data[, 3:(rep_treats + rep_conts + 2)], 2, as.numeric) ))
data_limma[is.infinite(data_limma)] <- NA
nan_idx <- which(is.na(data_limma))
#temp <- reshape(temp, nrow(data_limma)*ncol(data_limma), 1)
#hist(temp, na.rm = TRUE, xlab = "log2(intensity)", ylab = "Frequency", main =  "All data: before imputation")
# replace the following for the code before
temp <- as.vector(data_limma)
temp <- temp[!is.na(temp)]
png("Freq_log2_before_imputation.png", width=800, height=800)
hist(temp, xlab = "log2(intensity)", ylab = "Frequency", main = "All data: before imputation")
dev.off()
fit <- fitdistr(c(na.exclude(data_limma)), "normal")
mu <- as.double(fit$estimate[1])
sigma <- as.double(fit$estimate[2])
sigma_cutoff <- 6
new_width_cutoff <- 0.3
downshift <- 1.8
width <- sigma_cutoff * sigma
new_width <- width * new_width_cutoff
new_sigma <- new_width / sigma_cutoff
new_mean <- mu - downshift * sigma
imputed_vals_my = rnorm(length(nan_idx), new_mean, new_sigma)
# scaling_factor <- readfloat_0_1("Enter a number > 0 and <=1 to scale imputed values = ")
scaling_factor = 1.0
data_limma[nan_idx] <- imputed_vals_my*scaling_factor
#data_limma[nan_idx] <- imputed_vals_my
#NOTE: [10857] values are imputed!!!

#> tail(data_limma)
#            CA1      CA2      CA3     CA10      CA5      CA6      CA8     CA14
#[1785,] 24.04443 23.32959 22.77167 23.65635 21.63752 24.01719 22.81896 23.32914
#[1786,]       NA       NA       NA       NA       NA 22.37408 22.64151       NA
#[1787,]       NA 22.91260 22.48334       NA       NA 22.02580 21.90541       NA
#[1788,] 24.19555 22.50627       NA 22.73289 23.85307 23.26974 22.76822       NA
#[1789,]       NA 23.84345 23.01388 22.02657       NA       NA 22.55753 21.60069
#[1790,] 24.50819       NA       NA 25.07117 22.63140 21.78845 22.61968 20.99800
#            CA12     CA13     Con1     Con2     Con3     Con4     Con5     Con6
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197 22.85973 21.98199
#[1786,]       NA       NA       NA       NA       NA       NA       NA       NA
#[1787,]       NA       NA       NA       NA       NA       NA       NA       NA
#[1788,]       NA       NA       NA       NA       NA 21.93513       NA 23.16174
#[1789,] 23.15220 21.70746 20.82720       NA       NA       NA       NA       NA
#[1790,]       NA 23.65803       NA 24.47523       NA 23.56952 22.89298       NA
#----impute---->
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197
#[1786,] 23.01317 23.12527 21.11763 20.60838 20.00608 20.90938
#[1787,] 22.20547 22.69346 23.68541 22.05089 21.50437 21.32925
#[1788,] 21.47150 22.78137 22.29056 20.30675 21.50243 21.93513
#[1789,] 23.15220 21.70746 20.82720 20.68821 22.90624 22.58886
#[1790,] 21.16174 23.65803 21.56143 24.47523 21.91832 23.56952
#            CA1      CA2      CA3     CA10      CA5      CA6      CA8     CA14
#[1785,] 24.04443 23.32959 22.77167 23.65635 21.63752 24.01719 22.81896 23.32914
#[1786,] 23.37888 21.17247 21.03109 21.87904 21.97014 22.37408 22.64151 22.80035
#[1787,] 21.57987 22.91260 22.48334 21.98497 22.34736 22.02580 21.90541 20.12150
#[1788,] 24.19555 22.50627 21.69512 22.73289 23.85307 23.26974 22.76822 21.56332
#[1789,] 20.48388 23.84345 23.01388 22.02657 23.42263 21.80419 22.55753 21.60069
#[1790,] 24.50819 22.03749 21.90290 25.07117 22.63140 21.78845 22.61968 20.99800
#            CA12     CA13     Con1     Con2     Con3     Con4
#[1785,] 22.64312 23.35157 22.65619 22.88652 23.56822 22.20197
#[1786,] 20.89429 23.68783 22.26621 21.77314 22.70896 22.14385
#[1787,] 22.28350 22.87021 21.09328 23.33020 21.88887 20.88960
#[1788,] 22.25577 21.17894 21.69301 22.28463 22.30470 21.93513
#[1789,] 23.15220 21.70746 20.82720 20.97298 19.93577 23.94133
#[1790,] 21.19956 23.65803 21.75923 24.47523 22.53592 23.56952
# temp <- reshape(temp, nrow(data_limma)*ncol(data_limma), 1)
temp <- as.vector(data_limma)
png("Freq_log2_after_imputation.png", width=800, height=800)
#na.rm = TRUE, 
hist(temp, xlab = "log2(intensity)", ylab = "Frequency", main =  "All data: after imputation")
dev.off()

## Median Normalization Module
#want_normalization <- as.integer(readline(
#  cat(
#    'Enter 1: if you want to normalize the protein intensities in each experiemnt by substrating the median of the corresponding experiment\n',
#    '\bEnter 2: if you want to perform column wise median normalization of the data matrix\n',
#    '\b(for definition of "column wise median normalization" see README)= '
#  )
#))
#  if (want_normalization == 1) {
  # browser()
  # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates before normalization"))
  # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates before normalization"))

  # Before normalization
  png("before_normalization.png", width=800, height=800)
  boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data before median substract normalization")
  dev.off()

  # Subtracting the column medians
  #Subtracting Median of Columns (samples): When you subtract the median of columns (samples), you are ensuring that every sample has a median expression level set to zero (or another reference level if not using median). This is useful when you want to compare expression levels across different samples and want to correct for any systematic bias that might be present in a given sample. This type of normalization is particularly common when dealing with samples that may have been processed or sequenced at different times, by different technicians, or with slightly different protocols, leading to varying baseline levels.
  #Subtracting Median of Rows (genes): When you subtract the median of rows (genes), you are normalizing expression profiles of each gene across different samples. This might be useful in cases where you want to compare the expression profiles of different genes and you are interested in relative changes in gene expression across samples, not the absolute expression level of a gene.
  #In most gene expression studies, normalization by subtracting the median (or mean) of columns (samples) is the typical approach to account for variations between samples.
  col_med <- matrixStats::colMedians(data_limma)
  med_mat <- matlab::repmat(col_med, nrow(data_limma), 1)
  data_limma <- data_limma - med_mat
  #(optional) using sweep instead of matlab::repmat
  #col_med <- apply(data_limma, 2, median, na.rm = TRUE)
  #data_limma <- sweep(data_limma, 2, col_med)

  # After normalization
  png("after_normalization.png", width=800, height=800)
  boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data after median substract normalization")
  dev.off()
  # browser()
  # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates after normalization"))
  # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates after normalization"))
#  } else if (want_normalization == 2) {
#    boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data before column wise median normalization")
#    data_limma <- median_normalization(data_limma)
#    boxplot(data_limma[, 1:(rep_treats + rep_conts)], main = "data after column wise median normalization")
#    # browser()
#    # boxplot(data_limma[,1:rep_treats], main = paste(treatment, "replicates after normalization"))
#    # boxplot(data_limma[,(rep_treats+1):(rep_treats+rep_conts)], main = paste(control, "replicates after normalization"))
#  }

#For Gunnar's data
#Symbol <- data$Symbol
#Uniprot <- data$Uniprot
#For Susanne's data
Symbol <- data$Gene.Symbol
Uniprot <- data$Accession
##Limma main code
design <- model.matrix( ~ factor(c(rep(2, rep_treats), rep(1, rep_conts))))
colnames(design) <- c("Intercept", "Diff")
res.eb <- eb.fit(data_limma, design, Symbol)
Sig_FC_idx <- union(which(res.eb$logFC < (-FC_Cutoff)), which(res.eb$logFC > FC_Cutoff))
Sig_Pval_mod_idx <- which(res.eb$p.mod < 0.05)
Sig_Pval_ord_idx <- which(res.eb$p.ord < 0.05)
Sig_mod_idx <-  intersect(Sig_FC_idx, Sig_Pval_mod_idx)
Sig_ord_idx <-  intersect(Sig_FC_idx, Sig_Pval_ord_idx)
categ_Ord <- rep(c("Not Significant"), times = length(Symbol))
categ_Mod <- categ_Ord
categ_Mod[Sig_mod_idx] <- "Significant"
categ_Ord[Sig_ord_idx] <- "Significant"
dat <-
  cbind(
    res.eb,
    categ_Ord,
    categ_Mod,
    NegLogPvalMod = (-log10(res.eb$p.mod)),
    NegLogPvalOrd = (-log10(res.eb$p.ord))
  )

##Save the data file
setwd("/home/jhuang/DATA/Data_Susanne_MS/LIMMA-pipeline-proteomics")
final_data <-
  cbind(#DELETED select(data, matches("^id$")),
        Uniprot,
        Symbol,
        #Descriptions,
        data_limma,
        dat)
#DELETED final_data <- select(final_data, -matches("^gene$"))
filename_final_data <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_final_data')
# readline('Enter a filename for final data= ')

##Create plotly object and save plot as html
filename_mod <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_limma_plot')
# readline('Enter a filename for limma plot= ')
filename_ord <-
  paste0(format(Sys.time(), "%Y%m%d_%H%M%S"), '_ord_plot')
# readline('Enter a filename for ordinary t-test plot= ')

display_plotly_figs(final_data, FC_Cutoff, filename_mod, filename_ord)

write.table(
  final_data,
  paste(filename_final_data, '.tsv', sep = ''),
  sep = '\t',
  row.names = FALSE,
  col.names = TRUE
)
#setwd(cur_dir)

#sorting, add description.
#delete t.ord .. s2.post
# senden auch die Method-part!
# check if Symbol==gene  -log10(p.mod)
#cut -f2-2 20220221_173330_final_data.tsv > Symbol.txt
#cut -f14-14 20220221_173330_final_data.tsv > gene.txt

#https://bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html#6_Example_applications
#Input of GSVA should log and normalized or not normalized?

https://www.mcponline.org/article/S1535-9476(20)34997-5/fulltext DEqMS: A Method for Accurate Variance Estimation in Differential Protein Expression Analysis* https://genome.duke.edu/sites/default/files/2013Lecture3-DifferentialExpressionProteomicsforweb.pdf DESeq2 https://bioconductor.statistik.tu-dortmund.de/packages/3.5/bioc/vignettes/DESeq2/inst/doc/DESeq2.html Analyzing RNA-seq data with DESeq2 Comparative analysis of statistical methods used for detecting differential expression in label-free mass spectrometry proteomics

labeled mass spectrometry counts The only type of MS-based quantitation that is amenable to DESeq2 (or similar approaches) is spectral counting. Other approaches, whether labelled of unlabelled, are continuous data; I would also recommend limma for their analysis. https://www.researchgate.net/publication/221925844_Labeling_Methods_in_Mass_Spectrometry_Based_Quantitative_Proteomics https://de.wikipedia.org/wiki/Markierungsfreie_massenspektrometrische_Quantifizierung Label-free quantification is a method in mass spectrometry that aims to determine the relative amount of proteins in two or more biological samples. Unlike other methods for protein quantification, label-free quantification does not use a stable isotope containing compound to chemically bind to and thus label the protein.[1][2] https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-179 https://www.agilent.com/en/product/software-informatics/mass-spectrometry-software/data-analysis https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6412084/ https://bioinformatics.stackexchange.com/questions/16227/help-with-dia-mass-spectrometry-data-analysis-with-several-conditions-limma https://www.biostars.org/p/95405/#google_vignette https://www.sciencedirect.com/science/article/pii/S1570963913001866 https://github.com/wasimaftab/LIMMA-pipeline-proteomics/blob/master/limma_main.R

Statistical analysis on LC-MS data

  • To detect significant changes between two experimental groups we performed statistical analysis on LC-MS data.

Some comments to the analysis:

  • The LIMMA moderated t-test statistics are chosen for two-group comparison in a proteomic experiment.
  • Normalise by subtracting median: This method normalizes the protein intensities in each experiment by substrating the median of the corresponding experiment. We implemented statistical analysis in R and codes are freely available at the following GitHub repository https://github.com/wasimaftab/LIMMA-pipeline– proteomics/tree/master/Mitochondrial_Loop. The final volcano plots only feature proteins that are part of the high-confidence mito- chondrial proteome (Morgenstern et al., 2017).
    1. Normalise by subtracting median: This method normalizes the protein intensities in each experiemnt by substrating the median of the corresponding experiment.
    2. Column wise median normalization of the data matrix: This method calculates for each sample the median change (i.e. the difference between the observed value and the row average) and subtracts it from each row. Missing values are ignored in the procedure. The method is based on the assumption that a majority of the rows did not change. By default, all the rows are used for normalization but if we assume that the first 3 proteins are spike-ins then the call to median_normalization function needs to be modified as: median_normalization(data, spike_in_rows = 1:3)
  • The imputation algorithm proposed in Tyanova et al. [1] was used to impute missing values before the performance of statistical methods: 1. Tyanova, S., Temu, T., Sinitcyn, P., Carlson, A., Hein, M.Y., Geiger, T., Mann,M., and Cox, J. (2016). The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 13, 731–740.

域名预订类型

snapnames、namejet、dropcatch、youdot、mediaon、namecatch、name、domainmonster、hexonet、xz、epik、pool、asiaregister、dynadot、pheenix2、backorder、godaddy、ename、flappy、hupo、hooyoo、Juming、west263、yijie、cndns、bizcn、zgsj、17ex、66cn、aliy-pre、ename-pre、xw、22cn、dcpre、epre、Other-Pre、Gname-Pre、190

  1. 前言

    过期删除域名中,会看到域名有多种预订类型:Delete、GD-Pre、Ename-Pre、West-Pre、Aliy-Pre、Xw-Pre、22CN-Pre、Xz-Pre、Drop-Pre、SN-Pre、E-Pre、Other-pre、Gname-Pre。

  2. 预订类型:删除Delete

    介绍说明:“Delete”是正常过期删除域名,域名因过期后未续费,从而进入删除列表;

    注册时间:等同于新注册,注册时间按抢注成功后重新计算;

    域名过户:得标后0-3天内会完成域名过户,特殊情况除外;

    得标后是否会被赎回:不会,正常删除后原所有者并无权限赎回。

  3. 预订类型:Gd-Pre

    介绍说明:“Gd-Pre”过期提前释放的域名,是属于某国外注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要10-20天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  4. 预订类型:Ename-Pre

    介绍说明:“Ename-Pre”过期提前释放的域名,属于易名平台内由于域名过期后但注册者未续费,注册商提前开放预订;

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要24-28天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  5. 预订类型:West-Pre

    介绍说明:“West-Pre”过期提前释放的域名,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要20-30天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  6. 预订类型:Aliy-Pre

    介绍说明:“Aliy-Pre”过期提前释放的域名,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:32-75天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  7. 预订类型:Xw-Pre

    介绍说明:“Xw-Pre”过期提前释放的域名,是属于某国内注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要35-42天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  8. 预订类型:22CN-Pre

    介绍说明:“22CN-Pre”提前释放拍卖域名,由用户提交或其它合作商过期提前释放拍卖的域名。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:付款完毕后30-35天完成过户,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  9. 预订类型:Xz-Pre

    介绍说明:“Xz-Pre”过期提前释放的域名,是国内某注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要15天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  10. 预订类型:Drop-Pre

    介绍说明:“Drop-Pre”过期提前释放的域名,是属于某国外注册商,由于域名过期后但注册者未续费,注册商提前开放预订。 注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要10天,特殊情况除外; 得标后是否会被赎回:被赎回的概率较小。

  11. 预订类型:SN-Pre

    介绍说明:“SN-Pre”过期提前释放的域名,是属于某国外注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要10天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  12. 预订类型:E-Pre

    介绍说明:E-PRE(易释放)易释放是易名平台用户自主提交预释放交易的交易类型,用户发布后,域名将同步到抢注平台进行预订竞价。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:付款后将会过户; 得标后是否会被赎回:被赎回的概率较小。

  13. 预订类型:Other-Pre

    介绍说明:“Other-Pre”过期提前释放的域名,是属于国内外多个注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:注册的时间不变,为得标前域名注册成功的时间;

    得标过户:需要15-42天,特殊情况除外;

    得标后是否会被赎回:被赎回的概率较小。

  14. 预订类型:Gname-Pre

    介绍说明:“Gname-Pre”过期提前释放的域名,是属于某国外注册商,由于域名过期后但注册者未续费,注册商提前开放预订。

    注册时间:域名注册时间与以前的不变;

    得标过户:需要0-20天,特殊情况除外; 得标后是否会被赎回:得标后30天内可能被赎回,概率比较小。

SnapNames 是一家域名回购服务公司,帮助个人和企业在域名过期并再次可供注册时获取域名。以下是 SnapNames 回购过程的一般操作方法:

域名回购下单:
    访问 SnapNames 网站。
    搜索所需的域名。
    如果该域名已被注册,但您预期它可能很快会变得可用(由于过期或其他原因),则可以对其进行回购下单。

监控并尝试注册:
    一旦您下了回购订单,SnapNames 会监控该域名。
    如果域名到期并被当前所有者释放,SnapNames 会尝试代您注册。这通常涉及快速、自动化的过程,以确保在其他人之前有更高的机会获取域名。

拍卖流程:
    如果多个 SnapNames 用户回购同一域名,且 SnapNames 成功注册了该域名,则该域名将进入拍卖。
    用户可以对该域名进行出价,拍卖结束时出价最高的用户将赢得该域名。

    在 SnapNames 的系统中,当一个域名被成功捕获后,如果有多个用户对同一域名进行了回购下单,该域名将进入拍卖。通常情况下,只有那些对该域名进行了回购下单的用户有资格参与该域名的拍卖。这意味着,如果您没有提前进行回购下单,您通常无法参与那个域名的拍卖。

    但是,请注意,每个域名后购服务的具体规定可能会有所不同,因此建议您直接查看 SnapNames 或其他域名后购服务的官方文档或与他们联系,以获取确切和最新的信息。

域名转移:
    如果 SnapNames 成功地获得了域名(您是唯一的回购者或拍卖中的获胜者),该域名将转移到您的账户。
    然后,您可以管理它、将其转移到另一个注册商,或设置网络托管和电子邮件等。

费用:
    通常只有在 SnapNames 成功为您获取域名时才会收费。如果 SnapNames 无法获取域名,您通常不会收到费用。

注意事项:
    成功并不是保证的。即使使用回购服务,也存在获取理想域名的竞争。其他的回购服务、个人或当前域名所有者(如果他们选择续约)可能在您之前获得域名。

如果您考虑使用 SnapNames 或类似的服务,请始终查看他们的服务条款和任何相关费用,以确保您了解过程和潜在成本。

SPANDx Genomic Profiling: Verifying LTtr and K331A RNASeq Mutations

  1. Set up the directory for raw data.

    #Replace "p600" with "control", "p602" with "LT", "p605" with "LTtr", and "p783" with "K331A". Please note that the RNAseq data from the LT_K331A_d8 replicates were sequenced alongside the ChIPseq batch.
    ln -s /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/Raw_Data_ChIPseq/230306_NB501882_0417_AHMVHHBGXN/2023_022_nf_denise/nf857/1_NHDF_Donor_1_p783_S1_R1_001.fastq.gz LT_K331A_d8_DonorI.fastq.gz 
    ln -s /home/jhuang/DATA/Data_Denise_LT_DNA_Bindung/Raw_Data_ChIPseq/230306_NB501882_0417_AHMVHHBGXN/2023_022_nf_denise/nf858/2_NHDF_Donor_2_p783_S2_R1_001.fastq.gz LT_K331A_d8_DonorII.fastq.gz
    mv V_8_2_4_p600_d8_DonorI.fastq.gz    control_d8_DonorI.fastq.gz
    mv V_8_2_3_p600_d8_DonorII.fastq.gz   control_d8_DonorII.fastq.gz
    mv V_8_4_2_p602_d8_DonorI.fastq.gz    LT_d8_DonorI.fastq.gz
    mv V_8_4_1_p602_d8_DonorII.fastq.gz   LT_d8_DonorII.fastq.gz
    mv V_8_2_4_p605_d8_DonorI.fastq.gz    LTtr_d8_DonorI.fastq.gz
    mv V_8_2_3_p605_d8_DonorII.fastq.gz   LTtr_d8_DonorII.fastq.gz
  2. Execute SPANDx to produce the mapping profile. For the GenBank file used, refer to the provided link.

    LT_wt.gbk

    conda activate spandx
    [genbank copying]
    mkdir ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/LT_wildtype
    cp LT_wt.gbk ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/data/LT_wildtype/genes.gbk
    vim ~/anaconda3/envs/spandx/share/snpeff-4.3.1t-5/snpEff.config
    /home/jhuang/anaconda3/envs/spandx/bin/snpEff build -genbank LT_wildtype      -d
    
    #The mutation in LTtr is located approximately at position 780, while the K331A mutation in LT is found near position 993 out of a total of 2454 nt.
    MDLVLNRKEREALCKLLEIAPNCYGNIPLMKAAFKRSCLKHHPDKGGNPVIMMELNTLWSKFQQNIHKLRSDFSMFDEVDEAPIYGTTKFKEWWRSGGFSFGKAYEYGPNPHGTNSRSRKPSSNASRGAPSGSSPPHSQSSSSGYGSFSASQASDSQSRGPDIPPEHHEEPTSSSGSSSREETTNSGRESSTPNGTSVPRNSSRTDGTWEDLFCDESLSSPEPPSSSEEPEEPPSSRSSPRQPPSSSAEEASSSQFTDEEYRSSSFTTPKTPPPFSRKRKFGGSRSSASSASSASFTSTPPKPKKNRETPVPTDFPIDLSDYLSHAVYSN
    TVSCFAIYTTSDKAIELYDKIEKFKVDFKSRHACELGCILLFITLSKHRVSAIKNFCSTFCTISFLICKGVNKMPEMYNNLCKPPYKLLQENKPLLNYEFQEKEKEASCNWNLVAEFACEYELDDHFIILAHYLDFAKPFPCQKCENRSRLKPHKAHEAHHSNAKLFYESKSQKTICQQAADTVLAKRRLEMLEMTRTEMLCKKFKKHLERLRDLDTIDLLYYMGGVAWYCCLFEEFEKKLQKIIQLLTENIPKYRNIWFKGPINSGKTSFAAALIDLLEGKALNINCPSDKLPFELGCALDKFMVVFEDVKGQNSLNKDLQPGQGINNLDNLRDHLDGAVAVSLEKKHVNKKHQIFPPCIVTANDYFIPKTLIARFSYTLHFSPKANLRDSLDQNMEIRKRRILQSGTTLLLCLIWCLPDTTFKPCLQEEIKNWKQILQSEISYGKFCQMIENVEAGQDPLLNILIEEEGPEETEETQDSGTFSQ* nextflow run spandx/main.nf –fastq “Raw_Data_RNAseq_K331A_d8_SPANDx/*.fastq.gz” –ref LT_wt.fasta –annotation –database LT_wildtype –pairing SE -resume
  3. Open the BAM files created in the previous step using IGV.

mutations_on_LT_IGV

Native elongating transcript sequencing (NET-seq)

2023-10-10 16:18:07 星期二

Net-seq(原位延伸转录测序)是一种用于以核苷酸分辨率跨基因组分析RNA聚合酶(Pol)活性的方法。以下是Net-seq的简要概述:

目的: Net-seq旨在捕获并测序仍与RNA聚合酶结合在一起的活跃转录过程中的RNA的3’末端。这有助于创建跨基因组的活跃转录位点的高分辨率图。

工作原理:

  • 从细胞中提取RNA聚合酶及其绑定的RNA。
  • 使用特定的方法仅选择与RNA聚合酶结合的RNA的3’端。
  • 对这些3’端进行测序,从而获得在转录过程中的RNA的精确位置。

这种技术为研究基因的精确转录动态提供了一个强大的工具。

BRD4, or Bromodomain-containing protein 4, is a member of the bromodomain and extraterminal (BET) family of proteins. Here’s a brief overview:

Function: BRD4 is involved in several crucial cellular processes:

  • Transcriptional Regulation: BRD4 plays a vital role in promoting the transcription of many genes by associating with acetylated chromatin. It’s particularly significant in the regulation of genes associated with the cell cycle and can recruit positive transcription elongation factor b (P-TEFb) to promote transcriptional elongation.
  • Cell Cycle Progression: BRD4 is essential for progressing through various cell cycle stages, especially the G1/S transition.

Medical Relevance: Given its role in transcription and cell cycle regulation, it’s not surprising that BRD4 has been implicated in cancer biology. Inhibitors targeting the bromodomains of BRD4, commonly known as BET inhibitors, are being explored as potential therapeutic agents for various cancers.

Structure: BRD4 contains two bromodomains. Bromodomains can recognize and bind acetylated lysine residues, which are common post-translational modifications on histones and other proteins.

Associated Diseases: Alterations, mutations, or aberrant expression of BRD4 has been implicated in several cancers, including acute myeloid leukemia (AML), midline carcinoma, and others.

Inhibitors: Several small molecules that inhibit BRD4 have been developed as potential therapeutic agents. Notably, drugs like JQ1 and I-BET762 bind to the bromodomains of BRD4, inhibiting its interaction with acetylated proteins.

BRD4 is an active area of research, and new findings related to its function, regulation, and potential as a therapeutic target are regularly published.

HiS-NET-seq profiles the active enhancer landscape Putative enhancers show enhancer activity HiS-NET-seq reveals 11,007 new transcribed putative enhancers in K562 cells providing a more complete view on active enhancers. ChIP-Rx vs HiS-NET-seq: BRD4 is required for genome-wide enhancer transcription. HiChIP (H3K27Ac) is for discovering Enhancer connecting Promoter Enhancer and traget gene transcription is regulated by BRD4 Identify other post-initiation transcription regulators ablation: 切除,风化,消融,除去,切除 Acute BRD4-loss impairs transcription at the 3′-end genes. Acute BRD4-loss impairs transcription at the 5′ and 3′-end genes. BET & BRD4 degradation prompt (激励,鼓动,提示) 3′-processing defects Readthrough gene How do the distinct molecular interactions mediated by BRD4’s bromodomains contribute to its diverse roles in cellular processes, and how can this knowledge be leveraged to develop more targeted and effective therapeutic strategies for BRD4-associated diseases?