Author Archives: gene_x

Unveiling Difference in Promoter regions of MeDIP data 2

  1. extract all genes by giving GO-terms

    # # Using Bioconductor in R
    # library(org.Mm.eg.db)
    # genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    # > genes
  2. extract annotation.gtf for mouse.

    • (Optional) UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: “Genes and Gene Predictions” track: “RefSeq Genes” (or another gene prediction track if you have a preference) output format: “GTF – gene transfer format” Click “get output”, and you can then save the result as a .gtf file.

    • (Optional) Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • (Used) Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk ‘OFS=”\t” {if ($3 == “gene”) print $1, $4, $4+1, $10, $7, $12, $14}’ gencode.vM25.annotation.gtf > tss_.bed #55401 (Note not under kate)

      grep “proteincoding” tss.bed > tss.bed #21859 records

  3. In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the “promoter region” can vary between studies and depends on the specific context. For the mouse genome (mm10), a commonly used range for the promoter region is:

    • (Used) -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    • (Optional) -1000 to +200 bp relative to the TSS Or even a broader one:

    • (Optional) -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

  4. Convert TSS positions to promoter regions (Note that $7 is SYMBOL, $4 ENSEMBL-ID).

    # For genes on the '+' strand:
    awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    # For genes on the '-' strand:
    awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    # Combine the two promoter files.
    cat promoters_plus.bed promoters_minus.bed > promoters.bed
    
    # #(Optional) Extract sequences for these promoter regions.
    # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
  5. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    #Control females:
    bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    #Control males:
    bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    #Stress females:
    bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    #Stress males:
    bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
  6. Parse and consolidate coverage files:

    # delete the 'chr positions '
    for file in ./promoters_coverage_*.txt; do
        awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    done
    for file in ./promoters_coverage_*.txt; do
        sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    done
    # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    for file in ./promoters_coverage_*.txt; do
        if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
            echo "$file does not match reference!"
        fi
    done
    #merge the counts
    cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    for file in ./promoters_coverage_*.txt; do
        paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    done
    #add the sample names
    for file in ./promoters_coverage_*.txt; do
        basename $file | sed 's/promoters_coverage_//;s/.txt//'
    done > sample_names.txt
    echo -n "Gene " > header.txt
    paste -s -d ' ' sample_names.txt >> header.txt
    {
        cat header.txt
        cat read_counts_table.txt
    } > read_counts_table_with_header.txt
    
    #MANUALLY replace "\n" in the first line, DELETE " and "; 
    #DELETE the version of EnsembleID with the following command: 21860-1 records remaining.
    cp read_counts_table_with_header.txt temp
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt
    
    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_3")
    # Load the data
    count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    row.names(sampleInfo) <- sampleInfo$sampleID
    dds <- DESeqDataSetFromMatrix(
      countData = count_data,
      colData = sampleInfo,
      design = ~ condition
    )
    
    # Plot PCA
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    # Differential analysis
    dds$condition <- relevel(dds$condition, "control_female")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_female_vs_control_female")
    
    dds$condition <- relevel(dds$condition, "control_male")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_male_vs_control_male")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    listDatasets(ensembl)
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    datasets <- listDatasets(ensembl)
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    for (i in clist) {
      #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = gene_ids_no_version, 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    #~/Tools/csv2xls-0.4/csv_to_xls.py stress_female_vs_control_female-all.txt stress_female_vs_control_female-up.txt stress_female_vs_control_female-down.txt -d',' -o stress_female_vs_control_female.xls
    #~/Tools/csv2xls-0.4/csv_to_xls.py stress_male_vs_control_male-all.txt stress_male_vs_control_male-up.txt stress_male_vs_control_male-down.txt -d',' -o stress_male_vs_control_male.xls
    
    # Clustering the genes and draw heatmap for DEGs (limiting the genes with ENSEMBL by giving GO-term or give all genes via the file 'ids_')
    #install.packages("gplots")
    library("gplots")
    
    #(Optional) cat *.id | sort -u > ids  #add Gene_Id in the first line, delete the ""
    #(Used) cut -d$'\t' -f1 read_counts_table_with_header.txt > ids  #Rename the first line to Gene_Id
    
    GOI <- read.csv("ids")$Gene_Id  #21859 genes
    RNASeq.NoCellLine <- assay(rld)
    # Defining the custom order
    #column_order <- c(
    #  "uninfected.2h_r1", "uninfected.2h_r2", "uninfected.3h_r3", 
    #  "uninfected.3d_r1", "uninfected.3d_r2", 
    #  "1457-M10.2h_r1", "1457-M10.2h_r2", "1457-M10.2h_r3","1457.2h_r1", "1457.2h_r2", "1457.2h_r3",  
    #  "1457-M10.3d_r1", "1457-M10.3d_r2", "1457-M10.3d_r3","1457.3d_r1", "1457.3d_r2"
    #)
    #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    #head(RNASeq.NoCellLine_reordered)
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.01)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    png("DEGs_heatmap.png", width=1000, height=1010)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), margins=c(4,22),
                RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4, labRow="")
    dev.off()
    
    # Generate the table for cluster members
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    # Take all
    data <- as.data.frame(datamat)  #21859
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        #colnames(expression_values) <- colnames(data)
        colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
        # # Combine gene information and expression data
        # combined_result <- cbind(result, expression_values)
        # Fetch raw counts for the gene from count_data
        raw_counts <- count_data[gene, , drop = FALSE]
        colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
        # Combine gene information, expression data, and raw counts
        combined_result <- cbind(result, expression_values, raw_counts)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    write.csv(annotated_data, "annotated.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py annotated.csv -d',' -o raw_and_normalized_counts.xls

plot treeHeatmap using ggtree

ggtree_and_gheatmap

library(ggtree)
library(ggplot2)

setwd("/home/jhuang/DATA/Data_Luise_Epidome_batch3/plotTreeHeatmap/")

# -- edit tree --
#https://icytree.org/
info <- read.csv("typing_189.csv")
info$name <- info$Isolate
#tree <- read.tree("core_gene_alignment_fasttree_directly_from_186isoaltes.tree")  --> NOT GOOD!
tree <- read.tree("471.tree")
cols <- c(infection='purple2', commensalism='skyblue2')     

library(dplyr)
heatmapData2 <- info %>% select(Isolate, ST, Clonal.Complex, Phylotype)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "lightcyan3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "azure3","blueviolet","darkgoldenrod",  "tomato","mediumpurple4","indianred", 
                      "cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta",     "cornflowerblue","darkgreen","red","tan","brown",      "darkgrey")
names(heatmap.colours) <- c("1","2","3","4","5", "6","7",   "20","21","22", "28","30","33","42","43","52","53", "66","68",    "100","105","124","133","134","135","137",     "159","161",    "CC1","CC2","CC3","CC4","CC5","CC6","CC30","CC72","CC77","Singleton",    "IA1","IA2","IB","II","III",    "NA")
#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))

#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
  geom_tippoint(aes(color=Type)) + 
  scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
#, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
#difference between geom_tiplab and geom_tiplab2?
#+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
#font.size=10, 
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

png("ggtree_and_gheatmap.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=0.1,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
dev.off()

Intragenic Methylation Patterns and CpG Island Dynamics: Insights into the Methylome

MeDIP (Methylated DNA Immunoprecipitation) is a technique used to enrich for methylated DNA sequences. When you obtain a lot of reads mapping to the gene body in your MeDIP data, it may indicate the presence of intragenic DNA methylation. Here are a few reasons and implications for observing such a pattern:

  1. Intragenic Methylation is Common: DNA methylation within the gene body (especially in exonic regions) is quite common in many organisms, including humans. Intragenic methylation is not as well-understood as promoter methylation, but it does occur frequently.

  2. Positive Correlation with Gene Expression: Contrary to promoter methylation, which is often associated with gene repression, gene body methylation in many cases has been correlated with active gene expression. The exact reason is still under investigation, but some studies suggest that methylation in the gene body might help in the prevention of spurious transcription initiation.

  3. Alternative Promoters: In some cases, genes have alternative promoters within introns. Methylation around these intragenic promoters can regulate the alternative splicing or transcription from these internal sites.

  4. Role in Splicing: Some studies suggest a role for intragenic DNA methylation in alternative splicing, where methylation might influence the choice of exons included in the mRNA.

  5. Technical Bias: It’s essential to ensure that the observed pattern is not due to some technical biases in the MeDIP protocol, library preparation, or sequencing. For instance, biases could arise from the inefficiency of the immunoprecipitation step or the fragmentation of DNA.

  6. Comparison with Other Data: It would be beneficial to compare MeDIP data with other datasets like bisulfite sequencing (which gives single-nucleotide resolution of methylation) or with RNA-seq data (to correlate gene body methylation with gene expression levels).

If you’re consistently observing heavy methylation in the gene bodies across your data, it might be interesting to further investigate the implications and potential regulatory roles of this methylation in your system of interest.

CpG islands are regions of DNA that have a high frequency of CpG sites (cytosine-phosphate-guanine sequences). Specifically, a CpG island is typically defined by the following criteria:

  • Length > 200 bp
  • GC content > 50%
  • Ratio of observed to expected CpG > 0.6

Here’s where CpG islands are typically located:

  1. Promoter Regions: Most CpG islands (around 70% in the human genome) are found in the promoter regions of genes. These regions are upstream of the transcription start site (TSS) and play a crucial role in the regulation of gene expression.

  2. Exons: Some CpG islands can be found within the exons of genes, both in coding sequences and untranslated regions (UTRs).

  3. Intergenic Regions: CpG islands can also be found in regions between genes, though this is less common than promoter-associated or exon-associated islands.

  4. Introns: Less frequently, CpG islands can be located in intronic regions.

The methylation status of CpG islands plays a significant role in gene regulation. When a CpG island in a gene’s promoter region is methylated, that gene is typically repressed or silenced. Conversely, unmethylated CpG islands in promoter regions are usually associated with actively transcribed genes. This pattern of methylation is often involved in cellular differentiation, genomic imprinting, X-chromosome inactivation, and is also implicated in various diseases, including cancer, when it becomes dysregulated.

Both “intergenic” and “intragenic” refer to locations in the genome, but they denote different regions. In essence, while “intragenic” pertains to regions within genes, “intergenic” pertains to regions between genes.

  • Intragenic:

    • Refers to the regions within a gene.
    • This encompasses all sequences that lie within the boundaries of a gene, including both exons (coding regions) and introns (non-coding regions).
    • Modifications or mutations in these regions can potentially affect the function, expression, or regulation of the gene.
  • Intergenic:

    • Refers to the regions between genes.
    • These are sequences that lie outside the boundaries of genes.
    • While these regions were once thought to be “junk” DNA with no function, it’s now understood that many intergenic regions play critical roles in regulating gene expression, among other functions. They may contain regulatory elements, non-coding RNA genes, or sequences that have yet to be characterized for function.

Unveiling Difference in Promoter regions of MeDIP data

  1. Select Genes of Interest: Once Emilia provides the GO term, use a database like Gene Ontology or Bioconductor’s “org.Hs.eg.db” package (assuming human genes) to fetch the list of genes associated with that GO term.

    # Using Bioconductor in R
    library(org.Mm.eg.db)
    genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    > genes
              GO EVIDENCE ONTOLOGY    SYMBOL
    1  GO:0048813      IDA       BP      Abi1
    2  GO:0048813      IGI       BP      Abl2
    3  GO:0048813      IMP       BP      Abl2
    4  GO:0048813      IMP       BP     Atp7a
    5  GO:0048813      IMP       BP   Cacna1a
    6  GO:0048813      IMP       BP    Camk2a
    7  GO:0048813      IMP       BP    Ctnna2
    8  GO:0048813      IMP       BP      Cdk5
    9  GO:0048813      IGI       BP     Dclk1
    10 GO:0048813      IGI       BP       Dcx
    11 GO:0048813      IMP       BP     Dscam
    12 GO:0048813      IMP       BP      Dvl1
    13 GO:0048813      IGI       BP      Fmn1
    14 GO:0048813      IMP       BP      Fmn1
    15 GO:0048813      IMP       BP       Fyn
    16 GO:0048813      IMP       BP      Hprt
    17 GO:0048813      IBA       BP      Sdc2
    18 GO:0048813      IMP       BP    Elavl4
    19 GO:0048813      IGI       BP     Itgb1
    20 GO:0048813      IMP       BP     Itgb1
    21 GO:0048813      IMP       BP      Lrp8
    22 GO:0048813      ISO       BP     Mef2a
    23 GO:0048813      IBA       BP      Map6
    24 GO:0048813      ISO       BP      Map6
    25 GO:0048813      IMP       BP   Slc11a2
    26 GO:0048813      IGI       BP      Rac1
    27 GO:0048813      IGI       BP     Rock2
    28 GO:0048813      IMP       BP    Sema3a
    29 GO:0048813      IMP       BP     Vldlr
    30 GO:0048813      IMP       BP     Mapk8
    31 GO:0048813      ISO       BP     Mink1
    32 GO:0048813      IMP       BP      Tet1
    33 GO:0048813      ISO       BP    Celsr2
    34 GO:0048813      IMP       BP   Cacna1f
    35 GO:0048813      IGI       BP     Dact1
    36 GO:0048813      IMP       BP     Dact1
    37 GO:0048813      IMP       BP  Mapk8ip2
    38 GO:0048813      ISO       BP     Trak1
    39 GO:0048813      IMP       BP      Rere
    40 GO:0048813      ISO       BP     Trak2
    41 GO:0048813      IBA       BP  Tmem106b
    42 GO:0048813      ISO       BP  Tmem106b
    43 GO:0048813      IMP       BP   Slitrk5
    44 GO:0048813      IMP       BP Kidins220
    45 GO:0048813      IMP       BP    Rbfox2
    46 GO:0048813      IMP       BP      Klf7
    47 GO:0048813      IMP       BP    Dtnbp1
    48 GO:0048813      IMP       BP     Prex2
    49 GO:0048813      IBA       BP    Dcdc2a
    50 GO:0048813      IGI       BP    Dcdc2a
    51 GO:0048813      ISO       BP     Farp1
    52 GO:0048813      ISO       BP      Lrp4
    53 GO:0048813      IBA       BP     Btbd3
    54 GO:0048813      IDA       BP     Btbd3
    55 GO:0048813      IBA       BP   Abitram
    56 GO:0048813      IMP       BP   Abitram
    57 GO:0048813      IMP       BP    Picalm
    58 GO:0048813      ISO       BP    Picalm
  2. Extract Promoter Regions: Typically, the promoter region can be defined as a certain number of base pairs (e.g., 1000-2000 bp) upstream of the transcription start site (TSS) of a gene. Tools like BEDTools can help you extract these regions from a genome reference.

    2.1 extract annotation.gtf for mouse.

    • UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: “Genes and Gene Predictions” track: “RefSeq Genes” (or another gene prediction track if you have a preference) output format: “GTF – gene transfer format” Click “get output”, and you can then save the result as a .gtf file.

    • Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk ‘OFS=”\t” {if ($3 == “gene”) print $1, $4, $4+1, $10, $7, $12, $14}’ gencode.vM25.annotation.gtf > tss.bed

      grep “protein_coding” tss.bed | wc -l #21859 records

    In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the “promoter region” can vary between studies and depends on the specific context.

    For the mouse genome (mm10), a commonly used range for the promoter region is:

    -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    -1000 to +200 bp relative to the TSS Or even a broader one:

    -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

    For many standard analyses, the -2000 to +500 bp window is a reasonable compromise that captures a significant portion of the regulatory elements without being overly broad. However, always consider the biological question and the specifics of your dataset when choosing the appropriate range for promoter regions.

    #I used the two methods to extract promoter regions, why the result is different as shown below.
    #promoters.bed:chrM      -498    2002    "ENSMUSG00000064336.1";
    #promoters_.bed:chrM     -999    1       "ENSMUSG00000064336.1";
    
    # Convert TSS positions to promoter regions. Here, we're considering a promoter to be 1kb upstream of the TSS ($7 is SYMBOL, $4 ENSEMBL-ID).
    # For genes on the '+' strand:
    awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    # For genes on the '-' strand:
    awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    # Combine the two promoter files.
    cat promoters_plus.bed promoters_minus.bed > promoters.bed
    
    # ERROR --> NEED to be DEBUGGED!
    # # Initialize an empty promoters.bed file
    # > promoters.bed
    # # Loop through each line of tss.bed
    # while IFS=$'\t' read -r chr start end name strand; do
    #     if [[ $strand == "+" ]]; then
    #         # Adjust for '+' strand genes, set minimum start to 0
    #         adjusted_start=$(( $start - 2000 > 0 ? $start - 2000 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($start+500))\t$name" >> promoters.bed
    #     else
    #         # Adjust for '-' strand genes using the `end` column as the TSS, set minimum start to 0
    #         adjusted_start=$(( $end - 2500 > 0 ? $end - 2500 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($end-1000))\t$name" >> promoters.bed
    #     fi
    # done < tss.bed
    
    # # Extract sequences for these promoter regions.
    # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
    
    # # Using bedtools. Assuming a BED file with TSS positions (tss.bed)
    # bedtools flank -i tss.bed -g mouse.genome -l 1000 -r 0 > promoters.bed
  3. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    #Control females:
    bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    #Control males:
    bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    #Stress females:
    bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    #Stress males:
    bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
  4. Parse and consolidate coverage files:

    # delete the 'chr positions '
    for file in ./promoters_coverage_*.txt; do
        awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    done
    for file in ./promoters_coverage_*.txt; do
        sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    done
    
    # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    for file in ./promoters_coverage_*.txt; do
        if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
            echo "$file does not match reference!"
        fi
    done
    
    #merge the counts
    cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    for file in ./promoters_coverage_*.txt; do
        paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    done
    
    #add the sample names
    for file in ./promoters_coverage_*.txt; do
        basename $file | sed 's/promoters_coverage_//;s/.txt//'
    done > sample_names.txt
    echo -n "Gene " > header.txt
    paste -s -d ' ' sample_names.txt >> header.txt
    {
        cat header.txt
        cat read_counts_table.txt
    } > read_counts_table_with_header.txt
    #Manually replace "\n" in the first line, DELETE " and "; 
    #Delete the version of EnsembleID
    cp read_counts_table_with_header.txt temp
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt 
    
    #grep "ENSMUSG00000104217.1" *.txt
    #Gene    1.0.1   1.0.2   1.0.3   1.1.2b  1.1.3   1.1.4   2.0.1b  2.0.2   2.0.3   2.1.1   2.1.2   2.1.3
    #promoters_coverage_1.0.1.txt:"ENSMUSG00000104217.1"; 9 347 2500 0.1388000
    #promoters_coverage_1.0.2.txt:"ENSMUSG00000104217.1"; 22 491 2500 0.1964000
    #promoters_coverage_1.0.3.txt:"ENSMUSG00000104217.1"; 18 525 2500 0.2100000
    #promoters_coverage_1.1.2b.txt:"ENSMUSG00000104217.1"; 5 255 2500 0.1020000
    #promoters_coverage_1.1.3.txt:"ENSMUSG00000104217.1"; 11 315 2500 0.1260000
    #promoters_coverage_1.1.4.txt:"ENSMUSG00000104217.1"; 8 300 2500 0.1200000
    #promoters_coverage_2.0.1b.txt:"ENSMUSG00000104217.1"; 18 375 2500 0.1500000
    #promoters_coverage_2.0.2.txt:"ENSMUSG00000104217.1"; 9 458 2500 0.1832000
    #promoters_coverage_2.0.3.txt:"ENSMUSG00000104217.1"; 8 430 2500 0.1720000
    #promoters_coverage_2.1.1.txt:"ENSMUSG00000104217.1"; 17 389 2500 0.1556000
    #promoters_coverage_2.1.2.txt:"ENSMUSG00000104217.1"; 17 428 2500 0.1712000
    #promoters_coverage_2.1.3.txt:"ENSMUSG00000104217.1"; 15 456 2500 0.1824000
    #read_counts_table.txt:"ENSMUSG00000104217.1";   9       22      18      5       11      8       18      9       8       17      17      15
    #read_counts_table_with_header.txt:"ENSMUSG00000104217.1"        9       22      18      5       11      8       18      9       8       17      17   15
    #reference_genes.txt:"ENSMUSG00000104217.1";
  5. Load into R and format for DESeq2:

    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_2")
    
    # Load the data
    count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    
    ## Define conditions (you can replace this with your experimental design)
    #conditions <- c(rep("control_female", 3), rep("control_male", 3), rep("stress_female", 3), rep("stress_male", 3))
    
    ## Create DESeq2 object
    #dds <- DESeqDataSetFromMatrix(countData = count_data, colData = data.frame(condition = conditions), design = ~condition)
    
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #55401-->41415
    #dds <- DESeq(dds)
    #rld <- rlogTransformation(dds)
    
    #dim(counts(dds))
    #head(counts(dds), 10)
    
    ## This is an example using DESeq2, and it assumes count data rather than coverage.
    ## Proper normalization might need more manual adjustments based on the nature of MeDIP-seq data.
    #library(DESeq2)
    #dds <- DESeqDataSetFromMatrix(countData = countMatrix, colData = sampleInfo, design = ~ condition)
    #dds <- DESeq(dds)
    #normalized_counts <- counts(dds, normalized=TRUE)
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
    
    #-------
    
    sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    row.names(sampleInfo) <- sampleInfo$sampleID
    
    library(DESeq2)
    dds <- DESeqDataSetFromMatrix(
      countData = count_data,
      colData = sampleInfo,
      design = ~ condition
    )
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
  6. differential analysis

    dds$condition <- relevel(dds$condition, "control_female")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_female_vs_control_female")
    
    dds$condition <- relevel(dds$condition, "control_male")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_male_vs_control_male")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    listDatasets(ensembl)
    
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    datasets <- listDatasets(ensembl)
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    for (i in clist) {
      #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = gene_ids_no_version, 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
  7. clustering the genes and draw heatmap (limiting the genes with ENSEMBL by giving GO-term)

    install.packages("gplots")
    library("gplots")
    
    # Combine the STEP 1
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090399", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090399.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090398", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090398.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:2000035", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:2000035.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0017145", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0017145.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002761", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002761.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002244", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002244.csv", genes)
    
    #write.csv(file="ids_", unique(genes$ENSEMBL))
    ##cut -d',' -f2 ids_ > ids
    ##add Gene_Id in the first line, delete the ""
    #GOI <- read.csv("ids")$Gene_Id  
    GOI <- unique(genes$ENSEMBL)
    RNASeq.NoCellLine <- assay(rld)
    # Defining the custom order
    #column_order <- c(
    #  "uninfected.2h_r1"
    #)
    #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    #head(RNASeq.NoCellLine_reordered)
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    GOI_clean <- GOI[GOI %in% rownames(RNASeq.NoCellLine) & !is.na(GOI)]
    #GOI_clean <- GOI_clean[GOI_clean != "ENSMUSG00000094658"]
    datamat <- RNASeq.NoCellLine[GOI_clean, ]
    #Remove Problematic Rows whose var is 0
    datamat <- datamat[!apply(datamat, 1, var) == 0, ]
    #datamat = RNASeq.NoCellLine[GOI, ]
    #rownames(datamat) <- genes$external_gene_name #DEBUG since genes has 58 records, datamat has only 48 records.
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.00)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    
    png("DEGs_heatmap.png", width=1000, height=1010)
    #labRow="",
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), margins=c(4,22),
                RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4)
    dev.off()
    
    # Extract rows from datamat where the row names match the identifiers in subset_1
    #### cluster members #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    #subset_2<-names(subset(mycl, mycl == '2'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    #subset_3<-names(subset(mycl, mycl == '3'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        #colnames(expression_values) <- colnames(data)
        colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
    
        # # Combine gene information and expression data
        # combined_result <- cbind(result, expression_values)
    
        # Fetch raw counts for the gene from count_data
        raw_counts <- count_data[gene, , drop = FALSE]
        colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
        # Combine gene information, expression data, and raw counts
        combined_result <- cbind(result, expression_values, raw_counts)
    
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o gene_clusters.xls
  8. Add pvalues of p-adjusted to the end of output using stress_female_vs_control_female-all.txt and stress_male_vs_control_male-all.txt

    #python3 run.py
    import pandas as pd
    
    # Load the male and female stress data
    female_df = pd.read_csv('stress_female_vs_control_female-all.txt')
    male_df = pd.read_csv('stress_male_vs_control_male-all.txt')
    
    # List of files to process
    files_to_process = [
        './raw_and_normalized_counts_GO:0090399.csv',
        './raw_and_normalized_counts_GO:0017145.csv',
        './raw_and_normalized_counts_GO:0090398.csv',
        './raw_and_normalized_counts_GO:0002761.csv',
        './raw_and_normalized_counts_GO:0002244.csv',
        './raw_and_normalized_counts_GO:2000035.csv'
    ]
    
    for filepath in files_to_process:
        # Load the raw data
        raw_df = pd.read_csv(filepath)
    
        # Extracting the pvalue and padj for female dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(female_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_female_stress_vs_control', 
                                        'padj': 'padj_female_stress_vs_control'})
    
        # Extracting the pvalue and padj for male dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(male_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_male_stress_vs_control', 
                                        'padj': 'padj_male_stress_vs_control'})
    
        # Drop the merged columns 'Unnamed: 0' from the final dataframe
        raw_df = raw_df.drop(columns=['Unnamed: 0_x', 'Unnamed: 0_y'])
    
        # Save the raw_df with the added pvalues and padj to a new CSV file
        output_filepath = filepath.replace('.csv', '_with_pvalues_and_padj.csv')
        raw_df.to_csv(output_filepath, index=False)
    
    #~/Tools/csv2xls-0.4/csv_to_xls.py raw_and_normalized*with_pvalues_and_padj.csv -d',' -o raw_and_normalized_with_pvalues_and_padj.xls

Merkel细胞多瘤病毒大T抗原(LT)、截短的大T抗原(LTtr; MCC350)和小T抗原(sT)对初级新生儿人类真皮成纤维细胞(nHDFs)中的mRNA水平的影响

1: 概要: 此部分讨论研究的优点/缺点、新颖性/意义、总体执行和学术性。

1. 研究了Merkel细胞多瘤病毒大T抗原(LT)、截短的大T抗原(LTtr; MCC350)和小T抗原(sT)对初级新生儿人类真皮成纤维细胞(nHDFs)中的mRNA水平的影响。病毒cDNA通过慢病毒(sT-IRES-mCherry)和LeGO-iG2(LT/LTtr - eGFP)转导,然后在第2天为荧光蛋白排序,然后在d3和d8进行分析。虽然LT诱导了DNA复制基因,但sT的表达似乎减少了干扰素相关的基因。ISGF3成分和下游靶标的mRNA水平都减少了,这意味着sT在上游发挥了其效应。值得注意的是,sT通过RT-qPCR和免疫荧光染色证明了IRF9 mRNA和蛋白的水平降低。为了验证sT对I型IFN信号基因的影响,他们在H1299和HEK293细胞中进行了启动子报告者测定。他们观察到,当与TBK1、IRF3、RIG-I、cGAS和STING共转染时,sT减少了IFN-B1报告者的活性。基于这些结果,作者提出sT干扰TBK1对IFN-B1报告者的信号。然而,除了启动子报告者测定外,没有直接证据表明sT干扰TBK1。图9中的卡通图暗示sT直接影响TBK1,但至多,下游有一种效应,TBK1似乎在sT存在时更不容易诱导转录反应。

2. 在这篇论文中,作者发现Merkel细胞多瘤病毒小肿瘤抗原(sT)可能通过在IFNB1上游,可能是在TBK1水平上,抑制干扰素刺激基因(ISG)的表达,从而有了一个潜在的新功能。他们通过转导新生儿人类真皮成纤维细胞(nHDFs)(一个可能的宿主细胞)与编码sT、大肿瘤抗原(LT)和一个来自病毒阳性Merkel细胞癌肿瘤的截短LT变种(LTtr)的慢病毒颗粒的组合,开发了一个新的MCPyV感染模型。他们在sT和LT以及sT和LTtr表示早期和晚期感染的两个不同时间点后,模拟肿瘤发生进行了RNA测序分析。他们发现了数百个与ST/LT/LTtr表达有关的差异表达基因,涉及不同的基因本体和生物过程。有趣的是,他们观察到sT导致I型IFN反应基因和先天免疫基因的显著下调。sT下调的基因与巨噬细胞中的ISGF3调节基因有很强的重叠,这表明sT抑制了ISGF3复合体的功能。他们接下来表明,sT减少了活跃转录的组蛋白标记(H3K4me3和H3K27Ac)在sT下调的先天免疫基因中,但对抑制性组蛋白标记(H3K27me3或H3K9me3)的影响很小,这表明sT阻止了这些基因的转录,而不是诱导表观遗传沉默。后续实验表明,sT在nHDFs中下调了ISGF3复合体的一个亚基IRF9的水平。通过在H1299和293细胞中进行的荧光素酶报告者测定实验,作者表明sT似乎不抑制对IFNB1刺激的ISGs的转录,这表明sT在IFNB1和ISGF3的上游起作用,与他们在nHDFs中的先前结果相矛盾。他们表明sT阻止cGAS-STING、RIG-I和TBK1介导的IFNB1-luciferase报告者的诱导,但不阻止组成活性的IRF3介导的诱导。他们得出结论,sT在TBK1水平上作用,下调IFNB1的表达,从而导致ISGs的下调。返回nHDF感染模型,作者观察到,LT在早期时间点瞬时上调IFNB1和ISGs,但在后期时间点没有持续。sT单独下调ISGs,在早期时间点减弱了LT介导的ISG诱导。作者得出结论,他们已经确定了sT的一个基本的新功能 - 在MCPyV感染期间阻止LT诱导的I型IFN调节基因。
论文以该领域当前知识的详细回顾开始,做得很好。该论文提出了一个有趣的观点,即sT对抗由LT引起的天生免疫应答,且评审者赞赏作者研究sT和LT单独和结合使用的努力。然而,存在几个主要问题,使得论文的结论值得质疑(参见第二部分)。该论文对sT如何下调ISGs的机械性理解非常有限且令人困惑。H1299/HEK293细胞的luciferase实验和nHDFs中的实验结果存在冲突,作者并未解决或解答这一问题。sT通过与NEMO结合调控NF-kB信号传导,这可能有助于ISG的下调,但作者并未提及。作者也未提及他们的LT构造可能表达的ALTO的贡献。因为目前需要进行多次主要的修订,所以这位评审者建议拒绝当前形式的论文。

3. 本研究探索了MCPyV(Merkel细胞多形性病毒)T抗原,特别是sT、LT和LTtr,对初生人皮肤成纤细胞(nHDFs)及随后的上皮源性细胞系的影响。现有的研究MCPyV的体外模型在捕获病毒生命周期的所有方面上都有很大的局限性。本研究使用慢病毒转导在nHDFs中过表达T抗原,以模拟初次感染以及在癌症中的整合后的早期和长期病毒基因表达,并研究了基因表达的变化。虽然sT的表达增强了细胞增殖和趋化相关基因,但它抑制了许多基因,包括LT和LTtr上调的一些基因。值得注意的是,它还抑制了I型干扰素(IFN)应答基因和MHC I类基因。这种效应与激活基因相关的改变的组蛋白标记相关。研究表明,sT对基因表达的影响与干扰素刺激基因因子3(ISGF3)复合物无关,主要在激酶TBK1的水平上破坏了I型IFN信号途径。研究有助于理解MCPyV对宿主细胞基因表达和免疫逃逸策略的影响。总体来说,这是一篇写得很好的稿件,对生成的数据进行了彻底的分析。然而,下面还有一些详细的担忧。
  1. 主要问题:为接受所需的关键实验: 请使用此部分详述应绝对要求的关键新实验或对现有实验的修改,以验证研究结论。

      • 图1C中的热图在每个时间点只显示一个复制品。不清楚进行了多少RNAseq复制品。每个病毒结构和控制是否都只进行了一次复制品?所有构造体都在两个供体中进行了测试吗?图S2G中的PCA图表显示两个供体细胞的转录谱之间存在很大的差异。观察到的基因表达差异(表1)是否反映了两个供体细胞?这些值是否反映了两个或更多的复制品?

      • 图5:H3K27me3和H3K9me3的差异似乎不是很大(5C),也许是因为这些标记的水平较低,所以5B中的相关图与低r值有些误导。或许列举sT下调的特定基因对于测试的每个组蛋白标记会更有说明性。图5C的颜色编码丢失了。执行了多少ChIP-seq复制品?供体1和2都进行了描述吗?

      • 图7:对于IFNB1启动子报告酶测定,sT或V5-sT似乎减少了TBK1、IRF3-5D(持续活跃)、cGAS-STING和RIG-I-N的共转染的影响。sT是否影响这些共转染蛋白的水平?TBK1、IRF3-5D、cGAS-STING和RIG-I-N的Western印迹可能会有用。尽管sT在IRF3-5D存在下的效果不显著(图7D),但活动呈下降趋势。这次测定的结果是否认为sT对TBK1的影响与IRF3没有显著区别?

      • 在图1C中,sT、LT、LTtr、sT+LT 和 sT+LTtr 的 RNAseq 数据并排展示,允许轻松比较不同基因和簇,以及它们是上调还是下调。令人惊讶的是,图1B中的印迹是在单独的印迹上运行的,这只能确认某种病毒蛋白是否被表达。无法比较sT在sT单独和sT+LT或sT+LTtr中的表达,限制了读者解释每种蛋白对基因表达变化的相对贡献的能力。审稿人建议作者在相同的凝胶上重复运行这些印迹,以所有条件和给定供体的时间点,以便可以对每种条件下的相对sT/LT表达进行比较。

      • 作者提到IFNB1受NF-kB信号调控,并引用了两篇显示sT调节NF-kB活性的文章(Griffiths等人,2013 & Abdul-Sada等人,2017)。然而,作者并没有评估sT调节NF-kB途径对sT调节IFNB1和ISGs的贡献。目前尚不清楚作者是否发现了sT的全新功能,还是他们的观察结果是sT抑制NF-kB信号的结果。应该使用不与PP4C/NEMO结合的sT突变体进行额外实验,以评估NF-kB信号对ISG由sT调节的贡献。

      • 作者提供了来自nHDFs、H1299和HEK293细胞的冲突数据,但在论文中并未解决。在图6A中,作者显示了nHDFs在3和8 dpt的STAT1和IRF9 mRNA水平显著下调,并在图6C中显示了8 dpt的IRF9蛋白下调。审稿人对于为什么在图6C中对未分类的nHDFs 2 dpt进行印迹感到困惑,因为这些细胞与图6A或图8中之前分析的任何其他时间点都不相关。审稿人希望看到nHDFs在3和8 dpt的STAT1,p-STAT1和IRF9印迹。到目前为止,图6中的数据并不支持结论“MCPyV sT抑制ISG转录而不针对ISGF3复合物”。所示数据表明,sT确实通过下调IRF9 mRNA和蛋白以及可能的STAT1来抑制ISGF3复合物的活性。还应该解决STAT2的水平。

      • 在图7A/B/C中,作者在H1299细胞中进行荧光素酶实验,观察到sT并不抑制对IFNB1治疗的hIFIT1-/hIFIT-2/mMX1-荧光素酶报告基因的下调。鉴于此前显示sT在nHDFs中下调ISGF3活性,这些结果令人惊讶。在图8中,作者展示了数据,表明sT在nHDFs的3和8 dpt中下调IFIT1,IFIT2和MX1。sT在H1299细胞中是否像在nHDFs中那样下调STAT1,p-STAT1或IRF9?sT对hIFIT1-/hIFIT-2/mMX1-荧光素酶报告基因在没有IFNB1的情况下有任何影响吗?审稿人认为H1299细胞并不是一个好的模型系统,以剖析sT如何调节ISGs。

      • 在图7D中,作者评估sT调节IFNB1启动子活性的能力,使用通过与持续活化的磷酸模拟IRF3 (IRF3-5D) 或wt TBK1共转染刺激的IFNB1-荧光素酶报告基因。他们观察到sT不阻止IRF3-5D激活IFNB1-荧光素酶报告基因,但确实阻止了TBK1介导的IFNB1-荧光素酶诱导。众所周知,TBK1可以激活IRF3和典型的(p65/RelA) NF-kB信号,IRF3和NF-kB都可以驱动IFNB1的表达。作者确实承认IFNB1启动子包含NF-kB结合位点。已经显示TBK1可以激活IKK复合物,sT已被显示可以通过与NEMO结合来结合并抑制IKK。审稿人希望看到sT与NF-kB信号相关的突变体对IFNB1启动子的影响,以确定本文的发现的新颖性和重要性。

      • 在Fig7F/G/H中,作者从H1299切换到293细胞,因为“cGAS-STING的过表达或持续活化的RIG-I-N在H1299细胞中没有导致足够的报告基因诱导”。这些发现再次加强了H1299细胞不是研究sT调节ISGs的好模型的观念。审稿人欣赏作者重复了实验,以测试sT对TBK1介导的IFNB1-荧光素酶在293细胞中的激活的影响,但想知道他们为什么没有重复IFIT1/IFIT2/MX1荧光素酶结果在293细胞中,并建议他们检查sT是否下调IRF9、STAT1和p-STAT1。

      • 最后,作者将注意力集中在sT和LT上,这是MCPyV早期蛋白中的两种。作者没有引用ALTO (Carter等人,2013)。已经显示ALTO是在LT的第二外显子上过度打印的,并在正常的病毒生命周期中被表达。Yang等人(2021)显示ALTO可以刺激各种荧光素酶报告构建,表明ALTO对基因转录和表达有影响。除非他们评估他们的LT和LTtr慈善病毒构建是否编码ALTO,否则关于LT对转录的影响的结论会受到质疑。

      • 总之,作者在重新提交之前需要解决的三个主要问题是:

        • sT是否针对ISGF3复合物,如果是,是如何做到的?在nHDFs中的数据表明它确实如此,但并没有解决如何下调IRF9和STAT1 mRNA。

        • sT是否通过一个与其通过结合NEMO调节NF-kB信号的能力无关的机制来下调ISGs?

        • LT和LTtr构建是否表达ALTO?如果是,那么LT刺激ISGs的结论可能是不准确的。

      • 该研究适当地指出,不同的细胞类型可能会对病毒蛋白产生不同的反应,特别是在研究的天然免疫应答,如BKPyV和JCPyV在各种细胞类型中所见。然而,该研究使用源于上皮的H1299和HEK293来研究RNA-测序的发现,却没有任何理由。后者细胞系还包括腺病毒E1A,这也可能影响这些途径的信号。从nHDFs中IRF9定位的代表性IF图像来看,sT存在与对照和未转导的情况相比,可能存在改变的定位,或者这是由于成像过程中曝光设置不一致导致的。考虑到该研究指出,“由于cGAS-STING的过表达或RIG-I-N的固有活性在H1299细胞中没有导致足够的报告基因诱导”,荧光素酶和IRF9实验应在更合适的细胞系中重复进行。

      • 虽然纤维细胞明显容易受到MCPyV感染,但这些是否是MCC的起源细胞类型尚不清楚。对于UV介导的、病毒阴性的MCC,基于MCC/SCC的联合形态与共享的主干突变,上皮细胞几乎被证实是前体。考虑到这一点,以及上面提到的实验已经在上皮源性细胞中进行,使用sT、LT和LTtr对初级角质细胞进行慈善病毒转导实验将更好地解决nHDFs的观察结果有多通用的问题。

3: 次要问题:编辑和数据呈现的修改:请在此部分提供编辑建议,以及对现有数据进行相对较小的修改以增强清晰度。

1.
    * 为供体1显示的生长曲线(图S2F)显示了LT +/- sT的缓慢生长率,尽管在转录组配置文件中没有观察到明显的差异。出乎意料的是,结合的sT和LTtr增长速度最慢。供体2观察到了类似的动力学吗?
* 图3A。火山图可以显示小T抗原的表达

2.
    * 该论文可以从额外的校对中受益,因为某些短语和标点符号使用的英语不是标准的。
    * 为什么在图7的荧光素酶实验中同时使用了sT-V5和sT?文本中没有解释。
    * 选择了一个trLT变体 - 它是否代表所有其他MCC trLTs?
    * 在所有的图中,标签应该得到改进,以便读者可以看到哪个细胞系和哪个T-抗原正在被表达,以及在什么时点收获细胞,而不必阅读图例。特别是,图3A(哪个构造,nHDF sT?),图7(哪个细胞?它在子图之间改变,所以需要额外说明在哪里使用了哪个细胞系)。
    * 图7F/G/H中的印迹被裁剪,使读者无法比较印迹之间的肌动蛋白表达水平。应该在同一张印迹上重新运行印迹,将样本放在上面。

3.
    * 第138-140行:句子笨拙且不清晰
    * 该研究应列出每个样本生成了多少读数。
    * 为什么MACS2和SICER用于不同的组蛋白ChIP seq?
    * 使用了所有软件的什么版本?
    * ChIP seq实验生成了复制品吗?

From epidome to treeHeatmap

  1. (namely epidome-processing step 2.6) Classification using 98% similarity.

    #Input Files: yycH_seqtab_nochim.csv and ../epidome/DB/yycH_ref_aln.fasta
    #Output Files: yycH_seqtab_ASV_seqs.fasta, yycH_seqtab_ASV_blast.txt, and yycH_seqtab.csv.classified.csv
    
    python3 ../epidome/scripts/ASV_blast_classification.py yycH_seqtab_nochim.csv yycH_seqtab_ASV_seqs.fasta ../epidome/DB/yycH_ref_aln.fasta yycH_seqtab_ASV_blast.txt yycH_seqtab.csv.classified.csv 98
    python3 ../epidome/scripts/ASV_blast_classification.py g216_seqtab_nochim.csv g216_seqtab_ASV_seqs.fasta ../epidome/DB/g216_ref_aln.fasta g216_seqtab_ASV_blast.txt g216_seqtab.csv.classified.csv 98
  2. Manually check the reults, ensuring no seq34;seq35 occurs.

    #Delete all records in *_seqtab.csv.classified.csv the alignment length < 448 (delete records 110 and 111) in g216, or < 440 (del record 90) in yycH.
    
    #Confirm the results
    jhuang@hamburg:/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98$ cut -d$'\t' -f4 g216_seqtab_ASV_blast.txt | sort -u
    446
    447
    448
    jhuang@hamburg:/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98$ cut -d$'\t' -f4 yycH_seqtab_ASV_blast.txt | sort -u
    475
    476
    
    #It is not neccesary to delete correponding records 90 in yycH_* and g216_seqtab_ASV_blast.txt, since the two files are not used. 
    #sed -i -e s/seq//g yycH_seqtab_ASV_blast.txt   #length=476
    #sed -i -e s/seq//g g216_seqtab_ASV_blast.txt   #length=448
    ##;-->""
    #sed -i -e s/';'//g yycH_seqtab_ASV_blast.txt
    #sed -i -e s/';'//g g216_seqtab_ASV_blast.txt
    
    cp g216_seqtab.csv.classified.csv g216_seqtab.csv.classified.backup
    cp yycH_seqtab.csv.classified.csv yycH_seqtab.csv.classified.backup
    
    # (Option 1): Processing *.classified.csv, ensuring the format seq24,21 or seq24.
    #https://github.com/ssi-dk/epidome/blob/master/example_data/190920_run1_G216_seqtab_from_dada2.csv.classified.csv
    ##DEBUG using LibreOffice, e.g. libreoffice --calc yycH_seqtab.csv.classified_noNA.csv after adding "ID"; at the corner.
    #Final goal: "seqseq31;,seq30;" --> "seq31,30"
    #;,seq --> ,
    sed -i -e s/";,seq"/","/g g216_seqtab.csv.classified.csv
    sed -i -e s/";,seq"/","/g yycH_seqtab.csv.classified.csv
    #;"; --> ";
    sed -i -e s/";\";"/"\";"/g g216_seqtab.csv.classified.csv
    sed -i -e s/";\";"/"\";"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/seqseq/seq/g g216_seqtab.csv.classified.csv
    sed -i -e s/seqseq/seq/g yycH_seqtab.csv.classified.csv
    #confirm!
    cut -d';' -f3 g216_seqtab.csv.classified.csv
    cut -d';' -f3 yycH_seqtab.csv.classified.csv
    
    # (Option 2): To reduce the unclassified, rename seq31,seq30 --> seq31 in g216,  seq37,seq36 --> seq36 in yycH.
    sed -i -e s/"seq40,seq31,seq30"/"seq40"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq31,seq30"/"seq31"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq5,seq2"/"seq2"/g        sed -i -e s/"seq40,seq31,seq30"/"seq40"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq31,seq30"/"seq31"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq5,seq2"/"seq2"/g g216_seqtab.csv.classified.csv
    grep ",seq" g216_seqtab.csv.classified.csv  #should return no record.
    #sed -i -e s/"seq22,20"/"seq20"/g g216_seqtab.csv.classified.csv
    #sed -i -e s/"seq24,21"/"seq21"/g g216_seqtab.csv.classified.csv
    #sed -i -e s/"seq21,17"/"seq21"/g g216_seqtab.csv.classified.csv
    > epidome_object$p1_seqs    #vim g216_seqtab.csv.classified_noNA.csv
      [1] "seq28"       "seq5"        "seq40"       "seq31,30"(seq31)     "seq14"      
      [6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    [11] "seq29"       "seq24"       "seq1"        "seq26"       "seq40"      
    [16] "seq21"       "seq18"       "seq40"       "seq16"       "seq37"      
    [21] "seq5,2"      "seq18"       "seq11"       "seq14"       "seq26"      
    [26] "seq31,30"(seq31)     "seq37"       "seq26"       "seq1"        "seq21"      
    [31] "seq3"        "seq40"       "seq26"       "seq28"       "seq40"      
    [36] "seq10"       "seq5"        "seq40,31,30"(seq40)  "seq37"       "seq37"      
    [41] "seq37"       "seq40"       "seq28"       "seq26"       "seq37"      
    [46] "seq22"       "seq28"       "seq19"       "seq5,2"      "seq26"      
    [51] "seq40"       "seq40"       "seq26"       "seq40"       "seq28"      
    [56] "seq28"       "seq28"       "seq37"       "seq26"       "seq40"      
    [61] "seq26"       "seq37"       "seq28"       "seq28"       "seq1"       
    [66] "seq28"       "seq5,2"(seq5)      "seq40"       "seq21"       "seq28"      
    [71] "seq29"       
    #"seq37"       "seq22"       "seq22,20"(seq20)    "seq31,30"(seq31)    
    [76] "seq22"       "seq20"       "seq28"       "seq37"       "seq14"      
    [81] "seq28"       "seq20"       "seq40"       "seq28"       "seq24,21" 
    [86] "seq40"       "seq14"       "seq31,30"(seq31)      "seq1"        "seq1"       
    [91] "seq31,30"(seq31)     "seq1"        "seq40"       "seq20"       "seq1"       
    [96] "seq21"       "seq26"       "seq20"       "seq20"       "seq21,17"   
    [101] "seq37"       "seq26"       "seq31,30"(seq31)      "seq31,30"(seq31)    "seq40"      
    [106] "seq5"        "seq12"             
    
    sed -i -e s/"seq37,seq36"/"seq36"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq58,seq25,seq21,seq10"/"seq21"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq23,seq22"/"seq23"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq65,seq63,seq61,seq60"/"seq63"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq65,seq63,seq53"/"seq63"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq61,seq55"/"seq55"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq34,seq28,seq25,seq10,seq9"/"seq34"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq25,seq10,seq9"/"seq9"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq25,seq10"/"seq10"/g yycH_seqtab.csv.classified.csv
    grep ",seq" yycH_seqtab.csv.classified.csv  #should return no record.
    > epidome_object$p2_seqs
    [1] "34"            "43"            "seq37,seq36"-->36         "42"           
    [5] "26"            "63"            "3"             "2"            
    [9] "11"            "seq65,seq63,seq53"-->63      "14"            "38"           
    [13] "55"            "3"             "65"            "seq58,seq25,seq21,seq10"-->21  
    [17] "seq23,seq22"-->23         "53"            "56"            "34"           
    [21] "62"            "14"            "35"            "seq61,seq55"-->55        
    [25] "52"            "seq25,seq10"-->10         "seq65,seq63,seq61,seq60"-->63   "21"           
    [29] "55"            "15"            "34"            "49"           
    [33] "19"            "63"            "63"            "34"           
    [37] "43"            "12"            "34"            "seq34,seq28,seq25,seq10,seq9"-->34
    [41] "43"            "61"            "25"            "34"           
    [45] "34"            "5"             "18"            "43"           
    [49] "62"            "34"            "19"            "63"           
    [53] "7"             
    --------------
    "seq37,seq36"-->36         "seq25,seq10,seq9"-->9       "56"           
    [57] "11"            "49"            "10"            "63"           
    [61] "seq37,seq36"-->36         "14"            "55"  
    #-------------
    "55"           
    [65] "14"            "37,36"         "43"            "45"           
    [69] "34"            "34"            "43"            "43"           
    [73] "25"            "14"            "65,64,61,56"   "15,9"         
    [77] "43"            "38"            "43"            "53"           
    [81] "11"            "9"             "34"            "15"           
    [85] "9"             "23,22"         "62"            "43"           
    [89] "37,36"
    
    # Common step after option 1 or option 2.
    grep -v ";NA;" g216_seqtab.csv.classified.csv > g216_seqtab.csv.classified_noNA.csv
    grep -v ";NA;" yycH_seqtab.csv.classified.csv > yycH_seqtab.csv.classified_noNA.csv
    
    #filtering the small ASV in which the number < 500: INPUT: *_noNA.csv; OUTPUT: *_seqtab.csv.classified_filtered.csv
    python3 g216_filtering.py  
        import pandas as pd
        # Load the file
        data = pd.read_csv('g216_seqtab.csv.classified_noNA.csv', delimiter=';')
        # Compute row sums excluding the first 2 columns
        data['row_sum'] = data.iloc[:, 2:].sum(axis=1)
        print(data.iloc[:, 2:])
        print(data.iloc[:, 2:].sum(axis=1))
        # Filter out rows where the sum is less than 100
        filtered_data = data[data['row_sum'] >= 500].drop(columns=['row_sum'])
        # Save the resulting dataframe to a new file
        filtered_data.to_csv('g216_seqtab.csv.classified_filtered.csv', sep=';', index=False)
    python3 yycH_filtering.py 
  3. draw plot from three amplicons: cutadapted_g216, cutadapted_yycH

    #Taxonomic database setup and classification
    #- Custom databases of all unique g216 and yycH target sequences can be found at https://github.com/ssi-dk/epidome/tree/master/DB. 
    #- We formatted our g216 and yycH gene databases to be compatible with DADA2’s assign-Taxonomy function and used it to classify the S. epidermidis ASVs with the RDP naive Bayesian classifier method (https://github.com/ssi-dk/epidome/tree/master/scripts).
    #- ST classification of samples was performed using the g216 target sequence as the primary identifier. 
    #- All g216 sequences unique to a single clonal cluster in the database were immediately classified as the matching clone, and in cases were the g216 sequence matched multiple clones, the secondary yycH target sequences were parsed to determine which clone was present. When this classification failed to resolve due to multiple potential combinations of sequences, ASVs were categorized as “Unclassified”. Similarly, g216 sequences not found in the database were labelled as “Novel”. 
    
    #install.packages("pls")
    #library(pls)
    #install.packages("reshape")
    #library(reshape)
    #install.packages("vegan")
    library('vegan') 
    library(scales)
    library(RColorBrewer)
    
    #~/Tools/csv2xls-0.4/csv_to_xls.py g216_seqtab.csv.classified_noNA.csv yycH_seqtab.csv.classified_noNA.csv -d$'\t' -o counts.xls
    
    setwd("/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98_2")
    #under r4-base
    source("../epidome/scripts/epidome_functions.R")
    
    ST_amplicon_table = read.table("../epidome/DB/epidome_ST_amplicon_frequencies.txt",sep = "\t")
    
    epi01_table = read.table("g216_seqtab.csv.classified_filtered.csv",sep = ";",header=TRUE,row.names=1)
    epi02_table = read.table("yycH_seqtab.csv.classified_filtered.csv",sep = ";",header=TRUE,row.names=1)
    #> sum(epi01_table$AH.LH)
    #[1] 78872
    #> sum(epi02_table$AH.LH)
    #[1] 86949
    
    metadata_table = read.table("../metadata.txt",header=TRUE,row.names=1)
    metadata_table$patient.ID <- factor(metadata_table$patient.ID, levels=c("P7","P8","P9","P10","P11","P12","P13","P14","P15","P16","P17","P18","P19","P20", "AH","AL","CB","HR","KK",  "MC","MR","PT2","RP","SA","XN"))
    
    #epidome_object = setup_epidome_object(epi01_table,epi02_table, metadata_table=metadata_table)
    primer1_table <- epi01_table
    primer2_table <- epi02_table
    #setup_epidome_object <- function(primer1_table,primer2_table,metadata_table) {
      #DEBUG: 3:ncol --> 2:ncol
      primer1_counts = primer1_table[,2:ncol(primer1_table)]
      primer2_counts = primer2_table[,2:ncol(primer2_table)]
      primer1_all_sample_names = colnames(primer1_counts)
      primer2_all_sample_names = colnames(primer2_counts)
      samples_with_both_primers = primer1_all_sample_names[which(primer1_all_sample_names %in% primer2_all_sample_names)]
      samples_missing_primer1_data = primer2_all_sample_names[which(!primer2_all_sample_names %in% primer1_all_sample_names)]
      samples_missing_primer2_data = primer1_all_sample_names[which(!primer1_all_sample_names %in% primer2_all_sample_names)]
      primer1_seqs = as.vector(primer1_table$Seq_number)
      primer2_seqs = as.vector(primer2_table$Seq_number)
      primer1_seqs[which(is.na(primer1_seqs))] = "seqUnclassified"
      primer2_seqs[which(is.na(primer2_seqs))] = "seqUnclassified"
      primer1_counts = primer1_table[,which(colnames(primer1_table) %in% samples_with_both_primers)]
      primer2_counts = primer2_table[,which(colnames(primer2_table) %in% samples_with_both_primers)]
      if (!missing(metadata_table)) {
        metadata_names = rownames(metadata_table)
        samples_missing_metadata = samples_with_both_primers[which(!samples_with_both_primers %in% metadata_names)]
        samples_missing_primer_data = metadata_names[which(!metadata_names %in% samples_with_both_primers)]
        include_names = metadata_names[which(metadata_names %in% samples_with_both_primers)]
        metadata_include = match(include_names,metadata_names)
        metadata_include = metadata_include[which(!is.na(metadata_include))]
        metadata_table = metadata_table[metadata_include,]
        #primer1_include = match(colnames(primer1_counts),include_names)
        primer1_include = match(include_names,colnames(primer1_counts))
        primer1_include = primer1_include[which(!is.na(primer1_include))]
        #primer2_include = match(colnames(primer2_counts),include_names)
        primer2_include = match(include_names,colnames(primer2_counts))
        primer2_include = primer2_include[which(!is.na(primer2_include))]
        epi1_table = primer1_counts[,primer1_include]
        epi2_table = primer2_counts[,primer2_include]
        epidome_object = list('p1_seqs'=primer1_seqs,'p1_table'=epi1_table,'p2_seqs'=primer2_seqs,'p2_table'=epi2_table,'metadata'=metadata_table,'sample_names'=include_names,'meta_variables'=colnames(metadata_table))
        print(paste0("Metadata loaded with ",length(metadata_names)," samples and ",ncol(metadata_table)," variables"))
        print(paste0(length(samples_missing_metadata)," samples are found in both primer sets but not in metadata: ",paste0(samples_missing_metadata,collapse = " ")))
        print(paste0(length(samples_missing_primer_data)," samples are found in metadata but is missing from one or both primer sets: ",paste0(samples_missing_primer_data,collapse = " ")))
        print(paste0(length(include_names)," samples are found in metadata and both tables and are included in epidome object"))
    
      } else {
        epi1_table = primer1_counts[,match(colnames(primer1_counts),samples_with_both_primers)]
        epi2_table = primer2_counts[,match(colnames(primer2_counts),samples_with_both_primers)]
        epidome_object = list('p1_seqs'=primer1_seqs,'p1_table'=epi1_table,'p2_seqs'=primer2_seqs,'p2_table'=epi2_table,'sample_names'=samples_with_both_primers)
        print(paste0("No metadata loaded"))
        print(paste0(length(samples_missing_primer2_data)," samples are found in p1 table but not in p2 table: ",paste0(samples_missing_primer2_data,collapse = " ")))
        print(paste0(length(samples_missing_primer1_data)," samples are found in p2 table but not in p1 table: ",paste0(samples_missing_primer1_data,collapse = " ")))
        print(paste0(length(samples_with_both_primers)," samples are found in both tables and are included in epidome object"))
      }
      #return(epidome_object)
    #}
    
    str(epidome_object)
    #List of 7
    # $ p1_seqs       : chr [1:57] "seq28" "seq5" "seq40" "seq31,30" ...
    # $ p1_table      :'data.frame':    57 obs. of  51 variables:
    # $ p2_seqs       : chr [1:53] "seq34" "seq43" "seq37,36" "seq42" ...
    # $ p2_table      :'data.frame':    53 obs. of  51 variables:
    # $ metadata      :'data.frame':    51 obs. of  4 variables:
    # $ sample_names  : chr [1:51] "P7.Nose" "P8.Nose" "P9.Nose" "P10.Nose" ...
    # $ meta_variables: chr [1:4] "patient.ID" "sample.site" "sample.type" "patient.sample.site"
    unique(sort(epidome_object$p1_seqs))  # 57
    #[1] "seq28"       "seq5"        "seq40"       "seq31,30"    "seq14"      
    #[6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    #[11] "seq29"       "seq24"       "seq1"        "seq26"       "seq40"      
    #[16] "seq21"       "seq18"       "seq40"       "seq16"       "seq37"      
    #[21] "seq5,2"      "seq18"       "seq11"       "seq14"       "seq26"      
    #[26] "seq31,30"    "seq37"       "seq26"       "seq1"        "seq21"      
    #[31] "seq3"        "seq40"       "seq26"       "seq28"       "seq40"      
    #[36] "seq10"       "seq5"        "seq40,31,30" "seq37"       "seq37"      
    #[41] "seq37"       "seq40"       "seq28"       "seq26"       "seq37"      
    #[46] "seq22"       "seq28"       "seq19"       "seq5,2"      "seq26"      
    #[51] "seq40"       "seq40"       "seq26"       "seq40"       "seq28"      
    #[56] "seq28"       "seq28"
    unique(sort(epidome_object$p2_seqs))
    
    #Image1
    primer_compare = compare_primer_output(epidome_object,color_variable = "sample.type")
    png("image1.png")
    primer_compare$plot
    dev.off()
    
    eo_ASV_combined = combine_ASVs_epidome(epidome_object)
    #> str(eo_ASV_combined)
    #List of 7
    #$ p1_seqs       : chr [1:21] "seq28" "seq5" "seq40" "seq31,30" ...
    #$ p1_table      :'data.frame': 21 obs. of  51 variables:
    #> eo_ASV_combined$p1_seqs  #21
    #[1] "seq28"       "seq5"        "seq40"       "seq31,30"    "seq14"      
    #[6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    #[11] "seq29"       "seq24"       "seq1"        "seq18"       "seq16"      
    #[16] "seq5,2"      "seq11"       "seq3"        "seq10"       "seq40,31,30"
    #[21] "seq19"
    #eo_filtered = filter_lowcount_samples_epidome(eo_ASV_combined,500,500)
    
    #Note that we run the following code instead run the method from the script due to modified code: 
    count_table_checked = classify_epidome(eo_ASV_combined,ST_amplicon_table)
    write.csv(count_table_checked, file="count_table_checked.csv")
    strict_classifier=FALSE
    #classify_epidome = function(epidome_object,ST_amplicon_table,strict_classifier=FALSE) {
    #Adapt the code to the true epi01 and epi02 values, please give the complete code:
    epidome_object_norm = normalize_epidome_object(eo_ASV_combined)
    p1_seqs = unlist(lapply(as.vector(eo_ASV_combined$p1_seqs),function(x) substr(x,4,nchar(x))))
    #> p1_seqs
    #[1] "28"       "5"        "40"       "31,30"    "14"       "20"      
    #[7] "22"       "26"       "37"       "21"       "29"       "24"      
    #[13] "1"        "26"       "40"       "21"       "18"       "40"      
    #[19] "16"       "37"       "5,2"      "18"       "11"       "14"      
    #[25] "26"       "31,30"    "37"       "26"       "1"        "21"      
    #[31] "3"        "40"       "26"       "28"       "40"       "10"      
    #[37] "5"        "40,31,30" "37"       "37"       "37"       "40"      
    #[43] "28"       "26"       "37"       "22"       "28"       "19"      
    #[49] "5,2"      "26"       "40"       "40"       "26"       "40"      
    #[55] "28"       "28"       "28" 
    p2_seqs = unlist(lapply(as.vector(eo_ASV_combined$p2_seqs),function(x) substr(x,4,nchar(x))))
    n_samples = length(eo_ASV_combined$sample_names)
    n_p1_seqs = length(p1_seqs)
    count_table = matrix(nrow = 0, ncol = n_samples,dimnames = list('ST'=c(),'Samples'=eo_ASV_combined$sample_names))
    match_type_table = matrix(nrow = n_p1_seqs, ncol = n_samples)
    count_table_names = c()
    unclassified_count_vec = rep(0,n_samples)
    g1_unclassified_count_vec = rep(0,n_samples)
    for (i in 1:n_p1_seqs) {
      #i=4: "31,30" --> length(p1_seq_split)>1 --> possible_STs==("-","-","8","215") --> Unclassified!
      p1_seq = p1_seqs[i]
      p1_seq_split = strsplit(p1_seq,',')[[1]]
      if (length(p1_seq_split)>1) {
        possible_STs = as.vector(ST_amplicon_table$ST)[which(ST_amplicon_table$epi01_ASV %in% p1_seq_split)]
        if (length(possible_STs)==1) {
          p1_seq = p1_seq_split[1]
        } else {
          p1_seq = "Unclassified"
        }
      }
      if (p1_seq != "Unclassified") {
        p1_seq_ST_table = ST_amplicon_table[which(ST_amplicon_table$epi01_ASV==p1_seq),]
        unique_p1_ASVs = unique(as.vector(p1_seq_ST_table$ST))
        p2_ASVs_matching_p1 = p1_seq_ST_table$epi02_ASV
        count_vec = rep(0,n_samples)
        for (j in 1:n_samples) {
          p1_percent = epidome_object_norm$p1_table[i,j]
          p1_count = eo_ASV_combined$p1_table[i,j]
          if (p1_percent > 0.01) {
            p2_seqs_present_within_difference_threshold_idx = which(epidome_object_norm$p2_table[,j]>(p1_percent-10) & epidome_object_norm$p2_table[,j]>1)
            p2_seqs_present_ASVs = p2_seqs[p2_seqs_present_within_difference_threshold_idx]
            p2_seqs_present_ASVs_matching_p1 = p2_seqs_present_ASVs[which(p2_seqs_present_ASVs %in% p2_ASVs_matching_p1)]
            p2_percent = epidome_object_norm$p2_table[which(p2_seqs %in% p2_seqs_present_ASVs_matching_p1),j]
            p2_count = eo_ASV_combined$p2_table[which(p2_seqs %in% p2_seqs_present_ASVs_matching_p1),j]
            if (length(p2_seqs_present_ASVs_matching_p1) == 0) {
              ST_order = order(p1_seq_ST_table$freq,decreasing = T)
              classification_group = as.vector(p1_seq_ST_table$ST)[ST_order[1]]
              if (classification_group %in% count_table_names) {
                classification_idx = which(count_table_names == classification_group)
                count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
              } else {
                count_vec = rep(0,n_samples)
                count_vec[j] = p1_count
                count_table = rbind(count_table,count_vec)
                count_table_names = c(count_table_names,classification_group)
              }
              #match_type_table[i,j] = "epi01 match without epi02 match"
              # Replace the match type with actual epi01 and epi02 values
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:NA", " ", classification_group)
            }
            else if (length(p2_seqs_present_ASVs_matching_p1) == 1) {
              p1_p2_seq_ST_table = p1_seq_ST_table[which(p1_seq_ST_table$epi02_ASV==p2_seqs_present_ASVs_matching_p1[1]),]
              ST_order = order(p1_p2_seq_ST_table$freq,decreasing = T)
              classification_group = as.vector(p1_p2_seq_ST_table$ST)[ST_order[1]]
              if (classification_group %in% count_table_names) {
                classification_idx = which(count_table_names == classification_group)
                count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
              } else {
                count_vec = rep(0,n_samples)
                count_vec[j] = p1_count
                count_table = rbind(count_table,count_vec)
                count_table_names = c(count_table_names,classification_group)
              }
              #match_type_table[i,j] = "Unique epi01 epi02 combination"
              # Replace the match type with actual epi01 and epi02 values
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:", p2_seqs_present_ASVs_matching_p1[1], " ", classification_group)
            } else {
              if (strict_classifier) {
                unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
              } else {
                p1_p2_seq_ST_table = p1_seq_ST_table[which(p1_seq_ST_table$epi02_ASV %in% p2_seqs_present_ASVs_matching_p1),]
                ST_order = order(p1_p2_seq_ST_table$freq,decreasing = T)
                classification_group = as.vector(p1_p2_seq_ST_table$ST)[ST_order[1]]
                if (classification_group %in% count_table_names) {
                  classification_idx = which(count_table_names == classification_group)
                  count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
                } else {
                  count_vec = rep(0,n_samples)
                  count_vec[j] = p1_count
                  count_table = rbind(count_table,count_vec)
                  count_table_names = c(count_table_names,classification_group)
                }
              }
    
              ##unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
              #match_type_table[i,j] = c("Non unique epi01 epi02 combination")
              # Concatenate all found epi02 values and replace the match type
              epi02_values = paste(p2_seqs_present_ASVs_matching_p1, collapse = ",")
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:", epi02_values, " ", classification_group)
            }
    
          } else {
            match_type_table[i,j] = "Low counts"
            unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
          }
    
        }
      } else {
        count_vec = as.numeric(as.vector(eo_ASV_combined$p1_table[i,]))
        unclassified_count_vec = unclassified_count_vec+count_vec
        g1_unclassified_count_vec = g1_unclassified_count_vec+count_vec
        match_type_vec = rep("Unclassified epi01 match",n_samples)
        match_type_table[i,] = match_type_vec
      }
    
    }
    count_table = rbind(count_table,unclassified_count_vec)
    rownames(count_table) = c(count_table_names,"Unclassified")
    count_table_df <- as.data.frame(count_table)
    write.csv(count_table_df, file="count_table_df.csv")
    #TODO: delete the last record "Unclassified" due to the few reads!
    #count_table_df = count_table_df[-32,]
    
    # --> FOR STEP4: prepare the file match_type_table.csv for treeHeatmap drawing.
    write.csv(match_type_table, file = "match_type_table.csv", row.names = TRUE)
    
    count_table_df_ordered = count_table_df[order(rowSums(count_table_df),decreasing = T),]
    row_sums <- apply(count_table_df_ordered, MARGIN=1, FUN=sum)
    #TODO: the ST in which the counts <= 500 not shown!
    #NOTE: ST225 is "epi01:3 epi02:3 225"!
    # Since "epi01:3 epi02:NA 225","epi01:3 epi02:NA 225" are 225 disappeared in the match_type_table.csv_____
    #Note that epi01 is g216, epi02 is yycH.
    #        215            -            8           59          130          297 
    #      611637       401179       341187       337409       316389       316273 
    #         331            2           73          384          200            5 
    #      178616       125734        99899        97885        93613        92385 
    #          14          218           23          278          487          290 
    #       71417        62467        60192        57213        53630        51454 
    #          87           89          100          329          153           83 
    #       49000        23390        21880        21635        19453        10539 
    #          19          170          225*         570          184           88 
    #        9465         1839         1487          863          437          318 
    #         114 Unclassified 
    #         108           11 
    #> row_sums
    #         215            -           59 Unclassified          297          130 
    #      611915       442192       359708       357868       335880       315734 
    #           5           73          331          384          200           14 
    #      145578       134301       127734       126465        94917        85440 
    #           2          218           23          278          290           87 
    #       85438        62492        60192        57213        51363        49000 
    #          89          100          329          210           83           19 
    #       23390        21880        21635        12663        10642         9465 
    #         170          225          570           88          114           10 
    #        1839         1487          863          318          108           91 
    # > unique(sort(eo_ASV_combined$p1_seqs))
    #  [1] "seq1"  "seq10" "seq11" "seq14" "seq16" "seq18" "seq19" "seq2"  "seq20"
    # [10] "seq21" "seq22" "seq24" "seq26" "seq28" "seq29" "seq3"  "seq31" "seq37"
    # [19] "seq40" "seq5" 
    # > unique(sort(eo_ASV_combined$p2_seqs))
    #  [1] "seq10" "seq11" "seq12" "seq14" "seq15" "seq18" "seq19" "seq2"  "seq21"
    # [10] "seq23" "seq25" "seq26" "seq3"  "seq34" "seq35" "seq36" "seq38" "seq42"
    # [19] "seq43" "seq49" "seq5"  "seq52" "seq53" "seq55" "seq56" "seq61" "seq62"
    # [28] "seq63" "seq65" "seq7"
    #row.names(count_table_df) <- c("ST215","ST130","ST278","ST200","ST184","ST5","ST59","ST83","ST487","ST114","ST8","ST297","ST153","ST384","ST2","ST14","ST89","ST570","-","ST290","ST331","ST73","ST88","ST100","ST87","ST23","ST218","ST329","ST19","ST225","ST170")
    #row.names(count_table_df) <- c("215","130","278","200","184","5","59","83","487","114","8","297","153","384","2","14","89","570","-","290","331","73","88","100","87","23","218","329","19","225","170")
    #row.names(count_table_df) <- c("ST215","ST130","ST278","ST200","ST5","ST59","ST83","ST114","ST297","ST384","ST14","ST89","ST210","-","ST328","ST331","ST73","ST2","ST88","ST100","ST10","ST290","ST87","ST23","ST218","ST329","ST19","ST225","ST170","Unclassified")
    
    col_order <- c("P7.Nose","P8.Nose","P9.Nose","P10.Nose","P11.Nose","P12.Nose","P13.Nose","P14.Nose","P15.Nose",     "P16.Foot","P17.Foot","P18.Foot","P19.Foot","P20.Foot","P16.Nose","P17.Nose","P18.Nose","P19.Nose","P20.Nose",    "AH.LH","AH.NLH","AH.Nose", "AL.LH","AL.NLH","AL.Nose", "CB.LH","CB.NLH","CB.Nose", "HR.LH","HR.NLH","HR.Nose", "KK.LH","KK.NLH","KK.Nose", "MC.LH","MC.NLH","MC.Nose", "MR.LH","MR.NLH","MR.Nose",  "PT2.LH","PT2.NLH","PT2.Nose", "RP.LH","RP.NLH","RP.Nose", "SA.LH","SA.NLH","SA.Nose", "XN.LH","XN.NLH","XN.Nose") 
    count_table_df_reordered <- count_table_df_ordered[,col_order]
    p = make_barplot_epidome(count_table_df_reordered,reorder=FALSE,normalize=TRUE)
    #p = make_barplot_epidome(count_table_df_reordered,reorder=TRUE,normalize=TRUE)
    png("Barplot_All.png", width=1600, height=900)
    p
    dev.off()
    
    #Change rowname from '-' to 'Novel'
    rownames(count_table_df_reordered)[rownames(count_table_df_reordered) == "-"] <- "Novel"
    write.csv(file="count_table_df_reordered.txt", count_table_df_reordered)
  4. prepare data for plotTreeHeatmap: table+tree.

      #replace "," to \n in match_type_table.csv
      sed 's/\",\"/\n/g' match_type_table.csv > match_type_table.csv_
      grep "epi02" match_type_table.csv_ > match_type_table.csv__
      grep -v "NA" match_type_table.csv__ > match_type_table.csv___
      #(Deprecated) awk -F' ' '{ split($2, a, ":"); split(a[2], b, ","); print $1 " epi02:" b[1] " " $3 }' match_type_table.csv___ > match_type_table.csv____
      #(Deprecated) sed -i -e 's/\"//g' match_type_table.csv____
      #(Deprecated) sort match_type_table.csv____ -u > match_type_table.csv_____
    
      #Manually select and merge items from match_type_table.csv___ and save it to match_type_table.csv____ (see content as below)!
          epi01:1 epi02:36 2
          epi01:3 epi02:3 225
          epi01:5 epi02:34 130
          epi01:5 epi02:3 278
          epi01:5 epi02:43 200
          epi01:5 epi02:36 184
          epi01:11 epi02:19 19
          epi01:14 epi02:34 297
          epi01:14 epi02:36 153
          epi01:16 epi02:56 329
          epi01:16 epi02:55 329
          epi01:18 epi02:14 218
          epi01:19 epi02:43 170
          epi01:2 epi02:15 -
          epi01:20 epi02:36 2
          epi01:20 epi02:43 14
          epi01:20 epi02:42 89
          epi01:20 epi02:34 384
          epi01:20 epi02:11 570
          epi01:21 epi02:34 100
          epi01:21 epi02:63 290
          epi01:24 epi02:38 23
          epi01:24 epi02:2 87
          epi01:26 epi02:34 -
          epi01:26 epi02:53 290
          epi01:26 epi02:63 331
          epi01:28 epi02:43 215
          epi01:29 epi02:43 215
          epi01:29 epi02:55 -
          epi01:31 epi02:43 8
          epi01:37 epi02:36 2
          epi01:37 epi02:26 88
          epi01:37 epi02:43 73
          epi01:40 epi02:2 5
          epi01:40 epi02:34 5
          epi01:40 epi02:34 59
          epi01:40 epi02:26 114
          epi01:40 epi02:43 83
          epi01:40 epi02:36 487
          #---- v2 ('-' three times, ST215, ST290, ST329, ST59 two times) --> delete 'epi01:28 epi02:43 215' ----
          epi01:1 epi02:34 -
          epi01:3 epi02:3 225
          epi01:5 epi02:3 278
          epi01:5 epi02:34 130
          epi01:5 epi02:43 200
          epi01:11 epi02:19 19
          epi01:14 epi02:34 297
          epi01:16 epi02:55 329
          epi01:16 epi02:56 329
          epi01:18 epi02:14 218
          epi01:19 epi02:43 170
          epi01:20 epi02:42 89
          epi01:20 epi02:43 14
          epi01:20 epi02:34 384
          epi01:20 epi02:11 570
          epi01:20 epi02:3 210
          epi01:21 epi02:38 10
          epi01:21 epi02:34 100
          epi01:21 epi02:63 290
          epi01:24 epi02:38 23
          epi01:24 epi02:2 87
          epi01:26 epi02:34 -
          epi01:26 epi02:53 290
          epi01:26 epi02:63 331
          epi01:29 epi02:43 215
          epi01:29 epi02:55 -
          epi01:37 epi02:62 2
          epi01:37 epi02:26 88
          epi01:37 epi02:43 73
          epi01:37 epi02:34 59
          epi01:40 epi02:2 5
          epi01:40 epi02:26 114
          epi01:40 epi02:43 83
          epi01:40 epi02:34 59
    
    python3 concat_epi01_epi02.py
        # Define the function to extract fasta sequences
        def fasta_to_dict(filename):
            sequences = {}
            header = None
            sequence = []
            with open(filename, 'r') as file:
                for line in file:
                    line = line.strip()
                    if line.startswith('>'):
                        if header:
                            sequences[header] = ''.join(sequence)
                        header = line[1:]
                        sequence = []
                    else:
                        sequence.append(line)
                if header:
                    sequences[header] = ''.join(sequence)
            return sequences
        # Read combinations from match_type_table.csv____
        combinations = []
        with open("match_type_table.csv____", 'r') as csv_file:
            for line in csv_file:
                parts = line.strip().split()
                epi01 = parts[0].split(":")[1]
                epi02 = parts[1].split(":")[1]
                st = parts[2]
                combinations.append((epi01, epi02, st))
    
        # Read both fasta files into dictionaries
        g216_sequences = fasta_to_dict("../epidome/DB/epi01_ref_aln.fasta")
        yycH_sequences = fasta_to_dict("../epidome/DB/epi02_ref_aln.fasta")
    
        output_filename = "concatenated_sequences.fasta"
        with open(output_filename, 'w') as output_file:
            for epi01, epi02, st in combinations:
                if epi01 in g216_sequences and epi02 in yycH_sequences:
                    concatenated_seq = g216_sequences[epi01] + "NNN" + yycH_sequences[epi02]
                    header = f">g216-{epi01}_yycH-{epi02}_ST{st}"
                    output_file.write(f"{header}\n{concatenated_seq}\n")
    
        print(f"Concatenated sequences saved to {output_filename}")
    
    sed -i -e 's/ST-/-/g' concatenated_sequences.fasta
    muscle -in concatenated_sequences.fasta -out aligned.fasta
    muscle -clw -in concatenated_sequences.fasta -out aligned.clw
    FastTree -nt < aligned.fasta > concatenated_sequences.tree
    # Run plotTree.R to draw ggtree_and_gheatmap.png.
    
    library(ggtree)
    library(ggplot2)
    
    setwd("/home/jhuang/DATA/Data_Luise_Epidome_batch3/run_2023_98_2/")
    
    # -- edit tree --
    #https://icytree.org/
    info <- read.csv("typing_34.csv")
    info$name <- info$Isolate
    tree <- read.tree("concatenated_sequences.tree")
    #cols <- c(infection='purple2', commensalism='skyblue2')     
    
    library(dplyr)
    heatmapData2 <- info %>% select(Isolate,  g216, ST)
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
    #"g216:16","g216:18","g216:19","g216:20"(4),"g216:21"(2),"g216:24"(1),"g216:26"(2)
    heatmap.colours <- c("darkgrey",   "palegreen","seagreen","seagreen","seagreen4","seagreen4","yellowgreen",    "orange4","orange4","orange4","orange4","orange4",    "dodgerblue3","blue4","dodgerblue3","dodgerblue3",  "azure4",    "magenta2","maroon2","mediumorchid1","mediumorchid1","mediumorchid1","mediumorchid1","mediumorchid1", "mediumorchid3","mediumorchid3","mediumorchid3","mediumpurple1","mediumpurple4","mediumpurple4",         "red4",                                            "olivedrab3","olivedrab4","seagreen","seagreen4","yellowgreen",    "orange4",    "blue4","dodgerblue3",           "magenta2","maroon2","mediumorchid1", "mediumorchid3","mediumpurple1","mediumpurple4",         "red4")
    names(heatmap.colours) <- c("-",    "ST225","ST130","ST278","ST19","ST200", "ST297",                                             "ST5","ST59","ST114","ST83","ST487",   "ST2","ST215","ST73","ST88",   "ST8",                                                               "ST329","ST170", "ST89","ST14","ST570","ST384","ST210",                  "ST10","ST100","ST290","ST23","ST87", "ST331",     "ST218",      "g216:1","g216:3","g216:5","g216:11","g216:14",   "g216:40",   "g216:29","g216:37",        "g216:16","g216:19","g216:20","g216:21","g216:24","g216:26",    "g216:18")
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    
    #circular
    #geom_tippoint(aes(color=Type)) + scale_color_manual(values=cols) + 
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tiplab2(aes(label=name), offset=1)
    #, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
    #difference between geom_tiplab and geom_tiplab2?
    #+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
    #font.size=10, 
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    desired_order <- c("-",    "ST225","ST130","ST278","ST19","ST200", "ST297",                                             "ST5","ST59","ST114","ST83","ST487",   "ST2","ST215","ST73","ST88",   "ST8",                                                               "ST329","ST170", "ST89","ST14","ST570","ST384","ST210",                  "ST10","ST100","ST290","ST23","ST87", "ST331",     "ST218",      "g216:1","g216:3","g216:5","g216:11","g216:14",   "g216:40",   "g216:29","g216:37",        "g216:16","g216:19","g216:20","g216:21","g216:24","g216:26",    "g216:18")
    
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=0.15,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 9) + scale_fill_manual(values=heatmap.colours, breaks=desired_order) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
    dev.off()
  5. prepre the table for the alpha diversity calculation

    To calculate the alpha diversity values for each row based on the provided metrics (chao1, observed_otus, shannon, PD_whole_tree), you’ll typically use specialized software packages such as QIIME or R packages like vegan. Here, I can provide a high-level approach and calculations for some of the metrics. However, some of the metrics like PD_whole_tree require phylogenetic trees, which can’t be derived from the table you provided.

    import pandas as pd
    import numpy as np
    
    # Read the data from CSV
    df = pd.read_csv('count_table_df_reordered.csv', index_col=0)
    
    def chao1(s):
    s = np.array(s)
    n = np.sum(s)
    s_obs = np.count_nonzero(s)
    s1 = np.sum(s == 1)
    s2 = np.sum(s == 2)
    return s_obs + ((s1 ** 2) / (2 * s2)) if s2 != 0 else s_obs
    
    def observed_sts(s):
    return np.count_nonzero(s)
    
    def shannon2(s):
    s = np.array(s)
    p = s / np.sum(s)
    return -np.sum(p * np.log(p + 1e-10))  # added small value to avoid log(0)
    
    def shannon(s):
    s = np.array(s)
    total = np.sum(s)
    p = s / total  # Convert absolute counts to proportions
    return -np.sum(p * np.log(p + 1e-10))  # added small value to avoid log(0)
    
    # Calculate the metrics for each sample (column)
    results = {
    'sample': [],
    'chao1': [],
    'observed_sts': [],
    'shannon': []
    }
    
    for column in df.columns:
    results['sample'].append(column)
    results['chao1'].append(chao1(df[column]))
    results['observed_sts'].append(observed_sts(df[column]))
    results['shannon'].append(shannon(df[column]))
    
    # Convert results to DataFrame and save to a new CSV
    result_df = pd.DataFrame(results)
    result_df.to_csv('alpha_diversity_metrics_samples.csv', index=False)

    For PD_whole_tree, you would require a phylogenetic tree of the OTUs and then compute the sum of branch lengths. This is more complex and requires specialized software and the phylogenetic tree information, which isn’t present in the table you provided.

    The input data being absolute counts versus relative abundance does influence certain alpha diversity metrics. Let’s re-evaluate:

    • Observed OTUs: Whether the data is in absolute counts or relative abundance doesn’t impact the “Observed OTUs” metric. It’s simply counting how many OTUs have a non-zero count.

    • Chao1: Chao1 is intended for absolute counts. The metric is based on the number of singletons and doubletons, which are derived from raw counts and not relative abundances. Hence, the calculation remains accurate.

    • Shannon Diversity Index: The Shannon Diversity Index is based on the relative abundances of OTUs, not their absolute counts. The formula requires the proportion of each species (or OTU) in the sample. Thus, you’d need to convert absolute counts to proportions (or relative abundance) before calculating this metric.

  6. calculate the alpha diversity in Phyloseq.Rmd. The code for the calculation is as follows.

    \pagebreak
    
    # Alpha diversity for ST
    Plot Shannon index and observed STs. 
    Regroup together samples from the same group.
    
    ```{r, echo=TRUE, warning=FALSE}
    hmp.div_st <- read.csv("alpha_diversity_metrics_samples.csv", sep=",") 
    colnames(hmp.div_st) <- c("sample", "chao1", "observed_sts", "shannon")
    row.names(hmp.div_st) <- hmp.div_st$sample
    div.df <- merge(hmp.div_st, hmp.meta, by.x="sample", by.y = "sam_name")
    div.df2 <- div.df[, c("Group", "shannon", "observed_sts")]
    colnames(div.df2) <- c("Group", "Shannon", "ST")
    #colnames(div.df2)
    options(max.print=999999)
    
    stat.test.Shannon <- compare_means(
    Shannon ~ Group, data = div.df2,
    method = "t.test"
    )
    knitr::kable(stat.test.Shannon) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
    
    stat.test.ST <- compare_means(
    ST ~ Group, data = div.df2,
    method = "t.test"
    )
    knitr::kable(stat.test.ST) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
    
    div_df_melt <- reshape2::melt(div.df2)
    
    p <- ggboxplot(div_df_melt, x = "Group", y = "value",
                  facet.by = "variable", 
                  scales = "free",
                  width = 0.5,
                  fill = "gray", legend= "right")
    #ggpar(p, xlab = FALSE, ylab = FALSE)
    lev <- levels(factor(div_df_melt$Group)) # get the variables
    L.pairs <- combn(seq_along(lev), 2, simplify = FALSE, FUN = function(i) lev[i]) #%>% filter(p.signif != "ns")
    my_stat_compare_means  <- function (mapping = NULL, data = NULL, method = NULL, paired = FALSE, 
        method.args = list(), ref.group = NULL, comparisons = NULL, 
        hide.ns = FALSE, label.sep = ", ", label = NULL, label.x.npc = "left", 
        label.y.npc = "top", label.x = NULL, label.y = NULL, tip.length = 0.03, 
        symnum.args = list(), geom = "text", position = "identity", 
        na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...) 
    {
        if (!is.null(comparisons)) {
            method.info <- ggpubr:::.method_info(method)
            method <- method.info$method
            method.args <- ggpubr:::.add_item(method.args, paired = paired)
            if (method == "wilcox.test") 
                method.args$exact <- FALSE
            pms <- list(...)
            size <- ifelse(is.null(pms$size), 0.3, pms$size)
            color <- ifelse(is.null(pms$color), "black", pms$color)
            map_signif_level <- FALSE
            if (is.null(label)) 
                label <- "p.format"
            if (ggpubr:::.is_p.signif_in_mapping(mapping) | (label %in% "p.signif")) {
                if (ggpubr:::.is_empty(symnum.args)) {
                    map_signif_level <- c(`****` = 1e-04, `***` = 0.001, 
                      `**` = 0.01, `*` = 0.05, ns = 1)
                } else {
                  map_signif_level <- symnum.args
                } 
                if (hide.ns) 
                    names(map_signif_level)[5] <- " "
            }
            step_increase <- ifelse(is.null(label.y), 0.12, 0)
            ggsignif::geom_signif(comparisons = comparisons, y_position = label.y, 
                test = method, test.args = method.args, step_increase = step_increase, 
                size = size, color = color, map_signif_level = map_signif_level, 
                tip_length = tip.length, data = data)
        } else {
            mapping <- ggpubr:::.update_mapping(mapping, label)
            layer(stat = StatCompareMeans, data = data, mapping = mapping, 
                geom = geom, position = position, show.legend = show.legend, 
                inherit.aes = inherit.aes, params = list(label.x.npc = label.x.npc, 
                    label.y.npc = label.y.npc, label.x = label.x, 
                    label.y = label.y, label.sep = label.sep, method = method, 
                    method.args = method.args, paired = paired, ref.group = ref.group, 
                    symnum.args = symnum.args, hide.ns = hide.ns, 
                    na.rm = na.rm, ...))
        }
    }
    
    p3 <- p + 
      stat_compare_means(
        method="t.test",
        comparisons = list(c("P16-P20.Foot", "P16-P20.Nose"), c("AH-XN.LH", "AH-XN.Nose"), c("AH-XN.NLH", "AH-XN.Nose")), 
        label = "p.signif",
        symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns")),
      ) +
      theme(axis.text.x = element_text(angle = 30, hjust = 1))  # add this line to rotate x-axis text
    print(p3)
    
    ggsave("./figures/alpha_diversity_ST_Group2.png", device="png", height = 10, width = 12)
    ggsave("./figures/alpha_diversity_ST_Group2.svg", device="svg", height = 10, width = 12)
    ```

细胞分子的降解机制与cDNA的制备及用途

  1. mRNA、DNA 和蛋白质都是细胞内的重要分子,它们会在一定的时候被降解以维持细胞的稳态。以下是它们在人体中被降解的方式:

    • mRNA降解:

      • 5′->3′ 衰减:这是在真核生物中最常见的mRNA降解途径。当mRNA分子的5’端被去除帽子结构时,一种称为Xrn1的酶开始从5’端逐渐降解mRNA。
      • 3′->5′ 衰减:多蛋白复合体,如核糖核酸酶体(exosome complex)可以从3’端逐渐降解mRNA。
      • 内部切割:一些内部RNase可以特异性地识别并切割mRNA,从而导致其降解。
    • DNA降解:

      • DNA的降解主要是为了修复损伤或者进行细胞程序性死亡。
      • 细胞中有许多DNA修复酶,它们可以修复DNA损伤。当DNA受到损伤时,某些酶可以识别并去除损伤部位,然后由其他酶填充并连接这些缺失部位。
      • 在程序性细胞死亡(如凋亡)中,DNA会被内部的核酸酶切割,导致细胞内容的泄漏并促进细胞的死亡。
    • 蛋白质降解:

      • 蛋白酶体途径:蛋白酶体是一个大型的蛋白复合体,可以识别并降解标记有泛素的蛋白质。被泛素化的蛋白质会被蛋白酶体识别并降解为小的多肽片段。
      • 溶酶体途径:溶酶体是含有各种酶的酸性囊泡。蛋白质可以被送入溶酶体并在那里被降解。
      • 自噬:细胞可以通过一个叫做自噬的过程降解其自身的组分。在这个过程中,细胞形成一个围绕蛋白质或细胞器的膜囊,然后与溶酶体融合并降解其内容物。
  2. mRNA, DNA, and proteins are crucial molecules within cells. They undergo degradation to maintain cellular homeostasis. Here’s how they are degraded in the human body:

    • mRNA degradation:

      • 5′->3′ decay: This is the most common mRNA degradation pathway in eukaryotes. When the cap structure at the 5′ end of an mRNA molecule is removed, an enzyme called Xrn1 begins to degrade the mRNA progressively from the 5′ end.
      • 3′->5′ decay: Multi-protein complexes, like the exosome complex, can degrade mRNA progressively from the 3′ end.
      • Endonucleolytic cleavage: Some internal RNases can specifically recognize and cleave mRNAs, leading to their degradation.
    • DNA degradation:

      • DNA degradation is primarily for damage repair or programmed cell death.
      • Many DNA repair enzymes exist in cells. When DNA is damaged, certain enzymes can identify and remove the damaged parts, which are then filled and ligated by other enzymes.
      • In programmed cell death (such as apoptosis), DNA is cleaved by internal nucleases, leading to the leakage of cellular contents and promoting cell death.
    • Protein degradation:

      • Proteasome pathway: The proteasome is a large protein complex that identifies and degrades proteins marked with ubiquitin. Ubiquitinated proteins are recognized and degraded into short peptides by the proteasome.
      • Lysosomal pathway: Lysosomes are acidic vesicles containing various enzymes. Proteins can be delivered to lysosomes and degraded there.
      • Autophagy: Cells can degrade their components through a process called autophagy. Here, the cell forms a membranous sac around proteins or organelles, which then fuses with a lysosome and degrades its contents.
  3. cDNA(complementary DNA)是通过反转录酶从mRNA模板上合成的双链DNA。它是mRNA的补足DNA,因此其序列与mRNA的编码区域相对应,但不包括内含子。

    • cDNA的制备过程如下:

      • 反转录:首先,将纯化的mRNA和一个短的寡腺苷酸引物(通常是多T引物,也称为oligo(dT)引物)混合。这个引物能与mRNA的多A尾结合。
      • 使用反转录酶,这个引物起始合成一个cDNA的单链。
      • 该单链cDNA可以被用作模板,利用DNA聚合酶进行第二条链的合成,从而得到双链cDNA。
    • cDNA在分子生物学研究中有许多用途:

      • cDNA图书馆:由于cDNA是由mRNA模板合成的,因此只代表被转录的基因。通过制备特定组织或细胞类型的cDNA图书馆,研究者可以确定哪些基因在特定条件下被表达。
      • 克隆与表达:cDNA可以被克隆到表达载体中,然后在宿主细胞中产生蛋白质。这是因为cDNA只包含编码区域,不包含内含子,所以可以在原核细胞(如大肠杆菌)中得到正确的蛋白质表达。
      • 基因芯片和RNA测序:cDNA也常用于基因芯片技术和RNA测序中,作为对比或参考。
  4. cDNA (complementary DNA) is double-stranded DNA synthesized from an mRNA template through the action of reverse transcriptase. It is the complementary version of mRNA. Thus, its sequence corresponds to the coding region of mRNA but excludes introns.

    • The preparation of cDNA is as follows:

      • Reverse transcription: First, purified mRNA is mixed with a short oligoadenylate primer (often a poly-T primer or oligo(dT) primer). This primer can bind to the poly-A tail of mRNA.
      • Using reverse transcriptase, this primer starts synthesizing a single-stranded cDNA.
      • This single-stranded cDNA can then serve as a template, with DNA polymerase synthesizing the second strand to produce double-stranded cDNA.
    • cDNA has several uses in molecular biology research:

      • cDNA libraries: Since cDNA is synthesized from mRNA, it represents only transcribed genes. By preparing a cDNA library from specific tissues or cell types, researchers can determine which genes are expressed under specific conditions.
      • Cloning and expression: cDNA can be cloned into expression vectors and then produce proteins in host cells. Since cDNA only contains coding regions and excludes introns, it allows for the correct protein expression in prokaryotic cells (like E. coli).
      • Gene chips and RNA sequencing: cDNA is also frequently used in gene chip technology and RNA sequencing as a reference or comparator.

Yops (Yersinia outer proteins) analysis 2

http://xgenes.com/article/article-content/162/yersinia-outer-proteins-yops-analysis/

  1. extract all plasmids of the 50 isolates with plasmids but no yopK

    python3 extract_plasmids_from_gff.py ../prokka_plus/1045.gff  #reference
    for sample in SCPM-O-B-6291_C-25 KIM10+ 195P Nepal516 A1122 A1122_bis Nairobi IP31758 228 NW57 NW117 NW56 NW115 FORC_002 FORC_002_bis Gp200 NW116 Gp169 Y225 ATCC_BAA-2637 CFS1934 LC20 GTA 2011N-4075 ATCC_43970 NHV_3758 NVI-10705 NVI-1292 NVI-4570 NVI-6614 NVI-11267 NVI-11294 NVI-10571 NVI-8524 NVI-1176 NVI-701 17Y0412 17Y0414 NVI-492 NVI-9681 SC09 17Y0189 17Y0153 17Y0155 KMM821 16Y0180 NVI-5089 NVI-10587 NVI-4840 17Y0159; do
        python3 extract_plasmids_from_gff.py ../prokka_plus/${sample}.gff
    done
    
    grep "yop" *.gff3
    grep "Yop" *.gff3
    (yopH, yopO, yopE, yopT, yopM, yopD, yopB, yopN) and (YopH, YopJ, YopO, YopE, YopT, YopM, YopD, YopB, YopN, YopR) in 195P_NZ_CP019710
    
    # code of extract_plasmids_from_gff.py
    import sys
    import os
    from Bio import SeqIO
    from Bio.Alphabet import generic_dna
    
    if len(sys.argv) != 2:
        print("Usage: python script_name.py your_input.gff3")
        sys.exit(1)
    
    input_gff = sys.argv[1]
    base_filename = os.path.splitext(os.path.basename(input_gff))[0]
    
    # Split the GFF file into annotations and sequences
    with open(input_gff, 'r') as f:
        lines = f.readlines()
        fasta_start = lines.index("##FASTA\n")
        gff_lines = lines[:fasta_start]
        fasta_lines = lines[fasta_start + 1:]
    
    # Separate GFF content for each plasmid/chromosome
    gff_dict = {}
    for line in gff_lines:
        if not line.startswith("#"):
            record_id = line.split("\t")[0]
            if record_id not in gff_dict:
                gff_dict[record_id] = []
            gff_dict[record_id].append(line)
    
    # Write the sequences temporarily to a file
    with open("temp.fasta", 'w') as f:
        f.writelines(fasta_lines)
    
    # Read the sequences from the temporary file
    records = list(SeqIO.parse("temp.fasta", format="fasta"))
    
    for idx, rec in enumerate(records):
        # Skip the chromosome (the first record)
        if idx == 0:
            continue
    
        # Write GFF3
        with open(f"{base_filename}_{rec.id}.gff3", "w") as output_handle:
            output_handle.writelines(gff_dict.get(rec.id, []))
            output_handle.write("##FASTA\n")
            SeqIO.write(rec, output_handle, "fasta")
    
        ## Write GenBank (without annotations)
        #with open(f"plasmid_{rec.id}.gbk", "w") as output_handle:
        #    rec.seq.alphabet = generic_dna  # Add temporary alphabet
        #    SeqIO.write(rec, output_handle, "genbank")
    
        # Write FASTA
        with open(f"{base_filename}_{rec.id}.fasta", "w") as output_handle:
            SeqIO.write(rec, output_handle, "fasta")
  2. (optional) cluster all plasmids against reference using fastANI

    git clone https://github.com/ParBLiSS/FastANI.git
    cd FastANI
    ./bootstrap.sh
    ./configure
    make
    ~/Tools/FastANI/fastANI -q 1045_NZ_CP006795.1.fasta --rl plasmids.txt -o output_ani.txt

The output of FastANI is a tab-separated file, typically with four columns:

  • Query genome path: The path (or filename) of the query genome.

  • Reference genome path: The path (or filename) of the reference genome.

  • ANI percentage: The average nucleotide identity (ANI) value between the query and reference genomes. This is expressed as a percentage and represents the average nucleotide identity of orthologous gene pairs between the two genomes.

  • Orthologous fragment count: The number of orthologous fragments that were found and compared between the two genomes. FastANI breaks genomes into fixed-size fragments (default is 3kb) and then identifies orthologous fragments between genomes for ANI calculation. This column indicates the number of such orthologous fragment pairs used in the ANI calculation.

    1045_NZ_CP006795.1.fasta     ./SCPM-O-B-6291_C-25_NZ_CP045165.1.fasta   94.3585 1       23
  • Query: plasmid_NZ_CP006795.1.fasta

  • Reference: ./plasmid_NZ_CP045165.1.fasta

  • ANI: 94.3585%

  • Orthologous fragments: 1 out of 23

  • Here, the plasmid 1045_NZ_CP006795.1.fasta is being compared to SCPM-O-B-6291_C-25_NZ_CP045165.1.fasta. They have an ANI of approximately 94.36%. However, only 1 out of the 23 fragments in the query plasmid was found to be orthologous with the reference plasmid.

  1. The 50 isolates with plasmids but no yopK are as follows.

    SCPM-O-B-6291_C-25.gff  2   Yersinia pestis SCPM-O-B-6291 C-25  aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.MED   2   _plasmid    No  NA
    KIM10+  4   Yersinia pestis KIM10+  aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.MED   1   _plasmid    No  NA
    195P    19  Yersinia pestis 195P    aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.ANT   3   _plasmid    No  NA
    Nepal516    20  Yersinia pestis Nepal516    aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  2.ANT   2   _plasmid    No  NA
    A1122   24  Yersinia pestis A1122   aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ORI   2   _plasmid    No  NA
    A1122_bis   26  Yersinia pestis A1122 bis   aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ORI   2   _plasmid    No  NA
    Nairobi 43  Yersinia pestis Nairobi aarF(39)    dfp(37) galR(48)    glnS(41)    hemA(44)    rfaE(38)    speA(35)    pestis  pestis  79  pestis  1.ANT   1   _plasmid    No  NA
    IP31758 82  Yersinia pseudotuberculosis IP31758 adk(1)  argA(2) aroA(1) glnA(6) thrA(8) tmk(3)  trpE(2) pseudotuberculosis  pseudotuberculosis  2   8       2   _plasmid    No  NA
    228 94  Yersinia similis 228    adk(5)  argA(4) aroA(12)    glnA(12)    thrA(15)    tmk(9)  trpE(9) similis similis 92          1   _plasmid    No  NA
    NW57    115 Yersinia enterocolitica NW57    adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW117   116 Yersinia enterocolitica NW117   adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW56    118 Yersinia enterocolitica NW56    adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    NW115   119 Yersinia enterocolitica NW115   adk(20) argA(85)    aroA(21)    glnA(22)    thrA(21)    tmk(28) trpE(81)    enterocolitica  enterocolitica  312 1Aa     2   _plasmid    No  NA
    FORC_002    121 Yersinia enterocolitica FORC_002    adk(12) argA(19)    aroA(21)    glnA(22)    thrA(25)    tmk(24) trpE(19)    enterocolitica  enterocolitica  252 1Aa     1   _plasmid    No  NA
    FORC_002_bis    122 Yersinia enterocolitica FORC_002 bis    adk(12) argA(19)    aroA(21)    glnA(22)    thrA(25)    tmk(24) trpE(19)    enterocolitica  enterocolitica  252 1Aa     1   _plasmid    No  NA
    Gp200   129 Yersinia enterocolitica Gp200   adk(20) argA(21)    aroA(85)    glnA(32)    thrA(25)    tmk(~71)    trpE(19)    enterocolitica  enterocolitica      1Aa     1   _plasmid    No  NA
    NW116   130 Yersinia enterocolitica NW116   adk(86) argA(41)    aroA(31)    glnA(83)    thrA(31)    tmk(104)    trpE(16)    enterocolitica  enterocolitica  335 1Aa     1   _plasmid    No  NA
    Gp169   131 Yersinia enterocolitica Gp169   adk(86) argA(41)    aroA(31)    glnA(83)    thrA(31)    tmk(104)    trpE(16)    enterocolitica  enterocolitica  335 1Aa     1   _plasmid    No  NA
    Y225    134 Yersinia frederiksenii Y225 aarF(43)    dfp(41) galR(50)    glnS(47)    hemA(48)    rfaE(41)    speA(39)    frederiksenii   occitanica  83          1   _plasmid    No  NA
    ATCC_BAA-2637   137 Yersinia rochesterensis ATCC BAA-2637   aarF(43)    dfp(41) galR(50)    glnS(10)    hemA(58)    rfaE(41)    speA(39)    rochesterensis  occitanica  84          2   _plasmid    No  NA
    CFS1934 140 Yersinia hibernica CFS1934                              hibernica   hibernica               1   _plasmid    No  NA
    LC20    141 Yersinia hibernica LC20 adk(-)  argA(66)    aroA(-) glnA(68)    thrA(78)    tmk(85) trpE(76)    hibernica   hibernica               2   _plasmid    No  NA
    GTA 146 Yersinia massiliensis GTA   aarF(15)    dfp(~31)    galR(32)    glnS(15)    hemA(30)    rfaE(32)    speA(16)    massiliensis    massiliensis        2       2   _plasmid    No  NA
    2011N-4075  147 Yersinia massiliensis 2011N-4075    aarF(15)    dfp(~31)    galR(~32)   glnS(15)    hemA(30)    rfaE(32)    speA(16)    massiliensis    massiliensis        2       2   _plasmid    No  NA
    ATCC_43970  151 Yersinia bercovieri ATCC 43970  aarF(47)    dfp(45) galR(54)    glnS(61)    hemA(63)    rfaE(45)    speA(9) bercovieri  bercovieri  30          1   _plasmid    No  NA
    NHV_3758    163 Yersinia ruckeri NHV_3758   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-10705   164 Yersinia ruckeri NVI-10705  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-1292    165 Yersinia ruckeri NVI-1292   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-4570    166 Yersinia ruckeri NVI-4570   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             3   _plasmid    No  NA
    NVI-6614    167 Yersinia ruckeri NVI-6614   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             3   _plasmid    No  NA
    NVI-11267   168 Yersinia ruckeri NVI-11267  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-11294   169 Yersinia ruckeri NVI-11294  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-10571   170 Yersinia ruckeri NVI-10571  adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-8524    171 Yersinia ruckeri NVI-8524   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             2   _plasmid    No  NA
    NVI-1176    172 Yersinia ruckeri NVI-1176   adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-701 173 Yersinia ruckeri NVI-701    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    17Y0412 174 Yersinia ruckeri 17Y0412    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    17Y0414 175 Yersinia ruckeri 17Y0414    adk(64) argA(76)    aroA(78)    glnA(78)    thrA(90)    tmk(86) trpE(77)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-492 176 Yersinia ruckeri NVI-492    aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-9681    177 Yersinia ruckeri NVI-9681   adk(64) argA(67)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    SC09    178 Yersinia ruckeri SC09   aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri             2   _plasmid    No  NA
    17Y0189 180 Yersinia ruckeri 17Y0189    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    17Y0153 181 Yersinia ruckeri 17Y0153    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    17Y0155 182 Yersinia ruckeri 17Y0155    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    KMM821  183 Yersinia ruckeri KMM821 aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          2   _plasmid    No  NA
    16Y0180 184 Yersinia ruckeri 16Y0180    aarF(13)    dfp(10) galR(14)    glnS(13)    hemA(15)    rfaE(14)    speA(14)    ruckeri ruckeri 44          1   _plasmid    No  NA
    NVI-5089    189 Yersinia ruckeri NVI-5089   adk(75) argA(76)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-10587   190 Yersinia ruckeri NVI-10587  adk(75) argA(76)    aroA(72)    glnA(78)    thrA(90)    tmk(86) trpE(89)    ruckeri ruckeri             1   _plasmid    No  NA
    NVI-4840    191 Yersinia ruckeri NVI-4840   adk(75) argA(76)    aroA(72)    glnA(79)    thrA(90)    tmk(96) trpE(89)    ruckeri ruckeri             2   _plasmid    No  NA
    17Y0159 197 Yersinia ruckeri 17Y0159    aarF(76)    dfp(40) galR(14)    glnS(13)    hemA(15)    RfaE(14)    speA(14)    ruckeri ruckeri             3   _plasmid    No  NA

Should the inputs for GSVA be normalized or raw?

Gene Set Variation Analysis (GSVA) is a non-parametric and unsupervised method used for estimating the variation of gene set enrichment through the samples of a gene expression matrix. Given its nature and the underlying computations, there are specific recommendations for input data preprocessing.

  • Normalization: Gene expression data should generally be normalized before applying GSVA. This ensures that the different scales and potential batch effects from different experiments or different runs are accounted for. There are several normalization methods available depending on the type of gene expression data. For RNA-seq data, methods like TMM (Trimmed Mean of M-values) or RLE (Relative Log Expression) are popular. For microarray data, quantile normalization is commonly used.

  • Log Transformation: It is generally recommended to log-transform gene expression data before using GSVA. The rationale is similar to the reasons for normalizing the data: taking the logarithm compresses the range of expression values, making highly expressed genes and lowly expressed genes more comparable in scale. Typically, for RNA-seq count data, one might use a transformation like log2(CPM+1) or log2(FPKM/TPM + 1). The “+1” is added to handle zero counts.

However, the best practices can vary based on the specifics of the dataset and the research question. It’s crucial to consult the GSVA documentation, relevant literature, and potentially perform some exploratory analysis on your dataset to determine the best preprocessing steps.

Lastly, always remember to ensure that the gene identifiers in your expression dataset match those in the gene sets you are using for the enrichment analysis. This often requires additional preprocessing steps to map between different types of gene identifiers (e.g., gene symbols, Entrez IDs, Ensembl IDs).

https://bioconductor.org/packages/devel/bioc/vignettes/GSVA/inst/doc/GSVA.html

Input arguments of gsva(): There are four classes of parameter objects corresponding to the methods listed above, and may have different additional parameters to tune, but all of them require at least the following two input arguments:

  • A normalized gene expression dataset, which can be provided in one of the following containers:
    • A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
    • An ExpressionSet object; see package Biobase.
    • A SummarizedExperiment object, see package SummarizedExperiment.
  • A collection of gene sets; which can be provided in one of the following containers:
    • A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
    • A GeneSetCollection object; see package GSEABase.

In the context of gene expression data or other biological datasets, the term “non-log space” usually refers to the original, untransformed measurements or values. This is in contrast to “log space,” where values have been transformed using a logarithm, typically the natural logarithm or the base-2 logarithm.

Here’s a bit more on why and when these terms are used:

  • Log Transformation: For various types of data, including gene expression data, a logarithm transformation is often applied. One reason to do this is to stabilize variance or to make the data distribution more normal-like, especially when measurements can span several orders of magnitude. For example, gene expression data from microarray or RNA-seq experiments might have a long-tailed distribution, and taking the logarithm can help in compressing extreme values.

  • Non-log Space: When we refer to values in “non-log space,” we’re talking about the original measurements, before any logarithm transformation. For gene expression, these might be raw read counts, FPKM values (for RNA-seq data), or probe intensities (for microarrays).

  • Back Transformation: Sometimes, after performing computations or analyses in the log space, one might want to transform the results back to the original scale. This “back transformation” involves taking the antilogarithm (exponential) of the log-transformed values.

For example, consider the average expression of a gene. If you take the average of log-transformed expression values and then back-transform this average by taking the exponential, you won’t get the same result as taking the average of the original, non-log-transformed values. This is because the logarithm is a non-linear transformation.

In summary, “non-log space” refers to data that hasn’t been log-transformed, and it represents the original scale of the measurements.

GSVA calculation for RNA-seq data

#------------------------------
#---- 1. prepare gene sets ----

library(org.Hs.eg.db)
library(dplyr)
library(AnnotationDbi)

#1. Platelets
data <- data.frame(
  geneSymbol = c("GP1BA","GP5","GP6","GP9","LY6G6D","MMRN1","PEAR1","PF4","PF4V1","PPBP","SLC35D3"),
  geneEntrezID = c("2811","2814","51206","2815","58530","22915","375033","5196","5197","5473","340146"),
  GeneSet = rep("Platelets", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Platelets.csv", row.names = FALSE)

#2. Granulocytes
data <- data.frame(
  geneSymbol = c("CXCR2","CD177","CLC","CTSS","DEFA1","FUT7","LTB4R","MMP25","OSM","RETN"),
  geneEntrezID = c("3579","57126","1178","1520","1667","2529","1241","64386","5008","56729"),
  GeneSet = rep("Granulocytes", 10)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Granulocytes.csv", row.names = FALSE)

#3. LDG
data <- data.frame(
  geneSymbol = c("AZU1","CAMP","CEACAM6","CEACAM8","CTSG","DEFA4","ELANE","LCN2","LTF","MPO","OLFM4","RNASE3"),
  geneEntrezID = c("566","820","4680","1088","1511","1669","1991","3934","4057","4353","10562","6037"),
  GeneSet = rep("LDG", 12)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "LDG.csv", row.names = FALSE)

#4. pDC
data <- data.frame(
  geneSymbol = c("CLEC4C","NRP1"),
  geneEntrezID = c("170482","8829"),
  GeneSet = rep("pDC", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "pDC.csv", row.names = FALSE)

#5. Anti-inflammation
data <- data.frame(
  geneSymbol = c("IL1RN","TNFAIP3","SOCS3"),
  geneEntrezID = c("3557","7128","9021"),
  GeneSet = rep("Anti-inflammation", 3)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Anti-inflammation.csv", row.names = FALSE)

#6. Pro-inflam. IL-1
data <- data.frame(
  geneSymbol = c("IL1A","IL1B","IL18","IL36A","IL36B","IL36G","IL33","IL1R1","IL1RAP","IL18R1","IL18RAP"),
  geneEntrezID = c("3552","3553","3606","27179","27177","56300","90865","3554","3556","8809","8807"),
  GeneSet = rep("Pro-inflam. IL-1", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Pro-inflam._IL-1.csv", row.names = FALSE)

#7. Dendritic cells
data <- data.frame(
  geneSymbol = c("CLEC12A","CLEC10A","CLEC9A","CSF1R","IGIP","LILRA4","LY75","XCR1"),
  geneEntrezID = c("160364","10462","283420","1436","492311","23547","4065","2829"),
  GeneSet = rep("Dendritic cells", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Dendritic_cells.csv", row.names = FALSE)

#8. MHC II
data <- data.frame(
  geneSymbol = c("HLA-DMA","HLA-DMB","HLA-DPA1","HLA-DPB1","HLA-DPB2","HLA-DQA1","HLA-DQA2","HLA-DQB1","HLA-DQB2","HLA-DRA","HLA-DRB1","HLA-DRB3","HLA-DRB4","HLA-DRB5","HLA-DRB6"),
  geneEntrezID = c("3108","3109","3113","3115","3116","3117","3118","3119","3120","3122","3123","3125","3126","3127","3128"),
  GeneSet = rep("MHC II", 15)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "MHC_II.csv", row.names = FALSE)

#9. Alt. complement
data <- data.frame(
  geneSymbol = c("C3","C5","C6","C7","C8A","C9","CFB","CFD","CFH","CFHR5","CFP","CR1","GZMM","MIR520B","MIR520E","VSIG4"),
  geneEntrezID = c("718","727","729","730","731","735","629","1675","3075","81494","5199","1378","3004","574473","574461","11326"),
  GeneSet = rep("Alt. complement", 16)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Alt._complement.csv", row.names = FALSE)

#10. TNF
data <- data.frame(
  geneSymbol = c("BRCA1","CASP1","CXCL1","CXCL2","GBP1","GP1BA","IFI44","IL18","IL1B","IL1RN","MX1","NAMPT","NR3C1","OAS3","PATJ","SH3BP5","TAP2","TNF","TNFAIP3","TNFRSF11A","WT1","ACLY","ACSL1","ADGRE2","AK3","AKAP10","AMPD3","APOL3","ARID3A","ARSE","ASAP1","B4GALT5","BCL2A1","BHLHE41","BHMT","BIRC3","CALD1","CASP10","CCL15","CCL20","CCL23","CCL3L1","CD37","CD38","CD83","CDKN3","CKB","CR2","CTNND2","CXCL3","CXCL8","CYP27B1","DAB2","EBI3","EGR1","EGR2","EPB41","EREG","ETAA1","F3","FABP1","FBXL2","FCER2","FCGR2A","FLJ11129","FLNA","G0S2","GCH1","GJB2","GLS","GMIP","GRK3","HCAR3","HHEX","HOMER2","HP","ICAM1","IDO1","IKBKG","IL16","IL1A","IL6","INHBA","INSIG1","ITGA6","KITLG","KLF1","KMO","LGALS3BP","MAP3K4","MARCKS","MGLL","MMP19","MN1","MRPS15","MSC","MTF1","NELL2","NFKB1","NFKB2","NFKBIA","NFKBIZ","NKX3-2","PDE4DIP","PDPN","PIAS4","PLAUR","PTGES","PTGS2","RELB","RPGR","RPS9","SDC4","SERPIND1","SFRP1","SLAMF1","SLC30A4","SOD2","SPI1","SSPN","STAT4","TAF15","TBX3","TFF1","TNFAIP2","TRAF1","TSC22D1","TYROBP","UBE2C","VEGFA"),
  geneEntrezID = c("672","834","2919","2920","2633","2811","10561","3606","3553","3557","4599","10135","2908","4940","10207","9467","6891","7124","7128","8792","7490","47","2180","30817","50808","11216","272","80833","1820","415","50807","9334","597","79365","635","330","800","843","6359","6364","6368","6349","951","952","9308","1033","1152","1380","1501","2921","3576","1594","1601","10148","1958","1959","2035","2069","54465","2152","2168","25827","2208","2212","54674","2316","50486","2643","2706","2744","51291","157","8843","3087","9455","3240","3383","3620","8517","3603","3552","3569","3624","3638","3655","4254","10661","8564","3959","4216","4082","11343","4327","4330","64960","9242","4520","4753","4790","4791","4792","64332","579","9659","10630","51588","5329","9536","5743","5971","6103","6203","6385","3053","6422","6504","7782","6648","6688","8082","6775","8148","6926","7031","7127","7185","8848","7305","11065","7422"),
  GeneSet = rep("TNF", 130)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TNF.csv", row.names = FALSE)

#11. NLRP3 inflammasome
data <- data.frame(
  geneSymbol = c("APP","ATAT1","CARD8","CASP1","CASP4","CD36","CPTP","DHX33","EIF2AK2","GBP5","GSDMD","HSP90AB1","MEFV","NFKB1","NFKB2","NLRC3","NLRP3","P2RX7","PANX1","PSTPIP1","PYCARD","PYDC2","RELA","SIRT2","SUGT1","TLR4","TLR6","TXN","TXNIP","USP50"),
  geneEntrezID = c("351","79969","22900","834","837","948","80772","56919","5610","115362","79792","3326","4210","4790","4791","197358","114548","5027","24145","9051","29108","152138","5970","22933","10910","7099","10333","7295","10628","373509"),
  GeneSet = rep("NLRP3 inflammasome", 30)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "NLRP3_inflammasome.csv", row.names = FALSE)

#12. Unfolded protein
data <- data.frame(
  geneSymbol = c("B4GALT3","CALR","CALU","CANX","CDS2","CHST12","CHST2","DERL1","DERL2","DNAJC3","EDEM2","EDEM3","EMC9","ERAP1","ERGIC2","ERO1L","EXT1","GALNT2","GOLT1B","HERPUD1","HYOU1","IER3IP1","IMPAD1","KDELC1","KDELR2","LMAN2","LPGAT1","MAN1A1","MANEA","MANF","NUCB2","PDIA4","PDIA6","PIGK","PPIB","SEC24D","SEC61G","SPCS3","SSR1","SSR3","TRAM1","TRAM2","UGGT1","XBP1"),
  geneEntrezID = c("8703","811","813","821","8760","55501","9435","79139","51009","5611","55741","80267","51016","51752","51290","30001","2131","2590","51026","9709","10525","51124","54928","79070","11014","10960","9926","4121","79694","7873","4925","9601","10130","10026","5479","9871","23480","60559","6745","6747","23471","9697","56886","7494"),
  GeneSet = rep("Unfolded protein", 44)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Unfolded_protein.csv", row.names = FALSE)

#13. B cells
data <- data.frame(
  geneSymbol = c("IGHD","SH3BP5","BANK1","BLK","BLNK","CD19","CD22","CD79A","CD79B","DAPP1","FCRL1","FCRL2","FCRL3","FCRLA","GON4L","GPR183","IGHM","KLHL6","MS4A1","PAX5","PLCL2","VPREB1","ZNF318"),
  geneEntrezID = c("3495","9467","55024","640","29760","930","933","973","974","27071","115350","79368","115352","84824","54856","1880","3507","89857","931","5079","23228","7441","24149"),
  GeneSet = rep("B cells", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "B_cells.csv", row.names = FALSE)

#14. Monocyte cell surface
data <- data.frame(
  geneSymbol = c("CD14","CD300C","CD33","CD68","CLEC12A","CLEC4D","CLEC4E","FCGR1A","FCGR1B","FCGR3B","LILRA5","LILRA6","LILRB2","LILRB3","OSCAR","SEMA4A","SIGLEC1"),
  geneEntrezID = c("929","10871","945","968","160364","338339","26253","2209","2210","2215","353514","79168","10288","11025","126014","64218","6614"),
  GeneSet = rep("Monocyte cell surface", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocyte_cell_surface.csv", row.names = FALSE)

#15. Inflammasome
data <- data.frame(
  geneSymbol = c("CASP1","AIM2","CASP5","CTSB","GSDMB","GSDMD","NAIP","NEK7","NLRC4","NLRP1","NLRP3","NOD2","P2RX7","PANX1","PYCARD","RIPK1"),
  geneEntrezID = c("834","9447","838","1508","55876","79792","4671","140609","58484","22861","114548","64127","5027","24145","29108","8737"),
  GeneSet = rep("Inflammasome", 16)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Inflammasome.csv", row.names = FALSE)

#16. Monocytes_40
data <- data.frame(
  geneSymbol = c("C1QA","C1QB","C1QC","C1RL","CCL2","CCL8","CD14","CD300C","CD33","CD68","CLEC12A","CLEC4D","CLEC4E","CXCL1","CXCL2","CXCR2","FCGR1A","FCGR1B","FCGR3B","GRN","IK","IL18RAP","IL1B","IL1RN","LILRA5","LILRA6","LILRB2","LILRB3","OSCAR","S100A8","SEMA4A","SIGLEC1","THBD","BST1","C1R","CD163","CSF2","FUT4","MNDA","MRC1"),
  geneEntrezID = c("712","713","714","51279","6347","6355","929","10871","945","968","160364","338339","26253","2919","2920","3579","2209","2210","2215","2896","3550","8807","3553","3557","353514","79168","10288","11025","126014","6279","64218","6614","7056","683","715","9332","1437","2526","4332","4360"),
  GeneSet = rep("Monocytes_40", 40)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocytes_40.csv", row.names = FALSE)

#17. Monocyte secreted
data <- data.frame(
  geneSymbol = c("C1QA","C1QB","C1QC","C1RL","CCL2","CCL8","CXCL1","CXCL2","GRN","IK","IL18RAP","IL1B","IL1RN","S100A8","THBD","TNF","CXCL10"),
  geneEntrezID = c("712","713","714","51279","6347","6355","2919","2920","2896","3550","8807","3553","3557","6279","7056","7124","3627"),
  GeneSet = rep("Monocyte secreted", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocyte_secreted.csv", row.names = FALSE)

#18. IL-1 cytokines
data <- data.frame(
  geneSymbol = c("IL18","IL1B"),
  geneEntrezID = c("3606","3553"),
  GeneSet = rep("IL-1 cytokines", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IL-1_cytokines.csv", row.names = FALSE)

#19. SNOR low UP
data <- data.frame(
  geneSymbol = c("FCGR1A","CEACAM1","LGALS1","SNORD24","SNORD44","SNORD47","SNORD80"),
  geneEntrezID = c("2209","634","3956","26820","26806","26802","26774"),
  GeneSet = rep("SNOR low UP", 7)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "SNOR_low_UP.csv", row.names = FALSE)

#20. CD40 activated
data <- data.frame(
  geneSymbol = c("BUB1B","CCL22","CD58","DBI","DUSP4","FABP5","FDPS","FEN1","GAPDH","GARS","GRHPR","H2AFX","H2AFZ","HMGB2","HMMR","HPRT1","IMPDH2","LDHA","LDHB","LMNB1","LPXN","MCM2","MCM3","MCM6","MSH6","MTHFD2","MYBL2","NDC80","NME1","PAICS","PCNA","PKM","PRDX3","RFTN1","RGS10","SLAMF1","TK1","TOP2A","TPX2","TRAF1","TUBA1B","TXN","TYMS","UBE2S","VDAC1","WEE1","WSB2","ZWINT"),
  geneEntrezID = c("701","6367","965","1622","1846","2171","2224","2237","2597","2617","9380","3014","3015","3148","3161","3251","3615","3939","3945","4001","9404","4171","4172","4175","2956","10797","4605","10403","4830","10606","5111","5315","10935","23180","6001","6504","7083","7153","22974","7185","10376","7295","7298","27338","7416","7465","55884","11130"),
  GeneSet = rep("CD40 activated", 48)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "CD40_activated.csv", row.names = FALSE)

#21. Lectin complement
data <- data.frame(
  geneSymbol = c("A2M","C2","C3","C4A","C4B","C4B_2","C5","C6","C7","C8A","C9","COLEC10","COLEC11","FCN1","FCN2","FCN3","KRT1","MASP1","MASP2","MBL2","SERPING1"),
  geneEntrezID = c("2","717","718","720","721","100293534","727","729","730","731","735","10584","78989","2219","2220","8547","3848","5648","10747","4153","710"),
  GeneSet = rep("Lectin complement", 21)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Lectin_complement.csv", row.names = FALSE)

#22. Classical complement
data <- data.frame(
  geneSymbol = c("C1QA","CIQB","C1QC","C1R","C1S","C2","C3","C4A","C4B","C4B_2","C5","C6","C7","C8A","C9"),
  geneEntrezID = c("712","713","714","715","716","717","718","720","721","100293534","727","729","730","731","735"),
  GeneSet = rep("Classical complement", 15)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Classical_complement.csv", row.names = FALSE)

#23. Cell cycle
data <- data.frame(
  geneSymbol = c("BRCA1","ASPM","AURKA","AURKB","CCNB1","CCNB2","CCNE1","CDC20","CENPM","CEP55","E2F3","GINS2","MCM10","MCM2","MKI67","NCAPG","NDC80","PTTG1","TYMS"),
  geneEntrezID = c("672","259266","6790","9212","891","9133","898","991","79019","55165","1871","51659","55388","4171","4288","64151","10403","9232","7298"),
  GeneSet = rep("Cell cycle", 19)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Cell_cycle.csv", row.names = FALSE)

#24. Plasma cells
data <- data.frame(
  geneSymbol = c("IGHV4-28","IGHV4-34","IGHD","IGHG1","IGHV2-5","IGHV3-20","IGHV3-23","IGKC","IGLV1-40","IGLV3-1","IGLV3-19","IGLV3-25","IGLV4-3","IGLV4-60","IGLV5-45","IGLV6-57","IGLVI-70","C19orf10","IGH","IGHMBP2","IGHV4-31","IGK","IGL","IGLJ3","IGLL1","IGLV@","IGLV1-44","IGLV2-14","IGLV2-5","MZB1","PRDM1","SDC1","THEMIS2","TNFRSF17"),
  geneEntrezID = c("28400","28395","3495","3500","28457","28445","28442","3514","28825","28809","28797","28793","28786","28785","28781","28778","28763","56005","3492","3508","28396","50802","3535","28831","3543","3546","28823","28815","28818","51237","639","6382","9473","608"),
  GeneSet = rep("Plasma cells", 34)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Plasma_cells.csv", row.names = FALSE)

#25. IG chains
data <- data.frame(
  geneSymbol = c("IGHG1","IGHV2-5","IGHV3-20","IGHV3-23","IGHV4-28","IGHV4-34","IGKC","IGLV1-40","IGLV3-1","IGLV3-19","IGLV3-25","IGLV4-3","IGLV4-60","IGLV5-45","IGLV6-57","IGLVI-70","IGHA1","IGHA2","IGHD2-15","IGHD2-2","IGHD2-21","IGHD3-10","IGHD3-16","IGHD3-3","IGHD3-9","IGHG2","IGHG3","IGHG4","IGHJ1","IGHJ2","IGHJ3","IGHJ4","IGHJ5","IGHJ6","IGHV1-18","IGHV1-2","IGHV1-24","IGHV1-3","IGHV1-45","IGHV1-46","IGHV1-58","IGHV1-69-2","IGHV2-26","IGHV2-70","IGHV3-13","IGHV3-15","IGHV3-21","IGHV3-33","IGHV3-43","IGHV3-48","IGHV3-49","IGHV3-53","IGHV3-62","IGHV3-64","IGHV3-7","IGHV3-72","IGHV3-73","IGHV3-74","IGHV4-30-2","IGHV4-39","IGHV4-59","IGHV4-61","IGHV5-51","IGHV6-1","IGHV7-81","IGKJ1","IGKJ2","IGKJ3","IGKJ4","IGKJ5","IGKV1-16","IGKV1-17","IGKV1-27","IGKV1-5","IGKV1-6","IGKV1-9","IGKV1D-16","IGKV1D-17","IGKV1D-43","IGKV1D-8","IGKV2-24","IGKV2D-26","IGKV2D-29","IGKV2D-30","IGKV3-20","IGKV3D-20","IGKV3D-7","IGKV4-1","IGKV5-2","IGLC2","IGLC7","IGLJ6","IGLV10-54","IGLV1-36","IGLV1-47","IGLV2-11","IGLV2-18","IGLV2-23","IGLV2-33","IGLV2-8","IGLV3-10","IGLV3-12","IGLV3-16","IGLV3-21","IGLV3-27","IGLV3-32","IGLV4-69","IGLV5-37","IGLV7-43","IGLV8-61","IGLV9-49"),
  geneEntrezID = c("3500","28457","28445","28442","28400","28395","3514","28825","28809","28797","28793","28786","28785","28781","28778","28763","3493","3494","28503","28505","28502","28499","28498","28501","28500","3501","3502","3503","28483","28481","28479","28477","28476","28475","28468","28474","28467","28473","28466","28465","28464","28458","28455","28454","28449","28448","28444","28434","28426","28424","28423","28420","28416","28414","28452","28410","28409","28408","28398","28394","28392","28391","28388","28385","28378","28950","28949","28948","28947","28946","28938","28937","28935","28299","28943","28941","28901","28900","28891","28904","28923","28884","28882","28881","28912","28874","28877","28908","28907","3538","28834","28828","28772","28826","28822","28816","28814","28813","28811","28817","28803","28802","28799","28796","28791","28787","28784","28783","28776","28774","28773"),
  GeneSet = rep("IG chains", 111)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IG_chains.csv", row.names = FALSE)

#26. Erythrocytes
data <- data.frame(
  geneSymbol = c("EPO","GFI1B","GYPA","GYPB","GYPE","ICAM4","NFE2","SLC4A1","TRIM10","TSPO2","ZNF367"),
  geneEntrezID = c("2056","8328","2993","2994","2996","3386","4778","6521","10107","222642","195828"),
  GeneSet = rep("Erythrocytes", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Erythrocytes.csv", row.names = FALSE)

#27. IL-6R complex
data <- data.frame(
  geneSymbol = c("IL6","IL6R","IL6ST"),
  geneEntrezID = c("3569","3570","3572"),
  GeneSet = rep("IL-6R complex", 3)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IL-6R_complex.csv", row.names = FALSE)

#28. IFN
data <- data.frame(
  geneSymbol = c("EIF2AK2","GBP1","GBP2","GBP4","HERC5","HERC6","IFI27","IFI30","IFI35","IFI44","IFI44L","IFI6","IFIT1","IFIT2","IFIT3","IFIT5","IFITM1","IFITM2","IFITM3","ISG15","ISG20","MX1","MX2","OAS1","OAS2","OAS3","OASL","RSAD2","SAMD9","SAMD9L","SP100","SP110"),
  geneEntrezID = c("5610","2633","2634","115361","51191","55008","3429","10437","3430","10561","10964","2537","3434","3433","3437","24138","8519","10581","10410","9636","3669","4599","4600","4938","4939","4940","8638","91543","54809","219285","6672","3431"),
  GeneSet = rep("IFN", 32)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "IFN.csv", row.names = FALSE)

#29. TCRB
data <- data.frame(
  geneSymbol = c("TRBC2","TRBJ2-1","TRBJ2-2","TRBJ2-2P","TRBJ2-3","TRBJ2-4","TRBJ2-5","TRBJ2-6","TRBJ2-7","TRBV1","TRBV10-1","TRBV10-2","TRBV11-1","TRBV11-2","TRBV19","TRBV2","TRBV20-1","TRBV21-1","TRBV23-1","TRBV24-1","TRBV25-1","TRBV27","TRBV28","TRBV3-1","TRBV4-1","TRBV4-2","TRBV5-1","TRBV5-3","TRBV5-4","TRBV5-5","TRBV5-6","TRBV5-7","TRBV6-1","TRBV6-4","TRBV6-5","TRBV6-6","TRBV6-7","TRBV6-8","TRBV7-1","TRBV7-3","TRBV7-4","TRBV7-5","TRBV7-6","TRBV7-7","TRBV9"),
  geneEntrezID = c("28638","28629","28628","28627","28626","28625","28624","28623","28622","28621","28585","28584","28582","28581","28568","28620","28567","28566","28564","28563","28562","28560","28559","28619","28617","28616","28614","28612","28611","28610","28609","28608","28606","28603","28602","28601","28600","28599","28597","28595","28594","28593","28592","28591","28586"),
  GeneSet = rep("TCRB", 45)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRB.csv", row.names = FALSE)

#30. TCRA
data <- data.frame(
  geneSymbol = c("TRAV10","TRAV1-1","TRAV1-2","TRAV12-1","TRAV12-2","TRAV12-3","TRAV13-1","TRAV13-2","TRAV14DV4","TRAV16","TRAV17","TRAV18","TRAV19","TRAV2","TRAV20","TRAV21","TRAV22","TRAV23DV6","TRAV24","TRAV25","TRAV26-1","TRAV26-2","TRAV27","TRAV29DV5","TRAV3","TRAV30","TRAV34","TRAV35","TRAV36DV7","TRAV38-1","TRAV38-2DV8","TRAV39","TRAV4","TRAV40","TRAV41","TRAV5","TRAV7","TRAV8-1","TRAV8-2","TRAV8-3","TRAV8-4","TRAV8-6","TRAV8-7","TRAV9-1","TRAV9-2"),
  geneEntrezID = c("28676","28693","28692","28674","28673","28672","28671","28670","28669","28667","28666","28665","28664","28691","28663","28662","28661","28660","28659","28658","28657","28656","28655","28653","28690","28652","28648","28647","28646","28644","28643","28642","28689","28641","28640","28688","28686","28685","28684","28683","28682","28680","28679","28678","28677"),
  GeneSet = rep("TCRA", 45)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRA.csv", row.names = FALSE)

#31. Cyt. act. T cells
data <- data.frame(
  geneSymbol = c("GZMB","TAGAP","CD69","EOMES","GZMH","IFNG","IL2RB","PRF1","SGK1","TBX21","TFRC","ZNF683"),
  geneEntrezID = c("3002","117289","969","8320","2999","3458","3560","5551","6446","30009","7037","257101"),
  GeneSet = rep("Cyt. act. T cells", 12)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Cyt._act._T_cells.csv", row.names = FALSE)

#32. TCRG
data <- data.frame(
  geneSymbol = c("TRGC2","TRGV1","TRGV10","TRGV11","TRGV2","TRGV3","TRGV4","TRGV5","TRGV7","TRGV8","TRGV9"),
  geneEntrezID = c("6967","6973","6984","6985","6974","6976","6977","6978","6981","6982","6983"),
  GeneSet = rep("TCRG", 11)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRG.csv", row.names = FALSE)

#33. T cells
data <- data.frame(
  geneSymbol = c("CD8A","CD8B","TRDC","CCR3","CD226","CD247","CD28","CD3D","CD3E","CD3G","CD4","CD5","ETS1","GATA3","GRAP2","LEF1","SH2D1A","TRAC","TRBC1"),
  geneEntrezID = c("925","926","28526","1232","10666","919","940","915","916","917","920","921","2113","2625","9402","51176","4068","28755","28639"),
  GeneSet = rep("T cells", 19)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_cells.csv", row.names = FALSE)

#34. CD8T-NK-NKT
data <- data.frame(
  geneSymbol = c("CD8A","CD8B","GZMB","NKTR","CD2","CD7","CRTAM","GNLY","GZMA","GZMK","GZMM","HCST","KIR2DL3","KIR3DL1","KIR3DL2","KLRB1","KLRC3","KLRC4","KLRD1","NKG7","RASAL3","TIA1","TXK"),
  geneEntrezID = c("925","926","3002","4820","914","924","56253","10578","3001","3003","3004","10870","3804","3811","3812","3820","3823","8302","3824","4818","64926","7072","7294"),
  GeneSet = rep("CD8T-NK-NKT", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "CD8T-NK-NKT.csv", row.names = FALSE)

#35. Anergic or act. T cells
data <- data.frame(
  geneSymbol = c("CD160","CD244","CTLA4","HAVCR2","ICOS","KLRG1","LAG3","PDCD1"),
  geneEntrezID = c("11126","51744","1493","84868","29851","10219","3902","5133"),
  GeneSet = rep("Anergic or act. T cells", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Anergic_or_act._T_cells.csv", row.names = FALSE)

#36. T activated
data <- data.frame(
  geneSymbol = c("FASLG","TAGAP","CD40LG","CREM","IL17A","IL17F","IL23R","JAKMIP1","KCNA3","KCNN4","P2RX5","PRKCQ","RELT","RNF125","SATB1","TNFRSF4","TNFRSF9"),
  geneEntrezID = c("356","117289","959","1390","3605","112744","149233","152789","3738","3783","5026","5588","84957","54941","6304","7293","3604"),
  GeneSet = rep("T activated", 17)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_activated.csv", row.names = FALSE)

#37. NK cells
data <- data.frame(
  geneSymbol = c("KLRF1","NCAM1","NCR1","NCR3","SH2D1B"),
  geneEntrezID = c("51348","4684","9437","259197","117157"),
  GeneSet = rep("NK cells", 5)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "NK_cells.csv", row.names = FALSE)

#38. TCRD
data <- data.frame(
  geneSymbol = c("TRDC","TRDJ1","TRDJ2","TRDJ3","TRDJ4","TRDV1","TRDV2","TRDV3"),
  geneEntrezID = c("28526","28522","28521","28520","28519","28518","28517","28516"),
  GeneSet = rep("TCRD", 8)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "TCRD.csv", row.names = FALSE)

#39. T regs
data <- data.frame(
  geneSymbol = c("FOXP3","IKZF2"),
  geneEntrezID = c("50943","22807"),
  GeneSet = rep("T regs", 2)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "T_regs.csv", row.names = FALSE)

#40. SNOR low DOWN
data <- data.frame(
  geneSymbol = c("RNU4ATAC","SNORA11","SNORA16A","SNORA23","SNORA38B","SNORA73A","SNORA73B"),
  geneEntrezID = c("100151683","677799","692073","677808","100124536","6080","26768"),
  GeneSet = rep("SNOR low DOWN", 7)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "SNOR_low_DOWN.csv", row.names = FALSE)

#--

#41. Monocytes
data <- data.frame(
  geneSymbol = c("ACE","ADAM8","ADAMDEC1","ADGRE1","ADGRE2","APOBEC3B","APOBEC3G","APOBR","ART4","C1QA","C1QC","C2","C4A","C4B","C4BPA","C4BPB","C5","C6","C8A","C9","CCL17","CCL18","CCL22","CCL28","CCL7","CCL8","CD14","CD163","CD209","CD300C","CD300E","CD5L","CD80","CFB","CFD","CFP","CHI3L1","CHIT1","CLEC12A","CLEC12B","CLEC4D","CLEC4E","CLEC5A","CLEC7A","CLEC9A","CSF1R","CXCL1","CXCL10","CXCL11","CXCL13","CXCL8","CXCL9","CYBB","F12","FCGR3B","FFAR2","HMMR","HVCN1","IGSF6","IL10RA","IL12A","IL12B","IL1RAP","IL20","IL23A","IL27","IL31RA","LAMP3","LGALS12","LGALS4","LGALS9","LGALS9B","LGALS9C","LILRA2","LILRA5","LILRA6","LILRB5","LMNB1","LY86","LYZ","MARCO","MEGF10","MERTK","MFGE8","MPEG1","MRC1","MS4A4A","MSR1","NTSR1","OCM","OLR1","OSCAR","OTOF","PDCD1LG2","PILRA","PLA2G2D","PLA2G5","PYHIN1","S100A8","S100A9","S1PR5","SCARB1","SCARF2","SECTM1","SEMA4A","SERPINB9","SERPING1","SIGLEC1","SIGLEC14","SIGLEC5","SIGLEC7","SLC11A1","SLITRK4","SMPDL3B","SPIC","STAB2","STAP2","TEK","TGM2","TIMD4","TLR2","TLR8","TNF","TNFAIP8L2","TNFRSF1B","TNIP3","TULP1","UBD","VENTX","VSTM1"),
  geneEntrezID = c("1636","101","27299","2015","30817","9582","60489","55911","420","712","714","717","720","721","722","725","727","729","731","735","6361","6362","6367","56477","6354","6355","929","9332","30835","10871","342510","922","941","629","1675","5199","1116","1118","160364","387837","338339","26253","23601","64581","283420","1436","2919","3627","6373","10563","3576","4283","1536","2161","2215","2867","3161","84329","10261","3587","3592","3593","3556","50604","51561","246778","133396","27074","85329","3960","3965","284194","654346","11027","353514","79168","10990","4001","9450","4069","8685","84466","10461","4240","219972","4360","51338","4481","4923","654231","4973","126014","9381","80380","29992","26279","5322","149628","6279","6280","53637","949","91179","6398","64218","5272","710","6614","100049587","8778","27036","6556","139065","27293","121599","55576","55620","7010","7052","91937","7097","51311","7124","79626","7133","79931","7287","10537","27287","284415"),
  GeneSet = rep("Monocytes", 130)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Monocytes.csv", row.names = FALSE)

#42. Myeloid cells
data <- data.frame(
  geneSymbol = c("ADGRE3","AIF1","AOAH","APOC1","BMX","C1QB","CD101","CD300LF","CD33","CD68","CLEC10A","CLEC16A","CLEC1A","CLEC2B","CLEC4A","CLEC4G","CLEC6A","CSF2RA","CSF2RB","CSF3R","CST3","CYBA","FCER1A","FCER1G","FCGR1A","FCGR1B","FCGR2A","FCGR2B","FCGR2C","FCGR3A","FGR","FLT3","FOLR2","IFI30","IFNL1","IFNL2","IFNL3","IL18","IL1A","IL1B","IL1F10","IL1RN","IL26","IL37","ITGAX","LILRA1","LILRA3","LILRB1","LILRB2","LILRB3","LILRB4","LYVE1","MEFV","MNDA","MS4A6A","MS4A7","NLRP12","NLRP3","NOD2","PECAM1","PLEK","PRAM1","RNASE3","RNASE6","RNASE7","SIGLEC10","SKAP2","SLAMF8","SLPI","SPI1","TNFSF13B","TNFSF14","TREM1","TREML2","TREML4","TWIST1","TYROBP","UNC93B1","VSIG4"),
  geneEntrezID = c("84658","199","313","341","660","713","9398","146722","945","968","10462","23274","51267","9976","50856","339390","93978","1438","1439","1441","1471","1535","2205","2207","2209","2210","2212","2213","9103","2214","2268","2322","2350","10437","282618","282616","282617","3606","3552","3553","84639","3557","55801","27178","3687","11024","11026","10859","10288","11025","11006","10894","4210","4332","64231","58475","91662","114548","64127","5175","5341","84106","6037","6039","84659","89790","8935","56833","6590","6688","10673","8740","54210","79865","285852","7291","7305","81622","11326"),
  GeneSet = rep("Myeloid cells", 79)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Myeloid_cells.csv", row.names = FALSE)

#43. Neutrophils
data <- data.frame(
  geneSymbol = c("ABHD3","ARG1","BPI","CAMP","CD177","CD83","CRISP3","DEFA1","DEFA1B","DEFA3","DEFB103A","DEFB103B","DEFB106B","DEFB136","DEFB4A","FUT4","LY6E","MMP8","OLFM4","OR1J2","PGLYRP1","PRTN3","S100A6"),
  geneEntrezID = c("171586","383","671","820","57126","9308","10321","1667","728358","1668","414325","55894","503841","613210","1673","2526","4061","4317","10562","26740","8993","5657","6277"),
  GeneSet = rep("Neutrophils", 23)
)
ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                      keys=data$geneEntrezID, 
                      keytype="ENTREZID", 
                      columns="ENSEMBL")
result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
write.csv(sorted_result, file = "Neutrophils.csv", row.names = FALSE)

    #44. Neutrophils_
    data <- data.frame(
     geneSymbol = c("EMR2","C5AR2","PLEKHG3","DPEP2","PADI4","FAM212B","TREML2","REPS2","MAK","STEAP4","PGLYRP1","MXD1","MEFV","BTNL8","CREB5","ALOX5","BST1","CEACAM3","VNN3","LILRA2","HSPA6","NFE2","HAL","FFAR2","MMP25","QPCT","CDA","P2RY13","CHST15","TNFRSF10C","FPR2","MGAM","APOBEC3A","TLR2","CXCR1","FAM65B","LST1","TREM1","CSF3R","C5AR1","SELL","AQP9","CXCR2","VNN2","MNDA","FPR1","FCGR3B"),
     GeneSet = rep("Neutrophils_", 47)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Neutrophils_.csv", row.names = FALSE)

    #45. Inflammatory_neutrophils
    data <- data.frame(
     geneSymbol = c("CD177.1","ZDHHC19.1","PLAC8.1","IFI6.2","MT2A.2","SERPING1.2","RSAD2.2","GBP1.2","ISG15.2","LY6E.2","S100A12.2","GBP5.2","XAF1.2","IFI44L.2","IFIT3.2","IFITM3.2","SELL.2","TRIM22.2","EPSTI1.2","FCER1G.2","IFI44.2","CST7.1","OAS1.2","TNFSF13B.2","SLFN5","IFIT2.2","MX1.2","PLSCR1.2","MMP9.2","LAP3","OAS2.2","IFIT1.2","TAP1.2","SAMD9L.2","HIST1H2AC.1","APOL6.2","OASL.1","DDX58.1","WSB1.1","RPL28.1","IL1RN.2","MCEMP1.1","HERC5.1","IRF7.2","CKAP4.1","UBE2L6.1","RNF213.2","GBP2.2","CLEC4D.1","LILRA6.1","SHISA5.2","PHF11.1","IFI16.2","OAS3.1","PIM1.1","SERPINB1.1","GYG1.1","IFITM1.1","PSTPIP2","TMSB10.1","GBP4","NFKBIA.2","EIF2AK2.1","FFAR2.2","PARP9.1","NT5C3A","TNFSF10.2","MYL12A.1","PARP14.1","DDX60","DDX60L.2","STAT1.1","XRN1","SAMD9.2","SNX3","HIST1H2BD","PGD.1","FCGR1A","JUN","CD44","PLP2.1","CARD16.2","LIMK2.2","CYSTM1.1","HMGB2.1","IFIH1.2","C1orf162","KLF4","STAT2","ZBP1","ALPL.1","GSTK1","SP110.2","ANXA1","ANXA3.1","METTL9","PML","NUCB1","CEACAM1","SECTM1","PSMB9.1","SAT1","RNF10.1","MAPK14.1","LILRA5","C3AR1","ZCCHC2","SAMHD1","LGALS9","CR1","C4orf3","IRF1.1","CD63","GRINA","TXN.1","MX2.1","GAPDH.1","RBCK1","NBN.2","NMI.1","CAST","HIST2H2BE","NTNG2","SP100.1","CASP1.1","TNFAIP6","PLEK.1","LMNB1","NFIL3","APOL2","BST1","NUB1","PIK3AP1","CCR1.2","ALOX5AP.1","EMB","ISG20","TMEM123.1","ADAR","GIMAP4","SPTLC2","SAMSN1.1","SPI1","GRN.1","APOBEC3A","GLRX","CD53.1","TRIM38","CD82","SH3GLB1","VIM.1","ADM.1","CASP4.1","S100A6.1","RAC2.1","BAZ1A","ADD3","ITGAM","LY96.2","MSRB1.1","DYSF.1","MOB1A","TPM3.1","FGR","ACSL1","IL2RG.1","CDKN2D","FYB1.1","CLEC4E.1","HLA-F","LILRB3","RBMS1","PLBD1","FLOT1","GNG5","PTEN.1","PROK2.1","UBE2J1","CD37.1","KCNJ15.1","BRI3.1","HIF1A","PRR13.1","LRG1","MAX","RHOG","NFE2","STXBP2","B4GALT5","CAPZA1","SERPINA1","ZYX","RGS19","FKBP5","GCA.1","FKBP1A","MTPN","VNN2","AC245128.3","CD55","CFL1.1"),  # shortened for clarity
     GeneSet = rep("Inflammatory neutrophils", 201)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Inflammatory_neutrophils.csv", row.names = FALSE)

    #46. Suppressive_neutrophils
    data <- data.frame(
     geneSymbol = c("LY6E","XAF1","RSAD2","ISG15","IFI44L","IFITM3","MT2A","IFIT3","EPSTI1","IFIT1","IFI6","IFIT2","MX1","GBP5","PLSCR1","SELL","SERPING1","SHISA5","TRIM22","GBP1","IFI44","IRF7","FFAR2","OAS1","OAS2","IL1RN","IFI16","CCR1","SP110","APOL6","NFKBIA","TNFSF10","FCER1G","IFIH1","TNFSF13B","SAMD9","SAMD9L","DDX60L","TAP1","NBN","RNF213","CARD16","GBP2","LY96","CLEC2B","LIMK2"),
     GeneSet = rep("Suppressive neutrophils", 46)
    )
    data$geneSymbol <- sub("\\..*", "", data$geneSymbol)

    # Fetch ENSEMBL IDs for your gene symbols
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneSymbol, 
                          keytype="SYMBOL", 
                          columns=c("ENSEMBL", "ENTREZID"))

    # Join the original data with the fetched IDs
    result <- left_join(data, ensembl_ids, by = c("geneSymbol"="SYMBOL"))

    # Arrange columns as desired
    sorted_result <- result[, c("geneSymbol", "ENTREZID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Suppressive_neutrophils.csv", row.names = FALSE)

    #47. Apoptosis
    data <- data.frame(
      geneSymbol = c("AIFM1","APAF1","BAD","BAK1","BAX","BCL2L11","BID","CASP10","CASP2","CASP3","CASP6","CASP7","CASP8","CASP9","DIABLO","ENDOG","FAS","FASLG","HTRA2","TNFRSF1A","TP53"),
      geneEntrezID = c("9131","317","572","578","581","10018","637","843","835","836","839","840","841","842","56616","2021","355","356","27429","7132","7157"),
      GeneSet = rep("Apoptosis", 21)
    )
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneEntrezID, 
                          keytype="ENTREZID", 
                          columns="ENSEMBL")
    result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
    sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "Apoptosis.csv", row.names = FALSE)

    #48. NFkB complex
    data <- data.frame(
      geneSymbol = c("IKBKB","IKBKG","NFKB1","NFKB2","RELA","RELB","REL"),
      geneEntrezID = c("3551","8517","4790","4791","5970","5971","5966"),
      GeneSet = rep("NFkB complex", 7)
    )
    ensembl_ids <- AnnotationDbi::select(org.Hs.eg.db, 
                          keys=data$geneEntrezID, 
                          keytype="ENTREZID", 
                          columns="ENSEMBL")
    result <- left_join(data, ensembl_ids, by = c("geneEntrezID"="ENTREZID"))
    sorted_result <- result[, c("geneSymbol", "geneEntrezID", "ENSEMBL", "GeneSet")]
    write.csv(sorted_result, file = "NFkB_complex.csv", row.names = FALSE)

#~/Tools/csv2xls-0.4/csv_to_xls.py Alt._complement.csv Anergic_or_act._T_cells.csv Anti-inflammation.csv B_cells.csv CD40_activated.csv CD8T-NK-NKT.csv Cell_cycle.csv Classical_complement.csv Cyt._act._T_cells.csv Dendritic_cells.csv Erythrocytes.csv Granulocytes.csv IFN.csv IG_chains.csv IL-1_cytokines.csv IL-6R_complex.csv Inflammasome.csv LDG.csv Lectin_complement.csv MHC_II.csv Monocytes.csv Monocytes_40.csv Monocyte_cell_surface.csv Monocyte_secreted.csv Myeloid_cells.csv Neutrophils.csv NK_cells.csv NLRP3_inflammasome.csv pDC.csv Plasma_cells.csv Platelets.csv Pro-inflam._IL-1.csv SNOR_low_DOWN.csv SNOR_low_UP.csv TCRA.csv TCRB.csv TCRD.csv TCRG.csv TNF.csv T_activated.csv T_cells.csv T_regs.csv Unfolded_protein.csv Neutrophils_.csv Inflammatory_neutrophils.csv Suppressive_neutrophils.csv Apoptosis.csv NFkB_complex.csv -d',' -o Signatures.xls

#---------------------------------------------------------------------
#---- 2. Prepare gene expression matrix: calculate DESeq2 results ----

setwd("/home/jhuang/DATA/Data_Susanne_Carotis_RNASeq/results/featureCounts_2023")

library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
#BiocManager::install("org.Hs.eg.db")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)

d.raw<- read.delim2("merged_gene_counts.txt",sep="\t", header=TRUE, row.names=1)

colnames(d.raw)<-c("gene_name", "leer_mock_2h_r2", "Ace2_mock_2h_r2", "leer_inf_24h_r1", "Ace2_inf_2h_r1", "leer_inf_24h_r2", "leer_inf_2h_r1", "leer_mock_2h_r1", "leer_inf_2h_r2", "Ace2_inf_2h_r2", "Ace2_mock_2h_r1", "Ace2_inf_24h_r2", "Ace2_inf_24h_r1")

col_order <- c("gene_name", "leer_mock_2h_r1","leer_mock_2h_r2","leer_inf_2h_r1","leer_inf_2h_r2","leer_inf_24h_r1","leer_inf_24h_r2","Ace2_mock_2h_r1","Ace2_mock_2h_r2","Ace2_inf_2h_r1","Ace2_inf_2h_r2","Ace2_inf_24h_r1","Ace2_inf_24h_r2")

reordered.raw <- d.raw[,col_order]
reordered.raw$gene_name <- NULL
#d <- d.raw[rowSums(reordered.raw>3)>2,]

condition = as.factor(c("leer_mock_2h","leer_mock_2h","leer_inf_2h","leer_inf_2h","leer_inf_24h","leer_inf_24h","Ace2_mock_2h","Ace2_mock_2h","Ace2_inf_2h","Ace2_inf_2h","Ace2_inf_24h","Ace2_inf_24h"))
ids = as.factor(c("leer_mock_2h_r1","leer_mock_2h_r2","leer_inf_2h_r1","leer_inf_2h_r2","leer_inf_24h_r1","leer_inf_24h_r2","Ace2_mock_2h_r1","Ace2_mock_2h_r2","Ace2_inf_2h_r1","Ace2_inf_2h_r2","Ace2_inf_24h_r1","Ace2_inf_24h_r2"))

#cData = data.frame(row.names=colnames(reordered.raw), condition=condition,  batch=batch, ids=ids)
#dds<-DESeqDataSetFromMatrix(countData=reordered.raw, colData=cData, design=~batch+condition)
cData = data.frame(row.names=colnames(reordered.raw), condition=condition, ids=ids)
dds<-DESeqDataSetFromMatrix(countData=reordered.raw, colData=cData, design=~condition)

#----more detailed and specific with the following code!----
dds$condition <- relevel(dds$condition, "Ace2_mock_2h")
dds = DESeq(dds, betaPrior=FALSE)  # betaPrior default value is FALSE
resultsNames(dds)

#--------------------------------------
#---- 3. draw violin on signatures ----

library(GSVA)
library(ggplot2)
#install.packages("readxl")
library(readxl)

# Path to the Excel file
file_path <- "Signatures.xls"

#example of a signature:
#geneSymbol geneEntrezID    ENSEMBL GeneSet
#CD160  11126   ENSG00000117281 Anergic or act. T cells
#CD244  51744   ENSG00000122223 Anergic or act. T cells
#CTLA4  1493    ENSG00000163599 Anergic or act. T cells
#HAVCR2 84868   ENSG00000135077 Anergic or act. T cells
#ICOS   29851   ENSG00000163600 Anergic or act. T cells
#KLRG1  10219   ENSG00000139187 Anergic or act. T cells
#LAG3   3902    ENSG00000089692 Anergic or act. T cells
#PDCD1  5133    ENSG00000188389 Anergic or act. T cells
#PDCD1  5133    ENSG00000276977 Anergic or act. T cells

# Get the names of the sheets
sheet_names <- excel_sheets(file_path)

# Initialize an empty list to hold gene sets
geneSets <- list()

# Loop over each sheet, extract the ENSEMBL IDs, and add to the list
for (sheet in sheet_names) {
  # Read the sheet
  data <- read_excel(file_path, sheet = sheet)

  # Process the GeneSet names (replacing spaces with underscores, for example)
  gene_set_name <- gsub(" ", "_", unique(data$GeneSet)[1])

  # Add ENSEMBL IDs to the list
  geneSets[[gene_set_name]] <- as.character(data$ENSEMBL)
}

# Print the result to check
print(geneSets)

# 0. for Nanostring, the GSVA input requires a gene expression matrix 'exprs' where rows are genes and columns are samples.
#exprs <- exprs(filtered_or_neg_target_m666Data)
# 0. for RNAseq, the GSVA input requires a gene expression matrix where rows are genes and columns are samples.
exprs <- counts(dds, normalized=TRUE)

# 1. Compute GSVA scores:
library(GSVA)
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- dds$condition

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("Ace2_mock_2h", "Ace2_inf_24h"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("Ace2_mock_2h", "mock", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("Ace2_inf_24h", "infection", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("mock", "infection"))
plot_violin <- function(data, gene_name) {
  # Calculate the t-test p-value for the two conditions
  condition1_data <- data[data$Condition == "mock", gene_name]
  condition2_data <- data[data$Condition == "infection", gene_name]
  p_value <- t.test(condition1_data, condition2_data)$p.value

  # Convert p-value to annotation
  p_annotation <- ifelse(p_value < 0.01, "**", ifelse(p_value < 0.05, "*", ""))
  rounded_p_value <- paste0("p = ", round(p_value, 2))

  plot_title <- gsub("_", " ", gene_name)
  p <- ggplot(data, aes(x=Condition, y=!!sym(gene_name))) +
    geom_violin(linewidth=1.2) + 
    labs(title=plot_title, y="GSVA Score") +
    ylim(-1, 1) +
    theme_light() +
    theme(
      axis.title.x = element_text(size=12),
      axis.title.y = element_text(size=12),
      axis.text.x  = element_text(size=10),
      axis.text.y  = element_text(size=10),
      plot.title   = element_text(size=12, hjust=0.5)
    )

  # Add p-value annotation to the plot
  p <- p + annotate("text", x=1.5, y=0.9, label=paste0(p_annotation, " ", rounded_p_value), size=5, hjust=0.5)

  return(p)
}

# 6. Generate the list of plots:
#genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
#plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 6. Generate the list of plots in a predefined order:
desired_order <- c("Platelets","Granulocytes","LDG","pDC","Anti-inflammation",  "Pro-inflam._IL-1","Dendritic_cells","MHC_II","Alt._complement","TNF",  "NLRP3_inflammasome","Unfolded_protein","B_cells","Monocyte_cell_surface","Inflammasome",  "Monocyte_secreted","IL-1_cytokines","SNOR_low_UP","CD40_activated","Lectin_complement",  "Classical_complement","Cell_cycle","Plasma_cells","IG_chains","Erythrocytes",  "IL-6R_complex","IFN","TCRB","TCRA","Cyt._act._T_cells",  "TCRG","T_cells","CD8T-NK-NKT","Anergic_or_act._T_cells","T_activated",  "NK_cells","TCRD","T_regs","SNOR_low_DOWN","Monocytes",  "Myeloid_cells","Neutrophils")
genes <- colnames(gsva_df_filtered)[!colnames(gsva_df_filtered) %in% "Condition"]
genes <- genes[match(desired_order, genes)]
plots_list <- lapply(genes, function(gene) plot_violin(gsva_df_filtered, gene))

# 7. Pad the list of plots:
remaining_plots <- 6 - (length(plots_list) %% 6)
if (remaining_plots != 6) {
  plots_list <- c(plots_list, rep(list(NULL), remaining_plots))
}

# 8. Create the plots and arrange them in a grid:
library(gridExtra)
plots_matrix <- matrix(plots_list, ncol=6, byrow=T)
#do.call("grid.arrange", c(plots_matrix, list(ncol=6)))

# 9. Save the plots to a PNG:
png("All_Violin_Plots_RNAseq.png", width=1000, height=1000)
do.call("grid.arrange", c(plots_matrix, list(ncol=6)))
dev.off()

# ------------------------ Heatmap Presentation --------------------

#-- NanoString GSVA scores --
exprs <- exprs(target_m666Data)
#exprs <- exprs(filtered_or_neg_target_m666Data)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Grp
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1", "Group1", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1", "Group3"))

# Filter the data for "Group1" samples
group1_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1", genes]

# Set the median values for each column in the data.frame
# Calculate median values for each column
group1_medians <- apply(group1_data, 2, median)

# 3. Add conditions to gsva_df:
identical(rownames(pData(target_m666Data)), rownames(gsva_df))
gsva_df$Condition <- pData(target_m666Data)$Group
#identical(rownames(gsva_df_filtered), rownames(pData(target_m666Data)) )
gsva_df$SampleID <- pData(target_m666Data)$SampleID

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("1a", "1b", "3"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("1a", "Group1a", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("1b", "Group1b", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("3", "Group3", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("Group1a", "Group1b", "Group3"))

# Filter the data for "Group1a" samples
group1a_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1a", genes]
group1a_medians <- apply(group1a_data, 2, median)

# Filter the data for "Group1b" samples
group1b_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group1b", genes]
group1b_medians <- apply(group1b_data, 2, median)

# Filter the data for "Group3" samples
group3_data <- gsva_df_filtered[gsva_df_filtered$Condition == "Group3", genes]
group3_medians <- apply(group3_data, 2, median)

#-- RNAseq GSVA scores --
exprs <- counts(dds, normalized=TRUE)

# 1. Compute GSVA scores:
gsva_scores <- gsva(exprs, geneSets, method="gsva")

# 2. Convert to data.frame for ggplot:
gsva_df <- as.data.frame(t(gsva_scores))

# 3. Add conditions to gsva_df:
gsva_df$Condition <- dds$condition

# 4. Filter the gsva_df to retain only the desired conditions:
#group 1 vs. group 3 in the nanostring data
gsva_df_filtered <- gsva_df[gsva_df$Condition %in% c("Ace2_mock_2h", "Ace2_inf_24h"), ]

# 5. Define a function to plot violin plots:
# Update the condition levels in gsva_df_filtered to ensure the desired order on x-axis:
gsva_df_filtered$Condition <- gsub("Ace2_mock_2h", "mock", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- gsub("Ace2_inf_24h", "infection", gsva_df_filtered$Condition)
gsva_df_filtered$Condition <- factor(gsva_df_filtered$Condition, levels = c("mock", "infection"))

# Filter the data for "mock" samples
mock_data <- gsva_df_filtered[gsva_df_filtered$Condition == "mock", genes]
mock_medians <- apply(mock_data, 2, median)

# Filter the data for "infection" samples
infection_data <- gsva_df_filtered[gsva_df_filtered$Condition == "infection", genes]
infection_medians <- apply(infection_data, 2, median)

# Create a new data frame with the middle values for both groups
heatmap_data_rnaseq <- data.frame(mock = mock_medians, infection = infection_medians)

# Transpose the data matrix to switch rows and columns
#heatmap_data <- t(heatmap_data)

# Convert the data frame to a numeric matrix
heatmap_matrix <- as.matrix(heatmap_data)

# Specify heatmap colors
color_scheme <- colorRampPalette(c("blue", "white", "red"))(50)

#TODO: adjust the color, so that it is not so dark and light!!!!, add RNAseq data for mock and infection!!
# Create the heatmap without hierarchical clustering
library(gplots)
png("Heatmap.png", width=1000, height=1000)
heatmap.2(
  heatmap_matrix,
  Colv = FALSE,  # Turn off column dendrogram
  Rowv = FALSE,  # Turn off row dendrogram
  trace = "none",  # Turn off row and column dendrograms
  scale = "row",   # Scale the rows
  col = color_scheme,
  main = "GSVA Heatmap",  # Title of the heatmap
  key = TRUE,            # Show color key
  key.title = "GSVA Score",  # Color key title
  key.xlab = NULL,       # Remove x-axis label from color key
  density.info = "none",  # Remove density plot
  cexRow = 1.2,  # Increase font size of row names
  cexCol = 1.2,  # Increase font size of column names
  srtCol = 40, keysize = 0.72, margins = c(5, 14)
)
dev.off()