KEGG and GO annotations in non-model organisms

gene_x 0 like s 208 view s

Tags: pipeline

https://www.biobam.com/functional-analysis/

Blast2GO_workflow

Venn_Diagram_Tam_2025_1

  1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
    
    Install EggNOG-mapper:
    
        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env
    
    Run annotation:
    
        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
        emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.
    
    Extract KEGG KO IDs from annotations.emapper.annotations.
    

    Option 2 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    • 'Load protein sequences' (Tags: NONE, generated columns: Nr, SeqName) -->
    • Buttons 'blast' (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    • Button 'mapping' (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), "Mapping finished - Please proceed now to annotation."
    • Button 'annot' (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), "Annotation finished." * Used parameter 'Annotation CutOff': The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value. * Used parameter 'GO Weight' is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

    or blast2go_cli_v1.5.1 (NOT_USED)

        #https://help.biobam.com/space/BCD/2250407989/Installation
        #see ~/Scripts/blast2go_pipeline.sh
    

    Option 3 (GO Terms from 'Blast2GO 5 Basic', saved in blast2go_annot.annot2): Interpro based protein families / domains --> Button interpro * Button 'interpro' (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) --> "InterProScan Finished - You can now merge the obtained GO Annotations."

    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs * Button 'interpro'/'Merge InterProScan GOs to Annotation' --> "Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation." --> "Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation)." #* Button 'annot'/'ANNEX' --> "ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping."

    #-- before merging (blast2go_annot.annot) --
    #H0N29_18790     GO:0004842      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0085020
    #-- after merging (blast2go_annot.annot2) -->
    #H0N29_18790     GO:0031436      ankyrin repeat domain-containing protein
    #H0N29_18790     GO:0070531
    #H0N29_18790     GO:0004842
    #H0N29_18790     GO:0005515
    #H0N29_18790     GO:0085020
    

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    • Go to KAAS
    • Upload your FASTA file.
    • Select an appropriate gene set.
    • Download the KO assignments.
  2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn't directly in KEGG, use a closely related organism.

    • Check available KEGG organisms:
      library(clusterProfiler)
      library(KEGGREST)
      
      kegg_organisms <- keggList("organism")
      
      Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
      
      # Search for Acinetobacter in the list
      grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
      # Gammaproteobacteria
      #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"
      
  3. Find the Closest KEGG Organism for a Non-Model Species

    If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search
    

    For KEGG pathway enrichment in non-model species, use "ko" instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology
    
  4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2024/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
    
        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        #res <- read.csv("AUM_vs_MHB-all.csv")     #up149, down65
        #res <- read.csv("Urine_vs_AUM-all.csv")    #up155, down105
        res <- read.csv("Urine_vs_MHB-all.csv")   #up259, down138
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # DEBUG: NOT_NECESSARY, since res has already GeneName
        ##Convert row names to a new column 'GeneName' in res
        #res_with_geneName <- res %>%
        #mutate(GeneName = rownames(res)) %>%
        #as.data.frame()  # Ensure that it's a regular data frame without row names
        ## View the result
        #head(res_with_geneName)
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # DEBUG: NOT_NECESSARY, since 'GeneName' is already the first column.
        ## Reorder columns to move 'GeneName' as the first column in res_updated
        #res_updated <- res_updated %>%
        #select(GeneName, everything())
    
        ## Count the number of rows in the KEGG_ko, GOs, EC columns that have non-missing values
        #num_non_missing_KEGG_ko <- sum(res_updated$KEGG_ko != "-" & !is.na(res_updated$KEGG_ko))
        #print(num_non_missing_KEGG_ko)
        ##[1] 2030
        #num_non_missing_GOs <- sum(res_updated$GOs != "-" & !is.na(res_updated$GOs))
        #print(num_non_missing_GOs)
        ##[1] 2865 --> 2875
        #num_non_missing_EC <- sum(res_updated$EC != "-" & !is.na(res_updated$EC))
        #print(num_non_missing_EC)
        ##[1] 1701
    
        # Filter up-regulated genes
        up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        # Filter down-regulated genes
        down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        saveWorkbook(wb, "Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
        ## Set the 'GeneName' column as row.names
        #rownames(res_updated) <- res_updated$GeneName
        ## Drop the 'GeneName' column since it's now the row names
        #res_updated$GeneName <- NULL
    
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
    
        # ---- Perform GO enrichment analysis (TODO: extract the merged GO IDs from 'Blast2GO 5 Basic' and adapt the code below!)----
        #http://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html
        #If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.
        ## Perform GO enrichment analysis (Using clusterProfiler does not work, use enricher instead.)
        #go_enrichment_up <- enrichGO(gene = gene_list_go_up, OrgDb, ont = "BP")  #keyType = "ENSEMBL", organism = 'abn',
        #To adapt your code for GO enrichment analysis using the up-regulated genes (149 genes) as the test set and all genes in res (3646 genes) as the background, we need to:
        # * Extract the GeneIDs from up_regulated and use them as gene_list (the test set).
        # * Extract the GeneIDs from res and use them as background_genes (the universe).
        # * Create a TERM2GENE dataframe by extracting GO terms from res.
        # * Run the enricher() function using these updated inputs.
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis, where pAdjustMethod is one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
        #Here’s a ranking from least restrictive to most restrictive:
        #    * none – No correction at all.
        #    * fdr (same as BH, Benjamini-Hochberg) – Controls the false discovery rate rather than the family-wise error rate (FWER), allowing more significant results.
        #    * BY (Benjamini-Yekutieli) – Similar to BH but more conservative, accounting for dependence in test statistics.
        #    * Holm – A step-down version of Bonferroni, less strict than Bonferroni but still controls FWER.
        #    * Hochberg – Slightly more powerful than Holm, but still controlling FWER.
        #    * Hommel – A step-up approach controlling FWER, similar to Hochberg but slightly more powerful in some cases.
        #    * Bonferroni – The most conservative method, dividing the p-value threshold by the number of comparisons, leading to very strict significance criteria.
        #So, if you want the least restrictive correction while still controlling for multiple testing, use "fdr" (or "BH"). If you need a more stringent control over Type I errors, methods like Bonferroni or Holm are preferable.
    
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 1.0,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 1.0,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        #Error for GO term: GO:0006807: replace GO:0006807  obsolete nitrogen compound metabolic process
        #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!
    
  5. Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

        # 安装 & 加载必要的库
        if (!requireNamespace("ggvenn", quietly = TRUE)) install.packages("ggvenn")
        if (!requireNamespace("openxlsx", quietly = TRUE)) install.packages("openxlsx")
        #install.packages("ggvenn")
        #install.packages("VennDiagram")
        #install.packages("ggVennDiagram")
    
        library(ggvenn)
        library(openxlsx)
        library(VennDiagram)
    
        res <- read.csv("AUM_vs_MHB-all.csv")     #up149, down65
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_AUM_vs_MHB <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_AUM_vs_MHB <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        res <- read.csv("Urine_vs_AUM-all.csv")    #up155, down105
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_Urine_vs_AUM <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_Urine_vs_AUM <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        res <- read.csv("Urine_vs_MHB-all.csv")   #up259, down138
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_Urine_vs_MHB <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_Urine_vs_MHB <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        # 1. 生成基因集
        AUM_DEGs <- unique(c(up_regulated_AUM_vs_MHB$GeneID, down_regulated_AUM_vs_MHB$GeneID))
        Urine_DEGs <- unique(c(up_regulated_Urine_vs_AUM$GeneID, down_regulated_Urine_vs_AUM$GeneID))
        MHB_DEGs <- unique(c(up_regulated_Urine_vs_MHB$GeneID, down_regulated_Urine_vs_MHB$GeneID))
    
        # 2. 生成 Venn 图
        venn_plot <- ggvenn(
        list(AUM = AUM_DEGs, Urine = Urine_DEGs, MHB = MHB_DEGs),
        fill_color = c("blue", "red", "green"),  # 颜色设定
        stroke_size = 0.5,  # 线条粗细
        set_name_size = 5  # 组名字体大小
        )
    
        # 3. 保存 Venn 图为 PNG
        ggsave("Venn_Diagram.png", plot = venn_plot, width = 6, height = 6, dpi = 300)
    
        # 4. 计算各个集合的交集
        AUM_Urine <- intersect(AUM_DEGs, Urine_DEGs)
        AUM_MHB <- intersect(AUM_DEGs, MHB_DEGs)
        Urine_MHB <- intersect(Urine_DEGs, MHB_DEGs)
        AUM_Urine_MHB <- Reduce(intersect, list(AUM_DEGs, Urine_DEGs, MHB_DEGs))
    
        # 5. 保存基因列表到 Excel 文件
        #venn_gene_list <- list(
        #  "AUM_DEGs" = AUM_DEGs,
        #  "Urine_DEGs" = Urine_DEGs,
        #  "MHB_DEGs" = MHB_DEGs,
        #  "AUM_Urine" = AUM_Urine,
        #  "AUM_MHB" = AUM_MHB,
        #  "Urine_MHB" = Urine_MHB,
        #  "AUM_Urine_MHB" = AUM_Urine_MHB
        #)
        #write.xlsx(venn_gene_list, file = "Venn_GeneLists.xlsx", rowNames = FALSE)
    
        # 创建 Excel 文件,存储各个 Venn 区域的基因
        venn_list <- list(
        "AUM_only" = setdiff(AUM_DEGs, c(Urine_DEGs, MHB_DEGs)),
        "Urine_only" = setdiff(Urine_DEGs, c(AUM_DEGs, MHB_DEGs)),
        "MHB_only" = setdiff(MHB_DEGs, c(AUM_DEGs, Urine_DEGs)),
        "AUM_Urine" = intersect(AUM_DEGs, Urine_DEGs),
        "AUM_MHB" = intersect(AUM_DEGs, MHB_DEGs),
        "Urine_MHB" = intersect(Urine_DEGs, MHB_DEGs),
        "All_three" = Reduce(intersect, list(AUM_DEGs, Urine_DEGs, MHB_DEGs))
        )
        # 写入 Excel
        write.xlsx(venn_list, file = "Venn_Diagram_Genes.xlsx")
    
        # 打印完成信息
        print("Venn diagram saved as 'Venn_Diagram.png' and gene lists saved in 'Venn_Diagram_Genes.xlsx'.")
    

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum