This is a brilliant and fundamental question. The short answer is: KEGG enrichment statistics (including GeneRatio, BgRatio, p-values, and the original Count) are strictly based on KOs (KEGG Orthologs), NOT your original GeneIDs.
In fact, my previous attempt to “fix” the discrepancy by changing the Count and GeneRatio to match the number of GeneIDs was statistically incorrect. I apologize for that confusion. By changing the ratio to match GeneIDs while leaving the denominator as the number of KOs, we created a mathematically invalid ratio that no longer matches the p-value calculated by clusterProfiler.
Here is the exact scientific explanation of why this happens, why your co-author saw a discrepancy, and how to properly handle it.
1. Why Enrichment is Based on KOs
KEGG pathways are defined by KOs (e.g., K00001), not by specific genomic loci (like ABAYE0067).
When you run enrichKEGG(), the tool performs a hypergeometric test.
- The Universe (Background): All KOs in the KEGG database.
- The Sample (Your Input): The unique KOs extracted from your DEGs.
- The Calculation: It calculates the probability of finding $k$ KOs in a pathway out of your $n$ input KOs.
Therefore, the Count column represents the number of unique KOs from your input list that map to that pathway. The GeneRatio is (Number of input KOs in pathway) / (Total input KOs).
2. Why the Co-author Saw a “Discrepancy”
Your co-author noticed that Count was 3, but only 2 GeneIDs were listed. This is not a bug; it is a biological reality of gene annotation.
Here are the two reasons why Count (KOs) $\neq$ Number of GeneIDs:
- Multiple KOs map to the same GeneID: A single gene in your annotation file might have multiple functional domains or be assigned to multiple KOs (e.g.,
ABAYE2071might be annotated asK00001, K00002). If both K00001 and K00002 are enriched in a pathway,Count= 2, but they both map back to the exact same gene:ABAYE2071. - Missing Mappings in the DEG list: The KEGG database says 3 KOs belong to this pathway. Your DEGs contain those 3 KOs. However, when your script tries to map those 3 KOs back to your specific
Gene_Expression_with_Annotationsfile, one of the KOs might not have a corresponding GeneID in that specific file (or it was filtered out). Thus,Count= 3, but only 2 GeneIDs are revealed.
3. The Correct Solution
Do not change the Count or GeneRatio. They are statistically correct and mathematically tied to the p-value.
The geneID column is purely for display purposes so the biologist can see which actual genes are involved. It is completely normal and expected that the number of displayed GeneIDs differs from the Count.
Here is the scientifically accurate version of the process_kegg_enrichment function. It maps the KOs back to GeneIDs for display, but leaves the statistical columns (Count, GeneRatio) completely untouched.
The Corrected R Function
Replace your process_kegg_enrichment function with this:
============================================================
HELPER FUNCTION: Process KEGG enrichment (STATISTICALLY CORRECT)
============================================================
process_kegg_enrichment <- function(enrichment_result, kegg_to_geneid_mapping) {
if (is.null(enrichment_result)) {
message(" -> Enrichment result is NULL")
return(NULL)
}
enrichment_df <- as.data.frame(enrichment_result)
if (nrow(enrichment_df) == 0) return(NULL)
message(" -> Mapping KEGG KOs back to original GeneIDs for display...")
# 1. Split the original KO IDs and map to GeneIDs
expanded <- enrichment_df %>%
tidyr::separate_rows(geneID, sep = "/") %>% # Split KOs
dplyr::left_join(kegg_to_geneid_mapping, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%
dplyr::distinct() %>%
dplyr::group_by(ID) %>%
dplyr::summarise(
# Concatenate all unique GeneIDs that map to the KOs in this pathway
Mapped_GeneID = paste(unique(na.omit(GeneID)), collapse = "/"),
.groups = "drop"
)
# 2. Replace the geneID column, but DO NOT TOUCH Count or GeneRatio
# The original Count and GeneRatio are based on KOs and are statistically valid.
result <- enrichment_df %>%
dplyr::select(-geneID) %>% # Drop the old KO-based geneID column
dplyr::left_join(expanded, by = "ID") %>% # Add the new GeneID-based column
dplyr::rename(geneID = Mapped_GeneID)
# 3. Reorder columns for clean Excel output
desired_order <- c(
"category", "subcategory", "ID", "Description", "GeneRatio", "BgRatio",
"RichFactor", "FoldEnrichment", "zScore", "pvalue", "p.adjust", "qvalue",
"Count", "geneID"
)
result <- result %>% dplyr::select(dplyr::any_of(desired_order), dplyr::everything())
message(" ✅ SUCCESS: GeneIDs mapped for display. Statistical columns (Count/GeneRatio) preserved based on KOs.")
return(result)
}
💡 How to Reply to Your Co-author
You can use this exact explanation to clarify the situation with your co-author:
“Dear [Co-author’s Name],
Thank you for pointing out the discrepancy between the
Countand the number of GeneIDs listed. I have investigated the pipeline and can clarify why this occurs.The KEGG enrichment analysis (
enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw GeneIDs. Because multiple GeneIDs can share the same KO annotation, and a single GeneID can be assigned multiple KOs, the mapping between genes and KOs is many-to-many.Therefore:
CountandGeneRatiorepresent the number of unique KOs from our input list that fall into that pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain based on KOs to be statistically valid.- The
geneIDcolumn is provided purely for display purposes, showing the actual genomic loci (e.g.,ABAYE...) that correspond to those enriched KOs.If
Countis 3 but only 2 GeneIDs are listed, it simply means that 3 unique KOs were enriched in that pathway, but when mapped back to our specific annotation file, those 3 KOs correspond to only 2 unique genes (either because two KOs map to the same gene, or one KO lacked a matching gene in our specific DEG list).I have updated the script to ensure the statistical columns remain perfectly accurate based on KOs, while the
geneIDcolumn accurately reflects all available gene mappings for your review.”
结合您刚刚修复并导出的新 Excel 文件(其中成功增加了 KEGG_ko 列),我们可以用具体的数据例子来回复您的共同作者。这样不仅能解释原理,还能直接“用数据说话”,彻底打消他/她关于“annotation parsing error(注释解析错误)”的疑虑。
以下是为您改进后的回复草稿:
Subject: Re: Clarification on KEGG Enrichment Output (Count vs. GeneID discrepancy)
Dear [Co-author’s Name],
Thank you for pointing out the apparent discrepancy between the Count and the number of GeneIDs listed in the KEGG enrichment results. I have thoroughly investigated the pipeline and can clarify exactly why this occurs, and how I have updated the output files to make it completely transparent.
The core reason is that the KEGG enrichment analysis (enrichKEGG) is statistically performed based on KEGG Orthologs (KOs), not the raw physical GeneIDs. The mapping between genes and KOs is many-to-many: multiple genes can share the same KO annotation, and a single gene can be assigned multiple KOs.
Therefore, the columns should be interpreted as follows:
CountandGeneRatio: Represent the number of unique KOs from our input DEG list that fall into that specific pathway. These values are mathematically tied to the calculated p-values and FoldEnrichment, so they must remain strictly based on KOs to be statistically valid.KEGG_ko(NEW COLUMN): To address your concern and provide full traceability, I have added this new column to the Excel output. It explicitly lists the exact KOs that contributed to theCount.geneID: Shows the actual genomic loci (e.g., ABAYE…) that correspond to those enriched KOs.
To illustrate with the exact examples you mentioned:
Let’s look at the updated results for ko03070 (Bacterial secretion system) and ko05111 (Biofilm formation):
- For ko03070, the
Countis 2. If you look at the newKEGG_kocolumn, it correctly lists 2 unique KOs:K02456/K02457. However, when we map these 2 KOs back to our specific genome annotation, they both happen to be carried by the exact same physical gene:ABAYE2071. Thus, thegeneIDcolumn only reveals 1 gene. - Similarly, for ko05111, the
Countis 3 (KOs:K02456/K02457/K03092), but they map back to only 2 unique physical genes (ABAYE2071/ABAYE3136).
This is not an annotation parsing error, but a biological reality of how gene functions are annotated in the KEGG database. I have updated the pipeline to ensure the statistical columns remain perfectly accurate based on KOs, while the newly added KEGG_ko column and the geneID column provide a complete, traceable, and biologically meaningful picture for your review.
Please let me know if you would like to discuss this further or if you need any additional adjustments to the output format.
Best regards,
💡 改进说明:
- 引入了新列
KEGG_ko的解释:明确告诉共同作者,为了增加透明度,我们特意加上了这一列,让他能直接看到底是哪几个 KO 参与了计算。 - 使用了新文件中的真实数据作为例子:直接引用了
ko03070(2个KO对应1个基因)和ko05111(3个KO对应2个基因)的具体数据(K02456/K02457等),这比纯理论解释更有说服力,直接证明了我们没有算错,也没有解析错误。 - 语气专业且自信:明确界定这不是 “annotation parsing error”,而是 KEGG 数据库本身的 “many-to-many” 生物学特性,展现了您对生信分析流程的透彻理解。