Daily Archives: 2026年6月24日

HUMAnN 通路丰度计算方法详解 (Data_Tam_Metagenomics_2026_Soil)

总体流程

HUMAnN (HMP Unified Metabolic Analysis Network) 是 bioBakery 工作流中用于分析宏基因组功能的核心工具[[12]]。通路丰度的计算是一个多步骤的递归过程:

计算步骤:

  1. 基因家族丰度 → 2. 反应丰度 → 3. 通路丰度

详细计算原理

第1步:基因家族丰度(Gene Family Abundance)

从原始测序 reads 开始:

  • 使用 BLASTX 将 reads 比对到参考数据库(如 UniRef)
  • 根据比对质量、覆盖度、序列长度进行加权
  • 生成 RPK(Reads Per Kilobase)值

公式:

基因丰度 = Σ(比对权重) / 基因长度(kb)

其中每个 read 的总权重为 1.0,根据比对质量分配到多个基因匹配上[[9]]。


第2步:反应丰度(Reaction Abundance)

每个生化反应由一个或多个基因催化:

反应丰度 = Σ(催化该反应的所有基因丰度)

第3步:通路丰度(Pathway Abundance)

这是最关键的一步。通路包含多个反应,反应之间有不同的关系:

核心原则: 通路丰度由”最弱环节”(weakest link)决定[[1]]

计算方法:

  • 串联反应(必须全部存在):使用调和平均数(harmonic mean)
  • 并联反应(可选路径):使用最大值(max)
  • 可选反应:只有当其丰度大于必需反应的调和平均数时才计入[[1]]

最终通路丰度 = 通路中丰度最低的关键反应


具体示例

示例场景:糖酵解通路(Glycolysis)

假设糖酵解通路包含 5 个关键反应(R1-R5):

葡萄糖 → R1 → G6P → R2 → F6P → R3 → ... → 丙酮酸

基因-反应关系:

  • R1: 由基因 GK1 和 GK2 催化(冗余)
  • R2: 由基因 PGI 催化
  • R3: 由基因 PFK 催化
  • R4: 由基因 ALDO 催化
  • R5: 由基因 GAPDH 催化

测序后得到的基因丰度(RPK单位):

GK1:  8.0
GK2:  4.0
PGI:  10.0
PFK:  6.0
ALDO: 7.0
GAPDH: 5.0

计算步骤:

① 计算反应丰度:

R1 = GK1 + GK2 = 8.0 + 4.0 = 12.0  (冗余基因相加)
R2 = PGI = 10.0
R3 = PFK = 6.0
R4 = ALDO = 7.0
R5 = GAPDH = 5.0

② 计算通路丰度: 由于糖酵解是串联反应(所有步骤必须完成),使用”最弱环节”原则:

通路丰度 = min(R1, R2, R3, R4, R5)
          = min(12.0, 10.0, 6.0, 7.0, 5.0)
          = 5.0 RPK

解释: 该样本中糖酵解通路的丰度为 5.0 RPK,意味着”最弱环节”(R5/GAPDH)的覆盖度为 5.0。这表示通路中至少有 5.0 个”完整拷贝”的活性[[1]]。


归一化处理

为什么需要归一化?

原始 RPK 值受测序深度影响,不能直接跨样本比较[[1]]。

示例:

  • 样本 A:总 reads = 1000万,通路丰度 = 5.0 RPK
  • 样本 B:总 reads = 2000万,通路丰度 = 5.0 RPK

虽然都是 5.0 RPK,但样本 A 的相对丰度更高!

归一化方法:

CPM(Counts Per Million)或 RPKM:

归一化丰度 = (原始RPK / 总RPK) × 1,000,000

HUMAnN 输出的 pathabundance_relab.tsv 文件已经是归一化后的相对丰度[[11]]。


输出文件说明

HUMAnN 生成两个关键文件:

文件 含义 取值范围
pathcoverage.tsv 通路覆盖度(存在/缺失概率) 0-1
pathabundance.tsv 通路丰度(相对拷贝数) 0 及以上

覆盖度 vs 丰度:

  • 覆盖度 = 0.8:表示 80% 置信度该通路存在
  • 丰度 = 5.0:表示该通路约有 5 个完整拷贝(在归一化前)[[9]]

分层分析(Stratification)

HUMAnN 还能计算每个物种对通路的贡献:

总通路丰度 = 物种A贡献 + 物种B贡献 + 未分类贡献

例如:

糖酵解通路总丰度: 5.0 RPK
├─ Escherichia coli:  3.0 RPK (60%)
├─ Bacteroides fragilis: 1.5 RPK (30%)
└─ Unclassified: 0.5 RPK (10%)

这帮助您理解哪些微生物在驱动特定功能[[21]]。


关键要点总结

通路丰度 = 最弱关键反应的丰度(木桶原理)
单位:RPK(归一化前)或 CPM/RPKM(归一化后)
跨样本比较:必须使用归一化值(pathabundance_relab.tsv)[[15]]
解释:丰度 5.0 表示通路中最弱环节有 5.0 RPK 的覆盖度
准确性:考虑了基因冗余、反应关系、可选步骤等生物学复杂性


参考文献

  1. bioBakery Forum – Pathway abundance calculation [[1]]
  2. HUMAnN SOP – HMP Data Coordination Center [[9]]
  3. HUMAnN3 Documentation – Huttenhower Lab [[12]]
  4. bioBakery 3 publication (Nature Methods, 2021) [[13]]


HUMAnN Pathway Abundance 计算方法

基本原理

HUMAnN (HMP Unified Metabolic Analysis Network) 是 biobakery 工具集的一部分,用于从宏基因组数据中计算功能通路的丰度。

计算步骤

1. 基因家族定量 (Gene Family Quantification)

  • 首先将测序 reads 比对到 UniRef 基因家族数据库
  • 计算每个基因家族的丰度(reads per kilobase, RPK)

2. 通路映射 (Pathway Mapping)

  • 将基因家族映射到 MetaCyc 通路
  • 一个通路通常包含多个反应步骤,每个步骤可能由多个基因家族催化

3. 通路丰度计算 (Pathway Abundance Calculation)

关键公式:

通路丰度 = min(该通路中所有反应步骤的丰度)

其中:

  • 每个反应步骤的丰度 = 该步骤中所有基因家族丰度的总和
  • 通路的最终丰度 = 所有反应步骤丰度的最小值(瓶颈原则)

具体示例

假设情况

假设有一个简单的代谢通路 “Glycolysis”(糖酵解),包含 3 个反应步骤:

反应步骤 1:葡萄糖 → 葡萄糖-6-磷酸

  • 由基因家族 UniRef90_A 和 UniRef90_B 催化
  • UniRef90_A 丰度 = 100 RPK
  • UniRef90_B 丰度 = 50 RPK
  • 步骤 1 丰度 = 100 + 50 = 150 RPK

反应步骤 2:葡萄糖-6-磷酸 → 果糖-6-磷酸

  • 由基因家族 UniRef90_C 催化
  • UniRef90_C 丰度 = 80 RPK
  • 步骤 2 丰度 = 80 RPK

反应步骤 3:果糖-6-磷酸 → 果糖-1,6-二磷酸

  • 由基因家族 UniRef90_D 和 UniRef90_E 催化
  • UniRef90_D 丰度 = 200 RPK
  • UniRef90_E 丰度 = 120 RPK
  • 步骤 3 丰度 = 200 + 120 = 320 RPK

通路丰度计算

Glycolysis 通路丰度 = min(步骤1, 步骤2, 步骤3)
                    = min(150, 80, 320)
                    = 80 RPK

为什么用最小值?

  • 这遵循”木桶原理”(瓶颈效应)
  • 通路的整体通量受限于最慢/最少的反应步骤
  • 步骤 2 只有 80 RPK,即使其他步骤丰度很高,整个通路的最大通量也只能是 80

归一化 (Normalization)

HUMAnN 还会进行归一化处理:

1. RPK (Reads Per Kilobase)

RPK = (比对到基因的 reads 数) / (基因长度 kb)

2. RPKM/CPM (可选)

RPKM = RPK / (总百万 reads)

3. 相对丰度 (Relative Abundance)

通路相对丰度 = (通路 RPKM) / (所有通路 RPKM 总和) × 1,000,000

单位:CPM (Counts Per Million)


您的数据示例

从您的输出看:

otu_table()   OTU Table:         [ 1097 taxa and 2 samples ]

这 1097 个 “taxa” 实际上是 1097 个 MetaCyc 通路,每个通路的丰度值是经过以下处理的:

  1. 原始丰度:基于上述最小值原则计算
  2. 归一化:转换为相对丰度(sum = 1 或 100%)
  3. 输出文件pathabundance_relab.tsv 中的值就是相对丰度

关键要点

特点 说明
计算方法 取通路中所有反应步骤的最小丰度
单位 通常是相对丰度(0-1 或 0-100%)
生物学意义 反映通路的潜在代谢能力
优势 考虑了通路的完整性,不是简单加和
局限性 无法区分活跃/非活跃通路(需要转录组验证)

注意事项

⚠️ 重要提醒

  • 通路丰度反映的是基因潜力(gene potential),不是实际代谢活性
  • 一个通路存在 ≠ 该通路正在被使用
  • 需要结合转录组(RNA-seq)或代谢组数据才能确定实际活性
  • 对于您的 n=1 样本,只能做描述性比较,无法统计推断

STEP 5 — Descriptive visualisations (appropriate for n = 1 per group) for Data_Tam_Metagenomics_2026_Soil

Updated Step 5: PNG Figures + Complete Excel Exports

Prerequisites Update

Add openxlsx to your package list (for Excel export):

# Install if needed
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr",
                   "tidyr", "pheatmap", "openxlsx"))

And load it:

library(openxlsx)

Complete Step 5 — Replace your existing Step 5 with this

# =============================================================================
# STEP 5 — Visualisations (PNG) + Complete Excel exports
# =============================================================================

dir.create("figures", showWarnings = FALSE)
dir.create("tables",  showWarnings = FALSE)

# Helper: safe log2 fold change (adds pseudocount to avoid log(0))
safe_log2fc <- function(x, y, pseudo = 1e-6) {
  log2((x + pseudo) / (y + pseudo))
}

# =============================================================================
# 5a. Top-N species stacked bar plot (Loc1 vs Loc4)
# =============================================================================

top_n_sp <- 20
top_species <- names(sort(rowMeans(otu_table(species_ps)),
                          decreasing = TRUE))[1:top_n_sp]

ps_top <- prune_taxa(top_species, species_ps)

df_species <- psmelt(ps_top) %>%
  mutate(Species = factor(Species, levels = rev(top_species)))

p_species <- ggplot(df_species, aes(x = Location, y = Abundance, fill = Species)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "D") +
  labs(title = paste("Top", top_n_sp, "Species by Location"),
       x = "Location", y = "Relative Abundance", fill = "Species") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/species_top20_barplot.png", p_species,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/species_top20_barplot.png\n")

# =============================================================================
# 5b. Species heatmap (all species, row-scaled)
# =============================================================================

otu_mat <- as.matrix(otu_table(species_ps))

# Filter species with very low abundance (max < 0.01% across both samples)
keep <- apply(otu_mat, 1, max) > 0.0001
otu_filt <- otu_mat[keep, , drop = FALSE]

# Annotation column
ann_col <- data.frame(Location = metadata[colnames(otu_filt), "Location"],
                      row.names = colnames(otu_filt))

# Write heatmap to PNG
png("figures/species_heatmap.png",
    width = 10, height = max(8, 0.25 * nrow(otu_filt) + 2),
    units = "in", res = 300)

pheatmap(otu_filt,
         scale         = "row",
         clustering_distance_rows = "euclidean",
         clustering_method        = "complete",
         annotation_col = ann_col,
         main           = "Species Abundance Heatmap (row-scaled)",
         fontsize_row   = 6,
         fontsize_col   = 10,
         show_rownames  = nrow(otu_filt) <= 80)

dev.off()
cat("✅ Saved: figures/species_heatmap.png\n")

# =============================================================================
# 5c. Top pathways stacked bar plot
# =============================================================================

top_n_pw <- 20
top_pw_names <- names(sort(rowMeans(otu_table(pathway_ps)),
                           decreasing = TRUE))[1:top_n_pw]

ps_pw_top <- prune_taxa(top_pw_names, pathway_ps)

df_pw <- psmelt(ps_pw_top) %>%
  mutate(Pathway = factor(Pathway, levels = rev(top_pw_names)))

p_pw <- ggplot(df_pw, aes(x = Location, y = Abundance, fill = Pathway)) +
  geom_bar(stat = "identity", position = "fill", width = 0.6) +
  coord_flip() +
  scale_fill_viridis_d(option = "C") +
  labs(title = paste("Top", top_n_pw, "HUMAnN Pathways by Location"),
       x = "Location", y = "Relative Abundance", fill = "Pathway") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        legend.key.size  = unit(0.4, "cm"),
        legend.text      = element_text(size = 7))

ggsave("figures/pathways_top20_barplot.png", p_pw,
       width = 10, height = 8, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathways_top20_barplot.png\n")

# =============================================================================
# 5d. Pathway dot plot (Loc1 vs Loc4)
# =============================================================================

df_pw_wide <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway") %>%
  filter(Pathway %in% top_pw_names) %>%
  pivot_longer(-Pathway, names_to = "Sample", values_to = "Abundance") %>%
  left_join(data.frame(Sample = rownames(metadata_clean),
                       Location = metadata_clean$Location,
                       stringsAsFactors = FALSE),
            by = "Sample")

p_dot <- ggplot(df_pw_wide, aes(x = Location, y = Pathway,
                                 size = Abundance, color = Abundance)) +
  geom_point() +
  scale_color_viridis_c() +
  labs(title = "Pathway Abundance: Loc1 vs Loc4",
       x = "Location", y = "Pathway") +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 7))

ggsave("figures/pathway_dotplot.png", p_dot,
       width = 8, height = 10, dpi = 300, bg = "white")
cat("✅ Saved: figures/pathway_dotplot.png\n")

# =============================================================================
# 5e. COMPLETE species list → Excel
# =============================================================================

# Build full species table (ALL species, no cutoff)
sp_full <- as.data.frame(otu_table(species_ps)) %>%
  rownames_to_column("Species")

# Ensure columns exist (defensive)
if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(sp_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in species OTU table.")
}

sp_full <- sp_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Species,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for species
sp_wb <- createWorkbook()
addWorksheet(sp_wb, "All_Species")
addWorksheet(sp_wb, "Top50_by_Diff")
addWorksheet(sp_wb, "Top50_by_Abundance")
addWorksheet(sp_wb, "Loc1_only")
addWorksheet(sp_wb, "Loc4_only")
addWorksheet(sp_wb, "Shared")

writeData(sp_wb, "All_Species",       sp_full)
writeData(sp_wb, "Top50_by_Diff",     head(sp_full, 50))
writeData(sp_wb, "Top50_by_Abundance",
          sp_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(sp_wb, "Loc1_only",
          sp_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(sp_wb, "Loc4_only",
          sp_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(sp_wb, "Shared",
          sp_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

# Formatting
header_style <- createStyle(textDecoration = "bold", bgFill = "#D3D3D3")
for (sh in c("All_Species","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(sp_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(sp_wb, sh, cols = 1:ncol(sp_full), widths = "auto")
  freezePane(sp_wb, sh, firstRow = TRUE)
}

saveWorkbook(sp_wb, "tables/species_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/species_Loc1_vs_Loc4.xlsx  (%d species total)\n",
            nrow(sp_full)))

# =============================================================================
# 5f. COMPLETE pathway list → Excel
# =============================================================================

pw_full <- as.data.frame(otu_table(pathway_ps)) %>%
  rownames_to_column("Pathway")

if (!all(c("Soil_Loc1", "Soil_Loc4") %in% colnames(pw_full))) {
  stop("⚠️  Expected columns 'Soil_Loc1' and 'Soil_Loc4' not found in pathway OTU table.")
}

pw_full <- pw_full %>%
  mutate(
    Abundance_Loc1         = Soil_Loc1,
    Abundance_Loc4         = Soil_Loc4,
    Diff_Loc4_minus_Loc1   = Soil_Loc4 - Soil_Loc1,
    Log2FC_Loc4_vs_Loc1    = safe_log2fc(Soil_Loc4, Soil_Loc1),
    Present_in_Loc1        = Soil_Loc1 > 0,
    Present_in_Loc4        = Soil_Loc4 > 0,
    Total_Abundance        = Soil_Loc1 + Soil_Loc4,
    Mean_Abundance         = (Soil_Loc1 + Soil_Loc4) / 2
  ) %>%
  select(Pathway,
         Abundance_Loc1, Abundance_Loc4,
         Diff_Loc4_minus_Loc1, Log2FC_Loc4_vs_Loc1,
         Present_in_Loc1, Present_in_Loc4,
         Total_Abundance, Mean_Abundance) %>%
  arrange(desc(abs(Diff_Loc4_minus_Loc1)))

# Write multi-sheet Excel workbook for pathways
pw_wb <- createWorkbook()
addWorksheet(pw_wb, "All_Pathways")
addWorksheet(pw_wb, "Top50_by_Diff")
addWorksheet(pw_wb, "Top50_by_Abundance")
addWorksheet(pw_wb, "Loc1_only")
addWorksheet(pw_wb, "Loc4_only")
addWorksheet(pw_wb, "Shared")

writeData(pw_wb, "All_Pathways",       pw_full)
writeData(pw_wb, "Top50_by_Diff",      head(pw_full, 50))
writeData(pw_wb, "Top50_by_Abundance",
          pw_full %>% arrange(desc(Mean_Abundance)) %>% head(50))
writeData(pw_wb, "Loc1_only",
          pw_full %>% filter(Present_in_Loc1 & !Present_in_Loc4))
writeData(pw_wb, "Loc4_only",
          pw_full %>% filter(!Present_in_Loc1 & Present_in_Loc4))
writeData(pw_wb, "Shared",
          pw_full %>% filter(Present_in_Loc1 & Present_in_Loc4))

for (sh in c("All_Pathways","Top50_by_Diff","Top50_by_Abundance",
             "Loc1_only","Loc4_only","Shared")) {
  addStyle(pw_wb, sh, style = header_style, rows = 1, gridExpand = TRUE)
  setColWidths(pw_wb, sh, cols = 1:ncol(pw_full), widths = "auto")
  freezePane(pw_wb, sh, firstRow = TRUE)
}

saveWorkbook(pw_wb, "tables/pathways_Loc1_vs_Loc4.xlsx", overwrite = TRUE)
cat(sprintf("✅ Saved: tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways total)\n",
            nrow(pw_full)))

# =============================================================================
# Summary
# =============================================================================

cat("\n========================================\n")
cat("STEP 5 COMPLETE\n")
cat("========================================\n")
cat("Figures (PNG, 300 dpi):\n")
cat("  • figures/species_top20_barplot.png\n")
cat("  • figures/species_heatmap.png\n")
cat("  • figures/pathways_top20_barplot.png\n")
cat("  • figures/pathway_dotplot.png\n")
cat("\nExcel tables (complete lists, no cutoff):\n")
cat(sprintf("  • tables/species_Loc1_vs_Loc4.xlsx   (%d species)\n", nrow(sp_full)))
cat(sprintf("  • tables/pathways_Loc1_vs_Loc4.xlsx  (%d pathways)\n", nrow(pw_full)))
cat("========================================\n")

What You Get

📊 Figures (PNG, 300 dpi, publication-ready)

File Content Size
species_top20_barplot.png Top 20 species stacked bar 10 × 8 in
species_heatmap.png All species above 0.01% threshold, row-scaled 10 × auto (scales with # species)
pathways_top20_barplot.png Top 20 pathways stacked bar 10 × 8 in
pathway_dotplot.png Top 20 pathways dot plot 8 × 10 in

📑 Excel Files (complete lists, no cutoff)

Each workbook contains 6 sheets:

Sheet Content
All_Species / All_Pathways Every detected feature, sorted by absolute difference
Top50_by_Diff Top 50 features by Loc4 − Loc1
Top50_by_Abundance Top 50 features by mean abundance
Loc1_only Features detected only in Loc1
Loc4_only Features detected only in Loc4
Shared Features detected in both locations

📋 Columns in each Excel file

Column Meaning
Species / Pathway Feature name
Abundance_Loc1 Raw relative abundance in Loc1
Abundance_Loc4 Raw relative abundance in Loc4
Diff_Loc4_minus_Loc1 Absolute difference (Loc4 − Loc1)
Log2FC_Loc4_vs_Loc1 Log₂ fold change (with pseudocount 1e-6 to handle zeros)
Present_in_Loc1 TRUE/FALSE
Present_in_Loc4 TRUE/FALSE
Total_Abundance Sum across both samples
Mean_Abundance Mean across both samples

Notes

  1. Log₂ fold change: Uses a small pseudocount (1e-6) to avoid log(0). For features absent in one location, this gives a large but finite fold change — interpret these as “presence/absence” rather than true fold change.

  2. Heatmap filtering: Species with maximum abundance < 0.01% across both samples are excluded to keep the heatmap readable. Adjust the threshold in keep <- apply(otu_mat, 1, max) > 0.0001 if needed.

  3. Excel formatting: Headers are bold with grey background, columns auto-sized, and the first row is frozen for easy scrolling.

  4. File locations: All outputs go into figures/ and tables/ subfolders inside your current working directory (reports/).