Daily Archives: 2025年12月15日

Phyloseq objects used in the workflow (ps_*)

!!!!! Good candidate for the workshop for the clinicians !!!!!

“First, I generate ps_rarefied from ps_filt. Then, for cleaner composition plots, I create ps_abund / ps_abund_rel by filtering taxa for plotting (e.g., keeping taxa with mean relative abundance > 0.1%; resulting in ~95 taxa across 239 samples). Is it correct to compute alpha diversity and beta diversity using ps_abund (the taxa-filtered object)?”

Not really. ps_abund / ps_abund_rel (taxa filtered “for plotting”) is generally not the right object for alpha- or beta-diversity, unless you explicitly want diversity calculated only within that filtered subset. Luckily, I directly using alpha- and beta diversity from qiime2-result and MicrobiotaProcess –> which means we should feed MicrobiotaProcess with ps_filt (not rarefied) and for differential analysis also using ps_filt (not rarefied)

For that sentence, I mean these specific objects:

  • For Jaccard / unweighted UniFrac (presence–absence / detection-sensitive):
    ✅ compute beta diversity on ps_rarefied
    = the rarefied-count phyloseq object created from ps_filt (with the full taxon set, i.e., no plotting-only taxon filter).

  • For Bray–Curtis (abundance-based):
    ✅ compute beta diversity on a non-rarefied transformed object, typically:

    • ps_rel = relative abundances computed from ps_filt (not rarefied), or
    • a Hellinger-transformed version derived from ps_filt (if that is your chosen workflow).
  • For CLR / Aitchison distance:
    ✅ compute beta diversity on a CLR-transformed version of the non-rarefied counts from ps_filt
    (often named ps_clr / otu_clr, created from ps_filt + pseudocount), not on ps_abund.

The “plotting-filtered object” I’m warning against is:

  • ps_abund / ps_abund_rel = taxa filtered for visualization (e.g., mean relative abundance > 0.1%).

Why it’s not “correct” for diversity

Filtering taxa by mean relative abundance (e.g., >0.1%) removes many low-abundance / low-prevalence taxa and changes the community profile.

Alpha diversity

  • Observed richness will be artificially lower (because taxa were removed).
  • Shannon/Simpson can also change because you altered the abundance distribution. ✅ Therefore, compute alpha diversity on ps_rarefied (unfiltered taxa), not on ps_abund.

Beta diversity

This depends on the distance metric, but heavy taxon filtering can distort results:

  • Presence/absence distances (e.g., Jaccard, unweighted UniFrac) are very sensitive to removing taxa → filtering can strongly change distances and clustering.
  • Abundance-weighted distances (e.g., Bray–Curtis, weighted UniFrac) can also shift after filtering; mild filters (e.g., remove singletons) may be acceptable, but mean rel. abundance >0.1% can be quite aggressive (95 taxa suggests strong pruning). ✅ Best practice: compute beta diversity on ps_rarefied (for Jaccard/unweighted UniFrac) or on non-rarefied transformed data (for Bray/CLR workflows), not on a plotting-filtered object.

Recommended workflow (clean separation of purposes)

For analysis (diversity)

  • Alpha diversity: ps_rarefied (derived from ps_filt)
  • Beta diversity:
    • Jaccard / unweighted UniFrac: ps_rarefied
    • Bray–Curtis: usually from ps_rel (or Hellinger), often without rarefaction
    • Aitchison/CLR: CLR-transformed (non-rarefied) data (no rarefaction)

For visualization (composition plots only)

  • Optional cleaner composition plots: ps_abund / ps_abund_rel derived from ps_rel (taxa filtered to make stacked bars/heatmaps readable)

When would ps_abund be acceptable for beta diversity?

Only if your question is explicitly:

“How do samples differ considering only dominant taxa (≥0.1% mean abundance)?”

Then it’s a valid different analysis, but you should label it clearly as “dominant-taxa-only beta diversity.”

If you tell me which beta distance(s) you use (Bray, Jaccard, weighted/unweighted UniFrac, CLR/Aitchison), I can recommend the best object + transformation for your exact setup.


The correct answer is: beta diversity does not universally require rarefaction.

Why this confusion happens

Historically, many microbiome pipelines used rarefied counts for beta diversity to “standardize depth.” That’s why you often see “beta diversity → rarefy” in older workflows. But in modern practice, rarefaction for beta diversity is optional and sometimes not recommended, because it discards data and adds randomness.

When rarefaction is a good idea for beta diversity

Use ps_rarefied when your beta metric is very sensitive to library size, especially:

  • Presence/absence distances: Jaccard, unweighted UniFrac These are strongly influenced by detection of rare taxa, which depends on depth. Rarefaction is a common way to make them comparable.
  • If your field/reviewer expectations specifically require rarefied beta diversity for certain metrics.

When rarefaction is not necessary (often better avoided)

Prefer non-rarefied approaches when you use:

  • Bray–Curtis on relative abundance (or Hellinger-transformed data) Rarefaction can reduce power and add noise; relative abundance already removes the “sum” effect (though compositionality remains).
  • Aitchison distance / PCA via CLR transform (Euclidean on CLR) This is a standard compositional approach and typically does not use rarefaction.
  • DESeq2 VST / rlog for ordination-like PCA (count-model-based normalization) Again: no rarefaction.

Important note about your original code

Doing rarefaction → compositional (“relative abundance”) is usually redundant for beta diversity/ordination, because after rarefaction every sample has the same total, and compositional transformation just divides by that constant.

Corrected “rule of thumb”

  • Alpha diversity: ps_rarefied ✅ (common)
  • Beta diversity:
    • Unweighted UniFrac / Jaccard: ps_rarefied ✅ (often recommended)
    • Bray–Curtis / ordination on abundances: ps_rel or Hellinger ✅ (rarefaction optional)
    • Aitchison (CLR): CLR-transformed (non-rarefied) ✅ (no rarefaction)

If you tell me exactly which beta-diversity distance(s) you compute (Bray–Curtis? weighted/unweighted UniFrac? Jaccard? Aitchison/CLR?), I can recommend the cleanest “use this object + this transformation” setup.


Doing rarefaction and then compositional normalization (relative abundance) is often redundant.

Short answer: Don’t need both — rarefaction + compositional normalization is redundant in most cases. Doing rarefaction and then compositional normalization (relative abundance) is often redundant.

# RAREFACTION
set.seed(9242)  # This will help in reproducing the filtering and nomalisation.
ps.ng.tax <- rarefy_even_depth(ps.ng.tax, sample.size = 6389)
total <- 6389

# NORMALIZE number of reads in each sample using median sequencing depth.
total = median(sample_sums(ps.ng.tax))
#> total
#[1] 42369
standf = function(x, t=total) round(t * (x / sum(x)))
ps.ng.tax = transform_sample_counts(ps.ng.tax, standf)
ps.ng.tax_rel <- microbiome::transform(ps.ng.tax, "compositional")

saveRDS(ps.ng.tax, "./ps.ng.tax.rds")

What happens mathematically

1) Rarefaction forces every sample to the same library size (e.g., 6,389 reads). 2) Compositional normalization divides each sample by its total so it sums to 1.

After rarefaction, every sample has the same total, so compositional normalization is essentially: relative_abundance = rarefied_counts / constant_depth. That step is valid, but typically unnecessary.

Practical implication

  • If your goal is relative abundance–based visualization or many beta-diversity analyses, you can usually skip rarefaction and compute relative abundances (or other transformations) on the full data.
  • If your goal is alpha diversity, rarefaction is commonly used, but then you typically keep that rarefied object for alpha diversity only (not as a general-purpose normalized dataset).

So: rarefaction alone can be fine for specific tasks (especially alpha diversity), but rarefaction + compositional normalization together is usually not needed.


Mapping new ps_* objects to the old ps.ng.tax* objects (and overwrite notes)

ps_* object What it contains Old R-script equivalent (ps.ng.tax*) Overwrite notes for old object(s) Taxonomic composition Alpha diversity Beta diversity DESeq2 differential abundance
ps_raw Raw imported phyloseq object (integer counts; as imported) (no direct equivalent) ⚠️ only if you also apply sample filtering first
ps_base ps_raw + taxonomy + sample metadata aligned (still raw counts) (closest to) ps.ng.tax before overwrite Old ps.ng.tax BEFORE overwrite: integer count table (absolute abundance). ⚠️ only if you also apply sample filtering first
ps_pruned Optional subset of ps_base (e.g., remove unwanted samples by ID/pattern); still raw counts (subset of) ps.ng.tax before overwrite Old ps.ng.tax BEFORE overwrite: same as above, but without any sample subsetting unless you did it. ⚠️ only if you also apply low-depth filtering
ps_filt Filtered samples (low-depth samples removed) + taxa with nonzero totals; absolute counts ps.ng.tax before overwrite (but with explicit low-depth filtering) Old ps.ng.tax BEFORE overwrite: raw integer counts; new ps_filt is the “cleaned” version after sample-depth QC. ✅ as a starting point (but plot on ps_rel) ✅ as input to rarefaction ✅ as input to rarefaction ✅ as input to ps_deseq
ps_rel Relative abundance (compositional) computed from ps_filt ps.ng.tax_rel (conceptually) In the old script, ps.ng.tax_rel is “relative abundance of ps.ng.tax”, but its meaning depends on whether it was computed before or after ps.ng.tax was overwritten. ✅ primary
ps_abund Absolute counts after “plotting taxa filter” (e.g., mean rel. abundance > 0.1%), derived from ps_filt via ps_rel ps.ng.tax_abund Old ps.ng.tax_abund was created by filtering taxa using mean relative abundance (> 0.001) and then pruning counts. ✅ (if you want cleaner plots) (not recommended)
ps_abund_rel Relative abundance computed from ps_abund (filtered taxa set) ps.ng.tax_abund_rel Old ps.ng.tax_abund_rel was relative abundance of the filtered-taxa object. ✅ (clean composition plots)
ps_rarefied Rarefied counts from ps_filt (even depth) ps.ng.tax after overwrite Old ps.ng.tax was overwritten 1× by rarefy_even_depth(ps.ng.tax, sample.size = 41764) → after this, ps.ng.tax no longer meant raw counts; it meant rarefied counts. ✅ primary ✅ primary
ps_deseq Non-rarefied integer counts from ps_filt + optional count-based taxon prefilter (e.g., total ≥ 10) (no direct equivalent) Old ps.ng.tax_abund is not a good DESeq2 analogue because it used a mean-relative-abundance filter; ps_deseq uses count-based prefiltering (optional) and keeps integer counts. ✅ primary

Overwrite summary (old script):

  • ps.ng.tax was overwritten 1 time:
    • Before overwrite: ps.ng.tax = absolute abundance (raw integer counts).
    • After overwrite: ps.ng.tax = rarefied counts, produced by rarefy_even_depth(ps.ng.tax, sample.size = 41764).
ps_* object What it contains Taxonomic composition Alpha diversity Beta diversity DESeq2 differential abundance
ps_raw Raw imported phyloseq object (integer counts; as imported) ❌ (not recommended directly) ⚠️ only if you also apply sample filtering first
ps_base ps_raw + taxonomy + sample metadata aligned (still raw counts) ⚠️ only if you also apply sample filtering first
ps_pruned Optional subset of ps_base (e.g., remove unwanted samples by ID/pattern); still raw counts ⚠️ only if you also apply low-depth filtering
ps_filt Filtered samples (low-depth samples removed) + taxa with nonzero totals; absolute counts ✅ as a starting point (but plot on ps_rel) ✅ as the input to rarefaction ✅ as the input to rarefaction ✅ as the input to ps_deseq
ps_rel Relative abundance (compositional) computed from ps_filt ✅ primary
ps_abund Absolute counts after “plotting taxa filter” (e.g., mean rel. abundance > 0.1%), derived from ps_filt via ps_rel ✅ (if you want cleaner plots) ❌ (not recommended)
ps_abund_rel Relative abundance computed from ps_abund (filtered taxa set) ✅ (clean composition plots)
ps_rarefied Rarefied counts from ps_filt (even depth) ✅ primary ✅ primary
ps_deseq Non-rarefied integer counts from ps_filt + optional count-based taxon prefilter (e.g., total ≥ 10) ✅ primary

Why the ΔadeIJ Evidence Chain Is Not Fully “Closed”

efflux_logica_table

这里说“证据链没有闭环”,不是否定结果,而是从审稿人最严格的因果标准来看,ΔadeIJ 这条线目前更像“相关性很强 + 合理推断”,但还缺少几步能把“推断”锁死成“因果”的关键证据。

1) 目前已有的证据(强相关)

  • 表型:ΔadeIJ 的 ROS 更高,对 Cu²⁺ / SNP 更敏感
  • 转录组:ΔadeIJ 上调金属外排/解毒/应激相关基因

因此很自然会推断:

AdeIJ 参与金属/氧化应激稳态(redox & metal homeostasis)

这属于 “一致性证据(consistent with)”,说服力已经很不错。

2) 为什么说“还没闭环”:少了把“推断 → 因果”钉死的步骤

审稿人可能会问:
“你怎么证明是 缺失 AdeIJ 导致金属/ROS 失衡,而不是其他原因?”

通常闭环至少需要补上一类证据(不一定要做,但逻辑上缺):

A. 互补实验(complementation / rescue)

理想闭环:

  • ΔadeIJ 表型变差
  • 把 adeIJ(或 adeIJK)补回去 → ROS/Cu²⁺/SNP 表型恢复到 WT(rescue)

没有这一步,审稿人可能会担心:

  • 极性效应(polar effect)
  • 二次突变(secondary mutation)
  • 背景差异(background effects)

B. 直接测量“因果中间量”(mechanistic intermediate)

你们推断的是“金属/毒性底物积累 → ROS 上升 → Cu²⁺/SNP 敏感”。
但目前缺少对“中间量”的直接测量,例如:

  • 细胞内 铜含量 是否在 ΔadeIJ 增加(ICP-MS/比色法等)?
  • 是否有更明确的蛋白氧化损伤/Fe–S 破坏指标?
  • 是否存在呼吸链/膜电位异常导致 ROS 增加?

目前是“结果(ROS 高)+ 反应(相关基因上调)”,但还没直接证明“金属积累/底物积累”这一环节。

C. 排除替代解释(alternative explanations)

ΔadeIJ 的高 ROS 也可能来自:

  • 代谢状态改变导致呼吸链泄漏增加
  • 生长状态差异影响 ROS 测定
  • Cu²⁺ 敏感来自包膜改变而非金属外排不足

如果未排除,结论更适合写成 consistent with,而非 crucial determinant

3) “闭环”长什么样(最理想的因果链)

AdeIJ 缺失
细胞内金属/毒性底物积累(直接测到)
ROS 上升(你们已测到)
Cu²⁺/SNP 敏感(你们已测到)
补回 adeIJ 后表型恢复(rescue)

这就是“证据链闭环”。

4) 对写作的建议(不一定要加实验)

投稿时完全可以不补实验,但建议在文字上更稳健:

  • 避免:“AdeIJ is crucial/essential for maintaining …”
  • 推荐:
    • “These data support / are consistent with a role for AdeIJ in …”
    • “We suggest AdeIJ contributes to …”
    • “Further work (e.g., complementation or intracellular metal quantification) would be needed to establish causality.”