✅ 简短回答:是的,nf-core/methylong 接受 2 个生物学重复进行 DMR 分析
但需要注意统计效力的限制和结果解读的谨慎性。
🔧 技术层面:methylong 如何支持 2 重复设计
1️⃣ 参数配置示例
nextflow run nf-core/methylong \
-r 2.0.0 \
-profile docker \
--input samplesheet_6mA.csv \ # 包含 2 个重复/组
--outdir methylome_out_6mA \
--ont_aligner minimap2 \
--m6a \
--skip_snvs \
--dmr_population_scale \ # ⭐ 启用群体水平比较
--population_dmrer dss \ # ⭐ 使用 DSS (支持小样本)
--dmr_a "An6_rep1,An6_rep2" \ # 组 A: 2 个重复
--dmr_b "BG5_rep1,BG5_rep2" \ # 组 B: 2 个重复
-resume
2️⃣ samplesheet 格式要求
group,sample,path,ref,method
An6,An6_rep1,/path/An6_rep1.mod.bam,/path/genome.fa,ont
An6,An6_rep2,/path/An6_rep2.mod.bam,/path/genome.fa,ont
BG5,BG5_rep1,/path/BG5_rep1.mod.bam,/path/genome.fa,ont
BG5,BG5_rep2,/path/BG5_rep2.mod.bam,/path/genome.fa,ont
⚠️ 关键:
group列的值必须与--dmr_a/--dmr_b中的组名完全一致。
⚠️ 统计层面:2 重复的局限性与应对策略
🔹 DSS 在 2 重复下的工作原理
| 特性 | 说明 | 对您的影响 |
|---|---|---|
| 贝叶斯收缩 | 借用全基因组位点信息”收缩”方差估计 | ✅ 提升小样本稳健性,是 2 重复可行的理论基础 |
| 经验分布建模 | 基于所有位点的甲基化概率分布估计背景噪声 | ✅ 不需要大量重复也能估计全局变异 |
| FDR 校正 | Benjamini-Hochberg 方法控制多重检验 | ⚠️ 2 重复时假阳性控制可能偏保守或偏宽松,需谨慎解读 p.adj |
🔹 2 重复 vs 3+ 重复的统计效力对比
| 指标 | 2 重复/组 | 3 重复/组 | 对您的建议 | ||
|---|---|---|---|---|---|
| 检测中等效应 ( | diff | =20-30%) 的功率 | ~40-60% | ~70-85% | 当前结果可能漏检部分真实差异位点 |
| FDR 控制的可靠性 | 中等 | 高 | 显著位点 (p.adj<0.05) 可信,但”不显著”不代表无差异 |
||
| 异常值鲁棒性 | 低 (1 个异常样本可扭曲结果) | 中 | 建议用 MultiQC 检查样本间相关性,排除技术异常 |
🎯 实用建议:如何在 2 重复下最大化分析价值
✅ 策略 1:提高过滤阈值,聚焦高置信结果
# 在 DSS 输出后,额外应用更严格的过滤
awk '$8 >= 0.95 && abs($7) >= 0.3' An6_vs_BG5.DSS_DMLs.tsv > high_conf_diff_6mA.tsv
# $8 = p.adj, $7 = diff (甲基化差异)
✅ 策略 2:结合生物学先验知识验证
# 优先关注已知限制修饰系统相关基因
bedtools intersect -a high_conf_diff_6mA.tsv -b RM_system_genes.bed -wa > RM_related_diff.tsv
# 或关注表型相关基因 (如毒力、应激响应)
bedtools intersect -a high_conf_diff_6mA.tsv -b virulence_genes.bed -wa > phenotype_related_diff.tsv
✅ 策略 3:敏感性分析 (Bootstrap 评估稳健性)
# R 脚本示例:评估 2 重复结果的稳定性
library(boot)
# 对每个显著位点,用 bootstrap 重采样估计差异的置信区间
boot_diff <- function(data, indices) {
d <- data[indices, ]
mean(d$groupA) - mean(d$groupB) # 简化示例
}
# 对 top 100 显著位点运行 1000 次 bootstrap
# 如果 95% CI 不包含 0,则结果稳健
✅ 策略 4:报告时透明说明局限性
在论文/报告方法部分添加:
“差异甲基化分析使用 DSS (vX.X) 进行,每组 2 个生物学重复。由于样本量限制,我们采用更严格的显著性阈值 (FDR 30%) 并优先关注已知功能基因的甲基化变化,以降低假阳性风险。”
📊 如果您决定扩展至 3 重复:最小成本方案
| 方案 | 额外成本 | 收益 |
|---|---|---|
| 技术重复 (同一 DNA 多次建库) | 低 (仅测序成本) | 评估技术噪声,但不能替代生物学重复 |
| 生物学重复 (独立培养 + 提取) | 中 (时间 + 试剂) | ✅ 真正提升统计效力,推荐 |
| 混合策略 (2 生物重复 + 1 技术重复) | 中低 | 平衡成本与可靠性,适合资源有限场景 |
💡 一句话总结
🔑 技术上:methylong + DSS 完全支持 2 重复/组的 DMR 分析,管道可正常运行并输出结果。
🔑 统计上:2 重复的统计效力有限,建议:(1) 使用更严格的显著性阈值,(2) 结合生物学先验知识验证,(3) 在报告中透明说明局限性。
🔑 未来规划:如有关键发现需深入验证,建议扩展至≥3 个生物学重复以获得更可靠的 FDR 控制。
如需我帮您编写更严格的过滤脚本或敏感性分析代码,请随时告知!🚀
🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline
1. Base Modification Calling with modkit
Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:
| Column | Name | Description | Example |
|---|---|---|---|
| 1 | chrom |
Contig/chromosome identifier | contig_1 |
| 2 | start |
Modification position (0-based, BED format) | 12345 |
| 3 | end |
End position (start + 1 for single-base mods) |
12346 |
| 4 | name |
Modification code + sequence context | a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC) |
| 5 | score |
Total read coverage at the site | 45 |
| 6 | strand |
Strand orientation (+, -, or . for unstranded) |
+ |
| 7-9 | thickStart/End, itemRgb |
Visualization fields (unused in analysis) | – |
| 10 | blockCount |
Reads supporting the modification call | 32 |
| 11 | blockSizes |
Modification probability (0–100%) ← core filtering metric | 85.3 |
| 12-18 | Extended QC metrics | mod_count, unmod_count, del_count, no_call_count, strand-specific coverage |
– |
📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g.,
85.3= 85.3% of reads show 6mA at this adenine).
2. High-Confidence Site Filtering
Sites were filtered using two stringent thresholds to minimize false positives:
- Minimum coverage: ≥10 reads (
$5 >= 10) - Minimum modification rate:
- 5mC: ≥70% (
$11 >= 70) - 4mC: ≥50% (
$11 >= 50) - 6mA: ≥70% (
$11 >= 70)
- 5mC: ≥70% (
Filtering was performed using awk:
# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed
Sites were further separated by:
- Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
- Sequence context: CG vs. non-CG (for cytosine modifications)
3. Flanking Sequence Extraction for Motif Analysis
For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:
Modification site: [position X]
Extracted window: [X-50] ................ [X] ................ [X+50]
↑
Total length = 101 bp
Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.
Extraction was performed using bedtools getfasta:
# Create extended regions file
awk -F'\t' -v flank=50 '{
c=$1; gsub(/^>/,"",c);
s=($2-flank>0)?$2-flank:0; e=$3+flank;
print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed
# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s
4. Motif Enrichment Analysis with HOMER
De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:
| Parameter | Value | Rationale |
|---|---|---|
-len |
4,6,8,10 |
Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs |
-size |
50 |
Matches the ±50 bp flanking window used for extraction |
-mask |
true |
Masks low-complexity/repetitive regions to reduce spurious motifs |
-S |
10 |
Optimizes 10 putative motifs per dataset to balance sensitivity and specificity |
-noknown |
true |
Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE |
-useNewBg |
true |
Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability) |
-p |
8–100 |
Parallel threads (adjusted per compute resource) |
Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.
5. REBASE Annotation & Methylase Classification
Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.
Matching Logic:
- IUPAC-aware comparison: Motifs containing degenerate bases (e.g.,
R=A/G,Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences. - Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
- Methylation notation parsing: REBASE entries with methylation annotations (e.g.,
2(6)= N6-methyladenine at position 2) were parsed to distinguish:- REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
- Methylase Hits: Entries where the enzyme name starts with
M.(indicating a methyltransferase) AND the methylation notation matches the modification type:(4)= N4-methylcytosine → 4mC(5)= 5-methylcytosine → 5mC(6)= N6-methyladenine → 6mA
Output Classification:
| Category | Definition | Example |
|---|---|---|
| No REBASE match | Motif not found in REBASE | Likely a transcription factor binding site or novel methylase |
| REBASE match (restriction enzyme only) | Matches a restriction endonuclease but not its cognate methylase | Coincidental sequence similarity; methylation likely from orphan methylase |
| Methylase hit | Matches a methyltransferase with correct modification type notation | High-confidence candidate for the enzyme catalyzing the observed modification |
6. Output File Structure
Final results were compiled into Excel workbooks with the following sheets per sample:
| Sheet Name | Content | Source File |
|---|---|---|
4mC_CG |
4mC sites in CG context | 4mC_CG_filtered.bed + REBASE annotation |
4mC_nonCG |
4mC sites in non-CG context | 4mC_nonCG_filtered.bed + REBASE annotation |
5mC_CG |
5mC sites in CG context | 5mC_CG_filtered.bed + REBASE annotation |
5mC_nonCG |
5mC sites in non-CG context | 5mC_nonCG_filtered.bed + REBASE annotation |
6mA |
All 6mA sites | 6mA_raw_filtered.bed + REBASE annotation |
Each sheet includes:
- Genomic coordinates and modification metrics (from
modkit) - HOMER motif enrichment statistics (p-value, motif sequence)
- REBASE annotation (matched enzymes, methylase status, methylation notation)
7. Software & Database Versions
| Tool/Database | Version | Purpose |
|---|---|---|
modkit |
0.3.1 | Base modification calling from nanopore signals |
bedtools |
2.31.1 | Sequence extraction and genomic interval operations |
HOMER |
4.11 | De novo motif discovery and enrichment analysis |
REBASE |
605 (Apr 2026) | Curated database of restriction enzymes and methyltransferases |
Python |
3.10+ | Custom annotation and data integration scripts |
pandas/openpyxl |
Latest | Excel workbook generation |
🔑 Key Clarifications
- Flanking window:
-f 50= ±50 bp (101 bp total), not ±25 bp. - Modification rate: Column 11 (
blockSizes) = percent modified (0–100), not a probability score. - Methylase identification: Requires BOTH
M.prefix in enzyme name AND matching methylation notation(4)/(5)/(6). - Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.
This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.