Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)

✅ 简短回答:是的,nf-core/methylong 接受 2 个生物学重复进行 DMR 分析

但需要注意统计效力的限制结果解读的谨慎性


🔧 技术层面:methylong 如何支持 2 重复设计

1️⃣ 参数配置示例

nextflow run nf-core/methylong \
  -r 2.0.0 \
  -profile docker \
  --input samplesheet_6mA.csv \      # 包含 2 个重复/组
  --outdir methylome_out_6mA \
  --ont_aligner minimap2 \
  --m6a \
  --skip_snvs \
  --dmr_population_scale \           # ⭐ 启用群体水平比较
  --population_dmrer dss \           # ⭐ 使用 DSS (支持小样本)
  --dmr_a "An6_rep1,An6_rep2" \      # 组 A: 2 个重复
  --dmr_b "BG5_rep1,BG5_rep2" \      # 组 B: 2 个重复
  -resume

2️⃣ samplesheet 格式要求

group,sample,path,ref,method
An6,An6_rep1,/path/An6_rep1.mod.bam,/path/genome.fa,ont
An6,An6_rep2,/path/An6_rep2.mod.bam,/path/genome.fa,ont
BG5,BG5_rep1,/path/BG5_rep1.mod.bam,/path/genome.fa,ont
BG5,BG5_rep2,/path/BG5_rep2.mod.bam,/path/genome.fa,ont

⚠️ 关键group 列的值必须与 --dmr_a/--dmr_b 中的组名完全一致


⚠️ 统计层面:2 重复的局限性与应对策略

🔹 DSS 在 2 重复下的工作原理

特性 说明 对您的影响
贝叶斯收缩 借用全基因组位点信息”收缩”方差估计 ✅ 提升小样本稳健性,是 2 重复可行的理论基础
经验分布建模 基于所有位点的甲基化概率分布估计背景噪声 ✅ 不需要大量重复也能估计全局变异
FDR 校正 Benjamini-Hochberg 方法控制多重检验 ⚠️ 2 重复时假阳性控制可能偏保守或偏宽松,需谨慎解读 p.adj

🔹 2 重复 vs 3+ 重复的统计效力对比

指标 2 重复/组 3 重复/组 对您的建议
检测中等效应 ( diff =20-30%) 的功率 ~40-60% ~70-85% 当前结果可能漏检部分真实差异位点
FDR 控制的可靠性 中等 显著位点 (p.adj<0.05) 可信,但”不显著”不代表无差异
异常值鲁棒性 低 (1 个异常样本可扭曲结果) 建议用 MultiQC 检查样本间相关性,排除技术异常

🎯 实用建议:如何在 2 重复下最大化分析价值

✅ 策略 1:提高过滤阈值,聚焦高置信结果

# 在 DSS 输出后,额外应用更严格的过滤
awk '$8 >= 0.95 && abs($7) >= 0.3' An6_vs_BG5.DSS_DMLs.tsv > high_conf_diff_6mA.tsv
# $8 = p.adj, $7 = diff (甲基化差异)

✅ 策略 2:结合生物学先验知识验证

# 优先关注已知限制修饰系统相关基因
bedtools intersect -a high_conf_diff_6mA.tsv -b RM_system_genes.bed -wa > RM_related_diff.tsv

# 或关注表型相关基因 (如毒力、应激响应)
bedtools intersect -a high_conf_diff_6mA.tsv -b virulence_genes.bed -wa > phenotype_related_diff.tsv

✅ 策略 3:敏感性分析 (Bootstrap 评估稳健性)

# R 脚本示例:评估 2 重复结果的稳定性
library(boot)

# 对每个显著位点,用 bootstrap 重采样估计差异的置信区间
boot_diff <- function(data, indices) {
  d <- data[indices, ]
  mean(d$groupA) - mean(d$groupB)  # 简化示例
}

# 对 top 100 显著位点运行 1000 次 bootstrap
# 如果 95% CI 不包含 0,则结果稳健

✅ 策略 4:报告时透明说明局限性

在论文/报告方法部分添加:

“差异甲基化分析使用 DSS (vX.X) 进行,每组 2 个生物学重复。由于样本量限制,我们采用更严格的显著性阈值 (FDR 30%) 并优先关注已知功能基因的甲基化变化,以降低假阳性风险。”


📊 如果您决定扩展至 3 重复:最小成本方案

方案 额外成本 收益
技术重复 (同一 DNA 多次建库) 低 (仅测序成本) 评估技术噪声,但不能替代生物学重复
生物学重复 (独立培养 + 提取) 中 (时间 + 试剂) ✅ 真正提升统计效力,推荐
混合策略 (2 生物重复 + 1 技术重复) 中低 平衡成本与可靠性,适合资源有限场景

💡 一句话总结

🔑 技术上:methylong + DSS 完全支持 2 重复/组的 DMR 分析,管道可正常运行并输出结果。
🔑 统计上:2 重复的统计效力有限,建议:(1) 使用更严格的显著性阈值,(2) 结合生物学先验知识验证,(3) 在报告中透明说明局限性。
🔑 未来规划:如有关键发现需深入验证,建议扩展至≥3 个生物学重复以获得更可靠的 FDR 控制。

如需我帮您编写更严格的过滤脚本或敏感性分析代码,请随时告知!🚀

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with modkit

Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:

Column Name Description Example
1 chrom Contig/chromosome identifier contig_1
2 start Modification position (0-based, BED format) 12345
3 end End position (start + 1 for single-base mods) 12346
4 name Modification code + sequence context a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC)
5 score Total read coverage at the site 45
6 strand Strand orientation (+, -, or . for unstranded) +
7-9 thickStart/End, itemRgb Visualization fields (unused in analysis)
10 blockCount Reads supporting the modification call 32
11 blockSizes Modification probability (0–100%) ← core filtering metric 85.3
12-18 Extended QC metrics mod_count, unmod_count, del_count, no_call_count, strand-specific coverage

📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g., 85.3 = 85.3% of reads show 6mA at this adenine).


2. High-Confidence Site Filtering

Sites were filtered using two stringent thresholds to minimize false positives:

  • Minimum coverage: ≥10 reads ($5 >= 10)
  • Minimum modification rate:
    • 5mC: ≥70% ($11 >= 70)
    • 4mC: ≥50% ($11 >= 50)
    • 6mA: ≥70% ($11 >= 70)

Filtering was performed using awk:

# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed

Sites were further separated by:

  • Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
  • Sequence context: CG vs. non-CG (for cytosine modifications)

3. Flanking Sequence Extraction for Motif Analysis

For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:

Modification site:          [position X]
Extracted window:  [X-50] ................ [X] ................ [X+50]
                          ↑
                  Total length = 101 bp

Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.

Extraction was performed using bedtools getfasta:

# Create extended regions file
awk -F'\t' -v flank=50 '{
    c=$1; gsub(/^>/,"",c);
    s=($2-flank>0)?$2-flank:0; e=$3+flank;
    print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed

# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s

4. Motif Enrichment Analysis with HOMER

De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:

Parameter Value Rationale
-len 4,6,8,10 Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs
-size 50 Matches the ±50 bp flanking window used for extraction
-mask true Masks low-complexity/repetitive regions to reduce spurious motifs
-S 10 Optimizes 10 putative motifs per dataset to balance sensitivity and specificity
-noknown true Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE
-useNewBg true Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability)
-p 8–100 Parallel threads (adjusted per compute resource)

Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.


5. REBASE Annotation & Methylase Classification

Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.

Matching Logic:

  1. IUPAC-aware comparison: Motifs containing degenerate bases (e.g., R=A/G, Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences.
  2. Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
  3. Methylation notation parsing: REBASE entries with methylation annotations (e.g., 2(6) = N6-methyladenine at position 2) were parsed to distinguish:
    • REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
    • Methylase Hits: Entries where the enzyme name starts with M. (indicating a methyltransferase) AND the methylation notation matches the modification type:
      • (4) = N4-methylcytosine → 4mC
      • (5) = 5-methylcytosine → 5mC
      • (6) = N6-methyladenine → 6mA

Output Classification:

Category Definition Example
No REBASE match Motif not found in REBASE Likely a transcription factor binding site or novel methylase
REBASE match (restriction enzyme only) Matches a restriction endonuclease but not its cognate methylase Coincidental sequence similarity; methylation likely from orphan methylase
Methylase hit Matches a methyltransferase with correct modification type notation High-confidence candidate for the enzyme catalyzing the observed modification

6. Output File Structure

Final results were compiled into Excel workbooks with the following sheets per sample:

Sheet Name Content Source File
4mC_CG 4mC sites in CG context 4mC_CG_filtered.bed + REBASE annotation
4mC_nonCG 4mC sites in non-CG context 4mC_nonCG_filtered.bed + REBASE annotation
5mC_CG 5mC sites in CG context 5mC_CG_filtered.bed + REBASE annotation
5mC_nonCG 5mC sites in non-CG context 5mC_nonCG_filtered.bed + REBASE annotation
6mA All 6mA sites 6mA_raw_filtered.bed + REBASE annotation

Each sheet includes:

  • Genomic coordinates and modification metrics (from modkit)
  • HOMER motif enrichment statistics (p-value, motif sequence)
  • REBASE annotation (matched enzymes, methylase status, methylation notation)

7. Software & Database Versions

Tool/Database Version Purpose
modkit 0.3.1 Base modification calling from nanopore signals
bedtools 2.31.1 Sequence extraction and genomic interval operations
HOMER 4.11 De novo motif discovery and enrichment analysis
REBASE 605 (Apr 2026) Curated database of restriction enzymes and methyltransferases
Python 3.10+ Custom annotation and data integration scripts
pandas/openpyxl Latest Excel workbook generation

🔑 Key Clarifications

  1. Flanking window: -f 50 = ±50 bp (101 bp total), not ±25 bp.
  2. Modification rate: Column 11 (blockSizes) = percent modified (0–100), not a probability score.
  3. Methylase identification: Requires BOTH M. prefix in enzyme name AND matching methylation notation (4)/(5)/(6).
  4. Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.

This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.

Leave a Reply

Your email address will not be published. Required fields are marked *