Daily Archives: 2026年5月29日

Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)

✅ 简短回答:是的,nf-core/methylong 接受 2 个生物学重复进行 DMR 分析

但需要注意统计效力的限制结果解读的谨慎性


🔧 技术层面:methylong 如何支持 2 重复设计

1️⃣ 参数配置示例

nextflow run nf-core/methylong \
  -r 2.0.0 \
  -profile docker \
  --input samplesheet_6mA.csv \      # 包含 2 个重复/组
  --outdir methylome_out_6mA \
  --ont_aligner minimap2 \
  --m6a \
  --skip_snvs \
  --dmr_population_scale \           # ⭐ 启用群体水平比较
  --population_dmrer dss \           # ⭐ 使用 DSS (支持小样本)
  --dmr_a "An6_rep1,An6_rep2" \      # 组 A: 2 个重复
  --dmr_b "BG5_rep1,BG5_rep2" \      # 组 B: 2 个重复
  -resume

2️⃣ samplesheet 格式要求

group,sample,path,ref,method
An6,An6_rep1,/path/An6_rep1.mod.bam,/path/genome.fa,ont
An6,An6_rep2,/path/An6_rep2.mod.bam,/path/genome.fa,ont
BG5,BG5_rep1,/path/BG5_rep1.mod.bam,/path/genome.fa,ont
BG5,BG5_rep2,/path/BG5_rep2.mod.bam,/path/genome.fa,ont

⚠️ 关键group 列的值必须与 --dmr_a/--dmr_b 中的组名完全一致


⚠️ 统计层面:2 重复的局限性与应对策略

🔹 DSS 在 2 重复下的工作原理

特性 说明 对您的影响
贝叶斯收缩 借用全基因组位点信息”收缩”方差估计 ✅ 提升小样本稳健性,是 2 重复可行的理论基础
经验分布建模 基于所有位点的甲基化概率分布估计背景噪声 ✅ 不需要大量重复也能估计全局变异
FDR 校正 Benjamini-Hochberg 方法控制多重检验 ⚠️ 2 重复时假阳性控制可能偏保守或偏宽松,需谨慎解读 p.adj

🔹 2 重复 vs 3+ 重复的统计效力对比

指标 2 重复/组 3 重复/组 对您的建议
检测中等效应 ( diff =20-30%) 的功率 ~40-60% ~70-85% 当前结果可能漏检部分真实差异位点
FDR 控制的可靠性 中等 显著位点 (p.adj<0.05) 可信,但”不显著”不代表无差异
异常值鲁棒性 低 (1 个异常样本可扭曲结果) 建议用 MultiQC 检查样本间相关性,排除技术异常

🎯 实用建议:如何在 2 重复下最大化分析价值

✅ 策略 1:提高过滤阈值,聚焦高置信结果

# 在 DSS 输出后,额外应用更严格的过滤
awk '$8 >= 0.95 && abs($7) >= 0.3' An6_vs_BG5.DSS_DMLs.tsv > high_conf_diff_6mA.tsv
# $8 = p.adj, $7 = diff (甲基化差异)

✅ 策略 2:结合生物学先验知识验证

# 优先关注已知限制修饰系统相关基因
bedtools intersect -a high_conf_diff_6mA.tsv -b RM_system_genes.bed -wa > RM_related_diff.tsv

# 或关注表型相关基因 (如毒力、应激响应)
bedtools intersect -a high_conf_diff_6mA.tsv -b virulence_genes.bed -wa > phenotype_related_diff.tsv

✅ 策略 3:敏感性分析 (Bootstrap 评估稳健性)

# R 脚本示例:评估 2 重复结果的稳定性
library(boot)

# 对每个显著位点,用 bootstrap 重采样估计差异的置信区间
boot_diff <- function(data, indices) {
  d <- data[indices, ]
  mean(d$groupA) - mean(d$groupB)  # 简化示例
}

# 对 top 100 显著位点运行 1000 次 bootstrap
# 如果 95% CI 不包含 0,则结果稳健

✅ 策略 4:报告时透明说明局限性

在论文/报告方法部分添加:

“差异甲基化分析使用 DSS (vX.X) 进行,每组 2 个生物学重复。由于样本量限制,我们采用更严格的显著性阈值 (FDR 30%) 并优先关注已知功能基因的甲基化变化,以降低假阳性风险。”


📊 如果您决定扩展至 3 重复:最小成本方案

方案 额外成本 收益
技术重复 (同一 DNA 多次建库) 低 (仅测序成本) 评估技术噪声,但不能替代生物学重复
生物学重复 (独立培养 + 提取) 中 (时间 + 试剂) ✅ 真正提升统计效力,推荐
混合策略 (2 生物重复 + 1 技术重复) 中低 平衡成本与可靠性,适合资源有限场景

💡 一句话总结

🔑 技术上:methylong + DSS 完全支持 2 重复/组的 DMR 分析,管道可正常运行并输出结果。
🔑 统计上:2 重复的统计效力有限,建议:(1) 使用更严格的显著性阈值,(2) 结合生物学先验知识验证,(3) 在报告中透明说明局限性。
🔑 未来规划:如有关键发现需深入验证,建议扩展至≥3 个生物学重复以获得更可靠的 FDR 控制。

如需我帮您编写更严格的过滤脚本或敏感性分析代码,请随时告知!🚀

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with modkit

Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:

Column Name Description Example
1 chrom Contig/chromosome identifier contig_1
2 start Modification position (0-based, BED format) 12345
3 end End position (start + 1 for single-base mods) 12346
4 name Modification code + sequence context a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC)
5 score Total read coverage at the site 45
6 strand Strand orientation (+, -, or . for unstranded) +
7-9 thickStart/End, itemRgb Visualization fields (unused in analysis)
10 blockCount Reads supporting the modification call 32
11 blockSizes Modification probability (0–100%) ← core filtering metric 85.3
12-18 Extended QC metrics mod_count, unmod_count, del_count, no_call_count, strand-specific coverage

📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g., 85.3 = 85.3% of reads show 6mA at this adenine).


2. High-Confidence Site Filtering

Sites were filtered using two stringent thresholds to minimize false positives:

  • Minimum coverage: ≥10 reads ($5 >= 10)
  • Minimum modification rate:
    • 5mC: ≥70% ($11 >= 70)
    • 4mC: ≥50% ($11 >= 50)
    • 6mA: ≥70% ($11 >= 70)

Filtering was performed using awk:

# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed

Sites were further separated by:

  • Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
  • Sequence context: CG vs. non-CG (for cytosine modifications)

3. Flanking Sequence Extraction for Motif Analysis

For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:

Modification site:          [position X]
Extracted window:  [X-50] ................ [X] ................ [X+50]
                          ↑
                  Total length = 101 bp

Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.

Extraction was performed using bedtools getfasta:

# Create extended regions file
awk -F'\t' -v flank=50 '{
    c=$1; gsub(/^>/,"",c);
    s=($2-flank>0)?$2-flank:0; e=$3+flank;
    print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed

# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s

4. Motif Enrichment Analysis with HOMER

De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:

Parameter Value Rationale
-len 4,6,8,10 Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs
-size 50 Matches the ±50 bp flanking window used for extraction
-mask true Masks low-complexity/repetitive regions to reduce spurious motifs
-S 10 Optimizes 10 putative motifs per dataset to balance sensitivity and specificity
-noknown true Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE
-useNewBg true Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability)
-p 8–100 Parallel threads (adjusted per compute resource)

Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.


5. REBASE Annotation & Methylase Classification

Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.

Matching Logic:

  1. IUPAC-aware comparison: Motifs containing degenerate bases (e.g., R=A/G, Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences.
  2. Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
  3. Methylation notation parsing: REBASE entries with methylation annotations (e.g., 2(6) = N6-methyladenine at position 2) were parsed to distinguish:
    • REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
    • Methylase Hits: Entries where the enzyme name starts with M. (indicating a methyltransferase) AND the methylation notation matches the modification type:
      • (4) = N4-methylcytosine → 4mC
      • (5) = 5-methylcytosine → 5mC
      • (6) = N6-methyladenine → 6mA

Output Classification:

Category Definition Example
No REBASE match Motif not found in REBASE Likely a transcription factor binding site or novel methylase
REBASE match (restriction enzyme only) Matches a restriction endonuclease but not its cognate methylase Coincidental sequence similarity; methylation likely from orphan methylase
Methylase hit Matches a methyltransferase with correct modification type notation High-confidence candidate for the enzyme catalyzing the observed modification

6. Output File Structure

Final results were compiled into Excel workbooks with the following sheets per sample:

Sheet Name Content Source File
4mC_CG 4mC sites in CG context 4mC_CG_filtered.bed + REBASE annotation
4mC_nonCG 4mC sites in non-CG context 4mC_nonCG_filtered.bed + REBASE annotation
5mC_CG 5mC sites in CG context 5mC_CG_filtered.bed + REBASE annotation
5mC_nonCG 5mC sites in non-CG context 5mC_nonCG_filtered.bed + REBASE annotation
6mA All 6mA sites 6mA_raw_filtered.bed + REBASE annotation

Each sheet includes:

  • Genomic coordinates and modification metrics (from modkit)
  • HOMER motif enrichment statistics (p-value, motif sequence)
  • REBASE annotation (matched enzymes, methylase status, methylation notation)

7. Software & Database Versions

Tool/Database Version Purpose
modkit 0.3.1 Base modification calling from nanopore signals
bedtools 2.31.1 Sequence extraction and genomic interval operations
HOMER 4.11 De novo motif discovery and enrichment analysis
REBASE 605 (Apr 2026) Curated database of restriction enzymes and methyltransferases
Python 3.10+ Custom annotation and data integration scripts
pandas/openpyxl Latest Excel workbook generation

🔑 Key Clarifications

  1. Flanking window: -f 50 = ±50 bp (101 bp total), not ±25 bp.
  2. Modification rate: Column 11 (blockSizes) = percent modified (0–100), not a probability score.
  3. Methylase identification: Requires BOTH M. prefix in enzyme name AND matching methylation notation (4)/(5)/(6).
  4. Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.

This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.

dorado + nf-core/methylong + modkit pileup + interpretation using own scripts processing Data_Tam_DNAseq_2026_An6_BG5 methylome data

dna_methylation_analysis_4mc_5mc.sh dna_methylation_analysis_6mA.sh annotate_motifs_rebase.py generate_methylation_excel.py

  1. Run nextflow methylong after running nextflow bacass

     ✅ ---- 终极解决方案:使用本地 Dorado 预生成 modBAM + --skip_basecalling (绕过容器 CUDA 限制)----
    
     鉴于容器镜像的复杂性和 CUDA 版本限制,最可靠、最快速的路径是:
     使用您已安装的本地 Dorado v2.0.0 生成已比对的 modBAM
     运行 methylong 时添加 --skip_basecalling,跳过内部容器化 Dorado
     这样完全绕过容器、CUDA、GPU 传递等所有问题!
    
     步骤 1:如果您已安装本地 Dorado(非容器版),可预先生成 modBAM:
    
     # 安装本地 Dorado (CPU 或匹配您 CUDA 的版本)
     #    下载: https://github.com/nanoporetech/dorado/releases; https://software-docs.nanoporetech.com/dorado/latest/
     curl "https://cdn.oxfordnanoportal.com/software/analysis/dorado-2.0.0-linux-x64.tar.gz" -o dorado-2.0.0-linux-x64.tar.gz
     tar -xzf dorado-2.0.0-linux-x64.tar.gz
    
     # 添加到 PATH (永久生效)
     echo 'export PATH="/home/jhuang/Tools/dorado-2.0.0-linux-x64/bin:$PATH"' >> ~/.bashrc
     source ~/.bashrc
    
     # 验证安装: 2.0.0+20e87c8b
     dorado --version
    
     # 查看可用的碱基调用模型
     Positional arguments:
       model                             Model selection {fast,hac,sup}@v{version} for automatic model selection including modbases, or path to existing model directory.
       data                              The data directory or POD5 file path.
    
     # 查看可用的修饰检测模型
     dorado basecaller --help | grep -A5 "modified-bases"
     # ✅ 应看到: 6mA, 4mC_5mC, 5mC_5hmC 等
    
     步骤 2: 生成 modBAM
     ./generate_mapped_modbam.sh
    
         #!/bin/bash
         #===============================================================================
         # generate_mapped_modbam.sh - 本地 Dorado 生成已比对 modBAM
         #===============================================================================
         set -euo pipefail
    
         # === 配置 ===
         DORADO="/home/jhuang/Tools/dorado-2.0.0-linux-x64/bin/dorado"  # 您的本地 Dorado 路径
         REF_AN6="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa"
         REF_BG5="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa"
         POD5_AN6="./X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/An6/2157_4C_PBK79106_7ec05c46/An6_pod5_pass"
         POD5_BG5="./X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/BG5/2157_4C_PBK79106_7ec05c46/BG5_pod5_pass"
         OUTDIR="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5"
    
         # 确保参考基因组已索引
         samtools faidx "${REF_AN6}"
         samtools faidx "${REF_BG5}"
    
         echo "🚀 生成 An6 6mA modBAM (已比对)..."
         "${DORADO}" basecaller \
           --modified-bases 6mA \
           --emit-moves \
           --device cuda:0 \
           --reference "${REF_AN6}" \
           dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
           "${POD5_AN6}" | samtools view -b - > "${OUTDIR}/An6_6mA_mapped.mod.bam"
    
         echo "🚀 生成 BG5 6mA modBAM (已比对)..."
         "${DORADO}" basecaller \
           --modified-bases 6mA \
           --emit-moves \
           --device cuda:0 \
           --reference "${REF_BG5}" \
           dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
           "${POD5_BG5}" | samtools view -b - > "${OUTDIR}/BG5_6mA_mapped.mod.bam"
    
         echo "🚀 生成 An6 6mA modBAM (已比对)..."
         "${DORADO}" basecaller \
           --modified-bases 4mC_5mC \
           --emit-moves \
           --device cuda:0 \
           --reference "${REF_AN6}" \
           dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
           "${POD5_AN6}" | samtools view -b - > "${OUTDIR}/An6_4mC_5mC_mapped.mod.bam"
    
         echo "🚀 生成 BG5 6mA modBAM (已比对)..."
         "${DORADO}" basecaller \
           --modified-bases 4mC_5mC \
           --emit-moves \
           --device cuda:0 \
           --reference "${REF_BG5}" \
           dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
           "${POD5_BG5}" | samtools view -b - > "${OUTDIR}/BG5_4mC_5mC_mapped.mod.bam"
    
         # 验证输出
         echo "🔍 验证 BAM 文件..."
         for BAM in "${OUTDIR}"/*_6mA_mapped.mod.bam; do
           if samtools quickcheck "${BAM}" 2>/dev/null; then
             READS=$(samtools view -c "${BAM}")
             TAGS=$(samtools view "${BAM}" | head -50 | grep -o "MM:Z:[^ ]*" | sort -u | tr '\n' ' ')
             echo "✅ $(basename "${BAM}"): ${READS} reads, tags: ${TAGS}"
           else
             echo "❌ $(basename "${BAM}"): 文件损坏或格式错误"
           fi
         done
    
         echo "🎉 modBAM 生成完成!"
    
     步骤 3: 创建 methylong samplesheet(使用预生成 modBAM)
     See samplesheet_6mA.csv and samplesheet_4mC_5mC.csv
    
     步骤 4: 运行 methylong with --skip_basecalling
    
         ## Check if BAMs are read-name sorted:
         #samtools view -H An6_6mA_mapped.mod.bam | grep "^@HD"
         ## Should show: @HD VN:1.6 SO:queryname
    
         ## If not sorted correctly:
         #samtools sort -n -o An6.sorted_by_name.bam An6_6mA_mapped.mod.bam
         #samtools sort -n -o BG5.sorted_by_name.bam BG5_6mA_mapped.mod.bam
    
         ## Check BAM integrity:
         #samtools quickcheck -v An6_6mA_mapped.mod.bam
         #samtools quickcheck -v An6_after_trim.bam
    
         ## Use modkit to check tags:
         #modkit modbam check-tags An6_6mA_mapped.mod.bam
         #modkit modbam check-tags An6_after_trim.bam
    
         #BUG: '--skip_basecalling' does not exist (see the complete option list below.)
         nextflow run nf-core/methylong \
             -r 2.0.0 \
             -profile docker \
             --input samplesheet_6mA.csv \
             --outdir methylome_out_6mA \
             --skip_basecalling \          # ⭐ 关键: 跳过内部 Dorado,使用预生成 modBAM
             -resume \
             -work-dir methylome_out_6mA/work
    
         # Ist die BAM-Datei unaligniert? (keine @SQ Header mit Alignment)
         samtools view -H An6_6mA_mapped.mod.bam | head -10
    
         # Sind MM/ML-Tags für 6mA vorhanden?
         samtools view An6_6mA_mapped.mod.bam | head -5 | grep -o "MM:Z:[^ ]*"
    
         #NEXT_WEEK FOLLOWING the command under the website
         #https://bioinformatics.cc/debug-methylomic-analysis-of-data_tam_dnaseq_2026_an6_bg5/
    
         # FIRSTLY RUNNING NEXTFLOW
         # 1️⃣ 处理 6mA BAM(仅含 6mA 信号)
         nextflow run nf-core/methylong -r 2.0.0 -profile docker \
             --input samplesheet_6mA.csv \
             --outdir methylome_out_6mA \
             --ont_aligner minimap2 \
             --m6a \
             --skip_snvs \
             -resume \
             -work-dir methylome_out_6mA/work
         # 2️⃣ 处理 4mC_5mC BAM(含 4mC + 5mC 信号)
         nextflow run nf-core/methylong -r 2.0.0 -profile docker \
             --input samplesheet_4mC_5mC.csv \
             --outdir methylome_out_4mC_5mC \
             --ont_aligner minimap2 \
             --all_contexts \
             --skip_snvs \
             -resume \
             -work-dir methylome_out_4mC_5mC/work
    
     步骤 5: Check output
     [ ] ✅ 本地 Dorado 生成已比对 modBAM: samtools quickcheck An6_6mA_mapped.mod.bam 通过
     [ ] ✅ BAM 含 @SQ 头: samtools view -H An6_6mA_mapped.mod.bam | grep "^@SQ"
     [ ] ✅ BAM 含修饰标签: samtools view An6_6mA_mapped.mod.bam | grep "MM:Z:6mA" | head -3 returns nothing, but samtools view An6_6mA_mapped.mod.bam | grep "MM:Z:" | head -3
     [ ] ✅ samplesheet 使用绝对路径 + 精确 sample 名称
     [ ] ✅ 运行命令添加 --skip_basecalling
     [ ] ✅ 输出目录存在: methylome_out_6mA/methylation_calls/*.bedMethyl.gz
  2. SECONDLY RERUN modkit pileup to correct the BUG from nf-core/methylong

     # ---- 6mA 对应过滤命令 ----
     # 方案 A:降低阈值测试 (NOT_USED)
     # docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam An6_6mA_test.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif A 0 --filter-threshold A:0.7 --threads 8
    
     方案 B:6mA: 无过滤 + 手动后处理 (CHOSEN)
     # ✅ 第 1 步:生成无过滤的原始结果(6mA)
     docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam An6_6mA_raw.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif A 0 --no-filtering --threads 8
     docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/BG5/alignment/BG5_aligned.bam BG5_6mA_raw.bed --ref bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa --motif A 0 --no-filtering --threads 8
    
     # # ✅ 第 2 步:检查原始概率最大值
     # awk '{print $11}' An6_6mA_raw.bed | sort -rn | head -5
     # # 如果最大值 < 70 → 降低分析阈值
    
     # # ✅ 第 3 步:手动过滤生成最终结果
     # # ✅ 过滤高置信度 6mA 位点:概率≥70% + 覆盖度≥10
     # awk '$4 ~ /^A,/ && $11 >= 70 && $5 >= 10' An6_6mA_raw.bed > An6_6mA_final.bed
    
     # # ✅ 第 4 步:验证并统计
     # echo "=== 6mA 最终结果 ==="
     # # 📊 统计结果
     # echo "=== 6mA 高置信度位点统计 ==="
     # echo "总位点数: $(wc -l < An6_6mA_final.bed)"
     # echo ""
     # echo "概率分布 (前10名):"
     # awk '{print $11}' An6_6mA_final.bed | sort -rn | uniq -c | head -10
     # echo ""
     # echo "按染色体分布:"
     # cut -f1 An6_6mA_final.bed | sort | uniq -c | sort -rn
    
     # 方案 C:检查 BAM 信号原始分布 (For DEBUG)
     # # 提取 ML 标签中的概率值 (0-255 缩放 → 转换为 0-1)
     # samtools view ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam | head -100 | \
     #   grep -oP 'ML:B:C,\K[0-9,]+' | tr ',' '\n' | \
     #   awk '{if($1>0) print $1/255}' | sort -n | uniq -c | head -10
    
     # ---- 4mC_5mC 对应过滤命令 ----
     docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_4mC_5mC/ont/An6/alignment/An6_aligned.bam An6_4mC_5mC_raw.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif CG 0 --motif C 0 --no-filtering --threads 8
     docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_4mC_5mC/ont/BG5/alignment/BG5_aligned.bam BG5_4mC_5mC_raw.bed --ref bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa --motif CG 0 --motif C 0 --no-filtering --threads 8
    
     #修饰类型   列4匹配正则  说明
     #5mC @ CG   ^m,CG,  5-甲基胞嘧啶,在CG motif上
     #5mC @ C    ^m,C,   5-甲基胞嘧啶,在单碱基C上
     #4mC @ CG   ^21839,CG,  4-甲基胞嘧啶,在CG motif上
     #4mC @ C    ^21839,C,   4-甲基胞嘧啶,在单碱基C上
    
     ┌─────────────────────────────────────────────────────┐
     │  Column 4 格式:[修饰代码],[motif],[offset]         │
     ├─────────────────────────────────────────────────────┤
     │  m,CG,0      → 5mC @ CG 位点                        │
     │  m,C,0       → 5mC @ 任意 C 位点(包含 CG+非CG)    │
     │  21839,CG,0  → 4mC @ CG 位点                        │
     │  21839,C,0   → 4mC @ 任意 C 位点(包含 CG+非CG)    │
     └─────────────────────────────────────────────────────┘
     ⚠️ 重要提示:m,C,0 包含 所有 C 位点(CG + CHG + CHH),与 m,CG,0 有重叠!
     优先级 1: 4mC @ CG (21839,CG,0)  ← 细菌最主要的修饰
     优先级 2: 5mC @ CG (m,CG,0)      ← 次要修饰
     优先级 3: 非 CG 修饰             ← 探索性分析
     重点关注:
     ✅ 4mC @ CG - 细菌限制修饰系统的核心
     ✅ 5mC @ CG - 部分细菌也有 5mC
     ⚠️ 非 CG 修饰 - 细菌中较少,可先忽略
    
     修饰类型    位点数量    占比  生物学解释
     5mC @ CG    437 33% 经典 CpG 甲基化
     5mC @ non-CG    870 67% ⭐ 非 CG 位点更多,正常!
     4mC @ CG    7   17% 4mC 很少在 CG 上
     4mC @ non-CG    34  83% ⭐ 4mC 偏好非 CG 位点
  3. DNA methylation analysis (Sample An6: Acinetobacter harbinensis and Sample BG5: Pedobacter sp. strain BG5, both are bacteria, not plants.)

     mamba create -n methylation -c conda-forge -c bioconda \
       r-base=4.3 \
       r-dss \
       r-methylkit \
       r-chipseeker \
       r-qvalue \
       metilene \
       bedtools \
       samtools \
       ucsc-bedgraphtobigwig \
       -y
     mamba activate methylation
    
     #chmod +x dna_methylation_analysis_4mC_5mC.sh
     (methylation) bash dna_methylation_analysis_4mC_5mC.sh \
         -i An6_4mC_5mC_raw.bed \
         -r bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa \
         -o An6_4mC_5mC_methylation_results \
         -c 10 -m 70 -n 50 -f 50 -t 100
    
     #Using homerMotifs.all.motifs rather than nonRedundant.motifs
     (methylation) python annotate_motifs_rebase.py \
         --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/5mC_CG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output An6_5mC_CG_results.tsv
     #  Total motifs: 35
     #  REBASE matches: 8 (22.9%)
     #  ⚠️ Methylase matches: 2
    
     python annotate_motifs_rebase.py \
         --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/4mC_CG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output An6_4mC_CG_results.tsv
     #  Total motifs: Total motifs: 35
     # REBASE matches: 6 (17.1%)
     #⚠️ Methylase matches: 0
    
     python annotate_motifs_rebase.py \
         --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/5mC_nonCG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output An6_5mC_nonCG_results.tsv
     #📊 Summary:
     #Total motifs: 37
     #REBASE matches: 4 (10.8%)
     #⚠️ Methylase matches: 0
    
     python annotate_motifs_rebase.py \
         --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/4mC_nonCG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output An6_4mC_nonCG_results.tsv
     #    Total motifs: 35
     #REBASE matches: 10 (28.6%)
     #⚠️ Methylase matches: 2
    
     bash dna_methylation_analysis_4mC_5mC.sh \
         -i BG5_4mC_5mC_raw.bed \
         -r bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa \
         -o BG5_4mC_5mC_methylation_results \
         -c 10 -m 70 -n 50 -f 50 -t 100
    
     python annotate_motifs_rebase.py \
         --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/5mC_CG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output BG5_5mC_CG_results.tsv
     #  📊 Summary:
     #Total motifs: 36
     #REBASE matches: 7 (19.4%)
     #⚠️ Methylase matches: 1
    
     python annotate_motifs_rebase.py \
         --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/4mC_CG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output BG5_4mC_CG_results.tsv
     # Total motifs: 35
     #REBASE matches: 8 (22.9%)
     #⚠️ Methylase matches: 3
    
     python annotate_motifs_rebase.py \
         --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/5mC_nonCG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output BG5_5mC_nonCG_results.tsv
     #  Total motifs: 39
     #REBASE matches: 7 (17.9%)
     #⚠️ Methylase matches: 0
    
     python annotate_motifs_rebase.py \
         --homer  BG5_4mC_5mC_methylation_results/motif_analysis/homer/4mC_nonCG/homerMotifs.all.motifs \
         --rebase withrefm.txt \
         --output  BG5_4mC_nonCG_results.tsv
     #Total motifs: 35
     #REBASE matches: 9 (25.7%)
     #⚠️ Methylase matches: 3
    
     # -- For 6mA --
     ./dna_methylation_analysis_6mA.sh \
         -i An6_6mA_raw.bed \
         -r bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa \
         -o An6_6mA_methylation_results \
         -c 10 -m 70 -n 50 -f 50 -t 100
    
     python annotate_motifs_rebase.py \
         --homer An6_6mA_methylation_results/nonRedundant.motifs.txt \
         --rebase withrefm.txt \
         --output An6_6mA_results.tsv
     #📊 Summary:
     #Total motifs: 25
     #REBASE matches: 4 (16.0%)
     #⚠️ Methylase matches: 1
    
     ./dna_methylation_analysis_6mA.sh \
         -i BG5_6mA_raw.bed \
         -r bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa \
         -o BG5_6mA_methylation_results \
         -c 10 -m 70 -n 50 -f 50 -t 100
    
     python annotate_motifs_rebase.py \
         --homer BG5_6mA_methylation_results/nonRedundant.motifs.txt \
         --rebase withrefm.txt \
         --output BG5_6mA_results.tsv
     #Total motifs: 25
     #REBASE matches: 7 (28.0%)
     #⚠️ Methylase matches: 2
    
     (table1_env) python3 generate_methylation_excel.py
     #Delete the column 'bestguess_tf'; Mark rebase- and methylase-annotated cells yellow; Move 6mA as the first sheet!