Daily Archives: 2026年5月29日

Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)

✅ 简短回答：是的，nf-core/methylong 接受 2 个生物学重复进行 DMR 分析

但需要注意统计效力的限制和结果解读的谨慎性。

🔧 技术层面：methylong 如何支持 2 重复设计

1️⃣ 参数配置示例

nextflow run nf-core/methylong \
  -r 2.0.0 \
  -profile docker \
  --input samplesheet_6mA.csv \      # 包含 2 个重复/组
  --outdir methylome_out_6mA \
  --ont_aligner minimap2 \
  --m6a \
  --skip_snvs \
  --dmr_population_scale \           # ⭐ 启用群体水平比较
  --population_dmrer dss \           # ⭐ 使用 DSS (支持小样本)
  --dmr_a "An6_rep1,An6_rep2" \      # 组 A: 2 个重复
  --dmr_b "BG5_rep1,BG5_rep2" \      # 组 B: 2 个重复
  -resume

2️⃣ samplesheet 格式要求

group,sample,path,ref,method
An6,An6_rep1,/path/An6_rep1.mod.bam,/path/genome.fa,ont
An6,An6_rep2,/path/An6_rep2.mod.bam,/path/genome.fa,ont
BG5,BG5_rep1,/path/BG5_rep1.mod.bam,/path/genome.fa,ont
BG5,BG5_rep2,/path/BG5_rep2.mod.bam,/path/genome.fa,ont

⚠️ 关键：group 列的值必须与 --dmr_a/--dmr_b 中的组名完全一致。

⚠️ 统计层面：2 重复的局限性与应对策略

🔹 DSS 在 2 重复下的工作原理

特性	说明	对您的影响
贝叶斯收缩	借用全基因组位点信息”收缩”方差估计	✅ 提升小样本稳健性，是 2 重复可行的理论基础
经验分布建模	基于所有位点的甲基化概率分布估计背景噪声	✅ 不需要大量重复也能估计全局变异
FDR 校正	Benjamini-Hochberg 方法控制多重检验	⚠️ 2 重复时假阳性控制可能偏保守或偏宽松，需谨慎解读 `p.adj`

🔹 2 重复 vs 3+ 重复的统计效力对比

指标	2 重复/组	3 重复/组	对您的建议
检测中等效应 (	diff	=20-30%) 的功率	~40-60%	~70-85%	当前结果可能漏检部分真实差异位点
FDR 控制的可靠性	中等	高	显著位点 (`p.adj<0.05`) 可信，但”不显著”不代表无差异
异常值鲁棒性	低 (1 个异常样本可扭曲结果)	中	建议用 MultiQC 检查样本间相关性，排除技术异常

🎯 实用建议：如何在 2 重复下最大化分析价值

✅ 策略 1：提高过滤阈值，聚焦高置信结果

# 在 DSS 输出后，额外应用更严格的过滤
awk '$8 >= 0.95 && abs($7) >= 0.3' An6_vs_BG5.DSS_DMLs.tsv > high_conf_diff_6mA.tsv
# $8 = p.adj, $7 = diff (甲基化差异)

✅ 策略 2：结合生物学先验知识验证

# 优先关注已知限制修饰系统相关基因
bedtools intersect -a high_conf_diff_6mA.tsv -b RM_system_genes.bed -wa > RM_related_diff.tsv

# 或关注表型相关基因 (如毒力、应激响应)
bedtools intersect -a high_conf_diff_6mA.tsv -b virulence_genes.bed -wa > phenotype_related_diff.tsv

✅ 策略 3：敏感性分析 (Bootstrap 评估稳健性)

# R 脚本示例：评估 2 重复结果的稳定性
library(boot)

# 对每个显著位点，用 bootstrap 重采样估计差异的置信区间
boot_diff <- function(data, indices) {
  d <- data[indices, ]
  mean(d$groupA) - mean(d$groupB)  # 简化示例
}

# 对 top 100 显著位点运行 1000 次 bootstrap
# 如果 95% CI 不包含 0，则结果稳健

✅ 策略 4：报告时透明说明局限性

在论文/报告方法部分添加：

“差异甲基化分析使用 DSS (vX.X) 进行，每组 2 个生物学重复。由于样本量限制，我们采用更严格的显著性阈值 (FDR 30%) 并优先关注已知功能基因的甲基化变化，以降低假阳性风险。”

📊 如果您决定扩展至 3 重复：最小成本方案

方案	额外成本	收益
技术重复 (同一 DNA 多次建库)	低 (仅测序成本)	评估技术噪声，但不能替代生物学重复
生物学重复 (独立培养 + 提取)	中 (时间 + 试剂)	✅ 真正提升统计效力，推荐
混合策略 (2 生物重复 + 1 技术重复)	中低	平衡成本与可靠性，适合资源有限场景

💡 一句话总结

🔑 技术上：methylong + DSS 完全支持 2 重复/组的 DMR 分析，管道可正常运行并输出结果。
🔑 统计上：2 重复的统计效力有限，建议：(1) 使用更严格的显著性阈值，(2) 结合生物学先验知识验证，(3) 在报告中透明说明局限性。
🔑 未来规划：如有关键发现需深入验证，建议扩展至≥3 个生物学重复以获得更可靠的 FDR 控制。

如需我帮您编写更严格的过滤脚本或敏感性分析代码，请随时告知！🚀

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with `modkit`

Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:

Column	Name	Description	Example
1	`chrom`	Contig/chromosome identifier	`contig_1`
2	`start`	Modification position (0-based, BED format)	`12345`
3	`end`	End position (`start + 1` for single-base mods)	`12346`
4	`name`	Modification code + sequence context	`a,CG,ACGT` (6mA), `m,CG,ACGT` (5mC), `21839,C,ACGT` (4mC)
5	`score`	Total read coverage at the site	`45`
6	`strand`	Strand orientation (`+`, `-`, or `.` for unstranded)	`+`
7-9	`thickStart/End`, `itemRgb`	Visualization fields (unused in analysis)	–
10	`blockCount`	Reads supporting the modification call	`32`
11	`blockSizes`	Modification probability (0–100%) ← core filtering metric	`85.3`
12-18	Extended QC metrics	`mod_count`, `unmod_count`, `del_count`, `no_call_count`, strand-specific coverage	–

📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g., 85.3 = 85.3% of reads show 6mA at this adenine).

2. High-Confidence Site Filtering

Sites were filtered using two stringent thresholds to minimize false positives:

Minimum coverage: ≥10 reads ($5 >= 10)
Minimum modification rate:
- 5mC: ≥70% ($11 >= 70)
- 4mC: ≥50% ($11 >= 50)
- 6mA: ≥70% ($11 >= 70)

Filtering was performed using awk:

# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed

Sites were further separated by:

Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
Sequence context: CG vs. non-CG (for cytosine modifications)

3. Flanking Sequence Extraction for Motif Analysis

For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:

Modification site:          [position X]
Extracted window:  [X-50] ................ [X] ................ [X+50]
                          ↑
                  Total length = 101 bp

Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.

Extraction was performed using bedtools getfasta:

# Create extended regions file
awk -F'\t' -v flank=50 '{
    c=$1; gsub(/^>/,"",c);
    s=($2-flank>0)?$2-flank:0; e=$3+flank;
    print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed

# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s

4. Motif Enrichment Analysis with HOMER

De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:

Parameter	Value	Rationale
`-len`	`4,6,8,10`	Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs
`-size`	`50`	Matches the ±50 bp flanking window used for extraction
`-mask`	`true`	Masks low-complexity/repetitive regions to reduce spurious motifs
`-S`	`10`	Optimizes 10 putative motifs per dataset to balance sensitivity and specificity
`-noknown`	`true`	Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE
`-useNewBg`	`true`	Uses HOMER’s native GC-matched background generation (avoids `bedtools shuffle` instability)
`-p`	`8–100`	Parallel threads (adjusted per compute resource)

Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.

5. REBASE Annotation & Methylase Classification

Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.

Matching Logic:

IUPAC-aware comparison: Motifs containing degenerate bases (e.g., R=A/G, Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences.
Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
Methylation notation parsing: REBASE entries with methylation annotations (e.g., 2(6) = N6-methyladenine at position 2) were parsed to distinguish:
- REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
- Methylase Hits: Entries where the enzyme name starts with M. (indicating a methyltransferase) AND the methylation notation matches the modification type:
  - (4) = N4-methylcytosine → 4mC
  - (5) = 5-methylcytosine → 5mC
  - (6) = N6-methyladenine → 6mA

Output Classification:

Category	Definition	Example
No REBASE match	Motif not found in REBASE	Likely a transcription factor binding site or novel methylase
REBASE match (restriction enzyme only)	Matches a restriction endonuclease but not its cognate methylase	Coincidental sequence similarity; methylation likely from orphan methylase
Methylase hit	Matches a methyltransferase with correct modification type notation	High-confidence candidate for the enzyme catalyzing the observed modification

6. Output File Structure

Final results were compiled into Excel workbooks with the following sheets per sample:

Sheet Name	Content	Source File
`4mC_CG`	4mC sites in CG context	`4mC_CG_filtered.bed` + REBASE annotation
`4mC_nonCG`	4mC sites in non-CG context	`4mC_nonCG_filtered.bed` + REBASE annotation
`5mC_CG`	5mC sites in CG context	`5mC_CG_filtered.bed` + REBASE annotation
`5mC_nonCG`	5mC sites in non-CG context	`5mC_nonCG_filtered.bed` + REBASE annotation
`6mA`	All 6mA sites	`6mA_raw_filtered.bed` + REBASE annotation

Each sheet includes:

Genomic coordinates and modification metrics (from modkit)
HOMER motif enrichment statistics (p-value, motif sequence)
REBASE annotation (matched enzymes, methylase status, methylation notation)

7. Software & Database Versions

Tool/Database	Version	Purpose
`modkit`	0.3.1	Base modification calling from nanopore signals
`bedtools`	2.31.1	Sequence extraction and genomic interval operations
`HOMER`	4.11	De novo motif discovery and enrichment analysis
`REBASE`	605 (Apr 2026)	Curated database of restriction enzymes and methyltransferases
`Python`	3.10+	Custom annotation and data integration scripts
`pandas`/`openpyxl`	Latest	Excel workbook generation

🔑 Key Clarifications

Flanking window: -f 50 = ±50 bp (101 bp total), not ±25 bp.
Modification rate: Column 11 (blockSizes) = percent modified (0–100), not a probability score.
Methylase identification: Requires BOTH M. prefix in enzyme name AND matching methylation notation (4)/(5)/(6).
Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.

This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.

dorado + nf-core/methylong + modkit pileup + interpretation using own scripts processing Data_Tam_DNAseq_2026_An6_BG5 methylome data

Leave a reply

dna_methylation_analysis_4mc_5mc.sh dna_methylation_analysis_6mA.sh annotate_motifs_rebase.py generate_methylation_excel.py

Run nextflow methylong after running nextflow bacass

 ✅ ---- 终极解决方案：使用本地 Dorado 预生成 modBAM + --skip_basecalling （绕过容器 CUDA 限制）----

 鉴于容器镜像的复杂性和 CUDA 版本限制，最可靠、最快速的路径是：
 使用您已安装的本地 Dorado v2.0.0 生成已比对的 modBAM
 运行 methylong 时添加 --skip_basecalling，跳过内部容器化 Dorado
 这样完全绕过容器、CUDA、GPU 传递等所有问题！

 步骤 1:如果您已安装本地 Dorado（非容器版），可预先生成 modBAM：

 # 安装本地 Dorado (CPU 或匹配您 CUDA 的版本)
 #    下载: https://github.com/nanoporetech/dorado/releases; https://software-docs.nanoporetech.com/dorado/latest/
 curl "https://cdn.oxfordnanoportal.com/software/analysis/dorado-2.0.0-linux-x64.tar.gz" -o dorado-2.0.0-linux-x64.tar.gz
 tar -xzf dorado-2.0.0-linux-x64.tar.gz

 # 添加到 PATH (永久生效)
 echo 'export PATH="/home/jhuang/Tools/dorado-2.0.0-linux-x64/bin:$PATH"' >> ~/.bashrc
 source ~/.bashrc

 # 验证安装: 2.0.0+20e87c8b
 dorado --version

 # 查看可用的碱基调用模型
 Positional arguments:
   model                             Model selection {fast,hac,sup}@v{version} for automatic model selection including modbases, or path to existing model directory.
   data                              The data directory or POD5 file path.

 # 查看可用的修饰检测模型
 dorado basecaller --help | grep -A5 "modified-bases"
 # ✅ 应看到: 6mA, 4mC_5mC, 5mC_5hmC 等

 步骤 2: 生成 modBAM
 ./generate_mapped_modbam.sh

     #!/bin/bash
     #===============================================================================
     # generate_mapped_modbam.sh - 本地 Dorado 生成已比对 modBAM
     #===============================================================================
     set -euo pipefail

     # === 配置 ===
     DORADO="/home/jhuang/Tools/dorado-2.0.0-linux-x64/bin/dorado"  # 您的本地 Dorado 路径
     REF_AN6="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa"
     REF_BG5="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa"
     POD5_AN6="./X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/An6/2157_4C_PBK79106_7ec05c46/An6_pod5_pass"
     POD5_BG5="./X101SC26036392-Z01-J003/Release-X101SC26036392-Z01-J003-20260513_01/Data-X101SC26036392-Z01-J003/BG5/2157_4C_PBK79106_7ec05c46/BG5_pod5_pass"
     OUTDIR="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5"

     # 确保参考基因组已索引
     samtools faidx "${REF_AN6}"
     samtools faidx "${REF_BG5}"

     echo "🚀 生成 An6 6mA modBAM (已比对)..."
     "${DORADO}" basecaller \
       --modified-bases 6mA \
       --emit-moves \
       --device cuda:0 \
       --reference "${REF_AN6}" \
       dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
       "${POD5_AN6}" | samtools view -b - > "${OUTDIR}/An6_6mA_mapped.mod.bam"

     echo "🚀 生成 BG5 6mA modBAM (已比对)..."
     "${DORADO}" basecaller \
       --modified-bases 6mA \
       --emit-moves \
       --device cuda:0 \
       --reference "${REF_BG5}" \
       dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
       "${POD5_BG5}" | samtools view -b - > "${OUTDIR}/BG5_6mA_mapped.mod.bam"

     echo "🚀 生成 An6 6mA modBAM (已比对)..."
     "${DORADO}" basecaller \
       --modified-bases 4mC_5mC \
       --emit-moves \
       --device cuda:0 \
       --reference "${REF_AN6}" \
       dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
       "${POD5_AN6}" | samtools view -b - > "${OUTDIR}/An6_4mC_5mC_mapped.mod.bam"

     echo "🚀 生成 BG5 6mA modBAM (已比对)..."
     "${DORADO}" basecaller \
       --modified-bases 4mC_5mC \
       --emit-moves \
       --device cuda:0 \
       --reference "${REF_BG5}" \
       dna_r10.4.1_e8.2_400bps_sup@v5.0.0 \
       "${POD5_BG5}" | samtools view -b - > "${OUTDIR}/BG5_4mC_5mC_mapped.mod.bam"

     # 验证输出
     echo "🔍 验证 BAM 文件..."
     for BAM in "${OUTDIR}"/*_6mA_mapped.mod.bam; do
       if samtools quickcheck "${BAM}" 2>/dev/null; then
         READS=$(samtools view -c "${BAM}")
         TAGS=$(samtools view "${BAM}" | head -50 | grep -o "MM:Z:[^ ]*" | sort -u | tr '\n' ' ')
         echo "✅ $(basename "${BAM}"): ${READS} reads, tags: ${TAGS}"
       else
         echo "❌ $(basename "${BAM}"): 文件损坏或格式错误"
       fi
     done

     echo "🎉 modBAM 生成完成!"

 步骤 3: 创建 methylong samplesheet（使用预生成 modBAM）
 See samplesheet_6mA.csv and samplesheet_4mC_5mC.csv

 步骤 4: 运行 methylong with --skip_basecalling

     ## Check if BAMs are read-name sorted:
     #samtools view -H An6_6mA_mapped.mod.bam | grep "^@HD"
     ## Should show: @HD VN:1.6 SO:queryname

     ## If not sorted correctly:
     #samtools sort -n -o An6.sorted_by_name.bam An6_6mA_mapped.mod.bam
     #samtools sort -n -o BG5.sorted_by_name.bam BG5_6mA_mapped.mod.bam

     ## Check BAM integrity:
     #samtools quickcheck -v An6_6mA_mapped.mod.bam
     #samtools quickcheck -v An6_after_trim.bam

     ## Use modkit to check tags:
     #modkit modbam check-tags An6_6mA_mapped.mod.bam
     #modkit modbam check-tags An6_after_trim.bam

     #BUG: '--skip_basecalling' does not exist (see the complete option list below.)
     nextflow run nf-core/methylong \
         -r 2.0.0 \
         -profile docker \
         --input samplesheet_6mA.csv \
         --outdir methylome_out_6mA \
         --skip_basecalling \          # ⭐ 关键: 跳过内部 Dorado，使用预生成 modBAM
         -resume \
         -work-dir methylome_out_6mA/work

     # Ist die BAM-Datei unaligniert? (keine @SQ Header mit Alignment)
     samtools view -H An6_6mA_mapped.mod.bam | head -10

     # Sind MM/ML-Tags für 6mA vorhanden?
     samtools view An6_6mA_mapped.mod.bam | head -5 | grep -o "MM:Z:[^ ]*"

     #NEXT_WEEK FOLLOWING the command under the website
     #https://bioinformatics.cc/debug-methylomic-analysis-of-data_tam_dnaseq_2026_an6_bg5/

     # FIRSTLY RUNNING NEXTFLOW
     # 1️⃣ 处理 6mA BAM（仅含 6mA 信号）
     nextflow run nf-core/methylong -r 2.0.0 -profile docker \
         --input samplesheet_6mA.csv \
         --outdir methylome_out_6mA \
         --ont_aligner minimap2 \
         --m6a \
         --skip_snvs \
         -resume \
         -work-dir methylome_out_6mA/work
     # 2️⃣ 处理 4mC_5mC BAM（含 4mC + 5mC 信号）
     nextflow run nf-core/methylong -r 2.0.0 -profile docker \
         --input samplesheet_4mC_5mC.csv \
         --outdir methylome_out_4mC_5mC \
         --ont_aligner minimap2 \
         --all_contexts \
         --skip_snvs \
         -resume \
         -work-dir methylome_out_4mC_5mC/work

 步骤 5: Check output
 [ ] ✅ 本地 Dorado 生成已比对 modBAM: samtools quickcheck An6_6mA_mapped.mod.bam 通过
 [ ] ✅ BAM 含 @SQ 头: samtools view -H An6_6mA_mapped.mod.bam | grep "^@SQ"
 [ ] ✅ BAM 含修饰标签: samtools view An6_6mA_mapped.mod.bam | grep "MM:Z:6mA" | head -3 returns nothing, but samtools view An6_6mA_mapped.mod.bam | grep "MM:Z:" | head -3
 [ ] ✅ samplesheet 使用绝对路径 + 精确 sample 名称
 [ ] ✅ 运行命令添加 --skip_basecalling
 [ ] ✅ 输出目录存在: methylome_out_6mA/methylation_calls/*.bedMethyl.gz

SECONDLY RERUN modkit pileup to correct the BUG from nf-core/methylong

 # ---- 6mA 对应过滤命令 ----
 # 方案 A：降低阈值测试 (NOT_USED)
 # docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam An6_6mA_test.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif A 0 --filter-threshold A:0.7 --threads 8

 方案 B：6mA: 无过滤 + 手动后处理 (CHOSEN)
 # ✅ 第 1 步：生成无过滤的原始结果（6mA）
 docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam An6_6mA_raw.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif A 0 --no-filtering --threads 8
 docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_6mA/ont/BG5/alignment/BG5_aligned.bam BG5_6mA_raw.bed --ref bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa --motif A 0 --no-filtering --threads 8

 # # ✅ 第 2 步：检查原始概率最大值
 # awk '{print $11}' An6_6mA_raw.bed | sort -rn | head -5
 # # 如果最大值 < 70 → 降低分析阈值

 # # ✅ 第 3 步：手动过滤生成最终结果
 # # ✅ 过滤高置信度 6mA 位点：概率≥70% + 覆盖度≥10
 # awk '$4 ~ /^A,/ && $11 >= 70 && $5 >= 10' An6_6mA_raw.bed > An6_6mA_final.bed

 # # ✅ 第 4 步：验证并统计
 # echo "=== 6mA 最终结果 ==="
 # # 📊 统计结果
 # echo "=== 6mA 高置信度位点统计 ==="
 # echo "总位点数: $(wc -l < An6_6mA_final.bed)"
 # echo ""
 # echo "概率分布 (前10名):"
 # awk '{print $11}' An6_6mA_final.bed | sort -rn | uniq -c | head -10
 # echo ""
 # echo "按染色体分布:"
 # cut -f1 An6_6mA_final.bed | sort | uniq -c | sort -rn

 # 方案 C：检查 BAM 信号原始分布 (For DEBUG)
 # # 提取 ML 标签中的概率值 (0-255 缩放 → 转换为 0-1)
 # samtools view ./methylome_out_6mA/ont/An6/alignment/An6_aligned.bam | head -100 | \
 #   grep -oP 'ML:B:C,\K[0-9,]+' | tr ',' '\n' | \
 #   awk '{if($1>0) print $1/255}' | sort -n | uniq -c | head -10

 # ---- 4mC_5mC 对应过滤命令 ----
 docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_4mC_5mC/ont/An6/alignment/An6_aligned.bam An6_4mC_5mC_raw.bed --ref bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa --motif CG 0 --motif C 0 --no-filtering --threads 8
 docker run --rm -v /mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5:/data -w /data quay.io/biocontainers/ont-modkit:0.5.0--hcdda2d0_2 modkit pileup ./methylome_out_4mC_5mC/ont/BG5/alignment/BG5_aligned.bam BG5_4mC_5mC_raw.bed --ref bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa --motif CG 0 --motif C 0 --no-filtering --threads 8

 #修饰类型   列4匹配正则  说明
 #5mC @ CG   ^m,CG,  5-甲基胞嘧啶，在CG motif上
 #5mC @ C    ^m,C,   5-甲基胞嘧啶，在单碱基C上
 #4mC @ CG   ^21839,CG,  4-甲基胞嘧啶，在CG motif上
 #4mC @ C    ^21839,C,   4-甲基胞嘧啶，在单碱基C上

 ┌─────────────────────────────────────────────────────┐
 │  Column 4 格式：[修饰代码],[motif],[offset]         │
 ├─────────────────────────────────────────────────────┤
 │  m,CG,0      → 5mC @ CG 位点                        │
 │  m,C,0       → 5mC @ 任意 C 位点（包含 CG+非CG）    │
 │  21839,CG,0  → 4mC @ CG 位点                        │
 │  21839,C,0   → 4mC @ 任意 C 位点（包含 CG+非CG）    │
 └─────────────────────────────────────────────────────┘
 ⚠️ 重要提示：m,C,0 包含 所有 C 位点（CG + CHG + CHH），与 m,CG,0 有重叠！
 优先级 1: 4mC @ CG (21839,CG,0)  ← 细菌最主要的修饰
 优先级 2: 5mC @ CG (m,CG,0)      ← 次要修饰
 优先级 3: 非 CG 修饰             ← 探索性分析
 重点关注：
 ✅ 4mC @ CG - 细菌限制修饰系统的核心
 ✅ 5mC @ CG - 部分细菌也有 5mC
 ⚠️ 非 CG 修饰 - 细菌中较少，可先忽略

 修饰类型    位点数量    占比  生物学解释
 5mC @ CG    437 33% 经典 CpG 甲基化
 5mC @ non-CG    870 67% ⭐ 非 CG 位点更多，正常！
 4mC @ CG    7   17% 4mC 很少在 CG 上
 4mC @ non-CG    34  83% ⭐ 4mC 偏好非 CG 位点

DNA methylation analysis (Sample An6: Acinetobacter harbinensis and Sample BG5: Pedobacter sp. strain BG5, both are bacteria, not plants.)

 mamba create -n methylation -c conda-forge -c bioconda \
   r-base=4.3 \
   r-dss \
   r-methylkit \
   r-chipseeker \
   r-qvalue \
   metilene \
   bedtools \
   samtools \
   ucsc-bedgraphtobigwig \
   -y
 mamba activate methylation

 #chmod +x dna_methylation_analysis_4mC_5mC.sh
 (methylation) bash dna_methylation_analysis_4mC_5mC.sh \
     -i An6_4mC_5mC_raw.bed \
     -r bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa \
     -o An6_4mC_5mC_methylation_results \
     -c 10 -m 70 -n 50 -f 50 -t 100

 #Using homerMotifs.all.motifs rather than nonRedundant.motifs
 (methylation) python annotate_motifs_rebase.py \
     --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/5mC_CG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output An6_5mC_CG_results.tsv
 #  Total motifs: 35
 #  REBASE matches: 8 (22.9%)
 #  ⚠️ Methylase matches: 2

 python annotate_motifs_rebase.py \
     --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/4mC_CG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output An6_4mC_CG_results.tsv
 #  Total motifs: Total motifs: 35
 # REBASE matches: 6 (17.1%)
 #⚠️ Methylase matches: 0

 python annotate_motifs_rebase.py \
     --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/5mC_nonCG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output An6_5mC_nonCG_results.tsv
 #📊 Summary:
 #Total motifs: 37
 #REBASE matches: 4 (10.8%)
 #⚠️ Methylase matches: 0

 python annotate_motifs_rebase.py \
     --homer An6_4mC_5mC_methylation_results/motif_analysis/homer/4mC_nonCG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output An6_4mC_nonCG_results.tsv
 #    Total motifs: 35
 #REBASE matches: 10 (28.6%)
 #⚠️ Methylase matches: 2

 bash dna_methylation_analysis_4mC_5mC.sh \
     -i BG5_4mC_5mC_raw.bed \
     -r bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa \
     -o BG5_4mC_5mC_methylation_results \
     -c 10 -m 70 -n 50 -f 50 -t 100

 python annotate_motifs_rebase.py \
     --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/5mC_CG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output BG5_5mC_CG_results.tsv
 #  📊 Summary:
 #Total motifs: 36
 #REBASE matches: 7 (19.4%)
 #⚠️ Methylase matches: 1

 python annotate_motifs_rebase.py \
     --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/4mC_CG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output BG5_4mC_CG_results.tsv
 # Total motifs: 35
 #REBASE matches: 8 (22.9%)
 #⚠️ Methylase matches: 3

 python annotate_motifs_rebase.py \
     --homer BG5_4mC_5mC_methylation_results/motif_analysis/homer/5mC_nonCG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output BG5_5mC_nonCG_results.tsv
 #  Total motifs: 39
 #REBASE matches: 7 (17.9%)
 #⚠️ Methylase matches: 0

 python annotate_motifs_rebase.py \
     --homer  BG5_4mC_5mC_methylation_results/motif_analysis/homer/4mC_nonCG/homerMotifs.all.motifs \
     --rebase withrefm.txt \
     --output  BG5_4mC_nonCG_results.tsv
 #Total motifs: 35
 #REBASE matches: 9 (25.7%)
 #⚠️ Methylase matches: 3

 # -- For 6mA --
 ./dna_methylation_analysis_6mA.sh \
     -i An6_6mA_raw.bed \
     -r bacass_out/Medaka/An6-unicycler-medaka_polished_genome.fa \
     -o An6_6mA_methylation_results \
     -c 10 -m 70 -n 50 -f 50 -t 100

 python annotate_motifs_rebase.py \
     --homer An6_6mA_methylation_results/nonRedundant.motifs.txt \
     --rebase withrefm.txt \
     --output An6_6mA_results.tsv
 #📊 Summary:
 #Total motifs: 25
 #REBASE matches: 4 (16.0%)
 #⚠️ Methylase matches: 1

 ./dna_methylation_analysis_6mA.sh \
     -i BG5_6mA_raw.bed \
     -r bacass_out/Medaka/BG5-unicycler-medaka_polished_genome.fa \
     -o BG5_6mA_methylation_results \
     -c 10 -m 70 -n 50 -f 50 -t 100

 python annotate_motifs_rebase.py \
     --homer BG5_6mA_methylation_results/nonRedundant.motifs.txt \
     --rebase withrefm.txt \
     --output BG5_6mA_results.tsv
 #Total motifs: 25
 #REBASE matches: 7 (28.0%)
 #⚠️ Methylase matches: 2

 (table1_env) python3 generate_methylation_excel.py
 #Delete the column 'bestguess_tf'; Mark rebase- and methylase-annotated cells yellow; Move 6mA as the first sheet!

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Daily Archives: 2026年5月29日

Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)

✅ 简短回答：是的，nf-core/methylong 接受 2 个生物学重复进行 DMR 分析

🔧 技术层面：methylong 如何支持 2 重复设计

1️⃣ 参数配置示例

2️⃣ samplesheet 格式要求

⚠️ 统计层面：2 重复的局限性与应对策略

🔹 DSS 在 2 重复下的工作原理

🔹 2 重复 vs 3+ 重复的统计效力对比

🎯 实用建议：如何在 2 重复下最大化分析价值

✅ 策略 1：提高过滤阈值，聚焦高置信结果

✅ 策略 2：结合生物学先验知识验证

✅ 策略 3：敏感性分析 (Bootstrap 评估稳健性)

✅ 策略 4：报告时透明说明局限性

📊 如果您决定扩展至 3 重复：最小成本方案

💡 一句话总结

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with `modkit`

2. High-Confidence Site Filtering

3. Flanking Sequence Extraction for Motif Analysis

4. Motif Enrichment Analysis with HOMER

5. REBASE Annotation & Methylase Classification

Matching Logic:

Output Classification:

6. Output File Structure

7. Software & Database Versions

🔑 Key Clarifications

dorado + nf-core/methylong + modkit pileup + interpretation using own scripts processing Data_Tam_DNAseq_2026_An6_BG5 methylome data

✅ 简短回答：是的，nf-core/methylong 接受 2 个生物学重复进行 DMR 分析

🔧 技术层面：methylong 如何支持 2 重复设计

1️⃣ 参数配置示例

2️⃣ samplesheet 格式要求

⚠️ 统计层面：2 重复的局限性与应对策略

🔹 DSS 在 2 重复下的工作原理

🔹 2 重复 vs 3+ 重复的统计效力对比

🎯 实用建议：如何在 2 重复下最大化分析价值

✅ 策略 1：提高过滤阈值，聚焦高置信结果

✅ 策略 2：结合生物学先验知识验证

✅ 策略 3：敏感性分析 (Bootstrap 评估稳健性)

✅ 策略 4：报告时透明说明局限性

📊 如果您决定扩展至 3 重复：最小成本方案

💡 一句话总结

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with modkit

2. High-Confidence Site Filtering

3. Flanking Sequence Extraction for Motif Analysis

4. Motif Enrichment Analysis with HOMER

5. REBASE Annotation & Methylase Classification

Matching Logic:

Output Classification:

6. Output File Structure

7. Software & Database Versions

🔑 Key Clarifications

1. Base Modification Calling with `modkit`