merge_methylation_results (Data_Tam_DNAseq_2026_An6_BG5)

🤔 合并 vs. 分开:细菌甲基化结果管理策略

简短回答:建议采用 “原始结果分开 + 分析视图合并” 的混合策略,兼顾可重复性与分析便利性。


📊 两种策略对比

维度 🔹 分开保存 🔹 合并结果 🔹 混合策略 (推荐)
可重复性 ✅ 优秀 (参数/版本独立追踪) ❌ 较差 (合并后难追溯) ✅ 优秀 (保留原始 + 生成合并视图)
质量控制 ✅ 独立评估每种修饰质量 ❌ 混合后难定位问题源 ✅ 先独立QC,再合并高质量结果
可视化 (IGV) ❌ 需加载多个轨道 ✅ 单轨道显示所有修饰 ✅ 生成合并 bigWig + 保留原始
统计分析 ❌ 需手动关联同位置不同修饰 ✅ 直接比较 6mA/4mC 密度 ✅ 提供统一 TSV + 保留原始细节
存储效率 ❌ 部分文件重复 (如 BAM) ✅ 仅存储最终结果 ⚠️ 略多存储,但可清理中间文件
协作共享 ❌ 需说明多个目录结构 ✅ 单文件易分享 ✅ 提供”发布包” + 原始数据可选
重新分析 ✅ 仅重跑需更新的修饰类型 ❌ 需重新合并 ✅ 独立重跑 + 增量更新合并视图

🎯 推荐:混合策略工作流

graph LR
    A[POD5 输入] --> B[Run 1: 6mA];
    A --> C[Run 2: 4mC_5mC];

    B --> D[methylome_out_6mA/];
    C --> E[methylome_out_4mC/];

    D --> F[✅ 保留: 原始 bedMethyl/BAM/日志];
    E --> F;

    F --> G[🔧 合并脚本];
    G --> H[📦 发布包/];

    H --> I[merged_An6.all_mods.bedMethyl.gz];
    H --> J[merged_An6.all_mods.bigWig];
    H --> K[merged_An6.analysis_ready.tsv];
    H --> L[QC_summary.md];

    style F fill:#E8F5E9,stroke:#4CAF50
    style H fill:#FFF3E0,stroke:#FF9800

🛠️ 实用脚本:merge_methylation_results.py

#!/usr/bin/env python3
"""
🔗 merge_methylation_results.py
合并多轮 methylong 运行结果 (6mA + 4mC)

功能:
- 保留原始输出目录 (可重复性)
- 生成合并的 bedMethyl/bigWig (可视化)
- 创建分析就绪的 TSV (统计)
- 生成合并质量报告

用法:
    python .py \
      --sample An6 \
      --mods 6mA 4mC_5mC \
      --input-dirs methylome_out_6mA methylome_out_4mC \
      --output-dir merged_results/An6 \
      --reference /path/to/assembly.fasta
"""

import argparse
import gzip
import logging
import os
import sys
from pathlib import Path
from collections import defaultdict

import pandas as pd
import pyfaidx
import pyBigWig

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    handlers=[
        logging.FileHandler('merge_results.log'),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)

def parse_bedmethyl(filepath: str, mod_label: str) -> pd.DataFrame:
    """解析 bedMethyl 文件,添加修饰类型标签"""
    logger.info(f"📥 解析 {filepath} ({mod_label})")

    records = []
    open_func = gzip.open if filepath.endswith('.gz') else open

    with open_func(filepath, 'rt') as f:
        for line in f:
            if line.startswith('#') or not line.strip():
                continue
            fields = line.strip().split('\t')
            if len(fields) < 12:
                continue

            chrom, start, end, name, score, strand = fields[0:6]
            coverage = int(fields[10])
            mod_level = float(fields[11]) if len(fields) > 11 else 0

            # 提取具体修饰类型 (从 name 或根据输入标签)
            if '6mA' in name or mod_label == '6mA':
                mod_type = '6mA'
            elif '4mC' in name or (mod_label == '4mC_5mC' and '4mC' in name):
                mod_type = '4mC'
            elif '5mC' in name or (mod_label == '4mC_5mC' and '5mC' in name):
                mod_type = '5mC'
            else:
                mod_type = mod_label.split('_')[0]  # 回退策略

            records.append({
                'chrom': chrom,
                'start': int(start),
                'end': int(end),
                'strand': strand,
                'mod_type': mod_type,
                'mod_level': mod_level,
                'coverage': coverage,
                'score': int(score),
                'source_run': mod_label,  # 追踪来源
                'original_name': name
            })

    df = pd.DataFrame(records)
    logger.info(f"✅ 提取 {len(df)} 个 {mod_label} 位点")
    return df

def merge_bedmethyl(dfs: list[pd.DataFrame], output_path: str):
    """合并多个 DataFrame 为统一 bedMethyl 格式"""
    logger.info(f"🔗 合并 {len(dfs)} 个修饰类型...")

    # 合并并排序
    merged = pd.concat(dfs, ignore_index=True)
    merged = merged.sort_values(['chrom', 'start', 'mod_type'])

    # 重建 bedMethyl 格式 (12 列)
    with gzip.open(output_path, 'wt') as f:
        for _, row in merged.iterrows():
            # name 字段: 包含修饰类型 + 概率 (便于过滤)
            name_field = f"mod_{row['mod_type']}_{row['mod_level']:.2f}"

            fields = [
                row['chrom'], str(row['start']), str(row['end']),
                name_field, str(row['score']), row['strand'],
                str(row['start']), str(row['end']), "0,0,0",
                "1", str(row['coverage']), str(row['mod_level'])
            ]
            f.write('\t'.join(fields) + '\n')

    logger.info(f"💾 保存合并 bedMethyl: {output_path} ({len(merged)} 位点)")

def create_bigwig(merged_df: pd.DataFrame, reference_fasta: str, output_path: str):
    """生成合并的 bigWig 文件 (用于 IGV 可视化)"""
    logger.info(f"🎨 生成 bigWig: {output_path}")

    # 读取染色体长度
    fasta = pyfaidx.Fasta(reference_fasta, as_raw=True)
    chrom_sizes = [(name, len(seq)) for name, seq in fasta.items()]
    fasta.close()

    # 创建 bigWig
    bw = pyBigWig.open(output_path, "w")
    bw.addHeader(chrom_sizes)

    # 按染色体和修饰类型分组写入
    for chrom, chrom_df in merged_df.groupby('chrom'):
        if chrom not in dict(chrom_sizes):
            continue

        # 为每种修饰类型创建独立轨道 (可选: 或合并为单轨道)
        for mod_type in chrom_df['mod_type'].unique():
            mod_df = chrom_df[chrom_df['mod_type'] == mod_type]

            # 创建 intervals: (start, end, value)
            intervals = list(zip(
                mod_df['start'].tolist(),
                mod_df['end'].tolist(),
                mod_df['mod_level'].tolist()
            ))

            if intervals:
                track_name = f"{chrom}_{mod_type}"
                bw.addEntries(
                    [chrom] * len(intervals),
                    [s for s, e, v in intervals],
                    ends=[e for s, e, v in intervals],
                    values=[v for s, e, v in intervals],
                    span=None, strand=None
                )

    bw.close()
    logger.info(f"✅ bigWig 创建完成")

def create_analysis_tsv(merged_df: pd.DataFrame, output_path: str):
    """生成分析就绪的 TSV (便于 R/Python 统计)"""
    logger.info(f"📊 创建分析 TSV: {output_path}")

    # 添加衍生特征
    analysis_df = merged_df.copy()
    analysis_df['position'] = analysis_df['start']  # 简化: 用 start 代表位置
    analysis_df['log_mod_level'] = analysis_df['mod_level'].apply(
        lambda x: -1 if x == 0 else x  # 避免 log(0)
    )

    # 保存
    analysis_df.to_csv(output_path, sep='\t', index=False, compression='gzip')
    logger.info(f"✅ 保存 {len(analysis_df)} 行分析数据")

def generate_qc_summary(merged_df: pd.DataFrame, output_path: str):
    """生成合并质量报告"""
    logger.info(f"📋 生成质量报告: {output_path}")

    summary = {
        'total_sites': len(merged_df),
        'by_mod_type': merged_df['mod_type'].value_counts().to_dict(),
        'by_chromosome': merged_df.groupby('chrom').size().to_dict(),
        'coverage_stats': merged_df['coverage'].describe().to_dict(),
        'mod_level_stats': merged_df['mod_level'].describe().to_dict(),
        'high_confidence_sites': len(merged_df[
            (merged_df['coverage'] >= 10) & 
            (merged_df['mod_level'] >= 0.8)
        ])
    }

    with open(output_path, 'w') as f:
        f.write("# 甲基化结果合并质量报告\n")
        f.write(f"总位点数: {summary['total_sites']:,}\n\n")

        f.write("## 按修饰类型分布\n")
        for mod, count in summary['by_mod_type'].items():
            pct = count / summary['total_sites'] * 100
            f.write(f"- {mod}: {count:,} ({pct:.1f}%)\n")

        f.write(f"\n## 高质量位点 (覆盖度≥10×, 修饰水平≥0.8)\n")
        f.write(f"{summary['high_confidence_sites']:,} ({summary['high_confidence_sites']/summary['total_sites']*100:.1f}%)\n")

    logger.info(f"✅ 质量报告保存")

def main():
    parser = argparse.ArgumentParser(description='合并 methylong 多修饰结果')
    parser.add_argument('--sample', required=True, help='样本名称 (如: An6)')
    parser.add_argument('--mods', nargs='+', required=True, 
                       help='修饰类型标签 (如: 6mA 4mC_5mC)')
    parser.add_argument('--input-dirs', nargs='+', required=True,
                       help='对应修饰的 methylong 输出目录')
    parser.add_argument('--output-dir', required=True, help='合并结果输出目录')
    parser.add_argument('--reference', required=True, help='参考基因组 FASTA')
    parser.add_argument('--min-coverage', type=int, default=5, 
                       help='最小覆盖度过滤 (默认: 5)')
    parser.add_argument('--min-mod-level', type=float, default=0.0,
                       help='最小修饰水平过滤 (默认: 0.0)')
    args = parser.parse_args()

    # 创建输出目录
    outdir = Path(args.output_dir)
    outdir.mkdir(parents=True, exist_ok=True)

    # 1. 解析各修饰类型结果
    dfs = []
    for mod_label, input_dir in zip(args.mods, args.input_dirs):
        bedmethyl_path = Path(input_dir) / 'methylation_calls' / f"{args.sample}.{mod_label.split('_')[0]}.bedMethyl.gz"

        if not bedmethyl_path.exists():
            # 尝试其他命名变体
            alt_path = Path(input_dir) / 'methylation_calls' / f"{args.sample}.bedMethyl.gz"
            if alt_path.exists():
                bedmethyl_path = alt_path
            else:
                logger.warning(f"⚠️ 未找到 bedMethyl: {bedmethyl_path}")
                continue

        df = parse_bedmethyl(str(bedmethyl_path), mod_label)

        # 应用质控过滤
        before = len(df)
        df = df[
            (df['coverage'] >= args.min_coverage) &
            (df['mod_level'] >= args.min_mod_level)
        ]
        if before > len(df):
            logger.info(f"   🔍 过滤后: {before} → {len(df)} 位点")

        dfs.append(df)

    if not dfs:
        logger.error("❌ 无有效数据可合并")
        return 1

    # 2. 合并 bedMethyl
    merged_bed = outdir / f"{args.sample}.all_mods.bedMethyl.gz"
    merge_bedmethyl(dfs, str(merged_bed))

    # 3. 生成 bigWig (可选,需要 pyBigWig)
    try:
        merged_df = pd.concat(dfs, ignore_index=True)
        merged_bw = outdir / f"{args.sample}.all_mods.bigWig"
        create_bigwig(merged_df, args.reference, str(merged_bw))
    except Exception as e:
        logger.warning(f"⚠️ bigWig 生成失败: {e}")

    # 4. 创建分析 TSV
    analysis_tsv = outdir / f"{args.sample}.analysis_ready.tsv.gz"
    create_analysis_tsv(pd.concat(dfs, ignore_index=True), str(analysis_tsv))

    # 5. 生成质量报告
    qc_report = outdir / f"{args.sample}.merge_QC.md"
    generate_qc_summary(pd.concat(dfs, ignore_index=True), str(qc_report))

    logger.info(f"\n🎉 合并完成!")
    logger.info(f"📁 输出目录: {outdir}")
    logger.info(f"📄 快速查看: cat {qc_report}")

    return 0

if __name__ == '__main__':
    sys.exit(main())

🚀 一键运行脚本:run_merge.sh

#!/bin/bash
#===============================================================================
# 🔗 run_merge.sh - 合并 6mA + 4mC 甲基化结果
#===============================================================================
set -euo pipefail

SAMPLE="An6"  # 或 BG5
REFERENCE="/mnt/md1/DATA/Data_Tam_DNAseq_2026_An6_BG5/bacass_out/${SAMPLE}/unicycler/assembly.fasta"
OUTPUT_DIR="merged_results/${SAMPLE}"

echo "🔗 合并 ${SAMPLE} 甲基化结果..."

python3 merge_methylation_results.py \
  --sample "${SAMPLE}" \
  --mods 6mA 4mC_5mC \
  --input-dirs methylome_out_6mA methylome_out_4mC \
  --output-dir "${OUTPUT_DIR}" \
  --reference "${REFERENCE}" \
  --min-coverage 10 \
  --min-mod-level 0.7

echo ""
echo "✅ 合并完成! 输出文件:"
ls -lh "${OUTPUT_DIR}/"

echo ""
echo "🎨 IGV 可视化:"
echo "   1. 加载参考基因组: ${REFERENCE}"
echo "   2. 加载轨道: ${OUTPUT_DIR}/${SAMPLE}.all_mods.bigWig"
echo "   3. 按修饰类型过滤: 右键轨道 → Filter by name → mod_6mA / mod_4mC"

echo ""
echo "📊 快速统计:"
zcat "${OUTPUT_DIR}/${SAMPLE}.analysis_ready.tsv.gz" | \
  awk -F'\t' 'NR>1 {count[$7]++} END {for (m in count) print m": "count[m]}' | sort -k2 -nr

📋 文件组织建议

project/
├── methylome_out_6mA/          # 🔹 原始 6mA 结果 (保留!)
│   ├── methylation_calls/
│   ├── motifs/
│   └── pipeline_info/
├── methylome_out_4mC/          # 🔹 原始 4mC 结果 (保留!)
│   └── ...
├── merged_results/             # 🔹 合并分析视图 (用于发表/共享)
│   ├── An6/
│   │   ├── An6.all_mods.bedMethyl.gz    # 🧬 合并 bedMethyl
│   │   ├── An6.all_mods.bigWig          # 🎨 IGV 轨道
│   │   ├── An6.analysis_ready.tsv.gz    # 📊 统计输入
│   │   └── An6.merge_QC.md              # 📋 质量报告
│   └── BG5/ (同上)
└── scripts/
    ├── merge_methylation_results.py
    └── run_merge.sh

💡 最佳实践:在论文 Methods 中注明: “6mA and 4mC were called separately using nf-core/methylong v2.0.0 with Dorado models 6mA and 4mC_5mC, respectively. Results were merged post-hoc for visualization and integrated analysis while preserving original outputs for reproducibility.”


🇩🇪 德语总结(Deutsch)

Empfohlene Strategie:

  • 🔹 Original-Outputs trennen (für Reproduzierbarkeit & QC)
  • 🔹 Merge-Views generieren (für Visualisierung & Statistik)

📁 Verzeichnis-Struktur:

methylome_out_6mA/     ← Original 6mA (behalten!)
methylome_out_4mC/     ← Original 4mC (behalten!)
merged_results/An6/    ← Analyse-ready Dateien (teilen/publishen)

🔧 Merge-Skript:

  • Erstellt: merged.bedMethyl.gz, merged.bigWig, analysis.tsv
  • Behält: mod_type, coverage, source_run Metadaten
  • Generiert: QC-Bericht für Methods-Sektion

⚠️ 注意事项

问题 解决方案
bedMethyl name 字段冲突 合并脚本自动重命名为 mod_{type}_{prob} 格式
bigWig 轨道过多 按修饰类型分组写入,IGV 中可折叠/筛选
统计时重复计数 analysis TSV 含 source_run 列,便于追溯/去重
磁盘空间 合并后可归档原始目录: tar -czf methylome_out_6mA.tar.gz methylome_out_6mA/

🎯 针对您项目的具体建议

# 1. 先完成 6mA 单轮运行 (验证流程)
nextflow run nf-core/methylong ... --dorado_modification 6mA ...

# 2. 验证 6mA 结果质量
zcat methylome_out_6mA/motifs/An6_6mA_motifs.tsv | head -5
# 应看到 GATC 等细菌基序 → 说明流程正常

# 3. 如需 4mC,运行第二轮
nextflow run nf-core/methylong ... --dorado_modification 4mC_5mC ...

# 4. 合并结果 (仅当两轮都成功)
./run_merge.sh  # 使用上方脚本

# 5. 发表准备
# - 主图: merged.bigWig in IGV (截图)
# - 附表: analysis_ready.tsv.gz (完整位点)
# - Methods: 注明分开调用 + 合并策略

需要我帮您:

  • 🔄 适配合并脚本到您的具体文件命名规范?
  • 📊 添加 6mA/4mC 共定位分析 (同一基因组位置同时存在两种修饰)?
  • 🎨 生成发表级 Figure 代码 (ggplot2 + ggbio 轨道图)?

祝您分析顺利,期待您的重要发现!🦠🧬✨

Leave a Reply

Your email address will not be published. Required fields are marked *