Detailed Methods: Bacterial Methylome Analysis Pipeline (Data_Tam_DNAseq_2026_An6_BG5)

🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline

1. Base Modification Calling with modkit

Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:

Column Name Description Example
1 chrom Contig/chromosome identifier contig_1
2 start Modification position (0-based, BED format) 12345
3 end End position (start + 1 for single-base mods) 12346
4 name Modification code + sequence context a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC)
5 score Total read coverage at the site 45
6 strand Strand orientation (+, -, or . for unstranded) +
7-9 thickStart/End, itemRgb Visualization fields (unused in analysis)
10 blockCount Reads supporting the modification call 32
11 blockSizes Modification probability (0–100%) ← core filtering metric 85.3
12-18 Extended QC metrics mod_count, unmod_count, del_count, no_call_count, strand-specific coverage

📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g., 85.3 = 85.3% of reads show 6mA at this adenine).


2. High-Confidence Site Filtering

Sites were filtered using two stringent thresholds to minimize false positives:

  • Minimum coverage: ≥10 reads ($5 >= 10)
  • Minimum modification rate:
    • 5mC: ≥70% ($11 >= 70)
    • 4mC: ≥50% ($11 >= 50)
    • 6mA: ≥70% ($11 >= 70)

Filtering was performed using awk:

# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed

Sites were further separated by:

  • Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
  • Sequence context: CG vs. non-CG (for cytosine modifications)

3. Flanking Sequence Extraction for Motif Analysis

For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:

Modification site:          [position X]
Extracted window:  [X-50] ................ [X] ................ [X+50]
                          ↑
                  Total length = 101 bp

Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.

Extraction was performed using bedtools getfasta:

# Create extended regions file
awk -F'\t' -v flank=50 '{
    c=$1; gsub(/^>/,"",c);
    s=($2-flank>0)?$2-flank:0; e=$3+flank;
    print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed

# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s

4. Motif Enrichment Analysis with HOMER

De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:

Parameter Value Rationale
-len 4,6,8,10 Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs
-size 50 Matches the ±50 bp flanking window used for extraction
-mask true Masks low-complexity/repetitive regions to reduce spurious motifs
-S 10 Optimizes 10 putative motifs per dataset to balance sensitivity and specificity
-noknown true Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE
-useNewBg true Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability)
-p 8–100 Parallel threads (adjusted per compute resource)

Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.


5. REBASE Annotation & Methylase Classification

Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.

Matching Logic:

  1. IUPAC-aware comparison: Motifs containing degenerate bases (e.g., R=A/G, Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences.
  2. Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
  3. Methylation notation parsing: REBASE entries with methylation annotations (e.g., 2(6) = N6-methyladenine at position 2) were parsed to distinguish:
    • REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
    • Methylase Hits: Entries where the enzyme name starts with M. (indicating a methyltransferase) AND the methylation notation matches the modification type:
      • (4) = N4-methylcytosine → 4mC
      • (5) = 5-methylcytosine → 5mC
      • (6) = N6-methyladenine → 6mA

Output Classification:

Category Definition Example
No REBASE match Motif not found in REBASE Likely a transcription factor binding site or novel methylase
REBASE match (restriction enzyme only) Matches a restriction endonuclease but not its cognate methylase Coincidental sequence similarity; methylation likely from orphan methylase
Methylase hit Matches a methyltransferase with correct modification type notation High-confidence candidate for the enzyme catalyzing the observed modification

6. Output File Structure

Final results were compiled into Excel workbooks with the following sheets per sample:

Sheet Name Content Source File
4mC_CG 4mC sites in CG context 4mC_CG_filtered.bed + REBASE annotation
4mC_nonCG 4mC sites in non-CG context 4mC_nonCG_filtered.bed + REBASE annotation
5mC_CG 5mC sites in CG context 5mC_CG_filtered.bed + REBASE annotation
5mC_nonCG 5mC sites in non-CG context 5mC_nonCG_filtered.bed + REBASE annotation
6mA All 6mA sites 6mA_raw_filtered.bed + REBASE annotation

Each sheet includes:

  • Genomic coordinates and modification metrics (from modkit)
  • HOMER motif enrichment statistics (p-value, motif sequence)
  • REBASE annotation (matched enzymes, methylase status, methylation notation)

7. Software & Database Versions

Tool/Database Version Purpose
modkit 0.3.1 Base modification calling from nanopore signals
bedtools 2.31.1 Sequence extraction and genomic interval operations
HOMER 4.11 De novo motif discovery and enrichment analysis
REBASE 605 (Apr 2026) Curated database of restriction enzymes and methyltransferases
Python 3.10+ Custom annotation and data integration scripts
pandas/openpyxl Latest Excel workbook generation

🔑 Key Clarifications

  1. Flanking window: -f 50 = ±50 bp (101 bp total), not ±25 bp.
  2. Modification rate: Column 11 (blockSizes) = percent modified (0–100), not a probability score.
  3. Methylase identification: Requires BOTH M. prefix in enzyme name AND matching methylation notation (4)/(5)/(6).
  4. Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.

This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.

Leave a Reply

Your email address will not be published. Required fields are marked *