🔬 Detailed Methods: Bacterial Methylome Analysis Pipeline
1. Base Modification Calling with modkit
Raw nanopore sequencing signals were processed using modkit pileup (v0.3.1) to detect base modifications at single-nucleotide resolution. The pipeline accepts BAM files aligned to the isolate-specific reference genome and outputs a BED-format file with the following columns:
| Column | Name | Description | Example |
|---|---|---|---|
| 1 | chrom |
Contig/chromosome identifier | contig_1 |
| 2 | start |
Modification position (0-based, BED format) | 12345 |
| 3 | end |
End position (start + 1 for single-base mods) |
12346 |
| 4 | name |
Modification code + sequence context | a,CG,ACGT (6mA), m,CG,ACGT (5mC), 21839,C,ACGT (4mC) |
| 5 | score |
Total read coverage at the site | 45 |
| 6 | strand |
Strand orientation (+, -, or . for unstranded) |
+ |
| 7-9 | thickStart/End, itemRgb |
Visualization fields (unused in analysis) | – |
| 10 | blockCount |
Reads supporting the modification call | 32 |
| 11 | blockSizes |
Modification probability (0–100%) ← core filtering metric | 85.3 |
| 12-18 | Extended QC metrics | mod_count, unmod_count, del_count, no_call_count, strand-specific coverage |
– |
📌 Note on column 11: This value represents the percent of reads at that position called as modified (e.g.,
85.3= 85.3% of reads show 6mA at this adenine).
2. High-Confidence Site Filtering
Sites were filtered using two stringent thresholds to minimize false positives:
- Minimum coverage: ≥10 reads (
$5 >= 10) - Minimum modification rate:
- 5mC: ≥70% (
$11 >= 70) - 4mC: ≥50% (
$11 >= 50) - 6mA: ≥70% (
$11 >= 70)
- 5mC: ≥70% (
Filtering was performed using awk:
# Example: 5mC @ CG context
awk -F'\t' '$4 ~ /^m,CG,/ && $11 >= 70 && $5 >= 10 {print}' input.bed > 5mC_CG_filtered.bed
Sites were further separated by:
- Modification type: 4mC, 5mC, or 6mA (based on column 4 prefix)
- Sequence context: CG vs. non-CG (for cytosine modifications)
3. Flanking Sequence Extraction for Motif Analysis
For motif discovery, we extracted ±50 bp flanking sequences centered on each high-confidence modification site:
Modification site: [position X]
Extracted window: [X-50] ................ [X] ................ [X+50]
↑
Total length = 101 bp
Clarification: The -f 50 parameter in our pipeline specifies 50 bp upstream AND 50 bp downstream of the modification coordinate, yielding a 101-bp window (not ±25 bp). This ensures sufficient context for bacterial motif discovery, where recognition sites are typically 4–10 bp but may be embedded in larger regulatory elements.
Extraction was performed using bedtools getfasta:
# Create extended regions file
awk -F'\t' -v flank=50 '{
c=$1; gsub(/^>/,"",c);
s=($2-flank>0)?$2-flank:0; e=$3+flank;
print c"\t"s"\t"e"\t"$4"_"NR
}' filtered.bed > regions.bed
# Extract sequences (strand-aware)
bedtools getfasta -fi genome.fa -bed regions.bed -fo sequences.fa -nameOnly -s
4. Motif Enrichment Analysis with HOMER
De novo motif discovery was performed using HOMER (findMotifsGenome.pl, v4.11) with parameters optimized for bacterial restriction-modification systems:
| Parameter | Value | Rationale |
|---|---|---|
-len |
4,6,8,10 |
Bacterial R-M recognition sites are typically 4–8 bp palindromes; 10 bp captures complex motifs |
-size |
50 |
Matches the ±50 bp flanking window used for extraction |
-mask |
true |
Masks low-complexity/repetitive regions to reduce spurious motifs |
-S |
10 |
Optimizes 10 putative motifs per dataset to balance sensitivity and specificity |
-noknown |
true |
Focuses on de novo discovery; known motif scanning performed post-hoc via REBASE |
-useNewBg |
true |
Uses HOMER’s native GC-matched background generation (avoids bedtools shuffle instability) |
-p |
8–100 |
Parallel threads (adjusted per compute resource) |
Background sequences: HOMER automatically generated GC-matched background sequences from the reference genome, stratified by GC content to control for compositional bias.
5. REBASE Annotation & Methylase Classification
Enriched motifs were cross-referenced against the REBASE database (v605) using a custom Python script (annotate_motifs_rebase_fixed.py) to identify matches with known bacterial restriction-modification systems.
Matching Logic:
- IUPAC-aware comparison: Motifs containing degenerate bases (e.g.,
R=A/G,Y=C/T) were converted to regex patterns for flexible matching against REBASE recognition sequences. - Length-tolerant matching: Motifs were compared against REBASE entries of similar length (±1 bp) to accommodate minor variations.
- Methylation notation parsing: REBASE entries with methylation annotations (e.g.,
2(6)= N6-methyladenine at position 2) were parsed to distinguish:- REBASE Matches: Any enzyme (restriction endonuclease or methyltransferase) with a matching recognition sequence.
- Methylase Hits: Entries where the enzyme name starts with
M.(indicating a methyltransferase) AND the methylation notation matches the modification type:(4)= N4-methylcytosine → 4mC(5)= 5-methylcytosine → 5mC(6)= N6-methyladenine → 6mA
Output Classification:
| Category | Definition | Example |
|---|---|---|
| No REBASE match | Motif not found in REBASE | Likely a transcription factor binding site or novel methylase |
| REBASE match (restriction enzyme only) | Matches a restriction endonuclease but not its cognate methylase | Coincidental sequence similarity; methylation likely from orphan methylase |
| Methylase hit | Matches a methyltransferase with correct modification type notation | High-confidence candidate for the enzyme catalyzing the observed modification |
6. Output File Structure
Final results were compiled into Excel workbooks with the following sheets per sample:
| Sheet Name | Content | Source File |
|---|---|---|
4mC_CG |
4mC sites in CG context | 4mC_CG_filtered.bed + REBASE annotation |
4mC_nonCG |
4mC sites in non-CG context | 4mC_nonCG_filtered.bed + REBASE annotation |
5mC_CG |
5mC sites in CG context | 5mC_CG_filtered.bed + REBASE annotation |
5mC_nonCG |
5mC sites in non-CG context | 5mC_nonCG_filtered.bed + REBASE annotation |
6mA |
All 6mA sites | 6mA_raw_filtered.bed + REBASE annotation |
Each sheet includes:
- Genomic coordinates and modification metrics (from
modkit) - HOMER motif enrichment statistics (p-value, motif sequence)
- REBASE annotation (matched enzymes, methylase status, methylation notation)
7. Software & Database Versions
| Tool/Database | Version | Purpose |
|---|---|---|
modkit |
0.3.1 | Base modification calling from nanopore signals |
bedtools |
2.31.1 | Sequence extraction and genomic interval operations |
HOMER |
4.11 | De novo motif discovery and enrichment analysis |
REBASE |
605 (Apr 2026) | Curated database of restriction enzymes and methyltransferases |
Python |
3.10+ | Custom annotation and data integration scripts |
pandas/openpyxl |
Latest | Excel workbook generation |
🔑 Key Clarifications
- Flanking window:
-f 50= ±50 bp (101 bp total), not ±25 bp. - Modification rate: Column 11 (
blockSizes) = percent modified (0–100), not a probability score. - Methylase identification: Requires BOTH
M.prefix in enzyme name AND matching methylation notation(4)/(5)/(6). - Context separation: CG vs. non-CG filtering is applied ONLY to cytosine modifications (4mC/5mC); 6mA analysis includes all contexts.
This pipeline enables systematic identification of sequence motifs associated with bacterial DNA methylation and links them to known enzymatic machinery via REBASE annotation.