Daily Archives: 2025年10月27日

Comprehensive RNA-seq Time-Course Analysis Pipeline for Bacterial Stress-Related Genes (Data_Michelle_RNAseq_2025/README_oxidoreductases)

1. Prepare `samples.tsv` from the samplesheet

vim samples.tsv

Example contents:

sample  condition   time_h  batch   genotype    medium  replicate
WT_MH_2h_1  WT_MH   2       WT  MH  1
WT_MH_2h_2  WT_MH   2       WT  MH  2
WT_MH_2h_3  WT_MH   2       WT  MH  3
WT_MH_4h_1  WT_MH   4       WT  MH  1
WT_MH_4h_2  WT_MH   4       WT  MH  2
WT_MH_4h_3  WT_MH   4       WT  MH  3
WT_MH_18h_1 WT_MH   18      WT  MH  1
WT_MH_18h_2 WT_MH   18      WT  MH  2
WT_MH_18h_3 WT_MH   18      WT  MH  3
deltasbp_MH_2h_1    deltasbp_MH 2       deltasbp    MH  1
deltasbp_MH_2h_2    deltasbp_MH 2       deltasbp    MH  2
deltasbp_MH_2h_3    deltasbp_MH 2       deltasbp    MH  3
...
deltasbp_TSB_18h_3  deltasbp_TSB    18      deltasbp    TSB 3

2. Reformat `counts.tsv` from STAR/Salmon

cp ./results/star_salmon/gene_raw_counts.csv counts.tsv

Clean file manually:

Remove any double quotes ("), remove gene- from first column, replace delimiters to tab.

Clean file in R:

cts <- read.delim("counts.tsv", check.names = FALSE)
names(cts)[1] <- "gene_id"
if ("gene_name" %in% names(cts)) cts$gene_name <- NULL
names(cts) <- sub("_r([0-9]+)$", "_\\1", names(cts))
write.table(cts, file="counts_fixed.tsv", sep="\t", quote=FALSE, row.names=FALSE)
smp <- read.delim("samples.tsv", check.names = FALSE)
setdiff(colnames(cts)[-1], smp$sample)
setdiff(smp$sample, colnames(cts)[-1])

3. Run the R time-course analysis

Rscript rna_timecourse_bacteria.R \
  --counts counts_fixed.tsv \
  --samples samples.tsv \
  --condition_col condition \
  --time_col time_h \
  --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
  --volcano_csvs contrasts/ctrl_vs_treat.csv \
  --outdir results_bacteria

4. Summarize and convert results

~/Tools/csv2xls-0.4/csv_to_xls.py oxidoreductases_time_trends.tsv stress_genes_time_trends.tsv -d$'\t' -o oxidoreductases_and_stress_genes_time_trends.xls

5. Key summary and reporting

PCA plots, time trends, and top decreasing/increasing genes by condition are summarized. For further filtering, decreasing genes can be extracted by filtering direction == "decreasing" in the results tables.

6. Full main R script: `rna_timecourse_bacteria.R`

#!/usr/bin/env Rscript

# ===============================
# RNA-seq time-course helper (Bacteria) — DESeq2
# Uses eggNOG emapper annotations (GOs & EC) for oxidoreductases + stress genes
# ===============================
# Example:
#   Rscript rna_timecourse_bacteria.R \
#     --counts counts.tsv \
#     --samples samples.tsv \
#     --condition_col condition \
#     --time_col time_h \
#     --batch_col batch \
#     --emapper ~/DATA/Data_Michelle_RNAseq_2025/eggnog_out.emapper.annotations.txt \
#     --volcano_csvs contrasts/ctrl_vs_treat.csv \
#     --outdir results_bacteria
#
# Assumptions:
#   - counts.tsv: first column gene_id matching 'query' in emapper file
#   - samples.tsv: columns 'sample', condition/time (numeric), optional batch
#
suppressPackageStartupMessages({
  library(optparse)
  library(DESeq2)
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(ComplexHeatmap)
  library(circlize)
  library(ggplot2)
  library(purrr)
})

opt_list <- list(
  make_option("--counts", type="character", help="Counts matrix TSV (genes x samples). First col = gene_id"),
  make_option("--samples", type="character", help="Sample metadata TSV with 'sample' column."),
  make_option("--gene_id_col", type="character", default="gene_id", help="Counts gene id column name."),
  make_option("--condition_col", type="character", default="condition", help="Condition column in samples."),
  make_option("--time_col", type="character", default="time", help="Numeric time column in samples."),
  make_option("--batch_col", type="character", default=NULL, help="Optional batch column."),
  make_option("--emapper", type="character", help="eggNOG emapper annotations file (tab-delimited)."),
  make_option("--volcano_csvs", type="character", default=NULL, help="Comma-separated volcano CSV/TSV files (must have a 'gene' column)."),
  make_option("--outdir", type="character", default="results_bacteria", help="Output directory.")
)

opt <- parse_args(OptionParser(option_list = opt_list))
dir.create(opt$outdir, showWarnings = FALSE, recursive = TRUE)

message("[1/7] Load data")
counts <- read_tsv(opt$counts, col_types = cols())
stopifnot(opt$gene_id_col %in% colnames(counts))
counts <- as.data.frame(counts)
rownames(counts) <- counts[[opt$gene_id_col]]
counts[[opt$gene_id_col]] <- NULL

samples <- read_tsv(opt$samples, col_types = cols()) %>%
  filter(sample %in% colnames(counts))
samples <- as.data.frame(samples)
rownames(samples) <- samples$sample
samples$sample <- NULL
samples[[opt$time_col]] <- as.numeric(samples[[opt$time_col]])

# --- Coerce counts to numeric and validate ---
# Remove any commas and coerce to numeric
counts[] <- lapply(counts, function(x) {
  if (is.character(x)) x <- gsub(",", "", x, fixed = TRUE)
  suppressWarnings(as.numeric(x))
})
# Report any NA introduced by coercion
na_cols <- vapply(counts, function(x) any(is.na(x)), logical(1))
if (any(na_cols)) {
  bad <- names(which(na_cols))
  message("WARNING: Non-numeric values detected in count columns; introduced NAs in: ", paste(bad, collapse=", "))
  # Replace NA with 0 (safe fallback) and continue
  counts[bad] <- lapply(counts[bad], function(x) { x[is.na(x)] <- 0; x })
}

# Ensure samples and counts columns align 1:1 and reorder counts accordingly
missing_in_samples <- setdiff(colnames(counts), rownames(samples))
missing_in_counts  <- setdiff(rownames(samples), colnames(counts))
if (length(missing_in_samples) > 0) {
  stop("These count columns have no matching row in samples.tsv: ", paste(missing_in_samples, collapse=", "))
}
if (length(missing_in_counts) > 0) {
  stop("These samples.tsv rows have no matching column in counts.tsv: ", paste(missing_in_counts, collapse=", "))
}
counts <- counts[, rownames(samples), drop=FALSE]
# Finally, round to integers as required by DESeq2
counts <- round(as.matrix(counts))

message("[2/7] DESeq2 model (time-course)")
design_terms <- c()
if (!is.null(opt$batch_col) && opt$batch_col %in% colnames(samples)) {
  design_terms <- c(design_terms, opt$batch_col)
}
design_terms <- c(design_terms, opt$condition_col, opt$time_col, paste0(opt$condition_col, ":", opt$time_col))
design_formula <- as.formula(paste("~", paste(design_terms, collapse=" + ")))

dds <- DESeqDataSetFromMatrix(countData = round(as.matrix(counts)),
                              colData = samples,
                              design = design_formula)
dds <- dds[rowSums(counts(dds)) > 1, ]
dds <- DESeq(dds, test="LRT",
             full = design_formula,
             reduced = as.formula(paste("~", paste(setdiff(design_terms, paste0(opt$condition_col, ":", opt$time_col)), collapse=" + "))))

vsd <- vst(dds, blind=FALSE)
vsd_mat <- assay(vsd)

message("[3/7] Parse emapper for GO/EC (oxidoreductases & stress genes)")
stopifnot(!is.null(opt$emapper))
emap <- read_tsv(opt$emapper, comment = "#", col_types = cols(.default = "c"))
# Expecting columns: query, GOs, EC, Description, Preferred_name, etc.
emap <- emap %>%
  transmute(gene = query,
            GOs = ifelse(is.na(GOs), "", GOs),
            EC = ifelse(is.na(EC), "", EC),
            Description = ifelse(is.na(Description), "", Description),
            Preferred_name = ifelse(is.na(Preferred_name), "", Preferred_name))
emap <- emap %>% distinct(gene, .keep_all = TRUE)

# Flags:
# 1) oxidoreductase: EC starts with "1." OR GO includes GO:0016491
is_ox_by_ec <- grepl("^1\\.", emap$EC)
is_ox_by_go <- grepl("\\bGO:0016491\\b", emap$GOs)
emap$is_oxidoreductase <- is_ox_by_ec | is_ox_by_go

# 2) stress-related: search for stress GO ids in GOs
stress_gos <- c("GO:0006950","GO:0033554","GO:0006979","GO:0006974","GO:0009408","GO:0009266") # response to stress; cellular response; oxidative stress; DNA damage; response to heat; response to starvation
re_pat <- paste(stress_gos, collapse="|")
emap$is_stress <- grepl(re_pat, emap$GOs)

write_tsv(emap, file.path(opt$outdir, "emapper_flags.tsv"))

message("[4/7] Per-gene time slopes within each condition")
cond_levels <- unique(samples[[opt$condition_col]])
slope_summaries <- list()

for (cond in cond_levels) {
  sel <- samples[[opt$condition_col]] == cond
  mat <- vsd_mat[, sel, drop=FALSE]
  tvec <- samples[[opt$time_col]][sel]

  slopes <- apply(mat, 1, function(y) {
    fit <- try(lm(y ~ tvec), silent = TRUE)
    if (inherits(fit, "try-error")) return(c(NA, NA))
    co <- summary(fit)$coefficients
    c(beta=unname(co["tvec","Estimate"]), p=unname(co["tvec","Pr(>|t|)"]))
  })
  slopes <- t(slopes)
  df <- as.data.frame(slopes)
  df$gene <- rownames(mat)
  df$condition <- cond
  slope_summaries[[cond]] <- df
}

# (keep whatever you have above this point unchanged)
slope_df <- bind_rows(slope_summaries) %>%
  mutate(padj = p.adjust(p, method="BH")) %>%
  relocate(gene, condition, beta, p, padj)

# ---- Robust join to emapper ----
# clean IDs: trim; strip version suffixes like ".1"
emap$gene <- trimws(emap$gene)
emap$gene_clean <- sub("\\.\\d+$", "", emap$gene)

slope_df$gene <- trimws(slope_df$gene)
slope_df$gene_clean <- sub("\\.\\d+$", "", slope_df$gene)

# join on cleaned key
slope_df <- slope_df %>%
  dplyr::left_join(
    emap %>% dplyr::select(gene_clean, GOs, EC, Description, Preferred_name,
                           is_oxidoreductase, is_stress),
    by = "gene_clean"
  )

# recompute flags from EC/GOs when missing
slope_df <- slope_df %>%
  mutate(
    is_oxidoreductase = ifelse(
      is.na(is_oxidoreductase),
      (!is.na(EC) & grepl("^1\\.", EC)) | (!is.na(GOs) & grepl("\\bGO:0016491\\b", GOs)),
      is_oxidoreductase
    ),
    is_stress = ifelse(
      is.na(is_stress),
      (!is.na(GOs) & grepl("GO:0006950|GO:0033554|GO:0006979|GO:0006974|GO:0009408|GO:0009266", GOs)),
      is_stress
    )
  ) %>%
  dplyr::select(-gene_clean)

# write full slopes
readr::write_tsv(slope_df, file.path(opt$outdir, "time_slopes_by_condition.tsv"))

# summaries
ox_summary <- slope_df %>%
  dplyr::filter(!is.na(is_oxidoreductase) & is_oxidoreductase) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(ox_summary, file.path(opt$outdir, "oxidoreductases_time_trends.tsv"))

stress_summary <- slope_df %>%
  dplyr::filter(!is.na(is_stress) & is_stress) %>%
  dplyr::mutate(direction = dplyr::case_when(
    beta < 0 & padj < 0.05 ~ "decreasing",
    beta > 0 & padj < 0.05 ~ "increasing",
    TRUE ~ "ns"
  )) %>%
  dplyr::arrange(padj, beta)
readr::write_tsv(stress_summary, file.path(opt$outdir, "stress_genes_time_trends.tsv"))

message("[5/7] Heatmaps from volcano gene lists (plus per-gene)")
make_heatmap <- function(glist, tag, kmeans_rows=NA) {
  sub <- vsd_mat[rownames(vsd_mat) %in% glist, , drop=FALSE]
  if (nrow(sub) == 0) {
    message("No overlap for ", tag)
    return(invisible(NULL))
  }
  z <- t(scale(t(sub)))
  ha_col <- HeatmapAnnotation(
    df = data.frame(
      condition = samples[[opt$condition_col]],
      time = samples[[opt$time_col]]
    )
  )
  png(file.path(opt$outdir, paste0("heatmap_", tag, ".png")), width=1400, height=1000, res=140)
  print(Heatmap(z, name="z", top_annotation = ha_col,
                clustering_distance_rows = "euclidean",
                clustering_method_rows = "ward.D2",
                show_row_names = FALSE, show_column_names = TRUE,
                row_km = kmeans_rows))
  dev.off()
}

make_single_gene_heatmaps <- function(glist, tag) {
  for (g in glist) {
    if (!(g %in% rownames(vsd_mat))) next
    z <- t(scale(t(vsd_mat[g,,drop=FALSE])))
    png(file.path(opt$outdir, paste0("heatmap_", tag, "_", g, ".png")), width=1200, height=400, res=150)
    print(Heatmap(z, name="z", cluster_rows=FALSE, cluster_columns=FALSE,
                  show_row_names=TRUE, show_column_names=TRUE))
    dev.off()
  }
}

if (!is.null(opt$volcano_csvs) && nzchar(opt$volcano_csvs)) {
  files <- str_split(opt$volcano_csvs, ",")[[1]] %>% trimws()
  for (f in files) {
    df <- tryCatch({
      if (grepl("\\.tsv$", f, ignore.case = TRUE)) read_tsv(f, col_types=cols())
      else read_csv(f, col_types=cols())
    }, error=function(e) NULL)
    if (is.null(df) || !("gene" %in% names(df))) next
    genes <- df$gene %>% unique()
    tag <- tools::file_path_sans_ext(basename(f))
    make_heatmap(genes, tag, kmeans_rows = 4)
    make_single_gene_heatmaps(genes, paste0(tag, "_single"))
  }
}

message("[6/7] PCA")
pca <- plotPCA(vsd, intgroup=c(opt$condition_col, opt$time_col), returnData = TRUE)
percentVar <- round(100 * attr(pca, "percentVar"))

# change 'deltasbp_*' to 'Δ_*' for legend labels
pca[[opt$condition_col]] <- gsub("^deltasbp", "Δsbp", pca[[opt$condition_col]])

p <- ggplot(pca, aes(PC1, PC2,
                     color = .data[[opt$condition_col]],
                     shape = factor(.data[[opt$time_col]]))) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  labs(color = "Condition", shape = "Factor(time_h)") +  # legend titles
  theme_bw()
ggsave(file.path(opt$outdir, "PCA_condition_time.png"), p, width=10, height=7, dpi=150)

message("[7/7] Summary")
summary_txt <- file.path(opt$outdir, "SUMMARY.txt")
sink(summary_txt)
cat("Bacterial time-course summary\n")
cat("Date:", as.character(Sys.time()), "\n\n")
cat("Conditions:", paste(unique(samples[[opt$condition_col]]), collapse=", "), "\n\n")
cat("Top oxidoreductases decreasing over time:\n")
print(ox_summary %>% filter(direction=="decreasing") %>% head(20))
cat("\nTop stress genes decreasing over time:\n")
print(stress_summary %>% filter(direction=="decreasing") %>% head(20))
sink()
message("Done -> ", opt$outdir)

This code-rich summary provides a replicable basis for advanced bacterial RNA-seq time-course analysis and reporting. All main code steps and script logic are retained for transparency and practical reuse.

Tenergy / Dignics 与 Butterfly Timo Boll ALC 终极实用手册

1) 快速结论（给着急的人）

稳健FH弧圈：Dignics 09C（黑）2.1 或 Tenergy 05（红）2.1
凌厉BH快带/反拉：Dignics 64（1.9） 或 Tenergy 64（1.9）
一张通吃/省心：Tenergy 80（FH 2.1 / BH 1.9）
BH 要容错/易起球：Tenergy 64 FX（1.9 或 2.1）
为什么少推 T64 做正手：低弧直线，台内活、起下旋容错较低；正手通常更需要抓球+弧线（09C/05/80 更贴合）。
红/黑颜色：规则只要求“一面黑一面非黑”。普遍经验：黑皮更黏/质感更实（适合FH抓摩），红皮更通透爽快（常见BH或快出风格）。

2) TB ALC 底板一览

类型：OFF / OFF-（进攻），ALC 纤维（Arylate-Carbon）
层数（经典 ALC 叠层）：Koto – ALC – Limba – Kiri（芯）– Limba – ALC – Koto
常见参数：厚 ~5.7–5.9 mm；重 ~86–90 g；板面 ~157×150 mm
打感：甜区大、低震动、击球干净略“闷”，速度快但可控；弧圈中等抛物线，挡/对冲稳、响应线性。

与相近底板

Viscaria：整体更柔一点，抛物线略高；TB ALC 更直接、低一丝抛。
TB ZLC：更快更脆，持球/容错降低。
TB ZLF：更软更吃球，但顶速低。

3) 胶皮速览卡

Dignics 09C（D09C）

特性：微黏顶皮，抓球最强、吃球最深；中高弧线，起下旋最稳；前台不弹，中远台后劲足。
定位：FH 现代弧圈核心；在 TB ALC 上中和“干脆”，提升弧线与控制。

Dignics 64（D64）

特性：直线、反弹快，借力好；中低弧线，出速高；台内略活。
定位：BH 强势快带/对冲；在 TB ALC 上 BH 非常顺手。

Tenergy 05（T05）

特性：高摩擦颗粒设定，抓球强、弧线中高，台内更稳，响应线性。
定位：FH 万金油弧圈/拉冲；在 TB ALC 上成熟耐用的经典搭配。

Tenergy 80（T80）

特性：位于 05 与 64 之间的平衡点；弧线中等，速度/控制均衡。
定位：双面皆可，适合想“一张打全场”的简化方案。

Tenergy 64（T64）

特性：出球直、弧线中低、顶速与穿透强；台内更易“活”。
定位：BH 快带/反拉导向；FH 少量玩家偏好直线穿透可选。

Tenergy 64 FX（T64 FX）

特性：更软海绵版本；易起球、容错高，低中功率更轻松；台内更活、顶速略降。
定位：BH 容错/易操控路线；入门到中级/小力量友好。

4) 关键差异对照表

型号	抓球/持球	弧线高度	出速/穿透	台内控制	起下旋容错	最佳位面
D09C	最大	中-高	中后程强	稳	最优	FH
T05	高	中-高	中高	稳	很好	FH / BH 控制
T80	中高	中	中高	稳定	好	FH / BH
D64	中	中-低	最高	中等（活）	中等	BH
T64	中	低-中	最高	中等（活）	较低	BH / 少数 FH
T64 FX	中高（软）	中	高（低中功率更易）	偏活	较高	BH 初中级

注：表中“台内活”=反弹系数高、对小动作更敏感；需要更细腻的手上控制。

5) 正反手搭配方案（按打法）

A. FH 现代弧圈 + BH 快带/反拉（主流）

FH：D09C 2.1（黑） / T05 2.1（红）!!!!
BH：D64 1.9（红）/ T64 1.9 / T80 1.9（黑）!!!!

B. 近台借力对冲 + 二速快上手

FH：T05 2.1 / T80 2.1
BH：T64 1.9 / T64 FX 1.9（要容错）

C. 中远台大力爆冲

FH：D09C 2.1
BH：D64 2.1 或 T80 2.1（看稳定需求）

D. 一套省心通吃

FH：T80 2.1
BH：T80 1.9

E. 坚持 FH 直线穿透风

FH：T64 1.9（控台内）
BH：T80 1.9 / T64 FX 1.9（容错）

常见选择思路

粘性/中国套（如狂飙、09C）做正手 → 多数人选黑色; 黑皮通常略更黏、更实、更“顶”，适合发力刷摩、前冲。
日德套（如 Tenergy/Dignics/ESN）做正手 → 很多人选红色; 红皮一般手感更通透一点、出球更爽快，弧线略高、速度更轻快。

结合你这块 TB ALC：

想要抓球+弧线：正手 Dignics 09C 黑 2.1，反手 Dignics 64 红 1.9。
想要直接、快出：正手 Tenergy 05 红 2.1，反手 Tenergy 80 黑 1.9。
想要更稳控：都用 1.9 厚度即可。

6) 厚度与配色（红/黑）选择

厚度（1.9 vs 2.1）

1.9 mm：更稳，台内控制好、起板成功率高，适合反手或控球为先。
2.1 mm：最大威力与后程顶速，适合正手或追求爆冲者。

配色（规则与习惯）

规则：一面黑，一面非黑（通常红），没有“正手必须红/黑”的硬性规定。
经验：黑皮往往更黏、更扎实（FH 抓摩）；红皮更通透快出（BH/快攻）。
推荐：FH 黑 / BH 红（若选 09C/T05 等抓球型）；若使用 64/64 FX 做 BH，红/黑皆可。

7) 与 TB ALC 的化学反应：上手感受与注意事项

TB ALC 出球干净、回弹快，与 D09C/T05 组合能获得更稳的弧线与起下旋容错；
与 D64/T64 叠加会更“利落”，BH 爽快但台内更活；
若 FH 也选 64：建议 1.9 厚，台内要格外细腻；或更柔的底板来“降躁”。

8) 台内/发接发与相持要点

台内：D09C/T05/T80 更易控短与摆短；64/64 FX 需降低击球力度与板形开合，加大摩擦比重。
起下旋：D09C 最稳、T05 次之；64/64 FX 要注意摩擦角度与击球深度。
对拉/反拉：64/D64 出速高、直线穿透强；T05/T80 更有弧线安全窗。
借力：64 系列占优；D09C 需要主动发力，更吃你质量。

9) 维护与更换周期

清洁：每次打完用微湿海绵/专用清洁剂轻拭，贴保护膜；微黏（09C）表面避免硬擦。
更换：高频训练（>3 次/周）约 2–3 月更换一侧；普通强度 3–6 月。
粘贴：使用水溶性无机胶；避免非法改装（如违规增黏/增弹）。

10) 常见问答（FAQ）

Q1：为什么很多人不推荐 T64 做正手？
A：直线低弧、台内活、起下旋容错低，与 TB ALC 叠加更“躁”。大多数 FH 更需要抓球与弧线（09C/05/80）。

Q2：D64 vs T64？
A：D64 抓球与容错略好，仍保持直线与高出速；T64 更“经典 64 味”，更凌厉但更挑台内控制。

Q3：T64 vs T64 FX？
A：FX 更软，低中功率更容易、容错高；顶速与台内稳定性不及 T64。BH 入门到中级偏向 FX。

Q4：T80 能否双面？
A：可以。它在 05 与 64 之间，速度/弧线/控制均衡，是省心选择。

Q5：红黑是否影响性能？
A：配方/批次差异外，普遍经验是黑略黏、红更通透；按手感与需求定，无硬性规则。

11) 术语小词典

抓球/持球：球在胶皮上的“停留与摩擦”感。越强越有利于起下旋与弧圈稳定。
台内活：小力量下的反弹灵敏度高，容易弹起，控制难度增加。
弧线高度：出球抛物线的“拱度”，高抛提供更大安全窗。
直线穿透：出球平直、速度快，吃台后“冲”的感觉。
容错：击球角度/力量略有偏差时，仍能上台/有效的宽容度。

小结

FH 选抓球与弧线（D09C/T05），BH 选直线出速与借力（D64/T64/T80）。
T80 是“两边都不极端”的万能解；T64 FX 给 BH 轻松与容错。
TB ALC 性格“干净+线性”，合理胶皮搭配即可兼顾台内与相持。

ONT Methylation Analysis — Comprehensive Summary

1) What are 5‑mC, 6‑mA, 4‑mC?

5‑mC (5‑methylcytosine): methyl group on cytosine C5 carbon. In eukaryotes strongly linked to gene regulation (CpG), chromatin state, imprinting. Also present in some bacteria (e.g., Dcm at CCWGG).
6‑mA (N6‑methyladenine): methyl on adenine N6. Very common in bacteria/archaea (e.g., Dam at GATC), functions in restriction–modification (R–M), mismatch repair, replication control, and gene regulation.
4‑mC (N4‑methylcytosine): methyl on cytosine N4, mostly in bacteria/archaea (R–M and regulation).

Coverage guidance (ONT direct detection):

≥ 10× for 5‑mC calling/quantification.
≥ 50× for 6‑mA and 4‑mC (signals are weaker; models need depth).

2) How ONT detects methylation (no chemical conversion)

ONT does not convert bases (unlike bisulfite sequencing which converts un‑methylated C → U → read as T). ONT reads remain A/C/G/T.
ONT measures ionic current while DNA k‑mers pass the pore. Modified bases (5‑mC/6‑mA/4‑mC) slightly shift current distributions.
A modified‑base basecaller (now Dorado; historically Guppy+Remora) decodes those shifts and writes methylation annotations into aligned BAM/CRAM as MM/ML tags:
- MM: modified motif and per‑read positions.
- ML: per‑site modification probabilities/scores.
Downstream tools (e.g., modkit, methylartist, nf‑core/methylong) summarize per‑site/per‑region methylation and export BED/bedGraph/bigWig for visualization/statistics.

Key contrast with bisulfite (BS‑seq):

BS‑seq chemically converts un‑methylated C to U (sequenced as T) → uses base changes to infer methylation.
ONT uses signal differences; no base letters change. Methylation is metadata in BAM tags, not edits in the sequence.

3) Data types & what you need (modBAM vs “assembly” reads)

Previous ONT reads used for genome assembly are typically standard basecalls (A/C/G/T only) and lack MM/ML tags, so not suitable for methylation quantification.
For methylation analysis you need either:
1. Provider delivers aligned modified‑base BAM/CRAM (modBAM/CRAM) with MM/ML tags and indices (.bai/.crai).
2. Or you re‑basecall FAST5/FASTQ with a modified‑base Dorado model and then align to your reference (producing modBAM).

Reference genome requirement:

For aligned BAM, you (or the provider) must map to a reference FASTA. Keep the exact FASTA (and .fai) used for reproducibility and downstream summarization.

4) Practical workflow (bacteria)

A. Planning & sequencing

Decide targets: in bacteria prioritize 6‑mA/4‑mC; optionally 5‑mC (if Dam/Dcm enzymes present).
Coverage targets: ≥50× (6‑mA/4‑mC), ≥10× (5‑mC).
Ask provider to run Dorado (modified‑base model) and deliver aligned modBAM/CRAM with MM/ML tags.

B. Inputs/outputs to request from provider (e.g., Novogene)

Deliverables:
- modBAM/CRAM (aligned to our provided reference), with MM/ML tags + .bai/.crai.
- Optional per‑site tracks: BED/bedGraph/bigWig and a QC report.
Reference:
- Can we provide bacterial reference FASTA? Will they return the exact FASTA (.fai) used?
Models & modifications:
- Which Dorado model version and which mods (5‑mC, 6‑mA, 4‑mC) are called by default?
Unaligned data:
- If delivering unmapped uBAM/FASTQ, request that modified‑base calls (tags) are still included, or obtain raw signal/FAST5 if re‑calling in‑house.

C. In‑house analysis (outline)

Align mod‑called reads to reference (if not already) → modBAM.
Run modkit to summarize per‑site methylation frequencies and export bedGraph/bigWig.
Use methylartist for regional plots, motif‑centric views, metaplots over features (promoters, operons, RND genes, etc.).
Integrate with other omics (RNA‑seq) by averaging methylation in promoter/operon windows and correlating with expression changes.

5) nf‑core/methylong (pipeline overview)

Community Nextflow pipeline for ONT methylation. Typical features:
- Supports Dorado modified‑base calling (or consumes modBAM/CRAM).
- Performs alignment (e.g., minimap2) to your reference, keeps MM/ML tags.
- Generates per‑site/ per‑region summaries, tracks (bedGraph/bigWig), and QC.
Inputs: reads (FASTQ/FAST5) or modBAM + reference FASTA; sample sheet with metadata.
Outputs: modBAM/CRAM + indices, per‑site methylation tables, genome tracks, multiQC‑style reports.

(Exact CLI flags vary by version; coordinate with the provider or your compute environment.)

6) QC & caveats

Depth matters: 6‑mA/4‑mC need higher coverage than 5‑mC.
Model choice: Use the correct Dorado modified‑base model for your chemistry/flow cell and target modifications.
Reference fidelity: Use the same reference throughout (and document version).
BAM integrity: Verify MM/ML tags exist; confirm alignment header matches the provided FASTA.
Context effects: Methylation calling is k‑mer context‑dependent; some motifs are easier/harder.
Biological interpretation: In bacteria, methylation is often tied to R–M systems, replication, and gene regulation; interpret rates in motif/operon context, not only at single CpG‑style sites.

7) What to ask a provider (email checklist)

Will you deliver aligned modBAM/CRAM with MM/ML tags (+ index)?
Which modified bases are called (5‑mC, 6‑mA, 4‑mC)? Which Dorado model/version?
Do you require us to provide a bacterial reference FASTA for alignment? Will you return the exact reference used?
Can you also provide per‑site methylation tracks (bedGraph/bigWig) and a QC report?
What coverage will be achieved per sample (target ≥10× for 5‑mC; ≥50× for 6‑mA/4‑mC)?

8) Suggested minimal deliverables

modBAM/CRAM aligned to our provided reference (+ .bai/.crai).
Reference FASTA and .fai used in alignment/calling.
Per‑site tables (tsv) and tracks (bedGraph/bigWig).
Brief QC (coverage, fraction modified by motif, per‑site confidence).

9) Bacterial project recommendation (one‑liner)

For bacteria, profile 6‑mA (and 4‑mC) as primary targets (≥50×), optionally 5‑mC (≥10× if Dcm‑like activity expected), using Dorado modified‑base calling and aligned modBAM/CRAM with MM/ML tags; summarize with modkit/methylartist and integrate with RNA‑seq.

10) Handy pointers & checks (quick ref)

Check BAM has mods: samtools view -h mod.bam | head → look for MM:Z: and ML:B:C tags.
Confirm reference: samtools view -H mod.bam | grep '^@SQ' and keep the FASTA.
Summarize (example modkit): modkit pileup mod.bam ref.fa --bedgraph out.bg --min-mapq 20
Visualize: Load bigWig/bedGraph in IGV/JBrowse; overlay RNA‑seq coverage/DE results.

Prepared from the morning discussion to serve as a self‑contained guide and hand‑off document.

1. Prepare samples.tsv from the samplesheet

2. Reformat counts.tsv from STAR/Salmon