Daily Archives: 2026年1月30日

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

1) Primary WGS analysis with nf-core/bacass

We processed short-read bacterial WGS with nf-core/bacass v2.5.0 using Docker. The pipeline performs QC, assembly, and common taxonomic/typing steps (e.g., Kraken2). We ran it successfully with a local Kraken2 database.

A practical issue is the KmerFinder database download: the previously documented FTP “latest” path may be unavailable or slow, so it’s best to pre-download the database and pass a local path, or skip KmerFinder and run it manually via the CGE service if needed.

2) KmerFinder database: reliable download options

Two practical database sources:

  • Option A (recommended, easiest & fast): download the bundled DB from the CGE KmerFinder service site and extract locally. (In your case it finished in ~40 min.)
  • Option B (fallback, stable but older): use the Zenodo stable archive (20190108).

Once the database is available locally, it can be provided to bacass with --kmerfinderdb (pointing to the extracted bacteria/ directory, or the tarball depending on the pipeline expectation).

3) Species placement & phylogeny

For broad placement, an ANI-based approach (fastANI distance tree) is quick and robust, but for publishable phylogeny within a close clade, a standard route is:

Prokka → Roary (pangenome; core-gene alignment) → (trim) → RAxML-NG.

Key point: Roary core-gene trees are best for closely related genomes (same species/close genus). Distant outgroups can collapse the core (few/no shared genes). If outgroups are too far, use close outgroups (e.g., Moraxella, Psychrobacter) or switch to conserved marker gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) for deep phylogeny.

4) Comparative genome visualization

For pairwise genome comparison/plots, nf-core/pairgenomealign can generate whole-genome alignments and visual representations against a target genome.


Reproducible code snippets (complete commands)

A) nf-core/bacass runs

Run 1 (works):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/minikraken_20171019_8GB.tgz

Run 2 (skip KmerFinder, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/k2_standard_08_GB_20251015.tar.gz \
  --skip_kmerfinder \
  -resume

Then manually run KmerFinder via web service: https://cge.food.dtu.dk/services/KmerFinder/

Run 3 (use local KmerFinder DB, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
  --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
  -resume

B) KmerFinder DB download

Option A (CGE bundle; fast):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz'
tar -xvf kmerfinder_db.tar.gz
# resulting path typically contains: bacteria/

Option B (Zenodo stable archive):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://zenodo.org/records/13447056/files/20190108_kmerfinder_stable_dirs.tar.gz'
tar -xzf 20190108_kmerfinder_stable_dirs.tar.gz

C) Download close references (Acinetobacter + close outgroups) + fastANI matrix (your script config)

(You already implemented this; keep here as “reproducible config”)

# Example config snippet used in the downloader:
MAX_ACINETOBACTER=20
MAX_PER_OUTGROUP=4
ASSEMBLY_LEVELS="complete,chromosome"
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
MINFRACTION="0.05"

D) Prokka → Roary core MSA → RAxML-NG (phylogenomic tree)

0) Collect genomes (isolate + downloaded refs):

mkdir -p genomes
cp bacass_out/Prokka/An6/An6.fna genomes/An6.fna
bash 1_ani_tree_prep.sh
cp ani_tree_run/ref_fasta/*.genomic.fna genomes/

1) Prokka annotations (produces GFFs for Roary):

mkdir -p prokka gffs

for f in genomes/*.fna genomes/*.genomic.fna; do
  [ -f "$f" ] || continue
  bn=$(basename "$f")
  sample="${bn%.*}"
  sample="${sample%.genomic}"
  sample="${sample//./_}"

  prokka --outdir "prokka/$sample" \
         --prefix "$sample" \
         --cpus 8 --force \
         "$f"
done

cp prokka/*/*.gff gffs/

2) Roary core-genome alignment (MAFFT):

mkdir -p roary_out

roary -f roary_out \
      -p 16 \
      -e -n \
      -i 95 \
      -cd 95 \
      gffs/*.gff

# output:
ls -lh roary_out/core_gene_alignment.aln

3) Trim the core alignment (optional but recommended):

trimal -in roary_out/core_gene_alignment.aln \
       -out roary_out/core_gene_alignment.trim.aln \
       -automated1

4) RAxML-NG ML tree + bootstrap supports:

mkdir -p tree

raxml-ng --all \
  --msa roary_out/core_gene_alignment.trim.aln \
  --model GTR+G \
  --bs-trees autoMRE{500} \
  --threads 16 \
  --seed 12345 \
  --prefix tree/roary_core

# main result:
ls -lh tree/roary_core.raxml.support

E) Optional visualization in R (ape)

library(ape)

tr <- read.tree("tree/roary_core.raxml.support")

pdf("tree/roary_core_raxmlng.pdf", width=12, height=10)
plot(tr, cex=0.5, no.margin=TRUE)
title("Roary core genome + RAxML-NG (GTR+G)")
dev.off()

F) Pairwise genome alignment figure (nf-core/pairgenomealign)

Example pattern (edit to your inputs/refs):

nextflow run nf-core/pairgenomealign -r 2.2.1 -profile docker \
  --input samplesheet_pairgenomealign.tsv \
  --outdir pairgenomealign_out

Practical notes

  • If KmerFinder FTP “latest” is unavailable, use the CGE bundle or Zenodo stable DB and point bacass to the local extracted bacteria/ directory.
  • For phylogeny:

    • Roary/RAxML-NG is best for close genomes; use close outgroups or none and root later.
    • For distant outgroups, use marker-gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) instead of pangenome core.

Used scripts (complete code)

A) 1_ani_tree_prep.sh

#!/usr/bin/env bash
set -euo pipefail

# ===================== USER CONFIG =====================
QUERY_FASTA="bacass_out/Prokka/An6/An6.fna"
SEED_ACCESSIONS_FILE="accessions.txt"

OUTDIR="ani_tree_run"
THREADS=8
MINFRACTION="0.05"
FRAGLEN="3000"

MAX_ACINETOBACTER=20
SUMMARY_LIMIT_ACI=400

MAX_PER_OUTGROUP=4
SUMMARY_LIMIT_OUT=120

#,scaffold
ASSEMBLY_LEVELS="complete,chromosome"
#Drop from Roary: "Pseudomonas" "Escherichia")
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")

DO_MATRIX="yes"
SUMMARY_TIMEOUT=600
# =======================================================

ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
warn() { echo "[$(ts)] WARN: $*" >&2; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }
need() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
sanitize() { echo "$1" | tr ' /;:,()[]{}' '_' | tr -cd 'A-Za-z0-9._-'; }

need datasets
need fastANI
need awk
need sort
need unzip
need find
need grep
need wc
need head
need readlink
need sed
need timeout
need python3

[[ -f "$QUERY_FASTA" ]] || die "QUERY_FASTA not found: $QUERY_FASTA"

mkdir -p "$OUTDIR"/{ref_fasta,zips,tmp,logs,accessions}
DL_LOG="$OUTDIR/logs/download.log"
SUM_LOG="$OUTDIR/logs/summary.log"
ANI_LOG="$OUTDIR/logs/fastani.log"
: > "$DL_LOG"; : > "$SUM_LOG"; : > "$ANI_LOG"

QUERY_LIST="$OUTDIR/query_list.txt"
echo "$(readlink -f "$QUERY_FASTA")" > "$QUERY_LIST"
log "QUERY: $(cat "$QUERY_LIST")"

ACC_DIR="$OUTDIR/accessions"
ACC_SEED="$ACC_DIR/seed.acc.txt"
ACC_ACI="$ACC_DIR/acinetobacter.acc.txt"
ACC_OUT="$ACC_DIR/outgroups.acc.txt"
ACC_ALL="$ACC_DIR/all_refs.acc.txt"

REF_DIR="$OUTDIR/ref_fasta"
REF_LIST="$OUTDIR/ref_list.txt"

# ---- seed accessions ----
if [[ -f "$SEED_ACCESSIONS_FILE" ]]; then
  grep -v '^\s*$' "$SEED_ACCESSIONS_FILE" | grep -v '^\s*#' | sort -u > "$ACC_SEED" || true
else
  : > "$ACC_SEED"
fi
log "Seed accessions: $(wc -l < "$ACC_SEED")"

# ---- extract accessions from a jsonl FILE (FIXED) ----
extract_accessions_from_jsonl_file() {
  local jsonl="$1"
  python3 - "$jsonl" <<'PY'
import sys, json
path = sys.argv[1]
out = []
with open(path, "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            obj = json.loads(line)
        except Exception:
            continue
        # NCBI datasets summary often returns one report per line with top-level "accession"
        acc = obj.get("accession")
        if acc:
            out.append(acc)
        # some outputs can also contain "reports" lists
        reps = obj.get("reports") or []
        for r in reps:
            acc = r.get("accession")
            if acc:
                out.append(acc)
for a in out:
    print(a)
PY
}

fetch_accessions_by_taxon() {
  local taxon="$1"
  local want_n="$2"
  local limit_n="$3"
  local outfile="$4"

  local tag
  tag="$(sanitize "$taxon")"
  local jsonl="$OUTDIR/tmp/${tag}.jsonl"

  : > "$outfile"
  log "STEP: Fetch accessions | taxon='$taxon' want=$want_n limit=$limit_n assembly=$ASSEMBLY_LEVELS timeout=${SUMMARY_TIMEOUT}s"

  if ! timeout "$SUMMARY_TIMEOUT" datasets summary genome taxon "$taxon" \
      --assembly-level "$ASSEMBLY_LEVELS" \
      --annotated \
      --exclude-atypical \
      --exclude-multi-isolate \
      --limit "$limit_n" \
      --as-json-lines > "$jsonl" 2>>"$SUM_LOG"; then
    warn "datasets summary failed for taxon=$taxon"
    tail -n 80 "$SUM_LOG" >&2 || true
    return 1
  fi

  [[ -s "$jsonl" ]] || { warn "Empty jsonl for taxon=$taxon"; return 0; }
  log "  OK: summary jsonl lines: $(wc -l < "$jsonl")"

  # FIXED extraction
  extract_accessions_from_jsonl_file "$jsonl" | sort -u | head -n "$want_n" > "$outfile"

  log "  OK: got $(wc -l < "$outfile") accessions for $taxon"
  head -n 5 "$outfile" | sed 's/^/     sample: /' || true
  return 0
}

# ---- fetch Acinetobacter + outgroups ----
fetch_accessions_by_taxon "Acinetobacter" "$MAX_ACINETOBACTER" "$SUMMARY_LIMIT_ACI" "$ACC_ACI" \
  || die "Failed to fetch Acinetobacter accessions (see $SUM_LOG)"

: > "$ACC_OUT"
for tax in "${OUTGROUP_TAXA[@]}"; do
  tmp="$OUTDIR/tmp/$(sanitize "$tax").acc.txt"
  fetch_accessions_by_taxon "$tax" "$MAX_PER_OUTGROUP" "$SUMMARY_LIMIT_OUT" "$tmp" \
    || die "Failed to fetch outgroup taxon=$tax (see $SUM_LOG)"
  cat "$tmp" >> "$ACC_OUT"
done
sort -u "$ACC_OUT" -o "$ACC_OUT"
log "Outgroup accessions total: $(wc -l < "$ACC_OUT")"

# ---- merge ----
cat "$ACC_SEED" "$ACC_ACI" "$ACC_OUT" 2>/dev/null \
  | grep -v '^\s*$' | grep -v '^\s*#' | sort -u > "$ACC_ALL"
log "Total unique reference accessions: $(wc -l < "$ACC_ALL")"
[[ -s "$ACC_ALL" ]] || die "No reference accessions collected."

# ---- download genomes (batch) ----
ZIP="$OUTDIR/zips/ncbi_refs.zip"
UNZ="$OUTDIR/tmp/ncbi_refs_unzipped"
rm -f "$ZIP"; rm -rf "$UNZ"; mkdir -p "$UNZ"

log "STEP: Download genomes via NCBI datasets (batch) -> $ZIP"
datasets download genome accession --inputfile "$ACC_ALL" --include genome --filename "$ZIP" >>"$DL_LOG" 2>&1 \
  || { tail -n 120 "$DL_LOG" >&2; die "datasets download failed (see $DL_LOG)"; }

log "STEP: Unzip genomes"
unzip -q "$ZIP" -d "$UNZ" >>"$DL_LOG" 2>&1 || { tail -n 120 "$DL_LOG" >&2; die "unzip failed (see $DL_LOG)"; }

# ---- collect FASTA ----
log "STEP: Collect *_genomic.fna -> $REF_DIR"
: > "$REF_LIST"
n=0
while IFS= read -r f; do
  [[ -f "$f" ]] || continue
  acc="$(echo "$f" | grep -oE 'G[CF]A_[0-9]+\.[0-9]+' | head -n 1 || true)"
  [[ -z "${acc:-}" ]] && acc="REF$(printf '%04d' $((n+1)))"
  out="$REF_DIR/${acc}.genomic.fna"
  cp -f "$f" "$out"
  [[ -s "$out" ]] || continue
  echo "$(readlink -f "$out")" >> "$REF_LIST"
  n=$((n+1))
done < <(find "$UNZ" -type f -name "*_genomic.fna")

log "  OK: Reference FASTA collected: $n"
[[ "$n" -gt 0 ]] || die "No genomic.fna found after download."

# ---- fastANI query vs refs ----
RAW_QVSR="$OUTDIR/fastani_query_vs_refs.raw.tsv"
TSV_QVSR="$OUTDIR/fastani_query_vs_refs.tsv"

log "STEP: fastANI query vs refs"
log "  fastANI --ql $QUERY_LIST --rl $REF_LIST -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_QVSR"
fastANI \
  --ql "$QUERY_LIST" \
  --rl "$REF_LIST" \
  -t "$THREADS" \
  --minFraction "$MINFRACTION" \
  --fragLen "$FRAGLEN" \
  -o "$RAW_QVSR" >>"$ANI_LOG" 2>&1 \
  || { tail -n 160 "$ANI_LOG" >&2; die "fastANI failed (see $ANI_LOG)"; }

[[ -s "$RAW_QVSR" ]] || die "fastANI output empty: $RAW_QVSR"

echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$TSV_QVSR"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_QVSR" >> "$TSV_QVSR"

log "Top hits (ANI desc):"
tail -n +2 "$TSV_QVSR" | sort -k3,3nr | head -n 15 | sed 's/^/  /'

# ---- fastANI matrix for all genomes ----
if [[ "$DO_MATRIX" == "yes" ]]; then
  ALL_LIST="$OUTDIR/all_genomes_list.txt"
  cat "$QUERY_LIST" "$REF_LIST" > "$ALL_LIST"

  RAW_ALL="$OUTDIR/fastani_all.raw.tsv"
  log "STEP: fastANI all-vs-all matrix (for tree)"
  log "  fastANI --ql $ALL_LIST --rl $ALL_LIST --matrix -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_ALL"
  fastANI \
    --ql "$ALL_LIST" \
    --rl "$ALL_LIST" \
    --matrix \
    -t "$THREADS" \
    --minFraction "$MINFRACTION" \
    --fragLen "$FRAGLEN" \
    -o "$RAW_ALL" >>"$ANI_LOG" 2>&1 \
    || { tail -n 160 "$ANI_LOG" >&2; die "fastANI matrix failed (see $ANI_LOG)"; }

  [[ -f "${RAW_ALL}.matrix" ]] || die "Matrix not produced: ${RAW_ALL}.matrix"
  log "  OK: Matrix file: ${RAW_ALL}.matrix"
fi

log "DONE."
log "Accessions counts:"
log "  - seed:         $(wc -l < "$ACC_SEED")"
log "  - acinetobacter:$(wc -l < "$ACC_ACI")"
log "  - outgroups:    $(wc -l < "$ACC_OUT")"
log "  - all refs:     $(wc -l < "$ACC_ALL")"