Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)
1) Primary WGS analysis with nf-core/bacass
We processed short-read bacterial WGS with nf-core/bacass v2.5.0 using Docker. The pipeline performs QC, assembly, and common taxonomic/typing steps (e.g., Kraken2). We ran it successfully with a local Kraken2 database.
A practical issue is the KmerFinder database download: the previously documented FTP “latest” path may be unavailable or slow, so it’s best to pre-download the database and pass a local path, or skip KmerFinder and run it manually via the CGE service if needed.
2) KmerFinder database: reliable download options
Two practical database sources:
- Option A (recommended, easiest & fast): download the bundled DB from the CGE KmerFinder service site and extract locally. (In your case it finished in ~40 min.)
- Option B (fallback, stable but older): use the Zenodo stable archive (20190108).
Once the database is available locally, it can be provided to bacass with --kmerfinderdb (pointing to the extracted bacteria/ directory, or the tarball depending on the pipeline expectation).
3) Species placement & phylogeny
For broad placement, an ANI-based approach (fastANI distance tree) is quick and robust, but for publishable phylogeny within a close clade, a standard route is:
Prokka → Roary (pangenome; core-gene alignment) → (trim) → RAxML-NG.
Key point: Roary core-gene trees are best for closely related genomes (same species/close genus). Distant outgroups can collapse the core (few/no shared genes). If outgroups are too far, use close outgroups (e.g., Moraxella, Psychrobacter) or switch to conserved marker gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) for deep phylogeny.
4) Comparative genome visualization
For pairwise genome comparison/plots, nf-core/pairgenomealign can generate whole-genome alignments and visual representations against a target genome.
Reproducible code snippets (complete commands)
A) nf-core/bacass runs
Run 1 (works):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /home/jhuang/REFs/minikraken_20171019_8GB.tgz
Run 2 (skip KmerFinder, resume):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /home/jhuang/REFs/k2_standard_08_GB_20251015.tar.gz \
--skip_kmerfinder \
-resume
Then manually run KmerFinder via web service: https://cge.food.dtu.dk/services/KmerFinder/
Run 3 (use local KmerFinder DB, resume):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
-resume
B) KmerFinder DB download
Option A (CGE bundle; fast):
mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz'
tar -xvf kmerfinder_db.tar.gz
# resulting path typically contains: bacteria/
Option B (Zenodo stable archive):
mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://zenodo.org/records/13447056/files/20190108_kmerfinder_stable_dirs.tar.gz'
tar -xzf 20190108_kmerfinder_stable_dirs.tar.gz
C) Download close references (Acinetobacter + close outgroups) + fastANI matrix (your script config)
(You already implemented this; keep here as “reproducible config”)
# Example config snippet used in the downloader:
MAX_ACINETOBACTER=20
MAX_PER_OUTGROUP=4
ASSEMBLY_LEVELS="complete,chromosome"
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
MINFRACTION="0.05"
D) Prokka → Roary core MSA → RAxML-NG (phylogenomic tree)
0) Collect genomes (isolate + downloaded refs):
mkdir -p genomes
cp bacass_out/Prokka/An6/An6.fna genomes/An6.fna
bash 1_ani_tree_prep.sh
cp ani_tree_run/ref_fasta/*.genomic.fna genomes/
1) Prokka annotations (produces GFFs for Roary):
mkdir -p prokka gffs
for f in genomes/*.fna genomes/*.genomic.fna; do
[ -f "$f" ] || continue
bn=$(basename "$f")
sample="${bn%.*}"
sample="${sample%.genomic}"
sample="${sample//./_}"
prokka --outdir "prokka/$sample" \
--prefix "$sample" \
--cpus 8 --force \
"$f"
done
cp prokka/*/*.gff gffs/
2) Roary core-genome alignment (MAFFT):
mkdir -p roary_out
roary -f roary_out \
-p 16 \
-e -n \
-i 95 \
-cd 95 \
gffs/*.gff
# output:
ls -lh roary_out/core_gene_alignment.aln
3) Trim the core alignment (optional but recommended):
trimal -in roary_out/core_gene_alignment.aln \
-out roary_out/core_gene_alignment.trim.aln \
-automated1
4) RAxML-NG ML tree + bootstrap supports:
mkdir -p tree
raxml-ng --all \
--msa roary_out/core_gene_alignment.trim.aln \
--model GTR+G \
--bs-trees autoMRE{500} \
--threads 16 \
--seed 12345 \
--prefix tree/roary_core
# main result:
ls -lh tree/roary_core.raxml.support
E) Optional visualization in R (ape)
library(ape)
tr <- read.tree("tree/roary_core.raxml.support")
pdf("tree/roary_core_raxmlng.pdf", width=12, height=10)
plot(tr, cex=0.5, no.margin=TRUE)
title("Roary core genome + RAxML-NG (GTR+G)")
dev.off()
F) Pairwise genome alignment figure (nf-core/pairgenomealign)
Example pattern (edit to your inputs/refs):
nextflow run nf-core/pairgenomealign -r 2.2.1 -profile docker \
--input samplesheet_pairgenomealign.tsv \
--outdir pairgenomealign_out
Practical notes
- If KmerFinder FTP “latest” is unavailable, use the CGE bundle or Zenodo stable DB and point bacass to the local extracted
bacteria/directory. -
For phylogeny:
- Roary/RAxML-NG is best for close genomes; use close outgroups or none and root later.
- For distant outgroups, use marker-gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) instead of pangenome core.
Used scripts (complete code)
A) 1_ani_tree_prep.sh
#!/usr/bin/env bash
set -euo pipefail
# ===================== USER CONFIG =====================
QUERY_FASTA="bacass_out/Prokka/An6/An6.fna"
SEED_ACCESSIONS_FILE="accessions.txt"
OUTDIR="ani_tree_run"
THREADS=8
MINFRACTION="0.05"
FRAGLEN="3000"
MAX_ACINETOBACTER=20
SUMMARY_LIMIT_ACI=400
MAX_PER_OUTGROUP=4
SUMMARY_LIMIT_OUT=120
#,scaffold
ASSEMBLY_LEVELS="complete,chromosome"
#Drop from Roary: "Pseudomonas" "Escherichia")
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
DO_MATRIX="yes"
SUMMARY_TIMEOUT=600
# =======================================================
ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
warn() { echo "[$(ts)] WARN: $*" >&2; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }
need() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
sanitize() { echo "$1" | tr ' /;:,()[]{}' '_' | tr -cd 'A-Za-z0-9._-'; }
need datasets
need fastANI
need awk
need sort
need unzip
need find
need grep
need wc
need head
need readlink
need sed
need timeout
need python3
[[ -f "$QUERY_FASTA" ]] || die "QUERY_FASTA not found: $QUERY_FASTA"
mkdir -p "$OUTDIR"/{ref_fasta,zips,tmp,logs,accessions}
DL_LOG="$OUTDIR/logs/download.log"
SUM_LOG="$OUTDIR/logs/summary.log"
ANI_LOG="$OUTDIR/logs/fastani.log"
: > "$DL_LOG"; : > "$SUM_LOG"; : > "$ANI_LOG"
QUERY_LIST="$OUTDIR/query_list.txt"
echo "$(readlink -f "$QUERY_FASTA")" > "$QUERY_LIST"
log "QUERY: $(cat "$QUERY_LIST")"
ACC_DIR="$OUTDIR/accessions"
ACC_SEED="$ACC_DIR/seed.acc.txt"
ACC_ACI="$ACC_DIR/acinetobacter.acc.txt"
ACC_OUT="$ACC_DIR/outgroups.acc.txt"
ACC_ALL="$ACC_DIR/all_refs.acc.txt"
REF_DIR="$OUTDIR/ref_fasta"
REF_LIST="$OUTDIR/ref_list.txt"
# ---- seed accessions ----
if [[ -f "$SEED_ACCESSIONS_FILE" ]]; then
grep -v '^\s*$' "$SEED_ACCESSIONS_FILE" | grep -v '^\s*#' | sort -u > "$ACC_SEED" || true
else
: > "$ACC_SEED"
fi
log "Seed accessions: $(wc -l < "$ACC_SEED")"
# ---- extract accessions from a jsonl FILE (FIXED) ----
extract_accessions_from_jsonl_file() {
local jsonl="$1"
python3 - "$jsonl" <<'PY'
import sys, json
path = sys.argv[1]
out = []
with open(path, "r", encoding="utf-8", errors="ignore") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except Exception:
continue
# NCBI datasets summary often returns one report per line with top-level "accession"
acc = obj.get("accession")
if acc:
out.append(acc)
# some outputs can also contain "reports" lists
reps = obj.get("reports") or []
for r in reps:
acc = r.get("accession")
if acc:
out.append(acc)
for a in out:
print(a)
PY
}
fetch_accessions_by_taxon() {
local taxon="$1"
local want_n="$2"
local limit_n="$3"
local outfile="$4"
local tag
tag="$(sanitize "$taxon")"
local jsonl="$OUTDIR/tmp/${tag}.jsonl"
: > "$outfile"
log "STEP: Fetch accessions | taxon='$taxon' want=$want_n limit=$limit_n assembly=$ASSEMBLY_LEVELS timeout=${SUMMARY_TIMEOUT}s"
if ! timeout "$SUMMARY_TIMEOUT" datasets summary genome taxon "$taxon" \
--assembly-level "$ASSEMBLY_LEVELS" \
--annotated \
--exclude-atypical \
--exclude-multi-isolate \
--limit "$limit_n" \
--as-json-lines > "$jsonl" 2>>"$SUM_LOG"; then
warn "datasets summary failed for taxon=$taxon"
tail -n 80 "$SUM_LOG" >&2 || true
return 1
fi
[[ -s "$jsonl" ]] || { warn "Empty jsonl for taxon=$taxon"; return 0; }
log " OK: summary jsonl lines: $(wc -l < "$jsonl")"
# FIXED extraction
extract_accessions_from_jsonl_file "$jsonl" | sort -u | head -n "$want_n" > "$outfile"
log " OK: got $(wc -l < "$outfile") accessions for $taxon"
head -n 5 "$outfile" | sed 's/^/ sample: /' || true
return 0
}
# ---- fetch Acinetobacter + outgroups ----
fetch_accessions_by_taxon "Acinetobacter" "$MAX_ACINETOBACTER" "$SUMMARY_LIMIT_ACI" "$ACC_ACI" \
|| die "Failed to fetch Acinetobacter accessions (see $SUM_LOG)"
: > "$ACC_OUT"
for tax in "${OUTGROUP_TAXA[@]}"; do
tmp="$OUTDIR/tmp/$(sanitize "$tax").acc.txt"
fetch_accessions_by_taxon "$tax" "$MAX_PER_OUTGROUP" "$SUMMARY_LIMIT_OUT" "$tmp" \
|| die "Failed to fetch outgroup taxon=$tax (see $SUM_LOG)"
cat "$tmp" >> "$ACC_OUT"
done
sort -u "$ACC_OUT" -o "$ACC_OUT"
log "Outgroup accessions total: $(wc -l < "$ACC_OUT")"
# ---- merge ----
cat "$ACC_SEED" "$ACC_ACI" "$ACC_OUT" 2>/dev/null \
| grep -v '^\s*$' | grep -v '^\s*#' | sort -u > "$ACC_ALL"
log "Total unique reference accessions: $(wc -l < "$ACC_ALL")"
[[ -s "$ACC_ALL" ]] || die "No reference accessions collected."
# ---- download genomes (batch) ----
ZIP="$OUTDIR/zips/ncbi_refs.zip"
UNZ="$OUTDIR/tmp/ncbi_refs_unzipped"
rm -f "$ZIP"; rm -rf "$UNZ"; mkdir -p "$UNZ"
log "STEP: Download genomes via NCBI datasets (batch) -> $ZIP"
datasets download genome accession --inputfile "$ACC_ALL" --include genome --filename "$ZIP" >>"$DL_LOG" 2>&1 \
|| { tail -n 120 "$DL_LOG" >&2; die "datasets download failed (see $DL_LOG)"; }
log "STEP: Unzip genomes"
unzip -q "$ZIP" -d "$UNZ" >>"$DL_LOG" 2>&1 || { tail -n 120 "$DL_LOG" >&2; die "unzip failed (see $DL_LOG)"; }
# ---- collect FASTA ----
log "STEP: Collect *_genomic.fna -> $REF_DIR"
: > "$REF_LIST"
n=0
while IFS= read -r f; do
[[ -f "$f" ]] || continue
acc="$(echo "$f" | grep -oE 'G[CF]A_[0-9]+\.[0-9]+' | head -n 1 || true)"
[[ -z "${acc:-}" ]] && acc="REF$(printf '%04d' $((n+1)))"
out="$REF_DIR/${acc}.genomic.fna"
cp -f "$f" "$out"
[[ -s "$out" ]] || continue
echo "$(readlink -f "$out")" >> "$REF_LIST"
n=$((n+1))
done < <(find "$UNZ" -type f -name "*_genomic.fna")
log " OK: Reference FASTA collected: $n"
[[ "$n" -gt 0 ]] || die "No genomic.fna found after download."
# ---- fastANI query vs refs ----
RAW_QVSR="$OUTDIR/fastani_query_vs_refs.raw.tsv"
TSV_QVSR="$OUTDIR/fastani_query_vs_refs.tsv"
log "STEP: fastANI query vs refs"
log " fastANI --ql $QUERY_LIST --rl $REF_LIST -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_QVSR"
fastANI \
--ql "$QUERY_LIST" \
--rl "$REF_LIST" \
-t "$THREADS" \
--minFraction "$MINFRACTION" \
--fragLen "$FRAGLEN" \
-o "$RAW_QVSR" >>"$ANI_LOG" 2>&1 \
|| { tail -n 160 "$ANI_LOG" >&2; die "fastANI failed (see $ANI_LOG)"; }
[[ -s "$RAW_QVSR" ]] || die "fastANI output empty: $RAW_QVSR"
echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$TSV_QVSR"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_QVSR" >> "$TSV_QVSR"
log "Top hits (ANI desc):"
tail -n +2 "$TSV_QVSR" | sort -k3,3nr | head -n 15 | sed 's/^/ /'
# ---- fastANI matrix for all genomes ----
if [[ "$DO_MATRIX" == "yes" ]]; then
ALL_LIST="$OUTDIR/all_genomes_list.txt"
cat "$QUERY_LIST" "$REF_LIST" > "$ALL_LIST"
RAW_ALL="$OUTDIR/fastani_all.raw.tsv"
log "STEP: fastANI all-vs-all matrix (for tree)"
log " fastANI --ql $ALL_LIST --rl $ALL_LIST --matrix -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_ALL"
fastANI \
--ql "$ALL_LIST" \
--rl "$ALL_LIST" \
--matrix \
-t "$THREADS" \
--minFraction "$MINFRACTION" \
--fragLen "$FRAGLEN" \
-o "$RAW_ALL" >>"$ANI_LOG" 2>&1 \
|| { tail -n 160 "$ANI_LOG" >&2; die "fastANI matrix failed (see $ANI_LOG)"; }
[[ -f "${RAW_ALL}.matrix" ]] || die "Matrix not produced: ${RAW_ALL}.matrix"
log " OK: Matrix file: ${RAW_ALL}.matrix"
fi
log "DONE."
log "Accessions counts:"
log " - seed: $(wc -l < "$ACC_SEED")"
log " - acinetobacter:$(wc -l < "$ACC_ACI")"
log " - outgroups: $(wc -l < "$ACC_OUT")"
log " - all refs: $(wc -l < "$ACC_ALL")"