Step‑by‑Step Workflow for Phylogenomic and Functional Analysis of 100 Clinical Isolates Using WGS Data

Below is a sorted, step‑by‑step protocol integrating all commands and materials from your notes — organized into logical analysis phases. It outlines how to process whole‑genome data, annotate with Prokka, perform pan‑genome and SNP‑based phylogenetic analysis, and conduct resistome profiling and visualization.

1. Initial Setup and Environment Configuration

Create and activate working environments:

conda create -n bengal3_ac3 python=3.9
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3

Copy necessary config files for bacto:

cp /home/jhuang/Tools/bacto/bacto-0.1.json .
cp /home/jhuang/Tools/bacto/Snakefile .

Prepare R environment for later visualization:

mamba create -n r_414_bioc314 -c conda-forge -c bioconda \
  r-base=4.1.3 bioconductor-ggtree bioconductor-treeio \
  r-ggplot2 r-dplyr r-ape pandoc

2. Species Assignment and MLST Screening

Use WGS data to determine phylogenetic relationships based on MLST. Confirm species identity of isolates: E. coli, K. pneumoniae, A. baumannii complex, P. aeruginosa.
Example log output:

[10:19:59] Excluding ecoli.icdA.262 due to --exclude option
[10:19:59] If you like MLST, you're going to absolutely love cgMLST!
[10:19:59] Done.

Observation: No strain was close enough to be a suspected clonal strain (to be confirmed later with SNP‑based analysis).

3. Genome Annotation with Prokka

Cluster isolates into four species folders and annotate each genome separately.

Example shell script:

for cluster in abaumannii_2 ecoli_achtman_4 klebsiella paeruginosa; do
  GENUS_MAP=( [abaumannii_2]="Acinetobacter" [ecoli_achtman_4]="Escherichia" [klebsiella]="Klebsiella" [paeruginosa]="Pseudomonas" )
  SPECIES_MAP=( [abaumannii_2]="baumannii" [ecoli_achtman_4]="coli" [klebsiella]="pneumoniae" [paeruginosa]="aeruginosa" )
  prokka --force --outdir prokka/${cluster} --cpus 8 \
         --genus ${GENUS_MAP[$cluster]} --species ${SPECIES_MAP[$cluster]} \
         --addgenes --addmrna --prefix ${cluster} \
         -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
 done

4. Pan‑Genome Construction with Roary

Run Roary for each species cluster (GFF outputs from Prokka):

roary -f roary/ecoli_cluster -e --mafft -p 100 prokka/ecoli_cluster/*.gff
roary -f roary/kpneumoniae_cluster1 -e --mafft -p 100 prokka/kpneumoniae_cluster1/*.gff
roary -f roary/abaumannii_cluster -e --mafft -p 100 prokka/abaumannii_cluster/*.gff
roary -f roary/paeruginosa_cluster -e --mafft -p 100 prokka/paeruginosa_cluster/*.gff

Output: core gene alignments for each species.

5. Core‑Genome Phylogenetic Tree Building (RAxML‑NG)

Use maximum likelihood reconstruction per species:

raxml-ng --all --msa roary/ecoli_achtman_4/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix ecoli_core_gene_tree_1000
raxml-ng --all --msa roary/klebsiella/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix klebsiella_core_gene_tree_1000
raxml-ng --all --msa roary/abaumannii/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix abaumannii_core_gene_tree_1000
raxml-ng --all --msa roary/paeruginosa/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix paeruginosa_core_gene_tree_1000

6. SNP Detection and Distance Calculation

Extract SNP sites and generate distance matrices:

snp-sites -v -o core_snps.vcf roary/ecoli_achtman_4/core_gene_alignment.aln
snp-dists roary/ecoli_achtman_4/core_gene_alignment.aln > ecoli_snp_dist.tsv

Repeat for other species and convert to Excel summary:

~/Tools/csv2xls-0.4/csv_to_xls.py ecoli_snp_dist.tsv klebsiella_snp_dist.tsv abaumannii_snp_dist.tsv paeruginosa_snp_dist.tsv -d$'\t' -o snp_dist.xls

7. Resistome and Virulence Profiling with Abricate

Install and set up databases:

conda install -c bioconda -c conda-forge abricate=1.0.1
abricate --setupdb
abricate --list
DATABASE        SEQUENCES       DBTYPE  DATE
vfdb    2597    nucl    2025-Oct-22
resfinder       3077    nucl    2025-Oct-22
argannot        2223    nucl    2025-Oct-22
ecoh    597     nucl    2025-Oct-22
megares 6635    nucl    2025-Oct-22
card    2631    nucl    2025-Oct-22
ecoli_vf        2701    nucl    2025-Oct-22
plasmidfinder   460     nucl    2025-Oct-22
ncbi    5386    nucl    2025-Oct-22

Run VFDB for virulence genes:

# # Run VFDB example
# abricate --db vfdb genome.fasta > vfdb.tsv
#
# # strict filter (≥90% ID, ≥80% cov) using header-safe awk
# awk -F'\t' 'NR==1{
#   for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
#   print; next
# } ($id+0)>=90 && ($cov+0)>=80' vfdb.tsv > vfdb.strict.tsv

# 0) Define your cluster directories (exactly as in your prompt)
CLUSTERS="fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3"

# 1) Run Abricate (VFDB) per isolate, strictly filtered (≥90% ID, ≥80% COV)
for D in $CLUSTERS; do
  mkdir -p "$D/abricate_vfdb"
  for F in "$D"/*.fasta; do
    ISO=$(basename "$F" .fasta)
    # raw
    abricate --db vfdb "$F" > "$D/abricate_vfdb/${ISO}.vfdb.tsv"
    # strict filter while keeping header (header-safe awk)
    awk -F'\t' 'NR==1{
      for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
      print; next
    } ($id+0)>=90 && ($cov+0)>=80' \
      "$D/abricate_vfdb/${ISO}.vfdb.tsv" > "$D/abricate_vfdb/${ISO}.vfdb.strict.tsv"
  done
done

# 2) Build per-cluster "isolate
<TAB>gene" lists (Abricate column 5 = GENE)
for D in $CLUSTERS; do
  OUT="$D/virulence_isolate_gene.tsv"
  : > "$OUT"
  for T in "$D"/abricate_vfdb/*.vfdb.strict.tsv; do
    ISO=$(basename "$T" .vfdb.strict.tsv)
    awk -F'\t' -v I="$ISO" 'NR>1{print I"\t"$5}' "$T" >> "$OUT"
  done
  sort -u "$OUT" -o "$OUT"
done

# 3) Create a per-isolate "signature" (sorted, comma-joined gene list)
for D in $CLUSTERS; do
  IN="$D/virulence_isolate_gene.tsv"
  SIG="$D/virulence_signatures.tsv"
  awk -F'\t' '
  {m[$1][$2]=1}
  END{
    for(i in m){
      n=0
      for(g in m[i]){n++; a[n]=g}
      asort(a)
      printf("%s\t", i)
      for(k=1;k<=n;k++){printf("%s%s", (k>1?",":""), a[k])}
      printf("\n")
      delete a
    }
  }' "$IN" | sort > "$SIG"
done

# 4) Report whether each cluster is internally identical
for D in $CLUSTERS; do
  SIG="$D/virulence_signatures.tsv"
  echo "== $D =="
  # how many unique signatures?
  CUT=$(cut -f2 "$SIG" | sort -u | wc -l)
  if [ "$CUT" -eq 1 ]; then
    echo "  Virulence profiles: IDENTICAL within cluster"
  else
    echo "  Virulence profiles: NOT IDENTICAL (unique signatures: $CUT)"
    echo "  --- differing isolates & their signatures ---"
    cat "$SIG"
  fi
done

Summarize VFDB results:

#----
# Make gene lists (VFDB "GENE" = column 5) for each isolate
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | tail -n +2 | sort -u > c2_070.genes.txt

# Show symmetric differences
echo "Genes present in QRC018 only:"
comm -23 c2_018.genes.txt c2_070.genes.txt

echo "Genes present in QRC070 only:"
comm -13 c2_018.genes.txt c2_070.genes.txt

awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | sort -u > c2_018.gene_product.txt
awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | sort -u > c2_070.gene_product.txt

# Genes unique to QRC018 with product
join -t $'\t' <(cut -f1 c2_018.gene_product.txt) <(cut -f1 c2_070.gene_product.txt) -v1 > /dev/null # warms caches
comm -23 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_018.gene_product.txt

# Genes unique to QRC070 with product
comm -13 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_070.gene_product.txt

# Make a cluster summary table from raw (or strict) TSVs
for D in fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3; do
  echo "== $D =="
  abricate --summary "$D"/abricate_vfdb/*.vfdb.strict.tsv | column -t
done

Make gene presence lists and find symmetric differences between isolates:

cut -f5 fasta_kpneumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
comm -23 c2_018.genes.txt c2_070.genes.txt

5 (Optional). For A. baumannii with zero hits, run DIAMOND BLASTp against VFDB proteins.

#If you want to be extra thorough for A. baumannii. Because your VFDB nucleotide set returned zero for A. baumannii, you can cross-check with VFDB protein via DIAMOND:

# Build a DIAMOND db (once)
diamond makedb -in VFDB_proteins.faa -d vfdb_prot

# Query predicted proteins (Prokka .faa) per isolate
for F in fasta_abaumannii_cluster/*.fasta; do
  ISO=$(basename "$F" .fasta)
  prokka --cpus 8 --outdir prokka/$ISO --prefix $ISO "$F"
  diamond blastp -q prokka/$ISO/$ISO.faa -d vfdb_prot \
    -o abricate_vfdb/$ISO.vfdb_prot.tsv \
    -f 6 qseqid sseqid pident length evalue bitscore qcovhsp \
    --id 90 --query-cover 80 --max-target-seqs 1 --threads 8
done

8. Visualization with ggtree and Heatmaps

Render circular core genome trees and overlay selected ARGs or gene presence/absence matrices.

Key R snippet:

library(ggtree)
library(ggplot2)
library(dplyr)
info <- read.csv("typing_until_ecpA.csv", sep="\t")
tree <- read.tree("raxml-ng/snippy.core.aln.raxml.tree")
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info +
     geom_tippoint(aes(color=ST), size=4) + geom_tiplab2(aes(label=name), offset=1)
gheatmap(p, heatmapData2, width=4, colnames_angle=45, offset=4.2)

Output: multi‑layer circular tree (e.g., ggtree_and_gheatmap_selected_genes.png).

9. Final Analyses and Reporting

Compute pairwise SNP distances and identify potential clonal clusters using species‑specific thresholds (e.g., ≤17 SNPs for E. coli).
If clones exist, retain one representative per cluster for subsequent Boruta analysis or CA/EA metrics.
Integrate ARG heatmaps to the trees for intuitive visualization of resistance determinants relative to phylogeny.

Protocol Description Summary

This workflow systematically processes WGS data from multiple bacterial species:

Annotation (Prokka) ensures consistent gene predictions.
Pan‑genome alignment (Roary) captures core and accessory variation.
Phylogenetic reconstruction (RAxML‑NG) provides species‑level evolutionary context.
SNP‑distance computation (snp‑sites, snp‑dists) allows clonal determination and diversity quantification.
Resistome analysis (Abricate, Diamond) characterizes antimicrobial resistance and virulence determinants.
Visualization (ggtree + gheatmap) integrates tree topology with gene presence–absence or ARG annotations.

Properly documented, this end‑to‑end pipeline can serve as the Methods section for publication or as an online reproducible workflow guide.

Methods (for the manuscript). Genomes were annotated with PROKKA v1.14.5 [1]. A pangenome was built with Roary v3.13.0 [2] using default settings (95% protein identity) to derive a gene presence/absence matrix. Core genes—defined as present in 99–100% of isolates by Roary—were individually aligned with MAFFT v7 [3] and concatenated; the resulting multiple alignment was used to infer a maximum-likelihood phylogeny in RAxML-NG [4] under GTR+G, with 1,000 bootstrap replicates [5] and a random starting tree. Trees were visualized with the R package ggtree [6] in circular layout; tip colors denote sequence type (ST), with MLST assigned via PubMLST/BIGSdb [7]; concentric heatmap rings indicate the presence (+) or absence (-) of resistance genes.

Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693.
Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780.
Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A, Wren J. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. 2017. ggtree: an R package for visualization and annotation of phylogenetic trees. Methods in Ecology and Evolution 8(1):28–36.
Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3:124.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.