scripts_tree_construction_v2.zip
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
1. Download genomes from Genbank to local disk
LOCAL_ONLY_ACCESSIONS=(“CP002773.1” “CP093323.1” “CP012097.1”, “CP014137.1”, “JBUDMV010000001.1”) saving in the directory local_Sply_fastas, GE11174.fasta should be ./
2. Replace EXTRA_ASSEMBLIES (manual edit or use sed/python as shown above)
EXTRA_ASSEMBLIES=(
=== Gibbsiella (core comparison) ===
"GCF_002291425.1" # G. quercinecans FRB97ᵀ
"GCF_004342245.1" # G. quercinecans DSM 25889
#"GCF_039539505.1" # G. papilionis PWX6
# "GCF_039539505.1": "Gibbsiella papilionis PWX6 (GCF_039539505.1)",
"JBUDMV010000001.1"
"GCA_039540155.1" # G. greigii USA56
#"GCA_004342265.1" # G. dentisursi DSM 23818ᵀ
# === Serratia (address biochemical mis-ID) ===
#mkdir -p local_Sply_fastas
#mv ~/Downloads/CP002773.1.fasta local_Sply_fastas/
#mv ~/Downloads/CP093323.1.fasta local_Sply_fastas/
#mv ~/Downloads/CP012097.1.fasta local_Sply_fastas/
#"CP002773.1" # S. plymuthica AS9
#"CP093323.1" # S. plymuthica B37/06
#"CP012097.1" # S. plymuthica 3Re4-18
"GCA_001663135.1" # S. odorifera 4Rx13 (already validated)
#"GCA_002951025.1" # S. fonticola CDC 470-73ᵀ
# === Close outgroup (same family Yersiniaceae) ===
#"GCF_005484965.1" # Brenneria nigrifluens LMG 5956ᵀ
# "GCF_005484965.1": "Brenneria nigrifluens LMG 5956 (GCF_005484965.1)"
#"GCF_000239205.1" # Brenneria goodwinii BC-1
"CP014137.1"
)
3. EXACT desired display labels (as requested)
EXACT display names
desired = { “JBVKFS000000000”: “Gibbsiella quercinecans HH131225 (JBVKFS000000000) ⭐”, “GCF_002291425.1”: “Gibbsiella quercinecans FRB97 (GCF_002291425.1)”, “GCF_004342245.1”: “Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)”, “JBUDMV010000001.1”: “Gibbsiella dentisursi 2e (JBUDMV010000001)”, “GCA_039540155.1”: “Gibbsiella greigii USA56 (GCA_039540155.1)”, “CP002773.1”: “Serratia plymuthica AS9 (CP002773.1)”, “CP093323.1”: “Serratia plymuthica B37/06 (CP093323.1)”, “CP012097.1”: “Serratia plymuthica 3Re4-18 (CP012097.1)”, “GCA_001663135.1”: “Serratia odorifera 4Rx13 (GCA_001663135.1)”, “CP014137.1”: “Brenneria goodwinii FRB141 (CP014137.1)” }
4. Re-run the pipeline
./build_wgs_tree_fig3B_v3.sh
5. Verify output labels
cat work_wgs_tree/plot/labels.tsv
6. Delete the unnecessary isolates from labels.tsv, fastas and gffs, for example delete two isolates starting with ‘-‘
sample display
GCF_002291425.1 Gibbsiella quercinecans FRB97 (GCF_002291425.1)
GCA_039540155.1 Gibbsiella greigii USA56 (GCA_039540155.1)
- GCF_039539505.1 G_papilionis_PWX6 (GCF_039539505.1)
- GCF_005484965.1 B_nigrifluens_LMG5956 (GCF_005484965.1)
JBVKFS000000000 Gibbsiella quercinecans HH131225 (JBVKFS000000000) ⭐
GCF_004342245.1 Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)
JBUDMV010000001.1 Gibbsiella dentisursi 2e (JBUDMV010000001)
CP002773.1 Serratia plymuthica AS9 (CP002773.1)
CP093323.1 Serratia plymuthica B37/06 (CP093323.1)
CP012097.1 Serratia plymuthica 3Re4-18 (CP012097.1)
GCA_001663135.1 Serratia odorifera 4Rx13 (GCA_001663135.1)
CP014137.1 Brenneria goodwinii FRB141 (CP014137.1)
🔧 If you want to skip re-running everything If Prokka/GFFs are already done and you just need to re-run Roary→RAxML→Plot with correct inputs:
A. Ensure collect_fastas copied all 10 FASTAs
ls -1 work_wgs_tree_expanded/fastas/*.fna | wc -l # Should be 10
B. Re-run Roary with relaxed core threshold (for mixed genera)
roary -e –mafft -p 64 -cd 95 -i 95 \ -f work_wgs_tree_expanded/roary_final \ work_wgs_tree_expanded/gffs/*.gff #–> 60736 SNPs
C. Run RAxML-NG with explicit FASTA format
CORE_ALN=”work_wgs_tree_expanded/roary_final/core_gene_alignment.aln” raxml-ng –all \ –msa “$CORE_ALN” \ –msa-format FASTA \ –model GTR+G \ –bs-trees 1000 \ –threads 80 \ –prefix work_wgs_tree_expanded/raxmlng/core \ –redo
D. Run plotting with correct labels
Adapt the plot_tree.R by entering the outgroup-id, e.g. CP014137.1
Version 1: WITHOUT bootstrap values
Rscript plot_tree.R \ work_wgs_tree_expanded/raxmlng/core.raxml.support \ work_wgs_tree_expanded/plot/labels.tsv \ work_wgs_tree_expanded/plot/core_tree_fig3B_no_bootstrap.pdf \ work_wgs_tree_expanded/plot/core_tree_fig3B_no_bootstrap.png \ FALSE
Version 2: WITH bootstrap values
Rscript plot_tree.R \ work_wgs_tree_expanded/raxmlng/core.raxml.support \ work_wgs_tree_expanded/plot/labels.tsv \ work_wgs_tree_expanded/plot/core_tree_fig3B_with_bootstrap.pdf \ work_wgs_tree_expanded/plot/core_tree_fig3B_with_bootstrap.png \ TRUE
REPORT
I agree that showing where Gibbsiella sits among related Gram-negative bacteria would be interesting. However, for the generation of the phylogenetic tree, we are using a core-genome alignment approach (Roary + RAxML-NG), which has an important technical constraint: Core-genome methods require sufficient gene content overlap. When taxa are too evolutionarily distant, the number of shared (“core”) genes drops sharply, resulting in sparse alignments and reduced branch support. With too few core genes, the tree may reflect alignment artifacts rather than true evolutionary history.
I have currently selected the following species for tree construction: TARGET ISOLATE • G. quercinecans HH131225 – Clinical case (highlighted in red) GIBBSIELLA (intra-genus diversity) • G. quercinecans (2 strains) • G. papilionis • G. greigii SERRATIA (address biochemical misidentification) • S. plymuthica (3 strains) • S. odorifera OUTGROUP (rooting) • Brenneria nigrifluens This ensures a robust core alignment (51,495 SNPs) and high-confidence branching for the clinically relevant comparison.
Figure Legend: Figure 1. Core-genome maximum-likelihood phylogeny of Gibbsiella quercinecans HH131225 (red) with reference strains from the genus Gibbsiella, biochemically similar Serratia species, and Brenneria outgroup. Tip labels show species, strain designation, and GenBank assembly accession or chromosome accession. Scale bar: substitutions per site. Tree inferred from 60,736 core-genome SNPs using Roary (Page et al., 2015) and RAxML-NG (Kozlov et al., 2019) with the GTR+G model and 1,000 bootstrap replicates.
References: Page et al., 2015: Page, A. J., Cummins, C. A., Hunt, M., Wong, V. K., Reuter, S., Holden, M. T., … & Parkhill, J. (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics, 31(22), 3691–3693. https://doi.org/10.1093/bioinformatics/btv421 Kozlov et al., 2019: Kozlov, A. M., Darriba, D., Flouri, T., Morel, B., & Stamatakis, A. (2019). RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35(21), 4453–4455. https://doi.org/10.1093/bioinformatics/btz305

