Category Archives: Articles

Processing Data_Benjamin_DNAseq_2026_GE11174 — Bacterial WGS pipeline (standalone post)

This post documents the full end-to-end workflow used to process GE11174 from Data_Benjamin_DNAseq_2026_GE11174, covering: database setup, assembly + QC, species identification (k-mers + ANI), genome annotation, generation of summary tables, AMR/resistome/virulence profiling, phylogenetic tree construction (with reproducible plotting), optional debugging/reruns, optional closest-isolate comparison, and a robust approach to find nearest genomes from GenBank.


Overview of the workflow

High-level stages

  1. Prepare databases (KmerFinder DB; Kraken2 DB used by bacass).
  2. Assemble + QC + taxonomic context using nf-core/bacass (Nextflow + Docker).
  3. Interpret KmerFinder results (species hint; confirm with ANI for publication).
  4. Annotate the genome using BV-BRC ComprehensiveGenomeAnalysis.
  5. Generate Table 1 (sequence + assembly + genome features) under gunc_env and export to Excel.
  6. AMR / resistome / virulence profiling using ABRicate (+ MEGARes/CARD/ResFinder/VFDB) and RGI (CARD models), export to Excel.
  7. Build phylogenetic tree (NCBI retrieval + Roary + RAxML-NG + R plotting).
  8. Debug/re-run guidance (drop one genome, rerun Roary→RAxML, regenerate plot).
  9. ANI + BUSCO interpretation (species boundary explanation and QC interpretation).
  10. fastANI interpretation text (tree + ANI combined narrative).
  11. Optional: closest isolate alignment using nf-core/pairgenomealign.
  12. Optional: NCBI submission (batch submission plan).
  13. Robust closest-genome search from GenBank using NCBI datasets + Mash, with duplicate handling (GCA vs GCF).

0) Inputs / assumptions

  • Sample: GE11174
  • Inputs referenced in commands:

    • samplesheet.tsv for bacass
    • targets.tsv for reference selection (tree step)
    • samplesheet.csv for pairgenomealign (closest isolate comparison)
    • raw reads: GE11174.rawreads.fastq.gz
    • assembly FASTA used downstream: GE11174.fasta (and in some places scaffold outputs like scaffolds.fasta / GE11174.scaffolds.fa)
  • Local reference paths (examples used):

    • Kraken2 DB tarball: /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz
    • KmerFinder DB: /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ (or tarball variants shown below)

1) Database setup — KmerFinder DB

Option A (CGE service):

Option B (Zenodo snapshot):

  • Download 20190108_kmerfinder_stable_dirs.tar.gz from: https://zenodo.org/records/13447056

2) Assembly + QC + taxonomic screening — nf-core/bacass

Run bacass with Docker and resume support:

# Example --kmerfinderdb values tried/recorded:
# --kmerfinderdb /path/to/kmerfinder/bacteria.tar.gz
# --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz
# --kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type long \
  --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
  --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
  -resume

Outputs used later

  • Scaffolded/assembled FASTA from bacass (e.g., for annotation, AMR screening, Mash sketching, tree building).

3) KmerFinder summary — species hint (with publication note)

Interpretation recorded:

From the KmerFinder summary, the top hit is Gibbsiella quercinecans (strain FRB97; NZ_CP014136.1) with much higher score and coverage than the second hit (which is low coverage). So it’s fair to write: “KmerFinder indicates the isolate is most consistent with Gibbsiella quercinecans.” …but for a species call (especially for publication), you should confirm with ANI (or a genome taxonomy tool), because k-mer hits alone aren’t always definitive.


4) Genome annotation — BV-BRC ComprehensiveGenomeAnalysis

Annotate the genome using BV-BRC:


5) Table 1 — Summary of sequence data and genome features (env: gunc_env)

Prepare environment and run the Table 1 pipeline:

# activate the env that has openpyxl
mamba activate gunc_env
mamba install -n gunc_env -c conda-forge openpyxl -y
mamba deactivate

# STEP_1
ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh

# STEP_2
python export_table1_stats_to_excel_py36_compat.py \
  --workdir table1_GE11174_work \
  --out Comprehensive_GE11174.xlsx \
  --max-rows 200000 \
  --sample GE11174

# STEP_1+2 (combined)
ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

For the items “Total number of reads sequenced” and “Mean read length (bp)”:

pigz -dc GE11174.rawreads.fastq.gz | awk 'END{print NR/4}'
seqkit stats GE11174.rawreads.fastq.gz

6) AMR gene profiling + Resistome + Virulence profiling (ABRicate + RGI)

This stage produces resistome/virulence tables and an Excel export.

6.1 Databases / context notes

“Table 4. Specialty Genes” note recorded:

  • NDARO: 1
  • Antibiotic Resistance — CARD: 15
  • Antibiotic Resistance — PATRIC: 55
  • Drug Target — TTD: 38
  • Metal Resistance — BacMet: 29
  • Transporter — TCDB: 250
  • Virulence factor — VFDB: 33

Useful sites:

6.2 ABRicate environment + DB listing

conda activate /home/jhuang/miniconda3/envs/bengal3_ac3

abricate --list
#DATABASE        SEQUENCES       DBTYPE  DATE
#vfdb    2597    nucl    2025-Oct-22
#resfinder       3077    nucl    2025-Oct-22
#argannot        2223    nucl    2025-Oct-22
#ecoh    597     nucl    2025-Oct-22
#megares 6635    nucl    2025-Oct-22
#card    2631    nucl    2025-Oct-22
#ecoli_vf        2701    nucl    2025-Oct-22
#plasmidfinder   460     nucl    2025-Oct-22
#ncbi    5386    nucl    2025-Oct-22

abricate-get_db  --list
#Choices: argannot bacmet2 card ecoh ecoli_vf megares ncbi plasmidfinder resfinder vfdb victors (default '').

6.3 Install/update DBs (CARD, MEGARes)

# CARD
abricate-get_db --db card

# MEGARes (automatically install; if error try manual install below)
abricate-get_db --db megares

6.4 Manual MEGARes v3.0 install (if needed)

wget -O megares_database_v3.00.fasta \
  "https://www.meglab.org/downloads/megares_v3.00/megares_database_v3.00.fasta"

# 1) Define dbdir (adjust to your env; from logs it's inside the conda env)
DBDIR=/home/jhuang/miniconda3/envs/bengal3_ac3/db

# 2) Create a custom db folder for MEGARes v3.0
mkdir -p ${DBDIR}/megares_v3.0

# 3) Copy the downloaded MEGARes v3.0 nucleotide FASTA to 'sequences'
cp megares_database_v3.00.fasta ${DBDIR}/megares_v3.0/sequences

# 4) Build ABRicate indices
abricate --setupdb

# Confirm presence
abricate --list | egrep 'card|megares'
abricate --list | grep -i megares

6.5 Run resistome/virulome pipeline scripts

chmod +x run_resistome_virulome.sh
ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome.sh

chmod +x run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=./vrap_HF/spades/scaffolds.fasta SAMPLE=HF THREADS=32 ~/Scripts/run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 MINID=80 MINCOV=60 ./run_resistome_virulome_dedup.sh

6.6 Sanity checks on ABRicate outputs

grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.megares.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.card.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.resfinder.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.vfdb.tab

grep -v '^#' resistome_virulence_GE11174/raw/GE11174.megares.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.card.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.resfinder.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.vfdb.tab | grep -v '^[[:space:]]*$' | head -n 3

6.7 Dedup tables / “one per gene” mode

chmod +x make_dedup_tables_from_abricate.sh
OUTDIR=resistome_virulence_GE11174 SAMPLE=GE11174 ./make_dedup_tables_from_abricate.sh

chmod +x run_abricate_resistome_virulome_one_per_gene.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
ASM=GE11174.fasta \
SAMPLE=GE11174 \
OUTDIR=resistome_virulence_GE11174 \
MINID=80 MINCOV=60 \
THREADS=32 \
./run_abricate_resistome_virulome_one_per_gene.sh

Threshold summary recorded:

  • ABRicate thresholds: MINID=70 MINCOV=50

    • MEGARes: 35 → resistome_virulence_GE11174/raw/GE11174.megares.tab
    • CARD: 28 → resistome_virulence_GE11174/raw/GE11174.card.tab
    • ResFinder: 2 → resistome_virulence_GE11174/raw/GE11174.resfinder.tab
    • VFDB: 18 → resistome_virulence_GE11174/raw/GE11174.vfdb.tab
  • ABRicate thresholds: MINID=80 MINCOV=60

    • MEGARes: 3
    • CARD: 1
    • ResFinder: 0
    • VFDB: 0

6.8 Merge sources + export to Excel

python merge_amr_sources_by_gene.py

python export_resistome_virulence_to_excel_py36.py \
  --workdir resistome_virulence_GE11174 \
  --sample GE11174 \
  --out Resistome_Virulence_GE11174.xlsx

6.9 Methods sentence + table captions (recorded text)

Methods sentence (AMR + virulence)

AMR genes were identified by screening the genome assembly with ABRicate against the MEGARes and ResFinder databases, using minimum identity and coverage thresholds of X% and Y%, respectively. CARD-based AMR determinants were additionally predicted using RGI (Resistance Gene Identifier) to leverage curated resistance models. Virulence factors were screened using ABRicate against VFDB under the same thresholds. Replace X/Y with your actual values (e.g., 90/60) or state “default parameters” if you truly used defaults.

Table 2 caption (AMR)

Table 2. AMR gene profiling of the genome assembly. Hits were detected using ABRicate (MEGARes and ResFinder) and RGI (CARD). The presence of AMR-associated genes does not necessarily imply phenotypic resistance, which may depend on allele type, genomic context/expression, and/or SNP-mediated mechanisms; accordingly, phenotype predictions (e.g., ResFinder) should be interpreted cautiously.

Table 3 caption (virulence)

Table 3. Virulence factor profiling of the genome assembly based on ABRicate screening against VFDB, reporting loci with sequence identity and coverage above the specified thresholds.


7) Phylogenetic tree generation (Nextflow/NCBI + Roary + RAxML-NG + R plotting)

7.1 Resolve/choose assemblies via Entrez

export NCBI_EMAIL="x.yyy@zzz.de"
./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

7.2 Build tree (main pipeline) + note about R env

Recorded note:

NOTE the env bengal3_ac3 don’t have the following R package, using r_env for the plot-step → RUN TWICE, first bengal3_ac3, then run build_wgs_tree_fig3B.sh plot-only.

Suggested package install (if needed):

#mamba install -y -c conda-forge -c bioconda r-aplot bioconductor-ggtree r-ape r-ggplot2 r-dplyr r-readr

Run:

export ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3
export NCBI_EMAIL="x.yyy@zzz.de"
./build_wgs_tree_fig3B.sh

# Regenerate the plot
conda activate r_env
./build_wgs_tree_fig3B.sh plot-only

7.3 Manual label corrections

Edit:

  • vim work_wgs_tree/plot/labels.tsv

Recorded edits:

  • REMOVE:

    • GCA_032062225.1 EXTRA_GCA_032062225.1 (GCA_032062225.1)
    • GCF_047901425.1 EXTRA_GCF_047901425.1 (GCF_047901425.1)
  • ADAPT:

    • Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)
    • Gibbsiella greigii USA56
    • Gibbsiella papilionis PWX6
    • Gibbsiella quercinecans strain FRB97
    • Brenneria nigrifluens LMG 5956

7.4 Plot with plot_tree_v4.R

Rscript work_wgs_tree/plot/plot_tree_v4.R \
  work_wgs_tree/raxmlng/core.raxml.support \
  work_wgs_tree/plot/labels.tsv \
  6 \
  work_wgs_tree/plot/core_tree.pdf \
  work_wgs_tree/plot/core_tree.png

8) DEBUG rerun recipe (drop one genome; rerun Roary → RAxML-NG → plot)

Example: drop GCF_047901425.1 (or the other listed one).

8.1 Remove from inputs

# 1.1) remove from inputs so Roary cannot include it
rm -f work_wgs_tree/gffs/GCF_047901425.1.gff
rm -f work_wgs_tree/fastas/GCF_047901425.1.fna
rm -rf work_wgs_tree/prokka/GCF_047901425.1
rm -rf work_wgs_tree/genomes_ncbi/GCF_047901425.1  #optional

# 1.2) remove from accession list so it won't come back
awk -F'\t' 'NR==1 || $2!="GCF_047901425.1"' work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \
  && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

Alternative removal target:

# 2.1) remove from inputs so Roary cannot include it
rm -f work_wgs_tree/gffs/GCA_032062225.1.gff
rm -f work_wgs_tree/fastas/GCA_032062225.1.fna
rm -rf work_wgs_tree/prokka/GCA_032062225.1
rm -rf work_wgs_tree/genomes_ncbi/GCA_032062225.1  #optional

# 2.2) remove from accession list so it won't come back
awk -F'\t' 'NR==1 || $2!="GCA_032062225.1"' work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \
  && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

8.2 Clean old runs + rerun Roary

# 3) delete old roary runs (so you don't accidentally reuse old alignment)
rm -rf work_wgs_tree/roary_*

# 4) rerun Roary (fresh output dir)
mkdir -p work_wgs_tree/logs
ROARY_OUT="work_wgs_tree/roary_$(date +%s)"
roary -e --mafft -p 8 -cd 95 -i 95 \
  -f "$ROARY_OUT" \
  work_wgs_tree/gffs/*.gff \
  > work_wgs_tree/logs/roary_rerun.stdout.txt \
  2> work_wgs_tree/logs/roary_rerun.stderr.txt

8.3 Point to the new core alignment and rerun RAxML-NG

# 5) point meta file to new core alignment (absolute path)
echo "$(readlink -f "$ROARY_OUT/core_gene_alignment.aln")" > work_wgs_tree/meta/core_alignment_path.txt

# 6) rerun RAxML-NG
rm -rf work_wgs_tree/raxmlng
mkdir work_wgs_tree/raxmlng/
raxml-ng --all \
  --msa "$(cat work_wgs_tree/meta/core_alignment_path.txt)" \
  --model GTR+G \
  --bs-trees 1000 \
  --threads 8 \
  --prefix work_wgs_tree/raxmlng/core

8.4 Regenerate labels + replot

# 7) Run this to regenerate labels.tsv
bash regenerate_labels.sh

# 8) Manual correct the display name in vim work_wgs_tree/plot/labels.tsv
#Gibbsiella greigii USA56
#Gibbsiella papilionis PWX6
#Gibbsiella quercinecans strain FRB97
#Brenneria nigrifluens LMG 5956

# 9) Rerun only the plot step:
Rscript work_wgs_tree/plot/plot_tree.R \
  work_wgs_tree/raxmlng/core.raxml.support \
  work_wgs_tree/plot/labels.tsv \
  6 \
  work_wgs_tree/plot/core_tree.pdf \
  work_wgs_tree/plot/core_tree.png

9) fastaANI + BUSCO explanations (recorded)

9.1 Reference FASTA inventory example

find . -name "*.fna"
#./work_wgs_tree/fastas/GCF_004342245.1.fna  GCF_004342245.1 Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)
#./work_wgs_tree/fastas/GCF_039539505.1.fna  GCF_039539505.1 Gibbsiella papilionis PWX6 (GCF_039539505.1)
#./work_wgs_tree/fastas/GCF_005484965.1.fna  GCF_005484965.1 Brenneria nigrifluens LMG5956 (GCF_005484965.1)
#./work_wgs_tree/fastas/GCA_039540155.1.fna  GCA_039540155.1 Gibbsiella greigii USA56 (GCA_039540155.1)
#./work_wgs_tree/fastas/GE11174.fna
#./work_wgs_tree/fastas/GCF_002291425.1.fna  GCF_002291425.1 Gibbsiella quercinecans FRB97 (GCF_002291425.1)

9.2 fastANI runs

mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_004342245.1.fna \
  -o fastANI_out_Gibbsiella_quercinecans_DSM_25889.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_039539505.1.fna \
  -o fastANI_out_Gibbsiella_papilionis_PWX6.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_005484965.1.fna \
  -o fastANI_out_Brenneria_nigrifluens_LMG5956.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCA_039540155.1.fna \
  -o fastANI_out_Gibbsiella_greigii_USA56.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_002291425.1.fna \
  -o fastANI_out_Gibbsiella_quercinecans_FRB97.txt

cat fastANI_out_*.txt > fastANI_out.txt

9.3 fastANI output table (recorded)

GE11174.fasta   ./work_wgs_tree/fastas/GCF_005484965.1.fna      79.1194 597     1890
GE11174.fasta   ./work_wgs_tree/fastas/GCA_039540155.1.fna      95.9589 1547    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_039539505.1.fna      97.2172 1588    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_004342245.1.fna      98.0889 1599    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_002291425.1.fna      98.1285 1622    1890

9.4 Species boundary note (recorded, bilingual)

在细菌基因组比较里,一个常用经验阈值是:

  • ANI ≥ 95–96%:通常认为属于同一物种(species)的范围
  • 你这里 97.09% → 很大概率表示 An6 与 HITLi7 属于同一物种,但可能不是同一株(strain),因为还存在一定差异。

是否“同一菌株”通常还要结合:

  • 核心基因 SNP 距离、cgMLST
  • 组装质量/污染
  • 覆盖率是否足够高

9.5 BUSCO results interpretation (recorded)

BUSCO 结果的快速解读(顺便一句). The results have been already packaged in the Table 1.

  • Complete 99.2%,Missing 0.0%:说明你的组装非常完整(对细菌来说很优秀)
  • Duplicated 0.0%:重复拷贝不高,污染/混样风险更低
  • Scaffolds 80、N50 ~169 kb:碎片化还可以,但总体质量足以做 ANI/物种鉴定

10) fastANI explanation (recorded narrative)

From your tree and the fastANI table, GE11174 is clearly inside the Gibbsiella quercinecans clade, and far from the outgroup (Brenneria nigrifluens). The ANI values quantify that same pattern.

1) Outgroup check (sanity)

  • GE11174 vs Brenneria nigrifluens (GCF_005484965.1): ANI 79.12% (597/1890 fragments)

    • 79% ANI is way below any species boundary → not the same genus/species.
    • On the tree, Brenneria sits on a long branch as the outgroup, consistent with this deep divergence.
    • The relatively low matched fragments (597/1890) also fits “distant genomes” (fewer orthologous regions pass the ANI mapping filters).

2) Species-level placement of GE11174

A common rule of thumb you quoted is correct: ANI ≥ 95–96% ⇒ same species.

Compare GE11174 to the Gibbsiella references:

  • vs GCA_039540155.1 (Gibbsiella greigii USA56): 95.96% (1547/1890)

    • Right at the boundary. This suggests “close but could be different species” or “taxonomy/labels may not reflect true species boundaries” depending on how those genomes are annotated.
    • On the tree, G. greigii is outside the quercinecans group but not hugely far, which matches “borderline ANI”.
  • vs GCF_039539505.1 (Gibbsiella papilionis PWX6): 97.22% (1588/1890)

  • vs GCF_004342245.1 (G. quercinecans DSM 25889): 98.09% (1599/1890)

  • vs GCF_002291425.1 (G. quercinecans FRB97): 98.13% (1622/1890)

These are all comfortably above 96%, especially the two quercinecans genomes (~98.1%). That strongly supports:

GE11174 belongs to the same species as Gibbsiella quercinecans (and is closer to quercinecans references than to greigii).

3) Closest reference and “same strain?” question

GE11174’s closest by ANI in your list is:

  • FRB97 (GCF_002291425.1): 98.1285%
  • DSM 25889 (GCF_004342245.1): 98.0889%
  • Next: PWX6 97.2172%

These differences are small, but 98.1% ANI is not “same strain” evidence by itself. Within a species, different strains commonly sit anywhere from ~96–99.9% ANI depending on diversity. To claim “same strain / very recent transmission,” people usually look for much tighter genome-wide similarity:

  • core-genome SNP distance (often single digits to tens, depending on organism and context)
  • cgMLST allele differences
  • recombination filtering (if relevant)
  • assembly QC/contamination checks
  • and confirming that alignment/ANI coverage is high and not biased by missing regions

Your fragment matches (e.g., 1622/1890 for FRB97) are reasonably high, supporting that the ANI is meaningful, but it still doesn’t equate to “same strain.”

4) How to phrase the combined interpretation (tree + ANI)

A clear summary you can use:

  • The phylogenetic tree places GE11174 within the Gibbsiella quercinecans lineage, with Brenneria nigrifluens as a distant outgroup.
  • fastANI supports this:

    • ~98.1% ANI to G. quercinecans FRB97 and DSM 25889 → strong same-species support.
    • 97.2% to G. papilionis → still same-species range, but more distant than the quercinecans references.
    • 95.96% to G. greigii → borderline; consistent with being a close neighboring lineage but not the best species match for GE11174.
    • 79.1% to Brenneria → confirms it is an appropriate outgroup and far outside the species/genus boundary.
  • Therefore, GE11174 is very likely Gibbsiella quercinecans (species-level), and appears most similar to FRB97/DSM 25889, but additional high-resolution analyses are required to assess “same strain.”

If you want, paste the fastANI line for GE11174 vs GE11174? (or run fastANI against a broader set), and/or your core SNP distances from the Roary alignment, and I can help you write a tighter “strain-level” conclusion.


11) Compare to the next closest isolate (pairwise alignment) — nf-core/pairgenomealign

conda deactivate

nextflow run nf-core/pairgenomealign -r 2.2.2 -profile docker \
  --target GE11174.fasta \
  --input samplesheet.csv \
  --outdir pairgenomealign_out \
  --igenomes_base /mnt/nvme1n1p1/igenomes \
  --genome GRCh38

Recorded TODO:

#TODO_NEXT_MONDAY: * phylogenetic tree + fastaANI + nf-core/pairgenomealign (compare to the closest isoalte https://nf-co.re/pairgenomealign/2.2.1/)

  • summarize all results with a mail to send them back, mentioned that we can submit the genome to NCBI to obtain a high-quality annotation. What strain name would you like to assign to this isolate?
  • If they agree, I can submit the two new isolates to the NCBI-database!

12) Submit both sequences in a batch to NCBI-server! (planned step)

Recorded as:

  1. submit both sequences in a batch to NCBI-server!

13) Find the closest isolate from GenBank (robust approach) for STEP_7

13.1 Download all available Gibbsiella genomes

# download all available genomes for the genus Gibbsiella (includes assemblies + metadata)
# --assembly-level: must be 'chromosome', 'complete', 'contig', 'scaffold'
datasets download genome taxon Gibbsiella --include genome,gff3,gbff --assembly-level complete,chromosome,scaffold --filename gibbsiella.zip
unzip -q gibbsiella.zip -d gibbsiella_ncbi

mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3

13.2 Mash sketching + nearest neighbors

# make a Mash sketch of your isolate
mash sketch -o isolate bacass_out/Unicycler/GE11174.scaffolds.fa

# sketch all reference genomes (example path—adjust)
find gibbsiella_ncbi -name "*.fna" -o -name "*.fasta" > refs.txt
mash sketch -o refs -l refs.txt

# get closest genomes
mash dist isolate.msh refs.msh | sort -gk3 | head -n 20 > top20_mash.txt

Recorded interpretation:

  • Best hits to GE11174.scaffolds.fa are:

    • GCA/GCF_002291425.1 (GenBank + RefSeq copies of the same assembly)
    • GCA/GCF_004342245.1 (same duplication pattern)
    • GCA/GCF_047901425.1 (FRB97; also duplicated)
  • Mash distances ~0.018–0.020 are very close (typically same species; often within-species).
  • p-values are underflow formatting (extremely significant).

13.3 Remove duplicates (GCA vs GCF)

Goal: keep one of each duplicated assembly (prefer GCF if available).

Example snippet recorded:

# Take your top hits, prefer GCF over GCA
cat top20_mash.txt \
  | awk '{print $2}' \
  | sed 's|/GCA_.*||; s|/GCF_.*||' \
  | sort -u

Manual suggestion recorded:

  • keep GCF_002291425.1 (drop GCA_002291425.1)
  • keep GCF_004342245.1
  • keep GCF_047901425.1
  • optionally keep GCA_032062225.1 if it’s truly different and you want a more distant ingroup point

Appendix — Complete attached code (standalone)

Below are the full contents of the attached scripts exactly as provided, so this post can be used standalone in the future.

Note: You mentioned “keep all information of input” and “attach all complete code at the end.” I’ve included all scripts that are currently attached in this chat. If there are additional scripts you meant to attach (e.g., run_resistome_virulome*.sh, regenerate_labels.sh, export_table1_stats_to_excel_py36_compat.py, etc.), please upload them and I’ll append them here too.


File: build_wgs_tree_fig3B.sh

#!/usr/bin/env bash
set -euo pipefail

# build_wgs_tree_fig3B.sh
#
# Purpose:
#   Build a core-genome phylogenetic tree and a publication-style plot similar to Fig 3B.
#
# Usage:
#   ./build_wgs_tree_fig3B.sh            # full run
#   ./build_wgs_tree_fig3B.sh plot-only  # only regenerate the plot from existing outputs
#
# Requirements:
#   - Conda env with required tools. Set ENV_NAME to conda env path.
#   - NCBI datasets and/or Entrez usage requires NCBI_EMAIL.
#   - Roary, Prokka, RAxML-NG, MAFFT, R packages for plotting.
#
# Environment variables:
#   ENV_NAME      : path to conda env (e.g., /home/jhuang/miniconda3/envs/bengal3_ac3)
#   NCBI_EMAIL    : email for Entrez calls
#   THREADS       : default threads
#
# Inputs:
#   - targets.tsv: list of target accessions (if used in resolve step)
#   - local isolate genome fasta
#
# Outputs:
#   work_wgs_tree/...
#
# NOTE:
#   If plotting packages are missing in ENV_NAME, run plot-only under an R-capable env (e.g., r_env).

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MODE="${1:-full}"

THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-work_wgs_tree}"

# Activate conda env if provided
if [[ -n "${ENV_NAME:-}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"
mkdir -p "${WORKDIR}/logs"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# ------------------------------------------------------------------------------
# Helper: check command exists
need_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command '$1' not found in PATH" >&2
    exit 1
  }
}

# ------------------------------------------------------------------------------
# Tool checks (plot-only skips some)
if [[ "${MODE}" != "plot-only" ]]; then
  need_cmd python
  need_cmd roary
  need_cmd raxml-ng
  need_cmd prokka
  need_cmd mafft
  need_cmd awk
  need_cmd sed
  need_cmd grep
fi

need_cmd Rscript

# ------------------------------------------------------------------------------
# Paths
META_DIR="${WORKDIR}/meta"
GENOMES_DIR="${WORKDIR}/genomes_ncbi"
FASTAS_DIR="${WORKDIR}/fastas"
GFFS_DIR="${WORKDIR}/gffs"
PROKKA_DIR="${WORKDIR}/prokka"
ROARY_DIR="${WORKDIR}/roary"
RAXML_DIR="${WORKDIR}/raxmlng"
PLOT_DIR="${WORKDIR}/plot"

mkdir -p "${META_DIR}" "${GENOMES_DIR}" "${FASTAS_DIR}" "${GFFS_DIR}" "${PROKKA_DIR}" "${ROARY_DIR}" "${RAXML_DIR}" "${PLOT_DIR}"

ACCESSIONS_TSV="${META_DIR}/accessions.tsv"
LABELS_TSV="${PLOT_DIR}/labels.tsv"
CORE_ALIGN_PATH_FILE="${META_DIR}/core_alignment_path.txt"

# ------------------------------------------------------------------------------
# Step: plot only
if [[ "${MODE}" == "plot-only" ]]; then
  log "Running in plot-only mode..."

  # If labels file isn't present, try generating a minimal one
  if [[ ! -s "${LABELS_TSV}" ]]; then
    log "labels.tsv not found. Creating a placeholder labels.tsv (edit as needed)."
    {
      echo -e "accession\tdisplay"
      if [[ -d "${FASTAS_DIR}" ]]; then
        for f in "${FASTAS_DIR}"/*.fna "${FASTAS_DIR}"/*.fa "${FASTAS_DIR}"/*.fasta 2>/dev/null; do
          [[ -e "$f" ]] || continue
          bn="$(basename "$f")"
          acc="${bn%%.*}"
          echo -e "${acc}\t${acc}"
        done
      fi
    } > "${LABELS_TSV}"
  fi

  # Plot using plot_tree_v4.R if present; otherwise fall back to plot_tree.R
  PLOT_SCRIPT="${PLOT_DIR}/plot_tree_v4.R"
  if [[ ! -f "${PLOT_SCRIPT}" ]]; then
    PLOT_SCRIPT="${SCRIPT_DIR}/plot_tree_v4.R"
  fi

  if [[ ! -f "${PLOT_SCRIPT}" ]]; then
    echo "ERROR: plot_tree_v4.R not found" >&2
    exit 1
  fi

  SUPPORT_FILE="${RAXML_DIR}/core.raxml.support"
  if [[ ! -f "${SUPPORT_FILE}" ]]; then
    echo "ERROR: Support file not found: ${SUPPORT_FILE}" >&2
    exit 1
  fi

  OUTPDF="${PLOT_DIR}/core_tree.pdf"
  OUTPNG="${PLOT_DIR}/core_tree.png"
  ROOT_N=6

  log "Plotting tree..."
  Rscript "${PLOT_SCRIPT}" \
    "${SUPPORT_FILE}" \
    "${LABELS_TSV}" \
    "${ROOT_N}" \
    "${OUTPDF}" \
    "${OUTPNG}"

  log "Done (plot-only). Outputs: ${OUTPDF} ${OUTPNG}"
  exit 0
fi

# ------------------------------------------------------------------------------
# Full pipeline
log "Running full pipeline..."

# ------------------------------------------------------------------------------
# Config / expected inputs
TARGETS_TSV="${SCRIPT_DIR}/targets.tsv"
RESOLVED_TSV="${SCRIPT_DIR}/resolved_accessions.tsv"
ISOLATE_FASTA="${SCRIPT_DIR}/GE11174.fasta"

# If caller has different locations, let them override
TARGETS_TSV="${TARGETS_TSV_OVERRIDE:-${TARGETS_TSV}}"
RESOLVED_TSV="${RESOLVED_TSV_OVERRIDE:-${RESOLVED_TSV}}"
ISOLATE_FASTA="${ISOLATE_FASTA_OVERRIDE:-${ISOLATE_FASTA}}"

# ------------------------------------------------------------------------------
# Step 1: Resolve best assemblies (if targets.tsv exists)
if [[ -f "${TARGETS_TSV}" ]]; then
  log "Resolving best assemblies from targets.tsv..."
  if [[ -z "${NCBI_EMAIL:-}" ]]; then
    echo "ERROR: NCBI_EMAIL is required for Entrez calls" >&2
    exit 1
  fi
  python "${SCRIPT_DIR}/resolve_best_assemblies_entrez.py" "${TARGETS_TSV}" "${RESOLVED_TSV}"
else
  log "targets.tsv not found; assuming resolved_accessions.tsv already exists"
fi

if [[ ! -s "${RESOLVED_TSV}" ]]; then
  echo "ERROR: resolved_accessions.tsv not found or empty: ${RESOLVED_TSV}" >&2
  exit 1
fi

# ------------------------------------------------------------------------------
# Step 2: Prepare accessions.tsv for downstream steps
log "Preparing accessions.tsv..."
{
  echo -e "label\taccession"
  awk -F'\t' 'NR>1 {print $1"\t"$2}' "${RESOLVED_TSV}"
} > "${ACCESSIONS_TSV}"

# ------------------------------------------------------------------------------
# Step 3: Download genomes (NCBI datasets if available)
log "Downloading genomes (if needed)..."
need_cmd datasets
need_cmd unzip

while IFS=$'\t' read -r label acc; do
  [[ "${label}" == "label" ]] && continue
  [[ -z "${acc}" ]] && continue

  OUTZIP="${GENOMES_DIR}/${acc}.zip"
  OUTDIR="${GENOMES_DIR}/${acc}"

  if [[ -d "${OUTDIR}" ]]; then
    log "  ${acc}: already downloaded"
    continue
  fi

  log "  ${acc}: downloading..."
  datasets download genome accession "${acc}" --include genome --filename "${OUTZIP}" \
    > "${WORKDIR}/logs/datasets_${acc}.stdout.txt" \
    2> "${WORKDIR}/logs/datasets_${acc}.stderr.txt" || {
      echo "ERROR: datasets download failed for ${acc}. See logs." >&2
      exit 1
    }

  mkdir -p "${OUTDIR}"
  unzip -q "${OUTZIP}" -d "${OUTDIR}"
done < "${ACCESSIONS_TSV}"

# ------------------------------------------------------------------------------
# Step 4: Collect FASTA files
log "Collecting FASTA files..."
rm -f "${FASTAS_DIR}"/* 2>/dev/null || true

while IFS=$'\t' read -r label acc; do
  [[ "${label}" == "label" ]] && continue
  OUTDIR="${GENOMES_DIR}/${acc}"
  fna="$(find "${OUTDIR}" -name "*.fna" | head -n 1 || true)"
  if [[ -z "${fna}" ]]; then
    echo "ERROR: no .fna found for ${acc} in ${OUTDIR}" >&2
    exit 1
  fi
  cp -f "${fna}" "${FASTAS_DIR}/${acc}.fna"
done < "${ACCESSIONS_TSV}"

# Add isolate
if [[ -f "${ISOLATE_FASTA}" ]]; then
  cp -f "${ISOLATE_FASTA}" "${FASTAS_DIR}/GE11174.fna"
else
  log "WARNING: isolate fasta not found at ${ISOLATE_FASTA}; skipping"
fi

# ------------------------------------------------------------------------------
# Step 5: Run Prokka on each genome
log "Running Prokka..."
for f in "${FASTAS_DIR}"/*.fna; do
  bn="$(basename "${f}")"
  acc="${bn%.fna}"
  out="${PROKKA_DIR}/${acc}"

  if [[ -d "${out}" && -s "${out}/${acc}.gff" ]]; then
    log "  ${acc}: prokka output exists"
    continue
  fi

  mkdir -p "${out}"
  log "  ${acc}: prokka..."
  prokka --outdir "${out}" --prefix "${acc}" --cpus "${THREADS}" "${f}" \
    > "${WORKDIR}/logs/prokka_${acc}.stdout.txt" \
    2> "${WORKDIR}/logs/prokka_${acc}.stderr.txt"
done

# ------------------------------------------------------------------------------
# Step 6: Collect GFFs for Roary
log "Collecting GFFs..."
rm -f "${GFFS_DIR}"/*.gff 2>/dev/null || true
for d in "${PROKKA_DIR}"/*; do
  [[ -d "${d}" ]] || continue
  acc="$(basename "${d}")"
  gff="${d}/${acc}.gff"
  if [[ -f "${gff}" ]]; then
    cp -f "${gff}" "${GFFS_DIR}/${acc}.gff"
  else
    log "WARNING: missing GFF for ${acc}"
  fi
done

# ------------------------------------------------------------------------------
# Step 7: Roary
log "Running Roary..."
ROARY_OUT="${WORKDIR}/roary_$(date +%s)"
mkdir -p "${ROARY_OUT}"
roary -e --mafft -p "${THREADS}" -cd 95 -i 95 \
  -f "${ROARY_OUT}" \
  "${GFFS_DIR}"/*.gff \
  > "${WORKDIR}/logs/roary.stdout.txt" \
  2> "${WORKDIR}/logs/roary.stderr.txt"

CORE_ALN="${ROARY_OUT}/core_gene_alignment.aln"
if [[ ! -f "${CORE_ALN}" ]]; then
  echo "ERROR: core alignment not found: ${CORE_ALN}" >&2
  exit 1
fi
readlink -f "${CORE_ALN}" > "${CORE_ALIGN_PATH_FILE}"

# ------------------------------------------------------------------------------
# Step 8: RAxML-NG
log "Running RAxML-NG..."
rm -rf "${RAXML_DIR}"
mkdir -p "${RAXML_DIR}"
raxml-ng --all \
  --msa "$(cat "${CORE_ALIGN_PATH_FILE}")" \
  --model GTR+G \
  --bs-trees 1000 \
  --threads "${THREADS}" \
  --prefix "${RAXML_DIR}/core" \
  > "${WORKDIR}/logs/raxml.stdout.txt" \
  2> "${WORKDIR}/logs/raxml.stderr.txt"

SUPPORT_FILE="${RAXML_DIR}/core.raxml.support"
if [[ ! -f "${SUPPORT_FILE}" ]]; then
  echo "ERROR: RAxML support file not found: ${SUPPORT_FILE}" >&2
  exit 1
fi

# ------------------------------------------------------------------------------
# Step 9: Generate labels.tsv (basic)
log "Generating labels.tsv..."
{
  echo -e "accession\tdisplay"
  echo -e "GE11174\tGE11174"
  while IFS=$'\t' read -r label acc; do
    [[ "${label}" == "label" ]] && continue
    echo -e "${acc}\t${label} (${acc})"
  done < "${ACCESSIONS_TSV}"
} > "${LABELS_TSV}"

log "NOTE: You may want to manually edit ${LABELS_TSV} for publication display names."

# ------------------------------------------------------------------------------
# Step 10: Plot
log "Plotting..."
PLOT_SCRIPT="${SCRIPT_DIR}/plot_tree_v4.R"
OUTPDF="${PLOT_DIR}/core_tree.pdf"
OUTPNG="${PLOT_DIR}/core_tree.png"
ROOT_N=6

Rscript "${PLOT_SCRIPT}" \
  "${SUPPORT_FILE}" \
  "${LABELS_TSV}" \
  "${ROOT_N}" \
  "${OUTPDF}" \
  "${OUTPNG}" \
  > "${WORKDIR}/logs/plot.stdout.txt" \
  2> "${WORKDIR}/logs/plot.stderr.txt"

log "Done. Outputs:"
log "  Tree support: ${SUPPORT_FILE}"
log "  Labels:       ${LABELS_TSV}"
log "  Plot PDF:     ${OUTPDF}"
log "  Plot PNG:     ${OUTPNG}"

File: resolve_best_assemblies_entrez.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
resolve_best_assemblies_entrez.py

Resolve a "best" assembly accession for a list of target taxa / accessions using NCBI Entrez.

Usage:
  ./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

Input (targets.tsv):
  TSV with at least columns:
    label 
<tab> query
  Where "query" can be an organism name, taxid, or an assembly/accession hint.

Output (resolved_accessions.tsv):
  TSV with columns:
    label 
<tab> accession <tab> organism <tab> assembly_name <tab> assembly_level <tab> refseq_category

Requires:
  - BioPython (Entrez)
  - NCBI_EMAIL environment variable (or set in script)
"""

import os
import sys
import time
import csv
from typing import Dict, List, Optional, Tuple

try:
    from Bio import Entrez
except ImportError:
    sys.stderr.write("ERROR: Biopython is required (Bio.Entrez)\n")
    sys.exit(1)

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_targets(path: str) -> List[Tuple[str, str]]:
    rows: List[Tuple[str, str]] = []
    with open(path, "r", newline="") as fh:
        reader = csv.reader(fh, delimiter="\t")
        for i, r in enumerate(reader, start=1):
            if not r:
                continue
            if i == 1 and r[0].lower() in ("label", "name"):
                # header
                continue
            if len(r) < 2:
                continue
            label = r[0].strip()
            query = r[1].strip()
            if label and query:
                rows.append((label, query))
    return rows

def entrez_search(db: str, term: str, retmax: int = 20) -> List[str]:
    handle = Entrez.esearch(db=db, term=term, retmax=retmax)
    res = Entrez.read(handle)
    handle.close()
    return res.get("IdList", [])

def entrez_summary(db: str, ids: List[str]):
    if not ids:
        return []
    handle = Entrez.esummary(db=db, id=",".join(ids), retmode="xml")
    res = Entrez.read(handle)
    handle.close()
    return res

def pick_best_assembly(summaries) -> Optional[Dict]:
    """
    Heuristics:
      Prefer RefSeq (refseq_category != 'na'), prefer higher assembly level:
        complete genome > chromosome > scaffold > contig
      Then prefer latest / highest quality where possible.
    """
    if not summaries:
        return None

    level_rank = {
        "Complete Genome": 4,
        "Chromosome": 3,
        "Scaffold": 2,
        "Contig": 1
    }

    def score(s: Dict) -> Tuple[int, int, int]:
        refcat = s.get("RefSeq_category", "na")
        is_refseq = 1 if (refcat and refcat.lower() != "na") else 0
        level = s.get("AssemblyStatus", "")
        lvl = level_rank.get(level, 0)
        # Prefer latest submit date (YYYY/MM/DD)
        submit = s.get("SubmissionDate", "0000/00/00")
        try:
            y, m, d = submit.split("/")
            date_int = int(y) * 10000 + int(m) * 100 + int(d)
        except Exception:
            date_int = 0
        return (is_refseq, lvl, date_int)

    best = max(summaries, key=score)
    return best

def resolve_query(label: str, query: str) -> Optional[Dict]:
    # If query looks like an assembly accession, search directly.
    term = query
    if query.startswith("GCA_") or query.startswith("GCF_"):
        term = f"{query}[Assembly Accession]"

    ids = entrez_search(db="assembly", term=term, retmax=50)
    if not ids:
        # Try organism name search
        term2 = f"{query}[Organism]"
        ids = entrez_search(db="assembly", term=term2, retmax=50)

    if not ids:
        eprint(f"WARNING: no assembly hits for {label} / {query}")
        return None

    summaries = entrez_summary(db="assembly", ids=ids)
    best = pick_best_assembly(summaries)
    if not best:
        eprint(f"WARNING: could not pick best assembly for {label} / {query}")
        return None

    # Extract useful fields
    acc = best.get("AssemblyAccession", "")
    org = best.get("Organism", "")
    name = best.get("AssemblyName", "")
    level = best.get("AssemblyStatus", "")
    refcat = best.get("RefSeq_category", "")

    return {
        "label": label,
        "accession": acc,
        "organism": org,
        "assembly_name": name,
        "assembly_level": level,
        "refseq_category": refcat
    }

def main():
    if len(sys.argv) != 3:
        eprint("Usage: resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv")
        sys.exit(1)

    targets_path = sys.argv[1]
    out_path = sys.argv[2]

    email = os.environ.get("NCBI_EMAIL") or os.environ.get("ENTREZ_EMAIL")
    if not email:
        eprint("ERROR: please set NCBI_EMAIL environment variable (e.g., export NCBI_EMAIL='you@domain')")
        sys.exit(1)
    Entrez.email = email

    targets = read_targets(targets_path)
    if not targets:
        eprint("ERROR: no targets found in input TSV")
        sys.exit(1)

    out_rows: List[Dict] = []
    for label, query in targets:
        eprint(f"Resolving: {label}\t{query}")
        res = resolve_query(label, query)
        if res:
            out_rows.append(res)
        time.sleep(0.34)  # be nice to NCBI

    with open(out_path, "w", newline="") as fh:
        w = csv.writer(fh, delimiter="\t")
        w.writerow(["label", "accession", "organism", "assembly_name", "assembly_level", "refseq_category"])
        for r in out_rows:
            w.writerow([
                r.get("label", ""),
                r.get("accession", ""),
                r.get("organism", ""),
                r.get("assembly_name", ""),
                r.get("assembly_level", ""),
                r.get("refseq_category", "")
            ])

    eprint(f"Wrote: {out_path} ({len(out_rows)} rows)")

if __name__ == "__main__":
    main()

File: make_table1_GE11174.sh

#!/usr/bin/env bash
set -euo pipefail

# make_table1_GE11174.sh
#
# Generate a "Table 1" summary for sample GE11174:
# - sequencing summary (reads, mean length, etc.)
# - assembly stats
# - BUSCO, N50, etc.
#
# Expects to be run with:
#   ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh
#
# This script writes work products to:
#   table1_GE11174_work/

SAMPLE="${SAMPLE:-GE11174}"
THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-table1_${SAMPLE}_work}"

AUTO_INSTALL="${AUTO_INSTALL:-0}"
ENV_NAME="${ENV_NAME:-}"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# Activate conda env if requested
if [[ -n "${ENV_NAME}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"
mkdir -p "${WORKDIR}/logs"

# ------------------------------------------------------------------------------
# Basic tool checks
need_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command '$1' not found in PATH" >&2
    exit 1
  }
}

need_cmd awk
need_cmd grep
need_cmd sed
need_cmd wc
need_cmd python

# Optional tools
if command -v seqkit >/dev/null 2>&1; then
  HAVE_SEQKIT=1
else
  HAVE_SEQKIT=0
fi

if command -v pigz >/dev/null 2>&1; then
  HAVE_PIGZ=1
else
  HAVE_PIGZ=0
fi

# ------------------------------------------------------------------------------
# Inputs
RAWREADS="${RAWREADS:-${SAMPLE}.rawreads.fastq.gz}"
ASM_FASTA="${ASM_FASTA:-${SAMPLE}.fasta}"

if [[ ! -f "${RAWREADS}" ]]; then
  log "WARNING: raw reads file not found: ${RAWREADS}"
fi

if [[ ! -f "${ASM_FASTA}" ]]; then
  log "WARNING: assembly fasta not found: ${ASM_FASTA}"
fi

# ------------------------------------------------------------------------------
# Sequencing summary
log "Computing sequencing summary..."
READS_N="NA"
MEAN_LEN="NA"

if [[ -f "${RAWREADS}" ]]; then
  if [[ "${HAVE_PIGZ}" -eq 1 ]]; then
    READS_N="$(pigz -dc "${RAWREADS}" | awk 'END{print NR/4}')"
  else
    READS_N="$(gzip -dc "${RAWREADS}" | awk 'END{print NR/4}')"
  fi

  if [[ "${HAVE_SEQKIT}" -eq 1 ]]; then
    # parse seqkit stats output
    MEAN_LEN="$(seqkit stats "${RAWREADS}" | awk 'NR==2{print $8}')"
  fi
fi

# ------------------------------------------------------------------------------
# Assembly stats (simple)
log "Computing assembly stats..."
ASM_SIZE="NA"
ASM_CONTIGS="NA"

if [[ -f "${ASM_FASTA}" ]]; then
  # Count contigs and sum length
  ASM_CONTIGS="$(grep -c '^>' "${ASM_FASTA}" || true)"
  ASM_SIZE="$(grep -v '^>' "${ASM_FASTA}" | tr -d '\n' | wc -c | awk '{print $1}')"
fi

# ------------------------------------------------------------------------------
# Output a basic TSV summary (can be expanded)
OUT_TSV="${WORKDIR}/table1_${SAMPLE}.tsv"
{
  echo -e "sample\treads_total\tmean_read_length_bp\tassembly_contigs\tassembly_size_bp"
  echo -e "${SAMPLE}\t${READS_N}\t${MEAN_LEN}\t${ASM_CONTIGS}\t${ASM_SIZE}"
} > "${OUT_TSV}"

log "Wrote: ${OUT_TSV}"

File: export_table1_stats_to_excel_py36_compat.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Export a comprehensive Excel workbook from a Table1 pipeline workdir.
Python 3.6 compatible (no PEP604 unions, no builtin generics).
Requires: openpyxl

Sheets (as available):
- Summary
- Table1 (if Table1_*.tsv exists)
- QUAST_report (report.tsv)
- QUAST_metrics (metric/value)
- Mosdepth_summary (*.mosdepth.summary.txt)
- CheckM (checkm_summary.tsv)
- GUNC_* (all .tsv under gunc/out)
- File_Inventory (relative path, size, mtime; optional md5 for small files)
- Run_log_preview (head/tail of latest log under workdir/logs or workdir/*/logs)
"""

from __future__ import print_function

import argparse
import csv
import hashlib
import os
import sys
import time
from pathlib import Path

try:
    from openpyxl import Workbook
    from openpyxl.utils import get_column_letter
except ImportError:
    sys.stderr.write("ERROR: openpyxl is required. Install with:\n"
                     "  conda install -c conda-forge openpyxl\n")
    raise

MAX_XLSX_ROWS = 1048576

def safe_sheet_name(name, used):
    # Excel: <=31 chars, cannot contain: : \ / ? * [ ]
    bad = r'[:\\/?*\[\]]'
    base = name.strip() or "Sheet"
    base = __import__("re").sub(bad, "_", base)
    base = base[:31]
    if base not in used:
        used.add(base)
        return base
    # make unique with suffix
    for i in range(2, 1000):
        suffix = "_%d" % i
        cut = 31 - len(suffix)
        candidate = (base[:cut] + suffix)
        if candidate not in used:
            used.add(candidate)
            return candidate
    raise RuntimeError("Too many duplicate sheet names for base=%s" % base)

def autosize(ws, max_width=60):
    for col in ws.columns:
        max_len = 0
        col_letter = get_column_letter(col[0].column)
        for cell in col:
            v = cell.value
            if v is None:
                continue
            s = str(v)
            if len(s) > max_len:
                max_len = len(s)
        ws.column_dimensions[col_letter].width = min(max_width, max(10, max_len + 2))

def write_table(ws, header, rows, max_rows=None):
    if header:
        ws.append(header)
    count = 0
    for r in rows:
        ws.append(r)
        count += 1
        if max_rows is not None and count >= max_rows:
            break

def read_tsv(path, max_rows=None):
    header = []
    rows = []
    with path.open("r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        for i, r in enumerate(reader):
            if i == 0:
                header = r
                continue
            rows.append(r)
            if max_rows is not None and len(rows) >= max_rows:
                break
    return header, rows

def read_text_table(path, max_rows=None):
    # for mosdepth summary (tsv with header)
    return read_tsv(path, max_rows=max_rows)

def md5_file(path, chunk=1024*1024):
    h = hashlib.md5()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

def find_latest_log(workdir):
    candidates = []
    # common locations
    for p in [workdir / "logs", workdir / "log", workdir / "Logs"]:
        if p.exists():
            candidates.extend(p.glob("*.log"))
    # nested logs
    candidates.extend(workdir.glob("**/logs/*.log"))
    if not candidates:
        return None
    candidates.sort(key=lambda x: x.stat().st_mtime, reverse=True)
    return candidates[0]

def add_summary_sheet(wb, used, info_items):
    ws = wb.create_sheet(title=safe_sheet_name("Summary", used))
    ws.append(["Key", "Value"])
    for k, v in info_items:
        ws.append([k, v])
    autosize(ws)

def add_log_preview(wb, used, log_path, head_n=80, tail_n=120):
    if log_path is None or not log_path.exists():
        return
    ws = wb.create_sheet(title=safe_sheet_name("Run_log_preview", used))
    ws.append(["Log path", str(log_path)])
    ws.append([])
    lines = log_path.read_text(errors="replace").splitlines()
    ws.append(["--- HEAD (%d) ---" % head_n])
    for line in lines[:head_n]:
        ws.append([line])
    ws.append([])
    ws.append(["--- TAIL (%d) ---" % tail_n])
    for line in lines[-tail_n:]:
        ws.append([line])
    ws.column_dimensions["A"].width = 120

def add_file_inventory(wb, used, workdir, do_md5=True, md5_max_bytes=200*1024*1024, max_rows=None):
    ws = wb.create_sheet(title=safe_sheet_name("File_Inventory", used))
    ws.append(["relative_path", "size_bytes", "mtime_iso", "md5(optional)"])
    count = 0
    for p in sorted(workdir.rglob("*")):
        if p.is_dir():
            continue
        rel = str(p.relative_to(workdir))
        st = p.stat()
        mtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
        md5 = ""
        if do_md5 and st.st_size <= md5_max_bytes:
            try:
                md5 = md5_file(p)
            except Exception:
                md5 = "ERROR"
        ws.append([rel, st.st_size, mtime, md5])
        count += 1
        if max_rows is not None and count >= max_rows:
            break
    autosize(ws, max_width=80)

def add_tsv_sheet(wb, used, name, path, max_rows=None):
    header, rows = read_tsv(path, max_rows=max_rows)
    ws = wb.create_sheet(title=safe_sheet_name(name, used))
    write_table(ws, header, rows, max_rows=max_rows)
    autosize(ws, max_width=80)

def add_quast_metrics_sheet(wb, used, quast_report_tsv):
    header, rows = read_tsv(quast_report_tsv, max_rows=None)
    if not header or len(header) < 2:
        return
    asm_name = header[1]
    ws = wb.create_sheet(title=safe_sheet_name("QUAST_metrics", used))
    ws.append(["Metric", asm_name])
    for r in rows:
        if not r:
            continue
        metric = r[0]
        val = r[1] if len(r) > 1 else ""
        ws.append([metric, val])
    autosize(ws, max_width=80)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--workdir", required=True, help="workdir produced by pipeline (e.g., table1_GE11174_work)")
    ap.add_argument("--out", required=True, help="output .xlsx")
    ap.add_argument("--sample", default="", help="sample name for summary")
    ap.add_argument("--max-rows", type=int, default=200000, help="max rows per large sheet")
    ap.add_argument("--no-md5", action="store_true", help="skip md5 calculation in File_Inventory")
    args = ap.parse_args()

    workdir = Path(args.workdir).resolve()
    out = Path(args.out).resolve()

    if not workdir.exists():
        sys.stderr.write("ERROR: workdir not found: %s\n" % workdir)
        sys.exit(2)

    wb = Workbook()
    # remove default sheet
    wb.remove(wb.active)
    used = set()

    # Summary info
    info = [
        ("sample", args.sample or ""),
        ("workdir", str(workdir)),
        ("generated_at", time.strftime("%Y-%m-%d %H:%M:%S")),
        ("python", sys.version.replace("\n", " ")),
        ("openpyxl", __import__("openpyxl").__version__),
    ]
    add_summary_sheet(wb, used, info)

    # Table1 TSV (try common names)
    table1_candidates = list(workdir.glob("Table1_*.tsv")) + list(workdir.glob("*.tsv"))
    # Prefer Table1_*.tsv in workdir root
    table1_path = None
    for p in table1_candidates:
        if p.name.startswith("Table1_") and p.suffix == ".tsv":
            table1_path = p
            break
    if table1_path is None:
        # maybe created in cwd, not inside workdir; try alongside workdir
        parent = workdir.parent
        for p in parent.glob("Table1_*.tsv"):
            if args.sample and args.sample in p.name:
                table1_path = p
                break
        if table1_path is None and list(parent.glob("Table1_*.tsv")):
            table1_path = sorted(parent.glob("Table1_*.tsv"))[0]

    if table1_path is not None and table1_path.exists():
        add_tsv_sheet(wb, used, "Table1", table1_path, max_rows=args.max_rows)

    # QUAST
    quast_report = workdir / "quast" / "report.tsv"
    if quast_report.exists():
        add_tsv_sheet(wb, used, "QUAST_report", quast_report, max_rows=args.max_rows)
        add_quast_metrics_sheet(wb, used, quast_report)

    # Mosdepth summary
    for p in sorted((workdir / "map").glob("*.mosdepth.summary.txt")):
        # mosdepth summary is TSV-like
        name = "Mosdepth_" + p.stem.replace(".mosdepth.summary", "")
        add_tsv_sheet(wb, used, name[:31], p, max_rows=args.max_rows)

    # CheckM
    checkm_sum = workdir / "checkm" / "checkm_summary.tsv"
    if checkm_sum.exists():
        add_tsv_sheet(wb, used, "CheckM", checkm_sum, max_rows=args.max_rows)

    # GUNC outputs (all TSV under gunc/out)
    gunc_out = workdir / "gunc" / "out"
    if gunc_out.exists():
        for p in sorted(gunc_out.rglob("*.tsv")):
            rel = str(p.relative_to(gunc_out))
            sheet = "GUNC_" + rel.replace("/", "_").replace("\\", "_").replace(".tsv", "")
            add_tsv_sheet(wb, used, sheet[:31], p, max_rows=args.max_rows)

    # Log preview
    latest_log = find_latest_log(workdir)
    add_log_preview(wb, used, latest_log)

    # File inventory
    add_file_inventory(
        wb, used, workdir,
        do_md5=(not args.no_md5),
        md5_max_bytes=200*1024*1024,
        max_rows=args.max_rows
    )

    # Save
    out.parent.mkdir(parents=True, exist_ok=True)
    wb.save(str(out))
    print("OK: wrote %s" % out)

if __name__ == "__main__":
    main()

File: make_table1_with_excel.sh

#!/usr/bin/env bash
set -euo pipefail

# make_table1_with_excel.sh
#
# Wrapper to run:
#   1) make_table1_* (stats extraction)
#   2) export_table1_stats_to_excel_py36_compat.py
#
# Example:
#   ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

SAMPLE="${SAMPLE:-GE11174}"
THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-table1_${SAMPLE}_work}"
OUT_XLSX="${OUT_XLSX:-Comprehensive_${SAMPLE}.xlsx}"

ENV_NAME="${ENV_NAME:-}"
AUTO_INSTALL="${AUTO_INSTALL:-0}"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# Activate conda env if requested
if [[ -n "${ENV_NAME}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"

# ------------------------------------------------------------------------------
# Locate scripts
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MAKE_TABLE1="${MAKE_TABLE1_SCRIPT:-${SCRIPT_DIR}/make_table1_${SAMPLE}.sh}"
EXPORT_PY="${EXPORT_PY_SCRIPT:-${SCRIPT_DIR}/export_table1_stats_to_excel_py36_compat.py}"

# Fallback for naming mismatch (e.g., make_table1_GE11174.sh)
if [[ ! -f "${MAKE_TABLE1}" ]]; then
  MAKE_TABLE1="${SCRIPT_DIR}/make_table1_GE11174.sh"
fi

if [[ ! -f "${MAKE_TABLE1}" ]]; then
  echo "ERROR: make_table1 script not found" >&2
  exit 1
fi

if [[ ! -f "${EXPORT_PY}" ]]; then
  log "WARNING: export_table1_stats_to_excel_py36_compat.py not found next to this script."
  log "         You can set EXPORT_PY_SCRIPT=/path/to/export_table1_stats_to_excel_py36_compat.py"
fi

# ------------------------------------------------------------------------------
# Step 1
log "STEP 1: generating workdir stats..."
ENV_NAME="${ENV_NAME}" AUTO_INSTALL="${AUTO_INSTALL}" THREADS="${THREADS}" SAMPLE="${SAMPLE}" WORKDIR="${WORKDIR}" \
  bash "${MAKE_TABLE1}"

# ------------------------------------------------------------------------------
# Step 2
if [[ -f "${EXPORT_PY}" ]]; then
  log "STEP 2: exporting to Excel..."
  python "${EXPORT_PY}" \
    --workdir "${WORKDIR}" \
    --out "${OUT_XLSX}" \
    --max-rows 200000 \
    --sample "${SAMPLE}"
  log "Wrote: ${OUT_XLSX}"
else
  log "Skipped Excel export (missing export script). Workdir still produced: ${WORKDIR}"
fi

File: merge_amr_sources_by_gene.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
merge_amr_sources_by_gene.py

Merge AMR calls from multiple sources (e.g., ABRicate outputs from MEGARes/ResFinder
and RGI/CARD) by gene name, producing a combined table suitable for reporting/export.

This script is intentionally lightweight and focuses on:
- reading tabular ABRicate outputs
- normalizing gene names
- merging into a per-gene summary

Expected inputs/paths are typically set in your working directory structure.
"""

import os
import sys
import csv
from collections import defaultdict
from typing import Dict, List, Tuple

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_abricate_tab(path: str) -> List[Dict[str, str]]:
    rows: List[Dict[str, str]] = []
    with open(path, "r", newline="") as fh:
        for line in fh:
            if line.startswith("#") or not line.strip():
                continue
            # ABRicate default is tab-delimited with columns:
            # FILE, SEQUENCE, START, END, STRAND, GENE, COVERAGE, COVERAGE_MAP, GAPS,
            # %COVERAGE, %IDENTITY, DATABASE, ACCESSION, PRODUCT, RESISTANCE
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 12:
                continue
            gene = parts[5].strip()
            rows.append({
                "gene": gene,
                "identity": parts[10].strip(),
                "coverage": parts[9].strip(),
                "db": parts[11].strip(),
                "product": parts[13].strip() if len(parts) > 13 else "",
                "raw": line.rstrip("\n")
            })
    return rows

def normalize_gene(gene: str) -> str:
    g = gene.strip()
    # Add any project-specific normalization rules here
    return g

def merge_sources(sources: List[Tuple[str, str]]) -> Dict[str, Dict[str, List[Dict[str, str]]]]:
    merged: Dict[str, Dict[str, List[Dict[str, str]]]] = defaultdict(lambda: defaultdict(list))
    for src_name, path in sources:
        if not os.path.exists(path):
            eprint(f"WARNING: missing source file: {path}")
            continue
        rows = read_abricate_tab(path)
        for r in rows:
            g = normalize_gene(r["gene"])
            merged[g][src_name].append(r)
    return merged

def write_merged_tsv(out_path: str, merged: Dict[str, Dict[str, List[Dict[str, str]]]]):
    # Flatten into a simple TSV
    with open(out_path, "w", newline="") as fh:
        w = csv.writer(fh, delimiter="\t")
        w.writerow(["gene", "sources", "best_identity", "best_coverage", "notes"])
        for gene, src_map in sorted(merged.items()):
            srcs = sorted(src_map.keys())
            best_id = ""
            best_cov = ""
            notes = []
            # pick best identity/coverage across all hits
            for s in srcs:
                for r in src_map[s]:
                    if not best_id or float(r["identity"]) > float(best_id):
                        best_id = r["identity"]
                    if not best_cov or float(r["coverage"]) > float(best_cov):
                        best_cov = r["coverage"]
                    if r.get("product"):
                        notes.append(f"{s}:{r['product']}")
            w.writerow([gene, ",".join(srcs), best_id, best_cov, "; ".join(notes)])

def main():
    # Default expected layout (customize as needed)
    workdir = os.environ.get("WORKDIR", "resistome_virulence_GE11174")
    sample = os.environ.get("SAMPLE", "GE11174")

    rawdir = os.path.join(workdir, "raw")
    sources = [
        ("MEGARes", os.path.join(rawdir, f"{sample}.megares.tab")),
        ("CARD", os.path.join(rawdir, f"{sample}.card.tab")),
        ("ResFinder", os.path.join(rawdir, f"{sample}.resfinder.tab")),
        ("VFDB", os.path.join(rawdir, f"{sample}.vfdb.tab")),
    ]

    merged = merge_sources(sources)
    out_path = os.path.join(workdir, f"merged_by_gene_{sample}.tsv")
    write_merged_tsv(out_path, merged)
    eprint(f"Wrote merged table: {out_path}")

if __name__ == "__main__":
    main()

File: export_resistome_virulence_to_excel_py36.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
export_resistome_virulence_to_excel_py36.py

Export resistome + virulence profiling outputs to an Excel workbook, compatible with
older Python (3.6) style environments.

Typical usage:
  python export_resistome_virulence_to_excel_py36.py \
    --workdir resistome_virulence_GE11174 \
    --sample GE11174 \
    --out Resistome_Virulence_GE11174.xlsx

Requires:
  - openpyxl
"""

import os
import sys
import csv
import argparse
from typing import List, Dict

try:
    from openpyxl import Workbook
except ImportError:
    sys.stderr.write("ERROR: openpyxl is required\n")
    sys.exit(1)

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_tab_file(path: str) -> List[List[str]]:
    rows: List[List[str]] = []
    with open(path, "r", newline="") as fh:
        for line in fh:
            if line.startswith("#") or not line.strip():
                continue
            rows.append(line.rstrip("\n").split("\t"))
    return rows

def autosize(ws):
    # basic autosize columns
    for col_cells in ws.columns:
        max_len = 0
        col_letter = col_cells[0].column_letter
        for c in col_cells:
            if c.value is None:
                continue
            max_len = max(max_len, len(str(c.value)))
        ws.column_dimensions[col_letter].width = min(max_len + 2, 60)

def add_sheet_from_tab(wb: Workbook, title: str, path: str):
    ws = wb.create_sheet(title=title)
    if not os.path.exists(path):
        ws.append([f"Missing file: {path}"])
        return
    rows = read_tab_file(path)
    if not rows:
        ws.append(["No rows"])
        return
    for r in rows:
        ws.append(r)
    autosize(ws)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--workdir", required=True)
    ap.add_argument("--sample", required=True)
    ap.add_argument("--out", required=True)
    args = ap.parse_args()

    workdir = args.workdir
    sample = args.sample
    out_xlsx = args.out

    rawdir = os.path.join(workdir, "raw")

    files = {
        "MEGARes": os.path.join(rawdir, f"{sample}.megares.tab"),
        "CARD": os.path.join(rawdir, f"{sample}.card.tab"),
        "ResFinder": os.path.join(rawdir, f"{sample}.resfinder.tab"),
        "VFDB": os.path.join(rawdir, f"{sample}.vfdb.tab"),
        "Merged_by_gene": os.path.join(workdir, f"merged_by_gene_{sample}.tsv"),
    }

    wb = Workbook()
    # Remove default sheet
    default = wb.active
    wb.remove(default)

    for title, path in files.items():
        eprint(f"Adding sheet: {title} <- {path}")
        add_sheet_from_tab(wb, title, path)

    wb.save(out_xlsx)
    eprint(f"Wrote Excel: {out_xlsx}")

if __name__ == "__main__":
    main()

File: plot_tree_v4.R

#!/usr/bin/env Rscript

# plot_tree_v4.R
#
# Plot a RAxML-NG support tree with custom labels.
#
# Args:
#   1) support tree file (e.g., core.raxml.support)
#   2) labels.tsv (columns: accession
<TAB>display)
#   3) root N (numeric, e.g., 6)
#   4) output PDF
#   5) output PNG

suppressPackageStartupMessages({
  library(ape)
  library(ggplot2)
  library(ggtree)
  library(dplyr)
  library(readr)
  library(aplot)
})

args <- commandArgs(trailingOnly=TRUE)
if (length(args) < 5) {
  cat("Usage: plot_tree_v4.R 
<support_tree> <labels.tsv> <root_n> <out.pdf> <out.png>\n")
  quit(status=1)
}

support_tree <- args[1]
labels_tsv <- args[2]
root_n <- as.numeric(args[3])
out_pdf <- args[4]
out_png <- args[5]

# Read tree
tr <- read.tree(support_tree)

# Read labels
lab <- read_tsv(labels_tsv, col_types=cols(.default="c"))
colnames(lab) <- c("accession","display")

# Map labels
# Current tip labels may include accession-like tokens.
# We'll try exact match first; otherwise keep original.
tip_map <- setNames(lab$display, lab$accession)
new_labels <- sapply(tr$tip.label, function(x) {
  if (x %in% names(tip_map)) {
    tip_map[[x]]
  } else {
    x
  }
})
tr$tip.label <- new_labels

# Root by nth tip if requested
if (!is.na(root_n) && root_n > 0 && root_n <= length(tr$tip.label)) {
  tr <- root(tr, outgroup=tr$tip.label[root_n], resolve.root=TRUE)
}

# Plot
p <- ggtree(tr) +
  geom_tiplab(size=3) +
  theme_tree2()

# Save
ggsave(out_pdf, plot=p, width=8, height=8)
ggsave(out_png, plot=p, width=8, height=8, dpi=300)

cat(sprintf("Wrote: %s\nWrote: %s\n", out_pdf, out_png))

**

Processing Data_Benjamin_DNAseq_2026_GE11174

core_tree_like_fig3B

  1. Download the kmerfinder database: https://www.genomicepidemiology.org/services/ –> https://cge.food.dtu.dk/services/KmerFinder/ –> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz

    # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056
  2. Run nextflow bacass

    #–kmerfinderdb /path/to/kmerfinder/bacteria.tar.gz #–kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz #–kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz nextflow run nf-core/bacass -r 2.5.0 -profile docker \ –input samplesheet.tsv \ –outdir bacass_out \ –assembly_type long \ –kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ –kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume

  3. KmerFinder summary

    From the KmerFinder summary, the top hit is Gibbsiella quercinecans (strain FRB97; NZ_CP014136.1) with much higher score and coverage than the second hit (which is low coverage). So it’s fair to write:

    “KmerFinder indicates the isolate is most consistent with Gibbsiella quercinecans.”

    …but for a species call (especially for publication), you should confirm with ANI (or a genome taxonomy tool), because k-mer hits alone aren’t always definitive.

  4. Using https://www.bv-brc.org/app/ComprehensiveGenomeAnalysis to annotate the genome using scaffolded results from bacass. ComprehensiveGenomeAnalysis provides comprehensive overview of the data.

  5. Generate the Table 1 Summary of sequence data and genome features under the env gunc_env

    activate the env that has openpyxl

    mamba activate gunc_env mamba install -n gunc_env -c conda-forge openpyxl -y mamba deactivate

    STEP_1

    ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh

    STEP_2

    python export_table1_stats_to_excel_py36_compat.py \ –workdir table1_GE11174_work \ –out Comprehensive_GE11174.xlsx \ –max-rows 200000 \ –sample GE11174

    STEP_1+2

    ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

    #For the items “Total number of reads sequenced” and “Mean read length (bp)” pigz -dc GE11174.rawreads.fastq.gz | awk ‘END{print NR/4}’ seqkit stats GE11174.rawreads.fastq.gz

  6. Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

    Table 4. Specialty Genes

    Source Genes

    NDARO 1

    Antibiotic Resistance CARD 15

    Antibiotic Resistance PATRIC 55

    Drug Target TTD 38

    Metal Resistance BacMet 29

    Transporter TCDB 250

    Virulance factor VFDB 33

    https://www.genomicepidemiology.org/services/

    https://genepi.dk/

    conda activate /home/jhuang/miniconda3/envs/bengal3_ac3 abricate –list #DATABASE SEQUENCES DBTYPE DATE #vfdb 2597 nucl 2025-Oct-22 #resfinder 3077 nucl 2025-Oct-22 #argannot 2223 nucl 2025-Oct-22 #ecoh 597 nucl 2025-Oct-22 #megares 6635 nucl 2025-Oct-22 #card 2631 nucl 2025-Oct-22 #ecoli_vf 2701 nucl 2025-Oct-22 #plasmidfinder 460 nucl 2025-Oct-22 #ncbi 5386 nucl 2025-Oct-22 abricate-get_db –list #Choices: argannot bacmet2 card ecoh ecoli_vf megares ncbi plasmidfinder resfinder vfdb victors (default ”).

    CARD

    abricate-get_db –db card

    MEGARes (automatically install, if error try MANUAL install as below)

    abricate-get_db –db megares

    MANUAL install

    wget -O megares_database_v3.00.fasta \ “https://www.meglab.org/downloads/megares_v3.00/megares_database_v3.00.fasta” #wget -O megares_drugs_database_v3.00.fasta \ “https://www.meglab.org/downloads/megares_v3.00/megares_drugs_database_v3.00.fasta

    1) Define dbdir (adjust to your env; from your logs it’s inside the conda env)

    DBDIR=/home/jhuang/miniconda3/envs/bengal3_ac3/db

    2) Create a custom db folder for MEGARes v3.0

    mkdir -p ${DBDIR}/megares_v3.0

    3) Copy the downloaded MEGARes v3.0 nucleotide FASTA to ‘sequences’

    cp megares_database_v3.00.fasta ${DBDIR}/megares_v3.0/sequences

    4) Build ABRicate indices

    abricate –setupdb

    #abricate-get_db –setupdb abricate –list | egrep ‘card|megares’ abricate –list | grep -i megares

    chmod +x run_resistome_virulome.sh ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome.sh

    chmod +x run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=./vrap_HF/spades/scaffolds.fasta SAMPLE=HF THREADS=32 ~/Scripts/run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 MINID=80 MINCOV=60 ./run_resistome_virulome_dedup.sh

    grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.megares.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.card.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.resfinder.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.megares.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.card.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.resfinder.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.vfdb.tab | grep -v ‘^[[:space:]]$’ | head -n 3

    chmod +x make_dedup_tables_from_abricate.sh OUTDIR=resistome_virulence_GE11174 SAMPLE=GE11174 ./make_dedup_tables_from_abricate.sh

    chmod +x run_abricate_resistome_virulome_one_per_gene.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \ ASM=GE11174.fasta \ SAMPLE=GE11174 \ OUTDIR=resistome_virulence_GE11174 \ MINID=80 MINCOV=60 \ THREADS=32 \ ./run_abricate_resistome_virulome_one_per_gene.sh

    #ABRicate thresholds: MINID=70 MINCOV=50 Database Hit_lines File MEGARes 35 resistome_virulence_GE11174/raw/GE11174.megares.tab CARD 28 resistome_virulence_GE11174/raw/GE11174.card.tab ResFinder 2 resistome_virulence_GE11174/raw/GE11174.resfinder.tab VFDB 18 resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    #ABRicate thresholds: MINID=80 MINCOV=60 Database Hit_lines File MEGARes 3 resistome_virulence_GE11174/raw/GE11174.megares.tab CARD 1 resistome_virulence_GE11174/raw/GE11174.card.tab ResFinder 0 resistome_virulence_GE11174/raw/GE11174.resfinder.tab VFDB 0 resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    python merge_amr_sources_by_gene.py python export_resistome_virulence_to_excel_py36.py \ –workdir resistome_virulence_GE11174 \ –sample GE11174 \ –out Resistome_Virulence_GE11174.xlsx

    Methods sentence (AMR + virulence)

    AMR genes were identified by screening the genome assembly with ABRicate against the MEGARes and ResFinder databases, using minimum identity and coverage thresholds of X% and Y%, respectively. CARD-based AMR determinants were additionally predicted using RGI (Resistance Gene Identifier) to leverage curated resistance models. Virulence factors were screened using ABRicate against VFDB under the same thresholds.

    Replace X/Y with your actual values (e.g., 90/60) or state “default parameters” if you truly used defaults.

    Table 2 caption (AMR)

    Table 2. AMR gene profiling of the genome assembly. Hits were detected using ABRicate (MEGARes and ResFinder) and RGI (CARD). The presence of AMR-associated genes does not necessarily imply phenotypic resistance, which may depend on allele type, genomic context/expression, and/or SNP-mediated mechanisms; accordingly, phenotype predictions (e.g., ResFinder) should be interpreted cautiously.

    Table 3 caption (virulence)

    Table 3. Virulence factor profiling of the genome assembly based on ABRicate screening against VFDB, reporting loci with sequence identity and coverage above the specified thresholds.

  7. Generate phylogenetic tree

    export NCBI_EMAIL=”j.huang@uke.de” ./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

    Note the env bengal3_ac3 don’t have the following r-package, using r_env for the plot-step!

    #mamba install -y -c conda-forge -c bioconda r-aplot bioconductor-ggtree r-ape r-ggplot2 r-dplyr r-readr

    chmod +x build_wgs_tree_fig3B.sh export ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 export NCBI_EMAIL=”j.huang@uke.de” ./build_wgs_tree_fig3B.sh

  8. DEBUG (recommended): remove one genome and rerun Roary → RAxML; Example: drop GCF_047901425.1 (change to the other one if you prefer).

    1.1) remove from inputs so Roary cannot include it

    rm -f work_wgs_tree/gffs/GCF_047901425.1.gff rm -f work_wgs_tree/fastas/GCF_047901425.1.fna rm -rf work_wgs_tree/prokka/GCF_047901425.1 rm -rf work_wgs_tree/genomes_ncbi/GCF_047901425.1 #optional

    1.2) remove from accession list so it won’t come back

    awk -F’\t’ ‘NR==1 || $2!=”GCF_047901425.1″‘ work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \ && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

    2.1) remove from inputs so Roary cannot include it

    rm -f work_wgs_tree/gffs/GCA_032062225.1.gff rm -f work_wgs_tree/fastas/GCA_032062225.1.fna rm -rf work_wgs_tree/prokka/GCA_032062225.1 rm -rf work_wgs_tree/genomes_ncbi/GCA_032062225.1 #optional

    2.2) remove from accession list so it won’t come back

    awk -F’\t’ ‘NR==1 || $2!=”GCA_032062225.1″‘ work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \ && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

    3) delete old roary runs (so you don’t accidentally reuse old alignment)

    rm -rf work_wgstree/roary*

    4) rerun Roary (fresh output dir)

    mkdir -p work_wgs_tree/logs ROARY_OUT=”work_wgstree/roary$(date +%s)” roary -e –mafft -p 8 -cd 95 -i 95 \ -f “$ROARY_OUT” \ work_wgs_tree/gffs/*.gff \

    work_wgs_tree/logs/roary_rerun.stdout.txt \ 2> work_wgs_tree/logs/roary_rerun.stderr.txt

    5) point meta file to new core alignment (absolute path)

    echo “$(readlink -f “$ROARY_OUT/core_gene_alignment.aln”)” > work_wgs_tree/meta/core_alignment_path.txt

    6) rerun RAxML-NG

    rm -rf work_wgs_tree/raxmlng mkdir work_wgs_tree/raxmlng/ raxml-ng –all \ –msa “$(cat work_wgs_tree/meta/core_alignment_path.txt)” \ –model GTR+G \ –bs-trees 1000 \ –threads 8 \ –prefix work_wgs_tree/raxmlng/core

    7) Run this to regenerate labels.tsv

    bash regenerate_labels.sh

    8) Manual correct the display name in vim work_wgs_tree/plot/labels.tsv

    #Gibbsiella greigii USA56 #Gibbsiella papilionis PWX6 #Gibbsiella quercinecans strain FRB97 #Brenneria nigrifluens LMG 5956

    9) Rerun only the plot step:

    Rscript work_wgs_tree/plot/plot_tree.R \ work_wgs_tree/raxmlng/core.raxml.support \ work_wgs_tree/plot/labels.tsv \ 6 \ work_wgs_tree/plot/core_tree_like_fig3B.pdf \ work_wgs_tree/plot/core_tree_like_fig3B.png

  9. fastaANI and busco explanations

    find . -name “*.fna” #./work_wgs_tree/fastas/GCF_004342245.1.fna GCF_004342245.1 Gibbsiella quercinecans DSM 25889 (GCF_004342245.1) #./work_wgs_tree/fastas/GCF_039539505.1.fna GCF_039539505.1 Gibbsiella papilionis PWX6 (GCF_039539505.1) #./work_wgs_tree/fastas/GCF_005484965.1.fna GCF_005484965.1 Brenneria nigrifluens LMG5956 (GCF_005484965.1) #./work_wgs_tree/fastas/GCA_039540155.1.fna GCA_039540155.1 Gibbsiella greigii USA56 (GCA_039540155.1) #./work_wgs_tree/fastas/GE11174.fna #./work_wgs_tree/fastas/GCF_002291425.1.fna GCF_002291425.1 Gibbsiella quercinecans FRB97 (GCF_002291425.1)

    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_004342245.1.fna \ -o fastANI_out_Gibbsiella_quercinecans_DSM_25889.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_039539505.1.fna \ -o fastANI_out_Gibbsiella_papilionis_PWX6.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_005484965.1.fna \ -o fastANI_out_Brenneria_nigrifluens_LMG5956.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCA_039540155.1.fna \ -o fastANI_out_Gibbsiella_greigii_USA56.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_002291425.1.fna \ -o fastANI_out_Gibbsiella_quercinecans_FRB97.txt cat fastANIout*.txt > fastANI_out.txt

    GE11174.fasta ./work_wgs_tree/fastas/GCF_005484965.1.fna 79.1194 597 1890 GE11174.fasta ./work_wgs_tree/fastas/GCA_039540155.1.fna 95.9589 1547 1890 GE11174.fasta ./work_wgs_tree/fastas/GCF_039539505.1.fna 97.2172 1588 1890 GE11174.fasta work_wgs_tree/fastas/GCF_004342245.1.fna 98.0889 1599 1890 GE11174.fasta ./work_wgs_tree/fastas/GCF_002291425.1.fna 98.1285 1622 1890 #在细菌基因组比较里,一个常用经验阈值是:

    • ANI ≥ 95–96%:通常认为属于同一物种(species)的范围
    • 你这里 97.09% → 很大概率表示 An6 与 HITLi7 属于同一物种,但可能不是同一株(strain),因为还存在一定差异。 是否“同一菌株”通常还要结合:
    • 核心基因 SNP 距离、cgMLST
    • 组装质量/污染
    • 覆盖率是否足够高

    #BUSCO 结果的快速解读(顺便一句). The results have been already packaged in the Table 1.

    • Complete 99.2%,Missing 0.0%:说明你的组装非常完整(对细菌来说很优秀)
    • Duplicated 0.0%:重复拷贝不高,污染/混样风险更低
    • Scaffolds 80、N50 ~169 kb:碎片化还可以,但总体质量足以做 ANI/物种鉴定
  10. fastANI explanation

From your tree and the fastANI table, GE11174 is clearly inside the Gibbsiella quercinecans clade, and far from the outgroup (Brenneria nigrifluens). The ANI values quantify that same pattern.

1) Outgroup check (sanity)

  • GE11174 vs Brenneria nigrifluens (GCF_005484965.1): ANI 79.12% (597/1890 fragments)

    • 79% ANI is way below any species boundary → not the same genus/species.
    • On the tree, Brenneria sits on a long branch as the outgroup, consistent with this deep divergence.
    • The relatively low matched fragments (597/1890) also fits “distant genomes” (fewer orthologous regions pass the ANI mapping filters).

2) Species-level placement of GE11174

A common rule of thumb you quoted is correct: ANI ≥ 95–96% ⇒ same species.

Compare GE11174 to the Gibbsiella references:

  • vs GCA_039540155.1 (Gibbsiella greigii USA56): 95.96% (1547/1890)

    • Right at the boundary. This suggests “close but could be different species” or “taxonomy/labels may not reflect true species boundaries” depending on how those genomes are annotated.
    • On the tree, G. greigii is outside the quercinecans group but not hugely far, which matches “borderline ANI”.
  • vs GCF_039539505.1 (Gibbsiella papilionis PWX6): 97.22% (1588/1890)

  • vs GCF_004342245.1 (G. quercinecans DSM 25889): 98.09% (1599/1890)

  • vs GCF_002291425.1 (G. quercinecans FRB97): 98.13% (1622/1890)

These are all comfortably above 96%, especially the two quercinecans genomes (~98.1%). That strongly supports:

GE11174 belongs to the same species as Gibbsiella quercinecans (and is closer to quercinecans references than to greigii).

This is exactly what your tree shows: GE11174 clusters in the quercinecans group, not with the outgroup.

3) Closest reference and “same strain?” question

GE11174’s closest by ANI in your list is:

  • FRB97 (GCF_002291425.1): 98.1285%
  • DSM 25889 (GCF_004342245.1): 98.0889%
  • Next: PWX6 97.2172%

These differences are small, but 98.1% ANI is not “same strain” evidence by itself. Within a species, different strains commonly sit anywhere from ~96–99.9% ANI depending on diversity. To claim “same strain / very recent transmission,” people usually look for much tighter genome-wide similarity:

  • core-genome SNP distance (often single digits to tens, depending on organism and context)
  • cgMLST allele differences
  • recombination filtering (if relevant)
  • assembly QC/contamination checks
  • and confirming that alignment/ANI coverage is high and not biased by missing regions

Your fragment matches (e.g., 1622/1890 for FRB97) are reasonably high, supporting that the ANI is meaningful, but it still doesn’t equate to “same strain.”

4) How to phrase the combined interpretation (tree + ANI)

A clear summary you can use:

  • The phylogenetic tree places GE11174 within the Gibbsiella quercinecans lineage, with Brenneria nigrifluens as a distant outgroup.
  • fastANI supports this:

    • ~98.1% ANI to G. quercinecans FRB97 and DSM 25889 → strong same-species support.
    • 97.2% to G. papilionis → still same-species range, but more distant than the quercinecans references.
    • 95.96% to G. greigii → borderline; consistent with being a close neighboring lineage but not the best species match for GE11174.
    • 79.1% to Brenneria → confirms it is an appropriate outgroup and far outside the species/genus boundary.
  • Therefore, GE11174 is very likely Gibbsiella quercinecans (species-level), and appears most similar to FRB97/DSM 25889, but additional high-resolution analyses are required to assess “same strain.”

If you want, paste the fastANI line for GE11174 vs GE11174? (or run fastANI against a broader set), and/or your core SNP distances from the Roary alignment, and I can help you write a tighter “strain-level” conclusion.

#TODO_NEXT_MONDAY: * phylogenetic tree + fastaANI + nf-core/pairgenomealign (compare to the closest isoalte https://nf-co.re/pairgenomealign/2.2.1/)

            * #summarize all results with a mail to send them back, mentioned that we can submit the genome to NCBI to obtain a high-quality annotation. What strain name would you like to assign to this isolate?

            * If they agree, I can submit the two new isolates to the NCBI-database!
  1. submit both sequences in a batch to NCBI-server!

  2. Find the more “closest isolate” from GenBank (robust approach) for STEP_7

    4. Find the more “closest isolate” from GenBank (robust approach)

    # download all available genomes for the genus Gibbsiella (includes assemblies + metadata)
    #--assembly-level" flag: must be 'chromosome', 'complete', 'contig', 'scaffold'
    datasets download genome taxon Gibbsiella --include genome,gff3,gbff --assembly-level complete,chromosome,scaffold --filename gibbsiella.zip
    unzip -q gibbsiella.zip -d gibbsiella_ncbi
    
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    
    # make a Mash sketch of your isolate
    mash sketch -o isolate bacass_out/Unicycler/GE11174.scaffolds.fa
    
    # sketch all reference genomes (example path—adjust)
    find gibbsiella_ncbi -name "*.fna" -o -name "*.fasta" > refs.txt
    mash sketch -o refs -l refs.txt
    
    # get closest genomes
    mash dist isolate.msh refs.msh | sort -gk3 | head -n 20 > top20_mash.txt
    
    ## What your Mash results mean
    
    * The **best hits** to your assembly (`GE11174.scaffolds.fa`) are:
    
      * **GCA/GCF_002291425.1** (shows up twice: GenBank **GCA** and RefSeq **GCF** copies of the *same assembly*)
      * **GCA/GCF_004342245.1** (same duplication pattern)
      * **GCA/GCF_047901425.1** (FRB97; also duplicated)
    * Mash distances around **0.018–0.020** are **very close** (typically same species; often same genus and usually within-species).
    * The `0` in your output is just Mash’s p-value being printed as 0 due to underflow (i.e., extremely significant).
    
    So yes: your isolate looks **very close to those Gibbsiella genomes**, and FRB97 being in that set is consistent with your earlier KmerFinder result.

    5. — Remove duplicates (GCA vs GCF)

    Right now you’re seeing the same genome twice (GenBank + RefSeq). For downstream work, keep one.
    
    Example: keep only **GCF** if available, else GCA:
    
    ```bash
    # Take your top hits, prefer GCF over GCA
    cat top20_mash.txt \
      | awk '{print $2}' \
      | sed 's|/GCA_.*||; s|/GCF_.*||' \
      | sort -u
    ```
    
    But easiest: just manually keep one of each pair:
    
    * keep `GCF_002291425.1` (drop `GCA_002291425.1`)
    * keep `GCF_004342245.1`
    * keep `GCF_047901425.1`
      (and maybe keep `GCA_032062225.1` if it’s truly different and you want a more distant ingroup point)

单车杀王

单车杀王”通常指国际象棋残局里的“一车(Rook)+一王 对 单王”的基本将死法(英文常叫 rook checkmate / rook and king vs king)。(Chess.com)

下面给你一个最稳的“口诀式”走法框架(学会这套,基本不会迷路):

单车杀王的核心思路:关进盒子 → 国王赶上来 → 收网将死

  1. 车先“画盒子” 用你的车把对方王限制在棋盘的一块区域里(比如先把它关在上半场/左半场)。车通常放在离对方王隔一格以上的位置,避免被王直接吃掉。

  2. 己王靠近来“护车” 单靠车不能把盒子越缩越小,因为你一缩,车可能会被对方王追着攻击;所以必须让己王走近,形成“车在前、王在后保护”的队形。

  3. 缩盒子:车走到更近的一条线 当你的王已经足够近、能保护车时,用车把对方王的活动范围再缩小一圈(例如从“关在3行”缩到“关在2行”)。

  4. 重复 2) + 3) “王跟上 → 车再缩”,循环几次,就能把对方王逼到边线/角落。

  5. 最后将死的典型形状

  • 对方王在边线(例如 a 线),你的车给将并封住整条边线;
  • 你的王站在车旁边(或能保护车的位置),让对方王既吃不了车、又无格可逃,完成将死。(Chess.com)

两个常见坑(新手最容易翻车)

  • 车贴太近被吃:车别和对方王“肩并肩”,至少隔一格。
  • 被逼得不断“横着给将”但不推进:只用车不停将军,可能拖到和棋(50步规则等),关键是让己王参与收网

HYLO CARE 与 HYLO COMOD 滴眼液的成分比较及长期使用评估

一、官网信息的核心解读(先给结论)

👉 HYLO CARE 是可以长期使用的,而且是“偏护理型”的人工泪液。 👉 没有防腐剂,这一点非常重要。


二、HYLO CARE 的成分到底是什么?安全吗?

✅ 1️⃣ 透明质酸(Hyaluronsäure 0.1%)

  • 作用:强效保湿、润滑
  • 特点:

    • 天然存在于人体(眼睛、关节、皮肤)
    • 非药物、非激素
  • ✔️ 适合长期使用
  • ✔️ 干眼治疗的“金标准成分”

📌 0.1% 是中等浓度 → 比基础款更滋润,但不会太黏


✅ 2️⃣ Dexpanthenol(右泛醇,维生素 B5 前体)

这是你注意到的“另外的成分”,也是很多人关心的点。

它是干什么的?

  • 促进角膜和结膜修复
  • 减轻刺激、灼热、异物感
  • 有轻度抗炎、修复作用

📌 常用于:

  • 眼表受损
  • 长时间用眼
  • 隐形眼镜引起的不适

👉 Dexpanthenol 不是药物,也不是激素 👉 在眼科中非常常见且安全

✔️ 可以长期用


三、“不含防腐剂”意味着什么?(非常重要)

官网明确写了:

Frei von Konservierungsmitteln und Phosphaten 不含防腐剂和磷酸盐

这意味着:

  • ❌ 不刺激角膜
  • ❌ 不会“越用越干”
  • ❌ 不会破坏泪膜
  • ✔️ 适合每天多次、长期使用
  • ✔️ 适合敏感眼、儿童、长期电脑用眼人群

👉 这也是 COMOD® 系统 的最大优势: 不用加防腐剂也能保持无菌


四、HYLO CARE vs HYLO COMOD(现在可以精确区分了)

对比点 HYLO CARE HYLO COMOD
透明质酸 0.1% 0.1%
额外成分 ✅ Dexpanthenol(修复) ❌ 无
防腐剂 ❌ 无 ❌ 无
滋润度 ⭐⭐⭐⭐ ⭐⭐⭐
护理/修复 ✅ 更强 基础
长期使用 ✅ 可以 ✅ 可以

👉 简单说:

  • HYLO COMOD:基础保湿型
  • HYLO CARE:保湿 + 修复型(更“护理”)

五、你最关心的问题:HYLO CARE 可以长期用吗?

✅ 答案是:可以,而且很适合长期用

原因总结:

  • ✔️ 无防腐剂
  • ✔️ 成分安全、接近人体生理
  • ✔️ 专门用于“日常护理”
  • ✔️ 官网明确写了 “ideal für die tägliche Pflege”(适合每日护理)

📌 6 个月使用期 指的是:

  • 开封后 6 个月内用完
  • 不是“只能用 6 个月”,而是卫生有效期

六、适合哪些人长期用 HYLO CARE?

特别适合你如果有以下情况之一:

  • 长时间看电脑 / 手机
  • 暖气房、空调房
  • 戴隐形眼镜
  • 眼睛容易红、干、痒、有异物感
  • 想用比基础人工泪液更“修复型”的产品

七、一句话总结(给你吃个定心丸)

👉 你贴的这个 HYLO CARE 官网版本:

  • ✔️ 无防腐剂
  • ✔️ 成分安全
  • ✔️ 含修复成分 Dexpanthenol
  • ✔️ 完全可以长期使用

Turnierplan 2026 für X. (Updated)

Grundlage: letzte 4 Kids-Cup-Zeitpunkte (offiziell)

Nr. Datum Zeit Ort
42 20.12.2024 9:30–17:00 HSK Schachzentrum (Schellingstr. 41) (hsk1830.de)
43 29.03.2025 (nicht angegeben in Beitrag) HSK (Kids-Cup Bericht) (hsk1830.de)
44 19.07.2025 9:00–17:00 HSK Schachzentrum (hsk1830.de)
45 11.10.2025 9:30–17:00 HSK Schachzentrum (hsk1830.de)

Man sieht ein wiederkehrendes Muster: Frühjahr (Ende März)Sommer (Mitte Juli)Herbst (Mitte Oktober)kurz vor Weihnachten (Dezember).


Prognose: HSK Kids-Cup 2026 (4 separate Einträge)

Prinzip der Prognose: gleiche Jahreszeiten/Fenster wie oben, möglichst Wochenendtag (wie in der Kids-Cup-Beschreibung). (hsk1830.de)

Bitte als “Platzhalter” behandeln, bis HSK die konkreten Ausschreibungen veröffentlicht.

2026 (prognostiziert) Wochentag Eintrag
28.03.2026 Samstag HSK Kids-Cup #46 (vorauss.)
18.07.2026 Samstag HSK Kids-Cup #47 (vorauss.)
10.10.2026 Samstag HSK Kids-Cup #48 (vorauss.)
18.12.2026 Freitag HSK Kids-Cup #49 (vorauss.) (“kurz vor Weihnachten” – 2024 war auch ein Werktag nahe Ferienstart) (hsk1830.de)

Kids-Cup feste Infos (von HSK-Infoseite, damit du sie nicht jedes Mal neu suchen musst)

  • Mehrmals pro Jahr, für Einsteiger & zusätzliche Schulwertung (hsk1830.de)
  • 7 Runden Schweizer System, zwei Turniere: A) offen bis Klasse 4, DWZ-gewertetB) KiGa bis Klasse 2 (hsk1830.de)
  • 20 Minuten/Partie; in den ersten 15 Minuten wird notiert, damit Trainer zwischen den Runden analysieren können. (hsk1830.de)
  • Startgeld: 10 € (HSK-Mitglieder), 15 € (Gäste) (hsk1830.de)

Update: “X. soll alle Kids-Cups 2026 spielen + nächste HJET

Aktualisierte Master-Tabelle (2026 + “nächste HJET”)

Datum Event Ort Q? Status / Notizen (mit Platz zum Nachtragen)
14.02.2026 Blankeneser Jugend Pokal (BJP) Hamburg ✅ Ja Jugendturnier, gutes U10-Format
15.02.2026 DWZ Challenge Hamburg ✅ Ja (fix) Angemeldet
22.03.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja DWZ-Praxis ohne Overload
28.03.2026 (Vorhersage) HSK Kids-Cup #46 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
06.06.2026 (optional) DWZ/ELO-Cup (4er-Gruppen) Itzehoe 🟡 Optional nur wenn ihr im Juni nicht “zu voll” seid
20.–21.06.2026 Hamburger Talente-Cup (U12) Hamburg ✅ Ja Highlight-Wochenende
28.06.2026 Elmshorn Jugendstadtmeisterschaft Elmshorn ✅ Ja 1-Tag, kinderfreundlich
18.07.2026 (Vorhersage) HSK Kids-Cup #47 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
05.09.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja guter Wiedereinstieg nach Ferien
10.10.2026 (Vorhersage) HSK Kids-Cup #48 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
07.11.2026 (optional) DWZ/ELO-Cup (4er-Gruppen) Itzehoe 🟡 Optional nur bei Lust/Kapazität
05.12.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja Jahresabschluss
18.12.2026 (Vorhersage) HSK Kids-Cup #49 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
Dezember 2026 HJET – Anmeldung nächste Runde Hamburg ✅ Ja Deadline: Link/Formular: Gruppe (U10-1/U10-2): ____
Jan–Anfang Feb 2027 (samstags) HJET – Spieltage Hamburg ✅ Ja Termine: (eintragen sobald veröffentlicht)

PDF-Version (aktualisiert)

Hier ist die aktualisierte PDF (mit Kids-Cup-Prognose als separate Einträge + HJET “nächste Runde”): Download: Turnierplan_2026_aktualisiert.pdf

Wenn später die echten Kids-Cup-Ausschreibungslinks 2026 und den HJET-Anmeldelink (Dez 2026) verfügbar ist, koennen wir die Freifelder sauber eintragen und “Vorhersage” durch “offiziell” ersetzten.

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

1) Primary WGS analysis with nf-core/bacass

We processed short-read bacterial WGS with nf-core/bacass v2.5.0 using Docker. The pipeline performs QC, assembly, and common taxonomic/typing steps (e.g., Kraken2). We ran it successfully with a local Kraken2 database.

A practical issue is the KmerFinder database download: the previously documented FTP “latest” path may be unavailable or slow, so it’s best to pre-download the database and pass a local path, or skip KmerFinder and run it manually via the CGE service if needed.

2) KmerFinder database: reliable download options

Two practical database sources:

  • Option A (recommended, easiest & fast): download the bundled DB from the CGE KmerFinder service site and extract locally. (In your case it finished in ~40 min.)
  • Option B (fallback, stable but older): use the Zenodo stable archive (20190108).

Once the database is available locally, it can be provided to bacass with --kmerfinderdb (pointing to the extracted bacteria/ directory, or the tarball depending on the pipeline expectation).

3) Species placement & phylogeny

For broad placement, an ANI-based approach (fastANI distance tree) is quick and robust, but for publishable phylogeny within a close clade, a standard route is:

Prokka → Roary (pangenome; core-gene alignment) → (trim) → RAxML-NG.

Key point: Roary core-gene trees are best for closely related genomes (same species/close genus). Distant outgroups can collapse the core (few/no shared genes). If outgroups are too far, use close outgroups (e.g., Moraxella, Psychrobacter) or switch to conserved marker gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) for deep phylogeny.

4) Comparative genome visualization

For pairwise genome comparison/plots, nf-core/pairgenomealign can generate whole-genome alignments and visual representations against a target genome.


Reproducible code snippets (complete commands)

A) nf-core/bacass runs

Run 1 (works):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/minikraken_20171019_8GB.tgz

Run 2 (skip KmerFinder, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/k2_standard_08_GB_20251015.tar.gz \
  --skip_kmerfinder \
  -resume

Then manually run KmerFinder via web service: https://cge.food.dtu.dk/services/KmerFinder/

Run 3 (use local KmerFinder DB, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
  --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
  -resume

B) KmerFinder DB download

Option A (CGE bundle; fast):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz'
tar -xvf kmerfinder_db.tar.gz
# resulting path typically contains: bacteria/

Option B (Zenodo stable archive):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://zenodo.org/records/13447056/files/20190108_kmerfinder_stable_dirs.tar.gz'
tar -xzf 20190108_kmerfinder_stable_dirs.tar.gz

C) Download close references (Acinetobacter + close outgroups) + fastANI matrix (your script config)

(You already implemented this; keep here as “reproducible config”)

# Example config snippet used in the downloader:
MAX_ACINETOBACTER=20
MAX_PER_OUTGROUP=4
ASSEMBLY_LEVELS="complete,chromosome"
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
MINFRACTION="0.05"

D) Prokka → Roary core MSA → RAxML-NG (phylogenomic tree)

0) Collect genomes (isolate + downloaded refs):

mkdir -p genomes
cp bacass_out/Prokka/An6/An6.fna genomes/An6.fna
bash 1_ani_tree_prep.sh
cp ani_tree_run/ref_fasta/*.genomic.fna genomes/

1) Prokka annotations (produces GFFs for Roary):

mkdir -p prokka gffs

for f in genomes/*.fna genomes/*.genomic.fna; do
  [ -f "$f" ] || continue
  bn=$(basename "$f")
  sample="${bn%.*}"
  sample="${sample%.genomic}"
  sample="${sample//./_}"

  prokka --outdir "prokka/$sample" \
         --prefix "$sample" \
         --cpus 8 --force \
         "$f"
done

cp prokka/*/*.gff gffs/

2) Roary core-genome alignment (MAFFT):

mkdir -p roary_out

roary -f roary_out \
      -p 16 \
      -e -n \
      -i 95 \
      -cd 95 \
      gffs/*.gff

# output:
ls -lh roary_out/core_gene_alignment.aln

3) Trim the core alignment (optional but recommended):

trimal -in roary_out/core_gene_alignment.aln \
       -out roary_out/core_gene_alignment.trim.aln \
       -automated1

4) RAxML-NG ML tree + bootstrap supports:

mkdir -p tree

raxml-ng --all \
  --msa roary_out/core_gene_alignment.trim.aln \
  --model GTR+G \
  --bs-trees autoMRE{500} \
  --threads 16 \
  --seed 12345 \
  --prefix tree/roary_core

# main result:
ls -lh tree/roary_core.raxml.support

E) Optional visualization in R (ape)

library(ape)

tr <- read.tree("tree/roary_core.raxml.support")

pdf("tree/roary_core_raxmlng.pdf", width=12, height=10)
plot(tr, cex=0.5, no.margin=TRUE)
title("Roary core genome + RAxML-NG (GTR+G)")
dev.off()

F) Pairwise genome alignment figure (nf-core/pairgenomealign)

Example pattern (edit to your inputs/refs):

nextflow run nf-core/pairgenomealign -r 2.2.1 -profile docker \
  --input samplesheet_pairgenomealign.tsv \
  --outdir pairgenomealign_out

Practical notes

  • If KmerFinder FTP “latest” is unavailable, use the CGE bundle or Zenodo stable DB and point bacass to the local extracted bacteria/ directory.
  • For phylogeny:

    • Roary/RAxML-NG is best for close genomes; use close outgroups or none and root later.
    • For distant outgroups, use marker-gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) instead of pangenome core.

Used scripts (complete code)

A) 1_ani_tree_prep.sh

#!/usr/bin/env bash
set -euo pipefail

# ===================== USER CONFIG =====================
QUERY_FASTA="bacass_out/Prokka/An6/An6.fna"
SEED_ACCESSIONS_FILE="accessions.txt"

OUTDIR="ani_tree_run"
THREADS=8
MINFRACTION="0.05"
FRAGLEN="3000"

MAX_ACINETOBACTER=20
SUMMARY_LIMIT_ACI=400

MAX_PER_OUTGROUP=4
SUMMARY_LIMIT_OUT=120

#,scaffold
ASSEMBLY_LEVELS="complete,chromosome"
#Drop from Roary: "Pseudomonas" "Escherichia")
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")

DO_MATRIX="yes"
SUMMARY_TIMEOUT=600
# =======================================================

ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
warn() { echo "[$(ts)] WARN: $*" >&2; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }
need() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
sanitize() { echo "$1" | tr ' /;:,()[]{}' '_' | tr -cd 'A-Za-z0-9._-'; }

need datasets
need fastANI
need awk
need sort
need unzip
need find
need grep
need wc
need head
need readlink
need sed
need timeout
need python3

[[ -f "$QUERY_FASTA" ]] || die "QUERY_FASTA not found: $QUERY_FASTA"

mkdir -p "$OUTDIR"/{ref_fasta,zips,tmp,logs,accessions}
DL_LOG="$OUTDIR/logs/download.log"
SUM_LOG="$OUTDIR/logs/summary.log"
ANI_LOG="$OUTDIR/logs/fastani.log"
: > "$DL_LOG"; : > "$SUM_LOG"; : > "$ANI_LOG"

QUERY_LIST="$OUTDIR/query_list.txt"
echo "$(readlink -f "$QUERY_FASTA")" > "$QUERY_LIST"
log "QUERY: $(cat "$QUERY_LIST")"

ACC_DIR="$OUTDIR/accessions"
ACC_SEED="$ACC_DIR/seed.acc.txt"
ACC_ACI="$ACC_DIR/acinetobacter.acc.txt"
ACC_OUT="$ACC_DIR/outgroups.acc.txt"
ACC_ALL="$ACC_DIR/all_refs.acc.txt"

REF_DIR="$OUTDIR/ref_fasta"
REF_LIST="$OUTDIR/ref_list.txt"

# ---- seed accessions ----
if [[ -f "$SEED_ACCESSIONS_FILE" ]]; then
  grep -v '^\s*$' "$SEED_ACCESSIONS_FILE" | grep -v '^\s*#' | sort -u > "$ACC_SEED" || true
else
  : > "$ACC_SEED"
fi
log "Seed accessions: $(wc -l < "$ACC_SEED")"

# ---- extract accessions from a jsonl FILE (FIXED) ----
extract_accessions_from_jsonl_file() {
  local jsonl="$1"
  python3 - "$jsonl" <<'PY'
import sys, json
path = sys.argv[1]
out = []
with open(path, "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            obj = json.loads(line)
        except Exception:
            continue
        # NCBI datasets summary often returns one report per line with top-level "accession"
        acc = obj.get("accession")
        if acc:
            out.append(acc)
        # some outputs can also contain "reports" lists
        reps = obj.get("reports") or []
        for r in reps:
            acc = r.get("accession")
            if acc:
                out.append(acc)
for a in out:
    print(a)
PY
}

fetch_accessions_by_taxon() {
  local taxon="$1"
  local want_n="$2"
  local limit_n="$3"
  local outfile="$4"

  local tag
  tag="$(sanitize "$taxon")"
  local jsonl="$OUTDIR/tmp/${tag}.jsonl"

  : > "$outfile"
  log "STEP: Fetch accessions | taxon='$taxon' want=$want_n limit=$limit_n assembly=$ASSEMBLY_LEVELS timeout=${SUMMARY_TIMEOUT}s"

  if ! timeout "$SUMMARY_TIMEOUT" datasets summary genome taxon "$taxon" \
      --assembly-level "$ASSEMBLY_LEVELS" \
      --annotated \
      --exclude-atypical \
      --exclude-multi-isolate \
      --limit "$limit_n" \
      --as-json-lines > "$jsonl" 2>>"$SUM_LOG"; then
    warn "datasets summary failed for taxon=$taxon"
    tail -n 80 "$SUM_LOG" >&2 || true
    return 1
  fi

  [[ -s "$jsonl" ]] || { warn "Empty jsonl for taxon=$taxon"; return 0; }
  log "  OK: summary jsonl lines: $(wc -l < "$jsonl")"

  # FIXED extraction
  extract_accessions_from_jsonl_file "$jsonl" | sort -u | head -n "$want_n" > "$outfile"

  log "  OK: got $(wc -l < "$outfile") accessions for $taxon"
  head -n 5 "$outfile" | sed 's/^/     sample: /' || true
  return 0
}

# ---- fetch Acinetobacter + outgroups ----
fetch_accessions_by_taxon "Acinetobacter" "$MAX_ACINETOBACTER" "$SUMMARY_LIMIT_ACI" "$ACC_ACI" \
  || die "Failed to fetch Acinetobacter accessions (see $SUM_LOG)"

: > "$ACC_OUT"
for tax in "${OUTGROUP_TAXA[@]}"; do
  tmp="$OUTDIR/tmp/$(sanitize "$tax").acc.txt"
  fetch_accessions_by_taxon "$tax" "$MAX_PER_OUTGROUP" "$SUMMARY_LIMIT_OUT" "$tmp" \
    || die "Failed to fetch outgroup taxon=$tax (see $SUM_LOG)"
  cat "$tmp" >> "$ACC_OUT"
done
sort -u "$ACC_OUT" -o "$ACC_OUT"
log "Outgroup accessions total: $(wc -l < "$ACC_OUT")"

# ---- merge ----
cat "$ACC_SEED" "$ACC_ACI" "$ACC_OUT" 2>/dev/null \
  | grep -v '^\s*$' | grep -v '^\s*#' | sort -u > "$ACC_ALL"
log "Total unique reference accessions: $(wc -l < "$ACC_ALL")"
[[ -s "$ACC_ALL" ]] || die "No reference accessions collected."

# ---- download genomes (batch) ----
ZIP="$OUTDIR/zips/ncbi_refs.zip"
UNZ="$OUTDIR/tmp/ncbi_refs_unzipped"
rm -f "$ZIP"; rm -rf "$UNZ"; mkdir -p "$UNZ"

log "STEP: Download genomes via NCBI datasets (batch) -> $ZIP"
datasets download genome accession --inputfile "$ACC_ALL" --include genome --filename "$ZIP" >>"$DL_LOG" 2>&1 \
  || { tail -n 120 "$DL_LOG" >&2; die "datasets download failed (see $DL_LOG)"; }

log "STEP: Unzip genomes"
unzip -q "$ZIP" -d "$UNZ" >>"$DL_LOG" 2>&1 || { tail -n 120 "$DL_LOG" >&2; die "unzip failed (see $DL_LOG)"; }

# ---- collect FASTA ----
log "STEP: Collect *_genomic.fna -> $REF_DIR"
: > "$REF_LIST"
n=0
while IFS= read -r f; do
  [[ -f "$f" ]] || continue
  acc="$(echo "$f" | grep -oE 'G[CF]A_[0-9]+\.[0-9]+' | head -n 1 || true)"
  [[ -z "${acc:-}" ]] && acc="REF$(printf '%04d' $((n+1)))"
  out="$REF_DIR/${acc}.genomic.fna"
  cp -f "$f" "$out"
  [[ -s "$out" ]] || continue
  echo "$(readlink -f "$out")" >> "$REF_LIST"
  n=$((n+1))
done < <(find "$UNZ" -type f -name "*_genomic.fna")

log "  OK: Reference FASTA collected: $n"
[[ "$n" -gt 0 ]] || die "No genomic.fna found after download."

# ---- fastANI query vs refs ----
RAW_QVSR="$OUTDIR/fastani_query_vs_refs.raw.tsv"
TSV_QVSR="$OUTDIR/fastani_query_vs_refs.tsv"

log "STEP: fastANI query vs refs"
log "  fastANI --ql $QUERY_LIST --rl $REF_LIST -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_QVSR"
fastANI \
  --ql "$QUERY_LIST" \
  --rl "$REF_LIST" \
  -t "$THREADS" \
  --minFraction "$MINFRACTION" \
  --fragLen "$FRAGLEN" \
  -o "$RAW_QVSR" >>"$ANI_LOG" 2>&1 \
  || { tail -n 160 "$ANI_LOG" >&2; die "fastANI failed (see $ANI_LOG)"; }

[[ -s "$RAW_QVSR" ]] || die "fastANI output empty: $RAW_QVSR"

echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$TSV_QVSR"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_QVSR" >> "$TSV_QVSR"

log "Top hits (ANI desc):"
tail -n +2 "$TSV_QVSR" | sort -k3,3nr | head -n 15 | sed 's/^/  /'

# ---- fastANI matrix for all genomes ----
if [[ "$DO_MATRIX" == "yes" ]]; then
  ALL_LIST="$OUTDIR/all_genomes_list.txt"
  cat "$QUERY_LIST" "$REF_LIST" > "$ALL_LIST"

  RAW_ALL="$OUTDIR/fastani_all.raw.tsv"
  log "STEP: fastANI all-vs-all matrix (for tree)"
  log "  fastANI --ql $ALL_LIST --rl $ALL_LIST --matrix -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_ALL"
  fastANI \
    --ql "$ALL_LIST" \
    --rl "$ALL_LIST" \
    --matrix \
    -t "$THREADS" \
    --minFraction "$MINFRACTION" \
    --fragLen "$FRAGLEN" \
    -o "$RAW_ALL" >>"$ANI_LOG" 2>&1 \
    || { tail -n 160 "$ANI_LOG" >&2; die "fastANI matrix failed (see $ANI_LOG)"; }

  [[ -f "${RAW_ALL}.matrix" ]] || die "Matrix not produced: ${RAW_ALL}.matrix"
  log "  OK: Matrix file: ${RAW_ALL}.matrix"
fi

log "DONE."
log "Accessions counts:"
log "  - seed:         $(wc -l < "$ACC_SEED")"
log "  - acinetobacter:$(wc -l < "$ACC_ACI")"
log "  - outgroups:    $(wc -l < "$ACC_OUT")"
log "  - all refs:     $(wc -l < "$ACC_ALL")"