Category Archives: Articles

Toxin–Antitoxin (TA) Systems & Pulldown Experiments — Practical Guide

TA_operon_shared_promoter_v3_en

A consolidated reference covering TA gene organization and regulation, promoter vs RBS roles, co-transcription criteria, start-codon troubleshooting, RNA-seq analysis strategy, pulldown experiment design/controls/statistics, and multi-omics integration.


1) TA System Overview

  • Composition: Paired genes encoding toxin (protein) and antitoxin (protein).
  • Typical organization: Same strand, antitoxin upstream, toxin downstream, forming a bicistronic operon.
  • Transcriptional control: Frequently transcribed from a shared σ⁷⁰-like promoter (−35/−10) upstream of the antitoxin.
  • Autoregulation: Antitoxin or TA complex often binds operator sites near the promoter to repress transcription. Under stress (e.g., antitoxin proteolysis), repression is relieved → toxin increases.
  • Functions: Stress response, persistence, plasmid maintenance, virulence modulation (family-specific).

2) Promoter vs RBS — Who Does What?

  • Promoter → transcription start.
    • Recognized by RNA polymerase holoenzyme (core RNAP + σ factor; often σ⁷⁰).
    • −35/−10 boxes typically spaced 16–19 bp; TSS sits downstream of −10.
  • RBS (Shine–Dalgarno) → translation start.
    • 16S rRNA (30S) anti-SD tail base-pairs with the RBS to position the start codon (usually ATG, also GTG/TTG).
    • RBS–start codon spacing commonly 5–10 nt.
  • Cheats: Promoter decides where transcription begins; RBS decides where translation begins.

3) Evidence Framework for a Shared Promoter / Co-transcription

Goal: Decide whether antitoxin & toxin belong to the same transcript and quantify co-expression.

3.1 Structural / Sequence Evidence

  1. Genomic context: Same strand; short intergenic (<50–100 bp) or slight overlap.
  2. Promoter prediction: Clear −35/−10 upstream of antitoxin; no strong independent promoter upstream of toxin.
  3. RBS: SD-like motifs upstream of both ORFs.
  4. Terminator: No strong Rho-independent terminator between the pair; terminator at operon end.

3.2 RNA-seq Evidence (Strand-specific libraries preferred)

  1. Coverage continuity: Same-strand coverage crosses the intergenic region.
  2. Spanning fragments: Paired-end insert spans the antitoxin↔toxin boundary.
  3. Expression correlation: From all samples (e.g., 27), compute TPM/CPM correlations; Pearson/Spearman r ≥ 0.8, p<0.01; remains high within each timepoint subset.
  4. DE consistency: For each timepoint’s treated vs control, log2FC for both genes are same direction with FDR<0.05.
  5. (Optional) TSS evidence: 5′-enriched or TSS-seq reveals shared TSS cluster upstream of antitoxin.

Note: Non-strand-specific libraries weaken strand-continuity evidence; interpret cautiously.


4) Why Your Provided Sequences Start with “TTA,” Not “ATG”

  • Observed “TTA” starts suggest:
    1. Sequences include 5′ UTR/promoter (CDS not cut at true start).
    2. Sequences could be reverse-complement relative to coding strand.
    3. Bacteria can use GTG/TTG as starts, but TTA is not a typical start codon.
  • Standard resolution steps:
    • BLAST the fragments to the genome to get strand & coordinates.
    • Six-frame translate; on the correct strand, locate the longest ORF starting with ATG/GTG/TTG and ending at a stop.
    • Verify RBS distance (5–10 nt) and domain homology (BLASTX/HMM against TA families).
    • Use RNA-seq coverage shape/TSS to refine the start site.

5) RNA-seq Analysis Plan for 27 Samples (Example Design)

Design: Same strain × 3 conditions (untreated / Mitomycin C / Moxifloxacin) × 3 timepoints × 3 biological replicates = 27 samples.

5.1 Pipeline Outline

  1. QC & Alignment: FastQC/MultiQC → trimming → align to reference (confirm strand-specificity).
  2. Quantification: featureCounts/Salmon → DESeq2/edgeR normalization.
  3. Differential Expression:
    • For each timepoint, contrast treated vs untreated (include batch if needed).
    • Output per contrast: log2FC, SE, p, FDR.
  4. TA Co-transcription Checks:
    • IGV views: same-strand continuity across intergenic; spanning fragments.
    • Correlation between antitoxin & toxin across 27 samples (r, p).
    • DE direction consistency for both genes.
  5. Pulldown Targets in RNA-seq:
    • For candidate target list, extract log2FC/FDR; produce volcano/heatmaps.
    • Perform functional enrichment (GO/KEGG/COG) with overlap to pulldown hits.
  6. Deliverables:
    • IGV screenshots with annotated −35/−10, TSS, RBS, terminator.
    • MA/volcano plots, sample PCA, correlation plots.
    • Tables summarizing DEGs per timepoint and pulldown×RNA-seq overlaps.

6) Pulldown Experiments — Types, Controls, Statistics

6.1 Types

  • Protein–protein pulldown / affinity purification: Bait = toxin/antitoxin protein (His/FLAG/biotin); ID by LC–MS/MS.
  • Nucleic-acid pulldown:
    • DNA pulldown: bait = promoter/operator DNA; identify bound proteins (MS).
    • RNA pulldown: bait = specific RNA; identify bound proteins (MS) or enriched RNAs.

6.2 Critical Controls

  • Empty vector/beads, irrelevant protein, mutant bait (disrupt binding), competition elution.
  • 3 biological replicates recommended.

6.3 Hit Calling (Proteomics Example)

  • Use SAINT / MSstats / DEP or log2FC + FDR thresholds, e.g. log2FC ≥ 1 & FDR ≤ 0.05, consistently detected in ≥2 replicates.
  • Remove sticky background proteins (CRAPome) and ubiquitous ribosomal/chaperones where appropriate.
  • Deliver a high-confidence candidate list.

6.4 Integration with RNA-seq

  • Cross-table: pulldown hits vs RNA-seq log2FC/FDR across conditions/timepoints.
  • Enrichment/pathways: overlap enrichment for hits and DEGs.
  • Evidence ladder:
    1. Pulldown enrichment (binding);
    2. RNA-seq co-expression / DE (regulatory consistency);
    3. Biophysical/functional assays (EMSA/SPR/ChIP-qPCR, reporter assays) for validation.

7) Validation Roadmap (Low→High Effort)

  1. RT-qPCR: Junction-spanning primers across antitoxin→toxin.
  2. EMSA/SPR: Direct binding & affinity to operator by antitoxin/TA complex.
  3. Reporter / Mutagenesis: Disrupt operator/−35/−10 or RBS, assess transcription/translation impact.
  4. ChIP-qPCR/ChIP-seq: In vivo occupancy (if antitoxin has DNA-binding domain).
  5. RACE/TSS-seq: Precise TSS mapping to confirm shared promoter.

8) Practical Criteria & Verdict Grades

  • Structure: Same strand; intergenic <100 bp; no strong terminator in-between.
  • Promoter: Clear −35/−10 upstream of antitoxin; toxin lacks strong independent promoter.
  • RNA-seq: Same-strand continuity across intergenic; boundary-spanning fragments; r ≥ 0.8 (p<0.01) across all samples; per-timepoint log2FC same direction (FDR<0.05).
  • Conclusion grades: Strong support / Support / Insufficient evidence.

9) Schematic Figures (Generated)

  • Chinese-labeled (non-overlapping):
    TA_operon_shared_promoter_v3_cn.pngopen/download
  • English-labeled (non-overlapping):
    TA_operon_shared_promoter_v3_en.pngopen/download

Each figure depicts: shared σ⁷⁰-like promoter (−35/−10, TSS)antitoxin (upstream)toxin (downstream) with each RBS, a terminal Rho-independent terminator, and stylized same-strand RNA-seq coverage that spans the intergenic region.


10) “Ready-to-Ask” Template for Collaborators

Objective: Determine if the TA pair shares a promoter and is co-transcribed; call DEGs per timepoint across conditions; test RNA-seq changes for pulldown targets.

Please deliver:

  1. IGV tracks with −35/−10, TSS, RBS, terminator, and boundary-spanning reads.
  2. DE tables (per timepoint per contrast), with log2FC/FDR.
  3. Correlation stats (antitoxin↔toxin r, p) across 27 samples and within timepoints.
  4. Pulldown×RNA-seq cross table (+ enrichment analyses).
  5. One-page verdict: shared promoter? co-transcription? evidence grade & key screenshots.

Inputs we’ll provide: Reference genome/annotation (FASTA/GFF/GTF), BAM/BAI, sample sheet, pulldown target list.


One-liner Summary

Promoter = transcription start; RBS = translation start.
For TA pairs, antitoxin→toxin often sits in a single operon driven by a shared promoter; RNA-seq continuity, spanning fragments, correlation, and concordant DE together provide strong evidence for co-transcription.

克隆≠表型完全相同 —— 详细阐述与具体例子

“克隆性”是指细菌在基因型上几乎无差异,属于高度同源的一组,但这并不意味着它们在耐药性等表型上一定一模一样。克隆株的表型可以有差异,原因包括调控机制、基因表达水平、外排泵活性、膜蛋白变化、插入序列等导致的基因功能或表达的不同。

具体例子:

在临床实践中,研究发现铜绿假单胞菌(Pseudomonas aeruginosa)的同一克隆菌株,虽然基因组完全一致,但对美罗培南(Meropenem)的最小抑菌浓度(MIC)可能不同。进一步检测发现:某些菌株因oprD基因突变、插入等,导致外膜通道蛋白表达下调,从而表现为高MIC(耐药),而同簇内oprD完整的菌株则敏感。1

这说明:即使是同一克隆簇的菌株,耐药表型可以因调控突变或基因表达等后天因素存在差异。


科学文章(中文发布版)

克隆性≠表型一致:基因同源不等于耐药全同 —— Holger邮件讨论解读

在近期的菌群全基因组分析中,我们通过cgMLST/WGS技术识别出了若干克隆相关的菌株簇。以Acinetobacter baumannii和Klebsiella pneumoniae为例,每个克隆簇内的菌株,其耐药基因和毒力因子分布高度一致,且AST(药敏)表型大多数时间点表现相近。

但邮件交流中,Gradientech团队指出:“仅凭MLST等判断克隆性,不能保证所有基因组无差异。即使核心基因型相同,也可能存有表型差异,例如耐药性或毒力不同。” 这很有道理。

举例说明:临床细菌如铜绿假单胞菌,同簇内部分菌株,因为外排泵或通道蛋白基因(如oprD)表达下降、插入序列影响或失活突变,导致美罗培南敏感性变弱(MIC升高),表型变为耐药。而同克隆簇的另一株也许表达正常,表现为敏感。这种“基因型同源但表型不完全一致”的现象,正是精准医学面临的挑战之一。1

针对克隆株是否要在分析中剔除,Holger建议保持信息透明,在方法和讨论部分如实披露、科学解释,不做删减。讨论段推荐补充一句:“克隆性并不必然对应表型耐药表达的一致,实际还受调控机制等多种因素影响,因此本研究保留所有分离株评估表型检测方法的性能。如果后续审核需要,可以按簇去重再行敏感性分析。”


总结

  • 克隆唯一区分于基因型层面,表型如耐药往往还会受基因调控、表达水平、特殊突变等多种影响,同一克隆内“表型一致”不是绝对规律。
  • 这一认识对临床耐药菌株流行病学追踪和新型方法学评估极其关键,有助于提升科学结论的严谨性。1 2345678910

Step‑by‑Step Workflow for Phylogenomic and Functional Analysis of 100 Clinical Isolates Using WGS Data

  • ggtree_and_gheatmap_kpneumoniae_new
  • ggtree_and_gheatmap_paeruginosa_new
  • ggtree_and_gheatmap_paeruginosa
  • ggtree_and_gheatmap_kpneumoniae
  • ggtree_and_gheatmap_abaumannii
  • ggtree_and_gheatmap_ecoli

Below is a sorted, step‑by‑step protocol integrating all commands and materials from your notes — organized into logical analysis phases. It outlines how to process whole‑genome data, annotate with Prokka, perform pan‑genome and SNP‑based phylogenetic analysis, and conduct resistome profiling and visualization.


1. Initial Setup and Environment Configuration

  1. Create and activate working environments:
conda create -n bengal3_ac3 python=3.9
conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
  1. Copy necessary config files for bacto:
cp /home/jhuang/Tools/bacto/bacto-0.1.json .
cp /home/jhuang/Tools/bacto/Snakefile .
  1. Prepare R environment for later visualization:
mamba create -n r_414_bioc314 -c conda-forge -c bioconda \
  r-base=4.1.3 bioconductor-ggtree bioconductor-treeio \
  r-ggplot2 r-dplyr r-ape pandoc

2. Species Assignment and MLST Screening

  • Use WGS data to determine phylogenetic relationships based on MLST. Confirm species identity of isolates: E. coli, K. pneumoniae, A. baumannii complex, P. aeruginosa.
  • Example log output:
[10:19:59] Excluding ecoli.icdA.262 due to --exclude option
[10:19:59] If you like MLST, you're going to absolutely love cgMLST!
[10:19:59] Done.
  • Observation: No strain was close enough to be a suspected clonal strain (to be confirmed later with SNP‑based analysis).

3. Genome Annotation with Prokka

Cluster isolates into four species folders and annotate each genome separately.

Example shell script:

for cluster in abaumannii_2 ecoli_achtman_4 klebsiella paeruginosa; do
  GENUS_MAP=( [abaumannii_2]="Acinetobacter" [ecoli_achtman_4]="Escherichia" [klebsiella]="Klebsiella" [paeruginosa]="Pseudomonas" )
  SPECIES_MAP=( [abaumannii_2]="baumannii" [ecoli_achtman_4]="coli" [klebsiella]="pneumoniae" [paeruginosa]="aeruginosa" )
  prokka --force --outdir prokka/${cluster} --cpus 8 \
         --genus ${GENUS_MAP[$cluster]} --species ${SPECIES_MAP[$cluster]} \
         --addgenes --addmrna --prefix ${cluster} \
         -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
 done

4. Pan‑Genome Construction with Roary

Run Roary for each species cluster (GFF outputs from Prokka):

roary -f roary/ecoli_cluster -e --mafft -p 100 prokka/ecoli_cluster/*.gff
roary -f roary/kpneumoniae_cluster1 -e --mafft -p 100 prokka/kpneumoniae_cluster1/*.gff
roary -f roary/abaumannii_cluster -e --mafft -p 100 prokka/abaumannii_cluster/*.gff
roary -f roary/paeruginosa_cluster -e --mafft -p 100 prokka/paeruginosa_cluster/*.gff

Output: core gene alignments for each species.


5. Core‑Genome Phylogenetic Tree Building (RAxML‑NG)

Use maximum likelihood reconstruction per species:

raxml-ng --all --msa roary/ecoli_achtman_4/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix ecoli_core_gene_tree_1000
raxml-ng --all --msa roary/klebsiella/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix klebsiella_core_gene_tree_1000
raxml-ng --all --msa roary/abaumannii/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix abaumannii_core_gene_tree_1000
raxml-ng --all --msa roary/paeruginosa/core_gene_alignment.aln --model GTR+G --bs-trees 1000 --threads 40 --prefix paeruginosa_core_gene_tree_1000

6. SNP Detection and Distance Calculation

  1. Extract SNP sites and generate distance matrices:
snp-sites -v -o core_snps.vcf roary/ecoli_achtman_4/core_gene_alignment.aln
snp-dists roary/ecoli_achtman_4/core_gene_alignment.aln > ecoli_snp_dist.tsv
  1. Repeat for other species and convert to Excel summary:
~/Tools/csv2xls-0.4/csv_to_xls.py ecoli_snp_dist.tsv klebsiella_snp_dist.tsv abaumannii_snp_dist.tsv paeruginosa_snp_dist.tsv -d$'\t' -o snp_dist.xls

7. Resistome and Virulence Profiling with Abricate

  1. Install and set up databases:
conda install -c bioconda -c conda-forge abricate=1.0.1
abricate --setupdb
abricate --list
DATABASE        SEQUENCES       DBTYPE  DATE
vfdb    2597    nucl    2025-Oct-22
resfinder       3077    nucl    2025-Oct-22
argannot        2223    nucl    2025-Oct-22
ecoh    597     nucl    2025-Oct-22
megares 6635    nucl    2025-Oct-22
card    2631    nucl    2025-Oct-22
ecoli_vf        2701    nucl    2025-Oct-22
plasmidfinder   460     nucl    2025-Oct-22
ncbi    5386    nucl    2025-Oct-22
  1. Run VFDB for virulence genes:
# # Run VFDB example
# abricate --db vfdb genome.fasta > vfdb.tsv
#
# # strict filter (≥90% ID, ≥80% cov) using header-safe awk
# awk -F'\t' 'NR==1{
#   for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
#   print; next
# } ($id+0)>=90 && ($cov+0)>=80' vfdb.tsv > vfdb.strict.tsv

# 0) Define your cluster directories (exactly as in your prompt)
CLUSTERS="fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3"

# 1) Run Abricate (VFDB) per isolate, strictly filtered (≥90% ID, ≥80% COV)
for D in $CLUSTERS; do
  mkdir -p "$D/abricate_vfdb"
  for F in "$D"/*.fasta; do
    ISO=$(basename "$F" .fasta)
    # raw
    abricate --db vfdb "$F" > "$D/abricate_vfdb/${ISO}.vfdb.tsv"
    # strict filter while keeping header (header-safe awk)
    awk -F'\t' 'NR==1{
      for(i=1;i<=NF;i++){if($i=="%IDENTITY") id=i; if($i=="%COVERAGE") cov=i}
      print; next
    } ($id+0)>=90 && ($cov+0)>=80' \
      "$D/abricate_vfdb/${ISO}.vfdb.tsv" > "$D/abricate_vfdb/${ISO}.vfdb.strict.tsv"
  done
done

# 2) Build per-cluster "isolate
<TAB>gene" lists (Abricate column 5 = GENE)
for D in $CLUSTERS; do
  OUT="$D/virulence_isolate_gene.tsv"
  : > "$OUT"
  for T in "$D"/abricate_vfdb/*.vfdb.strict.tsv; do
    ISO=$(basename "$T" .vfdb.strict.tsv)
    awk -F'\t' -v I="$ISO" 'NR>1{print I"\t"$5}' "$T" >> "$OUT"
  done
  sort -u "$OUT" -o "$OUT"
done

# 3) Create a per-isolate "signature" (sorted, comma-joined gene list)
for D in $CLUSTERS; do
  IN="$D/virulence_isolate_gene.tsv"
  SIG="$D/virulence_signatures.tsv"
  awk -F'\t' '
  {m[$1][$2]=1}
  END{
    for(i in m){
      n=0
      for(g in m[i]){n++; a[n]=g}
      asort(a)
      printf("%s\t", i)
      for(k=1;k<=n;k++){printf("%s%s", (k>1?",":""), a[k])}
      printf("\n")
      delete a
    }
  }' "$IN" | sort > "$SIG"
done

# 4) Report whether each cluster is internally identical
for D in $CLUSTERS; do
  SIG="$D/virulence_signatures.tsv"
  echo "== $D =="
  # how many unique signatures?
  CUT=$(cut -f2 "$SIG" | sort -u | wc -l)
  if [ "$CUT" -eq 1 ]; then
    echo "  Virulence profiles: IDENTICAL within cluster"
  else
    echo "  Virulence profiles: NOT IDENTICAL (unique signatures: $CUT)"
    echo "  --- differing isolates & their signatures ---"
    cat "$SIG"
  fi
done
  1. Summarize VFDB results:
#----
# Make gene lists (VFDB "GENE" = column 5) for each isolate
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
cut -f5 fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | tail -n +2 | sort -u > c2_070.genes.txt

# Show symmetric differences
echo "Genes present in QRC018 only:"
comm -23 c2_018.genes.txt c2_070.genes.txt

echo "Genes present in QRC070 only:"
comm -13 c2_018.genes.txt c2_070.genes.txt

awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | sort -u > c2_018.gene_product.txt
awk -F'\t' 'NR>1{print $5"\t"$6}' fasta_kpnaumoniae_cluster2/abricate_vfdb/QRC070.vfdb.strict.tsv | sort -u > c2_070.gene_product.txt

# Genes unique to QRC018 with product
join -t $'\t' <(cut -f1 c2_018.gene_product.txt) <(cut -f1 c2_070.gene_product.txt) -v1 > /dev/null # warms caches
comm -23 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_018.gene_product.txt

# Genes unique to QRC070 with product
comm -13 <(cut -f1 c2_018.gene_product.txt | sort) <(cut -f1 c2_070.gene_product.txt | sort) \
  | xargs -I{} grep -m1 -P "^{}\t" c2_070.gene_product.txt

# Make a cluster summary table from raw (or strict) TSVs
for D in fasta_abaumannii_cluster fasta_kpnaumoniae_cluster1 fasta_kpnaumoniae_cluster2 fasta_kpnaumoniae_cluster3; do
  echo "== $D =="
  abricate --summary "$D"/abricate_vfdb/*.vfdb.strict.tsv | column -t
done
  1. Make gene presence lists and find symmetric differences between isolates:
cut -f5 fasta_kpneumoniae_cluster2/abricate_vfdb/QRC018.vfdb.strict.tsv | tail -n +2 | sort -u > c2_018.genes.txt
comm -23 c2_018.genes.txt c2_070.genes.txt

5 (Optional). For A. baumannii with zero hits, run DIAMOND BLASTp against VFDB proteins.

#If you want to be extra thorough for A. baumannii. Because your VFDB nucleotide set returned zero for A. baumannii, you can cross-check with VFDB protein via DIAMOND:

# Build a DIAMOND db (once)
diamond makedb -in VFDB_proteins.faa -d vfdb_prot

# Query predicted proteins (Prokka .faa) per isolate
for F in fasta_abaumannii_cluster/*.fasta; do
  ISO=$(basename "$F" .fasta)
  prokka --cpus 8 --outdir prokka/$ISO --prefix $ISO "$F"
  diamond blastp -q prokka/$ISO/$ISO.faa -d vfdb_prot \
    -o abricate_vfdb/$ISO.vfdb_prot.tsv \
    -f 6 qseqid sseqid pident length evalue bitscore qcovhsp \
    --id 90 --query-cover 80 --max-target-seqs 1 --threads 8
done

8. Visualization with ggtree and Heatmaps

Render circular core genome trees and overlay selected ARGs or gene presence/absence matrices.

Key R snippet:

library(ggtree)
library(ggplot2)
library(dplyr)
info <- read.csv("typing_until_ecpA.csv", sep="\t")
tree <- read.tree("raxml-ng/snippy.core.aln.raxml.tree")
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info +
     geom_tippoint(aes(color=ST), size=4) + geom_tiplab2(aes(label=name), offset=1)
gheatmap(p, heatmapData2, width=4, colnames_angle=45, offset=4.2)

Output: multi‑layer circular tree (e.g., ggtree_and_gheatmap_selected_genes.png).


9. Final Analyses and Reporting

  • Compute pairwise SNP distances and identify potential clonal clusters using species‑specific thresholds (e.g., ≤17 SNPs for E. coli).
  • If clones exist, retain one representative per cluster for subsequent Boruta analysis or CA/EA metrics.
  • Integrate ARG heatmaps to the trees for intuitive visualization of resistance determinants relative to phylogeny.

Protocol Description Summary

This workflow systematically processes WGS data from multiple bacterial species:

  1. Annotation (Prokka) ensures consistent gene predictions.
  2. Pan‑genome alignment (Roary) captures core and accessory variation.
  3. Phylogenetic reconstruction (RAxML‑NG) provides species‑level evolutionary context.
  4. SNP‑distance computation (snp‑sites, snp‑dists) allows clonal determination and diversity quantification.
  5. Resistome analysis (Abricate, Diamond) characterizes antimicrobial resistance and virulence determinants.
  6. Visualization (ggtree + gheatmap) integrates tree topology with gene presence–absence or ARG annotations.

Properly documented, this end‑to‑end pipeline can serve as the Methods section for publication or as an online reproducible workflow guide.

Methods (for the manuscript). Genomes were annotated with PROKKA v1.14.5 [1]. A pangenome was built with Roary v3.13.0 [2] using default settings (95% protein identity) to derive a gene presence/absence matrix. Core genes—defined as present in 99–100% of isolates by Roary—were individually aligned with MAFFT v7 [3] and concatenated; the resulting multiple alignment was used to infer a maximum-likelihood phylogeny in RAxML-NG [4] under GTR+G, with 1,000 bootstrap replicates [5] and a random starting tree. Trees were visualized with the R package ggtree [6] in circular layout; tip colors denote sequence type (ST), with MLST assigned via PubMLST/BIGSdb [7]; concentric heatmap rings indicate the presence (+) or absence (-) of resistance genes.

  1. Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.
  2. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. 2015. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31:3691–3693.
  3. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780.
  4. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A, Wren J. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455.
  5. Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
  6. Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. 2017. ggtree: an R package for visualization and annotation of phylogenetic trees. Methods in Ecology and Evolution 8(1):28–36.
  7. Jolley KA, Bray JE, Maiden MCJ. 2018. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res 3:124.

Phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages (plotTreeHeatmap)

ggtree_and_gheatmap_ecoli
library(ggtree)
library(ggplot2)
library(dplyr)
#setwd("/home/jhuang/DATA/Data_Ben_Boruta_Analysis/plotTreeHeatmap_ecoli/")

info <- read.csv("isolate_ecoli_.csv", sep="\t", check.names = FALSE)
info$name <- info$Isolate
# make ST discrete
info$ST <- factor(info$ST)

tree <- read.tree("../ecoli_core_gene_tree_1000.raxml.bestTree")
cols <- c("10"="cornflowerblue","38"="darkgreen","46"="seagreen3","69"="tan","88"="red",  "131"="navyblue", "156"="gold",     "167"="green","216"="orange","405"="pink","410"="purple","1882"="magenta","2450"="brown", "2851"="darksalmon","3570"="chocolate4","4774"="darkkhaki")
# "290"="azure3", "297"="maroon","325"="lightgreen",     "454"="blue","487"="cyan", "558"="skyblue2", "766"="blueviolet"

#heatmapData2 <- info %>% select(Isolate, blaCTX.M, blaIMP, blaKPC, blaNDM.1, blaNDM.5, blaOXA.23.like, blaOXA.24.like, blaOXA.48.like, blaOXA.58.like, blaPER.1, blaSHV, blaVEB.1, blaVIM)  #ST,
heatmapData2 <- info %>% select(
    Isolate,
    `blaCTX-M`, blaIMP, blaKPC, `blaNDM-1`, `blaNDM-5`,
    `blaOXA-23-like`, `blaOXA-24-like`, `blaOXA-48-like`, `blaOXA-58-like`,
    `blaPER-1`, blaSHV, `blaVEB-1`, blaVIM
)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
#names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")

#heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
#names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")

heatmap.colours <- c("darkgreen", "grey")
names(heatmap.colours) <- c("+","-")

#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
#circular

p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))

png("ggtree_and_gheatmap_mibi_selected_genes.png", width=1590, height=1300)
#svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=2, colnames_position="top", colnames_angle=90, colnames_offset_y = 2.0, hjust=0.7, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
dev.off()

# ---------

# 1) Optional: shrink tree to create more outer space (keeps relative scale)
tree_small <- tree
tree_small$edge.length <- tree_small$edge.length * 0.4    # try 0.3–0.5

# 2) Height after shrinking
ht <- max(ape::node.depth.edgelength(tree_small))

# 3) Base tree
info$ST <- factor(info$ST)
p <- ggtree(tree_small, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.4) +
    geom_tiplab2(aes(label = name), size = 1.4, offset = 0.02 * ht) +
    scale_color_manual(values = cols)

# 4) Reserve room outside the tips
off <- 0.30 * ht   # gap
wid <- 2.80 * ht   # ring thickness
mar <- 0.25 * ht
p_wide <- p + xlim(0, ht + off + wid + mar)

# 5) Ensure discrete values are factors and keep both levels
heatmapData2[] <- lapply(heatmapData2, function(x) factor(x, levels = c("+","-")))

# 6) Draw the heatmap
png("ggtree_and_gheatmap_readable.png", width = 3600, height = 3200, res = 350)
gheatmap(
    p_wide, heatmapData2,
    offset = off,
    width  = wid,
    color  = "white",            # borders between tiles
    colnames = TRUE,
    colnames_position = "top",
    colnames_angle = 0,
    colnames_offset_y = 0.04 * ht,
    hjust = 0.5,
    font.size = 7
) +
    scale_fill_manual(values = c("+"="darkgreen", "-"="grey"), drop = FALSE) +
    guides(
        fill  = guide_legend(title = "", override.aes = list(size = 4)),
        color = guide_legend(override.aes = list(size = 3))
    )
dev.off()

#-------------- plot with true scale -------------

# tree height (root → farthest tip)
ht <- max(ape::node.depth.edgelength(tree))

# circular tree with a small opening and compact labels
p <- ggtree(tree, layout = "circular", open.angle = 20) %<+% info +
    geom_tippoint(aes(color = ST), size = 1.8, alpha = 0.9) +
    geom_tiplab2(aes(label = name), size = 2.2, offset = 0.06 * ht) +
    scale_color_manual(values = cols) +
    theme(legend.title = element_text(size = 12),
                legend.text  = element_text(size = 10))

# higher-DPI export so small cells look crisp
png("ggtree_true_scale.png", width = 2400, height = 2400, res = 300)
p
dev.off()

# --- Heatmap: push it further out and make it wider ---
#svg("ggtree_and_gheatmap_mibi_selected_genes_true_scale.svg",
#    width = 32, height = 28)

png("ggtree_and_gheatmap_mibi_selected_genes_true_scale.png",
        width = 2000, height = 1750,  res = 200)

gheatmap(
    p, heatmapData2,
    offset = 0.24 * ht,         # farther from tips (was 0.08*ht)
    width  = 20.0 * ht,         # much wider ring (was 0.35*ht)
    color  = "white",           # thin tile borders help readability
    colnames_position  = "top",
    colnames_angle     = 90,     # keep column labels horizontal to save space
    colnames_offset_y  = 0.08 * ht,
    hjust = 0.1,
    font.size = 2.6               # bigger column label font
) +
    scale_fill_manual(values = heatmap.colours) +
    guides(fill  = guide_legend(title = "", override.aes = list(size = 4)),
                 color = guide_legend(override.aes = list(size = 3))) +
    theme(legend.position = "right",
                legend.title    = element_text(size = 10),
                legend.text     = element_text(size = 8))
dev.off()

This R script visualizes a phylogenetic tree of E. coli isolates using the ggtree and ggplot2 packages, annotated with sequence type (ST) colors and a heatmap showing the presence or absence of antimicrobial resistance genes. It creates several versions of the circular tree figure to adjust scaling and readability.

Key Steps

  1. Load libraries and data The code imports ggtree, ggplot2, and dplyr, reads isolate metadata (isolate_ecoli_.csv), and loads a phylogenetic tree file (ecoli_core_gene_tree_1000.raxml.bestTree).1112
  2. Prepare metadata It defines each isolate’s name and converts the sequence type (ST) into a categorical factor. A subset of columns (various bla resistance gene markers) is selected to create a heatmap data matrix, where each gene’s presence (“+”) or absence (“–”) will be visualized.
  3. Define color schemes Different STs are assigned distinct colors, and binary gene presence/absence states (“+”, “–”) mapped to “darkgreen” and “grey,” respectively.13
  4. Plot tree with annotation The main tree is plotted circularly using:
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
     geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols)

Tip labels are added for isolates, and STs are colored as per the legend.

  1. Add heatmap (gene presence/absence) The gheatmap() function appends the gene matrix to the tree, aligning rows with the tree tips and providing a heatmap of resistance gene patterns.1213
  2. Export figures Multiple PNGs are generated:
    • ggtree.png: Basic circular tree
    • ggtree_and_gheatmap_mibi_selected_genes.png: Tree + resistance gene heatmap
    • Scaled versions with adjusted offsets and spacing for improved readability and “true-scale” representations

Final Output

The resulting figures show a circular phylogenetic tree annotated by ST colors, accompanied by a ring-style heatmap indicating the distribution of key resistance genes across isolates. The layout adjustments (tree shrinking, offset tuning, width scaling) ensure legibility when genes or isolates are numerous.1213 1415161718192021222324252627282930

雙側足底筋膜炎與體外衝擊波治療(ESWT)全指南

概述

雙側足底筋膜炎是成人足跟痛最常見的原因之一,典型表現為清晨或久坐後首次下地時足跟內側刺痛,走動少許可緩解,但久站久走或跑步後又加重。313233 病因多與足底筋膜反覆牽拉導致的微撕裂與退行性改變相關,而非單純的急性發炎,雙側受累時步態代償更明顯、承重耐受度下降。343536

定義與病理

雖名為“炎”,但顯微與臨床更多呈現著骨點病變與退行性改變(微撕裂、膠原退變),而非典型急性炎細胞浸潤。34 病灶多起於跟骨內側結節的足底筋膜近端附著點,可見筋膜增厚與壓痛;影像可能出現跟骨骨刺,但骨刺本身並非疼痛直接原因。353734 足底筋膜自跟骨延伸至足趾底部,負責支撐足弓並吸收衝擊,因此在跑跳或久站等反覆負荷下容易受損。3331

常見症狀

足跟內側或足弓處的銳痛/刺痛,清晨首步痛最明顯,稍活動後可暫時減輕,但長時間負重後再度加重。3631 雙側病例中,兩足承重對稱受限,代償步態更突出,影響長時間站立與行走的耐受度。3634

危險因素

  • 足弓異常(扁平足或高弓足)與力線偏差會增加筋膜應力。373534
  • 腓腸肌/比目魚肌緊張與跟腱緊繃導致踝關節背屈受限。3534
  • 久站、跳躍、長距離或下坡跑,以及突然提高訓練量與硬地面負荷。343536
  • 體重增加/肥胖與足跟脂肪墊退變、不合適鞋履(支撐或緩衝不足)。353634
  • 雙側受累常提示雙足共同的生物力學與負荷模式問題,需要對稱化處理。3634

診斷要點

臨床診斷以病史與查體為主:跟骨前內側足底壓痛、清晨首步痛、踝背屈受限並結合危險因素評估。3135 必要時超音波可見筋膜增厚與水腫,X光可見跟骨骨刺並協助排除應力性骨折等其他病變。35

保守治療

多數患者在非手術治療下於數月內改善,治療目標為鎮痛、恢復筋膜與小腿後群柔韌性、優化足弓力線與支撐,並逐步回歸活動(雙足同步管理)。3335

  • 負荷管理:減少高衝擊與久站,分段活動,循序漸進增加訓練量(雙足均衡)。3436
  • 冰敷與短期口服NSAIDs依醫囑鎮痛。3335
  • 拉伸訓練:足底筋膜特異性拉伸與小腿後群(腓腸肌/比目魚肌)拉伸,每日多次、逐步加量。3335
  • 夜間足托維持中立至輕度背屈位,以減輕晨起首步痛。3533
  • 矯形支撐:足弓鞋墊、足跟杯/墊,必要時貼紮短期矯正力線。3335
  • 物理治療:本體感覺與肌力訓練、軟組織鬆解,必要時超音波等,依個體化評估執行。3835

體外衝擊波治療(ESWT)

ESWT (Extrakorporale Stoßwellentherapie) 為非侵入性療法,透過短時高壓聲學衝擊刺激組織修復並達到鎮痛,常於保守治療無效的慢性病例採用,臨床上常規劃約三次為一組的門診療程。3435 模式與定位:分為聚焦型(fESWT)與徑向型(rESWT),於最痛點經皮定位施打,單次治療僅需數分鐘,可在無麻醉下完成。3435 療效與節奏:目標為中長期減痛與功能提升,常在數週至數月逐步改善;並行拉伸、鞋墊與負荷管理可提高療效與持久性。3533 不良反應:多為短暫壓痛、紅腫或小淤斑,嚴重併發症罕見,整體安全性良好。3435

其他可選方案

超音波導引下糖皮質激素注射可提供短期鎮痛,但存在復發與筋膜損傷風險,須審慎權衡;對於頑固疼痛,衝擊波治療可作為進一步的非侵入選擇。3534 針灸在4–8週內的止痛有一定證據,但長期優勢未確立,宜與標準療法綜合評估。39 手術(如部分筋膜鬆解)僅限於長期頑固且保守療法失敗者,比例相當低。34

預後與復發

大多數患者在規範保守治療下於數月內明顯改善,部分病例一年內可自限;然而疼痛可能反覆,需長期維持拉伸與力線管理。3534 雙側病變因整體負荷更高、代償更多,康復節奏宜更循序,並強化小腿後群柔韌性與足弓支撐。3334

日常建議

  • 選擇具良好足弓支撐與後跟緩衝的鞋款,及時更換磨損鞋底,避免硬地赤足久走。3335
  • 循序遞增運動量,交替低衝擊訓練(如騎行/游泳),並做好體重管理以減輕足跟負荷。3635
  • 每日2–3次進行足底筋膜與小腿後群拉伸,每次保持20–30秒,重複3–5組。3335
  • 晨起下地前先做踝泵與毛巾拉伸,減少首步痛;需久站工作者可採用軟墊與分段休息策略。3533
  • 若持續數週無改善或出現夜間靜息痛、神經症狀,應就醫評估以排除其他病因。35 4041424344454647

RNA-seq + Proteomics Integration Workflow (Δsbp; MH & TSB)

cd results/star_salmon/degenes

The article uses the following Python scripts (please see Data_Michelle_RNAseq_2025_scripts.zip).

* map_orf_1to1_progress_safe.py
* map_orf_1to1_progress.py (deprecated)
* check_missing_uniprot_fastas.py
* rescue_missing_uniprot.py
* compare_transcript_protein.py
* make_rna_only_present_excels.py (needs debugging)
* make_rna_only_not_in_protDE.py (needs debugging)
* map_symbols.py (deprecated)

0) Goal

  • Match IDs: convert UniProt protein IDs to gene symbols to allow direct comparison with RNA-seq.
  • Merge datasets by gene_symbol and retain: protein_log2FC, protein_padj, rna_log2FC, rna_padj.
  • Classify each gene/protein into:
    • both_significant_same_direction
    • both_significant_opposite_direction
    • protein_only_significant
    • rna_only_significant
  • Subset analysis: examine proteins significant in proteomics but not in RNA-seq (and vice versa) to identify post-transcriptional regulation, stability, secretion/export, or translational effects.

Comparisons (process first: 1–6)

MH medium

Index Comparison (normalized) RNA-seq Proteomics
1 Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)
2 Δsbp_4h_vs_WT_4h
3 Δsbp_18h_vs_WT_18h

TSB medium

Index Comparison (normalized) RNA-seq Proteomics
4 Δsbp_2h_vs_WT_2h (alt. Δsbp_1h_vs_WT_1h)
5 Δsbp_4h_vs_WT_4h
6 Δsbp_18h_vs_WT_18h
7 Δsbp_1h_vs_Δsbp_4h
8 Δsbp_4h_vs_Δsbp_18h
9 WT_1h_vs_WT_4h
10 WT_4h_vs_WT_18h

1) Environment

mamba activate rnaseq_prot_comparison
mamba install -c conda-forge biopython pandas openpyxl requests requests-cache tqdm -y
mamba activate rnaseq_prot_comparison

2) Populate cache once (per‑ID fetch via aligner)

# MH (example outcome: mapped 589)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh.csv

# Expected console example:
# ORFs translated: 2319; proteins: 616; mapped: 589
# Output: map_mh.csv

3) Check which UniProt IDs are missing (cache and stream)

python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report.txt

# Example:
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1296
# Missing in cache: 485
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

4) Rescue missing accessions (UniSave → UniParc → UniProtKB remap)

  • After each run, check rescue_summary.txt and rerun the checker before redoing alignments.
# 4.1) Rescue cache-missing
python rescue_missing_uniprot.py --miss_file missing_cache.txt --cache cache/uniprot_fastas

# 4.2) Re-check
python check_missing_uniprot_fastas.py \
  --mh_xlsx "Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx" \
  --tsb_xlsx "250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx" \
  --cache cache/uniprot_fastas \
  --out check_report_after_rescue.txt

# Example (OK to proceed):
# Total unique accessions: 1781
# In cache (non-empty FASTA): 1781
# Missing in cache: 0
# Present via UniProt stream now: 0
# Missing via UniProt stream now: 1781

# 4.3) Optional: rescue stream-missing
python rescue_missing_uniprot.py --miss_file missing_stream.txt --cache cache/uniprot_fastas

Notes:

  • If “Present via UniProt stream now” is 0, it usually indicates a streaming query/parse/limit issue; cached per‑accession FASTAs are valid and sufficient for alignment.

5) Rerun the aligner with rescued FASTAs

# MH final
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh \
  --min_aa 10 \
  --requests_cache \
  --out map_mh_final.csv

# TSB (cache already populated; re-run if desired)
python map_orf_1to1_progress_safe.py \
  --rna_fa ../../genome/genome.transcripts.fa \
  --prot_xlsx 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb \
  --min_aa 10 \
  --requests_cache \
  --out map_tsb.csv

6) Final comparison (merge, classify, export)

# Build merged comparisons (1–6), classify categories
python compare_transcript_protein.py \
  --map_mh map_mh_final.csv \
  --map_tsb map_tsb.csv \
  --out_dir results_compare \
  --alpha 0.05

# Convert CSVs to a single Excel workbook
~/Tools/csv2xls-0.4/csv_to_xls.py \
  summary_counts.csv MH_2h_merged.csv MH_4h_merged.csv MH_18h_merged.csv \
  TSB_2h_merged.csv TSB_4h_merged.csv TSB_18h_merged.csv \
  -d',' -o RNAseq-Proteomics_deltasbp_MH-TSB_2h4h18h_summary_20251002.xlsx

# Optional: export RNA-only-present genes per comparison (empty proteomics fields)
python make_rna_only_present_excels.py --out_dir rna_only_present_excels

7) Why counts differ (e.g., 1173(1319) vs 1107; TSB 663 vs 650)

  • The aligner uses “unique, usable” FASTA sequences; duplicates, deprecated/secondary or invalid accessions, and empty/invalid sequences are filtered, so usable proteins can be fewer than attempted fetches.
  • TSB 663 vs 650: about 13 entries typically drop due to duplicate/merged IDs, obsolete accessions, decoy/contaminant-like strings, or missing gene symbols that prevent a 1‑to‑1 map with RNA-seq.

中文备注: 结论:对齐阶段用的是“可用且唯一”的蛋白序列集合,因此数量会小于最初尝试获取的条目数;对 TSB,663 行在规范化与去重后得到 650 个可用记录属正常现象。


8) Report (email-ready text)

As mentioned, the IDs between proteomics and RNA‑seq were matched based on direct sequence alignment (ORF nucleotide sequences translated and aligned to UniProt protein FASTA) to ensure 1‑to‑1 mapping and unambiguous ID correspondence.

Please note that the recorded counts are constrained by the proteomics differentially expressed (DE) tables: for MH medium, 1107 proteins/genes; for TSB medium, 650. The RNA‑seq DE lists are larger overall, but the integration and comparisons are based on entities present in both proteomics and RNA‑seq.

For the overlap between the two datasets, each gene/protein was placed into one of five categories:

  • not_significant_in_either
  • rna_only_significant
  • protein_only_significant
  • both_significant_same_direction
  • both_significant_opposite_direction

The attached Excel file contains per‑comparison merged tables (protein_log2FC, protein_padj, rna_log2FC, rna_padj), category calls for each entry, and summary counts for quick review.


9) Notes \& assumptions

  • MH comparison blocks are read in 18h, 4h, 1h order from Ord_609 (three repeated “Log2 difference / q-value (Benjamini FDR)” blocks after intensities). If the template differs, adjust the mapping.
  • TSB comparisons are explicitly labeled (sbp vs Control at 1h/4h/18h) and are used verbatim.
  • Gene symbol preference: use proteomics-provided symbols (MH via GN=; TSB via Gene Symbol); otherwise fall back to RNA Preferred_name by locus to unify merge keys.
  • Significance: default padj < 0.05 for both datasets (set via –alpha). Add effect-size filters if needed.

10) TODOs

  • Cluster by presence across platforms: using TSB 650 and MH 1319 proteins, derive 5+1 groups; the +1 group contains genes not occurring in the proteomics results; Namely, RNA-only-present category: implemented—prefer set-difference on gene_symbol (RNA DE minus proteomics DE) per comparison to produce concise lists and avoid many‑to‑many expansions.
  • Re-run proteomics DE with Gunnar’s project scripts to expand DE coverage if needed.

Appendix: Deprecated symbol-only mapping (for reference)

# (Deprecated) Symbol-based mapping reached ~35% overlap → switched to sequence-alignment-based mapping.

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx \
  --prot Aug25_AG_Rohde_Order_609_Michelle-MH-Medium-abgekratzer-Biofilm.xlsx \
  --prot_kind mh

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 1280
# Intersect (symbols): 519
# Coverage vs RNA symbols:        35.5%
# Coverage vs Proteomics symbols:  40.5%
# Coverage vs all RNA rows:        22.1%

python map_symbols.py \
  --rna DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx \
  --prot 250618_Report_AG_Rohde_Michelle_Savickis_BS_ND-schalen-zeitreihe-TSB-Medium.xlsx \
  --prot_kind tsb

# RNA rows total: 2344
# RNA symbols:    1460
# Proteomics symbols: 568
# Intersect (symbols): 512
# Coverage vs RNA symbols:        35.1%
# Coverage vs Proteomics symbols:  90.1%
# Coverage vs all RNA rows:        21.8%

48495051525354555657

生物信息学编程中主流AI模型对比(2025)

模型 编码能力 推理和复杂问题处理 多模态支持 上下文窗口大小 知识更新截止日期 优势和适用场景 价格水平
GPT-5 高(SWE-Bench约75%准确率) 出色,支持复杂逻辑推理 支持文本、图像 400k token 2024年9月 适合复杂代码编写与调试,详细输出渲染 中等(月约20美元)
GPT-5 Thinking 更强的多步思考与调试能力 优于普通GPT-5 同GPT-5 同上 同上 适合需高准确性和复杂推理的生物信息学 同GPT-5
Grok4 极强(98% HumanEval编码测试) 优秀,支持多代理协作 支持文本、图像、视频 256k token 2024年11月 适用于需要创意内容和即时代码优化 免费到高级版(月约300美元)
Claude Sonnet 强(SWE-Bench约74.5%) 较强,有良好错误处理机制 仅限文本和文件输出 200k token 2025年7月 适合追求代码安全、用户体验和伦理分析 较高(20-250美元)
Claude Sonnet Thinking 增强思考与调试能力 更强多步和复杂推理 同Sonnet 同Sonnet 同Sonnet 更适合复杂任务和细致代码分析 与Sonnet相近
Gemini 2.5 Pro 中等偏高(AIME 88%,SWE-Bench约60%) 优秀的多模态处理和长文本分析 支持文本、音频、视频 1,000,000 token(极大) 2025年3月 适合大文档分析和多模态数据整合 中高
Sonar 具体数据有限,但被用于生物信息 重点在错误检测和代码安全 主要文本 不详 不详 优于基础模型的错误检测和代码辅助 不详

推荐总结

  • 复杂度高,需多步推理和精细调试:优先选择GPT-5 Thinking或Claude Sonnet Thinking。
  • 需要多模态处理、处理超长文档生物信息数据:选择Gemini 2.5 Pro。
  • 实时代码优化和创意辅助:Grok4表现卓越,且免费起步。
  • 注重安全性及伦理合规:Claude Sonnet系列更合适。
  • 初学者或快速生成基础代码:普通GPT-5及Sonar也可满足需求。

综合来说,针对生物信息学复杂数据分析和算法开发,GPT-5 Thinking和Claude Sonnet Thinking因强推理和调试能力是主流选择;而Gemini以其超大上下文窗口和多模态优势适合处理大规模生物学多源数据。

参考来源

  • GPT-5,Grok4,Claude Opus,Gemini等模型2025年性能对比与使用反馈[web:46][web:47][web:51][web:60]
  • SWE-Bench和LiveCodeBench编码能力指标[web:47][web:51]
  • 不同模型上下文窗口和知识截止日期分析[web:46][web:47]
  • 生物信息学相关AI工具需求和应用趋势[web:50][web:59]

ChatGPT-5 vs ChatGPT-5 Thinking for Python Programming in Bioinformatics

Summary

  • ChatGPT-5 Thinking is designed for complex, multi-step reasoning and code debugging.
  • It generates higher quality, more efficient, and less error-prone code.
  • Particularly suitable for bioinformatics where complex algorithms and multi-step programming are common.
  • Larger improvement in coding accuracy (~53% to ~75%) and debugging ability compared to standard ChatGPT-5.
  • ChatGPT-5 Standard is suitable for simpler tasks with faster response times.
  • Human feedback and iteration remain important for optimizing code quality in bioinformatics.

Recommendation

  • For Python bioinformatics programming, ChatGPT-5 Thinking is generally better due to its advanced reasoning and debugging capability.
  • It can significantly improve code quality and development efficiency in complex tasks.

References

  • OpenAI performance benchmarks on programming tasks [web:35][web:39]
  • Usage in bioinformatics and biomedical informatics [web:40][web:41]

在 Ubuntu 上使用通义千问(Qwen)的完整指南

目前(截至 2025 年 9 月),通义app 官方主要提供的是 iOS 和 Android 移动端应用,以及通过 网页版 在浏览器中使用。

关于 Ubuntu(Linux)系统:

阿里云尚未官方发布适用于 Ubuntu 或其他 Linux 发行版的桌面客户端(如 .deb.AppImageSnap 包)。不过,你仍然有以下几种方式在 Ubuntu 上使用通义千问(Qwen):


推荐方式:使用网页版

  • 打开浏览器(如 Chrome、Firefox、Edge 等)
  • 访问官方网址:https://tongyi.aliyun.com/qwen/
  • 登录阿里云账号即可免费使用,功能与移动端基本一致
  • 支持语音输入、文件上传、代码生成等

开发者选项:本地部署开源模型(可选)

如果你有技术背景,也可以在 Ubuntu 上本地运行开源版 Qwen 模型:

  • Qwen 系列(如 Qwen-7B、Qwen-14B、Qwen-Max 等)已在 Hugging Face魔搭(ModelScope) 开源
  • 可通过 transformersvLLMOllamaLM Studio(部分支持)在本地运行
  • 需要一定 GPU 资源(如 NVIDIA 显卡)或接受 CPU 推理(较慢)

例如,使用 Ollama(支持 Linux):


# 安装 Ollama(Ubuntu)
curl -fsSL https://ollama.com/install.sh | sh

# 运行 Qwen(如果已支持)
ollama run qwen:7b
Improving_Efficiency_of_DAMIAN_
Improving_Efficiency_of_DAMIAN_

摘要

微生物组研究在医学、环境和工业领域具有重要意义,但随着高通量测序技术的发展,微生物基因组和宏基因组数据量呈指数增长,现有分析工具在处理大规模数据时存在计算效率低下的问题。DAMIAN作为一种微生物组分析工具,尽管在功能注释和基因预测上表现良好,但在序列比对环节耗时较长,限制了其大规模应用。

本研究旨在开发一个高效的微生物生物信息学分析工具,并优化现有DAMIAN的计算性能。具体目标包括:

  1. 构建一个集成化分析平台,实现序列质控、组装、注释、功能预测及群落结构分析的全流程处理;
  2. 将BLASTn替换为DIAMOND,以显著提升序列比对速度,同时保持较高比对准确性;
  3. 通过多线程处理和索引优化,进一步降低计算时间和内存占用;
  4. 对比不同策略在不同数据规模下的性能表现,为微生物组大数据分析提供实践指南。

预期结果表明,通过DIAMOND替代BLASTn及优化计算流程,工具在速度、准确性和资源消耗方面均将得到显著提升,为微生物组大数据研究提供高效、可扩展的分析解决方案。该研究不仅能够改善DAMIAN的性能,也为其他微生物生物信息学工具的开发和优化提供参考。


引言

微生物组研究在医学、农业、环境科学和工业生物技术中具有重要意义。随着高通量测序技术(HTS)的快速发展,研究者能够获取大量微生物基因组和宏基因组数据,从而深入解析微生物群落的组成、功能及其与环境和宿主的相互作用。然而,数据量的迅速增长也带来了新的挑战,尤其是数据分析的计算效率和资源消耗问题。传统的分析方法在处理大规模数据时往往耗时较长,限制了其在大规模微生物组研究中的应用。

DAMIAN作为一种集成化的微生物组分析工具,已经在功能注释、基因预测和群落结构分析中取得了良好效果。然而,DAMIAN在序列比对环节主要依赖BLASTn,随着数据量增加,计算时间显著增加,成为限制其高通量应用的主要瓶颈。因此,开发高效的分析策略和优化现有工具的性能,成为微生物生物信息学研究中的迫切需求。

本研究旨在通过开发高效微生物生物信息学工具,并优化DAMIAN的计算性能,解决大规模数据分析中的性能瓶颈。具体策略包括:

  1. 构建一个集成化分析平台,实现序列质控、组装、注释、功能预测及群落结构分析的全流程处理;
  2. 将BLASTn替换为DIAMOND,以显著提升序列比对速度,同时保持较高比对准确性;
  3. 通过并行计算、多线程处理及索引优化,进一步减少计算时间和内存占用;
  4. 对比不同策略在不同数据规模和复杂度下的性能表现,为微生物组大数据分析提供实践指南。

本研究不仅能够提升DAMIAN的计算效率,还可为微生物生物信息学工具的开发和优化提供可行方案,推动微生物组大数据研究的发展,并为相关领域提供高效、可扩展的分析解决方案。


方法

1. 工具架构与数据流程设计

本研究开发的微生物生物信息学分析工具采用模块化架构,包含以下主要功能模块:

  1. 序列质控:使用FastQC和Trimmomatic对原始测序数据进行质量评估和去低质量序列处理。
  2. 基因组/宏基因组组装:使用SPAdes或MEGAHIT进行短序列组装,保证组装质量和完整性。
  3. 功能注释与基因预测:通过Prokka进行基因预测,并结合EggNOG-mapper进行功能注释。
  4. 群落结构分析:使用Kraken2和Bracken进行物种分类及丰度分析,同时生成可视化图表。

工具支持多种输入格式(FASTA、FASTQ、Metagenome assemblies),并提供可视化报告输出,方便科研人员对微生物多样性和功能进行深入研究。

2. DAMIAN性能优化策略

2.1 序列比对加速

  • 原始DAMIAN采用BLASTn进行序列比对,随着数据规模增加,计算耗时明显增长。
  • 本研究将BLASTn替换为DIAMOND,利用其高速比对算法显著提高序列比对速度,同时保持较高准确性。
  • 对比DIAMOND与BLASTn在不同数据规模下的性能,包括比对时间、内存占用和比对准确率。

2.2 并行计算与多线程优化

  • 对关键模块(序列比对、注释、组装)实现多线程并行处理,充分利用多核CPU资源。
  • 针对大规模宏基因组数据,采用批量分片策略,降低单次任务的内存压力,优化整体计算效率。

2.3 数据索引与缓存优化

  • 在序列比对和注释过程中,构建高效索引,减少重复读取和计算。
  • 使用缓存机制保存中间结果,提高重复任务处理效率。

3. 性能评估

  • 数据集选择:采用多种公开微生物组数据集,包括人体肠道宏基因组、环境样本及实验室培养菌株序列。
  • 评估指标:比对速度、内存占用、注释覆盖率、比对准确性及整体管道运行时间。
  • 对比分析:分别比较原始DAMIAN与优化后工具在相同数据集上的性能差异,量化优化效果。

4. 工具可扩展性与可重复性

  • 工具设计遵循模块化与可扩展原则,便于未来集成新算法或功能。
  • 所有计算步骤均提供可配置参数和日志记录,保证分析结果可重复、可追踪。

结果与讨论

1. 工具性能评估

1.1 序列比对速度

  • 使用DIAMOND替代BLASTn后,序列比对速度显著提升。
  • 对人体肠道宏基因组数据集进行测试,DIAMOND比对速度提高约 10-50倍,随数据规模增加,优势更加明显。
  • 多线程并行处理进一步缩短整体计算时间,处理大规模宏基因组数据从数天缩短至数小时。

1.2 内存占用与计算资源

  • 通过索引优化和缓存机制,工具在处理百万级序列时内存占用下降约 30%
  • 分批处理策略有效防止单次任务内存溢出,保证分析流程稳定运行。

1.3 功能注释与准确性

  • 与原DAMIAN相比,优化后的工具在基因预测和功能注释覆盖率保持一致或略有提升。
  • DIAMOND在比对结果上的准确性与BLASTn高度一致,保证分析结果可靠。

2. 群落结构分析效果

  • 利用Kraken2与Bracken进行物种分类与丰度估计,结果显示优化后的工具能快速生成物种丰度矩阵及可视化报告。
  • 高效计算使研究者能够处理更多样本,支持大规模群落对比分析。

3. 方法与策略讨论

  • DIAMOND替代BLASTn:显著提高比对速度,适用于大规模微生物组分析。
  • 多线程与分批处理:充分利用硬件资源,提高计算效率并降低单次任务负载。
  • 索引与缓存优化:减少重复计算,提高中间结果重用率,进一步优化性能。

4. 工具应用与潜在价值

  • 高效分析管道为大规模微生物组研究提供可靠工具,支持从序列质控到功能分析的全流程处理。
  • 工具的模块化设计和可扩展性,为未来整合新算法、开发定制化分析管道提供了便利。
  • 优化策略可推广至其他生物信息学工具,提高大规模数据分析的整体效率。

5. 结论

  • 本研究通过DIAMOND替代BLASTn、并行计算、多线程优化及索引缓存策略,显著提升了DAMIAN的计算效率。
  • 优化后的工具在速度、资源消耗和分析准确性上均表现优异,能够支持高通量微生物组数据分析。
  • 该研究为微生物生物信息学工具开发提供了可行方案,同时为大规模微生物组数据的快速分析和深入研究奠定了基础.

未来工作与研究局限性

1. 未来工作

  1. 算法优化与新技术集成

    • 将进一步探索其他高效序列比对算法(如MMseqs2)与机器学习方法,用于功能预测和分类,以提升分析速度和准确性。
    • 计划集成基于图形数据库或网络分析的宏基因组关联分析模块,支持微生物互作和功能网络研究。
  2. 大规模数据分析与云计算支持

    • 开发云端部署版本,使工具能够处理TB级宏基因组数据,实现分布式计算与自动化管道管理。
    • 支持自动化数据更新与公共数据库同步,提高工具适应快速发展的微生物组数据环境的能力。
  3. 扩展应用场景

    • 将工具应用于多样化环境样本(如土壤、水体、工业发酵体系),评估不同生态系统微生物组分析的适用性和性能表现。
    • 开发针对临床微生物组数据的专门模块,支持病原体检测和微生物组功能关联分析。

2. 研究局限性

  1. 数据依赖性

    • 工具性能和分析结果受输入数据质量和参考数据库完整性影响,低质量或高度不完整的序列可能影响分析准确性。
  2. 硬件资源限制

    • 尽管优化了多线程和索引策略,但在极大规模数据(如数百TB宏基因组)分析中,仍可能受到存储和计算资源的制约。
  3. 功能注释的局限性

    • 现有功能注释依赖公共数据库,存在未注释或注释不准确的基因,限制了部分微生物功能解析的深度。
  4. 通用性与适应性

    • 工具主要针对短序列宏基因组数据优化,对于长读长测序(如PacBio或Nanopore)数据的分析,需要进一步调整参数和算法以保证准确性。

尽管存在上述局限性,本研究所开发的高效微生物组分析工具及优化策略为微生物组大数据分析提供了重要参考,并为未来的算法改进和多样化应用奠定了基础。

图表与可视化建议

图1:分析流程管道 (Pipeline Diagram)

  • 内容:展示从原始序列输入到最终群落分析和功能注释的完整流程。
  • 模块
    1. 数据输入 (FASTQ/FASTA)
    2. 序列质控 (FastQC/Trimmomatic)
    3. 基因组组装 (SPAdes/MEGAHIT)
    4. 基因预测 (Prokka)
    5. 功能注释 (EggNOG-mapper)
    6. 序列比对优化 (BLASTn → DIAMOND)
    7. 群落结构分析 (Kraken2/Bracken)
    8. 可视化与报告输出

图例建议:不同模块使用不同颜色,箭头表示流程顺序,可标注性能优化点(DIAMOND、多线程、索引优化)。


图2:比对性能比较 (Performance Comparison)

  • 内容:DIAMOND vs BLASTn 在不同数据规模下的比对速度和内存占用对比。
  • X轴:数据规模 (序列数量/样本数量)
  • Y轴
    • 左Y轴:比对耗时 (小时/分钟)
    • 右Y轴:内存占用 (GB)
  • 图类型:双Y轴柱状图或折线图

图3:功能注释覆盖率 (Annotation Coverage)

  • 内容:展示优化前后工具在功能注释覆盖率上的差异。
  • X轴:数据集或样本类型
  • Y轴:注释基因比例 (%)
  • 图类型:柱状图或堆积柱状图

图4:群落结构分析可视化 (Community Composition)

  • 内容:优化后工具生成的物种丰度矩阵示例可视化。
  • 类型:堆积条形图或饼图
  • 颜色:不同微生物类群区分
  • 目的:展示工具在快速生成可视化报告方面的优势

表1:性能指标总结 (Performance Metrics Table)

数据集 工具版本 序列数量 比对时间 (h) 内存占用 (GB) 注释覆盖率 (%) 多线程/优化策略
样本A DAMIAN原版 1,000,000 48 120 85
样本A DAMIAN优化版 1,000,000 2 80 86 DIAMOND + 8线程 + 缓存
样本B DAMIAN优化版 5,000,000 10 150 87 DIAMOND + 16线程

表2:未来工作计划与功能扩展 (Future Work Table)

方向 具体任务 预期效果
算法优化 集成MMseqs2和机器学习方法 提高速度和注释准确性
云端部署 分布式计算支持TB级数据 支持大规模宏基因组分析
扩展应用 不同环境和临床样本分析 验证工具适用性,拓展应用场景

图表插入提示

  • 使用 矢量图 (SVG/PNG) 以保证清晰度
  • 所有图表提供 详细图注,标明优化前后差异
  • 在正文中引用图表,例如:“如图2所示,DIAMOND在大规模数据比对中速度提升显著。”

Processing Data_Vero_Kymographs (Step 1 for filtering tracks and Step 2 for binding position analyses)

  1. Calculate the filtering tracks

     (plot-numpy1) python 1_filter_track.py
    
     #NOT necessary by reprocessing: (plot-numpy1) ./lake_files/2_generate_orig_lake_files.py
    
     (plot-numpy1) python 3_update_lake.py
    
     #PREPARE DIR position_2s_lakes by: mv updated_lakes/*_position_2s.lake position_2s_lakes
     #DEFINE input_dir = 'position_2s_lakes' and output_filename = 'position_2s.lake' in 4_merge_lake_files.py
     (plot-numpy1) python 4_merge_lake_files.py
    
     #Found 35 .lake files to merge. Merging...
     #✅ Merging complete!
     #   - Total unique H5 files: 35
     #   - Total kymograph entries: 35
     #   - Merged file saved as: 'position_2s.lake'
    
     #Found 37 .lake files to merge. Merging...
     #✅ Merging complete!
     #   - Total unique H5 files: 37
     #   - Total kymograph entries: 37
     #   - Merged file saved as: 'lifetime_5s_only.lake'
  2. Overlap the track signals

     ![p853_250706_p502_averaged_output_events](/wp-content/uploads/2025/09/p853_250706_p502_averaged_output_events-1024x684.png){.alignnone}
     ![p968_250702_p502_averaged_output_events](/wp-content/uploads/2025/09/p968_250702_p502_averaged_output_events-1024x691.png){.alignnone}
     ![p853_250706_p511_averaged_output_events](/wp-content/uploads/2025/09/p853_250706_p511_averaged_output_events-1024x684.png){.alignnone}
    
     # Change several filename to adapt the conventions
     cd Binding_positions_60_timebins
     mv p968_250702_502_10pN_ch4_0bar_b5_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b5_2_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b5_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b5_1_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b4_4_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_4_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b4_3_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_3_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b4_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_2_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b4_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b4_1_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b3_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b3_1_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b2_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b2_2_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b2_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b2_1_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b1_3_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_3_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b1_2_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_2_binding_position_blue.csv
     mv p968_250702_502_10pN_ch4_0bar_b1_1_binding_position_blue.csv p968_250702_p502_10pN_ch4_0bar_b1_1_binding_position_blue.csv
    
     # Sort the Binding_positions_60_timebins files into three direcories
     mkdir ../tracks_p968_250702_p502/ ../tracks_p853_250706_p511/ ../tracks_p853_250706_p502/
     mv p968_250702_p502*.csv ../tracks_p968_250702_p502/
     mv p853_250706_p511*.csv ../tracks_p853_250706_p511/
     mv p853_250706_p502*.csv ../tracks_p853_250706_p502/
     cd ..; rmdir Binding_positions_60_timebins
    
     # #p853_250706_p511
     # event_id;start_bin;end_bin;avg_position;peak_size
     # 1;57;57;1.706;0.143
     # 2;40;42;2.019;0.142
     # 3;44;45;2.014;0.134
     # 4;0;6;2.146;0.143
     # 5;25;25;3.208;0.143
     # 6;27;27;3.181;0.124
     # 7;42;42;3.559;0.143
     # *8;0;0;3.929;0.143
     # 9;34;38;4.135;0.143
     # 10;42;44;4.591;0.148
     # 11;47;50;4.667;0.144
     # -->
     # event_id;start_bin;end_bin;duration_bins;pos_range;avg_position;peak_size
     # 1;56;56;1;0.256;1.706;0.143
     # 2;39;41;3;0.256;2.019;0.142
     # 3;43;44;2;0.229;2.014;0.134
     # 4;0;5;6;0.243;2.152;0.138
     # 5;24;24;1;0.269;3.208;0.143
     # 6;26;26;1;0.216;3.181;0.124
     # 7;41;41;1;0.269;3.559;0.143
     # 8;33;37;5;0.256;4.135;0.143
     # 9;41;43;3;0.377;4.591;0.148
     # 10;46;49;4;0.404;4.667;0.144
    
     #> (plot-numpy1) jhuang@WS-2290C:~/DATA/Data_Vero_Kymographs/Binding_positions$ python overlap_v16_calculate_average_matix.py
    
     #1. Averaging: Compute the averaged and summed values at each (position, time_bin) point for tracks_p***_2507**_p***.
    
         # ---- Filtering the first time bin 0 for all three groups tracks_p***_2507**_p*** ----
         #rm tracks_p853_250706_p511/*_blue_filtered.csv
         for f in ./tracks_p853_250706_p511/*.csv; do
             python -c "import pandas as pd; \
             df=pd.read_csv('$f', sep=';'); \
             df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
         done
         rm tracks_p853_250706_p511/*_binding_position_blue.csv
    
         for f in ./tracks_p853_250706_p502/*.csv; do
             python -c "import pandas as pd; \
             df=pd.read_csv('$f', sep=';'); \
             df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
         done
         rm tracks_p853_250706_p502/*_binding_position_blue.csv
    
         for f in ./tracks_p968_250702_p502/*.csv; do
             python -c "import pandas as pd; \
             df=pd.read_csv('$f', sep=';'); \
             df.drop(columns=['time bin 0']).to_csv('${f%.csv}_filtered.csv', sep=';', index=False)"
         done
         rm tracks_p968_250702_p502/*_binding_position_blue.csv
    
                     #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502/*_binding_position_blue_filtered.csv")
                     python 1_combine_kymographs.py
                     mkdir tracks_p853_250706_p502_averaged
                     mv tracks_p853_250706_p502/averaged_output.csv tracks_p853_250706_p502_averaged
                     #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502_averaged/averaged_output.csv")
                     python 1_combine_kymographs.py
                     mkdir tracks_p853_250706_p502_sum
                     mv tracks_p853_250706_p502/sum_output.csv tracks_p853_250706_p502_sum
                     #hard-coding the path: file_list = glob.glob("tracks_p853_250706_p502_sum/sum_output.csv")
                     python 1_combine_kymographs.py
    
                     python 2_summarize_binding_events.py tracks_p853_250706_p502_binding_events
                     mv summaries tracks_p853_250706_p502_binding_event_summaries
    
                     # --------------------  The following steps are deprecated --------------------
    
         #After adapting the dir PARAMETERS 'tracks_p853_250706_p511' in the following two python-scripts
         python overlap_v16_calculate_average_matix.py
         python overlap_v17_draw_plot.py
    
         cd tracks_p853_250706_p511/debug_outputs/
         cut -d';' -f1-4 averaged_output_events.csv >f1_4
         cut -d';' -f6-7 averaged_output_events.csv >f6_7
         paste -d';' f1_4 f6_7 > tracks_p853_250706_p511.csv
         mv averaged_output_events.png tracks_p853_250706_p511.png
    
         #After adapting the dir parameter 'tracks_p853_250706_p502' in the following two python-scripts
         python overlap_v16_calculate_average_matix.py
         python overlap_v17_draw_plot.py
    
         cd tracks_p853_250706_p502/debug_outputs/
         cut -d';' -f1-4 averaged_output_events.csv >f1_4
         cut -d';' -f6-7 averaged_output_events.csv >f6_7
         paste -d';' f1_4 f6_7 > tracks_p853_250706_p502.csv
         mv averaged_output_events.png tracks_p853_250706_p502.png
    
         #After adapting the dir parameter 'tracks_p968_250702_p502' in the following two python-scripts
         python overlap_v16_calculate_average_matix.py
         python overlap_v17_draw_plot.py
    
         cd tracks_p968_250702_p502/debug_outputs/
         cut -d';' -f1-4 averaged_output_events.csv >f1_4
         cut -d';' -f6-7 averaged_output_events.csv >f6_7
         paste -d';' f1_4 f6_7 > tracks_p968_250702_p502.csv
         mv averaged_output_events.png tracks_p968_250702_p502.png
    
     ~/Tools/csv2xls-0.4/csv_to_xls.py \
     tracks_p853_250706_p511/debug_outputs/tracks_p853_250706_p511.csv \
     tracks_p853_250706_p502/debug_outputs/tracks_p853_250706_p502.csv \
     tracks_p968_250702_p502/debug_outputs/tracks_p968_250702_p502.csv \
     -d$';' -o binding_events.xls;
     #Correct the sheet names in binding_events.xls.
     # --> The FINAL results are binding_events.xls and tracks_p***_25070*_p5**/debug_outputs/tracks_p***_25070*_p5**.png.
    
     # ------ END ------
    
     #2. Threshold-based detection: Apply a signal threshold (e.g., 0.16) to identify bound states at each position and time bin.
    
     #3. Merging neighboring events: Combine adjacent bound regions in both position and time.
    
     #4. Event filtering: Only merged events that satisfy the following conditions are reported as binding events and annotated in the plot:
    
         * min_duration = 1 # minimum number of time bins
         * min_range = 0.08 μm # minimum position spread for an event
    
     5. Event statistics: Calculate time bin ranges, durations, peak sizes, and average positions for each event.
    
     NEW_ADAPT: I have checked the files again, and those marked with 'my_...' are filtered by position and time. I sent you the unfiltered files so that you could create the nice heat maps that you made. So everything is correct. Could you do me two more favours?
    
             Firstly, would it be possible to send me the heat maps of the '..._averaged_output_events' without the red dot, ID, bins and position?
             Secondly, would it be possible to have a similar binding intensity scale for all three samples?
             #--> Done with the new python code. TODO: update the post in bioinformatics.cc.
             Thirdly, could you create the same heat maps, but with only the tracks above 5 seconds? That would be great.
                 #Solution for 3rd point: delete the input files the second columns for "ignoring the first time bin (namely time bin 0). "
                 #Additionally: delete the column "duration_bins" in averaged_output_events.csv and convert csv to Excel-file using csv2xls; TODO: observe in the new output Excel-files, the start_bin and end_bin, the numbers will be changed?
  3. Used python scripts

    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/1_filter_track.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/lake_files/2_generate_orig_lake_files.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/3_update_lake.py
    • ~/DATA/Data_Vero_Kymographs/Vero_Lakeview/4_merge_lake_files.py
    • ~/DATA/Data_Vero_Kymographs/Binding_positions/overlap_v16_calculate_average_matix.py (Deprecated)
    • ~/DATA/Data_Vero_Kymographs/Binding_positions/overlap_v17_draw_plot.py (Deprecated)
    • python 1_combine_kymographs.py
    • python 2_summarize_binding_events.py
  4. Source code of 1_combine_kymographs.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
from scipy.ndimage import label

# List of input CSV files
file_list = glob.glob("tracks_p853_250706_p502/*_binding_position_blue_filtered.csv")
#file_list = glob.glob("tracks_p853_250706_p502_averaged/averaged_output.csv")
#file_list = glob.glob("tracks_p853_250706_p502_sum/sum_output.csv")
print(f"Found {len(file_list)} CSV files: {file_list}")

# Ensure output directory exists
output_dir = os.path.dirname(file_list[0]) if file_list else "."
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Collect all data
all_data = []
all_positions = []
for file in file_list:
    try:
        df = pd.read_csv(file, sep=';').rename(columns=lambda x: x.replace('# ', ''))
        positions = df['position (μm)'].values
        time_bins = df.columns[1:].values
        time_bin_data = df[df.columns[1:]].values
        all_positions.append(positions)
        all_data.append(time_bin_data)
    except Exception as e:
        print(f"Error processing {file} for data collection: {e}")
        continue

# Determine common position range and interpolate
min_pos = np.min([pos.min() for pos in all_positions if len(pos) > 0])
max_pos = np.max([pos.max() for pos in all_positions if len(pos) > 0])
max_len = max(len(pos) for pos in all_positions if len(pos) > 0)
common_positions = np.linspace(min_pos, max_pos, max_len)

# Interpolate each file's data
interpolated_data = []
for positions, time_bin_data in zip(all_positions, all_data):
    if len(positions) != time_bin_data.shape[0]:
        print("Warning: Mismatch in dimensions, skipping interpolation.")
        continue
    interpolated = np.zeros((time_bin_data.shape[1], len(common_positions)))
    for i in range(time_bin_data.shape[1]):
        interpolated[i] = np.interp(common_positions, positions, time_bin_data[:, i])
    interpolated_data.append(interpolated)

# Compute averaged data
if interpolated_data:
    averaged_data = np.mean(interpolated_data, axis=0)
else:
    print("No data available for averaging.")
    exit()

# Save averaged CSV
output_df = pd.DataFrame(averaged_data.T, columns=[f"time bin {i}" for i in range(averaged_data.shape[0])])
output_df.insert(0, 'position (μm)', common_positions)
output_csv_path = os.path.join(output_dir, 'averaged_output.csv')
output_df.to_csv(output_csv_path, sep=';', index=False)
print(f"Saved averaged output to {output_csv_path}")

# Compute summed data
if interpolated_data:
    sum_data = np.sum(interpolated_data, axis=0)
else:
    print("No data available for summing.")
    exit()

# Save summed CSV
output_df = pd.DataFrame(sum_data.T, columns=[f"time bin {i}" for i in range(sum_data.shape[0])])
output_df.insert(0, 'position (μm)', common_positions)
output_csv_path = os.path.join(output_dir, 'sum_output.csv')
output_df.to_csv(output_csv_path, sep=';', index=False)
print(f"Saved summed output to {output_csv_path}")

def plot_combined_kymograph(matrix, positions, output_dir, filename, title="Combined Kymograph"):
    """
    Plot combined kymograph for either averaged or summed data.

    Parameters
    ----------
    matrix : np.ndarray
        2D array of shape (num_time_bins, num_positions)
    positions : np.ndarray
        1D array of position coordinates
    output_dir : str
        Directory to save the plot
    filename : str
        Name of the output PNG file
    title : str
        Plot title
    """
    if matrix.size == 0:
        print("Empty matrix, skipping plot.")
        return

    plt.figure(figsize=(10, 6))
    max_ptp = np.max([np.ptp(line) for line in matrix]) if matrix.size > 0 else 1
    padding = 0.1 * np.max(matrix) if np.max(matrix) > 0 else 1
    offset_step = max_ptp + padding
    num_bins = matrix.shape[0]

    for i in range(num_bins):
        line = matrix[i]
        offset = i * offset_step
        plt.plot(positions, line + offset, 'b-', linewidth=0.5)

    y_ticks = [i * offset_step for i in range(num_bins)]
    y_labels = [f"time bin {i}" for i in range(num_bins)]
    plt.yticks(y_ticks, y_labels)
    plt.xlabel('Position (μm)')
    plt.ylabel('Binding Density')
    plt.title(title)

    output_path = os.path.join(output_dir, filename)
    plt.savefig(output_path, facecolor='white', edgecolor='none')
    plt.close()
    print(f"Saved combined kymograph plot to {output_path}")

# =============================
# Example usage
# =============================
# Plot averaged matrix
plot_combined_kymograph(averaged_data, common_positions, output_dir, 'combined_average_kymograph.png',
                        title="Averaged Binding Density Across All Files")

# Plot summed matrix
plot_combined_kymograph(sum_data, common_positions, output_dir, 'combined_sum_kymograph.png',
                        title="Summed Binding Density Across All Files")

# ======================================================
# Binding Event Detection + Plotting
# ======================================================

#signal_threshold = 0.2   # min photon density
#min_duration = 1         # min time bins
#min_range = 0.1         # μm, min position spread for an event
signal_threshold = 0.16   # min photon density
min_duration = 1         # min time bins
min_range = 0.08         # μm, min position spread for an event

for file in file_list:
    try:
        df = pd.read_csv(file, sep=';').rename(columns=lambda x: x.replace('# ', ''))
        positions = df['position (μm)'].values
        time_bins = df.columns[1:]
        photons_matrix = df[time_bins].values

        # Step 1: Binary mask
        mask = photons_matrix > signal_threshold

        # Step 2: Connected components
        labeled, num_features = label(mask, structure=np.ones((3, 3)))

        events = []
        for i in range(1, num_features + 1):
            coords = np.argwhere(labeled == i)
            pos_idx = coords[:, 0]
            time_idx = coords[:, 1]

            start_bin = time_idx.min()
            end_bin = time_idx.max()
            duration = end_bin - start_bin + 1

            pos_range = positions[pos_idx].max() - positions[pos_idx].min()

            # Apply filters
            if duration < min_duration or pos_range < min_range:
                continue

            avg_pos = positions[pos_idx].mean()
            peak_size = photons_matrix[pos_idx, time_idx].max()

            events.append({
                "start_bin": int(start_bin),
                "end_bin": int(end_bin),
                "duration_bins": int(duration),
                "pos_range": float(pos_range),
                "avg_position": float(avg_pos),
                "peak_size": float(peak_size)
            })

        # Print results
        print(f"\nBinding events for {os.path.basename(file)}:")
        if not events:
            print("No binding events detected.")
        else:
            for ev in events:
                print(
                    f" - Time bins {ev['start_bin']}–{ev['end_bin']} "
                    f"(duration {ev['duration_bins']}), "
                    f"Pos range: {ev['pos_range']:.2f} μm, "
                    f"Peak: {ev['peak_size']:.2f}, "
                    f"Avg pos: {ev['avg_position']:.2f} μm"
                )

        # Plot heatmap with detected events
        plt.figure(figsize=(10, 6))
        plt.imshow(
            photons_matrix.T,
            aspect='auto',
            origin='lower',
            extent=[positions.min(), positions.max(), 0, len(time_bins)],
            cmap='viridis'
        )
        plt.colorbar(label="Binding density (tracks)")  #Photon counts
        plt.xlabel("Position (μm)")
        plt.ylabel("Time bin")
        plt.title(f"Kymograph with Binding Events: {os.path.basename(file)}")

        # Annotate events on plot
        #for ev in events:
        #    mid_bin = (ev["start_bin"] + ev["end_bin"]) / 2
        #    plt.scatter(ev["avg_position"], mid_bin, color="red", marker="o", s=40)
        #    plt.text(
        #        ev["avg_position"], mid_bin + 0.5,
        #        f"{ev['duration_bins']} bins, {ev['pos_range']:.2f} μm",
        #        color="white", fontsize=7, ha="center"
        #    )

        out_path = os.path.join(output_dir, os.path.basename(file).replace(".csv", "_events.png"))
        plt.savefig(out_path, dpi=150, bbox_inches="tight")
        plt.close()
        print(f"Saved event plot to {out_path}")

    except Exception as e:
        print(f"Error processing {file}: {e}")
        continue
  1. Source code of 2_summarize_binding_events.py
import pandas as pd
import glob
import os
import argparse

# =============================
# Command-line arguments
# =============================
parser = argparse.ArgumentParser(description="Summarize binding-event CSV files in a folder.")
parser.add_argument("input_dir", type=str, help="Input directory containing CSV files")
args = parser.parse_args()
input_dir = args.input_dir

# =============================
# Prepare file list and output
# =============================
file_list = glob.glob(os.path.join(input_dir, "*.csv"))
print(f"Found {len(file_list)} CSV files in {input_dir}")

output_dir = "summaries"
os.makedirs(output_dir, exist_ok=True)

# Excel writer
excel_path = os.path.join(output_dir, "summary_all.xlsx")
with pd.ExcelWriter(excel_path, engine="xlsxwriter") as writer:

    for file in file_list:
        try:
            # Load CSV (ignore header lines with "#")
            df = pd.read_csv(file, sep=";", comment="#", header=None)

            # Assign proper column names
            df.columns = [
                "track index",
                "time (pixels)",
                "coordinate (pixels)",
                "time (seconds)",
                "position (um)",
                "minimum observable duration (seconds)"
            ]

            # Group by track index
            summaries = []
            for track, group in df.groupby("track index"):
                summary = {
                    "track index": track,
                    "n_events": len(group),
                    "time_start_s": round(group["time (seconds)"].min(), 3),
                    "time_end_s": round(group["time (seconds)"].max(), 3),
                    "duration_s": round(group["time (seconds)"].max() - group["time (seconds)"].min(), 3),
                    "pos_min_um": round(group["position (um)"].min(), 3),
                    "pos_max_um": round(group["position (um)"].max(), 3),
                    "pos_range_um": round(group["position (um)"].max() - group["position (um)"].min(), 3),
                    "pos_mean_um": round(group["position (um)"].mean(), 3)
                }
                summaries.append(summary)

            summary_df = pd.DataFrame(summaries)

            # Save each CSV summary individually
            out_csv_path = os.path.join(output_dir, os.path.basename(file).replace(".csv", "_summary.csv"))
            summary_df.to_csv(out_csv_path, sep=";", index=False, float_format="%.3f")
            print(f"Saved summary CSV for {file} -> {out_csv_path}")

            # Write summary to Excel sheet
            sheet_name = os.path.basename(file).replace(".csv", "")
            summary_df.to_excel(writer, sheet_name=sheet_name[:31], index=False, float_format="%.3f")

        except Exception as e:
            print(f"Error processing {file}: {e}")
            continue

print(f"Saved all summaries to Excel file -> {excel_path}")