HYLO CARE 与 HYLO COMOD 滴眼液的成分比较及长期使用评估

一、官网信息的核心解读(先给结论)

👉 HYLO CARE 是可以长期使用的,而且是“偏护理型”的人工泪液。 👉 没有防腐剂,这一点非常重要。


二、HYLO CARE 的成分到底是什么?安全吗?

✅ 1️⃣ 透明质酸(Hyaluronsäure 0.1%)

  • 作用:强效保湿、润滑
  • 特点:

    • 天然存在于人体(眼睛、关节、皮肤)
    • 非药物、非激素
  • ✔️ 适合长期使用
  • ✔️ 干眼治疗的“金标准成分”

📌 0.1% 是中等浓度 → 比基础款更滋润,但不会太黏


✅ 2️⃣ Dexpanthenol(右泛醇,维生素 B5 前体)

这是你注意到的“另外的成分”,也是很多人关心的点。

它是干什么的?

  • 促进角膜和结膜修复
  • 减轻刺激、灼热、异物感
  • 有轻度抗炎、修复作用

📌 常用于:

  • 眼表受损
  • 长时间用眼
  • 隐形眼镜引起的不适

👉 Dexpanthenol 不是药物,也不是激素 👉 在眼科中非常常见且安全

✔️ 可以长期用


三、“不含防腐剂”意味着什么?(非常重要)

官网明确写了:

Frei von Konservierungsmitteln und Phosphaten 不含防腐剂和磷酸盐

这意味着:

  • ❌ 不刺激角膜
  • ❌ 不会“越用越干”
  • ❌ 不会破坏泪膜
  • ✔️ 适合每天多次、长期使用
  • ✔️ 适合敏感眼、儿童、长期电脑用眼人群

👉 这也是 COMOD® 系统 的最大优势: 不用加防腐剂也能保持无菌


四、HYLO CARE vs HYLO COMOD(现在可以精确区分了)

对比点 HYLO CARE HYLO COMOD
透明质酸 0.1% 0.1%
额外成分 ✅ Dexpanthenol(修复) ❌ 无
防腐剂 ❌ 无 ❌ 无
滋润度 ⭐⭐⭐⭐ ⭐⭐⭐
护理/修复 ✅ 更强 基础
长期使用 ✅ 可以 ✅ 可以

👉 简单说:

  • HYLO COMOD:基础保湿型
  • HYLO CARE:保湿 + 修复型(更“护理”)

五、你最关心的问题:HYLO CARE 可以长期用吗?

✅ 答案是:可以,而且很适合长期用

原因总结:

  • ✔️ 无防腐剂
  • ✔️ 成分安全、接近人体生理
  • ✔️ 专门用于“日常护理”
  • ✔️ 官网明确写了 “ideal für die tägliche Pflege”(适合每日护理)

📌 6 个月使用期 指的是:

  • 开封后 6 个月内用完
  • 不是“只能用 6 个月”,而是卫生有效期

六、适合哪些人长期用 HYLO CARE?

特别适合你如果有以下情况之一:

  • 长时间看电脑 / 手机
  • 暖气房、空调房
  • 戴隐形眼镜
  • 眼睛容易红、干、痒、有异物感
  • 想用比基础人工泪液更“修复型”的产品

七、一句话总结(给你吃个定心丸)

👉 你贴的这个 HYLO CARE 官网版本:

  • ✔️ 无防腐剂
  • ✔️ 成分安全
  • ✔️ 含修复成分 Dexpanthenol
  • ✔️ 完全可以长期使用

Turnierplan 2026 für X. (Updated)

Grundlage: letzte 4 Kids-Cup-Zeitpunkte (offiziell)

Nr. Datum Zeit Ort
42 20.12.2024 9:30–17:00 HSK Schachzentrum (Schellingstr. 41) (hsk1830.de)
43 29.03.2025 (nicht angegeben in Beitrag) HSK (Kids-Cup Bericht) (hsk1830.de)
44 19.07.2025 9:00–17:00 HSK Schachzentrum (hsk1830.de)
45 11.10.2025 9:30–17:00 HSK Schachzentrum (hsk1830.de)

Man sieht ein wiederkehrendes Muster: Frühjahr (Ende März)Sommer (Mitte Juli)Herbst (Mitte Oktober)kurz vor Weihnachten (Dezember).


Prognose: HSK Kids-Cup 2026 (4 separate Einträge)

Prinzip der Prognose: gleiche Jahreszeiten/Fenster wie oben, möglichst Wochenendtag (wie in der Kids-Cup-Beschreibung). (hsk1830.de)

Bitte als “Platzhalter” behandeln, bis HSK die konkreten Ausschreibungen veröffentlicht.

2026 (prognostiziert) Wochentag Eintrag
28.03.2026 Samstag HSK Kids-Cup #46 (vorauss.)
18.07.2026 Samstag HSK Kids-Cup #47 (vorauss.)
10.10.2026 Samstag HSK Kids-Cup #48 (vorauss.)
18.12.2026 Freitag HSK Kids-Cup #49 (vorauss.) (“kurz vor Weihnachten” – 2024 war auch ein Werktag nahe Ferienstart) (hsk1830.de)

Kids-Cup feste Infos (von HSK-Infoseite, damit du sie nicht jedes Mal neu suchen musst)

  • Mehrmals pro Jahr, für Einsteiger & zusätzliche Schulwertung (hsk1830.de)
  • 7 Runden Schweizer System, zwei Turniere: A) offen bis Klasse 4, DWZ-gewertetB) KiGa bis Klasse 2 (hsk1830.de)
  • 20 Minuten/Partie; in den ersten 15 Minuten wird notiert, damit Trainer zwischen den Runden analysieren können. (hsk1830.de)
  • Startgeld: 10 € (HSK-Mitglieder), 15 € (Gäste) (hsk1830.de)

Update: “X. soll alle Kids-Cups 2026 spielen + nächste HJET

Aktualisierte Master-Tabelle (2026 + “nächste HJET”)

Datum Event Ort Q? Status / Notizen (mit Platz zum Nachtragen)
14.02.2026 Blankeneser Jugend Pokal (BJP) Hamburg ✅ Ja Jugendturnier, gutes U10-Format
15.02.2026 DWZ Challenge Hamburg ✅ Ja (fix) Angemeldet
22.03.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja DWZ-Praxis ohne Overload
28.03.2026 (Vorhersage) HSK Kids-Cup #46 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
06.06.2026 (optional) DWZ/ELO-Cup (4er-Gruppen) Itzehoe 🟡 Optional nur wenn ihr im Juni nicht “zu voll” seid
20.–21.06.2026 Hamburger Talente-Cup (U12) Hamburg ✅ Ja Highlight-Wochenende
28.06.2026 Elmshorn Jugendstadtmeisterschaft Elmshorn ✅ Ja 1-Tag, kinderfreundlich
18.07.2026 (Vorhersage) HSK Kids-Cup #47 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
05.09.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja guter Wiedereinstieg nach Ferien
10.10.2026 (Vorhersage) HSK Kids-Cup #48 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
07.11.2026 (optional) DWZ/ELO-Cup (4er-Gruppen) Itzehoe 🟡 Optional nur bei Lust/Kapazität
05.12.2026 DWZ/ELO-Cup (4er-Gruppen) Itzehoe ✅ Ja Jahresabschluss
18.12.2026 (Vorhersage) HSK Kids-Cup #49 HSK Schachzentrum ✅ Ja Ausschreibung-Link: Zeit: Anmeldeschluss: ____
Dezember 2026 HJET – Anmeldung nächste Runde Hamburg ✅ Ja Deadline: Link/Formular: Gruppe (U10-1/U10-2): ____
Jan–Anfang Feb 2027 (samstags) HJET – Spieltage Hamburg ✅ Ja Termine: (eintragen sobald veröffentlicht)

PDF-Version (aktualisiert)

Hier ist die aktualisierte PDF (mit Kids-Cup-Prognose als separate Einträge + HJET “nächste Runde”): Download: Turnierplan_2026_aktualisiert.pdf

Wenn später die echten Kids-Cup-Ausschreibungslinks 2026 und den HJET-Anmeldelink (Dez 2026) verfügbar ist, koennen wir die Freifelder sauber eintragen und “Vorhersage” durch “offiziell” ersetzten.

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)

1) Primary WGS analysis with nf-core/bacass

We processed short-read bacterial WGS with nf-core/bacass v2.5.0 using Docker. The pipeline performs QC, assembly, and common taxonomic/typing steps (e.g., Kraken2). We ran it successfully with a local Kraken2 database.

A practical issue is the KmerFinder database download: the previously documented FTP “latest” path may be unavailable or slow, so it’s best to pre-download the database and pass a local path, or skip KmerFinder and run it manually via the CGE service if needed.

2) KmerFinder database: reliable download options

Two practical database sources:

  • Option A (recommended, easiest & fast): download the bundled DB from the CGE KmerFinder service site and extract locally. (In your case it finished in ~40 min.)
  • Option B (fallback, stable but older): use the Zenodo stable archive (20190108).

Once the database is available locally, it can be provided to bacass with --kmerfinderdb (pointing to the extracted bacteria/ directory, or the tarball depending on the pipeline expectation).

3) Species placement & phylogeny

For broad placement, an ANI-based approach (fastANI distance tree) is quick and robust, but for publishable phylogeny within a close clade, a standard route is:

Prokka → Roary (pangenome; core-gene alignment) → (trim) → RAxML-NG.

Key point: Roary core-gene trees are best for closely related genomes (same species/close genus). Distant outgroups can collapse the core (few/no shared genes). If outgroups are too far, use close outgroups (e.g., Moraxella, Psychrobacter) or switch to conserved marker gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) for deep phylogeny.

4) Comparative genome visualization

For pairwise genome comparison/plots, nf-core/pairgenomealign can generate whole-genome alignments and visual representations against a target genome.


Reproducible code snippets (complete commands)

A) nf-core/bacass runs

Run 1 (works):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/minikraken_20171019_8GB.tgz

Run 2 (skip KmerFinder, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /home/jhuang/REFs/k2_standard_08_GB_20251015.tar.gz \
  --skip_kmerfinder \
  -resume

Then manually run KmerFinder via web service: https://cge.food.dtu.dk/services/KmerFinder/

Run 3 (use local KmerFinder DB, resume):

nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type short \
  --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
  --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
  -resume

B) KmerFinder DB download

Option A (CGE bundle; fast):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz'
tar -xvf kmerfinder_db.tar.gz
# resulting path typically contains: bacteria/

Option B (Zenodo stable archive):

mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://zenodo.org/records/13447056/files/20190108_kmerfinder_stable_dirs.tar.gz'
tar -xzf 20190108_kmerfinder_stable_dirs.tar.gz

C) Download close references (Acinetobacter + close outgroups) + fastANI matrix (your script config)

(You already implemented this; keep here as “reproducible config”)

# Example config snippet used in the downloader:
MAX_ACINETOBACTER=20
MAX_PER_OUTGROUP=4
ASSEMBLY_LEVELS="complete,chromosome"
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
MINFRACTION="0.05"

D) Prokka → Roary core MSA → RAxML-NG (phylogenomic tree)

0) Collect genomes (isolate + downloaded refs):

mkdir -p genomes
cp bacass_out/Prokka/An6/An6.fna genomes/An6.fna
bash 1_ani_tree_prep.sh
cp ani_tree_run/ref_fasta/*.genomic.fna genomes/

1) Prokka annotations (produces GFFs for Roary):

mkdir -p prokka gffs

for f in genomes/*.fna genomes/*.genomic.fna; do
  [ -f "$f" ] || continue
  bn=$(basename "$f")
  sample="${bn%.*}"
  sample="${sample%.genomic}"
  sample="${sample//./_}"

  prokka --outdir "prokka/$sample" \
         --prefix "$sample" \
         --cpus 8 --force \
         "$f"
done

cp prokka/*/*.gff gffs/

2) Roary core-genome alignment (MAFFT):

mkdir -p roary_out

roary -f roary_out \
      -p 16 \
      -e -n \
      -i 95 \
      -cd 95 \
      gffs/*.gff

# output:
ls -lh roary_out/core_gene_alignment.aln

3) Trim the core alignment (optional but recommended):

trimal -in roary_out/core_gene_alignment.aln \
       -out roary_out/core_gene_alignment.trim.aln \
       -automated1

4) RAxML-NG ML tree + bootstrap supports:

mkdir -p tree

raxml-ng --all \
  --msa roary_out/core_gene_alignment.trim.aln \
  --model GTR+G \
  --bs-trees autoMRE{500} \
  --threads 16 \
  --seed 12345 \
  --prefix tree/roary_core

# main result:
ls -lh tree/roary_core.raxml.support

E) Optional visualization in R (ape)

library(ape)

tr <- read.tree("tree/roary_core.raxml.support")

pdf("tree/roary_core_raxmlng.pdf", width=12, height=10)
plot(tr, cex=0.5, no.margin=TRUE)
title("Roary core genome + RAxML-NG (GTR+G)")
dev.off()

F) Pairwise genome alignment figure (nf-core/pairgenomealign)

Example pattern (edit to your inputs/refs):

nextflow run nf-core/pairgenomealign -r 2.2.1 -profile docker \
  --input samplesheet_pairgenomealign.tsv \
  --outdir pairgenomealign_out

Practical notes

  • If KmerFinder FTP “latest” is unavailable, use the CGE bundle or Zenodo stable DB and point bacass to the local extracted bacteria/ directory.
  • For phylogeny:

    • Roary/RAxML-NG is best for close genomes; use close outgroups or none and root later.
    • For distant outgroups, use marker-gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) instead of pangenome core.

Used scripts (complete code)

A) 1_ani_tree_prep.sh

#!/usr/bin/env bash
set -euo pipefail

# ===================== USER CONFIG =====================
QUERY_FASTA="bacass_out/Prokka/An6/An6.fna"
SEED_ACCESSIONS_FILE="accessions.txt"

OUTDIR="ani_tree_run"
THREADS=8
MINFRACTION="0.05"
FRAGLEN="3000"

MAX_ACINETOBACTER=20
SUMMARY_LIMIT_ACI=400

MAX_PER_OUTGROUP=4
SUMMARY_LIMIT_OUT=120

#,scaffold
ASSEMBLY_LEVELS="complete,chromosome"
#Drop from Roary: "Pseudomonas" "Escherichia")
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")

DO_MATRIX="yes"
SUMMARY_TIMEOUT=600
# =======================================================

ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
warn() { echo "[$(ts)] WARN: $*" >&2; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }
need() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
sanitize() { echo "$1" | tr ' /;:,()[]{}' '_' | tr -cd 'A-Za-z0-9._-'; }

need datasets
need fastANI
need awk
need sort
need unzip
need find
need grep
need wc
need head
need readlink
need sed
need timeout
need python3

[[ -f "$QUERY_FASTA" ]] || die "QUERY_FASTA not found: $QUERY_FASTA"

mkdir -p "$OUTDIR"/{ref_fasta,zips,tmp,logs,accessions}
DL_LOG="$OUTDIR/logs/download.log"
SUM_LOG="$OUTDIR/logs/summary.log"
ANI_LOG="$OUTDIR/logs/fastani.log"
: > "$DL_LOG"; : > "$SUM_LOG"; : > "$ANI_LOG"

QUERY_LIST="$OUTDIR/query_list.txt"
echo "$(readlink -f "$QUERY_FASTA")" > "$QUERY_LIST"
log "QUERY: $(cat "$QUERY_LIST")"

ACC_DIR="$OUTDIR/accessions"
ACC_SEED="$ACC_DIR/seed.acc.txt"
ACC_ACI="$ACC_DIR/acinetobacter.acc.txt"
ACC_OUT="$ACC_DIR/outgroups.acc.txt"
ACC_ALL="$ACC_DIR/all_refs.acc.txt"

REF_DIR="$OUTDIR/ref_fasta"
REF_LIST="$OUTDIR/ref_list.txt"

# ---- seed accessions ----
if [[ -f "$SEED_ACCESSIONS_FILE" ]]; then
  grep -v '^\s*$' "$SEED_ACCESSIONS_FILE" | grep -v '^\s*#' | sort -u > "$ACC_SEED" || true
else
  : > "$ACC_SEED"
fi
log "Seed accessions: $(wc -l < "$ACC_SEED")"

# ---- extract accessions from a jsonl FILE (FIXED) ----
extract_accessions_from_jsonl_file() {
  local jsonl="$1"
  python3 - "$jsonl" <<'PY'
import sys, json
path = sys.argv[1]
out = []
with open(path, "r", encoding="utf-8", errors="ignore") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        try:
            obj = json.loads(line)
        except Exception:
            continue
        # NCBI datasets summary often returns one report per line with top-level "accession"
        acc = obj.get("accession")
        if acc:
            out.append(acc)
        # some outputs can also contain "reports" lists
        reps = obj.get("reports") or []
        for r in reps:
            acc = r.get("accession")
            if acc:
                out.append(acc)
for a in out:
    print(a)
PY
}

fetch_accessions_by_taxon() {
  local taxon="$1"
  local want_n="$2"
  local limit_n="$3"
  local outfile="$4"

  local tag
  tag="$(sanitize "$taxon")"
  local jsonl="$OUTDIR/tmp/${tag}.jsonl"

  : > "$outfile"
  log "STEP: Fetch accessions | taxon='$taxon' want=$want_n limit=$limit_n assembly=$ASSEMBLY_LEVELS timeout=${SUMMARY_TIMEOUT}s"

  if ! timeout "$SUMMARY_TIMEOUT" datasets summary genome taxon "$taxon" \
      --assembly-level "$ASSEMBLY_LEVELS" \
      --annotated \
      --exclude-atypical \
      --exclude-multi-isolate \
      --limit "$limit_n" \
      --as-json-lines > "$jsonl" 2>>"$SUM_LOG"; then
    warn "datasets summary failed for taxon=$taxon"
    tail -n 80 "$SUM_LOG" >&2 || true
    return 1
  fi

  [[ -s "$jsonl" ]] || { warn "Empty jsonl for taxon=$taxon"; return 0; }
  log "  OK: summary jsonl lines: $(wc -l < "$jsonl")"

  # FIXED extraction
  extract_accessions_from_jsonl_file "$jsonl" | sort -u | head -n "$want_n" > "$outfile"

  log "  OK: got $(wc -l < "$outfile") accessions for $taxon"
  head -n 5 "$outfile" | sed 's/^/     sample: /' || true
  return 0
}

# ---- fetch Acinetobacter + outgroups ----
fetch_accessions_by_taxon "Acinetobacter" "$MAX_ACINETOBACTER" "$SUMMARY_LIMIT_ACI" "$ACC_ACI" \
  || die "Failed to fetch Acinetobacter accessions (see $SUM_LOG)"

: > "$ACC_OUT"
for tax in "${OUTGROUP_TAXA[@]}"; do
  tmp="$OUTDIR/tmp/$(sanitize "$tax").acc.txt"
  fetch_accessions_by_taxon "$tax" "$MAX_PER_OUTGROUP" "$SUMMARY_LIMIT_OUT" "$tmp" \
    || die "Failed to fetch outgroup taxon=$tax (see $SUM_LOG)"
  cat "$tmp" >> "$ACC_OUT"
done
sort -u "$ACC_OUT" -o "$ACC_OUT"
log "Outgroup accessions total: $(wc -l < "$ACC_OUT")"

# ---- merge ----
cat "$ACC_SEED" "$ACC_ACI" "$ACC_OUT" 2>/dev/null \
  | grep -v '^\s*$' | grep -v '^\s*#' | sort -u > "$ACC_ALL"
log "Total unique reference accessions: $(wc -l < "$ACC_ALL")"
[[ -s "$ACC_ALL" ]] || die "No reference accessions collected."

# ---- download genomes (batch) ----
ZIP="$OUTDIR/zips/ncbi_refs.zip"
UNZ="$OUTDIR/tmp/ncbi_refs_unzipped"
rm -f "$ZIP"; rm -rf "$UNZ"; mkdir -p "$UNZ"

log "STEP: Download genomes via NCBI datasets (batch) -> $ZIP"
datasets download genome accession --inputfile "$ACC_ALL" --include genome --filename "$ZIP" >>"$DL_LOG" 2>&1 \
  || { tail -n 120 "$DL_LOG" >&2; die "datasets download failed (see $DL_LOG)"; }

log "STEP: Unzip genomes"
unzip -q "$ZIP" -d "$UNZ" >>"$DL_LOG" 2>&1 || { tail -n 120 "$DL_LOG" >&2; die "unzip failed (see $DL_LOG)"; }

# ---- collect FASTA ----
log "STEP: Collect *_genomic.fna -> $REF_DIR"
: > "$REF_LIST"
n=0
while IFS= read -r f; do
  [[ -f "$f" ]] || continue
  acc="$(echo "$f" | grep -oE 'G[CF]A_[0-9]+\.[0-9]+' | head -n 1 || true)"
  [[ -z "${acc:-}" ]] && acc="REF$(printf '%04d' $((n+1)))"
  out="$REF_DIR/${acc}.genomic.fna"
  cp -f "$f" "$out"
  [[ -s "$out" ]] || continue
  echo "$(readlink -f "$out")" >> "$REF_LIST"
  n=$((n+1))
done < <(find "$UNZ" -type f -name "*_genomic.fna")

log "  OK: Reference FASTA collected: $n"
[[ "$n" -gt 0 ]] || die "No genomic.fna found after download."

# ---- fastANI query vs refs ----
RAW_QVSR="$OUTDIR/fastani_query_vs_refs.raw.tsv"
TSV_QVSR="$OUTDIR/fastani_query_vs_refs.tsv"

log "STEP: fastANI query vs refs"
log "  fastANI --ql $QUERY_LIST --rl $REF_LIST -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_QVSR"
fastANI \
  --ql "$QUERY_LIST" \
  --rl "$REF_LIST" \
  -t "$THREADS" \
  --minFraction "$MINFRACTION" \
  --fragLen "$FRAGLEN" \
  -o "$RAW_QVSR" >>"$ANI_LOG" 2>&1 \
  || { tail -n 160 "$ANI_LOG" >&2; die "fastANI failed (see $ANI_LOG)"; }

[[ -s "$RAW_QVSR" ]] || die "fastANI output empty: $RAW_QVSR"

echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$TSV_QVSR"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_QVSR" >> "$TSV_QVSR"

log "Top hits (ANI desc):"
tail -n +2 "$TSV_QVSR" | sort -k3,3nr | head -n 15 | sed 's/^/  /'

# ---- fastANI matrix for all genomes ----
if [[ "$DO_MATRIX" == "yes" ]]; then
  ALL_LIST="$OUTDIR/all_genomes_list.txt"
  cat "$QUERY_LIST" "$REF_LIST" > "$ALL_LIST"

  RAW_ALL="$OUTDIR/fastani_all.raw.tsv"
  log "STEP: fastANI all-vs-all matrix (for tree)"
  log "  fastANI --ql $ALL_LIST --rl $ALL_LIST --matrix -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_ALL"
  fastANI \
    --ql "$ALL_LIST" \
    --rl "$ALL_LIST" \
    --matrix \
    -t "$THREADS" \
    --minFraction "$MINFRACTION" \
    --fragLen "$FRAGLEN" \
    -o "$RAW_ALL" >>"$ANI_LOG" 2>&1 \
    || { tail -n 160 "$ANI_LOG" >&2; die "fastANI matrix failed (see $ANI_LOG)"; }

  [[ -f "${RAW_ALL}.matrix" ]] || die "Matrix not produced: ${RAW_ALL}.matrix"
  log "  OK: Matrix file: ${RAW_ALL}.matrix"
fi

log "DONE."
log "Accessions counts:"
log "  - seed:         $(wc -l < "$ACC_SEED")"
log "  - acinetobacter:$(wc -l < "$ACC_ACI")"
log "  - outgroups:    $(wc -l < "$ACC_OUT")"
log "  - all refs:     $(wc -l < "$ACC_ALL")"

Kindgerechte Erklärung der Schachnotation

♟️ Kindgerechte Erklärung der Schachnotation

(geeignet für U8 / U10 / U12 – Turnierstandard)

1️⃣ Grundidee der Notation

In einem Turnier muss jede gespielte Zugfolge aufgeschrieben werden, damit:

  • die Partie später analysiert werden kann
  • Schiedsrichter Streitfälle klären können
  • die Partie für DWZ / Elo gewertet werden kann

👉 Notation bedeutet nur:

Welche Figur zieht auf welches Feld.


2️⃣ Buchstaben für die Figuren

Figur Buchstabe Erklärung
König K King
Dame Q Queen
Turm R Rook
Läufer B Bishop
Springer N Knight (K ist schon der König)
Bauer kein Buchstabe ganz wichtig

⚠️ Bauern haben keinen Buchstaben!


3️⃣ Die Felder des Schachbretts

Jedes Feld hat einen Namen:

  • a–h (Spalten, von links nach rechts)
  • 1–8 (Reihen, von unten nach oben aus Weiß-Sicht)

Beispiele:

  • e4
  • f3
  • d5

4️⃣ Normale Züge schreiben

Bauernzüge

  • Bauer zieht nach e4 → e4
  • Bauer zieht nach d5 → d5

👉 Nur das Zielfeld schreiben.


Figuren ziehen

Figur + Zielfeld

Beispiele:

  • Springer nach f3 → Nf3
  • Läufer nach c4 → Bc4
  • Dame nach d1 → Qd1

5️⃣ Schlagen (sehr wichtig)

Figuren schlagen

Figur + x + Zielfeld

Beispiele:

  • Springer schlägt auf e5 → Nxe5
  • Läufer schlägt auf f7 → Bxf7

Niemals aufschreiben, welche Figur geschlagen wurde!


Bauern schlagen (Sonderregel)

Beim Bauern muss man angeben, von welcher Linie er kommt:

  • Bauer von d4 schlägt auf e5 → dxe5
  • Bauer von f3 schlägt auf e4 → fxe4

6️⃣ Schach, Matt, Rochade

  • + = Schach
  • # = Schachmatt
  • O-O = kurze Rochade
  • O-O-O = lange Rochade

Beispiele:

  • Qh5+
  • Qh7#

7️⃣ Zwei gleiche Figuren können auf dasselbe Feld ziehen

(Das ist für Turniere sehr wichtig!)

Regel:

Wenn zwei gleiche Figuren auf dasselbe Feld ziehen könnten, muss man angeben, welche gemeint ist.


Beispiel mit Springern

Beide Springer können nach d2:

  • Springer von b1Nbd2
  • Springer von f1Nfd2

Beispiel mit Türmen

Beide Türme können nach e1:

  • Turm von Reihe 2 → R2e1
  • Turm von Reihe 8 → R8e1

👉 Man ergänzt Buchstabe oder Zahl, bis klar ist, welche Figur zieht.


8️⃣ Beispielpartie (ca. 10 Züge)

1. e4     e5
2. Nf3    Nc6
3. Bc4    Bc5
4. d4     exd4
5. O-O    Nf6
6. e5     Ne4
7. Re1    d5
8. Bxd5   Qxd5
9. Nc3    Nxc3
10. bxc3

Diese Partie enthält:

  • Bauernzüge
  • Schlagen
  • Bauern schlagen
  • Rochade
  • Mehrere Figuren

9️⃣ Richtige Reihenfolge im Turnier

Immer so:

  1. Zug ausführen
  2. Uhr drücken
  3. Zug notieren

❌ Nicht vorher schreiben ❌ Nicht vergessen zu schreiben


🔟 Die 5 häufigsten Fehler bei Kindern

  1. ❌ Bauer bekommt einen Buchstaben (Pe4) ✅ e4

  2. ❌ Aufschreiben, welche Figur geschlagen wurde ✅ nur das Zielfeld

  3. ❌ Das „x“ beim Schlagen vergessen ✅ Nxe5

  4. ❌ Erst schreiben, dann ziehen ✅ erst ziehen

  5. ❌ Panik bei Zeitnot ✅ lieber langsam und sauber schreiben


✅ Merksatz für Kinder

**Welche Figur (Bauer ohne Buchstabe), wohin sie zieht, x beim Schlagen,

  • bei Schach.**

Punkte als Nächstes:

  • 📄 eine noch ausführlichere A4-Turnierfassung auf Deutsch
  • 🧒 eine U8-vereinfachte Version
  • 🧑‍⚖️ erklären, was Schiedsrichter wirklich kontrollieren

Data Processing Pipeline: RNA-seq × Exoproteome × MS (Proximity/ALFA)

Path: ~/DATA/Data_Michelle/MS_RNAseq_Venn_2026

This post documents the complete, reproducible pipeline used to generate a 3-circle Venn diagram integrating RNA-seq, exoproteome, and MS SIP (Proximity + ALFApulldown) datasets, harmonized at the locus-tag level.

The pipeline is intentionally verbose: all transformations, mappings, and decisions are logged so the analysis can be reconstructed later.


Overview of the Three Sets

Circle 1 — RNA-seq Union of all significantly regulated genes across six comparisons Criteria: padj < 0.05 and |log2FC| > 2

Circle 2 — Exoproteome Union of six exoproteome conditions (MH/TSB × 1h, 4h, 18h) Protein IDs (UniProt / UniParc) are mapped to locus tags via BLAST Direction is defined without a cutoff:

  • log2FC > 0 → UP
  • log2FC < 0 → DOWN

Circle 3 — MS SIP (ProALFA) Union of four datasets:

  • Proximity 4h / 18h
  • ALFApulldown 4h / 18h Protein IDs are mapped to locus tags via BLAST

Outputs

The pipeline produces:

  • 🟢 3-circle Venn diagram (PNG + PDF)
  • 📊 Detailed Excel workbook

    • all Venn regions
    • RNA / Exo / MS membership per locus tag
    • regulation directions and source datasets
  • 🧾 Pipeline log file

    • counts per dataset
    • mapping success
    • duplicated IDs
    • intersection sizes

Step A — Fetch exoproteome FASTA from UniProt

Exoproteome Excel sheets contain UniProt accessions, while RNA-seq uses locus tags. To enable mapping, protein sequences are downloaded from UniProt and UniParc.

Script

  • 01_fetch_uniprot_fasta_from_exoproteome_excel.py

What it does

  • Reads exoproteome sheets from Excel
  • Extracts UniProt / UniParc IDs
  • Downloads FASTA via UniProt REST (with caching + fallback to UniParc)
  • Writes:

    • one FASTA per sheet
    • mapping tables
    • UP / DOWN ID lists
    • run log

Bash: fetch exoproteome FASTAs

python3 01_fetch_uniprot_fasta_from_exoproteome_excel.py \
  --excel "Zusammenfassung SIP und Exoproteome.xlsx" \
  --outdir ./exoproteome_fastas \
  --cache  ./uniprot_cache

Step B — Normalize FASTA headers (Exoproteome + MS)

FASTA headers from UniProt and MS are heterogeneous and may contain duplicates. All FASTA files are normalized to canonical, unique protein IDs.

Script

  • 02_normalize_fasta_headers.py

What it does

  • Extracts canonical IDs (sp|…|, tr|…|, raw headers)
  • Enforces uniqueness (__dupN suffix if needed)
  • Writes an ID mapping table for traceability

Bash: normalize exoproteome FASTAs

mkdir -p normalized_fastas

for f in \
  Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
  Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    exoproteome_fastas/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Bash: normalize MS (Proximity / ALFA) FASTAs

for f in \
  Proximity_4h.fasta Proximity_18h.fasta \
  ALFApulldown_4h.fasta ALFApulldown_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    ../MS_GO_enrichments/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Step C — BLAST mapping to reference proteome

All exoproteome and MS proteins are mapped to locus tags using BLASTP against the reference proteome.

Bash: merge exoproteome FASTAs

cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
    Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \
    > Exoproteome_ALL.fasta

Bash: build BLAST database (once)

makeblastdb -in CP020463_protein.fasta \
  -dbtype prot \
  -out CP020463_ref_db

Bash: BLAST exoproteome

blastp -query Exoproteome_ALL.fasta \
  -db CP020463_ref_db \
  -out Exoproteome_ALL_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 \
  -max_target_seqs 5 \
  -num_threads 16

Bash: select best hit per query (exoproteome)

awk -F'\t' 'BEGIN{OFS="\t"}{
  key=$1; bits=$12+0; e=$11+0
  if(!(key in best) || bits>best_bits[key] || (bits==best_bits[key] && e<best_e[key])){
    best[key]=$0; best_bits[key]=bits; best_e[key]=e
  }
} END{for(k in best) print best[k]}' \
Exoproteome_ALL_vs_ref.blast.tsv \
| sort -t$'\t' -k1,1 \
> Exoproteome_ALL_vs_ref.besthit.tsv

Bash: BLAST MS datasets

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  blastp -query "${base}.fasta" \
    -db CP020463_ref_db \
    -out "${base}_vs_ref.blast.tsv" \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
    -evalue 1e-10 \
    -max_target_seqs 5 \
    -num_threads 16
done

Bash: best-hit selection (MS)

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  awk -F'\t' 'BEGIN{OFS="\t"}{
    q=$1; b=$12+0; e=$11+0
    if(!(q in best) || b>bb[q] || (b==bb[q] && e<be[q])){
      best[q]=$0; bb[q]=b; be[q]=e
    }
  } END{for(q in best) print best[q]}' \
  "${base}_vs_ref.blast.tsv" \
  | sort -t$'\t' -k1,1 \
  > "${base}_vs_ref.besthit.tsv"
done

Step D — R wrapper: integration, Venn, Excel, LOG

Script

  • 03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

What it does

  • Reads RNA-seq Excel files (6 comparisons)
  • Reads exoproteome and MS sheets
  • Applies BLAST best-hit mappings
  • Builds union sets and intersections
  • Generates:

    • Venn plot
    • detailed Excel workbook
    • comprehensive log file

Bash: run the final integration

Rscript 03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

Scripts Used (for traceability)

01_fetch_uniprot_fasta_from_exoproteome_excel.py

# see script file for full implementation

02_normalize_fasta_headers.py

# see script file for full implementation

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

# see script file for full implementation

Final Note

Keeping Python, Bash, and R steps together is intentional: this pipeline is meant to be auditable, reproducible, and explainable, even months later when details are fuzzy.

If you want, I can next:

  • condense this into a README.md
  • split it into Methods vs Reproducibility
  • or add a flowchart of the pipeline steps

Scripts Used

01_fetch_uniprot_fasta_from_exoproteome_excel.py

#!/usr/bin/env python3
"""
01_fetch_uniprot_fasta_from_exoproteome_excel.py (v3)

MS log2FC direction rule (no cutoff):
  Up    : log2FC > 0
  Down  : log2FC < 0
  Zero  : log2FC == 0
  NA    : missing/non-numeric log2FC

Reads per sheet:
  - protein IDs from column A (index 0)
  - log2FC from column D (index 3)

Outputs per sheet:
  - {sheet}.fasta
  - {sheet}.mapping.tsv
  - {sheet}.up.ids.txt / {sheet}.down.ids.txt
  - run.log

UniParc catching method:
  1) Try UniProtKB FASTA: /uniprotkb/{ACC}.fasta
  2) If missing: UniParc search -> UPI
  3) Fetch UniParc FASTA: /uniparc/{UPI}.fasta
"""

import argparse
import os
import re
import time
from typing import Optional, Tuple

import pandas as pd
import requests

DEFAULT_SHEETS = [
    "Exoproteome MH 1h",
    "Exoproteome MH 4h",
    "Exoproteome MH 18h",
    "Exoproteome TSB 1h",
    "Exoproteome TSB 4h",
    "Exoproteome TSB 18h",
]

UNIPROTKB_FASTA = "https://rest.uniprot.org/uniprotkb/{acc}.fasta"
UNIPARC_FASTA   = "https://rest.uniprot.org/uniparc/{upi}.fasta"
UNIPARC_SEARCH  = "https://rest.uniprot.org/uniparc/search"

def sanitize_name(s: str) -> str:
    s = s.strip()
    s = re.sub(r"[^\w\-\.]+", "_", s)
    s = re.sub(r"_+", "_", s)
    return s

def parse_id_cell(raw: str) -> Optional[str]:
    if raw is None:
        return None
    s = str(raw).strip()
    if not s or s.lower() in {"nan", "na"}:
        return None

    s = s.split()[0].strip()

    # header like tr|ACC|NAME
    if "|" in s:
        parts = s.split("|")
        if len(parts) >= 2 and parts[0] in {"tr", "sp"}:
            s = parts[1]
        else:
            s = parts[0]

    if re.fullmatch(r"UPI[0-9A-F]{10,}", s):
        return s
    if re.fullmatch(r"[A-Z0-9]{6,10}", s):
        return s
    return None

def to_float(x) -> Optional[float]:
    if x is None:
        return None
    try:
        s = str(x).strip()
        if s == "" or s.lower() in {"nan", "na"}:
            return None
        s = s.replace(",", ".")
        return float(s)
    except Exception:
        return None

def classify_direction_no_cutoff(log2fc: Optional[float]) -> str:
    if log2fc is None:
        return "NA"
    if log2fc > 0:
        return "Up"
    if log2fc < 0:
        return "Down"
    return "Zero"

def extract_ids_and_fc_from_sheet(excel_path: str, sheet: str) -> pd.DataFrame:
    df = pd.read_excel(excel_path, sheet_name=sheet, header=None)
    if df.shape[1] < 4:
        raise ValueError(f"Sheet '{sheet}' has <4 columns, cannot read column D (log2FC).")

    out_rows = []
    for _, row in df.iterrows():
        pid = parse_id_cell(row.iloc[0])      # col A
        if not pid:
            continue
        log2fc = to_float(row.iloc[3])        # col D
        out_rows.append({"protein_id": pid, "log2fc": log2fc})

    out = pd.DataFrame(out_rows)
    if out.empty:
        return out

    # de-duplicate by protein_id (keep first non-NA log2fc if possible)
    out["log2fc_isna"] = out["log2fc"].isna()
    out = out.sort_values(["protein_id", "log2fc_isna"]).drop_duplicates("protein_id", keep="first")
    out = out.drop(columns=["log2fc_isna"]).reset_index(drop=True)
    return out

def http_get_text(session: requests.Session, url: str, params: dict = None,
                  retries: int = 4, backoff: float = 1.5) -> Optional[str]:
    for attempt in range(retries):
        r = session.get(url, params=params, timeout=60)
        if r.status_code == 200:
            return r.text
        if r.status_code in (429, 500, 502, 503, 504):
            time.sleep(backoff * (attempt + 1))
            continue
        return None
    return None

def fetch_uniprotkb_fasta(session: requests.Session, acc: str) -> Optional[str]:
    return http_get_text(session, UNIPROTKB_FASTA.format(acc=acc))

def resolve_upi_via_uniparc_search(session: requests.Session, acc: str) -> Optional[str]:
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(session, UNIPARC_SEARCH, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    if re.fullmatch(r"UPI[0-9A-F]{10,}", upi):
        return upi
    return None

def fetch_uniparc_fasta(session: requests.Session, upi: str) -> Optional[str]:
    return http_get_text(session, UNIPARC_FASTA.format(upi=upi))

def cache_get(cache_dir: str, key: str) -> Optional[str]:
    if not cache_dir:
        return None
    path = os.path.join(cache_dir, f"{key}.fasta")
    if os.path.exists(path) and os.path.getsize(path) > 0:
        with open(path, "r", encoding="utf-8", errors="replace") as f:
            return f.read()
    return None

def cache_put(cache_dir: str, key: str, fasta: str) -> None:
    if not cache_dir:
        return
    os.makedirs(cache_dir, exist_ok=True)
    path = os.path.join(cache_dir, f"{key}.fasta")
    with open(path, "w", encoding="utf-8") as f:
        f.write(fasta)

def rewrite_fasta_header(fasta: str, new_header: str) -> str:
    lines = fasta.splitlines()
    if not lines or not lines[0].startswith(">"):
        return fasta
    lines[0] = ">" + new_header
    return "\n".join(lines) + ("\n" if not fasta.endswith("\n") else "")

def fetch_best_fasta(session: requests.Session, pid: str, cache_dir: str = "") -> Tuple[Optional[str], str, Optional[str]]:
    """
    Returns (fasta_text, status, resolved_upi)
      status: uniprotkb | uniparc | uniparc_direct | not_found | *_cache
    """
    if pid.startswith("UPI"):
        cached = cache_get(cache_dir, pid)
        if cached:
            return cached, "uniparc_direct_cache", pid
        fasta = fetch_uniparc_fasta(session, pid)
        if fasta:
            cache_put(cache_dir, pid, fasta)
            return fasta, "uniparc_direct", pid
        return None, "not_found", pid

    cached = cache_get(cache_dir, pid)
    if cached:
        return cached, "uniprotkb_cache", None

    fasta = fetch_uniprotkb_fasta(session, pid)
    if fasta:
        cache_put(cache_dir, pid, fasta)
        return fasta, "uniprotkb", None

    upi = resolve_upi_via_uniparc_search(session, pid)
    if not upi:
        return None, "not_found", None

    cached2 = cache_get(cache_dir, upi)
    if cached2:
        return cached2, "uniparc_cache", upi

    fasta2 = fetch_uniparc_fasta(session, upi)
    if fasta2:
        cache_put(cache_dir, upi, fasta2)
        return fasta2, "uniparc", upi

    return None, "not_found", upi

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--excel", required=True, help="Input Excel file")
    ap.add_argument("--outdir", default="./exoproteome_fastas", help="Output folder")
    ap.add_argument("--cache", default="./uniprot_cache", help="Cache folder for downloaded FASTAs")
    ap.add_argument("--sleep", type=float, default=0.1, help="Sleep seconds between requests")
    ap.add_argument("--sheets", nargs="*", default=DEFAULT_SHEETS, help="Sheet names to process")
    args = ap.parse_args()

    os.makedirs(args.outdir, exist_ok=True)
    os.makedirs(args.cache, exist_ok=True)

    log_path = os.path.join(args.outdir, "run.log")
    def log(msg: str):
        print(msg)
        with open(log_path, "a", encoding="utf-8") as f:
            f.write(msg + "\n")

    with open(log_path, "w", encoding="utf-8") as f:
        f.write("")

    xls = pd.ExcelFile(args.excel)
    missing = [s for s in args.sheets if s not in xls.sheet_names]
    if missing:
        log("WARNING: Missing sheets in workbook:")
        for s in missing:
            log(f"  - {s}")
        log("Continuing with available sheets only.\n")

    session = requests.Session()
    session.headers.update({"User-Agent": "exoproteome-fasta-fetch/3.0"})

    for sheet in args.sheets:
        if sheet not in xls.sheet_names:
            continue

        sheet_tag = sanitize_name(sheet)
        out_fasta = os.path.join(args.outdir, f"{sheet_tag}.fasta")
        out_map   = os.path.join(args.outdir, f"{sheet_tag}.mapping.tsv")
        out_up    = os.path.join(args.outdir, f"{sheet_tag}.up.ids.txt")
        out_down  = os.path.join(args.outdir, f"{sheet_tag}.down.ids.txt")

        tbl = extract_ids_and_fc_from_sheet(args.excel, sheet)
        log(f"\n=== Sheet: {sheet} ===")
        log(f"Extracted IDs (unique): {len(tbl)}")

        if tbl.empty:
            log("No IDs found. Skipping.")
            continue

        tbl["direction"] = [classify_direction_no_cutoff(x) for x in tbl["log2fc"].tolist()]

        n_fc = int(tbl["log2fc"].notna().sum())
        n_up = int((tbl["direction"] == "Up").sum())
        n_dn = int((tbl["direction"] == "Down").sum())
        n_zero = int((tbl["direction"] == "Zero").sum())
        n_na = int((tbl["direction"] == "NA").sum())

        log(f"log2FC present: {n_fc}/{len(tbl)}")
        log(f"Direction (no cutoff): Up={n_up}, Down={n_dn}, Zero={n_zero}, NA={n_na}")

        mapping_rows = []
        fasta_chunks = []

        n_ok = 0
        n_nf = 0

        for i, row in enumerate(tbl.itertuples(index=False), start=1):
            pid = row.protein_id
            log2fc = row.log2fc
            direction = row.direction

            fasta, status, upi = fetch_best_fasta(session, pid, cache_dir=args.cache)

            if fasta:
                header = pid
                if upi and upi != pid:
                    header = f"{pid}|UniParc:{upi}"
                fasta = rewrite_fasta_header(fasta, header)
                fasta_chunks.append(fasta)
                n_ok += 1
            else:
                n_nf += 1

            mapping_rows.append({
                "sheet": sheet,
                "protein_id_input": pid,
                "resolved_upi": upi if upi else "",
                "status": status,
                "log2fc": "" if log2fc is None else log2fc,
                "direction": direction,
                "fasta_written": "yes" if fasta else "no",
            })

            if args.sleep > 0:
                time.sleep(args.sleep)

            if i % 50 == 0:
                log(f"  progress: {i}/{len(tbl)} (ok={n_ok}, not_found={n_nf})")

        with open(out_fasta, "w", encoding="utf-8") as f:
            for chunk in fasta_chunks:
                f.write(chunk.strip() + "\n")

        map_df = pd.DataFrame(mapping_rows)
        map_df.to_csv(out_map, sep="\t", index=False)

        up_ids = map_df.loc[map_df["direction"] == "Up", "protein_id_input"].astype(str).tolist()
        dn_ids = map_df.loc[map_df["direction"] == "Down", "protein_id_input"].astype(str).tolist()
        with open(out_up, "w", encoding="utf-8") as f:
            f.write("\n".join(up_ids) + ("\n" if up_ids else ""))
        with open(out_down, "w", encoding="utf-8") as f:
            f.write("\n".join(dn_ids) + ("\n" if dn_ids else ""))

        log(f"Fetched FASTA: {n_ok}/{len(tbl)} (not_found={n_nf})")
        log(f"Saved FASTA: {out_fasta}")
        log(f"Saved mapping: {out_map}")
        log(f"Saved Up IDs:  {out_up} ({len(up_ids)})")
        log(f"Saved Down IDs:{out_down} ({len(dn_ids)})")

    log("\nDONE.")

if __name__ == "__main__":
    main()

02_normalize_fasta_headers.py

#!/usr/bin/env python3
import re
import sys
from collections import defaultdict

def canon_id(header: str) -> str:
    h = header.strip()
    h = h[1:] if h.startswith(">") else h

    # tr|ACC|NAME or sp|ACC|NAME
    m = re.match(r'^(sp|tr)\|([^|]+)\|', h)
    if m:
        return m.group(2)

    # otherwise: first token, then before first '|'
    first = h.split()[0]
    return first.split("|")[0]

def read_fasta(path):
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        header = None
        seq = []
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(">"):
                if header is not None:
                    yield header, "".join(seq)
                header = line
                seq = []
            else:
                seq.append(line.strip())
        if header is not None:
            yield header, "".join(seq)

def main():
    if len(sys.argv) < 3:
        print("Usage: normalize_fasta_headers.py in.fasta out.fasta [map.tsv]", file=sys.stderr)
        sys.exit(1)

    in_fa  = sys.argv[1]
    out_fa = sys.argv[2]
    map_tsv = sys.argv[3] if len(sys.argv) >= 4 else out_fa + ".idmap.tsv"

    seen = defaultdict(int)

    with open(out_fa, "w", encoding="utf-8") as fo, open(map_tsv, "w", encoding="utf-8") as fm:
        fm.write("old_header\tnew_id\n")
        for old_h, seq in read_fasta(in_fa):
            base = canon_id(old_h)
            seen[base] += 1
            new_id = base if seen[base] == 1 else f"{base}__dup{seen[base]}"
            fm.write(f"{old_h}\t{new_id}\n")

            fo.write(f">{new_id}\n")
            # wrap sequence at 60
            for i in range(0, len(seq), 60):
                fo.write(seq[i:i+60] + "\n")

    print(f"Wrote: {out_fa}")
    print(f"Wrote: {map_tsv}")

if __name__ == "__main__":
    main()

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

#!/usr/bin/env Rscript
# ============================================================
# 3-circle Venn: RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown
#
# Key points:
# - RNA-seq inputs are Excel files with sheet 'Complete_Data'
#   * GeneID is the ID used for comparison (locus tag style)
#   * Keep Preferred_name / Description for annotation in ALL Venn regions
# - Exoproteome uses MS protein IDs + log2FC sign (no cutoff), mapped to locus tags via BLAST best-hit TSV
# - Proximity/ALFApulldown (ProALFA) uses SIP/MS protein IDs mapped to locus tags via BLAST best-hit TSV
# - Debug stats + parsing/mapping summaries are written into a LOG file
#
# NOTE: (记得!) 生成 Exoproteome_ALL.fasta + Exoproteome_ALL_vs_ref.blast.tsv:
#   cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \\
#       Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \\
#       > Exoproteome_ALL.fasta
#   blastp -query Exoproteome_ALL.fasta -db CP020463_protein.dmnd \\
#       -out Exoproteome_ALL_vs_ref.blast.tsv -outfmt 6 -evalue 1e-5 -max_target_seqs 50 -num_threads 40
# ============================================================

suppressPackageStartupMessages({
  library(tidyverse)
  library(readxl)
  library(openxlsx)
  library(VennDiagram)
  library(grid)
})

# -------------------------
# CONFIG
# -------------------------
RNA_PADJ_CUT <- 0.05
RNA_LFC_CUT  <- 2

OUTDIR <- "./venn_outputs"
dir.create(OUTDIR, showWarnings = FALSE, recursive = TRUE)

LOGFILE <- file.path(OUTDIR, "pipeline_log.txt")
sink(LOGFILE, split = TRUE)

cat("=== 3-circle Venn pipeline ===\n")
cat(sprintf("RNA cutoff: padj < %.3g and |log2FoldChange| > %.3g\n", RNA_PADJ_CUT, RNA_LFC_CUT))
cat("Output dir:", normalizePath(OUTDIR), "\n\n")

# RNA inputs (Excel, sheet = Complete_Data)
RNA_FILES <- list(
  deltasbp_MH_2h  = "./DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx",
  deltasbp_MH_4h  = "./DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx",
  deltasbp_MH_18h = "./DEG_Annotations_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx",
  deltasbp_TSB_2h  = "./DEG_Annotations_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx",
  deltasbp_TSB_4h  = "./DEG_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx",
  deltasbp_TSB_18h = "./DEG_Annotations_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx"
)

# Exoproteome sheets (from summary Excel)
EXO_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
EXO_SHEETS <- list(
  Exoproteome_MH_1h  = "Exoproteome MH 1h",
  Exoproteome_MH_4h  = "Exoproteome MH 4h",
  Exoproteome_MH_18h = "Exoproteome MH 18h",
  Exoproteome_TSB_1h  = "Exoproteome TSB 1h",
  Exoproteome_TSB_4h  = "Exoproteome TSB 4h",
  Exoproteome_TSB_18h = "Exoproteome TSB 18h"
)

# Exoproteome BLAST best-hit mapping (query protein -> locus tag)
EXO_BLAST_TSV <- "./Exoproteome_ALL_vs_ref.besthit.tsv"

# Proximity/ALFApulldown sheets (from same summary Excel)
PROALFA_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
PROALFA_SHEETS <- list(
  Proximity_4h = "Proximity 4h",
  Proximity_18h = "Proximity 18h",
  ALFApulldown_4h = "ALFApulldown 4h",
  ALFApulldown_18h = "ALFApulldown 18h"
)

# ProALFA BLAST best-hit mapping (each dataset protein -> locus tag)
PROALFA_BLAST_TSV <- list(
  Proximity_4h      = "./Proximity_4h_vs_ref.besthit.tsv",
  Proximity_18h     = "./Proximity_18h_vs_ref.besthit.tsv",
  ALFApulldown_4h   = "./ALFApulldown_4h_vs_ref.besthit.tsv",
  ALFApulldown_18h  = "./ALFApulldown_18h_vs_ref.besthit.tsv"
)

# FASTA inputs (QC only)
FASTA_FILES <- list(
  Proximity_4h      = "./Proximity_4h.fasta",
  Proximity_18h     = "./Proximity_18h.fasta",
  ALFApulldown_4h   = "./ALFApulldown_4h.fasta",
  ALFApulldown_18h  = "./ALFApulldown_18h.fasta",
  Exo_MH_1h         = "./Exoproteome_MH_1h.fasta",
  Exo_MH_4h         = "./Exoproteome_MH_4h.fasta",
  Exo_MH_18h        = "./Exoproteome_MH_18h.fasta",
  Exo_TSB_1h        = "./Exoproteome_TSB_1h.fasta",
  Exo_TSB_4h        = "./Exoproteome_TSB_4h.fasta",
  Exo_TSB_18h       = "./Exoproteome_TSB_18h.fasta"
)

# -------------------------
# HELPERS
# -------------------------
safe_file_check <- function(fp, label) {
  if (!file.exists(fp)) stop(label, " missing: ", fp)
}

canonical_id_one <- function(x) {
  x <- as.character(x)
  x <- stringr::str_trim(x)

  if (is.na(x) || x == "") return(NA_character_)

  # remove leading ">"
  x <- stringr::str_remove(x, "^>")

  # keep first token before whitespace
  x <- stringr::str_split(x, "\\s+", simplify = TRUE)[1]

  # handle UniProt-style: tr|ID|... or sp|ID|...
  if (stringr::str_detect(x, "\\|")) {
    parts <- stringr::str_split(x, "\\|")[[1]]
    if (length(parts) >= 2 && parts[1] %in% c("tr", "sp")) {
      x <- parts[2]
    } else {
      x <- parts[1]
    }
  }

  x
}

canonical_id <- function(x) {
  vapply(x, canonical_id_one, character(1))
}

read_fasta_headers <- function(fp) {
  safe_file_check(fp, "FASTA")
  hdr <- readLines(fp, warn = FALSE) %>% keep(~str_starts(.x, ">"))
  tibble(raw = hdr, id = canonical_id(hdr)) %>%
    filter(!is.na(id), id != "")
}

norm_name <- function(x) {
  x <- tolower(trimws(as.character(x)))
  x <- gsub("[^a-z0-9]+", "", x)
  x
}

# Robust column picker (case-insensitive + punctuation-insensitive)
find_col <- function(nms, target) {
  nms <- as.character(nms)
  target_n <- norm_name(target)
  n_n <- norm_name(nms)

  # exact normalized match
  idx <- which(n_n == target_n)
  if (length(idx) > 0) return(nms[idx[1]])

  # contains (normalized)
  idx <- which(str_detect(n_n, fixed(target_n)))
  if (length(idx) > 0) return(nms[idx[1]])

  NA_character_
}

# Best-hit parser (BLAST outfmt6):
# qid sid pident length mismatch gapopen qstart qend sstart send evalue bitscore
read_besthit_tsv <- function(fp) {
  safe_file_check(fp, "BLAST TSV")
  suppressPackageStartupMessages({
    df <- read_tsv(fp, col_names = FALSE, show_col_types = FALSE, progress = FALSE)
  })
  if (ncol(df) < 12) stop("BLAST TSV has <12 columns: ", fp)
  colnames(df)[1:12] <- c("qid","sid","pident","length","mismatch","gapopen",
                         "qstart","qend","sstart","send","evalue","bitscore")
  df %>%
    mutate(
      qid = canonical_id(qid),
      sid = as.character(sid),
      bitscore = suppressWarnings(as.numeric(bitscore)),
      evalue = suppressWarnings(as.numeric(evalue))
    ) %>%
    filter(!is.na(qid), qid != "")
}

besthit_reduce <- function(df) {
  # one best hit per query: highest bitscore, then lowest evalue
  df %>%
    arrange(qid, desc(bitscore), evalue) %>%
    group_by(qid) %>%
    slice(1) %>%
    ungroup()
}

# Try to extract locus tag from subject id
extract_locus_tag <- function(sid) {
  sid <- as.character(sid)
  # if sid already looks like B4U56_00090
  if (str_detect(sid, "^B4U56_\\d{5}$")) return(sid)
  # otherwise try to find locus tag pattern anywhere
  hit <- str_extract(sid, "B4U56_\\d{5}")
  hit
}

read_exo_sheet <- function(xlsx, sheet) {
  df <- read_excel(xlsx, sheet = sheet, col_names = TRUE)

  # Protein column: assume first column contains protein IDs
  prot_col <- names(df)[1]

  # log2FC column: per user it's column D, name looks like "log 2 FC)"
  log2_col <- find_col(names(df), "log 2 FC)")
  if (is.na(log2_col)) {
    # fallback fuzzy: any col containing both log2 and fc
    hits <- names(df)[str_detect(norm_name(names(df)), "log2") & str_detect(norm_name(names(df)), "fc")]
    if (length(hits) > 0) log2_col <- hits[1]
  }
  if (is.na(log2_col)) stop("Exoproteome sheet missing log2FC column ('log 2 FC)'): ", sheet)

  out <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      log2FC = suppressWarnings(as.numeric(.data[[log2_col]]))
    ) %>%
    filter(!is.na(Protein_raw), Protein_raw != "")

  out <- out %>%
    mutate(
      Protein_ID = canonical_id(Protein_raw),
      EXO_direction = case_when(
        is.na(log2FC) ~ NA_character_,
        log2FC > 0 ~ "UP",
        log2FC < 0 ~ "DOWN",
        TRUE ~ "ZERO"
      )
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "")

  out
}

read_rna_excel <- function(fp) {
  safe_file_check(fp, "RNA Excel")
  df <- read_excel(fp, sheet = "Complete_Data", col_names = TRUE)

  # required columns
  req <- c("GeneID","padj","log2FoldChange")
  miss <- setdiff(req, names(df))
  if (length(miss) > 0) stop("RNA Excel missing columns: ", paste(miss, collapse=", "), " in ", fp)

  # optional annotation columns
  pref_col <- if ("Preferred_name" %in% names(df)) "Preferred_name" else NA_character_
  desc_col <- if ("Description" %in% names(df)) "Description" else NA_character_

  df <- df %>%
    mutate(
      GeneID_plain = str_remove(as.character(GeneID), "^gene-"),
      padj = suppressWarnings(as.numeric(padj)),
      log2FoldChange = suppressWarnings(as.numeric(log2FoldChange)),
      Preferred_name = if (!is.na(pref_col)) as.character(.data[[pref_col]]) else NA_character_,
      Description    = if (!is.na(desc_col)) as.character(.data[[desc_col]]) else NA_character_
    )

  df
}

# -------------------------
# FASTA QC
# -------------------------
cat("[FASTA] Reading FASTA files for QC (10)\n")
fasta_qc_stats <- list()

for (nm in names(FASTA_FILES)) {
  fp <- FASTA_FILES[[nm]]
  safe_file_check(fp, "FASTA file")
  h <- read_fasta_headers(fp)
  n_hdr <- nrow(h)
  n_uniq <- n_distinct(h$id)
  n_dup <- n_hdr - n_uniq
  cat(sprintf("  - %s: headers=%d, unique_IDs=%d, duplicates=%d\n", nm, n_hdr, n_uniq, n_dup))

  dup_ids <- h %>% count(id, sort = TRUE) %>% filter(n > 1)
  fasta_qc_stats[[nm]] <- tibble(
    Dataset = nm,
    headers = n_hdr,
    unique_IDs = n_uniq,
    duplicates = n_dup,
    top_duplicate_id = ifelse(nrow(dup_ids) > 0, dup_ids$id[1], NA_character_),
    top_duplicate_n  = ifelse(nrow(dup_ids) > 0, dup_ids$n[1], NA_integer_)
  )
}
fasta_qc_stats_df <- bind_rows(fasta_qc_stats)

# -------------------------
# RNA: read 6 comparisons (Excel)
# -------------------------
cat("\n[RNA] Reading 6 RNA-seq comparisons (Excel: Complete_Data)\n")
rna_sig_long <- list()
rna_all_long <- list()

for (cmp in names(RNA_FILES)) {
  fp <- RNA_FILES[[cmp]]
  df <- read_rna_excel(fp)

  # Keep a full-gene annotation table (used later to annotate EXO/ProALFA-only members)
  rna_all_long[[cmp]] <- df %>%
    mutate(Comparison = cmp) %>%
    select(GeneID_plain, Preferred_name, Description, padj, log2FoldChange, Comparison)

  # Debug: how many bad entries
  bad_gene <- sum(is.na(df$GeneID_plain) | df$GeneID_plain == "")
  bad_padj <- sum(is.na(df$padj))
  bad_lfc  <- sum(is.na(df$log2FoldChange))

  sig <- df %>%
    filter(!is.na(GeneID_plain), GeneID_plain != "",
           !is.na(padj), !is.na(log2FoldChange),
           padj < RNA_PADJ_CUT,
           abs(log2FoldChange) > RNA_LFC_CUT) %>%
    mutate(
      Comparison = cmp,
      RNA_direction = ifelse(log2FoldChange > 0, "UP", "DOWN")
    ) %>%
    select(GeneID_plain, Comparison, padj, log2FoldChange, RNA_direction,
           Preferred_name, Description)

  cat(sprintf("  - %s: rows=%d (bad GeneID=%d, bad padj=%d, bad log2FC=%d); significant=%d (UP=%d, DOWN=%d)\n",
              cmp, nrow(df), bad_gene, bad_padj, bad_lfc,
              nrow(sig), sum(sig$RNA_direction == "UP"), sum(sig$RNA_direction == "DOWN")))

  rna_sig_long[[cmp]] <- sig
}

rna_sig_long <- bind_rows(rna_sig_long)
rna_all_long <- bind_rows(rna_all_long)

RNA <- sort(unique(rna_sig_long$GeneID_plain))
cat(sprintf("[RNA] Union significant genes (6 comps): %d\n", length(RNA)))

# Build an annotation lookup from *ALL* genes (so EXO-only/ProALFA-only also get annotation)
rna_ann <- rna_all_long %>%
  filter(!is.na(GeneID_plain), GeneID_plain != "") %>%
  group_by(GeneID_plain) %>%
  summarise(
    Preferred_name = coalesce(Preferred_name[match(TRUE, !is.na(Preferred_name) & Preferred_name != "", !is.na(Preferred_name) & Preferred_name != "")], Preferred_name[1]),
    Description    = coalesce(Description[match(TRUE, !is.na(Description) & Description != "", !is.na(Description) & Description != "")], Description[1]),
    .groups = "drop"
  )

# -------------------------
# EXO: read exoproteome sheets + mapping (BLAST)
# -------------------------
cat("\n[EXO] Reading Exoproteome sheets + mapping via BLAST\n")
safe_file_check(EXO_SUMMARY_XLSX, "Exoproteome summary Excel")
safe_file_check(EXO_BLAST_TSV, "Exoproteome BLAST TSV")

exo_long <- list()
for (sh in names(EXO_SHEETS)) {
  df <- read_exo_sheet(EXO_SUMMARY_XLSX, EXO_SHEETS[[sh]]) %>%
    mutate(Sheet = EXO_SHEETS[[sh]])
  cat(sprintf("  - %s: proteins=%d (with log2FC=%d; UP=%d, DOWN=%d)\n",
              EXO_SHEETS[[sh]],
              n_distinct(df$Protein_ID),
              sum(!is.na(df$log2FC)),
              sum(df$EXO_direction == "UP", na.rm = TRUE),
              sum(df$EXO_direction == "DOWN", na.rm = TRUE)))
  exo_long[[sh]] <- df
}
exo_long <- bind_rows(exo_long) %>%
  distinct(Protein_ID, Sheet, .keep_all = TRUE)

# BLAST mapping
exo_blast <- read_besthit_tsv(EXO_BLAST_TSV)
exo_best  <- besthit_reduce(exo_blast) %>%
  mutate(LocusTag = map_chr(sid, extract_locus_tag))

cat(sprintf("[BLAST] Exoproteome_ALL rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
            nrow(exo_blast), n_distinct(exo_blast$qid), nrow(exo_best), n_distinct(exo_best$LocusTag)))

dup_targets <- exo_best %>%
  filter(!is.na(LocusTag)) %>%
  count(LocusTag, sort = TRUE) %>%
  filter(n > 1)

cat(sprintf("[BLAST] Exoproteome_ALL duplicated targets in besthits: %d (top 5)\n", nrow(dup_targets)))
if (nrow(dup_targets) > 0) print(head(dup_targets, 5))

exo_mapped <- exo_long %>%
  left_join(exo_best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid")) %>%
  mutate(LocusTag = as.character(LocusTag))

cat(sprintf("[EXO] Proteins before mapping: %d unique\n", n_distinct(exo_long$Protein_ID)))
cat(sprintf("[EXO] Mapped proteins (have LocusTag): %d\n", sum(!is.na(exo_mapped$LocusTag))))
cat(sprintf("[EXO] Unique mapped proteins: %d\n", n_distinct(exo_mapped$Protein_ID[!is.na(exo_mapped$LocusTag)])))
cat(sprintf("[EXO] Unique mapped locus tags: %d\n", n_distinct(exo_mapped$LocusTag[!is.na(exo_mapped$LocusTag)])))

EXO <- sort(unique(na.omit(exo_mapped$LocusTag)))

# -------------------------
# ProALFA: read proximity + ALFA sheets + mapping (BLAST)
# -------------------------
cat("\n[ProALFA] Reading Proximity/ALFA sheets + mapping via BLAST\n")
safe_file_check(PROALFA_SUMMARY_XLSX, "ProALFA summary Excel")

ms_long <- list()
for (ds in names(PROALFA_SHEETS)) {
  sh <- PROALFA_SHEETS[[ds]]
  if (!(sh %in% readxl::excel_sheets(PROALFA_SUMMARY_XLSX))) {
    stop("Sheet '", sh, "' not found")
  }
  df <- read_excel(PROALFA_SUMMARY_XLSX, sheet = sh, col_names = TRUE)

  # assume first column has protein IDs
  prot_col <- names(df)[1]
  tmp <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      Protein_ID = canonical_id(Protein_raw)
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "") %>%
    distinct(Protein_ID)

  cat(sprintf("  - %s (%s): proteins=%d unique\n", ds, sh, nrow(tmp)))
  ms_long[[ds]] <- tmp %>% mutate(Dataset = ds)
}
ms_long <- bind_rows(ms_long)

# mapping per dataset
ms_mapped_list <- list()
PROALFA_sets <- list()

for (ds in names(PROALFA_BLAST_TSV)) {
  bt <- PROALFA_BLAST_TSV[[ds]]
  safe_file_check(bt, paste0("ProALFA BLAST TSV: ", ds))
  b <- read_besthit_tsv(bt)
  best <- besthit_reduce(b) %>%
    mutate(LocusTag = map_chr(sid, extract_locus_tag))

  cat(sprintf("[BLAST] %s rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
              ds, nrow(b), n_distinct(b$qid), nrow(best), n_distinct(best$LocusTag)))

  dup_t <- best %>%
    filter(!is.na(LocusTag)) %>%
    count(LocusTag, sort = TRUE) %>%
    filter(n > 1)
  if (nrow(dup_t) > 0) {
    cat(sprintf("[BLAST] %s duplicated targets in besthits: %d (top 5)\n", ds, nrow(dup_t)))
    print(head(dup_t, 5))
  }

  tmp_ds <- ms_long %>%
    filter(Dataset == ds) %>%
    left_join(best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid"))

  cat(sprintf("[ProALFA] %s: proteins=%d, mapped=%d, unique_targets=%d\n",
              ds,
              n_distinct(tmp_ds$Protein_ID),
              sum(!is.na(tmp_ds$LocusTag)),
              n_distinct(tmp_ds$LocusTag[!is.na(tmp_ds$LocusTag)])))

  ms_mapped_list[[ds]] <- tmp_ds
  PROALFA_sets[[ds]] <- sort(unique(na.omit(tmp_ds$LocusTag)))
}
ms_mapped_all <- bind_rows(ms_mapped_list)

ProALFA <- sort(unique(unlist(PROALFA_sets)))
cat(sprintf("[ProALFA] Union locus tags (Proximity+ALFA): %d\n", length(ProALFA)))

# -------------------------
# VENN intersections
# -------------------------
cat("\n[VENN] Computing intersections\n")
cat(sprintf("Set sizes: RNA=%d, EXO=%d, ProALFA=%d\n", length(RNA), length(EXO), length(ProALFA)))

i_RNA_EXO     <- intersect(RNA, EXO)
i_RNA_ProALFA <- intersect(RNA, ProALFA)
i_EXO_ProALFA <- intersect(EXO, ProALFA)
i_all3        <- Reduce(intersect, list(RNA, EXO, ProALFA))

cat(sprintf("Intersections: RNA∩EXO=%d, RNA∩ProALFA=%d, EXO∩ProALFA=%d, ALL3=%d\n",
            length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)))

# -------------------------
# Venn plot
# -------------------------
venn_png <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.png")
venn_pdf <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.pdf")

cat("[VENN] Writing plot:", venn_png, "\n")
venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = venn_png,
  imagetype = "png",
  height = 2400, width = 2400, resolution = 300,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  # Increase 3rd value to push the long bottom label further out
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)

cat("[VENN] Writing plot:", venn_pdf, "\n")
vd <- venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = NULL,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)
pdf(venn_pdf, width=7.5, height=7.5)
grid.draw(vd)
dev.off()

# -------------------------
# Region table (ALL members + direction annotations)
# -------------------------
all_members <- tibble(LocusTag = sort(unique(c(RNA, EXO, ProALFA)))) %>%
  mutate(
    in_RNA = LocusTag %in% RNA,
    in_EXO = LocusTag %in% EXO,
    in_ProALFA = LocusTag %in% ProALFA
  )

# RNA direction per locus tag (from significant set only; union across comparisons)
rna_dir_tbl <- rna_sig_long %>%
  group_by(GeneID_plain) %>%
  summarise(
    RNA_comparisons = paste(sort(unique(Comparison)), collapse = ";"),
    RNA_dirs = paste(sort(unique(RNA_direction)), collapse = ";"),
    min_padj = suppressWarnings(min(padj, na.rm = TRUE)),
    max_abs_log2FC = suppressWarnings(max(abs(log2FoldChange), na.rm = TRUE)),
    .groups = "drop"
  ) %>%
  rename(LocusTag = GeneID_plain)

# EXO direction aggregated per locus tag
exo_dir_tbl <- exo_mapped %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    EXO_sheets = paste(sort(unique(Sheet)), collapse = ";"),
    EXO_dirs = paste(sort(unique(EXO_direction)), collapse = ";"),
    EXO_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

# ProALFA membership aggregated per locus tag
proalfa_tbl <- ms_mapped_all %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    ProALFA_datasets = paste(sort(unique(Dataset)), collapse = ";"),
    ProALFA_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

region_tbl <- all_members %>%
  left_join(rna_dir_tbl, by = "LocusTag") %>%
  left_join(exo_dir_tbl, by = "LocusTag") %>%
  left_join(proalfa_tbl, by = "LocusTag") %>%
  left_join(rna_ann, by = c("LocusTag" = "GeneID_plain")) %>%
  arrange(desc(in_RNA), desc(in_EXO), desc(in_ProALFA), LocusTag)

region_only <- function(inRNA, inEXO, inProALFA) {
  region_tbl %>%
    filter(in_RNA == inRNA, in_EXO == inEXO, in_ProALFA == inProALFA)
}

# -------------------------
# Excel output
# -------------------------
out_xlsx <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA_detailed.xlsx")
cat("[EXCEL] Writing:", out_xlsx, "\n")

wb <- createWorkbook()

addWorksheet(wb, "Summary")
writeData(wb, "Summary", data.frame(
  Metric = c("RNA cutoff", "RNA union", "EXO union", "ProALFA union",
             "RNA∩EXO", "RNA∩ProALFA", "EXO∩ProALFA", "ALL3"),
  Value  = c(
    sprintf("padj < %.3g & |log2FC| > %.3g", RNA_PADJ_CUT, RNA_LFC_CUT),
    length(RNA), length(EXO), length(ProALFA),
    length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)
  )
))

addWorksheet(wb, "FASTA_QC")
writeData(wb, "FASTA_QC", fasta_qc_stats_df)

addWorksheet(wb, "RNA_sig_long")
writeData(wb, "RNA_sig_long", rna_sig_long)

addWorksheet(wb, "RNA_all_annotations")
writeData(wb, "RNA_all_annotations", rna_all_long)

addWorksheet(wb, "EXO_proteins")
writeData(wb, "EXO_proteins", exo_long)

addWorksheet(wb, "EXO_mapped")
writeData(wb, "EXO_mapped", exo_mapped)

addWorksheet(wb, "ProALFA_proteins")
writeData(wb, "ProALFA_proteins", ms_long)

addWorksheet(wb, "ProALFA_mapped")
writeData(wb, "ProALFA_mapped", ms_mapped_all)

addWorksheet(wb, "All_members_with_dirs")
writeData(wb, "All_members_with_dirs", region_tbl)

addWorksheet(wb, "Only_RNA")
writeData(wb, "Only_RNA", region_only(TRUE, FALSE, FALSE))

addWorksheet(wb, "Only_EXO")
writeData(wb, "Only_EXO", region_only(FALSE, TRUE, FALSE))

addWorksheet(wb, "Only_ProALFA")
writeData(wb, "Only_ProALFA", region_only(FALSE, FALSE, TRUE))

addWorksheet(wb, "RNA_EXO_only")
writeData(wb, "RNA_EXO_only", region_only(TRUE, TRUE, FALSE))

addWorksheet(wb, "RNA_ProALFA_only")
writeData(wb, "RNA_ProALFA_only", region_only(TRUE, FALSE, TRUE))

addWorksheet(wb, "EXO_ProALFA_only")
writeData(wb, "EXO_ProALFA_only", region_only(FALSE, TRUE, TRUE))

addWorksheet(wb, "ALL3")
writeData(wb, "ALL3", region_only(TRUE, TRUE, TRUE))

saveWorkbook(wb, out_xlsx, overwrite = TRUE)

cat("\nDONE.\nOutputs:\n")
cat(" -", venn_png, "\n")
cat(" -", venn_pdf, "\n")
cat(" -", out_xlsx, "\n")
cat(" - LOG:", normalizePath(LOGFILE), "\n")

# dump warnings (so they go into log too)
if (!is.null(warnings())) {
  cat("\n[WARNINGS]\n")
  print(warnings())
}

sink()

Final Notes

  • All comparisons are locus-tag consistent, enabling clean cross-omics integration.
  • Exoproteome regulation is treated directionally only, by design.
  • Logs and Excel outputs are intentionally verbose to support downstream inspection and reuse.

This pipeline is designed to be transparent, auditable, and extensible—ideal for iterative multi-omics analysis and figure generation.

From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)

Path: ~/DATA/Data_Michelle/Data_Michelle_ProteinClustering_2026

Goal

For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:

  1. Export protein sequences as FASTA
  2. Annotate them with EggNOG-mapper (including the COG_category column)
  3. Summarize COG composition at two levels:

    • COG letters (J/A/K/…/S), including multi-letter cases like IQ
    • 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
  4. Export both summary statistics and the underlying protein IDs for each category/group.

Step 0 — Why this manual annotation approach is needed (non-model organism)

Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.


Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

Export sequences for each MS dataset:

python3 getProteinSequences_Proximity_4h.py      # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  # > ALFApulldown_18h.fasta

Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py). Output: One FASTA per dataset (*.fasta), used as EggNOG input.

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta) Output: cleaned FASTA headers (CP020463_protein.fasta)


Step 2 — Generate EggNOG annotation files (*.emapper.annotations)

2A) Install EggNOG-mapper

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda   # eggnog-mapper 2.1.12
mamba activate eggnog_env

2B) Download / prepare EggNOG database

mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

2C) Run emapper.py on FASTA inputs

(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):

emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60    # --resume if needed

For MS protein sets (the ones used for COG clustering):

emapper.py -i Proximity_4h.fasta      -o eggnog_out_Proximity_4h      --cpu 60   # --resume
emapper.py -i Proximity_18h.fasta     -o eggnog_out_Proximity_18h     --cpu 60
emapper.py -i ALFApulldown_4h.fasta   -o eggnog_out_ALFApulldown_4h   --cpu 60
emapper.py -i ALFApulldown_18h.fasta  -o eggnog_out_ALFApulldown_18h  --cpu 60

Key outputs used downstream (one per dataset):

  • eggnog_out_Proximity_4h.emapper.annotations
  • eggnog_out_Proximity_18h.emapper.annotations
  • eggnog_out_ALFApulldown_4h.emapper.annotations
  • eggnog_out_ALFApulldown_18h.emapper.annotations

These files include a column named COG_category, which is the basis for the clustering below.


Step 3 — COG clustering + reporting (this post’s main script)

Inputs

The four *.emapper.annotations files (must contain COG_category).

Outputs

All outputs are written to ./COG_outputs/:

Per dataset (Excel): COG_[Dataset].xlsx

  • COG_letters: per-letter count + percent
  • Debug: unassigned COG rows, R/S proportion, etc.
  • Protein_assignments: per-protein COG letters + functional groups
  • Protein_lists_by_COG: protein IDs per COG letter
  • Protein_lists_by_group: protein IDs per major functional class
  • Long_format_COG / Long_format_group: one row per protein per category/group (best for filtering)

Combined (Excel): COG_combined_summary.xlsx

  • combined counts/percents across datasets
  • combined long-format tables

Plots (PNG + PDF):

  • COG_grouped_barplot_percent_letters.* (COG letters across datasets)
  • COG_functional_groups.* (4 functional classes across datasets)
python3 cog_cluster.py

Interpretation notes

  • High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.

  • “POORLY CHARACTERIZED” being large is not an artifact of the plotting — it’s because the dataset itself contains many proteins annotated as unknown by EggNOG. Why “S” shows up so much in Proximity labeling / pulldown sets? Common biological/annotation reasons:

    • Small / strain-specific proteins (often hypothetical)
    • Membrane or secreted proteins with poor functional characterization
    • Mobile genetic elements / phage-related proteins
    • Proteins with only weak orthology support → EggNOG assigns S rather than a confident functional class
  • Group totals not equal to 100% can happen because:

    • some proteins have multi-letter COGs (e.g., IQ) → count in multiple groups
    • some proteins have no COG assignment (-) → don’t contribute to any group

Code snippet (generate Proximity_4h FASTA, used in Step 2)

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

Code snippet (COG clustering script, used in Step 3)

#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)

Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
    * Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
    * Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
  (note: method still allows multi-group membership; we just don't print it in labels)
"""

import os
import re
import pandas as pd
import numpy as np

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
    "Proximity_4h":     "./eggnog_out_Proximity_4h.emapper.annotations",
    "Proximity_18h":    "./eggnog_out_Proximity_18h.emapper.annotations",
    "ALFApulldown_4h":  "./eggnog_out_ALFApulldown_4h.emapper.annotations",
    "ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}

OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)

ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]

COG_DESCRIPTIONS = {
    "J": "Translation, ribosomal structure and biogenesis",
    "A": "RNA processing and modification",
    "K": "Transcription",
    "L": "Replication, recombination and repair",
    "B": "Chromatin structure and dynamics",
    "D": "Cell cycle control, cell division, chromosome partitioning",
    "Y": "Nuclear structure",
    "V": "Defense mechanisms",
    "T": "Signal transduction mechanisms",
    "M": "Cell wall/membrane/envelope biogenesis",
    "N": "Cell motility",
    "Z": "Cytoskeleton",
    "W": "Extracellular structures",
    "U": "Intracellular trafficking, secretion, and vesicular transport",
    "O": "Posttranslational modification, protein turnover, chaperones",
    "C": "Energy production and conversion",
    "G": "Carbohydrate transport and metabolism",
    "E": "Amino acid transport and metabolism",
    "F": "Nucleotide transport and metabolism",
    "H": "Coenzyme transport and metabolism",
    "I": "Lipid transport and metabolism",
    "P": "Inorganic ion transport and metabolism",
    "Q": "Secondary metabolites biosynthesis, transport and catabolism",
    "R": "General function prediction only",
    "S": "Function unknown",
}

FUNCTIONAL_GROUPS = {
    "INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
    "CELLULAR PROCESSES AND SIGNALING":   ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
    "METABOLISM":                        ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
    "POORLY CHARACTERIZED":              ['R', 'S'],
}

LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
    for c in letters:
        LETTER_TO_GROUP[c] = grp

# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
    rows = []
    header = None
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith("##"):
                continue
            if line.startswith("#query"):
                header = line.lstrip("#").split("\t")
                continue
            if header is None and re.match(r"^query\t", line):
                header = line.split("\t")
                continue
            if header is None:
                continue
            rows.append(line.split("\t"))

    if header is None:
        raise ValueError(f"Could not find header line (#query/query) in: {path}")

    df = pd.DataFrame(rows, columns=header)
    if "query" not in df.columns and "#query" in df.columns:
        df = df.rename(columns={"#query": "query"})
    return df

def split_cog_letters(cog_str):
    if cog_str is None:
        return []
    cog_str = str(cog_str).strip()
    if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
        return []
    letters = list(cog_str)
    return [c for c in letters if c in ALL_COG]

def count_cog_letters(df: pd.DataFrame) -> dict:
    counts = {c: 0 for c in ALL_COG}
    for x in df["COG_category"].astype(str).tolist():
        for c in split_cog_letters(x):
            counts[c] += 1
    return counts

def build_category_to_proteins(df: pd.DataFrame) -> dict:
    cat2prot = {c: set() for c in ALL_COG}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        for c in letters:
            cat2prot[c].add(q)
    return {c: sorted(list(v)) for c, v in cat2prot.items()}

def build_group_to_proteins(df: pd.DataFrame) -> dict:
    group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        for g in hit_groups:
            group2prot[g].add(q)
    return {g: sorted(list(v)) for g, v in group2prot.items()}

def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
        rows.append({
            "query": q,
            "COG_category_raw": cog,
            "COG_letters": "".join(sorted(set(letters))) if letters else "",
            "Functional_groups": "; ".join(groups) if groups else "",
            "Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
        })
    out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
    return out.sort_values("query")

def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for c in ALL_COG:
        for pid in cat2prot.get(c, []):
            rows.append({
                "Dataset": dataset,
                "COG": c,
                "Description": COG_DESCRIPTIONS.get(c, ""),
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for g in FUNCTIONAL_GROUPS.keys():
        for pid in grp2prot.get(g, []):
            rows.append({
                "Dataset": dataset,
                "Functional_group": g,
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
    groups = list(FUNCTIONAL_GROUPS.keys())
    recs = []
    for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        rec = {"query": q}
        for g in groups:
            rec[g] = (g in hit_groups)
        recs.append(rec)

    out = pd.DataFrame(recs)
    out = out.groupby("query", as_index=False).max()
    return out

# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []

for ds_name, path in INPUT_FILES.items():
    print(f"\n--- {ds_name} ---\nReading: {path}")
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing input file: {path}")

    df = read_emapper_annotations(path)
    if "COG_category" not in df.columns:
        raise ValueError(f"{path} has no 'COG_category' column.")

    n_rows = df.shape[0]
    unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()

    # Letter-based counts and percents
    counts = count_cog_letters(df)
    total_letters = sum(counts.values())
    pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}

    counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
    pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)

    all_counts[ds_name] = counts_s
    all_pct[ds_name] = pct_s

    out_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "Count": counts_s.values,
        "Percent": pct_s.values,
    })

    dbg = pd.DataFrame([{
        "Dataset": ds_name,
        "Proteins_in_table": int(n_rows),
        "COG_unassigned_rows": int(unassigned),
        "Total_assigned_COG_letters": int(total_letters),
        "R_count": int(counts.get("R", 0)),
        "S_count": int(counts.get("S", 0)),
        "R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
    }])

    debug_summary.append(dbg.iloc[0].to_dict())

    assignment_tbl = build_assignment_table(df)
    cat2prot = build_category_to_proteins(df)
    grp2prot = build_group_to_proteins(df)

    # category (COG letter) protein lists
    cat_list_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
        "Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
    })

    # group protein lists
    grp_list_df = pd.DataFrame({
        "Functional_group": list(FUNCTIONAL_GROUPS.keys()),
        "N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
        "Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
    })

    df_long_cog = long_format_by_cog(cat2prot, ds_name)
    df_long_group = long_format_by_group(grp2prot, ds_name)
    all_long_cog.append(df_long_cog)
    all_long_group.append(df_long_group)

    out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
    with pd.ExcelWriter(out_xlsx) as writer:
        out_df.to_excel(writer, sheet_name="COG_letters", index=False)
        dbg.to_excel(writer, sheet_name="Debug", index=False)
        assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
        cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
        grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
        df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
        df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)

    print(f"Saved: {out_xlsx}")

# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())

df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)

combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
    df_counts.to_excel(writer, sheet_name="Counts")
    df_pct.to_excel(writer, sheet_name="Percent")
    pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
    # Combined long formats (includes protein IDs per letter/group across datasets)
    pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
    pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)

print(f"\nSaved combined summary: {combined_xlsx}")

# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()

y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
    ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)

ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()

plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")

# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []

# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []

for ds_name, path in INPUT_FILES.items():
    df = read_emapper_annotations(path)

    # (A) protein-based group membership and stats
    memb = protein_based_group_membership(df)
    n_proteins = memb.shape[0]
    counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
    pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
    group_counts[ds_name] = pd.Series(counts)
    group_pct[ds_name] = pd.Series(pct)

    for g in FUNCTIONAL_GROUPS.keys():
        ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
        for pid in ids:
            group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})

    # (B) add category (letter) protein IDs too
    cat2prot = build_category_to_proteins(df)
    all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))

df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)

# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
    sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
    for c in ALL_COG:
        ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
        cat_list_rows.append({
            "Dataset": ds,
            "COG": c,
            "Description": COG_DESCRIPTIONS.get(c, ""),
            "N_proteins": len(ids),
            "Protein_IDs": "; ".join(sorted(ids))
        })
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)

group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
    df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
    df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
    df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)

    # NEW: category protein IDs included here as well
    df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
    df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)

print(f"Saved functional-group workbook: {group_xlsx}")

# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()

x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
    ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)

ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()

plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)

print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")

print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR))

End-to-end GO enrichment for non-model bacteria: MS → reference ID mapping with BLAST + Blast2GO (GUI) + R enrichment

Path: ~/DATA/Data_Michelle/MS_GO_enrichments_2026

This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:

  • MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g., B4U56_00090).
  • Standard R organism annotation packages (org.*.eg.db) are not available.
  • We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.

The workflow produces:

  • 4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
  • 1 combined summary workbook across all datasets

1) Generate protein FASTA files

1.1 MS protein sequences (Proximity / ALFApulldown)

From MS results (protein ID lists), retrieve sequences and write FASTA:

python3 getProteinSequences_Proximity_4h.py      > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  > ALFApulldown_18h.fasta

1.2 Reference proteome sequences (for mapping + background)

Download the reference proteome FASTA from GenBank and standardize headers:

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

2) Optional functional annotation for merging MS + RNA-seq (EggNOG)

EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env

download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60

# MS sets
emapper.py -i Proximity_4h.fasta     -o eggnog_out_Proximity_4h     --cpu 60
emapper.py -i Proximity_18h.fasta    -o eggnog_out_Proximity_18h    --cpu 60
emapper.py -i ALFApulldown_4h.fasta  -o eggnog_out_ALFApulldown_4h  --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60

3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)

Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.

3.1 Setup workspace

mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/

Launch Blast2GO:

~/Tools/Blast2GO/Blast2GO_Launcher

3.2 Step-by-step in Blast2GO GUI

STEP 1 — Load sequences (reference proteome)

  1. FileLoadLoad Sequences
  2. Choose: Load Fasta File (.fasta)
  3. Select: CP020463_protein.fasta
  4. Tags: (leave default / none)
  5. Check that the table is filled with columns like Nr, SeqName

✅ Output: sequences are loaded into the project


STEP 2 — BLAST (QBlast)

  1. Go to the BLAST panel
  2. Choose blastp (protein vs protein)
  3. Database: nr (NCBI)
  4. Set other parameters as needed (defaults typically OK)
  5. Run QBlast

⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.

✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)


STEP 3 — Mapping

  1. Click Mapping

✅ Output:

  • Tags updated to MAPPED
  • Columns appear such as #GO, GO IDs, GO Names

STEP 4 — Annotation

  1. Click Annotation
  2. Key parameters you may set/keep:

    • Annotation CutOff (controls reliability threshold)
    • GO Weight (boosts more general terms when supported by multiple specific hits)

✅ Output:

  • Tags updated to ANNOTATED
  • Enzyme-related columns may appear (e.g., Enzyme Codes)

STEP 5 — Export Annotations (before merging InterPro)

  1. FileExportExport Annotations
  2. Choose Export Annotations (.annot, custom, etc.)
  3. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot

✅ Output: blast2go_annot.annot


STEP 6 — InterProScan (optional but recommended for more GO terms)

  1. Click InterPro / InterProScan
  2. Run InterProScan (can be long: hours to >1 day depending on dataset & setup)

✅ Output:

  • Tags updated to INTERPRO
  • Additional columns: InterPro IDs, InterPro GO IDs/Names

STEP 7 — Merge InterProScan GOs into existing annotation

  1. In InterPro panel, choose: Merge InterProScan GOs to Annotation
  2. Confirm merge

✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)


STEP 8 — Export final annotations (after merging InterPro)

  1. FileExportExport Annotations
  2. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2

✅ This is the final file used for enrichment:

  • blast2go_annot.annot2

Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.


4) Generate BLAST mapping tables: *_vs_ref.blast.tsv

We map each MS protein set to the reference proteome locus tags using BLASTP.

4.1 Create BLAST database from reference proteome

makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db

4.2 BLASTP each MS set against the reference DB

blastp -query Proximity_4h.fasta -db CP020463_ref_db \
  -out Proximity_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query Proximity_18h.fasta -db CP020463_ref_db \
  -out Proximity_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
  -out ALFApulldown_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
  -out ALFApulldown_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

5) Run GO enrichment (4 sets + combined summary)

We run a single wrapper script that:

  • normalizes FASTA headers (handles UniParc/UniProt tr|... formats)
  • filters BLAST hits (pident/qcov/evalue thresholds)
  • picks one best hit per query
  • performs GO enrichment via clusterProfiler::enricher()
  • background (universe) = all reference proteins with ≥1 GO term in blast2go_annot.annot2
  • cutoff = BH/FDR adjusted p-value < 0.05
  • writes one Excel per dataset + one combined summary workbook
Rscript wrapper_besthit_GO_enrichment_4sets.R

Code snippets (used scripts)

A) Example script for Step 1: generate Proximity_4h FASTA

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(tibble)
  library(openxlsx)
  library(clusterProfiler)
  library(AnnotationDbi)
  library(GO.db)
})

# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2"  # or full path

datasets <- tibble::tribble(
  ~name,              ~blast_tsv,                          ~fasta,
  "Proximity_4h",     "Proximity_4h_vs_ref.blast.tsv",     "Proximity_4h.fasta",
  "Proximity_18h",    "Proximity_18h_vs_ref.blast.tsv",    "Proximity_18h.fasta",
  "ALFApulldown_4h",  "ALFApulldown_4h_vs_ref.blast.tsv",  "ALFApulldown_4h.fasta",
  "ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)

out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV   <- 0.70
MAX_EVALUE <- 1e-10

# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"

# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  parts <- stringr::str_split(x, "\\|", simplify = TRUE)
  is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
  ifelse(is_uniprot, parts[, 2], parts[, 1])
}

norm_subject_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  stringr::str_split_fixed(x, "\\|", 2)[, 1]
}

add_go_desc <- function(df) {
  if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
  df$GO_Term <- vapply(df$ID, function(go_id) {
    term <- tryCatch(
      AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
      error = function(e) NULL
    )
    if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
  }, FUN.VALUE = character(1))
  df
}

safe_read_blast <- function(path) {
  cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
            "qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
  readr::read_tsv(
    path,
    col_names = cols,
    col_types = readr::cols(
      qseqid   = readr::col_character(),
      sseqid   = readr::col_character(),
      pident   = readr::col_double(),
      length   = readr::col_double(),
      mismatch = readr::col_double(),
      gapopen  = readr::col_double(),
      qstart   = readr::col_double(),
      qend     = readr::col_double(),
      sstart   = readr::col_double(),
      send     = readr::col_double(),
      evalue   = readr::col_double(),
      bitscore = readr::col_double(),
      qlen     = readr::col_double(),
      slen     = readr::col_double()
    ),
    progress = FALSE
  )
}

# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
                              stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term   <- as.character(annot_df$Term)

term2gene_go_ref <- annot_df %>%
  dplyr::filter(grepl("^GO:", Term)) %>%
  dplyr::transmute(term = Term, gene = GeneID) %>%
  dplyr::distinct()

universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")

# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms  <- list()
all_besthits  <- list()

# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {

  ds <- datasets[i, ]
  name <- ds$name
  blast_path <- ds$blast_tsv
  fasta_path <- ds$fasta

  cat("\n=============================\n")
  cat("Dataset:", name, "\n")

  if (!file.exists(blast_path)) {
    warning("Missing BLAST TSV: ", blast_path, " (skipping)")
    next
  }
  if (!file.exists(fasta_path)) {
    warning("Missing FASTA: ", fasta_path, " (skipping)")
    next
  }

  # ---- FASTA query IDs ----
  fa <- readLines(fasta_path, warn = FALSE)
  q_all <- fa[startsWith(fa, ">")]
  q_all <- gsub("^>", "", q_all)
  q_all <- unique(norm_mixed_id(q_all))
  n_fasta <- length(q_all)

  # ---- BLAST ----
  bl <- safe_read_blast(blast_path) %>%
    dplyr::mutate(
      qid  = norm_mixed_id(qseqid),
      sid  = norm_subject_id(sseqid),
      qcov = length / qlen
    )

  n_queries_with_hits <- dplyr::n_distinct(bl$qid)
  max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0

  cat("FASTA queries:", n_fasta, "\n")
  cat("BLAST rows:", nrow(bl), "\n")
  cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
  cat("Max hits/query:", max_hits_per_query, "\n")

  bl_f <- bl %>%
    dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)

  cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")

  best <- bl_f %>%
    dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
    dplyr::group_by(qid) %>%
    dplyr::slice_head(n = 1) %>%
    dplyr::ungroup()

  best_out <- best %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::arrange(dplyr::desc(bitscore))

  best_all <- tibble::tibble(qid = q_all) %>%
    dplyr::left_join(best_out, by = "qid")

  unmapped <- best_all %>%
    dplyr::filter(is.na(sid) | sid == "") %>%
    dplyr::distinct(qid) %>%
    dplyr::pull(qid)

  mapped <- best_out %>%
    dplyr::filter(!is.na(sid), sid != "")

  cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
  cat("Unmapped queries:", length(unmapped), "\n")
  cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
  cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")

  gene_list_ref <- unique(mapped$sid)
  n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
  cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")

  # Save best-hit TSVs
  fn_best     <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
  fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
  readr::write_tsv(best_out, fn_best)
  readr::write_tsv(best_all, fn_best_all)

  # GO enrichment
  go_res <- tryCatch(
    clusterProfiler::enricher(
      gene = gene_list_ref,
      TERM2GENE = term2gene_go_ref,
      universe = universe_ref,
      pvalueCutoff = P_CUT,
      pAdjustMethod = PADJ_METHOD
    ),
    error = function(e) NULL
  )

  go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
  go_df <- add_go_desc(go_df)

  cat("Enriched GO terms:", nrow(go_df), "\n")

  per_protein_go <- mapped %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::distinct() %>%
    dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
    dplyr::rename(GO = term)

  # Per-dataset Excel
  out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
  wb <- openxlsx::createWorkbook()

  openxlsx::addWorksheet(wb, "BestHit_Mapped")
  openxlsx::writeData(wb, "BestHit_Mapped", mapped)

  openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
  openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))

  openxlsx::addWorksheet(wb, "PerProtein_GO")
  openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)

  openxlsx::addWorksheet(wb, "GO_Enrichment")
  openxlsx::writeData(wb, "GO_Enrichment", go_df)

  openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
  cat("Saved:", out_xlsx, "\n")

  # Collect combined summary
  summary_runs[[name]] <- tibble::tibble(
    dataset = name,
    fasta_queries = n_fasta,
    blast_rows = nrow(bl),
    queries_with_hits = n_queries_with_hits,
    max_hits_per_query = max_hits_per_query,
    filtered_rows = nrow(bl_f),
    filtered_queries = dplyr::n_distinct(bl_f$qid),
    mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
    unmapped_queries = length(unmapped),
    unique_targets_besthit = dplyr::n_distinct(mapped$sid),
    duplicated_targets_besthit = sum(duplicated(mapped$sid)),
    mapped_targets_with_GO = n_targets_with_go,
    enriched_GO_terms = nrow(go_df)
  )

  all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
  all_go_terms[[name]]  <- dplyr::mutate(per_protein_go, dataset = name)
  all_besthits[[name]]  <- dplyr::mutate(mapped, dataset = name)
}

# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)

go_counts <- combined_per_protein_go %>%
  dplyr::filter(!is.na(GO), GO != "") %>%
  dplyr::distinct(dataset, qid, GO) %>%
  dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
  dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))

combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()

openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)

openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)

openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)

openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)

openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)

openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)

cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "")