Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)
Bacterial WGS workflow overview (nf-core/bacass → taxonomic ID → phylogeny)
1) Primary WGS analysis with nf-core/bacass
We processed short-read bacterial WGS with nf-core/bacass v2.5.0 using Docker. The pipeline performs QC, assembly, and common taxonomic/typing steps (e.g., Kraken2). We ran it successfully with a local Kraken2 database.
A practical issue is the KmerFinder database download: the previously documented FTP “latest” path may be unavailable or slow, so it’s best to pre-download the database and pass a local path, or skip KmerFinder and run it manually via the CGE service if needed.
2) KmerFinder database: reliable download options
Two practical database sources:
- Option A (recommended, easiest & fast): download the bundled DB from the CGE KmerFinder service site and extract locally. (In your case it finished in ~40 min.)
- Option B (fallback, stable but older): use the Zenodo stable archive (20190108).
Once the database is available locally, it can be provided to bacass with --kmerfinderdb (pointing to the extracted bacteria/ directory, or the tarball depending on the pipeline expectation).
3) Species placement & phylogeny
For broad placement, an ANI-based approach (fastANI distance tree) is quick and robust, but for publishable phylogeny within a close clade, a standard route is:
Prokka → Roary (pangenome; core-gene alignment) → (trim) → RAxML-NG.
Key point: Roary core-gene trees are best for closely related genomes (same species/close genus). Distant outgroups can collapse the core (few/no shared genes). If outgroups are too far, use close outgroups (e.g., Moraxella, Psychrobacter) or switch to conserved marker gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) for deep phylogeny.
4) Comparative genome visualization
For pairwise genome comparison/plots, nf-core/pairgenomealign can generate whole-genome alignments and visual representations against a target genome.
Reproducible code snippets (complete commands)
A) nf-core/bacass runs
Run 1 (works):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /home/jhuang/REFs/minikraken_20171019_8GB.tgz
Run 2 (skip KmerFinder, resume):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /home/jhuang/REFs/k2_standard_08_GB_20251015.tar.gz \
--skip_kmerfinder \
-resume
Then manually run KmerFinder via web service: https://cge.food.dtu.dk/services/KmerFinder/
Run 3 (use local KmerFinder DB, resume):
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
--input samplesheet.tsv \
--outdir bacass_out \
--assembly_type short \
--kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
-resume
B) KmerFinder DB download
Option A (CGE bundle; fast):
mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz'
tar -xvf kmerfinder_db.tar.gz
# resulting path typically contains: bacteria/
Option B (Zenodo stable archive):
mkdir -p kmerfinder_db && cd kmerfinder_db
wget -c 'https://zenodo.org/records/13447056/files/20190108_kmerfinder_stable_dirs.tar.gz'
tar -xzf 20190108_kmerfinder_stable_dirs.tar.gz
C) Download close references (Acinetobacter + close outgroups) + fastANI matrix (your script config)
(You already implemented this; keep here as “reproducible config”)
# Example config snippet used in the downloader:
MAX_ACINETOBACTER=20
MAX_PER_OUTGROUP=4
ASSEMBLY_LEVELS="complete,chromosome"
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
MINFRACTION="0.05"
D) Prokka → Roary core MSA → RAxML-NG (phylogenomic tree)
0) Collect genomes (isolate + downloaded refs):
mkdir -p genomes
cp bacass_out/Prokka/An6/An6.fna genomes/An6.fna
bash 1_ani_tree_prep.sh
cp ani_tree_run/ref_fasta/*.genomic.fna genomes/
1) Prokka annotations (produces GFFs for Roary):
mkdir -p prokka gffs
for f in genomes/*.fna genomes/*.genomic.fna; do
[ -f "$f" ] || continue
bn=$(basename "$f")
sample="${bn%.*}"
sample="${sample%.genomic}"
sample="${sample//./_}"
prokka --outdir "prokka/$sample" \
--prefix "$sample" \
--cpus 8 --force \
"$f"
done
cp prokka/*/*.gff gffs/
2) Roary core-genome alignment (MAFFT):
mkdir -p roary_out
roary -f roary_out \
-p 16 \
-e -n \
-i 95 \
-cd 95 \
gffs/*.gff
# output:
ls -lh roary_out/core_gene_alignment.aln
3) Trim the core alignment (optional but recommended):
trimal -in roary_out/core_gene_alignment.aln \
-out roary_out/core_gene_alignment.trim.aln \
-automated1
4) RAxML-NG ML tree + bootstrap supports:
mkdir -p tree
raxml-ng --all \
--msa roary_out/core_gene_alignment.trim.aln \
--model GTR+G \
--bs-trees autoMRE{500} \
--threads 16 \
--seed 12345 \
--prefix tree/roary_core
# main result:
ls -lh tree/roary_core.raxml.support
E) Optional visualization in R (ape)
library(ape)
tr <- read.tree("tree/roary_core.raxml.support")
pdf("tree/roary_core_raxmlng.pdf", width=12, height=10)
plot(tr, cex=0.5, no.margin=TRUE)
title("Roary core genome + RAxML-NG (GTR+G)")
dev.off()
F) Pairwise genome alignment figure (nf-core/pairgenomealign)
Example pattern (edit to your inputs/refs):
nextflow run nf-core/pairgenomealign -r 2.2.1 -profile docker \
--input samplesheet_pairgenomealign.tsv \
--outdir pairgenomealign_out
Practical notes
- If KmerFinder FTP “latest” is unavailable, use the CGE bundle or Zenodo stable DB and point bacass to the local extracted
bacteria/directory. -
For phylogeny:
- Roary/RAxML-NG is best for close genomes; use close outgroups or none and root later.
- For distant outgroups, use marker-gene pipelines (GTDB-Tk/PhyloPhlAn/UBCG) instead of pangenome core.
Used scripts (complete code)
A) 1_ani_tree_prep.sh
#!/usr/bin/env bash
set -euo pipefail
# ===================== USER CONFIG =====================
QUERY_FASTA="bacass_out/Prokka/An6/An6.fna"
SEED_ACCESSIONS_FILE="accessions.txt"
OUTDIR="ani_tree_run"
THREADS=8
MINFRACTION="0.05"
FRAGLEN="3000"
MAX_ACINETOBACTER=20
SUMMARY_LIMIT_ACI=400
MAX_PER_OUTGROUP=4
SUMMARY_LIMIT_OUT=120
#,scaffold
ASSEMBLY_LEVELS="complete,chromosome"
#Drop from Roary: "Pseudomonas" "Escherichia")
OUTGROUP_TAXA=("Moraxella" "Psychrobacter")
DO_MATRIX="yes"
SUMMARY_TIMEOUT=600
# =======================================================
ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
warn() { echo "[$(ts)] WARN: $*" >&2; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }
need() { command -v "$1" >/dev/null 2>&1 || die "Missing command: $1"; }
sanitize() { echo "$1" | tr ' /;:,()[]{}' '_' | tr -cd 'A-Za-z0-9._-'; }
need datasets
need fastANI
need awk
need sort
need unzip
need find
need grep
need wc
need head
need readlink
need sed
need timeout
need python3
[[ -f "$QUERY_FASTA" ]] || die "QUERY_FASTA not found: $QUERY_FASTA"
mkdir -p "$OUTDIR"/{ref_fasta,zips,tmp,logs,accessions}
DL_LOG="$OUTDIR/logs/download.log"
SUM_LOG="$OUTDIR/logs/summary.log"
ANI_LOG="$OUTDIR/logs/fastani.log"
: > "$DL_LOG"; : > "$SUM_LOG"; : > "$ANI_LOG"
QUERY_LIST="$OUTDIR/query_list.txt"
echo "$(readlink -f "$QUERY_FASTA")" > "$QUERY_LIST"
log "QUERY: $(cat "$QUERY_LIST")"
ACC_DIR="$OUTDIR/accessions"
ACC_SEED="$ACC_DIR/seed.acc.txt"
ACC_ACI="$ACC_DIR/acinetobacter.acc.txt"
ACC_OUT="$ACC_DIR/outgroups.acc.txt"
ACC_ALL="$ACC_DIR/all_refs.acc.txt"
REF_DIR="$OUTDIR/ref_fasta"
REF_LIST="$OUTDIR/ref_list.txt"
# ---- seed accessions ----
if [[ -f "$SEED_ACCESSIONS_FILE" ]]; then
grep -v '^\s*$' "$SEED_ACCESSIONS_FILE" | grep -v '^\s*#' | sort -u > "$ACC_SEED" || true
else
: > "$ACC_SEED"
fi
log "Seed accessions: $(wc -l < "$ACC_SEED")"
# ---- extract accessions from a jsonl FILE (FIXED) ----
extract_accessions_from_jsonl_file() {
local jsonl="$1"
python3 - "$jsonl" <<'PY'
import sys, json
path = sys.argv[1]
out = []
with open(path, "r", encoding="utf-8", errors="ignore") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except Exception:
continue
# NCBI datasets summary often returns one report per line with top-level "accession"
acc = obj.get("accession")
if acc:
out.append(acc)
# some outputs can also contain "reports" lists
reps = obj.get("reports") or []
for r in reps:
acc = r.get("accession")
if acc:
out.append(acc)
for a in out:
print(a)
PY
}
fetch_accessions_by_taxon() {
local taxon="$1"
local want_n="$2"
local limit_n="$3"
local outfile="$4"
local tag
tag="$(sanitize "$taxon")"
local jsonl="$OUTDIR/tmp/${tag}.jsonl"
: > "$outfile"
log "STEP: Fetch accessions | taxon='$taxon' want=$want_n limit=$limit_n assembly=$ASSEMBLY_LEVELS timeout=${SUMMARY_TIMEOUT}s"
if ! timeout "$SUMMARY_TIMEOUT" datasets summary genome taxon "$taxon" \
--assembly-level "$ASSEMBLY_LEVELS" \
--annotated \
--exclude-atypical \
--exclude-multi-isolate \
--limit "$limit_n" \
--as-json-lines > "$jsonl" 2>>"$SUM_LOG"; then
warn "datasets summary failed for taxon=$taxon"
tail -n 80 "$SUM_LOG" >&2 || true
return 1
fi
[[ -s "$jsonl" ]] || { warn "Empty jsonl for taxon=$taxon"; return 0; }
log " OK: summary jsonl lines: $(wc -l < "$jsonl")"
# FIXED extraction
extract_accessions_from_jsonl_file "$jsonl" | sort -u | head -n "$want_n" > "$outfile"
log " OK: got $(wc -l < "$outfile") accessions for $taxon"
head -n 5 "$outfile" | sed 's/^/ sample: /' || true
return 0
}
# ---- fetch Acinetobacter + outgroups ----
fetch_accessions_by_taxon "Acinetobacter" "$MAX_ACINETOBACTER" "$SUMMARY_LIMIT_ACI" "$ACC_ACI" \
|| die "Failed to fetch Acinetobacter accessions (see $SUM_LOG)"
: > "$ACC_OUT"
for tax in "${OUTGROUP_TAXA[@]}"; do
tmp="$OUTDIR/tmp/$(sanitize "$tax").acc.txt"
fetch_accessions_by_taxon "$tax" "$MAX_PER_OUTGROUP" "$SUMMARY_LIMIT_OUT" "$tmp" \
|| die "Failed to fetch outgroup taxon=$tax (see $SUM_LOG)"
cat "$tmp" >> "$ACC_OUT"
done
sort -u "$ACC_OUT" -o "$ACC_OUT"
log "Outgroup accessions total: $(wc -l < "$ACC_OUT")"
# ---- merge ----
cat "$ACC_SEED" "$ACC_ACI" "$ACC_OUT" 2>/dev/null \
| grep -v '^\s*$' | grep -v '^\s*#' | sort -u > "$ACC_ALL"
log "Total unique reference accessions: $(wc -l < "$ACC_ALL")"
[[ -s "$ACC_ALL" ]] || die "No reference accessions collected."
# ---- download genomes (batch) ----
ZIP="$OUTDIR/zips/ncbi_refs.zip"
UNZ="$OUTDIR/tmp/ncbi_refs_unzipped"
rm -f "$ZIP"; rm -rf "$UNZ"; mkdir -p "$UNZ"
log "STEP: Download genomes via NCBI datasets (batch) -> $ZIP"
datasets download genome accession --inputfile "$ACC_ALL" --include genome --filename "$ZIP" >>"$DL_LOG" 2>&1 \
|| { tail -n 120 "$DL_LOG" >&2; die "datasets download failed (see $DL_LOG)"; }
log "STEP: Unzip genomes"
unzip -q "$ZIP" -d "$UNZ" >>"$DL_LOG" 2>&1 || { tail -n 120 "$DL_LOG" >&2; die "unzip failed (see $DL_LOG)"; }
# ---- collect FASTA ----
log "STEP: Collect *_genomic.fna -> $REF_DIR"
: > "$REF_LIST"
n=0
while IFS= read -r f; do
[[ -f "$f" ]] || continue
acc="$(echo "$f" | grep -oE 'G[CF]A_[0-9]+\.[0-9]+' | head -n 1 || true)"
[[ -z "${acc:-}" ]] && acc="REF$(printf '%04d' $((n+1)))"
out="$REF_DIR/${acc}.genomic.fna"
cp -f "$f" "$out"
[[ -s "$out" ]] || continue
echo "$(readlink -f "$out")" >> "$REF_LIST"
n=$((n+1))
done < <(find "$UNZ" -type f -name "*_genomic.fna")
log " OK: Reference FASTA collected: $n"
[[ "$n" -gt 0 ]] || die "No genomic.fna found after download."
# ---- fastANI query vs refs ----
RAW_QVSR="$OUTDIR/fastani_query_vs_refs.raw.tsv"
TSV_QVSR="$OUTDIR/fastani_query_vs_refs.tsv"
log "STEP: fastANI query vs refs"
log " fastANI --ql $QUERY_LIST --rl $REF_LIST -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_QVSR"
fastANI \
--ql "$QUERY_LIST" \
--rl "$REF_LIST" \
-t "$THREADS" \
--minFraction "$MINFRACTION" \
--fragLen "$FRAGLEN" \
-o "$RAW_QVSR" >>"$ANI_LOG" 2>&1 \
|| { tail -n 160 "$ANI_LOG" >&2; die "fastANI failed (see $ANI_LOG)"; }
[[ -s "$RAW_QVSR" ]] || die "fastANI output empty: $RAW_QVSR"
echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$TSV_QVSR"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_QVSR" >> "$TSV_QVSR"
log "Top hits (ANI desc):"
tail -n +2 "$TSV_QVSR" | sort -k3,3nr | head -n 15 | sed 's/^/ /'
# ---- fastANI matrix for all genomes ----
if [[ "$DO_MATRIX" == "yes" ]]; then
ALL_LIST="$OUTDIR/all_genomes_list.txt"
cat "$QUERY_LIST" "$REF_LIST" > "$ALL_LIST"
RAW_ALL="$OUTDIR/fastani_all.raw.tsv"
log "STEP: fastANI all-vs-all matrix (for tree)"
log " fastANI --ql $ALL_LIST --rl $ALL_LIST --matrix -t $THREADS --minFraction $MINFRACTION --fragLen $FRAGLEN -o $RAW_ALL"
fastANI \
--ql "$ALL_LIST" \
--rl "$ALL_LIST" \
--matrix \
-t "$THREADS" \
--minFraction "$MINFRACTION" \
--fragLen "$FRAGLEN" \
-o "$RAW_ALL" >>"$ANI_LOG" 2>&1 \
|| { tail -n 160 "$ANI_LOG" >&2; die "fastANI matrix failed (see $ANI_LOG)"; }
[[ -f "${RAW_ALL}.matrix" ]] || die "Matrix not produced: ${RAW_ALL}.matrix"
log " OK: Matrix file: ${RAW_ALL}.matrix"
fi
log "DONE."
log "Accessions counts:"
log " - seed: $(wc -l < "$ACC_SEED")"
log " - acinetobacter:$(wc -l < "$ACC_ACI")"
log " - outgroups: $(wc -l < "$ACC_OUT")"
log " - all refs: $(wc -l < "$ACC_ALL")" Protected: Hepatitis D virus infection in Germany
Protected: Teilnehmerliste 5. Schachwerkstatt DWZ Challenge
This content is password-protected. To view it, please enter the password below.
Kindgerechte Erklärung der Schachnotation
♟️ Kindgerechte Erklärung der Schachnotation
(geeignet für U8 / U10 / U12 – Turnierstandard)
1️⃣ Grundidee der Notation
In einem Turnier muss jede gespielte Zugfolge aufgeschrieben werden, damit:
- die Partie später analysiert werden kann
- Schiedsrichter Streitfälle klären können
- die Partie für DWZ / Elo gewertet werden kann
👉 Notation bedeutet nur:
Welche Figur zieht auf welches Feld.
2️⃣ Buchstaben für die Figuren
| Figur | Buchstabe | Erklärung |
|---|---|---|
| König | K | King |
| Dame | Q | Queen |
| Turm | R | Rook |
| Läufer | B | Bishop |
| Springer | N | Knight (K ist schon der König) |
| Bauer | kein Buchstabe | ganz wichtig |
⚠️ Bauern haben keinen Buchstaben!
3️⃣ Die Felder des Schachbretts
Jedes Feld hat einen Namen:
- a–h (Spalten, von links nach rechts)
- 1–8 (Reihen, von unten nach oben aus Weiß-Sicht)
Beispiele:
- e4
- f3
- d5
4️⃣ Normale Züge schreiben
Bauernzüge
- Bauer zieht nach e4 → e4
- Bauer zieht nach d5 → d5
👉 Nur das Zielfeld schreiben.
Figuren ziehen
Figur + Zielfeld
Beispiele:
- Springer nach f3 → Nf3
- Läufer nach c4 → Bc4
- Dame nach d1 → Qd1
5️⃣ Schlagen (sehr wichtig)
Figuren schlagen
Figur + x + Zielfeld
Beispiele:
- Springer schlägt auf e5 → Nxe5
- Läufer schlägt auf f7 → Bxf7
❌ Niemals aufschreiben, welche Figur geschlagen wurde!
Bauern schlagen (Sonderregel)
Beim Bauern muss man angeben, von welcher Linie er kommt:
- Bauer von d4 schlägt auf e5 → dxe5
- Bauer von f3 schlägt auf e4 → fxe4
6️⃣ Schach, Matt, Rochade
- + = Schach
- # = Schachmatt
- O-O = kurze Rochade
- O-O-O = lange Rochade
Beispiele:
- Qh5+
- Qh7#
7️⃣ Zwei gleiche Figuren können auf dasselbe Feld ziehen
(Das ist für Turniere sehr wichtig!)
Regel:
Wenn zwei gleiche Figuren auf dasselbe Feld ziehen könnten, muss man angeben, welche gemeint ist.
Beispiel mit Springern
Beide Springer können nach d2:
- Springer von b1 → Nbd2
- Springer von f1 → Nfd2
Beispiel mit Türmen
Beide Türme können nach e1:
- Turm von Reihe 2 → R2e1
- Turm von Reihe 8 → R8e1
👉 Man ergänzt Buchstabe oder Zahl, bis klar ist, welche Figur zieht.
8️⃣ Beispielpartie (ca. 10 Züge)
1. e4 e5
2. Nf3 Nc6
3. Bc4 Bc5
4. d4 exd4
5. O-O Nf6
6. e5 Ne4
7. Re1 d5
8. Bxd5 Qxd5
9. Nc3 Nxc3
10. bxc3
Diese Partie enthält:
- Bauernzüge
- Schlagen
- Bauern schlagen
- Rochade
- Mehrere Figuren
9️⃣ Richtige Reihenfolge im Turnier
Immer so:
- Zug ausführen
- Uhr drücken
- Zug notieren
❌ Nicht vorher schreiben ❌ Nicht vergessen zu schreiben
🔟 Die 5 häufigsten Fehler bei Kindern
-
❌ Bauer bekommt einen Buchstaben (Pe4) ✅ e4
-
❌ Aufschreiben, welche Figur geschlagen wurde ✅ nur das Zielfeld
-
❌ Das „x“ beim Schlagen vergessen ✅ Nxe5
-
❌ Erst schreiben, dann ziehen ✅ erst ziehen
-
❌ Panik bei Zeitnot ✅ lieber langsam und sauber schreiben
✅ Merksatz für Kinder
**Welche Figur (Bauer ohne Buchstabe), wohin sie zieht, x beim Schlagen,
- bei Schach.**
Punkte als Nächstes:
- 📄 eine noch ausführlichere A4-Turnierfassung auf Deutsch
- 🧒 eine U8-vereinfachte Version
- 🧑⚖️ erklären, was Schiedsrichter wirklich kontrollieren
Data Processing Pipeline: RNA-seq × Exoproteome × MS (Proximity/ALFA)
Path: ~/DATA/Data_Michelle/MS_RNAseq_Venn_2026
This post documents the complete, reproducible pipeline used to generate a 3-circle Venn diagram integrating RNA-seq, exoproteome, and MS SIP (Proximity + ALFApulldown) datasets, harmonized at the locus-tag level.
The pipeline is intentionally verbose: all transformations, mappings, and decisions are logged so the analysis can be reconstructed later.
Overview of the Three Sets
Circle 1 — RNA-seq
Union of all significantly regulated genes across six comparisons
Criteria: padj < 0.05 and |log2FC| > 2
Circle 2 — Exoproteome Union of six exoproteome conditions (MH/TSB × 1h, 4h, 18h) Protein IDs (UniProt / UniParc) are mapped to locus tags via BLAST Direction is defined without a cutoff:
log2FC > 0→ UPlog2FC < 0→ DOWN
Circle 3 — MS SIP (ProALFA) Union of four datasets:
- Proximity 4h / 18h
- ALFApulldown 4h / 18h Protein IDs are mapped to locus tags via BLAST
Outputs
The pipeline produces:
- 🟢 3-circle Venn diagram (PNG + PDF)
-
📊 Detailed Excel workbook
- all Venn regions
- RNA / Exo / MS membership per locus tag
- regulation directions and source datasets
-
🧾 Pipeline log file
- counts per dataset
- mapping success
- duplicated IDs
- intersection sizes
Step A — Fetch exoproteome FASTA from UniProt
Exoproteome Excel sheets contain UniProt accessions, while RNA-seq uses locus tags. To enable mapping, protein sequences are downloaded from UniProt and UniParc.
Script
01_fetch_uniprot_fasta_from_exoproteome_excel.py
What it does
- Reads exoproteome sheets from Excel
- Extracts UniProt / UniParc IDs
- Downloads FASTA via UniProt REST (with caching + fallback to UniParc)
-
Writes:
- one FASTA per sheet
- mapping tables
- UP / DOWN ID lists
- run log
Bash: fetch exoproteome FASTAs
python3 01_fetch_uniprot_fasta_from_exoproteome_excel.py \
--excel "Zusammenfassung SIP und Exoproteome.xlsx" \
--outdir ./exoproteome_fastas \
--cache ./uniprot_cache
Step B — Normalize FASTA headers (Exoproteome + MS)
FASTA headers from UniProt and MS are heterogeneous and may contain duplicates. All FASTA files are normalized to canonical, unique protein IDs.
Script
02_normalize_fasta_headers.py
What it does
- Extracts canonical IDs (
sp|…|,tr|…|, raw headers) - Enforces uniqueness (
__dupNsuffix if needed) - Writes an ID mapping table for traceability
Bash: normalize exoproteome FASTAs
mkdir -p normalized_fastas
for f in \
Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta
do
python3 02_normalize_fasta_headers.py \
exoproteome_fastas/"$f" \
"$f" \
"$f.idmap.tsv"
done
Bash: normalize MS (Proximity / ALFA) FASTAs
for f in \
Proximity_4h.fasta Proximity_18h.fasta \
ALFApulldown_4h.fasta ALFApulldown_18h.fasta
do
python3 02_normalize_fasta_headers.py \
../MS_GO_enrichments/"$f" \
"$f" \
"$f.idmap.tsv"
done
Step C — BLAST mapping to reference proteome
All exoproteome and MS proteins are mapped to locus tags using BLASTP against the reference proteome.
Bash: merge exoproteome FASTAs
cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \
> Exoproteome_ALL.fasta
Bash: build BLAST database (once)
makeblastdb -in CP020463_protein.fasta \
-dbtype prot \
-out CP020463_ref_db
Bash: BLAST exoproteome
blastp -query Exoproteome_ALL.fasta \
-db CP020463_ref_db \
-out Exoproteome_ALL_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 \
-max_target_seqs 5 \
-num_threads 16
Bash: select best hit per query (exoproteome)
awk -F'\t' 'BEGIN{OFS="\t"}{
key=$1; bits=$12+0; e=$11+0
if(!(key in best) || bits>best_bits[key] || (bits==best_bits[key] && e<best_e[key])){
best[key]=$0; best_bits[key]=bits; best_e[key]=e
}
} END{for(k in best) print best[k]}' \
Exoproteome_ALL_vs_ref.blast.tsv \
| sort -t$'\t' -k1,1 \
> Exoproteome_ALL_vs_ref.besthit.tsv
Bash: BLAST MS datasets
for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
blastp -query "${base}.fasta" \
-db CP020463_ref_db \
-out "${base}_vs_ref.blast.tsv" \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 \
-max_target_seqs 5 \
-num_threads 16
done
Bash: best-hit selection (MS)
for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
awk -F'\t' 'BEGIN{OFS="\t"}{
q=$1; b=$12+0; e=$11+0
if(!(q in best) || b>bb[q] || (b==bb[q] && e<be[q])){
best[q]=$0; bb[q]=b; be[q]=e
}
} END{for(q in best) print best[q]}' \
"${base}_vs_ref.blast.tsv" \
| sort -t$'\t' -k1,1 \
> "${base}_vs_ref.besthit.tsv"
done
Step D — R wrapper: integration, Venn, Excel, LOG
Script
03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R
What it does
- Reads RNA-seq Excel files (6 comparisons)
- Reads exoproteome and MS sheets
- Applies BLAST best-hit mappings
- Builds union sets and intersections
-
Generates:
- Venn plot
- detailed Excel workbook
- comprehensive log file
Bash: run the final integration
Rscript 03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R
Scripts Used (for traceability)
01_fetch_uniprot_fasta_from_exoproteome_excel.py
# see script file for full implementation
02_normalize_fasta_headers.py
# see script file for full implementation
03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R
# see script file for full implementation
Final Note
Keeping Python, Bash, and R steps together is intentional: this pipeline is meant to be auditable, reproducible, and explainable, even months later when details are fuzzy.
If you want, I can next:
- condense this into a README.md
- split it into Methods vs Reproducibility
- or add a flowchart of the pipeline steps
Scripts Used
01_fetch_uniprot_fasta_from_exoproteome_excel.py
#!/usr/bin/env python3
"""
01_fetch_uniprot_fasta_from_exoproteome_excel.py (v3)
MS log2FC direction rule (no cutoff):
Up : log2FC > 0
Down : log2FC < 0
Zero : log2FC == 0
NA : missing/non-numeric log2FC
Reads per sheet:
- protein IDs from column A (index 0)
- log2FC from column D (index 3)
Outputs per sheet:
- {sheet}.fasta
- {sheet}.mapping.tsv
- {sheet}.up.ids.txt / {sheet}.down.ids.txt
- run.log
UniParc catching method:
1) Try UniProtKB FASTA: /uniprotkb/{ACC}.fasta
2) If missing: UniParc search -> UPI
3) Fetch UniParc FASTA: /uniparc/{UPI}.fasta
"""
import argparse
import os
import re
import time
from typing import Optional, Tuple
import pandas as pd
import requests
DEFAULT_SHEETS = [
"Exoproteome MH 1h",
"Exoproteome MH 4h",
"Exoproteome MH 18h",
"Exoproteome TSB 1h",
"Exoproteome TSB 4h",
"Exoproteome TSB 18h",
]
UNIPROTKB_FASTA = "https://rest.uniprot.org/uniprotkb/{acc}.fasta"
UNIPARC_FASTA = "https://rest.uniprot.org/uniparc/{upi}.fasta"
UNIPARC_SEARCH = "https://rest.uniprot.org/uniparc/search"
def sanitize_name(s: str) -> str:
s = s.strip()
s = re.sub(r"[^\w\-\.]+", "_", s)
s = re.sub(r"_+", "_", s)
return s
def parse_id_cell(raw: str) -> Optional[str]:
if raw is None:
return None
s = str(raw).strip()
if not s or s.lower() in {"nan", "na"}:
return None
s = s.split()[0].strip()
# header like tr|ACC|NAME
if "|" in s:
parts = s.split("|")
if len(parts) >= 2 and parts[0] in {"tr", "sp"}:
s = parts[1]
else:
s = parts[0]
if re.fullmatch(r"UPI[0-9A-F]{10,}", s):
return s
if re.fullmatch(r"[A-Z0-9]{6,10}", s):
return s
return None
def to_float(x) -> Optional[float]:
if x is None:
return None
try:
s = str(x).strip()
if s == "" or s.lower() in {"nan", "na"}:
return None
s = s.replace(",", ".")
return float(s)
except Exception:
return None
def classify_direction_no_cutoff(log2fc: Optional[float]) -> str:
if log2fc is None:
return "NA"
if log2fc > 0:
return "Up"
if log2fc < 0:
return "Down"
return "Zero"
def extract_ids_and_fc_from_sheet(excel_path: str, sheet: str) -> pd.DataFrame:
df = pd.read_excel(excel_path, sheet_name=sheet, header=None)
if df.shape[1] < 4:
raise ValueError(f"Sheet '{sheet}' has <4 columns, cannot read column D (log2FC).")
out_rows = []
for _, row in df.iterrows():
pid = parse_id_cell(row.iloc[0]) # col A
if not pid:
continue
log2fc = to_float(row.iloc[3]) # col D
out_rows.append({"protein_id": pid, "log2fc": log2fc})
out = pd.DataFrame(out_rows)
if out.empty:
return out
# de-duplicate by protein_id (keep first non-NA log2fc if possible)
out["log2fc_isna"] = out["log2fc"].isna()
out = out.sort_values(["protein_id", "log2fc_isna"]).drop_duplicates("protein_id", keep="first")
out = out.drop(columns=["log2fc_isna"]).reset_index(drop=True)
return out
def http_get_text(session: requests.Session, url: str, params: dict = None,
retries: int = 4, backoff: float = 1.5) -> Optional[str]:
for attempt in range(retries):
r = session.get(url, params=params, timeout=60)
if r.status_code == 200:
return r.text
if r.status_code in (429, 500, 502, 503, 504):
time.sleep(backoff * (attempt + 1))
continue
return None
return None
def fetch_uniprotkb_fasta(session: requests.Session, acc: str) -> Optional[str]:
return http_get_text(session, UNIPROTKB_FASTA.format(acc=acc))
def resolve_upi_via_uniparc_search(session: requests.Session, acc: str) -> Optional[str]:
params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
txt = http_get_text(session, UNIPARC_SEARCH, params=params)
if not txt:
return None
lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
if len(lines) < 2:
return None
upi = lines[1].split("\t")[0].strip()
if re.fullmatch(r"UPI[0-9A-F]{10,}", upi):
return upi
return None
def fetch_uniparc_fasta(session: requests.Session, upi: str) -> Optional[str]:
return http_get_text(session, UNIPARC_FASTA.format(upi=upi))
def cache_get(cache_dir: str, key: str) -> Optional[str]:
if not cache_dir:
return None
path = os.path.join(cache_dir, f"{key}.fasta")
if os.path.exists(path) and os.path.getsize(path) > 0:
with open(path, "r", encoding="utf-8", errors="replace") as f:
return f.read()
return None
def cache_put(cache_dir: str, key: str, fasta: str) -> None:
if not cache_dir:
return
os.makedirs(cache_dir, exist_ok=True)
path = os.path.join(cache_dir, f"{key}.fasta")
with open(path, "w", encoding="utf-8") as f:
f.write(fasta)
def rewrite_fasta_header(fasta: str, new_header: str) -> str:
lines = fasta.splitlines()
if not lines or not lines[0].startswith(">"):
return fasta
lines[0] = ">" + new_header
return "\n".join(lines) + ("\n" if not fasta.endswith("\n") else "")
def fetch_best_fasta(session: requests.Session, pid: str, cache_dir: str = "") -> Tuple[Optional[str], str, Optional[str]]:
"""
Returns (fasta_text, status, resolved_upi)
status: uniprotkb | uniparc | uniparc_direct | not_found | *_cache
"""
if pid.startswith("UPI"):
cached = cache_get(cache_dir, pid)
if cached:
return cached, "uniparc_direct_cache", pid
fasta = fetch_uniparc_fasta(session, pid)
if fasta:
cache_put(cache_dir, pid, fasta)
return fasta, "uniparc_direct", pid
return None, "not_found", pid
cached = cache_get(cache_dir, pid)
if cached:
return cached, "uniprotkb_cache", None
fasta = fetch_uniprotkb_fasta(session, pid)
if fasta:
cache_put(cache_dir, pid, fasta)
return fasta, "uniprotkb", None
upi = resolve_upi_via_uniparc_search(session, pid)
if not upi:
return None, "not_found", None
cached2 = cache_get(cache_dir, upi)
if cached2:
return cached2, "uniparc_cache", upi
fasta2 = fetch_uniparc_fasta(session, upi)
if fasta2:
cache_put(cache_dir, upi, fasta2)
return fasta2, "uniparc", upi
return None, "not_found", upi
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--excel", required=True, help="Input Excel file")
ap.add_argument("--outdir", default="./exoproteome_fastas", help="Output folder")
ap.add_argument("--cache", default="./uniprot_cache", help="Cache folder for downloaded FASTAs")
ap.add_argument("--sleep", type=float, default=0.1, help="Sleep seconds between requests")
ap.add_argument("--sheets", nargs="*", default=DEFAULT_SHEETS, help="Sheet names to process")
args = ap.parse_args()
os.makedirs(args.outdir, exist_ok=True)
os.makedirs(args.cache, exist_ok=True)
log_path = os.path.join(args.outdir, "run.log")
def log(msg: str):
print(msg)
with open(log_path, "a", encoding="utf-8") as f:
f.write(msg + "\n")
with open(log_path, "w", encoding="utf-8") as f:
f.write("")
xls = pd.ExcelFile(args.excel)
missing = [s for s in args.sheets if s not in xls.sheet_names]
if missing:
log("WARNING: Missing sheets in workbook:")
for s in missing:
log(f" - {s}")
log("Continuing with available sheets only.\n")
session = requests.Session()
session.headers.update({"User-Agent": "exoproteome-fasta-fetch/3.0"})
for sheet in args.sheets:
if sheet not in xls.sheet_names:
continue
sheet_tag = sanitize_name(sheet)
out_fasta = os.path.join(args.outdir, f"{sheet_tag}.fasta")
out_map = os.path.join(args.outdir, f"{sheet_tag}.mapping.tsv")
out_up = os.path.join(args.outdir, f"{sheet_tag}.up.ids.txt")
out_down = os.path.join(args.outdir, f"{sheet_tag}.down.ids.txt")
tbl = extract_ids_and_fc_from_sheet(args.excel, sheet)
log(f"\n=== Sheet: {sheet} ===")
log(f"Extracted IDs (unique): {len(tbl)}")
if tbl.empty:
log("No IDs found. Skipping.")
continue
tbl["direction"] = [classify_direction_no_cutoff(x) for x in tbl["log2fc"].tolist()]
n_fc = int(tbl["log2fc"].notna().sum())
n_up = int((tbl["direction"] == "Up").sum())
n_dn = int((tbl["direction"] == "Down").sum())
n_zero = int((tbl["direction"] == "Zero").sum())
n_na = int((tbl["direction"] == "NA").sum())
log(f"log2FC present: {n_fc}/{len(tbl)}")
log(f"Direction (no cutoff): Up={n_up}, Down={n_dn}, Zero={n_zero}, NA={n_na}")
mapping_rows = []
fasta_chunks = []
n_ok = 0
n_nf = 0
for i, row in enumerate(tbl.itertuples(index=False), start=1):
pid = row.protein_id
log2fc = row.log2fc
direction = row.direction
fasta, status, upi = fetch_best_fasta(session, pid, cache_dir=args.cache)
if fasta:
header = pid
if upi and upi != pid:
header = f"{pid}|UniParc:{upi}"
fasta = rewrite_fasta_header(fasta, header)
fasta_chunks.append(fasta)
n_ok += 1
else:
n_nf += 1
mapping_rows.append({
"sheet": sheet,
"protein_id_input": pid,
"resolved_upi": upi if upi else "",
"status": status,
"log2fc": "" if log2fc is None else log2fc,
"direction": direction,
"fasta_written": "yes" if fasta else "no",
})
if args.sleep > 0:
time.sleep(args.sleep)
if i % 50 == 0:
log(f" progress: {i}/{len(tbl)} (ok={n_ok}, not_found={n_nf})")
with open(out_fasta, "w", encoding="utf-8") as f:
for chunk in fasta_chunks:
f.write(chunk.strip() + "\n")
map_df = pd.DataFrame(mapping_rows)
map_df.to_csv(out_map, sep="\t", index=False)
up_ids = map_df.loc[map_df["direction"] == "Up", "protein_id_input"].astype(str).tolist()
dn_ids = map_df.loc[map_df["direction"] == "Down", "protein_id_input"].astype(str).tolist()
with open(out_up, "w", encoding="utf-8") as f:
f.write("\n".join(up_ids) + ("\n" if up_ids else ""))
with open(out_down, "w", encoding="utf-8") as f:
f.write("\n".join(dn_ids) + ("\n" if dn_ids else ""))
log(f"Fetched FASTA: {n_ok}/{len(tbl)} (not_found={n_nf})")
log(f"Saved FASTA: {out_fasta}")
log(f"Saved mapping: {out_map}")
log(f"Saved Up IDs: {out_up} ({len(up_ids)})")
log(f"Saved Down IDs:{out_down} ({len(dn_ids)})")
log("\nDONE.")
if __name__ == "__main__":
main()
02_normalize_fasta_headers.py
#!/usr/bin/env python3
import re
import sys
from collections import defaultdict
def canon_id(header: str) -> str:
h = header.strip()
h = h[1:] if h.startswith(">") else h
# tr|ACC|NAME or sp|ACC|NAME
m = re.match(r'^(sp|tr)\|([^|]+)\|', h)
if m:
return m.group(2)
# otherwise: first token, then before first '|'
first = h.split()[0]
return first.split("|")[0]
def read_fasta(path):
with open(path, "r", encoding="utf-8", errors="replace") as f:
header = None
seq = []
for line in f:
line = line.rstrip("\n")
if line.startswith(">"):
if header is not None:
yield header, "".join(seq)
header = line
seq = []
else:
seq.append(line.strip())
if header is not None:
yield header, "".join(seq)
def main():
if len(sys.argv) < 3:
print("Usage: normalize_fasta_headers.py in.fasta out.fasta [map.tsv]", file=sys.stderr)
sys.exit(1)
in_fa = sys.argv[1]
out_fa = sys.argv[2]
map_tsv = sys.argv[3] if len(sys.argv) >= 4 else out_fa + ".idmap.tsv"
seen = defaultdict(int)
with open(out_fa, "w", encoding="utf-8") as fo, open(map_tsv, "w", encoding="utf-8") as fm:
fm.write("old_header\tnew_id\n")
for old_h, seq in read_fasta(in_fa):
base = canon_id(old_h)
seen[base] += 1
new_id = base if seen[base] == 1 else f"{base}__dup{seen[base]}"
fm.write(f"{old_h}\t{new_id}\n")
fo.write(f">{new_id}\n")
# wrap sequence at 60
for i in range(0, len(seq), 60):
fo.write(seq[i:i+60] + "\n")
print(f"Wrote: {out_fa}")
print(f"Wrote: {map_tsv}")
if __name__ == "__main__":
main()
03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R
#!/usr/bin/env Rscript
# ============================================================
# 3-circle Venn: RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown
#
# Key points:
# - RNA-seq inputs are Excel files with sheet 'Complete_Data'
# * GeneID is the ID used for comparison (locus tag style)
# * Keep Preferred_name / Description for annotation in ALL Venn regions
# - Exoproteome uses MS protein IDs + log2FC sign (no cutoff), mapped to locus tags via BLAST best-hit TSV
# - Proximity/ALFApulldown (ProALFA) uses SIP/MS protein IDs mapped to locus tags via BLAST best-hit TSV
# - Debug stats + parsing/mapping summaries are written into a LOG file
#
# NOTE: (记得!) 生成 Exoproteome_ALL.fasta + Exoproteome_ALL_vs_ref.blast.tsv:
# cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \\
# Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \\
# > Exoproteome_ALL.fasta
# blastp -query Exoproteome_ALL.fasta -db CP020463_protein.dmnd \\
# -out Exoproteome_ALL_vs_ref.blast.tsv -outfmt 6 -evalue 1e-5 -max_target_seqs 50 -num_threads 40
# ============================================================
suppressPackageStartupMessages({
library(tidyverse)
library(readxl)
library(openxlsx)
library(VennDiagram)
library(grid)
})
# -------------------------
# CONFIG
# -------------------------
RNA_PADJ_CUT <- 0.05
RNA_LFC_CUT <- 2
OUTDIR <- "./venn_outputs"
dir.create(OUTDIR, showWarnings = FALSE, recursive = TRUE)
LOGFILE <- file.path(OUTDIR, "pipeline_log.txt")
sink(LOGFILE, split = TRUE)
cat("=== 3-circle Venn pipeline ===\n")
cat(sprintf("RNA cutoff: padj < %.3g and |log2FoldChange| > %.3g\n", RNA_PADJ_CUT, RNA_LFC_CUT))
cat("Output dir:", normalizePath(OUTDIR), "\n\n")
# RNA inputs (Excel, sheet = Complete_Data)
RNA_FILES <- list(
deltasbp_MH_2h = "./DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx",
deltasbp_MH_4h = "./DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx",
deltasbp_MH_18h = "./DEG_Annotations_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx",
deltasbp_TSB_2h = "./DEG_Annotations_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx",
deltasbp_TSB_4h = "./DEG_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx",
deltasbp_TSB_18h = "./DEG_Annotations_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx"
)
# Exoproteome sheets (from summary Excel)
EXO_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
EXO_SHEETS <- list(
Exoproteome_MH_1h = "Exoproteome MH 1h",
Exoproteome_MH_4h = "Exoproteome MH 4h",
Exoproteome_MH_18h = "Exoproteome MH 18h",
Exoproteome_TSB_1h = "Exoproteome TSB 1h",
Exoproteome_TSB_4h = "Exoproteome TSB 4h",
Exoproteome_TSB_18h = "Exoproteome TSB 18h"
)
# Exoproteome BLAST best-hit mapping (query protein -> locus tag)
EXO_BLAST_TSV <- "./Exoproteome_ALL_vs_ref.besthit.tsv"
# Proximity/ALFApulldown sheets (from same summary Excel)
PROALFA_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
PROALFA_SHEETS <- list(
Proximity_4h = "Proximity 4h",
Proximity_18h = "Proximity 18h",
ALFApulldown_4h = "ALFApulldown 4h",
ALFApulldown_18h = "ALFApulldown 18h"
)
# ProALFA BLAST best-hit mapping (each dataset protein -> locus tag)
PROALFA_BLAST_TSV <- list(
Proximity_4h = "./Proximity_4h_vs_ref.besthit.tsv",
Proximity_18h = "./Proximity_18h_vs_ref.besthit.tsv",
ALFApulldown_4h = "./ALFApulldown_4h_vs_ref.besthit.tsv",
ALFApulldown_18h = "./ALFApulldown_18h_vs_ref.besthit.tsv"
)
# FASTA inputs (QC only)
FASTA_FILES <- list(
Proximity_4h = "./Proximity_4h.fasta",
Proximity_18h = "./Proximity_18h.fasta",
ALFApulldown_4h = "./ALFApulldown_4h.fasta",
ALFApulldown_18h = "./ALFApulldown_18h.fasta",
Exo_MH_1h = "./Exoproteome_MH_1h.fasta",
Exo_MH_4h = "./Exoproteome_MH_4h.fasta",
Exo_MH_18h = "./Exoproteome_MH_18h.fasta",
Exo_TSB_1h = "./Exoproteome_TSB_1h.fasta",
Exo_TSB_4h = "./Exoproteome_TSB_4h.fasta",
Exo_TSB_18h = "./Exoproteome_TSB_18h.fasta"
)
# -------------------------
# HELPERS
# -------------------------
safe_file_check <- function(fp, label) {
if (!file.exists(fp)) stop(label, " missing: ", fp)
}
canonical_id_one <- function(x) {
x <- as.character(x)
x <- stringr::str_trim(x)
if (is.na(x) || x == "") return(NA_character_)
# remove leading ">"
x <- stringr::str_remove(x, "^>")
# keep first token before whitespace
x <- stringr::str_split(x, "\\s+", simplify = TRUE)[1]
# handle UniProt-style: tr|ID|... or sp|ID|...
if (stringr::str_detect(x, "\\|")) {
parts <- stringr::str_split(x, "\\|")[[1]]
if (length(parts) >= 2 && parts[1] %in% c("tr", "sp")) {
x <- parts[2]
} else {
x <- parts[1]
}
}
x
}
canonical_id <- function(x) {
vapply(x, canonical_id_one, character(1))
}
read_fasta_headers <- function(fp) {
safe_file_check(fp, "FASTA")
hdr <- readLines(fp, warn = FALSE) %>% keep(~str_starts(.x, ">"))
tibble(raw = hdr, id = canonical_id(hdr)) %>%
filter(!is.na(id), id != "")
}
norm_name <- function(x) {
x <- tolower(trimws(as.character(x)))
x <- gsub("[^a-z0-9]+", "", x)
x
}
# Robust column picker (case-insensitive + punctuation-insensitive)
find_col <- function(nms, target) {
nms <- as.character(nms)
target_n <- norm_name(target)
n_n <- norm_name(nms)
# exact normalized match
idx <- which(n_n == target_n)
if (length(idx) > 0) return(nms[idx[1]])
# contains (normalized)
idx <- which(str_detect(n_n, fixed(target_n)))
if (length(idx) > 0) return(nms[idx[1]])
NA_character_
}
# Best-hit parser (BLAST outfmt6):
# qid sid pident length mismatch gapopen qstart qend sstart send evalue bitscore
read_besthit_tsv <- function(fp) {
safe_file_check(fp, "BLAST TSV")
suppressPackageStartupMessages({
df <- read_tsv(fp, col_names = FALSE, show_col_types = FALSE, progress = FALSE)
})
if (ncol(df) < 12) stop("BLAST TSV has <12 columns: ", fp)
colnames(df)[1:12] <- c("qid","sid","pident","length","mismatch","gapopen",
"qstart","qend","sstart","send","evalue","bitscore")
df %>%
mutate(
qid = canonical_id(qid),
sid = as.character(sid),
bitscore = suppressWarnings(as.numeric(bitscore)),
evalue = suppressWarnings(as.numeric(evalue))
) %>%
filter(!is.na(qid), qid != "")
}
besthit_reduce <- function(df) {
# one best hit per query: highest bitscore, then lowest evalue
df %>%
arrange(qid, desc(bitscore), evalue) %>%
group_by(qid) %>%
slice(1) %>%
ungroup()
}
# Try to extract locus tag from subject id
extract_locus_tag <- function(sid) {
sid <- as.character(sid)
# if sid already looks like B4U56_00090
if (str_detect(sid, "^B4U56_\\d{5}$")) return(sid)
# otherwise try to find locus tag pattern anywhere
hit <- str_extract(sid, "B4U56_\\d{5}")
hit
}
read_exo_sheet <- function(xlsx, sheet) {
df <- read_excel(xlsx, sheet = sheet, col_names = TRUE)
# Protein column: assume first column contains protein IDs
prot_col <- names(df)[1]
# log2FC column: per user it's column D, name looks like "log 2 FC)"
log2_col <- find_col(names(df), "log 2 FC)")
if (is.na(log2_col)) {
# fallback fuzzy: any col containing both log2 and fc
hits <- names(df)[str_detect(norm_name(names(df)), "log2") & str_detect(norm_name(names(df)), "fc")]
if (length(hits) > 0) log2_col <- hits[1]
}
if (is.na(log2_col)) stop("Exoproteome sheet missing log2FC column ('log 2 FC)'): ", sheet)
out <- df %>%
transmute(
Protein_raw = as.character(.data[[prot_col]]),
log2FC = suppressWarnings(as.numeric(.data[[log2_col]]))
) %>%
filter(!is.na(Protein_raw), Protein_raw != "")
out <- out %>%
mutate(
Protein_ID = canonical_id(Protein_raw),
EXO_direction = case_when(
is.na(log2FC) ~ NA_character_,
log2FC > 0 ~ "UP",
log2FC < 0 ~ "DOWN",
TRUE ~ "ZERO"
)
) %>%
filter(!is.na(Protein_ID), Protein_ID != "")
out
}
read_rna_excel <- function(fp) {
safe_file_check(fp, "RNA Excel")
df <- read_excel(fp, sheet = "Complete_Data", col_names = TRUE)
# required columns
req <- c("GeneID","padj","log2FoldChange")
miss <- setdiff(req, names(df))
if (length(miss) > 0) stop("RNA Excel missing columns: ", paste(miss, collapse=", "), " in ", fp)
# optional annotation columns
pref_col <- if ("Preferred_name" %in% names(df)) "Preferred_name" else NA_character_
desc_col <- if ("Description" %in% names(df)) "Description" else NA_character_
df <- df %>%
mutate(
GeneID_plain = str_remove(as.character(GeneID), "^gene-"),
padj = suppressWarnings(as.numeric(padj)),
log2FoldChange = suppressWarnings(as.numeric(log2FoldChange)),
Preferred_name = if (!is.na(pref_col)) as.character(.data[[pref_col]]) else NA_character_,
Description = if (!is.na(desc_col)) as.character(.data[[desc_col]]) else NA_character_
)
df
}
# -------------------------
# FASTA QC
# -------------------------
cat("[FASTA] Reading FASTA files for QC (10)\n")
fasta_qc_stats <- list()
for (nm in names(FASTA_FILES)) {
fp <- FASTA_FILES[[nm]]
safe_file_check(fp, "FASTA file")
h <- read_fasta_headers(fp)
n_hdr <- nrow(h)
n_uniq <- n_distinct(h$id)
n_dup <- n_hdr - n_uniq
cat(sprintf(" - %s: headers=%d, unique_IDs=%d, duplicates=%d\n", nm, n_hdr, n_uniq, n_dup))
dup_ids <- h %>% count(id, sort = TRUE) %>% filter(n > 1)
fasta_qc_stats[[nm]] <- tibble(
Dataset = nm,
headers = n_hdr,
unique_IDs = n_uniq,
duplicates = n_dup,
top_duplicate_id = ifelse(nrow(dup_ids) > 0, dup_ids$id[1], NA_character_),
top_duplicate_n = ifelse(nrow(dup_ids) > 0, dup_ids$n[1], NA_integer_)
)
}
fasta_qc_stats_df <- bind_rows(fasta_qc_stats)
# -------------------------
# RNA: read 6 comparisons (Excel)
# -------------------------
cat("\n[RNA] Reading 6 RNA-seq comparisons (Excel: Complete_Data)\n")
rna_sig_long <- list()
rna_all_long <- list()
for (cmp in names(RNA_FILES)) {
fp <- RNA_FILES[[cmp]]
df <- read_rna_excel(fp)
# Keep a full-gene annotation table (used later to annotate EXO/ProALFA-only members)
rna_all_long[[cmp]] <- df %>%
mutate(Comparison = cmp) %>%
select(GeneID_plain, Preferred_name, Description, padj, log2FoldChange, Comparison)
# Debug: how many bad entries
bad_gene <- sum(is.na(df$GeneID_plain) | df$GeneID_plain == "")
bad_padj <- sum(is.na(df$padj))
bad_lfc <- sum(is.na(df$log2FoldChange))
sig <- df %>%
filter(!is.na(GeneID_plain), GeneID_plain != "",
!is.na(padj), !is.na(log2FoldChange),
padj < RNA_PADJ_CUT,
abs(log2FoldChange) > RNA_LFC_CUT) %>%
mutate(
Comparison = cmp,
RNA_direction = ifelse(log2FoldChange > 0, "UP", "DOWN")
) %>%
select(GeneID_plain, Comparison, padj, log2FoldChange, RNA_direction,
Preferred_name, Description)
cat(sprintf(" - %s: rows=%d (bad GeneID=%d, bad padj=%d, bad log2FC=%d); significant=%d (UP=%d, DOWN=%d)\n",
cmp, nrow(df), bad_gene, bad_padj, bad_lfc,
nrow(sig), sum(sig$RNA_direction == "UP"), sum(sig$RNA_direction == "DOWN")))
rna_sig_long[[cmp]] <- sig
}
rna_sig_long <- bind_rows(rna_sig_long)
rna_all_long <- bind_rows(rna_all_long)
RNA <- sort(unique(rna_sig_long$GeneID_plain))
cat(sprintf("[RNA] Union significant genes (6 comps): %d\n", length(RNA)))
# Build an annotation lookup from *ALL* genes (so EXO-only/ProALFA-only also get annotation)
rna_ann <- rna_all_long %>%
filter(!is.na(GeneID_plain), GeneID_plain != "") %>%
group_by(GeneID_plain) %>%
summarise(
Preferred_name = coalesce(Preferred_name[match(TRUE, !is.na(Preferred_name) & Preferred_name != "", !is.na(Preferred_name) & Preferred_name != "")], Preferred_name[1]),
Description = coalesce(Description[match(TRUE, !is.na(Description) & Description != "", !is.na(Description) & Description != "")], Description[1]),
.groups = "drop"
)
# -------------------------
# EXO: read exoproteome sheets + mapping (BLAST)
# -------------------------
cat("\n[EXO] Reading Exoproteome sheets + mapping via BLAST\n")
safe_file_check(EXO_SUMMARY_XLSX, "Exoproteome summary Excel")
safe_file_check(EXO_BLAST_TSV, "Exoproteome BLAST TSV")
exo_long <- list()
for (sh in names(EXO_SHEETS)) {
df <- read_exo_sheet(EXO_SUMMARY_XLSX, EXO_SHEETS[[sh]]) %>%
mutate(Sheet = EXO_SHEETS[[sh]])
cat(sprintf(" - %s: proteins=%d (with log2FC=%d; UP=%d, DOWN=%d)\n",
EXO_SHEETS[[sh]],
n_distinct(df$Protein_ID),
sum(!is.na(df$log2FC)),
sum(df$EXO_direction == "UP", na.rm = TRUE),
sum(df$EXO_direction == "DOWN", na.rm = TRUE)))
exo_long[[sh]] <- df
}
exo_long <- bind_rows(exo_long) %>%
distinct(Protein_ID, Sheet, .keep_all = TRUE)
# BLAST mapping
exo_blast <- read_besthit_tsv(EXO_BLAST_TSV)
exo_best <- besthit_reduce(exo_blast) %>%
mutate(LocusTag = map_chr(sid, extract_locus_tag))
cat(sprintf("[BLAST] Exoproteome_ALL rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
nrow(exo_blast), n_distinct(exo_blast$qid), nrow(exo_best), n_distinct(exo_best$LocusTag)))
dup_targets <- exo_best %>%
filter(!is.na(LocusTag)) %>%
count(LocusTag, sort = TRUE) %>%
filter(n > 1)
cat(sprintf("[BLAST] Exoproteome_ALL duplicated targets in besthits: %d (top 5)\n", nrow(dup_targets)))
if (nrow(dup_targets) > 0) print(head(dup_targets, 5))
exo_mapped <- exo_long %>%
left_join(exo_best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid")) %>%
mutate(LocusTag = as.character(LocusTag))
cat(sprintf("[EXO] Proteins before mapping: %d unique\n", n_distinct(exo_long$Protein_ID)))
cat(sprintf("[EXO] Mapped proteins (have LocusTag): %d\n", sum(!is.na(exo_mapped$LocusTag))))
cat(sprintf("[EXO] Unique mapped proteins: %d\n", n_distinct(exo_mapped$Protein_ID[!is.na(exo_mapped$LocusTag)])))
cat(sprintf("[EXO] Unique mapped locus tags: %d\n", n_distinct(exo_mapped$LocusTag[!is.na(exo_mapped$LocusTag)])))
EXO <- sort(unique(na.omit(exo_mapped$LocusTag)))
# -------------------------
# ProALFA: read proximity + ALFA sheets + mapping (BLAST)
# -------------------------
cat("\n[ProALFA] Reading Proximity/ALFA sheets + mapping via BLAST\n")
safe_file_check(PROALFA_SUMMARY_XLSX, "ProALFA summary Excel")
ms_long <- list()
for (ds in names(PROALFA_SHEETS)) {
sh <- PROALFA_SHEETS[[ds]]
if (!(sh %in% readxl::excel_sheets(PROALFA_SUMMARY_XLSX))) {
stop("Sheet '", sh, "' not found")
}
df <- read_excel(PROALFA_SUMMARY_XLSX, sheet = sh, col_names = TRUE)
# assume first column has protein IDs
prot_col <- names(df)[1]
tmp <- df %>%
transmute(
Protein_raw = as.character(.data[[prot_col]]),
Protein_ID = canonical_id(Protein_raw)
) %>%
filter(!is.na(Protein_ID), Protein_ID != "") %>%
distinct(Protein_ID)
cat(sprintf(" - %s (%s): proteins=%d unique\n", ds, sh, nrow(tmp)))
ms_long[[ds]] <- tmp %>% mutate(Dataset = ds)
}
ms_long <- bind_rows(ms_long)
# mapping per dataset
ms_mapped_list <- list()
PROALFA_sets <- list()
for (ds in names(PROALFA_BLAST_TSV)) {
bt <- PROALFA_BLAST_TSV[[ds]]
safe_file_check(bt, paste0("ProALFA BLAST TSV: ", ds))
b <- read_besthit_tsv(bt)
best <- besthit_reduce(b) %>%
mutate(LocusTag = map_chr(sid, extract_locus_tag))
cat(sprintf("[BLAST] %s rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
ds, nrow(b), n_distinct(b$qid), nrow(best), n_distinct(best$LocusTag)))
dup_t <- best %>%
filter(!is.na(LocusTag)) %>%
count(LocusTag, sort = TRUE) %>%
filter(n > 1)
if (nrow(dup_t) > 0) {
cat(sprintf("[BLAST] %s duplicated targets in besthits: %d (top 5)\n", ds, nrow(dup_t)))
print(head(dup_t, 5))
}
tmp_ds <- ms_long %>%
filter(Dataset == ds) %>%
left_join(best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid"))
cat(sprintf("[ProALFA] %s: proteins=%d, mapped=%d, unique_targets=%d\n",
ds,
n_distinct(tmp_ds$Protein_ID),
sum(!is.na(tmp_ds$LocusTag)),
n_distinct(tmp_ds$LocusTag[!is.na(tmp_ds$LocusTag)])))
ms_mapped_list[[ds]] <- tmp_ds
PROALFA_sets[[ds]] <- sort(unique(na.omit(tmp_ds$LocusTag)))
}
ms_mapped_all <- bind_rows(ms_mapped_list)
ProALFA <- sort(unique(unlist(PROALFA_sets)))
cat(sprintf("[ProALFA] Union locus tags (Proximity+ALFA): %d\n", length(ProALFA)))
# -------------------------
# VENN intersections
# -------------------------
cat("\n[VENN] Computing intersections\n")
cat(sprintf("Set sizes: RNA=%d, EXO=%d, ProALFA=%d\n", length(RNA), length(EXO), length(ProALFA)))
i_RNA_EXO <- intersect(RNA, EXO)
i_RNA_ProALFA <- intersect(RNA, ProALFA)
i_EXO_ProALFA <- intersect(EXO, ProALFA)
i_all3 <- Reduce(intersect, list(RNA, EXO, ProALFA))
cat(sprintf("Intersections: RNA∩EXO=%d, RNA∩ProALFA=%d, EXO∩ProALFA=%d, ALL3=%d\n",
length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)))
# -------------------------
# Venn plot
# -------------------------
venn_png <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.png")
venn_pdf <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.pdf")
cat("[VENN] Writing plot:", venn_png, "\n")
venn.diagram(
x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
filename = venn_png,
imagetype = "png",
height = 2400, width = 2400, resolution = 300,
fill = c("#4C72B0","#55A868","#C44E52"),
alpha = 0.4,
cex = 0.9,
cat.cex = 0.85,
cat.pos = c(-20, 20, 270),
# Increase 3rd value to push the long bottom label further out
cat.dist = c(0.06, 0.06, 0.16),
margin = 0.22,
main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)
cat("[VENN] Writing plot:", venn_pdf, "\n")
vd <- venn.diagram(
x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
filename = NULL,
fill = c("#4C72B0","#55A868","#C44E52"),
alpha = 0.4,
cex = 0.9,
cat.cex = 0.85,
cat.pos = c(-20, 20, 270),
cat.dist = c(0.06, 0.06, 0.16),
margin = 0.22,
main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)
pdf(venn_pdf, width=7.5, height=7.5)
grid.draw(vd)
dev.off()
# -------------------------
# Region table (ALL members + direction annotations)
# -------------------------
all_members <- tibble(LocusTag = sort(unique(c(RNA, EXO, ProALFA)))) %>%
mutate(
in_RNA = LocusTag %in% RNA,
in_EXO = LocusTag %in% EXO,
in_ProALFA = LocusTag %in% ProALFA
)
# RNA direction per locus tag (from significant set only; union across comparisons)
rna_dir_tbl <- rna_sig_long %>%
group_by(GeneID_plain) %>%
summarise(
RNA_comparisons = paste(sort(unique(Comparison)), collapse = ";"),
RNA_dirs = paste(sort(unique(RNA_direction)), collapse = ";"),
min_padj = suppressWarnings(min(padj, na.rm = TRUE)),
max_abs_log2FC = suppressWarnings(max(abs(log2FoldChange), na.rm = TRUE)),
.groups = "drop"
) %>%
rename(LocusTag = GeneID_plain)
# EXO direction aggregated per locus tag
exo_dir_tbl <- exo_mapped %>%
filter(!is.na(LocusTag)) %>%
group_by(LocusTag) %>%
summarise(
EXO_sheets = paste(sort(unique(Sheet)), collapse = ";"),
EXO_dirs = paste(sort(unique(EXO_direction)), collapse = ";"),
EXO_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
.groups = "drop"
)
# ProALFA membership aggregated per locus tag
proalfa_tbl <- ms_mapped_all %>%
filter(!is.na(LocusTag)) %>%
group_by(LocusTag) %>%
summarise(
ProALFA_datasets = paste(sort(unique(Dataset)), collapse = ";"),
ProALFA_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
.groups = "drop"
)
region_tbl <- all_members %>%
left_join(rna_dir_tbl, by = "LocusTag") %>%
left_join(exo_dir_tbl, by = "LocusTag") %>%
left_join(proalfa_tbl, by = "LocusTag") %>%
left_join(rna_ann, by = c("LocusTag" = "GeneID_plain")) %>%
arrange(desc(in_RNA), desc(in_EXO), desc(in_ProALFA), LocusTag)
region_only <- function(inRNA, inEXO, inProALFA) {
region_tbl %>%
filter(in_RNA == inRNA, in_EXO == inEXO, in_ProALFA == inProALFA)
}
# -------------------------
# Excel output
# -------------------------
out_xlsx <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA_detailed.xlsx")
cat("[EXCEL] Writing:", out_xlsx, "\n")
wb <- createWorkbook()
addWorksheet(wb, "Summary")
writeData(wb, "Summary", data.frame(
Metric = c("RNA cutoff", "RNA union", "EXO union", "ProALFA union",
"RNA∩EXO", "RNA∩ProALFA", "EXO∩ProALFA", "ALL3"),
Value = c(
sprintf("padj < %.3g & |log2FC| > %.3g", RNA_PADJ_CUT, RNA_LFC_CUT),
length(RNA), length(EXO), length(ProALFA),
length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)
)
))
addWorksheet(wb, "FASTA_QC")
writeData(wb, "FASTA_QC", fasta_qc_stats_df)
addWorksheet(wb, "RNA_sig_long")
writeData(wb, "RNA_sig_long", rna_sig_long)
addWorksheet(wb, "RNA_all_annotations")
writeData(wb, "RNA_all_annotations", rna_all_long)
addWorksheet(wb, "EXO_proteins")
writeData(wb, "EXO_proteins", exo_long)
addWorksheet(wb, "EXO_mapped")
writeData(wb, "EXO_mapped", exo_mapped)
addWorksheet(wb, "ProALFA_proteins")
writeData(wb, "ProALFA_proteins", ms_long)
addWorksheet(wb, "ProALFA_mapped")
writeData(wb, "ProALFA_mapped", ms_mapped_all)
addWorksheet(wb, "All_members_with_dirs")
writeData(wb, "All_members_with_dirs", region_tbl)
addWorksheet(wb, "Only_RNA")
writeData(wb, "Only_RNA", region_only(TRUE, FALSE, FALSE))
addWorksheet(wb, "Only_EXO")
writeData(wb, "Only_EXO", region_only(FALSE, TRUE, FALSE))
addWorksheet(wb, "Only_ProALFA")
writeData(wb, "Only_ProALFA", region_only(FALSE, FALSE, TRUE))
addWorksheet(wb, "RNA_EXO_only")
writeData(wb, "RNA_EXO_only", region_only(TRUE, TRUE, FALSE))
addWorksheet(wb, "RNA_ProALFA_only")
writeData(wb, "RNA_ProALFA_only", region_only(TRUE, FALSE, TRUE))
addWorksheet(wb, "EXO_ProALFA_only")
writeData(wb, "EXO_ProALFA_only", region_only(FALSE, TRUE, TRUE))
addWorksheet(wb, "ALL3")
writeData(wb, "ALL3", region_only(TRUE, TRUE, TRUE))
saveWorkbook(wb, out_xlsx, overwrite = TRUE)
cat("\nDONE.\nOutputs:\n")
cat(" -", venn_png, "\n")
cat(" -", venn_pdf, "\n")
cat(" -", out_xlsx, "\n")
cat(" - LOG:", normalizePath(LOGFILE), "\n")
# dump warnings (so they go into log too)
if (!is.null(warnings())) {
cat("\n[WARNINGS]\n")
print(warnings())
}
sink()
Final Notes
- All comparisons are locus-tag consistent, enabling clean cross-omics integration.
- Exoproteome regulation is treated directionally only, by design.
- Logs and Excel outputs are intentionally verbose to support downstream inspection and reuse.
This pipeline is designed to be transparent, auditable, and extensible—ideal for iterative multi-omics analysis and figure generation.
From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)
Path: ~/DATA/Data_Michelle/Data_Michelle_ProteinClustering_2026
Goal
For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:
- Export protein sequences as FASTA
- Annotate them with EggNOG-mapper (including the
COG_categorycolumn) -
Summarize COG composition at two levels:
- COG letters (J/A/K/…/S), including multi-letter cases like
IQ - 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
- COG letters (J/A/K/…/S), including multi-letter cases like
- Export both summary statistics and the underlying protein IDs for each category/group.
Step 0 — Why this manual annotation approach is needed (non-model organism)
Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.
Step 1 — Generate protein FASTA files
1A) FASTA from MS protein lists
Export sequences for each MS dataset:
python3 getProteinSequences_Proximity_4h.py # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py # > ALFApulldown_18h.fasta
Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py).
Output: One FASTA per dataset (*.fasta), used as EggNOG input.
1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)
mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta)
Output: cleaned FASTA headers (CP020463_protein.fasta)
Step 2 — Generate EggNOG annotation files (*.emapper.annotations)
2A) Install EggNOG-mapper
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda # eggnog-mapper 2.1.12
mamba activate eggnog_env
2B) Download / prepare EggNOG database
mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
--data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
2C) Run emapper.py on FASTA inputs
(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60 # --resume if needed
For MS protein sets (the ones used for COG clustering):
emapper.py -i Proximity_4h.fasta -o eggnog_out_Proximity_4h --cpu 60 # --resume
emapper.py -i Proximity_18h.fasta -o eggnog_out_Proximity_18h --cpu 60
emapper.py -i ALFApulldown_4h.fasta -o eggnog_out_ALFApulldown_4h --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60
Key outputs used downstream (one per dataset):
eggnog_out_Proximity_4h.emapper.annotationseggnog_out_Proximity_18h.emapper.annotationseggnog_out_ALFApulldown_4h.emapper.annotationseggnog_out_ALFApulldown_18h.emapper.annotations
These files include a column named COG_category, which is the basis for the clustering below.
Step 3 — COG clustering + reporting (this post’s main script)
Inputs
The four *.emapper.annotations files (must contain COG_category).
Outputs
All outputs are written to ./COG_outputs/:
Per dataset (Excel): COG_[Dataset].xlsx
COG_letters: per-letter count + percentDebug: unassigned COG rows, R/S proportion, etc.Protein_assignments: per-protein COG letters + functional groupsProtein_lists_by_COG: protein IDs per COG letterProtein_lists_by_group: protein IDs per major functional classLong_format_COG/Long_format_group: one row per protein per category/group (best for filtering)
Combined (Excel): COG_combined_summary.xlsx
- combined counts/percents across datasets
- combined long-format tables
Plots (PNG + PDF):
COG_grouped_barplot_percent_letters.*(COG letters across datasets)COG_functional_groups.*(4 functional classes across datasets)
python3 cog_cluster.py
Interpretation notes
-
High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.
-
“POORLY CHARACTERIZED” being large is not an artifact of the plotting — it’s because the dataset itself contains many proteins annotated as unknown by EggNOG. Why “S” shows up so much in Proximity labeling / pulldown sets? Common biological/annotation reasons:
- Small / strain-specific proteins (often hypothetical)
- Membrane or secreted proteins with poor functional characterization
- Mobile genetic elements / phage-related proteins
- Proteins with only weak orthology support → EggNOG assigns S rather than a confident functional class
-
Group totals not equal to 100% can happen because:
- some proteins have multi-letter COGs (e.g.,
IQ) → count in multiple groups - some proteins have no COG assignment (
-) → don’t contribute to any group
- some proteins have multi-letter COGs (e.g.,
Code snippet (generate Proximity_4h FASTA, used in Step 2)
import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# --------- robust HTTP session (retries + backoff) ----------
def make_session():
s = requests.Session()
retries = Retry(
total=6,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
s.headers.update({
"User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
})
return s
S = make_session()
def http_get_text(url, params=None):
r = S.get(url, params=params, timeout=30)
if r.status_code == 200:
return r.text
return None
# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
return http_get_text(url)
# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
url = "https://rest.uniprot.org/uniparc/search"
params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
txt = http_get_text(url, params=params)
if not txt:
return None
lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
if len(lines) < 2:
return None
upi = lines[1].split("\t")[0].strip()
return upi if upi.startswith("UPI") else None
# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
txt = http_get_text(url1)
if txt:
return txt
url2 = f"https://rest.uniprot.org/uniparc/{upi}"
return http_get_text(url2, params={"format": "fasta"})
def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
identifier = identifier.strip()
if not identifier:
return None
if identifier.startswith("UPI"):
fasta = fetch_uniparc_fasta(identifier)
return (identifier, fasta) if fasta else None
fasta = fetch_uniprotkb_fasta(identifier)
if fasta:
return (identifier, fasta)
upi = resolve_upi_via_uniparc_search(identifier)
if upi:
fasta2 = fetch_uniparc_fasta(upi)
if fasta2:
fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
return (identifier, fasta2)
return None
def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
missing = []
with open(out_fasta, "w", encoding="utf-8") as f:
for pid in ids:
res = fetch_fasta_for_id(pid)
if res is None:
missing.append(pid)
else:
_, fasta_txt = res
if not fasta_txt.endswith("\n"):
fasta_txt += "\n"
f.write(fasta_txt)
time.sleep(delay_s)
return missing
ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"] # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)
Code snippet (COG clustering script, used in Step 3)
#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)
Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
* Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
* Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
(note: method still allows multi-group membership; we just don't print it in labels)
"""
import os
import re
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
"Proximity_4h": "./eggnog_out_Proximity_4h.emapper.annotations",
"Proximity_18h": "./eggnog_out_Proximity_18h.emapper.annotations",
"ALFApulldown_4h": "./eggnog_out_ALFApulldown_4h.emapper.annotations",
"ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}
OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)
ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]
COG_DESCRIPTIONS = {
"J": "Translation, ribosomal structure and biogenesis",
"A": "RNA processing and modification",
"K": "Transcription",
"L": "Replication, recombination and repair",
"B": "Chromatin structure and dynamics",
"D": "Cell cycle control, cell division, chromosome partitioning",
"Y": "Nuclear structure",
"V": "Defense mechanisms",
"T": "Signal transduction mechanisms",
"M": "Cell wall/membrane/envelope biogenesis",
"N": "Cell motility",
"Z": "Cytoskeleton",
"W": "Extracellular structures",
"U": "Intracellular trafficking, secretion, and vesicular transport",
"O": "Posttranslational modification, protein turnover, chaperones",
"C": "Energy production and conversion",
"G": "Carbohydrate transport and metabolism",
"E": "Amino acid transport and metabolism",
"F": "Nucleotide transport and metabolism",
"H": "Coenzyme transport and metabolism",
"I": "Lipid transport and metabolism",
"P": "Inorganic ion transport and metabolism",
"Q": "Secondary metabolites biosynthesis, transport and catabolism",
"R": "General function prediction only",
"S": "Function unknown",
}
FUNCTIONAL_GROUPS = {
"INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
"CELLULAR PROCESSES AND SIGNALING": ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
"METABOLISM": ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
"POORLY CHARACTERIZED": ['R', 'S'],
}
LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
for c in letters:
LETTER_TO_GROUP[c] = grp
# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
rows = []
header = None
with open(path, "r", encoding="utf-8", errors="replace") as f:
for line in f:
line = line.rstrip("\n")
if line.startswith("##"):
continue
if line.startswith("#query"):
header = line.lstrip("#").split("\t")
continue
if header is None and re.match(r"^query\t", line):
header = line.split("\t")
continue
if header is None:
continue
rows.append(line.split("\t"))
if header is None:
raise ValueError(f"Could not find header line (#query/query) in: {path}")
df = pd.DataFrame(rows, columns=header)
if "query" not in df.columns and "#query" in df.columns:
df = df.rename(columns={"#query": "query"})
return df
def split_cog_letters(cog_str):
if cog_str is None:
return []
cog_str = str(cog_str).strip()
if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
return []
letters = list(cog_str)
return [c for c in letters if c in ALL_COG]
def count_cog_letters(df: pd.DataFrame) -> dict:
counts = {c: 0 for c in ALL_COG}
for x in df["COG_category"].astype(str).tolist():
for c in split_cog_letters(x):
counts[c] += 1
return counts
def build_category_to_proteins(df: pd.DataFrame) -> dict:
cat2prot = {c: set() for c in ALL_COG}
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
for c in letters:
cat2prot[c].add(q)
return {c: sorted(list(v)) for c, v in cat2prot.items()}
def build_group_to_proteins(df: pd.DataFrame) -> dict:
group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
hit_groups = set()
for c in letters:
g = LETTER_TO_GROUP.get(c)
if g:
hit_groups.add(g)
for g in hit_groups:
group2prot[g].add(q)
return {g: sorted(list(v)) for g, v in group2prot.items()}
def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
rows = []
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
rows.append({
"query": q,
"COG_category_raw": cog,
"COG_letters": "".join(sorted(set(letters))) if letters else "",
"Functional_groups": "; ".join(groups) if groups else "",
"Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
})
out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
return out.sort_values("query")
def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
rows = []
for c in ALL_COG:
for pid in cat2prot.get(c, []):
rows.append({
"Dataset": dataset,
"COG": c,
"Description": COG_DESCRIPTIONS.get(c, ""),
"Protein_ID": pid
})
return pd.DataFrame(rows)
def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
rows = []
for g in FUNCTIONAL_GROUPS.keys():
for pid in grp2prot.get(g, []):
rows.append({
"Dataset": dataset,
"Functional_group": g,
"Protein_ID": pid
})
return pd.DataFrame(rows)
def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
groups = list(FUNCTIONAL_GROUPS.keys())
recs = []
for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
letters = split_cog_letters(cog)
hit_groups = set()
for c in letters:
g = LETTER_TO_GROUP.get(c)
if g:
hit_groups.add(g)
rec = {"query": q}
for g in groups:
rec[g] = (g in hit_groups)
recs.append(rec)
out = pd.DataFrame(recs)
out = out.groupby("query", as_index=False).max()
return out
# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []
for ds_name, path in INPUT_FILES.items():
print(f"\n--- {ds_name} ---\nReading: {path}")
if not os.path.exists(path):
raise FileNotFoundError(f"Missing input file: {path}")
df = read_emapper_annotations(path)
if "COG_category" not in df.columns:
raise ValueError(f"{path} has no 'COG_category' column.")
n_rows = df.shape[0]
unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()
# Letter-based counts and percents
counts = count_cog_letters(df)
total_letters = sum(counts.values())
pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}
counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)
all_counts[ds_name] = counts_s
all_pct[ds_name] = pct_s
out_df = pd.DataFrame({
"COG": ALL_COG_DISPLAY,
"Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
"Count": counts_s.values,
"Percent": pct_s.values,
})
dbg = pd.DataFrame([{
"Dataset": ds_name,
"Proteins_in_table": int(n_rows),
"COG_unassigned_rows": int(unassigned),
"Total_assigned_COG_letters": int(total_letters),
"R_count": int(counts.get("R", 0)),
"S_count": int(counts.get("S", 0)),
"R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
}])
debug_summary.append(dbg.iloc[0].to_dict())
assignment_tbl = build_assignment_table(df)
cat2prot = build_category_to_proteins(df)
grp2prot = build_group_to_proteins(df)
# category (COG letter) protein lists
cat_list_df = pd.DataFrame({
"COG": ALL_COG_DISPLAY,
"Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
"N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
"Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
})
# group protein lists
grp_list_df = pd.DataFrame({
"Functional_group": list(FUNCTIONAL_GROUPS.keys()),
"N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
"Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
})
df_long_cog = long_format_by_cog(cat2prot, ds_name)
df_long_group = long_format_by_group(grp2prot, ds_name)
all_long_cog.append(df_long_cog)
all_long_group.append(df_long_group)
out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
with pd.ExcelWriter(out_xlsx) as writer:
out_df.to_excel(writer, sheet_name="COG_letters", index=False)
dbg.to_excel(writer, sheet_name="Debug", index=False)
assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)
print(f"Saved: {out_xlsx}")
# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())
df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)
combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
df_counts.to_excel(writer, sheet_name="Counts")
df_pct.to_excel(writer, sheet_name="Percent")
pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
# Combined long formats (includes protein IDs per letter/group across datasets)
pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)
print(f"\nSaved combined summary: {combined_xlsx}")
# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()
y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))
fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)
ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()
plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")
# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []
# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []
for ds_name, path in INPUT_FILES.items():
df = read_emapper_annotations(path)
# (A) protein-based group membership and stats
memb = protein_based_group_membership(df)
n_proteins = memb.shape[0]
counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
group_counts[ds_name] = pd.Series(counts)
group_pct[ds_name] = pd.Series(pct)
for g in FUNCTIONAL_GROUPS.keys():
ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
for pid in ids:
group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})
# (B) add category (letter) protein IDs too
cat2prot = build_category_to_proteins(df)
all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))
df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)
# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
for c in ALL_COG:
ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
cat_list_rows.append({
"Dataset": ds,
"COG": c,
"Description": COG_DESCRIPTIONS.get(c, ""),
"N_proteins": len(ids),
"Protein_IDs": "; ".join(sorted(ids))
})
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)
group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)
# NEW: category protein IDs included here as well
df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)
print(f"Saved functional-group workbook: {group_xlsx}")
# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()
x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))
fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)
ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()
plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)
print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")
print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR)) End-to-end GO enrichment for non-model bacteria: MS → reference ID mapping with BLAST + Blast2GO (GUI) + R enrichment
Path: ~/DATA/Data_Michelle/MS_GO_enrichments_2026
This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:
- MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g.,
B4U56_00090). - Standard R organism annotation packages (
org.*.eg.db) are not available. - We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.
The workflow produces:
- 4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
- 1 combined summary workbook across all datasets
1) Generate protein FASTA files
1.1 MS protein sequences (Proximity / ALFApulldown)
From MS results (protein ID lists), retrieve sequences and write FASTA:
python3 getProteinSequences_Proximity_4h.py > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py > ALFApulldown_18h.fasta
1.2 Reference proteome sequences (for mapping + background)
Download the reference proteome FASTA from GenBank and standardize headers:
mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
2) Optional functional annotation for merging MS + RNA-seq (EggNOG)
EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env
download_eggnog_data.py --dbname eggnog.db -y \
--data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60
# MS sets
emapper.py -i Proximity_4h.fasta -o eggnog_out_Proximity_4h --cpu 60
emapper.py -i Proximity_18h.fasta -o eggnog_out_Proximity_18h --cpu 60
emapper.py -i ALFApulldown_4h.fasta -o eggnog_out_ALFApulldown_4h --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60
3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)
Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.
3.1 Setup workspace
mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/
Launch Blast2GO:
~/Tools/Blast2GO/Blast2GO_Launcher
3.2 Step-by-step in Blast2GO GUI
STEP 1 — Load sequences (reference proteome)
- File → Load → Load Sequences
- Choose: Load Fasta File (.fasta)
- Select:
CP020463_protein.fasta - Tags: (leave default / none)
- Check that the table is filled with columns like
Nr,SeqName
✅ Output: sequences are loaded into the project
STEP 2 — BLAST (QBlast)
- Go to the BLAST panel
- Choose blastp (protein vs protein)
- Database: nr (NCBI)
- Set other parameters as needed (defaults typically OK)
- Run QBlast
⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.
✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)
STEP 3 — Mapping
- Click Mapping
✅ Output:
- Tags updated to MAPPED
- Columns appear such as #GO, GO IDs, GO Names
STEP 4 — Annotation
- Click Annotation
-
Key parameters you may set/keep:
- Annotation CutOff (controls reliability threshold)
- GO Weight (boosts more general terms when supported by multiple specific hits)
✅ Output:
- Tags updated to ANNOTATED
- Enzyme-related columns may appear (e.g., Enzyme Codes)
STEP 5 — Export Annotations (before merging InterPro)
- File → Export → Export Annotations
- Choose Export Annotations (.annot, custom, etc.)
- Save as:
~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot
✅ Output: blast2go_annot.annot
STEP 6 — InterProScan (optional but recommended for more GO terms)
- Click InterPro / InterProScan
- Run InterProScan (can be long: hours to >1 day depending on dataset & setup)
✅ Output:
- Tags updated to INTERPRO
- Additional columns: InterPro IDs, InterPro GO IDs/Names
STEP 7 — Merge InterProScan GOs into existing annotation
- In InterPro panel, choose: Merge InterProScan GOs to Annotation
- Confirm merge
✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)
STEP 8 — Export final annotations (after merging InterPro)
- File → Export → Export Annotations
- Save as:
~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2
✅ This is the final file used for enrichment:
blast2go_annot.annot2
Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.
4) Generate BLAST mapping tables: *_vs_ref.blast.tsv
We map each MS protein set to the reference proteome locus tags using BLASTP.
4.1 Create BLAST database from reference proteome
makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db
4.2 BLASTP each MS set against the reference DB
blastp -query Proximity_4h.fasta -db CP020463_ref_db \
-out Proximity_4h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query Proximity_18h.fasta -db CP020463_ref_db \
-out Proximity_18h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
-out ALFApulldown_4h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
-out ALFApulldown_18h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
5) Run GO enrichment (4 sets + combined summary)
We run a single wrapper script that:
- normalizes FASTA headers (handles UniParc/UniProt
tr|...formats) - filters BLAST hits (pident/qcov/evalue thresholds)
- picks one best hit per query
- performs GO enrichment via
clusterProfiler::enricher() - background (universe) = all reference proteins with ≥1 GO term in
blast2go_annot.annot2 - cutoff = BH/FDR adjusted p-value < 0.05
- writes one Excel per dataset + one combined summary workbook
Rscript wrapper_besthit_GO_enrichment_4sets.R
Code snippets (used scripts)
A) Example script for Step 1: generate Proximity_4h FASTA
import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# --------- robust HTTP session (retries + backoff) ----------
def make_session():
s = requests.Session()
retries = Retry(
total=6,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
s.headers.update({
"User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
})
return s
S = make_session()
def http_get_text(url, params=None):
r = S.get(url, params=params, timeout=30)
if r.status_code == 200:
return r.text
return None
# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
return http_get_text(url)
# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
url = "https://rest.uniprot.org/uniparc/search"
params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
txt = http_get_text(url, params=params)
if not txt:
return None
lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
if len(lines) < 2:
return None
upi = lines[1].split("\t")[0].strip()
return upi if upi.startswith("UPI") else None
# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
txt = http_get_text(url1)
if txt:
return txt
url2 = f"https://rest.uniprot.org/uniparc/{upi}"
return http_get_text(url2, params={"format": "fasta"})
def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
identifier = identifier.strip()
if not identifier:
return None
if identifier.startswith("UPI"):
fasta = fetch_uniparc_fasta(identifier)
return (identifier, fasta) if fasta else None
fasta = fetch_uniprotkb_fasta(identifier)
if fasta:
return (identifier, fasta)
upi = resolve_upi_via_uniparc_search(identifier)
if upi:
fasta2 = fetch_uniparc_fasta(upi)
if fasta2:
fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
return (identifier, fasta2)
return None
def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
missing = []
with open(out_fasta, "w", encoding="utf-8") as f:
for pid in ids:
res = fetch_fasta_for_id(pid)
if res is None:
missing.append(pid)
else:
_, fasta_txt = res
if not fasta_txt.endswith("\n"):
fasta_txt += "\n"
f.write(fasta_txt)
time.sleep(delay_s)
return missing
ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"] # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)
B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets
#!/usr/bin/env Rscript
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tibble)
library(openxlsx)
library(clusterProfiler)
library(AnnotationDbi)
library(GO.db)
})
# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2" # or full path
datasets <- tibble::tribble(
~name, ~blast_tsv, ~fasta,
"Proximity_4h", "Proximity_4h_vs_ref.blast.tsv", "Proximity_4h.fasta",
"Proximity_18h", "Proximity_18h_vs_ref.blast.tsv", "Proximity_18h.fasta",
"ALFApulldown_4h", "ALFApulldown_4h_vs_ref.blast.tsv", "ALFApulldown_4h.fasta",
"ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)
out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)
# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV <- 0.70
MAX_EVALUE <- 1e-10
# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"
# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
x <- as.character(x)
x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
parts <- stringr::str_split(x, "\\|", simplify = TRUE)
is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
ifelse(is_uniprot, parts[, 2], parts[, 1])
}
norm_subject_id <- function(x) {
x <- as.character(x)
x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
stringr::str_split_fixed(x, "\\|", 2)[, 1]
}
add_go_desc <- function(df) {
if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
df$GO_Term <- vapply(df$ID, function(go_id) {
term <- tryCatch(
AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
error = function(e) NULL
)
if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
}, FUN.VALUE = character(1))
df
}
safe_read_blast <- function(path) {
cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
"qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
readr::read_tsv(
path,
col_names = cols,
col_types = readr::cols(
qseqid = readr::col_character(),
sseqid = readr::col_character(),
pident = readr::col_double(),
length = readr::col_double(),
mismatch = readr::col_double(),
gapopen = readr::col_double(),
qstart = readr::col_double(),
qend = readr::col_double(),
sstart = readr::col_double(),
send = readr::col_double(),
evalue = readr::col_double(),
bitscore = readr::col_double(),
qlen = readr::col_double(),
slen = readr::col_double()
),
progress = FALSE
)
}
# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term <- as.character(annot_df$Term)
term2gene_go_ref <- annot_df %>%
dplyr::filter(grepl("^GO:", Term)) %>%
dplyr::transmute(term = Term, gene = GeneID) %>%
dplyr::distinct()
universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")
# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms <- list()
all_besthits <- list()
# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {
ds <- datasets[i, ]
name <- ds$name
blast_path <- ds$blast_tsv
fasta_path <- ds$fasta
cat("\n=============================\n")
cat("Dataset:", name, "\n")
if (!file.exists(blast_path)) {
warning("Missing BLAST TSV: ", blast_path, " (skipping)")
next
}
if (!file.exists(fasta_path)) {
warning("Missing FASTA: ", fasta_path, " (skipping)")
next
}
# ---- FASTA query IDs ----
fa <- readLines(fasta_path, warn = FALSE)
q_all <- fa[startsWith(fa, ">")]
q_all <- gsub("^>", "", q_all)
q_all <- unique(norm_mixed_id(q_all))
n_fasta <- length(q_all)
# ---- BLAST ----
bl <- safe_read_blast(blast_path) %>%
dplyr::mutate(
qid = norm_mixed_id(qseqid),
sid = norm_subject_id(sseqid),
qcov = length / qlen
)
n_queries_with_hits <- dplyr::n_distinct(bl$qid)
max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0
cat("FASTA queries:", n_fasta, "\n")
cat("BLAST rows:", nrow(bl), "\n")
cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
cat("Max hits/query:", max_hits_per_query, "\n")
bl_f <- bl %>%
dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)
cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")
best <- bl_f %>%
dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
dplyr::group_by(qid) %>%
dplyr::slice_head(n = 1) %>%
dplyr::ungroup()
best_out <- best %>%
dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
dplyr::arrange(dplyr::desc(bitscore))
best_all <- tibble::tibble(qid = q_all) %>%
dplyr::left_join(best_out, by = "qid")
unmapped <- best_all %>%
dplyr::filter(is.na(sid) | sid == "") %>%
dplyr::distinct(qid) %>%
dplyr::pull(qid)
mapped <- best_out %>%
dplyr::filter(!is.na(sid), sid != "")
cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
cat("Unmapped queries:", length(unmapped), "\n")
cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")
gene_list_ref <- unique(mapped$sid)
n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")
# Save best-hit TSVs
fn_best <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
readr::write_tsv(best_out, fn_best)
readr::write_tsv(best_all, fn_best_all)
# GO enrichment
go_res <- tryCatch(
clusterProfiler::enricher(
gene = gene_list_ref,
TERM2GENE = term2gene_go_ref,
universe = universe_ref,
pvalueCutoff = P_CUT,
pAdjustMethod = PADJ_METHOD
),
error = function(e) NULL
)
go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
go_df <- add_go_desc(go_df)
cat("Enriched GO terms:", nrow(go_df), "\n")
per_protein_go <- mapped %>%
dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
dplyr::distinct() %>%
dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
dplyr::rename(GO = term)
# Per-dataset Excel
out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "BestHit_Mapped")
openxlsx::writeData(wb, "BestHit_Mapped", mapped)
openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))
openxlsx::addWorksheet(wb, "PerProtein_GO")
openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)
openxlsx::addWorksheet(wb, "GO_Enrichment")
openxlsx::writeData(wb, "GO_Enrichment", go_df)
openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
cat("Saved:", out_xlsx, "\n")
# Collect combined summary
summary_runs[[name]] <- tibble::tibble(
dataset = name,
fasta_queries = n_fasta,
blast_rows = nrow(bl),
queries_with_hits = n_queries_with_hits,
max_hits_per_query = max_hits_per_query,
filtered_rows = nrow(bl_f),
filtered_queries = dplyr::n_distinct(bl_f$qid),
mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
unmapped_queries = length(unmapped),
unique_targets_besthit = dplyr::n_distinct(mapped$sid),
duplicated_targets_besthit = sum(duplicated(mapped$sid)),
mapped_targets_with_GO = n_targets_with_go,
enriched_GO_terms = nrow(go_df)
)
all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
all_go_terms[[name]] <- dplyr::mutate(per_protein_go, dataset = name)
all_besthits[[name]] <- dplyr::mutate(mapped, dataset = name)
}
# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)
go_counts <- combined_per_protein_go %>%
dplyr::filter(!is.na(GO), GO != "") %>%
dplyr::distinct(dataset, qid, GO) %>%
dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))
combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)
openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)
openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)
openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)
openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)
openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)
cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "") Lakeview/Lake file refresh pipeline: track filtering, filename normalization, and automated .lake updates
This pipeline updates existing Lakeview merged .lake files by replacing their blue track content with filtered track CSVs, while keeping original .h5 filenames/paths unchanged so Lakeview can still load the raw kymographs locally.
Step 1 — Fix naming mismatches before updating lakes
To ensure each kymograph (.h5) finds its corresponding filtered CSV, filename inconsistencies are corrected first:
_p940_vs_940_: some filtered CSVs contain_p940_while the corresponding.h5uses_940_→ rename CSVs accordingly.ch4vsch5: some filtered CSVs were labeled_ch4_while the.h5filenames use_ch5_(or vice versa) → rename CSVs to match.- Extra CSVs without any matching
.h5are removed to avoid confusion later.
This step prevents Lakeview kymos from being dropped simply because no matching CSV is found.
Step 2 — Filter tracks and generate debug reports
Run the filtering scripts to create multiple filtered outputs from each raw *_blue.csv:
- Binding position filter: 2.2–3.8 µm
- Lifetime thresholds: ≥1s, ≥2s, ≥5s
- A lifetime-only filter: ≥5s without a position constraint
The debug version additionally writes per-track reports (binding position, lifetime, pass/fail reason), which makes it much easier to spot issues caused by parsing, NaNs, or unexpected track structure.
Step 3 — Organize filtered outputs into separate folders
Move filtered files into dedicated directories so each downstream lake update corresponds to a single filtering rule:
filtered_blue_position(2.2–3.8 µm)filtered_blue_position_1s(2.2–3.8 µm + ≥1s)filtered_blue_position_5s(2.2–3.8 µm + ≥5s)filtered_blue_lifetime_5s_only(≥5s, no position filter)
Step 4 — Update .lake files using the filtered tracks
Run 2_update_lakes.py once per filtered folder to create updated .lake outputs (and logs):
- For each kymo in each
.lake, the script tries to find a matching*_blue*.csv. -
Outcomes are classified:
- case1: CSV found and contains ≥1 track → replace blue track text and keep the kymo.
- case2: CSV found but no tracks remain after filtering (header-only / parse error) → remove the kymo.
- case3: no matching CSV → remove the kymo.
- extra: kymo missing a
data/tracks/bluefield → remove the kymo.
-
After filtering/removing kymos, the script also rebuilds:
file_viewer(keeps only.h5files referenced by retained kymos)experiments[*].dataset(keeps only dataset entries matching retained kymos)
This keeps the updated .lake files internally consistent and avoids dangling references.
Scripts used (code snippets)
1) 1_filter_track.py
import pandas as pd
import glob
import os
# === Parameters ===
input_folder = "./data"
output_folder = "./filtered"
separated_folder = "./separated"
# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8
# Column names (based on CSV header)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"
# Filter configurations
filter_configs = [
{
"label": "position",
"min_lifetime": 0.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm",
},
{
"label": "position_1s",
"min_lifetime": 1.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
},
{
"label": "position_5s",
"min_lifetime": 5.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
},
{
"label": "position_2s",
"min_lifetime": 2.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
},
{
"label": "lifetime_5s_only",
"min_lifetime": 5.0,
"min_binding_pos": None,
"max_binding_pos": None,
"desc": "Tracks with lifetime ≥ 5 s, no position filter",
},
]
def load_csv(filepath):
"""
Load a blue track CSV:
- find header line starting with '# track index'
- read data rows (semicolon-separated, skipping header lines)
- set lowercase column names based on the header line
"""
try:
with open(filepath, "r") as f:
lines = f.readlines()
if not lines:
raise ValueError(f"File {filepath} is empty")
header_line = None
for line in lines:
if line.startswith("# track index"):
header_line = line.lstrip("# ").strip()
break
if header_line is None:
raise ValueError(
f"No header line starting with '# track index' found in {filepath}"
)
df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
df.columns = [c.strip().lower() for c in header_line.split(";")]
required_cols = [track_col, time_col, position_col]
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
raise ValueError(
f"Missing required columns in {filepath}: {missing_cols}"
)
return df, lines[0].strip(), header_line
except Exception as e:
print(f"Error loading {filepath}: {e}")
return None, None, None
def compute_lifetime(track_df):
"""
Compute the lifetime of one track as (max time - min time) in seconds.
"""
if track_df[time_col].empty:
return 0.0
return track_df[time_col].max() - track_df[time_col].min()
# === Main Processing ===
os.makedirs(output_folder, exist_ok=True)
os.makedirs(separated_folder, exist_ok=True)
for filepath in glob.glob(os.path.join(input_folder, "*.csv")):
print(f"\n=== Processing input file: {filepath} ===")
df, header1, header2 = load_csv(filepath)
if df is None:
continue
base = os.path.splitext(os.path.basename(filepath))[0]
total_tracks_in_file = df[track_col].nunique()
print(f" Total tracks in file: {total_tracks_in_file}")
for config in filter_configs:
label = config["label"]
min_lifetime = config["min_lifetime"]
min_binding_pos = config.get("min_binding_pos")
max_binding_pos = config.get("max_binding_pos")
kept_tracks = []
removed_tracks = []
fail_pos_only = 0
fail_life_only = 0
fail_both = 0
for track_id, track_df in df.groupby(track_col):
if track_df.empty:
continue
binding_pos = track_df[position_col].iloc[0]
lifetime = compute_lifetime(track_df)
# Position check only if min/max are defined
position_ok = True
if min_binding_pos is not None and binding_pos < min_binding_pos:
position_ok = False
if max_binding_pos is not None and binding_pos > max_binding_pos:
position_ok = False
lifetime_ok = lifetime >= min_lifetime
if position_ok and lifetime_ok:
kept_tracks.append(track_df)
else:
removed_tracks.append(track_df)
if not position_ok and not lifetime_ok:
fail_both += 1
elif not position_ok:
fail_pos_only += 1
elif not lifetime_ok:
fail_life_only += 1
n_kept = len(kept_tracks)
n_removed = len(removed_tracks)
total_tracks = n_kept + n_removed
print(
f" [{label}] tracks kept: {n_kept}/{total_tracks} "
f"(removed: {n_removed}; "
f"fail_pos_only={fail_pos_only}, "
f"fail_life_only={fail_life_only}, "
f"fail_both={fail_both})"
)
# --- Write filtered (kept) file or placeholder file ---
outpath = os.path.join(output_folder, f"{base}_{label}.csv")
if n_kept > 0:
# Normal case: some tracks passed the filter
kept_df = pd.concat(kept_tracks, ignore_index=True)
with open(outpath, "w") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
print(f" -> Saved filtered tracks ({config['desc']}): {outpath}")
else:
# NEW: no track passed the filter → write header-only placeholder file
with open(outpath, "w") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
f.write(
f"# no tracks passed the '{label}' filter for {base}; "
f"placeholder file for downstream processing\n"
)
print(
f" -> No tracks passed the '{label}' filter for {base}. "
f"Created header-only placeholder: {outpath}"
)
# Save removed tracks (optional, still useful for debugging)
if removed_tracks:
removed_df = pd.concat(removed_tracks, ignore_index=True)
sep_outpath = os.path.join(
separated_folder, f"{base}_removed_{label}.csv"
)
with open(sep_outpath, "w") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
removed_df.to_csv(
sep_outpath, mode="a", sep=";", index=False, header=False
)
# print(f" -> Saved removed tracks: {sep_outpath}")
print("\nProcessing complete.")
2) 1_filter_track_debug.py
import pandas as pd
import glob
import os
import argparse
from typing import Optional, Tuple
# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8
# Column names (based on CSV header, after lowercasing)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"
# Filter configurations
filter_configs = [
{
"label": "position",
"min_lifetime": 0.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm",
},
{
"label": "position_1s",
"min_lifetime": 1.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
},
{
"label": "position_5s",
"min_lifetime": 5.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
},
{
"label": "position_2s",
"min_lifetime": 2.0,
"min_binding_pos": default_min_binding_pos,
"max_binding_pos": default_max_binding_pos,
"desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
},
{
"label": "lifetime_5s_only",
"min_lifetime": 5.0,
"min_binding_pos": None,
"max_binding_pos": None,
"desc": "Tracks with lifetime ≥ 5 s, no position filter",
},
]
def parse_args():
p = argparse.ArgumentParser(description="Filter blue track CSVs and emit debug reports per track.")
p.add_argument("--input_folder", "-i", default="./data", help="Folder containing input *_blue.csv files.")
p.add_argument("--output_folder", "-o", default="./filtered", help="Folder to write filtered CSVs.")
p.add_argument("--separated_folder", "-s", default="./separated", help="Folder to write removed-tracks CSVs.")
p.add_argument("--debug_folder", "-d", default="./debug_reports", help="Folder to write per-track debug reports.")
p.add_argument(
"--only",
default=None,
help="Optional: only process files whose basename contains this substring (e.g. 'p967_250704_502_10pN_ch4_0bar_b4_1_blue').",
)
p.add_argument(
"--binding_pos_method",
choices=["first_non_nan", "median", "mean"],
default="first_non_nan",
help="How to compute 'binding position' per track for the position filter.",
)
p.add_argument("--verbose", action="store_true", help="Print per-track debug for removed tracks (can be noisy).")
return p.parse_args()
def _coerce_numeric(series: pd.Series) -> pd.Series:
"""
Coerce numeric robustly:
- supports comma decimal separators: '1,23' -> '1.23'
- invalid parses become NaN
"""
s = series.astype(str).str.strip()
s = s.str.replace(",", ".", regex=False)
return pd.to_numeric(s, errors="coerce")
def load_csv(filepath: str) -> Tuple[Optional[pd.DataFrame], Optional[str], Optional[str]]:
"""
Load a blue track CSV:
- find header line starting with '# track index'
- read data rows (semicolon-separated, skipping 2 header lines)
- set lowercase column names based on the header line
- coerce time/position to numeric robustly (comma decimals supported)
"""
try:
with open(filepath, "r", encoding="utf-8") as f:
lines = f.readlines()
if not lines:
raise ValueError(f"File {filepath} is empty")
header_line = None
for line in lines:
if line.startswith("# track index"):
header_line = line.lstrip("# ").strip()
break
if header_line is None:
raise ValueError(f"No header line starting with '# track index' found in {filepath}")
df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
df.columns = [c.strip().lower() for c in header_line.split(";")]
required_cols = [track_col, time_col, position_col]
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
raise ValueError(f"Missing required columns in {filepath}: {missing_cols}")
# Robust numeric conversion
df[time_col] = _coerce_numeric(df[time_col])
df[position_col] = _coerce_numeric(df[position_col])
# If conversion introduced NaNs, keep them but warn (important for debugging)
n_time_nan = int(df[time_col].isna().sum())
n_pos_nan = int(df[position_col].isna().sum())
if n_time_nan > 0 or n_pos_nan > 0:
print(
f" [WARN] {os.path.basename(filepath)}: NaNs after numeric parsing "
f"(time NaN={n_time_nan}, position NaN={n_pos_nan}). "
f"This can cause lifetime=0 or position filters to behave unexpectedly."
)
return df, lines[0].strip(), header_line
except Exception as e:
print(f"Error loading {filepath}: {e}")
return None, None, None
def compute_lifetime(track_df: pd.DataFrame) -> float:
"""
Lifetime = max(time) - min(time), using only non-NaN times.
"""
t = track_df[time_col].dropna()
if t.empty:
return 0.0
return float(t.max() - t.min())
def compute_binding_pos(track_df: pd.DataFrame, method: str) -> float:
"""
Binding position metric used for filtering.
"""
p = track_df[position_col].dropna()
if p.empty:
return float("nan")
if method == "first_non_nan":
return float(p.iloc[0])
if method == "median":
return float(p.median())
if method == "mean":
return float(p.mean())
return float(p.iloc[0])
def main():
args = parse_args()
os.makedirs(args.output_folder, exist_ok=True)
os.makedirs(args.separated_folder, exist_ok=True)
os.makedirs(args.debug_folder, exist_ok=True)
for filepath in glob.glob(os.path.join(args.input_folder, "*.csv")):
basefile = os.path.basename(filepath)
base = os.path.splitext(basefile)[0]
if args.only and args.only not in base:
continue
print(f"\n=== Processing input file: {filepath} ===")
df, header1, header2 = load_csv(filepath)
if df is None:
continue
total_tracks_in_file = df[track_col].nunique()
print(f" Total tracks in file: {total_tracks_in_file}")
for config in filter_configs:
label = config["label"]
min_lifetime = float(config["min_lifetime"])
min_binding_pos = config.get("min_binding_pos")
max_binding_pos = config.get("max_binding_pos")
kept_tracks = []
removed_tracks = []
# For debug report
track_rows = []
fail_pos_only = 0
fail_life_only = 0
fail_both = 0
for track_id, track_df in df.groupby(track_col):
if track_df.empty:
continue
binding_pos = compute_binding_pos(track_df, args.binding_pos_method)
lifetime = compute_lifetime(track_df)
# Position check only if min/max are defined, and treat NaN binding_pos as "fails position"
position_ok = True
if min_binding_pos is not None or max_binding_pos is not None:
if pd.isna(binding_pos):
position_ok = False
else:
if min_binding_pos is not None and binding_pos < float(min_binding_pos):
position_ok = False
if max_binding_pos is not None and binding_pos > float(max_binding_pos):
position_ok = False
lifetime_ok = lifetime >= min_lifetime
reason_parts = []
if not position_ok:
reason_parts.append("position_out_of_range_or_nan")
if not lifetime_ok:
reason_parts.append("lifetime_too_short_or_time_nan")
reason = "PASS" if (position_ok and lifetime_ok) else "+".join(reason_parts)
# Debug record per track
track_rows.append(
{
"track_id": track_id,
"n_points": int(len(track_df)),
"binding_pos_um": binding_pos,
"binding_pos_method": args.binding_pos_method,
"lifetime_s": lifetime,
"position_ok": bool(position_ok),
"lifetime_ok": bool(lifetime_ok),
"reason": reason,
"min_binding_pos_um": min_binding_pos,
"max_binding_pos_um": max_binding_pos,
"min_lifetime_s": min_lifetime,
}
)
if position_ok and lifetime_ok:
kept_tracks.append(track_df)
else:
removed_tracks.append(track_df)
if not position_ok and not lifetime_ok:
fail_both += 1
elif not position_ok:
fail_pos_only += 1
elif not lifetime_ok:
fail_life_only += 1
if args.verbose:
print(
f" [REMOVED {label}] track={track_id} "
f"binding_pos={binding_pos} lifetime={lifetime} reason={reason}"
)
n_kept = len(kept_tracks)
n_removed = len(removed_tracks)
total_tracks = n_kept + n_removed
print(
f" [{label}] tracks kept: {n_kept}/{total_tracks} "
f"(removed: {n_removed}; "
f"fail_pos_only={fail_pos_only}, "
f"fail_life_only={fail_life_only}, "
f"fail_both={fail_both})"
)
# --- Write filtered (kept) file or placeholder file ---
outpath = os.path.join(args.output_folder, f"{base}_{label}.csv")
if n_kept > 0:
kept_df = pd.concat(kept_tracks, ignore_index=True)
with open(outpath, "w", encoding="utf-8") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
print(f" -> Saved filtered tracks ({config['desc']}): {outpath}")
else:
with open(outpath, "w", encoding="utf-8") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
f.write(
f"# no tracks passed the '{label}' filter for {base}; "
f"placeholder file for downstream processing\n"
)
print(
f" -> No tracks passed the '{label}' filter for {base}. "
f"Created header-only placeholder: {outpath}"
)
# Save removed tracks (still useful)
if removed_tracks:
removed_df = pd.concat(removed_tracks, ignore_index=True)
sep_outpath = os.path.join(args.separated_folder, f"{base}_removed_{label}.csv")
with open(sep_outpath, "w", encoding="utf-8") as f:
f.write(f"{header1}\n")
f.write(f"# {header2}\n")
removed_df.to_csv(sep_outpath, mode="a", sep=";", index=False, header=False)
# --- NEW: per-track debug report ---
report_df = pd.DataFrame(track_rows).sort_values(["reason", "track_id"])
report_path = os.path.join(args.debug_folder, f"{base}_{label}_track_report.csv")
report_df.to_csv(report_path, index=False)
print(f" -> Wrote per-track debug report: {report_path}")
print("\nProcessing complete.")
if __name__ == "__main__":
main()
3) 2_update_lakes.py
import pandas as pd
import glob
import os
import json
import argparse
def parse_args():
parser = argparse.ArgumentParser(
description="Update merged lake files with filtered blue track CSVs."
)
parser.add_argument(
"--merged_lake_folder", "-m",
default="./",
help="Folder containing merged .lake files (default: ./)"
)
parser.add_argument(
"--filtered_folder", "-f",
default="./filtered",
help="Folder containing filtered blue track CSVs (default: ./filtered)"
)
parser.add_argument(
"--output_folder", "-o",
default="./updated_lakes",
help="Folder to write updated .lake files to (default: ./updated_lakes)"
)
return parser.parse_args()
def build_blue_text_from_csv(csv_path):
"""
Rebuild the 'blue' track text block for Lakeview from a filtered CSV file.
Returns
-------
blue_text : str
Header + (optional) data rows in Lakeview format.
n_rows : int
Number of data rows (tracks). If 0, the CSV is considered "header only"
/ no tracks after filtering.
"""
with open(csv_path, "r", encoding="utf-8") as f:
lines = f.readlines()
if len(lines) < 2:
raise ValueError(f"{csv_path} has fewer than 2 header lines. Please check the file.")
header1 = lines[0].strip() # first header line
header2 = lines[1].strip() # second header line with column names
data_lines = lines[2:]
# Check if there is any non-comment, non-empty data line
has_data = any(
(not ln.lstrip().startswith("#")) and ln.strip() != ""
for ln in data_lines
)
# Column names are taken from the second header line (strip leading '# ')
colnames = [c.strip() for c in header2.lstrip("# ").split(";")]
base_text = header1 + "\n" + header2 + "\n"
if not has_data:
# Header-only CSV -> no tracks after filtering
return base_text, 0
# Read data rows with pandas
df = pd.read_csv(csv_path, sep=";", comment="#", header=None, skiprows=2)
if df.shape[0] == 0:
# Safety net: no rows
return base_text, 0
df.columns = colnames
n_rows = len(df)
txt = base_text
for _, row in df.iterrows():
row_str = ";".join(str(row[c]) for c in colnames)
txt += row_str + "\n"
return txt, n_rows
def find_matching_csv(filtered_folder, kymo_name, i):
"""
Try to find the filtered CSV corresponding to a given kymo_name.
1) First, try exact match:
<kymo_name>_blue*.csv
2) If not found, try a 'p'-patched version of the numeric chunk (e.g. 940 -> p940 or p940 -> 940)
"""
# 1) Exact match
pattern = os.path.join(filtered_folder, f"{kymo_name}_blue*.csv")
candidates = glob.glob(pattern)
if len(candidates) == 1:
return candidates[0]
elif len(candidates) > 1:
print(f" [kymo {i}] Multiple CSV matches for {kymo_name} (exact), skipping:")
for c in candidates:
print(f" - {c}")
return None # ambiguous
# 2) Fallback: patch the 3-digit numeric part by adding or removing 'p'
parts = kymo_name.split("_")
alt_candidates = []
for idx, part in enumerate(parts):
# Case A: pure 3-digit number (e.g. "940") -> try "p940"
if part.isdigit() and len(part) == 3:
alt_parts = parts.copy()
alt_parts[idx] = "p" + part
alt_name = "_".join(alt_parts)
alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
alt_candidates = glob.glob(alt_pattern)
if alt_candidates:
print(
f" [kymo {i}] No exact CSV for '{kymo_name}', "
f"but found match using '{alt_name}'."
)
break
# Case B: starts with 'p' and then 3 digits (e.g. "p940") -> try without 'p'
if part.startswith("p") and part[1:].isdigit() and len(part) == 4:
alt_parts = parts.copy()
alt_parts[idx] = part[1:] # drop the leading 'p'
alt_name = "_".join(alt_parts)
alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
alt_candidates = glob.glob(alt_pattern)
if alt_candidates:
print(
f" [kymo {i}] No exact CSV for '{kymo_name}', "
f"but found match using '{alt_name}'."
)
break
if len(alt_candidates) == 1:
return alt_candidates[0]
elif len(alt_candidates) > 1:
print(f" [kymo {i}] Multiple CSV matches for patched name, skipping:")
for c in alt_candidates:
print(f" - {c}")
return None
# Nothing found
return None
def main():
args = parse_args()
merged_lake_folder = args.merged_lake_folder
filtered_folder = args.filtered_folder
output_folder = args.output_folder
os.makedirs(output_folder, exist_ok=True)
# Global counters across all lakes
total_case1 = 0 # case1: CSV found & n_rows>0 → tracks updated (kymo kept)
total_case2 = 0 # case2: CSV exists, but no tracks remain after filtering (empty or error) → kymo removed
total_case3 = 0 # case3: no matching CSV → kymo removed
total_extra = 0 # extra: kymo without data/tracks/blue → removed
# Detailed lists of sample names (lake, kymo, ...)
case1_kymos = [] # (lake_file, kymo_name, csv_path)
case2_kymos = [] # (lake_file, kymo_name, csv_path, reason)
case3_kymos = [] # (lake_file, kymo_name)
extra_kymos = [] # (lake_file, kymo_name)
used_csv_paths = set() # CSVs that were actually matched to some kymo
# Loop over all merged .lake files
for lake_path in glob.glob(os.path.join(merged_lake_folder, "*.lake")):
base = os.path.basename(lake_path)
print(f"\n=== Processing lake file: {base} ===")
# per-lake list of removed kymograph names
removed_kymo_names = set()
# Load JSON from .lake file
with open(lake_path, "r", encoding="utf-8") as f:
lake = json.load(f)
old_kymos = lake.get("kymos", [])
new_kymos = [] # we will build a filtered list here
# Iterate over all kymos in this lake
for i, kymo in enumerate(old_kymos):
# Extract kymograph name from address.path (last segment of the path)
addr = kymo.get("address", {})
path = addr.get("path", "")
kymo_name = path.split("/")[-1] if path else None
if not kymo_name:
print(f" [kymo {i}] No valid name/path found, skipping.")
# keep it as-is (very unusual case)
new_kymos.append(kymo)
continue
# Find the corresponding filtered CSV
csv_path = find_matching_csv(filtered_folder, kymo_name, i)
if csv_path is None:
# case3: no CSV → remove kymo
print(
f" [kymo {i}] No suitable CSV found for '{kymo_name}' "
f"in {filtered_folder} → REMOVING kymograph from output lake."
)
total_case3 += 1
case3_kymos.append((base, kymo_name))
removed_kymo_names.add(kymo_name)
continue
csv_name = os.path.basename(csv_path)
used_csv_paths.add(os.path.abspath(csv_path))
# Build the new blue track text from the filtered CSV
try:
blue_text, n_rows = build_blue_text_from_csv(csv_path)
except Exception as e:
# case2: CSV present but not parseable
msg = f"read error: {e}"
print(f" [kymo {i}] Error reading {csv_name}: {msg} → REMOVING kymograph.")
total_case2 += 1
case2_kymos.append((base, kymo_name, csv_path, msg))
removed_kymo_names.add(kymo_name)
continue
if n_rows == 0:
# case2: CSV present but no tracks after filtering
msg = "0 tracks after filtering (header-only CSV)"
print(
f" [kymo {i}] CSV {csv_name} contains no tracks after filtering "
f"→ REMOVING kymograph."
)
total_case2 += 1
case2_kymos.append((base, kymo_name, csv_path, msg))
removed_kymo_names.add(kymo_name)
continue
# If we reach here, we have a non-empty CSV, so this is case1
try:
if "data" in kymo and "tracks" in kymo["data"] and "blue" in kymo["data"]["tracks"]:
kymo["data"]["tracks"]["blue"] = blue_text
new_kymos.append(kymo)
total_case1 += 1
case1_kymos.append((base, kymo_name, csv_path))
print(f" [kymo {i}] Updated blue tracks from {csv_name} (kept).")
else:
# extra: kymo structure has no blue field at all → remove
print(
f" [kymo {i}] Kymo '{kymo_name}' has no data/tracks/blue field "
f"→ REMOVING from output lake."
)
total_extra += 1
extra_kymos.append((base, kymo_name))
removed_kymo_names.add(kymo_name)
except Exception as e:
# treat write problems also as case2
msg = f"write error: {e}"
print(
f" [kymo {i}] Error writing tracks for {kymo_name}: {msg} "
f"→ REMOVING kymograph."
)
total_case2 += 1
case2_kymos.append((base, kymo_name, csv_path, msg))
removed_kymo_names.add(kymo_name)
# Replace kymos list with filtered one (case2/case3/extra removed)
lake["kymos"] = new_kymos
# ------------------------------------------------------
# NEW PART: rebuild file_viewer and experiments[*].dataset
# so that H5 links are consistent with the kept kymos.
# ------------------------------------------------------
kept_kymo_names = set()
file_viewer_files = []
for kymo in new_kymos:
addr = kymo.get("address", {})
path = addr.get("path", "")
file = addr.get("file", "")
if path:
name = path.split("/")[-1]
kept_kymo_names.add(name)
if file and file not in file_viewer_files:
file_viewer_files.append(file)
# 1) Root-level file_viewer: only files from kept kymos
if "file_viewer" in lake:
lake["file_viewer"] = file_viewer_files
# 2) Experiments datasets: keep only entries whose path matches kept kymo
if "experiments" in lake and isinstance(lake["experiments"], dict):
for exp_key, exp in lake["experiments"].items():
if not isinstance(exp, dict):
continue
dataset = exp.get("dataset")
if isinstance(dataset, list):
new_dataset = []
for item in dataset:
if not isinstance(item, dict):
continue
addr = item.get("address", {})
path = addr.get("path", "")
name = path.split("/")[-1] if path else None
if name in kept_kymo_names:
new_dataset.append(item)
exp["dataset"] = new_dataset
# Save updated lake JSON to output folder
out_path = os.path.join(output_folder, base)
with open(out_path, "w", encoding="utf-8") as f:
json.dump(lake, f, indent=4)
print(f"==> {base}: kept {len(new_kymos)} kymos after filtering, written to {out_path}")
# --- Global summary over all lakes ---
print("\n=== Summary over all processed lakes ===")
print(f" case1: updated kymos (CSV found & ≥1 track, kept) = {total_case1}")
print(f" case2: removed kymos (CSV exists, but no tracks remain after filtering) = {total_case2}")
print(f" case3: removed kymos (no matching CSV found) = {total_case3}")
print(f" extra: removed kymos (no data/tracks/blue field) = {total_extra}")
total_kymos = total_case1 + total_case2 + total_case3 + total_extra
print(f" total kymos classified (sum of the above) = {total_kymos}")
# CSV usage check
all_csv_paths = sorted(os.path.abspath(p) for p in glob.glob(os.path.join(filtered_folder, "*_blue*.csv")))
print(f"\nTotal CSV files in filtered_folder: {len(all_csv_paths)}")
print(f"CSV files actually used (matched to some kymo): {len(used_csv_paths)}")
unused_csv = [p for p in all_csv_paths if p not in used_csv_paths]
if unused_csv:
print("\nCSV files NOT used by any kymo (name mismatch / other replicates):")
for p in unused_csv:
print(f" {p}")
else:
print("\nAll CSV files in filtered_folder were used by at least one kymo.")
if __name__ == "__main__":
main()