Daily Archives: 2026年1月22日

From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)

Goal

For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:

  1. Export protein sequences as FASTA
  2. Annotate them with EggNOG-mapper (including the COG_category column)
  3. Summarize COG composition at two levels:

    • COG letters (J/A/K/…/S), including multi-letter cases like IQ
    • 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
  4. Export both summary statistics and the underlying protein IDs for each category/group.

Step 0 — Why this manual annotation approach is needed (non-model organism)

Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.


Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

Export sequences for each MS dataset:

python3 getProteinSequences_Proximity_4h.py      # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  # > ALFApulldown_18h.fasta

Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py). Output: One FASTA per dataset (*.fasta), used as EggNOG input.

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta) Output: cleaned FASTA headers (CP020463_protein.fasta)


Step 2 — Generate EggNOG annotation files (*.emapper.annotations)

2A) Install EggNOG-mapper

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda   # eggnog-mapper 2.1.12
mamba activate eggnog_env

2B) Download / prepare EggNOG database

mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

2C) Run emapper.py on FASTA inputs

(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):

emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60    # --resume if needed

For MS protein sets (the ones used for COG clustering):

emapper.py -i Proximity_4h.fasta      -o eggnog_out_Proximity_4h      --cpu 60   # --resume
emapper.py -i Proximity_18h.fasta     -o eggnog_out_Proximity_18h     --cpu 60
emapper.py -i ALFApulldown_4h.fasta   -o eggnog_out_ALFApulldown_4h   --cpu 60
emapper.py -i ALFApulldown_18h.fasta  -o eggnog_out_ALFApulldown_18h  --cpu 60

Key outputs used downstream (one per dataset):

  • eggnog_out_Proximity_4h.emapper.annotations
  • eggnog_out_Proximity_18h.emapper.annotations
  • eggnog_out_ALFApulldown_4h.emapper.annotations
  • eggnog_out_ALFApulldown_18h.emapper.annotations

These files include a column named COG_category, which is the basis for the clustering below.


Step 3 — COG clustering + reporting (this post’s main script)

Inputs

The four *.emapper.annotations files (must contain COG_category).

Outputs

All outputs are written to ./COG_outputs/:

Per dataset (Excel): COG_[Dataset].xlsx

  • COG_letters: per-letter count + percent
  • Debug: unassigned COG rows, R/S proportion, etc.
  • Protein_assignments: per-protein COG letters + functional groups
  • Protein_lists_by_COG: protein IDs per COG letter
  • Protein_lists_by_group: protein IDs per major functional class
  • Long_format_COG / Long_format_group: one row per protein per category/group (best for filtering)

Combined (Excel): COG_combined_summary.xlsx

  • combined counts/percents across datasets
  • combined long-format tables

Plots (PNG + PDF):

  • COG_grouped_barplot_percent_letters.* (COG letters across datasets)
  • COG_functional_groups.* (4 functional classes across datasets)

Interpretation notes

  • High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.
  • Group totals not equal to 100% can happen because:

    • some proteins have multi-letter COGs (e.g., IQ) → count in multiple groups
    • some proteins have no COG assignment (-) → don’t contribute to any group

Code snippet (generate Proximity_4h FASTA, used in Step 2)

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

Code snippet (COG clustering script, used in Step 3)

#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)

Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
    * Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
    * Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
  (note: method still allows multi-group membership; we just don't print it in labels)
"""

import os
import re
import pandas as pd
import numpy as np

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
    "Proximity_4h":     "./eggnog_out_Proximity_4h.emapper.annotations",
    "Proximity_18h":    "./eggnog_out_Proximity_18h.emapper.annotations",
    "ALFApulldown_4h":  "./eggnog_out_ALFApulldown_4h.emapper.annotations",
    "ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}

OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)

ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]

COG_DESCRIPTIONS = {
    "J": "Translation, ribosomal structure and biogenesis",
    "A": "RNA processing and modification",
    "K": "Transcription",
    "L": "Replication, recombination and repair",
    "B": "Chromatin structure and dynamics",
    "D": "Cell cycle control, cell division, chromosome partitioning",
    "Y": "Nuclear structure",
    "V": "Defense mechanisms",
    "T": "Signal transduction mechanisms",
    "M": "Cell wall/membrane/envelope biogenesis",
    "N": "Cell motility",
    "Z": "Cytoskeleton",
    "W": "Extracellular structures",
    "U": "Intracellular trafficking, secretion, and vesicular transport",
    "O": "Posttranslational modification, protein turnover, chaperones",
    "C": "Energy production and conversion",
    "G": "Carbohydrate transport and metabolism",
    "E": "Amino acid transport and metabolism",
    "F": "Nucleotide transport and metabolism",
    "H": "Coenzyme transport and metabolism",
    "I": "Lipid transport and metabolism",
    "P": "Inorganic ion transport and metabolism",
    "Q": "Secondary metabolites biosynthesis, transport and catabolism",
    "R": "General function prediction only",
    "S": "Function unknown",
}

FUNCTIONAL_GROUPS = {
    "INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
    "CELLULAR PROCESSES AND SIGNALING":   ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
    "METABOLISM":                        ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
    "POORLY CHARACTERIZED":              ['R', 'S'],
}

LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
    for c in letters:
        LETTER_TO_GROUP[c] = grp

# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
    rows = []
    header = None
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith("##"):
                continue
            if line.startswith("#query"):
                header = line.lstrip("#").split("\t")
                continue
            if header is None and re.match(r"^query\t", line):
                header = line.split("\t")
                continue
            if header is None:
                continue
            rows.append(line.split("\t"))

    if header is None:
        raise ValueError(f"Could not find header line (#query/query) in: {path}")

    df = pd.DataFrame(rows, columns=header)
    if "query" not in df.columns and "#query" in df.columns:
        df = df.rename(columns={"#query": "query"})
    return df

def split_cog_letters(cog_str):
    if cog_str is None:
        return []
    cog_str = str(cog_str).strip()
    if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
        return []
    letters = list(cog_str)
    return [c for c in letters if c in ALL_COG]

def count_cog_letters(df: pd.DataFrame) -> dict:
    counts = {c: 0 for c in ALL_COG}
    for x in df["COG_category"].astype(str).tolist():
        for c in split_cog_letters(x):
            counts[c] += 1
    return counts

def build_category_to_proteins(df: pd.DataFrame) -> dict:
    cat2prot = {c: set() for c in ALL_COG}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        for c in letters:
            cat2prot[c].add(q)
    return {c: sorted(list(v)) for c, v in cat2prot.items()}

def build_group_to_proteins(df: pd.DataFrame) -> dict:
    group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        for g in hit_groups:
            group2prot[g].add(q)
    return {g: sorted(list(v)) for g, v in group2prot.items()}

def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
        rows.append({
            "query": q,
            "COG_category_raw": cog,
            "COG_letters": "".join(sorted(set(letters))) if letters else "",
            "Functional_groups": "; ".join(groups) if groups else "",
            "Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
        })
    out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
    return out.sort_values("query")

def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for c in ALL_COG:
        for pid in cat2prot.get(c, []):
            rows.append({
                "Dataset": dataset,
                "COG": c,
                "Description": COG_DESCRIPTIONS.get(c, ""),
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for g in FUNCTIONAL_GROUPS.keys():
        for pid in grp2prot.get(g, []):
            rows.append({
                "Dataset": dataset,
                "Functional_group": g,
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
    groups = list(FUNCTIONAL_GROUPS.keys())
    recs = []
    for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        rec = {"query": q}
        for g in groups:
            rec[g] = (g in hit_groups)
        recs.append(rec)

    out = pd.DataFrame(recs)
    out = out.groupby("query", as_index=False).max()
    return out

# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []

for ds_name, path in INPUT_FILES.items():
    print(f"\n--- {ds_name} ---\nReading: {path}")
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing input file: {path}")

    df = read_emapper_annotations(path)
    if "COG_category" not in df.columns:
        raise ValueError(f"{path} has no 'COG_category' column.")

    n_rows = df.shape[0]
    unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()

    # Letter-based counts and percents
    counts = count_cog_letters(df)
    total_letters = sum(counts.values())
    pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}

    counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
    pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)

    all_counts[ds_name] = counts_s
    all_pct[ds_name] = pct_s

    out_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "Count": counts_s.values,
        "Percent": pct_s.values,
    })

    dbg = pd.DataFrame([{
        "Dataset": ds_name,
        "Proteins_in_table": int(n_rows),
        "COG_unassigned_rows": int(unassigned),
        "Total_assigned_COG_letters": int(total_letters),
        "R_count": int(counts.get("R", 0)),
        "S_count": int(counts.get("S", 0)),
        "R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
    }])

    debug_summary.append(dbg.iloc[0].to_dict())

    assignment_tbl = build_assignment_table(df)
    cat2prot = build_category_to_proteins(df)
    grp2prot = build_group_to_proteins(df)

    # category (COG letter) protein lists
    cat_list_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
        "Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
    })

    # group protein lists
    grp_list_df = pd.DataFrame({
        "Functional_group": list(FUNCTIONAL_GROUPS.keys()),
        "N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
        "Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
    })

    df_long_cog = long_format_by_cog(cat2prot, ds_name)
    df_long_group = long_format_by_group(grp2prot, ds_name)
    all_long_cog.append(df_long_cog)
    all_long_group.append(df_long_group)

    out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
    with pd.ExcelWriter(out_xlsx) as writer:
        out_df.to_excel(writer, sheet_name="COG_letters", index=False)
        dbg.to_excel(writer, sheet_name="Debug", index=False)
        assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
        cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
        grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
        df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
        df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)

    print(f"Saved: {out_xlsx}")

# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())

df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)

combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
    df_counts.to_excel(writer, sheet_name="Counts")
    df_pct.to_excel(writer, sheet_name="Percent")
    pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
    # Combined long formats (includes protein IDs per letter/group across datasets)
    pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
    pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)

print(f"\nSaved combined summary: {combined_xlsx}")

# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()

y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
    ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)

ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()

plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")

# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []

# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []

for ds_name, path in INPUT_FILES.items():
    df = read_emapper_annotations(path)

    # (A) protein-based group membership and stats
    memb = protein_based_group_membership(df)
    n_proteins = memb.shape[0]
    counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
    pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
    group_counts[ds_name] = pd.Series(counts)
    group_pct[ds_name] = pd.Series(pct)

    for g in FUNCTIONAL_GROUPS.keys():
        ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
        for pid in ids:
            group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})

    # (B) add category (letter) protein IDs too
    cat2prot = build_category_to_proteins(df)
    all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))

df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)

# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
    sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
    for c in ALL_COG:
        ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
        cat_list_rows.append({
            "Dataset": ds,
            "COG": c,
            "Description": COG_DESCRIPTIONS.get(c, ""),
            "N_proteins": len(ids),
            "Protein_IDs": "; ".join(sorted(ids))
        })
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)

group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
    df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
    df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
    df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)

    # NEW: category protein IDs included here as well
    df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
    df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)

print(f"Saved functional-group workbook: {group_xlsx}")

# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()

x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
    ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)

ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()

plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)

print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")

print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR))

End-to-end GO enrichment for non-model bacteria: MS → reference ID mapping with BLAST + Blast2GO (GUI) + R enrichment

This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:

  • MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g., B4U56_00090).
  • Standard R organism annotation packages (org.*.eg.db) are not available.
  • We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.

The workflow produces:

  • 4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
  • 1 combined summary workbook across all datasets

1) Generate protein FASTA files

1.1 MS protein sequences (Proximity / ALFApulldown)

From MS results (protein ID lists), retrieve sequences and write FASTA:

python3 getProteinSequences_Proximity_4h.py      > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  > ALFApulldown_18h.fasta

1.2 Reference proteome sequences (for mapping + background)

Download the reference proteome FASTA from GenBank and standardize headers:

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

2) Optional functional annotation for merging MS + RNA-seq (EggNOG)

EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env

download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60

# MS sets
emapper.py -i Proximity_4h.fasta     -o eggnog_out_Proximity_4h     --cpu 60
emapper.py -i Proximity_18h.fasta    -o eggnog_out_Proximity_18h    --cpu 60
emapper.py -i ALFApulldown_4h.fasta  -o eggnog_out_ALFApulldown_4h  --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60

3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)

Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.

3.1 Setup workspace

mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/

Launch Blast2GO:

~/Tools/Blast2GO/Blast2GO_Launcher

3.2 Step-by-step in Blast2GO GUI

STEP 1 — Load sequences (reference proteome)

  1. FileLoadLoad Sequences
  2. Choose: Load Fasta File (.fasta)
  3. Select: CP020463_protein.fasta
  4. Tags: (leave default / none)
  5. Check that the table is filled with columns like Nr, SeqName

✅ Output: sequences are loaded into the project


STEP 2 — BLAST (QBlast)

  1. Go to the BLAST panel
  2. Choose blastp (protein vs protein)
  3. Database: nr (NCBI)
  4. Set other parameters as needed (defaults typically OK)
  5. Run QBlast

⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.

✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)


STEP 3 — Mapping

  1. Click Mapping

✅ Output:

  • Tags updated to MAPPED
  • Columns appear such as #GO, GO IDs, GO Names

STEP 4 — Annotation

  1. Click Annotation
  2. Key parameters you may set/keep:

    • Annotation CutOff (controls reliability threshold)
    • GO Weight (boosts more general terms when supported by multiple specific hits)

✅ Output:

  • Tags updated to ANNOTATED
  • Enzyme-related columns may appear (e.g., Enzyme Codes)

STEP 5 — Export Annotations (before merging InterPro)

  1. FileExportExport Annotations
  2. Choose Export Annotations (.annot, custom, etc.)
  3. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot

✅ Output: blast2go_annot.annot


STEP 6 — InterProScan (optional but recommended for more GO terms)

  1. Click InterPro / InterProScan
  2. Run InterProScan (can be long: hours to >1 day depending on dataset & setup)

✅ Output:

  • Tags updated to INTERPRO
  • Additional columns: InterPro IDs, InterPro GO IDs/Names

STEP 7 — Merge InterProScan GOs into existing annotation

  1. In InterPro panel, choose: Merge InterProScan GOs to Annotation
  2. Confirm merge

✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)


STEP 8 — Export final annotations (after merging InterPro)

  1. FileExportExport Annotations
  2. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2

✅ This is the final file used for enrichment:

  • blast2go_annot.annot2

Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.


4) Generate BLAST mapping tables: *_vs_ref.blast.tsv

We map each MS protein set to the reference proteome locus tags using BLASTP.

4.1 Create BLAST database from reference proteome

makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db

4.2 BLASTP each MS set against the reference DB

blastp -query Proximity_4h.fasta -db CP020463_ref_db \
  -out Proximity_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query Proximity_18h.fasta -db CP020463_ref_db \
  -out Proximity_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
  -out ALFApulldown_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
  -out ALFApulldown_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

5) Run GO enrichment (4 sets + combined summary)

We run a single wrapper script that:

  • normalizes FASTA headers (handles UniParc/UniProt tr|... formats)
  • filters BLAST hits (pident/qcov/evalue thresholds)
  • picks one best hit per query
  • performs GO enrichment via clusterProfiler::enricher()
  • background (universe) = all reference proteins with ≥1 GO term in blast2go_annot.annot2
  • cutoff = BH/FDR adjusted p-value < 0.05
  • writes one Excel per dataset + one combined summary workbook
Rscript wrapper_besthit_GO_enrichment_4sets.R

Code snippets (used scripts)

A) Example script for Step 1: generate Proximity_4h FASTA

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(tibble)
  library(openxlsx)
  library(clusterProfiler)
  library(AnnotationDbi)
  library(GO.db)
})

# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2"  # or full path

datasets <- tibble::tribble(
  ~name,              ~blast_tsv,                          ~fasta,
  "Proximity_4h",     "Proximity_4h_vs_ref.blast.tsv",     "Proximity_4h.fasta",
  "Proximity_18h",    "Proximity_18h_vs_ref.blast.tsv",    "Proximity_18h.fasta",
  "ALFApulldown_4h",  "ALFApulldown_4h_vs_ref.blast.tsv",  "ALFApulldown_4h.fasta",
  "ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)

out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV   <- 0.70
MAX_EVALUE <- 1e-10

# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"

# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  parts <- stringr::str_split(x, "\\|", simplify = TRUE)
  is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
  ifelse(is_uniprot, parts[, 2], parts[, 1])
}

norm_subject_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  stringr::str_split_fixed(x, "\\|", 2)[, 1]
}

add_go_desc <- function(df) {
  if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
  df$GO_Term <- vapply(df$ID, function(go_id) {
    term <- tryCatch(
      AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
      error = function(e) NULL
    )
    if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
  }, FUN.VALUE = character(1))
  df
}

safe_read_blast <- function(path) {
  cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
            "qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
  readr::read_tsv(
    path,
    col_names = cols,
    col_types = readr::cols(
      qseqid   = readr::col_character(),
      sseqid   = readr::col_character(),
      pident   = readr::col_double(),
      length   = readr::col_double(),
      mismatch = readr::col_double(),
      gapopen  = readr::col_double(),
      qstart   = readr::col_double(),
      qend     = readr::col_double(),
      sstart   = readr::col_double(),
      send     = readr::col_double(),
      evalue   = readr::col_double(),
      bitscore = readr::col_double(),
      qlen     = readr::col_double(),
      slen     = readr::col_double()
    ),
    progress = FALSE
  )
}

# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
                              stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term   <- as.character(annot_df$Term)

term2gene_go_ref <- annot_df %>%
  dplyr::filter(grepl("^GO:", Term)) %>%
  dplyr::transmute(term = Term, gene = GeneID) %>%
  dplyr::distinct()

universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")

# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms  <- list()
all_besthits  <- list()

# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {

  ds <- datasets[i, ]
  name <- ds$name
  blast_path <- ds$blast_tsv
  fasta_path <- ds$fasta

  cat("\n=============================\n")
  cat("Dataset:", name, "\n")

  if (!file.exists(blast_path)) {
    warning("Missing BLAST TSV: ", blast_path, " (skipping)")
    next
  }
  if (!file.exists(fasta_path)) {
    warning("Missing FASTA: ", fasta_path, " (skipping)")
    next
  }

  # ---- FASTA query IDs ----
  fa <- readLines(fasta_path, warn = FALSE)
  q_all <- fa[startsWith(fa, ">")]
  q_all <- gsub("^>", "", q_all)
  q_all <- unique(norm_mixed_id(q_all))
  n_fasta <- length(q_all)

  # ---- BLAST ----
  bl <- safe_read_blast(blast_path) %>%
    dplyr::mutate(
      qid  = norm_mixed_id(qseqid),
      sid  = norm_subject_id(sseqid),
      qcov = length / qlen
    )

  n_queries_with_hits <- dplyr::n_distinct(bl$qid)
  max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0

  cat("FASTA queries:", n_fasta, "\n")
  cat("BLAST rows:", nrow(bl), "\n")
  cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
  cat("Max hits/query:", max_hits_per_query, "\n")

  bl_f <- bl %>%
    dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)

  cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")

  best <- bl_f %>%
    dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
    dplyr::group_by(qid) %>%
    dplyr::slice_head(n = 1) %>%
    dplyr::ungroup()

  best_out <- best %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::arrange(dplyr::desc(bitscore))

  best_all <- tibble::tibble(qid = q_all) %>%
    dplyr::left_join(best_out, by = "qid")

  unmapped <- best_all %>%
    dplyr::filter(is.na(sid) | sid == "") %>%
    dplyr::distinct(qid) %>%
    dplyr::pull(qid)

  mapped <- best_out %>%
    dplyr::filter(!is.na(sid), sid != "")

  cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
  cat("Unmapped queries:", length(unmapped), "\n")
  cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
  cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")

  gene_list_ref <- unique(mapped$sid)
  n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
  cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")

  # Save best-hit TSVs
  fn_best     <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
  fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
  readr::write_tsv(best_out, fn_best)
  readr::write_tsv(best_all, fn_best_all)

  # GO enrichment
  go_res <- tryCatch(
    clusterProfiler::enricher(
      gene = gene_list_ref,
      TERM2GENE = term2gene_go_ref,
      universe = universe_ref,
      pvalueCutoff = P_CUT,
      pAdjustMethod = PADJ_METHOD
    ),
    error = function(e) NULL
  )

  go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
  go_df <- add_go_desc(go_df)

  cat("Enriched GO terms:", nrow(go_df), "\n")

  per_protein_go <- mapped %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::distinct() %>%
    dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
    dplyr::rename(GO = term)

  # Per-dataset Excel
  out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
  wb <- openxlsx::createWorkbook()

  openxlsx::addWorksheet(wb, "BestHit_Mapped")
  openxlsx::writeData(wb, "BestHit_Mapped", mapped)

  openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
  openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))

  openxlsx::addWorksheet(wb, "PerProtein_GO")
  openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)

  openxlsx::addWorksheet(wb, "GO_Enrichment")
  openxlsx::writeData(wb, "GO_Enrichment", go_df)

  openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
  cat("Saved:", out_xlsx, "\n")

  # Collect combined summary
  summary_runs[[name]] <- tibble::tibble(
    dataset = name,
    fasta_queries = n_fasta,
    blast_rows = nrow(bl),
    queries_with_hits = n_queries_with_hits,
    max_hits_per_query = max_hits_per_query,
    filtered_rows = nrow(bl_f),
    filtered_queries = dplyr::n_distinct(bl_f$qid),
    mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
    unmapped_queries = length(unmapped),
    unique_targets_besthit = dplyr::n_distinct(mapped$sid),
    duplicated_targets_besthit = sum(duplicated(mapped$sid)),
    mapped_targets_with_GO = n_targets_with_go,
    enriched_GO_terms = nrow(go_df)
  )

  all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
  all_go_terms[[name]]  <- dplyr::mutate(per_protein_go, dataset = name)
  all_besthits[[name]]  <- dplyr::mutate(mapped, dataset = name)
}

# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)

go_counts <- combined_per_protein_go %>%
  dplyr::filter(!is.na(GO), GO != "") %>%
  dplyr::distinct(dataset, qid, GO) %>%
  dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
  dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))

combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()

openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)

openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)

openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)

openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)

openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)

openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)

cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "")

Lakeview/Lake file refresh pipeline: track filtering, filename normalization, and automated .lake updates

This pipeline updates existing Lakeview merged .lake files by replacing their blue track content with filtered track CSVs, while keeping original .h5 filenames/paths unchanged so Lakeview can still load the raw kymographs locally.

Step 1 — Fix naming mismatches before updating lakes

To ensure each kymograph (.h5) finds its corresponding filtered CSV, filename inconsistencies are corrected first:

  • _p940_ vs _940_: some filtered CSVs contain _p940_ while the corresponding .h5 uses _940_ → rename CSVs accordingly.
  • ch4 vs ch5: some filtered CSVs were labeled _ch4_ while the .h5 filenames use _ch5_ (or vice versa) → rename CSVs to match.
  • Extra CSVs without any matching .h5 are removed to avoid confusion later.

This step prevents Lakeview kymos from being dropped simply because no matching CSV is found.

Step 2 — Filter tracks and generate debug reports

Run the filtering scripts to create multiple filtered outputs from each raw *_blue.csv:

  • Binding position filter: 2.2–3.8 µm
  • Lifetime thresholds: ≥1s, ≥2s, ≥5s
  • A lifetime-only filter: ≥5s without a position constraint

The debug version additionally writes per-track reports (binding position, lifetime, pass/fail reason), which makes it much easier to spot issues caused by parsing, NaNs, or unexpected track structure.

Step 3 — Organize filtered outputs into separate folders

Move filtered files into dedicated directories so each downstream lake update corresponds to a single filtering rule:

  • filtered_blue_position (2.2–3.8 µm)
  • filtered_blue_position_1s (2.2–3.8 µm + ≥1s)
  • filtered_blue_position_5s (2.2–3.8 µm + ≥5s)
  • filtered_blue_lifetime_5s_only (≥5s, no position filter)

Step 4 — Update .lake files using the filtered tracks

Run 2_update_lakes.py once per filtered folder to create updated .lake outputs (and logs):

  • For each kymo in each .lake, the script tries to find a matching *_blue*.csv.
  • Outcomes are classified:

    • case1: CSV found and contains ≥1 track → replace blue track text and keep the kymo.
    • case2: CSV found but no tracks remain after filtering (header-only / parse error) → remove the kymo.
    • case3: no matching CSV → remove the kymo.
    • extra: kymo missing a data/tracks/blue field → remove the kymo.
  • After filtering/removing kymos, the script also rebuilds:

    • file_viewer (keeps only .h5 files referenced by retained kymos)
    • experiments[*].dataset (keeps only dataset entries matching retained kymos)

This keeps the updated .lake files internally consistent and avoids dangling references.


Scripts used (code snippets)

1) 1_filter_track.py

import pandas as pd
import glob
import os

# === Parameters ===
input_folder = "./data"
output_folder = "./filtered"
separated_folder = "./separated"

# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8

# Column names (based on CSV header)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"

# Filter configurations
filter_configs = [
    {
        "label": "position",
        "min_lifetime": 0.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm",
    },
    {
        "label": "position_1s",
        "min_lifetime": 1.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
    },
    {
        "label": "position_5s",
        "min_lifetime": 5.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
    },
    {
        "label": "position_2s",
        "min_lifetime": 2.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
    },
    {
        "label": "lifetime_5s_only",
        "min_lifetime": 5.0,
        "min_binding_pos": None,
        "max_binding_pos": None,
        "desc": "Tracks with lifetime ≥ 5 s, no position filter",
    },
]

def load_csv(filepath):
    """
    Load a blue track CSV:
    - find header line starting with '# track index'
    - read data rows (semicolon-separated, skipping header lines)
    - set lowercase column names based on the header line
    """
    try:
        with open(filepath, "r") as f:
            lines = f.readlines()
        if not lines:
            raise ValueError(f"File {filepath} is empty")

        header_line = None
        for line in lines:
            if line.startswith("# track index"):
                header_line = line.lstrip("# ").strip()
                break
        if header_line is None:
            raise ValueError(
                f"No header line starting with '# track index' found in {filepath}"
            )

        df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
        df.columns = [c.strip().lower() for c in header_line.split(";")]

        required_cols = [track_col, time_col, position_col]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(
                f"Missing required columns in {filepath}: {missing_cols}"
            )

        return df, lines[0].strip(), header_line
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None, None, None

def compute_lifetime(track_df):
    """
    Compute the lifetime of one track as (max time - min time) in seconds.
    """
    if track_df[time_col].empty:
        return 0.0
    return track_df[time_col].max() - track_df[time_col].min()

# === Main Processing ===
os.makedirs(output_folder, exist_ok=True)
os.makedirs(separated_folder, exist_ok=True)

for filepath in glob.glob(os.path.join(input_folder, "*.csv")):
    print(f"\n=== Processing input file: {filepath} ===")
    df, header1, header2 = load_csv(filepath)
    if df is None:
        continue

    base = os.path.splitext(os.path.basename(filepath))[0]
    total_tracks_in_file = df[track_col].nunique()
    print(f"  Total tracks in file: {total_tracks_in_file}")

    for config in filter_configs:
        label = config["label"]
        min_lifetime = config["min_lifetime"]
        min_binding_pos = config.get("min_binding_pos")
        max_binding_pos = config.get("max_binding_pos")

        kept_tracks = []
        removed_tracks = []

        fail_pos_only = 0
        fail_life_only = 0
        fail_both = 0

        for track_id, track_df in df.groupby(track_col):
            if track_df.empty:
                continue

            binding_pos = track_df[position_col].iloc[0]
            lifetime = compute_lifetime(track_df)

            # Position check only if min/max are defined
            position_ok = True
            if min_binding_pos is not None and binding_pos < min_binding_pos:
                position_ok = False
            if max_binding_pos is not None and binding_pos > max_binding_pos:
                position_ok = False

            lifetime_ok = lifetime >= min_lifetime

            if position_ok and lifetime_ok:
                kept_tracks.append(track_df)
            else:
                removed_tracks.append(track_df)
                if not position_ok and not lifetime_ok:
                    fail_both += 1
                elif not position_ok:
                    fail_pos_only += 1
                elif not lifetime_ok:
                    fail_life_only += 1

        n_kept = len(kept_tracks)
        n_removed = len(removed_tracks)
        total_tracks = n_kept + n_removed

        print(
            f"  [{label}] tracks kept: {n_kept}/{total_tracks} "
            f"(removed: {n_removed}; "
            f"fail_pos_only={fail_pos_only}, "
            f"fail_life_only={fail_life_only}, "
            f"fail_both={fail_both})"
        )

        # --- Write filtered (kept) file or placeholder file ---
        outpath = os.path.join(output_folder, f"{base}_{label}.csv")
        if n_kept > 0:
            # Normal case: some tracks passed the filter
            kept_df = pd.concat(kept_tracks, ignore_index=True)
            with open(outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
            kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
            print(f"    -> Saved filtered tracks ({config['desc']}): {outpath}")
        else:
            # NEW: no track passed the filter → write header-only placeholder file
            with open(outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
                f.write(
                    f"# no tracks passed the '{label}' filter for {base}; "
                    f"placeholder file for downstream processing\n"
                )
            print(
                f"    -> No tracks passed the '{label}' filter for {base}. "
                f"Created header-only placeholder: {outpath}"
            )

        # Save removed tracks (optional, still useful for debugging)
        if removed_tracks:
            removed_df = pd.concat(removed_tracks, ignore_index=True)
            sep_outpath = os.path.join(
                separated_folder, f"{base}_removed_{label}.csv"
            )
            with open(sep_outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
            removed_df.to_csv(
                sep_outpath, mode="a", sep=";", index=False, header=False
            )
            # print(f"    -> Saved removed tracks: {sep_outpath}")

print("\nProcessing complete.")

2) 1_filter_track_debug.py

import pandas as pd
import glob
import os
import argparse
from typing import Optional, Tuple

# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8

# Column names (based on CSV header, after lowercasing)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"

# Filter configurations
filter_configs = [
    {
        "label": "position",
        "min_lifetime": 0.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm",
    },
    {
        "label": "position_1s",
        "min_lifetime": 1.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
    },
    {
        "label": "position_5s",
        "min_lifetime": 5.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
    },
    {
        "label": "position_2s",
        "min_lifetime": 2.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
    },
    {
        "label": "lifetime_5s_only",
        "min_lifetime": 5.0,
        "min_binding_pos": None,
        "max_binding_pos": None,
        "desc": "Tracks with lifetime ≥ 5 s, no position filter",
    },
]

def parse_args():
    p = argparse.ArgumentParser(description="Filter blue track CSVs and emit debug reports per track.")
    p.add_argument("--input_folder", "-i", default="./data", help="Folder containing input *_blue.csv files.")
    p.add_argument("--output_folder", "-o", default="./filtered", help="Folder to write filtered CSVs.")
    p.add_argument("--separated_folder", "-s", default="./separated", help="Folder to write removed-tracks CSVs.")
    p.add_argument("--debug_folder", "-d", default="./debug_reports", help="Folder to write per-track debug reports.")
    p.add_argument(
        "--only",
        default=None,
        help="Optional: only process files whose basename contains this substring (e.g. 'p967_250704_502_10pN_ch4_0bar_b4_1_blue').",
    )
    p.add_argument(
        "--binding_pos_method",
        choices=["first_non_nan", "median", "mean"],
        default="first_non_nan",
        help="How to compute 'binding position' per track for the position filter.",
    )
    p.add_argument("--verbose", action="store_true", help="Print per-track debug for removed tracks (can be noisy).")
    return p.parse_args()

def _coerce_numeric(series: pd.Series) -> pd.Series:
    """
    Coerce numeric robustly:
    - supports comma decimal separators: '1,23' -> '1.23'
    - invalid parses become NaN
    """
    s = series.astype(str).str.strip()
    s = s.str.replace(",", ".", regex=False)
    return pd.to_numeric(s, errors="coerce")

def load_csv(filepath: str) -> Tuple[Optional[pd.DataFrame], Optional[str], Optional[str]]:
    """
    Load a blue track CSV:
    - find header line starting with '# track index'
    - read data rows (semicolon-separated, skipping 2 header lines)
    - set lowercase column names based on the header line
    - coerce time/position to numeric robustly (comma decimals supported)
    """
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            lines = f.readlines()
        if not lines:
            raise ValueError(f"File {filepath} is empty")

        header_line = None
        for line in lines:
            if line.startswith("# track index"):
                header_line = line.lstrip("# ").strip()
                break
        if header_line is None:
            raise ValueError(f"No header line starting with '# track index' found in {filepath}")

        df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
        df.columns = [c.strip().lower() for c in header_line.split(";")]

        required_cols = [track_col, time_col, position_col]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns in {filepath}: {missing_cols}")

        # Robust numeric conversion
        df[time_col] = _coerce_numeric(df[time_col])
        df[position_col] = _coerce_numeric(df[position_col])

        # If conversion introduced NaNs, keep them but warn (important for debugging)
        n_time_nan = int(df[time_col].isna().sum())
        n_pos_nan = int(df[position_col].isna().sum())
        if n_time_nan > 0 or n_pos_nan > 0:
            print(
                f"  [WARN] {os.path.basename(filepath)}: NaNs after numeric parsing "
                f"(time NaN={n_time_nan}, position NaN={n_pos_nan}). "
                f"This can cause lifetime=0 or position filters to behave unexpectedly."
            )

        return df, lines[0].strip(), header_line
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None, None, None

def compute_lifetime(track_df: pd.DataFrame) -> float:
    """
    Lifetime = max(time) - min(time), using only non-NaN times.
    """
    t = track_df[time_col].dropna()
    if t.empty:
        return 0.0
    return float(t.max() - t.min())

def compute_binding_pos(track_df: pd.DataFrame, method: str) -> float:
    """
    Binding position metric used for filtering.
    """
    p = track_df[position_col].dropna()
    if p.empty:
        return float("nan")
    if method == "first_non_nan":
        return float(p.iloc[0])
    if method == "median":
        return float(p.median())
    if method == "mean":
        return float(p.mean())
    return float(p.iloc[0])

def main():
    args = parse_args()

    os.makedirs(args.output_folder, exist_ok=True)
    os.makedirs(args.separated_folder, exist_ok=True)
    os.makedirs(args.debug_folder, exist_ok=True)

    for filepath in glob.glob(os.path.join(args.input_folder, "*.csv")):
        basefile = os.path.basename(filepath)
        base = os.path.splitext(basefile)[0]

        if args.only and args.only not in base:
            continue

        print(f"\n=== Processing input file: {filepath} ===")
        df, header1, header2 = load_csv(filepath)
        if df is None:
            continue

        total_tracks_in_file = df[track_col].nunique()
        print(f"  Total tracks in file: {total_tracks_in_file}")

        for config in filter_configs:
            label = config["label"]
            min_lifetime = float(config["min_lifetime"])
            min_binding_pos = config.get("min_binding_pos")
            max_binding_pos = config.get("max_binding_pos")

            kept_tracks = []
            removed_tracks = []

            # For debug report
            track_rows = []

            fail_pos_only = 0
            fail_life_only = 0
            fail_both = 0

            for track_id, track_df in df.groupby(track_col):
                if track_df.empty:
                    continue

                binding_pos = compute_binding_pos(track_df, args.binding_pos_method)
                lifetime = compute_lifetime(track_df)

                # Position check only if min/max are defined, and treat NaN binding_pos as "fails position"
                position_ok = True
                if min_binding_pos is not None or max_binding_pos is not None:
                    if pd.isna(binding_pos):
                        position_ok = False
                    else:
                        if min_binding_pos is not None and binding_pos < float(min_binding_pos):
                            position_ok = False
                        if max_binding_pos is not None and binding_pos > float(max_binding_pos):
                            position_ok = False

                lifetime_ok = lifetime >= min_lifetime

                reason_parts = []
                if not position_ok:
                    reason_parts.append("position_out_of_range_or_nan")
                if not lifetime_ok:
                    reason_parts.append("lifetime_too_short_or_time_nan")
                reason = "PASS" if (position_ok and lifetime_ok) else "+".join(reason_parts)

                # Debug record per track
                track_rows.append(
                    {
                        "track_id": track_id,
                        "n_points": int(len(track_df)),
                        "binding_pos_um": binding_pos,
                        "binding_pos_method": args.binding_pos_method,
                        "lifetime_s": lifetime,
                        "position_ok": bool(position_ok),
                        "lifetime_ok": bool(lifetime_ok),
                        "reason": reason,
                        "min_binding_pos_um": min_binding_pos,
                        "max_binding_pos_um": max_binding_pos,
                        "min_lifetime_s": min_lifetime,
                    }
                )

                if position_ok and lifetime_ok:
                    kept_tracks.append(track_df)
                else:
                    removed_tracks.append(track_df)
                    if not position_ok and not lifetime_ok:
                        fail_both += 1
                    elif not position_ok:
                        fail_pos_only += 1
                    elif not lifetime_ok:
                        fail_life_only += 1

                    if args.verbose:
                        print(
                            f"    [REMOVED {label}] track={track_id} "
                            f"binding_pos={binding_pos} lifetime={lifetime} reason={reason}"
                        )

            n_kept = len(kept_tracks)
            n_removed = len(removed_tracks)
            total_tracks = n_kept + n_removed

            print(
                f"  [{label}] tracks kept: {n_kept}/{total_tracks} "
                f"(removed: {n_removed}; "
                f"fail_pos_only={fail_pos_only}, "
                f"fail_life_only={fail_life_only}, "
                f"fail_both={fail_both})"
            )

            # --- Write filtered (kept) file or placeholder file ---
            outpath = os.path.join(args.output_folder, f"{base}_{label}.csv")
            if n_kept > 0:
                kept_df = pd.concat(kept_tracks, ignore_index=True)
                with open(outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
                print(f"    -> Saved filtered tracks ({config['desc']}): {outpath}")
            else:
                with open(outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                    f.write(
                        f"# no tracks passed the '{label}' filter for {base}; "
                        f"placeholder file for downstream processing\n"
                    )
                print(
                    f"    -> No tracks passed the '{label}' filter for {base}. "
                    f"Created header-only placeholder: {outpath}"
                )

            # Save removed tracks (still useful)
            if removed_tracks:
                removed_df = pd.concat(removed_tracks, ignore_index=True)
                sep_outpath = os.path.join(args.separated_folder, f"{base}_removed_{label}.csv")
                with open(sep_outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                removed_df.to_csv(sep_outpath, mode="a", sep=";", index=False, header=False)

            # --- NEW: per-track debug report ---
            report_df = pd.DataFrame(track_rows).sort_values(["reason", "track_id"])
            report_path = os.path.join(args.debug_folder, f"{base}_{label}_track_report.csv")
            report_df.to_csv(report_path, index=False)
            print(f"    -> Wrote per-track debug report: {report_path}")

    print("\nProcessing complete.")

if __name__ == "__main__":
    main()

3) 2_update_lakes.py

import pandas as pd
import glob
import os
import json
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Update merged lake files with filtered blue track CSVs."
    )
    parser.add_argument(
        "--merged_lake_folder", "-m",
        default="./",
        help="Folder containing merged .lake files (default: ./)"
    )
    parser.add_argument(
        "--filtered_folder", "-f",
        default="./filtered",
        help="Folder containing filtered blue track CSVs (default: ./filtered)"
    )
    parser.add_argument(
        "--output_folder", "-o",
        default="./updated_lakes",
        help="Folder to write updated .lake files to (default: ./updated_lakes)"
    )
    return parser.parse_args()

def build_blue_text_from_csv(csv_path):
    """
    Rebuild the 'blue' track text block for Lakeview from a filtered CSV file.

    Returns
    -------
    blue_text : str
        Header + (optional) data rows in Lakeview format.
    n_rows : int
        Number of data rows (tracks). If 0, the CSV is considered "header only"
        / no tracks after filtering.
    """
    with open(csv_path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    if len(lines) < 2:
        raise ValueError(f"{csv_path} has fewer than 2 header lines. Please check the file.")

    header1 = lines[0].strip()  # first header line
    header2 = lines[1].strip()  # second header line with column names

    data_lines = lines[2:]

    # Check if there is any non-comment, non-empty data line
    has_data = any(
        (not ln.lstrip().startswith("#")) and ln.strip() != ""
        for ln in data_lines
    )

    # Column names are taken from the second header line (strip leading '# ')
    colnames = [c.strip() for c in header2.lstrip("# ").split(";")]

    base_text = header1 + "\n" + header2 + "\n"

    if not has_data:
        # Header-only CSV -> no tracks after filtering
        return base_text, 0

    # Read data rows with pandas
    df = pd.read_csv(csv_path, sep=";", comment="#", header=None, skiprows=2)
    if df.shape[0] == 0:
        # Safety net: no rows
        return base_text, 0

    df.columns = colnames
    n_rows = len(df)

    txt = base_text
    for _, row in df.iterrows():
        row_str = ";".join(str(row[c]) for c in colnames)
        txt += row_str + "\n"

    return txt, n_rows

def find_matching_csv(filtered_folder, kymo_name, i):
    """
    Try to find the filtered CSV corresponding to a given kymo_name.

    1) First, try exact match: 
<kymo_name>_blue*.csv
    2) If not found, try a 'p'-patched version of the numeric chunk (e.g. 940 -> p940 or p940 -> 940)
    """

    # 1) Exact match
    pattern = os.path.join(filtered_folder, f"{kymo_name}_blue*.csv")
    candidates = glob.glob(pattern)
    if len(candidates) == 1:
        return candidates[0]
    elif len(candidates) > 1:
        print(f"  [kymo {i}] Multiple CSV matches for {kymo_name} (exact), skipping:")
        for c in candidates:
            print(f"    - {c}")
        return None  # ambiguous

    # 2) Fallback: patch the 3-digit numeric part by adding or removing 'p'
    parts = kymo_name.split("_")
    alt_candidates = []

    for idx, part in enumerate(parts):
        # Case A: pure 3-digit number (e.g. "940") -> try "p940"
        if part.isdigit() and len(part) == 3:
            alt_parts = parts.copy()
            alt_parts[idx] = "p" + part
            alt_name = "_".join(alt_parts)
            alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
            alt_candidates = glob.glob(alt_pattern)
            if alt_candidates:
                print(
                    f"  [kymo {i}] No exact CSV for '{kymo_name}', "
                    f"but found match using '{alt_name}'."
                )
                break

        # Case B: starts with 'p' and then 3 digits (e.g. "p940") -> try without 'p'
        if part.startswith("p") and part[1:].isdigit() and len(part) == 4:
            alt_parts = parts.copy()
            alt_parts[idx] = part[1:]  # drop the leading 'p'
            alt_name = "_".join(alt_parts)
            alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
            alt_candidates = glob.glob(alt_pattern)
            if alt_candidates:
                print(
                    f"  [kymo {i}] No exact CSV for '{kymo_name}', "
                    f"but found match using '{alt_name}'."
                )
                break

    if len(alt_candidates) == 1:
        return alt_candidates[0]
    elif len(alt_candidates) > 1:
        print(f"  [kymo {i}] Multiple CSV matches for patched name, skipping:")
        for c in alt_candidates:
            print(f"    - {c}")
        return None

    # Nothing found
    return None

def main():
    args = parse_args()

    merged_lake_folder = args.merged_lake_folder
    filtered_folder    = args.filtered_folder
    output_folder      = args.output_folder

    os.makedirs(output_folder, exist_ok=True)

    # Global counters across all lakes
    total_case1 = 0          # case1: CSV found & n_rows>0 → tracks updated (kymo kept)
    total_case2 = 0          # case2: CSV exists, but no tracks remain after filtering (empty or error) → kymo removed
    total_case3 = 0          # case3: no matching CSV → kymo removed
    total_extra = 0          # extra: kymo without data/tracks/blue → removed

    # Detailed lists of sample names (lake, kymo, ...)
    case1_kymos = []         # (lake_file, kymo_name, csv_path)
    case2_kymos = []         # (lake_file, kymo_name, csv_path, reason)
    case3_kymos = []         # (lake_file, kymo_name)
    extra_kymos = []         # (lake_file, kymo_name)

    used_csv_paths = set()   # CSVs that were actually matched to some kymo

    # Loop over all merged .lake files
    for lake_path in glob.glob(os.path.join(merged_lake_folder, "*.lake")):
        base = os.path.basename(lake_path)
        print(f"\n=== Processing lake file: {base} ===")

        # per-lake list of removed kymograph names
        removed_kymo_names = set()

        # Load JSON from .lake file
        with open(lake_path, "r", encoding="utf-8") as f:
            lake = json.load(f)

        old_kymos = lake.get("kymos", [])
        new_kymos = []   # we will build a filtered list here

        # Iterate over all kymos in this lake
        for i, kymo in enumerate(old_kymos):
            # Extract kymograph name from address.path (last segment of the path)
            addr = kymo.get("address", {})
            path = addr.get("path", "")
            kymo_name = path.split("/")[-1] if path else None

            if not kymo_name:
                print(f"  [kymo {i}] No valid name/path found, skipping.")
                # keep it as-is (very unusual case)
                new_kymos.append(kymo)
                continue

            # Find the corresponding filtered CSV
            csv_path = find_matching_csv(filtered_folder, kymo_name, i)
            if csv_path is None:
                # case3: no CSV → remove kymo
                print(
                    f"  [kymo {i}] No suitable CSV found for '{kymo_name}' "
                    f"in {filtered_folder} → REMOVING kymograph from output lake."
                )
                total_case3 += 1
                case3_kymos.append((base, kymo_name))
                removed_kymo_names.add(kymo_name)
                continue

            csv_name = os.path.basename(csv_path)
            used_csv_paths.add(os.path.abspath(csv_path))

            # Build the new blue track text from the filtered CSV
            try:
                blue_text, n_rows = build_blue_text_from_csv(csv_path)
            except Exception as e:
                # case2: CSV present but not parseable
                msg = f"read error: {e}"
                print(f"  [kymo {i}] Error reading {csv_name}: {msg} → REMOVING kymograph.")
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)
                continue

            if n_rows == 0:
                # case2: CSV present but no tracks after filtering
                msg = "0 tracks after filtering (header-only CSV)"
                print(
                    f"  [kymo {i}] CSV {csv_name} contains no tracks after filtering "
                    f"→ REMOVING kymograph."
                )
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)
                continue

            # If we reach here, we have a non-empty CSV, so this is case1
            try:
                if "data" in kymo and "tracks" in kymo["data"] and "blue" in kymo["data"]["tracks"]:
                    kymo["data"]["tracks"]["blue"] = blue_text
                    new_kymos.append(kymo)
                    total_case1 += 1
                    case1_kymos.append((base, kymo_name, csv_path))
                    print(f"  [kymo {i}] Updated blue tracks from {csv_name} (kept).")
                else:
                    # extra: kymo structure has no blue field at all → remove
                    print(
                        f"  [kymo {i}] Kymo '{kymo_name}' has no data/tracks/blue field "
                        f"→ REMOVING from output lake."
                    )
                    total_extra += 1
                    extra_kymos.append((base, kymo_name))
                    removed_kymo_names.add(kymo_name)
            except Exception as e:
                # treat write problems also as case2
                msg = f"write error: {e}"
                print(
                    f"  [kymo {i}] Error writing tracks for {kymo_name}: {msg} "
                    f"→ REMOVING kymograph."
                )
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)

        # Replace kymos list with filtered one (case2/case3/extra removed)
        lake["kymos"] = new_kymos

        # ------------------------------------------------------
        #  NEW PART: rebuild file_viewer and experiments[*].dataset
        #  so that H5 links are consistent with the kept kymos.
        # ------------------------------------------------------
        kept_kymo_names = set()
        file_viewer_files = []

        for kymo in new_kymos:
            addr = kymo.get("address", {})
            path = addr.get("path", "")
            file = addr.get("file", "")
            if path:
                name = path.split("/")[-1]
                kept_kymo_names.add(name)
            if file and file not in file_viewer_files:
                file_viewer_files.append(file)

        # 1) Root-level file_viewer: only files from kept kymos
        if "file_viewer" in lake:
            lake["file_viewer"] = file_viewer_files

        # 2) Experiments datasets: keep only entries whose path matches kept kymo
        if "experiments" in lake and isinstance(lake["experiments"], dict):
            for exp_key, exp in lake["experiments"].items():
                if not isinstance(exp, dict):
                    continue
                dataset = exp.get("dataset")
                if isinstance(dataset, list):
                    new_dataset = []
                    for item in dataset:
                        if not isinstance(item, dict):
                            continue
                        addr = item.get("address", {})
                        path = addr.get("path", "")
                        name = path.split("/")[-1] if path else None
                        if name in kept_kymo_names:
                            new_dataset.append(item)
                    exp["dataset"] = new_dataset

        # Save updated lake JSON to output folder
        out_path = os.path.join(output_folder, base)
        with open(out_path, "w", encoding="utf-8") as f:
            json.dump(lake, f, indent=4)
        print(f"==> {base}: kept {len(new_kymos)} kymos after filtering, written to {out_path}")

    # --- Global summary over all lakes ---
    print("\n=== Summary over all processed lakes ===")
    print(f"  case1: updated kymos (CSV found & ≥1 track, kept)  = {total_case1}")
    print(f"  case2: removed kymos (CSV exists, but no tracks remain after filtering)    = {total_case2}")
    print(f"  case3: removed kymos (no matching CSV found)       = {total_case3}")
    print(f"  extra: removed kymos (no data/tracks/blue field)   = {total_extra}")
    total_kymos = total_case1 + total_case2 + total_case3 + total_extra
    print(f"  total kymos classified (sum of the above)          = {total_kymos}")

    # CSV usage check
    all_csv_paths = sorted(os.path.abspath(p) for p in glob.glob(os.path.join(filtered_folder, "*_blue*.csv")))
    print(f"\nTotal CSV files in filtered_folder: {len(all_csv_paths)}")
    print(f"CSV files actually used (matched to some kymo): {len(used_csv_paths)}")

    unused_csv = [p for p in all_csv_paths if p not in used_csv_paths]
    if unused_csv:
        print("\nCSV files NOT used by any kymo (name mismatch / other replicates):")
        for p in unused_csv:
            print(f"  {p}")
    else:
        print("\nAll CSV files in filtered_folder were used by at least one kymo.")

if __name__ == "__main__":
    main()