Kindgerechte Erklärung der Schachnotation

Leave a reply

♟️ Kindgerechte Erklärung der Schachnotation

(geeignet für U8 / U10 / U12 – Turnierstandard)

1️⃣ Grundidee der Notation

In einem Turnier muss jede gespielte Zugfolge aufgeschrieben werden, damit:

die Partie später analysiert werden kann
Schiedsrichter Streitfälle klären können
die Partie für DWZ / Elo gewertet werden kann

👉 Notation bedeutet nur:

Welche Figur zieht auf welches Feld.

2️⃣ Buchstaben für die Figuren

Figur	Buchstabe	Erklärung
König	K	King
Dame	Q	Queen
Turm	R	Rook
Läufer	B	Bishop
Springer	N	Knight (K ist schon der König)
Bauer	kein Buchstabe	ganz wichtig

⚠️ Bauern haben keinen Buchstaben!

3️⃣ Die Felder des Schachbretts

Jedes Feld hat einen Namen:

a–h (Spalten, von links nach rechts)
1–8 (Reihen, von unten nach oben aus Weiß-Sicht)

Beispiele:

4️⃣ Normale Züge schreiben

Bauernzüge

Bauer zieht nach e4 → e4
Bauer zieht nach d5 → d5

👉 Nur das Zielfeld schreiben.

Figuren ziehen

Figur + Zielfeld

Beispiele:

Springer nach f3 → Nf3
Läufer nach c4 → Bc4
Dame nach d1 → Qd1

5️⃣ Schlagen (sehr wichtig)

Figuren schlagen

Figur + x + Zielfeld

Beispiele:

Springer schlägt auf e5 → Nxe5
Läufer schlägt auf f7 → Bxf7

❌ Niemals aufschreiben, welche Figur geschlagen wurde!

Bauern schlagen (Sonderregel)

Beim Bauern muss man angeben, von welcher Linie er kommt:

Bauer von d4 schlägt auf e5 → dxe5
Bauer von f3 schlägt auf e4 → fxe4

6️⃣ Schach, Matt, Rochade

+ = Schach
# = Schachmatt
O-O = kurze Rochade
O-O-O = lange Rochade

Beispiele:

Qh5+
Qh7#

7️⃣ Zwei gleiche Figuren können auf dasselbe Feld ziehen

(Das ist für Turniere sehr wichtig!)

Regel:

Wenn zwei gleiche Figuren auf dasselbe Feld ziehen könnten, muss man angeben, welche gemeint ist.

Beispiel mit Springern

Beide Springer können nach d2:

Springer von b1 → Nbd2
Springer von f1 → Nfd2

Beispiel mit Türmen

Beide Türme können nach e1:

Turm von Reihe 2 → R2e1
Turm von Reihe 8 → R8e1

👉 Man ergänzt Buchstabe oder Zahl, bis klar ist, welche Figur zieht.

8️⃣ Beispielpartie (ca. 10 Züge)

1. e4     e5
2. Nf3    Nc6
3. Bc4    Bc5
4. d4     exd4
5. O-O    Nf6
6. e5     Ne4
7. Re1    d5
8. Bxd5   Qxd5
9. Nc3    Nxc3
10. bxc3

Diese Partie enthält:

Bauernzüge
Schlagen
Bauern schlagen
Rochade
Mehrere Figuren

9️⃣ Richtige Reihenfolge im Turnier

Immer so:

Zug ausführen
Uhr drücken
Zug notieren

❌ Nicht vorher schreiben ❌ Nicht vergessen zu schreiben

🔟 Die 5 häufigsten Fehler bei Kindern

❌ Bauer bekommt einen Buchstaben (Pe4) ✅ e4
❌ Aufschreiben, welche Figur geschlagen wurde ✅ nur das Zielfeld
❌ Das „x“ beim Schlagen vergessen ✅ Nxe5
❌ Erst schreiben, dann ziehen ✅ erst ziehen
❌ Panik bei Zeitnot ✅ lieber langsam und sauber schreiben

✅ Merksatz für Kinder

**Welche Figur (Bauer ohne Buchstabe), wohin sie zieht, x beim Schlagen,

bei Schach.**

Punkte als Nächstes:

📄 eine noch ausführlichere A4-Turnierfassung auf Deutsch
🧒 eine U8-vereinfachte Version
🧑‍⚖️ erklären, was Schiedsrichter wirklich kontrollieren

Data Processing Pipeline: RNA-seq × Exoproteome × MS (Proximity/ALFA)

Leave a reply

Path: ~/DATA/Data_Michelle/MS_RNAseq_Venn_2026

This post documents the complete, reproducible pipeline used to generate a 3-circle Venn diagram integrating RNA-seq, exoproteome, and MS SIP (Proximity + ALFApulldown) datasets, harmonized at the locus-tag level.

The pipeline is intentionally verbose: all transformations, mappings, and decisions are logged so the analysis can be reconstructed later.

Overview of the Three Sets

Circle 1 — RNA-seq Union of all significantly regulated genes across six comparisons Criteria: padj < 0.05 and |log2FC| > 2

Circle 2 — Exoproteome Union of six exoproteome conditions (MH/TSB × 1h, 4h, 18h) Protein IDs (UniProt / UniParc) are mapped to locus tags via BLAST Direction is defined without a cutoff:

log2FC > 0 → UP
log2FC < 0 → DOWN

Circle 3 — MS SIP (ProALFA) Union of four datasets:

Proximity 4h / 18h
ALFApulldown 4h / 18h Protein IDs are mapped to locus tags via BLAST

Outputs

The pipeline produces:

🟢 3-circle Venn diagram (PNG + PDF)
📊 Detailed Excel workbook
- all Venn regions
- RNA / Exo / MS membership per locus tag
- regulation directions and source datasets
🧾 Pipeline log file
- counts per dataset
- mapping success
- duplicated IDs
- intersection sizes

Step A — Fetch exoproteome FASTA from UniProt

Exoproteome Excel sheets contain UniProt accessions, while RNA-seq uses locus tags. To enable mapping, protein sequences are downloaded from UniProt and UniParc.

Script

01_fetch_uniprot_fasta_from_exoproteome_excel.py

What it does

Reads exoproteome sheets from Excel
Extracts UniProt / UniParc IDs
Downloads FASTA via UniProt REST (with caching + fallback to UniParc)
Writes:
- one FASTA per sheet
- mapping tables
- UP / DOWN ID lists
- run log

Bash: fetch exoproteome FASTAs

python3 01_fetch_uniprot_fasta_from_exoproteome_excel.py \
  --excel "Zusammenfassung SIP und Exoproteome.xlsx" \
  --outdir ./exoproteome_fastas \
  --cache  ./uniprot_cache

Step B — Normalize FASTA headers (Exoproteome + MS)

FASTA headers from UniProt and MS are heterogeneous and may contain duplicates. All FASTA files are normalized to canonical, unique protein IDs.

Script

02_normalize_fasta_headers.py

What it does

Extracts canonical IDs (sp|…|, tr|…|, raw headers)
Enforces uniqueness (__dupN suffix if needed)
Writes an ID mapping table for traceability

Bash: normalize exoproteome FASTAs

mkdir -p normalized_fastas

for f in \
  Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
  Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    exoproteome_fastas/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Bash: normalize MS (Proximity / ALFA) FASTAs

for f in \
  Proximity_4h.fasta Proximity_18h.fasta \
  ALFApulldown_4h.fasta ALFApulldown_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    ../MS_GO_enrichments/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Step C — BLAST mapping to reference proteome

All exoproteome and MS proteins are mapped to locus tags using BLASTP against the reference proteome.

Bash: merge exoproteome FASTAs

cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
    Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \
    > Exoproteome_ALL.fasta

Bash: build BLAST database (once)

makeblastdb -in CP020463_protein.fasta \
  -dbtype prot \
  -out CP020463_ref_db

Bash: BLAST exoproteome

blastp -query Exoproteome_ALL.fasta \
  -db CP020463_ref_db \
  -out Exoproteome_ALL_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 \
  -max_target_seqs 5 \
  -num_threads 16

Bash: select best hit per query (exoproteome)

awk -F'\t' 'BEGIN{OFS="\t"}{
  key=$1; bits=$12+0; e=$11+0
  if(!(key in best) || bits>best_bits[key] || (bits==best_bits[key] && e<best_e[key])){
    best[key]=$0; best_bits[key]=bits; best_e[key]=e
  }
} END{for(k in best) print best[k]}' \
Exoproteome_ALL_vs_ref.blast.tsv \
| sort -t$'\t' -k1,1 \
> Exoproteome_ALL_vs_ref.besthit.tsv

Bash: BLAST MS datasets

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  blastp -query "${base}.fasta" \
    -db CP020463_ref_db \
    -out "${base}_vs_ref.blast.tsv" \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
    -evalue 1e-10 \
    -max_target_seqs 5 \
    -num_threads 16
done

Bash: best-hit selection (MS)

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  awk -F'\t' 'BEGIN{OFS="\t"}{
    q=$1; b=$12+0; e=$11+0
    if(!(q in best) || b>bb[q] || (b==bb[q] && e<be[q])){
      best[q]=$0; bb[q]=b; be[q]=e
    }
  } END{for(q in best) print best[q]}' \
  "${base}_vs_ref.blast.tsv" \
  | sort -t$'\t' -k1,1 \
  > "${base}_vs_ref.besthit.tsv"
done

Step D — R wrapper: integration, Venn, Excel, LOG

Script

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

What it does

Reads RNA-seq Excel files (6 comparisons)
Reads exoproteome and MS sheets
Applies BLAST best-hit mappings
Builds union sets and intersections
Generates:
- Venn plot
- detailed Excel workbook
- comprehensive log file

Bash: run the final integration

Rscript 03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

Scripts Used (for traceability)

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

# see script file for full implementation

`02_normalize_fasta_headers.py`

# see script file for full implementation

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

# see script file for full implementation

Final Note

Keeping Python, Bash, and R steps together is intentional: this pipeline is meant to be auditable, reproducible, and explainable, even months later when details are fuzzy.

If you want, I can next:

condense this into a README.md
split it into Methods vs Reproducibility
or add a flowchart of the pipeline steps

Scripts Used

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

#!/usr/bin/env python3
"""
01_fetch_uniprot_fasta_from_exoproteome_excel.py (v3)

MS log2FC direction rule (no cutoff):
  Up    : log2FC > 0
  Down  : log2FC < 0
  Zero  : log2FC == 0
  NA    : missing/non-numeric log2FC

Reads per sheet:
  - protein IDs from column A (index 0)
  - log2FC from column D (index 3)

Outputs per sheet:
  - {sheet}.fasta
  - {sheet}.mapping.tsv
  - {sheet}.up.ids.txt / {sheet}.down.ids.txt
  - run.log

UniParc catching method:
  1) Try UniProtKB FASTA: /uniprotkb/{ACC}.fasta
  2) If missing: UniParc search -> UPI
  3) Fetch UniParc FASTA: /uniparc/{UPI}.fasta
"""

import argparse
import os
import re
import time
from typing import Optional, Tuple

import pandas as pd
import requests

DEFAULT_SHEETS = [
    "Exoproteome MH 1h",
    "Exoproteome MH 4h",
    "Exoproteome MH 18h",
    "Exoproteome TSB 1h",
    "Exoproteome TSB 4h",
    "Exoproteome TSB 18h",
]

UNIPROTKB_FASTA = "https://rest.uniprot.org/uniprotkb/{acc}.fasta"
UNIPARC_FASTA   = "https://rest.uniprot.org/uniparc/{upi}.fasta"
UNIPARC_SEARCH  = "https://rest.uniprot.org/uniparc/search"

def sanitize_name(s: str) -> str:
    s = s.strip()
    s = re.sub(r"[^\w\-\.]+", "_", s)
    s = re.sub(r"_+", "_", s)
    return s

def parse_id_cell(raw: str) -> Optional[str]:
    if raw is None:
        return None
    s = str(raw).strip()
    if not s or s.lower() in {"nan", "na"}:
        return None

    s = s.split()[0].strip()

    # header like tr|ACC|NAME
    if "|" in s:
        parts = s.split("|")
        if len(parts) >= 2 and parts[0] in {"tr", "sp"}:
            s = parts[1]
        else:
            s = parts[0]

    if re.fullmatch(r"UPI[0-9A-F]{10,}", s):
        return s
    if re.fullmatch(r"[A-Z0-9]{6,10}", s):
        return s
    return None

def to_float(x) -> Optional[float]:
    if x is None:
        return None
    try:
        s = str(x).strip()
        if s == "" or s.lower() in {"nan", "na"}:
            return None
        s = s.replace(",", ".")
        return float(s)
    except Exception:
        return None

def classify_direction_no_cutoff(log2fc: Optional[float]) -> str:
    if log2fc is None:
        return "NA"
    if log2fc > 0:
        return "Up"
    if log2fc < 0:
        return "Down"
    return "Zero"

def extract_ids_and_fc_from_sheet(excel_path: str, sheet: str) -> pd.DataFrame:
    df = pd.read_excel(excel_path, sheet_name=sheet, header=None)
    if df.shape[1] < 4:
        raise ValueError(f"Sheet '{sheet}' has <4 columns, cannot read column D (log2FC).")

    out_rows = []
    for _, row in df.iterrows():
        pid = parse_id_cell(row.iloc[0])      # col A
        if not pid:
            continue
        log2fc = to_float(row.iloc[3])        # col D
        out_rows.append({"protein_id": pid, "log2fc": log2fc})

    out = pd.DataFrame(out_rows)
    if out.empty:
        return out

    # de-duplicate by protein_id (keep first non-NA log2fc if possible)
    out["log2fc_isna"] = out["log2fc"].isna()
    out = out.sort_values(["protein_id", "log2fc_isna"]).drop_duplicates("protein_id", keep="first")
    out = out.drop(columns=["log2fc_isna"]).reset_index(drop=True)
    return out

def http_get_text(session: requests.Session, url: str, params: dict = None,
                  retries: int = 4, backoff: float = 1.5) -> Optional[str]:
    for attempt in range(retries):
        r = session.get(url, params=params, timeout=60)
        if r.status_code == 200:
            return r.text
        if r.status_code in (429, 500, 502, 503, 504):
            time.sleep(backoff * (attempt + 1))
            continue
        return None
    return None

def fetch_uniprotkb_fasta(session: requests.Session, acc: str) -> Optional[str]:
    return http_get_text(session, UNIPROTKB_FASTA.format(acc=acc))

def resolve_upi_via_uniparc_search(session: requests.Session, acc: str) -> Optional[str]:
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(session, UNIPARC_SEARCH, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    if re.fullmatch(r"UPI[0-9A-F]{10,}", upi):
        return upi
    return None

def fetch_uniparc_fasta(session: requests.Session, upi: str) -> Optional[str]:
    return http_get_text(session, UNIPARC_FASTA.format(upi=upi))

def cache_get(cache_dir: str, key: str) -> Optional[str]:
    if not cache_dir:
        return None
    path = os.path.join(cache_dir, f"{key}.fasta")
    if os.path.exists(path) and os.path.getsize(path) > 0:
        with open(path, "r", encoding="utf-8", errors="replace") as f:
            return f.read()
    return None

def cache_put(cache_dir: str, key: str, fasta: str) -> None:
    if not cache_dir:
        return
    os.makedirs(cache_dir, exist_ok=True)
    path = os.path.join(cache_dir, f"{key}.fasta")
    with open(path, "w", encoding="utf-8") as f:
        f.write(fasta)

def rewrite_fasta_header(fasta: str, new_header: str) -> str:
    lines = fasta.splitlines()
    if not lines or not lines[0].startswith(">"):
        return fasta
    lines[0] = ">" + new_header
    return "\n".join(lines) + ("\n" if not fasta.endswith("\n") else "")

def fetch_best_fasta(session: requests.Session, pid: str, cache_dir: str = "") -> Tuple[Optional[str], str, Optional[str]]:
    """
    Returns (fasta_text, status, resolved_upi)
      status: uniprotkb | uniparc | uniparc_direct | not_found | *_cache
    """
    if pid.startswith("UPI"):
        cached = cache_get(cache_dir, pid)
        if cached:
            return cached, "uniparc_direct_cache", pid
        fasta = fetch_uniparc_fasta(session, pid)
        if fasta:
            cache_put(cache_dir, pid, fasta)
            return fasta, "uniparc_direct", pid
        return None, "not_found", pid

    cached = cache_get(cache_dir, pid)
    if cached:
        return cached, "uniprotkb_cache", None

    fasta = fetch_uniprotkb_fasta(session, pid)
    if fasta:
        cache_put(cache_dir, pid, fasta)
        return fasta, "uniprotkb", None

    upi = resolve_upi_via_uniparc_search(session, pid)
    if not upi:
        return None, "not_found", None

    cached2 = cache_get(cache_dir, upi)
    if cached2:
        return cached2, "uniparc_cache", upi

    fasta2 = fetch_uniparc_fasta(session, upi)
    if fasta2:
        cache_put(cache_dir, upi, fasta2)
        return fasta2, "uniparc", upi

    return None, "not_found", upi

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--excel", required=True, help="Input Excel file")
    ap.add_argument("--outdir", default="./exoproteome_fastas", help="Output folder")
    ap.add_argument("--cache", default="./uniprot_cache", help="Cache folder for downloaded FASTAs")
    ap.add_argument("--sleep", type=float, default=0.1, help="Sleep seconds between requests")
    ap.add_argument("--sheets", nargs="*", default=DEFAULT_SHEETS, help="Sheet names to process")
    args = ap.parse_args()

    os.makedirs(args.outdir, exist_ok=True)
    os.makedirs(args.cache, exist_ok=True)

    log_path = os.path.join(args.outdir, "run.log")
    def log(msg: str):
        print(msg)
        with open(log_path, "a", encoding="utf-8") as f:
            f.write(msg + "\n")

    with open(log_path, "w", encoding="utf-8") as f:
        f.write("")

    xls = pd.ExcelFile(args.excel)
    missing = [s for s in args.sheets if s not in xls.sheet_names]
    if missing:
        log("WARNING: Missing sheets in workbook:")
        for s in missing:
            log(f"  - {s}")
        log("Continuing with available sheets only.\n")

    session = requests.Session()
    session.headers.update({"User-Agent": "exoproteome-fasta-fetch/3.0"})

    for sheet in args.sheets:
        if sheet not in xls.sheet_names:
            continue

        sheet_tag = sanitize_name(sheet)
        out_fasta = os.path.join(args.outdir, f"{sheet_tag}.fasta")
        out_map   = os.path.join(args.outdir, f"{sheet_tag}.mapping.tsv")
        out_up    = os.path.join(args.outdir, f"{sheet_tag}.up.ids.txt")
        out_down  = os.path.join(args.outdir, f"{sheet_tag}.down.ids.txt")

        tbl = extract_ids_and_fc_from_sheet(args.excel, sheet)
        log(f"\n=== Sheet: {sheet} ===")
        log(f"Extracted IDs (unique): {len(tbl)}")

        if tbl.empty:
            log("No IDs found. Skipping.")
            continue

        tbl["direction"] = [classify_direction_no_cutoff(x) for x in tbl["log2fc"].tolist()]

        n_fc = int(tbl["log2fc"].notna().sum())
        n_up = int((tbl["direction"] == "Up").sum())
        n_dn = int((tbl["direction"] == "Down").sum())
        n_zero = int((tbl["direction"] == "Zero").sum())
        n_na = int((tbl["direction"] == "NA").sum())

        log(f"log2FC present: {n_fc}/{len(tbl)}")
        log(f"Direction (no cutoff): Up={n_up}, Down={n_dn}, Zero={n_zero}, NA={n_na}")

        mapping_rows = []
        fasta_chunks = []

        n_ok = 0
        n_nf = 0

        for i, row in enumerate(tbl.itertuples(index=False), start=1):
            pid = row.protein_id
            log2fc = row.log2fc
            direction = row.direction

            fasta, status, upi = fetch_best_fasta(session, pid, cache_dir=args.cache)

            if fasta:
                header = pid
                if upi and upi != pid:
                    header = f"{pid}|UniParc:{upi}"
                fasta = rewrite_fasta_header(fasta, header)
                fasta_chunks.append(fasta)
                n_ok += 1
            else:
                n_nf += 1

            mapping_rows.append({
                "sheet": sheet,
                "protein_id_input": pid,
                "resolved_upi": upi if upi else "",
                "status": status,
                "log2fc": "" if log2fc is None else log2fc,
                "direction": direction,
                "fasta_written": "yes" if fasta else "no",
            })

            if args.sleep > 0:
                time.sleep(args.sleep)

            if i % 50 == 0:
                log(f"  progress: {i}/{len(tbl)} (ok={n_ok}, not_found={n_nf})")

        with open(out_fasta, "w", encoding="utf-8") as f:
            for chunk in fasta_chunks:
                f.write(chunk.strip() + "\n")

        map_df = pd.DataFrame(mapping_rows)
        map_df.to_csv(out_map, sep="\t", index=False)

        up_ids = map_df.loc[map_df["direction"] == "Up", "protein_id_input"].astype(str).tolist()
        dn_ids = map_df.loc[map_df["direction"] == "Down", "protein_id_input"].astype(str).tolist()
        with open(out_up, "w", encoding="utf-8") as f:
            f.write("\n".join(up_ids) + ("\n" if up_ids else ""))
        with open(out_down, "w", encoding="utf-8") as f:
            f.write("\n".join(dn_ids) + ("\n" if dn_ids else ""))

        log(f"Fetched FASTA: {n_ok}/{len(tbl)} (not_found={n_nf})")
        log(f"Saved FASTA: {out_fasta}")
        log(f"Saved mapping: {out_map}")
        log(f"Saved Up IDs:  {out_up} ({len(up_ids)})")
        log(f"Saved Down IDs:{out_down} ({len(dn_ids)})")

    log("\nDONE.")

if __name__ == "__main__":
    main()

`02_normalize_fasta_headers.py`

#!/usr/bin/env python3
import re
import sys
from collections import defaultdict

def canon_id(header: str) -> str:
    h = header.strip()
    h = h[1:] if h.startswith(">") else h

    # tr|ACC|NAME or sp|ACC|NAME
    m = re.match(r'^(sp|tr)\|([^|]+)\|', h)
    if m:
        return m.group(2)

    # otherwise: first token, then before first '|'
    first = h.split()[0]
    return first.split("|")[0]

def read_fasta(path):
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        header = None
        seq = []
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(">"):
                if header is not None:
                    yield header, "".join(seq)
                header = line
                seq = []
            else:
                seq.append(line.strip())
        if header is not None:
            yield header, "".join(seq)

def main():
    if len(sys.argv) < 3:
        print("Usage: normalize_fasta_headers.py in.fasta out.fasta [map.tsv]", file=sys.stderr)
        sys.exit(1)

    in_fa  = sys.argv[1]
    out_fa = sys.argv[2]
    map_tsv = sys.argv[3] if len(sys.argv) >= 4 else out_fa + ".idmap.tsv"

    seen = defaultdict(int)

    with open(out_fa, "w", encoding="utf-8") as fo, open(map_tsv, "w", encoding="utf-8") as fm:
        fm.write("old_header\tnew_id\n")
        for old_h, seq in read_fasta(in_fa):
            base = canon_id(old_h)
            seen[base] += 1
            new_id = base if seen[base] == 1 else f"{base}__dup{seen[base]}"
            fm.write(f"{old_h}\t{new_id}\n")

            fo.write(f">{new_id}\n")
            # wrap sequence at 60
            for i in range(0, len(seq), 60):
                fo.write(seq[i:i+60] + "\n")

    print(f"Wrote: {out_fa}")
    print(f"Wrote: {map_tsv}")

if __name__ == "__main__":
    main()

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

#!/usr/bin/env Rscript
# ============================================================
# 3-circle Venn: RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown
#
# Key points:
# - RNA-seq inputs are Excel files with sheet 'Complete_Data'
#   * GeneID is the ID used for comparison (locus tag style)
#   * Keep Preferred_name / Description for annotation in ALL Venn regions
# - Exoproteome uses MS protein IDs + log2FC sign (no cutoff), mapped to locus tags via BLAST best-hit TSV
# - Proximity/ALFApulldown (ProALFA) uses SIP/MS protein IDs mapped to locus tags via BLAST best-hit TSV
# - Debug stats + parsing/mapping summaries are written into a LOG file
#
# NOTE: (记得！) 生成 Exoproteome_ALL.fasta + Exoproteome_ALL_vs_ref.blast.tsv：
#   cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \\
#       Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \\
#       > Exoproteome_ALL.fasta
#   blastp -query Exoproteome_ALL.fasta -db CP020463_protein.dmnd \\
#       -out Exoproteome_ALL_vs_ref.blast.tsv -outfmt 6 -evalue 1e-5 -max_target_seqs 50 -num_threads 40
# ============================================================

suppressPackageStartupMessages({
  library(tidyverse)
  library(readxl)
  library(openxlsx)
  library(VennDiagram)
  library(grid)
})

# -------------------------
# CONFIG
# -------------------------
RNA_PADJ_CUT <- 0.05
RNA_LFC_CUT  <- 2

OUTDIR <- "./venn_outputs"
dir.create(OUTDIR, showWarnings = FALSE, recursive = TRUE)

LOGFILE <- file.path(OUTDIR, "pipeline_log.txt")
sink(LOGFILE, split = TRUE)

cat("=== 3-circle Venn pipeline ===\n")
cat(sprintf("RNA cutoff: padj < %.3g and |log2FoldChange| > %.3g\n", RNA_PADJ_CUT, RNA_LFC_CUT))
cat("Output dir:", normalizePath(OUTDIR), "\n\n")

# RNA inputs (Excel, sheet = Complete_Data)
RNA_FILES <- list(
  deltasbp_MH_2h  = "./DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx",
  deltasbp_MH_4h  = "./DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx",
  deltasbp_MH_18h = "./DEG_Annotations_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx",
  deltasbp_TSB_2h  = "./DEG_Annotations_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx",
  deltasbp_TSB_4h  = "./DEG_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx",
  deltasbp_TSB_18h = "./DEG_Annotations_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx"
)

# Exoproteome sheets (from summary Excel)
EXO_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
EXO_SHEETS <- list(
  Exoproteome_MH_1h  = "Exoproteome MH 1h",
  Exoproteome_MH_4h  = "Exoproteome MH 4h",
  Exoproteome_MH_18h = "Exoproteome MH 18h",
  Exoproteome_TSB_1h  = "Exoproteome TSB 1h",
  Exoproteome_TSB_4h  = "Exoproteome TSB 4h",
  Exoproteome_TSB_18h = "Exoproteome TSB 18h"
)

# Exoproteome BLAST best-hit mapping (query protein -> locus tag)
EXO_BLAST_TSV <- "./Exoproteome_ALL_vs_ref.besthit.tsv"

# Proximity/ALFApulldown sheets (from same summary Excel)
PROALFA_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
PROALFA_SHEETS <- list(
  Proximity_4h = "Proximity 4h",
  Proximity_18h = "Proximity 18h",
  ALFApulldown_4h = "ALFApulldown 4h",
  ALFApulldown_18h = "ALFApulldown 18h"
)

# ProALFA BLAST best-hit mapping (each dataset protein -> locus tag)
PROALFA_BLAST_TSV <- list(
  Proximity_4h      = "./Proximity_4h_vs_ref.besthit.tsv",
  Proximity_18h     = "./Proximity_18h_vs_ref.besthit.tsv",
  ALFApulldown_4h   = "./ALFApulldown_4h_vs_ref.besthit.tsv",
  ALFApulldown_18h  = "./ALFApulldown_18h_vs_ref.besthit.tsv"
)

# FASTA inputs (QC only)
FASTA_FILES <- list(
  Proximity_4h      = "./Proximity_4h.fasta",
  Proximity_18h     = "./Proximity_18h.fasta",
  ALFApulldown_4h   = "./ALFApulldown_4h.fasta",
  ALFApulldown_18h  = "./ALFApulldown_18h.fasta",
  Exo_MH_1h         = "./Exoproteome_MH_1h.fasta",
  Exo_MH_4h         = "./Exoproteome_MH_4h.fasta",
  Exo_MH_18h        = "./Exoproteome_MH_18h.fasta",
  Exo_TSB_1h        = "./Exoproteome_TSB_1h.fasta",
  Exo_TSB_4h        = "./Exoproteome_TSB_4h.fasta",
  Exo_TSB_18h       = "./Exoproteome_TSB_18h.fasta"
)

# -------------------------
# HELPERS
# -------------------------
safe_file_check <- function(fp, label) {
  if (!file.exists(fp)) stop(label, " missing: ", fp)
}

canonical_id_one <- function(x) {
  x <- as.character(x)
  x <- stringr::str_trim(x)

  if (is.na(x) || x == "") return(NA_character_)

  # remove leading ">"
  x <- stringr::str_remove(x, "^>")

  # keep first token before whitespace
  x <- stringr::str_split(x, "\\s+", simplify = TRUE)[1]

  # handle UniProt-style: tr|ID|... or sp|ID|...
  if (stringr::str_detect(x, "\\|")) {
    parts <- stringr::str_split(x, "\\|")[[1]]
    if (length(parts) >= 2 && parts[1] %in% c("tr", "sp")) {
      x <- parts[2]
    } else {
      x <- parts[1]
    }
  }

  x
}

canonical_id <- function(x) {
  vapply(x, canonical_id_one, character(1))
}

read_fasta_headers <- function(fp) {
  safe_file_check(fp, "FASTA")
  hdr <- readLines(fp, warn = FALSE) %>% keep(~str_starts(.x, ">"))
  tibble(raw = hdr, id = canonical_id(hdr)) %>%
    filter(!is.na(id), id != "")
}

norm_name <- function(x) {
  x <- tolower(trimws(as.character(x)))
  x <- gsub("[^a-z0-9]+", "", x)
  x
}

# Robust column picker (case-insensitive + punctuation-insensitive)
find_col <- function(nms, target) {
  nms <- as.character(nms)
  target_n <- norm_name(target)
  n_n <- norm_name(nms)

  # exact normalized match
  idx <- which(n_n == target_n)
  if (length(idx) > 0) return(nms[idx[1]])

  # contains (normalized)
  idx <- which(str_detect(n_n, fixed(target_n)))
  if (length(idx) > 0) return(nms[idx[1]])

  NA_character_
}

# Best-hit parser (BLAST outfmt6):
# qid sid pident length mismatch gapopen qstart qend sstart send evalue bitscore
read_besthit_tsv <- function(fp) {
  safe_file_check(fp, "BLAST TSV")
  suppressPackageStartupMessages({
    df <- read_tsv(fp, col_names = FALSE, show_col_types = FALSE, progress = FALSE)
  })
  if (ncol(df) < 12) stop("BLAST TSV has <12 columns: ", fp)
  colnames(df)[1:12] <- c("qid","sid","pident","length","mismatch","gapopen",
                         "qstart","qend","sstart","send","evalue","bitscore")
  df %>%
    mutate(
      qid = canonical_id(qid),
      sid = as.character(sid),
      bitscore = suppressWarnings(as.numeric(bitscore)),
      evalue = suppressWarnings(as.numeric(evalue))
    ) %>%
    filter(!is.na(qid), qid != "")
}

besthit_reduce <- function(df) {
  # one best hit per query: highest bitscore, then lowest evalue
  df %>%
    arrange(qid, desc(bitscore), evalue) %>%
    group_by(qid) %>%
    slice(1) %>%
    ungroup()
}

# Try to extract locus tag from subject id
extract_locus_tag <- function(sid) {
  sid <- as.character(sid)
  # if sid already looks like B4U56_00090
  if (str_detect(sid, "^B4U56_\\d{5}$")) return(sid)
  # otherwise try to find locus tag pattern anywhere
  hit <- str_extract(sid, "B4U56_\\d{5}")
  hit
}

read_exo_sheet <- function(xlsx, sheet) {
  df <- read_excel(xlsx, sheet = sheet, col_names = TRUE)

  # Protein column: assume first column contains protein IDs
  prot_col <- names(df)[1]

  # log2FC column: per user it's column D, name looks like "log 2 FC)"
  log2_col <- find_col(names(df), "log 2 FC)")
  if (is.na(log2_col)) {
    # fallback fuzzy: any col containing both log2 and fc
    hits <- names(df)[str_detect(norm_name(names(df)), "log2") & str_detect(norm_name(names(df)), "fc")]
    if (length(hits) > 0) log2_col <- hits[1]
  }
  if (is.na(log2_col)) stop("Exoproteome sheet missing log2FC column ('log 2 FC)'): ", sheet)

  out <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      log2FC = suppressWarnings(as.numeric(.data[[log2_col]]))
    ) %>%
    filter(!is.na(Protein_raw), Protein_raw != "")

  out <- out %>%
    mutate(
      Protein_ID = canonical_id(Protein_raw),
      EXO_direction = case_when(
        is.na(log2FC) ~ NA_character_,
        log2FC > 0 ~ "UP",
        log2FC < 0 ~ "DOWN",
        TRUE ~ "ZERO"
      )
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "")

  out
}

read_rna_excel <- function(fp) {
  safe_file_check(fp, "RNA Excel")
  df <- read_excel(fp, sheet = "Complete_Data", col_names = TRUE)

  # required columns
  req <- c("GeneID","padj","log2FoldChange")
  miss <- setdiff(req, names(df))
  if (length(miss) > 0) stop("RNA Excel missing columns: ", paste(miss, collapse=", "), " in ", fp)

  # optional annotation columns
  pref_col <- if ("Preferred_name" %in% names(df)) "Preferred_name" else NA_character_
  desc_col <- if ("Description" %in% names(df)) "Description" else NA_character_

  df <- df %>%
    mutate(
      GeneID_plain = str_remove(as.character(GeneID), "^gene-"),
      padj = suppressWarnings(as.numeric(padj)),
      log2FoldChange = suppressWarnings(as.numeric(log2FoldChange)),
      Preferred_name = if (!is.na(pref_col)) as.character(.data[[pref_col]]) else NA_character_,
      Description    = if (!is.na(desc_col)) as.character(.data[[desc_col]]) else NA_character_
    )

  df
}

# -------------------------
# FASTA QC
# -------------------------
cat("[FASTA] Reading FASTA files for QC (10)\n")
fasta_qc_stats <- list()

for (nm in names(FASTA_FILES)) {
  fp <- FASTA_FILES[[nm]]
  safe_file_check(fp, "FASTA file")
  h <- read_fasta_headers(fp)
  n_hdr <- nrow(h)
  n_uniq <- n_distinct(h$id)
  n_dup <- n_hdr - n_uniq
  cat(sprintf("  - %s: headers=%d, unique_IDs=%d, duplicates=%d\n", nm, n_hdr, n_uniq, n_dup))

  dup_ids <- h %>% count(id, sort = TRUE) %>% filter(n > 1)
  fasta_qc_stats[[nm]] <- tibble(
    Dataset = nm,
    headers = n_hdr,
    unique_IDs = n_uniq,
    duplicates = n_dup,
    top_duplicate_id = ifelse(nrow(dup_ids) > 0, dup_ids$id[1], NA_character_),
    top_duplicate_n  = ifelse(nrow(dup_ids) > 0, dup_ids$n[1], NA_integer_)
  )
}
fasta_qc_stats_df <- bind_rows(fasta_qc_stats)

# -------------------------
# RNA: read 6 comparisons (Excel)
# -------------------------
cat("\n[RNA] Reading 6 RNA-seq comparisons (Excel: Complete_Data)\n")
rna_sig_long <- list()
rna_all_long <- list()

for (cmp in names(RNA_FILES)) {
  fp <- RNA_FILES[[cmp]]
  df <- read_rna_excel(fp)

  # Keep a full-gene annotation table (used later to annotate EXO/ProALFA-only members)
  rna_all_long[[cmp]] <- df %>%
    mutate(Comparison = cmp) %>%
    select(GeneID_plain, Preferred_name, Description, padj, log2FoldChange, Comparison)

  # Debug: how many bad entries
  bad_gene <- sum(is.na(df$GeneID_plain) | df$GeneID_plain == "")
  bad_padj <- sum(is.na(df$padj))
  bad_lfc  <- sum(is.na(df$log2FoldChange))

  sig <- df %>%
    filter(!is.na(GeneID_plain), GeneID_plain != "",
           !is.na(padj), !is.na(log2FoldChange),
           padj < RNA_PADJ_CUT,
           abs(log2FoldChange) > RNA_LFC_CUT) %>%
    mutate(
      Comparison = cmp,
      RNA_direction = ifelse(log2FoldChange > 0, "UP", "DOWN")
    ) %>%
    select(GeneID_plain, Comparison, padj, log2FoldChange, RNA_direction,
           Preferred_name, Description)

  cat(sprintf("  - %s: rows=%d (bad GeneID=%d, bad padj=%d, bad log2FC=%d); significant=%d (UP=%d, DOWN=%d)\n",
              cmp, nrow(df), bad_gene, bad_padj, bad_lfc,
              nrow(sig), sum(sig$RNA_direction == "UP"), sum(sig$RNA_direction == "DOWN")))

  rna_sig_long[[cmp]] <- sig
}

rna_sig_long <- bind_rows(rna_sig_long)
rna_all_long <- bind_rows(rna_all_long)

RNA <- sort(unique(rna_sig_long$GeneID_plain))
cat(sprintf("[RNA] Union significant genes (6 comps): %d\n", length(RNA)))

# Build an annotation lookup from *ALL* genes (so EXO-only/ProALFA-only also get annotation)
rna_ann <- rna_all_long %>%
  filter(!is.na(GeneID_plain), GeneID_plain != "") %>%
  group_by(GeneID_plain) %>%
  summarise(
    Preferred_name = coalesce(Preferred_name[match(TRUE, !is.na(Preferred_name) & Preferred_name != "", !is.na(Preferred_name) & Preferred_name != "")], Preferred_name[1]),
    Description    = coalesce(Description[match(TRUE, !is.na(Description) & Description != "", !is.na(Description) & Description != "")], Description[1]),
    .groups = "drop"
  )

# -------------------------
# EXO: read exoproteome sheets + mapping (BLAST)
# -------------------------
cat("\n[EXO] Reading Exoproteome sheets + mapping via BLAST\n")
safe_file_check(EXO_SUMMARY_XLSX, "Exoproteome summary Excel")
safe_file_check(EXO_BLAST_TSV, "Exoproteome BLAST TSV")

exo_long <- list()
for (sh in names(EXO_SHEETS)) {
  df <- read_exo_sheet(EXO_SUMMARY_XLSX, EXO_SHEETS[[sh]]) %>%
    mutate(Sheet = EXO_SHEETS[[sh]])
  cat(sprintf("  - %s: proteins=%d (with log2FC=%d; UP=%d, DOWN=%d)\n",
              EXO_SHEETS[[sh]],
              n_distinct(df$Protein_ID),
              sum(!is.na(df$log2FC)),
              sum(df$EXO_direction == "UP", na.rm = TRUE),
              sum(df$EXO_direction == "DOWN", na.rm = TRUE)))
  exo_long[[sh]] <- df
}
exo_long <- bind_rows(exo_long) %>%
  distinct(Protein_ID, Sheet, .keep_all = TRUE)

# BLAST mapping
exo_blast <- read_besthit_tsv(EXO_BLAST_TSV)
exo_best  <- besthit_reduce(exo_blast) %>%
  mutate(LocusTag = map_chr(sid, extract_locus_tag))

cat(sprintf("[BLAST] Exoproteome_ALL rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
            nrow(exo_blast), n_distinct(exo_blast$qid), nrow(exo_best), n_distinct(exo_best$LocusTag)))

dup_targets <- exo_best %>%
  filter(!is.na(LocusTag)) %>%
  count(LocusTag, sort = TRUE) %>%
  filter(n > 1)

cat(sprintf("[BLAST] Exoproteome_ALL duplicated targets in besthits: %d (top 5)\n", nrow(dup_targets)))
if (nrow(dup_targets) > 0) print(head(dup_targets, 5))

exo_mapped <- exo_long %>%
  left_join(exo_best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid")) %>%
  mutate(LocusTag = as.character(LocusTag))

cat(sprintf("[EXO] Proteins before mapping: %d unique\n", n_distinct(exo_long$Protein_ID)))
cat(sprintf("[EXO] Mapped proteins (have LocusTag): %d\n", sum(!is.na(exo_mapped$LocusTag))))
cat(sprintf("[EXO] Unique mapped proteins: %d\n", n_distinct(exo_mapped$Protein_ID[!is.na(exo_mapped$LocusTag)])))
cat(sprintf("[EXO] Unique mapped locus tags: %d\n", n_distinct(exo_mapped$LocusTag[!is.na(exo_mapped$LocusTag)])))

EXO <- sort(unique(na.omit(exo_mapped$LocusTag)))

# -------------------------
# ProALFA: read proximity + ALFA sheets + mapping (BLAST)
# -------------------------
cat("\n[ProALFA] Reading Proximity/ALFA sheets + mapping via BLAST\n")
safe_file_check(PROALFA_SUMMARY_XLSX, "ProALFA summary Excel")

ms_long <- list()
for (ds in names(PROALFA_SHEETS)) {
  sh <- PROALFA_SHEETS[[ds]]
  if (!(sh %in% readxl::excel_sheets(PROALFA_SUMMARY_XLSX))) {
    stop("Sheet '", sh, "' not found")
  }
  df <- read_excel(PROALFA_SUMMARY_XLSX, sheet = sh, col_names = TRUE)

  # assume first column has protein IDs
  prot_col <- names(df)[1]
  tmp <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      Protein_ID = canonical_id(Protein_raw)
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "") %>%
    distinct(Protein_ID)

  cat(sprintf("  - %s (%s): proteins=%d unique\n", ds, sh, nrow(tmp)))
  ms_long[[ds]] <- tmp %>% mutate(Dataset = ds)
}
ms_long <- bind_rows(ms_long)

# mapping per dataset
ms_mapped_list <- list()
PROALFA_sets <- list()

for (ds in names(PROALFA_BLAST_TSV)) {
  bt <- PROALFA_BLAST_TSV[[ds]]
  safe_file_check(bt, paste0("ProALFA BLAST TSV: ", ds))
  b <- read_besthit_tsv(bt)
  best <- besthit_reduce(b) %>%
    mutate(LocusTag = map_chr(sid, extract_locus_tag))

  cat(sprintf("[BLAST] %s rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
              ds, nrow(b), n_distinct(b$qid), nrow(best), n_distinct(best$LocusTag)))

  dup_t <- best %>%
    filter(!is.na(LocusTag)) %>%
    count(LocusTag, sort = TRUE) %>%
    filter(n > 1)
  if (nrow(dup_t) > 0) {
    cat(sprintf("[BLAST] %s duplicated targets in besthits: %d (top 5)\n", ds, nrow(dup_t)))
    print(head(dup_t, 5))
  }

  tmp_ds <- ms_long %>%
    filter(Dataset == ds) %>%
    left_join(best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid"))

  cat(sprintf("[ProALFA] %s: proteins=%d, mapped=%d, unique_targets=%d\n",
              ds,
              n_distinct(tmp_ds$Protein_ID),
              sum(!is.na(tmp_ds$LocusTag)),
              n_distinct(tmp_ds$LocusTag[!is.na(tmp_ds$LocusTag)])))

  ms_mapped_list[[ds]] <- tmp_ds
  PROALFA_sets[[ds]] <- sort(unique(na.omit(tmp_ds$LocusTag)))
}
ms_mapped_all <- bind_rows(ms_mapped_list)

ProALFA <- sort(unique(unlist(PROALFA_sets)))
cat(sprintf("[ProALFA] Union locus tags (Proximity+ALFA): %d\n", length(ProALFA)))

# -------------------------
# VENN intersections
# -------------------------
cat("\n[VENN] Computing intersections\n")
cat(sprintf("Set sizes: RNA=%d, EXO=%d, ProALFA=%d\n", length(RNA), length(EXO), length(ProALFA)))

i_RNA_EXO     <- intersect(RNA, EXO)
i_RNA_ProALFA <- intersect(RNA, ProALFA)
i_EXO_ProALFA <- intersect(EXO, ProALFA)
i_all3        <- Reduce(intersect, list(RNA, EXO, ProALFA))

cat(sprintf("Intersections: RNA∩EXO=%d, RNA∩ProALFA=%d, EXO∩ProALFA=%d, ALL3=%d\n",
            length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)))

# -------------------------
# Venn plot
# -------------------------
venn_png <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.png")
venn_pdf <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.pdf")

cat("[VENN] Writing plot:", venn_png, "\n")
venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = venn_png,
  imagetype = "png",
  height = 2400, width = 2400, resolution = 300,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  # Increase 3rd value to push the long bottom label further out
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)

cat("[VENN] Writing plot:", venn_pdf, "\n")
vd <- venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = NULL,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)
pdf(venn_pdf, width=7.5, height=7.5)
grid.draw(vd)
dev.off()

# -------------------------
# Region table (ALL members + direction annotations)
# -------------------------
all_members <- tibble(LocusTag = sort(unique(c(RNA, EXO, ProALFA)))) %>%
  mutate(
    in_RNA = LocusTag %in% RNA,
    in_EXO = LocusTag %in% EXO,
    in_ProALFA = LocusTag %in% ProALFA
  )

# RNA direction per locus tag (from significant set only; union across comparisons)
rna_dir_tbl <- rna_sig_long %>%
  group_by(GeneID_plain) %>%
  summarise(
    RNA_comparisons = paste(sort(unique(Comparison)), collapse = ";"),
    RNA_dirs = paste(sort(unique(RNA_direction)), collapse = ";"),
    min_padj = suppressWarnings(min(padj, na.rm = TRUE)),
    max_abs_log2FC = suppressWarnings(max(abs(log2FoldChange), na.rm = TRUE)),
    .groups = "drop"
  ) %>%
  rename(LocusTag = GeneID_plain)

# EXO direction aggregated per locus tag
exo_dir_tbl <- exo_mapped %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    EXO_sheets = paste(sort(unique(Sheet)), collapse = ";"),
    EXO_dirs = paste(sort(unique(EXO_direction)), collapse = ";"),
    EXO_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

# ProALFA membership aggregated per locus tag
proalfa_tbl <- ms_mapped_all %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    ProALFA_datasets = paste(sort(unique(Dataset)), collapse = ";"),
    ProALFA_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

region_tbl <- all_members %>%
  left_join(rna_dir_tbl, by = "LocusTag") %>%
  left_join(exo_dir_tbl, by = "LocusTag") %>%
  left_join(proalfa_tbl, by = "LocusTag") %>%
  left_join(rna_ann, by = c("LocusTag" = "GeneID_plain")) %>%
  arrange(desc(in_RNA), desc(in_EXO), desc(in_ProALFA), LocusTag)

region_only <- function(inRNA, inEXO, inProALFA) {
  region_tbl %>%
    filter(in_RNA == inRNA, in_EXO == inEXO, in_ProALFA == inProALFA)
}

# -------------------------
# Excel output
# -------------------------
out_xlsx <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA_detailed.xlsx")
cat("[EXCEL] Writing:", out_xlsx, "\n")

wb <- createWorkbook()

addWorksheet(wb, "Summary")
writeData(wb, "Summary", data.frame(
  Metric = c("RNA cutoff", "RNA union", "EXO union", "ProALFA union",
             "RNA∩EXO", "RNA∩ProALFA", "EXO∩ProALFA", "ALL3"),
  Value  = c(
    sprintf("padj < %.3g & |log2FC| > %.3g", RNA_PADJ_CUT, RNA_LFC_CUT),
    length(RNA), length(EXO), length(ProALFA),
    length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)
  )
))

addWorksheet(wb, "FASTA_QC")
writeData(wb, "FASTA_QC", fasta_qc_stats_df)

addWorksheet(wb, "RNA_sig_long")
writeData(wb, "RNA_sig_long", rna_sig_long)

addWorksheet(wb, "RNA_all_annotations")
writeData(wb, "RNA_all_annotations", rna_all_long)

addWorksheet(wb, "EXO_proteins")
writeData(wb, "EXO_proteins", exo_long)

addWorksheet(wb, "EXO_mapped")
writeData(wb, "EXO_mapped", exo_mapped)

addWorksheet(wb, "ProALFA_proteins")
writeData(wb, "ProALFA_proteins", ms_long)

addWorksheet(wb, "ProALFA_mapped")
writeData(wb, "ProALFA_mapped", ms_mapped_all)

addWorksheet(wb, "All_members_with_dirs")
writeData(wb, "All_members_with_dirs", region_tbl)

addWorksheet(wb, "Only_RNA")
writeData(wb, "Only_RNA", region_only(TRUE, FALSE, FALSE))

addWorksheet(wb, "Only_EXO")
writeData(wb, "Only_EXO", region_only(FALSE, TRUE, FALSE))

addWorksheet(wb, "Only_ProALFA")
writeData(wb, "Only_ProALFA", region_only(FALSE, FALSE, TRUE))

addWorksheet(wb, "RNA_EXO_only")
writeData(wb, "RNA_EXO_only", region_only(TRUE, TRUE, FALSE))

addWorksheet(wb, "RNA_ProALFA_only")
writeData(wb, "RNA_ProALFA_only", region_only(TRUE, FALSE, TRUE))

addWorksheet(wb, "EXO_ProALFA_only")
writeData(wb, "EXO_ProALFA_only", region_only(FALSE, TRUE, TRUE))

addWorksheet(wb, "ALL3")
writeData(wb, "ALL3", region_only(TRUE, TRUE, TRUE))

saveWorkbook(wb, out_xlsx, overwrite = TRUE)

cat("\nDONE.\nOutputs:\n")
cat(" -", venn_png, "\n")
cat(" -", venn_pdf, "\n")
cat(" -", out_xlsx, "\n")
cat(" - LOG:", normalizePath(LOGFILE), "\n")

# dump warnings (so they go into log too)
if (!is.null(warnings())) {
  cat("\n[WARNINGS]\n")
  print(warnings())
}

sink()

Final Notes

All comparisons are locus-tag consistent, enabling clean cross-omics integration.
Exoproteome regulation is treated directionally only, by design.
Logs and Excel outputs are intentionally verbose to support downstream inspection and reuse.

This pipeline is designed to be transparent, auditable, and extensible—ideal for iterative multi-omics analysis and figure generation.

From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)

Leave a reply

Path: ~/DATA/Data_Michelle/Data_Michelle_ProteinClustering_2026

Goal

For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:

Export protein sequences as FASTA
Annotate them with EggNOG-mapper (including the COG_category column)
Summarize COG composition at two levels:
- COG letters (J/A/K/…/S), including multi-letter cases like IQ
- 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
Export both summary statistics and the underlying protein IDs for each category/group.

Step 0 — Why this manual annotation approach is needed (non-model organism)

Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.

Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

Export sequences for each MS dataset:

python3 getProteinSequences_Proximity_4h.py      # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  # > ALFApulldown_18h.fasta

Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py). Output: One FASTA per dataset (*.fasta), used as EggNOG input.

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta) Output: cleaned FASTA headers (CP020463_protein.fasta)

Step 2 — Generate EggNOG annotation files (`*.emapper.annotations`)

2A) Install EggNOG-mapper

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda   # eggnog-mapper 2.1.12
mamba activate eggnog_env

2B) Download / prepare EggNOG database

mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

2C) Run `emapper.py` on FASTA inputs

(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):

emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60    # --resume if needed

For MS protein sets (the ones used for COG clustering):

emapper.py -i Proximity_4h.fasta      -o eggnog_out_Proximity_4h      --cpu 60   # --resume
emapper.py -i Proximity_18h.fasta     -o eggnog_out_Proximity_18h     --cpu 60
emapper.py -i ALFApulldown_4h.fasta   -o eggnog_out_ALFApulldown_4h   --cpu 60
emapper.py -i ALFApulldown_18h.fasta  -o eggnog_out_ALFApulldown_18h  --cpu 60

Key outputs used downstream (one per dataset):

eggnog_out_Proximity_4h.emapper.annotations
eggnog_out_Proximity_18h.emapper.annotations
eggnog_out_ALFApulldown_4h.emapper.annotations
eggnog_out_ALFApulldown_18h.emapper.annotations

These files include a column named COG_category, which is the basis for the clustering below.

Step 3 — COG clustering + reporting (this post’s main script)

Inputs

The four *.emapper.annotations files (must contain COG_category).

Outputs

All outputs are written to ./COG_outputs/:

Per dataset (Excel): COG_[Dataset].xlsx

COG_letters: per-letter count + percent
Debug: unassigned COG rows, R/S proportion, etc.
Protein_assignments: per-protein COG letters + functional groups
Protein_lists_by_COG: protein IDs per COG letter
Protein_lists_by_group: protein IDs per major functional class
Long_format_COG / Long_format_group: one row per protein per category/group (best for filtering)

Combined (Excel): COG_combined_summary.xlsx

combined counts/percents across datasets
combined long-format tables

Plots (PNG + PDF):

COG_grouped_barplot_percent_letters.* (COG letters across datasets)
COG_functional_groups.* (4 functional classes across datasets)

python3 cog_cluster.py

Interpretation notes

High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.
“POORLY CHARACTERIZED” being large is not an artifact of the plotting — it’s because the dataset itself contains many proteins annotated as unknown by EggNOG. Why “S” shows up so much in Proximity labeling / pulldown sets? Common biological/annotation reasons:
- Small / strain-specific proteins (often hypothetical)
- Membrane or secreted proteins with poor functional characterization
- Mobile genetic elements / phage-related proteins
- Proteins with only weak orthology support → EggNOG assigns S rather than a confident functional class
Group totals not equal to 100% can happen because:
- some proteins have multi-letter COGs (e.g., IQ) → count in multiple groups
- some proteins have no COG assignment (-) → don’t contribute to any group

Code snippet (generate Proximity_4h FASTA, used in Step 2)

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

Code snippet (COG clustering script, used in Step 3)

#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)

Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
    * Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
    * Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
  (note: method still allows multi-group membership; we just don't print it in labels)
"""

import os
import re
import pandas as pd
import numpy as np

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
    "Proximity_4h":     "./eggnog_out_Proximity_4h.emapper.annotations",
    "Proximity_18h":    "./eggnog_out_Proximity_18h.emapper.annotations",
    "ALFApulldown_4h":  "./eggnog_out_ALFApulldown_4h.emapper.annotations",
    "ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}

OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)

ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]

COG_DESCRIPTIONS = {
    "J": "Translation, ribosomal structure and biogenesis",
    "A": "RNA processing and modification",
    "K": "Transcription",
    "L": "Replication, recombination and repair",
    "B": "Chromatin structure and dynamics",
    "D": "Cell cycle control, cell division, chromosome partitioning",
    "Y": "Nuclear structure",
    "V": "Defense mechanisms",
    "T": "Signal transduction mechanisms",
    "M": "Cell wall/membrane/envelope biogenesis",
    "N": "Cell motility",
    "Z": "Cytoskeleton",
    "W": "Extracellular structures",
    "U": "Intracellular trafficking, secretion, and vesicular transport",
    "O": "Posttranslational modification, protein turnover, chaperones",
    "C": "Energy production and conversion",
    "G": "Carbohydrate transport and metabolism",
    "E": "Amino acid transport and metabolism",
    "F": "Nucleotide transport and metabolism",
    "H": "Coenzyme transport and metabolism",
    "I": "Lipid transport and metabolism",
    "P": "Inorganic ion transport and metabolism",
    "Q": "Secondary metabolites biosynthesis, transport and catabolism",
    "R": "General function prediction only",
    "S": "Function unknown",
}

FUNCTIONAL_GROUPS = {
    "INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
    "CELLULAR PROCESSES AND SIGNALING":   ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
    "METABOLISM":                        ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
    "POORLY CHARACTERIZED":              ['R', 'S'],
}

LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
    for c in letters:
        LETTER_TO_GROUP[c] = grp

# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
    rows = []
    header = None
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith("##"):
                continue
            if line.startswith("#query"):
                header = line.lstrip("#").split("\t")
                continue
            if header is None and re.match(r"^query\t", line):
                header = line.split("\t")
                continue
            if header is None:
                continue
            rows.append(line.split("\t"))

    if header is None:
        raise ValueError(f"Could not find header line (#query/query) in: {path}")

    df = pd.DataFrame(rows, columns=header)
    if "query" not in df.columns and "#query" in df.columns:
        df = df.rename(columns={"#query": "query"})
    return df

def split_cog_letters(cog_str):
    if cog_str is None:
        return []
    cog_str = str(cog_str).strip()
    if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
        return []
    letters = list(cog_str)
    return [c for c in letters if c in ALL_COG]

def count_cog_letters(df: pd.DataFrame) -> dict:
    counts = {c: 0 for c in ALL_COG}
    for x in df["COG_category"].astype(str).tolist():
        for c in split_cog_letters(x):
            counts[c] += 1
    return counts

def build_category_to_proteins(df: pd.DataFrame) -> dict:
    cat2prot = {c: set() for c in ALL_COG}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        for c in letters:
            cat2prot[c].add(q)
    return {c: sorted(list(v)) for c, v in cat2prot.items()}

def build_group_to_proteins(df: pd.DataFrame) -> dict:
    group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        for g in hit_groups:
            group2prot[g].add(q)
    return {g: sorted(list(v)) for g, v in group2prot.items()}

def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
        rows.append({
            "query": q,
            "COG_category_raw": cog,
            "COG_letters": "".join(sorted(set(letters))) if letters else "",
            "Functional_groups": "; ".join(groups) if groups else "",
            "Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
        })
    out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
    return out.sort_values("query")

def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for c in ALL_COG:
        for pid in cat2prot.get(c, []):
            rows.append({
                "Dataset": dataset,
                "COG": c,
                "Description": COG_DESCRIPTIONS.get(c, ""),
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for g in FUNCTIONAL_GROUPS.keys():
        for pid in grp2prot.get(g, []):
            rows.append({
                "Dataset": dataset,
                "Functional_group": g,
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
    groups = list(FUNCTIONAL_GROUPS.keys())
    recs = []
    for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        rec = {"query": q}
        for g in groups:
            rec[g] = (g in hit_groups)
        recs.append(rec)

    out = pd.DataFrame(recs)
    out = out.groupby("query", as_index=False).max()
    return out

# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []

for ds_name, path in INPUT_FILES.items():
    print(f"\n--- {ds_name} ---\nReading: {path}")
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing input file: {path}")

    df = read_emapper_annotations(path)
    if "COG_category" not in df.columns:
        raise ValueError(f"{path} has no 'COG_category' column.")

    n_rows = df.shape[0]
    unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()

    # Letter-based counts and percents
    counts = count_cog_letters(df)
    total_letters = sum(counts.values())
    pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}

    counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
    pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)

    all_counts[ds_name] = counts_s
    all_pct[ds_name] = pct_s

    out_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "Count": counts_s.values,
        "Percent": pct_s.values,
    })

    dbg = pd.DataFrame([{
        "Dataset": ds_name,
        "Proteins_in_table": int(n_rows),
        "COG_unassigned_rows": int(unassigned),
        "Total_assigned_COG_letters": int(total_letters),
        "R_count": int(counts.get("R", 0)),
        "S_count": int(counts.get("S", 0)),
        "R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
    }])

    debug_summary.append(dbg.iloc[0].to_dict())

    assignment_tbl = build_assignment_table(df)
    cat2prot = build_category_to_proteins(df)
    grp2prot = build_group_to_proteins(df)

    # category (COG letter) protein lists
    cat_list_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
        "Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
    })

    # group protein lists
    grp_list_df = pd.DataFrame({
        "Functional_group": list(FUNCTIONAL_GROUPS.keys()),
        "N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
        "Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
    })

    df_long_cog = long_format_by_cog(cat2prot, ds_name)
    df_long_group = long_format_by_group(grp2prot, ds_name)
    all_long_cog.append(df_long_cog)
    all_long_group.append(df_long_group)

    out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
    with pd.ExcelWriter(out_xlsx) as writer:
        out_df.to_excel(writer, sheet_name="COG_letters", index=False)
        dbg.to_excel(writer, sheet_name="Debug", index=False)
        assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
        cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
        grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
        df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
        df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)

    print(f"Saved: {out_xlsx}")

# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())

df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)

combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
    df_counts.to_excel(writer, sheet_name="Counts")
    df_pct.to_excel(writer, sheet_name="Percent")
    pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
    # Combined long formats (includes protein IDs per letter/group across datasets)
    pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
    pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)

print(f"\nSaved combined summary: {combined_xlsx}")

# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()

y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
    ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)

ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()

plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")

# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []

# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []

for ds_name, path in INPUT_FILES.items():
    df = read_emapper_annotations(path)

    # (A) protein-based group membership and stats
    memb = protein_based_group_membership(df)
    n_proteins = memb.shape[0]
    counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
    pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
    group_counts[ds_name] = pd.Series(counts)
    group_pct[ds_name] = pd.Series(pct)

    for g in FUNCTIONAL_GROUPS.keys():
        ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
        for pid in ids:
            group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})

    # (B) add category (letter) protein IDs too
    cat2prot = build_category_to_proteins(df)
    all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))

df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)

# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
    sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
    for c in ALL_COG:
        ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
        cat_list_rows.append({
            "Dataset": ds,
            "COG": c,
            "Description": COG_DESCRIPTIONS.get(c, ""),
            "N_proteins": len(ids),
            "Protein_IDs": "; ".join(sorted(ids))
        })
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)

group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
    df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
    df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
    df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)

    # NEW: category protein IDs included here as well
    df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
    df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)

print(f"Saved functional-group workbook: {group_xlsx}")

# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()

x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
    ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)

ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()

plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)

print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")

print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR))

End-to-end GO enrichment for non-model bacteria: MS → reference ID mapping with BLAST + Blast2GO (GUI) + R enrichment

Leave a reply

Path: ~/DATA/Data_Michelle/MS_GO_enrichments_2026

This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:

MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g., B4U56_00090).
Standard R organism annotation packages (org.*.eg.db) are not available.
We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.

The workflow produces:

4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
1 combined summary workbook across all datasets

1) Generate protein FASTA files

1.1 MS protein sequences (Proximity / ALFApulldown)

From MS results (protein ID lists), retrieve sequences and write FASTA:

python3 getProteinSequences_Proximity_4h.py      > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  > ALFApulldown_18h.fasta

1.2 Reference proteome sequences (for mapping + background)

Download the reference proteome FASTA from GenBank and standardize headers:

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

2) Optional functional annotation for merging MS + RNA-seq (EggNOG)

EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env

download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60

# MS sets
emapper.py -i Proximity_4h.fasta     -o eggnog_out_Proximity_4h     --cpu 60
emapper.py -i Proximity_18h.fasta    -o eggnog_out_Proximity_18h    --cpu 60
emapper.py -i ALFApulldown_4h.fasta  -o eggnog_out_ALFApulldown_4h  --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60

3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)

Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.

3.1 Setup workspace

mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/

Launch Blast2GO:

~/Tools/Blast2GO/Blast2GO_Launcher

3.2 Step-by-step in Blast2GO GUI

STEP 1 — Load sequences (reference proteome)

File → Load → Load Sequences
Choose: Load Fasta File (.fasta)
Select: CP020463_protein.fasta
Tags: (leave default / none)
Check that the table is filled with columns like Nr, SeqName

✅ Output: sequences are loaded into the project

STEP 2 — BLAST (QBlast)

Go to the BLAST panel
Choose blastp (protein vs protein)
Database: nr (NCBI)
Set other parameters as needed (defaults typically OK)
Run QBlast

⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.

✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)

STEP 3 — Mapping

Click Mapping

✅ Output:

Tags updated to MAPPED
Columns appear such as #GO, GO IDs, GO Names

STEP 4 — Annotation

Click Annotation
Key parameters you may set/keep:
- Annotation CutOff (controls reliability threshold)
- GO Weight (boosts more general terms when supported by multiple specific hits)

✅ Output:

Tags updated to ANNOTATED
Enzyme-related columns may appear (e.g., Enzyme Codes)

STEP 5 — Export Annotations (before merging InterPro)

File → Export → Export Annotations
Choose Export Annotations (.annot, custom, etc.)
Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot

✅ Output: blast2go_annot.annot

STEP 6 — InterProScan (optional but recommended for more GO terms)

Click InterPro / InterProScan
Run InterProScan (can be long: hours to >1 day depending on dataset & setup)

✅ Output:

Tags updated to INTERPRO
Additional columns: InterPro IDs, InterPro GO IDs/Names

STEP 7 — Merge InterProScan GOs into existing annotation

In InterPro panel, choose: Merge InterProScan GOs to Annotation
Confirm merge

✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)

STEP 8 — Export final annotations (after merging InterPro)

File → Export → Export Annotations
Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2

✅ This is the final file used for enrichment:

blast2go_annot.annot2

Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.

4) Generate BLAST mapping tables: `*_vs_ref.blast.tsv`

We map each MS protein set to the reference proteome locus tags using BLASTP.

4.1 Create BLAST database from reference proteome

makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db

4.2 BLASTP each MS set against the reference DB

blastp -query Proximity_4h.fasta -db CP020463_ref_db \
  -out Proximity_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query Proximity_18h.fasta -db CP020463_ref_db \
  -out Proximity_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
  -out ALFApulldown_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
  -out ALFApulldown_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

5) Run GO enrichment (4 sets + combined summary)

We run a single wrapper script that:

normalizes FASTA headers (handles UniParc/UniProt tr|... formats)
filters BLAST hits (pident/qcov/evalue thresholds)
picks one best hit per query
performs GO enrichment via clusterProfiler::enricher()
background (universe) = all reference proteins with ≥1 GO term in blast2go_annot.annot2
cutoff = BH/FDR adjusted p-value < 0.05
writes one Excel per dataset + one combined summary workbook

Rscript wrapper_besthit_GO_enrichment_4sets.R

Code snippets (used scripts)

A) Example script for Step 1: generate Proximity_4h FASTA

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(tibble)
  library(openxlsx)
  library(clusterProfiler)
  library(AnnotationDbi)
  library(GO.db)
})

# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2"  # or full path

datasets <- tibble::tribble(
  ~name,              ~blast_tsv,                          ~fasta,
  "Proximity_4h",     "Proximity_4h_vs_ref.blast.tsv",     "Proximity_4h.fasta",
  "Proximity_18h",    "Proximity_18h_vs_ref.blast.tsv",    "Proximity_18h.fasta",
  "ALFApulldown_4h",  "ALFApulldown_4h_vs_ref.blast.tsv",  "ALFApulldown_4h.fasta",
  "ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)

out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV   <- 0.70
MAX_EVALUE <- 1e-10

# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"

# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  parts <- stringr::str_split(x, "\\|", simplify = TRUE)
  is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
  ifelse(is_uniprot, parts[, 2], parts[, 1])
}

norm_subject_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  stringr::str_split_fixed(x, "\\|", 2)[, 1]
}

add_go_desc <- function(df) {
  if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
  df$GO_Term <- vapply(df$ID, function(go_id) {
    term <- tryCatch(
      AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
      error = function(e) NULL
    )
    if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
  }, FUN.VALUE = character(1))
  df
}

safe_read_blast <- function(path) {
  cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
            "qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
  readr::read_tsv(
    path,
    col_names = cols,
    col_types = readr::cols(
      qseqid   = readr::col_character(),
      sseqid   = readr::col_character(),
      pident   = readr::col_double(),
      length   = readr::col_double(),
      mismatch = readr::col_double(),
      gapopen  = readr::col_double(),
      qstart   = readr::col_double(),
      qend     = readr::col_double(),
      sstart   = readr::col_double(),
      send     = readr::col_double(),
      evalue   = readr::col_double(),
      bitscore = readr::col_double(),
      qlen     = readr::col_double(),
      slen     = readr::col_double()
    ),
    progress = FALSE
  )
}

# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
                              stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term   <- as.character(annot_df$Term)

term2gene_go_ref <- annot_df %>%
  dplyr::filter(grepl("^GO:", Term)) %>%
  dplyr::transmute(term = Term, gene = GeneID) %>%
  dplyr::distinct()

universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")

# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms  <- list()
all_besthits  <- list()

# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {

  ds <- datasets[i, ]
  name <- ds$name
  blast_path <- ds$blast_tsv
  fasta_path <- ds$fasta

  cat("\n=============================\n")
  cat("Dataset:", name, "\n")

  if (!file.exists(blast_path)) {
    warning("Missing BLAST TSV: ", blast_path, " (skipping)")
    next
  }
  if (!file.exists(fasta_path)) {
    warning("Missing FASTA: ", fasta_path, " (skipping)")
    next
  }

  # ---- FASTA query IDs ----
  fa <- readLines(fasta_path, warn = FALSE)
  q_all <- fa[startsWith(fa, ">")]
  q_all <- gsub("^>", "", q_all)
  q_all <- unique(norm_mixed_id(q_all))
  n_fasta <- length(q_all)

  # ---- BLAST ----
  bl <- safe_read_blast(blast_path) %>%
    dplyr::mutate(
      qid  = norm_mixed_id(qseqid),
      sid  = norm_subject_id(sseqid),
      qcov = length / qlen
    )

  n_queries_with_hits <- dplyr::n_distinct(bl$qid)
  max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0

  cat("FASTA queries:", n_fasta, "\n")
  cat("BLAST rows:", nrow(bl), "\n")
  cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
  cat("Max hits/query:", max_hits_per_query, "\n")

  bl_f <- bl %>%
    dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)

  cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")

  best <- bl_f %>%
    dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
    dplyr::group_by(qid) %>%
    dplyr::slice_head(n = 1) %>%
    dplyr::ungroup()

  best_out <- best %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::arrange(dplyr::desc(bitscore))

  best_all <- tibble::tibble(qid = q_all) %>%
    dplyr::left_join(best_out, by = "qid")

  unmapped <- best_all %>%
    dplyr::filter(is.na(sid) | sid == "") %>%
    dplyr::distinct(qid) %>%
    dplyr::pull(qid)

  mapped <- best_out %>%
    dplyr::filter(!is.na(sid), sid != "")

  cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
  cat("Unmapped queries:", length(unmapped), "\n")
  cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
  cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")

  gene_list_ref <- unique(mapped$sid)
  n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
  cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")

  # Save best-hit TSVs
  fn_best     <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
  fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
  readr::write_tsv(best_out, fn_best)
  readr::write_tsv(best_all, fn_best_all)

  # GO enrichment
  go_res <- tryCatch(
    clusterProfiler::enricher(
      gene = gene_list_ref,
      TERM2GENE = term2gene_go_ref,
      universe = universe_ref,
      pvalueCutoff = P_CUT,
      pAdjustMethod = PADJ_METHOD
    ),
    error = function(e) NULL
  )

  go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
  go_df <- add_go_desc(go_df)

  cat("Enriched GO terms:", nrow(go_df), "\n")

  per_protein_go <- mapped %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::distinct() %>%
    dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
    dplyr::rename(GO = term)

  # Per-dataset Excel
  out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
  wb <- openxlsx::createWorkbook()

  openxlsx::addWorksheet(wb, "BestHit_Mapped")
  openxlsx::writeData(wb, "BestHit_Mapped", mapped)

  openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
  openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))

  openxlsx::addWorksheet(wb, "PerProtein_GO")
  openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)

  openxlsx::addWorksheet(wb, "GO_Enrichment")
  openxlsx::writeData(wb, "GO_Enrichment", go_df)

  openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
  cat("Saved:", out_xlsx, "\n")

  # Collect combined summary
  summary_runs[[name]] <- tibble::tibble(
    dataset = name,
    fasta_queries = n_fasta,
    blast_rows = nrow(bl),
    queries_with_hits = n_queries_with_hits,
    max_hits_per_query = max_hits_per_query,
    filtered_rows = nrow(bl_f),
    filtered_queries = dplyr::n_distinct(bl_f$qid),
    mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
    unmapped_queries = length(unmapped),
    unique_targets_besthit = dplyr::n_distinct(mapped$sid),
    duplicated_targets_besthit = sum(duplicated(mapped$sid)),
    mapped_targets_with_GO = n_targets_with_go,
    enriched_GO_terms = nrow(go_df)
  )

  all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
  all_go_terms[[name]]  <- dplyr::mutate(per_protein_go, dataset = name)
  all_besthits[[name]]  <- dplyr::mutate(mapped, dataset = name)
}

# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)

go_counts <- combined_per_protein_go %>%
  dplyr::filter(!is.na(GO), GO != "") %>%
  dplyr::distinct(dataset, qid, GO) %>%
  dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
  dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))

combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()

openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)

openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)

openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)

openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)

openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)

openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)

cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "")

Lakeview/Lake file refresh pipeline: track filtering, filename normalization, and automated .lake updates

Leave a reply

This pipeline updates existing Lakeview merged .lake files by replacing their blue track content with filtered track CSVs, while keeping original .h5 filenames/paths unchanged so Lakeview can still load the raw kymographs locally.

Step 1 — Fix naming mismatches before updating lakes

To ensure each kymograph (.h5) finds its corresponding filtered CSV, filename inconsistencies are corrected first:

_p940_ vs _940_: some filtered CSVs contain _p940_ while the corresponding .h5 uses _940_ → rename CSVs accordingly.
ch4 vs ch5: some filtered CSVs were labeled _ch4_ while the .h5 filenames use _ch5_ (or vice versa) → rename CSVs to match.
Extra CSVs without any matching .h5 are removed to avoid confusion later.

This step prevents Lakeview kymos from being dropped simply because no matching CSV is found.

Step 2 — Filter tracks and generate debug reports

Run the filtering scripts to create multiple filtered outputs from each raw *_blue.csv:

Binding position filter: 2.2–3.8 µm
Lifetime thresholds: ≥1s, ≥2s, ≥5s
A lifetime-only filter: ≥5s without a position constraint

The debug version additionally writes per-track reports (binding position, lifetime, pass/fail reason), which makes it much easier to spot issues caused by parsing, NaNs, or unexpected track structure.

Step 3 — Organize filtered outputs into separate folders

Move filtered files into dedicated directories so each downstream lake update corresponds to a single filtering rule:

filtered_blue_position (2.2–3.8 µm)
filtered_blue_position_1s (2.2–3.8 µm + ≥1s)
filtered_blue_position_5s (2.2–3.8 µm + ≥5s)
filtered_blue_lifetime_5s_only (≥5s, no position filter)

Step 4 — Update `.lake` files using the filtered tracks

Run 2_update_lakes.py once per filtered folder to create updated .lake outputs (and logs):

For each kymo in each .lake, the script tries to find a matching *_blue*.csv.
Outcomes are classified:
- case1: CSV found and contains ≥1 track → replace blue track text and keep the kymo.
- case2: CSV found but no tracks remain after filtering (header-only / parse error) → remove the kymo.
- case3: no matching CSV → remove the kymo.
- extra: kymo missing a data/tracks/blue field → remove the kymo.
After filtering/removing kymos, the script also rebuilds:
- file_viewer (keeps only .h5 files referenced by retained kymos)
- experiments[*].dataset (keeps only dataset entries matching retained kymos)

This keeps the updated .lake files internally consistent and avoids dangling references.

Scripts used (code snippets)

1) `1_filter_track.py`

import pandas as pd
import glob
import os

# === Parameters ===
input_folder = "./data"
output_folder = "./filtered"
separated_folder = "./separated"

# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8

# Column names (based on CSV header)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"

# Filter configurations
filter_configs = [
    {
        "label": "position",
        "min_lifetime": 0.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm",
    },
    {
        "label": "position_1s",
        "min_lifetime": 1.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
    },
    {
        "label": "position_5s",
        "min_lifetime": 5.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
    },
    {
        "label": "position_2s",
        "min_lifetime": 2.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
    },
    {
        "label": "lifetime_5s_only",
        "min_lifetime": 5.0,
        "min_binding_pos": None,
        "max_binding_pos": None,
        "desc": "Tracks with lifetime ≥ 5 s, no position filter",
    },
]

def load_csv(filepath):
    """
    Load a blue track CSV:
    - find header line starting with '# track index'
    - read data rows (semicolon-separated, skipping header lines)
    - set lowercase column names based on the header line
    """
    try:
        with open(filepath, "r") as f:
            lines = f.readlines()
        if not lines:
            raise ValueError(f"File {filepath} is empty")

        header_line = None
        for line in lines:
            if line.startswith("# track index"):
                header_line = line.lstrip("# ").strip()
                break
        if header_line is None:
            raise ValueError(
                f"No header line starting with '# track index' found in {filepath}"
            )

        df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
        df.columns = [c.strip().lower() for c in header_line.split(";")]

        required_cols = [track_col, time_col, position_col]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(
                f"Missing required columns in {filepath}: {missing_cols}"
            )

        return df, lines[0].strip(), header_line
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None, None, None

def compute_lifetime(track_df):
    """
    Compute the lifetime of one track as (max time - min time) in seconds.
    """
    if track_df[time_col].empty:
        return 0.0
    return track_df[time_col].max() - track_df[time_col].min()

# === Main Processing ===
os.makedirs(output_folder, exist_ok=True)
os.makedirs(separated_folder, exist_ok=True)

for filepath in glob.glob(os.path.join(input_folder, "*.csv")):
    print(f"\n=== Processing input file: {filepath} ===")
    df, header1, header2 = load_csv(filepath)
    if df is None:
        continue

    base = os.path.splitext(os.path.basename(filepath))[0]
    total_tracks_in_file = df[track_col].nunique()
    print(f"  Total tracks in file: {total_tracks_in_file}")

    for config in filter_configs:
        label = config["label"]
        min_lifetime = config["min_lifetime"]
        min_binding_pos = config.get("min_binding_pos")
        max_binding_pos = config.get("max_binding_pos")

        kept_tracks = []
        removed_tracks = []

        fail_pos_only = 0
        fail_life_only = 0
        fail_both = 0

        for track_id, track_df in df.groupby(track_col):
            if track_df.empty:
                continue

            binding_pos = track_df[position_col].iloc[0]
            lifetime = compute_lifetime(track_df)

            # Position check only if min/max are defined
            position_ok = True
            if min_binding_pos is not None and binding_pos < min_binding_pos:
                position_ok = False
            if max_binding_pos is not None and binding_pos > max_binding_pos:
                position_ok = False

            lifetime_ok = lifetime >= min_lifetime

            if position_ok and lifetime_ok:
                kept_tracks.append(track_df)
            else:
                removed_tracks.append(track_df)
                if not position_ok and not lifetime_ok:
                    fail_both += 1
                elif not position_ok:
                    fail_pos_only += 1
                elif not lifetime_ok:
                    fail_life_only += 1

        n_kept = len(kept_tracks)
        n_removed = len(removed_tracks)
        total_tracks = n_kept + n_removed

        print(
            f"  [{label}] tracks kept: {n_kept}/{total_tracks} "
            f"(removed: {n_removed}; "
            f"fail_pos_only={fail_pos_only}, "
            f"fail_life_only={fail_life_only}, "
            f"fail_both={fail_both})"
        )

        # --- Write filtered (kept) file or placeholder file ---
        outpath = os.path.join(output_folder, f"{base}_{label}.csv")
        if n_kept > 0:
            # Normal case: some tracks passed the filter
            kept_df = pd.concat(kept_tracks, ignore_index=True)
            with open(outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
            kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
            print(f"    -> Saved filtered tracks ({config['desc']}): {outpath}")
        else:
            # NEW: no track passed the filter → write header-only placeholder file
            with open(outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
                f.write(
                    f"# no tracks passed the '{label}' filter for {base}; "
                    f"placeholder file for downstream processing\n"
                )
            print(
                f"    -> No tracks passed the '{label}' filter for {base}. "
                f"Created header-only placeholder: {outpath}"
            )

        # Save removed tracks (optional, still useful for debugging)
        if removed_tracks:
            removed_df = pd.concat(removed_tracks, ignore_index=True)
            sep_outpath = os.path.join(
                separated_folder, f"{base}_removed_{label}.csv"
            )
            with open(sep_outpath, "w") as f:
                f.write(f"{header1}\n")
                f.write(f"# {header2}\n")
            removed_df.to_csv(
                sep_outpath, mode="a", sep=";", index=False, header=False
            )
            # print(f"    -> Saved removed tracks: {sep_outpath}")

print("\nProcessing complete.")

2) `1_filter_track_debug.py`

import pandas as pd
import glob
import os
import argparse
from typing import Optional, Tuple

# Default position filter parameters (in µm)
default_min_binding_pos = 2.2
default_max_binding_pos = 3.8

# Column names (based on CSV header, after lowercasing)
track_col = "track index"
time_col = "time (seconds)"
position_col = "position (um)"

# Filter configurations
filter_configs = [
    {
        "label": "position",
        "min_lifetime": 0.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm",
    },
    {
        "label": "position_1s",
        "min_lifetime": 1.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 1 s",
    },
    {
        "label": "position_5s",
        "min_lifetime": 5.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 5 s",
    },
    {
        "label": "position_2s",
        "min_lifetime": 2.0,
        "min_binding_pos": default_min_binding_pos,
        "max_binding_pos": default_max_binding_pos,
        "desc": "Tracks with binding position 2.2–3.8 µm and lifetime ≥ 2 s",
    },
    {
        "label": "lifetime_5s_only",
        "min_lifetime": 5.0,
        "min_binding_pos": None,
        "max_binding_pos": None,
        "desc": "Tracks with lifetime ≥ 5 s, no position filter",
    },
]

def parse_args():
    p = argparse.ArgumentParser(description="Filter blue track CSVs and emit debug reports per track.")
    p.add_argument("--input_folder", "-i", default="./data", help="Folder containing input *_blue.csv files.")
    p.add_argument("--output_folder", "-o", default="./filtered", help="Folder to write filtered CSVs.")
    p.add_argument("--separated_folder", "-s", default="./separated", help="Folder to write removed-tracks CSVs.")
    p.add_argument("--debug_folder", "-d", default="./debug_reports", help="Folder to write per-track debug reports.")
    p.add_argument(
        "--only",
        default=None,
        help="Optional: only process files whose basename contains this substring (e.g. 'p967_250704_502_10pN_ch4_0bar_b4_1_blue').",
    )
    p.add_argument(
        "--binding_pos_method",
        choices=["first_non_nan", "median", "mean"],
        default="first_non_nan",
        help="How to compute 'binding position' per track for the position filter.",
    )
    p.add_argument("--verbose", action="store_true", help="Print per-track debug for removed tracks (can be noisy).")
    return p.parse_args()

def _coerce_numeric(series: pd.Series) -> pd.Series:
    """
    Coerce numeric robustly:
    - supports comma decimal separators: '1,23' -> '1.23'
    - invalid parses become NaN
    """
    s = series.astype(str).str.strip()
    s = s.str.replace(",", ".", regex=False)
    return pd.to_numeric(s, errors="coerce")

def load_csv(filepath: str) -> Tuple[Optional[pd.DataFrame], Optional[str], Optional[str]]:
    """
    Load a blue track CSV:
    - find header line starting with '# track index'
    - read data rows (semicolon-separated, skipping 2 header lines)
    - set lowercase column names based on the header line
    - coerce time/position to numeric robustly (comma decimals supported)
    """
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            lines = f.readlines()
        if not lines:
            raise ValueError(f"File {filepath} is empty")

        header_line = None
        for line in lines:
            if line.startswith("# track index"):
                header_line = line.lstrip("# ").strip()
                break
        if header_line is None:
            raise ValueError(f"No header line starting with '# track index' found in {filepath}")

        df = pd.read_csv(filepath, sep=";", comment="#", header=None, skiprows=2)
        df.columns = [c.strip().lower() for c in header_line.split(";")]

        required_cols = [track_col, time_col, position_col]
        missing_cols = [col for col in required_cols if col not in df.columns]
        if missing_cols:
            raise ValueError(f"Missing required columns in {filepath}: {missing_cols}")

        # Robust numeric conversion
        df[time_col] = _coerce_numeric(df[time_col])
        df[position_col] = _coerce_numeric(df[position_col])

        # If conversion introduced NaNs, keep them but warn (important for debugging)
        n_time_nan = int(df[time_col].isna().sum())
        n_pos_nan = int(df[position_col].isna().sum())
        if n_time_nan > 0 or n_pos_nan > 0:
            print(
                f"  [WARN] {os.path.basename(filepath)}: NaNs after numeric parsing "
                f"(time NaN={n_time_nan}, position NaN={n_pos_nan}). "
                f"This can cause lifetime=0 or position filters to behave unexpectedly."
            )

        return df, lines[0].strip(), header_line
    except Exception as e:
        print(f"Error loading {filepath}: {e}")
        return None, None, None

def compute_lifetime(track_df: pd.DataFrame) -> float:
    """
    Lifetime = max(time) - min(time), using only non-NaN times.
    """
    t = track_df[time_col].dropna()
    if t.empty:
        return 0.0
    return float(t.max() - t.min())

def compute_binding_pos(track_df: pd.DataFrame, method: str) -> float:
    """
    Binding position metric used for filtering.
    """
    p = track_df[position_col].dropna()
    if p.empty:
        return float("nan")
    if method == "first_non_nan":
        return float(p.iloc[0])
    if method == "median":
        return float(p.median())
    if method == "mean":
        return float(p.mean())
    return float(p.iloc[0])

def main():
    args = parse_args()

    os.makedirs(args.output_folder, exist_ok=True)
    os.makedirs(args.separated_folder, exist_ok=True)
    os.makedirs(args.debug_folder, exist_ok=True)

    for filepath in glob.glob(os.path.join(args.input_folder, "*.csv")):
        basefile = os.path.basename(filepath)
        base = os.path.splitext(basefile)[0]

        if args.only and args.only not in base:
            continue

        print(f"\n=== Processing input file: {filepath} ===")
        df, header1, header2 = load_csv(filepath)
        if df is None:
            continue

        total_tracks_in_file = df[track_col].nunique()
        print(f"  Total tracks in file: {total_tracks_in_file}")

        for config in filter_configs:
            label = config["label"]
            min_lifetime = float(config["min_lifetime"])
            min_binding_pos = config.get("min_binding_pos")
            max_binding_pos = config.get("max_binding_pos")

            kept_tracks = []
            removed_tracks = []

            # For debug report
            track_rows = []

            fail_pos_only = 0
            fail_life_only = 0
            fail_both = 0

            for track_id, track_df in df.groupby(track_col):
                if track_df.empty:
                    continue

                binding_pos = compute_binding_pos(track_df, args.binding_pos_method)
                lifetime = compute_lifetime(track_df)

                # Position check only if min/max are defined, and treat NaN binding_pos as "fails position"
                position_ok = True
                if min_binding_pos is not None or max_binding_pos is not None:
                    if pd.isna(binding_pos):
                        position_ok = False
                    else:
                        if min_binding_pos is not None and binding_pos < float(min_binding_pos):
                            position_ok = False
                        if max_binding_pos is not None and binding_pos > float(max_binding_pos):
                            position_ok = False

                lifetime_ok = lifetime >= min_lifetime

                reason_parts = []
                if not position_ok:
                    reason_parts.append("position_out_of_range_or_nan")
                if not lifetime_ok:
                    reason_parts.append("lifetime_too_short_or_time_nan")
                reason = "PASS" if (position_ok and lifetime_ok) else "+".join(reason_parts)

                # Debug record per track
                track_rows.append(
                    {
                        "track_id": track_id,
                        "n_points": int(len(track_df)),
                        "binding_pos_um": binding_pos,
                        "binding_pos_method": args.binding_pos_method,
                        "lifetime_s": lifetime,
                        "position_ok": bool(position_ok),
                        "lifetime_ok": bool(lifetime_ok),
                        "reason": reason,
                        "min_binding_pos_um": min_binding_pos,
                        "max_binding_pos_um": max_binding_pos,
                        "min_lifetime_s": min_lifetime,
                    }
                )

                if position_ok and lifetime_ok:
                    kept_tracks.append(track_df)
                else:
                    removed_tracks.append(track_df)
                    if not position_ok and not lifetime_ok:
                        fail_both += 1
                    elif not position_ok:
                        fail_pos_only += 1
                    elif not lifetime_ok:
                        fail_life_only += 1

                    if args.verbose:
                        print(
                            f"    [REMOVED {label}] track={track_id} "
                            f"binding_pos={binding_pos} lifetime={lifetime} reason={reason}"
                        )

            n_kept = len(kept_tracks)
            n_removed = len(removed_tracks)
            total_tracks = n_kept + n_removed

            print(
                f"  [{label}] tracks kept: {n_kept}/{total_tracks} "
                f"(removed: {n_removed}; "
                f"fail_pos_only={fail_pos_only}, "
                f"fail_life_only={fail_life_only}, "
                f"fail_both={fail_both})"
            )

            # --- Write filtered (kept) file or placeholder file ---
            outpath = os.path.join(args.output_folder, f"{base}_{label}.csv")
            if n_kept > 0:
                kept_df = pd.concat(kept_tracks, ignore_index=True)
                with open(outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                kept_df.to_csv(outpath, mode="a", sep=";", index=False, header=False)
                print(f"    -> Saved filtered tracks ({config['desc']}): {outpath}")
            else:
                with open(outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                    f.write(
                        f"# no tracks passed the '{label}' filter for {base}; "
                        f"placeholder file for downstream processing\n"
                    )
                print(
                    f"    -> No tracks passed the '{label}' filter for {base}. "
                    f"Created header-only placeholder: {outpath}"
                )

            # Save removed tracks (still useful)
            if removed_tracks:
                removed_df = pd.concat(removed_tracks, ignore_index=True)
                sep_outpath = os.path.join(args.separated_folder, f"{base}_removed_{label}.csv")
                with open(sep_outpath, "w", encoding="utf-8") as f:
                    f.write(f"{header1}\n")
                    f.write(f"# {header2}\n")
                removed_df.to_csv(sep_outpath, mode="a", sep=";", index=False, header=False)

            # --- NEW: per-track debug report ---
            report_df = pd.DataFrame(track_rows).sort_values(["reason", "track_id"])
            report_path = os.path.join(args.debug_folder, f"{base}_{label}_track_report.csv")
            report_df.to_csv(report_path, index=False)
            print(f"    -> Wrote per-track debug report: {report_path}")

    print("\nProcessing complete.")

if __name__ == "__main__":
    main()

3) `2_update_lakes.py`

import pandas as pd
import glob
import os
import json
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Update merged lake files with filtered blue track CSVs."
    )
    parser.add_argument(
        "--merged_lake_folder", "-m",
        default="./",
        help="Folder containing merged .lake files (default: ./)"
    )
    parser.add_argument(
        "--filtered_folder", "-f",
        default="./filtered",
        help="Folder containing filtered blue track CSVs (default: ./filtered)"
    )
    parser.add_argument(
        "--output_folder", "-o",
        default="./updated_lakes",
        help="Folder to write updated .lake files to (default: ./updated_lakes)"
    )
    return parser.parse_args()

def build_blue_text_from_csv(csv_path):
    """
    Rebuild the 'blue' track text block for Lakeview from a filtered CSV file.

    Returns
    -------
    blue_text : str
        Header + (optional) data rows in Lakeview format.
    n_rows : int
        Number of data rows (tracks). If 0, the CSV is considered "header only"
        / no tracks after filtering.
    """
    with open(csv_path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    if len(lines) < 2:
        raise ValueError(f"{csv_path} has fewer than 2 header lines. Please check the file.")

    header1 = lines[0].strip()  # first header line
    header2 = lines[1].strip()  # second header line with column names

    data_lines = lines[2:]

    # Check if there is any non-comment, non-empty data line
    has_data = any(
        (not ln.lstrip().startswith("#")) and ln.strip() != ""
        for ln in data_lines
    )

    # Column names are taken from the second header line (strip leading '# ')
    colnames = [c.strip() for c in header2.lstrip("# ").split(";")]

    base_text = header1 + "\n" + header2 + "\n"

    if not has_data:
        # Header-only CSV -> no tracks after filtering
        return base_text, 0

    # Read data rows with pandas
    df = pd.read_csv(csv_path, sep=";", comment="#", header=None, skiprows=2)
    if df.shape[0] == 0:
        # Safety net: no rows
        return base_text, 0

    df.columns = colnames
    n_rows = len(df)

    txt = base_text
    for _, row in df.iterrows():
        row_str = ";".join(str(row[c]) for c in colnames)
        txt += row_str + "\n"

    return txt, n_rows

def find_matching_csv(filtered_folder, kymo_name, i):
    """
    Try to find the filtered CSV corresponding to a given kymo_name.

    1) First, try exact match: 
<kymo_name>_blue*.csv
    2) If not found, try a 'p'-patched version of the numeric chunk (e.g. 940 -> p940 or p940 -> 940)
    """

    # 1) Exact match
    pattern = os.path.join(filtered_folder, f"{kymo_name}_blue*.csv")
    candidates = glob.glob(pattern)
    if len(candidates) == 1:
        return candidates[0]
    elif len(candidates) > 1:
        print(f"  [kymo {i}] Multiple CSV matches for {kymo_name} (exact), skipping:")
        for c in candidates:
            print(f"    - {c}")
        return None  # ambiguous

    # 2) Fallback: patch the 3-digit numeric part by adding or removing 'p'
    parts = kymo_name.split("_")
    alt_candidates = []

    for idx, part in enumerate(parts):
        # Case A: pure 3-digit number (e.g. "940") -> try "p940"
        if part.isdigit() and len(part) == 3:
            alt_parts = parts.copy()
            alt_parts[idx] = "p" + part
            alt_name = "_".join(alt_parts)
            alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
            alt_candidates = glob.glob(alt_pattern)
            if alt_candidates:
                print(
                    f"  [kymo {i}] No exact CSV for '{kymo_name}', "
                    f"but found match using '{alt_name}'."
                )
                break

        # Case B: starts with 'p' and then 3 digits (e.g. "p940") -> try without 'p'
        if part.startswith("p") and part[1:].isdigit() and len(part) == 4:
            alt_parts = parts.copy()
            alt_parts[idx] = part[1:]  # drop the leading 'p'
            alt_name = "_".join(alt_parts)
            alt_pattern = os.path.join(filtered_folder, f"{alt_name}_blue*.csv")
            alt_candidates = glob.glob(alt_pattern)
            if alt_candidates:
                print(
                    f"  [kymo {i}] No exact CSV for '{kymo_name}', "
                    f"but found match using '{alt_name}'."
                )
                break

    if len(alt_candidates) == 1:
        return alt_candidates[0]
    elif len(alt_candidates) > 1:
        print(f"  [kymo {i}] Multiple CSV matches for patched name, skipping:")
        for c in alt_candidates:
            print(f"    - {c}")
        return None

    # Nothing found
    return None

def main():
    args = parse_args()

    merged_lake_folder = args.merged_lake_folder
    filtered_folder    = args.filtered_folder
    output_folder      = args.output_folder

    os.makedirs(output_folder, exist_ok=True)

    # Global counters across all lakes
    total_case1 = 0          # case1: CSV found & n_rows>0 → tracks updated (kymo kept)
    total_case2 = 0          # case2: CSV exists, but no tracks remain after filtering (empty or error) → kymo removed
    total_case3 = 0          # case3: no matching CSV → kymo removed
    total_extra = 0          # extra: kymo without data/tracks/blue → removed

    # Detailed lists of sample names (lake, kymo, ...)
    case1_kymos = []         # (lake_file, kymo_name, csv_path)
    case2_kymos = []         # (lake_file, kymo_name, csv_path, reason)
    case3_kymos = []         # (lake_file, kymo_name)
    extra_kymos = []         # (lake_file, kymo_name)

    used_csv_paths = set()   # CSVs that were actually matched to some kymo

    # Loop over all merged .lake files
    for lake_path in glob.glob(os.path.join(merged_lake_folder, "*.lake")):
        base = os.path.basename(lake_path)
        print(f"\n=== Processing lake file: {base} ===")

        # per-lake list of removed kymograph names
        removed_kymo_names = set()

        # Load JSON from .lake file
        with open(lake_path, "r", encoding="utf-8") as f:
            lake = json.load(f)

        old_kymos = lake.get("kymos", [])
        new_kymos = []   # we will build a filtered list here

        # Iterate over all kymos in this lake
        for i, kymo in enumerate(old_kymos):
            # Extract kymograph name from address.path (last segment of the path)
            addr = kymo.get("address", {})
            path = addr.get("path", "")
            kymo_name = path.split("/")[-1] if path else None

            if not kymo_name:
                print(f"  [kymo {i}] No valid name/path found, skipping.")
                # keep it as-is (very unusual case)
                new_kymos.append(kymo)
                continue

            # Find the corresponding filtered CSV
            csv_path = find_matching_csv(filtered_folder, kymo_name, i)
            if csv_path is None:
                # case3: no CSV → remove kymo
                print(
                    f"  [kymo {i}] No suitable CSV found for '{kymo_name}' "
                    f"in {filtered_folder} → REMOVING kymograph from output lake."
                )
                total_case3 += 1
                case3_kymos.append((base, kymo_name))
                removed_kymo_names.add(kymo_name)
                continue

            csv_name = os.path.basename(csv_path)
            used_csv_paths.add(os.path.abspath(csv_path))

            # Build the new blue track text from the filtered CSV
            try:
                blue_text, n_rows = build_blue_text_from_csv(csv_path)
            except Exception as e:
                # case2: CSV present but not parseable
                msg = f"read error: {e}"
                print(f"  [kymo {i}] Error reading {csv_name}: {msg} → REMOVING kymograph.")
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)
                continue

            if n_rows == 0:
                # case2: CSV present but no tracks after filtering
                msg = "0 tracks after filtering (header-only CSV)"
                print(
                    f"  [kymo {i}] CSV {csv_name} contains no tracks after filtering "
                    f"→ REMOVING kymograph."
                )
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)
                continue

            # If we reach here, we have a non-empty CSV, so this is case1
            try:
                if "data" in kymo and "tracks" in kymo["data"] and "blue" in kymo["data"]["tracks"]:
                    kymo["data"]["tracks"]["blue"] = blue_text
                    new_kymos.append(kymo)
                    total_case1 += 1
                    case1_kymos.append((base, kymo_name, csv_path))
                    print(f"  [kymo {i}] Updated blue tracks from {csv_name} (kept).")
                else:
                    # extra: kymo structure has no blue field at all → remove
                    print(
                        f"  [kymo {i}] Kymo '{kymo_name}' has no data/tracks/blue field "
                        f"→ REMOVING from output lake."
                    )
                    total_extra += 1
                    extra_kymos.append((base, kymo_name))
                    removed_kymo_names.add(kymo_name)
            except Exception as e:
                # treat write problems also as case2
                msg = f"write error: {e}"
                print(
                    f"  [kymo {i}] Error writing tracks for {kymo_name}: {msg} "
                    f"→ REMOVING kymograph."
                )
                total_case2 += 1
                case2_kymos.append((base, kymo_name, csv_path, msg))
                removed_kymo_names.add(kymo_name)

        # Replace kymos list with filtered one (case2/case3/extra removed)
        lake["kymos"] = new_kymos

        # ------------------------------------------------------
        #  NEW PART: rebuild file_viewer and experiments[*].dataset
        #  so that H5 links are consistent with the kept kymos.
        # ------------------------------------------------------
        kept_kymo_names = set()
        file_viewer_files = []

        for kymo in new_kymos:
            addr = kymo.get("address", {})
            path = addr.get("path", "")
            file = addr.get("file", "")
            if path:
                name = path.split("/")[-1]
                kept_kymo_names.add(name)
            if file and file not in file_viewer_files:
                file_viewer_files.append(file)

        # 1) Root-level file_viewer: only files from kept kymos
        if "file_viewer" in lake:
            lake["file_viewer"] = file_viewer_files

        # 2) Experiments datasets: keep only entries whose path matches kept kymo
        if "experiments" in lake and isinstance(lake["experiments"], dict):
            for exp_key, exp in lake["experiments"].items():
                if not isinstance(exp, dict):
                    continue
                dataset = exp.get("dataset")
                if isinstance(dataset, list):
                    new_dataset = []
                    for item in dataset:
                        if not isinstance(item, dict):
                            continue
                        addr = item.get("address", {})
                        path = addr.get("path", "")
                        name = path.split("/")[-1] if path else None
                        if name in kept_kymo_names:
                            new_dataset.append(item)
                    exp["dataset"] = new_dataset

        # Save updated lake JSON to output folder
        out_path = os.path.join(output_folder, base)
        with open(out_path, "w", encoding="utf-8") as f:
            json.dump(lake, f, indent=4)
        print(f"==> {base}: kept {len(new_kymos)} kymos after filtering, written to {out_path}")

    # --- Global summary over all lakes ---
    print("\n=== Summary over all processed lakes ===")
    print(f"  case1: updated kymos (CSV found & ≥1 track, kept)  = {total_case1}")
    print(f"  case2: removed kymos (CSV exists, but no tracks remain after filtering)    = {total_case2}")
    print(f"  case3: removed kymos (no matching CSV found)       = {total_case3}")
    print(f"  extra: removed kymos (no data/tracks/blue field)   = {total_extra}")
    total_kymos = total_case1 + total_case2 + total_case3 + total_extra
    print(f"  total kymos classified (sum of the above)          = {total_kymos}")

    # CSV usage check
    all_csv_paths = sorted(os.path.abspath(p) for p in glob.glob(os.path.join(filtered_folder, "*_blue*.csv")))
    print(f"\nTotal CSV files in filtered_folder: {len(all_csv_paths)}")
    print(f"CSV files actually used (matched to some kymo): {len(used_csv_paths)}")

    unused_csv = [p for p in all_csv_paths if p not in used_csv_paths]
    if unused_csv:
        print("\nCSV files NOT used by any kymo (name mismatch / other replicates):")
        for p in unused_csv:
            print(f"  {p}")
    else:
        print("\nAll CSV files in filtered_folder were used by at least one kymo.")

if __name__ == "__main__":
    main()

Protected: 德国探亲签证一页待办

Enter your password to view comments.

TODOs

Leave a reply

https://laborbuch.uke.de/
https://www.linkedin.com/in/jiabin-huang-257a5960/
https://www.germanysky-shop.com/product/detail1031 超推薦真的好用.. Tetesept 肌肉肩頸發熱舒緩膏 100ml Muskel Vital Wärme-Balsam 原廠紙盒薄弱國際空運難免會有些微盒損不介意在下單喔 生薑和迷迭香以及尤加利和檸檬油的天然精油在塗抹時會產生令人愉悅的香味效果。含有溫熱成分的配方可確保在按摩護理膏後為肌膚帶來快速而愉悅的溫暖感。質地不油不膩，能快速熱傳導至肌肉組織有熱敷效果。對於肌肉緊張酸痛或拉傷均能有效舒緩不適。賣場編號 000903 ｜商品編號 000903
https://schach.in/zahlen/40023/
https://schach.in/zahlen/40007/
https://schach.in/zahlen/40042/
https://www.hsjb.de/hjet-2026/hjet-2026-u10-1/
https://www.hsjb.de/hjet-2025/hjet-2025-u14/
马象将死核心要点: https://sports.sina.cn/others/2020-05-18/detail-iircuyvi3719224.d.html?vt=4
https://www.genomicepidemiology.org/services/
https://genepi.dk/
ResFinder
#ResFinderFG
KmerResistance
#PathogenFinder
VirulenceFinder

Workflow using PICRUSt2 for Data_Karoline_16S_2025 (v2)

Leave a reply

Environment Setup: It sets up a Conda environment named picrust2, using the conda create command and then activates this environment using conda activate picrust2.

 #https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.2.0-beta)#minimum-requirements-to-run-full-tutorial
 mamba create -n picrust2 -c bioconda -c conda-forge picrust2    #2.5.3  #=2.2.0_b
 mamba activate /home/jhuang/miniconda3/envs/picrust2

Under docker-env (qiime2-amplicon-2023.9)

Export QIIME2 feature table and representative sequences

 #docker pull quay.io/qiime2/core:2023.9
 #docker run -it --rm \
 #-v /mnt/md1/DATA/Data_Karoline_16S_2025:/data \
 #-v /home/jhuang/REFs:/home/jhuang/REFs \
 #quay.io/qiime2/core:2023.9 bash
 #cd /data
 # === SETTINGS ===
 FEATURE_TABLE_QZA="dada2_tests2/test_7_f240_r240/table.qza"
 REP_SEQS_QZA="dada2_tests2/test_7_f240_r240/rep-seqs.qza"

 # === STEP 1: EXPORT QIIME2 ARTIFACTS ===
 mkdir -p qiime2_export
 qiime tools export --input-path $FEATURE_TABLE_QZA --output-path qiime2_export
 qiime tools export --input-path $REP_SEQS_QZA --output-path qiime2_export

Convert BIOM to TSV for Picrust2 input

 biom convert \
 -i qiime2_export/feature-table.biom \
 -o qiime2_export/feature-table.tsv \
 --to-tsv

Under env (picrust2): mamba activate /home/jhuang/miniconda3/envs/picrust2

Run PICRUSt2 pipeline

 tail -n +2 qiime2_export/feature-table.tsv > qiime2_export/feature-table-fixed.tsv
 picrust2_pipeline.py \
 -s qiime2_export/dna-sequences.fasta \
 -i qiime2_export/feature-table-fixed.tsv \
 -o picrust2_out \
 -p 100

 #This will:
 #* Place sequences in the reference tree (using EPA-NG),
 #* Predict gene family abundances (e.g., EC, KO, PFAM, TIGRFAM),
 #* Predict pathway abundances.

 #In current PICRUSt2 (with picrust2_pipeline.py), you do not run hsp.py separately.
 #Instead, picrust2_pipeline.py internally runs the HSP step for all functional categories automatically. It outputs all the prediction files (16S_predicted_and_nsti.tsv.gz, COG_predicted.tsv.gz, PFAM_predicted.tsv.gz, KO_predicted.tsv.gz, EC_predicted.tsv.gz, TIGRFAM_predicted.tsv.gz, PHENO_predicted.tsv.gz) in the output directory.

 mkdir picrust2_out_advanced; cd picrust2_out_advanced
 #If you still want to run hsp.py manually (advanced use / debugging), the commands correspond directly:
 hsp.py -i 16S -t ../picrust2_out/out.tre -o 16S_predicted_and_nsti.tsv.gz -p 100 -n
 hsp.py -i COG -t ../picrust2_out/out.tre -o COG_predicted.tsv.gz -p 100
 hsp.py -i PFAM -t ../picrust2_out/out.tre -o PFAM_predicted.tsv.gz -p 100
 hsp.py -i KO -t ../picrust2_out/out.tre -o KO_predicted.tsv.gz -p 100
 hsp.py -i EC -t ../picrust2_out/out.tre -o EC_predicted.tsv.gz -p 100
 hsp.py -i TIGRFAM -t ../picrust2_out/out.tre -o TIGRFAM_predicted.tsv.gz -p 100
 hsp.py -i PHENO -t ../picrust2_out/out.tre -o PHENO_predicted.tsv.gz -p 100

Metagenome prediction per functional category (if needed separately)

 #cd picrust2_out_advanced
 metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f COG_predicted.tsv.gz -o COG_metagenome_out --strat_out
 metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz -o EC_metagenome_out --strat_out
 metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f KO_predicted.tsv.gz -o KO_metagenome_out --strat_out
 metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f PFAM_predicted.tsv.gz -o PFAM_metagenome_out --strat_out
 metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f TIGRFAM_predicted.tsv.gz -o TIGRFAM_metagenome_out --strat_out

 # Add descriptions in gene family tables
 add_descriptions.py -i COG_metagenome_out/pred_metagenome_unstrat.tsv.gz -m COG -o COG_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
 add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
 add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz   # EC and METACYC is a pair, EC for gene_annotation and METACYC for pathway_annotation
 add_descriptions.py -i PFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m PFAM -o PFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
 add_descriptions.py -i TIGRFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m TIGRFAM -o TIGRFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

Pathway inference (MetaCyc pathways from EC numbers)

 #cd picrust2_out_advanced
 pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o EC_pathways_out -p 100
 pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o EC_pathways_out_per_seq -p 100 --per_sequence_contrib --per_sequence_abun EC_metagenome_out/seqtab_norm.tsv.gz --per_sequence_function EC_predicted.tsv.gz
 #ERROR due to missing .../pathway_mapfiles/KEGG_pathways_to_KO.tsv
 pathway_pipeline.py -i COG_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv
 pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_strat.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

 add_descriptions.py -i EC_pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o EC_pathways_out/path_abun_unstrat_descrip.tsv.gz
 gunzip EC_pathways_out/path_abun_unstrat_descrip.tsv.gz

 #Error - no rows remain after regrouping input table. The default pathway and regroup mapfiles are meant for EC numbers. Note that KEGG pathways are not supported since KEGG is a closed-source database, but you can input custom pathway mapfiles if you have access. If you are using a custom function database did you mean to set the --no-regroup flag and/or change the default pathways mapfile used?
 #If ERROR --> USE the METACYC for downstream analyses!!!

 #ERROR due to missing .../description_mapfiles/KEGG_pathways_info.tsv.gz
 #add_descriptions.py -i KO_pathways_out/path_abun_unstrat.tsv.gz -o KEGG_pathways_out/path_abun_unstrat_descrip.tsv.gz --custom_map_table /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/description_mapfiles/KEGG_pathways_info.tsv.gz

 #NOTE: Target-analysis for the pathway "mixed acid fermentation"

Visualization

 #7.1 STAMP
 #https://github.com/picrust/picrust2/wiki/STAMP-example
 #Note that STAMP can only be opened under Windows

 # It needs two files: path_abun_unstrat_descrip.tsv.gz as "Profile file" and metadata.tsv as "Group metadata file".
 cp ~/DATA/Data_Karoline_16S_2025/picrust2_out_advanced/EC_pathways_out/path_abun_unstrat_descrip.tsv ~/DATA/Access_to_Win10/

 cut -d$'\t' -f1 qiime2_metadata.tsv > 1
 cut -d$'\t' -f3 qiime2_metadata.tsv > 3
 cut -d$'\t' -f5-6 qiime2_metadata.tsv > 5_6
 paste -d$'\t' 1 3 > 1_3
 paste -d$'\t' 1_3 5_6 > metadata.tsv
 #SampleID --> SampleID
 SampleID        Group   pre_post        Sex_age
 sample-A1       Group1  3d.post.stroke  male.aged
 sample-A2       Group1  3d.post.stroke  male.aged
 sample-A3       Group1  3d.post.stroke  male.aged
 cp ~/DATA/Data_Karoline_16S_2025/metadata.tsv ~/DATA/Access_to_Win10/
 # MANULLY_EDITING: keeping the only needed records in metadata.tsv: Group 9 (J1–J4, J10, J11) and Group 10 (K1–K6).

 #7.2. ALDEx2
 https://bioconductor.org/packages/release/bioc/html/ALDEx2.html

Under docker-env (qiime2-amplicon-2023.9)

(NOT_NEEDED) Convert pathway output to BIOM and re-import to QIIME2 gunzip picrust2_out/pathways_out/path_abun_unstrat.tsv.gz biom convert \ -i picrust2_out/pathways_out/path_abun_unstrat.tsv \ -o picrust2_out/path_abun_unstrat.biom \ –table-type=”Pathway table” \ –to-hdf5

 qiime tools import \
 --input-path picrust2_out/path_abun_unstrat.biom \
 --type 'FeatureTable[Frequency]' \
 --input-format BIOMV210Format \
 --output-path path_abun.qza

 #qiime tools export --input-path path_abun.qza --output-path exported_path_abun
 #qiime tools peek path_abun.qza
 echo "✅ PICRUSt2 pipeline complete. Output in: picrust2_out"

Short answer: unless you had a very clear, pre-specified directional hypothesis, you should use a two-sided test.

 A bit more detail:

 * Two-sided t-test

         * Tests: “Are the means different?” (could be higher or lower).
         * Standard default in most biological and clinical studies and usually what reviewers expect.
         * More conservative than a one-sided test.

 * One-sided t-test

         * Tests: “Is Group A greater than Group B?” (or strictly less than).
         * You should only use it if before looking at the data you had a strong reason to expect a specific direction and you would ignore/consider uninterpretable a difference in the opposite direction.
         * Using one-sided just to gain significance is considered bad practice.

 For your pathway analysis (exploratory, many pathways, q-value correction), the safest and most defensible choice is to:

 * Use a two-sided t-test (equal variance or Welch’s, depending on variance assumptions).

 So I’d recommend rerunning STAMP with Type: Two-sided and reporting those results.

 #--> Using a two-sided Welch's t-test in STAMP, that is the unequal-variance version (does not assume equal variances and is more conservative than “t-test (equal variance)” referring to the classical unpaired Student’s t-test.

Statistics in STAMP

 * For multiple groups:
     * Statistical test: ANOVA, Kruskal-Wallis H-test
     * Post-hoc test: Games-Howell, Scheffe, Tukey-Kramer, Welch's (uncorrected) (by default 0.95)
     * Effect size: Eta-squared
     * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction
 * For two groups
     * Statistical test: t-test (equal variance), Welch's t-test, White's non-parametric t-test
     * Type: One-sided, Two-sided
     * CI method: "DP: Welch's inverted" (by default 0.95)
     * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction, Sidak, Storey FDR
 * For two samples
     * Statistical test: Bootstrap, Chi-square test, Chi-square test (w/Yates'), Difference between proportions, Fisher's exact test, G-test, G-test (w/Yates'), G-test (w/Yates') + Fisher's, Hypergeometric, Permutation
     * Type: One-sided, Two-sided
     * CI method: "DP: Asymptotic", "DP: Asymptotic-CC", "DP: Newcomber-Wilson", "DR: Haldane adjustment", "RP: Asymptotic" (by default 0.95)
     * Multiple test correction: Benjamini-Hochberg FDR, Bonferroni, No correction, Sidak, Storey FDR

Since MetaCyc does not have a single pathway explicitly named “short-chain fatty acid biosynthesis”, I defined a small SCFA-related set (acetate-, propionate- and butyrate-producing pathways) and tested these between Group 9 and Group 10 (Welch’s t-test, with BH correction within this subset). These pathways can also be found in the file Welchs_t-test.xlsx attached to my email from 26.11.2025 (for Group9 (J1-4, J6-7, J10-11) vs Group10 (K1-6)).

Pathway ID  Description     Group 9 mean (%)    Group 10 mean (%)   p-value     p-adj (BH, SCFA set)
P108-PWY    pyruvate fermentation to propanoate I   0.5070  0.3817  0.001178    0.0071
PWY-5100    pyruvate fermentation to acetate and lactate II     0.8354  0.9687  0.007596    0.0228
CENTFERM-PWY    pyruvate fermentation to butanoate  0.0766  0.0410  0.026608    0.0532
PWY-5677    succinate fermentation to butanoate     0.0065  0.0088  0.365051    0.5476
P163-PWY    L-lysine fermentation to acetate and butanoate  0.0324  0.0271  0.484704    0.5816
PWY-5676    acetyl-CoA fermentation to butanoate II     0.1397  0.1441  0.927588    0.9276

In this SCFA-focused set, the propionate (P108-PWY) and acetate (PWY-5100) pathways remain significantly different between Group 9 and Group 10 after adjustment, whereas the butyrate-related pathways do not show clear significant differences (CENTFERM-PWY is borderline).

from 14.01.2026 (for Group9 (J1-4, J10-11) vs Group10 (K1-6)), marked green in the Excel-files.

Pathway ID  Description Group 9 mean (%)    Group 10 mean (%)   p-value p-adj (BH, 6-pathway set)
P108-PWY    pyruvate fermentation to propanoate I   0.5142  0.3817  0.001354    0.008127
PWY-5100    pyruvate fermentation to acetate and lactate II 0.8401  0.9687  0.008763    0.026290
CENTFERM-PWY    pyruvate fermentation to butanoate  0.0729  0.0410  0.069958    0.139916
PWY-5677    succinate fermentation to butanoate 0.0063  0.0088  0.367586    0.551379
P163-PWY    L-lysine fermentation to acetate and butanoate  0.0308  0.0271  0.693841    0.832609
PWY-5676    acetyl-CoA fermentation to butanoate II 0.1421  0.1441  0.971290    0.971290

Reporting

Please find attached the results of the pathway analysis. The Excel file contains the full statistics for all pathways; those with adjusted p-values (Benjamini–Hochberg) ≤ 0.05 are highlighted in yellow and are the ones illustrated in the figure.

The analysis was performed using Welch’s t-test (two-sided) with Benjamini–Hochberg correction for multiple testing.

browse the 141 pipelines that are currently available as part of nf-core on 2026-01-14

Leave a reply

Count check

Input pipeline count: 141
Output pipeline count (below): 141 ✅
Categories: 16
Sum of category counts: 141 ✅

Categorized pipelines (with counts)

1) Bulk RNA-seq & transcriptomics (19)

alleleexpression, cageseq, circrna, denovotranscript, differentialabundance, drop, dualrnaseq, evexplorer, isoseq, lncpipe, nanostring, nascent, rnafusion, rnaseq, rnasplice, rnavar, riboseq, slamseq, stableexpression

2) Small RNA-seq (1)

smrnaseq

3) Single-cell transcriptomics (6)

marsseq, scdownstream, scflow, scnanoseq, scrnaseq, smartseq2

4) Spatial omics (6)

molkart, panoramaseq, pixelator, sopa, spatialvi, spatialxe

5) Chromatin & regulation (10)

atacseq, callingcards, chipseq, clipseq, cutandrun, hic, hicar, mnaseseq, sammyseq, tfactivity

6) DNA methylation (3)

methylarray, methylong, methylseq

7) Human genomics, variants & disease (22)

abotyper, circdna, deepvariant, eager, exoseq, gwas, longraredisease, mitodetect, oncoanalyser, pacvar, phaseimpute, radseq, raredisease, rarevariantburden, rnadnavar, sarek, ssds, tumourevo, variantbenchmarking, variantcatalogue, variantprioritization, createpanelrefs

8) Viruses & pathogen surveillance (7)

pathogensurveillance, phageannotator, tbanalyzer, viralmetagenome, viralintegration, viralrecon, vipr

9) Metagenomics & microbiome (10)

ampliseq, coproid, createtaxdb, detaxizer, funcscan, mag, magmap, metapep, metatdenovo, taxprofiler

10) Genome assembly, annotation & comparative genomics (14)

bacass, bactmap, denovohybrid, genomeannotator, genomeassembler, genomeqc, genomeskim, hgtseq, multiplesequencealign, neutronstar, pangenome, pairgenomealign, phyloplace, reportho

11) Immunology & antigen presentation (4)

airrflow, epitopeprediction, hlatyping, mhcquant

12) Proteomics, metabolomics & protein informatics (11)

ddamsproteomics, diaproteomics, kmermaid, metaboigniter, proteinannotator, proteinfamilies, proteinfold, proteogenomicsdb, proteomicslfq, quantms, ribomsqc

13) Imaging & other experimental modalities (7)

cellpainting, imcyto, liverctanalysis, lsmquant, mcmicro, rangeland, troughgraph

14) Data acquisition, QC & ops / utilities (12)

bamtofastq, datasync, demo, demultiplex, fastqrepair, fastquorum, fetchngs, nanoseq, readsimulator, references, seqinspector, seqsubmit

15) Genome editing & screens (2)

crisprseq, crisprvar

16) Other methods / modelling / non-bioinformatics (7)

deepmodeloptim, deepmutscan, diseasemodulediscovery, drugresponseeval, meerpipe, omicsgenetraitassociation, spinningjenny

Category	Name	Short description	中文描述	Stars	Last release
Bulk RNA-seq & transcriptomics	alleleexpression	Allele-specific expression (ASE) analysis using STAR-WASP, UMI-tools, phaser	等位基因特异性表达（ASE）分析：STAR-WASP 比对，UMI-tools 去重，phaser 单倍型分相与 ASE 检测	2	–
Bulk RNA-seq & transcriptomics	cageseq	CAGE-sequencing analysis pipeline with trimming, alignment and counting of CAGE tags.	CAGE-seq 分析：剪切、比对并统计 CAGE 标签（转录起始相关）。	11	1.0.2
Bulk RNA-seq & transcriptomics	circrna	circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data	环状 RNA（circRNA）定量、差异表达分析及 miRNA 靶标预测。	59	–
Bulk RNA-seq & transcriptomics	denovotranscript	de novo transcriptome assembly of paired-end short reads from bulk RNA-seq	基于 bulk RNA-seq 双端短读长的从头转录组组装。	19	1.2.1
Bulk RNA-seq & transcriptomics	differentialabundance	Differential abundance analysis for feature/observation matrices (e.g., RNA-seq)	对特征/观测矩阵做差异丰度分析（可用于表达矩阵等）。	87	1.5.0
Bulk RNA-seq & transcriptomics	drop	Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders	RNA-seq 异常事件检测流程（用于罕见病诊断等）。	7	–
Bulk RNA-seq & transcriptomics	dualrnaseq	Analysis of Dual RNA-seq data (host-pathogen interactions)	宿主-病原双 RNA-seq 分析流程，用于研究宿主-病原相互作用。	25	1.0.0
Bulk RNA-seq & transcriptomics	evexplorer	Analyze RNA data from extracellular vesicles; QC, region detection, normalization, DRE	胞外囊泡（EV）RNA 数据分析：质控、表达区域检测、归一化与差异 RNA 表达（DRE）。	1	–
Bulk RNA-seq & transcriptomics	isoseq	Genome annotation with PacBio Iso-Seq from raw subreads to FLNC and bed annotation	PacBio Iso-Seq 基因组注释：从 subreads 生成 FLNC 并产出 bed 注释。	50	2.0.0
Bulk RNA-seq & transcriptomics	lncpipe	Analysis of long non-coding RNAs from RNA-seq datasets (under development)	lncRNA（长链非编码 RNA）分析流程（开发中）。	34	–
Bulk RNA-seq & transcriptomics	nanostring	Analysis pipeline for Nanostring nCounter expression data.	Nanostring nCounter 表达数据分析流程。	16	1.3.1
Bulk RNA-seq & transcriptomics	nascent	Nascent Transcription Processing Pipeline	新生转录（nascent RNA）处理与分析流程。	22	2.3.0
Bulk RNA-seq & transcriptomics	rnafusion	RNA-seq analysis pipeline for detection of gene-fusions	RNA-seq 融合基因检测流程。	170	4.0.0
Bulk RNA-seq & transcriptomics	rnaseq	RNA sequencing pipeline (STAR/RSEM/HISAT2/Salmon) with QC and counts	常规 bulk RNA-seq 分析：比对/定量/计数与全面质控（多比对/定量器可选）。	1179	3.22.2
Bulk RNA-seq & transcriptomics	rnasplice	RNA-seq alternative splicing analysis	RNA-seq 可变剪接分析流程。	63	1.0.4
Bulk RNA-seq & transcriptomics	rnavar	gatk4 RNA variant calling pipeline	基于 GATK4 的 RNA 变异检测（RNA variant calling）。	58	1.2.2
Bulk RNA-seq & transcriptomics	riboseq	Analysis of ribosome profiling (Ribo-seq) data	Ribo-seq（核糖体测序/核糖体 footprinting）分析流程。	21	1.2.0
Bulk RNA-seq & transcriptomics	slamseq	SLAMSeq processing and analysis pipeline	SLAM-seq（新生 RNA 标记）处理与分析流程。	10	1.0.0
Bulk RNA-seq & transcriptomics	stableexpression	Identify stable genes across datasets; useful for RT-qPCR reference genes	寻找最稳定基因（适合作为 RT-qPCR 参考内参基因）。	5	–
Small RNA-seq	smrnaseq	A small-RNA sequencing analysis pipeline	小 RNA 测序（如 miRNA 等）分析流程。	98	2.4.1
Single-cell transcriptomics	marsseq	MARS-seq v2 pre-processing pipeline with velocity	MARS-seq v2 预处理流程，支持 RNA velocity。	8	1.0.3
Single-cell transcriptomics	scdownstream	Single cell transcriptomics pipeline for QC, integration, presentation	单细胞转录组下游：质控、整合与结果展示。	81	–
Single-cell transcriptomics	scflow	Please consider using/contributing to nf-core/scdownstream	单细胞流程（建议转向/贡献 scdownstream）。	25	–
Single-cell transcriptomics	scnanoseq	Single-cell/nuclei pipeline for Oxford Nanopore + 10x Genomics	单细胞/细胞核测序流程：结合 ONT 与 10x 数据。	52	1.2.1
Single-cell transcriptomics	scrnaseq	Single-cell RNA-Seq pipeline (10x/DropSeq/SmartSeq etc.)	单细胞 RNA-seq 主流程：支持 10x、DropSeq、SmartSeq 等。	310	4.1.0
Single-cell transcriptomics	smartseq2	Process single cell RNA-seq generated with SmartSeq2	SmartSeq2 单细胞 RNA-seq 处理流程。	15	–
Spatial omics	molkart	Processing Molecular Cartography data (Resolve Bioscience combinatorial FISH)	Resolve Molecular Cartography（组合 FISH）数据处理流程。	14	1.2.0
Spatial omics	panoramaseq	Pipeline to process sequencing-based spatial transcriptomics data (in-situ arrays)	测序型空间转录组（in-situ arrays）数据处理流程。	0	–
Spatial omics	pixelator	Pipeline to generate Molecular Pixelation data (Pixelgen)	Pixelgen 分子像素化（Molecular Pixelation）数据处理流程。	13	2.3.0
Spatial omics	sopa	Nextflow version of Sopa – spatial omics pipeline and analysis	Sopa 的 Nextflow 实现：空间组学流程与分析。	11	–
Spatial omics	spatialvi	Process spatial gene counts + spatial coordinates + image data (10x Visium)	10x Visium 空间转录组处理：基因计数+空间坐标+图像数据。	70	–
Spatial omics	spatialxe	(no description shown)	空间组学相关流程（原表未给出描述）。	24	–
Chromatin & regulation	atacseq	ATAC-seq peak-calling and QC analysis pipeline	ATAC-seq 峰识别与质控分析流程。	221	2.1.2
Chromatin & regulation	callingcards	A pipeline for processing calling cards data	Calling cards 实验数据处理流程。	6	1.0.0
Chromatin & regulation	chipseq	ChIP-seq peak-calling, QC and differential analysis	ChIP-seq 峰识别、质控与差异分析流程。	229	2.1.0
Chromatin & regulation	clipseq	CLIP-seq QC, mapping, UMI deduplication, peak-calling options	CLIP-seq 分析：质控、比对、UMI 去重与多种 peak calling。	24	1.0.0
Chromatin & regulation	cutandrun	CUT&RUN / CUT&TAG pipeline with QC, spike-ins, IgG controls, peak calling	CUT&RUN/CUT&TAG 分析：质控、spike-in、IgG 对照、峰识别与下游。	106	3.2.2
Chromatin & regulation	hic	Analysis of Chromosome Conformation Capture (Hi-C) data	Hi-C 染色体构象捕获数据分析流程。	105	2.1.0
Chromatin & regulation	hicar	HiCAR multi-omic co-assay pipeline	HiCAR 多组学共测（转录+染色质可及性+接触）分析流程。	12	1.0.0
Chromatin & regulation	mnaseseq	MNase-seq analysis pipeline using BWA and DANPOS2	MNase-seq 分析流程（BWA + DANPOS2）。	12	1.0.0
Chromatin & regulation	sammyseq	SAMMY-seq pipeline to analyze chromatin state	SAMMY-seq 染色质状态分析流程。	5	–
Chromatin & regulation	tfactivity	Identify differentially active TFs using expression + open chromatin	整合表达与开放染色质数据，识别差异活跃转录因子（TF）。	12	–
DNA methylation	methylarray	Illumina methylation array processing; QC, confounders, DMP/DMR, cell comp optional	Illumina 甲基化芯片分析：预处理、质控、混杂因素检查、DMP/DMR；可选细胞组成估计与校正。	6	–
DNA methylation	methylong	Extract methylation calls from long reads (ONT/PacBio)	从长读长（ONT/PacBio）提取甲基化识别结果。	19	2.0.0
DNA methylation	methylseq	Bisulfite-seq methylation pipeline (Bismark/bwa-meth + MethylDackel/rastair)	亚硫酸氢盐测序甲基化分析流程（Bismark/bwa-meth 等）。	185	4.2.0
Human genomics, variants & disease	abotyper	Characterise human blood group and red cell antigens using ONT	基于 ONT 的人类血型与红细胞抗原分型/鉴定流程。	1	–
Human genomics, variants & disease	circdna	Identify extrachromosomal circular DNA (ecDNA) from Circle-seq/WGS/ATAC-seq	从 Circle-seq/WGS/ATAC-seq 识别染色体外环状 DNA（ecDNA）。	31	1.1.0
Human genomics, variants & disease	createpanelrefs	Generate Panel of Normals / models / references from many samples	从大量样本生成 PoN（Panel of Normals）/模型/参考资源。	11	–
Human genomics, variants & disease	deepvariant	Consider using/contributing to nf-core/sarek	DeepVariant 相关（建议使用/贡献至 sarek）。	40	1.0
Human genomics, variants & disease	eager	Ancient DNA analysis pipeline	古 DNA（aDNA）分析流程（可重复、标准化）。	195	2.5.3
Human genomics, variants & disease	exoseq	Please consider using/contributing to nf-core/sarek	Exo-seq 相关（建议使用/贡献至 sarek）。	16	–
Human genomics, variants & disease	gwas	UNDER CONSTRUCTION: Genome Wide Association Studies	GWAS（全基因组关联分析）流程（建设中）。	27	–
Human genomics, variants & disease	longraredisease	Long-read sequencing pipeline for rare disease variant discovery	长读长测序罕见病变异识别流程（神经发育障碍等）。	5	v1.0.0-alpha
Human genomics, variants & disease	mitodetect	A-Z analysis of mitochondrial NGS data	线粒体 NGS 数据全流程分析。	7	–
Human genomics, variants & disease	oncoanalyser	Comprehensive cancer DNA/RNA analysis and reporting pipeline	肿瘤 DNA/RNA 综合分析与报告生成流程。	97	2.3.0
Human genomics, variants & disease	pacvar	Long-read PacBio sequencing processing for WGS and PureTarget	PacBio 长读长 WGS/PureTarget 测序数据处理流程。	13	1.0.1
Human genomics, variants & disease	phaseimpute	Phase and impute genetic data	遗传数据分相与基因型填补流程。	27	1.1.0
Human genomics, variants & disease	radseq	Variant-calling pipeline for RADseq	RADseq 变异检测流程。	7	–
Human genomics, variants & disease	raredisease	Call and score variants from WGS/WES of rare disease patients	罕见病 WGS/WES 变异检测与打分流程。	112	2.6.0
Human genomics, variants & disease	rarevariantburden	Summary count based rare variant burden test (e.g., vs gnomAD)	基于汇总计数的稀有变异负担检验（可与 gnomAD 等对照）。	0	–
Human genomics, variants & disease	rnadnavar	Integrated RNA+DNA somatic mutation detection	RNA+DNA 联合分析的体细胞突变检测流程。	14	–
Human genomics, variants & disease	sarek	Germline/somatic variant calling + annotation from WGS/targeted	WGS/靶向测序的生殖系/体细胞变异检测与注释（含预处理、calling、annotation）。	532	3.7.1
Human genomics, variants & disease	ssds	Single-stranded DNA Sequencing (SSDS) pipeline	SSDS（单链 DNA 测序）分析流程。	1	–
Human genomics, variants & disease	tumourevo	Model tumour clonal evolution from WGS (CN, subclones, signatures)	基于 WGS 的肿瘤克隆进化建模（CN、亚克隆、突变签名等）。	20	–
Human genomics, variants & disease	variantbenchmarking	Evaluate/validate variant calling accuracy	变异检测方法准确性评估与验证流程（benchmark）。	37	1.4.0
Human genomics, variants & disease	variantcatalogue	Generate population variant catalogues from WGS	从 WGS 构建人群变异目录（变异列表及频率）。	13	–
Human genomics, variants & disease	variantprioritization	(no description shown)	变异优先级筛选流程（原表未给出描述）。	12	–
Viruses & pathogen surveillance	pathogensurveillance	Surveillance of pathogens using population genomics and sequencing	基于群体基因组与测序的病原体监测流程。	52	1.0.0
Viruses & pathogen surveillance	phageannotator	Identify, annotate, quantify phage sequences in (meta)genomes	在（宏）基因组中识别、注释并定量噬菌体序列。	17	–
Viruses & pathogen surveillance	tbanalyzer	Pipeline for Mycobacterium tuberculosis complex analysis	结核分枝杆菌复合群（MTBC）分析流程。	13	–
Viruses & pathogen surveillance	viralmetagenome	Untargeted viral genome reconstruction with iSNV detection from metagenomes	宏基因组中无靶向病毒全基因组重建，并检测 iSNV。	28	1.0.1
Viruses & pathogen surveillance	viralintegration	Identify viral integration events using chimeric reads	基于嵌合 reads 的病毒整合事件检测流程。	17	0.1.1
Viruses & pathogen surveillance	viralrecon	Viral assembly and intrahost/low-frequency variant calling	病毒组装与宿主体内/低频变异检测流程。	151	3.0.0
Viruses & pathogen surveillance	vipr	Viral assembly and intrahost/low-frequency variant calling	病毒组装与体内/低频变异检测流程（类似 viralrecon）。	14	–
Metagenomics & microbiome	ampliseq	Amplicon sequencing workflow using DADA2 and QIIME2	扩增子测序（如 16S/ITS）分析：DADA2 + QIIME2。	231	2.15.0
Metagenomics & microbiome	coproid	Coprolite host identification pipeline	粪化石（coprolite）宿主鉴定流程。	13	2.0.0
Metagenomics & microbiome	createtaxdb	Automated construction of classifier databases for multiple tools	自动化并行构建多种宏基因组分类工具的数据库。	20	2.0.0
Metagenomics & microbiome	detaxizer	Identify (and optionally remove) sequences; default remove human	识别并（可选）去除特定序列（默认去除人源污染）。	22	1.3.0
Metagenomics & microbiome	funcscan	(Meta-)genome screening for functional and natural product genes	（宏）基因组功能基因与天然产物基因簇筛查。	99	3.0.0
Metagenomics & microbiome	mag	Assembly and binning of metagenomes	宏基因组组装与分箱（MAG 构建）。	264	5.3.0
Metagenomics & microbiome	magmap	Mapping reads to large collections of genomes	将 reads 比对到大型基因组集合的最佳实践流程。	10	1.0.0
Metagenomics & microbiome	metapep	From metagenomes to epitopes and beyond	从宏基因组到表位（epitope）等免疫相关下游分析。	12	1.0.0
Metagenomics & microbiome	metatdenovo	De novo assembly/annotation of metatranscriptomic or metagenomic data	宏转录组/宏基因组的从头组装与注释（支持原核/真核/病毒）。	34	1.3.0
Metagenomics & microbiome	taxprofiler	Multi-taxonomic profiling of shotgun short/long read metagenomics	shotgun 宏基因组多类群（多生物界）分类谱分析（短读长/长读长）。	175	1.2.5
Genome assembly, annotation & comparative genomics	bacass	Simple bacterial assembly and annotation pipeline	简单的细菌组装与注释流程。	80	2.5.0
Genome assembly, annotation & comparative genomics	bactmap	Mapping-based pipeline for bacterial phylogeny from WGS	基于比对的细菌 WGS 系统发育/建树流程。	61	1.0.0
Genome assembly, annotation & comparative genomics	denovohybrid	Hybrid genome assembly pipeline (under construction)	混合组装流程（长+短读长）（建设中）。	8	–
Genome assembly, annotation & comparative genomics	genomeannotator	Identify (coding) gene structures in draft genomes	草图基因组（draft genome）基因结构（编码基因）注释流程。	34	–
Genome assembly, annotation & comparative genomics	genomeassembler	Assembly and scaffolding from long ONT/PacBio HiFi reads	长读长（ONT/PacBio HiFi）基因组组装与脚手架构建。	31	1.1.0
Genome assembly, annotation & comparative genomics	genomeqc	Compare quality of multiple genomes and annotations	比较多个基因组及其注释质量。	19	–
Genome assembly, annotation & comparative genomics	genomeskim	QC/filter genome skims; organelle assembly and/or analysis	genome skim 数据质控/过滤，并进行细胞器组装或相关分析。	3	–
Genome assembly, annotation & comparative genomics	hgtseq	Investigate horizontal gene transfer from NGS data	从 NGS 数据研究水平基因转移（HGT）。	26	1.1.0
Genome assembly, annotation & comparative genomics	multiplesequencealign	Systematically evaluate MSA methods	多序列比对（MSA）方法系统评估流程。	40	1.1.1
Genome assembly, annotation & comparative genomics	neutronstar	De novo assembly for 10x linked-reads using Supernova	10x linked-reads 从头组装流程（Supernova）。	3	1.0.0
Genome assembly, annotation & comparative genomics	pangenome	Render sequences into a pangenome graph	将序列集合渲染为泛基因组图（pangenome graph）。	102	1.1.3
Genome assembly, annotation & comparative genomics	pairgenomealign	Pairwise genome comparison with LAST + plots	基于 LAST 的两两基因组比对与可视化绘图。	10	2.2.1
Genome assembly, annotation & comparative genomics	phyloplace	Phylogenetic placement with EPA-NG	使用 EPA-NG 的系统发育定位（placement）流程。	13	2.0.0
Genome assembly, annotation & comparative genomics	reportho	Comparative analysis of ortholog predictions	直系同源（ortholog）预测结果的比较分析流程。	11	1.1.0
Immunology & antigen presentation	airrflow	AIRR-seq repertoire analysis using Immcantation	免疫受体库（BCR/TCR，AIRR-seq）分析：基于 Immcantation。	73	4.3.1
Immunology & antigen presentation	epitopeprediction	Epitope prediction and annotation pipeline	表位（epitope）预测与注释流程。	50	3.1.0
Immunology & antigen presentation	hlatyping	Precision HLA typing from NGS data	基于 NGS 的高精度 HLA 分型流程。	76	2.1.0
Immunology & antigen presentation	mhcquant	Identify and quantify MHC eluted peptides from MS raw data	从质谱原始数据识别并定量 MHC 洗脱肽段。	42	3.1.0
Proteomics, metabolomics & protein informatics	ddamsproteomics	Quantitative shotgun MS proteomics	定量 shotgun 质谱蛋白组流程。	4	–
Proteomics, metabolomics & protein informatics	diaproteomics	Automated quantitative analysis of DIA proteomics MS measurements	DIA 蛋白组质谱数据自动化定量分析流程。	21	1.2.4
Proteomics, metabolomics & protein informatics	kmermaid	k-mer similarity analysis pipeline	k-mer 相似性分析流程。	23	0.1.0-alpha
Proteomics, metabolomics & protein informatics	metaboigniter	Metabolomics MS pre-processing with identification/quantification (MS1/MS2)	代谢组质谱预处理：基于 MS1/MS2 的鉴定与定量。	24	2.0.1
Proteomics, metabolomics & protein informatics	proteinannotator	Protein fasta → annotations	蛋白序列（FASTA）到注释的自动化流程。	8	–
Proteomics, metabolomics & protein informatics	proteinfamilies	Generation and updating of protein families	蛋白家族的生成与更新流程。	21	2.2.0
Proteomics, metabolomics & protein informatics	proteinfold	Protein 3D structure prediction pipeline	蛋白三维结构预测流程。	94	1.1.1
Proteomics, metabolomics & protein informatics	proteogenomicsdb	Generate protein databases for proteogenomics analysis	构建蛋白基因组学分析所需的蛋白数据库。	7	1.0.0
Proteomics, metabolomics & protein informatics	proteomicslfq	Proteomics label-free quantification (LFQ) analysis pipeline	蛋白组无标记定量（LFQ）分析流程。	37	1.0.0
Proteomics, metabolomics & protein informatics	quantms	Quantitative MS workflow (DDA-LFQ, DDA-Isobaric, DIA-LFQ)	定量蛋白组流程：支持 DDA-LFQ、等标记 DDA、DIA-LFQ 等。	34	1.2.0
Proteomics, metabolomics & protein informatics	ribomsqc	QC pipeline monitoring MS performance in ribonucleoside analysis	核苷相关质谱分析的性能监控与质控流程。	0	–
Imaging & other modalities	cellpainting	(no description shown)	Cell Painting 相关流程（原表未给出描述）。	8	–
Imaging & other modalities	imcyto	Image Mass Cytometry analysis pipeline	成像质谱细胞术（IMC）图像/数据分析流程。	26	1.0.0
Imaging & other modalities	liverctanalysis	UNDER CONSTRUCTION: pipeline for liver CT analysis	肝脏 CT 影像分析流程（建设中）。	0	–
Imaging & other modalities	lsmquant	Process and analyze light-sheet microscopy images	光片显微（light-sheet）图像处理与分析流程。	5	–
Imaging & other modalities	mcmicro	Whole-slide multi-channel image processing to single-cell data	多通道全切片图像到单细胞数据的端到端处理流程。	29	–
Imaging & other modalities	rangeland	Remotely sensed imagery pipeline for land-cover trend files	遥感影像处理流程：结合辅助数据生成土地覆盖变化趋势文件。	9	1.0.0
Imaging & other modalities	troughgraph	Quantitative assessment of permafrost landscapes and thaw level	冻土景观与冻融程度的定量评估流程。	2	–
Data acquisition, QC & utilities	bamtofastq	Convert BAM/CRAM to FASTQ and perform QC	BAM/CRAM 转 FASTQ 并进行质控。	31	2.2.0
Data acquisition, QC & utilities	datasync	System operation / automation workflows	系统运维/自动化工作流（数据同步与操作任务）。	10	–
Data acquisition, QC & utilities	demo	Simple nf-core style pipeline for workshops and demos	nf-core 风格的示例/教学演示流程。	10	1.0.2
Data acquisition, QC & utilities	demultiplex	Demultiplexing pipeline for sequencing data	测序数据拆样/解复用流程。	52	1.7.0
Data acquisition, QC & utilities	fastqrepair	Recover corrupted FASTQ.gz, fix reads, remove unpaired, reorder	修复损坏 FASTQ.gz：修正不合规 reads、移除未配对 reads、重排序等。	6	1.0.0
Data acquisition, QC & utilities	fastquorum	Produce consensus reads using UMIs/barcodes	基于 UMI/条形码生成共识 reads 的流程。	27	1.2.0
Data acquisition, QC & utilities	fetchngs	Fetch metadata and raw FastQ files from public databases	从公共数据库抓取元数据与原始 FASTQ。	185	1.12.0
Data acquisition, QC & utilities	nanoseq	Nanopore demultiplexing, QC and alignment pipeline	Nanopore 数据拆样、质控与比对流程。	218	3.1.0
Data acquisition, QC & utilities	readsimulator	Simulate sequencing reads (amplicon, metagenome, WGS, etc.)	测序 reads 模拟流程（扩增子、靶向捕获、宏基因组、全基因组等）。	33	1.0.1
Data acquisition, QC & utilities	references	Build references for multiple use cases	多用途参考资源构建流程。	19	0.1
Data acquisition, QC & utilities	seqinspector	QC-only pipeline producing global/group-specific MultiQC reports	纯质控流程：运行多种 QC 工具并输出全局/分组 MultiQC 报告。	16	–
Data acquisition, QC & utilities	seqsubmit	Submit data to ENA	向 ENA 提交数据的流程。	3	–
Genome editing & screens	crisprseq	CRISPR edited data analysis (targeted + screens)	CRISPR 编辑数据分析：靶向编辑质量评估与 pooled screen 关键基因发现。	53	2.3.0
Genome editing & screens	crisprvar	Evaluate outcomes from genome editing experiments (WIP)	基因编辑实验结果评估流程（WIP）。	5	–
Other methods / modelling / non-bio	deepmodeloptim	Stochastic Testing and Input Manipulation for Unbiased Learning Systems	无偏学习系统的随机测试与输入操控（机器学习相关）。	28	–
Other methods / modelling / non-bio	deepmutscan	Deep mutational scanning (DMS) analysis pipeline	深度突变扫描（DMS）数据分析流程。	3	–
Other methods / modelling / non-bio	diseasemodulediscovery	Network-based disease module identification	基于网络的疾病模块识别流程。	5	–
Other methods / modelling / non-bio	drugresponseeval	Evaluate drug response prediction models	药物反应预测模型的评估流程（统计与生物学上更严谨）。	24	1.1.0
Other methods / modelling / non-bio	meerpipe	Astronomy pipeline for MeerKAT pulsar data	MeerKAT 脉冲星数据天文处理流程（成像与计时分析）。	10	–
Other methods / modelling / non-bio	omicsgenetraitassociation	Multi-omics integration and trait association analysis pipeline	多组学整合并进行性状/表型关联分析的流程。	11	–
Other methods / modelling / non-bio	spinningjenny	Simulating the first industrial revolution using agent-based models	基于主体（Agent-based）模型模拟第一次工业革命的流程。	4	–

nf-core pipelines (selected)

Leave a reply

nf-core/viralmetagenome is a bioinformatics best-practice analysis pipeline for reconstructing consensus genomes and to identify intra-host variants from metagenomic sequencing data or enriched based sequencing data like hybrid capture.
nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples.
nf-core/vipr is a bioinformatics best-practice analysis pipeline for assembly and intrahost / low-frequency variant calling for viral samples.
nfcore/ampliseq is a bioinformatics analysis pipeline used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
nf-core/mag is a bioinformatics best-practise analysis pipeline for assembly, binning and annotation of metagenomes.
nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun short- and long-read metagenomic data.
nf-core/funcscan is a bioinformatics best-practice analysis pipeline for the screening of nucleotide sequences such as assembled contigs for functional genes.
nf-core/createtaxdb is a bioinformatics pipeline that constructs custom metagenomic classifier databases for multiple classifiers and profilers from the same input reference genome set in a highly automated and parallelised manner.
nf-core/detaxizer is a bioinformatics pipeline that checks for the presence of a specific taxon in (meta)genomic fastq files and to filter out this taxon or taxonomic subtree.

♟️ Kindgerechte Erklärung der Schachnotation

1️⃣ Grundidee der Notation

2️⃣ Buchstaben für die Figuren

3️⃣ Die Felder des Schachbretts

4️⃣ Normale Züge schreiben

Bauernzüge

Figuren ziehen

5️⃣ Schlagen (sehr wichtig)

Figuren schlagen

Bauern schlagen (Sonderregel)

6️⃣ Schach, Matt, Rochade

7️⃣ Zwei gleiche Figuren können auf dasselbe Feld ziehen

Regel:

Beispiel mit Springern

Beispiel mit Türmen

8️⃣ Beispielpartie (ca. 10 Züge)

9️⃣ Richtige Reihenfolge im Turnier

🔟 Die 5 häufigsten Fehler bei Kindern

✅ Merksatz für Kinder

Overview of the Three Sets

Outputs

Step A — Fetch exoproteome FASTA from UniProt

Bash: fetch exoproteome FASTAs

Step B — Normalize FASTA headers (Exoproteome + MS)

Bash: normalize exoproteome FASTAs

Bash: normalize MS (Proximity / ALFA) FASTAs

Step C — BLAST mapping to reference proteome

Bash: merge exoproteome FASTAs

Bash: build BLAST database (once)

Bash: BLAST exoproteome

Bash: select best hit per query (exoproteome)

Bash: BLAST MS datasets

Bash: best-hit selection (MS)

Step D — R wrapper: integration, Venn, Excel, LOG

Bash: run the final integration

Scripts Used (for traceability)

01_fetch_uniprot_fasta_from_exoproteome_excel.py

02_normalize_fasta_headers.py

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

Final Note

Scripts Used

01_fetch_uniprot_fasta_from_exoproteome_excel.py

02_normalize_fasta_headers.py

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

Final Notes

Goal

Step 0 — Why this manual annotation approach is needed (non-model organism)

Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

Step 2 — Generate EggNOG annotation files (*.emapper.annotations)

2A) Install EggNOG-mapper

2B) Download / prepare EggNOG database

2C) Run emapper.py on FASTA inputs

Step 3 — COG clustering + reporting (this post’s main script)

Inputs

Outputs

Interpretation notes

Code snippet (generate Proximity_4h FASTA, used in Step 2)

Code snippet (COG clustering script, used in Step 3)

1) Generate protein FASTA files

1.1 MS protein sequences (Proximity / ALFApulldown)

1.2 Reference proteome sequences (for mapping + background)

2) Optional functional annotation for merging MS + RNA-seq (EggNOG)

3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)

3.1 Setup workspace

3.2 Step-by-step in Blast2GO GUI

STEP 1 — Load sequences (reference proteome)

STEP 2 — BLAST (QBlast)

STEP 3 — Mapping

STEP 4 — Annotation

STEP 5 — Export Annotations (before merging InterPro)

STEP 6 — InterProScan (optional but recommended for more GO terms)

STEP 7 — Merge InterProScan GOs into existing annotation

STEP 8 — Export final annotations (after merging InterPro)

4) Generate BLAST mapping tables: *_vs_ref.blast.tsv

4.1 Create BLAST database from reference proteome

4.2 BLASTP each MS set against the reference DB

5) Run GO enrichment (4 sets + combined summary)

Code snippets (used scripts)

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

`02_normalize_fasta_headers.py`

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

`02_normalize_fasta_headers.py`

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

Step 2 — Generate EggNOG annotation files (`*.emapper.annotations`)

2C) Run `emapper.py` on FASTA inputs

4) Generate BLAST mapping tables: `*_vs_ref.blast.tsv`

Step 4 — Update `.lake` files using the filtered tracks

1) `1_filter_track.py`

2) `1_filter_track_debug.py`

3) `2_update_lakes.py`