Data Processing Pipeline: RNA-seq × Exoproteome × MS (Proximity/ALFA)

Path: ~/DATA/Data_Michelle/MS_RNAseq_Venn_2026

This post documents the complete, reproducible pipeline used to generate a 3-circle Venn diagram integrating RNA-seq, exoproteome, and MS SIP (Proximity + ALFApulldown) datasets, harmonized at the locus-tag level.

The pipeline is intentionally verbose: all transformations, mappings, and decisions are logged so the analysis can be reconstructed later.

Overview of the Three Sets

Circle 1 — RNA-seq Union of all significantly regulated genes across six comparisons Criteria: padj < 0.05 and |log2FC| > 2

Circle 2 — Exoproteome Union of six exoproteome conditions (MH/TSB × 1h, 4h, 18h) Protein IDs (UniProt / UniParc) are mapped to locus tags via BLAST Direction is defined without a cutoff:

log2FC > 0 → UP
log2FC < 0 → DOWN

Circle 3 — MS SIP (ProALFA) Union of four datasets:

Proximity 4h / 18h
ALFApulldown 4h / 18h Protein IDs are mapped to locus tags via BLAST

Outputs

The pipeline produces:

🟢 3-circle Venn diagram (PNG + PDF)
📊 Detailed Excel workbook
- all Venn regions
- RNA / Exo / MS membership per locus tag
- regulation directions and source datasets
🧾 Pipeline log file
- counts per dataset
- mapping success
- duplicated IDs
- intersection sizes

Step A — Fetch exoproteome FASTA from UniProt

Exoproteome Excel sheets contain UniProt accessions, while RNA-seq uses locus tags. To enable mapping, protein sequences are downloaded from UniProt and UniParc.

Script

01_fetch_uniprot_fasta_from_exoproteome_excel.py

What it does

Reads exoproteome sheets from Excel
Extracts UniProt / UniParc IDs
Downloads FASTA via UniProt REST (with caching + fallback to UniParc)
Writes:
- one FASTA per sheet
- mapping tables
- UP / DOWN ID lists
- run log

Bash: fetch exoproteome FASTAs

python3 01_fetch_uniprot_fasta_from_exoproteome_excel.py \
  --excel "Zusammenfassung SIP und Exoproteome.xlsx" \
  --outdir ./exoproteome_fastas \
  --cache  ./uniprot_cache

Step B — Normalize FASTA headers (Exoproteome + MS)

FASTA headers from UniProt and MS are heterogeneous and may contain duplicates. All FASTA files are normalized to canonical, unique protein IDs.

Script

02_normalize_fasta_headers.py

What it does

Extracts canonical IDs (sp|…|, tr|…|, raw headers)
Enforces uniqueness (__dupN suffix if needed)
Writes an ID mapping table for traceability

Bash: normalize exoproteome FASTAs

mkdir -p normalized_fastas

for f in \
  Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
  Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    exoproteome_fastas/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Bash: normalize MS (Proximity / ALFA) FASTAs

for f in \
  Proximity_4h.fasta Proximity_18h.fasta \
  ALFApulldown_4h.fasta ALFApulldown_18h.fasta
do
  python3 02_normalize_fasta_headers.py \
    ../MS_GO_enrichments/"$f" \
    "$f" \
    "$f.idmap.tsv"
done

Step C — BLAST mapping to reference proteome

All exoproteome and MS proteins are mapped to locus tags using BLASTP against the reference proteome.

Bash: merge exoproteome FASTAs

cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \
    Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \
    > Exoproteome_ALL.fasta

Bash: build BLAST database (once)

makeblastdb -in CP020463_protein.fasta \
  -dbtype prot \
  -out CP020463_ref_db

Bash: BLAST exoproteome

blastp -query Exoproteome_ALL.fasta \
  -db CP020463_ref_db \
  -out Exoproteome_ALL_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 \
  -max_target_seqs 5 \
  -num_threads 16

Bash: select best hit per query (exoproteome)

awk -F'\t' 'BEGIN{OFS="\t"}{
  key=$1; bits=$12+0; e=$11+0
  if(!(key in best) || bits>best_bits[key] || (bits==best_bits[key] && e<best_e[key])){
    best[key]=$0; best_bits[key]=bits; best_e[key]=e
  }
} END{for(k in best) print best[k]}' \
Exoproteome_ALL_vs_ref.blast.tsv \
| sort -t$'\t' -k1,1 \
> Exoproteome_ALL_vs_ref.besthit.tsv

Bash: BLAST MS datasets

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  blastp -query "${base}.fasta" \
    -db CP020463_ref_db \
    -out "${base}_vs_ref.blast.tsv" \
    -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
    -evalue 1e-10 \
    -max_target_seqs 5 \
    -num_threads 16
done

Bash: best-hit selection (MS)

for base in Proximity_4h Proximity_18h ALFApulldown_4h ALFApulldown_18h; do
  awk -F'\t' 'BEGIN{OFS="\t"}{
    q=$1; b=$12+0; e=$11+0
    if(!(q in best) || b>bb[q] || (b==bb[q] && e<be[q])){
      best[q]=$0; bb[q]=b; be[q]=e
    }
  } END{for(q in best) print best[q]}' \
  "${base}_vs_ref.blast.tsv" \
  | sort -t$'\t' -k1,1 \
  > "${base}_vs_ref.besthit.tsv"
done

Step D — R wrapper: integration, Venn, Excel, LOG

Script

03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

What it does

Reads RNA-seq Excel files (6 comparisons)
Reads exoproteome and MS sheets
Applies BLAST best-hit mappings
Builds union sets and intersections
Generates:
- Venn plot
- detailed Excel workbook
- comprehensive log file

Bash: run the final integration

Rscript 03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R

Scripts Used (for traceability)

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

# see script file for full implementation

`02_normalize_fasta_headers.py`

# see script file for full implementation

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

# see script file for full implementation

Final Note

Keeping Python, Bash, and R steps together is intentional: this pipeline is meant to be auditable, reproducible, and explainable, even months later when details are fuzzy.

If you want, I can next:

condense this into a README.md
split it into Methods vs Reproducibility
or add a flowchart of the pipeline steps

Scripts Used

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

#!/usr/bin/env python3
"""
01_fetch_uniprot_fasta_from_exoproteome_excel.py (v3)

MS log2FC direction rule (no cutoff):
  Up    : log2FC > 0
  Down  : log2FC < 0
  Zero  : log2FC == 0
  NA    : missing/non-numeric log2FC

Reads per sheet:
  - protein IDs from column A (index 0)
  - log2FC from column D (index 3)

Outputs per sheet:
  - {sheet}.fasta
  - {sheet}.mapping.tsv
  - {sheet}.up.ids.txt / {sheet}.down.ids.txt
  - run.log

UniParc catching method:
  1) Try UniProtKB FASTA: /uniprotkb/{ACC}.fasta
  2) If missing: UniParc search -> UPI
  3) Fetch UniParc FASTA: /uniparc/{UPI}.fasta
"""

import argparse
import os
import re
import time
from typing import Optional, Tuple

import pandas as pd
import requests

DEFAULT_SHEETS = [
    "Exoproteome MH 1h",
    "Exoproteome MH 4h",
    "Exoproteome MH 18h",
    "Exoproteome TSB 1h",
    "Exoproteome TSB 4h",
    "Exoproteome TSB 18h",
]

UNIPROTKB_FASTA = "https://rest.uniprot.org/uniprotkb/{acc}.fasta"
UNIPARC_FASTA   = "https://rest.uniprot.org/uniparc/{upi}.fasta"
UNIPARC_SEARCH  = "https://rest.uniprot.org/uniparc/search"

def sanitize_name(s: str) -> str:
    s = s.strip()
    s = re.sub(r"[^\w\-\.]+", "_", s)
    s = re.sub(r"_+", "_", s)
    return s

def parse_id_cell(raw: str) -> Optional[str]:
    if raw is None:
        return None
    s = str(raw).strip()
    if not s or s.lower() in {"nan", "na"}:
        return None

    s = s.split()[0].strip()

    # header like tr|ACC|NAME
    if "|" in s:
        parts = s.split("|")
        if len(parts) >= 2 and parts[0] in {"tr", "sp"}:
            s = parts[1]
        else:
            s = parts[0]

    if re.fullmatch(r"UPI[0-9A-F]{10,}", s):
        return s
    if re.fullmatch(r"[A-Z0-9]{6,10}", s):
        return s
    return None

def to_float(x) -> Optional[float]:
    if x is None:
        return None
    try:
        s = str(x).strip()
        if s == "" or s.lower() in {"nan", "na"}:
            return None
        s = s.replace(",", ".")
        return float(s)
    except Exception:
        return None

def classify_direction_no_cutoff(log2fc: Optional[float]) -> str:
    if log2fc is None:
        return "NA"
    if log2fc > 0:
        return "Up"
    if log2fc < 0:
        return "Down"
    return "Zero"

def extract_ids_and_fc_from_sheet(excel_path: str, sheet: str) -> pd.DataFrame:
    df = pd.read_excel(excel_path, sheet_name=sheet, header=None)
    if df.shape[1] < 4:
        raise ValueError(f"Sheet '{sheet}' has <4 columns, cannot read column D (log2FC).")

    out_rows = []
    for _, row in df.iterrows():
        pid = parse_id_cell(row.iloc[0])      # col A
        if not pid:
            continue
        log2fc = to_float(row.iloc[3])        # col D
        out_rows.append({"protein_id": pid, "log2fc": log2fc})

    out = pd.DataFrame(out_rows)
    if out.empty:
        return out

    # de-duplicate by protein_id (keep first non-NA log2fc if possible)
    out["log2fc_isna"] = out["log2fc"].isna()
    out = out.sort_values(["protein_id", "log2fc_isna"]).drop_duplicates("protein_id", keep="first")
    out = out.drop(columns=["log2fc_isna"]).reset_index(drop=True)
    return out

def http_get_text(session: requests.Session, url: str, params: dict = None,
                  retries: int = 4, backoff: float = 1.5) -> Optional[str]:
    for attempt in range(retries):
        r = session.get(url, params=params, timeout=60)
        if r.status_code == 200:
            return r.text
        if r.status_code in (429, 500, 502, 503, 504):
            time.sleep(backoff * (attempt + 1))
            continue
        return None
    return None

def fetch_uniprotkb_fasta(session: requests.Session, acc: str) -> Optional[str]:
    return http_get_text(session, UNIPROTKB_FASTA.format(acc=acc))

def resolve_upi_via_uniparc_search(session: requests.Session, acc: str) -> Optional[str]:
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(session, UNIPARC_SEARCH, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    if re.fullmatch(r"UPI[0-9A-F]{10,}", upi):
        return upi
    return None

def fetch_uniparc_fasta(session: requests.Session, upi: str) -> Optional[str]:
    return http_get_text(session, UNIPARC_FASTA.format(upi=upi))

def cache_get(cache_dir: str, key: str) -> Optional[str]:
    if not cache_dir:
        return None
    path = os.path.join(cache_dir, f"{key}.fasta")
    if os.path.exists(path) and os.path.getsize(path) > 0:
        with open(path, "r", encoding="utf-8", errors="replace") as f:
            return f.read()
    return None

def cache_put(cache_dir: str, key: str, fasta: str) -> None:
    if not cache_dir:
        return
    os.makedirs(cache_dir, exist_ok=True)
    path = os.path.join(cache_dir, f"{key}.fasta")
    with open(path, "w", encoding="utf-8") as f:
        f.write(fasta)

def rewrite_fasta_header(fasta: str, new_header: str) -> str:
    lines = fasta.splitlines()
    if not lines or not lines[0].startswith(">"):
        return fasta
    lines[0] = ">" + new_header
    return "\n".join(lines) + ("\n" if not fasta.endswith("\n") else "")

def fetch_best_fasta(session: requests.Session, pid: str, cache_dir: str = "") -> Tuple[Optional[str], str, Optional[str]]:
    """
    Returns (fasta_text, status, resolved_upi)
      status: uniprotkb | uniparc | uniparc_direct | not_found | *_cache
    """
    if pid.startswith("UPI"):
        cached = cache_get(cache_dir, pid)
        if cached:
            return cached, "uniparc_direct_cache", pid
        fasta = fetch_uniparc_fasta(session, pid)
        if fasta:
            cache_put(cache_dir, pid, fasta)
            return fasta, "uniparc_direct", pid
        return None, "not_found", pid

    cached = cache_get(cache_dir, pid)
    if cached:
        return cached, "uniprotkb_cache", None

    fasta = fetch_uniprotkb_fasta(session, pid)
    if fasta:
        cache_put(cache_dir, pid, fasta)
        return fasta, "uniprotkb", None

    upi = resolve_upi_via_uniparc_search(session, pid)
    if not upi:
        return None, "not_found", None

    cached2 = cache_get(cache_dir, upi)
    if cached2:
        return cached2, "uniparc_cache", upi

    fasta2 = fetch_uniparc_fasta(session, upi)
    if fasta2:
        cache_put(cache_dir, upi, fasta2)
        return fasta2, "uniparc", upi

    return None, "not_found", upi

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--excel", required=True, help="Input Excel file")
    ap.add_argument("--outdir", default="./exoproteome_fastas", help="Output folder")
    ap.add_argument("--cache", default="./uniprot_cache", help="Cache folder for downloaded FASTAs")
    ap.add_argument("--sleep", type=float, default=0.1, help="Sleep seconds between requests")
    ap.add_argument("--sheets", nargs="*", default=DEFAULT_SHEETS, help="Sheet names to process")
    args = ap.parse_args()

    os.makedirs(args.outdir, exist_ok=True)
    os.makedirs(args.cache, exist_ok=True)

    log_path = os.path.join(args.outdir, "run.log")
    def log(msg: str):
        print(msg)
        with open(log_path, "a", encoding="utf-8") as f:
            f.write(msg + "\n")

    with open(log_path, "w", encoding="utf-8") as f:
        f.write("")

    xls = pd.ExcelFile(args.excel)
    missing = [s for s in args.sheets if s not in xls.sheet_names]
    if missing:
        log("WARNING: Missing sheets in workbook:")
        for s in missing:
            log(f"  - {s}")
        log("Continuing with available sheets only.\n")

    session = requests.Session()
    session.headers.update({"User-Agent": "exoproteome-fasta-fetch/3.0"})

    for sheet in args.sheets:
        if sheet not in xls.sheet_names:
            continue

        sheet_tag = sanitize_name(sheet)
        out_fasta = os.path.join(args.outdir, f"{sheet_tag}.fasta")
        out_map   = os.path.join(args.outdir, f"{sheet_tag}.mapping.tsv")
        out_up    = os.path.join(args.outdir, f"{sheet_tag}.up.ids.txt")
        out_down  = os.path.join(args.outdir, f"{sheet_tag}.down.ids.txt")

        tbl = extract_ids_and_fc_from_sheet(args.excel, sheet)
        log(f"\n=== Sheet: {sheet} ===")
        log(f"Extracted IDs (unique): {len(tbl)}")

        if tbl.empty:
            log("No IDs found. Skipping.")
            continue

        tbl["direction"] = [classify_direction_no_cutoff(x) for x in tbl["log2fc"].tolist()]

        n_fc = int(tbl["log2fc"].notna().sum())
        n_up = int((tbl["direction"] == "Up").sum())
        n_dn = int((tbl["direction"] == "Down").sum())
        n_zero = int((tbl["direction"] == "Zero").sum())
        n_na = int((tbl["direction"] == "NA").sum())

        log(f"log2FC present: {n_fc}/{len(tbl)}")
        log(f"Direction (no cutoff): Up={n_up}, Down={n_dn}, Zero={n_zero}, NA={n_na}")

        mapping_rows = []
        fasta_chunks = []

        n_ok = 0
        n_nf = 0

        for i, row in enumerate(tbl.itertuples(index=False), start=1):
            pid = row.protein_id
            log2fc = row.log2fc
            direction = row.direction

            fasta, status, upi = fetch_best_fasta(session, pid, cache_dir=args.cache)

            if fasta:
                header = pid
                if upi and upi != pid:
                    header = f"{pid}|UniParc:{upi}"
                fasta = rewrite_fasta_header(fasta, header)
                fasta_chunks.append(fasta)
                n_ok += 1
            else:
                n_nf += 1

            mapping_rows.append({
                "sheet": sheet,
                "protein_id_input": pid,
                "resolved_upi": upi if upi else "",
                "status": status,
                "log2fc": "" if log2fc is None else log2fc,
                "direction": direction,
                "fasta_written": "yes" if fasta else "no",
            })

            if args.sleep > 0:
                time.sleep(args.sleep)

            if i % 50 == 0:
                log(f"  progress: {i}/{len(tbl)} (ok={n_ok}, not_found={n_nf})")

        with open(out_fasta, "w", encoding="utf-8") as f:
            for chunk in fasta_chunks:
                f.write(chunk.strip() + "\n")

        map_df = pd.DataFrame(mapping_rows)
        map_df.to_csv(out_map, sep="\t", index=False)

        up_ids = map_df.loc[map_df["direction"] == "Up", "protein_id_input"].astype(str).tolist()
        dn_ids = map_df.loc[map_df["direction"] == "Down", "protein_id_input"].astype(str).tolist()
        with open(out_up, "w", encoding="utf-8") as f:
            f.write("\n".join(up_ids) + ("\n" if up_ids else ""))
        with open(out_down, "w", encoding="utf-8") as f:
            f.write("\n".join(dn_ids) + ("\n" if dn_ids else ""))

        log(f"Fetched FASTA: {n_ok}/{len(tbl)} (not_found={n_nf})")
        log(f"Saved FASTA: {out_fasta}")
        log(f"Saved mapping: {out_map}")
        log(f"Saved Up IDs:  {out_up} ({len(up_ids)})")
        log(f"Saved Down IDs:{out_down} ({len(dn_ids)})")

    log("\nDONE.")

if __name__ == "__main__":
    main()

`02_normalize_fasta_headers.py`

#!/usr/bin/env python3
import re
import sys
from collections import defaultdict

def canon_id(header: str) -> str:
    h = header.strip()
    h = h[1:] if h.startswith(">") else h

    # tr|ACC|NAME or sp|ACC|NAME
    m = re.match(r'^(sp|tr)\|([^|]+)\|', h)
    if m:
        return m.group(2)

    # otherwise: first token, then before first '|'
    first = h.split()[0]
    return first.split("|")[0]

def read_fasta(path):
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        header = None
        seq = []
        for line in f:
            line = line.rstrip("\n")
            if line.startswith(">"):
                if header is not None:
                    yield header, "".join(seq)
                header = line
                seq = []
            else:
                seq.append(line.strip())
        if header is not None:
            yield header, "".join(seq)

def main():
    if len(sys.argv) < 3:
        print("Usage: normalize_fasta_headers.py in.fasta out.fasta [map.tsv]", file=sys.stderr)
        sys.exit(1)

    in_fa  = sys.argv[1]
    out_fa = sys.argv[2]
    map_tsv = sys.argv[3] if len(sys.argv) >= 4 else out_fa + ".idmap.tsv"

    seen = defaultdict(int)

    with open(out_fa, "w", encoding="utf-8") as fo, open(map_tsv, "w", encoding="utf-8") as fm:
        fm.write("old_header\tnew_id\n")
        for old_h, seq in read_fasta(in_fa):
            base = canon_id(old_h)
            seen[base] += 1
            new_id = base if seen[base] == 1 else f"{base}__dup{seen[base]}"
            fm.write(f"{old_h}\t{new_id}\n")

            fo.write(f">{new_id}\n")
            # wrap sequence at 60
            for i in range(0, len(seq), 60):
                fo.write(seq[i:i+60] + "\n")

    print(f"Wrote: {out_fa}")
    print(f"Wrote: {map_tsv}")

if __name__ == "__main__":
    main()

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

#!/usr/bin/env Rscript
# ============================================================
# 3-circle Venn: RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown
#
# Key points:
# - RNA-seq inputs are Excel files with sheet 'Complete_Data'
#   * GeneID is the ID used for comparison (locus tag style)
#   * Keep Preferred_name / Description for annotation in ALL Venn regions
# - Exoproteome uses MS protein IDs + log2FC sign (no cutoff), mapped to locus tags via BLAST best-hit TSV
# - Proximity/ALFApulldown (ProALFA) uses SIP/MS protein IDs mapped to locus tags via BLAST best-hit TSV
# - Debug stats + parsing/mapping summaries are written into a LOG file
#
# NOTE: (记得！) 生成 Exoproteome_ALL.fasta + Exoproteome_ALL_vs_ref.blast.tsv：
#   cat Exoproteome_MH_1h.fasta Exoproteome_MH_4h.fasta Exoproteome_MH_18h.fasta \\
#       Exoproteome_TSB_1h.fasta Exoproteome_TSB_4h.fasta Exoproteome_TSB_18h.fasta \\
#       > Exoproteome_ALL.fasta
#   blastp -query Exoproteome_ALL.fasta -db CP020463_protein.dmnd \\
#       -out Exoproteome_ALL_vs_ref.blast.tsv -outfmt 6 -evalue 1e-5 -max_target_seqs 50 -num_threads 40
# ============================================================

suppressPackageStartupMessages({
  library(tidyverse)
  library(readxl)
  library(openxlsx)
  library(VennDiagram)
  library(grid)
})

# -------------------------
# CONFIG
# -------------------------
RNA_PADJ_CUT <- 0.05
RNA_LFC_CUT  <- 2

OUTDIR <- "./venn_outputs"
dir.create(OUTDIR, showWarnings = FALSE, recursive = TRUE)

LOGFILE <- file.path(OUTDIR, "pipeline_log.txt")
sink(LOGFILE, split = TRUE)

cat("=== 3-circle Venn pipeline ===\n")
cat(sprintf("RNA cutoff: padj < %.3g and |log2FoldChange| > %.3g\n", RNA_PADJ_CUT, RNA_LFC_CUT))
cat("Output dir:", normalizePath(OUTDIR), "\n\n")

# RNA inputs (Excel, sheet = Complete_Data)
RNA_FILES <- list(
  deltasbp_MH_2h  = "./DEG_Annotations_deltasbp_MH_2h_vs_WT_MH_2h-all.xlsx",
  deltasbp_MH_4h  = "./DEG_Annotations_deltasbp_MH_4h_vs_WT_MH_4h-all.xlsx",
  deltasbp_MH_18h = "./DEG_Annotations_deltasbp_MH_18h_vs_WT_MH_18h-all.xlsx",
  deltasbp_TSB_2h  = "./DEG_Annotations_deltasbp_TSB_2h_vs_WT_TSB_2h-all.xlsx",
  deltasbp_TSB_4h  = "./DEG_Annotations_deltasbp_TSB_4h_vs_WT_TSB_4h-all.xlsx",
  deltasbp_TSB_18h = "./DEG_Annotations_deltasbp_TSB_18h_vs_WT_TSB_18h-all.xlsx"
)

# Exoproteome sheets (from summary Excel)
EXO_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
EXO_SHEETS <- list(
  Exoproteome_MH_1h  = "Exoproteome MH 1h",
  Exoproteome_MH_4h  = "Exoproteome MH 4h",
  Exoproteome_MH_18h = "Exoproteome MH 18h",
  Exoproteome_TSB_1h  = "Exoproteome TSB 1h",
  Exoproteome_TSB_4h  = "Exoproteome TSB 4h",
  Exoproteome_TSB_18h = "Exoproteome TSB 18h"
)

# Exoproteome BLAST best-hit mapping (query protein -> locus tag)
EXO_BLAST_TSV <- "./Exoproteome_ALL_vs_ref.besthit.tsv"

# Proximity/ALFApulldown sheets (from same summary Excel)
PROALFA_SUMMARY_XLSX <- "./Zusammenfassung SIP und Exoproteome.xlsx"
PROALFA_SHEETS <- list(
  Proximity_4h = "Proximity 4h",
  Proximity_18h = "Proximity 18h",
  ALFApulldown_4h = "ALFApulldown 4h",
  ALFApulldown_18h = "ALFApulldown 18h"
)

# ProALFA BLAST best-hit mapping (each dataset protein -> locus tag)
PROALFA_BLAST_TSV <- list(
  Proximity_4h      = "./Proximity_4h_vs_ref.besthit.tsv",
  Proximity_18h     = "./Proximity_18h_vs_ref.besthit.tsv",
  ALFApulldown_4h   = "./ALFApulldown_4h_vs_ref.besthit.tsv",
  ALFApulldown_18h  = "./ALFApulldown_18h_vs_ref.besthit.tsv"
)

# FASTA inputs (QC only)
FASTA_FILES <- list(
  Proximity_4h      = "./Proximity_4h.fasta",
  Proximity_18h     = "./Proximity_18h.fasta",
  ALFApulldown_4h   = "./ALFApulldown_4h.fasta",
  ALFApulldown_18h  = "./ALFApulldown_18h.fasta",
  Exo_MH_1h         = "./Exoproteome_MH_1h.fasta",
  Exo_MH_4h         = "./Exoproteome_MH_4h.fasta",
  Exo_MH_18h        = "./Exoproteome_MH_18h.fasta",
  Exo_TSB_1h        = "./Exoproteome_TSB_1h.fasta",
  Exo_TSB_4h        = "./Exoproteome_TSB_4h.fasta",
  Exo_TSB_18h       = "./Exoproteome_TSB_18h.fasta"
)

# -------------------------
# HELPERS
# -------------------------
safe_file_check <- function(fp, label) {
  if (!file.exists(fp)) stop(label, " missing: ", fp)
}

canonical_id_one <- function(x) {
  x <- as.character(x)
  x <- stringr::str_trim(x)

  if (is.na(x) || x == "") return(NA_character_)

  # remove leading ">"
  x <- stringr::str_remove(x, "^>")

  # keep first token before whitespace
  x <- stringr::str_split(x, "\\s+", simplify = TRUE)[1]

  # handle UniProt-style: tr|ID|... or sp|ID|...
  if (stringr::str_detect(x, "\\|")) {
    parts <- stringr::str_split(x, "\\|")[[1]]
    if (length(parts) >= 2 && parts[1] %in% c("tr", "sp")) {
      x <- parts[2]
    } else {
      x <- parts[1]
    }
  }

  x
}

canonical_id <- function(x) {
  vapply(x, canonical_id_one, character(1))
}

read_fasta_headers <- function(fp) {
  safe_file_check(fp, "FASTA")
  hdr <- readLines(fp, warn = FALSE) %>% keep(~str_starts(.x, ">"))
  tibble(raw = hdr, id = canonical_id(hdr)) %>%
    filter(!is.na(id), id != "")
}

norm_name <- function(x) {
  x <- tolower(trimws(as.character(x)))
  x <- gsub("[^a-z0-9]+", "", x)
  x
}

# Robust column picker (case-insensitive + punctuation-insensitive)
find_col <- function(nms, target) {
  nms <- as.character(nms)
  target_n <- norm_name(target)
  n_n <- norm_name(nms)

  # exact normalized match
  idx <- which(n_n == target_n)
  if (length(idx) > 0) return(nms[idx[1]])

  # contains (normalized)
  idx <- which(str_detect(n_n, fixed(target_n)))
  if (length(idx) > 0) return(nms[idx[1]])

  NA_character_
}

# Best-hit parser (BLAST outfmt6):
# qid sid pident length mismatch gapopen qstart qend sstart send evalue bitscore
read_besthit_tsv <- function(fp) {
  safe_file_check(fp, "BLAST TSV")
  suppressPackageStartupMessages({
    df <- read_tsv(fp, col_names = FALSE, show_col_types = FALSE, progress = FALSE)
  })
  if (ncol(df) < 12) stop("BLAST TSV has <12 columns: ", fp)
  colnames(df)[1:12] <- c("qid","sid","pident","length","mismatch","gapopen",
                         "qstart","qend","sstart","send","evalue","bitscore")
  df %>%
    mutate(
      qid = canonical_id(qid),
      sid = as.character(sid),
      bitscore = suppressWarnings(as.numeric(bitscore)),
      evalue = suppressWarnings(as.numeric(evalue))
    ) %>%
    filter(!is.na(qid), qid != "")
}

besthit_reduce <- function(df) {
  # one best hit per query: highest bitscore, then lowest evalue
  df %>%
    arrange(qid, desc(bitscore), evalue) %>%
    group_by(qid) %>%
    slice(1) %>%
    ungroup()
}

# Try to extract locus tag from subject id
extract_locus_tag <- function(sid) {
  sid <- as.character(sid)
  # if sid already looks like B4U56_00090
  if (str_detect(sid, "^B4U56_\\d{5}$")) return(sid)
  # otherwise try to find locus tag pattern anywhere
  hit <- str_extract(sid, "B4U56_\\d{5}")
  hit
}

read_exo_sheet <- function(xlsx, sheet) {
  df <- read_excel(xlsx, sheet = sheet, col_names = TRUE)

  # Protein column: assume first column contains protein IDs
  prot_col <- names(df)[1]

  # log2FC column: per user it's column D, name looks like "log 2 FC)"
  log2_col <- find_col(names(df), "log 2 FC)")
  if (is.na(log2_col)) {
    # fallback fuzzy: any col containing both log2 and fc
    hits <- names(df)[str_detect(norm_name(names(df)), "log2") & str_detect(norm_name(names(df)), "fc")]
    if (length(hits) > 0) log2_col <- hits[1]
  }
  if (is.na(log2_col)) stop("Exoproteome sheet missing log2FC column ('log 2 FC)'): ", sheet)

  out <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      log2FC = suppressWarnings(as.numeric(.data[[log2_col]]))
    ) %>%
    filter(!is.na(Protein_raw), Protein_raw != "")

  out <- out %>%
    mutate(
      Protein_ID = canonical_id(Protein_raw),
      EXO_direction = case_when(
        is.na(log2FC) ~ NA_character_,
        log2FC > 0 ~ "UP",
        log2FC < 0 ~ "DOWN",
        TRUE ~ "ZERO"
      )
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "")

  out
}

read_rna_excel <- function(fp) {
  safe_file_check(fp, "RNA Excel")
  df <- read_excel(fp, sheet = "Complete_Data", col_names = TRUE)

  # required columns
  req <- c("GeneID","padj","log2FoldChange")
  miss <- setdiff(req, names(df))
  if (length(miss) > 0) stop("RNA Excel missing columns: ", paste(miss, collapse=", "), " in ", fp)

  # optional annotation columns
  pref_col <- if ("Preferred_name" %in% names(df)) "Preferred_name" else NA_character_
  desc_col <- if ("Description" %in% names(df)) "Description" else NA_character_

  df <- df %>%
    mutate(
      GeneID_plain = str_remove(as.character(GeneID), "^gene-"),
      padj = suppressWarnings(as.numeric(padj)),
      log2FoldChange = suppressWarnings(as.numeric(log2FoldChange)),
      Preferred_name = if (!is.na(pref_col)) as.character(.data[[pref_col]]) else NA_character_,
      Description    = if (!is.na(desc_col)) as.character(.data[[desc_col]]) else NA_character_
    )

  df
}

# -------------------------
# FASTA QC
# -------------------------
cat("[FASTA] Reading FASTA files for QC (10)\n")
fasta_qc_stats <- list()

for (nm in names(FASTA_FILES)) {
  fp <- FASTA_FILES[[nm]]
  safe_file_check(fp, "FASTA file")
  h <- read_fasta_headers(fp)
  n_hdr <- nrow(h)
  n_uniq <- n_distinct(h$id)
  n_dup <- n_hdr - n_uniq
  cat(sprintf("  - %s: headers=%d, unique_IDs=%d, duplicates=%d\n", nm, n_hdr, n_uniq, n_dup))

  dup_ids <- h %>% count(id, sort = TRUE) %>% filter(n > 1)
  fasta_qc_stats[[nm]] <- tibble(
    Dataset = nm,
    headers = n_hdr,
    unique_IDs = n_uniq,
    duplicates = n_dup,
    top_duplicate_id = ifelse(nrow(dup_ids) > 0, dup_ids$id[1], NA_character_),
    top_duplicate_n  = ifelse(nrow(dup_ids) > 0, dup_ids$n[1], NA_integer_)
  )
}
fasta_qc_stats_df <- bind_rows(fasta_qc_stats)

# -------------------------
# RNA: read 6 comparisons (Excel)
# -------------------------
cat("\n[RNA] Reading 6 RNA-seq comparisons (Excel: Complete_Data)\n")
rna_sig_long <- list()
rna_all_long <- list()

for (cmp in names(RNA_FILES)) {
  fp <- RNA_FILES[[cmp]]
  df <- read_rna_excel(fp)

  # Keep a full-gene annotation table (used later to annotate EXO/ProALFA-only members)
  rna_all_long[[cmp]] <- df %>%
    mutate(Comparison = cmp) %>%
    select(GeneID_plain, Preferred_name, Description, padj, log2FoldChange, Comparison)

  # Debug: how many bad entries
  bad_gene <- sum(is.na(df$GeneID_plain) | df$GeneID_plain == "")
  bad_padj <- sum(is.na(df$padj))
  bad_lfc  <- sum(is.na(df$log2FoldChange))

  sig <- df %>%
    filter(!is.na(GeneID_plain), GeneID_plain != "",
           !is.na(padj), !is.na(log2FoldChange),
           padj < RNA_PADJ_CUT,
           abs(log2FoldChange) > RNA_LFC_CUT) %>%
    mutate(
      Comparison = cmp,
      RNA_direction = ifelse(log2FoldChange > 0, "UP", "DOWN")
    ) %>%
    select(GeneID_plain, Comparison, padj, log2FoldChange, RNA_direction,
           Preferred_name, Description)

  cat(sprintf("  - %s: rows=%d (bad GeneID=%d, bad padj=%d, bad log2FC=%d); significant=%d (UP=%d, DOWN=%d)\n",
              cmp, nrow(df), bad_gene, bad_padj, bad_lfc,
              nrow(sig), sum(sig$RNA_direction == "UP"), sum(sig$RNA_direction == "DOWN")))

  rna_sig_long[[cmp]] <- sig
}

rna_sig_long <- bind_rows(rna_sig_long)
rna_all_long <- bind_rows(rna_all_long)

RNA <- sort(unique(rna_sig_long$GeneID_plain))
cat(sprintf("[RNA] Union significant genes (6 comps): %d\n", length(RNA)))

# Build an annotation lookup from *ALL* genes (so EXO-only/ProALFA-only also get annotation)
rna_ann <- rna_all_long %>%
  filter(!is.na(GeneID_plain), GeneID_plain != "") %>%
  group_by(GeneID_plain) %>%
  summarise(
    Preferred_name = coalesce(Preferred_name[match(TRUE, !is.na(Preferred_name) & Preferred_name != "", !is.na(Preferred_name) & Preferred_name != "")], Preferred_name[1]),
    Description    = coalesce(Description[match(TRUE, !is.na(Description) & Description != "", !is.na(Description) & Description != "")], Description[1]),
    .groups = "drop"
  )

# -------------------------
# EXO: read exoproteome sheets + mapping (BLAST)
# -------------------------
cat("\n[EXO] Reading Exoproteome sheets + mapping via BLAST\n")
safe_file_check(EXO_SUMMARY_XLSX, "Exoproteome summary Excel")
safe_file_check(EXO_BLAST_TSV, "Exoproteome BLAST TSV")

exo_long <- list()
for (sh in names(EXO_SHEETS)) {
  df <- read_exo_sheet(EXO_SUMMARY_XLSX, EXO_SHEETS[[sh]]) %>%
    mutate(Sheet = EXO_SHEETS[[sh]])
  cat(sprintf("  - %s: proteins=%d (with log2FC=%d; UP=%d, DOWN=%d)\n",
              EXO_SHEETS[[sh]],
              n_distinct(df$Protein_ID),
              sum(!is.na(df$log2FC)),
              sum(df$EXO_direction == "UP", na.rm = TRUE),
              sum(df$EXO_direction == "DOWN", na.rm = TRUE)))
  exo_long[[sh]] <- df
}
exo_long <- bind_rows(exo_long) %>%
  distinct(Protein_ID, Sheet, .keep_all = TRUE)

# BLAST mapping
exo_blast <- read_besthit_tsv(EXO_BLAST_TSV)
exo_best  <- besthit_reduce(exo_blast) %>%
  mutate(LocusTag = map_chr(sid, extract_locus_tag))

cat(sprintf("[BLAST] Exoproteome_ALL rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
            nrow(exo_blast), n_distinct(exo_blast$qid), nrow(exo_best), n_distinct(exo_best$LocusTag)))

dup_targets <- exo_best %>%
  filter(!is.na(LocusTag)) %>%
  count(LocusTag, sort = TRUE) %>%
  filter(n > 1)

cat(sprintf("[BLAST] Exoproteome_ALL duplicated targets in besthits: %d (top 5)\n", nrow(dup_targets)))
if (nrow(dup_targets) > 0) print(head(dup_targets, 5))

exo_mapped <- exo_long %>%
  left_join(exo_best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid")) %>%
  mutate(LocusTag = as.character(LocusTag))

cat(sprintf("[EXO] Proteins before mapping: %d unique\n", n_distinct(exo_long$Protein_ID)))
cat(sprintf("[EXO] Mapped proteins (have LocusTag): %d\n", sum(!is.na(exo_mapped$LocusTag))))
cat(sprintf("[EXO] Unique mapped proteins: %d\n", n_distinct(exo_mapped$Protein_ID[!is.na(exo_mapped$LocusTag)])))
cat(sprintf("[EXO] Unique mapped locus tags: %d\n", n_distinct(exo_mapped$LocusTag[!is.na(exo_mapped$LocusTag)])))

EXO <- sort(unique(na.omit(exo_mapped$LocusTag)))

# -------------------------
# ProALFA: read proximity + ALFA sheets + mapping (BLAST)
# -------------------------
cat("\n[ProALFA] Reading Proximity/ALFA sheets + mapping via BLAST\n")
safe_file_check(PROALFA_SUMMARY_XLSX, "ProALFA summary Excel")

ms_long <- list()
for (ds in names(PROALFA_SHEETS)) {
  sh <- PROALFA_SHEETS[[ds]]
  if (!(sh %in% readxl::excel_sheets(PROALFA_SUMMARY_XLSX))) {
    stop("Sheet '", sh, "' not found")
  }
  df <- read_excel(PROALFA_SUMMARY_XLSX, sheet = sh, col_names = TRUE)

  # assume first column has protein IDs
  prot_col <- names(df)[1]
  tmp <- df %>%
    transmute(
      Protein_raw = as.character(.data[[prot_col]]),
      Protein_ID = canonical_id(Protein_raw)
    ) %>%
    filter(!is.na(Protein_ID), Protein_ID != "") %>%
    distinct(Protein_ID)

  cat(sprintf("  - %s (%s): proteins=%d unique\n", ds, sh, nrow(tmp)))
  ms_long[[ds]] <- tmp %>% mutate(Dataset = ds)
}
ms_long <- bind_rows(ms_long)

# mapping per dataset
ms_mapped_list <- list()
PROALFA_sets <- list()

for (ds in names(PROALFA_BLAST_TSV)) {
  bt <- PROALFA_BLAST_TSV[[ds]]
  safe_file_check(bt, paste0("ProALFA BLAST TSV: ", ds))
  b <- read_besthit_tsv(bt)
  best <- besthit_reduce(b) %>%
    mutate(LocusTag = map_chr(sid, extract_locus_tag))

  cat(sprintf("[BLAST] %s rows=%d, queries=%d, besthits=%d, unique_targets=%d\n",
              ds, nrow(b), n_distinct(b$qid), nrow(best), n_distinct(best$LocusTag)))

  dup_t <- best %>%
    filter(!is.na(LocusTag)) %>%
    count(LocusTag, sort = TRUE) %>%
    filter(n > 1)
  if (nrow(dup_t) > 0) {
    cat(sprintf("[BLAST] %s duplicated targets in besthits: %d (top 5)\n", ds, nrow(dup_t)))
    print(head(dup_t, 5))
  }

  tmp_ds <- ms_long %>%
    filter(Dataset == ds) %>%
    left_join(best %>% select(qid, LocusTag), by = c("Protein_ID" = "qid"))

  cat(sprintf("[ProALFA] %s: proteins=%d, mapped=%d, unique_targets=%d\n",
              ds,
              n_distinct(tmp_ds$Protein_ID),
              sum(!is.na(tmp_ds$LocusTag)),
              n_distinct(tmp_ds$LocusTag[!is.na(tmp_ds$LocusTag)])))

  ms_mapped_list[[ds]] <- tmp_ds
  PROALFA_sets[[ds]] <- sort(unique(na.omit(tmp_ds$LocusTag)))
}
ms_mapped_all <- bind_rows(ms_mapped_list)

ProALFA <- sort(unique(unlist(PROALFA_sets)))
cat(sprintf("[ProALFA] Union locus tags (Proximity+ALFA): %d\n", length(ProALFA)))

# -------------------------
# VENN intersections
# -------------------------
cat("\n[VENN] Computing intersections\n")
cat(sprintf("Set sizes: RNA=%d, EXO=%d, ProALFA=%d\n", length(RNA), length(EXO), length(ProALFA)))

i_RNA_EXO     <- intersect(RNA, EXO)
i_RNA_ProALFA <- intersect(RNA, ProALFA)
i_EXO_ProALFA <- intersect(EXO, ProALFA)
i_all3        <- Reduce(intersect, list(RNA, EXO, ProALFA))

cat(sprintf("Intersections: RNA∩EXO=%d, RNA∩ProALFA=%d, EXO∩ProALFA=%d, ALL3=%d\n",
            length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)))

# -------------------------
# Venn plot
# -------------------------
venn_png <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.png")
venn_pdf <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA.pdf")

cat("[VENN] Writing plot:", venn_png, "\n")
venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = venn_png,
  imagetype = "png",
  height = 2400, width = 2400, resolution = 300,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  # Increase 3rd value to push the long bottom label further out
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)

cat("[VENN] Writing plot:", venn_pdf, "\n")
vd <- venn.diagram(
  x = list(RNA=RNA, Exoproteome=EXO, ProALFA=ProALFA),
  category.names = c("RNA", "Exoproteome", "Proximity/ALFApulldown"),
  filename = NULL,
  fill = c("#4C72B0","#55A868","#C44E52"),
  alpha = 0.4,
  cex = 0.9,
  cat.cex = 0.85,
  cat.pos = c(-20, 20, 270),
  cat.dist = c(0.06, 0.06, 0.16),
  margin = 0.22,
  main = "RNA-seq (sig) vs Exoproteome vs Proximity/ALFApulldown"
)
pdf(venn_pdf, width=7.5, height=7.5)
grid.draw(vd)
dev.off()

# -------------------------
# Region table (ALL members + direction annotations)
# -------------------------
all_members <- tibble(LocusTag = sort(unique(c(RNA, EXO, ProALFA)))) %>%
  mutate(
    in_RNA = LocusTag %in% RNA,
    in_EXO = LocusTag %in% EXO,
    in_ProALFA = LocusTag %in% ProALFA
  )

# RNA direction per locus tag (from significant set only; union across comparisons)
rna_dir_tbl <- rna_sig_long %>%
  group_by(GeneID_plain) %>%
  summarise(
    RNA_comparisons = paste(sort(unique(Comparison)), collapse = ";"),
    RNA_dirs = paste(sort(unique(RNA_direction)), collapse = ";"),
    min_padj = suppressWarnings(min(padj, na.rm = TRUE)),
    max_abs_log2FC = suppressWarnings(max(abs(log2FoldChange), na.rm = TRUE)),
    .groups = "drop"
  ) %>%
  rename(LocusTag = GeneID_plain)

# EXO direction aggregated per locus tag
exo_dir_tbl <- exo_mapped %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    EXO_sheets = paste(sort(unique(Sheet)), collapse = ";"),
    EXO_dirs = paste(sort(unique(EXO_direction)), collapse = ";"),
    EXO_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

# ProALFA membership aggregated per locus tag
proalfa_tbl <- ms_mapped_all %>%
  filter(!is.na(LocusTag)) %>%
  group_by(LocusTag) %>%
  summarise(
    ProALFA_datasets = paste(sort(unique(Dataset)), collapse = ";"),
    ProALFA_proteinIDs = paste(sort(unique(Protein_ID)), collapse = ";"),
    .groups = "drop"
  )

region_tbl <- all_members %>%
  left_join(rna_dir_tbl, by = "LocusTag") %>%
  left_join(exo_dir_tbl, by = "LocusTag") %>%
  left_join(proalfa_tbl, by = "LocusTag") %>%
  left_join(rna_ann, by = c("LocusTag" = "GeneID_plain")) %>%
  arrange(desc(in_RNA), desc(in_EXO), desc(in_ProALFA), LocusTag)

region_only <- function(inRNA, inEXO, inProALFA) {
  region_tbl %>%
    filter(in_RNA == inRNA, in_EXO == inEXO, in_ProALFA == inProALFA)
}

# -------------------------
# Excel output
# -------------------------
out_xlsx <- file.path(OUTDIR, "Venn_RNA_EXO_ProALFA_detailed.xlsx")
cat("[EXCEL] Writing:", out_xlsx, "\n")

wb <- createWorkbook()

addWorksheet(wb, "Summary")
writeData(wb, "Summary", data.frame(
  Metric = c("RNA cutoff", "RNA union", "EXO union", "ProALFA union",
             "RNA∩EXO", "RNA∩ProALFA", "EXO∩ProALFA", "ALL3"),
  Value  = c(
    sprintf("padj < %.3g & |log2FC| > %.3g", RNA_PADJ_CUT, RNA_LFC_CUT),
    length(RNA), length(EXO), length(ProALFA),
    length(i_RNA_EXO), length(i_RNA_ProALFA), length(i_EXO_ProALFA), length(i_all3)
  )
))

addWorksheet(wb, "FASTA_QC")
writeData(wb, "FASTA_QC", fasta_qc_stats_df)

addWorksheet(wb, "RNA_sig_long")
writeData(wb, "RNA_sig_long", rna_sig_long)

addWorksheet(wb, "RNA_all_annotations")
writeData(wb, "RNA_all_annotations", rna_all_long)

addWorksheet(wb, "EXO_proteins")
writeData(wb, "EXO_proteins", exo_long)

addWorksheet(wb, "EXO_mapped")
writeData(wb, "EXO_mapped", exo_mapped)

addWorksheet(wb, "ProALFA_proteins")
writeData(wb, "ProALFA_proteins", ms_long)

addWorksheet(wb, "ProALFA_mapped")
writeData(wb, "ProALFA_mapped", ms_mapped_all)

addWorksheet(wb, "All_members_with_dirs")
writeData(wb, "All_members_with_dirs", region_tbl)

addWorksheet(wb, "Only_RNA")
writeData(wb, "Only_RNA", region_only(TRUE, FALSE, FALSE))

addWorksheet(wb, "Only_EXO")
writeData(wb, "Only_EXO", region_only(FALSE, TRUE, FALSE))

addWorksheet(wb, "Only_ProALFA")
writeData(wb, "Only_ProALFA", region_only(FALSE, FALSE, TRUE))

addWorksheet(wb, "RNA_EXO_only")
writeData(wb, "RNA_EXO_only", region_only(TRUE, TRUE, FALSE))

addWorksheet(wb, "RNA_ProALFA_only")
writeData(wb, "RNA_ProALFA_only", region_only(TRUE, FALSE, TRUE))

addWorksheet(wb, "EXO_ProALFA_only")
writeData(wb, "EXO_ProALFA_only", region_only(FALSE, TRUE, TRUE))

addWorksheet(wb, "ALL3")
writeData(wb, "ALL3", region_only(TRUE, TRUE, TRUE))

saveWorkbook(wb, out_xlsx, overwrite = TRUE)

cat("\nDONE.\nOutputs:\n")
cat(" -", venn_png, "\n")
cat(" -", venn_pdf, "\n")
cat(" -", out_xlsx, "\n")
cat(" - LOG:", normalizePath(LOGFILE), "\n")

# dump warnings (so they go into log too)
if (!is.null(warnings())) {
  cat("\n[WARNINGS]\n")
  print(warnings())
}

sink()

Final Notes

All comparisons are locus-tag consistent, enabling clean cross-omics integration.
Exoproteome regulation is treated directionally only, by design.
Logs and Excel outputs are intentionally verbose to support downstream inspection and reuse.

This pipeline is designed to be transparent, auditable, and extensible—ideal for iterative multi-omics analysis and figure generation.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Data Processing Pipeline: RNA-seq × Exoproteome × MS (Proximity/ALFA)

Overview of the Three Sets

Outputs

Step A — Fetch exoproteome FASTA from UniProt

Bash: fetch exoproteome FASTAs

Step B — Normalize FASTA headers (Exoproteome + MS)

Bash: normalize exoproteome FASTAs

Bash: normalize MS (Proximity / ALFA) FASTAs

Step C — BLAST mapping to reference proteome

Bash: merge exoproteome FASTAs

Bash: build BLAST database (once)

Bash: BLAST exoproteome

Bash: select best hit per query (exoproteome)

Bash: BLAST MS datasets

Bash: best-hit selection (MS)

Step D — R wrapper: integration, Venn, Excel, LOG

Bash: run the final integration

Scripts Used (for traceability)

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

`02_normalize_fasta_headers.py`

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

Final Note

Scripts Used

`01_fetch_uniprot_fasta_from_exoproteome_excel.py`

`02_normalize_fasta_headers.py`

`03_wrapper_3circles_venn_RNAseq_Exo_ProALFA.R`

Final Notes

Leave a Reply Cancel reply