From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)

Path: ~/DATA/Data_Michelle/Data_Michelle_ProteinClustering_2026

Goal

For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:

Export protein sequences as FASTA
Annotate them with EggNOG-mapper (including the COG_category column)
Summarize COG composition at two levels:
- COG letters (J/A/K/…/S), including multi-letter cases like IQ
- 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
Export both summary statistics and the underlying protein IDs for each category/group.

Step 0 — Why this manual annotation approach is needed (non-model organism)

Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.

Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

Export sequences for each MS dataset:

python3 getProteinSequences_Proximity_4h.py      # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  # > ALFApulldown_18h.fasta

Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py). Output: One FASTA per dataset (*.fasta), used as EggNOG input.

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta) Output: cleaned FASTA headers (CP020463_protein.fasta)

Step 2 — Generate EggNOG annotation files (`*.emapper.annotations`)

2A) Install EggNOG-mapper

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda   # eggnog-mapper 2.1.12
mamba activate eggnog_env

2B) Download / prepare EggNOG database

mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

2C) Run `emapper.py` on FASTA inputs

(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):

emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60    # --resume if needed

For MS protein sets (the ones used for COG clustering):

emapper.py -i Proximity_4h.fasta      -o eggnog_out_Proximity_4h      --cpu 60   # --resume
emapper.py -i Proximity_18h.fasta     -o eggnog_out_Proximity_18h     --cpu 60
emapper.py -i ALFApulldown_4h.fasta   -o eggnog_out_ALFApulldown_4h   --cpu 60
emapper.py -i ALFApulldown_18h.fasta  -o eggnog_out_ALFApulldown_18h  --cpu 60

Key outputs used downstream (one per dataset):

eggnog_out_Proximity_4h.emapper.annotations
eggnog_out_Proximity_18h.emapper.annotations
eggnog_out_ALFApulldown_4h.emapper.annotations
eggnog_out_ALFApulldown_18h.emapper.annotations

These files include a column named COG_category, which is the basis for the clustering below.

Step 3 — COG clustering + reporting (this post’s main script)

Inputs

The four *.emapper.annotations files (must contain COG_category).

Outputs

All outputs are written to ./COG_outputs/:

Per dataset (Excel): COG_[Dataset].xlsx

COG_letters: per-letter count + percent
Debug: unassigned COG rows, R/S proportion, etc.
Protein_assignments: per-protein COG letters + functional groups
Protein_lists_by_COG: protein IDs per COG letter
Protein_lists_by_group: protein IDs per major functional class
Long_format_COG / Long_format_group: one row per protein per category/group (best for filtering)

Combined (Excel): COG_combined_summary.xlsx

combined counts/percents across datasets
combined long-format tables

Plots (PNG + PDF):

COG_grouped_barplot_percent_letters.* (COG letters across datasets)
COG_functional_groups.* (4 functional classes across datasets)

python3 cog_cluster.py

Interpretation notes

High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.
“POORLY CHARACTERIZED” being large is not an artifact of the plotting — it’s because the dataset itself contains many proteins annotated as unknown by EggNOG. Why “S” shows up so much in Proximity labeling / pulldown sets? Common biological/annotation reasons:
- Small / strain-specific proteins (often hypothetical)
- Membrane or secreted proteins with poor functional characterization
- Mobile genetic elements / phage-related proteins
- Proteins with only weak orthology support → EggNOG assigns S rather than a confident functional class
Group totals not equal to 100% can happen because:
- some proteins have multi-letter COGs (e.g., IQ) → count in multiple groups
- some proteins have no COG assignment (-) → don’t contribute to any group

Code snippet (generate Proximity_4h FASTA, used in Step 2)

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

Code snippet (COG clustering script, used in Step 3)

#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)

Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
    * Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
    * Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
  (note: method still allows multi-group membership; we just don't print it in labels)
"""

import os
import re
import pandas as pd
import numpy as np

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
    "Proximity_4h":     "./eggnog_out_Proximity_4h.emapper.annotations",
    "Proximity_18h":    "./eggnog_out_Proximity_18h.emapper.annotations",
    "ALFApulldown_4h":  "./eggnog_out_ALFApulldown_4h.emapper.annotations",
    "ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}

OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)

ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]

COG_DESCRIPTIONS = {
    "J": "Translation, ribosomal structure and biogenesis",
    "A": "RNA processing and modification",
    "K": "Transcription",
    "L": "Replication, recombination and repair",
    "B": "Chromatin structure and dynamics",
    "D": "Cell cycle control, cell division, chromosome partitioning",
    "Y": "Nuclear structure",
    "V": "Defense mechanisms",
    "T": "Signal transduction mechanisms",
    "M": "Cell wall/membrane/envelope biogenesis",
    "N": "Cell motility",
    "Z": "Cytoskeleton",
    "W": "Extracellular structures",
    "U": "Intracellular trafficking, secretion, and vesicular transport",
    "O": "Posttranslational modification, protein turnover, chaperones",
    "C": "Energy production and conversion",
    "G": "Carbohydrate transport and metabolism",
    "E": "Amino acid transport and metabolism",
    "F": "Nucleotide transport and metabolism",
    "H": "Coenzyme transport and metabolism",
    "I": "Lipid transport and metabolism",
    "P": "Inorganic ion transport and metabolism",
    "Q": "Secondary metabolites biosynthesis, transport and catabolism",
    "R": "General function prediction only",
    "S": "Function unknown",
}

FUNCTIONAL_GROUPS = {
    "INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
    "CELLULAR PROCESSES AND SIGNALING":   ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
    "METABOLISM":                        ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
    "POORLY CHARACTERIZED":              ['R', 'S'],
}

LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
    for c in letters:
        LETTER_TO_GROUP[c] = grp

# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
    rows = []
    header = None
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.rstrip("\n")
            if line.startswith("##"):
                continue
            if line.startswith("#query"):
                header = line.lstrip("#").split("\t")
                continue
            if header is None and re.match(r"^query\t", line):
                header = line.split("\t")
                continue
            if header is None:
                continue
            rows.append(line.split("\t"))

    if header is None:
        raise ValueError(f"Could not find header line (#query/query) in: {path}")

    df = pd.DataFrame(rows, columns=header)
    if "query" not in df.columns and "#query" in df.columns:
        df = df.rename(columns={"#query": "query"})
    return df

def split_cog_letters(cog_str):
    if cog_str is None:
        return []
    cog_str = str(cog_str).strip()
    if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
        return []
    letters = list(cog_str)
    return [c for c in letters if c in ALL_COG]

def count_cog_letters(df: pd.DataFrame) -> dict:
    counts = {c: 0 for c in ALL_COG}
    for x in df["COG_category"].astype(str).tolist():
        for c in split_cog_letters(x):
            counts[c] += 1
    return counts

def build_category_to_proteins(df: pd.DataFrame) -> dict:
    cat2prot = {c: set() for c in ALL_COG}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        for c in letters:
            cat2prot[c].add(q)
    return {c: sorted(list(v)) for c, v in cat2prot.items()}

def build_group_to_proteins(df: pd.DataFrame) -> dict:
    group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        for g in hit_groups:
            group2prot[g].add(q)
    return {g: sorted(list(v)) for g, v in group2prot.items()}

def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
        letters = split_cog_letters(cog)
        groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
        rows.append({
            "query": q,
            "COG_category_raw": cog,
            "COG_letters": "".join(sorted(set(letters))) if letters else "",
            "Functional_groups": "; ".join(groups) if groups else "",
            "Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
        })
    out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
    return out.sort_values("query")

def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for c in ALL_COG:
        for pid in cat2prot.get(c, []):
            rows.append({
                "Dataset": dataset,
                "COG": c,
                "Description": COG_DESCRIPTIONS.get(c, ""),
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
    rows = []
    for g in FUNCTIONAL_GROUPS.keys():
        for pid in grp2prot.get(g, []):
            rows.append({
                "Dataset": dataset,
                "Functional_group": g,
                "Protein_ID": pid
            })
    return pd.DataFrame(rows)

def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
    groups = list(FUNCTIONAL_GROUPS.keys())
    recs = []
    for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
        letters = split_cog_letters(cog)
        hit_groups = set()
        for c in letters:
            g = LETTER_TO_GROUP.get(c)
            if g:
                hit_groups.add(g)
        rec = {"query": q}
        for g in groups:
            rec[g] = (g in hit_groups)
        recs.append(rec)

    out = pd.DataFrame(recs)
    out = out.groupby("query", as_index=False).max()
    return out

# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []

for ds_name, path in INPUT_FILES.items():
    print(f"\n--- {ds_name} ---\nReading: {path}")
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing input file: {path}")

    df = read_emapper_annotations(path)
    if "COG_category" not in df.columns:
        raise ValueError(f"{path} has no 'COG_category' column.")

    n_rows = df.shape[0]
    unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()

    # Letter-based counts and percents
    counts = count_cog_letters(df)
    total_letters = sum(counts.values())
    pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}

    counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
    pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)

    all_counts[ds_name] = counts_s
    all_pct[ds_name] = pct_s

    out_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "Count": counts_s.values,
        "Percent": pct_s.values,
    })

    dbg = pd.DataFrame([{
        "Dataset": ds_name,
        "Proteins_in_table": int(n_rows),
        "COG_unassigned_rows": int(unassigned),
        "Total_assigned_COG_letters": int(total_letters),
        "R_count": int(counts.get("R", 0)),
        "S_count": int(counts.get("S", 0)),
        "R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
    }])

    debug_summary.append(dbg.iloc[0].to_dict())

    assignment_tbl = build_assignment_table(df)
    cat2prot = build_category_to_proteins(df)
    grp2prot = build_group_to_proteins(df)

    # category (COG letter) protein lists
    cat_list_df = pd.DataFrame({
        "COG": ALL_COG_DISPLAY,
        "Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
        "N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
        "Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
    })

    # group protein lists
    grp_list_df = pd.DataFrame({
        "Functional_group": list(FUNCTIONAL_GROUPS.keys()),
        "N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
        "Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
    })

    df_long_cog = long_format_by_cog(cat2prot, ds_name)
    df_long_group = long_format_by_group(grp2prot, ds_name)
    all_long_cog.append(df_long_cog)
    all_long_group.append(df_long_group)

    out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
    with pd.ExcelWriter(out_xlsx) as writer:
        out_df.to_excel(writer, sheet_name="COG_letters", index=False)
        dbg.to_excel(writer, sheet_name="Debug", index=False)
        assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
        cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
        grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
        df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
        df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)

    print(f"Saved: {out_xlsx}")

# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())

df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)

combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
    df_counts.to_excel(writer, sheet_name="Counts")
    df_pct.to_excel(writer, sheet_name="Percent")
    pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
    # Combined long formats (includes protein IDs per letter/group across datasets)
    pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
    pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)

print(f"\nSaved combined summary: {combined_xlsx}")

# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()

y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
    ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)

ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()

plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")

# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []

# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []

for ds_name, path in INPUT_FILES.items():
    df = read_emapper_annotations(path)

    # (A) protein-based group membership and stats
    memb = protein_based_group_membership(df)
    n_proteins = memb.shape[0]
    counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
    pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
    group_counts[ds_name] = pd.Series(counts)
    group_pct[ds_name] = pd.Series(pct)

    for g in FUNCTIONAL_GROUPS.keys():
        ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
        for pid in ids:
            group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})

    # (B) add category (letter) protein IDs too
    cat2prot = build_category_to_proteins(df)
    all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))

df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)

# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
    sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
    for c in ALL_COG:
        ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
        cat_list_rows.append({
            "Dataset": ds,
            "COG": c,
            "Description": COG_DESCRIPTIONS.get(c, ""),
            "N_proteins": len(ids),
            "Protein_IDs": "; ".join(sorted(ids))
        })
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)

group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
    df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
    df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
    df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)

    # NEW: category protein IDs included here as well
    df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
    df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)

print(f"Saved functional-group workbook: {group_xlsx}")

# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()

x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))

fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
    ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)

ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()

plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)

print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")

print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR))

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

From MS protein lists to COG functional profiles: FASTA export → EggNOG annotation → COG clustering (with per-protein membership tables)

Goal

Step 0 — Why this manual annotation approach is needed (non-model organism)

Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

Step 2 — Generate EggNOG annotation files (`*.emapper.annotations`)

2A) Install EggNOG-mapper

2B) Download / prepare EggNOG database

2C) Run `emapper.py` on FASTA inputs

Step 3 — COG clustering + reporting (this post’s main script)

Inputs

Outputs

Interpretation notes

Code snippet (generate Proximity_4h FASTA, used in Step 2)

Code snippet (COG clustering script, used in Step 3)

Leave a Reply Cancel reply

Goal

Step 0 — Why this manual annotation approach is needed (non-model organism)

Step 1 — Generate protein FASTA files

1A) FASTA from MS protein lists

1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)

Step 2 — Generate EggNOG annotation files (*.emapper.annotations)

2A) Install EggNOG-mapper

2B) Download / prepare EggNOG database

2C) Run emapper.py on FASTA inputs

Step 3 — COG clustering + reporting (this post’s main script)

Inputs

Outputs

Interpretation notes

Code snippet (generate Proximity_4h FASTA, used in Step 2)

Code snippet (COG clustering script, used in Step 3)

Leave a Reply Cancel reply

Step 2 — Generate EggNOG annotation files (`*.emapper.annotations`)

2C) Run `emapper.py` on FASTA inputs