Goal
For four MS-derived protein sets (Proximity 4h/18h and ALFApulldown 4h/18h), we want to:
- Export protein sequences as FASTA
- Annotate them with EggNOG-mapper (including the
COG_categorycolumn) -
Summarize COG composition at two levels:
- COG letters (J/A/K/…/S), including multi-letter cases like
IQ - 4 major functional classes (Info / Cellular / Metabolism / Poorly characterized)
- COG letters (J/A/K/…/S), including multi-letter cases like
- Export both summary statistics and the underlying protein IDs for each category/group.
Step 0 — Why this manual annotation approach is needed (non-model organism)
Because the organism is non-model, standard organism-specific R annotation packages (e.g., org.Hs.eg.db) don’t apply. Instead, we generate functional annotations directly from protein sequences (EggNOG / Blast2GO), and then do downstream clustering/enrichment from those outputs.
Step 1 — Generate protein FASTA files
1A) FASTA from MS protein lists
Export sequences for each MS dataset:
python3 getProteinSequences_Proximity_4h.py # > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py # > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py # > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py # > ALFApulldown_18h.fasta
Input: MS protein list (dataset-specific; handled inside each getProteinSequences_*.py).
Output: One FASTA per dataset (*.fasta), used as EggNOG input.
1B) (USED FOR RNA-SEQ, NOT_USED HERE) Reference FASTA from GenBank (for RNA-seq integration / ID baseline)
mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
Input: downloaded GenBank protein FASTA (CP020463_protein_.fasta)
Output: cleaned FASTA headers (CP020463_protein.fasta)
Step 2 — Generate EggNOG annotation files (*.emapper.annotations)
2A) Install EggNOG-mapper
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda # eggnog-mapper 2.1.12
mamba activate eggnog_env
2B) Download / prepare EggNOG database
mkdir -p /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y \
--data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
2C) Run emapper.py on FASTA inputs
(USED FOR RNA-SEQ, NOT_USED HERE) For RNA-seq reference proteome (optional but useful for integration):
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60 # --resume if needed
For MS protein sets (the ones used for COG clustering):
emapper.py -i Proximity_4h.fasta -o eggnog_out_Proximity_4h --cpu 60 # --resume
emapper.py -i Proximity_18h.fasta -o eggnog_out_Proximity_18h --cpu 60
emapper.py -i ALFApulldown_4h.fasta -o eggnog_out_ALFApulldown_4h --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60
Key outputs used downstream (one per dataset):
eggnog_out_Proximity_4h.emapper.annotationseggnog_out_Proximity_18h.emapper.annotationseggnog_out_ALFApulldown_4h.emapper.annotationseggnog_out_ALFApulldown_18h.emapper.annotations
These files include a column named COG_category, which is the basis for the clustering below.
Step 3 — COG clustering + reporting (this post’s main script)
Inputs
The four *.emapper.annotations files (must contain COG_category).
Outputs
All outputs are written to ./COG_outputs/:
Per dataset (Excel): COG_[Dataset].xlsx
COG_letters: per-letter count + percentDebug: unassigned COG rows, R/S proportion, etc.Protein_assignments: per-protein COG letters + functional groupsProtein_lists_by_COG: protein IDs per COG letterProtein_lists_by_group: protein IDs per major functional classLong_format_COG/Long_format_group: one row per protein per category/group (best for filtering)
Combined (Excel): COG_combined_summary.xlsx
- combined counts/percents across datasets
- combined long-format tables
Plots (PNG + PDF):
COG_grouped_barplot_percent_letters.*(COG letters across datasets)COG_functional_groups.*(4 functional classes across datasets)
Interpretation notes
- High “POORLY CHARACTERIZED” (R/S) is common when EggNOG cannot assign confident functions—e.g., many proteins are effectively “hypothetical protein”–like, strain-specific, or lack strong ortholog support.
-
Group totals not equal to 100% can happen because:
- some proteins have multi-letter COGs (e.g.,
IQ) → count in multiple groups - some proteins have no COG assignment (
-) → don’t contribute to any group
- some proteins have multi-letter COGs (e.g.,
Code snippet (generate Proximity_4h FASTA, used in Step 2)
import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# --------- robust HTTP session (retries + backoff) ----------
def make_session():
s = requests.Session()
retries = Retry(
total=6,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
s.headers.update({
"User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
})
return s
S = make_session()
def http_get_text(url, params=None):
r = S.get(url, params=params, timeout=30)
if r.status_code == 200:
return r.text
return None
# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
return http_get_text(url)
# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
url = "https://rest.uniprot.org/uniparc/search"
params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
txt = http_get_text(url, params=params)
if not txt:
return None
lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
if len(lines) < 2:
return None
upi = lines[1].split("\t")[0].strip()
return upi if upi.startswith("UPI") else None
# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
txt = http_get_text(url1)
if txt:
return txt
url2 = f"https://rest.uniprot.org/uniparc/{upi}"
return http_get_text(url2, params={"format": "fasta"})
def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
identifier = identifier.strip()
if not identifier:
return None
if identifier.startswith("UPI"):
fasta = fetch_uniparc_fasta(identifier)
return (identifier, fasta) if fasta else None
fasta = fetch_uniprotkb_fasta(identifier)
if fasta:
return (identifier, fasta)
upi = resolve_upi_via_uniparc_search(identifier)
if upi:
fasta2 = fetch_uniparc_fasta(upi)
if fasta2:
fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
return (identifier, fasta2)
return None
def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
missing = []
with open(out_fasta, "w", encoding="utf-8") as f:
for pid in ids:
res = fetch_fasta_for_id(pid)
if res is None:
missing.append(pid)
else:
_, fasta_txt = res
if not fasta_txt.endswith("\n"):
fasta_txt += "\n"
f.write(fasta_txt)
time.sleep(delay_s)
return missing
ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"] # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)
Code snippet (COG clustering script, used in Step 3)
#!/usr/bin/env python3
"""
COG clustering for 4 MS protein sets (Proximity 4h/18h, ALFApulldown 4h/18h)
Updates vs previous version:
- The *category* Excel (functional groups) now also includes protein IDs per group AND per COG letter:
* Sheet: Protein_lists_by_COG (COG letter -> protein IDs)
* Sheet: All_Long_COG (one row per protein per COG letter; best for filtering)
- Renamed group output filename to remove "_optionB_protein_based"
- Removed "(multi-group allowed)" and "(protein-based; ...)" wording from plot 2 axis/title
(note: method still allows multi-group membership; we just don't print it in labels)
"""
import os
import re
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
# -------------------------
# CONFIG
# -------------------------
INPUT_FILES = {
"Proximity_4h": "./eggnog_out_Proximity_4h.emapper.annotations",
"Proximity_18h": "./eggnog_out_Proximity_18h.emapper.annotations",
"ALFApulldown_4h": "./eggnog_out_ALFApulldown_4h.emapper.annotations",
"ALFApulldown_18h": "./eggnog_out_ALFApulldown_18h.emapper.annotations",
}
OUTDIR = "./COG_outputs"
os.makedirs(OUTDIR, exist_ok=True)
ALL_COG = ['J','A','K','L','B','D','Y','V','T','M','N','Z','W','U','O','C','G','E','F','H','I','P','Q','R','S']
ALL_COG_DISPLAY = ALL_COG[::-1]
COG_DESCRIPTIONS = {
"J": "Translation, ribosomal structure and biogenesis",
"A": "RNA processing and modification",
"K": "Transcription",
"L": "Replication, recombination and repair",
"B": "Chromatin structure and dynamics",
"D": "Cell cycle control, cell division, chromosome partitioning",
"Y": "Nuclear structure",
"V": "Defense mechanisms",
"T": "Signal transduction mechanisms",
"M": "Cell wall/membrane/envelope biogenesis",
"N": "Cell motility",
"Z": "Cytoskeleton",
"W": "Extracellular structures",
"U": "Intracellular trafficking, secretion, and vesicular transport",
"O": "Posttranslational modification, protein turnover, chaperones",
"C": "Energy production and conversion",
"G": "Carbohydrate transport and metabolism",
"E": "Amino acid transport and metabolism",
"F": "Nucleotide transport and metabolism",
"H": "Coenzyme transport and metabolism",
"I": "Lipid transport and metabolism",
"P": "Inorganic ion transport and metabolism",
"Q": "Secondary metabolites biosynthesis, transport and catabolism",
"R": "General function prediction only",
"S": "Function unknown",
}
FUNCTIONAL_GROUPS = {
"INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
"CELLULAR PROCESSES AND SIGNALING": ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
"METABOLISM": ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
"POORLY CHARACTERIZED": ['R', 'S'],
}
LETTER_TO_GROUP = {}
for grp, letters in FUNCTIONAL_GROUPS.items():
for c in letters:
LETTER_TO_GROUP[c] = grp
# -------------------------
# Helpers
# -------------------------
def read_emapper_annotations(path: str) -> pd.DataFrame:
rows = []
header = None
with open(path, "r", encoding="utf-8", errors="replace") as f:
for line in f:
line = line.rstrip("\n")
if line.startswith("##"):
continue
if line.startswith("#query"):
header = line.lstrip("#").split("\t")
continue
if header is None and re.match(r"^query\t", line):
header = line.split("\t")
continue
if header is None:
continue
rows.append(line.split("\t"))
if header is None:
raise ValueError(f"Could not find header line (#query/query) in: {path}")
df = pd.DataFrame(rows, columns=header)
if "query" not in df.columns and "#query" in df.columns:
df = df.rename(columns={"#query": "query"})
return df
def split_cog_letters(cog_str):
if cog_str is None:
return []
cog_str = str(cog_str).strip()
if cog_str == "" or cog_str == "-" or cog_str.lower() == "nan":
return []
letters = list(cog_str)
return [c for c in letters if c in ALL_COG]
def count_cog_letters(df: pd.DataFrame) -> dict:
counts = {c: 0 for c in ALL_COG}
for x in df["COG_category"].astype(str).tolist():
for c in split_cog_letters(x):
counts[c] += 1
return counts
def build_category_to_proteins(df: pd.DataFrame) -> dict:
cat2prot = {c: set() for c in ALL_COG}
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
for c in letters:
cat2prot[c].add(q)
return {c: sorted(list(v)) for c, v in cat2prot.items()}
def build_group_to_proteins(df: pd.DataFrame) -> dict:
group2prot = {g: set() for g in FUNCTIONAL_GROUPS.keys()}
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
hit_groups = set()
for c in letters:
g = LETTER_TO_GROUP.get(c)
if g:
hit_groups.add(g)
for g in hit_groups:
group2prot[g].add(q)
return {g: sorted(list(v)) for g, v in group2prot.items()}
def build_assignment_table(df: pd.DataFrame) -> pd.DataFrame:
rows = []
for q, cog in zip(df["query"].astype(str), df["COG_category"].astype(str)):
letters = split_cog_letters(cog)
groups = sorted(set(LETTER_TO_GROUP[c] for c in letters if c in LETTER_TO_GROUP))
rows.append({
"query": q,
"COG_category_raw": cog,
"COG_letters": "".join(sorted(set(letters))) if letters else "",
"Functional_groups": "; ".join(groups) if groups else "",
"Unassigned_COG": (str(cog).strip() in ["", "-", "nan", "NA"])
})
out = pd.DataFrame(rows).drop_duplicates(subset=["query"])
return out.sort_values("query")
def long_format_by_cog(cat2prot: dict, dataset: str) -> pd.DataFrame:
rows = []
for c in ALL_COG:
for pid in cat2prot.get(c, []):
rows.append({
"Dataset": dataset,
"COG": c,
"Description": COG_DESCRIPTIONS.get(c, ""),
"Protein_ID": pid
})
return pd.DataFrame(rows)
def long_format_by_group(grp2prot: dict, dataset: str) -> pd.DataFrame:
rows = []
for g in FUNCTIONAL_GROUPS.keys():
for pid in grp2prot.get(g, []):
rows.append({
"Dataset": dataset,
"Functional_group": g,
"Protein_ID": pid
})
return pd.DataFrame(rows)
def protein_based_group_membership(df: pd.DataFrame) -> pd.DataFrame:
groups = list(FUNCTIONAL_GROUPS.keys())
recs = []
for q, cog in zip(df["query"].astype(str).tolist(), df["COG_category"].astype(str).tolist()):
letters = split_cog_letters(cog)
hit_groups = set()
for c in letters:
g = LETTER_TO_GROUP.get(c)
if g:
hit_groups.add(g)
rec = {"query": q}
for g in groups:
rec[g] = (g in hit_groups)
recs.append(rec)
out = pd.DataFrame(recs)
out = out.groupby("query", as_index=False).max()
return out
# -------------------------
# Main processing
# -------------------------
all_counts = {}
all_pct = {}
debug_summary = []
all_long_cog = []
all_long_group = []
for ds_name, path in INPUT_FILES.items():
print(f"\n--- {ds_name} ---\nReading: {path}")
if not os.path.exists(path):
raise FileNotFoundError(f"Missing input file: {path}")
df = read_emapper_annotations(path)
if "COG_category" not in df.columns:
raise ValueError(f"{path} has no 'COG_category' column.")
n_rows = df.shape[0]
unassigned = (df["COG_category"].isna() | df["COG_category"].astype(str).str.strip().isin(["", "-", "nan"])).sum()
# Letter-based counts and percents
counts = count_cog_letters(df)
total_letters = sum(counts.values())
pct = {k: (v / total_letters * 100 if total_letters else 0.0) for k, v in counts.items()}
counts_s = pd.Series(counts, name="Count").reindex(ALL_COG_DISPLAY)
pct_s = pd.Series(pct, name="Percent").reindex(ALL_COG_DISPLAY).round(2)
all_counts[ds_name] = counts_s
all_pct[ds_name] = pct_s
out_df = pd.DataFrame({
"COG": ALL_COG_DISPLAY,
"Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
"Count": counts_s.values,
"Percent": pct_s.values,
})
dbg = pd.DataFrame([{
"Dataset": ds_name,
"Proteins_in_table": int(n_rows),
"COG_unassigned_rows": int(unassigned),
"Total_assigned_COG_letters": int(total_letters),
"R_count": int(counts.get("R", 0)),
"S_count": int(counts.get("S", 0)),
"R_plus_S_percent_of_letters": float((counts.get("R", 0) + counts.get("S", 0)) / total_letters * 100) if total_letters else 0.0,
}])
debug_summary.append(dbg.iloc[0].to_dict())
assignment_tbl = build_assignment_table(df)
cat2prot = build_category_to_proteins(df)
grp2prot = build_group_to_proteins(df)
# category (COG letter) protein lists
cat_list_df = pd.DataFrame({
"COG": ALL_COG_DISPLAY,
"Description": [COG_DESCRIPTIONS.get(c, "") for c in ALL_COG_DISPLAY],
"N_proteins": [len(cat2prot[c]) for c in ALL_COG_DISPLAY],
"Protein_IDs": ["; ".join(cat2prot[c]) for c in ALL_COG_DISPLAY],
})
# group protein lists
grp_list_df = pd.DataFrame({
"Functional_group": list(FUNCTIONAL_GROUPS.keys()),
"N_proteins": [len(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
"Protein_IDs": ["; ".join(grp2prot[g]) for g in FUNCTIONAL_GROUPS.keys()],
})
df_long_cog = long_format_by_cog(cat2prot, ds_name)
df_long_group = long_format_by_group(grp2prot, ds_name)
all_long_cog.append(df_long_cog)
all_long_group.append(df_long_group)
out_xlsx = os.path.join(OUTDIR, f"COG_{ds_name}.xlsx")
with pd.ExcelWriter(out_xlsx) as writer:
out_df.to_excel(writer, sheet_name="COG_letters", index=False)
dbg.to_excel(writer, sheet_name="Debug", index=False)
assignment_tbl.to_excel(writer, sheet_name="Protein_assignments", index=False)
cat_list_df.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
grp_list_df.to_excel(writer, sheet_name="Protein_lists_by_group", index=False)
df_long_cog.to_excel(writer, sheet_name="Long_format_COG", index=False)
df_long_group.to_excel(writer, sheet_name="Long_format_group", index=False)
print(f"Saved: {out_xlsx}")
# -------------------------
# Combined summaries (letters)
# -------------------------
df_counts = pd.concat(all_counts.values(), axis=1)
df_counts.columns = list(all_counts.keys())
df_pct = pd.concat(all_pct.values(), axis=1)
df_pct.columns = list(all_pct.keys())
df_pct = df_pct.round(2)
combined_xlsx = os.path.join(OUTDIR, "COG_combined_summary.xlsx")
with pd.ExcelWriter(combined_xlsx) as writer:
df_counts.to_excel(writer, sheet_name="Counts")
df_pct.to_excel(writer, sheet_name="Percent")
pd.DataFrame(debug_summary).to_excel(writer, sheet_name="Debug", index=False)
# Combined long formats (includes protein IDs per letter/group across datasets)
pd.concat(all_long_cog, ignore_index=True).to_excel(writer, sheet_name="All_Long_COG", index=False)
pd.concat(all_long_group, ignore_index=True).to_excel(writer, sheet_name="All_Long_group", index=False)
print(f"\nSaved combined summary: {combined_xlsx}")
# -------------------------
# Plot 1: per-letter % (assigned letters)
# -------------------------
categories = df_pct.index.tolist()
datasets = df_pct.columns.tolist()
y = np.arange(len(categories))
bar_height = 0.18
offsets = np.linspace(-bar_height*1.5, bar_height*1.5, num=len(datasets))
fig, ax = plt.subplots(figsize=(12, 10))
for i, ds in enumerate(datasets):
ax.barh(y + offsets[i], df_pct[ds].values, height=bar_height, label=ds)
ax.set_yticks(y)
ax.set_yticklabels(categories)
ax.invert_yaxis()
ax.set_xlabel("Relative occurrence (%) of assigned COG letters")
ax.set_title("COG category distribution (EggNOG COG_category; multi-letter split)")
ax.legend(loc="best")
plt.tight_layout()
plot1_png = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.png")
plot1_pdf = os.path.join(OUTDIR, "COG_grouped_barplot_percent_letters.pdf")
plt.savefig(plot1_png, dpi=300)
plt.savefig(plot1_pdf)
plt.close(fig)
print(f"Saved plot 1: {plot1_png}")
print(f"Saved plot 1: {plot1_pdf}")
# -------------------------
# Plot 2 + Excel: functional groups (protein-based)
# Also includes protein IDs per COG letter in the SAME workbook.
# -------------------------
group_counts = {}
group_pct = {}
group_long = []
# We also want category (letter) protein IDs in this workbook:
all_long_cog_for_groupwb = []
for ds_name, path in INPUT_FILES.items():
df = read_emapper_annotations(path)
# (A) protein-based group membership and stats
memb = protein_based_group_membership(df)
n_proteins = memb.shape[0]
counts = {g: int(memb[g].sum()) for g in FUNCTIONAL_GROUPS.keys()}
pct = {g: (counts[g] / n_proteins * 100 if n_proteins else 0.0) for g in FUNCTIONAL_GROUPS.keys()}
group_counts[ds_name] = pd.Series(counts)
group_pct[ds_name] = pd.Series(pct)
for g in FUNCTIONAL_GROUPS.keys():
ids = memb.loc[memb[g] == True, "query"].astype(str).tolist()
for pid in ids:
group_long.append({"Dataset": ds_name, "Functional_group": g, "Protein_ID": pid})
# (B) add category (letter) protein IDs too
cat2prot = build_category_to_proteins(df)
all_long_cog_for_groupwb.append(long_format_by_cog(cat2prot, ds_name))
df_group_counts = pd.DataFrame(group_counts).reindex(FUNCTIONAL_GROUPS.keys())
df_group_pct = pd.DataFrame(group_pct).reindex(FUNCTIONAL_GROUPS.keys()).round(2)
df_group_long = pd.DataFrame(group_long)
df_all_long_cog = pd.concat(all_long_cog_for_groupwb, ignore_index=True)
# Also create a compact category list with protein IDs (per dataset + letter)
cat_list_rows = []
for ds in df_all_long_cog["Dataset"].unique():
sub = df_all_long_cog[df_all_long_cog["Dataset"] == ds]
for c in ALL_COG:
ids = sub.loc[sub["COG"] == c, "Protein_ID"].tolist()
cat_list_rows.append({
"Dataset": ds,
"COG": c,
"Description": COG_DESCRIPTIONS.get(c, ""),
"N_proteins": len(ids),
"Protein_IDs": "; ".join(sorted(ids))
})
df_cat_list_for_groupwb = pd.DataFrame(cat_list_rows)
group_xlsx = os.path.join(OUTDIR, "COG_functional_groups.xlsx")
with pd.ExcelWriter(group_xlsx) as writer:
df_group_counts.to_excel(writer, sheet_name="Counts_Proteins")
df_group_pct.to_excel(writer, sheet_name="Percent_of_Proteins")
df_group_long.to_excel(writer, sheet_name="Long_format_group", index=False)
# NEW: category protein IDs included here as well
df_cat_list_for_groupwb.to_excel(writer, sheet_name="Protein_lists_by_COG", index=False)
df_all_long_cog.to_excel(writer, sheet_name="All_Long_COG", index=False)
print(f"Saved functional-group workbook: {group_xlsx}")
# Plot 2 (labels updated as requested)
groups = df_group_pct.index.tolist()
datasets = df_group_pct.columns.tolist()
x = np.arange(len(groups))
bar_width = 0.18
offsets = np.linspace(-bar_width*1.5, bar_width*1.5, num=len(datasets))
fig, ax = plt.subplots(figsize=(12, 6))
for i, ds in enumerate(datasets):
ax.bar(x + offsets[i], df_group_pct[ds].values, width=bar_width, label=ds)
ax.set_xticks(x)
ax.set_xticklabels(groups, rotation=15, ha="right")
ax.set_ylabel("% of proteins")
ax.set_title("COG functional groups")
ax.legend(loc="best")
plt.tight_layout()
plot2_png = os.path.join(OUTDIR, "COG_functional_groups.png")
plot2_pdf = os.path.join(OUTDIR, "COG_functional_groups.pdf")
plt.savefig(plot2_png, dpi=300)
plt.savefig(plot2_pdf)
plt.close(fig)
print(f"Saved plot 2: {plot2_png}")
print(f"Saved plot 2: {plot2_pdf}")
print("\nDONE. All outputs are in:", os.path.abspath(OUTDIR))