This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:
- MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g.,
B4U56_00090). - Standard R organism annotation packages (
org.*.eg.db) are not available. - We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.
The workflow produces:
- 4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
- 1 combined summary workbook across all datasets
1) Generate protein FASTA files
1.1 MS protein sequences (Proximity / ALFApulldown)
From MS results (protein ID lists), retrieve sequences and write FASTA:
python3 getProteinSequences_Proximity_4h.py > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py > ALFApulldown_18h.fasta
1.2 Reference proteome sequences (for mapping + background)
Download the reference proteome FASTA from GenBank and standardize headers:
mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta
2) Optional functional annotation for merging MS + RNA-seq (EggNOG)
EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env
download_eggnog_data.py --dbname eggnog.db -y \
--data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60
# MS sets
emapper.py -i Proximity_4h.fasta -o eggnog_out_Proximity_4h --cpu 60
emapper.py -i Proximity_18h.fasta -o eggnog_out_Proximity_18h --cpu 60
emapper.py -i ALFApulldown_4h.fasta -o eggnog_out_ALFApulldown_4h --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60
3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)
Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.
3.1 Setup workspace
mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/
Launch Blast2GO:
~/Tools/Blast2GO/Blast2GO_Launcher
3.2 Step-by-step in Blast2GO GUI
STEP 1 — Load sequences (reference proteome)
- File → Load → Load Sequences
- Choose: Load Fasta File (.fasta)
- Select:
CP020463_protein.fasta - Tags: (leave default / none)
- Check that the table is filled with columns like
Nr,SeqName
✅ Output: sequences are loaded into the project
STEP 2 — BLAST (QBlast)
- Go to the BLAST panel
- Choose blastp (protein vs protein)
- Database: nr (NCBI)
- Set other parameters as needed (defaults typically OK)
- Run QBlast
⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.
✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)
STEP 3 — Mapping
- Click Mapping
✅ Output:
- Tags updated to MAPPED
- Columns appear such as #GO, GO IDs, GO Names
STEP 4 — Annotation
- Click Annotation
-
Key parameters you may set/keep:
- Annotation CutOff (controls reliability threshold)
- GO Weight (boosts more general terms when supported by multiple specific hits)
✅ Output:
- Tags updated to ANNOTATED
- Enzyme-related columns may appear (e.g., Enzyme Codes)
STEP 5 — Export Annotations (before merging InterPro)
- File → Export → Export Annotations
- Choose Export Annotations (.annot, custom, etc.)
- Save as:
~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot
✅ Output: blast2go_annot.annot
STEP 6 — InterProScan (optional but recommended for more GO terms)
- Click InterPro / InterProScan
- Run InterProScan (can be long: hours to >1 day depending on dataset & setup)
✅ Output:
- Tags updated to INTERPRO
- Additional columns: InterPro IDs, InterPro GO IDs/Names
STEP 7 — Merge InterProScan GOs into existing annotation
- In InterPro panel, choose: Merge InterProScan GOs to Annotation
- Confirm merge
✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)
STEP 8 — Export final annotations (after merging InterPro)
- File → Export → Export Annotations
- Save as:
~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2
✅ This is the final file used for enrichment:
blast2go_annot.annot2
Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.
4) Generate BLAST mapping tables: *_vs_ref.blast.tsv
We map each MS protein set to the reference proteome locus tags using BLASTP.
4.1 Create BLAST database from reference proteome
makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db
4.2 BLASTP each MS set against the reference DB
blastp -query Proximity_4h.fasta -db CP020463_ref_db \
-out Proximity_4h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query Proximity_18h.fasta -db CP020463_ref_db \
-out Proximity_18h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
-out ALFApulldown_4h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
-out ALFApulldown_18h_vs_ref.blast.tsv \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
-evalue 1e-10 -max_target_seqs 5 -num_threads 16
5) Run GO enrichment (4 sets + combined summary)
We run a single wrapper script that:
- normalizes FASTA headers (handles UniParc/UniProt
tr|...formats) - filters BLAST hits (pident/qcov/evalue thresholds)
- picks one best hit per query
- performs GO enrichment via
clusterProfiler::enricher() - background (universe) = all reference proteins with ≥1 GO term in
blast2go_annot.annot2 - cutoff = BH/FDR adjusted p-value < 0.05
- writes one Excel per dataset + one combined summary workbook
Rscript wrapper_besthit_GO_enrichment_4sets.R
Code snippets (used scripts)
A) Example script for Step 1: generate Proximity_4h FASTA
import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# --------- robust HTTP session (retries + backoff) ----------
def make_session():
s = requests.Session()
retries = Retry(
total=6,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
s.headers.update({
"User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
})
return s
S = make_session()
def http_get_text(url, params=None):
r = S.get(url, params=params, timeout=30)
if r.status_code == 200:
return r.text
return None
# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
return http_get_text(url)
# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
url = "https://rest.uniprot.org/uniparc/search"
params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
txt = http_get_text(url, params=params)
if not txt:
return None
lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
if len(lines) < 2:
return None
upi = lines[1].split("\t")[0].strip()
return upi if upi.startswith("UPI") else None
# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
txt = http_get_text(url1)
if txt:
return txt
url2 = f"https://rest.uniprot.org/uniparc/{upi}"
return http_get_text(url2, params={"format": "fasta"})
def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
identifier = identifier.strip()
if not identifier:
return None
if identifier.startswith("UPI"):
fasta = fetch_uniparc_fasta(identifier)
return (identifier, fasta) if fasta else None
fasta = fetch_uniprotkb_fasta(identifier)
if fasta:
return (identifier, fasta)
upi = resolve_upi_via_uniparc_search(identifier)
if upi:
fasta2 = fetch_uniparc_fasta(upi)
if fasta2:
fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
return (identifier, fasta2)
return None
def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
missing = []
with open(out_fasta, "w", encoding="utf-8") as f:
for pid in ids:
res = fetch_fasta_for_id(pid)
if res is None:
missing.append(pid)
else:
_, fasta_txt = res
if not fasta_txt.endswith("\n"):
fasta_txt += "\n"
f.write(fasta_txt)
time.sleep(delay_s)
return missing
ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"] # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)
B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets
#!/usr/bin/env Rscript
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tibble)
library(openxlsx)
library(clusterProfiler)
library(AnnotationDbi)
library(GO.db)
})
# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2" # or full path
datasets <- tibble::tribble(
~name, ~blast_tsv, ~fasta,
"Proximity_4h", "Proximity_4h_vs_ref.blast.tsv", "Proximity_4h.fasta",
"Proximity_18h", "Proximity_18h_vs_ref.blast.tsv", "Proximity_18h.fasta",
"ALFApulldown_4h", "ALFApulldown_4h_vs_ref.blast.tsv", "ALFApulldown_4h.fasta",
"ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)
out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)
# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV <- 0.70
MAX_EVALUE <- 1e-10
# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"
# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
x <- as.character(x)
x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
parts <- stringr::str_split(x, "\\|", simplify = TRUE)
is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
ifelse(is_uniprot, parts[, 2], parts[, 1])
}
norm_subject_id <- function(x) {
x <- as.character(x)
x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
stringr::str_split_fixed(x, "\\|", 2)[, 1]
}
add_go_desc <- function(df) {
if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
df$GO_Term <- vapply(df$ID, function(go_id) {
term <- tryCatch(
AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
error = function(e) NULL
)
if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
}, FUN.VALUE = character(1))
df
}
safe_read_blast <- function(path) {
cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
"qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
readr::read_tsv(
path,
col_names = cols,
col_types = readr::cols(
qseqid = readr::col_character(),
sseqid = readr::col_character(),
pident = readr::col_double(),
length = readr::col_double(),
mismatch = readr::col_double(),
gapopen = readr::col_double(),
qstart = readr::col_double(),
qend = readr::col_double(),
sstart = readr::col_double(),
send = readr::col_double(),
evalue = readr::col_double(),
bitscore = readr::col_double(),
qlen = readr::col_double(),
slen = readr::col_double()
),
progress = FALSE
)
}
# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term <- as.character(annot_df$Term)
term2gene_go_ref <- annot_df %>%
dplyr::filter(grepl("^GO:", Term)) %>%
dplyr::transmute(term = Term, gene = GeneID) %>%
dplyr::distinct()
universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")
# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms <- list()
all_besthits <- list()
# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {
ds <- datasets[i, ]
name <- ds$name
blast_path <- ds$blast_tsv
fasta_path <- ds$fasta
cat("\n=============================\n")
cat("Dataset:", name, "\n")
if (!file.exists(blast_path)) {
warning("Missing BLAST TSV: ", blast_path, " (skipping)")
next
}
if (!file.exists(fasta_path)) {
warning("Missing FASTA: ", fasta_path, " (skipping)")
next
}
# ---- FASTA query IDs ----
fa <- readLines(fasta_path, warn = FALSE)
q_all <- fa[startsWith(fa, ">")]
q_all <- gsub("^>", "", q_all)
q_all <- unique(norm_mixed_id(q_all))
n_fasta <- length(q_all)
# ---- BLAST ----
bl <- safe_read_blast(blast_path) %>%
dplyr::mutate(
qid = norm_mixed_id(qseqid),
sid = norm_subject_id(sseqid),
qcov = length / qlen
)
n_queries_with_hits <- dplyr::n_distinct(bl$qid)
max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0
cat("FASTA queries:", n_fasta, "\n")
cat("BLAST rows:", nrow(bl), "\n")
cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
cat("Max hits/query:", max_hits_per_query, "\n")
bl_f <- bl %>%
dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)
cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")
best <- bl_f %>%
dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
dplyr::group_by(qid) %>%
dplyr::slice_head(n = 1) %>%
dplyr::ungroup()
best_out <- best %>%
dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
dplyr::arrange(dplyr::desc(bitscore))
best_all <- tibble::tibble(qid = q_all) %>%
dplyr::left_join(best_out, by = "qid")
unmapped <- best_all %>%
dplyr::filter(is.na(sid) | sid == "") %>%
dplyr::distinct(qid) %>%
dplyr::pull(qid)
mapped <- best_out %>%
dplyr::filter(!is.na(sid), sid != "")
cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
cat("Unmapped queries:", length(unmapped), "\n")
cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")
gene_list_ref <- unique(mapped$sid)
n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")
# Save best-hit TSVs
fn_best <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
readr::write_tsv(best_out, fn_best)
readr::write_tsv(best_all, fn_best_all)
# GO enrichment
go_res <- tryCatch(
clusterProfiler::enricher(
gene = gene_list_ref,
TERM2GENE = term2gene_go_ref,
universe = universe_ref,
pvalueCutoff = P_CUT,
pAdjustMethod = PADJ_METHOD
),
error = function(e) NULL
)
go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
go_df <- add_go_desc(go_df)
cat("Enriched GO terms:", nrow(go_df), "\n")
per_protein_go <- mapped %>%
dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
dplyr::distinct() %>%
dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
dplyr::rename(GO = term)
# Per-dataset Excel
out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "BestHit_Mapped")
openxlsx::writeData(wb, "BestHit_Mapped", mapped)
openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))
openxlsx::addWorksheet(wb, "PerProtein_GO")
openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)
openxlsx::addWorksheet(wb, "GO_Enrichment")
openxlsx::writeData(wb, "GO_Enrichment", go_df)
openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
cat("Saved:", out_xlsx, "\n")
# Collect combined summary
summary_runs[[name]] <- tibble::tibble(
dataset = name,
fasta_queries = n_fasta,
blast_rows = nrow(bl),
queries_with_hits = n_queries_with_hits,
max_hits_per_query = max_hits_per_query,
filtered_rows = nrow(bl_f),
filtered_queries = dplyr::n_distinct(bl_f$qid),
mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
unmapped_queries = length(unmapped),
unique_targets_besthit = dplyr::n_distinct(mapped$sid),
duplicated_targets_besthit = sum(duplicated(mapped$sid)),
mapped_targets_with_GO = n_targets_with_go,
enriched_GO_terms = nrow(go_df)
)
all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
all_go_terms[[name]] <- dplyr::mutate(per_protein_go, dataset = name)
all_besthits[[name]] <- dplyr::mutate(mapped, dataset = name)
}
# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)
go_counts <- combined_per_protein_go %>%
dplyr::filter(!is.na(GO), GO != "") %>%
dplyr::distinct(dataset, qid, GO) %>%
dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))
combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)
openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)
openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)
openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)
openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)
openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)
cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "")