End-to-end GO enrichment for non-model bacteria: MS → reference ID mapping with BLAST + Blast2GO (GUI) + R enrichment

This post summarizes a complete workflow to run GO enrichment for protein sets from mass spectrometry (MS) in a non-model organism, where:

  • MS protein IDs (e.g., UniProt/UniParc) do not match the locus tags used in the reference genome (e.g., B4U56_00090).
  • Standard R organism annotation packages (org.*.eg.db) are not available.
  • We therefore build our own GO mapping using Blast2GO and perform enrichment with a custom TERM2GENE.

The workflow produces:

  • 4 per-dataset Excel reports (Proximity 4h/18h, ALFApulldown 4h/18h)
  • 1 combined summary workbook across all datasets

1) Generate protein FASTA files

1.1 MS protein sequences (Proximity / ALFApulldown)

From MS results (protein ID lists), retrieve sequences and write FASTA:

python3 getProteinSequences_Proximity_4h.py      > Proximity_4h.fasta
python3 getProteinSequences_Proximity_18h.py     > Proximity_18h.fasta
python3 getProteinSequences_ALFApulldown_4h.py   > ALFApulldown_4h.fasta
python3 getProteinSequences_ALFApulldown_18h.py  > ALFApulldown_18h.fasta

1.2 Reference proteome sequences (for mapping + background)

Download the reference proteome FASTA from GenBank and standardize headers:

mv ~/Downloads/sequence\ \(3\).txt CP020463_protein_.fasta
python ~/Scripts/update_fasta_header.py CP020463_protein_.fasta CP020463_protein.fasta

2) Optional functional annotation for merging MS + RNA-seq (EggNOG)

EggNOG gives KO/GO predictions via orthology/phylogeny, which is useful for annotation tables and quick merging. (For enrichment, we will use Blast2GO GO terms later, which are typically more comprehensive for GO/EC.)

mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda
mamba activate eggnog_env

download_eggnog_data.py --dbname eggnog.db -y \
  --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/

# reference proteome (for RNA-seq / background annotation)
emapper.py -i CP020463_protein.fasta -o eggnog_out --cpu 60

# MS sets
emapper.py -i Proximity_4h.fasta     -o eggnog_out_Proximity_4h     --cpu 60
emapper.py -i Proximity_18h.fasta    -o eggnog_out_Proximity_18h    --cpu 60
emapper.py -i ALFApulldown_4h.fasta  -o eggnog_out_ALFApulldown_4h  --cpu 60
emapper.py -i ALFApulldown_18h.fasta -o eggnog_out_ALFApulldown_18h --cpu 60

3) Build comprehensive GO annotations using Blast2GO GUI (FULL steps)

Because this is a non-model organism, we create a custom GO annotation file from the reference proteome (CP020463_protein.fasta) using Blast2GO.

3.1 Setup workspace

mkdir -p ~/b2gWorkspace_Michelle_RNAseq_2025
cp /path/to/CP020463_protein.fasta ~/b2gWorkspace_Michelle_RNAseq_2025/

Launch Blast2GO:

~/Tools/Blast2GO/Blast2GO_Launcher

3.2 Step-by-step in Blast2GO GUI

STEP 1 — Load sequences (reference proteome)

  1. FileLoadLoad Sequences
  2. Choose: Load Fasta File (.fasta)
  3. Select: CP020463_protein.fasta
  4. Tags: (leave default / none)
  5. Check that the table is filled with columns like Nr, SeqName

✅ Output: sequences are loaded into the project


STEP 2 — BLAST (QBlast)

  1. Go to the BLAST panel
  2. Choose blastp (protein vs protein)
  3. Database: nr (NCBI)
  4. Set other parameters as needed (defaults typically OK)
  5. Run QBlast

⚠️ This step is typically the most time-consuming (hours to days depending on dataset size and NCBI queue). If Blast2GO reports warnings like “Sequences without results”, you can re-submit if desired.

✅ Output: sequences get BLAST hit information; the table gets BLAST-related columns (hits, e-values, descriptions)


STEP 3 — Mapping

  1. Click Mapping

✅ Output:

  • Tags updated to MAPPED
  • Columns appear such as #GO, GO IDs, GO Names

STEP 4 — Annotation

  1. Click Annotation
  2. Key parameters you may set/keep:

    • Annotation CutOff (controls reliability threshold)
    • GO Weight (boosts more general terms when supported by multiple specific hits)

✅ Output:

  • Tags updated to ANNOTATED
  • Enzyme-related columns may appear (e.g., Enzyme Codes)

STEP 5 — Export Annotations (before merging InterPro)

  1. FileExportExport Annotations
  2. Choose Export Annotations (.annot, custom, etc.)
  3. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot

✅ Output: blast2go_annot.annot


STEP 6 — InterProScan (optional but recommended for more GO terms)

  1. Click InterPro / InterProScan
  2. Run InterProScan (can be long: hours to >1 day depending on dataset & setup)

✅ Output:

  • Tags updated to INTERPRO
  • Additional columns: InterPro IDs, InterPro GO IDs/Names

STEP 7 — Merge InterProScan GOs into existing annotation

  1. In InterPro panel, choose: Merge InterProScan GOs to Annotation
  2. Confirm merge

✅ Output: GO annotation becomes more complete (adds/validates InterPro GO terms)


STEP 8 — Export final annotations (after merging InterPro)

  1. FileExportExport Annotations
  2. Save as: ~/b2gWorkspace_Michelle_RNAseq_2025/blast2go_annot.annot2

✅ This is the final file used for enrichment:

  • blast2go_annot.annot2

Practical note: For enrichment we only need one Blast2GO annotation built on the reference proteome. Blast2GO runs on each MS set are not needed.


4) Generate BLAST mapping tables: *_vs_ref.blast.tsv

We map each MS protein set to the reference proteome locus tags using BLASTP.

4.1 Create BLAST database from reference proteome

makeblastdb -in CP020463_protein.fasta -dbtype prot -out CP020463_ref_db

4.2 BLASTP each MS set against the reference DB

blastp -query Proximity_4h.fasta -db CP020463_ref_db \
  -out Proximity_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query Proximity_18h.fasta -db CP020463_ref_db \
  -out Proximity_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_4h.fasta -db CP020463_ref_db \
  -out ALFApulldown_4h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

blastp -query ALFApulldown_18h.fasta -db CP020463_ref_db \
  -out ALFApulldown_18h_vs_ref.blast.tsv \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen" \
  -evalue 1e-10 -max_target_seqs 5 -num_threads 16

5) Run GO enrichment (4 sets + combined summary)

We run a single wrapper script that:

  • normalizes FASTA headers (handles UniParc/UniProt tr|... formats)
  • filters BLAST hits (pident/qcov/evalue thresholds)
  • picks one best hit per query
  • performs GO enrichment via clusterProfiler::enricher()
  • background (universe) = all reference proteins with ≥1 GO term in blast2go_annot.annot2
  • cutoff = BH/FDR adjusted p-value < 0.05
  • writes one Excel per dataset + one combined summary workbook
Rscript wrapper_besthit_GO_enrichment_4sets.R

Code snippets (used scripts)

A) Example script for Step 1: generate Proximity_4h FASTA

import time
import re
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --------- robust HTTP session (retries + backoff) ----------
def make_session():
    s = requests.Session()
    retries = Retry(
        total=6,
        backoff_factor=0.5,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": "sequence-fetcher/1.0 (contact: your_email@example.com)"
    })
    return s

S = make_session()

def http_get_text(url, params=None):
    r = S.get(url, params=params, timeout=30)
    if r.status_code == 200:
        return r.text
    return None

# --------- UniProtKB FASTA ----------
def fetch_uniprotkb_fasta(acc: str) -> str | None:
    url = f"https://rest.uniprot.org/uniprotkb/{acc}.fasta"
    return http_get_text(url)

# --------- Resolve accession -> UniParc UPI (via UniParc search) ----------
def resolve_upi_via_uniparc_search(acc: str) -> str | None:
    url = "https://rest.uniprot.org/uniparc/search"
    params = {"query": acc, "format": "tsv", "fields": "upi", "size": 1}
    txt = http_get_text(url, params=params)
    if not txt:
        return None
    lines = [ln.strip() for ln in txt.splitlines() if ln.strip()]
    if len(lines) < 2:
        return None
    upi = lines[1].split("\t")[0].strip()
    return upi if upi.startswith("UPI") else None

# --------- UniParc FASTA ----------
def fetch_uniparc_fasta(upi: str) -> str | None:
    url1 = f"https://rest.uniprot.org/uniparc/{upi}.fasta"
    txt = http_get_text(url1)
    if txt:
        return txt
    url2 = f"https://rest.uniprot.org/uniparc/{upi}"
    return http_get_text(url2, params={"format": "fasta"})

def fetch_fasta_for_id(identifier: str) -> tuple[str, str] | None:
    identifier = identifier.strip()
    if not identifier:
        return None
    if identifier.startswith("UPI"):
        fasta = fetch_uniparc_fasta(identifier)
        return (identifier, fasta) if fasta else None

    fasta = fetch_uniprotkb_fasta(identifier)
    if fasta:
        return (identifier, fasta)

    upi = resolve_upi_via_uniparc_search(identifier)
    if upi:
        fasta2 = fetch_uniparc_fasta(upi)
        if fasta2:
            fasta2 = re.sub(r"^>", f">{identifier}|UniParc:{upi} ", fasta2, count=1, flags=re.M)
            return (identifier, fasta2)
    return None

def fetch_all(ids: list[str], out_fasta: str = "all_sequences.fasta", delay_s: float = 0.2):
    missing = []
    with open(out_fasta, "w", encoding="utf-8") as f:
        for pid in ids:
            res = fetch_fasta_for_id(pid)
            if res is None:
                missing.append(pid)
            else:
                _, fasta_txt = res
                if not fasta_txt.endswith("\n"):
                    fasta_txt += "\n"
                f.write(fasta_txt)
            time.sleep(delay_s)
    return missing

ids = ["A0A0E1VEW0", "A0A0E1VHW4", "A0A0N1EUK4"]  # etc...
missing = fetch_all(ids, out_fasta="Proximity_4h.fasta")
print("Missing:", missing)

B) Wrapper R script: best-hit selection + GO enrichment for 4 datasets

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(readr)
  library(stringr)
  library(tibble)
  library(openxlsx)
  library(clusterProfiler)
  library(AnnotationDbi)
  library(GO.db)
})

# ---------------------------
# CONFIG: EDIT THESE PATHS
# ---------------------------
blast2go_path <- "blast2go_annot.annot2"  # or full path

datasets <- tibble::tribble(
  ~name,              ~blast_tsv,                          ~fasta,
  "Proximity_4h",     "Proximity_4h_vs_ref.blast.tsv",     "Proximity_4h.fasta",
  "Proximity_18h",    "Proximity_18h_vs_ref.blast.tsv",    "Proximity_18h.fasta",
  "ALFApulldown_4h",  "ALFApulldown_4h_vs_ref.blast.tsv",  "ALFApulldown_4h.fasta",
  "ALFApulldown_18h", "ALFApulldown_18h_vs_ref.blast.tsv", "ALFApulldown_18h.fasta"
)

out_dir <- "./GO_wrapper_outputs"
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

# Best-hit filtering thresholds (tune)
MIN_PIDENT <- 70
MIN_QCOV   <- 0.70
MAX_EVALUE <- 1e-10

# Enrichment thresholds
P_CUT <- 0.05
PADJ_METHOD <- "BH"

# ---------------------------
# Helpers
# ---------------------------
norm_mixed_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  parts <- stringr::str_split(x, "\\|", simplify = TRUE)
  is_uniprot <- ncol(parts) >= 3 & (parts[, 1] == "tr" | parts[, 1] == "sp")
  ifelse(is_uniprot, parts[, 2], parts[, 1])
}

norm_subject_id <- function(x) {
  x <- as.character(x)
  x <- stringr::str_split_fixed(x, "\\s+", 2)[, 1]
  stringr::str_split_fixed(x, "\\|", 2)[, 1]
}

add_go_desc <- function(df) {
  if (is.null(df) || nrow(df) == 0 || !"ID" %in% names(df)) return(df)
  df$GO_Term <- vapply(df$ID, function(go_id) {
    term <- tryCatch(
      AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID"),
      error = function(e) NULL
    )
    if (!is.null(term) && nrow(term)) term$TERM[1] else NA_character_
  }, FUN.VALUE = character(1))
  df
}

safe_read_blast <- function(path) {
  cols <- c("qseqid","sseqid","pident","length","mismatch","gapopen",
            "qstart","qend","sstart","send","evalue","bitscore","qlen","slen")
  readr::read_tsv(
    path,
    col_names = cols,
    col_types = readr::cols(
      qseqid   = readr::col_character(),
      sseqid   = readr::col_character(),
      pident   = readr::col_double(),
      length   = readr::col_double(),
      mismatch = readr::col_double(),
      gapopen  = readr::col_double(),
      qstart   = readr::col_double(),
      qend     = readr::col_double(),
      sstart   = readr::col_double(),
      send     = readr::col_double(),
      evalue   = readr::col_double(),
      bitscore = readr::col_double(),
      qlen     = readr::col_double(),
      slen     = readr::col_double()
    ),
    progress = FALSE
  )
}

# ---------------------------
# Load Blast2GO TERM2GENE + universe
# ---------------------------
annot_df <- utils::read.table(blast2go_path, header = FALSE, sep = "\t",
                              stringsAsFactors = FALSE, fill = TRUE, quote = "")
annot_df <- annot_df[, 1:2]
colnames(annot_df) <- c("GeneID", "Term")
annot_df$GeneID <- as.character(annot_df$GeneID)
annot_df$Term   <- as.character(annot_df$Term)

term2gene_go_ref <- annot_df %>%
  dplyr::filter(grepl("^GO:", Term)) %>%
  dplyr::transmute(term = Term, gene = GeneID) %>%
  dplyr::distinct()

universe_ref <- unique(term2gene_go_ref$gene)
cat("Reference universe (proteins with GO):", length(universe_ref), "\n")

# ---------------------------
# Combined summary collectors
# ---------------------------
summary_runs <- list()
all_go_enrich <- list()
all_go_terms  <- list()
all_besthits  <- list()

# ---------------------------
# Main loop
# ---------------------------
for (i in seq_len(nrow(datasets))) {

  ds <- datasets[i, ]
  name <- ds$name
  blast_path <- ds$blast_tsv
  fasta_path <- ds$fasta

  cat("\n=============================\n")
  cat("Dataset:", name, "\n")

  if (!file.exists(blast_path)) {
    warning("Missing BLAST TSV: ", blast_path, " (skipping)")
    next
  }
  if (!file.exists(fasta_path)) {
    warning("Missing FASTA: ", fasta_path, " (skipping)")
    next
  }

  # ---- FASTA query IDs ----
  fa <- readLines(fasta_path, warn = FALSE)
  q_all <- fa[startsWith(fa, ">")]
  q_all <- gsub("^>", "", q_all)
  q_all <- unique(norm_mixed_id(q_all))
  n_fasta <- length(q_all)

  # ---- BLAST ----
  bl <- safe_read_blast(blast_path) %>%
    dplyr::mutate(
      qid  = norm_mixed_id(qseqid),
      sid  = norm_subject_id(sseqid),
      qcov = length / qlen
    )

  n_queries_with_hits <- dplyr::n_distinct(bl$qid)
  max_hits_per_query <- if (nrow(bl)) max(table(bl$qid)) else 0

  cat("FASTA queries:", n_fasta, "\n")
  cat("BLAST rows:", nrow(bl), "\n")
  cat("Queries with >=1 BLAST hit:", n_queries_with_hits, "\n")
  cat("Max hits/query:", max_hits_per_query, "\n")

  bl_f <- bl %>%
    dplyr::filter(evalue <= MAX_EVALUE, pident >= MIN_PIDENT, qcov >= MIN_QCOV)

  cat("After filters: rows:", nrow(bl_f), " unique queries:", dplyr::n_distinct(bl_f$qid), "\n")

  best <- bl_f %>%
    dplyr::arrange(qid, dplyr::desc(bitscore), evalue, dplyr::desc(qcov), dplyr::desc(pident)) %>%
    dplyr::group_by(qid) %>%
    dplyr::slice_head(n = 1) %>%
    dplyr::ungroup()

  best_out <- best %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::arrange(dplyr::desc(bitscore))

  best_all <- tibble::tibble(qid = q_all) %>%
    dplyr::left_join(best_out, by = "qid")

  unmapped <- best_all %>%
    dplyr::filter(is.na(sid) | sid == "") %>%
    dplyr::distinct(qid) %>%
    dplyr::pull(qid)

  mapped <- best_out %>%
    dplyr::filter(!is.na(sid), sid != "")

  cat("Best-hit mapped queries:", dplyr::n_distinct(mapped$qid), "\n")
  cat("Unmapped queries:", length(unmapped), "\n")
  cat("Unique mapped targets:", dplyr::n_distinct(mapped$sid), "\n")
  cat("Duplicated targets in best hits:", sum(duplicated(mapped$sid)), "\n")

  gene_list_ref <- unique(mapped$sid)
  n_targets_with_go <- sum(gene_list_ref %in% universe_ref)
  cat("Mapped targets with GO terms:", n_targets_with_go, "/", length(gene_list_ref), "\n")

  # Save best-hit TSVs
  fn_best     <- file.path(out_dir, paste0(name, "_blast_besthit.tsv"))
  fn_best_all <- file.path(out_dir, paste0(name, "_blast_besthit_with_unmapped.tsv"))
  readr::write_tsv(best_out, fn_best)
  readr::write_tsv(best_all, fn_best_all)

  # GO enrichment
  go_res <- tryCatch(
    clusterProfiler::enricher(
      gene = gene_list_ref,
      TERM2GENE = term2gene_go_ref,
      universe = universe_ref,
      pvalueCutoff = P_CUT,
      pAdjustMethod = PADJ_METHOD
    ),
    error = function(e) NULL
  )

  go_df <- if (!is.null(go_res)) as.data.frame(go_res) else data.frame()
  go_df <- add_go_desc(go_df)

  cat("Enriched GO terms:", nrow(go_df), "\n")

  per_protein_go <- mapped %>%
    dplyr::select(qid, sid, pident, qcov, evalue, bitscore) %>%
    dplyr::distinct() %>%
    dplyr::left_join(term2gene_go_ref, by = c("sid" = "gene")) %>%
    dplyr::rename(GO = term)

  # Per-dataset Excel
  out_xlsx <- file.path(out_dir, paste0("GO_enrichment_", name, ".xlsx"))
  wb <- openxlsx::createWorkbook()

  openxlsx::addWorksheet(wb, "BestHit_Mapped")
  openxlsx::writeData(wb, "BestHit_Mapped", mapped)

  openxlsx::addWorksheet(wb, "Unmapped_QueryIDs")
  openxlsx::writeData(wb, "Unmapped_QueryIDs", data.frame(qid = unmapped))

  openxlsx::addWorksheet(wb, "PerProtein_GO")
  openxlsx::writeData(wb, "PerProtein_GO", per_protein_go)

  openxlsx::addWorksheet(wb, "GO_Enrichment")
  openxlsx::writeData(wb, "GO_Enrichment", go_df)

  openxlsx::saveWorkbook(wb, out_xlsx, overwrite = TRUE)
  cat("Saved:", out_xlsx, "\n")

  # Collect combined summary
  summary_runs[[name]] <- tibble::tibble(
    dataset = name,
    fasta_queries = n_fasta,
    blast_rows = nrow(bl),
    queries_with_hits = n_queries_with_hits,
    max_hits_per_query = max_hits_per_query,
    filtered_rows = nrow(bl_f),
    filtered_queries = dplyr::n_distinct(bl_f$qid),
    mapped_queries_besthit = dplyr::n_distinct(mapped$qid),
    unmapped_queries = length(unmapped),
    unique_targets_besthit = dplyr::n_distinct(mapped$sid),
    duplicated_targets_besthit = sum(duplicated(mapped$sid)),
    mapped_targets_with_GO = n_targets_with_go,
    enriched_GO_terms = nrow(go_df)
  )

  all_go_enrich[[name]] <- if (nrow(go_df) > 0) dplyr::mutate(go_df, dataset = name) else tibble::tibble(dataset = name)
  all_go_terms[[name]]  <- dplyr::mutate(per_protein_go, dataset = name)
  all_besthits[[name]]  <- dplyr::mutate(mapped, dataset = name)
}

# Combined summary workbook
combined_summary <- dplyr::bind_rows(summary_runs)
combined_go_enrich <- dplyr::bind_rows(all_go_enrich)
combined_per_protein_go <- dplyr::bind_rows(all_go_terms)
combined_besthits <- dplyr::bind_rows(all_besthits)

go_counts <- combined_per_protein_go %>%
  dplyr::filter(!is.na(GO), GO != "") %>%
  dplyr::distinct(dataset, qid, GO) %>%
  dplyr::count(dataset, GO, name = "n_queries_with_GO") %>%
  dplyr::arrange(dataset, dplyr::desc(n_queries_with_GO))

combined_xlsx <- file.path(out_dir, "Combined_GO_summary.xlsx")
wb2 <- openxlsx::createWorkbook()

openxlsx::addWorksheet(wb2, "Run_Summary")
openxlsx::writeData(wb2, "Run_Summary", combined_summary)

openxlsx::addWorksheet(wb2, "All_BestHits_Mapped")
openxlsx::writeData(wb2, "All_BestHits_Mapped", combined_besthits)

openxlsx::addWorksheet(wb2, "All_PerProtein_GO")
openxlsx::writeData(wb2, "All_PerProtein_GO", combined_per_protein_go)

openxlsx::addWorksheet(wb2, "All_GO_Enrichment")
openxlsx::writeData(wb2, "All_GO_Enrichment", combined_go_enrich)

openxlsx::addWorksheet(wb2, "GO_Counts_By_Dataset")
openxlsx::writeData(wb2, "GO_Counts_By_Dataset", go_counts)

openxlsx::saveWorkbook(wb2, combined_xlsx, overwrite = TRUE)

cat("\n=============================\n")
cat("DONE. Outputs in: ", normalizePath(out_dir), "\n", sep = "")
cat("Combined summary workbook: ", combined_xlsx, "\n", sep = "")

Leave a Reply

Your email address will not be published. Required fields are marked *