gene_x 0 like s 43 view s
Tags: pipeline
https://www.biobam.com/functional-analysis/
Assign KEGG and GO Terms (see diagram above)
Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.
Option 1 (GO Terms): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping
or blast2go_cli_v1.5.1 (NOT_USED)
#https://help.biobam.com/space/BCD/2250407989/Installation
# blast2go_pipeline.sh
#!/bin/bash
# -------------------------------
# BLAST2GO Automation Pipeline
# -------------------------------
# Steps:
# 1. Run BLAST (NCBI BLAST+)
# 2. Perform GO Mapping (Blast2GO CLI)
# 3. Perform GO Annotation
# 4. Export results to TSV format
# -------------------------------
# CONFIGURATION
BLAST_DB="/path/to/blast/db" # Path to BLAST database
OUTPUT_DIR="/path/to/output" # Directory for storing results
THREADS=8 # Number of CPU threads
# Ensure output directory exists
mkdir -p "$OUTPUT_DIR"
# Iterate over all FASTA files in the input directory
for FASTA in /path/to/input/*.fasta; do
BASENAME=$(basename "$FASTA" .fasta)
echo "Processing: $BASENAME"
# Step 1: Run BLAST (Change 'blastp' to 'blastx' for nucleotide queries)
echo "Running BLAST..."
blastp -query "$FASTA" -db "$BLAST_DB" -out "$OUTPUT_DIR/${BASENAME}_blast.xml" \
-evalue 1e-5 -outfmt 5 -num_threads "$THREADS"
# Step 2: Perform GO Mapping
echo "Running GO Mapping..."
./blast2go_cli -mapping -in "$OUTPUT_DIR/${BASENAME}_blast.xml" \
-out "$OUTPUT_DIR/${BASENAME}_mapping.b2g"
# Step 3: Perform GO Annotation
echo "Running GO Annotation..."
./blast2go_cli -annotation -in "$OUTPUT_DIR/${BASENAME}_mapping.b2g" \
-out "$OUTPUT_DIR/${BASENAME}_annotation.b2g" \
-evalue 1e-6 -annex -goslim
# Step 4: Export to TSV
echo "Exporting results to TSV..."
./blast2go_cli -export -in "$OUTPUT_DIR/${BASENAME}_annotation.b2g" \
-out "$OUTPUT_DIR/${BASENAME}_annotation.tsv" -format TSV
echo "Processing of $BASENAME completed!"
done
echo "✅ All jobs completed successfully!"
Option 2 (GO Terms): Interpro based protein families / domains --> Button interpro * Button 'interpro' (Tags: Interproscaned?, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names Enzyme Codes)
#TODO: merge InterPro GO IDs to GO IDs and generate final GO IDs, then perform GO enrichment!
Option 3 (KEGG Terms): EggNog based on orthology and phylogenies (NOT_USED)
EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
Install EggNOG-mapper:
mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda #eggnog-mapper_2.1.12
mamba activate eggnog_env
Run annotation:
#diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
#NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
#----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
#----> 470.IX87_14445:
* 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
* IX87_14445 would refer to a specific gene or protein within that genome.
Extract KEGG KO IDs from annotations.emapper.annotations.
Option 4 (NOT_USED): RFAM for non-colding RNA
Option 5 (NOT_USED): PSORTb for subcellular localizations
Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)
Find the Closest KEGG Organism Code (TODO: find the closely related organism)
Since your species isn't directly in KEGG, use a closely related organism.
library(clusterProfiler)
library(KEGGREST)
kegg_organisms <- keggList("organism")
Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
# Search for Acinetobacter in the list
grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
# Gammaproteobacteria
#Extract KO IDs from the eggnog results for "Acinetobacter baumannii strain ATCC 19606"
Find the Closest KEGG Organism for a Non-Model Species
If your organism is not in KEGG, search for the closest relative:
grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE) # Example search
For KEGG pathway enrichment in non-model species, use "ko" instead of a species code:
kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko") # "ko" = KEGG Orthology
Perform KEGG and GO Enrichment in R (TODO: ids in gene_list_kegg_up are not recognized!)
# Load required libraries
library(openxlsx) # For Excel file handling
library(dplyr) # For data manipulation
library(clusterProfiler) # For KEGG and GO enrichment analysis
#library(org.Hs.eg.db) # Replace with appropriate organism database
# Load the results from the AUM_vs_MHB-all.csv
res <- read.csv("AUM_vs_MHB-all.csv")
# Replace empty GeneName with modified GeneID
res$GeneName <- ifelse(
res$GeneName == "" | is.na(res$GeneName),
gsub("gene-", "", res$GeneID),
res$GeneName
)
# Remove duplicated genes by selecting the gene with the smallest padj
duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
res <- res %>%
group_by(GeneName) %>%
slice_min(padj, with_ties = FALSE) %>%
ungroup()
res <- as.data.frame(res)
# Sort res first by padj (ascending) and then by log2FoldChange (descending)
res <- res[order(res$padj, -res$log2FoldChange), ]
# Read eggnog annotations
eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
# Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
res$GeneID <- gsub("gene-", "", res$GeneID)
# Merge eggnog data with res based on GeneID (or GeneName)
res <- res %>%
left_join(eggnog_data, by = c("GeneID" = "query"))
# Filter up-regulated genes: log2FoldChange > 2 and padj < 1e-2
up_regulated <- res[res$log2FoldChange > 2 & res$padj < 1e-2, ]
# Filter down-regulated genes: log2FoldChange < -2 and padj < 1e-2
down_regulated <- res[res$log2FoldChange < -2 & res$padj < 1e-2, ]
# Create a new workbook
wb <- createWorkbook()
# Add the complete dataset as the first sheet (with annotations)
addWorksheet(wb, "Complete_Data")
writeData(wb, "Complete_Data", res)
# Add the up-regulated genes as the second sheet (with annotations)
addWorksheet(wb, "Up_Regulated")
writeData(wb, "Up_Regulated", up_regulated)
# Add the down-regulated genes as the third sheet (with annotations)
addWorksheet(wb, "Down_Regulated")
writeData(wb, "Down_Regulated", down_regulated)
# Save the workbook to a file
saveWorkbook(wb, "Gene_Expression_with_Annotations_AUM_vs_MHB.xlsx", overwrite = TRUE)
# Set the 'GeneName' column as row.names
rownames(res) <- res$GeneName
# Drop the 'GeneName' column since it's now the row names
res$GeneName <- NULL
# ---- Perform KEGG enrichment analysis ----
gene_list_kegg_up <- up_regulated$KEGG_ko
gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
# Perform KEGG enrichment analysis
gene_list_kegg_down <- down_regulated$KEGG_ko
gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
# Create a new workbook
wb <- createWorkbook()
# Save enrichment results to the workbook
addWorksheet(wb, "KEGG_Enrichment_Up")
writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up))
# Save enrichment results to the workbook
addWorksheet(wb, "KEGG_Enrichment_Down")
writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down))
# ---- Perform GO enrichment analysis (TODO: extract the merged GO IDs from 'Blast2GO 5 Basic' and adapt the code below!)----
#http://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html
#If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.
# Perform GO enrichment analysis (Using enricher)
gene_list_go_up <- up_regulated$GOs
go_enrichment_up <- enrichGO(gene = gene_list_go_up, OrgDb, ont = "BP") #keyType = "ENSEMBL", organism = 'abn',
# Perform GO enrichment analysis (Using clusterProfiler)
#gene_list_go_down <- down_regulated$GOs
#go_enrichment_down <- enrichGO(gene = gene_list_go_down, organism = 'abn', ont = "BP")
#addWorksheet(wb, "GO_Enrichment_Up")
#writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
#addWorksheet(wb, "GO_Enrichment_Down")
#writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
# Save the workbook with enrichment results
saveWorkbook(wb, "KEGG_and_GO_Enrichments_AUM_vs_MHB.xlsx", overwrite = TRUE)
#To adapt your code for GO enrichment analysis using the up-regulated genes (149 genes) as the test set and all genes in res (3646 genes) as the background, we need to:
* Extract the GeneIDs from up_regulated and use them as gene_list (the test set).
* Extract the GeneIDs from res and use them as background_genes (the universe).
* Create a TERM2GENE dataframe by extracting GO terms from res.
* Run the enricher() function using these updated inputs.
# Define gene list (up-regulated genes)
gene_list_go_up <- up_regulated$GeneID # Extract the 149 up-regulated genes
# Define background gene set (all genes in res)
background_genes <- res$GeneID # Extract the 3646 background genes
# Prepare GO annotation data from res
go_annotation <- res[, c("GeneID", "GOs")] # Extract relevant columns
go_annotation <- go_annotation %>%
tidyr::separate_rows(GOs, sep = ",") # Split multiple GO terms into separate rows
# Perform GO enrichment analysis
go_enrichment_up <- enricher(
gene = gene_list_go_up, # Up-regulated genes
TERM2GENE = go_annotation, # Custom GO annotation
pvalueCutoff = 0.05, # Significance threshold
universe = background_genes # Define the background gene set
)
# View results
print(go_enrichment_up)
GO enrichment using TERM2GENE for enricher in clusterProfiler (Example code, the running code for current perform see above)
To perform GO enrichment analysis using the enricher() function with a custom GO annotation data (in the form of a data.frame), follow these steps. The user can supply a gene list and a custom GO annotation in the form of a data.frame, where the first column is the gene ID and the second column is the GO ID.
Here’s how to structure the R code for both over-representation testing (using enricher()) and gene set enrichment analysis (using GSEA()):
# Install and load the necessary libraries
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db") # For human gene IDs (optional)
library(clusterProfiler)
library(org.Hs.eg.db) # Only if using human gene IDs (e.g., Entrez)
#Prepare Custom GO Annotation Data
#Assume that you have a custom GO annotation dataset in the form of a data.frame, where:
#The first column is gene IDs (could be Entrez, gene symbols, etc.).
#The second column is GO terms (GO IDs).
Example data.frame:
# Example custom GO annotation data (replace with your own data)
go_annotation <- data.frame(
geneID = c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"),
GOID = c("GO:0001234", "GO:0005678", "GO:0002345", "GO:0009876", "GO:0005432")
)
# View the custom GO annotation
head(go_annotation)
3. Perform GO Enrichment Using enricher()
With your gene list (e.g., from differential expression analysis) and the custom GO annotation, you can now run the over-representation test using the enricher() function.
# Example gene list (e.g., gene symbols or Entrez IDs)
gene_list <- c("GeneA", "GeneB", "GeneC", "GeneX", "GeneY") # Replace with your own gene IDs
# Perform enrichment analysis with custom GO annotation
go_enrichment_result <- enricher(
gene = gene_list, # The gene list
TERM2GENE = go_annotation, # The custom GO annotation data
pvalueCutoff = 0.05 # Adjust p-value threshold
universe = background_genes # Define a custom background gene set
)
# View the summary of the results
summary(go_enrichment_result)
# Visualize the results using a dotplot
dotplot(go_enrichment_result)
4. Perform Gene Set Enrichment Analysis (GSEA) Using GSEA()
In addition to the over-representation test, you can also perform GSEA to determine whether a predefined set of genes (e.g., in the context of a ranked list of genes) is significantly enriched in certain GO categories.
# Example ranked gene list (e.g., from differential expression with fold change)
ranked_gene_list <- c("GeneA" = 2.5, "GeneB" = 1.8, "GeneC" = 0.5, "GeneD" = -1.2, "GeneE" = -2.0)
# Perform GSEA with custom GO annotation
gsea_result <- GSEA(
geneList = ranked_gene_list, # Ranked gene list (e.g., logFC or other metric)
TERM2GENE = go_annotation, # The custom GO annotation data
pvalueCutoff = 0.05 # Adjust p-value threshold
)
# View the summary of the GSEA results
summary(gsea_result)
# Visualize the GSEA results using a dotplot
dotplot(gsea_result)
5. Save Results to CSV (Optional)
If you want to save the results to a file (e.g., CSV), you can use the write.csv() function.
# Save the GO enrichment results to a CSV file
write.csv(go_enrichment_result, "GO_enrichment_results.csv")
# Save the GSEA results to a CSV file
write.csv(gsea_result, "GSEA_results.csv")
Explanation of Key Steps:
Gene List: You need a list of genes that you want to test for enrichment. This can be a list of genes from differential expression analysis or a custom set of genes.
GO Annotation (TERM2GENE): This is a data.frame containing the mapping of gene IDs to GO terms (or categories). You provide this as an argument to both enricher() and GSEA().
enricher() Function: This function performs the over-representation analysis (ORA), testing whether specific GO terms are significantly enriched in your gene list compared to the background.
GSEA() Function: This function performs Gene Set Enrichment Analysis, testing whether the predefined gene sets (in this case, GO terms) are enriched in a ranked list of genes (e.g., based on log-fold change).
Additional Parameters:
pvalueCutoff: The threshold for statistical significance in both enricher() and GSEA(). Genes with a p-value below this cutoff are considered significantly enriched.
minGSSize and maxGSSize: Define the minimum and maximum size of the gene sets (GO terms) to be included in the analysis.
qvalueCutoff: The threshold for adjusted p-values (FDR).
Conclusion:
This guide demonstrates how to perform GO enrichment analysis using a custom GO annotation data (data.frame) with the enricher() and GSEA() functions from the clusterProfiler package. Whether you're performing over-representation testing or gene set enrichment analysis, these functions allow you to explore which biological processes or pathways are enriched in your gene list.
点赞本文的读者
还没有人对此文章表态
没有评论
RNA-seq Tam on Acinetobacter baumannii strain ATCC 19606 CP059040.1 (Data_Tam_RNAseq_2024)
Enhanced Visualization of Gene Presence for the Selected Genes in Bongarts_S.epidermidis_HDRNA
© 2023 XGenes.com Impressum