bioBakery Made Simple: A Docker-Centric Guide for Unbiased Metagenomic Profiling (Data_Tam_DNAseq_2026_wastewater_metagenomics)

🇨🇳 中文摘要：本文总结了 bioBakery 的三种部署方式（Docker / 虚拟机镜像 / 云），重点记录了使用 Docker 安装数据库、运行宏基因组分析流程的完整命令与注意事项。鉴于 VirtualBox 7.x 与 bioBakery 虚拟镜像的兼容性问题，推荐优先采用 Docker 方案，实现环境隔离、数据持久化与跨平台复现。下一步将基于该环境开展污水宏基因组数据的无偏分析流程测试。

🔍 Quick Summary

bioBakery is a comprehensive suite of tools developed by the Huttenhower Lab and Segata Lab for metagenomic community analysis. It integrates workflows like MetaPhlAn4 (taxonomic profiling) and HUMAnN3 (functional profiling) — ideal for unbiased metagenomics research.

There are three deployment options:

🐳 Docker (recommended, flexible, reproducible)
💿 Pre-built VM Image (Vagrant + VirtualBox) (encountered compatibility issues with VirtualBox 7.x)
☁️ Cloud (AWS/Google Cloud via bioBakery images)

✅ Today’s focus: Docker setup — skip the VM headaches and get straight to analysis.

🐳 Part 1: Install & Run bioBakery with Docker (Step-by-Step)

✅ Prerequisites

Docker installed & running (docker --version)
~7 GB free disk space for image + databases
Outbound HTTPS access (for database downloads)

🔽 Step 1: Pull the bioBakery Docker Image

docker pull biobakery/workflows:latest
# Verify
docker images | grep biobakery
# Expected: ~6.68 GB image

🗄️ Step 2: Prepare Local Database Directory

# Create persistent host directory for databases
mkdir -p /mnt/nvme1n1p1/biobakery_db

📦 Step 3: Install Databases Inside Container

docker run -it \
  -v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
  biobakery/workflows:latest \
  /bin/bash

# Inside container:
biobakery_workflows_databases --install wmgx --location /biobakery_databases

biobakery_workflows_databases --available
#There are five available database sets each corresponding to a data processing workflow.
#wmgx: The full databases for the whole metagenome workflow
#wmgx_demo: The demo databases for the whole metagenome workflow
#wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow
#16s_usearch: The full databases for the 16s workflow
#16s_dada2: The full databases for the dada2 workflow
#16s_its: The unite database for the its workflow
#isolate_assembly: The eggnog-mapper databases for the assembly workflow

biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases
biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases
biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases
biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases
biobakery_workflows_databases --install 16s_its --location /biobakery_databases
biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases

⏱️ Note: Downloads ~40–70 GB (ChocoPhlAn, UniRef, utility mappings) for the wmgx database. Ensure stable internet & sufficient space.

🧪 Step 4: Run Your First Metagenomics Workflow

docker run -it \
  -v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
  -v /home/jhuang/DATA/your_raw_data:/data \
  biobakery/workflows:latest \
  biobakery_wmgx \
  --input /data/sample.fastq \
  --output /data/output \
  --databases /biobakery_databases

🔑 Optional: Install USEARCH (for 16S workflows)

# 1. Get license from https://www.drive5.com/usearch/
# 2. Inside container or on host:
sudo wget -O /usr/local/bin/usearch "$USEARCH_URL"
sudo chmod +x /usr/local/bin/usearch

⚠️ Troubleshooting Notes (From Today’s Log)

Issue	Solution
VirtualBox Guest Additions mismatch (v6.1.8 vs host v7.1)	Prefer Docker to avoid VM dependency conflicts
Vagrant box version conflicts	Use `vagrant box list` / `--force` to manage versions, but Docker is cleaner
Large database downloads failing	Ensure container has HTTPS access; use `-v` to persist downloads across sessions
Shared folder not mounting	Docker `-v` mounts are more reliable than Vagrant shared folders

📚 What’s Inside bioBakery? (Quick Reference)

Tool	Purpose	Module
MetaPhlAn4	Taxonomic profiling	`biobakery_wmgx`
HUMAnN3	Functional profiling (pathways, genes)	`biobakery_wmgx`
StrainPhlAn	Strain-level analysis	Optional module
PanPhlAn	Pangenome analysis	Optional module
q2-biobakery	QIIME2 plugin for 16S	Separate workflow

🔗 Official Docs:

https://github.com/biobakery/biobakery/wiki

http://huttenhower.sph.harvard.edu/biobakery

🎯 Why Docker First?

✅ Reproducible: Same environment across machines ✅ Lightweight: No full VM overhead ✅ Flexible: Easy to mount local data & databases ✅ Future-proof: Avoid VirtualBox/Vagrant version lock-in ✅ Cloud-ready: Same container runs on local HPC or AWS Batch

📌 Next Steps (TODO)

Test full biobakery_wmgx pipeline on wastewater metagenomics dataset
Benchmark runtime & resource usage
Document output interpretation (MetaPhlAn4 + HUMAnN3 results)
Explore cloud deployment option (AWS Batch + ECR)
Shelved: VM image option — revisit if Docker resource constraints arise

💡 Pro Tip: Always mount your database directory with -v to avoid re-downloading 70 GB every time!

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.