🇨🇳 中文摘要: 本文总结了 bioBakery 的三种部署方式(Docker / 虚拟机镜像 / 云),重点记录了使用 Docker 安装数据库、运行宏基因组分析流程的完整命令与注意事项。鉴于 VirtualBox 7.x 与 bioBakery 虚拟镜像的兼容性问题,推荐优先采用 Docker 方案,实现环境隔离、数据持久化与跨平台复现。下一步将基于该环境开展污水宏基因组数据的无偏分析流程测试。
🔍 Quick Summary
bioBakery is a comprehensive suite of tools developed by the Huttenhower Lab and Segata Lab for metagenomic community analysis. It integrates workflows like MetaPhlAn4 (taxonomic profiling) and HUMAnN3 (functional profiling) — ideal for unbiased metagenomics research.
There are three deployment options:
- 🐳 Docker (recommended, flexible, reproducible)
- 💿 Pre-built VM Image (Vagrant + VirtualBox) (encountered compatibility issues with VirtualBox 7.x)
- ☁️ Cloud (AWS/Google Cloud via bioBakery images)
✅ Today’s focus: Docker setup — skip the VM headaches and get straight to analysis.
🐳 Part 1: Install & Run bioBakery with Docker (Step-by-Step)
✅ Prerequisites
- Docker installed & running (
docker --version) - ~7 GB free disk space for image + databases
- Outbound HTTPS access (for database downloads)
🔽 Step 1: Pull the bioBakery Docker Image
docker pull biobakery/workflows:latest
# Verify
docker images | grep biobakery
# Expected: ~6.68 GB image
🗄️ Step 2: Prepare Local Database Directory
# Create persistent host directory for databases
mkdir -p /mnt/nvme1n1p1/biobakery_db
📦 Step 3: Install Databases Inside Container
docker run -it \
-v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
biobakery/workflows:latest \
/bin/bash
# Inside container:
biobakery_workflows_databases --install wmgx --location /biobakery_databases
biobakery_workflows_databases --available
#There are five available database sets each corresponding to a data processing workflow.
#wmgx: The full databases for the whole metagenome workflow
#wmgx_demo: The demo databases for the whole metagenome workflow
#wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow
#16s_usearch: The full databases for the 16s workflow
#16s_dada2: The full databases for the dada2 workflow
#16s_its: The unite database for the its workflow
#isolate_assembly: The eggnog-mapper databases for the assembly workflow
biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases
biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases
biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases
biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases
biobakery_workflows_databases --install 16s_its --location /biobakery_databases
biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases
⏱️ Note: Downloads ~40–70 GB (ChocoPhlAn, UniRef, utility mappings) for the wmgx database. Ensure stable internet & sufficient space.
🧪 Step 4: Run Your First Metagenomics Workflow
docker run -it \
-v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
-v /home/jhuang/DATA/your_raw_data:/data \
biobakery/workflows:latest \
biobakery_wmgx \
--input /data/sample.fastq \
--output /data/output \
--databases /biobakery_databases
🔑 Optional: Install USEARCH (for 16S workflows)
# 1. Get license from https://www.drive5.com/usearch/
# 2. Inside container or on host:
sudo wget -O /usr/local/bin/usearch "$USEARCH_URL"
sudo chmod +x /usr/local/bin/usearch
⚠️ Troubleshooting Notes (From Today’s Log)
| Issue | Solution |
|---|---|
| VirtualBox Guest Additions mismatch (v6.1.8 vs host v7.1) | Prefer Docker to avoid VM dependency conflicts |
| Vagrant box version conflicts | Use vagrant box list / --force to manage versions, but Docker is cleaner |
| Large database downloads failing | Ensure container has HTTPS access; use -v to persist downloads across sessions |
| Shared folder not mounting | Docker -v mounts are more reliable than Vagrant shared folders |
📚 What’s Inside bioBakery? (Quick Reference)
| Tool | Purpose | Module |
|---|---|---|
| MetaPhlAn4 | Taxonomic profiling | biobakery_wmgx |
| HUMAnN3 | Functional profiling (pathways, genes) | biobakery_wmgx |
| StrainPhlAn | Strain-level analysis | Optional module |
| PanPhlAn | Pangenome analysis | Optional module |
| q2-biobakery | QIIME2 plugin for 16S | Separate workflow |
🔗 Official Docs:
🎯 Why Docker First?
✅ Reproducible: Same environment across machines ✅ Lightweight: No full VM overhead ✅ Flexible: Easy to mount local data & databases ✅ Future-proof: Avoid VirtualBox/Vagrant version lock-in ✅ Cloud-ready: Same container runs on local HPC or AWS Batch
📌 Next Steps (TODO)
- Test full
biobakery_wmgxpipeline on wastewater metagenomics dataset - Benchmark runtime & resource usage
- Document output interpretation (MetaPhlAn4 + HUMAnN3 results)
- Explore cloud deployment option (AWS Batch + ECR)
- Shelved: VM image option — revisit if Docker resource constraints arise
💡 Pro Tip: Always mount your database directory with
-vto avoid re-downloading 70 GB every time!