bioBakery Made Simple: A Docker-Centric Guide for Unbiased Metagenomic Profiling (Data_Tam_DNAseq_2026_wastewater_metagenomics)

🇨🇳 中文摘要: 本文总结了 bioBakery 的三种部署方式(Docker / 虚拟机镜像 / 云),重点记录了使用 Docker 安装数据库、运行宏基因组分析流程的完整命令与注意事项。鉴于 VirtualBox 7.x 与 bioBakery 虚拟镜像的兼容性问题,推荐优先采用 Docker 方案,实现环境隔离、数据持久化与跨平台复现。下一步将基于该环境开展污水宏基因组数据的无偏分析流程测试。

🔍 Quick Summary

bioBakery is a comprehensive suite of tools developed by the Huttenhower Lab and Segata Lab for metagenomic community analysis. It integrates workflows like MetaPhlAn4 (taxonomic profiling) and HUMAnN3 (functional profiling) — ideal for unbiased metagenomics research.

There are three deployment options:

  1. 🐳 Docker (recommended, flexible, reproducible)
  2. 💿 Pre-built VM Image (Vagrant + VirtualBox) (encountered compatibility issues with VirtualBox 7.x)
  3. ☁️ Cloud (AWS/Google Cloud via bioBakery images)

Today’s focus: Docker setup — skip the VM headaches and get straight to analysis.


🐳 Part 1: Install & Run bioBakery with Docker (Step-by-Step)

✅ Prerequisites

  • Docker installed & running (docker --version)
  • ~7 GB free disk space for image + databases
  • Outbound HTTPS access (for database downloads)

🔽 Step 1: Pull the bioBakery Docker Image

docker pull biobakery/workflows:latest
# Verify
docker images | grep biobakery
# Expected: ~6.68 GB image

🗄️ Step 2: Prepare Local Database Directory

# Create persistent host directory for databases
mkdir -p /mnt/nvme1n1p1/biobakery_db

📦 Step 3: Install Databases Inside Container

docker run -it \
  -v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
  biobakery/workflows:latest \
  /bin/bash

# Inside container:
biobakery_workflows_databases --install wmgx --location /biobakery_databases

biobakery_workflows_databases --available
#There are five available database sets each corresponding to a data processing workflow.
#wmgx: The full databases for the whole metagenome workflow
#wmgx_demo: The demo databases for the whole metagenome workflow
#wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow
#16s_usearch: The full databases for the 16s workflow
#16s_dada2: The full databases for the dada2 workflow
#16s_its: The unite database for the its workflow
#isolate_assembly: The eggnog-mapper databases for the assembly workflow

biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases
biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases
biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases
biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases
biobakery_workflows_databases --install 16s_its --location /biobakery_databases
biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases

⏱️ Note: Downloads ~40–70 GB (ChocoPhlAn, UniRef, utility mappings) for the wmgx database. Ensure stable internet & sufficient space.

🧪 Step 4: Run Your First Metagenomics Workflow

docker run -it \
  -v /mnt/nvme1n1p1/biobakery_db:/biobakery_databases \
  -v /home/jhuang/DATA/your_raw_data:/data \
  biobakery/workflows:latest \
  biobakery_wmgx \
  --input /data/sample.fastq \
  --output /data/output \
  --databases /biobakery_databases

🔑 Optional: Install USEARCH (for 16S workflows)

# 1. Get license from https://www.drive5.com/usearch/
# 2. Inside container or on host:
sudo wget -O /usr/local/bin/usearch "$USEARCH_URL"
sudo chmod +x /usr/local/bin/usearch

⚠️ Troubleshooting Notes (From Today’s Log)

Issue Solution
VirtualBox Guest Additions mismatch (v6.1.8 vs host v7.1) Prefer Docker to avoid VM dependency conflicts
Vagrant box version conflicts Use vagrant box list / --force to manage versions, but Docker is cleaner
Large database downloads failing Ensure container has HTTPS access; use -v to persist downloads across sessions
Shared folder not mounting Docker -v mounts are more reliable than Vagrant shared folders

📚 What’s Inside bioBakery? (Quick Reference)

Tool Purpose Module
MetaPhlAn4 Taxonomic profiling biobakery_wmgx
HUMAnN3 Functional profiling (pathways, genes) biobakery_wmgx
StrainPhlAn Strain-level analysis Optional module
PanPhlAn Pangenome analysis Optional module
q2-biobakery QIIME2 plugin for 16S Separate workflow

🔗 Official Docs:


🎯 Why Docker First?

Reproducible: Same environment across machines ✅ Lightweight: No full VM overhead ✅ Flexible: Easy to mount local data & databases ✅ Future-proof: Avoid VirtualBox/Vagrant version lock-in ✅ Cloud-ready: Same container runs on local HPC or AWS Batch


📌 Next Steps (TODO)

  • Test full biobakery_wmgx pipeline on wastewater metagenomics dataset
  • Benchmark runtime & resource usage
  • Document output interpretation (MetaPhlAn4 + HUMAnN3 results)
  • Explore cloud deployment option (AWS Batch + ECR)
  • Shelved: VM image option — revisit if Docker resource constraints arise

💡 Pro Tip: Always mount your database directory with -v to avoid re-downloading 70 GB every time!

Leave a Reply

Your email address will not be published. Required fields are marked *