Analysis metagenomics using Docker (Data_Tam_Metagenomics_2026*)

https://huttenhower.sph.harvard.edu/biobakery_workflows/

Whole metagenome shotgun sequencing data can be processed through read-level quality control (KneadData), taxonomic profiling (MetaPhlAn), functional profiling (HUMAnN), and strain profiling (StrainPhlAn) to generate a report with publication-ready figures with two workflow commands.

Prepare the toy datasets

 # Install if needed: conda install -c bioconda seqtk
 cd ~/DATA/Data_Tam_Metagenomics_2026_pre_vs_post_treatment/X101SC25123808-Z01-J002/01.RawData/A
 seqtk sample -s100 A_1.fq.gz 0.01 | gzip > ../J002_A_1.fastq.gz
 seqtk sample -s100 A_2.fq.gz 0.01 | gzip > ../J002_A_2.fastq.gz
 cd ../B
 seqtk sample -s100 B_1.fq.gz 0.01 | gzip > ../J002_B_1.fastq.gz
 seqtk sample -s100 B_2.fq.gz 0.01 | gzip > ../J002_B_2.fastq.gz
 mv J002*.fastq.gz /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test/

 seqtk sample -s100 B_1.fastq.gz 0.01 | gzip > ../A_test/B_1.fastq.gz
 seqtk sample -s100 B_2.fastq.gz 0.01 | gzip > ../A_test/B_2.fastq.gz

拉取镜像（注意：latest 实际是 2019-2021 年构建的旧版）
```
 docker pull biobakery/workflows:latest
```

验证容器内版本

 docker run --rm biobakery/workflows:latest biobakery_workflows --version

Install Databases Inside Container

 # Create persistent host directory for databases
 mkdir -p /mnt/nvme4n1p1/biobakery_db

 docker run -it \
 -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
 biobakery/workflows:latest \
 /bin/bash

 # Inside container:
 biobakery_workflows_databases --available
 #There are five available database sets each corresponding to a data processing workflow.
 #wmgx: The full databases for the whole metagenome workflow
 #wmgx_demo: The demo databases for the whole metagenome workflow
 #wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow
 #16s_usearch: The full databases for the 16s workflow
 #16s_dada2: The full databases for the dada2 workflow
 #16s_its: The unite database for the its workflow
 #isolate_assembly: The eggnog-mapper databases for the assembly workflow

 biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases
 biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases
 biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases
 biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases
 biobakery_workflows_databases --install 16s_its --location /biobakery_databases
 biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases

 biobakery_workflows_databases --install wmgx --location /biobakery_databases
 # ---- DOWNLOAD_LOG ----
 1. INSTALLING humann utility mapping database
     Creating directory to install database: /biobakery_databases/humann

     Creating subdirectory to INSTALL database: /biobakery_databases/humann/utility_mapping
     Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
     Downloading file of size: 2.55 GB
     2.55 GB 100.00 %   4.70 MB/sec  0 min -0 sec
     Extracting: /biobakery_databases/humann/full_mapping_v201901.tar.gz
     Database installed: /biobakery_databases/humann/utility_mapping
     HUMAnN configuration file updated: database_folders : utility_mapping = /biobakery_databases/humann/utility_mapping

     Generating strainphlan fasta database (FOR GENERETING DIRS strainphlan_db_reference and strainphlan_db_markers from humann/utility_mapping?), it is contradicted with the following assumption: bowtie2-inspect ${DB_DIR}/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503 > ${DB_DIR}/strainphlan_db_markers/all_markers.fasta

 2. INSTALLING humann nucleotide and protein databases

     Creating subdirectory to INSTALL database: /biobakery_databases/humann/chocophlan
     Download URL: http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz
     Downloading file of size: 15.30 GB
     15.30 GB 100.00 %   6.75 MB/sec  0 min -0 sec
     Extracting: /biobakery_databases/humann/full_chocophlan.v296_201901.tar.gz
     Database installed: /biobakery_databases/humann/chocophlan
     HUMAnN configuration file updated: database_folders : nucleotide = /biobakery_databases/humann/chocophlan

     Creating subdirectory to INSTALL database: /biobakery_databases/humann/uniref
     Download URL: http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz
     Downloading file of size: 19.31 GB
     19.31 GB 100.00 %   7.22 MB/sec  0 min -0 sec
     Extracting: /biobakery_databases/humann/uniref90_annotated_v201901.tar.gz
     Database installed: /biobakery_databases/humann/uniref
     HUMAnN configuration file updated: database_folders : protein = /biobakery_databases/humann/uniref

 3. INSTALLING hg kneaddata database

     Creating directory to install database: /biobakery_databases/kneaddata_db_human_genome
     Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz
     Downloading file of size: 3.48 GB
     3.48 GB 100.00 %   7.17 MB/sec  0 min -0 sec
     Extracting: /biobakery_databases/kneaddata_db_human_genome/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz
     Database installed: /biobakery_databases/kneaddata_db_human_genome

 A custom install location was selected. Please set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the install location.

DEBUGs

5.1. Unable to find fastqc

 # Install the missing fastqc software

     apt-get update
     apt-get install -y fastqc

5.2. Install Java 11 to correctly run fastqc

 # Install Java 11
     apt-get install -y openjdk-11-jre-headless

5.3. Wrong version of kneaddata

 #🔍BUG: 从你提供的日志中可以看到两件事：你当前安装的 kneaddata 版本是 v0.7.10。biobakery_workflows 在调用 kneaddata 时，强行传入了 --run-trf 这个参数。
 #然而，在 kneaddata v0.7.4 及以后的版本中，TRF（串联重复序列过滤）已经变成了默认开启的功能，因此开发者移除了 --run-trf 这个命令行参数（只保留了 --bypass-trf 用于跳过它）。
 #因为 biobakery_workflows 的脚本里还写死了要传递 --run-trf，而 v0.7.10 的 kneaddata 根本不认识这个参数，所以直接报错退出：unrecognized arguments: --run-trf。随后，由于第一步的 kneaddata 失败了，依赖它的所有下游任务（MetaPhlAn, HUMAnN 等）也随之全部级联失败。
 #方案一：降级 kneaddata 到兼容版本（推荐）
 #我们需要将 kneaddata 降级到 0.7.3 版本，这是最后一个原生支持 --run-trf 参数的稳定版本，且能与当前的 biobakery_workflows 完美配合。

 pip install --upgrade kneaddata
 #Successfully installed kneaddata-0.12.4 (NOT_COMPATIBLE)

 pip install kneaddata==0.7.10
 #Successfully installed kneaddata-0.7.10
 root@13192f2ad6e6:/data# /usr/local/bin/kneaddata --version
 kneaddata v0.7.10 (NOT_COMPATIBLE)

 pip uninstall -y kneaddata
 pip install kneaddata==0.7.3
 /usr/local/bin/kneaddata --version
 # 应该输出: kneaddata v0.7.3

5.4. ⚠️ IMPORTANT: Save Your Container (固化这个改变)：

 1. Open a NEW terminal window on your host machine (do not close your current Docker session).
 2. Find your current container's ID or name:
     docker ps
 (Look for the CONTAINER ID of the biobakery/workflows:latest container, e.g., 13192f2ad6e6)
 3. Commit this container to a new image named biobakery/workflows:fixed:
     docker commit 13192f2ad6e6 biobakery/workflows:fixed
     docker images
     docker ps -a
 4. From now on, whenever you want to run the workflow, use this new image name instead of :latest:
     docker run -it \
       -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
       -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test_sampled:/data \
       biobakery/workflows:fixed \
       /bin/bash
 The kneaddata wrapper will already be there, and the workflow will run smoothly.

Rerun

6.1. Inside the environment (SUCCESSFUL!)

 docker run -it \
     -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
     -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \
     biobakery/workflows:fixed \
     /bin/bash

 export BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases
 $ #OR to make this permanent, add that exact line to the ~/.bashrc file and run source ~/.bashrc.

 # ---- Configure databases (read-level quality control (1_KneadData), taxonomic profiling (2_MetaPhlAn), functional profiling (3_HUMAnN), and strain profiling (4_StrainPhlAn)) ----

 # 更新 2_MetaPhlAn_databases 路径
 python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))"
 /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases
 ls -lh $(python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))")

 #TODO: TRY the complete metaphlan_databases from host-env to docker-system, namely from ~/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases (v202503, 34G) to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases (v201901, 2.8G)

 # 更新 3_HUMAnN 配置指向该路径
 humann_config --update database_folders nucleotide /biobakery_databases/humann/chocophlan
 humann_config --update database_folders protein /biobakery_databases/humann/uniref
 humann_config --update database_folders utility_mapping /biobakery_databases/humann/utility_mapping
 humann_config --print

 # 1_KneadData_databases 路径: /biobakery_databases/kneaddata_db_human_genome

 # 4_StrainPhlAn_databases 路径: strainphlan_db_reference(empty) and strainphlan_db_markers (1.4G)

 # ---- If new running, optimally clean up the partial results from the failed run ----
 rm -rf /data/results/
 rm -rf /data/results/*fastqc.zip _fastqc    #IMPORTANT, so that no fastqc-related files existing under /data/results/
 rm -rf /data/results/humann

 $ biobakery_workflows wmgx --input /data --output /data/results
 $ biobakery_workflows wmgx_vis --input $OUTPUT_DATA --output $OUTPUT_VIS --project-name $PROJECT    #for visualizations
     * $INPUT : A directory containing shotgun sequencing data (i.e. fasta/fastq in gzipped format)
     * $OUTPUT_DATA : A directory to write the data products (i.e. abundance tables). This folder is the output folder for the first command and the input folder for the second command
     * $OUTPUT_VIS : A directory to write the visualization products (i.e. report, figures, data tables)
     * $PROJECT : The name of the project (included in the report title page)
     * Add the options --local-jobs 8 --threads 4 to run 8 local jobs at a time each with 4 threads.
     * Add the option --grid-jobs 100 to run 100 grid jobs at a time.

 # --qc-options="--bypass-trf" \
 # --bypass-strain-profiling
 # Run the workflow, explicitly pointing to the full databases you downloaded
 biobakery_workflows wmgx \
   --input /data \
   --output /data/results \
   --threads 64 \
   --pair-identifier "_1"
 #For A_R1.fastq.gz and A_R2.fastq.gz, the identifier is "_R1".
 #For A1a_1.fq.gz and A1a_2.fq.gz, the identifier is just "_1" (because the files end in _1 and _2 directly).
 #For A_1.fq.gz and A_2.fq.gz, the identifier is also "_1".

 biobakery_workflows wmgx_vis \
   --input /data/results \
   --output /data/results_vis \
   --project-name wastewater_2026_A_test_sampled

 #TODO_TOMORROW_2: rerun the 2 commands above again using complete metaphlan_databases: now the database under /data should copy to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases, please make backup the original simplified database as metaphlan_databases_simplified!

6.2. (NOT_TRIED): directly run under host-environment

 docker run --rm \
   -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
   -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \
   -e BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases \
   biobakery/workflows:fixed \
   biobakery_workflows wmgx \
   -i /data \
   -o /data/results \
   --threads 32

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Analysis metagenomics using Docker (Data_Tam_Metagenomics_2026*)

Leave a Reply Cancel reply