Whole metagenome shotgun sequencing data can be processed through read-level quality control (KneadData), taxonomic profiling (MetaPhlAn), functional profiling (HUMAnN), and strain profiling (StrainPhlAn) to generate a report with publication-ready figures with two workflow commands.
-
Prepare the toy datasets
# Install if needed: conda install -c bioconda seqtk cd ~/DATA/Data_Tam_Metagenomics_2026_pre_vs_post_treatment/X101SC25123808-Z01-J002/01.RawData/A seqtk sample -s100 A_1.fq.gz 0.01 | gzip > ../J002_A_1.fastq.gz seqtk sample -s100 A_2.fq.gz 0.01 | gzip > ../J002_A_2.fastq.gz cd ../B seqtk sample -s100 B_1.fq.gz 0.01 | gzip > ../J002_B_1.fastq.gz seqtk sample -s100 B_2.fq.gz 0.01 | gzip > ../J002_B_2.fastq.gz mv J002*.fastq.gz /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test/ seqtk sample -s100 B_1.fastq.gz 0.01 | gzip > ../A_test/B_1.fastq.gz seqtk sample -s100 B_2.fastq.gz 0.01 | gzip > ../A_test/B_2.fastq.gz -
拉取镜像(注意:latest 实际是 2019-2021 年构建的旧版)
docker pull biobakery/workflows:latest -
验证容器内版本
docker run --rm biobakery/workflows:latest biobakery_workflows --version -
Install Databases Inside Container
# Create persistent host directory for databases mkdir -p /mnt/nvme4n1p1/biobakery_db docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ biobakery/workflows:latest \ /bin/bash # Inside container: biobakery_workflows_databases --available #There are five available database sets each corresponding to a data processing workflow. #wmgx: The full databases for the whole metagenome workflow #wmgx_demo: The demo databases for the whole metagenome workflow #wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow #16s_usearch: The full databases for the 16s workflow #16s_dada2: The full databases for the dada2 workflow #16s_its: The unite database for the its workflow #isolate_assembly: The eggnog-mapper databases for the assembly workflow biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases biobakery_workflows_databases --install 16s_its --location /biobakery_databases biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases biobakery_workflows_databases --install wmgx --location /biobakery_databases # ---- DOWNLOAD_LOG ---- 1. INSTALLING humann utility mapping database Creating directory to install database: /biobakery_databases/humann Creating subdirectory to INSTALL database: /biobakery_databases/humann/utility_mapping Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz Downloading file of size: 2.55 GB 2.55 GB 100.00 % 4.70 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/full_mapping_v201901.tar.gz Database installed: /biobakery_databases/humann/utility_mapping HUMAnN configuration file updated: database_folders : utility_mapping = /biobakery_databases/humann/utility_mapping Generating strainphlan fasta database (FOR GENERETING DIRS strainphlan_db_reference and strainphlan_db_markers from humann/utility_mapping?), it is contradicted with the following assumption: bowtie2-inspect ${DB_DIR}/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503 > ${DB_DIR}/strainphlan_db_markers/all_markers.fasta 2. INSTALLING humann nucleotide and protein databases Creating subdirectory to INSTALL database: /biobakery_databases/humann/chocophlan Download URL: http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz Downloading file of size: 15.30 GB 15.30 GB 100.00 % 6.75 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/full_chocophlan.v296_201901.tar.gz Database installed: /biobakery_databases/humann/chocophlan HUMAnN configuration file updated: database_folders : nucleotide = /biobakery_databases/humann/chocophlan Creating subdirectory to INSTALL database: /biobakery_databases/humann/uniref Download URL: http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz Downloading file of size: 19.31 GB 19.31 GB 100.00 % 7.22 MB/sec 0 min -0 sec Extracting: /biobakery_databases/humann/uniref90_annotated_v201901.tar.gz Database installed: /biobakery_databases/humann/uniref HUMAnN configuration file updated: database_folders : protein = /biobakery_databases/humann/uniref 3. INSTALLING hg kneaddata database Creating directory to install database: /biobakery_databases/kneaddata_db_human_genome Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz Downloading file of size: 3.48 GB 3.48 GB 100.00 % 7.17 MB/sec 0 min -0 sec Extracting: /biobakery_databases/kneaddata_db_human_genome/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz Database installed: /biobakery_databases/kneaddata_db_human_genome A custom install location was selected. Please set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the install location. -
DEBUGs
5.1. Unable to find fastqc
# Install the missing fastqc software apt-get update apt-get install -y fastqc5.2. Install Java 11 to correctly run fastqc
# Install Java 11 apt-get install -y openjdk-11-jre-headless5.3. Wrong version of kneaddata
#🔍BUG: 从你提供的日志中可以看到两件事:你当前安装的 kneaddata 版本是 v0.7.10。biobakery_workflows 在调用 kneaddata 时,强行传入了 --run-trf 这个参数。 #然而,在 kneaddata v0.7.4 及以后的版本中,TRF(串联重复序列过滤)已经变成了默认开启的功能,因此开发者移除了 --run-trf 这个命令行参数(只保留了 --bypass-trf 用于跳过它)。 #因为 biobakery_workflows 的脚本里还写死了要传递 --run-trf,而 v0.7.10 的 kneaddata 根本不认识这个参数,所以直接报错退出:unrecognized arguments: --run-trf。随后,由于第一步的 kneaddata 失败了,依赖它的所有下游任务(MetaPhlAn, HUMAnN 等)也随之全部级联失败。 #方案一:降级 kneaddata 到兼容版本(推荐) #我们需要将 kneaddata 降级到 0.7.3 版本,这是最后一个原生支持 --run-trf 参数的稳定版本,且能与当前的 biobakery_workflows 完美配合。 pip install --upgrade kneaddata #Successfully installed kneaddata-0.12.4 (NOT_COMPATIBLE) pip install kneaddata==0.7.10 #Successfully installed kneaddata-0.7.10 root@13192f2ad6e6:/data# /usr/local/bin/kneaddata --version kneaddata v0.7.10 (NOT_COMPATIBLE) pip uninstall -y kneaddata pip install kneaddata==0.7.3 /usr/local/bin/kneaddata --version # 应该输出: kneaddata v0.7.35.4. ⚠️ IMPORTANT: Save Your Container (固化这个改变):
1. Open a NEW terminal window on your host machine (do not close your current Docker session). 2. Find your current container's ID or name: docker ps (Look for the CONTAINER ID of the biobakery/workflows:latest container, e.g., 13192f2ad6e6) 3. Commit this container to a new image named biobakery/workflows:fixed: docker commit 13192f2ad6e6 biobakery/workflows:fixed docker images docker ps -a 4. From now on, whenever you want to run the workflow, use this new image name instead of :latest: docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test_sampled:/data \ biobakery/workflows:fixed \ /bin/bash The kneaddata wrapper will already be there, and the workflow will run smoothly. -
Rerun
6.1. Inside the environment (SUCCESSFUL!)
docker run -it \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \ biobakery/workflows:fixed \ /bin/bash export BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases $ #OR to make this permanent, add that exact line to the ~/.bashrc file and run source ~/.bashrc. # ---- Configure databases (read-level quality control (1_KneadData), taxonomic profiling (2_MetaPhlAn), functional profiling (3_HUMAnN), and strain profiling (4_StrainPhlAn)) ---- # 更新 2_MetaPhlAn_databases 路径 python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))" /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases ls -lh $(python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))") #TODO: TRY the complete metaphlan_databases from host-env to docker-system, namely from ~/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases (v202503, 34G) to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases (v201901, 2.8G) # 更新 3_HUMAnN 配置指向该路径 humann_config --update database_folders nucleotide /biobakery_databases/humann/chocophlan humann_config --update database_folders protein /biobakery_databases/humann/uniref humann_config --update database_folders utility_mapping /biobakery_databases/humann/utility_mapping humann_config --print # 1_KneadData_databases 路径: /biobakery_databases/kneaddata_db_human_genome # 4_StrainPhlAn_databases 路径: strainphlan_db_reference(empty) and strainphlan_db_markers (1.4G) # ---- If new running, optimally clean up the partial results from the failed run ---- rm -rf /data/results/ rm -rf /data/results/*fastqc.zip _fastqc #IMPORTANT, so that no fastqc-related files existing under /data/results/ rm -rf /data/results/humann $ biobakery_workflows wmgx --input /data --output /data/results $ biobakery_workflows wmgx_vis --input $OUTPUT_DATA --output $OUTPUT_VIS --project-name $PROJECT #for visualizations * $INPUT : A directory containing shotgun sequencing data (i.e. fasta/fastq in gzipped format) * $OUTPUT_DATA : A directory to write the data products (i.e. abundance tables). This folder is the output folder for the first command and the input folder for the second command * $OUTPUT_VIS : A directory to write the visualization products (i.e. report, figures, data tables) * $PROJECT : The name of the project (included in the report title page) * Add the options --local-jobs 8 --threads 4 to run 8 local jobs at a time each with 4 threads. * Add the option --grid-jobs 100 to run 100 grid jobs at a time. # --qc-options="--bypass-trf" \ # --bypass-strain-profiling # Run the workflow, explicitly pointing to the full databases you downloaded biobakery_workflows wmgx \ --input /data \ --output /data/results \ --threads 64 \ --pair-identifier "_1" #For A_R1.fastq.gz and A_R2.fastq.gz, the identifier is "_R1". #For A1a_1.fq.gz and A1a_2.fq.gz, the identifier is just "_1" (because the files end in _1 and _2 directly). #For A_1.fq.gz and A_2.fq.gz, the identifier is also "_1". biobakery_workflows wmgx_vis \ --input /data/results \ --output /data/results_vis \ --project-name wastewater_2026_A_test_sampled #TODO_TOMORROW_2: rerun the 2 commands above again using complete metaphlan_databases: now the database under /data should copy to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases, please make backup the original simplified database as metaphlan_databases_simplified!6.2. (NOT_TRIED): directly run under host-environment
docker run --rm \ -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \ -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \ -e BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases \ biobakery/workflows:fixed \ biobakery_workflows wmgx \ -i /data \ -o /data/results \ --threads 32