Daily Archives: 2026年6月5日

Analysis metagenomics using Docker (Data_Tam_Metagenomics_2026*)

https://huttenhower.sph.harvard.edu/biobakery_workflows/

Whole metagenome shotgun sequencing data can be processed through read-level quality control (KneadData), taxonomic profiling (MetaPhlAn), functional profiling (HUMAnN), and strain profiling (StrainPhlAn) to generate a report with publication-ready figures with two workflow commands.

  1. Prepare the toy datasets

     # Install if needed: conda install -c bioconda seqtk
     cd ~/DATA/Data_Tam_Metagenomics_2026_pre_vs_post_treatment/X101SC25123808-Z01-J002/01.RawData/A
     seqtk sample -s100 A_1.fq.gz 0.01 | gzip > ../J002_A_1.fastq.gz
     seqtk sample -s100 A_2.fq.gz 0.01 | gzip > ../J002_A_2.fastq.gz
     cd ../B
     seqtk sample -s100 B_1.fq.gz 0.01 | gzip > ../J002_B_1.fastq.gz
     seqtk sample -s100 B_2.fq.gz 0.01 | gzip > ../J002_B_2.fastq.gz
     mv J002*.fastq.gz /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test/
    
     seqtk sample -s100 B_1.fastq.gz 0.01 | gzip > ../A_test/B_1.fastq.gz
     seqtk sample -s100 B_2.fastq.gz 0.01 | gzip > ../A_test/B_2.fastq.gz
  2. 拉取镜像(注意:latest 实际是 2019-2021 年构建的旧版)

     docker pull biobakery/workflows:latest
  3. 验证容器内版本

     docker run --rm biobakery/workflows:latest biobakery_workflows --version
  4. Install Databases Inside Container

     # Create persistent host directory for databases
     mkdir -p /mnt/nvme4n1p1/biobakery_db
    
     docker run -it \
     -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
     biobakery/workflows:latest \
     /bin/bash
    
     # Inside container:
     biobakery_workflows_databases --available
     #There are five available database sets each corresponding to a data processing workflow.
     #wmgx: The full databases for the whole metagenome workflow
     #wmgx_demo: The demo databases for the whole metagenome workflow
     #wmgx_wmtx: The full databases for the whole metagenome and metatranscriptome workflow
     #16s_usearch: The full databases for the 16s workflow
     #16s_dada2: The full databases for the dada2 workflow
     #16s_its: The unite database for the its workflow
     #isolate_assembly: The eggnog-mapper databases for the assembly workflow
    
     biobakery_workflows_databases --install wmgx_demo --location /biobakery_databases
     biobakery_workflows_databases --install wmgx_wmtx --location /biobakery_databases
     biobakery_workflows_databases --install 16s_usearch --location /biobakery_databases
     biobakery_workflows_databases --install 16s_dada2 --location /biobakery_databases
     biobakery_workflows_databases --install 16s_its --location /biobakery_databases
     biobakery_workflows_databases --install isolate_assembly --location /biobakery_databases
    
     biobakery_workflows_databases --install wmgx --location /biobakery_databases
     # ---- DOWNLOAD_LOG ----
     1. INSTALLING humann utility mapping database
         Creating directory to install database: /biobakery_databases/humann
    
         Creating subdirectory to INSTALL database: /biobakery_databases/humann/utility_mapping
         Download URL: http://huttenhower.sph.harvard.edu/humann2_data/full_mapping_v201901.tar.gz
         Downloading file of size: 2.55 GB
         2.55 GB 100.00 %   4.70 MB/sec  0 min -0 sec
         Extracting: /biobakery_databases/humann/full_mapping_v201901.tar.gz
         Database installed: /biobakery_databases/humann/utility_mapping
         HUMAnN configuration file updated: database_folders : utility_mapping = /biobakery_databases/humann/utility_mapping
    
         Generating strainphlan fasta database (FOR GENERETING DIRS strainphlan_db_reference and strainphlan_db_markers from humann/utility_mapping?), it is contradicted with the following assumption: bowtie2-inspect ${DB_DIR}/metaphlan_databases/mpa_vJan25_CHOCOPhlAnSGB_202503 > ${DB_DIR}/strainphlan_db_markers/all_markers.fasta
    
     2. INSTALLING humann nucleotide and protein databases
    
         Creating subdirectory to INSTALL database: /biobakery_databases/humann/chocophlan
         Download URL: http://huttenhower.sph.harvard.edu/humann2_data/chocophlan/full_chocophlan.v296_201901.tar.gz
         Downloading file of size: 15.30 GB
         15.30 GB 100.00 %   6.75 MB/sec  0 min -0 sec
         Extracting: /biobakery_databases/humann/full_chocophlan.v296_201901.tar.gz
         Database installed: /biobakery_databases/humann/chocophlan
         HUMAnN configuration file updated: database_folders : nucleotide = /biobakery_databases/humann/chocophlan
    
         Creating subdirectory to INSTALL database: /biobakery_databases/humann/uniref
         Download URL: http://huttenhower.sph.harvard.edu/humann2_data/uniprot/uniref_annotated/uniref90_annotated_v201901.tar.gz
         Downloading file of size: 19.31 GB
         19.31 GB 100.00 %   7.22 MB/sec  0 min -0 sec
         Extracting: /biobakery_databases/humann/uniref90_annotated_v201901.tar.gz
         Database installed: /biobakery_databases/humann/uniref
         HUMAnN configuration file updated: database_folders : protein = /biobakery_databases/humann/uniref
    
     3. INSTALLING hg kneaddata database
    
         Creating directory to install database: /biobakery_databases/kneaddata_db_human_genome
         Download URL: http://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz
         Downloading file of size: 3.48 GB
         3.48 GB 100.00 %   7.17 MB/sec  0 min -0 sec
         Extracting: /biobakery_databases/kneaddata_db_human_genome/Homo_sapiens_hg37_and_human_contamination_Bowtie2_v0.1.tar.gz
         Database installed: /biobakery_databases/kneaddata_db_human_genome
    
     A custom install location was selected. Please set the environment variable $BIOBAKERY_WORKFLOWS_DATABASES to the install location.
  5. DEBUGs

    5.1. Unable to find fastqc

     # Install the missing fastqc software
    
         apt-get update
         apt-get install -y fastqc

    5.2. Install Java 11 to correctly run fastqc

     # Install Java 11
         apt-get install -y openjdk-11-jre-headless

    5.3. Wrong version of kneaddata

     #🔍BUG: 从你提供的日志中可以看到两件事:你当前安装的 kneaddata 版本是 v0.7.10。biobakery_workflows 在调用 kneaddata 时,强行传入了 --run-trf 这个参数。
     #然而,在 kneaddata v0.7.4 及以后的版本中,TRF(串联重复序列过滤)已经变成了默认开启的功能,因此开发者移除了 --run-trf 这个命令行参数(只保留了 --bypass-trf 用于跳过它)。
     #因为 biobakery_workflows 的脚本里还写死了要传递 --run-trf,而 v0.7.10 的 kneaddata 根本不认识这个参数,所以直接报错退出:unrecognized arguments: --run-trf。随后,由于第一步的 kneaddata 失败了,依赖它的所有下游任务(MetaPhlAn, HUMAnN 等)也随之全部级联失败。
     #方案一:降级 kneaddata 到兼容版本(推荐)
     #我们需要将 kneaddata 降级到 0.7.3 版本,这是最后一个原生支持 --run-trf 参数的稳定版本,且能与当前的 biobakery_workflows 完美配合。
    
     pip install --upgrade kneaddata
     #Successfully installed kneaddata-0.12.4 (NOT_COMPATIBLE)
    
     pip install kneaddata==0.7.10
     #Successfully installed kneaddata-0.7.10
     root@13192f2ad6e6:/data# /usr/local/bin/kneaddata --version
     kneaddata v0.7.10 (NOT_COMPATIBLE)
    
     pip uninstall -y kneaddata
     pip install kneaddata==0.7.3
     /usr/local/bin/kneaddata --version
     # 应该输出: kneaddata v0.7.3

    5.4. ⚠️ IMPORTANT: Save Your Container (固化这个改变):

     1. Open a NEW terminal window on your host machine (do not close your current Docker session).
     2. Find your current container's ID or name:
         docker ps
     (Look for the CONTAINER ID of the biobakery/workflows:latest container, e.g., 13192f2ad6e6)
     3. Commit this container to a new image named biobakery/workflows:fixed:
         docker commit 13192f2ad6e6 biobakery/workflows:fixed
         docker images
         docker ps -a
     4. From now on, whenever you want to run the workflow, use this new image name instead of :latest:
         docker run -it \
           -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
           -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test_sampled:/data \
           biobakery/workflows:fixed \
           /bin/bash
     The kneaddata wrapper will already be there, and the workflow will run smoothly.
  6. Rerun

    6.1. Inside the environment (SUCCESSFUL!)

     docker run -it \
         -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
         -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \
         biobakery/workflows:fixed \
         /bin/bash
    
     export BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases
     $ #OR to make this permanent, add that exact line to the ~/.bashrc file and run source ~/.bashrc.
    
     # ---- Configure databases (read-level quality control (1_KneadData), taxonomic profiling (2_MetaPhlAn), functional profiling (3_HUMAnN), and strain profiling (4_StrainPhlAn)) ----
    
     # 更新 2_MetaPhlAn_databases 路径
     python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))"
     /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases
     ls -lh $(python3 -c "import metaphlan, os; print(os.path.join(os.path.dirname(metaphlan.__file__), 'metaphlan_databases'))")
    
     #TODO: TRY the complete metaphlan_databases from host-env to docker-system, namely from ~/mambaforge/envs/biobakery_run/lib/python3.10/site-packages/metaphlan/metaphlan_databases (v202503, 34G) to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases (v201901, 2.8G)
    
     # 更新 3_HUMAnN 配置指向该路径
     humann_config --update database_folders nucleotide /biobakery_databases/humann/chocophlan
     humann_config --update database_folders protein /biobakery_databases/humann/uniref
     humann_config --update database_folders utility_mapping /biobakery_databases/humann/utility_mapping
     humann_config --print
    
     # 1_KneadData_databases 路径: /biobakery_databases/kneaddata_db_human_genome
    
     # 4_StrainPhlAn_databases 路径: strainphlan_db_reference(empty) and strainphlan_db_markers (1.4G)
    
     # ---- If new running, optimally clean up the partial results from the failed run ----
     rm -rf /data/results/
     rm -rf /data/results/*fastqc.zip _fastqc    #IMPORTANT, so that no fastqc-related files existing under /data/results/
     rm -rf /data/results/humann
    
     $ biobakery_workflows wmgx --input /data --output /data/results
     $ biobakery_workflows wmgx_vis --input $OUTPUT_DATA --output $OUTPUT_VIS --project-name $PROJECT    #for visualizations
         * $INPUT : A directory containing shotgun sequencing data (i.e. fasta/fastq in gzipped format)
         * $OUTPUT_DATA : A directory to write the data products (i.e. abundance tables). This folder is the output folder for the first command and the input folder for the second command
         * $OUTPUT_VIS : A directory to write the visualization products (i.e. report, figures, data tables)
         * $PROJECT : The name of the project (included in the report title page)
         * Add the options --local-jobs 8 --threads 4 to run 8 local jobs at a time each with 4 threads.
         * Add the option --grid-jobs 100 to run 100 grid jobs at a time.
    
     # --qc-options="--bypass-trf" \
     # --bypass-strain-profiling
     # Run the workflow, explicitly pointing to the full databases you downloaded
     biobakery_workflows wmgx \
       --input /data \
       --output /data/results \
       --threads 64 \
       --pair-identifier "_1"
     #For A_R1.fastq.gz and A_R2.fastq.gz, the identifier is "_R1".
     #For A1a_1.fq.gz and A1a_2.fq.gz, the identifier is just "_1" (because the files end in _1 and _2 directly).
     #For A_1.fq.gz and A_2.fq.gz, the identifier is also "_1".
    
     biobakery_workflows wmgx_vis \
       --input /data/results \
       --output /data/results_vis \
       --project-name wastewater_2026_A_test_sampled
    
     #TODO_TOMORROW_2: rerun the 2 commands above again using complete metaphlan_databases: now the database under /data should copy to /usr/local/lib/python3.6/dist-packages/metaphlan/metaphlan_databases, please make backup the original simplified database as metaphlan_databases_simplified!

    6.2. (NOT_TRIED): directly run under host-environment

     docker run --rm \
       -v /mnt/nvme4n1p1/biobakery_db:/biobakery_databases \
       -v /mnt/md1/DATA/Data_Tam_Metagenomics_2026_wastewater/X101SC25123808-Z01-J003/01.RawData/A_test:/data \
       -e BIOBAKERY_WORKFLOWS_DATABASES=/biobakery_databases \
       biobakery/workflows:fixed \
       biobakery_workflows wmgx \
       -i /data \
       -o /data/results \
       --threads 32