Workflow using PICRUSt2 for Data_Karoline_16S_2025

gene_x 0 like s 126 view s

Tags: pipeline

Environment Setup: It sets up a Conda environment named picrust2, using the conda create command and then activates this environment using conda activate picrust2.

#https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.2.0-beta)#minimum-requirements-to-run-full-tutorial
mamba create -n picrust2 -c bioconda -c conda-forge picrust2    #2.5.3  #=2.2.0_b
mamba activate /home/jhuang/miniconda3/envs/picrust2

Under env (qiime2-amplicon-2023.9)

Export QIIME2 feature table and representative sequences

#docker pull quay.io/qiime2/core:2023.9
#docker run -it --rm \
#-v /mnt/md1/DATA/Data_Karoline_16S_2025:/data \
#-v /home/jhuang/REFs:/home/jhuang/REFs \
#quay.io/qiime2/core:2023.9 bash
#cd /data
# === SETTINGS ===
FEATURE_TABLE_QZA="dada2_tests2/test_7_f240_r240/table.qza"
REP_SEQS_QZA="dada2_tests2/test_7_f240_r240/rep-seqs.qza"

# === STEP 1: EXPORT QIIME2 ARTIFACTS ===
mkdir -p qiime2_export
qiime tools export --input-path $FEATURE_TABLE_QZA --output-path qiime2_export
qiime tools export --input-path $REP_SEQS_QZA --output-path qiime2_export

Convert BIOM to TSV for Picrust2 input

biom convert \
-i qiime2_export/feature-table.biom \
-o qiime2_export/feature-table.tsv \
--to-tsv

Under env (picrust2): mamba activate /home/jhuang/miniconda3/envs/picrust2

Run PICRUSt2 pipeline

tail -n +2 qiime2_export/feature-table.tsv > qiime2_export/feature-table-fixed.tsv
picrust2_pipeline.py \
-s qiime2_export/dna-sequences.fasta \
-i qiime2_export/feature-table-fixed.tsv \
-o picrust2_out \
-p 100

#This will:
#* Place sequences in the reference tree (using EPA-NG),
#* Predict gene family abundances (e.g., EC, KO, PFAM, TIGRFAM),
#* Predict pathway abundances.

#In current PICRUSt2 (with picrust2_pipeline.py), you do not run hsp.py separately.
#Instead, picrust2_pipeline.py internally runs the HSP step for all functional categories automatically. It outputs all the prediction files (16S_predicted_and_nsti.tsv.gz, COG_predicted.tsv.gz, PFAM_predicted.tsv.gz, KO_predicted.tsv.gz, EC_predicted.tsv.gz, TIGRFAM_predicted.tsv.gz, PHENO_predicted.tsv.gz) in the output directory.

mkdir picrust2_out_advanced; cd picrust2_out_advanced
#If you still want to run hsp.py manually (advanced use / debugging), the commands correspond directly:
hsp.py -i 16S -t ../picrust2_out/out.tre -o 16S_predicted_and_nsti.tsv.gz -p 100 -n
hsp.py -i COG -t ../picrust2_out/out.tre -o COG_predicted.tsv.gz -p 100
hsp.py -i PFAM -t ../picrust2_out/out.tre -o PFAM_predicted.tsv.gz -p 100
hsp.py -i KO -t ../picrust2_out/out.tre -o KO_predicted.tsv.gz -p 100
hsp.py -i EC -t ../picrust2_out/out.tre -o EC_predicted.tsv.gz -p 100
hsp.py -i TIGRFAM -t ../picrust2_out/out.tre -o TIGRFAM_predicted.tsv.gz -p 100
hsp.py -i PHENO -t ../picrust2_out/out.tre -o PHENO_predicted.tsv.gz -p 100

Metagenome prediction per functional category (if needed separately)

#cd picrust2_out_advanced
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f COG_predicted.tsv.gz -o COG_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz -o EC_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f KO_predicted.tsv.gz -o KO_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f PFAM_predicted.tsv.gz -o PFAM_metagenome_out --strat_out
metagenome_pipeline.py -i ../qiime2_export/feature-table.biom -m 16S_predicted_and_nsti.tsv.gz -f TIGRFAM_predicted.tsv.gz -o TIGRFAM_metagenome_out --strat_out

# Add descriptions in gene family tables
add_descriptions.py -i COG_metagenome_out/pred_metagenome_unstrat.tsv.gz -m COG -o COG_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz   # EC and METACYC is a pair, EC for gene_annotation and METACYC for pathway_annotation
add_descriptions.py -i PFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m PFAM -o PFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i TIGRFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m TIGRFAM -o TIGRFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

Pathway inference (MetaCyc pathways from EC numbers)

#cd picrust2_out_advanced
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o EC_pathways_out -p 100
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o EC_pathways_out_per_seq -p 100 --per_sequence_contrib --per_sequence_abun EC_metagenome_out/seqtab_norm.tsv.gz --per_sequence_function EC_predicted.tsv.gz
#ERROR due to missing .../pathway_mapfiles/KEGG_pathways_to_KO.tsv
pathway_pipeline.py -i COG_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv
pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_strat.tsv.gz -o KEGG_pathways_out -p 100 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

add_descriptions.py -i EC_pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o EC_pathways_out/path_abun_unstrat_descrip.tsv.gz
gunzip EC_pathways_out/path_abun_unstrat_descrip.tsv.gz

#Error - no rows remain after regrouping input table. The default pathway and regroup mapfiles are meant for EC numbers. Note that KEGG pathways are not supported since KEGG is a closed-source database, but you can input custom pathway mapfiles if you have access. If you are using a custom function database did you mean to set the --no-regroup flag and/or change the default pathways mapfile used?
#If ERROR --> USE the METACYC for downstream analyses!!!

#ERROR due to missing .../description_mapfiles/KEGG_pathways_info.tsv.gz
#add_descriptions.py -i KO_pathways_out/path_abun_unstrat.tsv.gz -o KEGG_pathways_out/path_abun_unstrat_descrip.tsv.gz --custom_map_table /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/description_mapfiles/KEGG_pathways_info.tsv.gz

#NOTE: Target-analysis for the pathway "mixed acid fermentation"

Visualization

#7.1 STAMP
#https://github.com/picrust/picrust2/wiki/STAMP-example
#Note that STAMP can only be opened under Windows

# It needs two files: path_abun_unstrat_descrip.tsv.gz as "Profile file" and metadata.tsv as "Group metadata file".
cp ~/DATA/Data_Karoline_16S_2025/picrust2_out_advanced/EC_pathways_out/path_abun_unstrat_descrip.tsv ~/DATA/Access_to_Win7/

cut -d$'\t' -f1 qiime2_metadata.tsv > 1
cut -d$'\t' -f3 qiime2_metadata.tsv > 3
cut -d$'\t' -f5-6 qiime2_metadata.tsv > 5_6
paste -d$'\t' 1 3 > 1_3
paste -d$'\t' 1_3 5_6 > metadata.tsv
#SampleID --> SampleID
SampleID        Group   pre_post        Sex_age
sample-A1       Group1  3d.post.stroke  male.aged
sample-A2       Group1  3d.post.stroke  male.aged
sample-A3       Group1  3d.post.stroke  male.aged
cp ~/DATA/Data_Karoline_16S_2025/metadata.tsv ~/DATA/Access_to_Win7/

#7.2. ALDEx2
https://bioconductor.org/packages/release/bioc/html/ALDEx2.html

Under env (qiime2-amplicon-2023.9)

(NOT_NEEDED) Convert pathway output to BIOM and re-import to QIIME2 gunzip picrust2_out/pathways_out/path_abun_unstrat.tsv.gz biom convert \ -i picrust2_out/pathways_out/path_abun_unstrat.tsv \ -o picrust2_out/path_abun_unstrat.biom \ --table-type="Pathway table" \ --to-hdf5

qiime tools import \
--input-path picrust2_out/path_abun_unstrat.biom \
--type 'FeatureTable[Frequency]' \
--input-format BIOMV210Format \
--output-path path_abun.qza

#qiime tools export --input-path path_abun.qza --output-path exported_path_abun
#qiime tools peek path_abun.qza
echo "✅ PICRUSt2 pipeline complete. Output in: picrust2_out"

For QIIME1

Environment Setup: It sets up a Conda environment named picrust2, using the conda create command and then activates this environment using conda activate picrust2.

#https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.2.0-beta)#minimum-requirements-to-run-full-tutorial
mamba create -n picrust2 -c bioconda -c conda-forge picrust2    #2.5.3  #=2.2.0_b
mamba activate /home/jhuang/miniconda3/envs/picrust2

Data Preparation: The script creates a new directory called picrust2_out, then enters it using mkdir and cd commands. It then identifies input files that are needed for the analysis: metadata.tsv, seqs.fna, table.biom. The biom commands are used to inspect and convert the BIOM format files.

mkdir picrust2_out_2024_2
cd picrust2_out_2024_2

# Identifying input data
# Note: Replace the paths and filenames with your actual data if different
# metadata.tsv == ../map_corrected.txt
# seqs.fna     == ../clustering/seqs.fna
# table.biom   == ../core_diversity_e42369/table_even42369.biom

# Inspect and convert the BIOM format files
biom head -i ../core_diversity_e42369/table_even42369.biom
biom summarize-table -i ../core_diversity_e42369/table_even42369.biom
biom convert -i ../core_diversity_e42369/table_even42369.biom -o table_even42369.tsv --to-tsv
#For QIIME2: exported_rarefied_table/feature-table.tsv

Running PiCRUST2: The place_seqs.py command aligns the input sequences to a reference tree. The hsp.py commands generate hidden state prediction for multiple functional categories.

#insert reads into reference tree using EPA-NG
cp ../clustering/rep_set.fna ./
grep ">" rep_set.fna | wc -l  #40990
vim table_even42369.tsv       #40596-2

samtools faidx rep_set.fna
cut -f1-1 table_even42369.tsv > table_even42369.id
#manually modify table_even42369.id by replacing "\n" with " >> seqs.fna\nsamtools faidx rep_set.fna "
run table_even42369.id

rm -rf intermediate/
place_seqs.py -s seqs.fna -o out.tre -p 4 --intermediate intermediate/place_seqs

#castor: Efficient Phylogenetics on Large Trees
#https://github.com/picrust/picrust2/wiki/Hidden-state-prediction

hsp.py -i 16S -t out.tre -o 16S_predicted_and_nsti.tsv.gz -p 100 -n
hsp.py -i COG -t out.tre -o COG_predicted.tsv.gz -p 100
hsp.py -i PFAM -t out.tre -o PFAM_predicted.tsv.gz -p 15
hsp.py -i KO -t out.tre -o KO_predicted.tsv.gz -p 15
hsp.py -i EC -t out.tre -o EC_predicted.tsv.gz -p 15
hsp.py -i TIGRFAM -t out.tre -o TIGRFAM_predicted.tsv.gz -p 15
hsp.py -i PHENO -t out.tre -o PHENO_predicted.tsv.gz -p 15

#>In this table the predicted copy number of all Enzyme Classification (EC) numbers is shown for each ASV. The NSTI values per ASV are not in this table since we did not specify the -n option. EC numbers are a type of gene family defined based on the chemical reactions they catalyze. For instance, EC:1.1.1.1 corresponds to alcohol dehydrogenase. In this tutorial we are focusing on EC numbers since they can be used to infer MetaCyc pathway levels (see below).

zless -S EC_predicted.tsv.gz
sequence        EC:1.1.1.1      EC:1.1.1.10     EC:1.1.1.100    ...
20e568023c10eaac834f1c110aacea18        2       0       3    ...
23fe12a325dfefcdb23447f43b6b896e        0       0       1    ...
288c8176059111c4c7fdfb0cd5afce64        1       0       1    ...
...

##Why import the tsv file to MyData?
#MyData <- read.csv(file="./COG_predicted.tsv", header=TRUE, sep="\t", row.names=1)   #6806 4598  e.g. COG5665
#MyData <- read.csv(file="./PFAM_predicted.tsv", header=TRUE, sep="\t", row.names=1)  #6806 11089 e.g. PF17225
#MyData <- read.csv(file="./KO_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806 10543 e.g. K19791
#MyData <- read.csv(file="./EC_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806 2913  e.g. EC.6.6.1.2
#MyData <- read.csv(file="./16S_predicted.tsv", header=TRUE, sep="\t", row.names=1)   #6806    1     e.g. X16S_rRNA_Count
#MyData <- read.csv(file="./TIGRFAM_predicted.tsv", header=TRUE, sep="\t", row.names=1)  #6806 4287  e.g. TIGR04571
#MyData <- read.csv(file="./PHENO_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806   41  e.g. Use_of_nitrate_as_electron_acceptor, Xylose_utilizing

The metagenome_pipeline.py commands perform metagenomic prediction for several functional categories. Predicted gene families weighted by the relative abundance of ASVs in their community. In other words, we are interested in inferring the metagenomes of the communities.

#Generate metagenome predictions using EC numbers https://en.wikipedia.org/wiki/List_of_enzymes#Category:EC_1.1_(act_on_the_CH-OH_group_of_donors)
metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f COG_predicted.tsv.gz -o COG_metagenome_out --strat_out
metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz -o EC_metagenome_out --strat_out
metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f KO_predicted.tsv.gz -o KO_metagenome_out --strat_out
metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f PFAM_predicted.tsv.gz -o PFAM_metagenome_out --strat_out
metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f TIGRFAM_predicted.tsv.gz -o TIGRFAM_metagenome_out --strat_out

Pathway-level inference: By default this script infers MetaCyc pathway abundances based on EC number abundances, although different gene families and pathways can also be optionally specified. This script performs a number of steps by default, which are based on the approach implemented in HUMAnN2:

#- Regroups EC numbers to MetaCyc reactions.
#- Infers which MetaCyc pathways are present based on these reactions with MinPath.
#- Calculates and returns the abundance of pathways identified as present.

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o pathways_out -p 15

#Note that the path of map files is under /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles
pathway_pipeline.py -i COG_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out -p 15 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

#Mapping predicted KO abundances to legacy KEGG pathways (with stratified output that represents contributions to community-wide abundances):
pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_strat.tsv.gz -o KEGG_pathways_out --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

#Map EC numbers to MetaCyc pathways and get stratified output corresponding to contribution of predicted gene family abundances within each predicted genome:
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o pathways_out_per_seq --per_sequence_contrib --per_sequence_abun EC_metagenome_out/seqtab_norm.tsv.gz --per_sequence_function EC_predicted.tsv.gz

Add functional descriptions: Finally, it can be useful to have a description of each functional id in the output abundance tables. The below commands will add these descriptions as new column in gene family and pathway abundance tables

#--6.1. Add descriptions in gene family tables
add_descriptions.py -i COG_metagenome_out/pred_metagenome_unstrat.tsv.gz -m COG -o COG_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz   # EC and METACYC is a pair, EC for gene_annotation and METACYC for pathway_annotation
add_descriptions.py -i PFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m PFAM -o PFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i TIGRFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m TIGRFAM -o TIGRFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

#--6.2. Add descriptions in pathway abundance tables
add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv.gz
gunzip path_abun_unstrat_descrip.tsv.gz

#Error - no rows remain after regrouping input table. The default pathway and regroup mapfiles are meant for EC numbers. Note that KEGG pathways are not supported since KEGG is a closed-source database, but you can input custom pathway mapfiles if you have access. If you are using a custom function database did you mean to set the --no-regroup flag and/or change the default pathways mapfile used?
#If ERROR --> USE the METACYC for downstream analyses!!!

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -o KEGG_pathways_out/path_abun_unstrat_descrip.tsv.gz --custom_map_table /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/description_mapfiles/KEGG_pathways_info.tsv.gz

Visualization

#7.1 STAMP
#https://github.com/picrust/picrust2/wiki/STAMP-example
conda deactivate
conda install -c bioconda stamp

#conda install -c bioconda stamp
#sudo pip install pyqi
#sudo apt-get install libblas-dev liblapack-dev gfortran
#sudo apt-get install freetype* python-pip python-dev python-numpy python-scipy python-matplotlib
#sudo pip install STAMP
#conda install -c bioconda stamp

conda create -n stamp -c bioconda/label/cf201901 stamp
brew install pyqt

#DEBUG the environment
conda install pyqt=4
#conda install icu=56

e.g. path_abun_unstrat_descrip.tsv.gz and metadata.tsv from the tutorial)
cut -d$'\t' -f1 map_corrected.txt > 1
cut -d$'\t' -f5 map_corrected.txt > 5
cut -d$'\t' -f6 map_corrected.txt > 6
paste -d$'\t' 1 5 > 1_5
paste -d$'\t' 1_5 6 > metadata.tsv
#SampleID --> SampleID
SampleID    Facility    Genotype
100CHE6KO   PaloAlto    KO
101CHE6WT   PaloAlto    WT

#7.2. ALDEx2
https://bioconductor.org/packages/release/bioc/html/ALDEx2.html

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Workflow using PICRUSt2 for Data_Karoline_16S_2025

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章