To analyze your metagenomics data with greater control, you can use the R packages phyloseq, ggplot2, vegan, and MaAsLin2. The biobakery workflow has already generated the merged tables you need in the results/ directory.
Below is the complete R code to:
- Load the data and reproduce the figures from
wmgx_report.pdf(PCoA, Heatmap, Stacked Barplot). - Perform differential abundance analysis between Pre-treatment and Post-treatment.
- Statistically test if the differences between time points are significant.
Prerequisites
Make sure you have the following R packages installed:
install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr", "tidyr", "pheatmap"))
# MaAsLin2 is installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("MaAsLin2")
Step 1: Load Data and Create Metadata
The biobakery workflow merged your samples into single tables. We will load the MetaPhlAn species table and HUMAnN pathway table, and map your experimental design.
library(phyloseq)
library(ggplot2)
library(dplyr)
library(tidyr)
# 1. Load MetaPhlAn taxonomic profiles
mpa_file <- "results/metaphlan/merged/metaphlan_taxonomic_profiles.tsv"
mpa_raw <- read.table(mpa_file, header=TRUE, sep="\t", comment.char="", check.names=FALSE)
colnames(mpa_raw)[1] <- "clade_name"
rownames(mpa_raw) <- mpa_raw$clade_name
mpa_otu <- mpa_raw[, -1]
# Filter for species level (s__) and clean names
species_idx <- grep("s__", rownames(mpa_otu))
species_otu <- mpa_otu[species_idx, ]
species_names <- gsub(".*s__", "", rownames(species_otu))
species_names <- gsub("\\|.*", "", species_names)
species_names <- gsub("_", " ", species_names) # Replace underscores with spaces
rownames(species_otu) <- species_names
# 2. Create Metadata based on your file names
sample_names <- colnames(species_otu)
metadata <- data.frame(
row.names = sample_names,
TimePoint = case_when(
grepl("Nov", sample_names) ~ "Nov",
grepl("Jan", sample_names) ~ "Jan",
grepl("Mar", sample_names) ~ "Mar",
grepl("May", sample_names) ~ "May"
),
Treatment = ifelse(grepl("Pre", sample_names), "Pre", "Post")
)
# 3. Create phyloseq object for Species
tax_mat <- matrix(rownames(species_otu), ncol=1, dimnames=list(rownames(species_otu), "Species"))
species_ps <- phyloseq(
otu_table(species_otu, taxa_are_rows=TRUE),
sample_data(metadata),
tax_table(tax_mat)
)
# (Optional) Load HUMAnN pathways using the same logic
# path_file <- "results/humann/merged/pathabundance_relab.tsv"
# ... (similar loading steps as above) ...
Step 2: Reproduce Figures from wmgx_report.pdf
1. Ordination (PCoA)
Reproduces the Bray-Curtis Principal Coordinate Analysis.
library(vegan)
# Calculate Bray-Curtis distance and perform PCoA
ord <- ordinate(species_ps, method="PCoA", distance="bray")
# Plot
plot_ordination(species_ps, ord, color="Treatment", shape="TimePoint") +
geom_point(size=4) +
theme_minimal() +
labs(title="PCoA of Species Profiles (Bray-Curtis)",
subtitle="Colored by Treatment, Shaped by Time Point") +
theme(legend.position = "right")
2. Heatmap (Top 25 Species)
Reproduces the hierarchical clustering heatmap using Euclidean distance.
# Get top 25 most abundant species
top25 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:25]
ps_top25 <- prune_taxa(top25, species_ps)
# Plot heatmap (method=NULL defaults to hierarchical clustering)
plot_heatmap(ps_top25, method=NULL, distance="euclidean",
taxa.labels=TRUE, label.size=3, na.value=NA) +
scale_fill_viridis_c() +
labs(title="Heatmap of Top 25 Most Abundant Species")
3. Stacked Barplot (Top 15 Species)
Reproduces the stacked barplot, sorted by total abundance.
# Get top 15 species
top15 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:15]
ps_top15 <- prune_taxa(top15, species_ps)
# Melt data for ggplot2
df <- psmelt(ps_top15)
# Sort samples by the total abundance of these top 15 species (decreasing order)
sample_order <- df %>%
group_by(Sample) %>%
summarise(Total_Ab = sum(Abundance)) %>%
arrange(desc(Total_Ab)) %>%
pull(Sample)
df$Sample <- factor(df$Sample, levels=sample_order)
# Plot
ggplot(df, aes(x=Sample, y=Abundance, fill=Species)) +
geom_bar(stat="identity", position="stack") +
theme_minimal() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
labs(title="Relative Abundance of Top 15 Most Abundant Species",
y="Relative Abundance") +
scale_fill_discrete(name = "Species")
Step 3: Differential Abundance Analysis (Pre vs. Post)
To find which specific species or pathways are significantly different between Pre- and Post-treatment, we use MaAsLin2 (the official biobakery recommendation). It allows us to test the effect of Treatment while adjusting for TimePoint as a covariate.
library(MaAsLin2)
# MaAsLin2 requires samples as rows and features as columns
species_t <- t(otu_table(species_ps))
species_df <- as.data.frame(species_t)
species_df$sample_id <- rownames(species_df)
metadata_df <- data.frame(
sample_id = rownames(metadata),
metadata
)
# Run MaAsLin2
# We set "Pre" and "Nov" as reference groups.
# The output coefficients will represent the difference of Post vs Pre.
fit_results <- MaAsLin2(
input_data = species_df,
input_metadata = metadata_df,
output = "maaslin2_species_output",
fixed_effects = c("Treatment", "TimePoint"),
reference = c("Treatment,Pre", "TimePoint,Nov"),
random_effects = c(),
correction = "FDR",
standardize = FALSE,
min_abundance = 0.001,
min_prevalence = 0.1 # Feature must be present in at least 10% of samples
)
# View significant results (q-value < 0.05)
sig_results <- subset(fit_results$results, qval < 0.05 & metadata == "Treatment")
if(nrow(sig_results) > 0) {
print(sig_results[, c("feature", "coef", "qval")] %>% arrange(qval))
} else {
print("No significantly different species found at FDR < 0.05.")
}
Note: You can run the exact same code for pathways by replacing species_df with your HUMAnN pathway data frame.
Step 4: Is the difference between time points significant?
To determine if the microbial community composition changes significantly across the different time points (Nov, Jan, Mar, May), we use PERMANOVA (via the adonis2 function in the vegan package). This tests the overall community structure rather than individual features.
library(vegan)
# Calculate Bray-Curtis distance matrix
dist_bc <- vegdist(t(otu_table(species_ps)), method="bray")
# Run PERMANOVA
# This tests the main effects of Treatment and TimePoint, and their interaction
adonis_res <- adonis2(dist_bc ~ Treatment * TimePoint, data=metadata, permutations=999)
print(adonis_res)
How to interpret the adonis2 output:
Treatmentrow: Look at thePr(>F)(p-value). If it is< 0.05, it means there is a statistically significant difference in the overall microbiome between Pre-treatment and Post-treatment.TimePointrow: Look at thePr(>F). If it is< 0.05, it means the microbial community composition differs significantly across the months (Nov, Jan, Mar, May).Treatment:TimePointrow: If this interaction is significant (< 0.05), it means the effect of the treatment depends on the time of year (e.g., the treatment works differently in November compared to May).
Important Note on Statistical Power: Your dataset has 11 samples total. In the PERMANOVA model, this leaves only 3 residual degrees of freedom. While this mathematically allows for testing, the statistical power is low. Significant results are highly trustworthy, but non-significant results might simply be due to the small sample size.