Generating report using R (Data_Tam_Metagenomics_2026_Wastewater)

To analyze your metagenomics data with greater control, you can use the R packages phyloseq, ggplot2, vegan, and MaAsLin2. The biobakery workflow has already generated the merged tables you need in the results/ directory.

Below is the complete R code to:

Load the data and reproduce the figures from wmgx_report.pdf (PCoA, Heatmap, Stacked Barplot).
Perform differential abundance analysis between Pre-treatment and Post-treatment.
Statistically test if the differences between time points are significant.

Prerequisites

Make sure you have the following R packages installed:

install.packages(c("phyloseq", "ggplot2", "vegan", "dplyr", "tidyr", "pheatmap"))
# MaAsLin2 is installed via Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("MaAsLin2")

Step 1: Load Data and Create Metadata

The biobakery workflow merged your samples into single tables. We will load the MetaPhlAn species table and HUMAnN pathway table, and map your experimental design.

library(phyloseq)
library(ggplot2)
library(dplyr)
library(tidyr)

# 1. Load MetaPhlAn taxonomic profiles
mpa_file <- "results/metaphlan/merged/metaphlan_taxonomic_profiles.tsv"
mpa_raw <- read.table(mpa_file, header=TRUE, sep="\t", comment.char="", check.names=FALSE)
colnames(mpa_raw)[1] <- "clade_name"
rownames(mpa_raw) <- mpa_raw$clade_name
mpa_otu <- mpa_raw[, -1]

# Filter for species level (s__) and clean names
species_idx <- grep("s__", rownames(mpa_otu))
species_otu <- mpa_otu[species_idx, ]
species_names <- gsub(".*s__", "", rownames(species_otu))
species_names <- gsub("\\|.*", "", species_names)
species_names <- gsub("_", " ", species_names) # Replace underscores with spaces
rownames(species_otu) <- species_names

# 2. Create Metadata based on your file names
sample_names <- colnames(species_otu)
metadata <- data.frame(
  row.names = sample_names,
  TimePoint = case_when(
    grepl("Nov", sample_names) ~ "Nov",
    grepl("Jan", sample_names) ~ "Jan",
    grepl("Mar", sample_names) ~ "Mar",
    grepl("May", sample_names) ~ "May"
  ),
  Treatment = ifelse(grepl("Pre", sample_names), "Pre", "Post")
)

# 3. Create phyloseq object for Species
tax_mat <- matrix(rownames(species_otu), ncol=1, dimnames=list(rownames(species_otu), "Species"))
species_ps <- phyloseq(
  otu_table(species_otu, taxa_are_rows=TRUE),
  sample_data(metadata),
  tax_table(tax_mat)
)

# (Optional) Load HUMAnN pathways using the same logic
# path_file <- "results/humann/merged/pathabundance_relab.tsv"
# ... (similar loading steps as above) ...

Step 2: Reproduce Figures from `wmgx_report.pdf`

1. Ordination (PCoA)

Reproduces the Bray-Curtis Principal Coordinate Analysis.

library(vegan)

# Calculate Bray-Curtis distance and perform PCoA
ord <- ordinate(species_ps, method="PCoA", distance="bray")

# Plot
plot_ordination(species_ps, ord, color="Treatment", shape="TimePoint") +
  geom_point(size=4) +
  theme_minimal() +
  labs(title="PCoA of Species Profiles (Bray-Curtis)", 
       subtitle="Colored by Treatment, Shaped by Time Point") +
  theme(legend.position = "right")

2. Heatmap (Top 25 Species)

Reproduces the hierarchical clustering heatmap using Euclidean distance.

# Get top 25 most abundant species
top25 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:25]
ps_top25 <- prune_taxa(top25, species_ps)

# Plot heatmap (method=NULL defaults to hierarchical clustering)
plot_heatmap(ps_top25, method=NULL, distance="euclidean", 
             taxa.labels=TRUE, label.size=3, na.value=NA) +
  scale_fill_viridis_c() +
  labs(title="Heatmap of Top 25 Most Abundant Species")

3. Stacked Barplot (Top 15 Species)

Reproduces the stacked barplot, sorted by total abundance.

# Get top 15 species
top15 <- names(sort(taxa_sums(species_ps), decreasing=TRUE))[1:15]
ps_top15 <- prune_taxa(top15, species_ps)

# Melt data for ggplot2
df <- psmelt(ps_top15)

# Sort samples by the total abundance of these top 15 species (decreasing order)
sample_order <- df %>%
  group_by(Sample) %>%
  summarise(Total_Ab = sum(Abundance)) %>%
  arrange(desc(Total_Ab)) %>%
  pull(Sample)

df$Sample <- factor(df$Sample, levels=sample_order)

# Plot
ggplot(df, aes(x=Sample, y=Abundance, fill=Species)) +
  geom_bar(stat="identity", position="stack") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) +
  labs(title="Relative Abundance of Top 15 Most Abundant Species", 
       y="Relative Abundance") +
  scale_fill_discrete(name = "Species")

Step 3: Differential Abundance Analysis (Pre vs. Post)

To find which specific species or pathways are significantly different between Pre- and Post-treatment, we use MaAsLin2 (the official biobakery recommendation). It allows us to test the effect of Treatment while adjusting for TimePoint as a covariate.

library(MaAsLin2)

# MaAsLin2 requires samples as rows and features as columns
species_t <- t(otu_table(species_ps))
species_df <- as.data.frame(species_t)
species_df$sample_id <- rownames(species_df)

metadata_df <- data.frame(
  sample_id = rownames(metadata),
  metadata
)

# Run MaAsLin2
# We set "Pre" and "Nov" as reference groups. 
# The output coefficients will represent the difference of Post vs Pre.
fit_results <- MaAsLin2(
  input_data = species_df,
  input_metadata = metadata_df,
  output = "maaslin2_species_output",
  fixed_effects = c("Treatment", "TimePoint"),
  reference = c("Treatment,Pre", "TimePoint,Nov"), 
  random_effects = c(),
  correction = "FDR",
  standardize = FALSE,
  min_abundance = 0.001, 
  min_prevalence = 0.1   # Feature must be present in at least 10% of samples
)

# View significant results (q-value < 0.05)
sig_results <- subset(fit_results$results, qval < 0.05 & metadata == "Treatment")
if(nrow(sig_results) > 0) {
  print(sig_results[, c("feature", "coef", "qval")] %>% arrange(qval))
} else {
  print("No significantly different species found at FDR < 0.05.")
}

Note: You can run the exact same code for pathways by replacing species_df with your HUMAnN pathway data frame.

Step 4: Is the difference between time points significant?

To determine if the microbial community composition changes significantly across the different time points (Nov, Jan, Mar, May), we use PERMANOVA (via the adonis2 function in the vegan package). This tests the overall community structure rather than individual features.

library(vegan)

# Calculate Bray-Curtis distance matrix
dist_bc <- vegdist(t(otu_table(species_ps)), method="bray")

# Run PERMANOVA
# This tests the main effects of Treatment and TimePoint, and their interaction
adonis_res <- adonis2(dist_bc ~ Treatment * TimePoint, data=metadata, permutations=999)
print(adonis_res)

How to interpret the `adonis2` output:

Treatment row: Look at the Pr(>F) (p-value). If it is < 0.05, it means there is a statistically significant difference in the overall microbiome between Pre-treatment and Post-treatment.
TimePoint row: Look at the Pr(>F). If it is < 0.05, it means the microbial community composition differs significantly across the months (Nov, Jan, Mar, May).
Treatment:TimePoint row: If this interaction is significant (< 0.05), it means the effect of the treatment depends on the time of year (e.g., the treatment works differently in November compared to May).

Important Note on Statistical Power: Your dataset has 11 samples total. In the PERMANOVA model, this leaves only 3 residual degrees of freedom. While this mathematically allows for testing, the statistical power is low. Significant results are highly trustworthy, but non-significant results might simply be due to the small sample size.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Generating report using R (Data_Tam_Metagenomics_2026_Wastewater)

Prerequisites

Step 1: Load Data and Create Metadata

Step 2: Reproduce Figures from `wmgx_report.pdf`

1. Ordination (PCoA)

2. Heatmap (Top 25 Species)

3. Stacked Barplot (Top 15 Species)

Step 3: Differential Abundance Analysis (Pre vs. Post)

Step 4: Is the difference between time points significant?

How to interpret the `adonis2` output:

Leave a Reply Cancel reply

Prerequisites

Step 1: Load Data and Create Metadata

Step 2: Reproduce Figures from wmgx_report.pdf

1. Ordination (PCoA)

2. Heatmap (Top 25 Species)

3. Stacked Barplot (Top 15 Species)

Step 3: Differential Abundance Analysis (Pre vs. Post)

Step 4: Is the difference between time points significant?

How to interpret the adonis2 output:

Leave a Reply Cancel reply

Step 2: Reproduce Figures from `wmgx_report.pdf`

How to interpret the `adonis2` output: