Hamburg:

Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
Group 3 (post-stroke RD): #18-20, #23-25, #27-29
Group 4 (post-stroke HFD): #21-22, #26, #30-34
Group 5 (Baseline RD): #40-43, #49-53
Group 6 (Baseline HFD): #35-39, #44-48

Odense:

Group 7 (Baseline Female): #54-69
Group 8 (Baseline Male): #96-97, #99-102, #104-109
Group 9 (pre-Stroke RD female): #76-78, #80-81
Group 10 (pre-Stroke RD male): #114-121
Group 11 (pre-Stroke HFD female): #70-75, #79
Group 12 (pre-Stroke HFD male): #110-113
Group 13 (post-stroke RD female): #88-95
Group 14 (post-stroke RD male): #129-136
Group 15 (post-stroke HFD female): #82-87
Group 16 (post-stroke HFD male): #122-128

Could you please create two independent html files with the respective Analysis of the Samples from Odense or Hamburg?
We are ineterested in the following comparisons:

a. Hamburg:

i. Group 1 vs 2 (PCA and DEGs)
ii. Group 3 vs 4 (PCA and DEGs)
iii. Group 3 vs 4 (PCA and DEGs)

b. Odense:

i. Group 7 vs 8 (PCA and DEGs)
ii. Group 9 vs 10 (PCA and DEGs)
iii. Group 11 vs 12 (PCA and DEGs)
iv. Group 13 vs 14 (PCA and DEGs)
v. Group 15 vs 16 (PCA and DEGs)

author: “” date: ‘r format(Sys.time(), "%d %m %Y")‘ header-includes:

\usepackage{color, fancyvrb} output: rmdformats::readthedown: highlight: kate number_sections : yes pdf_document: toc: yes toc_depth: 2 number_sections : yes


#install.packages(c("picante", "rmdformats"))
#mamba install -c conda-forge freetype libpng harfbuzz fribidi
#mamba install -c conda-forge r-systemfonts r-svglite r-kableExtra freetype fontconfig harfbuzz fribidi libpng
library(knitr)
library(rmdformats)
library(readxl)
library(dplyr)
library(kableExtra)
library(openxlsx)
library(DESeq2)
library(writexl)

options(max.print="75")
knitr::opts_chunk$set(fig.width=8,
                      fig.height=6,
                      eval=TRUE,
                      cache=TRUE,
                      echo=TRUE,
                      prompt=FALSE,
                      tidy=FALSE,
                      comment=NA,
                      message=FALSE,
                      warning=FALSE)
opts_knit$set(width=85)
# Phyloseq R library
#* Phyloseq web site : https://joey711.github.io/phyloseq/index.html
#* See in particular tutorials for
#    - importing data: https://joey711.github.io/phyloseq/import-data.html
#    - heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html
#STEP_1: RUN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#STEP_2: RUN MicrobiotaProcess_Group_1_2_3_4_5_6.R
#STEP_3: adjust Global PERMANOVA text, Pairwise PERMANOVA table, and PCoA figure.
#STEP_4: RUN_AGAIN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#options(max.print = 1e6)

Data

Import raw data and assign sample key:

#extend qiime2_metadata_for_qza_to_phyloseq.tsv with Diet and Flora
#setwd("~/DATA/Data_Laura_16S_2/core_diversity_e4753")
#map_corrected <- read.csv("qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1)
#knitr::kable(map_corrected) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Prerequisites to be installed

R : https://pbil.univ-lyon1.fr/CRAN/
R studio : https://www.rstudio.com/products/rstudio/download/#download

install.packages("dplyr")     # To manipulate dataframes
install.packages("readxl")    # To read Excel files into R
install.packages("ggplot2")   # for high quality graphics
install.packages("heatmaply")
source("https://bioconductor.org/biocLite.R")
biocLite("phyloseq")

#mamba install -c conda-forge r-ggplot2 r-vegan r-data.table
#BiocManager::install("microbiome")
#install.packages("ggpubr")
#install.packages("heatmaply")
library("readxl") # necessary to import the data from Excel file
library("ggplot2") # graphics
library("picante")
library("microbiome") # data analysis and visualisation
library("phyloseq") # also the basis of data object. Data analysis and visualisation
library("ggpubr") # publication quality figures, based on ggplot2
library("dplyr") # data handling, filter and reformat data frames
library("RColorBrewer") # nice color options
library("heatmaply")
library(vegan)
library(gplots)
#install.packages("openxlsx")
library(openxlsx)
library(rstatix)
library(knitr)
library(kableExtra)

Read the data and create phyloseq objects

Three tables are needed

OTU
Taxonomy
Samples

Create analysis-specific phyloseq objects

We maintain one filtered “base” phyloseq object and then derive analysis-specific objects from it. This avoids accidental overwriting and ensures each analysis uses the appropriate data scale (counts vs. relative abundance vs. rarefied counts).

ps_raw → Raw imported phyloseq object (integer counts; import stage only)
ps_base → ps_raw with taxonomy + sample metadata properly aligned (the clean master object before any filtering)
ps_pruned → Optional sample subset of ps_base (e.g., drop unwanted samples by ID/pattern); still integer counts
ps_filt → The shared filtered backbone: low-depth samples removed + taxa with zero total counts dropped; remains integer counts


    library(tidyr)

    # For QIIME1
    #ps.ng.tax <- import_biom("./exported_table/feature-table.biom", "./exported-tree/tree.nwk")

    # For QIIME2
    #install.packages("remotes")
    #remotes::install_github("jbisanz/qiime2R")
    #"core_metrics_results/rarefied_table.qza", rarefying performed in the code, therefore import the raw table.
    library(qiime2R)
    ps_raw <- qza_to_phyloseq(
      features =  "dada2_tests/test_49_f240_r240/table.qza",
      tree = "rooted-tree.qza",
      metadata = "qiime2_metadata_for_qza_to_phyloseq.tsv"
    )

    # Refresh/ensure sample_data (optional but keeps things explicit)
    sample_df <- read.csv("./qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1, check.names=FALSE)
    SAM <- sample_data(sample_df, errorIfNULL = TRUE)

    # Add taxonomy table (exported from QIIME2)
    taxonomy <- read.delim("./exported-taxonomy/taxonomy.tsv", sep="\t", header=TRUE)

    # Separate taxonomy string into separate ranks
    taxonomy_df <- taxonomy %>% separate(Taxon, into = c("Domain","Phylum","Class","Order","Family","Genus","Species"), sep = ";", fill = "right", extra = "drop")
    # Use Feature.ID as rownames
    rownames(taxonomy_df) <- taxonomy_df$Feature.ID
    taxonomy_df <- taxonomy_df[, -c(1, ncol(taxonomy_df))]  # Drop Feature.ID and Confidence
    # Create tax_table
    tax_table_final <- phyloseq::tax_table(as.matrix(taxonomy_df))

    # Base object: raw integer counts + metadata + taxonomy
    ps_base <- merge_phyloseq(ps_raw, SAM, tax_table_final)
    print(ps_base)

    #colnames(phyloseq::tax_table(ps_base)) <- c("Domain","Phylum","Class","Order","Family","Genus","Species")
    saveRDS(ps_base, "./ps_base.rds")

Visualize data

# Inspect the base object (raw integer counts)
sample_names(ps_base)
rank_names(ps_base)
sample_variables(ps_base)

# Optional: prune to a naming pattern (avoids hard-coding long sample lists)
#samples_keep <- sample_names(ps_base)
#samples_keep <- samples_keep[grepl("^sample-[A-H]", samples_keep)]
#Deleted samples: sample-NTC-9  sample-PC-1 sample-PC-2 sample-PC-3 sample-PC-4  sample-PC-5
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
samples_keep <- c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

#    Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
#    Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
#    Group 3 (post-stroke RD): #18-20, #23-25, #27-29
#    Group 4 (post-stroke HFD): #21-22, #26, #30-34
#    Group 5 (Baseline RD): #40-43, #49-53
#    Group 6 (Baseline HFD): #35-39, #44-48

ps_pruned <- prune_samples(samples_keep, ps_base)

# Drop taxa absent after pruning
ps_pruned <- prune_taxa(taxa_sums(ps_pruned) > 0, ps_pruned)

# Quick sanity checks
nsamples(ps_base); ntaxa(ps_base)
nsamples(ps_pruned); ntaxa(ps_pruned)

#  #Preprocessing statistics for each sample
#   denoising_stats <- read.csv("dada2_tests/test_59_f235_r245/data_stats.tsv", sep="\t")
#   # Display the table
#   kable(denoising_stats, caption = "Preprocessing statistics for each sample") %>%
#     kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

In the QC filtering step, we removed 14 samples that either fell below the minimum sequencing depth (library size < 12,201 reads) or were pre-specified for exclusion via sample_ids (as defined in the previous step). The filtered object (ps_filt) therefore contains only samples meeting the depth cutoff, and taxa were re-pruned to retain only those with nonzero total abundance across the retained samples.

# ------------------------------------------------------------
#   Filter low-depth samples (recommended for all analyses)
# ------------------------------------------------------------
#IMPORTANT_ADAPTION: min_depth needs to be adapted to the rarefaction threshold.
min_depth <- 5610  # <-- adjust to your data / study design, keeps all!
ps_filt <- prune_samples(sample_sums(ps_pruned) >= min_depth, ps_pruned)
ps_filt <- prune_taxa(taxa_sums(ps_filt) > 0, ps_filt)
ps_filt

# Keep a depth summary for reporting / QC
depth_summary <- summary(sample_sums(ps_filt))
depth_summary

Differential abundance (DESeq2) → ps_deseq: non-rarefied integer counts derived from ps_filt, with optional count-based taxon prefilter (default: taxa total counts ≥ 10 across all samples)

From ps_filt (e.g. 5669 taxa and 239 samples), we branch into analysis-ready objects in two directions:

Direction 1 for diversity analyses
- Alpha diversity: ps_rarefied ✅ (common)
- Beta diversity:
  - Unweighted UniFrac / Jaccard: ps_rarefied ✅ (often recommended)
  - Bray–Curtis / ordination on abundances: ps_rel or Hellinger ✅ (rarefaction optional)
  - Aitchison (CLR): CLR-transformed (non-rarefied) ✅ (no rarefaction)

Rarefaction


# RAREFACTION
set.seed(9242)  # This will help in reproducing the filtering and nomalisation.
ps_rarefied <- rarefy_even_depth(ps_filt, sample.size = min_depth)

# # NORMALIZE number of reads in each sample using median sequencing depth.
# total = median(sample_sums(ps.ng.tax))
# #> total
# #[1] 42369
# standf = function(x, t=total) round(t * (x / sum(x)))
# ps.ng.tax = transform_sample_counts(ps.ng.tax, standf)
# ps_rel <- microbiome::transform(ps.ng.tax, "compositional")
#
# saveRDS(ps.ng.tax, "./ps.ng.tax.rds")

Direction 2 for taxonomic composition plots
- Taxonomic composition → ps_rel: relative abundance (compositional) computed after sample filtering (e.g. 5669 taxa and 239 samples)
- Optional cleaner composition plots → ps_abund / ps_abund_rel: taxa filtered for plotting (e.g., keep taxa with mean relative abundance > 0.1%); (e.g. 95 taxa and 239 samples) ps_abund = counts, ps_abund_rel = relative abundance (use for visualization, not DESeq2)

For the heatmaps, we focus on the most abundant OTUs by first converting counts to relative abundances within each sample. We then filter to retain only OTUs whose mean relative abundance across all samples exceeds 0.1% (0.001). We are left with 199 OTUs which makes the reading much more easy.

# 1) Convert to relative abundances
ps_rel <- transform_sample_counts(ps_filt, function(x) x / sum(x))

# 2) Get the logical vector of which OTUs to keep (based on relative abundance)
keep_vector <- phyloseq::filter_taxa(
  ps_rel,
  function(x) mean(x) > 0.001,
  prune = FALSE
)

# 3) Use the TRUE/FALSE vector to subset absolute abundance data
ps_abund <- prune_taxa(names(keep_vector)[keep_vector], ps_filt)

# 4) Normalize the final subset to relative abundances per sample
ps_abund_rel <- transform_sample_counts(
  ps_abund,
  function(x) x / sum(x)
)

cat("ps_pruned:", nsamples(ps_pruned), "\n")
cat("ps_filt:", nsamples(ps_filt), "\n")
cat("ps_abund:", nsamples(ps_abund), "\n")
cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")

Heatmaps

datamat_ = as.data.frame(otu_table(ps_abund))
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
datamat <- datamat_[c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")]
# Remove rows with zero variance
datamat <- datamat[apply(datamat, 1, var) > 0, ]
# Remove cols with zero variance
#datamat <- datamat[, apply(datamat, 2, var) > 0]

# (optional) replace with your actual column names
#colnames(datamat) <- c(
#    "A24040201",...
#  )

# ---------- 0) Sample names from datamat ----------
samplenames <- sub("", "", colnames(datamat))  #^sample-

# ---------- 1) Read metadata ----------
meta_path <- "qiime2_metadata.tsv"
meta <- read.delim(
  meta_path,
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = FALSE,
  comment.char = ""
)

# ---------- 2) Identify SampleID + Group columns ----------
sample_id_col <- c("#SampleID","SampleID","sample-id","sampleid")
sample_id_col <- sample_id_col[sample_id_col %in% colnames(meta)][1]
if (is.na(sample_id_col)) sample_id_col <- colnames(meta)[1]

group_col <- c("Group","group","GROUP")
group_col <- group_col[group_col %in% colnames(meta)][1]
if (is.na(group_col)) stop("No 'Group' column found in metadata.")

# ---------- 3) Build lookup: sample -> group ----------
meta_ids <- sub("", "", meta[[sample_id_col]])  #^sample-
meta_groups <- trimws(as.character(meta[[group_col]]))
group_map <- setNames(meta_groups, meta_ids)

# Map datamat columns to group labels
groups <- unname(group_map[samplenames])
groups[is.na(groups)] <- "unknown"

# ---------- 4) Color mapping for YOUR labels ----------
#  "positive control" = "#ff7f00",
## (Adjust colors if you prefer different ones)
#color_map <- c(
#  "1" = "#a6cee3",
#  "2" = "#1f78b4",
#  "3" = "#b2df8a",
#  "4" = "#33a02c",
#  "5" = "#fb9a99",
#  "negative control" = "#6a3d9a",
#  "unknown" = "GREY"
#)
# Color map 1..16 + PC
color_map <- c(
  "1"="#a6cee3","2"="#1f78b4","3"="#b2df8a","4"="#33a02c","5"="#fb9a99","6"="#e31a1c"
)
#"7"="#fdbf6f","8"="#ff7f00","9"="#cab2d6","10"="#6a3d9a",
#  "11"="#ffff99","12"="#b15928","13"="#8dd3c7","14"="#bebada",
#  "15"="#80b1d3","16"="#fccde5"
#,"PC"="#000000"

# Assign colors safely
sampleCols <- unname(color_map[groups])
sampleCols[is.na(sampleCols)] <- "GREY"
names(sampleCols) <- samplenames

# ---------- 5) Checks ----------
cat("Unique groups found in datamat:\n")
print(sort(unique(groups)))
cat("\nCounts per group:\n")
print(table(groups, useNA = "ifany"))
cat("\nFirst 10 sample colors:\n")
print(head(sampleCols, 10))

# Optional: list any samples that didn't match metadata
unmatched <- samplenames[groups == "unknown"]
if (length(unmatched) > 0) {
  cat("\nWARNING: Unmatched samples (showing up to 20):\n")
  print(head(unmatched, 20))
}

## --- 1) order columns by group (and then by sample name) ---
#"positive control",
group_order <- c("1","2","3","4","5", "6")
#,"7","8","9","10", "11","12","13","14","15", "16", "PC"
groups_fac  <- factor(groups, levels = group_order, ordered = TRUE)

col_order <- order(groups_fac, samplenames)

datamat_ord     <- datamat[, col_order, drop = FALSE]
groups_ord      <- groups[col_order]
samplenames_ord <- samplenames[col_order]
sampleCols_ord  <- sampleCols[col_order]

stopifnot(identical(colnames(datamat_ord), samplenames_ord))

## group separators
grp_counts <- table(factor(groups_ord, levels = group_order))
grp_breaks <- cumsum(as.integer(grp_counts[grp_counts > 0]))

## --- 2) cluster ROWS using the *ordered* matrix (columns don't matter, but be consistent) ---
hr   <- hclust(as.dist(1 - cor(t(datamat_ord), method = "pearson")), method = "complete")
mycl <- cutree(hr, h = max(hr$height) / 1.08)

mycol_palette <- c("YELLOW","DARKBLUE","DARKORANGE","DARKMAGENTA","DARKCYAN","DARKRED",
                   "MAROON","DARKGREEN","LIGHTBLUE","PINK","MAGENTA","LIGHTCYAN",
                   "LIGHTGREEN","BLUE","ORANGE","CYAN","RED","GREEN")
mycol <- mycol_palette[as.vector(mycl)]

## --- 3) plot using datamat_ord and sampleCols_ord; keep column order fixed ---
library(RColorBrewer)
heatmap_colors <- colorRampPalette(brewer.pal(9, "Blues"))(100)

png("figures/heatmap.png", width = 1800, height = 2400)
heatmap.2(
  as.matrix(datamat_ord),
  Rowv = as.dendrogram(hr),
  Colv = NA,                        # IMPORTANT: do NOT cluster columns
  dendrogram = "row",
  scale = "row",
  trace = "none",
  col = heatmap_colors,
  cexRow = 1.2,
  cexCol = 2.2,                     # <<< bigger x (column) labels
  RowSideColors = mycol,
  ColSideColors = sampleCols_ord,   # IMPORTANT: use ordered colors
  srtCol = 85,
  labRow = row.names(datamat_ord),
  labCol = samplenames_ord,         # optional but explicit
  key = TRUE,
  margins = c(10, 18),              # <<< more space for rotated x labels
  lhei = c(0.7, 15),
  lwid = c(1, 8),
  colsep = grp_breaks,              # optional separators
  sepcolor = "black",
  sepwidth = c(0.02, 0.02)
)
dev.off()

knitr::include_graphics("./figures/heatmap.png")

\pagebreak

# Clean taxonomy prefixes (e.g., "d__", "p__") safely without hard-coded row indices
tt_mat <- as(phyloseq::tax_table(ps_abund_rel), "matrix")
if (is.null(rownames(tt_mat))) rownames(tt_mat) <- phyloseq::taxa_names(ps_abund_rel)

strip_prefix <- function(x) {
  x <- as.character(x)
  sub("^[^_]*__", "", x)
}

for (j in seq_len(ncol(tt_mat))) {
  tt_mat[, j] <- strip_prefix(tt_mat[, j])
}
tt_mat[tt_mat == ""] <- NA_character_

# Ensure taxa names still match
stopifnot(identical(rownames(tt_mat), phyloseq::taxa_names(ps_abund_rel)))

phyloseq::tax_table(ps_abund_rel) <- phyloseq::tax_table(tt_mat)

Taxonomic summary

Bar plots in phylum level

  sample_order <-
  c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

  # create a sample ID column in sample_data (or overwrite an existing one)
  sample_data(ps_abund_rel)$SampleID <- sample_names(ps_abund_rel)

  # set the order as a factor with the desired levels
  sample_data(ps_abund_rel)$SampleID <- factor(
    sample_data(ps_abund_rel)$SampleID,
    levels = sample_order
  )

  #aes(color="Phylum", fill="Phylum") --> aes()
  #ggplot(data=data, aes(x=Sample, y=Abundance, fill=Phylum))
  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,           # 85° rotation
        hjust = 1,            # right-justified so labels are neat
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 2))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  ## merge + normalize
  # ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group")
  # ps_abund_rel_group_ <- transform_sample_counts(
  #   ps_abund_rel_group,
  #   function(x) x / sum(x)
  # )

  # make a safe grouping variable (character, non-numeric-looking)
  sample_data(ps_abund_rel)$Group_chr <- paste0("G", as.character(sample_data(ps_abund_rel)$Group))
  ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group_chr")
  ps_abund_rel_group_ <- transform_sample_counts(ps_abund_rel_group, function(x) x / sum(x))
  # (optional) if you want x-axis labels 1..6 again:
  sample_names(ps_abund_rel_group_) <- sub("^G", "", sample_names(ps_abund_rel_group_))

  # desired order on x-axis
  #, "positive control"
  group_order <- c("1", "2", "3", "4", "5", "6")
  #, "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "PC"

  plot_bar(ps_abund_rel_group_, fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_x_discrete(limits = group_order) +          # <-- set order here
    scale_fill_manual(values = my_colors) +
    labs(x = "Group") +   # <- change x-axis label from "Sample" to "Group"
    theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

  cat("ps_pruned:", nsamples(ps_pruned), "\n")
  cat("ps_filt:", nsamples(ps_filt), "\n")
  cat("ps_abund:", nsamples(ps_abund), "\n")
  cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")
  cat("ps_abund_rel_group:", nsamples(ps_abund_rel_group), "\n")

Bar plots in class level

  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Class") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 3))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Class") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in order level

  my_colors <- c(
    "darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid",
    "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink",
    "khaki2", "firebrick", "brown1", "darkorange1", "cyan1",
    "royalblue4", "darksalmon", "darkblue","royalblue4",
    "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen",
    "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1",
    "brown1", "darkorange1", "cyan1", "darkgrey"
  )

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Order") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 4))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Order") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in family level

  my_colors <- c(
          "#FF0000", "#000000", "#0000FF", "#C0C0C0", "#FFFFFF", "#FFFF00", "#00FFFF", "#FFA500", "#00FF00", "#808080", "#FF00FF", "#800080", "#FDD017", "#0000A0", "#3BB9FF", "#008000", "#800000", "#ADD8E6", "#F778A1", "#800517", "#736F6E", "#F52887", "#C11B17", "#5CB3FF", "#A52A2A", "#FF8040", "#2B60DE", "#736AFF", "#1589FF", "#98AFC7", "#8D38C9", "#307D7E", "#F6358A", "#151B54", "#6D7B8D", "#FDEEF4", "#FF0080", "#F88017", "#2554C7", "#FFF8C6", "#D4A017", "#306EFF", "#151B8D", "#9E7BFF", "#EAC117", "#E0FFFF", "#15317E", "#6C2DC7", "#FBB917", "#FCDFFF", "#15317E", "#254117", "#FAAFBE", "#357EC7"
        )
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Family") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 8))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Family") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

# !!!!NOT_USED!!!!: #Export Relative abundances of Phylum, Class, Order, and Family levels across all samples in Excel files!
library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Check for NA or empty values in the taxonomy table
  tax_table_df <- as.data.frame(tax_table(ps))
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Replace NA or empty strings with "Unclassified"
  taxa_names <- ifelse(is.na(taxa_names) | taxa_names == "", paste0("Unclassified_", rank), taxa_names)

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family_old.xlsx")

library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Clean the taxonomy table by removing D_[level]__ prefixes
  tax_table_df <- as.data.frame(tax_table(ps))
  tax_table_df[[rank]] <- ifelse(is.na(tax_table_df[[rank]]) | tax_table_df[[rank]] == "",
                                 paste0("Unclassified_", rank),
                                 sub("^D_[0-9]+__(.+)", "\\1", tax_table_df[[rank]]))
  tax_table(ps) <- as.matrix(tax_table_df)  # Update taxonomy table with cleaned names

  # Check for NA or empty values in the taxonomy table
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family.xlsx")

#Sum up the last two colums with the same row.names to a new column, export the file as csv, then delete the two rows before last, then merge them with csv2xls to a Excel-file, adapt the sheet-names.
#~/Tools/csv2xls-0.4/csv_to_xls.py relative_abundance_phylum.csv relative_abundance_order.csv relative_abundance_family.csv -d$'\t' -o relative_abundance_phylum_order_family.xls;

\pagebreak

Alpha diversity

Plot Chao1 richness estimator, Observed OTUs, Shannon index, and Phylogenetic diversity. Regroup together samples from the same group.

# using rarefied data
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even42369.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_stool/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_swab/rep_set.tre

hmp.meta <- meta(ps_rarefied)
hmp.meta$sam_name <- rownames(hmp.meta)

# ---- enforce Group order (edit if you have different labels) ----
hmp.meta$Group <- factor(as.character(hmp.meta$Group), levels = group_order)

# for QIIME2: Lesen der Metriken
shannon <- read.table("exported_alpha/shannon/alpha-diversity.tsv", header=TRUE, sep="\t")  #cp -r ../Data_Karoline_16S_2025/exported_alpha/ .
faith_pd <- read.table("exported_alpha/faith_pd/alpha-diversity.tsv", header=TRUE, sep="\t")
observed <- read.table("exported_alpha/observed_features/alpha-diversity.tsv", header=TRUE, sep="\t")
#chao1 <- read.table("exported_alpha/chao1/alpha-diversity.tsv", header=TRUE, sep="\t")    #TODO: Check the correctness of chao1-calculation.

# Umbenennen für Klarheit
colnames(shannon) <- c("sam_name", "shannon")
colnames(faith_pd) <- c("sam_name", "PD_whole_tree")
colnames(observed) <- c("sam_name", "observed_otus")
#colnames(chao1) <- c("sam_name", "chao1")

# Merge alles in ein DataFrame
div.df <- Reduce(function(x, y) merge(x, y, by="sam_name"),
                  list(shannon, faith_pd, observed))

# Meta-Daten einfügen
div.df <- merge(div.df, hmp.meta, by="sam_name")

# Reformat
div.df2 <- div.df[, c("sam_name", "Group", "shannon", "observed_otus", "PD_whole_tree")]
colnames(div.df2) <- c("Sample name", "Group", "Shannon", "OTU", "Phylogenetic Diversity")
write.csv(div.df2, file="alpha_diversities.txt")
knitr::kable(div.df2) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

## ---- Summary stats (report in text/table) -----------------------------------
sum_stats <- div.df2 %>%
group_by(Group) %>%
summarise(n = n(),
            mean = mean(Shannon, na.rm = TRUE),
            sd = sd(Shannon, na.rm = TRUE),
            median = median(Shannon, na.rm = TRUE),
            IQR = IQR(Shannon, na.rm = TRUE),
            .groups = "drop")
# knitr::kable(sum_stats, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(sum_stats, "Shannon_Groups_summary.csv", row.names = FALSE)

Overall alpha-diversity: Kruskal–Wallis test checks whether the metric (e.g., Shannon) differs among groups overall.

## ---- Overall test -------------------------------------------------
#https://uc-r.github.io/t_test
#We can perform the test with t.test and transform our data and we can also perform the nonparametric test with the wilcox.test function.
#stat.test.Shannon <- compare_means(
# Shannon ~ Group, data = div.df2,
# method = "t.test"
#)
#knitr::kable(stat.test.Shannon) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Nonparametric overall test (robust default)
overall_kw <- compare_means(Shannon ~ Group, data = div.df2, method = "kruskal.test")
# knitr::kable(overall_kw, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(overall_kw, "Shannon_Groups_overall_Kruskal.csv", row.names = FALSE)
knitr::kable(overall_kw) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

## Optional parametric overall (use if assumptions OK)
overall_anova <- compare_means(Shannon ~ Group, data = div.df2, method = "anova")
write.csv(overall_anova, "Shannon_Groups_overall_ANOVA.csv", row.names = FALSE)

## Assumption checks (optional; helps decide ANOVA vs KW)
shapiro_res <- div.df2 %>% group_by(Group) %>% shapiro_test(Shannon)
levene_res  <- div.df2 %>% levene_test(Shannon ~ Group)
write.csv(shapiro_res, "Shannon_Groups_shapiro.csv", row.names = FALSE)
write.csv(levene_res,  "Shannon_Groups_levene.csv",  row.names = FALSE)

## ---- Pairwise tests with BH correction --------------------------------------
# Wilcoxon (nonparametric)
# Wilcoxon (nonparametric) — robust to empty/singleton groups
div_shannon <- div.df2 %>%
  dplyr::filter(!is.na(Shannon), !is.na(Group)) %>%
  dplyr::group_by(Group) %>%
  dplyr::filter(dplyr::n() >= 2) %>%
  dplyr::ungroup()
div_shannon$Group <- droplevels(factor(div_shannon$Group))

pw_wilcox <- tryCatch(
  {
    if (dplyr::n_distinct(div_shannon$Group) < 2) {
      tibble::tibble()
    } else {
      rstatix::pairwise_wilcox_test(div_shannon, Shannon ~ Group,
                                   p.adjust.method = "BH", exact = FALSE)
    }
  },
  error = function(e) tibble::tibble()
)

write.csv(pw_wilcox, "Shannon_Groups_pairwise_wilcox_BH.csv", row.names = FALSE)

# t-tests (parametric, optional)
pw_t <- pairwise_t_test(div.df2, Shannon ~ Group, p.adjust.method = "BH")
write.csv(pw_t, "Shannon_Groups_pairwise_t_BH.csv", row.names = FALSE)

div_df_melt <- reshape2::melt(div.df2)
#head(div_df_melt)

#https://plot.ly/r/box-plots/#horizontal-boxplot
#http://www.sthda.com/english/wiki/print.php?id=177
#https://rpkgs.datanovia.com/ggpubr/reference/as_ggplot.html
#http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/82-ggplot2-easy-way-to-change-graphical-parameters/
#https://plot.ly/r/box-plots/#horizontal-boxplot
#library("gridExtra")
#par(mfrow=c(4,1))
p <- ggboxplot(div_df_melt, x = "Group", y = "value",
              facet.by = "variable",
              scales = "free",
              width = 0.5,
              fill = "gray", legend= "right")
#ggpar(p, xlab = FALSE, ylab = FALSE)
lev <- levels(factor(div_df_melt$Group)) # get the variables
#FITTING4: delete H47(1) in lev
#lev <- lev[-c(3)]
# make a pairwise list that we want to compare.
#my_stat_compare_means
#https://stackoverflow.com/questions/47839988/indicating-significance-with-ggplot2-in-a-boxplot-with-multiple-groups
L.pairs <- combn(seq_along(lev), 2, simplify = FALSE, FUN = function(i) lev[i]) #%>% filter(p.signif != "ns")
my_stat_compare_means  <- function (mapping = NULL, data = NULL, method = NULL, paired = FALSE,
    method.args = list(), ref.group = NULL, comparisons = NULL,
    hide.ns = FALSE, label.sep = ", ", label = NULL, label.x.npc = "left",
    label.y.npc = "top", label.x = NULL, label.y = NULL, tip.length = 0.03,
    symnum.args = list(), geom = "text", position = "identity",
    na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...)
{
    if (!is.null(comparisons)) {
        method.info <- ggpubr:::.method_info(method)
        method <- method.info$method
        method.args <- ggpubr:::.add_item(method.args, paired = paired)
        if (method == "wilcox.test")
            method.args$exact <- FALSE
        pms <- list(...)
        size <- ifelse(is.null(pms$size), 0.3, pms$size)
        color <- ifelse(is.null(pms$color), "black", pms$color)
        map_signif_level <- FALSE
        if (is.null(label))
            label <- "p.format"
        if (ggpubr:::.is_p.signif_in_mapping(mapping) | (label %in% "p.signif")) {
            if (ggpubr:::.is_empty(symnum.args)) {
                map_signif_level <- c(`****` = 1e-04, `***` = 0.001,
                  `**` = 0.01, `*` = 0.05, ns = 1)
            } else {
               map_signif_level <- symnum.args
            }
            if (hide.ns)
                names(map_signif_level)[5] <- " "
        }
        step_increase <- ifelse(is.null(label.y), 0.12, 0)
        ggsignif::geom_signif(comparisons = comparisons, y_position = label.y,
            test = method, test.args = method.args, step_increase = step_increase,
            size = size, color = color, map_signif_level = map_signif_level,
            tip_length = tip.length, data = data)
    } else {
        mapping <- ggpubr:::.update_mapping(mapping, label)
        layer(stat = StatCompareMeans, data = data, mapping = mapping,
            geom = geom, position = position, show.legend = show.legend,
            inherit.aes = inherit.aes, params = list(label.x.npc = label.x.npc,
                label.y.npc = label.y.npc, label.x = label.x,
                label.y = label.y, label.sep = label.sep, method = method,
                method.args = method.args, paired = paired, ref.group = ref.group,
                symnum.args = symnum.args, hide.ns = hide.ns,
                na.rm = na.rm, ...))
    }
}

# Rotate the x-axis labels to 45 degrees and adjust their position
#comparisons = list(c("1", "2"), c("1", "3"), c("1", "4"), c("1", "5"), c("1", "6"), c("2", "3"), c("2", "4"), c("2", "5"), c("2", "6"), c("3", "4"), c("3", "5"), c("3", "6"), c("4", "5"), c("4", "6"), c("5", "6")),
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size=8))
p2 <- p +
stat_compare_means(
  method="wilcox.test",  #"t.test"",
  comparisons = list(c("1", "2"), c("3", "4"), c("5", "6")),
  label = "p.signif",
  p.adjust.method = "BH",
  method.args = list(exact = FALSE),
  symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))
)
##comparisons = L.pairs,
##stat_pvalue_manual
##https://stackoverflow.com/questions/20500706/saving-multiple-ggplots-from-ls-into-one-and-separate-files-in-r
#print(p2)
#ggsave("./figures/alpha_diversity_Group.png", device="png", height = 10, width = 15)
#ggsave("./figures/alpha_diversity_Group.svg", device="svg", height = 10, width = 15)

Pairwise tests: Within each metric (Shannon/OTU/PD), Wilcoxon rank-sum tests compare groups pairwise, with BH correction for multiple testing.

library(dplyr)
library(tidyr)
library(ggpubr)
library(rstatix)
library(writexl)

div.df2 <- div.df2 %>% mutate(Group = factor(Group))

# 1) Long format for all 3 metrics
div_long <- div.df2 %>%
  select(Group, Shannon, OTU, `Phylogenetic Diversity`) %>%
  pivot_longer(
    cols = c(Shannon, OTU, `Phylogenetic Diversity`),
    names_to = "variable",
    values_to = "value"
  ) %>%
  filter(!is.na(Group), !is.na(value))

# 2) Drop groups with <2 samples within each metric
div_long2 <- div_long %>%
  group_by(variable, Group) %>%
  filter(n() >= 2) %>%
  ungroup() %>%
  mutate(Group = droplevels(Group))

# 3) ONE pairwise table for all metrics (BH within each metric)
pw_all <- div_long2 %>%
  group_by(variable) %>%
  pairwise_wilcox_test(value ~ Group, p.adjust.method = "BH", exact = FALSE) %>%
  ungroup()

write.csv(pw_all, "Pairwise_Wilcox_BH_all_metrics.csv", row.names = FALSE)
write_xlsx(pw_all, "Pairwise_Wilcox_BH_all_metrics.xlsx")
knitr::kable(pw_all) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# ---- desired order ----
comp_list <- list(
  c("1","2"),
  c("3","4"),
  c("5","6")
)

comp_order <- tibble(
  group1 = sapply(comp_list, `[`, 1),
  group2 = sapply(comp_list, `[`, 2),
  comp_id = seq_along(comp_list)
)

# 4) base y per facet
y_base <- div_long2 %>%
  group_by(variable) %>%
  summarise(
    y_max = max(value, na.rm = TRUE),
    y_min = min(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(y_span = y_max - y_min)

# 5) match order regardless of direction (A-B same as B-A)
pw_all_ordered <- pw_all %>%
  mutate(g1 = as.character(group1), g2 = as.character(group2)) %>%
  left_join(comp_order, by = c("g1"="group1","g2"="group2")) %>%
  mutate(
    comp_id = ifelse(
      is.na(comp_id),
      comp_order$comp_id[match(
        paste(g2, g1),
        paste(comp_order$group1, comp_order$group2)
      )],
      comp_id
    )
  ) %>%
  select(-g1, -g2) %>%
  filter(!is.na(comp_id))

# 6) assign y.position in your specified order within each facet
pw_all_pos <- pw_all_ordered %>%
  left_join(y_base, by = "variable") %>%
  group_by(variable) %>%
  arrange(comp_id, .by_group = TRUE) %>%
  mutate(
    step = ifelse(y_span > 0, 0.12 * y_span, 0.12),
    y.position = y_max + row_number() * step
  ) %>%
  ungroup()

# 7) plot + annotate
p <- ggboxplot(
  div_long2, x = "Group", y = "value",
  facet.by = "variable", scales = "free",
  width = 0.5, fill = "gray"
) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8)) +
  stat_pvalue_manual(
    pw_all_pos,
    label = "p.adj.signif",
    y.position = "y.position",
    tip.length = 0.01,
    hide.ns = FALSE
  )

print(p)

## MANUALLY selected alpha diversities unter host-env after 'cp alpha_diversities.txt selected_alpha_diversities.txt'
#knitr::include_graphics("./figures/alpha_diversity_Group.png")
#selected_alpha_diversities<-read.csv("selected_alpha_diversities.txt",sep="\t")
#knitr::kable(selected_alpha_diversities) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

#!!# Beta diversity (Bray-Curtis distance)
#!!## Group1 vs Group2

#fig.cap="Beta diversity",

#for QIIME1: file:///home/jhuang/DATA/Data_Marius_16S/core_diversity_e42369/bdiv_even42369_Group/weighted_unifrac_boxplots/Group_Stats.txt

# -- for QIIME2: MANUALLY filter permanova-pairwise.csv and save as permanova-pairwise_.csv
# #grep "Permutations" exported_beta_group/permanova-pairwise.csv > permanova-pairwise_.csv
# #grep "Group1,Group2" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# #grep "Group3,Group4" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# beta_diversity_group_stats<-read.csv("permanova-pairwise_.csv",sep=",")
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group1" & beta_diversity_group_stats$Group.2 == "Group2", ]
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group3" & beta_diversity_group_stats$Group.2 == "Group4", ]
# knitr::kable(beta_diversity_group_stats) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

#NOTE: Run this Phyloseq.Rmd, then run the code of MicrobiotaProcess.R to manually generate Comparison_of_Bray_Distances_Group1_vs_Group2.png and Comparison_of_Bray_Distances_Group3_vs_Group4.png, then run this Phyloseq.Rmd!

#knitr::include_graphics("./figures/Comparison_of_Bray_Distances_Group1_vs_Group2.png")

Principal coordinates analysis (PCoA) of beta-diversity based on Bray–Curtis dissimilarity

Beta-diversity (between-sample differences in bacterial community composition) was quantified using Bray–Curtis dissimilarity computed on Hellinger-transformed abundance data. Beta-diversity patterns were visualized using principal coordinates analysis (PCoA) based on the Bray–Curtis dissimilarity matrix.

Global PERMANOVA on the Bray–Curtis dissimilarity matrix indicated a significant effect of Group on overall community composition (adonis2: p = 1×10⁻⁴; 9,999 permutations).

# --- Global beta-diversity (PERMANOVA) ---
cat("```text\n")
cat(
"[PERMANOVA result]\n",
"The object contained internal attribute: PCoA ADONIS\n",
"Permutation test for adonis under reduced model\n",
"Permutation: free\n",
"Number of permutations: 9999\n\n",
"vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)\n",
"          Df SumOfSqs      R2      F Pr(>F)\n",
"Model     5   6.1752 0.51654 9.8296  1e-04 ***\n",
"Residual 46   5.7797 0.48346\n",
"Total    51  11.9549 1.00000  \n",
"---\n",
"Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n",
sep = ""
)
cat("```\n")

Pairwise PERMANOVA tests were performed on Bray–Curtis dissimilarity matrices to compare beta-diversity between all pairs of sample groups (metadata column Group). For each pairwise comparison, the Bray–Curtis matrix was subset to samples from the two groups only, and significance was assessed using vegan::adonis2 with 9,999 permutations. Resulting p-values were adjusted for multiple testing using both Benjamini–Hochberg (BH/FDR) and Bonferroni corrections.

#Manually give the annotation of
Bray_pairwise_PERMANOVA <- read.csv("figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", sep = ",")
knitr::kable(Bray_pairwise_PERMANOVA, caption = "Pairwise PERMANOVA results (distance-based community differences among Group levels).") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# --- Ordination figures ---
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA.png")
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA3.png")

Differential abundance analysis

Differential abundance analysis aims to find the differences in the abundance of each taxa between two groups of samples, assigning a significance value to each comparison.

# ------------------------------------------------------------
#  DESeq2: non-rarefied integer counts + optional taxon prefilter
# ------------------------------------------------------------
ps_deseq <- ps_filt

Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

Group 1 vs Group 2

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group1,Group2)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "2")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group1_vs_Group2"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 3 vs Group 4

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group3,Group4)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "4")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group3_vs_Group4"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 5 vs Group 6

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group5,Group6)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "6")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group5_vs_Group6"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

## =========================================================
## MicrobiotaProcess_UPDATED.R
## Uses your objects:
##   - ps_filt        : for alpha + beta diversity (NOT taxa-filtered)
##   - ps_abund_rel   : for cleaner composition plots (taxa filtered for plotting)
##
## Output:
##   - Rarefaction curves + alpha diversity plots
##   - Bray (Hellinger) distance + PCoA + PERMANOVA
##   - Composition plots (Class) + heatmaps (from ps_abund_rel)
## =========================================================

#DEBUG_Update: remotes::install_github("erocoar/gghalves")
#DEBUG: update.packages(ask = FALSE)
suppressPackageStartupMessages({
  library(phyloseq)
  library(MicrobiotaProcess)
  library(microeco)
  library(ggplot2)
  library(aplot)
  library(ggrepel)
  library(vegan)
  library(ggalluvial)
  library(ggh4x)
  library(gghalves)
})

dir.create("figures_MP_Group_1_2_3_4_5_6", showWarnings = FALSE)

## -----------------------------
## 0) Define groups (sample IDs)  ---- EXACTLY as provided
Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

## Ensure unique IDs within each vector
Group1 <- unique(Group1);
Group2 <- unique(Group2);
Group3 <- unique(Group3)
Group4 <- unique(Group4);
Group5 <- unique(Group5);
Group6 <- unique(Group6);
#PC <- unique(PC)

## The group order you want everywhere (plots + stats)
#,"PC"
group_levels <- c("Group1","Group2","Group3","Group4","Group5","Group6")

## -----------------------------
## 0.1) Sample subset (union of all groups)
#, PC
keep_samples <- unique(c(Group1, Group2, Group3, Group4, Group5, Group6))

## -----------------------------
## 0.2) Helper: assign Group as an ordered factor (membership-based)
add_group <- function(mpse_obj) {
  sn <- rownames(colData(mpse_obj))
  grp <- rep(NA_character_, length(sn))

  grp[sn %in% Group1] <- "Group1"
  grp[sn %in% Group2] <- "Group2"
  grp[sn %in% Group3] <- "Group3"
  grp[sn %in% Group4] <- "Group4"
  grp[sn %in% Group5] <- "Group5"
  grp[sn %in% Group6] <- "Group6"
  #grp[sn %in% PC]     <- "PC"

  colData(mpse_obj)$Group <- factor(grp, levels = group_levels)

  # warn if any samples weren't assigned (helps catch typos / missing IDs)
  if (any(is.na(colData(mpse_obj)$Group))) {
    warning(
      "Unassigned samples (not found in any group list): ",
      paste(sn[is.na(colData(mpse_obj)$Group)], collapse = ", ")
    )
  }
  mpse_obj
}

## -----------------------------
## 0.3) Colors (edit to taste)
group_colors <- c(
  "Group1" = "#1f77b4",  # DarkBlue
  "Group2" = "#add8e6",  # LightBlue
  "Group3" = "#006400",  # DarkGreen
  "Group4" = "#90ee90",  # LightGreen
  "Group5" = "#6a0dad",  # DarkPurple
  "Group6" = "#dda0dd"   # LightPurple
)
#  "PC"     = """#7f7f7f"

## =========================================================
## 1) Diversity analysis object (alpha + beta)
##    IMPORTANT: start from ps_filt (all taxa retained)
## =========================================================
ps_div <- prune_samples(keep_samples, ps_filt)
ps_div <- prune_taxa(taxa_sums(ps_div) > 0, ps_div)

mpse_div <- as.MPSE(ps_div)
mpse_div <- add_group(mpse_div)

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 2) Alpha diversity (rarefaction-based)
## =========================================================
set.seed(9242)
mpse_div %<>% mp_rrarefy()  # creates RareAbundance
mpse_div %<>% mp_cal_rarecurve(.abundance = RareAbundance, chunks = 400)

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
#BUG: Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.
#install.packages("remotes")
#remotes::install_github("erocoar/gghalves")

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
##Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 3) Beta diversity (Bray–Curtis on Hellinger)
##    IMPORTANT: use non-rarefied Abundance (not taxa-filtered)
## =========================================================
mpse_div %<>% mp_decostand(.abundance = Abundance)             # creates 'hellinger'
mpse_div %<>% mp_cal_dist(.abundance = hellinger, distmethod = "bray")

# Distance between samples
p_dist_1 <- mpse_div %>% mp_plot_dist(.distmethod = bray)
png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples.png", width = 1000, height = 1000)
print(p_dist_1)
dev.off()

# Distance with group info
p_dist_2 <- mpse_div %>% mp_plot_dist(.distmethod = bray, .group = Group) +
  scale_fill_manual(values = group_colors) +
  scale_color_gradient()

png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples_with_group_info.png", width = 1000, height = 1000)
print(p_dist_2)
dev.off()

# Compare distances between groups (Bray)
p_dist_3 <- mpse_div %>%
  mp_plot_dist(.distmethod = bray, .group = Group, group.test = TRUE, textsize = 6) +
  theme(
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x  = element_text(size = 14),
    axis.text.y  = element_text(size = 14)
  )

png("figures_MP_Group_1_2_3_4_5_6/Comparison_of_Bray_Distances.png", width = 1000, height = 1000)
print(p_dist_3)
dev.off()

## PCoA + PERMANOVA (adonis2)
mpse_div %<>% mp_cal_pcoa(.abundance = hellinger, distmethod = "bray")

mpse_div %<>% mp_adonis(
  .abundance   = hellinger,
  .formula     = ~ Group,
  distmethod   = "bray",
  permutations = 9999,
  action       = "add"
)

cat("\n[PERMANOVA result]\n")
print(mpse_div %>% mp_extract_internal_attr(name = "adonis"))
#[PERMANOVA result]
#The object contained internal attribute: PCoA ADONIS
#Permutation test for adonis under reduced model
#Permutation: free
#Number of permutations: 9999
#
#vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)
#         Df SumOfSqs      R2      F Pr(>F)
#Model     5   6.1752 0.51654 9.8296  1e-04 ***
#Residual 46   5.7797 0.48346
#Total    51  11.9549 1.00000
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## PCoA plot
p_pcoa_1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
print(p_pcoa_1)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf", width = 10, height = 8)
print(p_pcoa_1)
dev.off()

## Optional: label points
library(ggrepel)

df <- p_pcoa_1$data
df$ShortLabel <- gsub("sample-", "", rownames(colData(mpse_div)))  # or df$Sample if present

p_pcoa_2 <- p_pcoa_1 +
  geom_text_repel(
    data = df,
    aes(label = ShortLabel),
    size = 4, max.overlaps = 100
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_labeled.png", width = 1200, height = 1000)
print(p_pcoa_2)
dev.off()

## =========================================================
## 4) Composition plots object (clean taxa set for plotting)
##    IMPORTANT: start from ps_abund_rel (plotting-filtered taxa)
## =========================================================
ps_plot <- prune_samples(keep_samples, ps_abund_rel)
ps_plot <- prune_taxa(taxa_sums(ps_plot) > 0, ps_plot)

mpse_plot <- as.MPSE(ps_plot)
mpse_plot <- add_group(mpse_plot)

## Summaries for plotting (Class)
mpse_plot %<>%
  mp_cal_abundance(.abundance = Abundance) %>%               # per sample
  mp_cal_abundance(.abundance = Abundance, .group = Group)   # per group

## Class abundance barplots (top 20)
p_class_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = TRUE
  )

p_class_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = FALSE
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_samples.png", width = 1200, height = 600)
print(p_class_rel / p_class_abs)
dev.off()

## Heatmaps (grouped)
h_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = TRUE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

h_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = FALSE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_heatmap.png", width = 1200, height = 600)
print(aplot::plot_list(gglist = list(h_rel, h_abs), tag_levels = "A"))
dev.off()

## Group-level barplots
p_group_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = TRUE
  ) +
  scale_fill_manual(values = group_colors)

p_group_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = FALSE
  ) +
  scale_fill_manual(values = group_colors)

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_groups.png", width = 1000, height = 1000)
print(p_group_rel / p_group_abs)
dev.off()

cat("\nDONE. Outputs written to ./figures_MP_Group_1_2_3_4_5_6/\n")

## =========================================================
## CONTINUE: Export Bray distances + pairwise PERMANOVA
## (Use mpse_div from the updated script above)
## =========================================================

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(openxlsx)
  library(vegan)
})

## -----------------------------
## Helper: get assay matrix with rows = samples, cols = features
get_sample_by_feature <- function(mpse_obj, assay_name) {
  mat <- assay(mpse_obj, assay_name)

  # sample IDs in MPSE
  samp_ids <- rownames(colData(mpse_obj))

  # If samples are columns, transpose
  if (!is.null(colnames(mat)) && all(samp_ids %in% colnames(mat))) {
    mat <- t(mat)
  }

  # Now enforce row order to match colData
  mat <- mat[samp_ids, , drop = FALSE]
  mat
}

## -----------------------------
## 1) Recompute Bray–Curtis distance (robust extraction)
hell_mat_sf <- get_sample_by_feature(mpse_div, "hellinger")  # rows=samples, cols=features
bray_dist   <- vegdist(hell_mat_sf, method = "bray")

## sanity checks
stopifnot(!any(is.na(as.matrix(bray_dist))))
stopifnot(!any(as.matrix(bray_dist) < 0, na.rm = TRUE))

## -----------------------------
## 2) Export all pairwise Bray distances to Excel
bray_mat <- as.matrix(bray_dist)
samples  <- rownames(bray_mat)

bray_df <- as.data.frame(as.table(bray_mat)) %>%
  rename(Sample1 = Var1, Sample2 = Var2, BrayDistance = Freq) %>%
  filter(Sample1 < Sample2) %>%
  arrange(Sample1, Sample2)

write.xlsx(bray_df, file = "figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx")

## -----------------------------
## 3) Pairwise PERMANOVA (post-hoc) using vegan::adonis2
meta2 <- as.data.frame(colData(mpse_div))
meta2 <- meta2[rownames(hell_mat_sf), , drop = FALSE]
meta2$Group <- droplevels(meta2$Group)

groups <- levels(meta2$Group)

res_list <- list()
k <- 1

for (i in 1:(length(groups) - 1)) {
  for (j in (i + 1):length(groups)) {

    g1 <- groups[i]
    g2 <- groups[j]

    idx <- meta2$Group %in% c(g1, g2)
    sub_meta <- meta2[idx, , drop = FALSE]

    sub_dist <- as.dist(as.matrix(bray_dist)[idx, idx])

    ad <- adonis2(sub_dist ~ Group, data = sub_meta, permutations = 9999)

    res_list[[k]] <- data.frame(
      group1 = g1,
      group2 = g2,
      F      = ad$F[1],
      R2     = ad$R2[1],
      p      = ad$`Pr(>F)`[1]
    )
    k <- k + 1
  }
}

pair_res <- do.call(rbind, res_list)
pair_res$p_adj_BH         <- p.adjust(pair_res$p, method = "BH")
#pair_res$p_adj_Bonferroni <- p.adjust(pair_res$p, method = "bonferroni")
pair_res$p.adj.format <- format.pval(pair_res$p_adj_BH, digits = 3, eps = 1e-300)
pair_res$p.adj.signif <- cut(
  pair_res$p_adj_BH,
  breaks = c(-Inf, 1e-4, 1e-3, 1e-2, 0.05, Inf),
  labels = c("****", "***", "**", "*", "ns"),
  right = TRUE
)

write.csv(pair_res, "figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", row.names = FALSE)

cat("\nPairwise PERMANOVA written to: figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv\n")
cat("Bray distance table written to: figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx\n")

## =========================================================
## OPTIONAL: PCoA plot where point size = Shannon and alpha = Observe
## (requires mpse_div already has Shannon/Observe from mp_cal_alpha)
## =========================================================

p_pcoa_sizealpha <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  scale_size_continuous(range = c(2, 6)) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.png", width = 1200, height = 1000)
print(p_pcoa_sizealpha)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.pdf", width = 10, height = 8)
print(p_pcoa_sizealpha)
dev.off()

## =========================================================
## Ensure all three ordination outputs exist:
##   - PCoA  : basic (color/group)
##   - PCoA2 : size = Shannon, alpha = Observe
##   - PCoA3 : same as PCoA2 + sample labels
##
## Assumes you already have:
##   - mpse_div with: pcoa, Group, Shannon, Observe
##   - group_colors defined
## =========================================================

p1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
p1
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf")
p1
dev.off()

p2 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA2.png", width = 1200, height = 1000)
p2
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA2.pdf")
p2
dev.off()

library(ggrepel)
colData(mpse_div)$ShortLabel <- gsub("sample-", "", mpse_div@colData@rownames)

p3 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  geom_text_repel(aes(label = ShortLabel), size = 5, max.overlaps = 100) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA3.png", width = 1200, height = 1000)
p3
dev.off()
svg("figures_MP_Group_1_2_3_4_5_6/PCoA3.svg", width = 12, height = 10)
p3
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA3.pdf")
p3
dev.off()

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Daily Archives: 2026年2月27日

Data_Marius_16S_2026

\usepackage{color, fancyvrb} output: rmdformats::readthedown: highlight: kate number_sections : yes pdf_document: toc: yes toc_depth: 2 number_sections : yes

Data

Prerequisites to be installed

Read the data and create phyloseq objects

Create analysis-specific phyloseq objects

Heatmaps

Taxonomic summary

Bar plots in phylum level

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

Bar plots in class level

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

Bar plots in order level

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

Bar plots in family level

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

Alpha diversity

Principal coordinates analysis (PCoA) of beta-diversity based on Bray–Curtis dissimilarity

Differential abundance analysis

Group 1 vs Group 2

Group 3 vs Group 4

Group 5 vs Group 6

Protected: Birth Certificate Translation Comparison