Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and *S. epidermidis* Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)

1. Study Design & Metadata

This pipeline analyzes longitudinal nasal swab samples from patients undergoing neurosurgery.

Cohort A (Aneurysm): 15 patients (45 samples)
Cohort H (Hypophysis): 20 patients (60 samples)
Total: 108 patient samples + 3 extraction controls.
Timepoints:
- .1 = Admission (Baseline)
- .2 = Surgery (Intraoperative/Immediate post-op)
- .3 = Discharge (Recovery)

Targeted Sequencing Approaches:

16S rRNA Gene Amplicon Sequencing: For overall nasal microbiome profiling (processed via QIIME1 open-reference picking against SILVA 132).
Epidome Method (S. epidermidis): Targeted amplicon sequencing of the g216 and yycH genes for high-resolution Sequence Type (ST) tracking (processed via DADA2 and custom Epidome Python/R scripts).

2. R Environment & Dependencies

To reproduce this analysis, ensure you are using R (v4.1+) and have the following packages installed.

# Core Microbiome & Phyloseq
install.packages(c("phyloseq", "microbiome", "picante", "vegan", "ape"))
# Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "microbiome", "phyloseq"))

# Advanced Processing & Visualization
install.packages(c("MicrobiotaProcess", "microeco", "ggplot2", "ggpubr", "ggrepel", "aplot"))
install.packages(c("ggh4x", "gghalves", "ggalluvial", "RColorBrewer", "heatmaply", "gplots"))

# Statistical Testing & Data Wrangling
install.packages(c("dplyr", "tidyr", "rstatix", "RVAideMemoire", "openxlsx", "knitr", "kableExtra", "tibble", "reshape2", "coin", "BSDA"))

3. Workflow Part I: Overall Microbiome (16S) Analysis

Scripts involved: MicrobiotaProcess.R (Primary engine for diversity/ordination) & Phyloseq.Rmd (Reporting, DESeq2, and specific statistical tests).

3.1 Data Import & Normalization Strategy

The pipeline splits the raw OTU table (table_even9893.biom) into distinct objects based on the downstream statistical requirements:

ps_filt: Raw counts, low-depth samples removed (min 1000 reads). Used for DESeq2 and Hellinger transformation.
ps_rarefied: Rarefied to an even depth of 9,893 reads (set.seed(9242)). Used for Alpha diversity and UniFrac/Jaccard metrics.
ps_abund_rel: Relative abundance, filtered to keep only taxa with mean relative abundance > 0.1%. Used for clean taxonomic composition plots.
hellinger (via MicrobiotaProcess): Hellinger-transformed ps_filt. Used for Bray-Curtis distances and PCoA.

3.2 Alpha Diversity (Longitudinal & Cross-Sectional)

Metrics: Observed OTUs, Chao1, ACE, Shannon, Simpson, Pielou.
Longitudinal Analysis (Within Cohorts A and H):
- Overall temporal shift: Friedman test (non-parametric repeated measures) accounting for PatientID pairing.
- Post-hoc pairwise: Paired t-tests and Paired Wilcoxon signed-rank tests with Benjamini-Hochberg (BH) correction.
Cross-Sectional Analysis (Cohort A vs. Cohort H):
- All timepoints pooled: Independent t-test / Wilcoxon rank-sum test.
- Surgery timepoint ONLY: To prevent dilution of the surgical effect by baseline/discharge samples, a specific filter (SampleType == "surgery") is applied before running the independent t-test/Wilcoxon.

3.3 Beta Diversity & Ordination

Distance Metric: Bray-Curtis dissimilarity calculated on Hellinger-transformed non-rarefied counts.
Ordination: Principal Coordinates Analysis (PCoA).
Statistical Testing:
- Overall: PERMANOVA (vegan::adonis2) with 9,999 permutations.
- Post-hoc Pairwise: All 15 possible pairwise group comparisons (e.g., A1 vs A2, A1 vs H1, H2 vs H3) are calculated using adonis2. P-values are adjusted using BH (FDR) and Bonferroni corrections.
Outputs: PCoA plots (colored by group, sized by Shannon, alpha by Observed), exported as PNG/PDF/SVG. Pairwise PERMANOVA results exported to Bray_pairwise_PERMANOVA.xlsx.

3.4 Taxonomic Composition

Plots: Stacked barplots and Heatmaps at the Class level (Top 20 taxa).
Grouping: Plots generated for individual samples, grouped by Cohort (A vs H), and grouped by Timepoint (Admission, Surgery, Discharge).

3.5 Differential Abundance Analysis (DESeq2)

Method: Negative Binomial Generalized Linear Model (Wald test) using raw, non-rarefied integer counts (ps_filt).
Comparisons: Six independent pairwise comparisons per cohort:
1. Admission vs. Surgery
2. Admission vs. Discharge
3. Surgery vs. Discharge (Repeated for both Cohort A and Cohort H)
Threshold: Adjusted P-value (BH) < 0.05.
Outputs: Volcano-style plots (log2FoldChange by Family/Class) and Excel tables of significant DEGs for each comparison.

4. Workflow Part II: S. epidermidis Epidome Analysis

Script involved: Phyloseq.Rmd (Sections: ST summary, Binary Prevalence Analysis). Input Data: count_table_seq31_seq37_ST.txt (ASVs classified into STs via g216 and yycH BLAST against the Epidome DB). Normalized to median depth (56,191).

4.1 ST Composition & Alpha Diversity

Composition: Stacked barplots of relative ST abundance across the 108 samples.
Alpha Diversity: Shannon index and Observed STs compared between Cohorts (t-test) and across Timepoints (t-test).
goeBURST: Clonal complex (CC) visualization based on PubMLST allele profiles.

4.2 Continuous Abundance Testing (Specific STs)

Method: Wilcoxon signed-rank test (paired, Admission .1 vs. Discharge .3).
Targets: Analyzed individually for clinically relevant STs (ST2, ST5, ST8, ST23, ST59, ST60, ST73, ST100, ST130, ST215) and in combined clinical groups (e.g., ST5+ST87+ST130+ST210+ST384).
Handling Ties: Evaluated using exact calculations via the coin package and the Sign Test (BSDA) to ensure robustness against zero-inflation.

4.3 🌟 Novel Binary Prevalence Analysis (Cochran’s Q & McNemar)

Because continuous abundance of specific high-risk STs (ST2, ST5, ST23) showed no significant shifts, we converted the data to a binary presence/absence matrix to track strain acquisition and clearance.

Data Transformation: Abundance > 0 becomes 1 (Present); becomes (Absent).
Overall Temporal Shift: Cochran’s Q test (RVAideMemoire::cochran.qtest) to evaluate if the prevalence proportion changes across the 3 timepoints (Admission, Surgery, Discharge) for a given ST.
Pairwise Transitions: Robust McNemar tests for paired binary data (Admission vs Surgery, Surgery vs Discharge, Admission vs Discharge).
- Robustness: Custom R function handles zero-variance (no discordant pairs) by assigning $p = 1.0$ instead of crashing.
Correction: Benjamini-Hochberg (BH) FDR correction applied to the 3 pairwise p-values.
Visualization: Line charts with scatter points showing the Prevalence (%) trend across the 3 timepoints, annotated with $n/N$ (number of positive patients / total patients).
Export: The binary matrix is exported to ST_binary_matrix.xlsx.

5. Execution Instructions (How to Reproduce)

Step 1: Prepare the Directory

Ensure your working directory contains the following raw files:

table_even9893.biom and ../clustering/rep_set.tre (16S data)
../map3_corrected.txt (Metadata)
../count_table_seq31_seq37_ST.txt (Epidome ST counts)
adiv_even.txt, adiv_even_A.txt, adiv_even_H.txt (Pre-calculated 16S alpha metrics from QIIME)
alpha_diversity_metrics_samples_ST.csv (Pre-calculated ST alpha metrics)

Step 2: Run the MicrobiotaProcess Pipeline (Beta/Alpha/Composition)

Open your terminal or RStudio and run the MicrobiotaProcess.R script. This will generate the figures_MP/ and figures_All_Combined/ directories.

Rscript MicrobiotaProcess.R

Outputs: PCoA plots, rarefaction curves, Bray-Curtis distance matrices, pairwise PERMANOVA Excel files, and Class-level heatmaps.

Step 3: Render the Comprehensive HTML Report

Open RStudio, load Phyloseq.Rmd, and render it. This integrates the outputs from Step 2, runs DESeq2, performs the Epidome statistical tests, and generates the final interactive report.

# In R Console:
rmarkdown::render('Phyloseq.Rmd', output_file = 'Phyloseq.html')

Outputs: Phyloseq.html (the master report), figures/ directory (DESeq2 plots, ST prevalence plots, goeBURST), and various .xlsx files for ST binary matrices and DEG tables.

6. Summary of Key Output Files

File / Directory	Description
`Phyloseq.html`	Master Report. Contains all 16S and Epidome statistics, tables, and embedded figures.
`figures_MP/`	High-res PCoA, rarefaction, alpha diversity, and composition heatmaps from `MicrobiotaProcess`.
`figures_All_Combined/`	Pairwise PERMANOVA results (`Bray_pairwise_PERMANOVA.xlsx`) and combined PCoA plots.
`figures/`	DESeq2 volcano plots, ST prevalence trend lines, and Alpha diversity boxplots.
`DEGs_*.xlsx`	Excel tables of differentially abundant OTUs for all 6 pairwise comparisons per cohort.
`ST_binary_matrix.xlsx`	The 0/1 presence/absence matrix for ST2, ST5, and ST23 used for the McNemar analysis.

Pipeline maintained by Jiabin Huang. For questions regarding the Epidome Python scripts or upstream QIIME1/DADA2 processing, refer to the respective bash/R scripts in the epidome/scripts/ directory.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Comprehensive Reproducible Pipeline for Longitudinal Nasal Microbiome and S. epidermidis Epidome Analysis (DATA_B/Data_Luise_Epidome_longitudinal_nose)