How to search for motifs in Acinetobacter?

To search for those short peptide motifs (like “DIKDY” or “DLSDY”) across all available Acinetobacter genomes, you can use the online BLAST tool on the NCBI website. Since these are short sequences (5-7 amino acids), you’ll need to adjust the search settings to find meaningful matches.

Here is a step-by-step guide on how to perform this search effectively:

1. Access NCBI BLAST and Choose the Correct Program

First, go to the NCBI BLAST home page and select the appropriate search tool. For protein motifs, you should use blastp.

2. Enter Your Query Sequence

In the “Enter Query Sequence” box, you can directly type or paste your short peptide sequence, such as DIKDY .

  • Important: For each search, use only one of the short peptide motifs you mentioned (e.g., first search for “DIKDY”, then later for “DNYQFDSK”, etc.).

3. Select the Database and Restrict by Organism

This is the most important step to ensure you search only Acinetobacter genomes.

  • In the “Choose Search Set” section, select the Non-redundant protein sequences (nr) database. This is the comprehensive database of all protein sequences at NCBI.
  • In the “Organism” box, type “Acinetobacter” (or “Acinetobacter baumannii”).
  • Select the desired option from the dropdown menu that appears (e.g., Acinetobacter (taxid:469) ) . This will restrict your search to only sequences from this genus.

4. Optimize the Algorithm for Short Sequences

Standard BLAST settings are optimized for long sequences and may miss short motifs. You must change the algorithm to one designed for short inputs.

  • In the “Program Selection” section, click on “Choose Search Set” or “Algorithm” to see more options.
  • Select the blastp-short task. This program is specifically optimized for query sequences shorter than 30 residues .

5. Adjust Advanced Parameters (Optional but Recommended)

You may also want to adjust the Expect threshold (E-value) to get more relevant results. A higher E-value is more lenient and can find more distant matches, which is useful for short, conserved motifs .

  • Click on “Algorithm parameters” to expand the advanced settings.
  • Increase the Expect threshold to 20000 or 200000. The default value of 10 is too stringent for a 5-amino acid query.
  • Consider turning off the low-complexity region filter, as short motifs might be filtered out by default.

6. Run the Search and Interpret the Results

Click the “BLAST” button at the bottom of the page.

  • The results page will show you all protein sequences in Acinetobacter that contain your exact peptide motif.
  • Pay attention to the “Query Cover” column. For a perfect 5-amino acid match, the query cover will be 100%. If you allowed for mismatches, it might be lower.
  • Look at the “Scientific Name” column to see which Acinetobacter species and strains contain the motif .

Summary of Settings for Your Search

Parameter Recommended Setting Reason
Program blastp To search a protein query against a protein database .
Database Non-redundant protein sequences (nr) The most comprehensive public database.
Organism Acinetobacter (taxid:469) To limit results to your genus of interest .
Task blastp-short Optimized for short, peptide-like queries .
Expect threshold 20,000 Increases sensitivity to find short, exact matches.

By following these steps for each of the eight peptide motifs (four from AdeJ and four from AdeB), you will be able to see how conserved they are across different Acinetobacter strains and identify which specific genomes contain these regions of interest.

I hope this step-by-step guide helps you perform your analysis successfully. If the results are too broad or too narrow, feel free to adjust the E-value or experiment with allowing a small number of mismatches in the algorithm parameters.

Data_Marius_16S_2026

Hamburg:

  • Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
  • Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
  • Group 3 (post-stroke RD): #18-20, #23-25, #27-29
  • Group 4 (post-stroke HFD): #21-22, #26, #30-34
  • Group 5 (Baseline RD): #40-43, #49-53
  • Group 6 (Baseline HFD): #35-39, #44-48

Odense:

  • Group 7 (Baseline Female): #54-69
  • Group 8 (Baseline Male): #96-97, #99-102, #104-109
  • Group 9 (pre-Stroke RD female): #76-78, #80-81
  • Group 10 (pre-Stroke RD male): #114-121
  • Group 11 (pre-Stroke HFD female): #70-75, #79
  • Group 12 (pre-Stroke HFD male): #110-113
  • Group 13 (post-stroke RD female): #88-95
  • Group 14 (post-stroke RD male): #129-136
  • Group 15 (post-stroke HFD female): #82-87
  • Group 16 (post-stroke HFD male): #122-128
  1. Could you please create two independent html files with the respective Analysis of the Samples from Odense or Hamburg?

  2. We are ineterested in the following comparisons:

a. Hamburg:

  • i. Group 1 vs 2 (PCA and DEGs)
  • ii. Group 3 vs 4 (PCA and DEGs)
  • iii. Group 3 vs 4 (PCA and DEGs)

b. Odense:

  • i. Group 7 vs 8 (PCA and DEGs)
  • ii. Group 9 vs 10 (PCA and DEGs)
  • iii. Group 11 vs 12 (PCA and DEGs)
  • iv. Group 13 vs 14 (PCA and DEGs)
  • v. Group 15 vs 16 (PCA and DEGs)

author: “” date: ‘r format(Sys.time(), "%d %m %Y")‘ header-includes:

  • \usepackage{color, fancyvrb} output: rmdformats::readthedown: highlight: kate number_sections : yes pdf_document: toc: yes toc_depth: 2 number_sections : yes


#install.packages(c("picante", "rmdformats"))
#mamba install -c conda-forge freetype libpng harfbuzz fribidi
#mamba install -c conda-forge r-systemfonts r-svglite r-kableExtra freetype fontconfig harfbuzz fribidi libpng
library(knitr)
library(rmdformats)
library(readxl)
library(dplyr)
library(kableExtra)
library(openxlsx)
library(DESeq2)
library(writexl)

options(max.print="75")
knitr::opts_chunk$set(fig.width=8,
                      fig.height=6,
                      eval=TRUE,
                      cache=TRUE,
                      echo=TRUE,
                      prompt=FALSE,
                      tidy=FALSE,
                      comment=NA,
                      message=FALSE,
                      warning=FALSE)
opts_knit$set(width=85)
# Phyloseq R library
#* Phyloseq web site : https://joey711.github.io/phyloseq/index.html
#* See in particular tutorials for
#    - importing data: https://joey711.github.io/phyloseq/import-data.html
#    - heat maps: https://joey711.github.io/phyloseq/plot_heatmap-examples.html
#STEP_1: RUN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#STEP_2: RUN MicrobiotaProcess_Group_1_2_3_4_5_6.R
#STEP_3: adjust Global PERMANOVA text, Pairwise PERMANOVA table, and PCoA figure.
#STEP_4: RUN_AGAIN rmarkdown::render('Phyloseq_Hamburg.Rmd',output_file='Phyloseq_Hamburg.html')
#options(max.print = 1e6)

Data

Import raw data and assign sample key:

#extend qiime2_metadata_for_qza_to_phyloseq.tsv with Diet and Flora
#setwd("~/DATA/Data_Laura_16S_2/core_diversity_e4753")
#map_corrected <- read.csv("qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1)
#knitr::kable(map_corrected) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Prerequisites to be installed

install.packages("dplyr")     # To manipulate dataframes
install.packages("readxl")    # To read Excel files into R
install.packages("ggplot2")   # for high quality graphics
install.packages("heatmaply")
source("https://bioconductor.org/biocLite.R")
biocLite("phyloseq")
#mamba install -c conda-forge r-ggplot2 r-vegan r-data.table
#BiocManager::install("microbiome")
#install.packages("ggpubr")
#install.packages("heatmaply")
library("readxl") # necessary to import the data from Excel file
library("ggplot2") # graphics
library("picante")
library("microbiome") # data analysis and visualisation
library("phyloseq") # also the basis of data object. Data analysis and visualisation
library("ggpubr") # publication quality figures, based on ggplot2
library("dplyr") # data handling, filter and reformat data frames
library("RColorBrewer") # nice color options
library("heatmaply")
library(vegan)
library(gplots)
#install.packages("openxlsx")
library(openxlsx)
library(rstatix)
library(knitr)
library(kableExtra)

Read the data and create phyloseq objects

Three tables are needed

  • OTU
  • Taxonomy
  • Samples

Create analysis-specific phyloseq objects

We maintain one filtered “base” phyloseq object and then derive analysis-specific objects from it. This avoids accidental overwriting and ensures each analysis uses the appropriate data scale (counts vs. relative abundance vs. rarefied counts).

  • ps_raw → Raw imported phyloseq object (integer counts; import stage only)
  • ps_baseps_raw with taxonomy + sample metadata properly aligned (the clean master object before any filtering)
  • ps_pruned → Optional sample subset of ps_base (e.g., drop unwanted samples by ID/pattern); still integer counts
  • ps_filt → The shared filtered backbone: low-depth samples removed + taxa with zero total counts dropped; remains integer counts

    library(tidyr)

    # For QIIME1
    #ps.ng.tax <- import_biom("./exported_table/feature-table.biom", "./exported-tree/tree.nwk")

    # For QIIME2
    #install.packages("remotes")
    #remotes::install_github("jbisanz/qiime2R")
    #"core_metrics_results/rarefied_table.qza", rarefying performed in the code, therefore import the raw table.
    library(qiime2R)
    ps_raw <- qza_to_phyloseq(
      features =  "dada2_tests/test_49_f240_r240/table.qza",
      tree = "rooted-tree.qza",
      metadata = "qiime2_metadata_for_qza_to_phyloseq.tsv"
    )

    # Refresh/ensure sample_data (optional but keeps things explicit)
    sample_df <- read.csv("./qiime2_metadata_for_qza_to_phyloseq.tsv", sep="\t", row.names=1, check.names=FALSE)
    SAM <- sample_data(sample_df, errorIfNULL = TRUE)

    # Add taxonomy table (exported from QIIME2)
    taxonomy <- read.delim("./exported-taxonomy/taxonomy.tsv", sep="\t", header=TRUE)

    # Separate taxonomy string into separate ranks
    taxonomy_df <- taxonomy %>% separate(Taxon, into = c("Domain","Phylum","Class","Order","Family","Genus","Species"), sep = ";", fill = "right", extra = "drop")
    # Use Feature.ID as rownames
    rownames(taxonomy_df) <- taxonomy_df$Feature.ID
    taxonomy_df <- taxonomy_df[, -c(1, ncol(taxonomy_df))]  # Drop Feature.ID and Confidence
    # Create tax_table
    tax_table_final <- phyloseq::tax_table(as.matrix(taxonomy_df))

    # Base object: raw integer counts + metadata + taxonomy
    ps_base <- merge_phyloseq(ps_raw, SAM, tax_table_final)
    print(ps_base)

    #colnames(phyloseq::tax_table(ps_base)) <- c("Domain","Phylum","Class","Order","Family","Genus","Species")
    saveRDS(ps_base, "./ps_base.rds")

Visualize data

# Inspect the base object (raw integer counts)
sample_names(ps_base)
rank_names(ps_base)
sample_variables(ps_base)

# Optional: prune to a naming pattern (avoids hard-coding long sample lists)
#samples_keep <- sample_names(ps_base)
#samples_keep <- samples_keep[grepl("^sample-[A-H]", samples_keep)]
#Deleted samples: sample-NTC-9  sample-PC-1 sample-PC-2 sample-PC-3 sample-PC-4  sample-PC-5
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
samples_keep <- c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

#    Group 1 (pre-Stroke RD): #1-3, #6-8, #10-12
#    Group 2 (pre-Stroke HFD): #4-5, #9, #13-17
#    Group 3 (post-stroke RD): #18-20, #23-25, #27-29
#    Group 4 (post-stroke HFD): #21-22, #26, #30-34
#    Group 5 (Baseline RD): #40-43, #49-53
#    Group 6 (Baseline HFD): #35-39, #44-48

ps_pruned <- prune_samples(samples_keep, ps_base)

# Drop taxa absent after pruning
ps_pruned <- prune_taxa(taxa_sums(ps_pruned) > 0, ps_pruned)

# Quick sanity checks
nsamples(ps_base); ntaxa(ps_base)
nsamples(ps_pruned); ntaxa(ps_pruned)
#  #Preprocessing statistics for each sample
#   denoising_stats <- read.csv("dada2_tests/test_59_f235_r245/data_stats.tsv", sep="\t")
#   # Display the table
#   kable(denoising_stats, caption = "Preprocessing statistics for each sample") %>%
#     kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

In the QC filtering step, we removed 14 samples that either fell below the minimum sequencing depth (library size < 12,201 reads) or were pre-specified for exclusion via sample_ids (as defined in the previous step). The filtered object (ps_filt) therefore contains only samples meeting the depth cutoff, and taxa were re-pruned to retain only those with nonzero total abundance across the retained samples.

# ------------------------------------------------------------
#   Filter low-depth samples (recommended for all analyses)
# ------------------------------------------------------------
#IMPORTANT_ADAPTION: min_depth needs to be adapted to the rarefaction threshold.
min_depth <- 5610  # <-- adjust to your data / study design, keeps all!
ps_filt <- prune_samples(sample_sums(ps_pruned) >= min_depth, ps_pruned)
ps_filt <- prune_taxa(taxa_sums(ps_filt) > 0, ps_filt)
ps_filt

# Keep a depth summary for reporting / QC
depth_summary <- summary(sample_sums(ps_filt))
depth_summary

Differential abundance (DESeq2)ps_deseq: non-rarefied integer counts derived from ps_filt, with optional count-based taxon prefilter (default: taxa total counts ≥ 10 across all samples)

From ps_filt (e.g. 5669 taxa and 239 samples), we branch into analysis-ready objects in two directions:

  • Direction 1 for diversity analyses

    • Alpha diversity: ps_rarefied ✅ (common)
    • Beta diversity:
      • Unweighted UniFrac / Jaccard: ps_rarefied ✅ (often recommended)
      • Bray–Curtis / ordination on abundances: ps_rel or Hellinger ✅ (rarefaction optional)
      • Aitchison (CLR): CLR-transformed (non-rarefied) ✅ (no rarefaction)

Rarefaction


# RAREFACTION
set.seed(9242)  # This will help in reproducing the filtering and nomalisation.
ps_rarefied <- rarefy_even_depth(ps_filt, sample.size = min_depth)

# # NORMALIZE number of reads in each sample using median sequencing depth.
# total = median(sample_sums(ps.ng.tax))
# #> total
# #[1] 42369
# standf = function(x, t=total) round(t * (x / sum(x)))
# ps.ng.tax = transform_sample_counts(ps.ng.tax, standf)
# ps_rel <- microbiome::transform(ps.ng.tax, "compositional")
#
# saveRDS(ps.ng.tax, "./ps.ng.tax.rds")
  • Direction 2 for taxonomic composition plots

    • Taxonomic compositionps_rel: relative abundance (compositional) computed after sample filtering (e.g. 5669 taxa and 239 samples)
    • Optional cleaner composition plotsps_abund / ps_abund_rel: taxa filtered for plotting (e.g., keep taxa with mean relative abundance > 0.1%); (e.g. 95 taxa and 239 samples) ps_abund = counts, ps_abund_rel = relative abundance (use for visualization, not DESeq2)

For the heatmaps, we focus on the most abundant OTUs by first converting counts to relative abundances within each sample. We then filter to retain only OTUs whose mean relative abundance across all samples exceeds 0.1% (0.001). We are left with 199 OTUs which makes the reading much more easy.

# 1) Convert to relative abundances
ps_rel <- transform_sample_counts(ps_filt, function(x) x / sum(x))

# 2) Get the logical vector of which OTUs to keep (based on relative abundance)
keep_vector <- phyloseq::filter_taxa(
  ps_rel,
  function(x) mean(x) > 0.001,
  prune = FALSE
)

# 3) Use the TRUE/FALSE vector to subset absolute abundance data
ps_abund <- prune_taxa(names(keep_vector)[keep_vector], ps_filt)

# 4) Normalize the final subset to relative abundances per sample
ps_abund_rel <- transform_sample_counts(
  ps_abund,
  function(x) x / sum(x)
)

cat("ps_pruned:", nsamples(ps_pruned), "\n")
cat("ps_filt:", nsamples(ps_filt), "\n")
cat("ps_abund:", nsamples(ps_abund), "\n")
cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")

Heatmaps

datamat_ = as.data.frame(otu_table(ps_abund))
#"sample-PC-1","sample-PC-2","sample-PC-3","sample-PC-4","sample-PC-5"
datamat <- datamat_[c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")]
# Remove rows with zero variance
datamat <- datamat[apply(datamat, 1, var) > 0, ]
# Remove cols with zero variance
#datamat <- datamat[, apply(datamat, 2, var) > 0]

# (optional) replace with your actual column names
#colnames(datamat) <- c(
#    "A24040201",...
#  )

# ---------- 0) Sample names from datamat ----------
samplenames <- sub("", "", colnames(datamat))  #^sample-

# ---------- 1) Read metadata ----------
meta_path <- "qiime2_metadata.tsv"
meta <- read.delim(
  meta_path,
  sep = "\t",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = FALSE,
  comment.char = ""
)

# ---------- 2) Identify SampleID + Group columns ----------
sample_id_col <- c("#SampleID","SampleID","sample-id","sampleid")
sample_id_col <- sample_id_col[sample_id_col %in% colnames(meta)][1]
if (is.na(sample_id_col)) sample_id_col <- colnames(meta)[1]

group_col <- c("Group","group","GROUP")
group_col <- group_col[group_col %in% colnames(meta)][1]
if (is.na(group_col)) stop("No 'Group' column found in metadata.")

# ---------- 3) Build lookup: sample -> group ----------
meta_ids <- sub("", "", meta[[sample_id_col]])  #^sample-
meta_groups <- trimws(as.character(meta[[group_col]]))
group_map <- setNames(meta_groups, meta_ids)

# Map datamat columns to group labels
groups <- unname(group_map[samplenames])
groups[is.na(groups)] <- "unknown"

# ---------- 4) Color mapping for YOUR labels ----------
#  "positive control" = "#ff7f00",
## (Adjust colors if you prefer different ones)
#color_map <- c(
#  "1" = "#a6cee3",
#  "2" = "#1f78b4",
#  "3" = "#b2df8a",
#  "4" = "#33a02c",
#  "5" = "#fb9a99",
#  "negative control" = "#6a3d9a",
#  "unknown" = "GREY"
#)
# Color map 1..16 + PC
color_map <- c(
  "1"="#a6cee3","2"="#1f78b4","3"="#b2df8a","4"="#33a02c","5"="#fb9a99","6"="#e31a1c"
)
#"7"="#fdbf6f","8"="#ff7f00","9"="#cab2d6","10"="#6a3d9a",
#  "11"="#ffff99","12"="#b15928","13"="#8dd3c7","14"="#bebada",
#  "15"="#80b1d3","16"="#fccde5"
#,"PC"="#000000"

# Assign colors safely
sampleCols <- unname(color_map[groups])
sampleCols[is.na(sampleCols)] <- "GREY"
names(sampleCols) <- samplenames

# ---------- 5) Checks ----------
cat("Unique groups found in datamat:\n")
print(sort(unique(groups)))
cat("\nCounts per group:\n")
print(table(groups, useNA = "ifany"))
cat("\nFirst 10 sample colors:\n")
print(head(sampleCols, 10))

# Optional: list any samples that didn't match metadata
unmatched <- samplenames[groups == "unknown"]
if (length(unmatched) > 0) {
  cat("\nWARNING: Unmatched samples (showing up to 20):\n")
  print(head(unmatched, 20))
}
## --- 1) order columns by group (and then by sample name) ---
#"positive control",
group_order <- c("1","2","3","4","5", "6")
#,"7","8","9","10", "11","12","13","14","15", "16", "PC"
groups_fac  <- factor(groups, levels = group_order, ordered = TRUE)

col_order <- order(groups_fac, samplenames)

datamat_ord     <- datamat[, col_order, drop = FALSE]
groups_ord      <- groups[col_order]
samplenames_ord <- samplenames[col_order]
sampleCols_ord  <- sampleCols[col_order]

stopifnot(identical(colnames(datamat_ord), samplenames_ord))

## group separators
grp_counts <- table(factor(groups_ord, levels = group_order))
grp_breaks <- cumsum(as.integer(grp_counts[grp_counts > 0]))

## --- 2) cluster ROWS using the *ordered* matrix (columns don't matter, but be consistent) ---
hr   <- hclust(as.dist(1 - cor(t(datamat_ord), method = "pearson")), method = "complete")
mycl <- cutree(hr, h = max(hr$height) / 1.08)

mycol_palette <- c("YELLOW","DARKBLUE","DARKORANGE","DARKMAGENTA","DARKCYAN","DARKRED",
                   "MAROON","DARKGREEN","LIGHTBLUE","PINK","MAGENTA","LIGHTCYAN",
                   "LIGHTGREEN","BLUE","ORANGE","CYAN","RED","GREEN")
mycol <- mycol_palette[as.vector(mycl)]

## --- 3) plot using datamat_ord and sampleCols_ord; keep column order fixed ---
library(RColorBrewer)
heatmap_colors <- colorRampPalette(brewer.pal(9, "Blues"))(100)

png("figures/heatmap.png", width = 1800, height = 2400)
heatmap.2(
  as.matrix(datamat_ord),
  Rowv = as.dendrogram(hr),
  Colv = NA,                        # IMPORTANT: do NOT cluster columns
  dendrogram = "row",
  scale = "row",
  trace = "none",
  col = heatmap_colors,
  cexRow = 1.2,
  cexCol = 2.2,                     # <<< bigger x (column) labels
  RowSideColors = mycol,
  ColSideColors = sampleCols_ord,   # IMPORTANT: use ordered colors
  srtCol = 85,
  labRow = row.names(datamat_ord),
  labCol = samplenames_ord,         # optional but explicit
  key = TRUE,
  margins = c(10, 18),              # <<< more space for rotated x labels
  lhei = c(0.7, 15),
  lwid = c(1, 8),
  colsep = grp_breaks,              # optional separators
  sepcolor = "black",
  sepwidth = c(0.02, 0.02)
)
dev.off()

knitr::include_graphics("./figures/heatmap.png")

\pagebreak

# Clean taxonomy prefixes (e.g., "d__", "p__") safely without hard-coded row indices
tt_mat <- as(phyloseq::tax_table(ps_abund_rel), "matrix")
if (is.null(rownames(tt_mat))) rownames(tt_mat) <- phyloseq::taxa_names(ps_abund_rel)

strip_prefix <- function(x) {
  x <- as.character(x)
  sub("^[^_]*__", "", x)
}

for (j in seq_len(ncol(tt_mat))) {
  tt_mat[, j] <- strip_prefix(tt_mat[, j])
}
tt_mat[tt_mat == ""] <- NA_character_

# Ensure taxa names still match
stopifnot(identical(rownames(tt_mat), phyloseq::taxa_names(ps_abund_rel)))

phyloseq::tax_table(ps_abund_rel) <- phyloseq::tax_table(tt_mat)

Taxonomic summary

Bar plots in phylum level

  sample_order <-
  c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12", "sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17",  "sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29", "sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34", "sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53", "sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

  # create a sample ID column in sample_data (or overwrite an existing one)
  sample_data(ps_abund_rel)$SampleID <- sample_names(ps_abund_rel)

  # set the order as a factor with the desired levels
  sample_data(ps_abund_rel)$SampleID <- factor(
    sample_data(ps_abund_rel)$SampleID,
    levels = sample_order
  )

  #aes(color="Phylum", fill="Phylum") --> aes()
  #ggplot(data=data, aes(x=Sample, y=Abundance, fill=Phylum))
  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,           # 85° rotation
        hjust = 1,            # right-justified so labels are neat
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 2))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  ## merge + normalize
  # ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group")
  # ps_abund_rel_group_ <- transform_sample_counts(
  #   ps_abund_rel_group,
  #   function(x) x / sum(x)
  # )

  # make a safe grouping variable (character, non-numeric-looking)
  sample_data(ps_abund_rel)$Group_chr <- paste0("G", as.character(sample_data(ps_abund_rel)$Group))
  ps_abund_rel_group  <- merge_samples(ps_abund_rel, "Group_chr")
  ps_abund_rel_group_ <- transform_sample_counts(ps_abund_rel_group, function(x) x / sum(x))
  # (optional) if you want x-axis labels 1..6 again:
  sample_names(ps_abund_rel_group_) <- sub("^G", "", sample_names(ps_abund_rel_group_))

  # desired order on x-axis
  #, "positive control"
  group_order <- c("1", "2", "3", "4", "5", "6")
  #, "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "PC"

  plot_bar(ps_abund_rel_group_, fill = "Phylum") +
    geom_bar(stat = "identity", position = "stack") +
    scale_x_discrete(limits = group_order) +          # <-- set order here
    scale_fill_manual(values = my_colors) +
    labs(x = "Group") +   # <- change x-axis label from "Sample" to "Group"
    theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

  cat("ps_pruned:", nsamples(ps_pruned), "\n")
  cat("ps_filt:", nsamples(ps_filt), "\n")
  cat("ps_abund:", nsamples(ps_abund), "\n")
  cat("ps_abund_rel:", nsamples(ps_abund_rel), "\n")
  cat("ps_abund_rel_group:", nsamples(ps_abund_rel_group), "\n")

Bar plots in class level

  my_colors <- c("darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink", "khaki2", "firebrick", "brown1", "darkorange1", "cyan1", "royalblue4", "darksalmon", "darkblue","royalblue4", "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen", "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1", "brown1", "darkorange1", "cyan1", "darkgrey")
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Class") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 3))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Class") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in order level

  my_colors <- c(
    "darkblue", "darkgoldenrod1", "darkseagreen", "darkorchid",
    "darkolivegreen1", "lightskyblue", "darkgreen", "deeppink",
    "khaki2", "firebrick", "brown1", "darkorange1", "cyan1",
    "royalblue4", "darksalmon", "darkblue","royalblue4",
    "dodgerblue3", "steelblue1", "lightskyblue", "darkseagreen",
    "darkgoldenrod1", "darkseagreen", "darkorchid", "darkolivegreen1",
    "brown1", "darkorange1", "cyan1", "darkgrey"
  )

  plot_bar(ps_abund_rel, x = "SampleID", fill = "Order") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 4))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Order") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

Bar plots in family level

  my_colors <- c(
          "#FF0000", "#000000", "#0000FF", "#C0C0C0", "#FFFFFF", "#FFFF00", "#00FFFF", "#FFA500", "#00FF00", "#808080", "#FF00FF", "#800080", "#FDD017", "#0000A0", "#3BB9FF", "#008000", "#800000", "#ADD8E6", "#F778A1", "#800517", "#736F6E", "#F52887", "#C11B17", "#5CB3FF", "#A52A2A", "#FF8040", "#2B60DE", "#736AFF", "#1589FF", "#98AFC7", "#8D38C9", "#307D7E", "#F6358A", "#151B54", "#6D7B8D", "#FDEEF4", "#FF0080", "#F88017", "#2554C7", "#FFF8C6", "#D4A017", "#306EFF", "#151B8D", "#9E7BFF", "#EAC117", "#E0FFFF", "#15317E", "#6C2DC7", "#FBB917", "#FCDFFF", "#15317E", "#254117", "#FAAFBE", "#357EC7"
        )
  plot_bar(ps_abund_rel, x = "SampleID", fill = "Family") +
    geom_bar(stat = "identity", position = "stack") +
    scale_fill_manual(values = my_colors) +
    theme(
      axis.text.x = element_text(
        angle = 85,
        hjust = 1,
        vjust = 1,
        size  = 16,
        colour = "black"
      ),
      axis.text.y = element_text(size = 12, colour = "black"),
      legend.position = "bottom"
    ) +
    guides(fill = guide_legend(nrow = 8))

Aggregate samples by group and normalize read counts within each group to correct for differences in sequencing depth.

  plot_bar(ps_abund_rel_group_, fill="Family") + geom_bar(aes(), stat="identity", position="stack") + scale_x_discrete(limits = group_order) +
  scale_fill_manual(values = my_colors) + labs(x = "Group") + theme(axis.text = element_text(angle = 0, size = 16, colour="black"), axis.text.x = element_text(angle = -15),hjust = 10,vjust = 2)

\pagebreak

# !!!!NOT_USED!!!!: #Export Relative abundances of Phylum, Class, Order, and Family levels across all samples in Excel files!
library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Check for NA or empty values in the taxonomy table
  tax_table_df <- as.data.frame(tax_table(ps))
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Replace NA or empty strings with "Unclassified"
  taxa_names <- ifelse(is.na(taxa_names) | taxa_names == "", paste0("Unclassified_", rank), taxa_names)

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order_old.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family_old.xlsx")
library(phyloseq)
library(writexl)
library(dplyr)

# Function to check for NA or empty values in a taxonomic rank
check_taxa_names <- function(tax_table, rank) {
  tax_values <- tax_table[[rank]]
  na_count <- sum(is.na(tax_values) | tax_values == "")
  cat("Number of NA or empty values in", rank, ":", na_count, "\n")
  if (na_count > 0) {
    cat("Taxa with NA or empty", rank, ":\n")
    print(tax_values[is.na(tax_values) | tax_values == ""])
  }
}

# Function to create and save relative abundance table for a given taxonomic rank with normalization
save_taxa_abundance <- function(ps, rank, output_file) {
  # Clean the taxonomy table by removing D_[level]__ prefixes
  tax_table_df <- as.data.frame(tax_table(ps))
  tax_table_df[[rank]] <- ifelse(is.na(tax_table_df[[rank]]) | tax_table_df[[rank]] == "",
                                 paste0("Unclassified_", rank),
                                 sub("^D_[0-9]+__(.+)", "\\1", tax_table_df[[rank]]))
  tax_table(ps) <- as.matrix(tax_table_df)  # Update taxonomy table with cleaned names

  # Check for NA or empty values in the taxonomy table
  check_taxa_names(tax_table_df, rank)

  # Aggregate OTUs by taxonomic rank, removing taxa with NA at the specified rank
  ps_glom <- tax_glom(ps, taxrank = rank, NArm = TRUE)

  # Extract OTU table (relative abundances)
  otu_table <- as.data.frame(otu_table(ps_glom))

  # Normalize each column to sum to 1
  otu_table_normalized <- apply(otu_table, 2, function(x) x / sum(x))

  # Convert matrix to data frame
  otu_table_normalized <- as.data.frame(otu_table_normalized)

  # Verify column sums are approximately 1.0
  col_sums <- colSums(otu_table_normalized)
  if (any(abs(col_sums - 1) > 1e-6)) {
    warning("Column sums in ", rank, " table do not equal 1.0: ", paste(col_sums, collapse = ", "))
  } else {
    cat("Column sums for ", rank, " table are all approximately 1.0\n")
  }

  # Extract taxonomy table and get the specified rank for taxa names
  tax_table_glom <- as.data.frame(tax_table(ps_glom))
  taxa_names <- tax_table_glom[[rank]]

  # Ensure unique row names
  taxa_names <- make.unique(taxa_names)

  # Set row names to taxa names (for internal reference)
  rownames(otu_table_normalized) <- taxa_names

  # Add taxa names as a column
  otu_table_normalized[[rank]] <- taxa_names

  # Reorder to move rank column to the first position
  otu_table_normalized <- otu_table_normalized[, c(rank, setdiff(names(otu_table_normalized), rank))]

  # Rename sample columns by removing "sample-" prefix
  names(otu_table_normalized)[-1] <- sub("sample-", "", names(otu_table_normalized)[-1])

  # Write the data frame to Excel, including the rank column
  write_xlsx(otu_table_normalized, path = output_file)
  cat("Saved", output_file, "\n")
}

# # Verify column sums of ps_abund_rel
# col_sums <- colSums(otu_table(ps_abund_rel))
# cat("Column sums of ps_abund_rel:\n")
# summary(col_sums)

## Generate Excel files for Phylum, Class, Order, and Family levels with normalization and renamed sample names
#save_taxa_abundance(ps_abund_rel, "Phylum", "relative_abundance_phylum.xlsx")
#save_taxa_abundance(ps_abund_rel, "Class", "relative_abundance_class.xlsx")
#save_taxa_abundance(ps_abund_rel, "Order", "relative_abundance_order.xlsx")
#save_taxa_abundance(ps_abund_rel, "Family", "relative_abundance_family.xlsx")

#Sum up the last two colums with the same row.names to a new column, export the file as csv, then delete the two rows before last, then merge them with csv2xls to a Excel-file, adapt the sheet-names.
#~/Tools/csv2xls-0.4/csv_to_xls.py relative_abundance_phylum.csv relative_abundance_order.csv relative_abundance_family.csv -d$'\t' -o relative_abundance_phylum_order_family.xls;

\pagebreak

Alpha diversity

Plot Chao1 richness estimator, Observed OTUs, Shannon index, and Phylogenetic diversity. Regroup together samples from the same group.

# using rarefied data
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even42369.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_stool/rep_set.tre
#gunzip table_even4753.biom.gz
#alpha_diversity.py -i table_even4753.biom --metrics chao1,observed_otus,shannon,PD_whole_tree -o adiv_even.txt -t ../clustering_swab/rep_set.tre
hmp.meta <- meta(ps_rarefied)
hmp.meta$sam_name <- rownames(hmp.meta)

# ---- enforce Group order (edit if you have different labels) ----
hmp.meta$Group <- factor(as.character(hmp.meta$Group), levels = group_order)

# for QIIME2: Lesen der Metriken
shannon <- read.table("exported_alpha/shannon/alpha-diversity.tsv", header=TRUE, sep="\t")  #cp -r ../Data_Karoline_16S_2025/exported_alpha/ .
faith_pd <- read.table("exported_alpha/faith_pd/alpha-diversity.tsv", header=TRUE, sep="\t")
observed <- read.table("exported_alpha/observed_features/alpha-diversity.tsv", header=TRUE, sep="\t")
#chao1 <- read.table("exported_alpha/chao1/alpha-diversity.tsv", header=TRUE, sep="\t")    #TODO: Check the correctness of chao1-calculation.

# Umbenennen für Klarheit
colnames(shannon) <- c("sam_name", "shannon")
colnames(faith_pd) <- c("sam_name", "PD_whole_tree")
colnames(observed) <- c("sam_name", "observed_otus")
#colnames(chao1) <- c("sam_name", "chao1")

# Merge alles in ein DataFrame
div.df <- Reduce(function(x, y) merge(x, y, by="sam_name"),
                  list(shannon, faith_pd, observed))

# Meta-Daten einfügen
div.df <- merge(div.df, hmp.meta, by="sam_name")

# Reformat
div.df2 <- div.df[, c("sam_name", "Group", "shannon", "observed_otus", "PD_whole_tree")]
colnames(div.df2) <- c("Sample name", "Group", "Shannon", "OTU", "Phylogenetic Diversity")
write.csv(div.df2, file="alpha_diversities.txt")
knitr::kable(div.df2) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

## ---- Summary stats (report in text/table) -----------------------------------
sum_stats <- div.df2 %>%
group_by(Group) %>%
summarise(n = n(),
            mean = mean(Shannon, na.rm = TRUE),
            sd = sd(Shannon, na.rm = TRUE),
            median = median(Shannon, na.rm = TRUE),
            IQR = IQR(Shannon, na.rm = TRUE),
            .groups = "drop")
# knitr::kable(sum_stats, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(sum_stats, "Shannon_Groups_summary.csv", row.names = FALSE)

Overall alpha-diversity: Kruskal–Wallis test checks whether the metric (e.g., Shannon) differs among groups overall.

## ---- Overall test -------------------------------------------------
#https://uc-r.github.io/t_test
#We can perform the test with t.test and transform our data and we can also perform the nonparametric test with the wilcox.test function.
#stat.test.Shannon <- compare_means(
# Shannon ~ Group, data = div.df2,
# method = "t.test"
#)
#knitr::kable(stat.test.Shannon) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# Nonparametric overall test (robust default)
overall_kw <- compare_means(Shannon ~ Group, data = div.df2, method = "kruskal.test")
# knitr::kable(overall_kw, digits = 3) %>%
# kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"))
write.csv(overall_kw, "Shannon_Groups_overall_Kruskal.csv", row.names = FALSE)
knitr::kable(overall_kw) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
## Optional parametric overall (use if assumptions OK)
overall_anova <- compare_means(Shannon ~ Group, data = div.df2, method = "anova")
write.csv(overall_anova, "Shannon_Groups_overall_ANOVA.csv", row.names = FALSE)

## Assumption checks (optional; helps decide ANOVA vs KW)
shapiro_res <- div.df2 %>% group_by(Group) %>% shapiro_test(Shannon)
levene_res  <- div.df2 %>% levene_test(Shannon ~ Group)
write.csv(shapiro_res, "Shannon_Groups_shapiro.csv", row.names = FALSE)
write.csv(levene_res,  "Shannon_Groups_levene.csv",  row.names = FALSE)

## ---- Pairwise tests with BH correction --------------------------------------
# Wilcoxon (nonparametric)
# Wilcoxon (nonparametric) — robust to empty/singleton groups
div_shannon <- div.df2 %>%
  dplyr::filter(!is.na(Shannon), !is.na(Group)) %>%
  dplyr::group_by(Group) %>%
  dplyr::filter(dplyr::n() >= 2) %>%
  dplyr::ungroup()
div_shannon$Group <- droplevels(factor(div_shannon$Group))

pw_wilcox <- tryCatch(
  {
    if (dplyr::n_distinct(div_shannon$Group) < 2) {
      tibble::tibble()
    } else {
      rstatix::pairwise_wilcox_test(div_shannon, Shannon ~ Group,
                                   p.adjust.method = "BH", exact = FALSE)
    }
  },
  error = function(e) tibble::tibble()
)

write.csv(pw_wilcox, "Shannon_Groups_pairwise_wilcox_BH.csv", row.names = FALSE)

# t-tests (parametric, optional)
pw_t <- pairwise_t_test(div.df2, Shannon ~ Group, p.adjust.method = "BH")
write.csv(pw_t, "Shannon_Groups_pairwise_t_BH.csv", row.names = FALSE)
div_df_melt <- reshape2::melt(div.df2)
#head(div_df_melt)

#https://plot.ly/r/box-plots/#horizontal-boxplot
#http://www.sthda.com/english/wiki/print.php?id=177
#https://rpkgs.datanovia.com/ggpubr/reference/as_ggplot.html
#http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/82-ggplot2-easy-way-to-change-graphical-parameters/
#https://plot.ly/r/box-plots/#horizontal-boxplot
#library("gridExtra")
#par(mfrow=c(4,1))
p <- ggboxplot(div_df_melt, x = "Group", y = "value",
              facet.by = "variable",
              scales = "free",
              width = 0.5,
              fill = "gray", legend= "right")
#ggpar(p, xlab = FALSE, ylab = FALSE)
lev <- levels(factor(div_df_melt$Group)) # get the variables
#FITTING4: delete H47(1) in lev
#lev <- lev[-c(3)]
# make a pairwise list that we want to compare.
#my_stat_compare_means
#https://stackoverflow.com/questions/47839988/indicating-significance-with-ggplot2-in-a-boxplot-with-multiple-groups
L.pairs <- combn(seq_along(lev), 2, simplify = FALSE, FUN = function(i) lev[i]) #%>% filter(p.signif != "ns")
my_stat_compare_means  <- function (mapping = NULL, data = NULL, method = NULL, paired = FALSE,
    method.args = list(), ref.group = NULL, comparisons = NULL,
    hide.ns = FALSE, label.sep = ", ", label = NULL, label.x.npc = "left",
    label.y.npc = "top", label.x = NULL, label.y = NULL, tip.length = 0.03,
    symnum.args = list(), geom = "text", position = "identity",
    na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...)
{
    if (!is.null(comparisons)) {
        method.info <- ggpubr:::.method_info(method)
        method <- method.info$method
        method.args <- ggpubr:::.add_item(method.args, paired = paired)
        if (method == "wilcox.test")
            method.args$exact <- FALSE
        pms <- list(...)
        size <- ifelse(is.null(pms$size), 0.3, pms$size)
        color <- ifelse(is.null(pms$color), "black", pms$color)
        map_signif_level <- FALSE
        if (is.null(label))
            label <- "p.format"
        if (ggpubr:::.is_p.signif_in_mapping(mapping) | (label %in% "p.signif")) {
            if (ggpubr:::.is_empty(symnum.args)) {
                map_signif_level <- c(`****` = 1e-04, `***` = 0.001,
                  `**` = 0.01, `*` = 0.05, ns = 1)
            } else {
               map_signif_level <- symnum.args
            }
            if (hide.ns)
                names(map_signif_level)[5] <- " "
        }
        step_increase <- ifelse(is.null(label.y), 0.12, 0)
        ggsignif::geom_signif(comparisons = comparisons, y_position = label.y,
            test = method, test.args = method.args, step_increase = step_increase,
            size = size, color = color, map_signif_level = map_signif_level,
            tip_length = tip.length, data = data)
    } else {
        mapping <- ggpubr:::.update_mapping(mapping, label)
        layer(stat = StatCompareMeans, data = data, mapping = mapping,
            geom = geom, position = position, show.legend = show.legend,
            inherit.aes = inherit.aes, params = list(label.x.npc = label.x.npc,
                label.y.npc = label.y.npc, label.x = label.x,
                label.y = label.y, label.sep = label.sep, method = method,
                method.args = method.args, paired = paired, ref.group = ref.group,
                symnum.args = symnum.args, hide.ns = hide.ns,
                na.rm = na.rm, ...))
    }
}

# Rotate the x-axis labels to 45 degrees and adjust their position
#comparisons = list(c("1", "2"), c("1", "3"), c("1", "4"), c("1", "5"), c("1", "6"), c("2", "3"), c("2", "4"), c("2", "5"), c("2", "6"), c("3", "4"), c("3", "5"), c("3", "6"), c("4", "5"), c("4", "6"), c("5", "6")),
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust=1, size=8))
p2 <- p +
stat_compare_means(
  method="wilcox.test",  #"t.test"",
  comparisons = list(c("1", "2"), c("3", "4"), c("5", "6")),
  label = "p.signif",
  p.adjust.method = "BH",
  method.args = list(exact = FALSE),
  symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns"))
)
##comparisons = L.pairs,
##stat_pvalue_manual
##https://stackoverflow.com/questions/20500706/saving-multiple-ggplots-from-ls-into-one-and-separate-files-in-r
#print(p2)
#ggsave("./figures/alpha_diversity_Group.png", device="png", height = 10, width = 15)
#ggsave("./figures/alpha_diversity_Group.svg", device="svg", height = 10, width = 15)

Pairwise tests: Within each metric (Shannon/OTU/PD), Wilcoxon rank-sum tests compare groups pairwise, with BH correction for multiple testing.

library(dplyr)
library(tidyr)
library(ggpubr)
library(rstatix)
library(writexl)

div.df2 <- div.df2 %>% mutate(Group = factor(Group))

# 1) Long format for all 3 metrics
div_long <- div.df2 %>%
  select(Group, Shannon, OTU, `Phylogenetic Diversity`) %>%
  pivot_longer(
    cols = c(Shannon, OTU, `Phylogenetic Diversity`),
    names_to = "variable",
    values_to = "value"
  ) %>%
  filter(!is.na(Group), !is.na(value))

# 2) Drop groups with <2 samples within each metric
div_long2 <- div_long %>%
  group_by(variable, Group) %>%
  filter(n() >= 2) %>%
  ungroup() %>%
  mutate(Group = droplevels(Group))

# 3) ONE pairwise table for all metrics (BH within each metric)
pw_all <- div_long2 %>%
  group_by(variable) %>%
  pairwise_wilcox_test(value ~ Group, p.adjust.method = "BH", exact = FALSE) %>%
  ungroup()

write.csv(pw_all, "Pairwise_Wilcox_BH_all_metrics.csv", row.names = FALSE)
write_xlsx(pw_all, "Pairwise_Wilcox_BH_all_metrics.xlsx")
knitr::kable(pw_all) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# ---- desired order ----
comp_list <- list(
  c("1","2"),
  c("3","4"),
  c("5","6")
)

comp_order <- tibble(
  group1 = sapply(comp_list, `[`, 1),
  group2 = sapply(comp_list, `[`, 2),
  comp_id = seq_along(comp_list)
)

# 4) base y per facet
y_base <- div_long2 %>%
  group_by(variable) %>%
  summarise(
    y_max = max(value, na.rm = TRUE),
    y_min = min(value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(y_span = y_max - y_min)

# 5) match order regardless of direction (A-B same as B-A)
pw_all_ordered <- pw_all %>%
  mutate(g1 = as.character(group1), g2 = as.character(group2)) %>%
  left_join(comp_order, by = c("g1"="group1","g2"="group2")) %>%
  mutate(
    comp_id = ifelse(
      is.na(comp_id),
      comp_order$comp_id[match(
        paste(g2, g1),
        paste(comp_order$group1, comp_order$group2)
      )],
      comp_id
    )
  ) %>%
  select(-g1, -g2) %>%
  filter(!is.na(comp_id))

# 6) assign y.position in your specified order within each facet
pw_all_pos <- pw_all_ordered %>%
  left_join(y_base, by = "variable") %>%
  group_by(variable) %>%
  arrange(comp_id, .by_group = TRUE) %>%
  mutate(
    step = ifelse(y_span > 0, 0.12 * y_span, 0.12),
    y.position = y_max + row_number() * step
  ) %>%
  ungroup()

# 7) plot + annotate
p <- ggboxplot(
  div_long2, x = "Group", y = "value",
  facet.by = "variable", scales = "free",
  width = 0.5, fill = "gray"
) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8)) +
  stat_pvalue_manual(
    pw_all_pos,
    label = "p.adj.signif",
    y.position = "y.position",
    tip.length = 0.01,
    hide.ns = FALSE
  )

print(p)
## MANUALLY selected alpha diversities unter host-env after 'cp alpha_diversities.txt selected_alpha_diversities.txt'
#knitr::include_graphics("./figures/alpha_diversity_Group.png")
#selected_alpha_diversities<-read.csv("selected_alpha_diversities.txt",sep="\t")
#knitr::kable(selected_alpha_diversities) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
#!!# Beta diversity (Bray-Curtis distance)
#!!## Group1 vs Group2

#fig.cap="Beta diversity",

#for QIIME1: file:///home/jhuang/DATA/Data_Marius_16S/core_diversity_e42369/bdiv_even42369_Group/weighted_unifrac_boxplots/Group_Stats.txt

# -- for QIIME2: MANUALLY filter permanova-pairwise.csv and save as permanova-pairwise_.csv
# #grep "Permutations" exported_beta_group/permanova-pairwise.csv > permanova-pairwise_.csv
# #grep "Group1,Group2" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# #grep "Group3,Group4" exported_beta_group/permanova-pairwise.csv >> permanova-pairwise_.csv
# beta_diversity_group_stats<-read.csv("permanova-pairwise_.csv",sep=",")
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group1" & beta_diversity_group_stats$Group.2 == "Group2", ]
# #beta_diversity_group_stats <- beta_diversity_group_stats[beta_diversity_group_stats$Group.1 == "Group3" & beta_diversity_group_stats$Group.2 == "Group4", ]
# knitr::kable(beta_diversity_group_stats) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

#NOTE: Run this Phyloseq.Rmd, then run the code of MicrobiotaProcess.R to manually generate Comparison_of_Bray_Distances_Group1_vs_Group2.png and Comparison_of_Bray_Distances_Group3_vs_Group4.png, then run this Phyloseq.Rmd!

#knitr::include_graphics("./figures/Comparison_of_Bray_Distances_Group1_vs_Group2.png")

Principal coordinates analysis (PCoA) of beta-diversity based on Bray–Curtis dissimilarity

Beta-diversity (between-sample differences in bacterial community composition) was quantified using Bray–Curtis dissimilarity computed on Hellinger-transformed abundance data. Beta-diversity patterns were visualized using principal coordinates analysis (PCoA) based on the Bray–Curtis dissimilarity matrix.

Global PERMANOVA on the Bray–Curtis dissimilarity matrix indicated a significant effect of Group on overall community composition (adonis2: p = 1×10⁻⁴; 9,999 permutations).

# --- Global beta-diversity (PERMANOVA) ---
cat("```text\n")
cat(
"[PERMANOVA result]\n",
"The object contained internal attribute: PCoA ADONIS\n",
"Permutation test for adonis under reduced model\n",
"Permutation: free\n",
"Number of permutations: 9999\n\n",
"vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)\n",
"          Df SumOfSqs      R2      F Pr(>F)\n",
"Model     5   6.1752 0.51654 9.8296  1e-04 ***\n",
"Residual 46   5.7797 0.48346\n",
"Total    51  11.9549 1.00000  \n",
"---\n",
"Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1\n",
sep = ""
)
cat("```\n")

Pairwise PERMANOVA tests were performed on Bray–Curtis dissimilarity matrices to compare beta-diversity between all pairs of sample groups (metadata column Group). For each pairwise comparison, the Bray–Curtis matrix was subset to samples from the two groups only, and significance was assessed using vegan::adonis2 with 9,999 permutations. Resulting p-values were adjusted for multiple testing using both Benjamini–Hochberg (BH/FDR) and Bonferroni corrections.

#Manually give the annotation of
Bray_pairwise_PERMANOVA <- read.csv("figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", sep = ",")
knitr::kable(Bray_pairwise_PERMANOVA, caption = "Pairwise PERMANOVA results (distance-based community differences among Group levels).") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

# --- Ordination figures ---
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA.png")
knitr::include_graphics("./figures_MP_Group_1_2_3_4_5_6/PCoA3.png")

Differential abundance analysis

Differential abundance analysis aims to find the differences in the abundance of each taxa between two groups of samples, assigning a significance value to each comparison.

# ------------------------------------------------------------
#  DESeq2: non-rarefied integer counts + optional taxon prefilter
# ------------------------------------------------------------
ps_deseq <- ps_filt

Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

Group 1 vs Group 2

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group1,Group2)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "2")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group1_vs_Group2"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 3 vs Group 4

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group3,Group4)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "4")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group3_vs_Group4"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

Group 5 vs Group 6

ps_deseq_sel <- data.table::copy(ps_deseq)
otu_table(ps_deseq_sel) <- otu_table(ps_deseq)[,c(Group5,Group6)]
diagdds = phyloseq_to_deseq2(ps_deseq_sel, ~Group)
diagdds$Group <- relevel(diagdds$Group, "6")
diagdds <- DESeq(
  diagdds,
  test   = "Wald",
  fitType = "parametric",
  sfType  = "poscounts"  # <- important
)
resultsNames(diagdds)

res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(phyloseq::tax_table(ps_deseq_sel)[rownames(sigtab), ], "matrix"))
# file base name
fname <- "DEGs_Group5_vs_Group6"
write.xlsx(sigtab, file = paste0(fname, ".xlsx"), rowNames = TRUE)

kable(sigtab) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
    scale_fill_brewer(palette = palname, ...)
}
x = tapply(sigtab$log2FoldChange, sigtab$Order, function(x) max(x))
x = sort(x)
sigtab$Order = factor(as.character(sigtab$Order), levels=names(x))
x = tapply(sigtab$log2FoldChange, sigtab$Family, function(x) max(x))
x = sort(x)
sigtab$Family = factor(as.character(sigtab$Family), levels=names(x))

#ggplot(sigtab, aes(x=log2FoldChange, y=Family, color=Order)) + geom_point(aes(size=padj)) + scale_size_continuous(name="padj",range=c(8,4))+
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust=0.5))
# build the plot
p <- ggplot(sigtab, aes(x = log2FoldChange, y = Family, color = Order)) +
  geom_point(aes(size = padj)) +
  scale_size_continuous(name = "padj", range = c(8, 4)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -25, hjust = 0, vjust = 0.5))
# SVG (svglite gives crisp text)
if (!requireNamespace("svglite", quietly = TRUE)) install.packages("svglite")
ggplot2::ggsave(paste0(fname, ".svg"), plot = p, device = svglite::svglite,
                width = 8, height = 6, units = "in", dpi = 300)
# PNG
ggplot2::ggsave(paste0(fname, ".png"), plot = p, device = "png",
                width = 8, height = 6, units = "in", dpi = 300)
knitr::include_graphics(paste0(fname, ".png"))

## =========================================================
## MicrobiotaProcess_UPDATED.R
## Uses your objects:
##   - ps_filt        : for alpha + beta diversity (NOT taxa-filtered)
##   - ps_abund_rel   : for cleaner composition plots (taxa filtered for plotting)
##
## Output:
##   - Rarefaction curves + alpha diversity plots
##   - Bray (Hellinger) distance + PCoA + PERMANOVA
##   - Composition plots (Class) + heatmaps (from ps_abund_rel)
## =========================================================

#DEBUG_Update: remotes::install_github("erocoar/gghalves")
#DEBUG: update.packages(ask = FALSE)
suppressPackageStartupMessages({
  library(phyloseq)
  library(MicrobiotaProcess)
  library(microeco)
  library(ggplot2)
  library(aplot)
  library(ggrepel)
  library(vegan)
  library(ggalluvial)
  library(ggh4x)
  library(gghalves)
})

dir.create("figures_MP_Group_1_2_3_4_5_6", showWarnings = FALSE)

## -----------------------------
## 0) Define groups (sample IDs)  ---- EXACTLY as provided
Group1<-c("sample-1","sample-2","sample-3","sample-6","sample-7","sample-8","sample-10","sample-11","sample-12")
Group2<-c("sample-4","sample-5","sample-9","sample-13","sample-14","sample-15","sample-16","sample-17")
Group3<-c("sample-18","sample-19","sample-20","sample-23","sample-24","sample-25","sample-28","sample-29")
Group4<-c("sample-21","sample-22","sample-26","sample-30","sample-31","sample-32","sample-33","sample-34")
Group5<-c("sample-40","sample-41","sample-42","sample-43","sample-49","sample-50","sample-51","sample-52","sample-53")
Group6<-c("sample-35","sample-36","sample-37","sample-38","sample-39","sample-44","sample-45","sample-46","sample-47","sample-48")

## Ensure unique IDs within each vector
Group1 <- unique(Group1);
Group2 <- unique(Group2);
Group3 <- unique(Group3)
Group4 <- unique(Group4);
Group5 <- unique(Group5);
Group6 <- unique(Group6);
#PC <- unique(PC)

## The group order you want everywhere (plots + stats)
#,"PC"
group_levels <- c("Group1","Group2","Group3","Group4","Group5","Group6")

## -----------------------------
## 0.1) Sample subset (union of all groups)
#, PC
keep_samples <- unique(c(Group1, Group2, Group3, Group4, Group5, Group6))

## -----------------------------
## 0.2) Helper: assign Group as an ordered factor (membership-based)
add_group <- function(mpse_obj) {
  sn <- rownames(colData(mpse_obj))
  grp <- rep(NA_character_, length(sn))

  grp[sn %in% Group1] <- "Group1"
  grp[sn %in% Group2] <- "Group2"
  grp[sn %in% Group3] <- "Group3"
  grp[sn %in% Group4] <- "Group4"
  grp[sn %in% Group5] <- "Group5"
  grp[sn %in% Group6] <- "Group6"
  #grp[sn %in% PC]     <- "PC"

  colData(mpse_obj)$Group <- factor(grp, levels = group_levels)

  # warn if any samples weren't assigned (helps catch typos / missing IDs)
  if (any(is.na(colData(mpse_obj)$Group))) {
    warning(
      "Unassigned samples (not found in any group list): ",
      paste(sn[is.na(colData(mpse_obj)$Group)], collapse = ", ")
    )
  }
  mpse_obj
}

## -----------------------------
## 0.3) Colors (edit to taste)
group_colors <- c(
  "Group1" = "#1f77b4",  # DarkBlue
  "Group2" = "#add8e6",  # LightBlue
  "Group3" = "#006400",  # DarkGreen
  "Group4" = "#90ee90",  # LightGreen
  "Group5" = "#6a0dad",  # DarkPurple
  "Group6" = "#dda0dd"   # LightPurple
)
#  "PC"     = """#7f7f7f"

## =========================================================
## 1) Diversity analysis object (alpha + beta)
##    IMPORTANT: start from ps_filt (all taxa retained)
## =========================================================
ps_div <- prune_samples(keep_samples, ps_filt)
ps_div <- prune_taxa(taxa_sums(ps_div) > 0, ps_div)

mpse_div <- as.MPSE(ps_div)
mpse_div <- add_group(mpse_div)

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 2) Alpha diversity (rarefaction-based)
## =========================================================
set.seed(9242)
mpse_div %<>% mp_rrarefy()  # creates RareAbundance
mpse_div %<>% mp_cal_rarecurve(.abundance = RareAbundance, chunks = 400)

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
#BUG: Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.
#install.packages("remotes")
#remotes::install_github("erocoar/gghalves")

# Rarefaction curves: sample + grouped
p_rare_1 <- mpse_div %>% mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe)

p_rare_2 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = Observe, .group = Group) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

p_rare_3 <- mpse_div %>%
  mp_plot_rarecurve(.rare = RareAbundanceRarecurve, .alpha = "Observe",
                    .group = Group, plot.group = TRUE) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

png("figures_MP_Group_1_2_3_4_5_6/rarefaction_of_samples_or_groups.png", width = 1080, height = 600)
print(p_rare_1 + p_rare_2 + p_rare_3)
dev.off()

# Alpha indices from RareAbundance
mpse_div %<>% mp_cal_alpha(.abundance = RareAbundance)

f_alpha_1 <- mpse_div %>%
  mp_plot_alpha(.group = Group, .alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou)) +
  scale_color_manual(values = group_colors, guide = "none") +
  scale_fill_manual(values = group_colors, guide = "none")

f_alpha_2 <- mpse_div %>%
  mp_plot_alpha(.alpha = c(Observe, Chao1, ACE, Shannon, Simpson, Pielou))

png("figures_MP_Group_1_2_3_4_5_6/alpha_diversity_comparison.png", width = 1400, height = 600)
print(f_alpha_1 / f_alpha_2)
dev.off()
##Error in `gghalves::geom_half_point()`:
#! Problem while converting geom to grob.
#ℹ Error occurred in the 2nd layer.
#Caused by error in `fun()`:
#! argument "layout" is missing, with no default
#Run `rlang::last_trace()` to see where the error occurred.

cat("\n[mpse_div] Group counts:\n")
print(table(colData(mpse_div)$Group, useNA = "ifany"))

## =========================================================
## 3) Beta diversity (Bray–Curtis on Hellinger)
##    IMPORTANT: use non-rarefied Abundance (not taxa-filtered)
## =========================================================
mpse_div %<>% mp_decostand(.abundance = Abundance)             # creates 'hellinger'
mpse_div %<>% mp_cal_dist(.abundance = hellinger, distmethod = "bray")

# Distance between samples
p_dist_1 <- mpse_div %>% mp_plot_dist(.distmethod = bray)
png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples.png", width = 1000, height = 1000)
print(p_dist_1)
dev.off()

# Distance with group info
p_dist_2 <- mpse_div %>% mp_plot_dist(.distmethod = bray, .group = Group) +
  scale_fill_manual(values = group_colors) +
  scale_color_gradient()

png("figures_MP_Group_1_2_3_4_5_6/distance_between_samples_with_group_info.png", width = 1000, height = 1000)
print(p_dist_2)
dev.off()

# Compare distances between groups (Bray)
p_dist_3 <- mpse_div %>%
  mp_plot_dist(.distmethod = bray, .group = Group, group.test = TRUE, textsize = 6) +
  theme(
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x  = element_text(size = 14),
    axis.text.y  = element_text(size = 14)
  )

png("figures_MP_Group_1_2_3_4_5_6/Comparison_of_Bray_Distances.png", width = 1000, height = 1000)
print(p_dist_3)
dev.off()

## PCoA + PERMANOVA (adonis2)
mpse_div %<>% mp_cal_pcoa(.abundance = hellinger, distmethod = "bray")

mpse_div %<>% mp_adonis(
  .abundance   = hellinger,
  .formula     = ~ Group,
  distmethod   = "bray",
  permutations = 9999,
  action       = "add"
)

cat("\n[PERMANOVA result]\n")
print(mpse_div %>% mp_extract_internal_attr(name = "adonis"))
#[PERMANOVA result]
#The object contained internal attribute: PCoA ADONIS
#Permutation test for adonis under reduced model
#Permutation: free
#Number of permutations: 9999
#
#vegan::adonis2(formula = .formula, data = sampleda, permutations = permutations, method = distmethod)
#         Df SumOfSqs      R2      F Pr(>F)
#Model     5   6.1752 0.51654 9.8296  1e-04 ***
#Residual 46   5.7797 0.48346
#Total    51  11.9549 1.00000
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## PCoA plot
p_pcoa_1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
print(p_pcoa_1)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf", width = 10, height = 8)
print(p_pcoa_1)
dev.off()

## Optional: label points
library(ggrepel)

df <- p_pcoa_1$data
df$ShortLabel <- gsub("sample-", "", rownames(colData(mpse_div)))  # or df$Sample if present

p_pcoa_2 <- p_pcoa_1 +
  geom_text_repel(
    data = df,
    aes(label = ShortLabel),
    size = 4, max.overlaps = 100
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_labeled.png", width = 1200, height = 1000)
print(p_pcoa_2)
dev.off()

## =========================================================
## 4) Composition plots object (clean taxa set for plotting)
##    IMPORTANT: start from ps_abund_rel (plotting-filtered taxa)
## =========================================================
ps_plot <- prune_samples(keep_samples, ps_abund_rel)
ps_plot <- prune_taxa(taxa_sums(ps_plot) > 0, ps_plot)

mpse_plot <- as.MPSE(ps_plot)
mpse_plot <- add_group(mpse_plot)

## Summaries for plotting (Class)
mpse_plot %<>%
  mp_cal_abundance(.abundance = Abundance) %>%               # per sample
  mp_cal_abundance(.abundance = Abundance, .group = Group)   # per group

## Class abundance barplots (top 20)
p_class_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = TRUE
  )

p_class_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    taxa.class = Class,
    topn = 20,
    relative = FALSE
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_samples.png", width = 1200, height = 600)
print(p_class_rel / p_class_abs)
dev.off()

## Heatmaps (grouped)
h_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = TRUE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

h_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    relative = FALSE,
    topn = 20,
    geom = "heatmap",
    features.dist = "euclidean",
    features.hclust = "average",
    sample.dist = "bray",
    sample.hclust = "average"
  )

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_heatmap.png", width = 1200, height = 600)
print(aplot::plot_list(gglist = list(h_rel, h_abs), tag_levels = "A"))
dev.off()

## Group-level barplots
p_group_rel <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = TRUE
  ) +
  scale_fill_manual(values = group_colors)

p_group_abs <- mpse_plot %>%
  mp_plot_abundance(
    .abundance = Abundance,
    .group = Group,
    taxa.class = Class,
    topn = 20,
    plot.group = TRUE,
    relative = FALSE
  ) +
  scale_fill_manual(values = group_colors)

png("figures_MP_Group_1_2_3_4_5_6/relative_abundance_and_abundance_groups.png", width = 1000, height = 1000)
print(p_group_rel / p_group_abs)
dev.off()

cat("\nDONE. Outputs written to ./figures_MP_Group_1_2_3_4_5_6/\n")

## =========================================================
## CONTINUE: Export Bray distances + pairwise PERMANOVA
## (Use mpse_div from the updated script above)
## =========================================================

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(openxlsx)
  library(vegan)
})

## -----------------------------
## Helper: get assay matrix with rows = samples, cols = features
get_sample_by_feature <- function(mpse_obj, assay_name) {
  mat <- assay(mpse_obj, assay_name)

  # sample IDs in MPSE
  samp_ids <- rownames(colData(mpse_obj))

  # If samples are columns, transpose
  if (!is.null(colnames(mat)) && all(samp_ids %in% colnames(mat))) {
    mat <- t(mat)
  }

  # Now enforce row order to match colData
  mat <- mat[samp_ids, , drop = FALSE]
  mat
}

## -----------------------------
## 1) Recompute Bray–Curtis distance (robust extraction)
hell_mat_sf <- get_sample_by_feature(mpse_div, "hellinger")  # rows=samples, cols=features
bray_dist   <- vegdist(hell_mat_sf, method = "bray")

## sanity checks
stopifnot(!any(is.na(as.matrix(bray_dist))))
stopifnot(!any(as.matrix(bray_dist) < 0, na.rm = TRUE))

## -----------------------------
## 2) Export all pairwise Bray distances to Excel
bray_mat <- as.matrix(bray_dist)
samples  <- rownames(bray_mat)

bray_df <- as.data.frame(as.table(bray_mat)) %>%
  rename(Sample1 = Var1, Sample2 = Var2, BrayDistance = Freq) %>%
  filter(Sample1 < Sample2) %>%
  arrange(Sample1, Sample2)

write.xlsx(bray_df, file = "figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx")

## -----------------------------
## 3) Pairwise PERMANOVA (post-hoc) using vegan::adonis2
meta2 <- as.data.frame(colData(mpse_div))
meta2 <- meta2[rownames(hell_mat_sf), , drop = FALSE]
meta2$Group <- droplevels(meta2$Group)

groups <- levels(meta2$Group)

res_list <- list()
k <- 1

for (i in 1:(length(groups) - 1)) {
  for (j in (i + 1):length(groups)) {

    g1 <- groups[i]
    g2 <- groups[j]

    idx <- meta2$Group %in% c(g1, g2)
    sub_meta <- meta2[idx, , drop = FALSE]

    sub_dist <- as.dist(as.matrix(bray_dist)[idx, idx])

    ad <- adonis2(sub_dist ~ Group, data = sub_meta, permutations = 9999)

    res_list[[k]] <- data.frame(
      group1 = g1,
      group2 = g2,
      F      = ad$F[1],
      R2     = ad$R2[1],
      p      = ad$`Pr(>F)`[1]
    )
    k <- k + 1
  }
}

pair_res <- do.call(rbind, res_list)
pair_res$p_adj_BH         <- p.adjust(pair_res$p, method = "BH")
#pair_res$p_adj_Bonferroni <- p.adjust(pair_res$p, method = "bonferroni")
pair_res$p.adj.format <- format.pval(pair_res$p_adj_BH, digits = 3, eps = 1e-300)
pair_res$p.adj.signif <- cut(
  pair_res$p_adj_BH,
  breaks = c(-Inf, 1e-4, 1e-3, 1e-2, 0.05, Inf),
  labels = c("****", "***", "**", "*", "ns"),
  right = TRUE
)

write.csv(pair_res, "figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv", row.names = FALSE)

cat("\nPairwise PERMANOVA written to: figures_MP_Group_1_2_3_4_5_6/Bray_pairwise_PERMANOVA.csv\n")
cat("Bray distance table written to: figures_MP_Group_1_2_3_4_5_6/Bray_Curtis_Distances.xlsx\n")

## =========================================================
## OPTIONAL: PCoA plot where point size = Shannon and alpha = Observe
## (requires mpse_div already has Shannon/Observe from mp_cal_alpha)
## =========================================================

p_pcoa_sizealpha <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = TRUE
  ) +
  scale_fill_manual(values = group_colors) +
  scale_color_manual(values = group_colors) +
  scale_size_continuous(range = c(2, 6)) +
  theme(
    axis.text   = element_text(size = 16),
    axis.title  = element_text(size = 18),
    legend.text = element_text(size = 14),
    legend.title= element_text(size = 16)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.png", width = 1200, height = 1000)
print(p_pcoa_sizealpha)
dev.off()

pdf("figures_MP_Group_1_2_3_4_5_6/PCoA_sizealpha.pdf", width = 10, height = 8)
print(p_pcoa_sizealpha)
dev.off()

## =========================================================
## Ensure all three ordination outputs exist:
##   - PCoA  : basic (color/group)
##   - PCoA2 : size = Shannon, alpha = Observe
##   - PCoA3 : same as PCoA2 + sample labels
##
## Assumes you already have:
##   - mpse_div with: pcoa, Group, Shannon, Observe
##   - group_colors defined
## =========================================================

p1 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = 4,
    .alpha = 1,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 1.6,
      keyheight = 1.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA.png", width = 1200, height = 1000)
p1
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA.pdf")
p1
dev.off()

p2 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA2.png", width = 1200, height = 1000)
p2
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA2.pdf")
p2
dev.off()

library(ggrepel)
colData(mpse_div)$ShortLabel <- gsub("sample-", "", mpse_div@colData@rownames)

p3 <- mpse_div %>%
  mp_plot_ord(
    .ord   = pcoa,
    .group = Group,
    .color = Group,
    .size  = Shannon,
    .alpha = Observe,
    ellipse = TRUE,
    show.legend = FALSE
  ) +
  geom_text_repel(aes(label = ShortLabel), size = 5, max.overlaps = 100) +
  scale_fill_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_color_manual(
    values = group_colors,
    guide  = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  scale_size_continuous(
    range = c(2, 6),
    guide = guide_legend(
      keywidth = 0.6,
      keyheight = 0.6,
      label.theme = element_text(size = 16)
    )
  ) +
  theme(
    axis.text   = element_text(size = 20),
    axis.title  = element_text(size = 22),
    legend.text = element_text(size = 20),
    legend.title= element_text(size = 22),
    plot.title  = element_text(size = 24, face = "bold"),
    plot.subtitle = element_text(size = 20)
  )

png("figures_MP_Group_1_2_3_4_5_6/PCoA3.png", width = 1200, height = 1000)
p3
dev.off()
svg("figures_MP_Group_1_2_3_4_5_6/PCoA3.svg", width = 12, height = 10)
p3
dev.off()
pdf("figures_MP_Group_1_2_3_4_5_6/PCoA3.pdf")
p3
dev.off()

Guide RNA Design for a Toxin–Antitoxin Locus in CP052959 Using the GuideFinder Workflow (Data_JuliaFuchs_RNAseq_2025)

This post documents how I generated CRISPRi gRNA candidates for a toxin–antitoxin locus in Staphylococcus epidermidis HD46 using a GuideFinder-inspired pipeline (Spoto et al., 2020; mSphere). The workflow is designed to be reproducible and works on complete microbial genomes.

Target locus

Two adjacent genes form a co-transcribed toxin–antitoxin (TA) unit:

  • Toxin: HJI06_09100 (1889026–1889421; strand -)
  • Antitoxin: HJI06_09105 (1889421–1889651; strand -)

Because RNA-seq coverage across the TA boundary is continuous, the locus behaves like an operon; therefore, targeting the antitoxin-side entry region (higher coordinates on the minus strand) is a good strategy to knock down the TA unit as a whole.


Overview of the pipeline

The workflow consists of four stages:

  1. Extract CDS sequences from the GenBank file → CP052959_CDS.fasta
  2. BLAST CDS to genome to generate a coordinate table → CP052959_gene_coordinates.tsv
  3. Scan promoter + gene body for NGG PAMs and generate candidate 20-mers with basic filters → CP052959_guides_preoff.tsv + BLAST hit table
  4. Off-target filtering using several scenarios (off_n1/off_n3/off_n5/off_n10/off_refined) → scan_results/.../guides_all.tsv + guides_top5.tsv

A key debugging lesson: if a target gene has no candidates, it is often because the gene scan window is too short (especially for short genes) or the promoter window is too small. Setting scan_gene_frac = 1.0 (scan the full gene) is often the fix.


Requirements

Software

  • Python 3 (Biopython)
  • R (with Biostrings, readr, dplyr, stringr, tidyr)
  • BLAST+ (makeblastdb, blastn)
  • (Optional) samtools for quick sequence validation

Input files

Download from NCBI:

  • Genome FASTA: CP052959.1.fna (or equivalent)
  • GenBank annotation: CP052959.1.gbk (or equivalent)

For convenience, I standardize filenames locally as:

  • CP052959.fna (genome)
  • CP052959.gbk (annotation)

Step-by-step commands

0) Prepare working directory

mkdir -p HD46_GuideFinder
cd HD46_GuideFinder

Place the following files in this directory:

  • CP052959.gbk
  • CP052959.fna
  • the scripts from the bottom of this post

1) Extract CDS sequences from GenBank (Python)

This creates a CDS FASTA where each header encodes metadata. Important note from troubleshooting: headers can contain spaces; later, it’s safest to avoid spaces or replace them with underscores.

✅ Run:

python 1_extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

Expected output:

  • CP052959_CDS.fasta

Example log:

  • Wrote 2573 CDS sequences to CP052959_CDS.fasta

2) Pre-processing: build BLAST DB + map CDS to genome (R)

This step wraps:

  • makeblastdb
  • blastn (CDS vs genome)
  • parsing BLAST output into a coordinate table

✅ Run:

#NOTE that we need to manually replace spaces (' ') with underscores ('_') in all headers of CP052959_CDS.fasta.
Rscript 2_PreProcessing_CP052959.R

Expected output:

  • GenomeDB_CP052959.* (BLAST database files)
  • CDS_vs_genome_CP052959.blast
  • CP052959_gene_coordinates.tsv

Sanity check (recommended):

grep -E "HJI06_09100|HJI06_09105" CP052959_gene_coordinates.tsv

You should see the correct coordinates and minus strand (consistent with complement(...) in GenBank).


3) Filter TA coordinates + create TA_targets.txt (R)

The coordinate table can contain multiple BLAST hits per locus_tag (including short, spurious alignments). So I keep only the longest hit per locus_tag and write:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

✅ Run:

Rscript 3_Filter_TA_coords.R

Expected outputs:

  • TA_gene_coordinates.tsv
  • TA_targets.txt

4) Generate candidate guides (promoter + gene body) and BLAST hit table (R)

This script scans:

  • promoter window + gene body window
  • identifies NGG PAM (and CCN for reverse-strand equivalent)
  • extracts adjacent 20-nt protospacers
  • filters by GC and homopolymer
  • BLASTs guides to genome and writes a hit table

✅ Run:

Rscript 4_GuideFinder_CP052959_base.R

Expected outputs:

  • CP052959_guides_preoff.tsv
  • BLASTguides.fasta
  • MatchGuides.blast
  • CP052959_blast_hits.tsv

Troubleshooting note (from my run)

Initially, HJI06_09105 produced no candidates. The fix was to adjust scanning parameters:

  • scan_gene_frac <- 1.0 (scan full gene; crucial for short antitoxin genes)
  • increase promoter length (e.g., 600 bp)
  • optionally relax GC/homopolymer thresholds slightly

5) Off-target filtering (R)

This script reads:

  • CP052959_guides_preoff.tsv
  • CP052959_blast_hits.tsv

Then produces multiple filtering scenarios:

  • off_n1 / off_n3 / off_n5 / off_n10: keep guides where the number of strict genomic hits ≤ N
  • off_refined: remove any guide with a second perfect 20/20 match elsewhere; allow only “weak” secondary hits

✅ Run:

Rscript 5_ScanOfftarget_CP052959.R

Expected output structure:

scan_results/
  off_n1/
  off_n3/
  off_n5/
  off_n10/
  off_refined/

Each contains:

  • guides_all.tsv
  • guides_top5.tsv (ranked by dist_to_tss and GC)

Key concepts (explained from the debugging/communication notes)

What is PAM?

PAM = protospacer adjacent motif. For SpCas9/dCas9, PAM is typically NGG. Cas9/dCas9 binding normally requires a correct PAM adjacent to the DNA target site.

What is guide_seq vs protospacer?

In this pipeline:

  • guide_seq is the 20-nt protospacer sequence extracted from the genome adjacent to PAM.
  • The gRNA spacer you clone is usually this same 20-nt string (PAM is not included in gRNA).

What does guide_strand mean?

  • guide_strand = +: the 20-nt guide_seq is exactly the sequence on the genome forward strand (CP052959.fna) at guide_start..guide_end.
  • guide_strand = -: the guide_seq is the reverse complement of the forward strand sequence at that coordinate interval.

It is not a problem if a paired design uses one “+” and one “-” guide. They simply refer to target sites on different strands.

Why do some target sites appear multiple times?

Two reasons:

  1. The same coordinate-defined target may appear as a forward/reverse complement pair (because both PAM contexts were scanned).
  2. In tight loci like TA operons, promoter/gene scan windows can overlap between toxin and antitoxin, so the same genomic site may be reported under both locus_tags. For practical ordering, treat each coordinate-defined site as one physical target.

Final candidate selection (from my off_refined output)

After off_refined filtering, I had 10 candidates covering both genes.

Recommended single guide (entry-proximal for TA-unit knockdown):

  • CP052959|HJI06_09105|gene|1889575|+
  • guide_seq: AGTGCTGCGATCACTTCTGT
  • PAM: CGG
  • This is entry-proximal (antitoxin-side) and passed the strict off_refined rule.

Recommended “enhanced” 2-guide set:

  • Guide A (entry-proximal): CP052959|HJI06_09105|gene|1889575|+
  • Guide B (upstream roadblock): CP052959|HJI06_09105|promoter|1889908|- (guide_seq: CATTCTGACTTTATGGAAAC, PAM AGG)
  • These are ~333 bp apart and both pass off_refined.

Reproducibility note: verifying a candidate sequence in the genome

I used samtools faidx to confirm that the guide_seq matches the genome coordinates:

samtools faidx CP052959.fna
samtools faidx CP052959.fna CP052959.1:1889575-1889594

Expected sequence: AGTGCTGCGATCACTTCTGT


Raw code (scripts used)

Below are the exact scripts used so you can reproduce this pipeline on another computer.


1) 1_extract_cds_from_gbk.py

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

gbk_file = "CP052959.gbk"
out_fasta = "CP052959_CDS.fasta"

records = []

for rec in SeqIO.parse(gbk_file, "genbank"):
    for feature in rec.features:
        if feature.type != "CDS":
            continue

        seq = feature.extract(rec.seq)

        locus_tag = feature.qualifiers.get("locus_tag", ["unknown"])[0]
        gene      = feature.qualifiers.get("gene", [""])[0]
        product   = feature.qualifiers.get("product", [""])[0]

        header_parts = [locus_tag]
        if gene:
            header_parts.append(gene)
        if product:
            header_parts.append(product)

        header = "|".join(header_parts)

        rec_out = SeqRecord(
            seq,
            id=header,
            description=""
        )
        records.append(rec_out)

SeqIO.write(records, out_fasta, "fasta")

print(f"Wrote {len(records)} CDS sequences to {out_fasta}")

2) 2_PreProcessing_CP052959.R

# PreProcessing_CP052959.R
# Build BLAST DB from CP052959.fasta and align CP052959_CDS.fasta to get gene coordinates

# Packages
if (!requireNamespace("Biostrings", quietly = TRUE)) {
  install.packages("BiocManager")
  BiocManager::install("Biostrings")
}
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")

library(Biostrings)
library(readr)
library(dplyr)
library(stringr)

genome_fasta <- "CP052959.fna"
cds_fasta    <- "CP052959_CDS.fasta"
db_name      <- "GenomeDB_CP052959"
blast_out    <- "CDS_vs_genome_CP052959.blast"
coord_table  <- "CP052959_gene_coordinates.tsv"

message("Using genome FASTA: ", genome_fasta)
message("Using CDS FASTA:    ", cds_fasta)

# 1. Make BLAST database
cmd_db <- sprintf("makeblastdb -in %s -dbtype nucl -out %s",
                  shQuote(genome_fasta), shQuote(db_name))
message("Running: ", cmd_db)
system(cmd_db)

# 2. BLAST CDS vs genome to get coordinates
# outfmt: qseqid sseqid sstart send sstrand length pident bitscore
cmd_blast <- paste(
  "blastn",
  "-task blastn",
  "-query", shQuote(cds_fasta),
  "-db", shQuote(db_name),
  "-outfmt '6 qseqid sseqid sstart send sstrand length pident bitscore'",
  "-max_target_seqs 1",
  "-perc_identity 95",
  "-out", shQuote(blast_out)
)
message("Running: ", cmd_blast)
system(cmd_blast)

# 3. Load BLAST output
cols <- c("qseqid", "sseqid", "sstart", "send", "sstrand",
          "length", "pident", "bitscore")

blast <- read_tsv(blast_out,
                  col_names = cols,
                  show_col_types = FALSE,
                  comment = "#")

if (nrow(blast) == 0) {
  stop("No BLAST hits found. Check that CP052959_CDS.fasta matches CP052959.fasta.")
}

# 4. Normalise coordinates (start < end) and define strand
blast2 <- blast %>%
  mutate(
    start  = pmin(sstart, send),
    end    = pmax(sstart, send),
    strand = if_else(sstart <= send, "+", "-")
  ) %>%
  select(qseqid, sseqid, start, end, strand, length, pident, bitscore)

# 5. Parse gene ID / name from qseqid (header from CP052959_CDS.fasta)
# Format: locus_tag|gene|product (if available)
blast2 <- blast2 %>%
  mutate(
    locus_tag = str_split_fixed(qseqid, "\\|", 3)[, 1],
    gene      = str_split_fixed(qseqid, "\\|", 3)[, 2]
  )

# 6. Save coordinate table
coord <- blast2 %>%
  transmute(
    gene_id    = if_else(gene == "", locus_tag, gene),
    locus_tag  = locus_tag,
    contig_id  = sseqid,
    start      = start,
    end        = end,
    strand     = strand,
    length_nt  = length,
    pident     = pident,
    bitscore   = bitscore
  )

write_tsv(coord, coord_table)
message("Wrote gene coordinate table to: ", coord_table)

3) 3_Filter_TA_coords.R

#!/usr/bin/env Rscript

suppressPackageStartupMessages({
  library(readr)
  library(dplyr)
})

# ---------- 输入 / 输出 ----------
coord_in  <- "CP052959_gene_coordinates.tsv"
coord_out <- "TA_gene_coordinates.tsv"
targets_out <- "TA_targets.txt"

# 你的 TA targets
targets <- c("HJI06_09100", "HJI06_09105")

# ---------- 读取坐标表 ----------
coords <- read_tsv(coord_in, col_names = FALSE, show_col_types = FALSE)

# 预期 9 列:product, locus_tag, seqid, start, end, strand, aln_len, pident, bitscore
if (ncol(coords) < 9) {
  stop("Input coord table has < 9 columns: ", coord_in)
}

colnames(coords)[1:9] <- c("product","locus_tag","seqid","start","end","strand","aln_len","pident","bitscore")

# 强制类型(避免读成字符导致 slice_max 异常)
coords <- coords %>%
  mutate(
    locus_tag = as.character(locus_tag),
    aln_len = as.integer(aln_len),
    pident = as.numeric(pident),
    bitscore = as.numeric(bitscore)
  )

# ---------- 过滤:只保留 TA + 每基因最长命中 ----------
coords_ta <- coords %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# ---------- 安全检查 ----------
missing <- setdiff(targets, unique(coords_ta$locus_tag))
if (length(missing) > 0) {
  stop("Missing targets in filtered results: ", paste(missing, collapse = ", "),
       "\nCheck that locus_tags exist in ", coord_in)
}

if (nrow(coords_ta) != length(targets)) {
  warning("Filtered rows (", nrow(coords_ta), ") != number of targets (", length(targets), ").",
          "\nThis can happen if duplicate rows/ties were removed; please inspect output.")
}

# ---------- 写输出 ----------
write_tsv(coords_ta, coord_out)

# targets.txt:一行一个 locus_tag(按你给定的顺序写)
writeLines(targets, targets_out)

message("Wrote: ", coord_out)
message("Wrote: ", targets_out)
message("\nFiltered TA coordinates:")
print(coords_ta)

4) 4_GuideFinder_CP052959_base.R

#!/usr/bin/env Rscript

# GuideFinder_CP052959_base.R
# 目的:
#   1) 仅针对 TA targets 生成所有候选 guides(NGG PAM,20bp)
#   2) 计算 dist_to_tss(promoter 为负,gene body 为正,按转录方向)
#   3) 输出 guides_preoff.tsv + guides fasta
#   4) 运行 BLAST 做全基因组命中统计,输出 blast_hits.tsv
#
# 输入:
#   CP052959.fna                (整条基因组序列,单条序列 FASTA;可软链接到 CP052959.1.fna)
#   CP052959_gene_coordinates.tsv (PreProcessing_CP052959.R 生成)
#   TA_targets.txt                (两行:HJI06_09100 / HJI06_09105)
#
# 输出:
#   CP052959_guides_preoff.tsv
#   BLASTguides.fasta
#   MatchGuides.blast
#   CP052959_blast_hits.tsv

suppressPackageStartupMessages({
  if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
  if (!requireNamespace("Biostrings", quietly = TRUE)) BiocManager::install("Biostrings")
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  if (!requireNamespace("stringr", quietly = TRUE)) install.packages("stringr")
  if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")

  library(Biostrings)
  library(readr)
  library(dplyr)
  library(stringr)
  library(tidyr)
})

# ---------- 参数区 ----------
genome_fasta        <- "CP052959.fna"
coord_table         <- "CP052959_gene_coordinates.tsv"   # PreProcessing_CP052959.R 生成的
targets_file        <- "TA_targets.txt"                  # 只跑 TA 两个基因
blast_db_name       <- "GenomeDB_CP052959"               # 预处理里建的 BLAST DB 前缀
guides_fasta        <- "BLASTguides.fasta"
blast_matches_file  <- "MatchGuides.blast"
guides_preoff_tsv   <- "CP052959_guides_preoff.tsv"
blast_hits_tsv      <- "CP052959_blast_hits.tsv"

guide_len      <- 20L
pam            <- "NGG"              # SpCas9
#promoter_len   <- 200L              # 扫描 TSS 上游长度(bp),可按需改 100/300
#scan_gene_frac <- 0.30              # gene body 扫描前 30%(更适合 CRISPRi)
#gc_min         <- 0.30              # S. epidermidis GC 偏低,建议 0.30~0.35
#gc_max         <- 0.80              # 基本不会用到,但留着
#homopolymer_n  <- 4L                # 连续 A 或 T >= 4 就过滤(可改 5 更宽松)
promoter_len   <- 600L     # 原来 200L,改大:增加上游/入口区域的 PAM 机会
scan_gene_frac <- 1.00     # 原来 0.30,改为扫完整 gene body(短基因必需)
gc_min         <- 0.25     # 原来 0.30,略放宽(S. epi GC 低,避免全被 GC 过滤)
gc_max         <- 0.80
homopolymer_n  <- 5L       # 原来 4L,略放宽(避免把仅有的候选全杀掉)

# BLAST 参数:用于命中统计(不是最终 off_refined)
blast_task     <- "blastn-short"
blast_wordsize <- 7L
blast_evalue   <- 1000

# ---------- 工具函数 ----------
stopif_file_missing <- function(path) {
  if (!file.exists(path)) stop("File not found: ", path)
}

calc_gc <- function(seq) {
  s <- as.character(seq)
  if (nchar(s) == 0) return(NA_real_)
  g <- str_count(s, "G")
  c <- str_count(s, "C")
  (g + c) / nchar(s)
}

has_homopolymer_AT <- function(seq, n = 4L) {
  s <- as.character(seq)
  patA <- paste0("A{", n, ",}")
  patT <- paste0("T{", n, ",}")
  str_detect(s, patA) | str_detect(s, patT)
}

revcomp_str <- function(s) {
  as.character(reverseComplement(DNAString(s)))
}

# 扫描 forward strand 的 NGG:返回 data.frame(protospacer + PAM)
scan_forward_NGG <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  # dna: DNAString(该区域序列,5'->3' 按基因组正链)
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # i: protospacer 起点(0-based in R string positions)
  # protospacer = s[i+1 .. i+20]
  # PAM         = s[i+21 .. i+23] must match NGG
  max_i <- L - (guide_len + 3L)
  if (max_i < 0) return(tibble())

  idx <- 0L
  for (i in 0:max_i) {
    pam_seq <- substr(s, i + guide_len + 1L, i + guide_len + 3L)
    if (nchar(pam_seq) != 3) next
    if (!str_detect(pam_seq, "^[ACGT]GG$")) next

    prot <- substr(s, i + 1L, i + guide_len)
    if (nchar(prot) != guide_len) next

    guide_start <- offset_genome_1based + i
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- guide_end + 1L
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = prot,        # 常规报告为 protospacer(不含 PAM)
      pam_seq     = pam_seq,
      guide_strand = "+",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# 扫描 reverse strand:等价扫描 forward 的 CCN(因为 reverse 的 PAM = NGG)
scan_reverse_CCN <- function(dna, offset_genome_1based = 1L, region_label = "region") {
  s <- as.character(dna)
  L <- nchar(s)
  out <- list()

  # 在 forward 上找 CCN:pam_fwd = CCN (positions j..j+2)
  # reverse 上 PAM = NGG,对应的 protospacer 在 forward 上是 pam_fwd 后面紧接 20nt(j+3..j+22)
  # 输出 guide_seq 应为该 20nt 的反向互补(5'->3')
  max_j <- L - (3L + guide_len)
  if (max_j < 0) return(tibble())

  idx <- 0L
  for (j in 0:max_j) {
    pam_fwd <- substr(s, j + 1L, j + 3L)
    if (nchar(pam_fwd) != 3) next
    if (!str_detect(pam_fwd, "^CC[ACGT]$")) next

    prot_fwd <- substr(s, j + 4L, j + 3L + guide_len)  # downstream 20nt
    if (nchar(prot_fwd) != guide_len) next

    guide_seq <- revcomp_str(prot_fwd)
    # 基因组坐标(1-based):prot_fwd 覆盖 [j+4 .. j+23](长度20)
    guide_start <- offset_genome_1based + j + 3L
    guide_end   <- guide_start + guide_len - 1L
    pam_start   <- offset_genome_1based + j
    pam_end     <- pam_start + 2L

    idx <- idx + 1L
    out[[idx]] <- tibble(
      guide_seq   = guide_seq,
      pam_seq     = revcomp_str(pam_fwd),   # 反向互补后就是 NGG
      guide_strand = "-",
      guide_start = guide_start,
      guide_end   = guide_end,
      pam_start   = pam_start,
      pam_end     = pam_end,
      region      = region_label
    )
  }
  bind_rows(out)
}

# dist_to_tss:按转录方向计算 guide 到 TSS 的距离(bp)
# 约定:
#   + 链:TSS ~ gene_start(start),promoter 在 start-promoter_len .. start-1(dist 负)
#   - 链:TSS ~ gene_end(end),promoter 在 end+1 .. end+promoter_len(dist 负)
calc_dist_to_tss <- function(gene_start, gene_end, gene_strand, guide_start, guide_end) {
  if (gene_strand == "+") {
    # 取 guide 的 5'端(转录方向上的前端)近似
    guide_5p <- guide_start
    as.integer(guide_5p - gene_start)
  } else {
    # - 链转录方向为 大 -> 小,5'端在更大坐标
    guide_5p <- guide_end
    as.integer(gene_end - guide_5p)
  }
}

# ---------- 输入检查 ----------
stopif_file_missing(genome_fasta)
stopif_file_missing(coord_table)
stopif_file_missing(targets_file)

message("Using genome FASTA: ", genome_fasta)
message("Using coord table:  ", coord_table)
message("Using targets:      ", targets_file)

targets <- readLines(targets_file)
targets <- targets[targets != ""]
if (length(targets) == 0) stop("targets_file is empty: ", targets_file)
message("Targets: ", paste(targets, collapse = ", "))

# ---------- 读取基因组 ----------
genome_set <- readDNAStringSet(genome_fasta)
if (length(genome_set) < 1) stop("No sequences in genome FASTA: ", genome_fasta)
if (length(genome_set) > 1) {
  message("Warning: genome FASTA has multiple sequences. Using the first one: ", names(genome_set)[1])
}
genome <- genome_set[[1]]
genome_len <- length(genome)
genome_id <- names(genome_set)[1]
message("Genome length: ", genome_len, " bp; genome_id: ", genome_id)

# ---------- 读取坐标表并过滤(关键:只保留 targets + 每基因最长命中) ----------
# 你的表是 9 列(从 grep 输出推断):
# V1=product, V2=locus_tag, V3=seqid, V4=start, V5=end, V6=strand, V7=aln_len, V8=pident, V9=bitscore
coords_raw <- read_tsv(coord_table, col_names = FALSE, show_col_types = FALSE)
if (ncol(coords_raw) < 9) stop("coord_table seems not to have 9 columns: ", coord_table)

coords <- coords_raw %>%
  transmute(
    product   = as.character(X1),
    locus_tag = as.character(X2),
    seqid     = as.character(X3),
    start     = as.integer(X4),
    end       = as.integer(X5),
    strand    = as.character(X6),
    aln_len   = as.integer(X7),
    pident    = as.numeric(X8),
    bitscore  = as.numeric(X9)
  ) %>%
  filter(locus_tag %in% targets) %>%
  group_by(locus_tag) %>%
  slice_max(order_by = aln_len, n = 1, with_ties = FALSE) %>%
  ungroup()

# 安全检查:targets 都得存在
missing_targets <- setdiff(targets, coords$locus_tag)
if (length(missing_targets) > 0) {
  stop("Some targets not found in coord_table after filtering: ", paste(missing_targets, collapse = ", "))
}

message("Filtered coord rows: ", nrow(coords))
print(coords)

# ---------- 逐基因生成候选 guides ----------
all_guides <- list()
gid <- 0L

for (i in seq_len(nrow(coords))) {
  row <- coords[i, ]

  gene_start <- min(row$start, row$end)
  gene_end   <- max(row$start, row$end)
  gene_strand <- row$strand
  locus <- row$locus_tag
  prod  <- row$product

  gene_len <- gene_end - gene_start + 1L
  scan_gene_len <- max(1L, floor(gene_len * scan_gene_frac))

  # promoter 区间(按链方向定义“上游”)
  if (gene_strand == "+") {
    tss <- gene_start
    prom_start <- max(1L, gene_start - promoter_len)
    prom_end   <- gene_start - 1L
    gene_scan_start <- gene_start
    gene_scan_end   <- min(gene_end, gene_start + scan_gene_len - 1L)
  } else {
    tss <- gene_end
    prom_start <- gene_end + 1L
    prom_end   <- min(genome_len, gene_end + promoter_len)
    gene_scan_end   <- gene_end
    gene_scan_start <- max(gene_start, gene_end - scan_gene_len + 1L)
  }

  message("\n--- ", locus, " (", gene_strand, ") ---")
  message("Gene: ", gene_start, "..", gene_end, " (len=", gene_len, "), scan gene part=", scan_gene_len,
          ", TSS~", tss)
  message("Promoter region: ", prom_start, "..", prom_end)
  message("Gene-scan region:", gene_scan_start, "..", gene_scan_end)

  # 提取 promoter / gene-scan 序列(按基因组正链坐标截取)
  guides_gene <- tibble()
  guides_prom <- tibble()

  if (prom_start <= prom_end) {
    dna_prom <- subseq(genome, start = prom_start, end = prom_end)
    guides_prom <- bind_rows(
      scan_forward_NGG(dna_prom, offset_genome_1based = prom_start, region_label = "promoter"),
      scan_reverse_CCN(dna_prom, offset_genome_1based = prom_start, region_label = "promoter")
    )
  }

  if (gene_scan_start <= gene_scan_end) {
    dna_gene <- subseq(genome, start = gene_scan_start, end = gene_scan_end)
    guides_gene <- bind_rows(
      scan_forward_NGG(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene"),
      scan_reverse_CCN(dna_gene, offset_genome_1based = gene_scan_start, region_label = "gene")
    )
  }

  guides <- bind_rows(guides_prom, guides_gene)
  if (nrow(guides) == 0) {
    message("No NGG candidates found in promoter+gene-scan region for ", locus)
    next
  }

  # 过滤:GC + homopolymer
  guides <- guides %>%
    mutate(
      locus_tag = locus,
      product   = prod,
      gene_start = gene_start,
      gene_end   = gene_end,
      gene_strand = gene_strand,
      tss_pos   = tss,
      gc        = vapply(guide_seq, function(x) calc_gc(DNAString(x)), numeric(1)),
      has_hpoly = vapply(guide_seq, function(x) has_homopolymer_AT(x, homopolymer_n), logical(1)),
      dist_to_tss = mapply(
        FUN = calc_dist_to_tss,
        gene_start = gene_start,
        gene_end   = gene_end,
        gene_strand = gene_strand,
        guide_start = guide_start,
        guide_end   = guide_end
      )
    ) %>%
    filter(!is.na(gc)) %>%
    filter(gc >= gc_min, gc <= gc_max) %>%
    filter(!has_hpoly)

  if (nrow(guides) == 0) {
    message("All candidates filtered out by GC/homopolymer for ", locus)
    next
  }

  # 给每条 guide 一个唯一 ID
  guides <- guides %>%
    arrange(region, abs(dist_to_tss), desc(gc)) %>%
    mutate(
      guide_id = {
        # 稳定可读:CP052959|locus|region|start|strand
        paste0("CP052959|", locus_tag, "|", region, "|", guide_start, "|", guide_strand)
      }
    )

  all_guides[[length(all_guides) + 1L]] <- guides
  message("Kept guides after filters: ", nrow(guides))
}

guides_df <- bind_rows(all_guides)

if (nrow(guides_df) == 0) {
  stop("No guides generated after filtering. Consider lowering gc_min or relaxing homopolymer_n/promoter_len.")
}

# 输出 preoff
write_tsv(guides_df %>%
            select(
              guide_id, locus_tag, product, region,
              guide_seq, pam_seq, guide_strand,
              guide_start, guide_end, pam_start, pam_end,
              gene_start, gene_end, gene_strand, tss_pos, dist_to_tss,
              gc, has_hpoly
            ),
          guides_preoff_tsv)

message("\nWrote guides_preoff: ", guides_preoff_tsv)

# ---------- 写 BLASTguides.fasta ----------
# 注意:写出的序列是 guide_seq(20nt)
guides_set <- DNAStringSet(guides_df$guide_seq)
names(guides_set) <- guides_df$guide_id
writeXStringSet(guides_set, guides_fasta, format = "fasta")
message("Wrote guides FASTA for BLAST: ", guides_fasta)

# ---------- 运行 BLAST ----------
# 检查 BLAST DB 是否存在,否则尝试建库
db_nsq <- paste0(blast_db_name, ".nsq")
db_nin <- paste0(blast_db_name, ".nin")
db_nhr <- paste0(blast_db_name, ".nhr")
if (!file.exists(db_nsq) && !file.exists(db_nin) && !file.exists(db_nhr)) {
  message("BLAST DB not found (", blast_db_name, "). Trying to build it from ", genome_fasta)
  cmd_mk <- sprintf("makeblastdb -in '%s' -dbtype nucl -out '%s'", genome_fasta, blast_db_name)
  message("Running: ", cmd_mk)
  system(cmd_mk, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)
}

cmd_blast <- sprintf(
  "blastn -task %s -word_size %d -evalue %g -query '%s' -db '%s' -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 20 -out '%s'",
  blast_task, blast_wordsize, blast_evalue, guides_fasta, blast_db_name, blast_matches_file
)
message("Running: ", cmd_blast)
system(cmd_blast, intern = FALSE, ignore.stdout = TRUE, ignore.stderr = FALSE)

stopif_file_missing(blast_matches_file)

# ---------- 解析 BLAST 输出 ----------
blast_raw <- read_tsv(
  blast_matches_file,
  col_names = c("qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore"),
  show_col_types = FALSE
)

if (nrow(blast_raw) == 0) {
  stop("No BLAST hits found. Check that guides_fasta and genome DB match.")
}

# 统一 sstart/send 为 min/max(方便判断)
blast_hits <- blast_raw %>%
  mutate(
    smin = pmin(sstart, send),
    smax = pmax(sstart, send),
    hit_strand = ifelse(sstart <= send, "+", "-")
  )

# 命中计数(粗略:所有 hits;后续 off_refined 由你的 ScanOfftarget 脚本做)
hit_counts <- blast_hits %>%
  count(qseqid, name = "n_hits") %>%
  rename(guide_id = qseqid)

# 合并 guides 注释
blast_merged <- blast_hits %>%
  rename(guide_id = qseqid) %>%
  left_join(guides_df %>% select(guide_id, locus_tag, region, guide_seq, guide_strand, guide_start, guide_end), by = "guide_id") %>%
  left_join(hit_counts, by = "guide_id") %>%
  arrange(locus_tag, region, guide_id, desc(pident), desc(length), desc(bitscore))

write_tsv(blast_merged, blast_hits_tsv)
message("Wrote blast hits table: ", blast_hits_tsv)

message("\nDONE.")
message("Next step: run ScanOfftarget_CP052959.R to generate off_n3/off_n5/off_refined outputs.")

5) 5_ScanOfftarget_CP052959.R (updated with strict hit counting)

#!/usr/bin/env Rscript

# ScanOfftarget_CP052959.R
# 适配当前版本输出:
#   - guides_preoff: CP052959_guides_preoff.tsv(locus_tag, dist_to_tss, gc, guide_id...)
#   - blast_hits:    CP052959_blast_hits.tsv(guide_id, pident, length, mismatch, gapopen... + n_hits)
#
# 输出:
#   scan_results/off_n1|off_n3|off_n5|off_n10|off_refined/{guides_all.tsv,guides_top5.tsv}

suppressPackageStartupMessages({
  if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
  if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
  library(readr)
  library(dplyr)
})

guides_preoff_tsv <- "CP052959_guides_preoff.tsv"
blast_hits_tsv    <- "CP052959_blast_hits.tsv"
guide_len         <- 20L
base_outdir       <- "scan_results"

message("Loading guides: ", guides_preoff_tsv)
guides <- read_tsv(guides_preoff_tsv, show_col_types = FALSE)

message("Loading BLAST hits: ", blast_hits_tsv)
blast <- read_tsv(blast_hits_tsv, show_col_types = FALSE)

if (!dir.exists(base_outdir)) dir.create(base_outdir)

# ---------- 兼容列名:gene_id / locus_tag;GC / gc ----------
# guides 表必须有 guide_id
if (!"guide_id" %in% names(guides)) stop("guides_preoff 缺少 guide_id 列:", guides_preoff_tsv)

# gene_id:优先用 gene_id,没有就用 locus_tag
if (!"gene_id" %in% names(guides)) {
  if (!"locus_tag" %in% names(guides)) stop("guides_preoff 既没有 gene_id 也没有 locus_tag")
  guides <- guides %>% mutate(gene_id = locus_tag)
}

# GC:优先用 GC,没有就用 gc
if (!"GC" %in% names(guides)) {
  if (!"gc" %in% names(guides)) stop("guides_preoff 既没有 GC 也没有 gc")
  guides <- guides %>% mutate(GC = gc)
}

# dist_to_tss 必须存在
if (!"dist_to_tss" %in% names(guides)) stop("guides_preoff 缺少 dist_to_tss 列")

# ---------- BLAST hits 列检查 ----------
# 我们需要:guide_id, pident, length, mismatch, gapopen
needed_blast_cols <- c("guide_id", "pident", "length", "mismatch", "gapopen")
missing_cols <- setdiff(needed_blast_cols, names(blast))
if (length(missing_cols) > 0) {
  stop("blast_hits 缺少必要列:", paste(missing_cols, collapse = ", "),
       "\n请确认 blast_hits_tsv 是由 GuideFinder_CP052959_base.R 生成的版本。")
}

# ---------- 规则封装 ----------
# 通用:按 n_hits 上限过滤
#是太严(仍然 0),就放宽一点: mismatch 2-->3: filter_by_nhits(10, require_full_length = TRUE, max_mismatch = 3, require_no_gaps = TRUE)
filter_by_nhits <- function(max_hits,
                            require_full_length = TRUE,
                            max_mismatch = 2,
                            require_no_gaps = TRUE,
                            min_pident = 0) {

  # 1) 先从 blast hits 里挑“更像真实 off-target 风险”的命中
  b <- blast

  # 强制数值类型(避免字符导致比较全 NA)
  b <- b %>%
    mutate(
      pident  = as.numeric(pident),
      length  = as.integer(length),
      mismatch = as.integer(mismatch),
      gapopen = as.integer(gapopen)
    )

  if (require_full_length) {
    b <- b %>% filter(length == guide_len)
  }
  if (require_no_gaps) {
    b <- b %>% filter(gapopen == 0)
  }
  if (!is.null(max_mismatch)) {
    b <- b %>% filter(mismatch <= max_mismatch)
  }
  if (!is.null(min_pident) && min_pident > 0) {
    # 注意:你的 pident 通常是 0-100 的百分比
    b <- b %>% filter(pident >= min_pident)
  }

  # 2) 用“严格命中”重新计算每条 guide 的 n_hits(更合理)
  valid_ids <- b %>%
    count(guide_id, name = "n_hits_strict") %>%
    filter(n_hits_strict <= max_hits) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 精细规则 off_refined:
#  - exactly 1 perfect hit: length==guide_len & pident==100 & mismatch==0 & gapopen==0
#  - 其他 hits(若存在):length < 15 或 mismatch >= 3
filter_refined <- function() {
  blast_grouped <- blast %>%
    group_by(guide_id) %>%
    summarise(
      n_hits = n(),
      n_perfect = sum(length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0),
      other_ok = all(
        ifelse(
          length == guide_len & pident == 100 & mismatch == 0 & gapopen == 0,
          TRUE,
          (length < 15 | mismatch >= 3)
        )
      ),
      .groups = "drop"
    )

  valid_ids <- blast_grouped %>%
    filter(n_perfect == 1, other_ok) %>%
    pull(guide_id)

  guides %>% filter(guide_id %in% valid_ids)
}

# 写结果到子目录
run_scenario <- function(name, guides_selected) {
  outdir <- file.path(base_outdir, name)
  if (!dir.exists(outdir)) dir.create(outdir, recursive = TRUE)

  all_path <- file.path(outdir, "guides_all.tsv")
  top_path <- file.path(outdir, "guides_top5.tsv")

  write_tsv(guides_selected, all_path)

  guides_top <- guides_selected %>%
    group_by(gene_id) %>%
    arrange(dist_to_tss, desc(GC)) %>%
    slice_head(n = 5) %>%
    ungroup()

  write_tsv(guides_top, top_path)

  message(sprintf("[%s] 总 guides: %d, 覆盖基因数: %d",
                  name, nrow(guides_selected), length(unique(guides_selected$gene_id))))
}

# ---------- 运行多套参数 ----------
message("Running scenario: n_hits <= 1")
run_scenario("off_n1", filter_by_nhits(1))

message("Running scenario: n_hits <= 3")
run_scenario("off_n3", filter_by_nhits(3))

message("Running scenario: n_hits <= 5")
run_scenario("off_n5", filter_by_nhits(5))

message("Running scenario: n_hits <= 10")
run_scenario("off_n10", filter_by_nhits(10))

message("Running scenario: refined rule")
run_scenario("off_refined", filter_refined())

message("All scenarios finished. Check 'scan_results/' directory.")

Final notes

  • If move to another machine, the only strict requirements are: BLAST+, R packages, and the same input FASTA/GBK.
  • If a target gene yields no candidates again, check:

    • scan_gene_frac (set to 1.0 for short genes)
    • promoter length (increase if needed)
    • GC/homopolymer thresholds (slightly relax)

If the gRNA backbone available (e.g., BsaI/BsmBI Golden Gate format) and we can add an appendix section that auto-generates the exact top/bottom oligos for ordering from the selected guides.

Guide RNA Design with GuideFinder for the Toxin–Antitoxin Unit in Staphylococcus epidermidis HD46 (CP052959.1)

Part A|把你“125 的流程”迁移到 CP052959.1 的 step-by-step(中文)

0️⃣ 目录结构建议(最省事)

新建一个工作目录,例如:

HD46_GuideFinder/
  CP052959.gbk
  CP052959.fna
  extract_cds_from_gbk.py
  PreProcessing_CP052959.R
  GuideFinder_CP052959_base.R
  ScanOfftarget_CP052959.R
  (可选) *.Rmd

你可以把你当时的 PreProcessing_125.R / GuideFinder_125_base.R / ScanOfftarget_125.R 复制一份改名,然后把里面所有 “125” 的文件名前缀改成 “HD46” 或 “CP052959.1”。


1️⃣ 从 GenBank 提取 CDS → 生成 CP052959_CDS.fasta(Python + Biopython)

输入: CP052959.gbk 输出: CP052959_CDS.fasta(每个 CDS 一条序列,header 用 locus_tag)

运行(示例):

python extract_cds_from_gbk.py \
  --gbk CP052959.gbk \
  --out CP052959_CDS.fasta

检查输出里应该能看到类似:

  • >HJI06_09100 ...
  • >HJI06_09105 ...

这一步的目的:后面要把每条 CDS 用 BLAST 定位回整条基因组,得到统一的坐标表(尤其适配 draft genome / 多 contig 的情况;complete genome 也能这么跑,流程统一)。


2️⃣ Pre-processing:把 CDS BLAST 回基因组 → 生成 gene coordinate table

2.1 建 BLAST 数据库(GenomeDB)

用基因组 FASTA 建库(输入 CP052959.1.fna):

makeblastdb -in CP052959.1.fna -dbtype nucl -out GenomeDB_HD46

2.2 BLAST:CDS vs Genome

(你笔记里用的是 -task blastn,这里沿用;如果你 CDS 很短也可用 blastn-short

blastn -task blastn \
  -query HD46_CDS.fasta \
  -db GenomeDB_HD46 \
  -outfmt "6 qseqid sseqid sstart send sstrand length pident bitscore" \
  -max_target_seqs 1 \
  -perc_identity 95 \
  -out CDS_vs_genome_HD46.blast

2.3 生成坐标表

运行你改好的预处理脚本(对应你笔记里的 PreProcessing_125.R):

Rscript PreProcessing_HD46.R
# 或者
Rscript -e "rmarkdown::render('PreProcessing_HD46.Rmd')"

输出: HD46_gene_coordinates.tsv

这张表通常会包含每个基因(locus_tag)的:

  • contig/seqid(这里应该是 CP052959.1)
  • start/end
  • strand
  • (如果脚本计算了)promoter 区、TSS 近似位置、gene length 等

3️⃣ Guide 设计:跑 GuideFinder(或你简化版 GuideFinder-like 脚本)

你笔记里分成:

  • GuideFinder_125_base.R:做 PAM 扫描、GC/同聚物过滤、计算 dist_to_tss、写出待 BLAST 的 guides fasta
  • 然后再 ScanOfftarget_125.R:根据 BLAST 命中做 off-target 分层(off_n3/off_n5/off_refined)

我建议你按同样分法跑(更清晰、也更容易 debug)。

3.1(可选但很推荐)只做 TA 这两个基因:建 targets 列表

新建 TA_targets.txt

HJI06_09100
HJI06_09105

然后在 GuideFinder 脚本里加一个“只保留这两个基因”的过滤(如果你当时 125 是全基因组,这里就能显著加速)。

3.2 运行 base 设计脚本

Rscript GuideFinder_HD46_base.R

你期望它会做:

  1. 扫 NGG PAM、生成 20bp(或 20/21/22)候选
  2. GC / homopolymer / bad-seed / restriction-site 过滤
  3. 正确计算 dist_to_tss(你笔记里写这次修好了)
  4. 写出 BLASTguides.fasta 并跑 BLAST(或至少准备好 BLAST 输入)
  5. 产生一个“preoff”的 guides 表(比如 HD46_guides_preoff.tsv

3.3 Off-target 扫描与分层输出

Rscript ScanOfftarget_HD46.R

它会把结果放进类似这样的目录结构(沿用你笔记习惯):

scan_results/
  off_n3/
    guides_all.tsv
    guides_top5.tsv
  off_n5/
    guides_all.tsv
    guides_top5.tsv
  off_refined/
    guides_all.tsv
    guides_top5.tsv

Part B|你要用 off_refined 的“选择逻辑”怎么理解(中文)

你笔记里这段非常关键,我用更直白的方式总结一下(保持你原有意图):

1) 现在脚本里的 n_hits 统计是什么?

  • 目前 n_hits 是“guide 的 BLAST 命中次数”,不区分 promoter / CDS / intergenic
  • 不检查命中位点是否旁边有 NGG PAM
  • 因此它会偏“保守”(宁可把一些其实没有 PAM 的匹配也算作潜在风险)

2) off_refined 策略核心是什么?

你笔记里这句说得很到位:

  • off_refined 已经在 BLAST level 上把真正危险的“第二个完美 20/20 命中”踢掉了
  • 剩下的 guide 通常是“1 个 on-target + 一堆残缺 off-target”(比如 mismatch 多、seed 不完整等),更接近你理想的“只在目标位点真正起作用”

3) 你提出的“只把有 PAM 的命中算 off-target”是对的

从 dCas9/Cas9 生物学机制看:

  • 没有合适 PAM(这里是 NGG),多数情况下不构成真实 off-target 风险
  • 你可以在后续加一层:对每个 BLAST 命中回到 genome 取邻域序列,检查是否存在 NGG,再把“有 PAM 的命中”纳入 n_hits

但就你现在说的用途(实际做 CRISPRi 设计),off_refined 已经是很实用的默认档位


Part C|给 Xxxxxx 的邮件草稿(你直接复制粘贴就能发)

你要“写一封邮件给 Xxxxxx,说原理上可行,并做了一个 probe-test;最佳匹配 Enterobacter intestinihominis…;用 off_refined;解释结果;并附算法细节;附件 guides_all.tsv / guides_top5.tsv”。 我下面给英文邮件(更像工作邮件)。你把方括号里的占位符替换成你的具体数字/文件名即可。

Hi Xxxxxx,

In principle the guide design pipeline is working end-to-end now. I ran a small probe-test on Patricia’s previously assembled complete isolate “125” genome input.

Quick taxonomy sanity check (PATRIC/closest match): Best match: Enterobacter intestinihominis (taxid = 3133180; species; lineage: Bacteria → Pseudomonadati/Pseudomonadota → Gammaproteobacteria → Enterobacterales → Enterobacteriaceae).

What the pipeline produced (strategy: off_refined):

  • It completes the sequence-level filters (GC + homopolymers) and computes dist_to_tss correctly.
  • It then runs a BLAST-based off-target scan and generates the final guide tables.
  • Attached are:

    • guides_all.tsv (all guides passing filters)
    • guides_top5.tsv (up to 5 best guides per gene)

Summary of the results (off_refined):

  • Total genes evaluated: [N_genes]
  • Genes with ≥1 guide: [N_with_guides] ([pct]% coverage)
  • Genes without guides: [N_without_guides]
  • Total guides retained in guides_all.tsv: [N_guides_all]
  • Median guides per gene (genes with guides): [median_guides_per_gene] (These numbers are from the attached tables.)

Algorithm details (off_refined, in plain terms):

  1. Candidate generation: scan for NGG PAM sites and create guide candidates (20 bp; optionally 21/22 bp depending on settings).
  2. Primary filters: remove candidates failing GC threshold and basic sequence-quality rules (e.g., long homopolymer runs; optional “bad-seed” motifs / restriction sites).
  3. TSS proximity scoring: compute dist_to_tss and prioritize guides closer to the inferred transcription start region for CRISPRi.
  4. Off-target screening via BLAST: align guide (or seed region, depending on mode) against the whole genome.
  5. off_refined rule: discard any guide that shows a second perfect full-length match elsewhere in the genome (i.e., another 20/20 hit besides the intended on-target). The remaining guides typically have 1 on-target hit plus only weak/partial off-target hits.

Note: in the current implementation, n_hits is counted genomewide from BLAST hits (no promoter/CDS distinction and no explicit PAM check at hit sites), which is conservative. If needed, we can further refine by counting only BLAST hits that also have an appropriate NGG PAM in the correct context.

Best, Yyyy


如果你接下来要把这个流程用到你真正关心的 HD46(CP052959.1)TA 单元,我建议你按上面 Part A 跑完后,把 scan_results/off_refined/guides_top5.tsvHJI06_09100HJI06_09105 的行贴出来,我可以再帮你快速做“最终挑哪 1 条单 guide / 哪一对 paired guides 最适合用来一起 knockdown 这个 TA 单元”的二次筛选。

GuideFinder-Based gRNA Design for CRISPRi Knockdown of a Toxin–Antitoxin Operon in S. epidermidis HD46 (CP052959.1)

两个基因的 locus_tag 分别是:

  • 毒素(toxin)HJI06_09100,坐标 complement(1889026..1889421)
  • 抗毒素(antitoxin)HJI06_09105,坐标 complement(1889421..1889651)

这里有一个非常重要的更正:它们在 complement,也就是 负链(minus strand)。 因此“操纵子上游/启动子在前端”对应的方向是:

  • 负链基因的“上游(5’端)”在 更大的坐标数那一侧
  • 你的 TA 单元从坐标上看是 1889651(更大)→ 1889026(更小) 这个方向转录

你想把 toxin+antitoxin 当一个单元一起 knockdown,那么最有效的策略通常是: 优先打在 antitoxin 的 5’端附近(靠近 1889651 一侧)以及其上游启动子区域,这样最容易把整条共转录本压下去。


1) “paired guides(成对 guides)”用中文再讲清楚

paired guides = 为同一个目标(同一个基因/同一转录单元)挑两条不同位置的 guide,同时表达,让 dCas9 在两个点上“堵住”转录,从而增强抑制的强度或稳定性。

  • 它不是“分别打两个基因”这个意思
  • 对你来说更合适的用法是:

    • 1 条 guide 卡在 启动子/TSS 附近
    • 另 1 条卡在 操纵子前段(这里就是 antitoxin 5’端或紧邻区域)
  • GuideFinder 会按你设置的“最小间距”(常见 100 bp)去输出可用的 guide 对

2) 用 GuideFinder 给 HD46(CP052959.1)这两个 locus_tag 设计 guides:实操步骤

下面按“你有完整基因组(complete genome)”的最常见流程写(你不需要自己手算 NGG 位点,GuideFinder 会做)。

Step A:准备文件

从 NCBI 下载这两个文件到同一目录(建议):

  1. 基因组序列 FASTACP052959.1.fna
  2. 基因组注释 GenBankCP052959.1.gbff(或 .gbk)

FASTA 提供序列;GenBank 提供 gene/CDS 坐标和 locus_tag。GuideFinder 依赖这些信息来定位目标基因并取启动子区域。

Step B:安装依赖

你需要:

  • R / RStudio
  • BLAST+(命令行 blastn):GuideFinder 用它做 off-target(脱靶)筛查
  • 下载 GuideFinder(GitHub: ohlab/Guide-Finder)的 Rmarkdown/R 脚本

Step C:准备目标基因列表

建一个文本文件(例如 targets.txt),内容就两行:

HJI06_09100
HJI06_09105

Step D:在 GuideFinder 里设置关键参数(推荐给 S. epidermidis 的组合)

因为 S. epidermidis GC 偏低,建议用“先严格、再迭代放宽”的策略(GuideFinder 的特色功能):

第一轮(主参数)

  • PAM:NGG(默认 SpCas9)
  • GC minimum:35%(或稍低一点)
  • TSS/启动子附近优先:让它偏向“离 TSS 更近”的 guide
  • off-target:开启(强烈建议不要关)

迭代轮(只针对第一轮找不到 guide 的基因自动放宽)

  • GC minimum:降到 30%
  • TSS 最大距离:放宽
  • 必要时再放宽一些序列过滤(同聚物等),但 off-target 最好别放得太松

Step E:运行并看输出

GuideFinder 通常会给你两类输出表(名字可能因版本略有不同):

  1. Top hits(单条 guide 推荐)
  2. Paired guides(成对 guide 推荐)

你现在的目标是“把 TA 单元一起压下去”,所以你从输出里优先挑:

  • 靠近 1889651 一侧(负链的 5’端/上游)的 guides

    • 重点看 HJI06_09105(antitoxin)输出里最靠前、最接近其 5’端/推定启动子区的 guides
    • HJI06_09100(toxin)也会出 guides,但从“整单元 knockdown”角度,最“入口”的位置通常更强

3) 针对你这个 TA 单元,怎么挑“最可能一把压住两基因”的 guide

因为它们紧挨着且共转录,你最推荐的组合是:

方案 1:单 guide(最简、最常成功)

  • 选 1 条 最接近操纵子启动子/TSS 的 guide
  • 实验上先用它验证 toxin 与 antitoxin 的 mRNA 是否都下降(RT-qPCR 两个都测)

方案 2:paired guides(更稳的增强版)

  • 选两条都位于“操纵子前端”的 guide(一般以 antitoxin 5’端/上游区域为核心)
  • 两条之间满足你设的最小间距(比如 100 bp)
  • 用同一个载体同时表达两条 guide(或串联表达)

4) 一个很关键的小提醒:负链时,“上游”在坐标更大端

你最初邮件里写的是 “(+)”,但 GenBank 注释清楚写了 complement。 所以当你在 GuideFinder 输出里看到坐标时,记住:

  • 越接近 1889651(更大坐标)越接近“转录起点/上游入口”
  • 这类 guide 通常更容易把整个 TA 转录单元一起压下去

把运行 GuideFinder 后输出表里与这两个 locus_tag 对应的前几条候选(包含坐标、链方向、序列、是否 off-target)贴出来,可以快速做二次筛选: 哪一条最适合做“单 guide 一把压住”,哪一对最适合做“paired guides 增强版”。

GEO HTS Submission

The steps are basically: (1) prepare the GEO HTS metadata spreadsheet, (2) upload files to GEO FTP, (3) submit the metadata file via the GEO web form, and (4) respond to any curator questions (esp. raw FASTQ).

1) Make a single dataset folder name for GEO (recommended)

GEO wants one folder per submission/dataset. Create one top-level folder name that will be uploaded as-is:

cd ~/Downloads
mv GEO_submissions GeoMx_DSP_M666_UKE_Hamburg

Inside it should remain:

  • dcc/
  • pkc/
  • annotation/
  • README.txt
  • counts_matrix.tsv

2) Prepare the GEO HTS metadata spreadsheet (REQUIRED)

On the GEO “Submit HTS” page:

  • Download the HTS metadata spreadsheet template.
  • Fill at least these tabs:

STUDY tab

  • Title, summary, overall design.
  • Add the “raw FASTQ missing” sentence if you still don’t have FASTQs:

    Raw sequencing FASTQ files are currently not available to the authors; processed DCC outputs and PKC probe metadata are provided. FASTQs will be deposited to SRA and linked once obtained.

SAMPLES tab (most important)

One row per ROI/AOI (so ~45 rows for your DCCs).

  • Use Sample_ID (from your Sheet8/clean TSV) as the sample/library name.
  • For each row, set processed data file to the matching DCC filename (e.g., DSP-...-A02.dcc).
  • Include key sample characteristics (Group/Location/PCR/Case/etc.) from your annotation.

PROTOCOLS tab

  • Sample prep + GeoMx DSP workflow description.
  • Mention panels: Hs_R_NGS_WTA_v1.0, GeoMx_Hs_CTA_v1.0, GeoMx_COVID19_v1.0.
  • State: DCC exported from NanoString GeoMx NGS pipeline; PKC required to interpret RTS_IDs.

FILES / supplementary (if the template has it)

List:

  • all DCCs
  • all PKCs
  • the annotation file(s) (at least the clean TSV or the Excel)
  • README.txt
  • counts_matrix.tsv (optional, but you have it already)

Tip: In the metadata, reference counts_matrix.tsv as “gene-level raw counts matrix (targets x ROIs)” and keep DCCs as the authoritative per-ROI outputs.

3) FTP upload to GEO (data files only)

  1. Connect to GEO’s FTP using the credentials/path GEO gives you (your personal upload directory).
  2. In that upload directory, create a dataset folder, e.g.:

    • GeoMx_DSP_M666_UKE_Hamburg/
  3. Upload only the data files by FTP:

    • the whole folder with subfolders: dcc/, pkc/, annotation/, plus README.txt, counts_matrix.tsv
  4. Do NOT upload the metadata spreadsheet by FTP (GEO explicitly says upload metadata via the web form).

4) Submit the metadata spreadsheet on the GEO web page

After the FTP transfer finishes:

  • Go to Submit a new high-throughput sequencing submission
  • Upload the metadata spreadsheet file for this dataset
  • Submit → this places it in the GEO processing queue.

5) If GEO asks for raw FASTQs

This is the most common friction point for GeoMx when FASTQs aren’t available.

  • If they allow: they’ll accept processed now and ask you to link SRA later.
  • If they require: you’ll need to obtain FASTQs from the provider or email GEO and explain you only received DCC/PKC and are requesting FASTQs.

TODO: upload/paste the GEO HTS metadata spreadsheet template (blank) or a screenshot of the SAMPLES tab columns — to have a look which columns to fill with which fields from your annotation/M666_Sheet8_annotation_clean.tsv.

Comprehensive summary: GEO vs NCBI Submission Portal (SRA/BioProject/BioSample) for bulk RNA-seq + GeoMx DSP

https://submit.ncbi.nlm.nih.gov/subs/

https://www.ncbi.nlm.nih.gov/geo/subs/

1) What each platform is best for

  • NCBI Submission Portal Used primarily to submit raw sequencing reads to the Sequence Read Archive (SRA) and to create/link the organizing records:

    • BioProject (the overall study / project container)
    • BioSample (the per-sample metadata records required for SRA)
  • NCBI GEO (Gene Expression Omnibus) Best for gene expression–style datasets and processed outputs, including:

    • count matrices / processed tables
    • sample/ROI metadata tables
    • GeoMx DSP outputs (DCC/PKC)
    • supplementary analysis outputs and documentation (README, workflow description) GEO is commonly used for “processed + metadata”, while raw FASTQs (if available) go to SRA.

2) Where to submit your two dataset types

  • Bulk RNA-seq (Illumina) with raw FASTQ files

    • Submit to SRA via the NCBI Submission Portal.
    • Ensure you have/define: BioProject + BioSamples, then upload FASTQs and metadata.
  • GeoMx DSP spatial transcriptomics with only DCC/PKC + annotation Excel (+ R workflow), no FASTQs

    • Submit to GEO (not SRA), because DCC/PKC are processed outputs, not raw reads.
    • In GEO, upload:

      • DCC files (processed counts + QC/metrics per ROI/AOI)
      • PKC files (panel/probe definitions)
      • annotation table (Excel is okay; CSV/TSV is even better for machine readability)
      • README describing file structure + column meanings + mapping between ROI IDs and metadata
      • analysis workflow (e.g., R scripts; optionally hosted on GitHub/Zenodo and linked)
  • If GeoMx FASTQs are obtained later

    • Submit GeoMx FASTQs to SRA, and in the GEO record state: raw reads in SRA, processed data in GEO, cross-linking accessions.

3) What to choose in the NCBI “Start a new submission” list

For your use case:

  • Choose Sequence Read Archive (SRA) ✅ for FASTQ (raw reads).
  • You will also use/associate BioProject and BioSample (often created during the SRA workflow).
  • Do not use GenBank/TSA/Genome for these transcriptomics read submissions (those are for assembled sequences/genomes, not raw RNA-seq reads).
  • GEO is not started from that list; it has its own submission entry point.

4) Accounts / login clarification

  • Same login principle: In general, you use the same My NCBI account across NCBI services.
  • Why GEO can feel like “another account”:

    • GEO requires a GEO Submitter Profile (contact identity/profile) attached to a My NCBI account.
    • In many labs, a PI or colleague already has a GEO profile under their account—so the lab might say “GEO is on another account.”
  • Your options:

    • Create/use your own My NCBI + GEO Submitter Profile and submit under your name.
    • Or submit using the lab/PI account that already has the GEO profile (for consistent lab identity in GEO).

5) Practical “Data availability” logic for a manuscript

A journal-friendly setup is:

  • Raw sequencing reads (bulk RNA-seq FASTQs; later GeoMx FASTQs if available) → SRA accession(s)
  • Processed expression outputs and metadata (GeoMx DCC/PKC + annotation + processed tables) → GEO accession(s)
  • Workflow/code → GitHub + DOI archive (e.g., Zenodo), linked from GEO and/or the paper.

6) ENA note (from the email context)

  • If a manuscript currently lists an ENA project accession that can’t be found (e.g., looks like a placeholder), you either:

    • confirm the correct existing ENA project accession, or
    • create a new submission/accession (ENA or NCBI—both are widely accepted, but your processed GeoMx outputs still fit best in GEO).

7) Key takeaway (one-liner)

  • FASTQ = SRA (via NCBI Submission Portal + BioProject/BioSample)
  • GeoMx DCC/PKC + annotation + processed outputs = GEO
  • One My NCBI login, but GEO needs a GEO Submitter Profile and uses a separate submission entry point.