Category Archives: Articles

微生物与病毒生物信息学 vs. 细菌与病毒生物信息学

核心区别

  • Microbial & Viral Bioinformatics(微生物与病毒生物信息学)
    研究对象更广,涵盖所有类型的微生物(细菌、古菌、真菌、原生生物等)以及病毒。
  • Bacterial and Viral Bioinformatics(细菌与病毒生物信息学)
    研究对象较窄,专注于细菌和病毒,不包括其他非细菌类微生物。
Bacterial_and_Viral_Bioinformatics

对比表

维度 微生物与病毒生物信息学 细菌与病毒生物信息学
研究对象范围 微生物(细菌、古菌、真菌、原生生物等)+ 病毒 仅限细菌 + 病毒
广度 广,涵盖多类生物 窄,专注于细菌相关
典型研究方向 宏基因组学、群落生态学、宿主-微生物互作、病毒-微生物关系 细菌基因组学、耐药基因预测、病原体分析、噬菌体研究
应用场景 微生物群落研究、环境样本分析、人体微生物组研究 医学细菌学、传染病、抗生素耐药机制
学科定位 偏向综合性、生态学与系统层面 偏向临床、病原学与应用层面

应用实例对比

  • 微生物与病毒生物信息学

    • 实例:分析人类肠道微生物组(细菌、真菌、古菌等)与病毒群落的互作,探索其与肥胖、糖尿病或免疫系统疾病的关系。
    • 特点:关注整体微生物生态系统,强调多类生物之间的协同与平衡。
  • 细菌与病毒生物信息学

    • 实例:研究医院获得性耐药细菌(如耐甲氧西林金黄色葡萄球菌,MRSA)及其与噬菌体的互作,寻找新的治疗策略。
    • 特点:聚焦病原体与临床应用,强调对疾病防控和治疗的直接价值。

总结

  • Microbial & Viral Bioinformatics:范围大,适合研究多样微生物群落及其与病毒的关系。
  • Bacterial and Viral Bioinformatics:范围小,专注于细菌和病毒,更聚焦临床和病原学研究。

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Create side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(12,6))

# --- Left: Microbial & Viral ---
venn_left = venn2(
    subsets=(1, 1, 1),
    set_labels=("Microbes", "Viruses"),
    ax=axes[0],
    set_colors=("skyblue", "lightcoral"),  # 微生物用蓝色,病毒用红色
    alpha=0.6
)
venn_left.get_label_by_id('10').set_text('Bacteria, Archaea,\nFungi, Protists')
venn_left.get_label_by_id('01').set_text('Viruses')
venn_left.get_label_by_id('11').set_text('Microbe-Virus\ninteractions')
axes[0].set_title("Microbial & Viral Bioinformatics")

# --- Right: Bacterial & Viral ---
venn_right = venn2(
    subsets=(1, 1, 1),
    set_labels=("Bacteria", "Viruses"),
    ax=axes[1],
    set_colors=("lightgreen", "lightcoral"),  # 细菌用绿色,病毒用红色
    alpha=0.6
)
venn_right.get_label_by_id('10').set_text('Bacteria\n(pathogens, commensals)')
venn_right.get_label_by_id('01').set_text('Viruses\n(phages, human viruses)')
venn_right.get_label_by_id('11').set_text('Bacteria-Virus\ninteractions')
axes[1].set_title("Bacterial & Viral Bioinformatics")

# Add a caption below the plots
fig.text(
    0.5, -0.05,
    "Comparison: Microbial & Viral Bioinformatics covers all microbes (bacteria, archaea, fungi, protists) plus viruses,\n"
    "while Bacterial & Viral Bioinformatics is narrower, focusing only on bacteria and viruses.",
    ha='center', fontsize=10
)

plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6,6))

# 大圆 - Microbial
microbe_circle = plt.Circle((0,0), 1.0, color="skyblue", alpha=0.4, label="Microbes")
ax.add_artist(microbe_circle)

# 小圆 - Bacteria (在Microbial里)
bacteria_circle = plt.Circle((0.3,0.3), 0.4, color="lightgreen", alpha=0.6, label="Bacteria")
ax.add_artist(bacteria_circle)

# 另一个圆 - Viruses
virus_circle = plt.Circle((-0.2,-0.2), 0.6, color="lightcoral", alpha=0.6, label="Viruses")
ax.add_artist(virus_circle)

# 设置比例 & 美化
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect("equal")
ax.axis("off")

# 添加标签
ax.text(-0.9, 1.0, "Microbes\n(bacteria, archaea, fungi, protists)", fontsize=10, color="blue")
ax.text(0.4, 0.6, "Bacteria", fontsize=10, color="green")
ax.text(-0.9, -0.8, "Viruses", fontsize=10, color="red")

plt.title("Nested Scope: Microbes vs Bacteria with Viruses", fontsize=12)
plt.show()

The top 10 genes

  • TP53: The TP53 gene, also known as p53, is a tumor suppressor gene that plays a critical role in preventing cancer. Mutations in TP53 are present in up to 50% of all cancers and are associated with a worse prognosis. TP53 acts as a transcription factor, regulating the expression of genes involved in cell cycle control, DNA repair, and apoptosis. When DNA is damaged, p53 is activated, leading to cell cycle arrest, DNA repair, or apoptosis if the damage is too severe.

  • TNF: Tumor necrosis factor (TNF) is a cytokine that plays a critical role in the immune response to cancer and infectious diseases. TNF is produced by immune cells, such as macrophages and T cells, and can induce apoptosis in cancer cells. In addition, TNF has been targeted by drugs to treat autoimmune diseases, such as rheumatoid arthritis and psoriasis.

  • EGFR: Epidermal growth factor receptor (EGFR) is a transmembrane receptor that binds to epidermal growth factor (EGF) and other ligands to activate downstream signaling pathways. EGFR is frequently mutated in cancers, particularly in lung cancer, and mutations in EGFR can lead to resistance to chemotherapy and targeted therapies. EGFR-targeting drugs, such as gefitinib and erlotinib, have been developed to treat lung cancer and other cancers that overexpress EGFR.

  • VEGFA: Vascular endothelial growth factor A (VEGFA) is a protein that promotes angiogenesis, the formation of new blood vessels. VEGFA is overexpressed in many cancers, including breast, lung, and colon cancer, and plays a critical role in tumor growth and metastasis. Drugs targeting VEGFA, such as bevacizumab and ranibizumab, have been developed to inhibit angiogenesis and treat cancer.

  • APOE: Apolipoprotein E (APOE) is a protein involved in lipid metabolism and is important for the transportation of cholesterol and other lipids in the bloodstream. APOE has also been implicated in Alzheimer’s disease, as individuals with a certain APOE allele have an increased risk of developing the disease.

  • IL6: Interleukin 6 (IL6) is a cytokine that plays a critical role in the immune response to infection and inflammation. IL6 is produced by immune cells, such as T cells and macrophages, and can activate downstream signaling pathways that lead to inflammation and fever. IL6 has also been implicated in cancer, as high levels of IL6 in the bloodstream have been associated with a worse prognosis.

  • TGFB1: Transforming growth factor beta 1 (TGFB1) is a cytokine that plays a critical role in cell proliferation, differentiation, and immune regulation. TGFB1 is produced by immune cells, such as T cells and macrophages, and can activate downstream signaling pathways that lead to cell cycle arrest and apoptosis. In addition, TGFB1 has been implicated in cancer, as it can promote tumor growth and metastasis.

  • MTHFR (Methylenetetrahydrofolate reductase) is an enzyme that plays a critical role in the metabolism of amino acids. Specifically, MTHFR is involved in the conversion of homocysteine to methionine, which is essential for the production of S-adenosylmethionine (SAMe). SAMe is a methyl donor that is important for the methylation of DNA, RNA, and proteins, and is involved in many cellular processes including gene expression, protein synthesis, and cell signaling. Mutations in the MTHFR gene can lead to reduced activity of the enzyme, which can result in elevated levels of homocysteine in the blood. Elevated homocysteine levels have been linked to a number of health problems, including cardiovascular disease, stroke, and neural tube defects in newborns. Some studies have also suggested that MTHFR mutations may be associated with an increased risk of certain types of cancer, although the evidence is not conclusive.

  • ESR1 (Oestrogen receptor 1) is a protein that plays a critical role in the response to estrogen in many tissues, including the breast, uterus, and bone. The ESR1 gene codes for the estrogen receptor alpha, which is a nuclear receptor that binds to estrogen and regulates the expression of many genes. In breast cancer, ESR1 is often overexpressed, and is a major driver of the disease. Targeting ESR1 with drugs like tamoxifen or aromatase inhibitors has been a key strategy in the treatment of estrogen receptor-positive breast cancer.

  • AKT1 (also known as protein kinase B) is a serine/threonine kinase that is involved in many cellular processes, including cell proliferation, apoptosis, and metabolism. AKT1 is activated by a variety of stimuli, including growth factors, cytokines, and extracellular matrix proteins. Once activated, AKT1 phosphorylates a number of downstream targets, including transcription factors, enzymes, and cytoskeletal proteins, leading to changes in gene expression, metabolism, and cell morphology. Mutations in AKT1 have been identified in a number of different cancers, including breast, colorectal, and ovarian cancer. In many cases, these mutations lead to increased AKT1 activity, which can promote cell survival and proliferation, and can also confer resistance to chemotherapy and other cancer treatments. As a result, AKT1 inhibitors are being developed as a potential therapeutic strategy for cancer treatment.

Creating a 3D scatterplot using ggplot2 in R

library(ggplot2)
library(plotly)

# Create some random data
set.seed(123)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)

# Perform PCA
data <- data.frame(x, y, z)
pca <- prcomp(data, scale = TRUE)
scores <- as.data.frame(pca$x)

# Create 3D scatterplot using ggplot2 and plotly
ggplot(scores, aes(x = PC1, y = PC2, z = PC3)) +
  geom_point(size = 3, color = "blue") +
  labs(x = "PC1", y = "PC2", z = "PC3") +
  theme_bw() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black", size = 0.5),
        panel.border = element_blank()) +
  scale_x_continuous(limits = c(-4, 4), expand = c(0, 0)) +
  scale_y_continuous(limits = c(-4, 4), expand = c(0, 0)) +
  scale_z_continuous(limits = c(-4, 4), expand = c(0, 0)) +
  coord_cartesian(xlim = c(-4, 4), ylim = c(-4, 4), zlim = c(-4, 4)) +
  ggtitle("3D Scatterplot using PCA") +
  theme(plot.title = element_text(size = 12, face = "bold"))

This code creates a 3D scatterplot using ggplot2 and the plotly package. It first generates some random data and performs principal component analysis (PCA) on it. The resulting PCA scores are then plotted in a 3D scatterplot. The labs() function is used to set the axis labels, and the ggtitle() function is used to set the plot title. The scale_x_continuous(), scale_y_continuous(), and scale_z_continuous() functions are used to set the axis limits, and the coord_cartesian() function is used to ensure that the plot is displayed with the specified limits. The theme() function is used to adjust the appearance of the plot.

draw 2D PCA from rld

library(ggplot2)
data <- plotPCA(rld, intgroup=c("condition", "donor"), returnData=TRUE)
ggplot(data, aes(x=PC1, y=PC2, color=condition, shape=donor)) +
      geom_point(size=8) + labs(x = "PC1", y = "PC2") +
      scale_color_manual(values = c("untreated" = "grey",
                                    "mCh d3"="#a6cee3",
                                    "mCh d8"="#1f78b4",
                                    "GFP+mCh d9/12"="cyan",
                                    "GFP d3"="#b2df8a",
                                    "GFP d8"="#33a02c",
                                    "sT d3"="#fb9a99",
                                    "sT d8"="#e31a1c",
                                    "LT d3"="#fdbf6f",
                                    "LT d8"="#ff7f00",
                                    "LTtr d3"="#cab2d6",
                                    "LTtr d8"="#6a3d9a",
                                    "sT+LT d3"="#ffff99",                        
                                    "sT+LTtr d9/12"="#a14a1a")) 
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) + theme(axis.text = element_text(face="bold",size = 21), axis.title = element_text(face="bold",size = 21)) + theme(legend.text = element_text(size = 20)) + theme(legend.title = element_text(size = 22)) + guides(color = guide_legend(override.aes = list(size = 10)), shape = guide_legend(override.aes = list(size = 10)), alpha = guide_legend(override.aes = list(size = 10)))

bubble plots in R using ggplot2

library(ggplot2)
png("bubble_plot.png", 3000, 2000)
ggplot(mydat, aes(y = Term, x = Comparison)) + geom_point(aes(color = Regulation, size = Count, alpha = abs(log10(FDR)))) + scale_color_manual(values = c("up" = "red", "down" = "blue")) + scale_size_continuous(range = c(1, 34)) + labs(x = "", y = "", color="Regulation", size="Count", alpha="-log10(FDR)") + theme(axis.text.y = element_text(face = bold.labels))+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 40)) + theme(legend.text = element_text(size = 40)) + theme(legend.title = element_text(size = 40))+
  guides(color = guide_legend(override.aes = list(size = 20)), alpha = guide_legend(override.aes = list(size = 20)))
dev.off()

library(ggplot2)
library(dplyr)
library(magrittr)
library(tidyr)
library(forcats)

mydat <- read.csv2("GO_DAVID_Summary_R.csv", sep=",", header=TRUE)
#mydat$GeneRatio <- sapply(mydat$GeneRatio_frac, function(x) eval(parse(text=x)))
mydat$FoldEnrichment <- as.numeric(mydat$FoldEnrichment)
mydat$Comparison <- factor(mydat$Comparison, levels=c("sT 3 dpi","sT 8 dpi","LT 3 dpi","LT 8 dpi","LTtr 3 dpi","LTtr 8 dpi","sT+LT 3 dpi","sT+LTtr 9/12 dpi"))    # solution for point 2
mydat$Term <- factor(mydat$Term, levels=rev(c("Cell proliferation","Negative regulation of cell proliferation","Cell division","Mitotic nuclear division","G1/S transition of mitotic cell cycle","DNA replication initiation","DNA replication","Sister chromatid cohesion","DNA repair","Cellular response to DNA damage stimulus","Transcription, DNA-templated","Regulation of transcription, DNA-templated","Positive regulation of transcription, DNA-templated","Positive regulation of transcription from RNA polymerase II promoter","Positive regulation of gene expression","Negative regulation of transcription from RNA polymerase II promoter","rRNA processing","Protein folding","Inflammatory response","Immune response","Innate immune response","Positive regulation of ERK1 and ERK2 cascade","Chemokine-mediated signaling pathway","Chemotaxis","Cell chemotaxis","Neutrophil chemotaxis","Viral process","Response to virus","Defense response to virus","Cellular response to lipopolysaccharide","Type I interferon signaling pathway","Extracellular matrix organization","Cell adhesion","Nervous system development","Angiogenesis","Apoptotic process")))    # solution for point 2
mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
png("bubble.png", width=1166, height=1067)
ggplot(mydat, aes(y = Term, x = Comparison, size = FoldEnrichment)) + geom_point(aes(color = Regulation), alpha = 1.0) + labs(x = "", y = "") + theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20)) + theme(legend.text = element_text(size = 20)) + theme(legend.title = element_text(size = 20)) + scale_size(range = c(1, 20)) + guides(color = guide_legend(override.aes = list(size = 10)))    # solution for point 1
dev.off()

libraries with a syntax similar to ggplot2 for creating 3D plots in R

If you are looking for a library with a syntax similar to ggplot2 for creating 3D plots in R, you might want to check out the ggplot2 extension packages ggplot2_3d, ggplot2rayshader, or rayshader. Here is a brief description of each package:

ggplot2_3d: This package provides an extension to ggplot2 that allows you to create 3D plots using the same syntax as ggplot2. The package adds a geom_3d() function and several 3D coordinate systems that can be used to create different types of 3D plots.

ggplot2rayshader: This package extends ggplot2 with the rayshader package to create 3D plots with realistic shading and lighting effects. It provides functions for creating elevation maps, hillshades, and combining them with ggplot2 layers.

rayshader: This package allows you to create 2D and 3D maps and visualizations using a combination of ggplot2 and raytracing techniques. It provides functions for creating elevation maps, hillshades, and adding water features and labels to the plots.

Here is an example code snippet using the ggplot2_3d package to create a 3D scatter plot:

library(ggplot2)
library(ggplot2_3d)

# create a 3D scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, z = Petal.Length, color = Species)) +
      geom_3d(point_size = 3) +
      theme_3d()
library(ggplot2)
library(ggplot2_3d)

# create a 3D scatter plot
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, z = Petal.Length, color = Species)) +
  geom_3d(point_size = 3) +
  theme_3d()

This will create a 3D scatter plot of the iris dataset with Sepal.Length, Sepal.Width, and Petal.Length as the x, y, and z coordinates, respectively, and Species as the color variable. The geom_3d() function is used to specify the type of plot, and the theme_3d() function is used to add a 3D theme to the plot.

How to use the TxDb.Hsapiens.UCSC.hg38.knownGene package in R?

To use the TxDb.Hsapiens.UCSC.hg38.knownGene package in R, you will need to follow these steps: :joy: https://www.markdownguide.org/basic-syntax/ https://www.markdownguide.org/extended-syntax/

Tux, Logo

Tux, yopK

  • Install the TxDb.Hsapiens.UCSC.hg38.knownGene package from Bioconductor using the following code:
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
if (!requireNamespace("BiocManager", quietly = TRUE))

  install.packages("BiocManager")

BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")
  • Load the package using the library() function:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
  • Load the GenomicFeatures package, which is required by TxDb packages:
library(GenomicFeatures)
  • Use the TxDb.Hsapiens.UCSC.hg38.knownGene package to create a transcript database object using the TxDb() function:
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
  • This will create a TxDb object that contains information about gene models and transcripts from the hg38 version of the human genome, based on the knownGene track in the UCSC Genome Browser.
  • You can then use the GenomicFeatures package to retrieve information about genes and transcripts from the TxDb object. For example, you can use the exonsBy() function to retrieve exon coordinates for a given gene:
    gene <- "ENSG00000139618"
    exons <- exonsBy(txdb, gene)

This will retrieve the exon coordinates for the gene with the Ensembl ID “ENSG00000139618”.

Note that the TxDb objects can be memory-intensive, especially for larger genomes or datasets, so you may need to be careful about the amount of data you load into memory at once. You can also use the saveDb() and loadDb() functions to save and load TxDb objects to/from disk, respectively.

update all packages in R

# Load the utils package
library(utils)

# Get information about all installed packages
pkg_info <- packageStatus()

# Filter the information to show only packages that need to be updated
outdated_packages <- pkg_info[,"needsUpdate"]

# Display the number of outdated packages
cat("Number of outdated packages:", length(outdated_packages))

pkg_info <- packageStatus()
pkg_info

Number of installed packages:

                            ok upgrade unavailable

/usr/local/lib/R/site-library 467 27 161 /usr/lib/R/site-library 0 0 0 /usr/lib/R/library 16 13 0

Number of available packages (each package counted only once):

                                           installed not installed

http://ftp.gwdg.de/pub/misc/cran/src/contrib 506 18771

/usr/local/lib/R/site-library /usr/lib/R/library/

sudo chown -R jhuang:jhuang /usr/lib/R/library/ 
sudo chown -R jhuang:jhuang /usr/share/R/doc/html
#update.packages(ask = FALSE)
update.packages(repos='http://cran.rstudio.com/', ask=FALSE, checkBuilt=TRUE)

draw pca plots with ggplot2 (2D) and plotly (3D)

#TODOs: next week
#- try install ggplot2_3d
#- try install kaleido

TODO: using python to generate the 3D plot! https://pypi.org/project/plotly/

# -- before pca --
#png("pca.png", 1200, 800)
svg("pca.svg")
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

#TODO:adding label in the figure, change Donor I in blue and donor II in orange
#https://loading.io/color/feature/Paired-12/
svg("pca2.svg")
#plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
plotPCA(rld, intgroup = c("donor"))
#plotPCA(rld, "batch")
dev.off()
#TODO: adding label in the figure
svg("pca3.svg")
plotPCA(rld, intgroup=c("replicates2"))
dev.off()

#https://loading.io/color/feature/Paired-12/
#https://support.bioconductor.org/p/66404/
# -- calculate PC3 from rld --
library(genefilter)
ntop <- 500
rv <- rowVars(assay(rld))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
mat <- t( assay(rld)[select, ] )
pc <- prcomp(mat)
pc$x[,1:3]

#To my case:
mat <- t( assay(rld) )
pc <- prcomp(mat)
#pca <- yourFavoritePCA( mat )

pc$x[,1:3]
#                          PC1         PC2        PC3
#untreated DI      -27.5379705   1.4478299  -6.389731
#untreated DII     -28.3320463   0.6794066   2.073768
#mCh d3 DII          2.8988953  -6.4372647  10.252829
#sT d3 DII           5.1869876   2.6116282  13.816117
#mCh d8 DII        -20.8047275   1.0708861   3.394721
#sT d8 DII          -4.5144119  19.6230473   8.357902
#mCh d3 DI          -4.5690693  -8.8938297  -7.391567
#sT d3 DI           -7.6326832   5.3781061   2.214181
#mCh d8 DI           0.8536828  -5.0593045 -13.325567
#sT d8 DI            1.9232111  24.8795741  -4.162946
#GFP d3 DII        -12.5042914  -3.3424106  15.207755
#LTtr d3 DII         5.2309178  -9.6124712   8.328132
#GFP d8 DII        -13.0652347  -8.2058086  15.078469
#LTtr d8 DII        13.0678654  -2.0677676   9.188943
#GFP d3 DI         -13.9999251  -1.4988226  -3.335085
#LTtr d3 DI          2.6090782  -9.5753559 -10.022324
#GFP d8 DI         -12.4430571  -6.0670545 -14.725450
#LTtr d8 DI          8.9794396   3.4918629 -14.410118
#LT d8 DII          18.8388058   0.2459081   2.334700
#LT d8 DI           15.2986278  -0.6055500 -11.034778
#GFP+mCh d9/12 DI  -17.3162152   3.2939931  -6.917358
#sT+LTtr d9/12 DI    6.8517730  17.9282911  -6.209778
#GFP+mCh d9/12 DII   2.0874834  -6.7379107   8.810602
#sT+LTtr d9/12 DII  19.3883422  19.6033774   4.314808
#LT d3 DI            6.5376031  -8.5766236  -6.500155
#LT d3 DII          17.8400725 -11.7362896   1.117396
#sT+LT d3 DI        16.6029944  -7.7951798  -5.593658
#sT+LT d3 DII       18.5238521  -4.0422674   5.528193

# vs.
#data
#                     PC1        PC2         group condition donor          name
#untreated DI  -27.537970  1.4478299  untreated:DI untreated    DI  untreated DI
#untreated DII -28.332046  0.6794066 untreated:DII untreated   DII untreated DII
#mCh d3 DII      2.898895 -6.4372647    mCh d3:DII    mCh d3   DII    mCh d3 DII

# -- construct a data structure (merged_df) as above with data and pc --
library(ggplot2)
data <- plotPCA(rld, intgroup=c("condition", "donor"), returnData=TRUE)
#calculate all PCs including PC3 with the following codes
library(genefilter)
ntop <- 500
rv <- rowVars(assay(rld))
select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
mat <- t( assay(rld)[select, ] )
pc <- prcomp(mat)
pc$x[,1:3]
df_pc <- data.frame(pc$x[,1:3])

identical(rownames(data), rownames(df_pc)) #-->TRUE
## define the desired order of row names
#desired_order <- rownames(data)
## sort the data frame by the desired order of row names
#df <- df[match(desired_order, rownames(df_pc)), ]

data$PC1 <- NULL
data$PC2 <- NULL
merged_df <- merge(data, df_pc, by = "row.names")
#merged_df <- merged_df[, -1]
row.names(merged_df) <- merged_df$Row.names
merged_df$Row.names <- NULL  # remove the "name" column
merged_df$name <- NULL 
merged_df <- merged_df[, c("PC1","PC2","PC3","group","condition","donor")]

# -- draw 3D with merged_df using plot3D --
#https://stackoverflow.com/questions/45052188/how-to-plot-3d-scatter-diagram-using-ggplot
devtools::install_github("AckerDWM/gg3D")
library("gg3D")
png("pca10.png",800,800)
#svg("pca10.svg",10,10)
#methods(class = "prcomp")
#summary(pc) #--> Proportion of Variance  0.3647 0.1731 0.1515
#percentVar <- round(100 * attr(data, "percentVar"))
percentVar <- c(36,17,15)
#scatterplot3d

#Unfortunately, ggplot does not support 3D plotting. It is designed for creating 2D plots and visualizations in R. However, there are other packages available in R for creating 3D plots, such as plot3D, scatterplot3d, and rgl. These packages can be used to create 3D scatter plots, surface plots, and more complex 3D visualizations. You can install and load these packages in R using the following commands:
install.packages("plot3D")
library(plot3D)
install.packages("scatterplot3d")
library(scatterplot3d)
install.packages("rgl")
library(rgl)
#Once you have loaded these packages, you can create 3D plots using their respective functions. For example, you can create a 3D scatter plot using the plot3D package with the following code:

#https://plotly.com/r/3d-scatter-plots/
library(plotly)
data(mtcars)
png("xxx.png", 1200, 800)
plot_ly(mtcars, x = ~mpg, y = ~wt, z = ~qsec, type = "scatter3d", mode = "markers")
dev.off()

#  zlab(paste0("PC3: ",percentVar[3],"% variance")) + 
#scatterplot3d(merged_df[,c("PC1","PC2","PC3")], pch=16, color="blue", main="3D Scatter Plot")
# labs(x = "PC1", y = "PC2", z = "PC3") +
#axes_3D() + stat_3D() +
colors = "Set1",
#marker = list(symbol = ~shapes[group])
#using the corresponding keywords ("square", "triangle-up", "diamond", etc.). 

labs <- list(x = paste0("PC1: ",percentVar[1],"% variance"), y = paste0("PC2: ",percentVar[2],"% variance"), z = paste0("PC3: ",percentVar[3],"% variance"))
#ggplot(merged_df, aes(x=PC1, y=PC2, z=PC3, color=condition, shape=donor)) +

#https://stackoverflow.com/questions/75452609/update-color-in-different-marker-in-plotly-r
dt <- iris
dt$shape_1 <- c("Yes","No")
dt$color_1 <- c("Medium","Large","Small")

library(plotly)
library(kaleido)
fig <- plot_ly(dt,
        x=1:nrow(iris),
        y=~Sepal.Length,
        type="scatter",
        mode='markers',
        color=~Species,
        colors = c("#4477AA","#DDCC77","#CC6677"),
        symbol = ~shape_1,
        symbols = c("triangle-up", "circle"),
        size = 20) %>% add_surface()
# save the chart as an SVG file using Kaleido
kaleido(fig, file = "chart.svg", format = "svg", width = 800, height = 600, 
        scale = 2, output_options = list(bg = "white"))

#        inherit = F,
#        size = ~Sepal.Width, 
#        sizes = c(10, 100) * 10)

plot_ly(dt,
+         x=1:nrow(iris),
+         y=~Sepal.Length,
+         type="scatter",
+         mode='markers',
+         color=~Species,
+         colors = c("#4477AA","#DDCC77","#CC6677"),
+         size = ~Sepal.Width, 
+         symbol = ~shape_1,
+         symbols = c("triangle-up", "circle"),
+         inherit = F,
+         sizes = c(10, 100) * 10))
%>%
+ add_trace(type="scatter",
+           mode = "text",
+           text=~Sepal.Width,
+           textposition = "top right",
+           color = ~color_1,
+           colors = c("black","green","blue"),
+           textfont = list(size = 10)
+ )

factors(merged_df$condition)

"GFP d3"        "GFP d8"        "GFP+mCh d9/12" "LT d3"        
 [5] "LT d8"         "LTtr d3"       "LTtr d8"       "mCh d3"       
 [9] "mCh d8"        "sT d3"         "sT d8"         "sT+LT d3"     
[13] "sT+LTtr d9/12" "untreated"

merged_df$condition <- factor(merged_df$condition, levels=c("untreated","mCh d3","mCh d8","GFP+mCh d9/12","GFP d3","GFP d8","sT d3","sT d8","LT d3","LT d8","LTtr d3","LTtr d8","sT+LT d3","sT+LTtr d9/12"))
merged_df$donor <- as.character(merged_df$donor)
# Define a list of shapes for each group
shapes <- list("circle", "triangle-up")
plot_ly(merged_df, x=~PC1, y=~PC2, z=~PC3, type = "scatter3d", mode = "markers",   color=~condition, colors = c("grey","#a6cee3","#1f78b4","cyan","#b2df8a","#33a02c","#fb9a99","#e31a1c","#fdbf6f","#ff7f00","#cab2d6","#6a3d9a","#ffff99","#a14a1a"), symbol=~donor, symbols = c("triangle-up", "circle"))
  geom_point_3d() +
  scale_x_continuous(name = labs$x) +
  scale_y_continuous(name = labs$y) +
  scale_z_continuous(name = labs$z) +
  geom_point(size=8) + 
  scale_color_manual(values = c("untreated" = "grey",
                                "mCh d3"="#a6cee3",
                                "mCh d8"="#1f78b4",
                                "GFP+mCh d9/12"="cyan",
                                "GFP d3"="#b2df8a",
                                "GFP d8"="#33a02c",
                                "sT d3"="#fb9a99",
                                "sT d8"="#e31a1c",
                                "LT d3"="#fdbf6f",
                                "LT d8"="#ff7f00",
                                "LTtr d3"="#cab2d6",
                                "LTtr d8"="#6a3d9a",
                                "sT+LT d3"="#ffff99",                        
                                "sT+LTtr d9/12"="#a14a1a")) +
                                theme(axis.text = element_text(face="bold",size = 21), axis.title = element_text(face="bold",size = 21)) + theme(legend.text = element_text(size = 20)) + theme(legend.title = element_text(size = 22)) + guides(color = guide_legend(override.aes = list(size = 10)), shape = guide_legend(override.aes = list(size = 10)), alpha = guide_legend(override.aes = list(size = 10)))

#axis.title = element_text(face="bold",size = 20)
#p + theme(axis.text.x = element_text(face="bold",size=14), axis.text.y = element_text(face="bold",size=14))
#+ coord_fixed()
#+ theme(
#    # Set the width to 6 inches
#    fig.width = 6,
#    # Set the height to 4 inches
#    fig.height = 4
#  )

svg("pca4.svg")
plotPCA(rld, intgroup=c("days"))
dev.off()

scatter plot with categorical data using ggplot2

load packages

library(tidyverse)
library(palmerpenguins)
library(ggbeeswarm)
library(ggforce)

#remotes::install_github("allisonhorst/palmerpenguins")
# peek at penguins data
#glimpse(penguins)

# create some example data
x <- c(rep("A", 100), rep("B", 100), rep("C", 100))
y <- rnorm(300)

# create a data frame with x, y, and color columns
df <- data.frame(x = x, y = y, color = ifelse(c(1:300) %in% c(5, 10, 15), "Highlighted", "Normal"))

#  geom_point(size = 3) +
# plot the data with points colored by category and highlight
ggplot(df, aes(x = x, y = y, color = color)) +
  scale_color_manual(values = c("Normal" = "black", "Highlighted" = "red")) +
  geom_beeswarm(cex = 1.5) +
  theme_classic()

 #In this script, we use ggplot2 to create a scatter plot with categorical data and highlight some points. We start by creating an example dataset with a categorical variable x and a continuous variable y. We then create a data frame df with x, y, and color columns. The color column is set to "Highlighted" for the points we want to highlight, and "Normal" for the rest.

 #We then use ggplot2 to plot the data. We set x, y, and color to the corresponding columns in df using the aes() function. We use geom_point() to plot the points, and set the size argument to control the size of the points. We use scale_color_manual() to set the colors for the "Normal" and "Highlighted" categories. Finally, we use theme_classic() to set the theme of the plot to a classic theme.