Author Archives: gene_x

General statistics and important genes in humans, bacteria, and viruses

Here are some general statistics about the number of genes in humans, bacteria, and viruses:

Human genes:

Humans are estimated to have between 20,000 and 25,000 protein-coding genes. However, the number of functional genes may be even higher due to alternative splicing, where a single gene can produce multiple protein variants.

Bacterial genes:

Bacteria are unicellular organisms that have much smaller genomes than humans. The number of genes in bacteria varies depending on the species, but the average bacterial genome contains around 4,000 genes.

Virus genes:

Viruses are much simpler organisms than bacteria or humans, and they typically have much smaller genomes. The number of genes in viruses varies widely depending on the type of virus. Some viruses have only a few genes, while others have hundreds or even thousands of genes.

It’s important to note that these numbers are approximate and can vary depending on the specific organism or virus being studied. Additionally, the number of genes does not necessarily reflect the complexity or importance of an organism or virus.

There are an estimated 10 million species of bacteria, and each species has a unique genome with a varying number of genes. However, we can still provide some general statistics about the number of genes in bacteria based on available data:

The average number of genes in bacterial genomes is around 4,000, but this can range from a few hundred to over 15,000 depending on the species.
The bacterium with the largest number of genes currently known is Sorangium cellulosum, with approximately 15,000 genes.
Some bacteria have much smaller genomes than the average, such as the bacterium Carsonella ruddii, which has only 182 genes.
Bacterial genes can be classified into core genes, which are shared by all members of a species, and accessory genes, which are unique to certain strains or species. The number of core genes in bacterial genomes is typically a few thousand, while the number of accessory genes can range from a few dozen to several thousand.

It’s worth noting that the field of microbiology is constantly evolving, and new bacteria and their genomes are being discovered all the time. As a result, these statistics are subject to change as more data becomes available.

It’s difficult to provide a comprehensive list of all core genes in bacteria, as the specific genes that are considered “core” can vary depending on the species and the methods used to identify them. However, there are some general categories of genes that are considered to be part of the bacterial core genome:

Housekeeping genes: These are genes that are involved in basic cellular processes that are required for all cells to function, such as DNA replication, transcription, translation, and metabolism. Examples of housekeeping genes include the genes that encode ribosomal RNA (rRNA), ATP synthase, and DNA polymerase.
Information processing genes: These are genes that are involved in the processing and transmission of genetic information, such as DNA repair genes, DNA topoisomerases, and RNA polymerase.
Structural genes: These are genes that are involved in the formation and maintenance of cell structure, such as genes that encode cell wall components or flagella.
Transport genes: These are genes that are involved in the transport of molecules across the cell membrane, such as ABC transporters and permeases.
Regulatory genes: These are genes that are involved in the regulation of gene expression, such as transcription factors and sigma factors.

While these categories are a good starting point for identifying core genes in bacteria, the specific genes that are considered to be part of the core genome can vary depending on the species and the criteria used to define them. Additionally, some researchers may define “core genes” more narrowly or broadly than others, further complicating efforts to create a comprehensive list.

Housekeeping genes are essential genes that are required for basic cellular functions in all bacteria. They are involved in key processes such as DNA replication, transcription, translation, and metabolism. Here is a list of some important housekeeping genes in bacteria:

dnaA: encodes the protein responsible for initiating DNA replication.
gyrA: encodes DNA gyrase, an essential enzyme involved in DNA replication and repair.
rpoB: encodes the beta subunit of RNA polymerase, which is responsible for catalyzing transcription.
infB: encodes initiation factor 2, which is involved in translation initiation.
recA: encodes a protein involved in DNA repair and recombination.
groEL: encodes a chaperonin protein that assists in the folding of other proteins.
ftsZ: encodes a protein involved in cell division.
nrdA and nrdB: encode the subunits of ribonucleotide reductase, which is involved in nucleotide metabolism.
atpA, atpB, atpE, and atpF: encode subunits of ATP synthase, which generates ATP during cellular respiration.
fabF: encodes a fatty acid biosynthesis enzyme that is essential for cell membrane synthesis.

Transport genes are responsible for the movement of various molecules across the cell membrane of bacteria. These molecules can include nutrients, ions, and waste products. Here is a list of some important transport genes in bacteria:

ABC transporters: These are a class of transporters that use ATP to move a wide range of molecules across the cell membrane, including sugars, amino acids, and vitamins. Examples of ABC transporters in bacteria include the maltose transporter (malEFG) and the histidine transporter (hisJQMP).
PTS transporters: These transporters use a phosphate group from phosphoenolpyruvate to move sugars across the cell membrane. Examples of PTS transporters in bacteria include the glucose transporter (ptsG) and the lactose transporter (lacY).
TonB-dependent transporters: These transporters use the energy from the proton motive force to move molecules across the outer membrane of Gram-negative bacteria. Examples of TonB-dependent transporters in bacteria include the iron transporter (fhuA) and the vitamin B12 transporter (btuB).
Permeases: These are a class of transporters that move small molecules such as sugars, amino acids, and ions across the cell membrane. Examples of permeases in bacteria include the lactose permease (lacY) and the phosphate permease (pstS).
MFS transporters: These transporters belong to the major facilitator superfamily and move a wide range of molecules across the cell membrane, including sugars, drugs, and organic acids. Examples of MFS transporters in bacteria include the glucose transporter (gluP) and the tetracycline transporter (tetA).
Outer membrane porins: These are proteins that form channels in the outer membrane of Gram-negative bacteria, allowing small molecules such as nutrients and waste products to move across the membrane. Examples of outer membrane porins in bacteria include OmpF and OmpC.

These are just a few examples of important transport genes in bacteria. It’s worth noting that the specific transport genes required for cellular function can vary between different bacterial species, and some bacteria may have additional transport systems beyond those listed here.

细菌中最神奇的10个基因 (10 Most Amazing Genes in Bacteria)

Leave a reply

在细菌中，有许多令人神奇的基因，但以下是其中一些最为令人惊叹的基因：

CRISPR-Cas9基因：CRISPR-Cas9是一种具有基因编辑能力的DNA序列，这种基因编辑技术可以用于修改细菌、植物和动物的基因，对生物医学和生物工程领域具有重大意义。
Green Fluorescent Protein基因：该基因来自于海葵，它的基因产物是一种能够发出绿色荧光的蛋白质，已被广泛用于生物学研究中，尤其是用于标记细胞、病原体和其他生物分子。
荧光素酶基因：该基因来自于发光菌，产生一种能够发出强烈荧光的酶，已被广泛应用于生物学、生物医学和生物技术领域，例如生物传感器、荧光显微镜等。
Nitrogen fixation genes：固氮基因来自于一些细菌，它们能够将空气中的氮气转化为植物所需的氨基酸，对植物生长和生态系统的平衡具有重要作用。
Resistance genes：耐药基因是一些细菌中的基因，它们使细菌能够抵抗抗生素的杀菌作用，这些基因的存在导致了严重的公共卫生问题，需要开发新的抗生素来对抗它们。
Bioluminescence genes：发光基因来自于一些生物，例如发光细菌、蛤蜊和火虫等。这些基因可以使细菌产生荧光蛋白质，发出绿色、黄色、蓝色等颜色的荧光。这些细菌常常生存在深海或地下洞穴中，发光有助于它们在黑暗中进行交流或引诱猎物。
Denitrification genes：反硝化基因使细菌能够从硝酸盐和亚硝酸盐中释放出氮气，这些基因在土壤中具有重要作用，可帮助减少土壤中的氮含量。
Pili genes：纤毛基因编码一种可以使细菌产生纤毛的蛋白质，这些纤毛可以用来连接不同的细胞或附着到表面上。这些纤毛在感染性细菌中非常重要，它们帮助细菌黏附在宿主细胞上，使它们更容易入侵宿主。
Virulence genes：毒力基因使一些细菌具有攻击宿主细胞的能力，从而导致疾病发生。这些基因可以编码产生毒素的蛋白质，也可以编码使细菌能够避免宿主免疫系统的攻击的蛋白质。
Metabolism genes：代谢基因编码细菌用于生存的关键代谢途径，例如分解食物、合成细胞壁、生成能量等。这些基因在不同的细菌中有很大的差异，因此可以用于区分不同种类的细菌。

Merkel Cell Polyomavirus: Decoding Reference Sequences, miRNA, and the Oncogenic Roles of LT, LTtr, and sT in Merkel Cell Carcinoma

Leave a reply

Merkel cell polyomavirus (MCPyV) is a virus that has been associated with Merkel cell carcinoma, an aggressive form of skin cancer. The MCPyV genome is small, circular, and double-stranded DNA with a length of approximately 5.4 kilobases. Here, we will provide a brief summary of the reference sequences of MCPyV isolates and the microRNA (miRNA) found in the virus.

Reference sequence of MCPyV isolates: The reference sequence of MCPyV isolates can be found in the National Center for Biotechnology Information (NCBI) database. The NCBI’s Genome database contains information on the complete MCPyV genome, which can be accessed using the following link: https://www.ncbi.nlm.nih.gov/datasets/genomes/?taxon=493803&utm_source=nuccore&utm_medium=referral
MCPyV isolate WaGa: The MCPyV isolate WaGa (accession number: KJ128379.1) is a strain of the virus discovered in 2014. The complete genome of this isolate can be found in the NCBI’s Nucleotide database at the following link: https://www.ncbi.nlm.nih.gov/nuccore/KJ128379.1
MCPyV isolate MKL-1: The MCPyV isolate MKL-1 (accession number: FJ173815.1) is another strain of the virus, which was identified in 2008. The complete genome of this isolate is also available in the NCBI’s Nucleotide database and can be accessed at the following link: https://www.ncbi.nlm.nih.gov/nuccore/FJ173815.1
miRNA: MCPyV encodes a microRNA (miRNA) known as miR-M1 (accession number: JN707599.1). miRNAs are small, non-coding RNA molecules that play important roles in regulating gene expression. The sequence of miR-M1 can be found in the NCBI’s Nucleotide database using the following link: https://www.ncbi.nlm.nih.gov/nuccore/JN707599.1

LT, LTtr, and sT are viral proteins that play critical roles in MCPyV-associated oncogenesis, specifically in the development of Merkel cell carcinoma, an aggressive form of skin cancer.

LT (Large Tumor antigen): The Large Tumor antigen is a multifunctional protein encoded by MCPyV. It is involved in various cellular processes, including cell cycle regulation, DNA replication, and cell transformation. LT binds to the tumor suppressor protein retinoblastoma (Rb) and disrupts its function, leading to uncontrolled cell proliferation. In the context of MCPyV-associated Merkel cell carcinoma, LT expression is often observed in tumor cells, suggesting its importance in the development of this aggressive skin cancer.
LTtr (Large Tumor antigen truncated): LTtr is a truncated version of the Large Tumor antigen, which is shorter than the full-length LT protein. This truncation results from a mutation or deletion in the viral genome, leading to a premature stop codon and an incomplete protein product. While LTtr retains some of the functions of the full-length LT, it may have altered activity or reduced functionality. The exact role of LTtr in MCPyV-associated Merkel cell carcinoma is still being investigated.
sT, or Small Tumor antigen, is another protein encoded by Merkel Cell Polyomavirus (MCPyV) and is involved in MCPyV-related oncogenesis. sT is a multifunctional protein that plays various roles in cellular processes, including the activation of signaling pathways, modulation of protein stability, and the disruption of cellular functions that can contribute to cancer development.
- One of the key roles of sT is its ability to interact with and inhibit protein phosphatase 2A (PP2A), a cellular tumor suppressor that negatively regulates cell growth and division. By inhibiting PP2A, sT promotes cell proliferation and survival, contributing to oncogenesis.
- In addition to its interaction with PP2A, sT can also activate the phosphatidylinositol 3-kinase (PI3K)/Akt/mTOR signaling pathway, which further supports cell survival and growth. Furthermore, sT has been found to induce the expression of heat shock proteins, which are associated with cellular stress responses and contribute to the stabilization of oncogenic proteins.

In MCPyV-associated Merkel cell carcinoma, both Large Tumor antigen (LT) and Small Tumor antigen (sT) are expressed in tumor cells, suggesting that they cooperatively contribute to the development of this aggressive skin cancer. While LT primarily targets the retinoblastoma (Rb) tumor suppressor protein, sT interacts with multiple cellular targets to promote cell growth and survival, ultimately leading to oncogenesis.

Visualizing Phylogenetic Relationships and Metadata with Circular ggtree and gheatmap in R

Leave a reply

Download the file typing_189.csv

Download the file 471.tree

library(ggtree)
library(ggplot2)

setwd("/media/jhuang/Elements2/Data_Anna_C.acnes/plotTreeHeatmap/")

# -- edit tree --
#https://icytree.org/
#0.000780
info <- read.csv("typing_189.csv")
info$name <- info$Isolate
#tree <- read.tree("core_gene_alignment_fasttree_directly_from_186isoaltes.tree")  --> NOT GOOD!
tree <- read.tree("471.tree")
cols <- c(infection='purple2', commensalism='skyblue2')     

library(dplyr)
heatmapData2 <- info %>% select(Isolate, ST, Clonal.Complex, Phylotype)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "lightcyan3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "azure3","blueviolet","darkgoldenrod",  "tomato","mediumpurple4","indianred", 
                      "cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta",     "cornflowerblue","darkgreen","red","tan","brown",      "darkgrey")
names(heatmap.colours) <- c("1","2","3","4","5", "6","7",   "20","21","22", "28","30","33","42","43","52","53", "66","68",    "100","105","124","133","134","135","137",     "159","161",    "CC1","CC2","CC3","CC4","CC5","CC6","CC30","CC72","CC77","Singleton",    "IA1","IA2","IB","II","III",    "NA")
#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))

#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
  geom_tippoint(aes(color=Type)) + 
  scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
#, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
#difference between geom_tiplab and geom_tiplab2?
#+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
#font.size=10, 
png("ggtree.png", width=1260, height=1260)
svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

#png("ggtree_and_gheatmap.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=1290, height=1000)
svg("ggtree_and_gheatmap.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=0.1,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
dev.off()

Bubble Plot Visualization of Gene Expression Changes

Leave a reply

Download the file Bubble_R_allTantigens.csv

library(ggplot2)
library(dplyr)
library(magrittr)
library(tidyr)
library(forcats)

mydat <- read.csv("Bubble_R_allTantigens.csv", sep=";", header=TRUE)
mydat$GeneRatio <- sapply(mydat$GeneRatio_frac, function(x) eval(parse(text=x)))
#mydat$FoldEnrichment <- as.numeric(mydat$FoldEnrichment)
mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
mydat$Comparison <- factor(mydat$Comparison, levels=c("sT vs ctrl d3","sT vs ctrl d8","LT vs ctrl d3","LT vs ctrl d8","LTtr vs ctrl d3","LTtr vs ctrl d8","sT+LT vs ctrl d3","sT+LTtr vs ctrl d912")) 
mydat$Term <- factor(mydat$Term, levels=rev(c("Cell division", "Cell cycle", "Negative regulation of cell proliferation", "Mitotic cell cycle", "Mitotic sister chromatid segregation", "Mitotic spindle organization", "Chromosome segregation", "DNA replication", "DNA repair", "Cellular response to DNA damage stimulus", "Regulation of transcription, DNA-templated", "Positive regulation of transcription, DNA-templated", "Regulation of transcription from RNA polymerase II promoter", "Negative regulation of transcription from RNA polymerase II promoter", "rRNA processing", "Protein folding", "Immune response", "Inflammatory response", "Positive regulation of I-kappaB kinase/NF-kappaB signaling", "Cellular response to tumor necrosis factor", "Chemotaxis", "Neutrophil chemotaxis", "Innate immune response", "Response to virus", "Defense response to virus", "Cellular response to lipopolysaccharide", "Signal transduction", "Response to drug", "Apoptotic process", "Cell adhesion", "Collagen fibril organization", "Nervous system development", "Axon guidance", "Extracellular matrix organization", "Angiogenesis"))) 
tiff("Bubble_all-Tantigens_big.tiff", units = "in", width = 35, height = 50, res=500) 
#png("bubble.png", width=1400, height=600)

xl <- factor(rev(c("Cell division", "Cell cycle", "Negative regulation of cell proliferation", "Mitotic cell cycle", "Mitotic sister chromatid segregation", "Mitotic spindle organization", "Chromosome segregation", "DNA replication", "DNA repair", "Cellular response to DNA damage stimulus", "Regulation of transcription, DNA-templated", "Positive regulation of transcription, DNA-templated", "Regulation of transcription from RNA polymerase II promoter", "Negative regulation of transcription from RNA polymerase II promoter", "rRNA processing", "Protein folding", "Immune response", "Inflammatory response", "Positive regulation of I-kappaB kinase/NF-kappaB signaling", "Cellular response to tumor necrosis factor", "Chemotaxis", "Neutrophil chemotaxis", "Innate immune response", "Response to virus", "Defense response to virus", "Cellular response to lipopolysaccharide", "Signal transduction", "Response to drug", "Apoptotic process", "Cell adhesion", "Collagen fibril organization", "Nervous system development", "Axon guidance", "Extracellular matrix organization", "Angiogenesis")))
bold.terms <- c("Innate immune response", "Response to virus", "Defense response to virus")
bold.labels <- ifelse((xl) %in% bold.terms, yes = "bold", no = "plain")

#-log10(FDR) can be renamed as 'Significance'   
png("bubble_plot.png", 3000, 2000)
ggplot(mydat, aes(y = Term, x = Comparison)) + geom_point(aes(color = Regulation, size = Count, alpha = abs(log10(FDR)))) + scale_color_manual(values = c("up" = "red", "down" = "blue")) + scale_size_continuous(range = c(1, 34)) + labs(x = "", y = "", color="Regulation", size="Count", alpha="-log10(FDR)") + theme(axis.text.y = element_text(face = bold.labels))+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 40)) + theme(legend.text = element_text(size = 40)) + theme(legend.title = element_text(size = 40))+
  guides(color = guide_legend(override.aes = list(size = 20)), alpha = guide_legend(override.aes = list(size = 20)))
dev.off()

智力相关基因：揭示认知能力的遗传因素

Leave a reply

智力是由多个基因、环境因素及其相互作用共同影响的复杂性状。尽管没有单一的“智力基因”，但研究人员已经发现了一些与认知能力相关的基因。这些基因包括：

CHRM2：这个基因编码乙酰胆碱受体 2（cholinergic receptor, muscarinic 2），与认知能力（包括智力和记忆力）相关。
BDNF：脑源性神经营养因子（Brain-derived neurotrophic factor，BDNF）是一种与神经元生长和维持有关的蛋白质。BDNF 基因的变异与智力、记忆和学习差异有关。
COMT：儿茶酚氧位甲基转移酶（catechol-O-methyltransferase，COMT）基因参与多巴胺和去甲肾上腺素等神经递质的分解。COMT 基因的某些变异与认知能力和工作记忆差异有关。
SNAP25：突触小泡相关蛋白 25（Synaptosomal-associated protein 25，SNAP25）参与突触处神经递质的释放。SNAP25 基因的变异与智力和认知能力相关。
ASPM：异常纺锤体样小头畸形相关（abnormal spindle-like microcephaly-associated，ASPM）基因与大脑大小有关，并与智力相关。该基因的变种与认知能力差异有关。
NRG1：神经胶质瘤 1（Neuregulin 1，NRG1）参与神经元发育，其基因变异与认知功能和智力有关。

需要注意的是，智力是一个多面的特质，许多基因以复杂的方式共同影响它。此外，环境因素和基因-环境相互作用在决定个体智力方面也起着重要作用。随着我们对智力基因基础的理解不断深入，可能会发现更多基因及其在认知功能中的作用。

Intelligence is a complex trait influenced by multiple genes, environmental factors, and their interactions. Although there is no single “intelligence gene,” researchers have identified several genes that appear to be associated with cognitive abilities. Some of these genes include:

CHRM2: This gene encodes the cholinergic receptor, muscarinic 2, and has been associated with cognitive abilities, including general intelligence and memory.
BDNF: Brain-derived neurotrophic factor (BDNF) is a protein involved in the growth and maintenance of neurons. Variations in the BDNF gene have been linked to differences in intelligence, memory, and learning.
COMT: The catechol-O-methyltransferase (COMT) gene is involved in the breakdown of neurotransmitters such as dopamine and norepinephrine. Certain variations in the COMT gene have been associated with differences in cognitive performance and working memory.
SNAP25: Synaptosomal-associated protein 25 (SNAP25) is involved in the release of neurotransmitters at synapses. Genetic variations in SNAP25 have been linked to intelligence and cognitive performance.
ASPM: The abnormal spindle-like microcephaly-associated (ASPM) gene plays a role in brain size and has been associated with intelligence. Variants of this gene have been linked to differences in cognitive abilities.
NRG1: Neuregulin 1 (NRG1) is involved in neuron development, and genetic variations in this gene have been associated with cognitive function and intelligence.

It is essential to note that intelligence is a multifaceted trait, and many genes contribute to it in complex ways. Additionally, environmental factors and gene-environment interactions also play a significant role in determining an individual’s intelligence. As our understanding of the genetic basis of intelligence continues to grow, more genes and their roles in cognitive function will likely be identified.

性取向相关基因：探索与同性恋相关的遗传因素

Leave a reply

虽然没有一个单一的“同性恋基因”决定性取向，但多项研究已经发现了与同性恋相关的一些基因标记。需要注意的是，与性取向相关的基因因素非常复杂，可能涉及许多基因。以下是一些在一些研究中与同性恋相关的基因标记示例：

Xq28：X染色体上的一个区域，一些研究中已经暗示与男性同性恋有关联。
8q12：染色体8上的一个区域，一些研究中已经与男性同性恋有关联。
10q26：染色体10上的另一个区域，在一些研究中被认为与男性同性恋有关。

这些发现并不是确定性的，我们需要更多的研究来充分了解影响性取向的基因因素。同时，重要的是要认识到基因只是影响人类性行为的遗传、环境和社会因素相互作用的复杂因素中的一个方面。

While there isn’t a single “gay gene” responsible for determining sexual orientation, multiple studies have identified several genetic markers associated with homosexuality. It is important to note that the genetic factors involved in sexual orientation are complex and likely involve numerous genes. Here are a few examples of genetic markers that have been implicated in some studies:

Xq28: A region on the X chromosome that has been suggested to be linked to male homosexuality in some studies.
8q12: A region on chromosome 8 that has been associated with male homosexuality in some research.
10q26: Another region on chromosome 10 that has been implicated in male homosexuality in some studies.

Please keep in mind that these findings are not definitive, and more research is needed to fully understand the genetic factors that contribute to sexual orientation. It is also important to recognize that genes are only one aspect of the complex interplay between genetic, environmental, and social factors that influence human sexuality.

癌症相关基因：多种基因与癌症风险的关联

Leave a reply

Cancer is a complex disease caused by multiple genetic and environmental factors. Some of the genes associated with increased cancer risk include:

TP53 (肿瘤蛋白P53基因) – This gene is responsible for producing the p53 protein, which is involved in suppressing tumor formation. Mutations in this gene are linked to several types of cancer, including breast, ovarian, and colon cancers.
BRCA1 (乳腺癌1基因) and BRCA2 (乳腺癌2基因) – These genes are associated with an increased risk of breast and ovarian cancers.
APC (腺瘤性息肉症基因) – Mutations in this gene are associated with familial adenomatous polyposis, an inherited disorder that can lead to colorectal cancer.
MLH1 (错配修复基因1) and MSH2 (错配修复基因2) – These genes are involved in DNA mismatch repair and are associated with Lynch syndrome, which increases the risk of colorectal, endometrial, and other cancers.
RET (转化生长因子受体基因) – Mutations in this gene can lead to multiple endocrine neoplasia type 2, which is associated with thyroid, parathyroid, and adrenal gland cancers.
VHL (Von Hippel-Lindau基因) – This gene is linked to Von Hippel-Lindau syndrome, which increases the risk of kidney, pancreatic, and brain cancers.
RB1 (视网膜母细胞瘤基因) – Mutations in this gene can lead to retinoblastoma, a rare type of eye cancer that affects children.

These are just a few examples of the many genes associated with different types of cancer. It is important to note that the presence of mutations in these genes does not guarantee that an individual will develop cancer, but it can increase their risk.

Display viral transcripts found in mRNA-seq MKL-1, WaGa EVs compared to cells

Leave a reply

Dowload the file fpkm_values_WaGa.csv

Download the file fpkm_values_MKL-1.csv

STEP1: using following code (R_scripts.txt) to generate merged_gene_counts_WaGa.csv and merged_gene_counts_MKL-1.csv

# ---- GENERATE BED-file for WaGa ----
cp WaGa.gff3 WaGa.gff3_backup
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" WaGa.gff3
grep "Genbank_gene" WaGa.gff3 > WaGa_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" WaGa_gene.gff3
grep "gene_biotype=pseudogene" WaGa.gff3_backup >> WaGa_gene.gff3    #-->4

#The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing. 
# from gff3 to gtf
cp WaGa_gene.gff3 WaGa_gene.gtf
sed -i -e "s/\tID=gene-/\tgene_id \"/g" WaGa_gene.gtf
sed -i -e "s/;/\"; /g" WaGa_gene.gtf
sed -i -e "s/=/=\"/g" WaGa_gene.gtf

#sed -i -e $'s/\n/\"\n/g' WaGa_gene.gtf
gffread -E -F --bed WaGa.gff3 -o WaGa.bed    #-->4

# ---- GENERATE BED-file for MKL-1 ----
cp MKL-1.gff3 MKL-1.gff3_backup
sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" MKL-1.gff3
grep "Genbank_gene" MKL-1.gff3 > MKL-1_gene.gff3
sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" MKL-1_gene.gff3
grep "gene_biotype=pseudogene" MKL-1.gff3_backup >> MKL-1_gene.gff3    #-->4

# from gff3 to gtf
cp MKL-1_gene.gff3 MKL-1_gene.gtf
sed -i -e "s/\tID=gene-/\tgene_id \"/g" MKL-1_gene.gtf
sed -i -e "s/;/\"; /g" MKL-1_gene.gtf
sed -i -e "s/=/=\"/g" MKL-1_gene.gtf

#sed -i -e $'s/\n/\"\n/g' MKL-1_gene.gtf
gffread -E -F --bed MKL-1.gff3 -o MKL-1.bed    #-->4

# ---- GENERATE GFF3-file for WaGa ----
cp WaGa.gff3_backup WaGa.gff3
grep "^##" WaGa.gff3 > WaGa_gene.gff3
grep "ID=gene" WaGa.gff3 >> WaGa_gene.gff3
#!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'! 
sed -i -e "s/\tgene\t/\texon\t/g" WaGa_gene.gff3
#INPUT are 3796 genes (pseudogene, rRNA, tRNA, gene), why output are only 3726 (-pseudogene-rRNA)

# ---- GENERATE GFF3-file for MKL-1 ----
cp MKL-1.gff3_backup MKL-1.gff3
grep "^##" MKL-1.gff3 > MKL-1_gene.gff3
grep "ID=gene" MKL-1.gff3 >> MKL-1_gene.gff3
#!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'! 
sed -i -e "s/\tgene\t/\texon\t/g" MKL-1_gene.gff3
#INPUT are 3796 genes (pseudogene, rRNA, tRNA, gene), why output are only 3726 (-pseudogene-rRNA)

#MODIFY LTtr1 to LT1, LTtr2 to LT2, since the gff3 files contains only the records of LT1 and LT2, not LTtr1 and LTtr2.
#KJ128379.1      Genbank exon    196     429     .       +       .       ID=gene-LTtr1;Name=LTtr1;gbkey=Gene;gene=LTtr1;gene_biotype=protein_coding
#KJ128379.1      Genbank exon    861     1463    .       +       .       ID=gene-LTtr2;Name=LTtr2;gbkey=Gene;gene=LTtr2;gene_biotype=protein_coding

# generate ./results_virus/featureCounts/merged_gene_counts_MKL-1.txt
nextflow run rnaseq/main_Fraction_O_for_virus.nf --reads '/home/jhuang/DATA/Data_Ute_RNA-Seq_MKL-1/Raw_Data/*.fastq.gz' --fasta /home/jhuang/DATA/MCPyV_refs/MKL-1.fasta --gtf /home/jhuang/DATA/MCPyV_refs/MKL-1_gene.gff3 --bed12 /home/jhuang/DATA/MCPyV_refs/MKL-1.bed --singleEnd -profile standard --aligner hisat2 --fcGroupFeatures Name --fcGroupFeaturesType gene_biotype --saveReference -resume --saveAlignedIntermediates --skip_rseqc --skip_dupradar --skip_genebody_coverage --skip_preseq --skip_edger  #-work-dir 

# generate ./results_virus/featureCounts/merged_gene_counts_WaGa.txt
nextflow run rnaseq/main_Fraction_O_for_virus.nf --reads '/home/jhuang/DATA/Data_Ute_RNA-Seq_WaGa/Raw_Data/*.fastq.gz' --fasta /home/jhuang/DATA/MCPyV_refs/WaGa.fasta --gtf /home/jhuang/DATA/MCPyV_refs/WaGa_gene.gff3 --bed12 /home/jhuang/DATA/MCPyV_refs/WaGa.bed --singleEnd -profile standard --aligner hisat2 --fcGroupFeatures Name --fcGroupFeaturesType gene_biotype --saveReference -resume --saveAlignedIntermediates --skip_rseqc --skip_dupradar --skip_genebody_coverage --skip_preseq --skip_edger 

rsync -a -P jhuang@hamm:~/DATA/Data_Ute_RNA-Seq_WaGa/results_human/featureCounts/merged_gene_counts.txt ./results_human/featureCounts/merged_gene_counts_WaGa.txt
rsync -a -P jhuang@hamm:~/DATA/Data_Ute_RNA-Seq_WaGa/results_virus/featureCounts/merged_gene_counts.txt ./results_virus/featureCounts/merged_gene_counts_WaGa.txt
rsync -a -P jhuang@hamm:~/DATA/Data_Ute_RNA-Seq_MKL-1/results_human/featureCounts/merged_gene_counts.txt ./results_human/featureCounts/merged_gene_counts_MKL-1.txt
rsync -a -P jhuang@hamm:~/DATA/Data_Ute_RNA-Seq_MKL-1/results_virus/featureCounts/merged_gene_counts.txt ./results_virus/featureCounts/merged_gene_counts_MKL-1.txt

sed 's/Aligned.sortedByCoord.out.bam//g' results_human/featureCounts/merged_gene_counts_WaGa.txt > merged_gene_counts_WaGa_human.txt
sed 's/.sorted.bam//g' results_virus/featureCounts/merged_gene_counts_WaGa.txt > merged_gene_counts_WaGa_virus.txt
sed 's/Aligned.sortedByCoord.out.bam//g' results_human/featureCounts/merged_gene_counts_MKL-1.txt > merged_gene_counts_MKL-1_human.txt
sed 's/.sorted.bam//g' results_virus/featureCounts/merged_gene_counts_MKL-1.txt > merged_gene_counts_MKL-1_virus.txt

python3 concat_files_WaGa.py merged_gene_counts_WaGa_human.txt merged_gene_counts_WaGa_virus.txt merged_gene_counts_WaGa.csv     #(23 samples)
python3 concat_files_MKL-1.py merged_gene_counts_MKL-1_human.txt merged_gene_counts_MKL-1_virus.txt merged_gene_counts_MKL-1.csv #(18 samples)

The code snippet of the file concat_files_MKL-1.py

    #!/usr/bin/env python3

    #"MKL-1 RNA","MKL-1 RNA 118","MKL-1 RNA 147",    
    #"MKL-1 EV","MKL-1 EV 2","MKL-1 EV 118","MKL-1 EV 87","MKL-1 EV 27","MKL-1 EV 042","MKL-1 EV 0505",     
    #"MKL-1 EV sT DMSO 042","MKL-1 EV sT DMSO 0505",   "MKL-1 EV scr DMSO 042","MKL-1 EV scr DMSO 0505",   "MKL-1 EV sT Dox 042","MKL-1 EV sT Dox 0505",   "MKL-1 EV scr Dox 042","MKL-1 EV scr Dox 0505",   

    #"WaGa RNA","WaGa RNA 118","WaGa RNA 147",    
    #"WaGa EV","WaGa EV 2","WaGa EV 118","WaGa EV 147","WaGa EV 226","WaGa EV 1107","WaGa EV 1605","WaGa EV 2706",    
    #"WaGa EV sT DMSO 1107","WaGa EV sT DMSO 1605","WaGa EV sT DMSO 2706",     "WaGa EV scr DMSO 1107","WaGa EV scr DMSO 1605","WaGa EV scr DMSO 2706",     "WaGa EV sT Dox 1107","WaGa EV sT Dox 1605","WaGa EV sT Dox 2706",     "WaGa EV scr Dox 1107","WaGa EV scr Dox 1605","WaGa EV scr Dox 2706"

    #Geneid  gene_name       WaGa_RNA        1605_WaGa_wt_EV 2706_WaGa_scr_DMSO_EV   WaGa_EV-RNA_226 1107_WaGa_scr_Dox_EV    1107_WaGa_sT_Dox_EV     1605_WaGa_sT_Dox_EV     1107_WaGa_scr_DMSO_EV   1107_WaGa_wt_EV 1605_WaGa_sT_DMSO_EV    WaGa_EV-RNA_2   2706_WaGa_sT_Dox_EV     WaGa_EV-RNA_147 2706_WaGa_sT_DMSO_EV    1107_WaGa_sT_DMSO_EV    2706_WaGa_scr_Dox_EV    WaGa_RNA_147    1605_WaGa_scr_DMSO_EV   2706_WaGa_wt_EV WaGa_RNA_118    1605_WaGa_scr_Dox_EV    WaGa_EV-RNA_118 WaGa_EV-RNA

    #Geneid  gene_name

    import sys
    import pandas as pd

    # Check if the correct number of arguments is provided
    if len(sys.argv) != 4:
        print("Usage: python merge_files.py <file1.csv> <file2.csv> <output.csv>")
        sys.exit()

    # Get the file names from command line arguments
    file1 = sys.argv[1]
    file2 = sys.argv[2]
    output_file = sys.argv[3]

    # Read the files
    df1 = pd.read_csv(file1, sep='\t')
    df2 = pd.read_csv(file2, sep='\t')

    # Rename columns in both DataFrames
    #df1.rename(columns=column_mapping, inplace=True)
    #df2.rename(columns=column_mapping, inplace=True)

    # Define the desired column order
    #"MKL-1 RNA","MKL-1 RNA 118","MKL-1 RNA 147",    
    #"MKL-1 EV","MKL-1 EV 2","MKL-1 EV 118","MKL-1 EV 87","MKL-1 EV 27","MKL-1 EV 042","MKL-1 EV 0505",     
    #"MKL-1 EV sT DMSO 042","MKL-1 EV sT DMSO 0505",   "MKL-1 EV scr DMSO 042","MKL-1 EV scr DMSO 0505",   "MKL-1 EV sT Dox 042","MKL-1 EV sT Dox 0505",   "MKL-1 EV scr Dox 042","MKL-1 EV scr Dox 0505",   

    #Geneid  gene_name       0505_MKL-1_scr_Dox_EV   MKL-1_EV-RNA    0505_MKL-1_scr_DMSO_EV  0505_MKL-1_wt_EV        MKL-1_RNA_147   MKL-1_EV-RNA_2  MKL-1_RNA       MKL-1_EV-RNA_87 042_MKL-1_wt_EV 042_MKL-1_scr_Dox_EV    MKL-1_EV-RNA_27 042_MKL-1_sT_Dox        042_MKL-1_sT_DMSO       MKL-1_EV-RNA_118        0505_MKL-1_sT_Dox_EV    0505_MKL-1_sT_DMSO_EV   MKL-1_RNA_118   042_MKL-1_scr_DMSO_EV
    column_order = ['Geneid','gene_name',  'MKL-1_RNA','MKL-1_RNA_118','MKL-1_RNA_147',  'MKL-1_EV-RNA','MKL-1_EV-RNA_2','MKL-1_EV-RNA_118','MKL-1_EV-RNA_87','MKL-1_EV-RNA_27','042_MKL-1_wt_EV','0505_MKL-1_wt_EV',  '042_MKL-1_sT_DMSO','0505_MKL-1_sT_DMSO_EV', '042_MKL-1_scr_DMSO_EV','0505_MKL-1_scr_DMSO_EV',  '042_MKL-1_sT_Dox','0505_MKL-1_sT_Dox_EV', '042_MKL-1_scr_Dox_EV','0505_MKL-1_scr_Dox_EV']

    # Reorder columns in both DataFrames
    df1 = df1[column_order]
    df2 = df2[column_order]

    # Concatenate the dataframes
    result = pd.concat([df1, df2], axis=0, ignore_index=True)

    # Save the result to a new file
    result.to_csv(output_file, index=False)

The code snippet of the file concat_files_WaGa.py

    #!/usr/bin/env python3

    #"WaGa RNA","WaGa RNA 118","WaGa RNA 147",    
    #"WaGa EV","WaGa EV 2","WaGa EV 118","WaGa EV 147","WaGa EV 226","WaGa EV 1107","WaGa EV 1605","WaGa EV 2706",    
    #"WaGa EV sT DMSO 1107","WaGa EV sT DMSO 1605","WaGa EV sT DMSO 2706",     "WaGa EV scr DMSO 1107","WaGa EV scr DMSO 1605","WaGa EV scr DMSO 2706",     "WaGa EV sT Dox 1107","WaGa EV sT Dox 1605","WaGa EV sT Dox 2706",     "WaGa EV scr Dox 1107","WaGa EV scr Dox 1605","WaGa EV scr Dox 2706"

    #Geneid  gene_name       WaGa_RNA        1605_WaGa_wt_EV 2706_WaGa_scr_DMSO_EV   WaGa_EV-RNA_226 1107_WaGa_scr_Dox_EV    1107_WaGa_sT_Dox_EV     1605_WaGa_sT_Dox_EV     1107_WaGa_scr_DMSO_EV   1107_WaGa_wt_EV 1605_WaGa_sT_DMSO_EV    WaGa_EV-RNA_2   2706_WaGa_sT_Dox_EV     WaGa_EV-RNA_147 2706_WaGa_sT_DMSO_EV    1107_WaGa_sT_DMSO_EV    2706_WaGa_scr_Dox_EV    WaGa_RNA_147    1605_WaGa_scr_DMSO_EV   2706_WaGa_wt_EV WaGa_RNA_118    1605_WaGa_scr_Dox_EV    WaGa_EV-RNA_118 WaGa_EV-RNA

    #Geneid  gene_name

    import sys
    import pandas as pd

    # Check if the correct number of arguments is provided
    if len(sys.argv) != 4:
        print("Usage: python merge_files.py <file1.csv> <file2.csv> <output.csv>")
        sys.exit()

    # Get the file names from command line arguments
    file1 = sys.argv[1]
    file2 = sys.argv[2]
    output_file = sys.argv[3]

    # Read the files
    df1 = pd.read_csv(file1, sep='\t')
    df2 = pd.read_csv(file2, sep='\t')

    # Define the column mapping for renaming columns
    column_mapping = {
        'WaGa_RNA': 'WaGa RNA',
        'WaGa_RNA_118': 'WaGa RNA 118',
        'WaGa_RNA_147': 'WaGa RNA 147',
        'WaGa_EV-RNA': 'WaGa EV',
        'WaGa_EV-RNA_2': 'WaGa EV 2',
        'WaGa_EV-RNA_118': 'WaGa EV 118',
        'WaGa_EV-RNA_147': 'WaGa EV 147',
        'WaGa_EV-RNA_226': 'WaGa EV 226',
        '1107_WaGa_wt_EV': 'WaGa EV 1107',
        '1605_WaGa_wt_EV': 'WaGa EV 1605',
        '2706_WaGa_wt_EV': 'WaGa EV 2706',
        '1107_WaGa_sT_DMSO_EV': 'WaGa EV sT DMSO 1107',
        '1605_WaGa_sT_DMSO_EV': 'WaGa EV sT DMSO 1605',
        '2706_WaGa_sT_DMSO_EV': 'WaGa EV sT DMSO 2706',
        '1107_WaGa_scr_DMSO_EV': 'WaGa EV scr DMSO 1107',
        '1605_WaGa_scr_DMSO_EV': 'WaGa EV scr DMSO 1605',
        '2706_WaGa_scr_DMSO_EV': 'WaGa EV scr DMSO 2706',
        '1107_WaGa_sT_Dox_EV': 'WaGa EV sT Dox 1107',
        '1605_WaGa_sT_Dox_EV': 'WaGa EV sT Dox 1605',
        '2706_WaGa_sT_Dox_EV': 'WaGa EV sT Dox 2706',
        '1107_WaGa_scr_Dox_EV': 'WaGa EV scr Dox 1107',
        '1605_WaGa_scr_Dox_EV': 'WaGa EV scr Dox 1605',
        '2706_WaGa_scr_Dox_EV': 'WaGa EV scr Dox 2706',
    }  # Replace with your desired column names

    # Rename columns in both DataFrames
    #df1.rename(columns=column_mapping, inplace=True)
    #df2.rename(columns=column_mapping, inplace=True)

    # Define the desired column order
    #column_order = ['Geneid', 'gene_name', 'WaGa RNA','WaGa RNA 118','WaGa RNA 147','WaGa EV','WaGa EV 2','WaGa EV 118','WaGa EV 147','WaGa EV 226','WaGa EV 1107','WaGa EV 1605','WaGa EV 2706','WaGa EV sT DMSO 1107','WaGa EV sT DMSO 1605','WaGa EV sT DMSO 2706','WaGa EV scr DMSO 1107','WaGa EV scr DMSO 1605','WaGa EV scr DMSO 2706','WaGa EV sT Dox 1107','WaGa EV sT Dox 1605','WaGa EV sT Dox 2706','WaGa EV scr Dox 1107','WaGa EV scr Dox 1605','WaGa EV scr Dox 2706']  # Replace with your desired column order
    column_order = ['Geneid','gene_name','WaGa_RNA','WaGa_RNA_118','WaGa_RNA_147','WaGa_EV-RNA','WaGa_EV-RNA_2','WaGa_EV-RNA_118','WaGa_EV-RNA_147','WaGa_EV-RNA_226','1107_WaGa_wt_EV','1605_WaGa_wt_EV','2706_WaGa_wt_EV','1107_WaGa_sT_DMSO_EV','1605_WaGa_sT_DMSO_EV','2706_WaGa_sT_DMSO_EV','1107_WaGa_scr_DMSO_EV','1605_WaGa_scr_DMSO_EV','2706_WaGa_scr_DMSO_EV','1107_WaGa_sT_Dox_EV','1605_WaGa_sT_Dox_EV','2706_WaGa_sT_Dox_EV','1107_WaGa_scr_Dox_EV','1605_WaGa_scr_Dox_EV','2706_WaGa_scr_Dox_EV']

    # Reorder columns in both DataFrames
    df1 = df1[column_order]
    df2 = df2[column_order]

    # Concatenate the dataframes
    result = pd.concat([df1, df2], axis=0, ignore_index=True)

    # Save the result to a new file
    result.to_csv(output_file, index=False)

STEP2: using following code (R_scripts.txt) to generate fpkm_values_WaGa.csv and fpkm_values_MKL-1.csv

    #if (!requireNamespace("BiocManager", quietly = TRUE))
    #    install.packages("BiocManager")
    #BiocManager::install("biomaRt")
    library(biomaRt)
    library(dplyr)

    # Replace 'input_file.csv' with the name of your CSV file containing Ensembl gene IDs and counts
    input_data <- read.csv("merged_gene_counts_WaGa.csv")
    #input_data <- read.csv("merged_gene_counts_MKL-1.csv")
    input_data <- input_data %>% rename(ensembl_gene_id = Geneid)

    # Specify the rows you want to sum
    row1 <- 60665
    row2 <- 60666

    # Create a new row with the sums of the two specified rows, except for the first two columns
    new_row <- input_data[c(row1, row2), -c(1, 2)] %>%
      summarise(across(everything(), sum))

    # Name the first two columns
    new_row$ensembl_gene_id <- "LT"
    new_row$gene_name <- ""

    # Add the new row to the original data frame
    input_data <- bind_rows(input_data, new_row)

    # Delete row1 and row2
    input_data <- input_data[-c(row1, row2),]

    # Assuming your CSV file has columns named 'ensembl_gene_id' and 'counts'
    gene_ids <- input_data$ensembl_gene_id

    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")

    #geness <- getBM(
    #    attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
    #    filters = 'ensembl_gene_id',
    #    values = rownames(res), 
    #    mart = ensembl)
    #geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)

    # Get transcript lengths for each Ensembl gene ID
    transcript_lengths <- getBM(
      attributes = c("ensembl_gene_id", "transcript_length"),
      filters = "ensembl_gene_id",
      values = gene_ids,
      mart = ensembl
    )

    # Calculate the median transcript length for each gene
    median_transcript_lengths <- transcript_lengths %>%
      group_by(ensembl_gene_id) %>%
      summarize(median_length = median(transcript_length, na.rm = TRUE))

    #add to median_transcript_lengths WaGa
    #KJ128379.1      Genbank exon    196     429     .       +       .       ID=gene-LT1;Name=LT1;gbkey=Gene;gene=LT1;gene_biotype=protein_coding --> 234
    #KJ128379.1      Genbank exon    861     1463    .       +       .       ID=gene-LT2;Name=LT2;gbkey=Gene;gene=LT2;gene_biotype=protein_coding --> 603
    #KJ128379.1      Genbank exon    196     756     .       +       .       ID=gene-ST;Name=ST;gbkey=Gene;gene=ST;gene_biotype=protein_coding  --> 561
    #KJ128379.1      Genbank exon    3156    4427    .       -       .       ID=gene-VP1;Name=VP1;gbkey=Gene;gene=VP1;gene_biotype=protein_coding --> 1272
    #KJ128379.1      Genbank exon    4393    5118    .       -       .       ID=gene-VP2;Name=VP2;gbkey=Gene;gene=VP2;gene_biotype=protein_coding --> 726
    #MKL-1
    #FJ173815.1      Genbank exon    196     429     .       +       .       ID=gene-LT1;Name=LT1;gbkey=Gene;gene=LT1;gene_biotype=protein_coding --> 234
    #FJ173815.1      Genbank exon    861     1619    .       +       .       ID=gene-LT2;Name=LT2;gbkey=Gene;gene=LT2;gene_biotype=protein_coding --> 759
    #FJ173815.1      Genbank exon    196     756     .       +       .       ID=gene-ST;Name=ST;gbkey=Gene;gene=ST;gene_biotype=protein_coding --> 561
    #FJ173815.1      Genbank exon    3110    4381    .       -       .       ID=gene-VP1;Name=VP1;gbkey=Gene;gene=VP1;gene_biotype=protein_coding --> 1272
    #FJ173815.1      Genbank exon    4347    5072    .       -       .       ID=gene-VP2;Name=VP2;gbkey=Gene;gene=VP2;gene_biotype=protein_coding --> 726

    # Add four lengths for LT, ST, VP1, and VP2, WaGa
    additional_lengths <- data.frame(
      ensembl_gene_id = c("LT", "ST", "VP1", "VP2"),
      median_length = c(837, 561, 1272, 726)
    )
    #additional_lengths <- data.frame(
    #  ensembl_gene_id = c("LT", "ST", "VP1", "VP2"),
    #  median_length = c(993, 561, 1272, 726)
    #)

    # Combine the original median_transcript_lengths data frame with the additional lengths
    median_transcript_lengths <- rbind(median_transcript_lengths, additional_lengths)

    # Merge input_data with median_transcript_lengths
    merged_data <- merge(input_data, median_transcript_lengths, by = "ensembl_gene_id")

    # [1] ensembl_gene_id        gene_name              WaGa_RNA              
    # [4] WaGa_RNA_118           WaGa_RNA_147           WaGa_EV.RNA           
    # [7] WaGa_EV.RNA_2          WaGa_EV.RNA_118        WaGa_EV.RNA_147       
    #[10] WaGa_EV.RNA_226        X1107_WaGa_wt_EV       X1605_WaGa_wt_EV      
    #[13] X2706_WaGa_wt_EV       X1107_WaGa_sT_DMSO_EV  X1605_WaGa_sT_DMSO_EV 
    #[16] X2706_WaGa_sT_DMSO_EV  X1107_WaGa_scr_DMSO_EV X1605_WaGa_scr_DMSO_EV
    #[19] X2706_WaGa_scr_DMSO_EV X1107_WaGa_sT_Dox_EV   X1605_WaGa_sT_Dox_EV  
    #[22] X2706_WaGa_sT_Dox_EV   X1107_WaGa_scr_Dox_EV  X1605_WaGa_scr_Dox_EV 
    #[25] X2706_WaGa_scr_Dox_EV  median_length         
    #<0 rows> (or 0-length row.names)
    # List of column names containing count data
    count_columns <- c("WaGa_RNA", "WaGa_RNA_118", "WaGa_RNA_147", "WaGa_EV.RNA", "WaGa_EV.RNA_2", "WaGa_EV.RNA_118", "WaGa_EV.RNA_147", "WaGa_EV.RNA_226", "X1107_WaGa_wt_EV", "X1605_WaGa_wt_EV", "X2706_WaGa_wt_EV", "X1107_WaGa_sT_DMSO_EV", "X1605_WaGa_sT_DMSO_EV", "X2706_WaGa_sT_DMSO_EV", "X1107_WaGa_scr_DMSO_EV", "X1605_WaGa_scr_DMSO_EV", "X2706_WaGa_scr_DMSO_EV", "X1107_WaGa_sT_Dox_EV", "X1605_WaGa_sT_Dox_EV", "X2706_WaGa_sT_Dox_EV", "X1107_WaGa_scr_Dox_EV", "X1605_WaGa_scr_Dox_EV", "X2706_WaGa_scr_Dox_EV")

    # [1] ensembl_gene_id         gene_name               MKL.1_RNA              
    # [4] MKL.1_RNA_118           MKL.1_RNA_147           MKL.1_EV.RNA           
    # [7] MKL.1_EV.RNA_2          MKL.1_EV.RNA_118        MKL.1_EV.RNA_87        
    #[10] MKL.1_EV.RNA_27         X042_MKL.1_wt_EV        X0505_MKL.1_wt_EV      
    #[13] X042_MKL.1_sT_DMSO      X0505_MKL.1_sT_DMSO_EV  X042_MKL.1_scr_DMSO_EV 
    #[16] X0505_MKL.1_scr_DMSO_EV X042_MKL.1_sT_Dox       X0505_MKL.1_sT_Dox_EV  
    #[19] X042_MKL.1_scr_Dox_EV   X0505_MKL.1_scr_Dox_EV  median_length 
    #count_columns <- c("MKL.1_RNA","MKL.1_RNA_118","MKL.1_RNA_147","MKL.1_EV.RNA","MKL.1_EV.RNA_2","MKL.1_EV.RNA_118","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","X042_MKL.1_wt_EV","X0505_MKL.1_wt_EV","X042_MKL.1_sT_DMSO","X0505_MKL.1_sT_DMSO_EV","X042_MKL.1_scr_DMSO_EV","X0505_MKL.1_scr_DMSO_EV","X042_MKL.1_sT_Dox","X0505_MKL.1_sT_Dox_EV","X042_MKL.1_scr_Dox_EV","X0505_MKL.1_scr_Dox_EV")

    # Calculate FPKM values for each sample
    fpkm_data <- merged_data %>%
      mutate(across(all_of(count_columns), ~ .x / median_length * 1e9 / sum(.x), .names = "FPKM_{.col}"))
    # # Calculate FPKM values
    # total_counts <- sum(merged_data$counts)
    # merged_data$FPKM <- (merged_data$counts / (merged_data$median_length / 1000)) / (total_counts / 1e6)
    # merged_data$FPKM <- (merged_data$counts * 1e9) / (sum(merged_data$counts) * merged_data$median_length)

    # Save the FPKM values to a CSV file
    write.csv(fpkm_data, "fpkm_values_WaGa.csv", row.names = FALSE)

STEP3: using following code to draw plots

    #For plotting a category against values with randomly assigned x-axis positions for the points within each category, you can use the geom_jitter() function in ggplot2. This function adds a small amount of random noise to the x-axis position of each point, allowing them to be distinguished from each other and preventing overplotting.
    #This will produce a scatter plot with the category on the x-axis and the values on the y-axis, with the x-axis positions of the points randomly assigned within each category. You can adjust the amount of jitter by changing the width parameter.

    library(ggplot2)
    fpkm_data_waga = read.csv("fpkm_values_WaGa.csv")
    fpkm_data_waga_selected = fpkm_data_waga[, c('ensembl_gene_id', 'FPKM_WaGa_RNA', 'FPKM_X1107_WaGa_wt_EV')]
    fpkm_data_mkl1= read.csv("fpkm_values_MKL-1.csv")
    fpkm_data_mkl1_selected = fpkm_data_mkl1[, c('ensembl_gene_id', 'FPKM_MKL.1_RNA', 'FPKM_X042_MKL.1_wt_EV')]

    # merge the two data frames by the 'id' column
    fpkm_data_selected <- merge(fpkm_data_waga_selected, fpkm_data_mkl1_selected, by = "ensembl_gene_id")

    ## reorder the columns
    #fpkm_data_selected <- select(fpkm_data_selected, ensembl_gene_id, FPKM_WaGa_RNA, FPKM_MKL.1_RNA, FPKM_WaGa_EV.RNA, FPKM_MKL.1_EV.RNA)

    # select columns with FPKM values
    fpkm_cols <- c('FPKM_WaGa_RNA', 'FPKM_X1107_WaGa_wt_EV', 'FPKM_MKL.1_RNA', 'FPKM_X042_MKL.1_wt_EV')

    ## take the log2 of selected columns
    #fpkm_data_selected_log2 <- fpkm_data_selected %>% 
    #  mutate(across(fpkm_cols, log2, .names = "log2_{.col}"))
    #fpkm_data <- fpkm_data %>%
    #  mutate(across(c(count_columns), ~log2(. + 1), .names = "{.col}_log2")) %>%
    #  select(-all_of(count_columns))

    # subset the data frame to exclude rows with missing values in column "ensembl_gene_id"
    #fpkm_data_filtered <- fpkm_data_selected %>% filter(ensembl_gene_id != "")

    fpkm_data_selected <- fpkm_data_selected %>%
      mutate(across(all_of(fpkm_cols), ~log2(. + 1), .names = "log2_{.col}")) %>% select(-all_of(fpkm_cols)) %>% rename("WaGa RNA" = "log2_FPKM_WaGa_RNA", "MKL-1 RNA" = "log2_FPKM_MKL.1_RNA", "WaGa EV" = "log2_FPKM_X1107_WaGa_wt_EV", "MKL-1 EV" = "log2_FPKM_X042_MKL.1_wt_EV")

    df_long <- pivot_longer(fpkm_data_selected, cols = -ensembl_gene_id, names_to = "x", values_to = "y")
    # use mutate() and ifelse() to add a color column
    df_long <- mutate(df_long, color = ifelse(ensembl_gene_id %in% c("ST", "LT", "VP1", "VP2"),
                                         "viral",
                                         "human"))
    df_long$x <- factor(df_long$x, levels=c("WaGa RNA","MKL-1 RNA","WaGa EV","MKL-1 EV")) 
    # Sample 1000 rows from your data frame
    #df_sampled <- df_long %>% sample_n(1000)

    # plot using geom_jitter
    png("plot_samples_against_FPKM.png", width = 800, height = 1000)
    #svg("plot_samples_against_FPKM.svg")
    ggplot(df_long, aes(x = x, y = y, color=color)) +
      scale_color_manual(values = c("human" = "darkblue", "viral" = "red")) +
      geom_jitter(width = 0.4) + 
      labs(x = "", y = "log2(Fragments Per Kilobase of transcript per Million mapped reads)")
    dev.off()

Designing Specific and Sensitive Nucleic Acid Probes for Bioinformatics Applications

Leave a reply

Designing a probe for bioinformatics applications typically involves the creation of DNA or RNA sequences that can specifically hybridize to target nucleic acids in a sample. This process is commonly used for applications such as DNA microarrays, fluorescence in situ hybridization (FISH), and qPCR. Here are the general steps involved in designing a probe for bioinformatics:

Define the target sequence: Identify the specific nucleic acid sequence (DNA or RNA) that you want to detect or analyze.
Obtain reference sequences: Collect reference sequences related to the target sequence from databases like GenBank, RefSeq, or Ensemble.
Select target regions: Choose specific regions within the target sequence that will allow for specific and sensitive detection. These regions should have unique characteristics, such as a conserved sequence or a single nucleotide polymorphism (SNP).
Design probe sequence: Generate the complementary probe sequence based on the selected target region. The length of the probe sequence should be appropriate for the intended application (e.g., 18-30 nucleotides for qPCR, 50-70 nucleotides for FISH).

Evaluate probe properties: Analyze the probe sequence for properties like melting temperature (Tm), GC content, and potential secondary structures. Ensure that these properties are within acceptable ranges for the intended application.

./BaitsTools/baitstools.rb checkbaits -i BLAST_2_NCCRmiRNA.csv -L 25 -c -N -C  -n0.0 -x100.0 -q67.0 -z76.0  -T DNA-DNA > BLAST_2_NCCRmiRNA_filtered_by_Tm.txt
../BaitsTools/baitstools.rb checkbaits -i out-baits.fa -L 25 -c -N -C  -n0.0 -x100.0 -q67.0 -z76.0  -T DNA-DNA > probes_filtered_by_Tm.txt
#BLAST_2_NCCRmiRNA_filtered_by_Tm.txt contains 200 records.
#The output file is out-filtered-baits.fa also contains 200 records.

Cluster similar probes: Group probes with high sequence similarity together to reduce redundancy and save costs. This can be achieved using clustering algorithms such as hierarchical clustering or k-means clustering. By identifying and combining highly similar probes, you can streamline the probe set without sacrificing coverage of the target region.
```
usearch -cluster_smallmem out-filtered-baits.fa -id 0.90 -centroids out-filtered-baits_clustered.fa
```

Check for specificity: Perform in silico analyses (e.g., BLAST search) to ensure that the designed probe does not have significant homology to other sequences in the genome or transcriptome, which could lead to false-positive results.

# -- simple version, it works well! --
#------------------------------------------------------------------------------
#Standard Nucleotide BLAST              11         On (DUST)           10
#Search for short/near exact matches     7         Off               1000
blastn -task blastn-short -db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -query 2_clustered_with_0.90.fasta -out 2_clustered_with_0.90_on-human.blastn -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
blastn -task blastn-short -db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -query 1_filtered_by_Tm.fasta -out 1_filtered_by_Tm_on-human.blastn -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
blastn -task blastn-short -db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -query BLAST_2_NCCRmiRNA.csv -out BLAST_2_NCCRmiRNA_on-human_evalue0.1.blastn -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
blastn -task blastn-short -db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -query BLAST_2_NCCRmiRNA.csv -out BLAST_2_NCCRmiRNA_on-human_evalue1.blastn -evalue 1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
#qseqid chr pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore

# -- complicated version, it does not work well! --
blastn -db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -query out-filtered-baits_clustered.fa -out blast_result_human.txt -evalue 1 -num_threads 15 -outfmt 6 -strand both
paste f1 f2 f3_4 f5 f6_7 f8 > input_baitfilter.txt
BaitFilter -m blast-a --ref-blast-db /ref/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa -i input_baitfilter.txt --blast-first-hit-evalue 0.00000001 --blast-second-hit-evalue 0.00001 -o baits_excluding_human.out
mv blast_result.txt blast_result_human.txt
cut -f1 blast_result_human.txt > blast_result_human_f1.txt
# using commands '| sort | uniq -c' exclude the baits.
cat blast_result_human_f1.txt | sort | uniq -c | wc -l
cat blast_result_human_f1.txt | sort | uniq -c > blast_result_human_counts.txt

Check for self-matching: Assess the designed probe for self-complementarity, which could lead to the formation of hairpins or dimers that may decrease the probe’s specificity and sensitivity. Use tools like Primer3, OligoAnalyzer, or UNAFold to identify and minimize self-complementarity in the probe sequence.

makeblastdb -in 2_clustered_with_0.90.fasta -dbtype nucl   
makeblastdb -in 1_filtered_by_Tm.fasta -dbtype nucl  
# NOTE that '-strand minus' may be wrong, should be set as 'both': -strand <String, `both', `minus', `plus'>
#qseqid sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore
blastn -task blastn-short -db 2_clustered_with_0.90.fasta -query 2_clustered_with_0.90.fasta -out 2_clustered_with_0.90_self-comp.blastn -evalue 0.1 -num_threads 15 -outfmt 6 -strand minus -max_target_seqs 1  
blastn -task blastn-short -db 1_filtered_by_Tm.fasta -query 1_filtered_by_Tm.fasta -out 1_filtered_by_Tm_self-comp.blastn -evalue 0.1 -num_threads 15 -outfmt 6 -strand minus -max_target_seqs 1  
makeblastdb -in 2_clustered_with_0.90.fasta -dbtype nucl         
blastn -db 2_clustered_with_0.90.fasta -query 2_clustered_with_0.90.fasta -out 2_clustered_with_0.90_self-comp.blastn -evalue 1e-10 -num_threads 15 -outfmt 6  -strand minus  # -max_target_seqs 1
blastn -db ligated_probes_clustered.fasta -query ligated_probes_clustered.fasta -out probes_clustered_on_human_self-comp.blastn -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both

Validate probe performance: Test the designed probe experimentally to confirm its specificity, sensitivity, and overall performance in the intended application.

Analyze and interpret data: Use the designed probe in the bioinformatics application and analyze the resulting data to draw conclusions about the target sequence or biological system under investigation.

~/Tools/csv2xls-0.4/csv_to_xls.py 1_filtered_by_Tm_on-human_re.blastn 1_filtered_by_Tm_self-comp.blastn -d$'\t' -o 1_filtered_qualitycontrol.xls
~/Tools/csv2xls-0.4/csv_to_xls.py 2_clustered_with_0.90_on-human.fasta.blastn 2_clustered_with_0.90_self-comp.blastn -d$'\t' -o 2_clustered_with_0.90_qualitycontrol.xls
~/Tools/csv2xls-0.4/csv_to_xls.py 1_filtered_by_Tm.txt 2_clustered_with_0.90.txt -d$'\t' -o probes.xls
~/Tools/csv2xls-0.4/csv_to_xls.py BLAST_2_NCCRmiRNA_on-human_evalue0.1.blastn -d$'\t' -o BLAST_2_NCCRmiRNA_homology_to_human.xls
#For most applications, an e-value cutoff of 1e-5 or 1e-6 is considered appropriate. This threshold balances the need for specificity while still allowing for some degree of similarity between the query sequence and the target genome. However, if you require higher stringency, you can use a lower e-value cutoff, such as 1e-10 or 1e-20.

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

General statistics and important genes in humans, bacteria, and viruses

细菌中最神奇的10个基因 (10 Most Amazing Genes in Bacteria)

Merkel Cell Polyomavirus: Decoding Reference Sequences, miRNA, and the Oncogenic Roles of LT, LTtr, and sT in Merkel Cell Carcinoma

Visualizing Phylogenetic Relationships and Metadata with Circular ggtree and gheatmap in R

Bubble Plot Visualization of Gene Expression Changes

智力相关基因：揭示认知能力的遗传因素

性取向相关基因：探索与同性恋相关的遗传因素

癌症相关基因：多种基因与癌症风险的关联

Display viral transcripts found in mRNA-seq MKL-1, WaGa EVs compared to cells

Designing Specific and Sensitive Nucleic Acid Probes for Bioinformatics Applications