Author Archives: gene_x

RNA-seq 2024 Ute from raw counts

1, input files (use R 4.3.3 (/home/jhuang/miniconda3/bin/R))

  merged_gene_counts_40samples.txt
  merged_gene_counts_WaGa_virus_rounded.txt
  merged_gene_counts_MKL-1_virus_rounded.txt

  #!! Problem: merge the two files and check if the read number in Dox samples are really lower than that of DMSO samples --> Not really!

2, background knowledge

  #Figure 4. MCPyV sT alters the expression of cell surface proteins in WaGa cells. WaGa shRNA scr and shRNA sT cell lines were induced with 1 m g/ml Dox or DMSO 3 d before the experiment. #Surface proteins upregulated during WaGa sT KD, indicating downregulation by sT in WaGa cells, are depicted in a.
  #Surface markers downregulated during sT KD, indicating an upregulation by sT in WaGa cells, are displayed in b. The threshold of sT-dependent differential regulation was set to 0.1 log 2diff.
  (c) The independent, confirmatory FACS experiments for the surface markers ADAM10, CD44, CD47, and CD95 in WaGa cells described in a and b Dox/DMSO addition.
  (d) CD47 surface marker expression in HEK293, WaGa, or nHDF cells overexpressing sT. All experiments were performed in triplicates,

  grown as described (Czech-Sioli et al., 2020a).
  For knockdown experiments, cells were induced with 1 mg/ml Dox or DMSO.
  Plasmids, shRNAs, and lentiviral transduction
  - MCPyV sT cDNA was cloned into lentiviral expression plasmid LeGo-iC2 through EcoRI and NotI restriction sites.
  - Small interfering RNA transfection
  - Analysis was performed using www.webgestalt.org.
  - Complete data of transcriptome analysis can be found in Supplementary Tables S1 and S2.
  d, day;
  Dox, doxycycline;
  FDR, false discovery rate;
  GO, gene ontology;
  GSEA, gene set enrichment analysis;
  KD, knockdown;
  RNA-Seq, RNA sequencing;
  scr, scrambled;
  #For knockdown experiments, cells were induced with 1 mg/ml Dox or DMSO.
  #shRNA scr, control cell line expressing a scrambled short hairpin RNA;
  #shRNA sT, small T antigen‒specific short hairpin RNA;
  #sT, small T antigen.

  #Figure 1. (b) Immunoblot analysis of total cell lysates from WaGa cells transduced with shRNA scr, shRNA sT, or shRNA sT/LT 3 and 5 d after Dox induction.
  #sT and LTtrunc protein expression was detected using the sT/LT-recognizing antibody 2T2 (first panel) and the LT-specific antibody Cm2B4 antibody (third panel); antibody-recognizing actin was used as a loading control.
  !Only "shRNA sT with Dox induction" can low the expression of sT; the DMSO control or shRNA scrambled both cannot low the expression.
  #Only "shRNA sT with Dox induction" specifically targets and reduces the expression of small T antigen (sT). The DMSO control and the scrambled shRNA do not have this effect because they do not specifically target the sT gene. The Dox induction is necessary to activate the shRNA sT, leading to the knockdown of sT expression.

  #In the figure, "shRNA scr" not knocked down, "shRNA sT and DMSO" not knocked down!

  #Yes, typically, in systems where doxycycline (Dox) is used to control shRNA expression, all shRNAs under the control of a Dox-inducible promoter would require doxycycline to be induced. Doxycycline is often used in inducible systems like the Tet-On or Tet-Off systems to regulate gene expression, including the expression of short hairpin RNAs (shRNAs). In these systems, without doxycycline, there should be minimal to no expression of the shRNA. When doxycycline is added, it induces the expression of the shRNA, leading to the targeted knockdown of gene expression.

  #shRNA (小干扰RNA)敲减的原理是通过介导RNA干扰(RNA interference, RNAi)过程来降低特定基因的表达。shRNA是一种双链RNA分子，其中一条链与目标基因的mRNA序列互补。当shRNA进入细胞后，它被DICER酶切割成小分子干扰RNA (siRNA)。然后siRNA被纳入到RNA诱导沉默复合体(RISC)中。在RISC中，siRNA的一条链被降解，而另一条与目标mRNA互补的链则被用来寻找相对应的mRNA分子。
  当siRNA找到与其互补的目标mRNA后，RISC会切割这个mRNA，从而阻止它被翻译成蛋白质。这种降解过程减少了目标基因在细胞内的表达量，从而实现了基因表达的敲减。通过选择特定的目标基因进行敲减，研究人员可以研究这些基因的功能，或者在疾病治疗中靶向这些基因。
  #shRNA scrambled（杂乱或无特定靶向的小干扰RNA）是一种设计用于不特定地靶向任何基因的短链RNA。它通常用作对照，以确保观察到的效果是特定于靶向的shRNA（如针对特定基因的shRNA）的结果，而不是由于RNA干扰技术本身或细胞对处理的非特异性反应。shRNA scrambled不会靶向或降低任何特定基因的表达，因此在实验中，它用于与特定靶向的shRNA处理的细胞进行比较，以证明任何观察到的变化是由于特定基因表达的降低。
  #DMSO 是二甲基亚砜（Dimethyl sulfoxide）的缩写，这是一种有机溶剂，常用于生物学实验，可以增加细胞膜的渗透性，帮助药物或化合物进入细胞。DMSO 也常作为对照物质使用，在实验中用于对比特定处理的效果。
  #Doxycycline（Dox）是一种抗生素，属于四环素类，通常用于治疗各种感染症，如呼吸道感染、尿路感染、眼睛感染等。在分子生物学研究中，Doxycycline 常用于诱导表达系统，如在doxycycline-inducible shRNA表达系统中，Doxycycline 的添加可以诱导特定基因的沉默或表达。

  #Difference between design=~shRNA+treatment+shRNA:treatment and design=~shRNA+treatment
  #在这两个设计公式中，design=~shRNA+treatment 表示一个没有交互作用项的模型，其中 shRNA 和 treatment 作为独立的因素被考虑。这意味着每个因素的效果是单独评估的，不考虑这两个因素之间可能的相互作用。
  #而 design=~shRNA+treatment+shRNA:treatment 这个设计公式包含了一个交互作用项（shRNA:treatment），这表示除了考虑 shRNA 和 treatment 作为独立因素的效果之外，还要考虑它们之间的相互作用。换句话说，该模型会评估 treatment（处理方式）如何依赖于不同 shRNA 的情况下有所不同。
  #总之，第一个设计只考虑了独立效果，而第二个设计还考虑了这两个因素的相互作用。在进行实验设计和统计分析时，选择哪一个取决于你的研究假设和数据的特点。

3, common processing for the data MKL+1 + WaGa

  #The pipeline finished successfully, but the following samples were skipped,
  #  - 0505_MKL-1_wt_EV,EV.RNA

  #### -------
  setwd("/home/jhuang/DATA/Data_Ute/Data_RNA-Seq_MKL-1_WaGa/results_2024_2/featureCounts")

  # ---- when BiocManager::install("ggtree") doesn't work, uisng devtools install it (Successful!) ----
  #update.packages(ask = FALSE, checkBuilt = TRUE)
  #.libPaths()
  #install.packages(c("rlang", "cli"))
  #rm(list = ls())
  #options(repos = BiocManager::repositories())
  #if (!requireNamespace("devtools", quietly = TRUE))
  #    install.packages("devtools")
  #devtools::install_github("YuLab-SMU/ggtree")

  #install.packages("gplots")
  #if (!requireNamespace("BiocManager", quietly = TRUE))
  #    install.packages("BiocManager")
  #BiocManager::install(c("clusterProfiler", "ReactomePA", "DESeq2", "AnnotationDbi", "GenomeInfoDb", "Biostrings"), force=TRUE)

  # R(4.3) works well of the saga server!
  library("AnnotationDbi")
  library("clusterProfiler")
  library("ReactomePA")
  #library("org.Mm.eg.db")
  library(DESeq2)
  library(gplots)

  [1] gene_name
  [2] X042_MKL.1_wt_EV
  [3] MKL.1_RNA_147
  [4] X042_MKL.1_sT_DMSO     #EIGENTLICH _EV
  [5] MKL.1_RNA
  [6] X0505_MKL.1_scr_DMSO_EV
  [7] X0505_MKL.1_sT_DMSO_EV
  [8] MKL.1_EV.RNA_87
  [9] MKL.1_EV.RNA_27
  [10] X042_MKL.1_scr_Dox_EV
  [11] X042_MKL.1_scr_DMSO_EV
  [12] MKL.1_EV.RNA_118
  [13] X0505_MKL.1_scr_Dox_EV
  [14] MKL.1_EV.RNA
  [15] X042_MKL.1_sT_Dox      #EIGENTLICH _EV
  [16] X0505_MKL.1_sT_Dox_EV
  [17] MKL.1_RNA_118
  [18] MKL.1_EV.RNA_2

  [19] Geneid.1
  [20] gene_name.1
  [21] X1605_WaGa_sT_DMSO_EV
  [22] WaGa_EV.RNA_118
  [23] WaGa_EV.RNA
  [24] X2706_WaGa_scr_Dox_EV
  [25] X2706_WaGa_sT_DMSO_EV
  [26] X1107_WaGa_wt_EV
  [27] X1107_WaGa_sT_Dox_EV
  [28] WaGa_RNA
  [29] X1605_WaGa_scr_DMSO_EV
  [30] X2706_WaGa_scr_DMSO_EV
  [31] WaGa_EV.RNA_226
  [32] X2706_WaGa_sT_Dox_EV
  [33] X1605_WaGa_scr_Dox_EV
  [34] X1605_WaGa_wt_EV
  [35] X1605_WaGa_sT_Dox_EV
  [36] X1107_WaGa_scr_DMSO_EV
  [37] WaGa_EV.RNA_2
  [38] WaGa_EV.RNA_147
  [39] WaGa_RNA_118
  [40] X1107_WaGa_scr_Dox_EV
  [41] X1107_WaGa_sT_DMSO_EV
  [42] WaGa_RNA_147
  [43] X2706_WaGa_wt_EV

  #Geneid  gene_name       1605_WaGa_sT_DMSO_EV    WaGa_EV-RNA_118 WaGa_EV-RNA     2706_WaGa_scr_Dox_EV    2706_WaGa_sT_DMSO_EV    1107_WaGa_wt_EV 1107_WaGa_sT_Dox_EVWaGa_RNA        1605_WaGa_scr_DMSO_EV   2706_WaGa_scr_DMSO_EV   WaGa_EV-RNA_226 2706_WaGa_sT_Dox_EV     1605_WaGa_scr_Dox_EV    1605_WaGa_wt_EV 1605_WaGa_sT_Dox_EV1107_WaGa_scr_DMSO_EV   WaGa_EV-RNA_2   WaGa_EV-RNA_147 WaGa_RNA_118    1107_WaGa_scr_Dox_EV    1107_WaGa_sT_DMSO_EV    WaGa_RNA_147    2706_WaGa_wt_EV

  d.raw_human<- read.delim2("merged_gene_counts_40samples.txt",sep="\t", header=TRUE, row.names=1)
  colnames(d.raw_human)<- c("gene_name","X042_MKL.1_wt_EV","MKL.1_RNA_147","X042_MKL.1_sT_DMSO","MKL.1_RNA","X0505_MKL.1_scr_DMSO_EV","X0505_MKL.1_sT_DMSO_EV","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","X042_MKL.1_scr_Dox_EV","X042_MKL.1_scr_DMSO_EV","MKL.1_EV.RNA_118","X0505_MKL.1_scr_Dox_EV","MKL.1_EV.RNA","X042_MKL.1_sT_Dox","X0505_MKL.1_sT_Dox_EV","MKL.1_RNA_118","MKL.1_EV.RNA_2",      "Geneid.1","gene_name.1","X1605_WaGa_sT_DMSO_EV","WaGa_EV.RNA_118","WaGa_EV.RNA","X2706_WaGa_scr_Dox_EV","X2706_WaGa_sT_DMSO_EV","X1107_WaGa_wt_EV","X1107_WaGa_sT_Dox_EV","WaGa_RNA","X1605_WaGa_scr_DMSO_EV","X2706_WaGa_scr_DMSO_EV","WaGa_EV.RNA_226","X2706_WaGa_sT_Dox_EV","X1605_WaGa_scr_Dox_EV","X1605_WaGa_wt_EV","X1605_WaGa_sT_Dox_EV","X1107_WaGa_scr_DMSO_EV","WaGa_EV.RNA_2","WaGa_EV.RNA_147","WaGa_RNA_118","X1107_WaGa_scr_Dox_EV","X1107_WaGa_sT_DMSO_EV","WaGa_RNA_147","X2706_WaGa_wt_EV")

  col_order <- c("gene_name",  "MKL.1_RNA","MKL.1_RNA_118","MKL.1_RNA_147","MKL.1_EV.RNA","MKL.1_EV.RNA_2","MKL.1_EV.RNA_118","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","X042_MKL.1_wt_EV","X042_MKL.1_sT_DMSO","X0505_MKL.1_sT_DMSO_EV","X042_MKL.1_scr_DMSO_EV","X0505_MKL.1_scr_DMSO_EV","X042_MKL.1_sT_Dox","X0505_MKL.1_sT_Dox_EV","X042_MKL.1_scr_Dox_EV","X0505_MKL.1_scr_Dox_EV",     "Geneid.1","gene_name.1",    "WaGa_RNA","WaGa_RNA_118","WaGa_RNA_147",    "WaGa_EV.RNA","WaGa_EV.RNA_2","WaGa_EV.RNA_118","WaGa_EV.RNA_147","WaGa_EV.RNA_226","X1107_WaGa_wt_EV","X1605_WaGa_wt_EV","X2706_WaGa_wt_EV",    "X1107_WaGa_sT_DMSO_EV","X1605_WaGa_sT_DMSO_EV","X2706_WaGa_sT_DMSO_EV",    "X1107_WaGa_scr_DMSO_EV","X1605_WaGa_scr_DMSO_EV","X2706_WaGa_scr_DMSO_EV",     "X1107_WaGa_sT_Dox_EV","X1605_WaGa_sT_Dox_EV","X2706_WaGa_sT_Dox_EV",     "X1107_WaGa_scr_Dox_EV","X1605_WaGa_scr_Dox_EV","X2706_WaGa_scr_Dox_EV")
  reordered.raw_human <- d.raw_human[,col_order]

  d.raw_virus <- read.delim2("merged_gene_counts_virus_rounded.txt",sep="\t", header=TRUE, row.names=1)
  reordered.raw_virus <- d.raw_virus[,col_order]

  identical(colnames(reordered.raw_human), colnames(reordered.raw_virus))

  reordered.raw <- rbind(reordered.raw_human, reordered.raw_virus)

  #rename
  #colnames(reordered.raw) <- c("gene_name", "MKL-1 RNA","MKL-1 RNA 118","MKL-1 RNA 147",    "MKL-1 EV","MKL-1 EV 2","MKL-1 EV 118","MKL-1 EV 87","MKL-1 EV 27","MKL-1 EV 042",    "MKL-1 EV sT DMSO 042","MKL-1 EV sT DMSO 0505","MKL-1 EV scr DMSO 042","MKL-1 EV scr DMSO 0505","MKL-1 EV sT Dox 042","MKL-1 EV sT Dox 0505","MKL-1 EV scr Dox 042","MKL-1 EV scr Dox 0505",   "Geneid.1","gene_name.1",    "WaGa RNA","WaGa RNA 118","WaGa RNA 147",    "WaGa EV","WaGa EV 2","WaGa EV 118","WaGa EV 147","WaGa EV 226","WaGa EV 1107","WaGa EV 1605","WaGa EV 2706",    "WaGa EV sT DMSO 1107","WaGa EV sT DMSO 1605","WaGa EV sT DMSO 2706",     "WaGa EV scr DMSO 1107","WaGa EV scr DMSO 1605","WaGa EV scr DMSO 2706",     "WaGa EV sT Dox 1107","WaGa EV sT Dox 1605","WaGa EV sT Dox 2706",     "WaGa EV scr Dox 1107","WaGa EV scr Dox 1605","WaGa EV scr Dox 2706")

  colnames(reordered.raw) <- c("gene_name", "MKL-1 parental cell RNA","MKL-1 parental cell RNA 118","MKL-1 parental cell RNA 147",    "MKL-1 wt EV RNA","MKL-1 wt EV RNA 2","MKL-1 wt EV RNA 118","MKL-1 wt EV RNA 87","MKL-1 wt EV RNA 27","MKL-1 wt EV RNA 042",    "MKL-1 sT DMSO EV RNA 042","MKL-1 sT DMSO EV RNA 0505",  "MKL-1 scr DMSO EV RNA 042","MKL-1 scr DMSO EV RNA 0505",  "MKL-1 sT Dox EV RNA 042","MKL-1 sT Dox EV RNA 0505",  "MKL-1 scr Dox EV RNA 042","MKL-1 scr Dox EV RNA 0505",   "Geneid.1","gene_name.1",    "WaGa parental cell RNA","WaGa parental cell RNA 118","WaGa parental cell RNA 147",    "WaGa wt EV RNA","WaGa wt EV RNA 2","WaGa wt EV RNA 118","WaGa wt EV RNA 147","WaGa wt EV RNA 226","WaGa wt EV RNA 1107","WaGa wt EV RNA 1605","WaGa wt EV RNA 2706",    "WaGa sT DMSO EV RNA 1107","WaGa sT DMSO EV RNA 1605","WaGa sT DMSO EV RNA 2706",     "WaGa scr DMSO EV RNA 1107","WaGa scr DMSO EV RNA 1605","WaGa scr DMSO EV RNA 2706",     "WaGa sT Dox EV RNA 1107","WaGa sT Dox EV RNA 1605","WaGa sT Dox EV RNA 2706",     "WaGa scr Dox EV RNA 1107","WaGa scr Dox EV RNA 1605","WaGa scr Dox EV RNA 2706")

  reordered.raw$gene_name <- NULL
  reordered.raw$Geneid.1 <- NULL
  reordered.raw$gene_name.1 <- NULL
  write.csv(reordered.raw, file="counts.txt")
  #IMPORTANT that we should filter the data with the counts in the STEP!
  d <- reordered.raw[rowSums(reordered.raw>3)>2,]

  condition_for_pca = as.factor(c("RNA","RNA","RNA","EV","EV","EV","EV","EV","EV","sT.DMSO","sT.DMSO","scr.DMSO","scr.DMSO","sT.Dox","sT.Dox","scr.Dox","scr.Dox",    "RNA","RNA","RNA","EV","EV","EV","EV","EV","EV","EV","EV","sT.DMSO","sT.DMSO","sT.DMSO","scr.DMSO","scr.DMSO","scr.DMSO","sT.Dox","sT.Dox","sT.Dox","scr.Dox","scr.Dox","scr.Dox"))

  condition = as.factor(c("MKL1.RNA","MKL1.RNA","MKL1.RNA","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.sT.DMSO","MKL1.sT.DMSO","MKL1.scr.DMSO","MKL1.scr.DMSO","MKL1.sT.Dox","MKL1.sT.Dox","MKL1.scr.Dox","MKL1.scr.Dox",    "WaGa.RNA","WaGa.RNA","WaGa.RNA","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.scr.Dox","WaGa.scr.Dox","WaGa.scr.Dox"))

  #not sure if they rep1 in the first time is the same to the second time.
  donor = as.factor(c("1","118","147",  "1","2","118","87","27","042",  "042","0505","042","0505","042","0505","042","0505",    "1","118","147",  "1","2","118","147","226","1107","1605","2706",   "1107","1605","2706","1107","1605","2706","1107","1605","2706","1107","1605","2706"))

  batch = as.factor(c("2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08",    "2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11"))

  cell.line = as.factor(c("MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1",    "WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))

  ids = as.factor(c("MKL1.RNA","MKL1.RNA.118","MKL1.RNA.147",  "MKL1.EV","MKL1.EV.2","MKL1.EV.118","MKL1.EV.87","MKL1.EV.27","MKL1.EV.042",  "MKL1.EV.sT.DMSO.042","MKL1.EV.sT.DMSO.0505",  "MKL1.EV.scr.DMSO.042","MKL1.EV.scr.DMSO.0505",  "MKL1.EV.sT.Dox.042","MKL1.EV.sT.Dox.0505",  "MKL1.EV.scr.Dox.042","MKL1.EV.scr.Dox.0505",      "WaGa.RNA","WaGa.RNA.118","WaGa.RNA.147",  "WaGa.EV","WaGa.EV.2","WaGa.EV.118","WaGa.EV.147","WaGa.EV.226","WaGa.EV.1107","WaGa.EV.1605","WaGa.EV.2706",  "WaGa.EV.sT.DMSO.1107","WaGa.EV.sT.DMSO.1605","WaGa.EV.sT.DMSO.2706",  "WaGa.EV.scr.DMSO.1107","WaGa.EV.scr.DMSO.1605","WaGa.EV.scr.DMSO.2706",  "WaGa.EV.sT.Dox.1107","WaGa.EV.sT.Dox.1605","WaGa.EV.sT.Dox.2706",  "WaGa.EV.scr.Dox.1107","WaGa.EV.scr.Dox.1605","WaGa.EV.scr.Dox.2706"))

  #DEL ids = as.factor(c("MKL.1_RNA","MKL.1_RNA_118","MKL.1_RNA_147","MKL.1_EV.RNA","MKL.1_EV.RNA_2","MKL.1_EV.RNA_118","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","042_MKL.1_wt_EV","042_MKL.1_sT_DMSO","0505_MKL.1_sT_DMSO_EV","042_MKL.1_scr_DMSO_EV","0505_MKL.1_scr_DMSO_EV","042_MKL.1_sT_Dox","0505_MKL.1_sT_Dox_EV","042_MKL.1_scr_Dox_EV","0505_MKL.1_scr_Dox_EV"))

  #IMPORTANT: using d instead of reordered.raw.
  #cData = data.frame(row.names=colnames(d), condition=condition,  batch=batch, ids=ids)
  #dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~batch+condition)
  #cData = data.frame(row.names=colnames(d), condition=condition, ids=ids)
  #dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~condition)
  cData = data.frame(row.names=colnames(d), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
  #dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~batch+condition_for_pca)
  dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~batch+condition)

  #rld <- rlogTransformation(dds)
  rld <- vst(dds)
  #--> [OPTION] goto DEG_Heatmap_Drawing!

4, preparing the data for PCA_MKL1 and PCA_WaGa drawing

  d_WaGa <- d[, !grepl("parental|MKL-1", names(d))]
  condition = as.factor(c("WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.scr.Dox","WaGa.scr.Dox","WaGa.scr.Dox"))
  donor = as.factor(c("1","2","118","147","226","1107","1605","2706",   "1107","1605","2706","1107","1605","2706","1107","1605","2706","1107","1605","2706"))
  batch = as.factor(c("2021.08","2021.08","2021.09","2021.09","2021.09","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11"))

  cell.line = as.factor(c("WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
  ids = as.factor(c("WaGa.EV","WaGa.EV.2","WaGa.EV.118","WaGa.EV.147","WaGa.EV.226","WaGa.EV.1107","WaGa.EV.1605","WaGa.EV.2706",  "WaGa.EV.sT.DMSO.1107","WaGa.EV.sT.DMSO.1605","WaGa.EV.sT.DMSO.2706",  "WaGa.EV.scr.DMSO.1107","WaGa.EV.scr.DMSO.1605","WaGa.EV.scr.DMSO.2706",  "WaGa.EV.sT.Dox.1107","WaGa.EV.sT.Dox.1605","WaGa.EV.sT.Dox.2706",  "WaGa.EV.scr.Dox.1107","WaGa.EV.scr.Dox.1605","WaGa.EV.scr.Dox.2706"))
  cData = data.frame(row.names=colnames(d_WaGa), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
  dds_WaGa<-DESeqDataSetFromMatrix(countData=d_WaGa, colData=cData, design=~batch+condition)
  rld_WaGa <- vst(dds_WaGa)

  d_MKL1 <- d[, !grepl("parental|WaGa", names(d))]
  condition = as.factor(c("MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.sT.DMSO","MKL1.sT.DMSO","MKL1.scr.DMSO","MKL1.scr.DMSO","MKL1.sT.Dox","MKL1.sT.Dox","MKL1.scr.Dox","MKL1.scr.Dox"))
  donor = as.factor(c("1","2","118","87","27","042",  "042","0505","042","0505","042","0505","042","0505"))
  batch = as.factor(c("2021.08","2021.08","2021.09","2021.09","2021.09","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08"))
  cell.line = as.factor(c("MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1"))
  ids = as.factor(c("MKL1.EV","MKL1.EV.2","MKL1.EV.118","MKL1.EV.87","MKL1.EV.27","MKL1.EV.042",  "MKL1.EV.sT.DMSO.042","MKL1.EV.sT.DMSO.0505",  "MKL1.EV.scr.DMSO.042","MKL1.EV.scr.DMSO.0505",  "MKL1.EV.sT.Dox.042","MKL1.EV.sT.Dox.0505",  "MKL1.EV.scr.Dox.042","MKL1.EV.scr.Dox.0505"))
  cData = data.frame(row.names=colnames(d_MKL1), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
  dds_MKL1<-DESeqDataSetFromMatrix(countData=d_MKL1, colData=cData, design=~batch+condition)
  rld_MKL1 <- vst(dds_MKL1)

  # -- before pca --
  png("pca_before_batch_correction2.png", 1200, 800)
  #plotPCA(rld, intgroup=c("condition"))
  plotPCA(rld, intgroup = c("condition", "batch"))
  #plotPCA(rld, intgroup = c("condition", "ids"))
  #plotPCA(rld, "batch")
  dev.off()
  #73% (PC1), 11% (PC2)   8% (PC3)

5, drawing PCA MKL1+WaGa

  # -- construct a data structure (merged_df) as above with data and pc --
  library(ggplot2)
  data <- plotPCA(rld, intgroup=c("condition", "batch", "cell.line"), returnData=TRUE)
  write.csv(data, file="plotPCA_data.csv")
  #calculate all PCs including PC3 with the following codes
  library(genefilter)
  ntop <- 500
  rv <- rowVars(assay(rld))
  select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  mat <- t( assay(rld)[select, ] )
  pc <- prcomp(mat)
  summary(pc)
  #Cumulative Proportion   0.7651  0.07746 0.05608   0.01405 0.00984 0.00857 0.00689 (with rld <- rlogTransformation(dds))
  #Proportion of Variance  0.7286  0.1060  0.07657   0.01445 0.01042 0.00705 0.00578 (with rld <- vst(dds))
  pc$x[,1:3]
  #df_pc <- data.frame(pc$x[,1:3])
  df_pc <- data.frame(pc$x)

  #> head(data)
  #                     PC1        PC2            group condition   batch
  #MKL-1 RNA     -71.108137  -2.054660 MKL1.RNA:2021.08  MKL1.RNA 2021.08
  #MKL-1 RNA 118 -62.978513  -6.906138 MKL1.RNA:2021.09  MKL1.RNA 2021.09
  #MKL-1 RNA 147 -77.698768  -3.355581 MKL1.RNA:2021.09  MKL1.RNA 2021.09
  #MKL-1 EV      -49.482607 -25.469602  MKL1.EV:2021.08   MKL1.EV 2021.08
  #MKL-1 EV 2    -19.805802 -23.850122  MKL1.EV:2021.08   MKL1.EV 2021.08
  #MKL-1 EV 118    2.264943 -22.114427  MKL1.EV:2021.09   MKL1.EV 2021.09
  #> head(df_pc)
  #                     PC1        PC2       PC3        PC4       PC5        PC6
  #MKL-1 RNA     -71.108137  -2.054660 16.617315 -1.7281595  6.275895 -2.3597341
  #MKL-1 RNA 118 -62.978513  -6.906138  9.973149 -2.4119718  7.778461 -2.3797116
  #MKL-1 RNA 147 -77.698768  -3.355581 17.062958 -2.0865184  6.100746 -0.6303835
  #MKL-1 EV      -49.482607 -25.469602 -2.410902  2.2679166  5.300198 10.5329350
  #MKL-1 EV 2    -19.805802 -23.850122 -1.704857 -2.3895896 -1.742060 -3.7750239
  #MKL-1 EV 118    2.264943 -22.114427 -4.991651 -0.3336808 -2.973397 -4.5959970

  #Note it is a little different since we added the virus-gene-expression in the table!
  identical(rownames(data), rownames(df_pc)) #-->TRUE
  data$PC1 <- NULL
  data$PC2 <- NULL
  merged_df <- merge(data, df_pc, by = "row.names")
  #merged_df <- merged_df[, -1]
  row.names(merged_df) <- merged_df$Row.names
  merged_df$Row.names <- NULL  # remove the "name" column
  merged_df$name <- NULL
  merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","PC21","PC22","PC23","PC24","PC25","PC26","PC27","PC28","PC29","PC30","PC31","PC32","PC33","PC34","PC35","PC36","PC37","PC38","PC39","PC40","group","condition","batch","cell.line")]
  write.csv(merged_df, file="merged_df_40PCs.csv")

  #using the python script to draw the 3D PCA-plot.
  # need to install plotly with pip ~/.local/bin/pip3 in .local/bin/python3.10/site-packages
  #python3 ~/Scripts/PCA_3D_drawing.py
  python3 ~/Scripts/PCA_3D_drawing.py merged_df_40PCs.csv

6, drawing PCA WaGa

  library(ggplot2)
  data <- plotPCA(rld_WaGa, intgroup=c("condition", "batch", "cell.line"), returnData=TRUE)
  write.csv(data, file="plotPCA_data_WaGa.csv")
  #calculate all PCs including PC3 with the following codes
  library(genefilter)
  ntop <- 500
  rv <- rowVars(assay(rld_WaGa))
  select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  mat <- t( assay(rld_WaGa)[select, ] )
  pc <- prcomp(mat)
  summary(pc)
  #Proportion of Variance  0.6342 0.06491 0.06044 --> 0.63, 0.06, 0.06
  pc$x[,1:3]
  #df_pc <- data.frame(pc$x[,1:3])
  df_pc <- data.frame(pc$x)

  #Note it is a little different since we added the virus-gene-expression in the table!
  identical(rownames(data), rownames(df_pc)) #-->TRUE
  data$PC1 <- NULL
  data$PC2 <- NULL
  merged_df <- merge(data, df_pc, by = "row.names")
  #merged_df <- merged_df[, -1]
  row.names(merged_df) <- merged_df$Row.names
  merged_df$Row.names <- NULL  # remove the "name" column
  merged_df$name <- NULL
  merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","PC15","PC16","PC17","PC18","PC19","PC20","group","condition","batch","cell.line")]
  write.csv(merged_df, file="merged_df_WaGa.csv")

  python3 ~/Scripts/PCA_3D_drawing_WaGa.py

7, drawing PCA MKL1

  library(ggplot2)
  data <- plotPCA(rld_MKL1, intgroup=c("condition", "batch", "cell.line"), returnData=TRUE)
  write.csv(data, file="plotPCA_data_MKL1.csv")
  #calculate all PCs including PC3 with the following codes
  library(genefilter)
  ntop <- 500
  rv <- rowVars(assay(rld_MKL1))
  select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
  mat <- t( assay(rld_MKL1)[select, ] )
  pc <- prcomp(mat)
  summary(pc)
  #Proportion of Variance  0.7549  0.05717  0.04503 --> 0.75, 0.06, 0.05
  pc$x[,1:3]
  #df_pc <- data.frame(pc$x[,1:3])
  df_pc <- data.frame(pc$x)

  #Note it is a little different since we added the virus-gene-expression in the table!
  identical(rownames(data), rownames(df_pc)) #-->TRUE
  data$PC1 <- NULL
  data$PC2 <- NULL
  merged_df <- merge(data, df_pc, by = "row.names")
  #merged_df <- merged_df[, -1]
  row.names(merged_df) <- merged_df$Row.names
  merged_df$Row.names <- NULL  # remove the "name" column
  merged_df$name <- NULL
  merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","PC11","PC12","PC13","PC14","group","condition","batch","cell.line")]
  write.csv(merged_df, file="merged_df_MKL1.csv")

  python3 ~/Scripts/PCA_3D_drawing_MKL1.py

  # -- before heatmap --
  ## generate the pairwise comparison between samples
  library(gplots)
  library("RColorBrewer")
  png("heatmap_before_donor_correction.png", 1200, 800)
  distsRL <- dist(t(assay(rld)))
  mat <- as.matrix(distsRL)
  #paste( rld$dex, rld$cell, sep="-" )
  #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,batch, sep=":"))
  #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,ids, sep=":"))
  hc <- hclust(distsRL)
  hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
  heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
  dev.off()

  # -- remove batch effect --
  mat <- assay(rld)
  mm <- model.matrix(~condition, colData(rld))
  mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
  assay(rld) <- mat

  # -- after pca --
  png("pca_after_batch_correction.png", 1200, 800)
  svg("pca_after_batch_correction.svg")
  plotPCA(rld, intgroup=c("condition"))
  dev.off()

  library(ggplot2)
  data <- plotPCA(rld, intgroup=c("condition", "cell.line"), returnData=TRUE)
  colnames(data) <- c("PC1","PC2","group2","Group","Cell.line","name")
  percentVar <- round(100 * attr(data, "percentVar"))
  #, shape=donor
  #png("pca6.png", 1000, 1000)
  svg("pca6.svg", 10, 8)
  ggplot(data, aes(PC1, PC2, color=Group, shape=Cell.line)) +
    geom_point(size=3) +
    scale_color_manual(values = c("RNA" = "grey",
                                  "EV"="cyan",
                                  "scr.DMSO"="#b2df8a",
                                  "sT.DMSO"="#33a02c",
                                  "scr.Dox"="#fb9a99",
                                  "sT.Dox"="#e31a1c")) +
    xlab(paste0("PC1: ",percentVar[1],"% variance")) +
    ylab(paste0("PC2: ",percentVar[2],"% variance")) +
    coord_fixed()
  dev.off()

  # -- after heatmap --
  ## generate the pairwise comparison between samples
  png("heatmap_after_donor_correction.png", 1200, 800)
  distsRL <- dist(t(assay(rld)))
  mat <- as.matrix(distsRL)
  rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,donor, sep=":"))
  #rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition,ids, sep=":"))
  hc <- hclust(distsRL)
  hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
  heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
  dev.off()

  #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
  sizeFactors(dds)
  #NULL
  dds <- estimateSizeFactors(dds)
  > sizeFactors(dds)
      MKL-1 parental cell RNA MKL-1 parental cell RNA 118
                    2.5689118                   1.6270659
  MKL-1 parental cell RNA 147             MKL-1 wt EV RNA
                    1.9241464                   1.5296877
            MKL-1 wt EV RNA 2         MKL-1 wt EV RNA 118
                    1.2829153                   0.6786713
          MKL-1 wt EV RNA 87          MKL-1 wt EV RNA 27
                    0.6741590                   0.4176156
          MKL-1 wt EV RNA 042    MKL-1 sT DMSO EV RNA 042
                    1.3360156                   0.9539241
    MKL-1 sT DMSO EV RNA 0505   MKL-1 scr DMSO EV RNA 042
                    1.7357438                   1.2957555
  MKL-1 scr DMSO EV RNA 0505     MKL-1 sT Dox EV RNA 042
                    1.0908450                   1.3657925
    MKL-1 sT Dox EV RNA 0505    MKL-1 scr Dox EV RNA 042
                    1.1221456                   1.4708191
    MKL-1 scr Dox EV RNA 0505      WaGa parental cell RNA
                    1.1242096                   2.1097851
  WaGa parental cell RNA 118  WaGa parental cell RNA 147
                    1.6925780                   1.6712182
              WaGa wt EV RNA            WaGa wt EV RNA 2
                    0.6021352                   1.2486966
          WaGa wt EV RNA 118          WaGa wt EV RNA 147
                    0.4518733                   0.5695057
          WaGa wt EV RNA 226         WaGa wt EV RNA 1107
                    0.5449802                   1.4885064
          WaGa wt EV RNA 1605         WaGa wt EV RNA 2706
                    1.2631181                   0.7678146
    WaGa sT DMSO EV RNA 1107    WaGa sT DMSO EV RNA 1605
                    0.7981256                   0.9054709
    WaGa sT DMSO EV RNA 2706   WaGa scr DMSO EV RNA 1107
                    1.0603277                   0.7927572
    WaGa scr DMSO EV RNA 1605   WaGa scr DMSO EV RNA 2706
                    0.8115785                   1.0659662
      WaGa sT Dox EV RNA 1107     WaGa sT Dox EV RNA 1605
                    0.8553457                   0.9547341
      WaGa sT Dox EV RNA 2706    WaGa scr Dox EV RNA 1107
                    0.9081678                   0.7440897
    WaGa scr Dox EV RNA 1605    WaGa scr Dox EV RNA 2706
                    0.8735490                   1.0146989

  raw_counts <- counts(dds)
  normalized_counts <- counts(dds, normalized=TRUE)
  write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
  write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)

  #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_RNAAligned.sortedByCoord.out.markDups.bam -o WaGa_RNA.bw --binSize 10 --scaleFactor 0.4958732    --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_RNA_118Aligned.sortedByCoord.out.markDups.bam -o WaGa_RNA_118.bw --binSize 10 --scaleFactor 0.6013898        --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_RNA_147Aligned.sortedByCoord.out.markDups.bam -o WaGa_RNA_147.bw --binSize 10 --scaleFactor 0.6154516      --effectiveGenomeSize 2864785220

  bamCoverage --bam ../markDuplicates/MKL1_RNAAligned.sortedByCoord.out.markDups.bam -o MKL1_RNA.bw --binSize 10 --scaleFactor  0.4078775      --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_RNA_118Aligned.sortedByCoord.out.markDups.bam -o MKL1_RNA_118.bw --binSize 10 --scaleFactor 0.6525297       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_RNA_147Aligned.sortedByCoord.out.markDups.bam -o MKL1_RNA_147.bw --binSize 10 --scaleFactor 0.5331748       --effectiveGenomeSize 2864785220

  bamCoverage --bam ../markDuplicates/WaGa_EV_RNAAligned.sortedByCoord.out.markDups.bam -o WaGa_EV_RNA.bw --binSize 10 --scaleFactor 1.733964       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_EV_RNA_2Aligned.sortedByCoord.out.markDups.bam -o WaGa_EV_RNA_2.bw --binSize 10 --scaleFactor  0.7761222      --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_EV_RNA_118Aligned.sortedByCoord.out.markDups.bam -o WaGa_EV_RNA_118.bw --binSize 10 --scaleFactor  2.222971      --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_EV_RNA_147Aligned.sortedByCoord.out.markDups.bam -o WaGa_EV_RNA_147.bw --binSize 10 --scaleFactor  1.679485     --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/WaGa_EV_RNA_226Aligned.sortedByCoord.out.markDups.bam -o WaGa_EV_RNA_226.bw --binSize 10 --scaleFactor  1.901282     --effectiveGenomeSize 2864785220

  bamCoverage --bam ../markDuplicates/MKL1_EV_RNAAligned.sortedByCoord.out.markDups.bam -o MKL1_EV_RNA.bw --binSize 10 --scaleFactor 0.6469583       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_EV_RNA_2Aligned.sortedByCoord.out.markDups.bam -o MKL1_EV_RNA_2.bw --binSize 10 --scaleFactor 0.6877478       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_EV_RNA_27Aligned.sortedByCoord.out.markDups.bam -o MKL1_EV_RNA_27.bw --binSize 10 --scaleFactor 2.462424       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_EV_RNA_87Aligned.sortedByCoord.out.markDups.bam -o MKL1_EV_RNA_87.bw --binSize 10 --scaleFactor 1.411154       --effectiveGenomeSize 2864785220
  bamCoverage --bam ../markDuplicates/MKL1_EV_RNA_118Aligned.sortedByCoord.out.markDups.bam -o MKL1_EV_RNA_118.bw --binSize 10 --scaleFactor 1.544239       --effectiveGenomeSize 2864785220

  setwd("degenes")
  #---- * to untreated and wildtype ----
  dds$condition <- relevel(dds$condition, "MKL1.RNA")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("MKL1.EV_vs_MKL1.RNA")

  dds$condition <- relevel(dds$condition, "MKL1.EV")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("MKL1.sT.DMSO_vs_MKL1.EV", "MKL1.sT.Dox_vs_MKL1.EV", "MKL1.scr.DMSO_vs_MKL1.EV", "MKL1.scr.Dox_vs_MKL1.EV")

  #For internal comparisons between sT.DMSO, sT.Dox, scr.DMSO and scr.Dox see the chapter "Do separate shRNA and treatment analysis"

  dds$condition <- relevel(dds$condition, "WaGa.RNA")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("WaGa.EV_vs_WaGa.RNA")

  dds$condition <- relevel(dds$condition, "WaGa.EV")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("WaGa.sT.DMSO_vs_WaGa.EV", "WaGa.sT.Dox_vs_WaGa.EV", "WaGa.scr.DMSO_vs_WaGa.EV", "WaGa.scr.Dox_vs_WaGa.EV")

  #For internal comparisons between sT.DMSO, sT.Dox, scr.DMSO and scr.Dox see the chapter "Do separate shRNA and treatment analysis"

  dds$condition <- relevel(dds$condition, "MKL1.RNA")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("WaGa.RNA_vs_MKL1.RNA")

  dds$condition <- relevel(dds$condition, "MKL1.EV")
  dds = DESeq(dds, betaPrior=FALSE)
  resultsNames(dds)
  clist <- c("WaGa.EV_vs_MKL1.EV")

  ##https://bioconductor.statistik.tu-dortmund.de/packages/3.7/data/annotation/
  #BiocManager::install("EnsDb.Mmusculus.v79")
  #library(EnsDb.Mmusculus.v79)
  #edb <- EnsDb.Mmusculus.v79
  #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
  #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
  library(biomaRt)
  listEnsembl()
  listMarts()
  #ensembl <- useEnsembl(biomart = "genes", mirror="asia")  # default is Mouse strains 104
  #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", mirror = "www")
  #ensembl = useMart("ensembl_mart_44", dataset="hsapiens_gene_ensembl",archive=TRUE, mysql=TRUE)
  #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="104")
  #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="86")
  #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="GRCh37")
  #--> total 69, 27  GRCh38.p7 and 39  GRCm38.p4
  #DEBUG: use R version 4.3.3, the version 104 is not loadable, using version 112 instead!
  #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
  #Error in bmRequest(request = request, httr_config = httr_config, verbose = verbose) :
  #  Internal Server Error (HTTP 500).
  #--> total 202   80                         GRCh38.p13         107                            GRCm39
  #80           hsapiens_gene_ensembl                                      Human genes (GRCh38.p13)                         GRCh38.p13
  #107         mmusculus_gene_ensembl                                        Mouse genes (GRCm39)                            GRCm39
  ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="112")
  #80                         GRCh38.p14
  #107                            GRCm39
  datasets <- listDatasets(ensembl)

  > listEnsemblArchives()
              name     date                                url version
  1  Ensembl GRCh37 Feb 2014          http://grch37.ensembl.org  GRCh37  *
  2     Ensembl 104 May 2021 http://may2021.archive.ensembl.org     104  *
  3     Ensembl 103 Feb 2021 http://feb2021.archive.ensembl.org     103
  4     Ensembl 102 Nov 2020 http://nov2020.archive.ensembl.org     102
  attributes = listAttributes(ensembl)
  attributes[1:25,]

  #library("dplyr")
  for (i in clist) {
  #i<-clist[1]
    contrast = paste("condition", i, sep="_")
    res = results(dds, name=contrast)
    res <- res[!is.na(res$log2FoldChange),]
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
    # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
    #geness <- geness[!duplicated(geness$GENEID), ]

    #using getBM replacing AnnotationDbi::select
    #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
    geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = rownames(res),
        mart = ensembl)
    geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)

    #merge by column by common colunmn name, in the case "GENEID"
    res$ENSEMBL = rownames(res)
    identical(rownames(res), rownames(geness_uniq))
    res_df <- as.data.frame(res)
    geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
    dim(geness_res)
    rownames(geness_res) <- geness_res$ensembl_gene_id
    geness_res$ensembl_gene_id <- NULL
    write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
    up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
    down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
    write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
    write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
  }

8, do separate shRNA and treatment analysis

  #In DESeq2, resultsNames(dds) gives you the names of the coefficients in your linear model that DESeq2 has fit to the count data. Each name corresponds to a comparison between levels of #your factors (or the interaction between them) that DESeq2 can test for differential expression. Here's what each of these results names likely means based on typical DESeq2 output:
  #  * "Intercept": This represents the base level of expression estimated by the model. If your factors are shRNA and treatment, and if the reference levels for these factors are scr and DMSO, respectively, then the "Intercept" represents the expression level for the reference group, which in this case is cells treated with scr and DMSO.
  #  * "shRNA_sT_vs_scr": This represents the comparison between the sT and scr levels of the shRNA factor. This means it represents the log2 fold change in expression from the scr condition to the sT condition, while keeping the treatment factor constant (at its reference level, likely DMSO).
  #  * "treatment_Dox_vs_DMSO": This represents the comparison between the Dox and DMSO levels of the treatment factor. This means it represents the log2 fold change in expression from the DMSO condition to the Dox condition, while keeping the shRNA factor constant (at its reference level, likely scr).
  #  * "shRNAsT.treatmentDox": This is the interaction term between the sT level of shRNA and the Dox level of treatment. This term tests whether the effect of the treatment (Dox vs. DMSO) is different in the sT condition compared to the scr condition. In other words, it tests whether the difference in expression due to the treatment (Dox) is different when the cells are treated with sT compared to when they are treated with scr.
  #If you have an interaction term in your design, you should interpret main effects (shRNA_sT_vs_scr and treatment_Dox_vs_DMSO) cautiously because the main effects are evaluated at the reference level of the other factor, and their simple interpretation is complicated by the presence of interaction. Interaction term (shRNAsT.treatmentDox) can indicate if there's a specific combination of shRNA and treatment that has a differential expression different from what would be expected based on the individual effects of shRNA and treatment.
  #当你运行 DESeq2 并使用 resultsNames(dds) 函数时，你将得到一系列的结果名称，这些名称代表了模型中的比较。在你给出的例子中：
  #  * "Intercept"：截距，代表模型的基线水平。在生物统计学中，这通常指没有进行任何处理的条件。
  #  * "shRNA_sT_vs_scr"：这代表 shRNA 的 sT 条件与 scr 条件的比较。如果你的实验设计中有两种不同的 shRNA 处理（sT 和 scr），这个比较会告诉你，与 scr 相比，sT 处理导致的基因表达量变化。
  #  * "treatment_Dox_vs_DMSO"：这代表了处理条件 Dox 与 DMSO 的比较。这会显示在 DMSO（常作为对照组）和 Dox（可能是某种药物或处理）之间的差异。
  #  * "shRNAsT.treatmentDox"：这是一个交互项，表示 shRNA 的 sT 处理和 Dox 处理的结合效应。这意味着它比较的是在 sT shRNA 影响下，Dox 对基因表达的影响与在 scr shRNA 影响下，Dox 对基因表达的影响之间的差异。
  #简而言之，这些结果名称代表你的实验中不同条件或处理组合的比较，这有助于你了解不同实验条件下的基因表达如何变化。

  d_MKL1 <- d[, !(grepl("parental cell|wt|WaGa", names(d)))]
  cell.line = as.factor(c("MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1"))
  shRNA = as.factor(c("sT","sT","scr","scr","sT","sT","scr","scr"))
  treatment = as.factor(c("DMSO","DMSO","DMSO","DMSO","Dox","Dox","Dox","Dox"))
  cData = data.frame(row.names=colnames(d_MKL1), shRNA=shRNA, treatment=treatment, cell.line=cell.line)
  dds_MKL1<-DESeqDataSetFromMatrix(countData=d_MKL1, colData=cData, design=~shRNA+treatment+shRNA:treatment)

  #png("pca_shRNA_treatment.png", 1200, 800)
  #plotPCA(rld, intgroup = c("shRNA", "treatment"))
  #dev.off()
  #shRNA type (scr or sT) and the treatment (DMSO or Dox),
  #design = ~ shRNA + treatment + shRNA:treatment
  #DESeq2Res <- results(dds_MKL1,  contrast = c("treatment","Dox","DMSO")) # scr_Dox vs scr_DMSO, effect of Dox in scr. "
  #DESeq2Res = results(dds_MKL1, contrast = list( c("treatment_Dox_vs_DMSO","shRNAsT.treatmentDMSO") )) # sT_Dox vs sT_DMSO, effect of Dox in sT"
  #DESeq2Res = results(dds_MKL1, name="shRNAsT.treatmentDMSO") # difference between sT and src, effect of Dox different in sT vs scr.

  #dds_MKL1$treatment <- factor(dds_MKL1$treatment)
  #dds_MKL1$treatment <- relevel(dds_MKL1$treatment, ref = "DMSO")

  dds_MKL1 = DESeq(dds_MKL1, betaPrior=FALSE)
  resultsNames(dds_MKL1)
  contrasts <- c("shRNA_sT_vs_scr", "treatment_Dox_vs_DMSO", "shRNAsT.treatmentDox")

  #library("dplyr")
  for (contrast in contrasts) {
    res = results(dds_MKL1, name=contrast)
    res <- res[!is.na(res$log2FoldChange),]
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
    # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
    #geness <- geness[!duplicated(geness$GENEID), ]

    #using getBM replacing AnnotationDbi::select
    #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
    geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = rownames(res),
        mart = ensembl)
    geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)

    #merge by column by common colunmn name, in the case "GENEID"
    res$ENSEMBL = rownames(res)
    identical(rownames(res), rownames(geness_uniq))
    res_df <- as.data.frame(res)
    geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
    dim(geness_res)
    rownames(geness_res) <- geness_res$ensembl_gene_id
    geness_res$ensembl_gene_id <- NULL
    write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
    up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
    down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
    write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
    write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))
  }

  mv shRNA_sT_vs_scr-all.txt MKL1_shRNA_sT_vs_scr-all.txt
  mv shRNA_sT_vs_scr-up.txt MKL1_shRNA_sT_vs_scr-up.txt
  mv shRNA_sT_vs_scr-down.txt MKL1_shRNA_sT_vs_scr-down.txt
  mv treatment_Dox_vs_DMSO-all.txt MKL1_treatment_Dox_vs_DMSO-all.txt
  mv treatment_Dox_vs_DMSO-up.txt MKL1_treatment_Dox_vs_DMSO-up.txt
  mv treatment_Dox_vs_DMSO-down.txt MKL1_treatment_Dox_vs_DMSO-down.txt
  mv shRNAsT.treatmentDox-all.txt MKL1_shRNAsT.treatmentDox-all.txt
  mv shRNAsT.treatmentDox-up.txt MKL1_shRNAsT.treatmentDox-up.txt
  mv shRNAsT.treatmentDox-down.txt MKL1_shRNAsT.treatmentDox-down.txt

  d_WaGa <- d[, !(grepl("parental cell|wt|MKL", names(d)))]
  cell.line = as.factor(c("WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
  shRNA = as.factor(c("sT","sT","sT","scr","scr","scr","sT","sT","sT","scr","scr","scr"))
  treatment = as.factor(c("DMSO","DMSO","DMSO","DMSO","DMSO","DMSO","Dox","Dox","Dox","Dox","Dox","Dox"))
  cData = data.frame(row.names=colnames(d_WaGa), shRNA=shRNA, treatment=treatment, cell.line=cell.line)
  dds_WaGa<-DESeqDataSetFromMatrix(countData=d_WaGa, colData=cData, design=~shRNA+treatment+shRNA:treatment)

  dds_WaGa = DESeq(dds_WaGa, betaPrior=FALSE)
  resultsNames(dds_WaGa)
  contrasts <- c("shRNA_sT_vs_scr", "treatment_Dox_vs_DMSO", "shRNAsT.treatmentDox")

  #library("dplyr")
  for (contrast in contrasts) {
    res = results(dds_WaGa, name=contrast)
    res <- res[!is.na(res$log2FoldChange),]
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
    # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
    #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
    #geness <- geness[!duplicated(geness$GENEID), ]

    #using getBM replacing AnnotationDbi::select
    #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
    geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = rownames(res),
        mart = ensembl)
    geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)

    #merge by column by common colunmn name, in the case "GENEID"
    res$ENSEMBL = rownames(res)
    identical(rownames(res), rownames(geness_uniq))
    res_df <- as.data.frame(res)
    geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
    dim(geness_res)
    rownames(geness_res) <- geness_res$ensembl_gene_id
    geness_res$ensembl_gene_id <- NULL
    write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(contrast, "all.txt", sep="-"))
    up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
    down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
    write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(contrast, "up.txt", sep="-"))
    write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(contrast, "down.txt", sep="-"))
  }

  mv shRNA_sT_vs_scr-all.txt WaGa_shRNA_sT_vs_scr-all.txt
  mv shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-up.txt
  mv shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-down.txt
  mv treatment_Dox_vs_DMSO-all.txt WaGa_treatment_Dox_vs_DMSO-all.txt
  mv treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-up.txt
  mv treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-down.txt
  mv shRNAsT.treatmentDox-all.txt WaGa_shRNAsT.treatmentDox-all.txt
  mv shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-up.txt
  mv shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-down.txt

  ~/Tools/csv2xls-0.4/csv_to_xls.py MKL1_shRNA_sT_vs_scr-up.txt MKL1_shRNA_sT_vs_scr-down.txt MKL1_shRNA_sT_vs_scr-all.txt -d$',' -o MKL1_shRNA_sT_vs_scr.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py MKL1_treatment_Dox_vs_DMSO-up.txt MKL1_treatment_Dox_vs_DMSO-down.txt MKL1_treatment_Dox_vs_DMSO-all.txt -d$',' -o MKL1_treatment_Dox_vs_DMSO.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py MKL1_shRNAsT.treatmentDox-up.txt MKL1_shRNAsT.treatmentDox-down.txt MKL1_shRNAsT.treatmentDox-all.txt -d$',' -o MKL1_shRNAsT.treatmentDox.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-all.txt -d$',' -o WaGa_shRNA_sT_vs_scr.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-all.txt -d$',' -o WaGa_treatment_Dox_vs_DMSO.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-all.txt -d$',' -o WaGa_shRNAsT.treatmentDox.xls

9, volcano plots with automatically finding top_g

  #A canonical visualization for interpreting differential gene expression results is the volcano plot.
  library(ggrepel)

  #"EV.RNA_vs_RNA"  "scr.DMSO_vs_EV.RNA" "sT.DMSO_vs_EV.RNA" "scr.Dox_vs_EV.RNA" "sT.Dox_vs_EV.RNA"  "sT.Dox_vs_scr.Dox"  "sT.DMSO_vs_scr.DMSO"
  geness_res <- read.csv(file = paste("WaGa.EV_vs_WaGa.RNA", "all.txt", sep="-"), row.names=1)
  geness_res$Color <- "NS or log2FC < 2.0"
  geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
  geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
  geness_res$Color[geness_res$padj < 0.001] <- "P-adj < 0.001"
  geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
  geness_res$Color <- factor(geness_res$Color,
                          levels = c("NS or log2FC < 2.0", "P < 0.05",
                                    "P-adj < 0.05", "P-adj < 0.001"))

  geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
  top_g <- c()
  top_g <- c(top_g, geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:200]], geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:200]])
  top_g <- unique(top_g)
  geness_res <- geness_res[, -1*ncol(geness_res)]  #remove invert_P from matrix

  png("WaGa.EV_vs_WaGa.RNA.png",width=1400, height=1000)
  ggplot(geness_res,
        aes(x = log2FoldChange, y = -log10(pvalue),
            color = Color, label = external_gene_name)) +
    geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +
    geom_hline(yintercept = -log10(0.05), lty = "dashed") +
    geom_point() +
    labs(x = "log2(FC)",
        y = "Significance, -log10(P)",
        color = "Significance") +
    scale_color_manual(values = c(`P-adj < 0.001` = "dodgerblue",
                                  `P-adj < 0.05` = "lightblue",
                                  `P < 0.05` = "orange2",
                                  `NS or log2FC < 2.0` = "gray"),
                      guide = guide_legend(override.aes = list(size = 4))) +
    scale_y_continuous(expand = expansion(mult = c(0,0.05))) +
    geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)),
                    size = 4, point.padding = 0.15, color = "black",
                    min.segment.length = .1, box.padding = .2, lwd = 2) +
    theme_bw(base_size = 16) +
    theme(legend.position = "bottom")
  dev.off()

  #sed -i -e 's/Color/Category/g' *_Category.csv

  #x <- data.frame(k1 = c(NA,3,4,5,2), k2 = c(1,NA,4,5,2), data = 6:10)
  #merge(x, y, by = "k2")

  for cmp in "EV.RNA_vs_RNA"  "scr.DMSO_vs_EV.RNA" "sT.DMSO_vs_EV.RNA" "scr.Dox_vs_EV.RNA" "sT.Dox_vs_EV.RNA"  "sT.Dox_vs_scr.Dox"  "sT.DMSO_vs_scr.DMSO"; do
    echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${cmp}-all.txt ${cmp}-up.txt ${cmp}-down.txt -d$',' -o ${cmp}.xls"
  done
  ~/Tools/csv2xls-0.4/csv_to_xls.py EV.RNA_vs_RNA-all.txt EV.RNA_vs_RNA-up.txt EV.RNA_vs_RNA-down.txt -d$',' -o EV.RNA_vs_RNA.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py scr.DMSO_vs_EV.RNA-all.txt scr.DMSO_vs_EV.RNA-up.txt scr.DMSO_vs_EV.RNA-down.txt -d$',' -o scr.DMSO_vs_EV.RNA.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py sT.DMSO_vs_EV.RNA-all.txt sT.DMSO_vs_EV.RNA-up.txt sT.DMSO_vs_EV.RNA-down.txt -d$',' -o sT.DMSO_vs_EV.RNA.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py scr.Dox_vs_EV.RNA-all.txt scr.Dox_vs_EV.RNA-up.txt scr.Dox_vs_EV.RNA-down.txt -d$',' -o scr.Dox_vs_EV.RNA.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py sT.Dox_vs_EV.RNA-all.txt sT.Dox_vs_EV.RNA-up.txt sT.Dox_vs_EV.RNA-down.txt -d$',' -o sT.Dox_vs_EV.RNA.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py sT.Dox_vs_scr.Dox-all.txt sT.Dox_vs_scr.Dox-up.txt sT.Dox_vs_scr.Dox-down.txt -d$',' -o sT.Dox_vs_scr.Dox.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py sT.DMSO_vs_scr.DMSO-all.txt sT.DMSO_vs_scr.DMSO-up.txt sT.DMSO_vs_scr.DMSO-down.txt -d$',' -o sT.DMSO_vs_scr.DMSO.xls
  #WaGa.EV_vs_WaGa.RNA-all.txt
  ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa.EV_vs_WaGa.RNA-up.txt WaGa.EV_vs_WaGa.RNA-down.txt -d$',' -o WaGa.EV_vs_WaGa.RNA.xls

10, clustering the genes and draw heatmap

  install.packages("gplots")
  library("gplots")

  #Option3: as paper described, A heatmap showing expression values of all DEGs which are significant between any pair conditions.
  all_genes <- c(rownames(mock_sT_d8_vs_mock_sT_d3_sig),rownames(sT_d3_vs_mock_sT_d3_sig),rownames(sT_d8_vs_mock_sT_d8_sig),rownames(sT_d8_vs_sT_d3_sig))     #873
  all_genes <- unique(all_genes)   #663
  #all_genes2 <- c(rownames(WAC_vs_mock_sig),rownames(WAP_vs_mock_sig),rownames(WAC_vs_WAP_sig))   #3917
  #all_genes2 <- unique(all_genes2)   #2608
  #intersected_genes <- intersect(all_genes, all_genes2)  # 2608
  #RNASeq.NoCellLine <- read.csv(file ="gene_expression_keeping_condition.txt", row.names=1)
  RNASeq.NoCellLine_  <- RNASeq.NoCellLine[all_genes,]
  write.csv(as.data.frame(RNASeq.NoCellLine_), file ="gene_expression_keeping_condition.txt")

  RNASeq.NoCellLine_ <- cbind(RNASeq.NoCellLine_, mock_sT_d3 = rowMeans(RNASeq.NoCellLine_[, 1:2]))
  RNASeq.NoCellLine_ <- cbind(RNASeq.NoCellLine_, mock_sT_d8 = rowMeans(RNASeq.NoCellLine_[, 3:4]))
  RNASeq.NoCellLine_ <- cbind(RNASeq.NoCellLine_, sT_d3 = rowMeans(RNASeq.NoCellLine_[, 5:6]))
  RNASeq.NoCellLine_ <- cbind(RNASeq.NoCellLine_, sT_d8 = rowMeans(RNASeq.NoCellLine_[, 7:8]))
  RNASeq.NoCellLine_ <- RNASeq.NoCellLine_[,c(-1:-8)]        #663x4
  #RNASeq.NoCellLine__ <- read.csv(file ="gene_expression_keeping_condition.txt", row.names=1)
  write.csv(as.data.frame(RNASeq.NoCellLine_), file ="gene_expression_merging_condition.txt")

  cut -d',' -f1-1 ./WaGa.EV_vs_WaGa.RNA-up.txt > WaGa.EV_vs_WaGa.RNA-up.id
  cut -d',' -f1-1 ./WaGa.sT.DMSO_vs_WaGa.EV-up.txt > WaGa.sT.DMSO_vs_WaGa.EV-up.id
  cut -d',' -f1-1 ./WaGa.sT.Dox_vs_WaGa.EV-up.txt > WaGa.sT.Dox_vs_WaGa.EV-up.id
  cut -d',' -f1-1 ./WaGa.scr.DMSO_vs_WaGa.EV-up.txt > WaGa.scr.DMSO_vs_WaGa.EV-up.id
  cut -d',' -f1-1 ./WaGa.scr.Dox_vs_WaGa.EV-up.txt > WaGa.scr.Dox_vs_WaGa.EV-up.id
  cut -d',' -f1-1 ./WaGa_shRNA_sT_vs_scr-up.txt > WaGa_shRNA_sT_vs_scr-up.id
  cut -d',' -f1-1 ./WaGa_treatment_Dox_vs_DMSO-up.txt > WaGa_treatment_Dox_vs_DMSO-up.id
  cut -d',' -f1-1 ./WaGa_shRNAsT.treatmentDox-up.txt > WaGa_shRNAsT.treatmentDox-up.id

  cut -d',' -f1-1 ./WaGa.EV_vs_WaGa.RNA-down.txt > WaGa.EV_vs_WaGa.RNA-down.id
  cut -d',' -f1-1 ./WaGa.sT.DMSO_vs_WaGa.EV-down.txt > WaGa.sT.DMSO_vs_WaGa.EV-down.id
  cut -d',' -f1-1 ./WaGa.sT.Dox_vs_WaGa.EV-down.txt > WaGa.sT.Dox_vs_WaGa.EV-down.id
  cut -d',' -f1-1 ./WaGa.scr.DMSO_vs_WaGa.EV-down.txt > WaGa.scr.DMSO_vs_WaGa.EV-down.id
  cut -d',' -f1-1 ./WaGa.scr.Dox_vs_WaGa.EV-down.txt > WaGa.scr.Dox_vs_WaGa.EV-down.id
  cut -d',' -f1-1 ./WaGa_shRNA_sT_vs_scr-down.txt > WaGa_shRNA_sT_vs_scr-down.id
  cut -d',' -f1-1 ./WaGa_treatment_Dox_vs_DMSO-down.txt > WaGa_treatment_Dox_vs_DMSO-down.id
  cut -d',' -f1-1 ./WaGa_shRNAsT.treatmentDox-down.txt > WaGa_shRNAsT.treatmentDox-down.id

  cut -d',' -f1-1 ./MKL1.EV_vs_MKL1.RNA-up.txt > MKL1.EV_vs_MKL1.RNA-up.id
  cut -d',' -f1-1 ./MKL1.sT.DMSO_vs_MKL1.EV-up.txt > MKL1.sT.DMSO_vs_MKL1.EV-up.id
  cut -d',' -f1-1 ./MKL1.sT.Dox_vs_MKL1.EV-up.txt > MKL1.sT.Dox_vs_MKL1.EV-up.id
  cut -d',' -f1-1 ./MKL1.scr.DMSO_vs_MKL1.EV-up.txt > MKL1.scr.DMSO_vs_MKL1.EV-up.id
  cut -d',' -f1-1 ./MKL1.scr.Dox_vs_MKL1.EV-up.txt > MKL1.scr.Dox_vs_MKL1.EV-up.id
  cut -d',' -f1-1 ./MKL1_shRNA_sT_vs_scr-up.txt > MKL1_shRNA_sT_vs_scr-up.id
  cut -d',' -f1-1 ./MKL1_treatment_Dox_vs_DMSO-up.txt > MKL1_treatment_Dox_vs_DMSO-up.id
  cut -d',' -f1-1 ./MKL1_shRNAsT.treatmentDox-up.txt > MKL1_shRNAsT.treatmentDox-up.id

  cut -d',' -f1-1 ./MKL1.EV_vs_MKL1.RNA-down.txt > MKL1.EV_vs_MKL1.RNA-down.id
  cut -d',' -f1-1 ./MKL1.sT.DMSO_vs_MKL1.EV-down.txt > MKL1.sT.DMSO_vs_MKL1.EV-down.id
  cut -d',' -f1-1 ./MKL1.sT.Dox_vs_MKL1.EV-down.txt > MKL1.sT.Dox_vs_MKL1.EV-down.id
  cut -d',' -f1-1 ./MKL1.scr.DMSO_vs_MKL1.EV-down.txt > MKL1.scr.DMSO_vs_MKL1.EV-down.id
  cut -d',' -f1-1 ./MKL1.scr.Dox_vs_MKL1.EV-down.txt > MKL1.scr.Dox_vs_MKL1.EV-down.id
  cut -d',' -f1-1 ./MKL1_shRNA_sT_vs_scr-down.txt > MKL1_shRNA_sT_vs_scr-down.id
  cut -d',' -f1-1 ./MKL1_treatment_Dox_vs_DMSO-down.txt > MKL1_treatment_Dox_vs_DMSO-down.id
  cut -d',' -f1-1 ./MKL1_shRNAsT.treatmentDox-down.txt > MKL1_shRNAsT.treatmentDox-down.id

  cut -d',' -f1-1 WaGa.RNA_vs_MKL1.RNA-up.txt > WaGa.RNA_vs_MKL1.RNA-up.id
  cut -d',' -f1-1 WaGa.RNA_vs_MKL1.RNA-down.txt > WaGa.RNA_vs_MKL1.RNA-down.id
  cut -d',' -f1-1 WaGa.EV_vs_MKL1.EV-up.txt > WaGa.EV_vs_MKL1.EV-up.id
  cut -d',' -f1-1 WaGa.EV_vs_MKL1.EV-down.txt > WaGa.EV_vs_MKL1.EV-down.id

  cat MKL1.EV_vs_MKL1.RNA-down.id MKL1.EV_vs_MKL1.RNA-up.id MKL1.scr.DMSO_vs_MKL1.EV-down.id MKL1.scr.DMSO_vs_MKL1.EV-up.id MKL1.scr.Dox_vs_MKL1.EV-down.id MKL1.scr.Dox_vs_MKL1.EV-up.id MKL1_shRNAsT.treatmentDox-down.id MKL1_shRNAsT.treatmentDox-up.id MKL1_shRNA_sT_vs_scr-down.id MKL1_shRNA_sT_vs_scr-up.id MKL1.sT.DMSO_vs_MKL1.EV-down.id MKL1.sT.DMSO_vs_MKL1.EV-up.id MKL1.sT.Dox_vs_MKL1.EV-down.id MKL1.sT.Dox_vs_MKL1.EV-up.id MKL1_treatment_Dox_vs_DMSO-down.id MKL1_treatment_Dox_vs_DMSO-up.id | sort -u > MKL1.ids
  cat WaGa.EV_vs_WaGa.RNA-down.id WaGa.EV_vs_WaGa.RNA-up.id WaGa.scr.DMSO_vs_WaGa.EV-down.id WaGa.scr.DMSO_vs_WaGa.EV-up.id WaGa.scr.Dox_vs_WaGa.EV-down.id WaGa.scr.Dox_vs_WaGa.EV-up.id WaGa_shRNAsT.treatmentDox-down.id WaGa_shRNAsT.treatmentDox-up.id WaGa_shRNA_sT_vs_scr-down.id WaGa_shRNA_sT_vs_scr-up.id WaGa.sT.DMSO_vs_WaGa.EV-down.id WaGa.sT.DMSO_vs_WaGa.EV-up.id WaGa.sT.Dox_vs_WaGa.EV-down.id WaGa.sT.Dox_vs_WaGa.EV-up.id WaGa_treatment_Dox_vs_DMSO-down.id WaGa_treatment_Dox_vs_DMSO-up.id | sort -u > WaGa.ids
  #WaGa.RNA_vs_MKL1.RNA-up.id WaGa.RNA_vs_MKL1.RNA-down.id WaGa.EV_vs_MKL1.EV-up.id WaGa.EV_vs_MKL1.EV-down.id | sort -u > ids

  # 28186 (new) vs 28193
  cat MKL1.EV_vs_MKL1.RNA-down.id MKL1.EV_vs_MKL1.RNA-up.id MKL1.scr.DMSO_vs_MKL1.EV-down.id MKL1.scr.DMSO_vs_MKL1.EV-up.id MKL1.scr.Dox_vs_MKL1.EV-down.id MKL1.scr.Dox_vs_MKL1.EV-up.id MKL1_shRNAsT.treatmentDox-down.id MKL1_shRNAsT.treatmentDox-up.id MKL1_shRNA_sT_vs_scr-down.id MKL1_shRNA_sT_vs_scr-up.id MKL1.sT.DMSO_vs_MKL1.EV-down.id MKL1.sT.DMSO_vs_MKL1.EV-up.id MKL1.sT.Dox_vs_MKL1.EV-down.id MKL1.sT.Dox_vs_MKL1.EV-up.id MKL1_treatment_Dox_vs_DMSO-down.id MKL1_treatment_Dox_vs_DMSO-up.id WaGa.EV_vs_WaGa.RNA-down.id WaGa.EV_vs_WaGa.RNA-up.id WaGa.scr.DMSO_vs_WaGa.EV-down.id WaGa.scr.DMSO_vs_WaGa.EV-up.id WaGa.scr.Dox_vs_WaGa.EV-down.id WaGa.scr.Dox_vs_WaGa.EV-up.id WaGa_shRNAsT.treatmentDox-down.id WaGa_shRNAsT.treatmentDox-up.id WaGa_shRNA_sT_vs_scr-down.id WaGa_shRNA_sT_vs_scr-up.id WaGa.sT.DMSO_vs_WaGa.EV-down.id WaGa.sT.DMSO_vs_WaGa.EV-up.id WaGa.sT.Dox_vs_WaGa.EV-down.id WaGa.sT.Dox_vs_WaGa.EV-up.id WaGa_treatment_Dox_vs_DMSO-down.id WaGa_treatment_Dox_vs_DMSO-up.id WaGa.RNA_vs_MKL1.RNA-up.id WaGa.RNA_vs_MKL1.RNA-down.id WaGa.EV_vs_MKL1.EV-up.id WaGa.EV_vs_MKL1.EV-down.id | sort -u > ids

  # ---- Draw DEGs_heatmap for parental_cell_RNA and wt_EV_RNA samples ----
  #-- MKL-1.RNA and MKL-1.EV vs WaGa.RNA and WaGa.EV --
  #add Gene_Id in the first line.
  GOI <- read.csv("ids")$Gene_Id
  RNASeq.NoCellLine <- assay(rld)

  #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
  datamat = RNASeq.NoCellLine[GOI, ]
  hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
  hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
  mycl = cutree(hr, h=max(hr$height)/1.05)
  mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");

  mycol = mycol[as.vector(mycl)]
  sampleCols <- rep('GREY',ncol(datamat))
  names(sampleCols) <- c("WaGa_RNA","WaGa_RNA_118","WaGa_RNA_147",  "MKL1_RNA","MKL1_RNA_118","MKL1_RNA_147",  "WaGa_EV_RNA","WaGa_EV_RNA_2","WaGa_EV_RNA_118","WaGa_EV_RNA_147","WaGa_EV_RNA_226",  "MKL1_EV_RNA","MKL1_EV_RNA_2","MKL1_EV_RNA_27","MKL1_EV_RNA_87","MKL1_EV_RNA_118")
  #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'

  sampleCols["WaGa_RNA"] <- 'DARKBLUE'
  sampleCols["WaGa_RNA_118"] <- 'DARKBLUE'
  sampleCols["WaGa_RNA_147"] <- 'DARKBLUE'

  sampleCols["MKL1_RNA"] <- 'DARKRED'
  sampleCols["MKL1_RNA_118"] <- 'DARKRED'
  sampleCols["MKL1_RNA_147"] <- 'DARKRED'

  sampleCols["WaGa_EV_RNA"] <- 'DARKORANGE'
  sampleCols["WaGa_EV_RNA_2"] <- 'DARKORANGE'
  sampleCols["WaGa_EV_RNA_118"] <- 'DARKORANGE'
  sampleCols["WaGa_EV_RNA_147"] <- 'DARKORANGE'
  sampleCols["WaGa_EV_RNA_226"] <- 'DARKORANGE'

  sampleCols["MKL1_EV_RNA"] <- 'DARKGREEN'
  sampleCols["MKL1_EV_RNA_2"] <- 'DARKGREEN'
  sampleCols["MKL1_EV_RNA_27"] <- 'DARKGREEN'
  sampleCols["MKL1_EV_RNA_87"] <- 'DARKGREEN'
  sampleCols["MKL1_EV_RNA_118"] <- 'DARKGREEN'

  png("DEGs_heatmap.png", width=1000, height=1200)
  heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
              scale='row',trace='none',col=bluered(75),
              RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=45, lwid=c(1,7), lhei = c(1, 8))
  legend("top", title = "",legend=c("WaGa_RNA","MKL1_RNA","WaGa_EV_RNA","MKL1_EV_RNA"), fill=c("DARKBLUE","DARKRED","DARKORANGE","DARKGREEN"), cex=0.8, box.lty=0)
  dev.off()

  # ---- Draw DEGs_heatmap for the MKL-1 samples ----
  #add Gene_Id in the first line.
  setwd("degenes")
  #BiocManager::install("biomaRt")
  #install.packages("gplots")
  #install.packages("writexl")  # Install the writexl package
  library(writexl)             # Load the writexl package
  library(gplots)
  library(biomaRt)
  library(dplyr)
  GOI <- read.csv("MKL1.ids")$Gene_Id
  RNASeq.NoCellLine <- assay(rld)

  #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
  datamat_ = RNASeq.NoCellLine[GOI, c("MKL-1 parental cell RNA", "MKL-1 parental cell RNA 118", "MKL-1 parental cell RNA 147", "MKL-1 wt EV RNA", "MKL-1 wt EV RNA 2", "MKL-1 wt EV RNA 118", "MKL-1 wt EV RNA 87", "MKL-1 wt EV RNA 27", "MKL-1 wt EV RNA 042", "MKL-1 sT DMSO EV RNA 042", "MKL-1 sT DMSO EV RNA 0505", "MKL-1 scr DMSO EV RNA 042", "MKL-1 scr DMSO EV RNA 0505", "MKL-1 sT Dox EV RNA 042", "MKL-1 sT Dox EV RNA 0505", "MKL-1 scr Dox EV RNA 042", "MKL-1 scr Dox EV RNA 0505")]
  new_order <- c("MKL-1 parental cell RNA", "MKL-1 parental cell RNA 118", "MKL-1 parental cell RNA 147", "MKL-1 wt EV RNA", "MKL-1 wt EV RNA 2", "MKL-1 wt EV RNA 118", "MKL-1 wt EV RNA 87", "MKL-1 wt EV RNA 27", "MKL-1 wt EV RNA 042", "MKL-1 sT DMSO EV RNA 042", "MKL-1 sT DMSO EV RNA 0505", "MKL-1 sT Dox EV RNA 042", "MKL-1 sT Dox EV RNA 0505", "MKL-1 scr DMSO EV RNA 042", "MKL-1 scr DMSO EV RNA 0505", "MKL-1 scr Dox EV RNA 042", "MKL-1 scr Dox EV RNA 0505")
  datamat <- datamat_[, new_order]
  hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
  hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
  mycl = cutree(hr, h=max(hr$height)/1.05)
  mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
  mycol = mycol[as.vector(mycl)]
  sampleCols <- rep('GREY',ncol(datamat))
  names(sampleCols) <- c("MKL-1 parental cell RNA", "MKL-1 parental cell RNA 118", "MKL-1 parental cell RNA 147", "MKL-1 wt EV RNA", "MKL-1 wt EV RNA 2", "MKL-1 wt EV RNA 118", "MKL-1 wt EV RNA 87", "MKL-1 wt EV RNA 27", "MKL-1 wt EV RNA 042", "MKL-1 sT DMSO EV RNA 042", "MKL-1 sT DMSO EV RNA 0505", "MKL-1 sT Dox EV RNA 042", "MKL-1 sT Dox EV RNA 0505", "MKL-1 scr DMSO EV RNA 042", "MKL-1 scr DMSO EV RNA 0505", "MKL-1 scr Dox EV RNA 042", "MKL-1 scr Dox EV RNA 0505")
  #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'

  sampleCols["MKL-1 parental cell RNA"] <- 'WHITE'
  sampleCols["MKL-1 parental cell RNA 118"] <- 'WHITE'
  sampleCols["MKL-1 parental cell RNA 147"] <- 'WHITE'

  sampleCols["MKL-1 wt EV RNA"] <- 'GREY'
  sampleCols["MKL-1 wt EV RNA 2"] <- 'GREY'
  sampleCols["MKL-1 wt EV RNA 118"] <- 'GREY'
  sampleCols["MKL-1 wt EV RNA 87"] <- 'GREY'
  sampleCols["MKL-1 wt EV RNA 27"] <- 'GREY'
  sampleCols["MKL-1 wt EV RNA 042"] <- 'GREY'

  #"salmon", "lightcoral", or "mistyrose"
  sampleCols["MKL-1 sT DMSO EV RNA 042"] <- 'mistyrose'
  sampleCols["MKL-1 sT DMSO EV RNA 0505"] <- 'mistyrose'
  sampleCols["MKL-1 sT Dox EV RNA 042"] <- 'RED'
  sampleCols["MKL-1 sT Dox EV RNA 0505"] <- 'RED'

  sampleCols["MKL-1 scr DMSO EV RNA 042"] <- 'LIGHTGREEN'
  sampleCols["MKL-1 scr DMSO EV RNA 0505"] <- 'LIGHTGREEN'
  sampleCols["MKL-1 scr Dox EV RNA 042"] <- 'GREEN'
  sampleCols["MKL-1 scr Dox EV RNA 0505"] <- 'GREEN'

  #legend("right", title = "",legend=c("MKL-1 parental cell RNA","MKL-1 wt EV RNA","MKL-1 sT DMSO EV RNA","MKL-1 sT Dox EV RNA","MKL-1 scr DMSO EV RNA","MKL-1 scr Dox EV RNA"), fill=c("WHITE", "GREY", "salmon","RED","LICHTGREEN","GREEN"), cex=0.8, box.lty=0)
  png("DEGs_heatmap_MKL-1.png", width=1000, height=1200)
  heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
              scale='row',trace='none',col=bluered(75),
              RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=30, lwid=c(1,7), lhei = c(1, 8))
  dev.off()
  #rsync -a -P /home/jhuang/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results/featureCounts/degenes/DEGs_heatmap_MKL-1.png ./

  # ---- cluster members ----
  subset_1<-names(subset(mycl, mycl == '1'))
  subset_1_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = subset_1,
        mart = ensembl)
  subset_1_uniq <- distinct(subset_1_, ensembl_gene_id, .keep_all= TRUE)
  subset_1_expr  <- datamat[subset_1,]
  subset_1_expr <- as.data.frame(subset_1_expr)
  subset_1_expr$ENSEMBL = rownames(subset_1_expr)
  cluster1_YELLOW <- merge(subset_1_uniq, subset_1_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
  #write.csv(cluster1_YELLOW,file='cluster1_YELLOW.txt')

  subset_2<-names(subset(mycl, mycl == '2'))
  subset_2_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = subset_2,
        mart = ensembl)
  subset_2_uniq <- distinct(subset_2_, ensembl_gene_id, .keep_all= TRUE)
  subset_2_expr  <- datamat[subset_2,]
  subset_2_expr <- as.data.frame(subset_2_expr)
  subset_2_expr$ENSEMBL = rownames(subset_2_expr)
  cluster2_DARKBLUE <- merge(subset_2_uniq, subset_2_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
  #write.csv(cluster2_DARKBLUE,file='cluster2_DARKBLUE.txt')

  write_xlsx(list(
    "Cluster 1 YELLOW" = cluster1_YELLOW,
    "Cluster 2 DARKBLUE" = cluster2_DARKBLUE
  ), "DEGs_heatmap_data_MKL-1.xlsx")

  #write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
  #write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
  #~/Tools/csv2xls-0.4/csv_to_xls.py \
  #cluster1_YELLOW.txt \
  #cluster2_DARKBLUE.txt \
  #-d$',' -o gene_culsters.xls;

  #TODO: save datamat as DEGs_heatmap_data.txt
  #datamat = read.csv(file="DEGs_heatmap_data.txt", sep="\t", row.names=1)
  #~/Tools/csv2xls-0.4/csv_to_xls.py \
  #DEGs_heatmap_data.txt \
  #-d',' -o DEGs_heatmap_data.xls;

  #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o genelist_clusters_MKL-1.xls

  # ---- Draw DEGs_heatmap for the WaGa samples ----
  #add Gene_Id in the first line.
  GOI <- read.csv("WaGa.ids")$Gene_Id
  RNASeq.NoCellLine <- assay(rld)

  #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
  datamat_ = RNASeq.NoCellLine[GOI, c("WaGa parental cell RNA", "WaGa parental cell RNA 118", "WaGa parental cell RNA 147", "WaGa wt EV RNA", "WaGa wt EV RNA 2", "WaGa wt EV RNA 118", "WaGa wt EV RNA 147", "WaGa wt EV RNA 226", "WaGa wt EV RNA 1107", "WaGa wt EV RNA 1605", "WaGa wt EV RNA 2706", "WaGa sT DMSO EV RNA 1107", "WaGa sT DMSO EV RNA 1605", "WaGa sT DMSO EV RNA 2706", "WaGa scr DMSO EV RNA 1107", "WaGa scr DMSO EV RNA 1605", "WaGa scr DMSO EV RNA 2706", "WaGa sT Dox EV RNA 1107", "WaGa sT Dox EV RNA 1605", "WaGa sT Dox EV RNA 2706", "WaGa scr Dox EV RNA 1107", "WaGa scr Dox EV RNA 1605", "WaGa scr Dox EV RNA 2706")]
  new_order <- c("WaGa parental cell RNA", "WaGa parental cell RNA 118", "WaGa parental cell RNA 147", "WaGa wt EV RNA", "WaGa wt EV RNA 2", "WaGa wt EV RNA 118", "WaGa wt EV RNA 147", "WaGa wt EV RNA 226", "WaGa wt EV RNA 1107", "WaGa wt EV RNA 1605", "WaGa wt EV RNA 2706", "WaGa sT DMSO EV RNA 1107", "WaGa sT DMSO EV RNA 1605", "WaGa sT DMSO EV RNA 2706", "WaGa sT Dox EV RNA 1107", "WaGa sT Dox EV RNA 1605", "WaGa sT Dox EV RNA 2706", "WaGa scr DMSO EV RNA 1107", "WaGa scr DMSO EV RNA 1605", "WaGa scr DMSO EV RNA 2706", "WaGa scr Dox EV RNA 1107", "WaGa scr Dox EV RNA 1605", "WaGa scr Dox EV RNA 2706")
  datamat <- datamat_[, new_order]
  hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
  hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
  mycl = cutree(hr, h=max(hr$height)/1.05)
  mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");

  mycol = mycol[as.vector(mycl)]
  sampleCols <- rep('GREY',ncol(datamat))
  names(sampleCols) <- c("WaGa parental cell RNA", "WaGa parental cell RNA 118", "WaGa parental cell RNA 147", "WaGa wt EV RNA", "WaGa wt EV RNA 2", "WaGa wt EV RNA 118", "WaGa wt EV RNA 147", "WaGa wt EV RNA 226", "WaGa wt EV RNA 1107", "WaGa wt EV RNA 1605", "WaGa wt EV RNA 2706", "WaGa sT DMSO EV RNA 1107", "WaGa sT DMSO EV RNA 1605", "WaGa sT DMSO EV RNA 2706", "WaGa sT Dox EV RNA 1107", "WaGa sT Dox EV RNA 1605", "WaGa sT Dox EV RNA 2706", "WaGa scr DMSO EV RNA 1107", "WaGa scr DMSO EV RNA 1605", "WaGa scr DMSO EV RNA 2706", "WaGa scr Dox EV RNA 1107", "WaGa scr Dox EV RNA 1605", "WaGa scr Dox EV RNA 2706")
  #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'

  sampleCols["WaGa parental cell RNA"] <- 'WHITE'
  sampleCols["WaGa parental cell RNA 118"] <- 'WHITE'
  sampleCols["WaGa parental cell RNA 147"] <- 'WHITE'

  sampleCols["WaGa wt EV RNA"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 2"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 118"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 147"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 226"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 1107"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 1605"] <- '#A9A9A9'
  sampleCols["WaGa wt EV RNA 2706"] <- '#A9A9A9'
  #fb9a99
  sampleCols["WaGa sT DMSO EV RNA 1107"] <- 'mistyrose'
  sampleCols["WaGa sT DMSO EV RNA 1605"] <- 'mistyrose'
  sampleCols["WaGa sT DMSO EV RNA 2706"] <- 'mistyrose'
  sampleCols["WaGa sT Dox EV RNA 1107"] <- '#e31a1c'
  sampleCols["WaGa sT Dox EV RNA 1605"] <- '#e31a1c'
  sampleCols["WaGa sT Dox EV RNA 2706"] <- '#e31a1c'

  sampleCols["WaGa scr DMSO EV RNA 1107"] <- '#b2df8a'
  sampleCols["WaGa scr DMSO EV RNA 1605"] <- '#b2df8a'
  sampleCols["WaGa scr DMSO EV RNA 2706"] <- '#b2df8a'
  sampleCols["WaGa scr Dox EV RNA 1107"] <- '#33a02c'
  sampleCols["WaGa scr Dox EV RNA 1605"] <- '#33a02c'
  sampleCols["WaGa scr Dox EV RNA 2706"] <- '#33a02c'

  #legend("right", title = "",legend=c("WaGa parental cell RNA","WaGa wt EV RNA","WaGa sT DMSO EV RNA","WaGa sT Dox EV RNA","WaGa scr DMSO EV RNA","WaGa scr Dox EV RNA"), fill=c("WHITE", "GREY", "salmon","RED","LICHTGREEN","GREEN"), cex=0.8, box.lty=0)
  png("DEGs_heatmap_WaGa.png", width=1000, height=1200)
  heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
              scale='row',trace='none',col=bluered(75),
              RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=45, lwid=c(1,7), lhei = c(1, 8))
  dev.off()

  # ---- cluster members ----
  subset_1<-names(subset(mycl, mycl == '1'))
  subset_1_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = subset_1,
        mart = ensembl)
  subset_1_uniq <- distinct(subset_1_, ensembl_gene_id, .keep_all= TRUE)
  subset_1_expr  <- datamat[subset_1,]
  subset_1_expr <- as.data.frame(subset_1_expr)
  subset_1_expr$ENSEMBL = rownames(subset_1_expr)
  cluster1_YELLOW <- merge(subset_1_uniq, subset_1_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
  #write.csv(cluster1_YELLOW,file='cluster1_YELLOW.txt')

  subset_2<-names(subset(mycl, mycl == '2'))
  subset_2_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
        filters = 'ensembl_gene_id',
        values = subset_2,
        mart = ensembl)
  subset_2_uniq <- distinct(subset_2_, ensembl_gene_id, .keep_all= TRUE)
  subset_2_expr  <- datamat[subset_2,]
  subset_2_expr <- as.data.frame(subset_2_expr)
  subset_2_expr$ENSEMBL = rownames(subset_2_expr)
  cluster2_DARKBLUE <- merge(subset_2_uniq, subset_2_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
  #write.csv(cluster2_DARKBLUE,file='cluster2_DARKBLUE.txt')

  write_xlsx(list(
    "Cluster 1 YELLOW" = cluster1_YELLOW,
    "Cluster 2 DARKBLUE" = cluster2_DARKBLUE
  ), "DEGs_heatmap_data_WaGa.xlsx")

  #write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
  #write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
  #write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
  #write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
  #write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
  #~/Tools/csv2xls-0.4/csv_to_xls.py \
  #cluster1_YELLOW.txt \
  #cluster2_DARKBLUE.txt \
  #-d$',' -o gene_culsters.xls;
  #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o genelist_clusters_WaGa.xls

  # ---- Draw DEGs_heatmap for all samples ----
  #add Gene_Id in the first line.
  GOI <- read.csv("all.ids")$Gene_Id
  RNASeq.NoCellLine <- assay(rld)

  #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
  datamat = RNASeq.NoCellLine[GOI, c("MKL-1 RNA", "MKL-1 RNA 118", "MKL-1 RNA 147", "MKL-1 EV", "MKL-1 EV 2", "MKL-1 EV 118", "MKL-1 EV 87", "MKL-1 EV 27", "MKL-1 EV 042", "MKL-1 EV sT DMSO 042", "MKL-1 EV sT DMSO 0505", "MKL-1 EV scr DMSO 042", "MKL-1 EV scr DMSO 0505", "MKL-1 EV sT Dox 042", "MKL-1 EV sT Dox 0505", "MKL-1 EV scr Dox 042", "MKL-1 EV scr Dox 0505",    "WaGa RNA", "WaGa RNA 118", "WaGa RNA 147", "WaGa EV", "WaGa EV 2", "WaGa EV 118", "WaGa EV 147", "WaGa EV 226", "WaGa EV 1107", "WaGa EV 1605", "WaGa EV 2706", "WaGa EV sT DMSO 1107", "WaGa EV sT DMSO 1605", "WaGa EV sT DMSO 2706", "WaGa EV scr DMSO 1107", "WaGa EV scr DMSO 1605", "WaGa EV scr DMSO 2706", "WaGa EV sT Dox 1107", "WaGa EV sT Dox 1605", "WaGa EV sT Dox 2706", "WaGa EV scr Dox 1107", "WaGa EV scr Dox 1605", "WaGa EV scr Dox 2706")]
  hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
  hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
  mycl = cutree(hr, h=max(hr$height)/1.05)
  mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");

  mycol = mycol[as.vector(mycl)]
  sampleCols <- rep('GREY',ncol(datamat))
  names(sampleCols) <- c("MKL-1 RNA", "MKL-1 RNA 118", "MKL-1 RNA 147", "MKL-1 EV", "MKL-1 EV 2", "MKL-1 EV 118", "MKL-1 EV 87", "MKL-1 EV 27", "MKL-1 EV 042", "MKL-1 EV sT DMSO 042", "MKL-1 EV sT DMSO 0505", "MKL-1 EV scr DMSO 042", "MKL-1 EV scr DMSO 0505", "MKL-1 EV sT Dox 042", "MKL-1 EV sT Dox 0505", "MKL-1 EV scr Dox 042", "MKL-1 EV scr Dox 0505",    "WaGa RNA", "WaGa RNA 118", "WaGa RNA 147", "WaGa EV", "WaGa EV 2", "WaGa EV 118", "WaGa EV 147", "WaGa EV 226", "WaGa EV 1107", "WaGa EV 1605", "WaGa EV 2706", "WaGa EV sT DMSO 1107", "WaGa EV sT DMSO 1605", "WaGa EV sT DMSO 2706", "WaGa EV scr DMSO 1107", "WaGa EV scr DMSO 1605", "WaGa EV scr DMSO 2706", "WaGa EV sT Dox 1107", "WaGa EV sT Dox 1605", "WaGa EV sT Dox 2706", "WaGa EV scr Dox 1107", "WaGa EV scr Dox 1605", "WaGa EV scr Dox 2706")
  #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'

  sampleCols["MKL-1 RNA"] <- 'DARKBLUE'
  sampleCols["MKL-1 RNA 118"] <- 'DARKBLUE'
  sampleCols["MKL-1 RNA 147"] <- 'DARKBLUE'

  sampleCols["MKL-1 EV"] <- 'DARKRED'
  sampleCols["MKL-1 EV 2"] <- 'DARKRED'
  sampleCols["MKL-1 EV 118"] <- 'DARKRED'
  sampleCols["MKL-1 EV 87"] <- 'DARKRED'
  sampleCols["MKL-1 EV 27"] <- 'DARKRED'
  sampleCols["MKL-1 EV 042"] <- 'DARKRED'

  sampleCols["MKL-1 EV sT DMSO 042"] <- 'DARKORANGE'
  sampleCols["MKL-1 EV sT DMSO 0505"] <- 'DARKORANGE'
  sampleCols["MKL-1 EV scr DMSO 042"] <- 'DARKORANGE'
  sampleCols["MKL-1 EV scr DMSO 0505"] <- 'DARKORANGE'

  sampleCols["MKL-1 EV sT Dox 042"] <- 'DARKGREEN'
  sampleCols["MKL-1 EV sT Dox 0505"] <- 'DARKGREEN'
  sampleCols["MKL-1 EV scr Dox 042"] <- 'DARKGREEN'
  sampleCols["MKL-1 EV scr Dox 0505"] <- 'DARKGREEN'

  sampleCols["WaGa RNA"] <- 'BLUE'
  sampleCols["WaGa RNA 118"] <- 'BLUE'
  sampleCols["WaGa RNA 147"] <- 'BLUE'

  sampleCols["WaGa EV"] <- 'RED'
  sampleCols["WaGa EV 2"] <- 'RED'
  sampleCols["WaGa EV 118"] <- 'RED'
  sampleCols["WaGa EV 147"] <- 'RED'
  sampleCols["WaGa EV 226"] <- 'RED'
  sampleCols["WaGa EV 1107"] <- 'RED'
  sampleCols["WaGa EV 1605"] <- 'RED'
  sampleCols["WaGa EV 2706"] <- 'RED'

  sampleCols["WaGa EV sT DMSO 1107"] <- 'ORANGE'
  sampleCols["WaGa EV sT DMSO 1605"] <- 'ORANGE'
  sampleCols["WaGa EV sT DMSO 2706"] <- 'ORANGE'
  sampleCols["WaGa EV scr DMSO 1107"] <- 'ORANGE'
  sampleCols["WaGa EV scr DMSO 1605"] <- 'ORANGE'
  sampleCols["WaGa EV scr DMSO 2706"] <- 'ORANGE'

  sampleCols["WaGa EV sT Dox 1107"] <- 'GREEN'
  sampleCols["WaGa EV sT Dox 1605"] <- 'GREEN'
  sampleCols["WaGa EV sT Dox 2706"] <- 'GREEN'
  sampleCols["WaGa EV scr Dox 1107"] <- 'GREEN'
  sampleCols["WaGa EV scr Dox 1605"] <- 'GREEN'
  sampleCols["WaGa EV scr Dox 2706"] <- 'GREEN'

  #legend("right", title = "",legend=c("WaGa parental cell RNA","WaGa wt EV RNA","WaGa sT DMSO EV RNA","WaGa sT Dox EV RNA","WaGa scr DMSO EV RNA","WaGa scr Dox EV RNA"), fill=c("WHITE", "GREY", "salmon","RED","LICHTGREEN","GREEN"), cex=0.8, box.lty=0)
  png("DEGs_heatmap_all.png", width=1000, height=1200)
  heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
              scale='row',trace='none',col=bluered(75),
              RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=45, lwid=c(1,7), lhei = c(1, 8))
  dev.off()

  # ---- cluster members ----
  #c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
  write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
  write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt')
  write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')
  write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt')
  write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')
  #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o genelist_clusters_all.xls

11, pathways

  mkdir pathways

  #--continue from BREAK POINT--
  ##
  #source("https://bioconductor.org/biocLite.R")
  #biocLite("AnnotationDbi")
  library("clusterProfiler")
  library("ReactomePA")
  setwd("~/DATA/Data_Anastasia_RNASeq/results/featureCounts/pathways")

  for sample in WaGa_RNA_vs_MKL1_RNA MKL1_EV_RNA_vs_MKL1_RNA WaGa_EV_RNA_vs_MKL1_EV_RNA WaGa_EV_RNA_vs_WaGa_RNA; do \
  echo "${sample}_up <- read.csv('../degenes/${sample}-up.txt', row.names=1)"
  echo "${sample}_up_KEGG <- enrichKEGG(${sample}_up\$entrezgene_id)"
  echo "write.table(as.data.frame(${sample}_up_KEGG), file = '${sample}_up_KEGG.txt', sep = '\t', row.names = FALSE)"
  echo "${sample}_down <- read.csv('../degenes/${sample}-down.txt', row.names=1)"
  echo "${sample}_down_KEGG <- enrichKEGG(${sample}_down\$entrezgene_id)"
  echo "write.table(as.data.frame(${sample}_down_KEGG), file = '${sample}_down_KEGG.txt', sep = '\t', row.names = FALSE)"
  echo "${sample}_sig <- rbind(${sample}_up, ${sample}_down)"
  echo "${sample}_sig_KEGG <- enrichKEGG(${sample}_sig\$entrezgene_id)"
  echo "write.table(as.data.frame(${sample}_sig_KEGG), file = '${sample}_sig_KEGG.txt', sep = '\t', row.names = FALSE)"
  done

  WaGa_RNA_vs_MKL1_RNA_up <- read.csv('../degenes/WaGa_RNA_vs_MKL1_RNA-up.txt', row.names=1)
  WaGa_RNA_vs_MKL1_RNA_up_KEGG <- enrichKEGG(WaGa_RNA_vs_MKL1_RNA_up$entrezgene_id)
  write.table(as.data.frame(WaGa_RNA_vs_MKL1_RNA_up_KEGG), file = 'WaGa_RNA_vs_MKL1_RNA_up_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_RNA_vs_MKL1_RNA_down <- read.csv('../degenes/WaGa_RNA_vs_MKL1_RNA-down.txt', row.names=1)
  WaGa_RNA_vs_MKL1_RNA_down_KEGG <- enrichKEGG(WaGa_RNA_vs_MKL1_RNA_down$entrezgene_id)
  write.table(as.data.frame(WaGa_RNA_vs_MKL1_RNA_down_KEGG), file = 'WaGa_RNA_vs_MKL1_RNA_down_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_RNA_vs_MKL1_RNA_sig <- rbind(WaGa_RNA_vs_MKL1_RNA_up, WaGa_RNA_vs_MKL1_RNA_down)
  WaGa_RNA_vs_MKL1_RNA_sig_KEGG <- enrichKEGG(WaGa_RNA_vs_MKL1_RNA_sig$entrezgene_id)
  write.table(as.data.frame(WaGa_RNA_vs_MKL1_RNA_sig_KEGG), file = 'WaGa_RNA_vs_MKL1_RNA_sig_KEGG.txt', sep = '\t', row.names = FALSE)
  MKL1_EV_RNA_vs_MKL1_RNA_up <- read.csv('../degenes/MKL1_EV_RNA_vs_MKL1_RNA-up.txt', row.names=1)
  MKL1_EV_RNA_vs_MKL1_RNA_up_KEGG <- enrichKEGG(MKL1_EV_RNA_vs_MKL1_RNA_up$entrezgene_id)
  write.table(as.data.frame(MKL1_EV_RNA_vs_MKL1_RNA_up_KEGG), file = 'MKL1_EV_RNA_vs_MKL1_RNA_up_KEGG.txt', sep = '\t', row.names = FALSE)
  MKL1_EV_RNA_vs_MKL1_RNA_down <- read.csv('../degenes/MKL1_EV_RNA_vs_MKL1_RNA-down.txt', row.names=1)
  MKL1_EV_RNA_vs_MKL1_RNA_down_KEGG <- enrichKEGG(MKL1_EV_RNA_vs_MKL1_RNA_down$entrezgene_id)
  write.table(as.data.frame(MKL1_EV_RNA_vs_MKL1_RNA_down_KEGG), file = 'MKL1_EV_RNA_vs_MKL1_RNA_down_KEGG.txt', sep = '\t', row.names = FALSE)
  MKL1_EV_RNA_vs_MKL1_RNA_sig <- rbind(MKL1_EV_RNA_vs_MKL1_RNA_up, MKL1_EV_RNA_vs_MKL1_RNA_down)
  MKL1_EV_RNA_vs_MKL1_RNA_sig_KEGG <- enrichKEGG(MKL1_EV_RNA_vs_MKL1_RNA_sig$entrezgene_id)
  write.table(as.data.frame(MKL1_EV_RNA_vs_MKL1_RNA_sig_KEGG), file = 'MKL1_EV_RNA_vs_MKL1_RNA_sig_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_up <- read.csv('../degenes/WaGa_EV_RNA_vs_MKL1_EV_RNA-up.txt', row.names=1)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_up_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_MKL1_EV_RNA_up$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_MKL1_EV_RNA_up_KEGG), file = 'WaGa_EV_RNA_vs_MKL1_EV_RNA_up_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_down <- read.csv('../degenes/WaGa_EV_RNA_vs_MKL1_EV_RNA-down.txt', row.names=1)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_down_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_MKL1_EV_RNA_down$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_MKL1_EV_RNA_down_KEGG), file = 'WaGa_EV_RNA_vs_MKL1_EV_RNA_down_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_sig <- rbind(WaGa_EV_RNA_vs_MKL1_EV_RNA_up, WaGa_EV_RNA_vs_MKL1_EV_RNA_down)
  WaGa_EV_RNA_vs_MKL1_EV_RNA_sig_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_MKL1_EV_RNA_sig$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_MKL1_EV_RNA_sig_KEGG), file = 'WaGa_EV_RNA_vs_MKL1_EV_RNA_sig_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_WaGa_RNA_up <- read.csv('../degenes/WaGa_EV_RNA_vs_WaGa_RNA-up.txt', row.names=1)
  WaGa_EV_RNA_vs_WaGa_RNA_up_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_WaGa_RNA_up$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_WaGa_RNA_up_KEGG), file = 'WaGa_EV_RNA_vs_WaGa_RNA_up_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_WaGa_RNA_down <- read.csv('../degenes/WaGa_EV_RNA_vs_WaGa_RNA-down.txt', row.names=1)
  WaGa_EV_RNA_vs_WaGa_RNA_down_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_WaGa_RNA_down$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_WaGa_RNA_down_KEGG), file = 'WaGa_EV_RNA_vs_WaGa_RNA_down_KEGG.txt', sep = '\t', row.names = FALSE)
  WaGa_EV_RNA_vs_WaGa_RNA_sig <- rbind(WaGa_EV_RNA_vs_WaGa_RNA_up, WaGa_EV_RNA_vs_WaGa_RNA_down)
  WaGa_EV_RNA_vs_WaGa_RNA_sig_KEGG <- enrichKEGG(WaGa_EV_RNA_vs_WaGa_RNA_sig$entrezgene_id)
  write.table(as.data.frame(WaGa_EV_RNA_vs_WaGa_RNA_sig_KEGG), file = 'WaGa_EV_RNA_vs_WaGa_RNA_sig_KEGG.txt', sep = '\t', row.names = FALSE)

  png("pathways_KEGG.png",width=1260, height=1000)
  merged_list <- merge_result(list('WaGa_RNA vs MKL1_RNA'=WaGa_RNA_vs_MKL1_RNA_sig_KEGG, 'MKL1_EV_RNA vs MKL1_RNA'=MKL1_EV_RNA_vs_MKL1_RNA_sig_KEGG, 'WaGa_EV_RNA vs MKL1_EV_RNA'=WaGa_EV_RNA_vs_MKL1_EV_RNA_sig_KEGG, 'WaGa_EV_RNA vs WaGa_RNA'=WaGa_EV_RNA_vs_WaGa_RNA_sig_KEGG))
  dotplot(merged_list, showCategory=1000)    #, font.size=6, srt = 35
  dev.off()

  # under CONSOLE
  cd pathways_KEGG
  ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_RNA_vs_MKL1_RNA_sig_KEGG.txt MKL1_EV_RNA_vs_MKL1_RNA_sig_KEGG.txt WaGa_EV_RNA_vs_MKL1_EV_RNA_sig_KEGG.txt WaGa_EV_RNA_vs_WaGa_RNA_sig_KEGG.txt -d$'\t' -o pathways_KEGG.xls

12, GOs

  setwd("~/DATA/Data_Anastasia_RNASeq/results/featureCounts/GOs")
  for sample in IKK1_vs_Cre IKK2_vs_Cre Nemo_vs_Cre NIK_vs_Cre; do \
  echo "${sample}_up <- read.csv('../degenes/${sample}-up.txt', row.names=1)"
  echo "${sample}_up_GO_BP <- enrichGO(${sample}_up\$entrezgene_id, 'org.Mm.eg.db', ont='BP')"
  echo "${sample}_up_GO_MF <- enrichGO(${sample}_up\$entrezgene_id, 'org.Mm.eg.db', ont='MF')"
  echo "${sample}_up_GO_CC <- enrichGO(${sample}_up\$entrezgene_id, 'org.Mm.eg.db', ont='CC')"
  #echo "${sample}_up_GO_ALL <- enrichGO(${sample}_up\$entrezgene_id, 'org.Mm.eg.db', ont='ALL')"
  echo "write.table(as.data.frame(${sample}_up_GO_BP), file = '${sample}_up_GO_BP.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_up_GO_MF), file = '${sample}_up_GO_MF.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_up_GO_CC), file = '${sample}_up_GO_CC.txt', sep = '\t', row.names = FALSE)"
  #echo "write.table(as.data.frame(${sample}_up_GO_ALL), file = '${sample}_up_GO_ALL.txt', sep = '\t', row.names = FALSE)"
  echo "${sample}_down <- read.csv('../degenes/${sample}-down.txt', row.names=1)"
  echo "${sample}_down_GO_BP <- enrichGO(${sample}_down\$entrezgene_id, 'org.Mm.eg.db', ont='BP')"
  echo "${sample}_down_GO_MF <- enrichGO(${sample}_down\$entrezgene_id, 'org.Mm.eg.db', ont='MF')"
  echo "${sample}_down_GO_CC <- enrichGO(${sample}_down\$entrezgene_id, 'org.Mm.eg.db', ont='CC')"
  #echo "${sample}_down_GO_ALL <- enrichGO(${sample}_down\$entrezgene_id, 'org.Mm.eg.db', ont='ALL')"
  echo "write.table(as.data.frame(${sample}_down_GO_BP), file = '${sample}_down_GO_BP.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_down_GO_MF), file = '${sample}_down_GO_MF.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_down_GO_CC), file = '${sample}_down_GO_CC.txt', sep = '\t', row.names = FALSE)"
  #echo "write.table(as.data.frame(${sample}_down_GO_ALL), file = '${sample}_down_GO_ALL.txt', sep = '\t', row.names = FALSE)"
  echo "${sample}_sig <- rbind(${sample}_up, ${sample}_down)"
  echo "${sample}_sig_GO_BP <- enrichGO(${sample}_sig\$entrezgene_id, 'org.Mm.eg.db', ont='BP')"
  echo "${sample}_sig_GO_MF <- enrichGO(${sample}_sig\$entrezgene_id, 'org.Mm.eg.db', ont='MF')"
  echo "${sample}_sig_GO_CC <- enrichGO(${sample}_sig\$entrezgene_id, 'org.Mm.eg.db', ont='CC')"
  #echo "${sample}_sig_GO_ALL <- enrichGO(${sample}_sig\$entrezgene_id, 'org.Mm.eg.db', ont='ALL')"
  echo "write.table(as.data.frame(${sample}_sig_GO_BP), file = '${sample}_sig_GO_BP.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_sig_GO_MF), file = '${sample}_sig_GO_MF.txt', sep = '\t', row.names = FALSE)"
  echo "write.table(as.data.frame(${sample}_sig_GO_CC), file = '${sample}_sig_GO_CC.txt', sep = '\t', row.names = FALSE)"
  #echo "write.table(as.data.frame(${sample}_sig_GO_ALL), file = '${sample}_sig_GO_ALL.txt', sep = '\t', row.names = FALSE)"
  done

  png("GOs_BP.png",width=3000, height=24000)
  merged_list <- merge_result(list('IKK1_vs_Cre_up'=IKK1_vs_Cre_up_GO_BP, 'IKK1_vs_Cre_down'=IKK1_vs_Cre_down_GO_BP, 'IKK1_vs_Cre_sig'=IKK1_vs_Cre_sig_GO_BP, 'IKK2_vs_Cre_up'=IKK2_vs_Cre_up_GO_BP, 'IKK2_vs_Cre_down'=IKK2_vs_Cre_down_GO_BP, 'IKK2_vs_Cre_sig'=IKK2_vs_Cre_sig_GO_BP, 'Nemo_vs_Cre_up'= Nemo_vs_Cre_up_GO_BP, 'Nemo_vs_Cre_down'=Nemo_vs_Cre_down_GO_BP, 'Nemo_vs_Cre_sig'=Nemo_vs_Cre_sig_GO_BP, 'NIK_vs_Cre_up'=NIK_vs_Cre_up_GO_BP, 'NIK_vs_Cre_down'=NIK_vs_Cre_down_GO_BP, 'NIK_vs_Cre_sig'=NIK_vs_Cre_sig_GO_BP))
  dotplot(merged_list, showCategory=10000)
  dev.off()

  # under CONSOLE
  #
  cd GOs
  ~/Tools/csv2xls-0.4/csv_to_xls.py IKK1_vs_Cre_up_GO_BP.txt IKK1_vs_Cre_down_GO_BP.txt IKK1_vs_Cre_sig_GO_BP.txt  IKK2_vs_Cre_up_GO_BP.txt IKK2_vs_Cre_down_GO_BP.txt IKK2_vs_Cre_sig_GO_BP.txt  Nemo_vs_Cre_up_GO_BP.txt Nemo_vs_Cre_down_GO_BP.txt Nemo_vs_Cre_sig_GO_BP.txt  NIK_vs_Cre_up_GO_BP.txt NIK_vs_Cre_down_GO_BP.txt NIK_vs_Cre_sig_GO_BP.txt  -d$'\t' -o GOs_BP.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py IKK1_vs_Cre_up_GO_MF.txt IKK1_vs_Cre_down_GO_MF.txt IKK1_vs_Cre_sig_GO_MF.txt  IKK2_vs_Cre_up_GO_MF.txt IKK2_vs_Cre_down_GO_MF.txt IKK2_vs_Cre_sig_GO_MF.txt  Nemo_vs_Cre_up_GO_MF.txt Nemo_vs_Cre_down_GO_MF.txt Nemo_vs_Cre_sig_GO_MF.txt  NIK_vs_Cre_up_GO_MF.txt NIK_vs_Cre_down_GO_MF.txt NIK_vs_Cre_sig_GO_MF.txt  -d$'\t' -o GOs_MF.xls
  ~/Tools/csv2xls-0.4/csv_to_xls.py IKK1_vs_Cre_up_GO_CC.txt IKK1_vs_Cre_down_GO_CC.txt IKK1_vs_Cre_sig_GO_CC.txt  IKK2_vs_Cre_up_GO_CC.txt IKK2_vs_Cre_down_GO_CC.txt IKK2_vs_Cre_sig_GO_CC.txt  Nemo_vs_Cre_up_GO_CC.txt Nemo_vs_Cre_down_GO_CC.txt Nemo_vs_Cre_sig_GO_CC.txt  NIK_vs_Cre_up_GO_CC.txt NIK_vs_Cre_down_GO_CC.txt NIK_vs_Cre_sig_GO_CC.txt  -d$'\t' -o GOs_CC.xls

miRNA 2024 Ute from raw counts

Leave a reply

Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.
```
COMPSRA_MERGE_0_miRNA.txt *
COMPSRA_MERGE_0_piRNA.txt *
COMPSRA_MERGE_0_snRNA.txt *
COMPSRA_MERGE_0_tRNA.txt
COMPSRA_MERGE_0_snoRNA.txt
COMPSRA_MERGE_0_circRNA.txt
```
Input files for miRNA are two files: COMPSRA_MERGE_0_miRNA.txt and ids under “/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_7/miRNA_WaGa_re”
- COMPSRA_MERGE_0_miRNA.txt
  
  #./our_out_without_cutadapt/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt #./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt diff ./our_out_on_hg38/COMPSRA_MERGE_0_miRNA.txt ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt #–> different! #BUG: why using the same code resulting in different miRNA results! A little different!! #–> using ./our_out_on_hg38+JN707599_miRNA_merging_replicates/COMPSRA_MERGE_0_miRNA.txt
  
  cp miRNA_WaGa_on_hg38+JN707599_merging_replicates/COMPSRA_MERGE_0_miRNA.txt miRNA_WaGa_2024 cd miRNA_WaGa_2024
- prepare the file ids
  
  #cp ./miRNA_WaGa_on_hg38+JN707599_merging_replicates/ids miRNA_WaGa_2024 for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo “cut -d’,’ -f1-1 ${i}-up.txt > ${i}-up.id”; echo “cut -d’,’ -f1-1 ${i}-down.txt > ${i}-down.id”; done cat *.id | sort -u > ids #add Gene_Id in the first line, delete the “”

Draw plots with R using DESeq2 (under the ‘base’ env)

#BiocManager::install("AnnotationDbi")
#BiocManager::install("clusterProfiler")
#BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
#BiocManager::install("limma")
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)
library(limma)
# Check the current library paths
.libPaths()
#setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/our_out_on_hg38+JN707599_2024_corrected/")
setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/miRNA_WaGa_2024")

d.raw<- read.delim2("COMPSRA_MERGE_0_miRNA.txt",sep="\t", header=TRUE, row.names=1)
d.raw$X <- NULL
#colnames(d.raw)<- c("WaGa_EV", "MKL1_EV", "WaGa_wt", "MKL1_wt")
d.raw[] <- lapply(d.raw, as.numeric)

#MCPyV-M1   0   0   29  0   23  0   30  8   0   0   202 196
# New row to add
new_row <- data.frame(
  X0505_WaGa_sT_DMSO = 0,
  X1905_WaGa_sT_DMSO = 0,
  X0505_WaGa_sT_Dox = 29,
  X1905_WaGa_sT_Dox = 0,
  X0505_WaGa_scr_DMSO = 23,
  X1905_WaGa_scr_DMSO = 0,
  X0505_WaGa_scr_Dox = 30,
  X1905_WaGa_scr_Dox = 8,
  X0505_WaGa_wt = 0,
  X1905_WaGa_wt = 0,
  control_MKL1 = 202,
  control_WaGa = 196,
  row.names = "MCPyV-M1"
)

# Add the new row to the data frame
d.raw <- rbind(d.raw, new_row)

#cell_line = as.factor(c("WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "WaGa","WaGa", "MKL1","WaGa"))
#condition = as.factor(c("sT","sT", "sT","sT", "scr","scr", "scr","scr", "wt","wt", "control","control"))
#treatment = as.factor(c("DMSO","DMSO", "Dox","Dox", "DMSO","DMSO", "Dox","Dox", "wt","wt", "control","control"))

EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))

cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)

rld <- rlogTransformation(dds)

# -- before pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("donor"))
dev.off()

##### STEP3: prepare all_genes #####
rld <- rlogTransformation(dds)
mat <- assay(rld)
mm <- model.matrix(~replicates, colData(rld))
mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
assay(rld) <- mat
RNASeq.NoCellLine <- assay(rld)
# reorder the columns
colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]

#Option4: manully defining
#for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
#cat *.id | sort -u > ids
##add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id
datamat = RNASeq.NoCellLine[GOI, ]

##### STEP4: clustering the genes and draw heatmap #####
datamat <- datamat[,-1]  #delete the sample "control MKL1"
colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
colnames(datamat)[2] <- "WaGa wildtype 0505"
colnames(datamat)[3] <- "WaGa wildtype 1905"
colnames(datamat)[4] <- "WaGa sT DMSO 0505"
colnames(datamat)[5] <- "WaGa sT DMSO 1905"
colnames(datamat)[6] <- "WaGa sT Dox 0505"
colnames(datamat)[7] <- "WaGa sT Dox 1905"
colnames(datamat)[8] <- "WaGa scr DMSO 0505"
colnames(datamat)[9] <- "WaGa scr DMSO 1905"
colnames(datamat)[10] <- "WaGa scr Dox 0505"
colnames(datamat)[11] <- "WaGa scr Dox 1905"
write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
# In order to 100% reproduce the plot, load the data from the txt-file as datamat1.
datamat1 <- read.csv("/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_7/miRNA_WaGa_on_hg38+JN707599_keeping_replicates/gene_expression_keeping_replicates.txt")
colnames(datamat1) <- c("Gene", "WaGa control", "WaGa wildtype 0505", "WaGa wildtype 1905", "WaGa sT DMSO 0505", "WaGa sT DMSO 1905", "WaGa sT Dox 0505", "WaGa sT Dox 1905", "WaGa scr DMSO 0505", "WaGa scr DMSO 1905", "WaGa scr Dox 0505", "WaGa scr Dox 1905")
rownames(datamat1) <- datamat1$Gene
datamat1 <- datamat1[ , -1]

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat1), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat1, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/2.0)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

mycol = mycol[as.vector(mycl)]
png("miRNA_heatmap_keeping_replicates.png", width=800, height=1000)
#svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
heatmap.2(as.matrix(datamat1),
  Rowv=as.dendrogram(hr),
  Colv=NA,
  dendrogram='row',
  labRow="",
  scale='row',
  trace='none',
  col=bluered(75),
  RowSideColors=mycol,
  srtCol=20,
  lhei=c(1,8),
  #cexRow=1.2,   # Increase row label font size
  cexCol=1.7,    # Increase column label font size
  margin=c(7,1)
 )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt GREEN.txt CYAN.txt MAGENTA.txt LIGHTCYAN.txt GREY.txt RED.txt -d',' -o miRNA_heatmap_keeping_replicates.xls

# merging replicates
datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
datamat <- datamat[,c(-2:-11)]
write.csv(datamat, file ="gene_expression_merging_replicates.txt")
# In order to 100% reproduce the plot, load the data from the txt-file as datamat2.
datamat2 <- read.csv("/media/jhuang/Elements/Data_Ute/Data_Ute_smallRNA_7/miRNA_WaGa_on_hg38+JN707599_merging_replicates/gene_expression_merging_replicates.txt")
colnames(datamat2) <- c("Gene", "WaGa control", "WaGa wildtype", "WaGa sT DMSO", "WaGa sT Dox", "WaGa scr DMSO", "WaGa scr Dox")
rownames(datamat2) <- datamat2$Gene
datamat2 <- datamat2[ , -1]

# Now map your clusters to colors, making sure that there's one color for each row:
#ERROR: actualColors <- mycol[mycl]  # Assign colors based on cluster assignment

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat2), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat2, method="spearman")), method="complete")
mycl <- cutree(hr, h=max(hr$height)/2.2)
#mycl <- cutree(hr, h=max(hr$height)/5.0)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");
mycol = mycol[as.vector(mycl)]
# ERROR: Then use these 'actualColors' in your heatmap
#png("miRNA_heatmap_merging_replicates.png", width=800, height=1000)
#heatmap.2(as.matrix(datamat2),
#          Rowv=as.dendrogram(hr),
#          Colv=NA,
#          dendrogram='row',
#          labRow=rownames(datamat2),
#          scale='row',
#          trace='none',
#          col=bluered(75),
#          RowSideColors=mycol, #ERROR: RowSideColors=actualColors,
#          srtCol=20,
#          lhei=c(1,8),
#          cexCol=1.7,    # Increase column label font size
#          margin=c(7,1)
#        )
# Save the heatmap as a PNG file
png("miRNA_heatmap_merging_replicates.png", width=1400, height=4000)
# Generate the heatmap
heatmap.2(as.matrix(datamat2),
          Rowv=as.dendrogram(hr),
          Colv=NA,
          dendrogram='row',
          labRow=rownames(datamat2), # Add row names (gene names) to the heatmap
          scale='row',
          trace='none',
          col=bluered(75),
          RowSideColors=mycol, # Assign row colors based on clusters
          srtCol=20, # Adjust column label orientation for better visibility
          lhei=c(1, 16),
          cexCol=2.0, # Increase column label font size
          cexRow=1.6, # Adjust row label font size if needed
          margin=c(10, 15), # Adjust margins to prevent the error
          labCol=colnames(datamat2) # Add column names to the heatmap
        )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
#write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o miRNA_heatmap_merging_replicates.xls
#100+11+4+7+2+5+14+63=206

4, do separate shRNA and treatment analysis (input d.raw)

    d_WaGa <- d.raw[, !(grepl("control|wt|MKL1", names(d.raw)))]
    cell.line = as.factor(c("WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
    shRNA = as.factor(c("sT","sT","sT","sT","scr","scr","scr","scr"))
    treatment = as.factor(c("DMSO","DMSO","Dox","Dox","DMSO","DMSO","Dox","Dox"))
    cData = data.frame(row.names=colnames(d_WaGa), shRNA=shRNA, treatment=treatment, cell.line=cell.line)
    dds_WaGa<-DESeqDataSetFromMatrix(countData=d_WaGa, colData=cData, design=~shRNA+treatment+shRNA:treatment)
    dds_WaGa = DESeq(dds_WaGa, betaPrior=FALSE)
    resultsNames(dds_WaGa)
    contrasts <- c("shRNA_sT_vs_scr", "treatment_Dox_vs_DMSO", "shRNAsT.treatmentDox")

    for (i in contrasts) {
      res = results(dds_WaGa, name=i)
      res <- res[!is.na(res$log2FoldChange),]
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)

      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))

      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }

    mv shRNA_sT_vs_scr-all.txt WaGa_shRNA_sT_vs_scr-all.txt
    mv shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-up.txt
    mv shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-down.txt
    mv treatment_Dox_vs_DMSO-all.txt WaGa_treatment_Dox_vs_DMSO-all.txt
    mv treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-up.txt
    mv treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-down.txt
    mv shRNAsT.treatmentDox-all.txt WaGa_shRNAsT.treatmentDox-all.txt
    mv shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-up.txt
    mv shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-down.txt

    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-all.txt -d$',' -o WaGa_shRNA_sT_vs_scr.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-all.txt -d$',' -o WaGa_treatment_Dox_vs_DMSO.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-all.txt -d$',' -o WaGa_shRNAsT.treatmentDox.xls

5, display viral transcripts found in mRNA-seq (or small RNA) MKL-1, WaGa EVs compared to parental cells

    http://xgenes.com/article/article-content/87/display-viral-transcripts-found-in-mrna-seq-mkl-1-waga-evs-compared-to-cells/

piRNA 2024 Ute from raw counts

Leave a reply

Draw plots for piRNA generated by COMPSRA

Generate the following files according to STEPS 1-4 from http://xgenes.com/article/article-content/239/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, http://xgenes.com/article/article-content/232/small-rna-sequencing-processing-in-the-example-of-smallrna-7/, and http://xgenes.com/article/article-content/156/small-rna-processing/. For COMPSRA_MERGE_0_miRNA.txt, we also need STEP 5 to add the read numbers of MCPyV-M1.
```
COMPSRA_MERGE_0_miRNA.txt *
COMPSRA_MERGE_0_piRNA.txt *
COMPSRA_MERGE_0_snRNA.txt *
COMPSRA_MERGE_0_tRNA.txt
COMPSRA_MERGE_0_snoRNA.txt
COMPSRA_MERGE_0_circRNA.txt
```
Input files for piRNA are two files: COMPSRA_MERGE_0_piRNA.txt and ids
- COMPSRA_MERGE_0_piRNA.txt
  
  #The former are more precise due to the reads from virus will be mapped on the virus-genome diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_piRNA.txt diff ./our_out_on_hg38+JN707599/COMPSRA_MERGE_0_snRNA.txt ./our_out_on_hg38/COMPSRA_MERGE_0_snRNA.txt cp ../our_out_on_hg38+JN707599/COMPSRA_MERGE_0_piRNA.txt .
- prepare the file ids
  
  #see Option4: manully defining

Draw plots with R using DESeq2

#BiocManager::install("AnnotationDbi")
#BiocManager::install("clusterProfiler")
#BiocManager::install(c("ReactomePA","org.Hs.eg.db"))
#BiocManager::install("limma")
library("AnnotationDbi")
library("clusterProfiler")
library("ReactomePA")
library("org.Hs.eg.db")
library(DESeq2)
library(gplots)
library(limma)
# Check the current library paths
.libPaths()
setwd("/home/jhuang/DATA/Data_Ute/Data_Ute_smallRNA_7/piRNA_WaGa_2024")

d.raw<- read.delim2("COMPSRA_MERGE_0_piRNA.txt",sep="\t", header=TRUE, row.names=1)
d.raw$X <- NULL
d.raw[] <- lapply(d.raw, as.numeric)

EV_or_parental = as.factor(c("EV","EV", "EV","EV", "EV","EV", "EV","EV", "EV","EV", "parental","parental"))
donor = as.factor(c("0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905", "0505","1905"))
replicates = as.factor(c("sT_DMSO","sT_DMSO", "sT_Dox","sT_Dox", "scr_DMSO","scr_DMSO", "scr_Dox","scr_Dox", "wt","wt", "control","control"))
ids = as.factor(c("0505_WaGa_sT_DMSO","1905_WaGa_sT_DMSO","0505_WaGa_sT_Dox","1905_WaGa_sT_Dox","0505_WaGa_scr_DMSO","1905_WaGa_scr_DMSO","0505_WaGa_scr_Dox","1905_WaGa_scr_Dox","0505_WaGa_wt","1905_WaGa_wt","control_MKL1","control_WaGa"))

cData = data.frame(row.names=colnames(d.raw), replicates=replicates, ids=ids, donor=donor, EV_or_parental=EV_or_parental)
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)

rld <- rlogTransformation(dds)

# -- before pca --
png("pca.png", 1200, 800)
plotPCA(rld, intgroup=c("replicates"))
#plotPCA(rld, intgroup = c("replicates", "batch"))
#plotPCA(rld, intgroup = c("replicates", "ids"))
#plotPCA(rld, "batch")
dev.off()

png("pca2.png", 1200, 800)
plotPCA(rld, intgroup=c("donor"))
dev.off()

#### STEP2: DEGs ####
#convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
sizeFactors(dds)
#NULL
dds <- estimateSizeFactors(dds)
sizeFactors(dds)
normalized_counts <- counts(dds, normalized=TRUE)
write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)

#---- * to untreated ----
dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~EV_or_parental+donor)
dds$EV_or_parental <- relevel(dds$EV_or_parental, "parental")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("EV_vs_parental")
for (i in clist) {
  contrast = paste("EV_or_parental", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}
#~/Tools/csv2xls-0.4/csv_to_xls.py EV_vs_parental-all.txt EV_vs_parental-up.txt EV_vs_parental-down.txt -d$',' -o EV_vs_parental.xls;

dds<-DESeqDataSetFromMatrix(countData=d.raw, colData=cData, design=~replicates+donor)
dds$replicates <- relevel(dds$replicates, "sT_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_sT_DMSO")

dds$replicates <- relevel(dds$replicates, "scr_Dox")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("sT_Dox_vs_scr_Dox")

dds$replicates <- relevel(dds$replicates, "scr_DMSO")
dds = DESeq(dds, betaPrior=FALSE)
resultsNames(dds)
clist <- c("scr_Dox_vs_scr_DMSO", "sT_DMSO_vs_scr_DMSO")

for (i in clist) {
  contrast = paste("replicates", i, sep="_")
  res = results(dds, name=contrast)
  res <- res[!is.na(res$log2FoldChange),]
  #https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-are-some-p-values-set-to-na
  res$padj <- ifelse(is.na(res$padj), 1, res$padj)
  res_df <- as.data.frame(res)
  write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))
  up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
  down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
  write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
  write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
}

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_sT_DMSO-all.txt \
sT_Dox_vs_sT_DMSO-up.txt \
sT_Dox_vs_sT_DMSO-down.txt \
-d$',' -o sT_Dox_vs_sT_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_Dox_vs_scr_Dox-all.txt \
sT_Dox_vs_scr_Dox-up.txt \
sT_Dox_vs_scr_Dox-down.txt \
-d$',' -o sT_Dox_vs_scr_Dox.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
scr_Dox_vs_scr_DMSO-all.txt \
scr_Dox_vs_scr_DMSO-up.txt \
scr_Dox_vs_scr_DMSO-down.txt \
-d$',' -o scr_Dox_vs_scr_DMSO.xls;

~/Tools/csv2xls-0.4/csv_to_xls.py \
sT_DMSO_vs_scr_DMSO-all.txt \
sT_DMSO_vs_scr_DMSO-up.txt \
sT_DMSO_vs_scr_DMSO-down.txt \
-d$',' -o sT_DMSO_vs_scr_DMSO.xls;

##### STEP3: prepare all_genes #####
rld <- rlogTransformation(dds)
mat <- assay(rld)
mm <- model.matrix(~replicates, colData(rld))
mat <- limma::removeBatchEffect(mat, batch=rld$donor, design=mm)
assay(rld) <- mat
RNASeq.NoCellLine <- assay(rld)
# reorder the columns
colnames(RNASeq.NoCellLine) = c("0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox","0505 WaGa wt","1905 WaGa wt","control MKL1","control WaGa")
col.order <-c("control MKL1",  "control WaGa","0505 WaGa wt","1905 WaGa wt","0505 WaGa sT DMSO","1905 WaGa sT DMSO","0505 WaGa sT Dox","1905 WaGa sT Dox","0505 WaGa scr DMSO","1905 WaGa scr DMSO","0505 WaGa scr Dox","1905 WaGa scr Dox")
RNASeq.NoCellLine <- RNASeq.NoCellLine[,col.order]

#Option4: manully defining
#for i in EV_vs_parental sT_Dox_vs_sT_DMSO sT_Dox_vs_scr_Dox scr_Dox_vs_scr_DMSO sT_DMSO_vs_scr_DMSO; do echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"; echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"; done
#cat *.id | sort -u > ids
##add Gene_Id in the first line, delete the ""
GOI <- read.csv("ids")$Gene_Id
datamat = RNASeq.NoCellLine[GOI, ]

##### STEP4: clustering the genes and draw heatmap #####
datamat <- datamat[,-1]  #delete the sample "control MKL1"
colnames(datamat)[1] <- "WaGa control"  #rename the isolate names according to the style of RNA-seq as follows?
colnames(datamat)[2] <- "WaGa wildtype 0505"
colnames(datamat)[3] <- "WaGa wildtype 1905"
colnames(datamat)[4] <- "WaGa sT DMSO 0505"
colnames(datamat)[5] <- "WaGa sT DMSO 1905"
colnames(datamat)[6] <- "WaGa sT Dox 0505"
colnames(datamat)[7] <- "WaGa sT Dox 1905"
colnames(datamat)[8] <- "WaGa scr DMSO 0505"
colnames(datamat)[9] <- "WaGa scr DMSO 1905"
colnames(datamat)[10] <- "WaGa scr Dox 0505"
colnames(datamat)[11] <- "WaGa scr Dox 1905"
write.csv(datamat, file ="gene_expression_keeping_replicates.txt")
# In order to 100% reproduce the plot, load the data from the txt-file as datamat1.

#"ward.D"’, ‘"ward.D2"’,‘"single"’, ‘"complete"’, ‘"average"’ (= UPGMA), ‘"mcquitty"’(= WPGMA), ‘"median"’ (= WPGMC) or ‘"centroid"’ (= UPGMC)
hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
mycl = cutree(hr, h=max(hr$height)/1.5)
mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

mycol = mycol[as.vector(mycl)]
png("piRNA_heatmap_keeping_replicates.png", width=800, height=1000)
#svg("DEGs_heatmap_keeping_replicates.svg", width=6, height=8)
heatmap.2(as.matrix(datamat),
  Rowv=as.dendrogram(hr),
  Colv=NA,
  dendrogram='row',
  labRow="",
  scale='row',
  trace='none',
  col=bluered(75),
  RowSideColors=mycol,
  srtCol=20,
  lhei=c(1,8),
  #cexRow=1.2,   # Increase row label font size
  cexCol=1.7,    # Increase column label font size
  margin=c(7,1)
 )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_keeping_replicates.txt YELLOW.txt ORANGE.txt BLUE.txt -d',' -o piRNA_heatmap_keeping_replicates.xls

mv piRNA_heatmap_keeping_replicates.png piRNA_heatmap.png
mv piRNA_heatmap_keeping_replicates.xls piRNA_heatmap.xls
mv pca.png piRNA_pca.png
mv EV_vs_parental.xls piRNA_EV_vs_parental.xls
mv sT_DMSO_vs_scr_DMSO.xls piRNA_sT_DMSO_vs_scr_DMSO.xls
mv sT_Dox_vs_scr_Dox.xls piRNA_sT_Dox_vs_scr_Dox.xls
mv sT_Dox_vs_sT_DMSO.xls piRNA_sT_Dox_vs_sT_DMSO.xls
mv scr_Dox_vs_scr_DMSO.xls piRNA_scr_Dox_vs_scr_DMSO.xls
# --> SENDING piRNA_*.png, piRNA_EV_vs_parental.xls, and piRNA_heatmap.xls

# ---- NOT WORKING WELL due to actualColors --> DEBUG see the README_miRNA_WaGa ERROR-commented-lines in the heatmap.2-block ----
# merging replicates
datamat <- cbind(datamat, "WaGa wildtype" = rowMeans(datamat[, 2:3]))
datamat <- cbind(datamat, "WaGa sT DMSO" = rowMeans(datamat[, 4:5]))
datamat <- cbind(datamat, "WaGa sT Dox" = rowMeans(datamat[, 6:7]))
datamat <- cbind(datamat, "WaGa scr DMSO" = rowMeans(datamat[, 8:9]))
datamat <- cbind(datamat, "WaGa scr Dox" = rowMeans(datamat[, 10:11]))
datamat <- datamat[,c(-2:-11)]
write.csv(datamat, file ="gene_expression_merging_replicates.txt")

# Ensure 'mycl' is calculated properly.
mycl <- cutree(hr, h=max(hr$height)/2.9)
# mycol = c("YELLOW", "BLUE", "ORANGE", "CYAN", "GREEN", "MAGENTA", "GREY", "LIGHTCYAN", "RED",     "PINK", "DARKORANGE", "MAROON",  "LIGHTGREEN", "DARKBLUE",  "DARKRED",   "LIGHTBLUE", "DARKCYAN",  "DARKGREEN", "DARKMAGENTA");

# Now map your clusters to colors, making sure that there's one color for each row:
 <- mycol[mycl]  # Assign colors based on cluster assignment

# Then use these 'actualColors' in your heatmap:
png("piRNA_heatmap_merging_replicates.png", width=800, height=1000)
heatmap.2(as.matrix(datamat),
          Rowv=as.dendrogram(hr),
          Colv=NA,
          dendrogram='row',
          labRow="",
          scale='row',
          trace='none',
          col=bluered(75),
          RowSideColors=actualColors, # Update this part
          srtCol=20,
          lhei=c(1,8),
          cexCol=1.7,    # Increase column label font size
          margin=c(7,1)
        )
dev.off()

#### cluster members #####
write.csv(names(subset(mycl, mycl == '1')),file='YELLOW.txt')
write.csv(names(subset(mycl, mycl == '2')),file='BLUE.txt')
write.csv(names(subset(mycl, mycl == '3')),file='ORANGE.txt')
write.csv(names(subset(mycl, mycl == '4')),file='CYAN.txt')
write.csv(names(subset(mycl, mycl == '5')),file='GREEN.txt')
write.csv(names(subset(mycl, mycl == '6')),file='MAGENTA.txt')
write.csv(names(subset(mycl, mycl == '7')),file='GREY.txt')
write.csv(names(subset(mycl, mycl == '8')),file='LIGHTCYAN.txt')
#write.csv(names(subset(mycl, mycl == '9')),file='RED.txt')
#~/Tools/csv2xls-0.4/csv_to_xls.py gene_expression_merging_replicates.txt YELLOW.txt BLUE.txt ORANGE.txt CYAN.txt MAGENTA.txt GREEN.txt LIGHTCYAN.txt GREY.txt -d',' -o piRNA_heatmap_merging_replicates.xls

4, do separate shRNA and treatment analysis (input d.raw)

    d_WaGa <- d.raw[, !(grepl("control|wt|MKL1", names(d.raw)))]
    cell.line = as.factor(c("WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
    shRNA = as.factor(c("sT","sT","sT","sT","scr","scr","scr","scr"))
    treatment = as.factor(c("DMSO","DMSO","Dox","Dox","DMSO","DMSO","Dox","Dox"))
    cData = data.frame(row.names=colnames(d_WaGa), shRNA=shRNA, treatment=treatment, cell.line=cell.line)
    dds_WaGa<-DESeqDataSetFromMatrix(countData=d_WaGa, colData=cData, design=~shRNA+treatment+shRNA:treatment)
    dds_WaGa = DESeq(dds_WaGa, betaPrior=FALSE)
    resultsNames(dds_WaGa)
    contrasts <- c("shRNA_sT_vs_scr", "treatment_Dox_vs_DMSO", "shRNAsT.treatmentDox")

    for (i in contrasts) {
      res = results(dds_WaGa, name=i)
      res <- res[!is.na(res$log2FoldChange),]
      res$padj <- ifelse(is.na(res$padj), 1, res$padj)

      res_df <- as.data.frame(res)
      write.csv(as.data.frame(res_df[order(res_df$pvalue),]), file = paste(i, "all.txt", sep="-"))

      up <- subset(res_df, padj<=0.1 & log2FoldChange>=2)
      down <- subset(res_df, padj<=0.1 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }

    mv shRNA_sT_vs_scr-all.txt WaGa_shRNA_sT_vs_scr-all.txt
    mv shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-up.txt
    mv shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-down.txt
    mv treatment_Dox_vs_DMSO-all.txt WaGa_treatment_Dox_vs_DMSO-all.txt
    mv treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-up.txt
    mv treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-down.txt
    mv shRNAsT.treatmentDox-all.txt WaGa_shRNAsT.treatmentDox-all.txt
    mv shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-up.txt
    mv shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-down.txt

    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNA_sT_vs_scr-up.txt WaGa_shRNA_sT_vs_scr-down.txt WaGa_shRNA_sT_vs_scr-all.txt -d$',' -o WaGa_shRNA_sT_vs_scr.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_treatment_Dox_vs_DMSO-up.txt WaGa_treatment_Dox_vs_DMSO-down.txt WaGa_treatment_Dox_vs_DMSO-all.txt -d$',' -o WaGa_treatment_Dox_vs_DMSO.xls
    ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_shRNAsT.treatmentDox-up.txt WaGa_shRNAsT.treatmentDox-down.txt WaGa_shRNAsT.treatmentDox-all.txt -d$',' -o WaGa_shRNAsT.treatmentDox.xls

Call and merge SNP and Indel results from using two variant calling methods

Leave a reply

call variant calling using snippy

mv snippy_HDRNA_01 snippy_CP133676

mv snippy_HDRNA_03 snippy_CP133677

cp -r snippy_HDRNA_06 snippy_CP133678
mv snippy_HDRNA_06 snippy_CP133679

cp -r snippy_HDRNA_07 snippy_CP133680
mv snippy_HDRNA_07 snippy_CP133681

cp -r snippy_HDRNA_08 snippy_CP133682
mv snippy_HDRNA_08 snippy_CP133683

cp -r snippy_HDRNA_12 snippy_CP133684
cp -r snippy_HDRNA_12 snippy_CP133685
cp -r snippy_HDRNA_12 snippy_CP133686
mv snippy_HDRNA_12 snippy_CP133687

cp -r snippy_HDRNA_16 snippy_CP133688
cp -r snippy_HDRNA_16 snippy_CP133689
cp -r snippy_HDRNA_16 snippy_CP133690
cp -r snippy_HDRNA_16 snippy_CP133691
mv snippy_HDRNA_16 snippy_CP133692

cp -r snippy_HDRNA_17 snippy_CP133693
cp -r snippy_HDRNA_17 snippy_CP133694
mv snippy_HDRNA_17 snippy_CP133695

cp -r snippy_HDRNA_19 snippy_CP133696
cp -r snippy_HDRNA_19 snippy_CP133697
cp -r snippy_HDRNA_19 snippy_CP133698
mv snippy_HDRNA_19 snippy_CP133699

cp -r snippy_HDRNA_20 snippy_CP133700
mv snippy_HDRNA_20 snippy_CP133701

    #git clone https://github.com/huang/bacto
    #mv bacto/* ./
    #rm -rf bacto

for sample in snippy_CP133678 snippy_CP133679 snippy_CP133680 snippy_CP133681 snippy_CP133682 snippy_CP133683 snippy_CP133684 snippy_CP133685 snippy_CP133686 snippy_CP133687 snippy_CP133688 snippy_CP133689 snippy_CP133690 snippy_CP133691 snippy_CP133692 snippy_CP133693 snippy_CP133694 snippy_CP133695 snippy_CP133696 snippy_CP133697 snippy_CP133698 snippy_CP133699 snippy_CP133700 snippy_CP133701; do
    cd ${sample};
    ln -s ~/Tools/bacto/db/ .;
    ln -s ~/Tools/bacto/envs/ .;
    ln -s ~/Tools/bacto/local/ .;
    cp ~/Tools/bacto/Snakefile .;
    cp ~/Tools/bacto/bacto-0.1.json .;
    cp ~/Tools/bacto/cluster.json .;
    cd ..;
done
#prepare raw_data and bacto-0.1.json
#set "reference": "db/CP133***.gb" in bacto-0.1.json

conda activate bengal3_ac3
#cd snippy_CP133676
#/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
#cd snippy_CP133677
#/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd snippy_CP133678
(bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133679
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133680
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133681
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133682
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

cd snippy_CP133683
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133684
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133685
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133686
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133687
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133688
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133689
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133690
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133691
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133692
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133693
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133694
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133695
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133696
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133697
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133698
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133699
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133700
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ../snippy_CP133701
/home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
cd ..

2, using spandx calling variants (almost the same results to the one from viral-ngs!)

    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610
    cp PP810610.1.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/PP810610/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build  PP810610     -d
    gzip hCoV229E_Rluc_trimmed_P_1.fastq hCoV229E_Rluc_trimmed_P_2.fastq p10_DMSO_trimmed_P_1.fastq p10_DMSO_trimmed_P_2.fastq p10_K22_trimmed_P_1.fastq p10_K22_trimmed_P_2.fastq p10_K7523_trimmed_P_1.fastq p10_K7523_trimmed_P_2.fastq
    ln -s /home/jhuang/Tools/spandx/ spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq.gz" --ref PP810610.fasta --annotation --database PP810610 -resume

3, summarize all SNPs and Indels from the snippy result directory.

    #Output: snippy_CP133676/snippy/summary_snps_indels.csv
    python3 summarize_snippy_res.py snippy_CP133676/snippy

    #-- post-processing of Outputs_CP133676 produced by SPANDx (temporary not necessary) --
    #cd Outputs_CP133676/Phylogeny_and_annotation
    #awk '{if($3!=$7) print}' < All_SNPs_indels_annotated.txt > All_SNPs_indels_annotated_.txt
    #cut -d$'\t' -f1-7 All_SNPs_indels_annotated_.txt > f1_7
    #grep -v "/" f1_7 > f1_7_
    #grep -v "\." f1_7_ > f1_7__
    #grep -v "*" f1_7__ > f1_7___
    #grep -v "INDEL" f1_7___ > f1_7_SNPs
    #cut -d$'\t' -f2-2 f1_7_SNPs > f1_7_SNPs_f2
    #grep -wFf <(awk 'NR>0 {print $1}' f1_7_SNPs_f2) All_SNPs_indels_annotated_.txt > All_SNPs_annotated.txt
    #grep -v "SNP" f1_7___ > f1_7_indels
    #cut -d$'\t' -f2-2 f1_7_indels > f1_7_indels_f2
    #grep -wFf <(awk 'NR>0 {print $1}' f1_7_indels_f2) All_SNPs_indels_annotated_.txt > All_indels_annotated.txt

4, merge the following two files summary_snps_indels.csv (192) and All_SNPs_indels_annotated.txt (248) –> merged_variants_CP133676.csv (94)

    python3 merge_snps_indels.py snippy_CP133676/snippy/summary_snps_indels.csv Outputs_CP133676/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt merged_variants_CP133676.csv
    #check if the number of the output file is correct?
    comm -12 <(cut -d, -f2 snippy_CP133676/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs_CP133676/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq) | wc -l
    comm -12 <(cut -d, -f2 snippy_CP133676/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs_CP133676/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq)

5, (optional, since it should not happed) filter rows without change in snippy_CP133676/snippy/summary_snps_indels.csv (94)

    awk -F, '
    NR == 1 {print; next}  # Print the header line
    {
        ref = $3;  # Reference base (assuming REF is in the 3rd column)
        same = 1;  # Flag to check if all bases are the same as reference
        samples = "";  # Initialize variable to hold sample data
        for (i = 6; i <= NF - 8; i++) {  # Loop through all sample columns, adjusting for the number of fixed columns
            samples = samples $i " ";  # Collect sample data
            if ($i != ref) {
                same = 0;
            }
        }
        if (!same) {
            print $0;  # Print the entire line if not all bases are the same as the reference
            #print "Samples: " samples;  # Print all sample data for checking
        }
    }
    ' merged_variants_CP133676.csv > merged_variants_CP133676_.csv

    #Explanation:
    #    -F, sets the field separator to a comma.
    #    NR == 1 {print; next} prints the header line and skips to the next line.
    #    ref = $3 sets the reference base (assumed to be in the 3rd column).
    #    same = 1 initializes a flag to check if all sample bases are the same as the reference.
    #    samples = ""; initializes a string to collect sample data.
    #    for (i = 6; i <= NF - 8; i++) loops through the sample columns. This assumes the first 5 columns are fixed (CHROM, POS, REF, ALT, TYPE), and the last 8 columns are fixed (Effect, Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length, Gene_name, Biotype).
    #    samples = samples $i " "; collects sample data.
    #    if ($i != ref) { same = 0; } checks if any sample base is different from the reference base. If found, it sets same to 0.
    #    if (!same) { print $0; print "Samples: " samples; } prints the entire line and the sample data if not all sample bases are the same as the reference.

6, improve the header

    sed -i '1s/_trimmed_P//g' merged_variants_CP133676.csv

7, check the REF and K1 have the same base and delete those records with difference.

    cut -f3 -d',' merged_variants_CP133676.csv > f3
    cut -f6 -d',' merged_variants_CP133676.csv > f6
    diff f3 f6
    awk -F, '$3 == $6 || NR==1' merged_variants_CP133676.csv > filtered_merged_variants_CP133676.csv #(93)
    cut -f3 -d',' filtered_merged_variants_CP133676.csv > f3
    cut -f6 -d',' filtered_merged_variants_CP133676.csv > f6
    diff f3 f6

8, MANUALLY REMOVE the column f6 in filtered_merged_variants_CP133676.csv, and rename CHROM to HDRNA_01_K01 in the header, summarize chr and plasmids SNPs of a sample together to a single list, save as an Excel-file.

9, code of summarize_snippy_res.py

    import pandas as pd
    import glob
    import argparse
    import os

    #python3 summarize_snps_indels.py snippy_HDRNA_01/snippy

    #The following record for 2365295 is wrong, since I am sure in the HDRNA_01_K010, it should be a 'G', since in HDRNA_01_K010.csv
    #CP133676,2365295,snp,A,G,G:178 A:0
    #
    #The current output is as follows:
    #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,None,,,,,,None,None
    #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
    #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
    #BUG: CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan

    import pandas as pd
    import glob
    import argparse
    import os

    def main(base_directory):
        # List of isolate identifiers
        isolates = [f"HDRNA_01_K0{i}" for i in range(1, 11)]
        expected_columns = ["CHROM", "POS", "REF", "ALT", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"]

        # Find all CSV files in the directory and its subdirectories
        csv_files = glob.glob(os.path.join(base_directory, '**', '*.csv'), recursive=True)

        # Create an empty DataFrame to store the summary
        summary_df = pd.DataFrame()

        # Iterate over each CSV file
        for file_path in csv_files:
            # Extract isolate identifier from the file name
            isolate = os.path.basename(file_path).replace('.csv', '')
            df = pd.read_csv(file_path)

            # Ensure all expected columns are present, adding missing ones as empty columns
            for col in expected_columns:
                if col not in df.columns:
                    df[col] = None

            # Extract relevant columns
            df = df[expected_columns]

            # Ensure consistent data types
            df = df.astype({"CHROM": str, "POS": int, "REF": str, "ALT": str, "TYPE": str, "EFFECT": str, "LOCUS_TAG": str, "GENE": str, "PRODUCT": str})

            # Add the isolate column with the ALT allele
            df[isolate] = df["ALT"]

            # Drop columns that are not needed for the summary
            df = df.drop(["ALT"], axis=1)

            if summary_df.empty:
                summary_df = df
            else:
                summary_df = pd.merge(summary_df, df, on=["CHROM", "POS", "REF", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"], how="outer")

        # Fill missing values with the REF allele for each isolate column
        for isolate in isolates:
            if isolate in summary_df.columns:
                summary_df[isolate] = summary_df[isolate].fillna(summary_df["REF"])
            else:
                summary_df[isolate] = summary_df["REF"]

        # Rename columns to match the required format
        summary_df = summary_df.rename(columns={
            "CHROM": "CHROM",
            "POS": "POS",
            "REF": "REF",
            "TYPE": "TYPE",
            "EFFECT": "Effect",
            "LOCUS_TAG": "Gene_name",
            "GENE": "Biotype",
            "PRODUCT": "Product"
        })

        # Replace any remaining None or NaN values in the non-isolate columns with empty strings
        summary_df = summary_df.fillna("")

        # Add empty columns for Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length
        summary_df["Impact"] = ""
        summary_df["Functional_Class"] = ""
        summary_df["Codon_change"] = ""
        summary_df["Protein_and_nucleotide_change"] = ""
        summary_df["Amino_Acid_Length"] = ""

        # Reorder columns
        cols = ["CHROM", "POS", "REF", "TYPE"] + isolates + ["Effect", "Impact", "Functional_Class", "Codon_change", "Protein_and_nucleotide_change", "Amino_Acid_Length", "Gene_name", "Biotype"]
        summary_df = summary_df.reindex(columns=cols)

        # Remove duplicate rows
        summary_df = summary_df.drop_duplicates()

        # Save the summary DataFrame to a CSV file
        output_file = os.path.join(base_directory, "summary_snps_indels.csv")
        summary_df.to_csv(output_file, index=False)

        print("Summary CSV file created successfully at:", output_file)

    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Summarize SNPs and Indels from CSV files.")
        parser.add_argument("directory", type=str, help="Base directory containing the CSV files in subdirectories.")
        args = parser.parse_args()

        main(args.directory)

10, code of merge_snps_indels.py

    import pandas as pd
    import argparse
    import os

    def merge_files(input_file1, input_file2, output_file):
        # Read the input files
        df1 = pd.read_csv(input_file1)
        df2 = pd.read_csv(input_file2, sep='\t')

        # Merge the dataframes on the 'POS' column, keeping only the rows that have common 'POS' values
        merged_df = pd.merge(df2, df1[['POS']], on='POS', how='inner')

        # Remove duplicate rows
        merged_df.drop_duplicates(inplace=True)

        # Save the merged dataframe to the output file
        merged_df.to_csv(output_file, index=False)

        print("Merged file created successfully at:", output_file)

    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description="Merge two SNP and Indel files based on the 'POS' column.")
        parser.add_argument("input_file1", type=str, help="Path to the first input file (summary_snps_indels.csv).")
        parser.add_argument("input_file2", type=str, help="Path to the second input file (All_SNPs_indels_annotated.txt).")
        parser.add_argument("output_file", type=str, help="Path to the output file.")
        args = parser.parse_args()

        merge_files(args.input_file1, args.input_file2, args.output_file)

Tn-seq analysis pipeline (improved3)

Leave a reply

only one mutant, PacBio long-sequencing find the positions of transposons

Häufigkeit in mixed mutants using short-sequencing

DEGs regarding Häufigkeit using short-sequencing

Overview of Data Processing Procedure

1. Convert .fastq files to .fasta format (.reads).

"AGCTTCAGGGTTGAGATGTGTATAAGAGACAG", allowed a mismatch of 1 nt
2. Identify reads with the transposon prefix in R1. The sequence searched for is "AGCTTCAGGGTTGAGATGTGTATAAGAGACAG", allowed a mismatch of 1 nt, which must start between cycles 5 and 10 (inclusive). (Note that this ends in the canonical terminus of the Himar1 transposon, TGTTA.) The “staggered” position of this sequence is due to insertion a few nucleotides of variable length in the primers used in the Tn-Seq sample prep protocol (e.g. 4 variants of Sol_AP1_57, etc.). The number of mimatches allowed in searching reads for the transposon sequence pattern can be adjusted as an option in the interface; the default is 1.

#What are TACCACGACCA?
3. Extract genomic part of read 1. This is the suffix following the transposon sequence pattern above. However, for reads coming from fragments shorter than the read length, the adapter might appear at the other end of R1, TACCACGACCA. If so, the adapter suffix is stripped off. (These are referred to as “truncated” reads, but they can still be mapped into the genome just fine by BWA.) The length of the genomic part must be at least 20 bp.

3. Extract barcodes from read 2. Read 2 is searched for GATGGCCGGTGGATTTGTGnnnnnnnnnnTGGTCGTGGTAT”. The length of the barcode is typically 10 bp, but can be varaible, and must be between 5-15 bp.

4. Extract genomic portions of read 2. This is the part following TGGTCGTGGTAT…. It is often the whole suffix of the read. However, if the read comes from a short DNA fragment that is shorter than the read length, the adapter on the other end might appear, in which case it is stripped off and the nucleotides in the middle representing the genomic insert, TGGTCGTGGTATxxxxxxxTAACAGGTTGGCTGATAAG. The insert must be at least 20 bp long (inserts shorter than this are discarded, as they might map to spurious locations in the genome).

5. Map genomic parts of R1 and R2 into the genome using BWA. Mismatches are allowed, but indels are ignored. No trimming is performed. BWA is run in ‘sampe’ mode (treating reads as pairs). Both reads of a pair must map (on opposite strands) to be counted.

6. Count the reads mapping to each TA site in the reference genome (or all sites for Tn5).

7. Reduce raw read counts to unique template counts. Group reads by barcode AND mapping location of read 2 (aka fragment “endpoints”).

8. Output template counts at each TA site in a .wig file.

9. Calculate statistics like insertion_density and NZ_mean. Look for the site with the max template count. Look for reads matching the primer or vector sequences.

quality control

./240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R1_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R2_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R1_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R2_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R1_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R2_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq.gz
./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq.gz

#from fastqc results of initial_mutants
49821406
35-161
49821406
35-161

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=1194086
https://www.ncbi.nlm.nih.gov/Traces/wgs/AKKR01?
https://www.ncbi.nlm.nih.gov/Traces/wgs/AKKR01?display=download
#Found 4,131 proteins

modify the tpp scripts

vim ~/.local/lib/python3.10/site-packages/pytpp/tpp_tools.py
#search for "DEBUG"
#-maxreads 10000 or not_given for take all!
#-primer AGATGTGTATAAGAGACAG     the default primer of Tn5 is TAAGAGACAG!
#-primer-start-window 0,159  set 0,159 as default!
#delete to import barcode-file, since we already demultipled the file!
#  pattern for read 2...
#    TAGTGGATGATGGCCGGTGGATTTGTG GTAATTACCA TGGTCGTGGTAT CCCAGCGCGACTTCTTCGGCGCACACACC TAACAGGTTGGCTGATAAGTCCCCG?AGAT AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGT
#    -----const1---------------- --barcode- ---const2--- ------genomic---------------- ------const3--------------------------------------------------------------

run Transposon Position Profiling (TPP) on multiple contigs

#https://transit.readthedocs.io/en/latest/transit_running.html;
#https://orca2.tamu.edu/tom/iLab.html

# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623):

#primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG
#  primer_matches: 0 reads (0.0%) contain CTAGAGGGCCCAATTCGCCCTATAGTGAGT (Himar1)
#  vector_matches: 0 reads (0.0%) contain CTAGACCGTCCAGTCTGGCAGGCCGGAAAC (phiMycoMarT7)
#  adapter_matches: 0 reads (0.0%) contain GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (Illumina/TruSeq index)
#  misprimed_reads: 0 reads (0.0%) contain Himar1 prefix but don't end in TGTTA

#kr11.trimmed1_failed_trim    22072406
#-rw-rw-r-- 1 jhuang jhuang  2,7G Jun 11 15:44 kr11.trimmed1    20339623
#-rw-rw-r-- 1 jhuang jhuang  2,9G Jun 11 15:46 kr11.trimmed2    20339623
#cat initial_mutants_a_2_S6_R1_001.fastq | echo $((`wc -l` / 4))  49821406=29481783 reads (59.2%) + 20339623 = 49821406

#vs.

# Break-down of total reads (49821406):
#  29356859 reads (58.9%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20464547):
#  primer_matches: 0 reads (0.0%) contain CTAGAGGGCCCAATTCGCCCTATAGTGAGT (Himar1)
#  vector_matches: 0 reads (0.0%) contain CTAGACCGTCCAGTCTGGCAGGCCGGAAAC (phiMycoMarT7)
#  adapter_matches: 0 reads (0.0%) contain GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (Illumina/TruSeq index)
#  misprimed_reads: 0 reads (0.0%) contain Himar1 prefix but don't end in TGTTA
# read_length: 100 bp
# mean_R1_genomic_length: 73.4 bp
# mean_R2_genomic_length: 86.4 bp

#conda deactivate
##Test AGCTTCAGGGTTGAGATGTGTATAAGAGACAG --> TAAGAGACAG, the results are similar!
# Note that "-primer-start-window 0,161" is invalid and cannot replace the default value!
#python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R1_001.fastq.gz -reads2 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R2_001.fastq.gz -output kr11_10nt_primer -primer TAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
#mv tpp.cfg kr11_10nt_primer_tpp.cfg

#for initial_mutants
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R1_001.fastq.gz -reads2 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R2_001.fastq.gz -output initial_mutants -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
mv tpp.cfg initial_mutants_tpp.cfg

#primer_start_window 0,159
# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623):
#  primer_matches: 0 reads (0.0%) contain CTAGAGGGCCCAATTCGCCCTATAGTGAGT (Himar1)
#  vector_matches: 0 reads (0.0%) contain CTAGACCGTCCAGTCTGGCAGGCCGGAAAC (phiMycoMarT7)
#  adapter_matches: 0 reads (0.0%) contain GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (Illumina/TruSeq index)
#  misprimed_reads: 0 reads (0.0%) contain Himar1 prefix but don't end in TGTTA

#for LB_culture
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R2_001.fastq.gz -output LB_culture -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
mv tpp.cfg LB_culture_tpp.cfg

#for growthout_control_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R2_001.fastq.gz -output growthout_control_24h -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
mv tpp.cfg growthout_control_24h_tpp.cfg

#for extracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz -output extracellular_mutants_24h -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
mv tpp.cfg extracellular_mutants_24h_tpp.cfg

#for intracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref WA-314_m_.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq.gz -output intracellular_mutants_24h -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem -replicon-ids contig_2_10,contig_2_9,contig_2_8,contig_2_7,contig_2_6,contig_2_5,contig_2_3,contig_2_2,contig_5_10,contig_5_11,contig_5_12,contig_5_13,contig_5_15,contig_5_16,contig_5_17,contig_5_18,contig_5_9,contig_5_8,contig_5_7,contig_5_6,contig_5_5,contig_5_4,contig_5_3,contig_5_2,contig_5_1,contig_4_2,contig_4_1,contig_3_59,contig_3_58,contig_3_57,contig_3_56,contig_3_55,contig_3_54,contig_3_53,contig_3_52,contig_3_51,contig_3_50,contig_3_49,contig_3_48,contig_3_47,contig_3_46,contig_3_44,contig_3_43,contig_3_42,contig_3_41,contig_3_40,contig_3_39,contig_3_38,contig_3_37,contig_3_36,contig_3_35,contig_3_34,contig_3_33,contig_3_32,contig_3_31,contig_3_30,contig_3_29,contig_3_28,contig_3_27,contig_3_26,contig_3_25,contig_3_24,contig_3_23,contig_3_22,contig_3_21,contig_3_20,contig_3_17,contig_3_16,contig_3_15,contig_3_14,contig_3_13,contig_3_12,contig_3_11,contig_3_9,contig_3_8,contig_3_7,contig_3_6,contig_3_5,contig_3_3,contig_3_2,contig_3_1,contig_1_48,contig_1_47,contig_1_46,contig_1_45,contig_1_44,contig_1_43,contig_1_42,contig_1_41,contig_1_40,contig_1_39,contig_1_38,contig_1_37,contig_1_34,contig_1_33,contig_1_32,contig_1_31,contig_1_28,contig_1_27,contig_1_26,contig_1_25,contig_1_24,contig_1_22,contig_1_20,contig_1_19,contig_1_18,contig_1_17,contig_1_16,contig_1_15,contig_1_14,contig_1_13,contig_1_12,contig_1_10,contig_1_8,contig_1_7,contig_1_6,contig_1_5,contig_1_4,contig_1_3,contig_1_2,contig_1_1,contig_C8715,contig_C8943,contig_C9371,contig_C8939,contig_C9357,contig_C8991,contig_C9445,contig_C8689
mv tpp.cfg intracellular_mutants_24h_tpp.cfg

generate statistics tables in Excel-format from the multiple contig running.

for sample in initial_mutants LB_culture growthout_control_24h extracellular_mutants_24h intracellular_mutants_24h; do
    echo "cd ${sample}"
    echo "cp ${sample}.tn_stats ${sample}.tn_stats_"
    echo "#Delete all general statistics before the table data in ${sample}.tn_stats_; delete the content after \"# FR_corr (Fwd templates vs. Rev templates):\""
    echo "sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_"
    echo "sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_"
    echo "python3 ~/Scripts/parse_tn_stats.py ${sample}.tn_stats_ ${sample}.tn_stats.xlsx"
    echo "#calculate the sum of the first and second columns by \"=SUM(B2:B130)\" and \"=SUM(C2:C130)\""
    echo "mkdir ${sample}_wig"
    echo "mv *.wig ${sample}_wig/"
    echo "zip -r ${sample}_wig.zip ${sample}_wig/"
done

cd initial_mutants
cp initial_mutants.tn_stats initial_mutants.tn_stats_
#Delete all general statistics before the table data in initial_mutants.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ~/Scripts/parse_tn_stats.py initial_mutants.tn_stats_ initial_mutants.tn_stats.xlsx
#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)"
#16,228,513 and 2,454,346

cd LB_culture
cp LB_culture.tn_stats LB_culture.tn_stats_
#Delete all general statistics before the table data in LB_culture.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ~/Scripts/parse_tn_stats.py LB_culture.tn_stats_ LB_culture.tn_stats.xlsx
#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)"
#19735541, 3266320

cd growthout_control_24h
cp growthout_control_24h.tn_stats growthout_control_24h.tn_stats_
#Delete all general statistics before the table data in growthout_control_24h.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ~/Scripts/parse_tn_stats.py growthout_control_24h.tn_stats_ growthout_control_24h.tn_stats.xlsx
#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)"
#23812866, 3487969

cd extracellular_mutants_24h
cp extracellular_mutants_24h.tn_stats extracellular_mutants_24h.tn_stats_
#Delete all general statistics before the table data in extracellular_mutants_24h.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ~/Scripts/parse_tn_stats.py extracellular_mutants_24h.tn_stats_ extracellular_mutants_24h.tn_stats.xlsx
#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)"
#6491071, 236041

cd intracellular_mutants_24h
cp intracellular_mutants_24h.tn_stats intracellular_mutants_24h.tn_stats_
#Delete all general statistics before the table data in intracellular_mutants_24h.tn_stats_; delete the content after "# FR_corr (Fwd templates vs. Rev templates):"
sed -i 's/read_count (TA sites only, for Himar1)/read_counts/g' *.tn_stats_
sed -i 's/NZ_mean (among templates)/NZ_mean (mean template count over non-zero TA sites)/g' *.tn_stats_
python3 ~/Scripts/parse_tn_stats.py intracellular_mutants_24h.tn_stats_ intracellular_mutants_24h.tn_stats.xlsx
#calculate the sum of the first and second columns by "=SUM(B2:B130)" and "=SUM(C2:C130)"
#20619934, 1402849

mkdir initial_mutants_wig
mv *.wig initial_mutants_wig/
cd initial_mutants_wig/
python3 ~/Scripts/update_wig_initial_mutants.py
cd ..
zip -r initial_mutants_wig.zip initial_mutants_wig/

mkdir LB_culture_wig
mv *.wig LB_culture_wig/
cd LB_culture_wig/
python3 ~/Scripts/update_wig_LB_culture.py
cd ..
zip -r LB_culture_wig.zip LB_culture_wig/

mkdir growthout_control_24h_wig
mv *.wig growthout_control_24h_wig/
cd growthout_control_24h_wig/
python3 ~/Scripts/update_wig_growthout_control_24h.py
cd ..
zip -r growthout_control_24h_wig.zip growthout_control_24h_wig/

mkdir extracellular_mutants_24h_wig
mv *.wig extracellular_mutants_24h_wig/
cd extracellular_mutants_24h_wig/
python3 ~/Scripts/update_wig_extracellular_mutants_24h.py
cd ..
zip -r extracellular_mutants_24h_wig.zip extracellular_mutants_24h_wig/

mkdir intracellular_mutants_24h_wig
mv *.wig intracellular_mutants_24h_wig/
cd intracellular_mutants_24h_wig/
python3 ~/Scripts/update_wig_intracellular_mutants_24h.py
cd ..
zip -r intracellular_mutants_24h_wig.zip intracellular_mutants_24h_wig/

zip -r genbank_files.zip genbank_files

run Transposon Position Profiling (TPP) on merged_genome

#for initial_mutants
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R1_001.fastq.gz -reads2 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R2_001.fastq.gz -output initial_mutants_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg initial_mutants_tpp_run2.cfg

#for LB_culture
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R2_001.fastq.gz -output LB_culture_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg LB_culture_tpp_run2.cfg

#for growthout_control_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R2_001.fastq.gz -output growthout_control_24h_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg growthout_control_24h_tpp_run2.cfg

#for extracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz -output extracellular_mutants_24h_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg extracellular_mutants_24h_tpp_run2.cfg

#for intracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq.gz -output intracellular_mutants_24h_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg intracellular_mutants_24h_tpp_run2.cfg

# "=SUM(B2:B130)" 20619934; "=SUM(C2:C130)" 1402849
# command: python /home/jhuang/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref merged_genome.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq.gz -output intracellular_mutants_24h_run2 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
# transposon type: Tn5
# protocol type: Tn5
# bwa flags:
# read1: ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq
# read2: ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq
# ref_genome: merged_genome.fasta
# replicon_ids:
# total_reads (or read pairs): 51244639
# truncated_reads 0 (genomic inserts shorter than the read length; ADAP2 appears in read1)
# trimmed_reads (reads with valid Tn prefix, and insert size>20bp): 23204461 *
# reads1_mapped: 20987078
# reads2_mapped: 20915836
# mapped_reads (both R1 and R2 map into genome, and R2 has a proper barcode): 20627374 * (since barcode is deleted, that means only this filters only the records in which the R2 not containing bacterial genome!)
# read_count (TA sites only, for Himar1): 20627374
# template_count: 1405508
# template_ratio (reads per template): 14.68
# TA_sites: 4537463
# TAs_hit: 93859
# density: 0.021
# max_count (among templates): 285
# max_site (coordinate): 31693
# NZ_mean (among templates): 15.0
# FR_corr (Fwd templates vs. Rev templates): 0.003
# BC_corr (reads vs. templates, summed over both strands): 0.919
# Break-down of total reads (51244639):
#  28040178 reads (54.7%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (23204461):
#  primer_matches: 0 reads (0.0%) contain CTAGAGGGCCCAATTCGCCCTATAGTGAGT (Himar1)
#  vector_matches: 0 reads (0.0%) contain CTAGACCGTCCAGTCTGGCAGGCCGGAAAC (phiMycoMarT7)
#  adapter_matches: 0 reads (0.0%) contain GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (Illumina/TruSeq index)
#  misprimed_reads: 0 reads (0.0%) contain Himar1 prefix but don't end in TGTTA
# read_length: 130 bp
# mean_R1_genomic_length: 75.6 bp
# mean_R2_genomic_length: 88.5 bp

./initial_mutants.tn_stats
# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623): --> #16,228,513 and 2,454,346

./LB_culture.tn_stats
# Break-down of total reads (43486192):
#  20855173 reads (48.0%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (22631019): --> #19,735,541 and 3,266,320

./growthout_control_24h.tn_stats
# Break-down of total reads (70663823):
#  43886543 reads (62.1%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (26777280):

./extracellular_mutants_24h.tn_stats
# Break-down of total reads (47473664):
#  38115004 reads (80.3%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (9358660):

./intracellular_mutants_24h.tn_stats
# Break-down of total reads (51244639):
#  28040178 reads (54.7%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (23204461):

#grep "AGCTTCAGGGTTGAGATGTGTATAAGAGACAG" intracellular_mutants_24h_a_2_S10_R1_001.fastq | wc -l
#28404198
#grep "AGCTTCAGGGTTGAGATGTGTATAAGAGACAG" intracellular_mutants_24h_a_2_S10_R2_001.fastq | wc -l
#29

#AGCTTCAGGGTTGAGATGTGTATAAGAGACAG
#NOTE_IMPORTANT: explain that some multiple mapped reads have to been deleted for the down-stream analysis!

run Transposon Position Profiling (TPP) on the complete genome CP009367

#for initial_mutants
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta -reads1 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R1_001.fastq.gz -reads2 240606_VH00358_96_AAFCFJGM5/kr11/initial_mutants_a_2_S6_R2_001.fastq.gz -output initial_mutants_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg initial_mutants_tpp_run3.cfg

#for LB_culture
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr13/LB_culture_a_2_S7_R2_001.fastq.gz -output LB_culture_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg LB_culture_tpp_run3.cfg

#for growthout_control_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr15/growthout_control_24h_a_2_S8_R2_001.fastq.gz -output growthout_control_24h_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg growthout_control_24h_tpp_run3.cfg

#for extracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr17/extracellular_mutants_24h_a_2_S9_R2_001.fastq.gz -output extracellular_mutants_24h_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg extracellular_mutants_24h_tpp_run3.cfg

#for intracellular_mutants_24h
python3 ~/.local/bin/tpp -bwa /usr/bin/bwa -protocol Tn5 -ref CP009367.fasta -reads1 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R1_001.fastq.gz -reads2 ./240606_VH00358_96_AAFCFJGM5/kr19/intracellular_mutants_24h_a_2_S10_R2_001.fastq.gz -output intracellular_mutants_24h_run3 -primer AGCTTCAGGGTTGAGATGTGTATAAGAGACAG -mismatches 1 -bwa-alg mem
mv tpp.cfg intracellular_mutants_24h_tpp_run3.cfg
#NOTE that for some unknown reasons, intracellular_mutants_24h_run3.wig is always empty. Copy it for backup.
cp intracellular_mutants_24h_run3.wig intracellular_mutants_24h_run3_backup.wig

# "=SUM(B2:B130)" 20619934; "=SUM(C2:C130)" 1402849

./initial_mutants_run3.tn_stats
# Break-down of total reads (49821406):
#  29481783 reads (59.2%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (20339623):

./LB_culture_run3.tn_stats
# Break-down of total reads (43486192):
#  20855173 reads (48.0%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (22631019):

./growthout_control_24h_run3.tn_stats
# Break-down of total reads (70663823):
#  43886543 reads (62.1%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (26777280):

./extracellular_mutants_24h_run3.tn_stats
# Break-down of total reads (47473664):
#  38115004 reads (80.3%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (9358660):

./intracellular_mutants_24h_run3.tn_stats
# Break-down of total reads (51244639):
#  28040178 reads (54.7%) lack the expected Tn prefix
# Break-down of trimmed reads with valid Tn prefix (23204461):

Prepare the wig files on merged_genome for transit running (it is not necessary for run3)

#change all wigs title to WA314
#./initial_mutants_run2.wig
#./LB_culture_run2.wig
#./growthout_control_24h_run2.wig
#./intracellular_mutants_24h_run2.wig
#./extracellular_mutants_24h_run2.wig
sed -i 's/chrom=merged_genome/chrom=WA314/g' *_run2.wig

Run Transit on merged_genome
```
#-- convert gbk or gff3 to prot_table --
#python3 ~/Scripts/gbk_to_prottable.py merged_genome.gbk merged_genome.prot_table
python3 /home/jhuang/.local/bin/transit convert gff_to_prot_table CP009367.gff3 CP009367.prot_table

#transit export
#https://transit.readthedocs.io/en/latest/transit_normalization_tutorial.html
#-1- transit export combined_wig <comma-separated .wig files> <annotation .prot_table> <output file> -n TTR
    transit export combined_wig  initial_mutants_run3.wig,LB_culture_run3.wig,growthout_control_24h_run3.wig,intracellular_mutants_24h_run3.wig,extracellular_mutants_24h_run3.wig CP009367.prot_table combined.wig -n nonorm
    transit export combined_wig initial_mutants_run3.wig,LB_culture_run3.wig,growthout_control_24h_run3.wig,intracellular_mutants_24h_run3.wig,extracellular_mutants_24h_run3.wig CP009367.prot_table combined_normalized.wig -n TTR
#-2-
    transit export igv initial_mutants_run3.wig,LB_culture_run3.wig,growthout_control_24h_run3.wig,intracellular_mutants_24h_run3.wig,extracellular_mutants_24h_run3.wig CP009367.prot_table combined.igv -n nonorm  #[TO_SEND!]
    transit export igv initial_mutants_run3.wig,LB_culture_run3.wig,growthout_control_24h_run3.wig,intracellular_mutants_24h_run3.wig,extracellular_mutants_24h_run3.wig CP009367.prot_table combined_normalized.igv -n TTR  #[TO_SEND!]
    #Open the two files in IGV, firstly import CP009367.gb, then import combined.igv or combined_normalized.igv.
#-3- transit export mean_counts -c combined.wig combined.mean_counts
        #ARGS=['combined_on_CP009367.mean_counts']
        #KWARGS={'c': 'combined_on_CP009367.wig'}
        #Error: list index out of range
        #python /home/jhuang/.local/bin/transit export mean_counts <comma-separated .wig files>|<combined_wig> <annotation .prot_table> <output file> [-c]
        #note: append -c if inputing a combined_wig file

#https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/tnseq/tutorial.html#compare-the-essential-genes-between-two-conditions
# -- detect essential genes (DnaA is essential, which is why there are no insertion counts in the first few TA sites)
#https://transit.readthedocs.io/en/v3.2.5/method_ttnfitness.html
#对于 transit gumbel 分析，通常期望输入的是已经规范化（normalized）的 .wig 文件。规范化的数据可以减少由于不同实验条件或测序深度导致的偏差，使得后续的统计分析更加可靠
#-- Normalization --
#https://transit.readthedocs.io/en/latest/method_normalization.html
for sample in initial_mutants LB_culture growthout_control_24h intracellular_mutants_24h extracellular_mutants_24h; do
    transit normalize ${sample}_run3.wig ${sample}_run3_normalized.wig -n TTR    #betageom
done

#https://transit.readthedocs.io/en/latest/method_tnseq_stats.html
#pre-normalization
for sample in initial_mutants LB_culture growthout_control_24h intracellular_mutants_24h extracellular_mutants_24h; do
    transit tnseq_stats ${sample}_run3.wig
done

#post-normalization
for sample in initial_mutants LB_culture growthout_control_24h intracellular_mutants_24h extracellular_mutants_24h; do
    transit tnseq_stats ${sample}_run3_normalized.wig
done

#File --> Export --> combined_wig, or IGV, or Mean_Gene_Counts.
#     --> Convert --> ...... [1]
#View --> Scatter_Plot (only two datasets are allowed)
#     --> Track_View
#     --> Quality_Control
#Analysis
--> Himar1_Methods

    * gumbel
    * resampling
    * hmm
    * example
    * binomial
    * griffin
    * randproduct
    * utest
    * gi
    * normalize
    * tnseq_stats
    * corrplot
    * heatmap
    * ttnfitness

--> Tn5_Methods
    * resampling: Resampling test of conditional essentiality between two conditions
    transit resampling -c combined.wig samples_run3.metadata LB_culture intracellular_mutants_24h CP009367.prot_table resampling_results_test.txt -s 10000 -n TTR -h -a -l -winz

    * utest: Mann-Whitney U-test of conditional essentiality between two conditions. This is a method for comparing datasets from a TnSeq library evaluated in two different conditions, analogous to resampling.
    transit utest LB_culture_run3.wig intracellular_mutants_24h_run3.wig CP009367.prot_table utest_out -n TTR -l
        #-l :=  Perform LOESS Correction; Helps remove possible genomic position bias. Default: Turned Off.

    * ZINB (command line only, If you want to compare more than two conditions, see ZINB.): The ZINB (Zero-Inflated Negative Binomial) method is used to determine which genes exhibit statistically significant variability across multiple conditions, in either the magnitude of insertion counts or local saturation, agnostically (in any one condition compared to the others). Like ANOVA, the ZINB method takes a combined_wig file (which combines multiple datasets in one file) and a samples_metadata file (which describes which samples/replicates belong to which experimental conditions).
        transit zinb combined.wig samples_run3.metadata CP009367.prot_table zinb_out -n TTR --condition Condition --include-conditions LB_culture,intracellular_mutants_24h
        #grep "not analyzed" zinb_out | wc -l  #WARNING: Could run successful, but 4097 records are not analyzed!
        #TODO: R is called by Transit for certain commands, such as ZINB, corrplot, and heatmap.
        #install R (tested on v3.5.2)
        #R packages: MASS, pscl, corrplot, gplots (run
            install.packages(c("MASS", "pscl", "corrplot", "gplots"))
            install.packages("remotes")
            remotes::install_version("MASS", version = "7.3-60")
        #Python packages (for python3): rpy2 (v>=3.0) (run “pip3 install rpy2” on command line)

    ** ANOVA (command line only): The Anova (Analysis of variance) method is used to determine which genes exhibit statistically significant variability of insertion counts across multiple conditions. Unlike other methods which take a comma-separated list of wig files as input, the method takes a combined_wig file (which combined multiple datasets in one file) and a samples_metadata file (which describes which samples/replicates belong to which experimental conditions).
        transit anova combined.wig samples_run3.metadata CP009367.prot_table anova_out -n TTR --include-conditions LB_culture,intracellular_mutants_24h --ref LB_culture -PC 5 -alpha 1000 -winz

        #--ref 
```
:= which condition(s) to use as a reference for calculating LFCs (comma-separated if multiple conditions) transit anova combined.wig samples_run3.metadata CP009367.prot_table anova_5samples_ref_LB_culture_out -n TTR –include-conditions initial_mutants,LB_culture,growthout_control_24h,intracellular_mutants_24h,extracellular_mutants_24h –ref LB_culture -PC 5 -alpha 1000 -winz #TODO_MERGE: merge combined.wig and combined_normalized.wig to Excel-file as the read_counts based on the gene! #Rv Gene TAs Mean_initial_mutants Mean_LB_culture Mean_growthout_control_24h Mean_intracellular_mutants_24h Mean_extracellular_mutants_24h LFC_initial_mutants LFC_LB_culture LFC_growthout_control_24h LFC_intracellular_mutants_24h LFC_extracellular_mutants_24h Fstat Pval #Padj status # Orf Gene ID. #Name Name of the gene. #TAs Number of TA sites in Gene #Means… Mean readcounts for each condition #LFCs… Log-fold-changes of counts in each condition vs mean across all conditions #MSR Mean-squared residual #MSE+alpha Mean-squared error, plus moderation value #p-value P-value calculated by the Anova test. #p-adj Adjusted p-value controlling for the FDR (Benjamini-Hochberg) #status Debug information (If any) transit example initial_mutants_run3.wig CP009367.prot_table initial_mutants_mean_read-counts_per_gene.txt #TODO: MERGE anovo+example together, delete all headers of the results, save as the Excel-file! # —- commands for one sample —- ** normalize: Normalization method: python transit.py norm glycerol_H37Rv_rep1.wig,glycerol_H37Rv_rep2.wig H37Rv.prot_table glycerol_TTR.txt -n TTR – TTR: Trimmed Total Reads (TTR), normalized by the total read-counts (like totreads), but trims top and bottom 5% of read-counts. This is the recommended normalization method for most cases, as it has the benefit of compensating for differences in saturation (which is especially important for resampling). – nzmean: Normalizes datasets to have the same mean over the non-zero sites. – totreads: Normalizes datasets by total read-counts, and scales them to have the same mean over all counts. – zinfnb: Fits a zero-inflated negative binomial model, and then divides read-counts by the mean. The zero-inflated negative binomial model will treat some empty sites as belonging to the “true” negative binomial distribution responsible for read-counts while treating the others as “essential” (and thus not influencing its parameters). – quantile: Normalizes datasets using the quantile normalization method described by Bolstad et al. (2003). In this normalization procedure, datasets are sorted, an empirical distribution is estimated as the mean across the sorted datasets at each site, and then the original (unsorted) datasets are assigned values from the empirical distribution based on their quantiles. This actually doesn’t work well on TnSeq data if a large fraction of TA sites have counts of 0 (ties). – betageom: Normalizes the datasets to fit an “ideal” Geometric distribution with a variable probability parameter p. Specially useful for datasets that contain a large skew. – nonorm: No normalization is performed. ** tnseq_stats: Statistical Metrics for TnSeq datasets #Typically a skew < 50 is desired #total_cts Sum of total read-counts in the sample. Indicates how much sequencing material was obtained. Typically >1M reads is desired for Himar1 datasets. transit tnseq_stats -c combined.wig -o tnseq_stats * example: Example method that calculates mean read-counts per gene. * transit example initial_mutants_run3.wig,LB_culture_run3.wig,growthout_control_24h_run3.wig,intracellular_mutants_24h_run3.wig,extracellular_mutants_24h_run3.wig CP009367.prot_table mean_read-counts_per_gene.txt * rankproduct: Rank product test for determining conditional essentiality. transit rankproduct LB_culture_run2.wig intracellular_mutants_24h_run2.wig merged_genome.prot_table rankproduct_out #-s 100 -n TTR -h -a -l transit rankproduct LB_culture_run3.wig intracellular_mutants_24h_run3.wig CP009367.prot_table rankproduct_out #ERROR! #Tn5Gaps method: https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/tnseq/tutorial.html#predict-the-essentiality-of-genes ** tn5gaps: It is based on a Gumbel analysis method Griffin et al. 2011 and adapted to Tn5 transposon specificity. The main difference comes from the fact that Tn5 transposon can insert everywhere, thus creating libraries with lower insertion rates. #transit tn5gaps initial_mutants_run2.wig merged_genome.prot_table initial_mutants_tn5gaps_out #-m 2 -r Sum -iN 5 -iC 5; #transit tn5gaps initial_mutants_run3.wig CP009367.prot_table initial_mutants_tn5gaps_out for sample in initial_mutants LB_culture growthout_control_24h intracellular_mutants_24h extracellular_mutants_24h; do transit tn5gaps ${sample}_run3_normalized.wig CP009367.prot_table ${sample}_tn5gaps_trimmed.dat -m 2 -r Sum -iN 5 -iC 5; done #grep “Essential” initial_mutants_tn5gaps_trimmed.dat | wc -l #218 #grep “Non-essential” initial_mutants_tn5gaps_trimmed.dat | wc -l #3817 ~/Tools/csv2xls-0.4/csv_to_xls.py initial_mutants_tn5gaps_trimmed.dat LB_culture_tn5gaps_trimmed.dat \ growthout_control_24h_tn5gaps_trimmed.dat intracellular_mutants_24h_tn5gaps_trimmed.dat extracellular_mutants_24h_tn5gaps_trimmed.dat -d$’\t’ -o Tn5Gaps.xls; # — draw graphics to explain the r, ovr and lenovr based on the information below — #Orf Name Desc k n r ovr lenovr pval padj call CH47_1 – 4Fe-4S dicluster domain protein 12 925 150 150 150 1.00000 1.00000 Non-essential CH47_10 hypE hydrogenase expression/formation protein HypE 8 871 349 349 349 0.99997 1.00000 Non-essential CH47_100 – hypothetical protein 21 150 26 26 26 1.00000 1.00000 Non-essential CH47_1000 – hypothetical protein 6 690 415 418 486 0.23409 1.00000 Non-essential k: Number of Transposon Insertions Observed within the ORF. n: Total Number of TA dinucleotides within the ORF. r: Length of the Maximum Run of Non-Insertions observed. #TODO1_DEL: ovr: The number of nucleotides in the overlap with the longest run partially covering the gene. lenovr: The length of the above run with the largest overlap with the gene. pval: P-value calculated by the permutation test. padj: Adjusted p-value controlling for the FDR (Benjamini-Hochberg). call: Essentiality call for the gene. Depends on FDR corrected thresholds. Essential or Non-Essential. r (Run of Non-Insertions): This value represents the length of the longest continuous region within an ORF (open reading frame) where no transposon insertions are observed. Graphically, this could be shown as a long unbroken line or bar within a longer gene representation, highlighting the absence of marks (insertions). ovrovr (Overlap with Run): This is the number of nucleotides that overlap with the longest run, which might partially cover the gene. It is not the total length of the run, but how much of it overlaps with the gene. In a graphic, this could be illustrated by overlapping two segments: one for the gene and another for the run, with the overlapping part distinctly colored or shaded. lenovrlenovr (Length of Overlap with Run): This measures the full length of the run that has the largest overlap with the gene. Visually, this could be depicted as a separate longer line or bar that extends beyond the gene boundaries but is highlighted where it overlaps with the gene. * #WARNING: Since gumbel_out cannot be generated, the ttnfitness can be generated! * TTN-Fitness (TTNFitness method that calculates mean read-counts per gene): Typically with individual TnSeq datasets, Gumbel and HMM are the methods used for evaluating essentiality. – Gumbel distinguishes between ES (essential) from NE (non-essential). – HMM adds the GD (growth-defect; suppressed counts; mutant has reduced fitness) and GA (growth advantage; inflated counts; mutant has selective advantage) categories. – Quantifying the magnitude of the fitness defect is risky because the counts at individual TA sites can be noisy. Sometimes the counts at a TA site in a gene can span a wide range of very low to very high counts. The TTN-Fitness gives a more fine-grained analysis of the degree of fitness effect by taking into account the insertion preferences of the Himar1 transposon. – These insertion preferences are influenced by the nucleotide context of each TA site. The TTN-Fitness method uses a statistical model based on surrounding nucleotides to estimate the insertion bias of each site. Then, it corrects for this to compute an overall fitness level as a Fitness Ratio, where the ratio is 0 for ES genes, 1 for typical NE genes, between 0 and 1 for GD genes and above 1 for GA genes. #transit ttnfitness transit ttnfitness initial_mutants_run3_normalized.wig,LB_culture_run3_normalized.wig,growthout_control_24h_run3_normalized.wig,intracellular_mutants_24h_run3_normalized.wig,extracellular_mutants_24h_run3_normalized.wig CP009367.prot_table CP009367.fa gumbel_out ttnfitness_out1 ttnfitness_out2 > ttnfitness_log #ERROR – gumbel output file:* The Gumbel method must be run first on the dataset.The output of the Gumbel method is provided as an input to this method. ES (essential by Gumbel) and EB (essential by Binomial) is calculated in the TTN-Fitness method via this files * Genetic Interactions: The genetic interactions (GI) method is a comparative analysis used used to determine genetic interactions. It is a Bayesian method that estimates the distribution of log fold-changes (logFC) in two strain backgrounds under different conditions, and identifies significantly large changes in enrichment (delta_logFC) to identify those genes that imply a genetic interaction. * Pathway Enrichment Analysis: Pathway Enrichment Analysis provides a method to identify enrichment of functionally-related genes among those that are conditionally essential (i.e. significantly more or less essential between two conditions). transit pathway_enrichment [-M ] [- … ** corrplot (Correlation among TnSeq datasets, command line only): A useful tool when evaluating the quality of a collection of TnSeq datasets is to make a correlation plot of the mean insertion counts (averaged at the gene-level) among samples. #INCOMPLETE cut -f1-6 combined.wig > combined_.wig transit corrplot combined_.wig corrplot.png #INCOMPLETE cut -f1-6 combined_normalized.wig > combined_normalized_.wig #transit corrplot combined_normalized_.wig corrplot_normalized.png transit corrplot anova_5samples_ref_LB_culture_out corrplot_anova.png -anova ** heatmap (Heatmap among Conditions, command line only): The output of ANOVA or ZINB can be used to generate a heatmap that simultaneously clusters the significant genes and clusters the conditions, which is especially useful for shedding light on the relationships among the conditions apparent in the data. transit heatmap anova_5samples_ref_LB_culture_out heatmap.png -anova -qval 0.05 -low_mean_filter 3 #heatmap based on 74 genes transit heatmap anova_5samples_ref_LB_culture_out heatmap.png -anova -qval 0.1 -low_mean_filter 3

Reports

#1. For gene_based reports: example (mean_read-counts_per_gene.txt) + ANOVA (anova_5samples_ref_LB_culture_out)
        #TAs: Number of TA sites in Gene

        #mean: Average read-count, including empty sites.
        #nzmean: Average read-count, excluding empty sites.

        Orf: Gene ID
        Name: Name of the gene.
        Desc    Gene description.
        k: Number of Transposon Insertions Observed within the ORF.
        n: Total Number of TA sites within the ORF
        Normalized_initial_mutants: Mean read counts for the condition initial_mutants (normalized with TTR)
        Normalized_LB_culture: Mean read counts for the condition LB_culture (normalized with TTR)
        Normalized_growthout_control_24h: Mean read counts for the condition growthout_control_24h (normalized with TTR)
        Normalized_intracellular_mutants_24h: Mean read counts for the condition intracellular_mutants_24h (normalized with TTR)
        Normalized_extracellular_mutants_24h: Mean read counts for the condition extracellular_mutants_24h (normalized with TTR)

#check both f1 are the same;
sort -t'_' -k2,2n mean_read-counts_per_gene.txt -o mean_read-counts_per_gene_sorted.txt
sort -t'_' -k2,2n anova_5samples_ref_LB_culture_out -o anova_5samples_ref_LB_culture_out_sorted
#DELETE all header lines except for keeping one in sorted files.
cut -f1 mean_read-counts_per_gene_sorted.txt > f1_1
cut -f1 anova_5samples_ref_LB_culture_out_sorted > f1_2
diff f1_1 f1_2

#Orf    Name    Desc    k   n   mean    nzmean
cut -f1-5 mean_read-counts_per_gene_sorted.txt > f1_5;  #delete the columns mean and nzmean.
#Rv     Gene    TAs     Normalized_initial_mutants    Normalized_LB_culture Normalized_growthout_control_24h      Normalized_intracellular_mutants_24h  Normalized_extracellular_mutants_24h
cut -f4-8 anova_5samples_ref_LB_culture_out_sorted > f4_8;

paste f1_5 f4_8 > overview_gene_based.txt
#~/Tools/csv2xls-0.4/csv_to_xls.py overview_gene_based.txt -d$'\t' -o overview_gene_based.xls;

#2. For essentiall gene report: transit tn5gaps

#Tn5Gaps method: https://training.galaxyproject.org/training-material/topics/genome-annotation/tutorials/tnseq/tutorial.html#predict-the-essentiality-of-genes
#dinucleotides
ORF     Gene ID.
Name    Name of the gene.
Desc    Gene description.
k   Number of Transposon Insertions Observed within the ORF.
n   Total Number of TA sites within the ORF.
r   Length of the Maximum Run of Non-Insertions observed.
pval    P-value calculated by the permutation test.
padj    Adjusted p-value controlling for the FDR (Benjamini-Hochberg).
call    Essentiality call for the gene. Depends on FDR corrected thresholds. Essential or Non-Essential.
#DELETE the headers; DEL the columns ovr and lenovr; merge tn5gaps also the gene-based tables in the Tn5Gaps.xls.

#3. Only choose ANOVA for DEG reports, since it can be drawn + are consistent with the gene-based view!

#Orf: Gene ID
#Name: Name of the gene
#TAs: Number of TA sites in the gene
#Means…: Mean read counts for each condition
#LFCs…: Log-fold changes of counts in each condition vs. the mean across all conditions
#p-value: P-value calculated by the ANOVA test
#p-adj: Adjusted p-value controlling for the FDR (False Discovery Rate, Benjamini-Hochberg method)
#status: Debug information

#~/Tools/csv2xls-0.4/csv_to_xls.py anova_out -d$'\t' -o anova_out.xls;
# DELETE the headers and the columns MSR and MSE+alpha; SORT according to Padj.

#4. 2 plots
corrplot: A useful tool when evaluating the quality of a collection of TnSeq datasets is to make a correlation plot of the mean insertion counts (averaged at the gene-level) among samples.
heatmap: The output of ANOVA can be used to generate a heatmap that simultaneously clusters the significant genes and clusters the conditions, which is especially useful for shedding light on the relationships among the conditions apparent in the data.

#5. igv and genbank files for IGV opening
#Usage: Open the two files in IGV, firstly import CP009367.gb, then import combined.igv or combined_normalized.igv.

plot Figure 1. Overview of the Yersinia enterocolitica subsp. enterocolitica WA-314 Transposon Mutant Library This figure illustrates the distribution of transposon insertion sites across the genome. The outermost black circle represents the WA-314 genome in kilobase pairs (Kbp). The middle circles, which feature scatter points, indicate the normalized number of sequencing reads at each unique transposon insertion site, with each axis line representing an increment of 50,000 in data values. The circles are color-coded to represent different conditions: green for extracellular mutants, blue for intracellular mutants, red for growthout control, purple for LB culture, and yellow for initial mutants. The innermost orange circle highlights the locations of genes identified as essential.

#create karyotype.microbe.txt contains the following line
    chr - chr CP009367 0 4548749 black
#Input file: combined_normalized.wig
#Output file: initial_mutants.txt
              LB_culture.txt
              growthout_control.txt
              intracellular_mutants.txt
              extracellular_mutants.txt
python3 generate_circos_input_files.py

#Prepare CP009367.prot_table_essential from CP009367.prot_table and initial_mutants_tn5gaps_trimmed.dat
grep "Essential" initial_mutants_tn5gaps_trimmed.dat > essential.ids
cut -f1 essential.ids > essential.f1
#DELETE the header
#replace \t with | in CP009367.prot_table_divided_with_
#replace \n with |" CP009367.prot_table_divided_with_ >> CP009367.prot_table_essential\ngrep "| in essential.f1
#replace | with \t in CP009367.prot_table_essential
#format circos_data/merged_genome.txt to the following format
#chr    98090   98972   1
#chr    99072   100149  1
#...

circos -conf circos.conf

generate_circos_input_files.py

import pandas as pd
import os

# Function to read the combined .wig file
def read_combined_wig(file):
    with open(file, 'r') as f:
        lines = f.readlines()
    header_line = next(i for i, line in enumerate(lines) if line.startswith("#TA_coord")) + 1
    data = pd.read_csv(file, sep="\t", skiprows=header_line, header=None)
    data.columns = ["TA_coord", "initial_mutants", "LB_culture", "growthout_control", "intracellular_mutants", "extracellular_mutants", "gene_id"]
    data["TA_coord"] = pd.to_numeric(data["TA_coord"])
    return data

# Read the .wig file
wig_file = "combined_normalized.wig"
plot_data = read_combined_wig(wig_file)

# Melt the DataFrame for easier manipulation
plot_data_long = plot_data.melt(id_vars=["TA_coord", "gene_id"], var_name="condition", value_name="value")

# Filter out zeros for better visualization (optional)
plot_data_long = plot_data_long[plot_data_long["value"] > 0]

# Ensure the condition column is treated as a category
plot_data_long["condition"] = plot_data_long["condition"].astype("category")

# Function to save data for a specific condition
def save_condition_data(data, condition, filename):
    condition_data = data[data["condition"] == condition].copy()
    condition_data["chr"] = "chr"
    condition_data["start"] = condition_data["TA_coord"]
    condition_data["end"] = condition_data["TA_coord"] + 1
    condition_data = condition_data[["chr", "start", "end", "value"]]
    condition_data.to_csv(filename, sep="\t", header=False, index=False)

# Create data directory if it doesn't exist
if not os.path.exists("circos_data"):
    os.makedirs("circos_data")

# Save data for each condition
save_condition_data(plot_data_long, "initial_mutants", "circos_data/initial_mutants.txt")
save_condition_data(plot_data_long, "LB_culture", "circos_data/LB_culture.txt")
save_condition_data(plot_data_long, "growthout_control", "circos_data/growthout_control.txt")
save_condition_data(plot_data_long, "intracellular_mutants", "circos_data/intracellular_mutants.txt")
save_condition_data(plot_data_long, "extracellular_mutants", "circos_data/extracellular_mutants.txt")

generate_gene_locations.py

import pandas as pd

# Read the merged genome table
prot_table = pd.read_csv("CP009367.prot_table_essential", sep="\t", header=None)

# Print the column names to check for correctness
print(prot_table.columns)

# Create a new DataFrame with required columns for Circos
circos_genes = pd.DataFrame()
circos_genes["chr"] = "chr"  # Chromosome name (constant for all entries)

# Use correct column indices based on inspection
circos_genes["start"] = prot_table.iloc[:, 1]  # Second column for start coordinate
circos_genes["end"] = prot_table.iloc[:, 2]    # Third column for end coordinate

circos_genes["value"] = 1  # Set a constant value for heatmap

# Save the Circos gene location file
circos_genes.to_csv("circos_data/merged_genome.txt", sep="\t", header=False, index=False)

circos.conf

#circos -conf circos.conf

<<include /etc/circos/colors_fonts_patterns.conf>>

karyotype = circos_data/karyotype.microbe.txt

chromosomes_units           = 10000
chromosomes_display_default = yes

show = yes default = 0.01r break = 0.5r #8r # Adds a gap between the first and last position of the single chromosome radius = 0.9r thickness = 25p fill = no stroke_thickness = 2 stroke_color = black show_bands = yes fill_bands = yes band_transparency = 0 show_label = yes label_font = default label_radius = 1.05r label_size = 75 label_parallel = yes orientation = 100 # Rotate the plot by 90 degrees

    #<<include ticks.conf>>
    show_ticks          = yes
    show_tick_labels    = yes

    show_grid          = no
    grid_start         = dims(ideogram,radius_inner)-0.5r
    grid_end           = dims(ideogram,radius_inner)

skip_first_label = yes skip_last_label = no radius = dims(ideogram,radius_outer) tick_separation = 2p min_label_distance_to_edge = 0p label_separation = 5p label_offset = 5p label_size = 8p multiplier = 0.001 color = black thickness = 3p size = 20p size = 10p spacing = 1u color = black show_label = no label_size = 12p format = %.2f grid = no grid_color = lblue grid_thickness = 1p size = 15p spacing = 5u color = black show_label = yes label_size = 16p format = %s grid = yes grid_color = lgrey grid_thickness = 1p size = 18p spacing = 10u color = black show_label = yes label_size = 16p format = %s grid = yes grid_color = grey grid_thickness = 1p spacing = 100u color = black show_label = yes suffix = ” kb” label_size = 36p format = %s grid = yes grid_color = dgrey grid_thickness = 1p <> <> <>

    # -- Scatter plot 1 --

type = scatter file = circos_data/extracellular_mutants.txt r1 = 0.99r r0 = 0.79r #python3 identify_min_max.py circos_data/initial_mutants.txt min = 0 max = 400000 glyph = circle glyph_size = 5 color = dgreen spacing = 50000 color = lgrey # # #spacing = 0.1r #color = lgrey # # #spacing = 0.1 #size = 10p #thickness = 2p #color = lgrey #show_label = yes #label_size = 20p #label_offset = 5p #format = %0.1f # # # # # # #condition = var(value) > 10000 #stroke_color = dred #fill_color = red #glyph = rectangle #glyph_size = 2 # # # — Scatter plot 2 — type = scatter file = circos_data/intracellular_mutants.txt r1 = 0.79r r0 = 0.715r min = 0 max = 150000 glyph = circle glyph_size = 5 color = dblue spacing = 50000 color = lgrey # — Scatter plot 3 — type = scatter file = circos_data/growthout_control.txt r1 = 0.715r r0 = 0.64r min = 0 max = 150000 glyph = circle glyph_size = 5 color = dred spacing = 50000 color = lgrey # — Scatter plot 4 — type = scatter file = circos_data/LB_culture.txt r1 = 0.64r r0 = 0.565r min = 0 max = 150000 glyph = circle glyph_size = 5 color = dpurple spacing = 50000 color = lgrey # — Scatter plot 5 — type = scatter file = circos_data/initial_mutants.txt r1 = 0.565r r0 = 0.49r min = 0 max = 150000 glyph = circle glyph_size = 5 color = dyellow spacing = 50000 color = lgrey # Gene Locations type = heatmap file = circos_data/merged_genome.txt r1 = 0.45r r0 = 0.42r color = orange #grep “Essential” tn5_gap_inituial_mutant.csv > essential_genes.txt #cut -f1 -d$’\t’ essential_genes.txt > f1 #vim merged_genome.prot_table #” merged_genome.prot_table >> merged_genome.prot_table_essential \ngrep ” #python3 generate_gene_locations.py #replace “\n\t” with “\nchr\t”

    <<include /etc/circos/housekeeping.conf>>

Small RNA-seq analysis pipeline XICRA

Leave a reply

https://github.com/cougarlj/COMPSRA/issues/18

https://github.com/HCGB-IGTP/XICRA

This pipeline is designed to take paired end reads in fastq format, trim adapters and low-quality base pairs positions, and merge read pairs (R1 & R2) that overlap.
A mapping step to the reference genome (user defined) assigns joined reads to all major RNA biotypes including miRNA and isomiRs, tRNA fragments (tRFs) and piwi associated RNAs (piRNAs).
Then, XICRA produces a miRNA analysis at the isomiR level using joined reads, with several choices of software that can be selected by the user with standardized output.
Results are generated for each sample, analyzed and summarized for all samples in a single expression matrix.
This information can be processed at the miRNA or isomiR level (single sequence) but also summarizing for each isomiR variant type.
Statistical summaries can be easily accessed using the accompanied R package XICRA.stats (https://github.com/HCGB-IGTP/XICRA.stats).
Although the pipeline is designed to take paired-end reads, it also accepts single-end reads.
XICRA uses cutadapt [30] for the adapter trimming analysis.
Default trimming preset parameter settings are: to keep all reads regardless of whether the adapter is found or not, a 10% maximum adapter matching error rate (mismatches, insertions and deletions), and a 3 bp minimum overlap length.
User must provide specific adapter sequences for the trimming analysis.
An optional previous quality checking step can be performed for each sample using FastQC [31] before the trimming analysis.
Results are summarized for all samples using MultiQC software [32].
Once all reads are adapter trimmed, the tool uses fastq-join from ea-utils [33] to join the two PE reads, if provided, on the overlapping ends.
Apart from the joined reads, this tool also generates two files with the R1 and R2 reads that cannot be joined.
As a default the minimum overlap is set to 6 bp and the maximum allowed difference for the reads to be joined is set to 0% to retain 100% matching read pairs ensuring high quality sequencing information.
Parameters can be modified using the different options provided.
The XICRA pipeline can continue to process either joined PE reads or SR reads.
Two levels of mapping are implemented. The first level profiles RNA biotypes using STAR [34] to map reads against the reference genome and featureCounts [35] to extract and quantify numbers of reads by class. The second level focuses specifically on small RNA subclasses.
Here we describe the miRNA analysis implemented within XICRA but the modularity and versatility of the pipeline would make it quite straightforward to include other RNA biotypes analyses in detail.
For miRNAs analysis at the isomiR resolution level, XICRA allows the user to use either miraligner [26], sRNAbench from sRNAtoolbox [27] or OPTIMIR [28].
Each software uses different strategies and might produce different results [36].
We have included them as they allow following standardization procedures performed by miRTOP software and adopt the miR.gff3 file format [37].
Again, the pipeline modular implementation would allow adding additional softwares converging and adapting to miRTOP and miR.gff3 format.
For each of the softwares mentioned above and included within the miRNA module in XICRA default parameters are used.
Some of these parameters can be modified using the different options provided.
As a result of this miRNA module, annotation is generated that categorizes isomiRs into classes based on their sequence modifications (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl) following miRTOP suggested classification scheme.
A final conversion step from individual per sample miR.gff3 files into a single expression matrix is performed.
This file serves as input for differential expression (DE) analysis.
Information is provided for each unique sequence and indexed names contain the miRNA, the variant type and license plate (unique identifier, UID) provided by miRTOP.
Duplicated entries at the sequence level, produced by different modifications from the same or different miRNA are discarded.
An additional matrix is provided containing the sequence information for each encrypted UID.
Per sample read count matrices at the isomiR level are summarized into a single expression matrix that it serves as input for DE analysis between the comparison groups of interest.
We have generated an additional R package (XICRA.stats) that facilitates the retrieval of these matrices and parses the information included within each unique index name provided.
The DE analysis can be done aggregating data at the mature miRNA level (i.e. hsa-miR-501-3p), by isomiR class (i.e. hsa-miR-501-3p_iso_5p), by specific length variant cluster (i.e. hsa-miR-501-3p_iso_3p:-2) or with the sequence of the read itself as the counting data.
This is useful since different types of modification may coexist in a single sequence, and non-templated additions and internally edited sequences can differ leading to isomiRs that can fall into different categories or be derived from different mature miRNAs.
DE analysis is performed outside of the tool with DESeq2 package in R [38].

Types of small RNAs

miRNA（微小RNA）：微小RNA是长度约为20-24个核苷酸的短RNA分子，它们在基因表达的后转录水平上发挥调控作用。miRNA通过结合到mRNA分子的互补序列上，可以抑制翻译过程或导致mRNA的降解。miRNA参与许多细胞过程，包括发育、分化和细胞周期控制。
tRNA（转运RNA）：转运RNA是长约70-90个核苷酸的适配器分子，在蛋白质生物合成的翻译过程中起着关键作用。它们负责将相应的氨基酸运送到核糖体，并通过其反密码子环识别mRNA上的密码子，确保在形成的蛋白质中正确排序氨基酸。
piRNA（Piwi相互作用RNA）：Piwi相互作用RNA是一类长度通常在24至31个核苷酸之间的小RNA分子，主要存在于动物的生殖细胞中，参与转座子沉默和基因组稳定性。它们与Piwi蛋白互作，帮助抑制转座元件的转录，保护基因组完整性。
snRNA（小核RNA）：小核RNA是一组在真核生物细胞核中发现的小RNA分子，长度约为100-200个核苷酸。它们是剪接体的重要组成部分，剪接体是负责从前mRNA分子中移除内含子的复合体。
snoRNA（小核仁RNA）：小核仁RNA是发现在细胞的核仁中的特化RNA分子，长度约为60-300个核苷酸。它们参与rRNA的修饰和成熟过程，特别是rRNA的化学修饰过程，如甲基化和伪尿苷化。
circRNA（环形RNA）：环形RNA是一种具有闭合圆形结构的RNA，没有常见的5’和3’端。它们可以由外显子或内含子产生，并具有多种功能，包括作为miRNA的分子海绵、影响转录和调节基因表达。CircRNAs参与许多生物学过程，并与各种疾病有关。

在生物学研究中，进行差异表达分析通常用于比较不同样本或条件下RNA分子的表达水平变化。对于上面RNA类型：

miRNA（微小RNA）：进行差异表达分析是有意义的。miRNAs在调控基因表达和参与多种生物过程中扮演关键角色，因此，分析其在不同条件下的表达差异有助于了解其在疾病发生、发展或其他生物学过程中的功能。
tRNA（转运RNA）：虽然tRNAs是蛋白质合成中必不可少的，但它们在不同状态下的表达量差异通常不是研究的主要焦点。尽管如此，tRNA的改变有时可以反映细胞的代谢状态或应对压力的能力，但这并不是差异表达分析的常见应用。
piRNA（Piwi-interacting RNA）：进行差异表达分析同样有意义，尤其是在研究生殖细胞、干细胞和癌症等领域。piRNAs与转座子的沉默和基因组稳定性维护有关，因此分析其差异表达有助于揭示它们在这些过程中的作用。
snRNA（小核RNA）和snoRNA（小核仁RNA）：这两种RNA主要参与RNA加工和修饰，如剪接和rRNA的修饰。通常情况下，对它们进行差异表达分析不是很常见，因为它们更多地涉及基本的细胞内过程。然而，如果研究的目的是特定RNA加工途径的变化，那么它们的表达分析可能是有意义的。
circRNA（环状RNA）：进行差异表达分析是有意义的。circRNAs与许多生理过程有关，包括作为miRNA的海绵、影响基因的转录和参与疾病的发展。因此，分析circRNAs的表达差异可以帮助理解它们在不同生物学背景下的功能和作用。

总的来说，miRNA、piRNA和circRNA的差异表达分析在许多研究领域是有意义的，因为它们在疾病和生物学过程中的角色。而tRNA、snRNA和snoRNA的差异表达分析可能不那么常见，除非研究特定的生物过程或条件下它们的特定变化。

通过将 cDNA 测序读段映射到人类基因组上，我们能否确定这些读段是否来源于 circRNA？

可以通过将 cDNA 的 reads 映射到人类基因组上来帮助确定它们是否来源于 circRNA。在 circRNA 的研究中，这种映射过程中寻找的关键特征是反向剪接事件，也就是说，寻找那些正常线性剪接顺序被反向连接的 reads。

在标准的基因表达研究中，mRNA 被逆转录成 cDNA，然后生成的测序 reads 通常会映射到基因组的连续区域以表示线性剪接事件。然而，对于 circRNA，由于它们是由反向剪接形成的，因此具有独特的“头尾相连”的结构。在测序数据中，这些反向剪接或“头尾相连”的事件会导致 reads 映射到基因组的非连续区域，即一个 exon 的末尾连接到另一个 exon 的开始，这与正常的线性剪接顺序不同。

通过检测这种非典型的、非连续的映射模式，可以推断出 reads 来自于 circRNA。需要专门的生物信息学工具和算法来识别这些特殊的映射模式，从而鉴定出 circRNAs。这些工具能够识别跨越独特的反向剪接点的 reads，帮助研究者确定哪些 reads 来自 circRNAs。

干眼症

Leave a reply

https://www.mayoclinic.org/zh-hans/diseases-conditions/dry-eyes/symptoms-causes/syc-20371863

概述

干眼症是一种常见状况，发生在眼泪不能给眼睛提供足够润滑时。多种原因都可能导致眼泪不充足或不稳定。例如，如果您的眼泪分泌不足或质量不高，就会出现干眼症。这种泪液的不稳定性会导致眼睛表面发生炎症和损伤。

眼睛干涩会令人感觉不适。如果患有干眼症，您的眼睛可能感到刺痛或灼痛。在某些情况下，比如在飞机上、空调房里、骑自行车时或看几个小时电脑屏幕后，您可能会感到眼睛干涩。

治疗干眼症可能让您感觉更舒服。治疗方法包括改变生活方式和使用滴眼液。您可能需要采取这些措施来控制干眼症的症状。产品与服务

书籍：《妙佑医疗国际改善视力指南》

症状

干眼症的体征和症状通常累及双眼，可能包括：

眼睛刺痛、有灼热感或发痒
眼内或眼周出现黏稠的黏液
对光敏感
眼睛发红
眼睛有异物感
难以佩戴隐形眼镜
难以在夜间开车
泪溢，这是机体对干眼症刺激的反应
视力模糊或眼睛疲劳

何时就诊

如果您的眼睛长期存在发红、发炎、疲倦或疼痛等干眼症体征和症状，请就诊。医务人员可以采取措施来确定您的眼睛有什么问题或将您转诊给专科医生。预约门诊病因

干眼症由破坏健康泪膜的多种原因引起。泪膜有三层：脂质层、水液层和黏液层。这样的组合通常能保持眼睛表面润滑、平滑和清澈。任何一层出现问题都会导致干眼症。

泪膜功能障碍的原因很多，包括激素变化、自身免疫性疾病、眼睑腺发炎或过敏性眼病。对一些人来说，导致干眼症的原因是泪液分泌减少或泪液蒸发增加。泪腺和泪腺管泪腺和泪腺管

泪腺位于每个眼球上方，用于不断地提供泪液，每次眨眼时，泪液就会充满眼睛表面。多余的液体会通过泪腺管流进鼻子。泪液分泌减少

无法分泌足够的泪液（房水）时，可能出现眼睛干涩。这种状况的医学术语是干燥性角膜结膜炎。泪液分泌减少的常见原因包括：

老龄化
某些医疗状况，包括干燥综合征、过敏性眼病、类风湿关节炎、狼疮、硬皮病、移植物抗宿主病、结节病、甲状腺疾病或维生素 A 缺乏症
某些药物，包括抗组胺药、减充血剂、激素替代疗法、抗抑郁药以及高血压、痤疮、避孕和帕金森病药物
因使用隐形眼镜、神经损伤或激光眼科手术引起的角膜神经脱敏（但与此手术相关的眼睛干涩症状通常是暂时的）

泪液蒸发增加

眼睑边缘的小腺体（睑板腺）产生的油膜可能被堵塞。睑板腺阻塞在玫瑰痤疮或其他皮肤疾病患者中更常见。

导致泪液蒸发增加的常见原因包括：

后睑缘炎（睑板腺功能障碍）
眨眼次数较少，通常发生在某些状况中，例如帕金森病；或在某些活动中集中注意力时，例如阅读、驾驶或使用电脑工作时
眼睑问题，例如眼睑向外翻（睑外翻）和眼睑向内翻（睑内翻）
眼睛过敏
外用滴眼液中的防腐剂
风、烟或空气干燥
缺乏维生素 A

风险因素

导致更易出现眼睛干涩的因素包括：

年龄超过 50 岁。随着年龄增长，泪液分泌会逐渐减少。眼睛干涩多见于 50 岁以上的人群。
女性。眼泪不足在女性中更为常见，尤其是经受由于怀孕、服用避孕药或绝经而导致的激素变化的女性。
日常饮食中维生素 A（肝脏、胡萝卜和西兰花中富含）摄入不足，或者 ω-3 脂肪酸（鱼类、核桃和植物油中富含）摄入不足。
戴隐形眼镜或有屈光手术史。

并发症

干眼症患者可能会有以下并发症：

眼睛感染。眼泪可以保护眼睛受到感染。如果没有足够的眼泪，眼睛感染的风险可能会增加。
眼睛表面损伤。如果不治疗，严重干眼症会导致眼部炎症、角膜表面磨损、角膜溃疡和视力下降。
生活质量下降。干眼症会使人难以进行日常活动，如阅读。

预防

如果您患有干眼症，请注意最有可能引发症状的情形。然后找到避免这些情形的方法，以预防干眼症症状。例如：

避免对着眼睛吹风。请勿将吹风机、汽车加热器、空调或风扇直接对着眼睛。
空气加湿。冬季，加湿器可以加湿干燥的室内空气。
考虑佩戴面罩式太阳镜或其他防护眼镜。可以在眼镜的顶部和侧面加装安全防护罩，阻挡风和干燥空气进入。请向您购买眼镜的商家询问是否有防护罩。
长时间工作时，适时休息眼睛。如果您正在阅读或在做其他需要长时间用眼的工作，请定期休息眼睛。闭目养神几分钟。或反复眨眼几秒钟，有助于将眼泪均匀地分布在眼睛上。
注意您所处的环境。高海拔地区、沙漠地区和飞机上的空气可能非常干燥。身处这种环境时，经常闭眼几分钟减少眼泪蒸发，可能会有所帮助。
让计算机屏幕低于眼睛高度。如果计算机屏幕高于眼睛高度，您在观看屏幕时眼睛需要睁得更宽。将计算机屏幕布置在眼睛高度以下，那么您就不用睁大眼睛。这可能有助于减缓不眨眼时眼泪蒸发。
戒烟并避免抽烟。如果您抽烟，请医务人员帮您制定最适合您的戒烟策略。如果您不抽烟，请远离抽烟者。抽烟可能会加剧干眼症的症状。
定期使用人工泪液。如果您患有慢性干眼症，即使眼睛感觉良好也要使用滴眼液，以保持眼睛水润。

一、热敷

广口杯内倒上热水，水温高于体温40℃到45℃就可以了。将杯子凑近眼睛，利用热蒸汽来热敷眼部。需要注意的是，热敷过程中请闭上眼睛，同时掌握距离，眼部感觉温温的即可。

二、清洁

对于经常化妆、油脂分泌较多、经常使用手机的人尤其需要注意清洁睑缘。在早晚洗脸后，使用专门清洁睑缘的湿巾，闭上眼睛，沿着睫毛根部横向慢慢擦拭五至六次。

三、按摩

按摩，这里指的是睑板腺的挤压。需要用医学方法将睑板腺的分泌物挤压出来，过程会有疼痛感，需要在医院里进行。

四、人工泪液

人工泪液是模拟眼泪的成分，起到润滑眼表、滋润眼睛的作用，可以长期滴用。根据每个人情况不同，一天点三至六次即可。

https://hylo.de/produkte

HYLO DUAL® Befeuchtung und Linderung allergischer Symptome

Jetzt bestellen und heute erhalten:
HYLO DUAL® kaufen in Shop-Apotheke

Augentropfen- PZN 14169820 (10ml)
Anwendungsgebiete
    Leichte bis mittelschwere Formen des trockenen Auges mit instabilem Tränenfilm
Inhaltsstoffe
    Hyaluronsäure (0,05%), Ectoin (2,0%)

Information Intensive Augenbefeuchtung bei chronischen Beschwerden

HYLO DUAL INTENSE® Augentropfen sind der nachhaltige Feuchtigkeitsspender für chronisch trockene und gereizte Augen. Die besondere Kombination aus hochkonzentrierter Hyaluronsäure und Ectoin lindert entzündliche Symptome durch intensive Befeuchtung und gibt den Augen langanhaltenden Schutz.

Auf einen Blick HYLO DUAL INTENSE®

    Bei chronisch trockenen und gereizten Augen mit entzündlicher Symptomatik
    Duales Wirkprinzip aus 0,2 % Hyaluronsäure und 2,0 % Ectoin
    Intensive Befeuchtung und langanhaltende Linderung entzündlicher Symptome
    Stabilisiert den Tränenfilm und schützt die Augenoberfläche vor erneuter Austrocknung
    Unterstützt die körpereigene Barrierefunktion gegen entzündliche Reize
    Frei von Konservierungsmitteln und Phosphaten
    6 Monate nach Anbruch verwendbar
    Hochergiebig mit mind. 300 Tropfen/Flasche
    Im praktischen COMOD®-System einfach zu dosieren
    Vegan / frei von tierischen Bestandteilen

依可汀（Ectoin）的作用
依可汀（Ectoin）是一种天然的小分子物质，最早在极端环境下生存的微生物中发现。它具有显著的保护细胞和生物分子的作用，主要通过以下机制发挥其效用：
细胞保护：依可汀能够稳定细胞膜和蛋白质结构，保护它们免受环境压力（如紫外线、温度变化和干燥）的损害。这种保护作用有助于维持细胞的正常功能和存活。
抗炎作用：依可汀具有抗炎特性，可以减轻由外部刺激（如过敏原、污染物）引起的炎症反应。它能减轻皮肤和黏膜的发红、肿胀和疼痛。
保湿作用：依可汀能够增强皮肤和黏膜的保湿能力，防止水分流失，从而保持组织的湿润和柔软。这使得依可汀在护肤品和眼部护理产品中广泛应用，特别适用于干燥和敏感肌肤。
修复作用：依可汀促进细胞再生和修复，有助于加速伤口愈合和受损组织的恢复。
依可汀常用于护肤品、眼药水、鼻喷剂和口腔护理产品中，以帮助减轻干燥、过敏和炎症症状，并提供全面的细胞保护和修复作用。其天然温和的特性使其适用于各种皮肤类型和敏感人群。
总之，依可汀作为一种多功能的天然保护剂，在维护细胞健康和改善皮肤、黏膜状态方面具有重要的应用价值。

Ectoin（依克多因）是一种天然的小分子有机化合物，属于四氢嘧啶类化合物。它由一些能够在极端环境中生存的细菌产生，具有出色的保湿和细胞保护功能。以下是关于Ectoin的详细介绍：

主要功能
保湿：

Ectoin具有强大的保湿效果，能够吸引并保持水分，使皮肤保持湿润和光滑。它在护肤品中广泛应用，帮助防止皮肤干燥和脱水。
细胞保护：

Ectoin能够稳定细胞膜和蛋白质，保护细胞免受极端温度、紫外线辐射和盐度变化等环境压力的损伤。这使它在各种护肤和抗老化产品中得到广泛使用。
抗炎：

Ectoin具有抗炎特性，能够减少皮肤的炎症反应。它常用于治疗和预防炎症性皮肤疾病，如湿疹和皮炎。
应用领域
护肤品：

由于其卓越的保湿和细胞保护功能，Ectoin常被添加到各种护肤品中，如保湿霜、防晒霜和抗老化产品。
医药：

Ectoin的抗炎和细胞保护特性使其在治疗皮肤病、过敏和其他炎症性疾病方面有潜在应用。它也被用于眼药水和鼻喷雾等医疗产品中，以缓解眼部和鼻腔的干燥和炎症。
生物技术：

Ectoin用于保护生物分子和细胞在极端条件下的稳定性，广泛应用于生物制剂的储存和运输中。
总结
Ectoin是一种多功能的化合物，因其在保湿、细胞保护和抗炎方面的独特优势，在护肤品和医药领域具有广泛的应用前景。它能够有效改善皮肤状态，保护细胞免受环境压力的损伤，并减少炎症反应，使其成为许多护肤和医疗产品中的重要成分。

EvoTears

> https://kampagne.doc.green/hylo?0000#/
> https://kampagne.doc.green/parinPOS?07756623#/kontaktdaten

Die Innovation bei trockenen Augen

Die einzigartigen* EvoTears® Augentropfen sind wasserfrei und verwenden den Inhaltsstoff Perfluorhexyloctan. Dadurch bieten die Augentropfen langanhaltenden Schutz vor Verdunstung und tränenden Augen, da sie sich wie ein Schutzmantel über die Träne legen.
*aufgrund von Perfluorhexyloctan und Wasserfreiheit

干眼症的创新

独特的 EvoTears® 眼药水不含水分，使用了全氟己基辛烷成分。因此，这些眼药水能够提供长时间的防蒸发保护，并防止流泪，因为它们像一层保护膜覆盖在泪液上。
*由于全氟己基辛烷和不含水分的特点。

Der einzigartige* Schutz: EvoTears®

Die verstärkte Nutzung von Smartphone und PC sowie Zugluft und ungünstige Klimabedingungen bedeuten Dauerstress für die Augen. Diese Faktoren können dazu beitragen, dass die Lipidschicht beeinträchtigt wird (z.B. aufgrund deutlich verringerter Lidschlagfrequenz am Bildschirm) und zu einer übermäßigen Verdunstung des Tränenfilms führen. Trockene und sogar tränende Augen lassen dann nicht lange auf sich warten. In solchen Fällen können EvoTears® Augentropfen die Augen unterstützen. Die ersten wasserfreien Augentropfen sind sehr gut zur Linderung der Symptome des Trockenen Auges. Mit dem Inhaltsstoff Perfluorhexyloctan ersetzen sie nicht den wässrigen Teil des Tränenfilms, sondern legen sich wie ein Schutzmantel über ihn und können so einer vorzeitigen Verdunstung vorbeugen. Durch ihre besonderen Fließeigenschaften gibt es EvoTears® in einer praktischen Tropferflasche, in der auch ohne Konservierungsmittel keine Keime entstehen können. Dadurch sind sie gut verträglich und nach Anbruch 6 Monate haltbar.

*aufgrund von Perfluorhexyloctan und Wasserfreiheit

独特的*保护：EvoTears®

频繁使用智能手机和电脑、空气流动以及不利的气候条件都对眼睛造成了持续的压力。这些因素会影响到脂质层（例如，由于在屏幕前眨眼频率显著降低），导致泪膜过度蒸发。干燥甚至流泪的眼睛就随之而来。在这种情况下，EvoTears® 眼药水可以帮助缓解这些症状。作为首款不含水分的眼药水，它们非常适合缓解干眼症的症状。其成分全氟己基辛烷不替代泪膜的水性部分，而是像一层保护膜覆盖在泪膜上，从而防止泪膜过早蒸发。由于其特殊的流动特性，EvoTears® 被装在一种实用的滴眼瓶中，即使不含防腐剂也不会滋生细菌。因此，它们使用起来非常温和，开封后可保存6个月。

*由于全氟己基辛烷和不含水分的特点

HYLO PARIN

> https://kampagne.doc.green/hylo?0000#/
> https://kampagne.doc.green/parinPOS?07756623#/kontaktdaten

Information Pflege für gereizte Horn- und Bindehaut

HYLO PARIN® Augentropfen wurden speziell auf die Bedürfnisse trockener und wiederkehrend gereizter Augen abgestimmt. Die Kombination von Hyaluronsäure und Heparin unterstützt die Regeneration der Hornhaut und bietet den Schutz, den der natürliche Tränenfilm nicht mehr ausreichend gewähren kann.

Auf einen Blick HYLO PARIN®

    Bei trockenen und wiederkehrend gereizten Augen
    Effiziente Befeuchtung bei leichten bis mittelschweren Beschwerden trockener Augen, auch mit chronischen Beschwerden der Augenoberfläche
    Mit 0,1 % Hyaluronsäure und Heparin
    Frei von Konservierungsmitteln und Phosphaten
    6 Monate nach Anbruch verwendbar
    Mit Kontaktlinsen verträglich
    Hochergiebig mit mind. 300 Tropfen/Flasche
    Im praktischen COMOD®-System einfach zu dosieren

肝素的作用
肝素是一种广泛用于临床的抗凝血药物。它的主要作用机制是通过激活抗凝血酶III（AT III），从而抑制凝血酶（凝血因子IIa）和凝血因子Xa的活性。这些凝血因子的抑制作用有助于防止血栓的形成和扩展。此外，肝素还能影响其他凝血因子和血小板功能，进一步增强其抗凝效果。
肝素常用于预防和治疗各种类型的血栓性疾病，如深静脉血栓、肺栓塞、心肌梗死和不稳定型心绞痛。它也被用于手术期间和体外循环（如透析）过程中，以防止血液凝固。
肝素的给药途径通常为静脉注射或皮下注射。其剂量和治疗方案需根据具体的临床情况和患者的凝血指标进行调整。使用肝素时需要监测凝血功能，以防止出血等副作用的发生。
总之，肝素作为一种强效的抗凝血药物，在防治血栓性疾病中发挥了重要作用。

PARIN POS® Augensalbe, 5g PZN: 07756623

HYLO CARE

Dexpanthenol，也称为D-泛醇或维生素B5原，是一种在化妆品和制药行业广泛使用的化学物质。它是泛酸（维生素B5）的一种醇类形式。Dexpanthenol以其保湿、促进愈合和抗炎特性而闻名。

Dexpanthenol的主要功能
保湿：Dexpanthenol能够吸引水分并将其保持在皮肤中，使皮肤柔软光滑。这是许多保湿霜、乳液和其他护肤品的常见成分。

促进伤口愈合：Dexpanthenol促进皮肤再生，支持伤口、小割伤和擦伤的愈合过程。它常用于伤口膏和乳霜中。

抗炎：Dexpanthenol可以帮助减少炎症和皮肤刺激。它常用于治疗皮肤过敏、晒伤和其他皮肤炎症。

护发：Dexpanthenol改善头发的弹性，增加水分含量，使头发光亮。在洗发水和护发素中经常使用。

应用领域
化妆品：Dexpanthenol是许多护肤产品的重要成分，包括保湿霜、面膜和眼霜。

药品：Dexpanthenol用于治疗皮肤损伤、烧伤和术后疤痕的药膏和乳霜中。

护发：洗发水、护发素和发膜中通常含有Dexpanthenol，以护理和强化头发。

示例产品
Bepanthen软膏：一种著名的伤口和愈合软膏，含有Dexpanthenol，用于支持皮肤愈合。
总之，Dexpanthenol因其卓越的保湿、愈合和抗炎特性，在护肤和药品领域具有广泛的应用前景。

HYLO FRESH

Hyaluronsäure (0,03%), Euphrasia Urtinktur

Euphrasia，也称为小米草，是一种属于玄参科的小草本植物，常用于草药治疗和护眼产品中。以下是关于Euphrasia的一些详细信息：

"Euphrasia Urtinktur" 是指用「欧亚种草 (Euphrasia officinalis)」制成的酊剂。在西方草药疗法中，欧亚种草常用于治疗眼部问题，如结膜炎和眼疲劳。"Urtinktur" 是德文，意为「酊剂」，即用酒精提取植物活性成分制成的药剂。

简单来说，"Euphrasia Urtinktur" 是一种用于治疗眼睛问题的草药酊剂。

在睡眠中几乎不产生泪液是一个正常的生理现象，主要原因如下：

1. 副交感神经系统活动减少

    神经调控：睡眠期间，副交感神经系统的活动减少，导致泪腺的分泌功能减弱。副交感神经系统在白天更活跃，帮助调节泪液的分泌以保持眼睛湿润。

2. 眼睑闭合

    眼睑保护：睡眠时眼睑闭合，减少了眼球暴露在空气中的时间，从而降低了泪液蒸发的需求。眼睛在闭合状态下不需要像白天那样频繁地进行润滑。

3. 泪液生产的昼夜节律

    生物节律：人体的许多生理功能都有昼夜节律，包括泪液的生产。白天活动期间，眼睛需要更多的泪液来保护和润滑，而夜间休息时则不需要如此多的泪液。

4. 减少外部刺激

    刺激减少：在睡眠中，眼睛不接触外界环境中的灰尘、风和其他可能刺激眼睛的因素，因此泪液分泌需求减少。

结论

总的来说，睡眠中泪液分泌减少是人体的一种适应机制，旨在节省能量和资源，同时也与减少外部刺激和眼睛自我保护有关。这是正常的生理现象，不必担心。

HYLO NICHT and GEL

HYLO NIGHT：夜间保护

> https://www.google.com/search?q=Bindehautsack&client=ubuntu-sn&hs=WRS&sca_esv=5a58adcc29de473b&channel=fs&ei=2w1sZpylCouoxc8Pz9qwuAw&ved=0ahUKEwjc_rWz5NqGAxULVPEDHU8tDMcQ4dUDCBE&uact=5&oq=Bindehautsack&gs_lp=Egxnd3Mtd2l6LXNlcnAiDUJpbmRlaGF1dHNhY2syDRAAGIAEGLEDGEMYigUyChAAGIAEGEMYigUyBRAAGIAEMgUQABiABDIFEAAYgAQyBRAAGIAEMgUQABiABDIFEAAYgAQyBRAAGIAEMgUQABiABEiZBlDEBFjEBHABeACQAQCYAVugAbIBqgEBMrgBA8gBAPgBAZgCAqACfsICChAAGLADGNYEGEeYAwCIBgGQBgiSBwMxLjGgB-gJ&sclient=gws-wiz-serp#imgrc=iAi8BNhm_SEQSM&imgdii=O8urUWD1tj1HUM

用于夜间眼部保湿的眼膏
缓解眼部灼热、干燥和疲劳感
改善泪膜并在夜间保护眼睛表面
含有维生素A
不含防腐剂和磷酸盐

夜间保护眼睛表面
睡眠中几乎不产生泪液，无法通过眼药水进行定期保湿。HYLO NIGHT® 专为夜间保护眼睛表面而设计。柔软光滑的眼膏能够很好地在受刺激的眼表面分布，并通过含有的维生素A缓解眼部灼热、干燥和疲劳感。HYLO NIGHT® 不含防腐剂，因此非常耐受。经济实惠的眼膏在开封后可使用6个月。HYLO NIGHT® 适合作为所有干眼症阶段中HYLO®眼药水的夜间补充。

含维生素A的柔软眼膏
维生素A是人体必需的活性成分，例如用于调节生长或保护粘膜。同时，它也被称为“眼维生素”，因为它在视力和泪液方面具有重要作用。维生素A是泪膜的天然成分之一，这支持了HYLO NIGHT®的良好耐受性。它还确保了眼膏能与天然泪液很好地混合，使眼睑在角膜和结膜上滑动更加顺畅。

使用方法：

1. 略微仰头，用一只手的手指轻轻拉下下眼睑
2. 用另一只手将管子直立在眼睛上方。通过轻压管子将少量眼膏挤入结膜囊。
3. 缓慢闭眼。用压缩物或布移除多余的眼膏。

使用眼膏时应避免管口直接接触眼睛和面部皮肤。同一管眼膏应仅供一个人使用，以防止可能的病菌和病原体传播。

不同阶段的干眼症
干眼症，即干眼综合症，会导致各种不适症状，如灼热、眼红或眼痒，眼睛感到疲劳或受刺激。对于轻度至中度症状，保湿眼药水可迅速缓解，而对于症状较重和慢性的患者，建议使用特别粘稠的凝胶滴剂，这些滴剂能长时间附着在眼表面。此外，夜间使用眼膏可为所有类型的干眼提供密集的保湿，并支持受损眼表面的再生。

眼中沙粒感
干眼症不仅在白天会引起不适。在睡觉时泪液分泌不足，许多患者早晨醒来后会感到眼中有沙粒感，这是由于眼睑内侧粘附在干燥的眼表面上引起的。使用HYLO NIGHT®眼膏可以防止早晨的沙粒感。

常见问题解答：

HYLO NIGHT®眼膏应该多频繁使用？
HYLO NIGHT® 改善泪膜并保护眼表面。根据您的需要或眼科医生的建议，个性化地使用眼膏。HYLO NIGHT® 适用于所有干眼症阶段，作为HYLO COMOD® 或 HYLO GEL® 等保湿眼药水的补充，每晚睡前在每只眼睛的结膜囊内挤入约0.5厘米长的膏条。在眼部干燥感和严重不适时，白天也可以使用该眼膏。不要与其他眼部药物同时使用HYLO NIGHT®。如有必要，应在使用其他眼药后约30分钟再使用HYLO NIGHT®。

使用HYLO NIGHT® 时视力会受到影响吗？
HYLO NIGHT® 眼膏在眼睛上形成持久的保护膜。由于其含有脂肪的质地，可能会稍微影响视力，因此建议主要在夜间使用。如果您在白天也要使用HYLO NIGHT®，请不要驾驶或参与道路交通活动，不要操作机器或进行需要稳定支撑的工作。

HYLO NIGHT® 能与隐形眼镜一起使用吗？
不行，在佩戴隐形眼镜时不应使用HYLO NIGHT®眼膏。对于隐形眼镜佩戴者，您可以选择HYLO®系列中的不同眼药水产品进行强效保湿，这取决于干眼症的严重程度。

使用HYLO NIGHT® 后需要覆盖眼睛吗？
不需要。唯一的例外情况是眼睑闭合不全（兔眼）。在这种情况下，可以与眼科医生协商，夜间使用称为钟表玻璃敷料的透明胶布，形成湿润环境，防止眼睛干燥。

HYLO NIGHT: Der Schutz für die Nacht

    Augensalbe zur nächtlichen Augenbefeuchtung 
    Lindert das Gefühl brennender, trockener und müder Augen
    Verbessert den Tränenfilm und  bietet der Augenoberfläche Schutz über Nacht
    Mit Vitamin A
    Frei von Konservierungsmitteln und Phosphaten 

Schützt die Augenoberfläche über Nacht
Im Schlaf wird kaum Tränenflüssigkeit produziert, eine regelmäßige Befeuch- tung mit Augentropfen ist über Nacht nicht möglich. Speziell für den nächtli- chen Schutz der Augenoberfläche wurde HYLO NIGHT® entwickelt. Die weiche und geschmeidige Augensalbe verteilt sich gut auf der gereizten Augenober- fläche und lindert mit dem Inhaltsstoff Vitamin A das Gefühl brennender, trockener und müder Augen. HYLO NIGHT® ist frei von Konservierungsmitteln und daher sehr gut verträglich. Die ergiebige Augensalbe ist 6 Monate nach Anbruch der Tube haltbar. HYLO NIGHT® eignet sich als nächtliche Ergänzung von HYLO® Augentropfen bei allen Stadien trockener Augen.

Weiche Augensalbe mit Vitamin A
Vitamin A ist ein essenzieller Wirkstoff für unseren Körper, der z. B. für die Regulation von Wachstum oder den Schutz der Schleim- häute bedeutend ist. Er gilt aber auch als so- genanntes „Augenvitamin“. Das ist auf seine bedeutende Rolle hinsichtlich der Sehfähigkeit und Tränenflüssigkeit der Augen zurückzufüh- ren. Das Vitamin stellt einen natürlichen Be- standteil des Tränenfilms dar. Das unterstützt die gute Verträglichkeit von HYLO NIGHT®. Es sorgt auch dafür, dass sich die Augensalbe gut mit der natürlichen Tränenflüssigkeit ver- mischt und die Augenlider leichter über Horn- haut und Bindehaut gleiten.

Anwendung:

    Legen Sie den Kopf leicht zurück,  ziehen  das Unterlid mit dem Finger  einer Hand etwas  vom Auge zurück
    Halten Sie mit der anderen  Hand die Tube in aufrechter  Position über das Auge. Bringen Sie durch leichten Druck auf die Tube einen kleinen Salbenstrang in den Bindehautsack ein.
    Schließen Sie die Augenlider langsam. Überschüssig austretende Salbe können mit einer  Kompresse oder einem  Tuch entfernt werden. 

Bei der Anwendung von Augensalben sollte ein direkter Kontakt der Tubenspitze mit Auge und Gesichtshaut vermieden werden. Mit dem Inhalt einer Tube sollte immer nur eine Person behandelt werden, um einer mögliche Übertragung von Keimen und Erregern vorzubeugen.

VERSCHIEDENE STADIEN TROCKENER AUGEN
Trockene Augen, das sog. Sicca-Syndrom, machen sich mit unterschiedlichen Be- schwerden wie Brennen, Rötung oder Juckreiz bemerkbar, die Augen fühlen sich müde oder gereizt an. Bei leichten bis mittelschweren Symptomen ermöglichen befeuchtende Augentropfen eine rasche Linderung, für Patienten mit stärkeren und chronischen Be- schwerden empfehlen sich spezielle hochviskose Geltropfen, die besonders lange an der Augenoberfläche haften. Zusätzlich kann die Anwendung einer Augensalbe über Nacht bei allen Formen des trockenen Auges für eine intensive Augenbefeuchtung sorgen und die Regeneration der strapazierten Augenoberfläche unterstützen.

GEFÜHL VON SANDKÖRNERN IM AUGE
Trockene Augen können nicht nur tagsüber zu unangenehmen Beschwerden führen. Durch die mangelnde Produktion von Tränenflüssigkeit beim Schlafen klagen viele Betroffene morgens nach dem Aufwachen über ein sogenanntes Sandkorngefühl. Dieses entsteht durch das Anhaften der Lidinnenseite auf der Augenoberfläche, was durch trockene Augen ausgelöst wird. Beim Öffnen reibt das Lid auf der trockenen Au- genoberfläche, was dieses unangenehme Gefühl auslöst. Die Anwendung der HYLO NIGHT® Augensalbe über Nacht bewahrt so vor dem unangenehmen Sandkorngefühl am Morgen.

HÄUFIGE FRAGEN & ANTWORTEN:

Wie oft sollte HYLO NIGHT® Augensalbe angewendet werden?
HYLO NIGHT® verbessert den Tränenfilm und schützt die Augenoberfläche. Wenden Sie die Augensalbe individuell nach Ihren Bedürfnissen bzw. den Empfehlungen Ihres Au- genarztes an. HYLO NIGHT® eignet sich bei allen Stadien des trockenen Auges zur Ergänzung von befeuchtenden Augentropfen wie beispielsweise HYLO COMOD® oder HYLO GEL®, indem Sie abends vor dem Schlafengehen einen ca. 0,5 cm langen Salbenstrang in den Bindehautsack jeden Auges geben. Bei einem ausgeprägten Trockenheitsgefühl der Augen und starken Beschwerden kann die Salbe auch tagsüber angewendet werden. Wenden Sie HYLO NIGHT® nicht gleichzeitig mit anderen Arzneimitteln am Auge an. Sollte das notwendig sein, ist HYLO NIGHT® erst etwa 30 Minuten nach der Anwendung anderer Augenarzneimittel zu verabreichen.

Ist bei der Anwendung von HYLO NIGHT® mit einer Sehbeeinträchtigung zu rechnen?
HYLO NIGHT® Augensalbe bildet einen langanhaltenden Schutzfilm auf dem Auge. Aufgrund der fetthaltigen Konsistenz kann es zu einer leichten Beeinträchtigung des Sehver- mögens kommen, weshalb sich die Anwendung des Medizinprodukts vor allem über Nacht empfiehlt. Falls Sie HYLO NIGHT® auch tagsüber anwenden möchten, sollten Sie nicht Auto fahren oder in sonstiger Weise aktiv am Straßenverkehr teilnehmen, Maschinen bedienen oder Arbeiten ohne sicheren Halt ausführen.

Kann HYLO NIGHT® mit Kontaktlinsen verwendet werden?
Nein, während des Tragens von Kontaktlinsen darf HYLO NIGHT® Augensalbe nicht angewendet werden. Zur intensiven Befeuchtung des Auges stehen Ihnen als Kontaktlinsen- träger alternativ verschiedene Augentropfen-Präparate des HYLO® Sortiments – abhängig von der Schwere der Symptome eines trockenen Auges – zur Auswahl.

Muss das Auge nach der Anwendung von HYLO NIGHT® abgedeckt werden?
Nein, das ist nicht notwendig. Die einzige Ausnahme besteht bei einem unvollständigen Lidschluss (Lagophthalmus). In diesem Fall kann in Absprache mit dem behandelnden Augenarzt über Nacht ein sogenannter Uhrglasverband angeraten sein. Das transparente Pflaster bildet eine feuchte Kammer, die die Austrocknung des Auges verhindert.

HYLO GEL®：用于长期缓解慢性干眼症

用于异物感、眼痒或眼红

高效缓解重度和慢性干眼症

支持术后眼睛的愈合过程

含有0.2%透明质酸

不含防腐剂和磷酸盐

不影响视力

有效缓解严重症状
HYLO GEL® 针对慢性干眼症和重度干眼症的特殊需求。由于含有高浓度透明质酸，这些高粘度的滴剂在眼表面特别具有粘附性。持续且严重的症状，如眼痒、异物感、眼红和灼烧感将得到缓解，并且眼睛将长时间受到保护免受刺激。此外，HYLO GEL® 眼药水也是支持眼科手术后愈合过程的良好选择。实用的COMOD®系统保证了精准的剂量控制，高效利用，每瓶至少可滴300滴，开封后可使用6个月。HYLO GEL® 眼药水不含防腐剂和磷酸盐，因此具有极佳的耐受性。

HYLO GEL® 保湿眼药水无需处方，但可以开具处方。对于某些适应症（根据药品指南附录V），医生可以开具处方。

高浓度透明质酸带来持久的眼部湿润
HYLO GEL® 眼药水含有透明质酸，这是一种自然存在于人体某些部位及眼睛中的物质。由于含有0.2%的高浓度透明质酸，HYLO GEL® 具有高粘度，在受刺激的眼表面形成稳定的保护膜。结膜和角膜得到深层和持久的湿润。尽管具有凝胶状的质地，但不影响视力。HYLO GEL® 眼药水也可与硬性和软性隐形眼镜一起使用。

剂量：
步骤1：
使用前取下保护盖。

步骤2：
将瓶子倒置，使滴头向下。将拇指放在瓶肩，其余手指放在瓶底。

步骤3：
用另一只手轻轻支撑持瓶的手。

步骤4：
稍微向后仰头，用另一只手轻轻拉下下眼睑。用力按压瓶底，将药水滴入结膜囊。感谢特殊的COMOD®泵系统，每次仅出一滴。缓慢闭眼，使液体在眼表面分布均匀。使用后立即用盖子盖上滴头。

干眼症
干眼症，即干眼综合症，是最常见的眼科疾病之一。根据眼科医生协会的信息，在德国，每五位成人中就有一位受其影响，并且这一趋势在上升。主要原因之一是越来越多地使用智能手机和平板电脑等数字媒体，以及工作中密集的屏幕使用。此外，环境因素如穿堂风、空调和细颗粒物污染以及某些疾病、激素变化和年龄相关过程也会引发各种干眼症状。

眼科手术
手术可能导致严重的刺激并影响眼睛泪液的供应。在眼科手术后，稳定的泪膜尤为重要，因为伤口只有在充分湿润的环境中才能良好愈合。由于高浓度和高粘度的透明质酸，HYLO GEL® 在术后护理中因其强效和持久的保湿而被证明有效。由于不含防腐剂和磷酸盐，可以在术后安全使用。眼药水也推荐给术前已经有干眼症状的患者。

健康保险的报销
HYLO GEL® 是德国第一个可开具处方的眼部保湿医疗产品。在治疗必要时，某些情况下费用可以由法定健康保险报销。医生可根据药品指南附录V开具处方，用于泪腺缺失或损伤、各种自身免疫性疾病、面神经麻痹（面瘫）和眼睑闭合不全（兔眼）。

常见问题解答：
我应该每天使用HYLO GEL® 眼药水多少次？
HYLO GEL® 眼药水在严重和慢性干眼症中提供强效和持久的保湿。根据您的具体情况或眼科医生的建议，个性化地调整剂量。通常，每天3次，每只眼睛滴1滴HYLO GEL®。在症状较严重时，可以增加使用频率。如果您同时使用其他药物性眼药水，请至少间隔30分钟。此情况下，应最后使用HYLO GEL®。如果同时使用眼膏，请在使用HYLO GEL® 后将眼膏涂入结膜囊。

使用HYLO GEL® 与隐形眼镜时需要注意什么？
HYLO GEL® 也适用于隐形眼镜佩戴者。佩戴隐形眼镜后，建议等待约30分钟再使用保湿眼药水。这样可以避免隐形眼镜护理产品与HYLO GEL® 之间可能的相互作用引起的不适反应。由于眼药水具有凝胶状质地，滴入眼中后可能会出现短暂的模糊。这通常会在几次眨眼后消失。

如何确保我的眼睛在夜间得到足够的湿润？
HYLO GEL® 含有高粘度透明质酸，能在眼表面具有特别好的粘附性，并提供持久的保湿效果，持续到夜间。如果您在夜间特别感到干眼症状严重，可以通过在夜间使用保湿眼膏如HYLO NIGHT® 来补充眼药水。含有维生素A的眼膏能在眼表面很好地分布并改善泪膜。

不含防腐剂的眼药水有什么优点？
防腐剂被广泛用于许多制剂中，以防止病原菌进入眼睛。如果眼药水不能保持无菌，细菌和真菌容易滋生，使用时可能进入眼睛并引起感染。然而，防腐剂通常无法完全保护眼睛，并可能引起不耐受反应。HYLO GEL® 和HYLO® 系列的所有其他眼药水均不含防腐剂。这得益于独特的COMOD® 系统，确保液体无菌且密封。即使没有防腐剂，开封后的眼药水仍可使用6个月。防腐剂的缺失有双重好处：HYLO® 眼药水具有极好的耐受性。此外，术后护理仅适合不含防腐剂的医疗产品，否则会延缓伤口的良好愈合。

Doppelherz® system AUGEN PLUS SEHKRAFT + SCHUTZ
Packungsgröße: 120 St | Kapseln

高眼压

1、眼药水。滴眼药水能够及时的降低眼压。如果眼压持续升高，需要尽快采取科学的治疗，最好去医院检查一下，然后对症用药。

2、眼部热敷。通过眼部热敷的方式可以促进局部的血液循环，对缓解眼压的升高有一定的帮助。建议平时可用热水或蒸汽等方式来熏浴双眼，也可使用热毛巾对眼部进行热敷，都能够起到改善的作用。

3、眼球运动。长时间用眼会导致眼部疲劳，从而导致眼压的升高，可以通过眼球的自主运动，使眼部肌肉得到缓解和休息。可以向不同的方向转动眼球，能够锻炼眼球的活力，从而起到舒筋活络和改善视力功能的作用，同时也能缓解眼压高。

4、饮食调理。蜂蜜是高渗剂，口服之后能升高血液中的渗透压，将眼内多余的水分吸收到血液中，从而达到降低眼压的作用。建议多吃利水的食物，排出身体多余的水分，可以降低眼压，例如西瓜、冬瓜和赤小豆。

总之，在被确认病症之后，必须积极配合医生的治疗，这样病症才能完全的治愈。平时要避免饥饿，以免影响的眼部周围血液的循环，从而导致眼压的升高，一定要控制饮水量，对预防眼压高会有很大的好处。做到戒烟戒酒，不吃辛辣或刺激性食物，以免导致血管神经功能的失调，影响眼压平衡。

骨质疏松

硫酸氨基葡萄糖胶囊（伊索佳）：

   * 作用：用于改善骨关节炎症状，具有保护软骨、减轻疼痛、改善关节功能等作用。
   * 用法：口服，一日3次，每次2粒。

藻来苯苄布胶囊（西乐葆）：

   * 作用：为一种非甾体抗炎药（NSAID），用于缓解疼痛和炎症，常用于关节炎、肌肉疼痛等。
   * 用法：口服，一日1次，每次1粒。

碳酸钙D3片（钙尔奇）：

   * 作用：用于补充钙和维生素D，预防和治疗骨质疏松症。
   * 用法：口服，一日1次，每次1片。

阿仑膦酸钠片（基）：

   * 作用：用于治疗和预防骨质疏松症，尤其是绝经后妇女和老年男性。
   * 用法：口服，每周1次，每次1片（餐前）。

玻璃酸钠注射液：

   * 作用：用于缓解骨关节炎的疼痛和改善关节功能，常用于膝关节内注射。
   * 用法：关节腔注射，立即，每次1支。

盐酸利多卡因注射液：

   * 作用：局部麻醉剂，用于局部麻醉、缓解疼痛。
   * 用法：辅助用药，立即，每次1支。

醋酸曲安奈德注射液：

   * 作用：为一种糖皮质激素，用于减轻炎症反应，常用于关节内注射以缓解疼痛和炎症。
   * 用法：辅助用药，立即，每次10毫克。

Insertion Sequence (IS) Element Detection

Leave a reply

Bioinformatics Methods

Sequence Extraction and Comparative Analysis

Genomic regions surrounding the multidrug efflux MFS transporter were extracted from five isolates of Acinetobacter baumannii: ATCC19606, ACICU, AYE, ATCC17978, and SDF. These regions were aligned to identify any unmapped segments. This alignment revealed a 45 kb sequence (positions 2,944,541 to 2,989,666 in ATCC17978) that did not align with any part of the other four genomes. This unmapped sequence was subsequently extracted for further analysis.

Insertion Sequence (IS) Element Detection

The tool ISEScan (Xie & Tang, 2017) was employed to detect insertion sequences (IS) within the extracted 45 kb fragment. The analysis identified four distinct IS elements, providing insights into potential genomic variations and mobile genetic elements within this segment (see IS_of_45kb.xlsx).

Visualization and Annotation

The alignment was visualized using clinker v0.0.21 (Gilchrist and Chooi, 2021). Genes are represented by arrows colored according to similarity groups, with grey arrows indicating genes not part of any similarity group. The sequence identity between genes in the same similarity group is indicated by shading, according to the identity scale bar. Detailed gene annotations are shown and color-coded, with the multidrug efflux MFS transporter loci indicated by a green dashed box (see clinker1.png).

References
- Gilchrist, C.L.M., & Chooi, Y.-H. (2021). clinker & clustermap.js: Automatic generation of gene cluster comparison figures. Bioinformatics. doi: https://doi.org/10.1093/bioinformatics/btab007
- Xie, Z., & Tang, H. (2017). ISEScan: automated identification of insertion sequence elements in prokaryotic genomes. Bioinformatics, 33(21), 3340-3347. doi: https://doi.org/10.1093/bioinformatics/btx433
Tools of Insertion Sequence (IS) Element Detection
- ISEScan: Description: Although not a database, ISEScan is a software tool used to identify IS elements in bacterial genome sequences. It can be helpful for researchers looking to analyze newly sequenced genomes for the presence of IS elements. Website: Available on platforms like GitHub for download and integration into bioinformatics workflows.
- TnCentral including ISFinder: Description: TnCentral is a more comprehensive resource that includes information about transposons, which are larger and more complex than simple IS elements but often contain IS sequences as part of their structure. This database provides detailed information about transposon structures, including associated genes and regulatory features.
- ISsaga: ISsaga is a web-based tool for the identification and annotation of insertion sequences in prokaryotic genomes. It provides various features for IS element analysis, including detection, classification, and visualization. You can access ISsaga here: ISsaga #http://issaga.biotoul.fr/ISsaga/issaga_index.php
- ISFinder: ISFinder is a curated database and analysis platform for insertion sequences in prokaryotic genomes. It provides a comprehensive collection of IS sequences and tools for sequence analysis, classification, and annotation. You can access ISFinder here: ISFinder For ISfinder please cite: Siguier P. et al. (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36 (link pubmed) and the database URL (http://www-is.biotoul.fr).
- For ISbrowser please cite: Kichenaradja P. et al. (2010) ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes. Nucleic Acids Res. 38: D62-D68 (link pubmed). For ISsaga please cite: Varani A. et al. (2011) ISsaga: an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes, Genome Biology 2011, 12:R30 (link pubmed).
- ISMapper: ISMapper is a tool for mapping insertion sequences in bacterial genomes. It uses paired-end sequence data to identify IS element insertion sites and provides information about their genomic context. You can access ISMapper here: ISMapper ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data https://pubmed.ncbi.nlm.nih.gov/26336060/
- ISseeker: ISseeker is a software package for the identification and annotation of insertion sequences in bacterial genomes. It provides a user-friendly interface for IS element detection and characterization. You can access ISseeker here: ISseeker

Identificatio of IS (Insertion Sequence) elements using ISEScan

#extracted sequence segments from the two isolates, specifically:
#    ATCC19606: 930469 to 951674 — segment1
#    ATCC17978: 2,934,384 to 3,000,721 — segment2
#Then, I compared the two segments and found that positions 1-11055 of segment1 mapped to 66338-55284 of segment2, and positions 11049-21206 of segment1 mapped to 10158-23 of segment2. This means the sequence from 10159-55283 of segment2 (about 45 kb nt) is not mapped. I then extracted the 45 kb sequence (see the attached fasta file). I attempted to detect IS elements using the tool ISEScan (https://academic.oup.com/bioinformatics/article/33/21/3340/3930124). Four ISs were detected (see 45kb.fasta.xlsx; for more detailed results, see 45kb.fasta.zip).

#samtools faidx Acinetobacter_baumannii_ATCC19606.gbk_converted.fna CP059040.1:930469-951674 > ../ATCC19606_segment.fasta
vim ./gbks/A.baumannii_ATCC17978.gbk
#LOCUS       CP000521             3976747 bp    DNA     circular BCT 31-JAN-2014
#DEFINITION  Acinetobacter baumannii ATCC 17978, complete genome.

I used the following commands extracted a 45kb fasta. Then using a tools get IS elements.
samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta

please update the following tables in which all positons referred to the 45kb sequence to the complete genome in ATCC17978.

    #seqID: sequence identifier
    #family: family name of IS element
    #cluster: Tpase cluster
    #isBegin and isEnd: genome coordinates of the predicted IS element
    #isLen: length of the predicted IS element
    #ncopy4is: number of predicted IS copies including full-length and partial IS copies
    #start1, end1, start2, end2: genome coordinates of the IRs
    #score: score of the IRs
    #irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
    #irLen, length of inverted repeats
    #nGaps: number of gaps in IRs
    #orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
    #strand: strand where the Tpase is
    #orfLen: length of predicted Tpase ORF
    #E-value: the best E-value among all IS copies for the same IS element, the smaller the better
    #E-value4copy: the E-value of the reported IS copy, the smaller the better
    #type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
    #ov: ov number returned by hmmer search
    #tir: terminal inverted repeat sequences
    seqID   family  cluster isBegin isEnd   isLen   ncopy4is    start1  end1    start2  end2    score   irId    irLen   nGaps   orfBegin    orfEnd  strand  orfLen  E-value E-value4copy    type    ov  tir
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_222 5818    8737    2920    1   5818    5842    8713    8737    18  17  25  0   5931    8822    +   2892    3.3E-74 3.3E-74 c   1   TGATTAAACTTTGCGGATTAAATTG:TGATTAAATCTAATGTGTTGAATTG
    CP000521.1_2934384_3000721:10159-55283  IS3 IS3_176 8745    9849    1105    1   8745    8761    9833    9849    26  15  17  0   8916    9775    -   860 9E-38   9E-38   p   1   ATTGATGATAGCCAAAA:ATTGATCCTAGCCAAAA
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_226 9983    10411   429 1   9983    9996    10398   10411   20  12  14  0   9850    10364   +   515 7.2E-28 7.2E-28 p   1   TATCATTCATTATA:TATCATTCAGCATA
    CP000521.1_2934384_3000721:10159-55283  IS5 IS5_302 23918   24796   879 1   23918   23953   24762   24796   54  33  36  1   23947   24699   -   753 3E-82   3E-82   c   1   AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG:AAAATCAAAATGATGC-TAGGGCGTGTCTTCATTTG

Postprocessing of the relative postions in the 45kb.fasta in the results IS_of_45kb.xlsx to the absolute positions of the genome

#https://github.com/xiezhq/ISEScan?tab=readme-ov-file#Usage

Columns in NC_012624.fna.csv (NC_012624.fna.tsv, NC_012624.fna.raw):

#seqID: sequence identifier
#family: family name of IS element
#cluster: Tpase cluster
#isBegin and isEnd: genome coordinates of the predicted IS element
#isLen: length of the predicted IS element
#ncopy4is: number of predicted IS copies including full-length and partial IS copies
#start1, end1, start2, end2: genome coordinates of the IRs
#score: score of the IRs
#irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
#irLen, length of inverted repeats
#nGaps: number of gaps in IRs
#orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
#strand: strand where the Tpase is
#orfLen: length of predicted Tpase ORF
#E-value: the best E-value among all IS copies for the same IS element, the smaller the better
#E-value4copy: the E-value of the reported IS copy, the smaller the better
#type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
#ov: ov number returned by hmmer search
#tir: terminal inverted repeat sequences

#Note: the E-value is the E-value returned by hmmer when searching profile HMMs against proteome translated from a genome sequence

samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-8737 > 1_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-5842 > 1_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8713-8737 > 1_3.fasta
seqtk seq -r 1_3.fasta > 1_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5931-8822 > 1_4.fasta

samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-9849 > 2_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-8761 > 2_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9833-9849 > 2_3.fasta
seqtk seq -r 2_3.fasta > 2_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8916-9775 > 2_4.fasta
seqtk seq -r 2_4.fasta > 2_4_rc.fasta

samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-10411 > 3_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-9996 > 3_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:10398-10411 > 3_3.fasta
seqtk seq -r 3_3.fasta > 3_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9850-10364 > 3_4.fasta

samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-24796 > 4_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-23953 > 4_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:24762-24796 > 4_3.fasta
seqtk seq -r 4_3.fasta > 4_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23947-24699 > 4_4.fasta
seqtk seq -r 4_4.fasta > 4_4_rc.fasta

ATG (77 %), GTG (14 %), TTG

Stopcodons in der DNA:

    TAG (Thymin - Adenin - Guanin)
    TAA (Thymin - Adenin - Adenin)
    TGA (Thymin - Guanin - Adenin)

CP000521.1_2934384_3000721:10159-55283  IS5 IS5_222 5818    8737    2920    1   5818    5842    8713    8737    18  17  25  0   5931    8822
CP000521.1_2934384_3000721:10159-55283  IS3 IS3_176 8745    9849    1105    1   8745    8761    9833    9849    26  15  17  0   8916    9775
CP000521.1_2934384_3000721:10159-55283  IS5 IS5_226 9983    10411   429 1   9983    9996    10398   10411   20  12  14  0   9850    10364
CP000521.1_2934384_3000721:10159-55283  IS5 IS5_302 23918   24796   879 1   23918   23953   24762   24796   54  33  36  1   23947   24699

seqID   family  cluster isBegin isEnd   isLen   ncopy4is    start1  end1    start2  end2    score   irId    irLen   nGaps   orfBegin    orfEnd  strand  orfLen  E-value E-value4copy    type    ov  tir
CP000521.1  IS5 IS5_222 2,950,360   2,953,279   2,920   1   2,950,360   2,950,388   2,953,239   2,953,267   18  17  25  0   2,950,473   2,953,364
+   2,892   3.3E-74 3.3E-74 c   1   TGATTAAACTTTGCGGATTAAATTG
CP000521.1  IS3 IS3_176 2,953,287   2,954,391   1,105   1   2,953,287   2,953,303   2,954,359   2,954,375   26  15  17  0   2,953,458   2,954,317
-   860 9E-38   9E-38   p   1   ATTGATGATAGCCAAAA
CP000521.1  IS5 IS5_226 2,954,509   2,954,937   429 1   2,954,509   2,954,522   2,954,924   2,954,937   20  12  14  0   2,954,376   2,954,890
+   515 7.2E-28 7.2E-28 p   1   TATCATTCATTATA
CP000521.1  IS5 IS5_302 2,969,444   2,970,322   879 1   2,969,444   2,969,479   2,970,248   2,970,282   54  33  36  1   2,969,473   2,970,225
-   753 3E-82   3E-82   c   1   AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG

I checked the results of the following command, they should be TGATTAAACTTTGCGGATTAAATTG, However it is a little different.
samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950360-2950388
GATTAAACTTTGCGGATTAAATTGACAGA

2950359-5818=+2944541

1..45125
2944541..2989666

#confirm the position-changed sheet!
CP000521.1  IS5 IS5_222 2950359 2953278 2920    1   2950359 2950383 2953254 2953278 18  17  25  0   2950472 2953363
CP000521.1  IS3 IS3_176 2953286 2954390 1105    1   2953286 2953302 2954374 2954390 26  15  17  0   2953457 2954316
CP000521.1  IS5 IS5_226 2954524 2954952 429 1   2954524 2954537 2954939 2954952 20  12  14  0   2954391 2954905
CP000521.1  IS5 IS5_302 2968459 2969337 879 1   2968459 2968494 2969303 2969337 54  33  36  1   2968488 2969240

samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2953278 > 1_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2950383 > 1_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953254-2953278 > 1_3_changed.fasta
seqtk seq -r 1_3_changed.fasta > 1_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950472-2953363 > 1_4_changed.fasta

samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2954390 > 2_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2953302 > 2_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954374-2954390 > 2_3_changed.fasta
seqtk seq -r 2_3_changed.fasta > 2_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953457-2954316 > 2_4_changed.fasta
seqtk seq -r 2_4_changed.fasta > 2_4_rc_changed.fasta

samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954952 > 3_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954537 > 3_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954939-2954952 > 3_3_changed.fasta
seqtk seq -r 3_3_changed.fasta > 3_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954391-2954905 > 3_4_changed.fasta

samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2969337 > 4_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2968494 > 4_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2969303-2969337 > 4_3_changed.fasta
seqtk seq -r 4_3_changed.fasta > 4_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968488-2969240 > 4_4_changed.fasta
seqtk seq -r 4_4_changed.fasta > 4_4_rc_changed.fasta

diff 1_1.fasta 1_1_changed.fasta
diff 1_2.fasta 1_2_changed.fasta
diff 1_3.fasta 1_3_changed.fasta
diff 1_4.fasta 1_4_changed.fasta

diff 2_1.fasta 2_1_changed.fasta
diff 2_2.fasta 2_2_changed.fasta
diff 2_3.fasta 2_3_changed.fasta
diff 2_4.fasta 2_4_changed.fasta

diff 3_1.fasta 3_1_changed.fasta
diff 3_2.fasta 3_2_changed.fasta
diff 3_3.fasta 3_3_changed.fasta
diff 3_4.fasta 3_4_changed.fasta

diff 4_1.fasta 4_1_changed.fasta
diff 4_2.fasta 4_2_changed.fasta
diff 4_3.fasta 4_3_changed.fasta
diff 4_4.fasta 4_4_changed.fasta

It should be ATG.... (gene length)

The positions are not correct. Note that I changed the coordinate twice in the following commands:
        samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
        makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
        blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
        samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta

The postions in the original tables are relative postions in the 45kb.fasta. Please correct it!

我想比对的是这27,694 Acinetobacter calcoaceticus/baumannii complex Genomes (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=909768) 中，目的基因 (https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530) 是否是conserved的 (先不管SNP, InDel或其他突变现象)。就如这两篇paper的结果显示”AdeIJK is highly conserved across the Acinetobacter genus” (Paper1: Evolution of RND efflux pumps in the development of a successful pathogen; Paper2: RND pumps across the genus Acinetobacter: AdeIJK is the universal efflux pump). Analysis results: In our analysis of 38,638 records, we discovered that 65 of them do not contain the specific genes based on the submitted genomes from GenBank (refer to the details below). Attached, you will find a document that provides detailed information on the positions of the genes present in the remaining genomes. However, it’s important to note that, based on my experience, GenBank does contain a number of erroneously assembled genomes. Therefore, it’s conceivable that the absence of the gene in some isolates could be attributed not to its actual absence but rather to errors in genome assembly.

~/Tools/datasets download genome taxon 909768
#ncbi_dataset_taxon909768

~/Tools/datasets download genome accession --inputfile sampled_accession_numbers.txt --include gff3,gbff,genome

#./blastn_1st_batch.sh generating ../blastn_1st_res
for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
  echo "makeblastdb -in ${sample} -dbtype nucl"
  echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_merged_res_1_1000"
done

for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
  echo "makeblastdb -in ${sample} -dbtype nucl"
  echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res_4001_8000"
done

...

TODO NEXT WEEK: senden a EXCEL storing the merged blastn results! look if all 38638 records contain gene, if not, mark yellow (explain it is also possible an error of assembly of the genome regarding the specific isolate!)
#GenBank GCA_000810025.3

38638

makeblastdb -in GCF_000757665.1_ASM75766v1_genomic.fna -dbtype nucl
blastn -db GCF_000757665.1_ASM75766v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
makeblastdb -in GCF_000746645.1_ASM74664v1_genomic.fna -dbtype nucl
blastn -db GCF_000746645.1_ASM74664v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res

#find ./ncbi_dataset/sampled_100/ -name "*.fna" -print0 | xargs -0 mv -t /destination/directory/
find . -mindepth 2 -maxdepth 2 -type f -name "*.fna" -print0 | xargs -0 mv -t .
find . -mindepth 1 -maxdepth 1 -type f -name "*.fna" >../fna_list2

# The genome was downloaded at Mär 25, 2024
I have over 38638 genome with the taxon-id 909768, want to assembly 100 from them,

the data strcture as the following:

./data/GCF_000757665.1/GCF_000757665.1_ASM75766v1_genomic.fna
./data/GCF_000746645.1/GCF_000746645.1_ASM74664v1_genomic.fna
./data/GCF_000746605.1/GCF_000746605.1_ASM74660v1_genomic.fna
./data/GCF_000738845.1/GCF_000738845.1_ASM73884v1_genomic.fna
./data/GCF_000734775.1/GCF_000734775.1_ASM73477v1_genomic.fna
./data/GCF_000731965.1/GCF_000731965.1_ASM73196v1_genomic.fna
./data/GCF_000708795.1/GCF_000708795.1_ABU310_genomic.fna
./data/GCF_000708775.1/GCF_000708775.1_ASM70877v1_genomic.fna
./data/GCF_000695855.3/GCF_000695855.3_ASM69585v3_genomic.fna
./data/GCF_000248195.1/GCF_000248195.1_ASM24819v2_genomic.fna
....

I have a total of 38638 genomes in fasta format in local computer. I want to quick check if they are contain a gene ABAYE_RS05070 (/product="multidrug effflux MFS transporter") https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530)

# What are the default values of -perc_identity and  -qcov_hsp_perc in blastn?
#-perc_identity 90 -qcov_hsp_perc 90
By default:
-perc_identity is set to 0, meaning that all alignments found will be reported regardless of percent identity.
-qcov_hsp_perc is set to 0 as well, indicating that all alignments will be reported regardless of query coverage per HSP.

makeblastdb -in ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -dbtype nucl
blastn -db ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -query ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1

makeblastdb -in 100.fasta -dbtype nucl
# -max_target_seqs 1
blastn -db 100.fasta -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1

with the following commands:

for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna _genomic.fna
GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
  makeblastdb -in ${sample} -dbtype nucl
  blastn -db ${sample} -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_1st_res
done

get the following list, I want to add the two parts in the filenames also in the results as the second and third columns.

For example for  GCF_001907155.1_ASM190715v1_genomic.fna,

the current results is
"gi|169794206|ref|NC_010410.1|:974325-975530     NZ_JWTW03000099.1       97.595  1206    29      0       1       1206    211093  209888  0.0     2067"
It should be
"gi|169794206|ref|NC_010410.1|:974325-975530     GCF_001907155.1  ASM190715v1  NZ_JWTW03000099.1       97.595  1206    29      0       1       1206    211093  209888  0.0     2067"

gi|169794206|ref|NC_010410.1|:974325-975530     NZ_JWYU03000064.1       97.595  1206    29      0       1       1206    73028   74233   0.0     2067
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_CP010397.1   97.181  1206    34      0       1       1206    983267  984472  0.0     2039
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_HG977522.1   97.595  1206    29      0       1       1206    935195  936400  0.0     2067
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_HG977526.1   97.595  1206    29      0       1       1206    934286  935491  0.0     2067
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_AP014649.1   97.595  1206    29      0       1       1206    960072  961277  0.0     2067
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_CP010781.1   100.000 1206    0       0       1       1206    2943042 2941837 0.0     2228
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_BBSU01000014.1       98.093  1206    23      0       1       1206    30546   29341   0.0     2100
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_BBSP01000001.1       97.430  1206    31      0       1       1206    511041  509836  0.0     2056

#!/bin/bash

> ../blastn_res
for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
  # Extract the second and third parts from the filename
  filename_parts=$(echo "${sample}" | cut -d'_' -f1-3)

  # Create the blast database
  makeblastdb -in "${sample}" -dbtype nucl

  # Run blastn and append the results with filename parts
  blastn -db "${sample}" -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -strand both -max_target_seqs 1 | awk -v filename_parts="${filename_parts}" '{print $0 "\t" filename_parts}' >> ../blastn_res
done

#a specific assembly version of a genome in the NCBI (National Center for Biotechnology Information) database

#GCF_000248195.1_ASM24819v2    Acinetobacter baumannii strain NCTC 7422 contig_0002, whole genome shotgun sequence

DELETE, since they are all the same:   1.  qseqid      [query or source (gene) sequence id]
  2.  sseqid      GenBank ID   [subject or target (reference genome) sequence id]
  3.  pident      Percentage of identical positions
  4.  length      Alignment length (sequence overlap)
  5.  mismatch    Number of mismatches
  6.  gapopen     Number of gap openings
  7.  qstart      Start of alignment in query
  8.  qend        End of alignment in query
  9.  sstart      Start of alignment in subject
10.  send        End of alignment in subject
11.  evalue      Expect value
12.  bitscore    Bit score
13.              Assembly accession number

GenBank ID  Percentage of identical positions   Alignment length    Number of mismatches    Number of gap openings  Start of alignment in query End of alignment in query   Start of alignment in subject   End of alignment in subject Expect value    Bit score   Assembly accession number
query_seq_id sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
gi|169794206|ref|NC_010410.1|:974325-975530     NZ_AIED01000002.1       82.047  1192    210     4       1       1190    161177  159988  0.0     1013    GCF_000248195.1_ASM24819v2

#batch 1: until ...802.2_genomic.fna GCA_032600445.2_PDT001925801.2_genomic.fna GCA_032600465.2_PDT001925800.2_genomic.fna GCA_032600485.2_PDT001925799.2_genomic.fna GCA_032600525.2_PDT001925797.2_genomic.fna GCA_032600545.2_PDT001925796.2_genomic.fna GCA_032600565.2_PDT001925795.2_genomic.fna GCA_032600585.2_PDT001925793.2_genomic.fna
#batch 2: from GCA_032600605.2_PDT001925792.2_genomic.fna GCA_032600625.2_PDT001925794.2_genomic.fna GCA_032600645.2_PDT001925791.2_genomic.fna ...

#to compare with:
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCE020000004.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600385.2_PDT001925803.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCB020000004.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600395.2_PDT001925804.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCD020000003.1   97.595  1206    29  0   1   1206    96245   95040   0.0 2067    GCA_032600425.2_PDT001925802.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCF020000001.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600445.2_PDT001925801.2 <--
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCG020000001.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600465.2_PDT001925800.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCH020000002.1   97.595  1206    29  0   1   1206    203882  202677  0.0 2067    GCA_032600485.2_PDT001925799.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCJ020000003.1   97.595  1206    29  0   1   1206    113990  112785  0.0 2067    GCA_032600525.2_PDT001925797.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCK020000002.1   97.595  1206    29  0   1   1206    210992  209787  0.0 2067    GCA_032600545.2_PDT001925796.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600565.2_PDT001925795.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1   97.595  1206    29  0   1   1206    210993  209788  0.0 2067    GCA_032600585.2_PDT001925793.2 <--

Assembly accession number                       GenBank accession number
Can I say NZ_AIED01000002.1 Genbank_ID, and GCF_000248195.1_ASM24819v2 Assembly_ID?

Yes, you can interpret these identifiers within the given line as follows:

* NZ_AIED01000002.1 can be referred to as a GenBank accession number. This format typically represents a sequence identifier within the GenBank database, part of the NCBI's collection of databases related to nucleotide sequences and their annotations.

* GCF_000248195.1_ASM24819v2 refers to an assembly accession number from the GenBank assembly database. The "GCF" prefix indicates a RefSeq genome assembly accession, which is part of NCBI's Reference Sequence Database. The number following "GCF" is a unique identifier for the specific assembly of the genome, and the information after the underscore provides a version identifier and possibly an internal project code or name.

In a broader context, the line you've provided looks like it's from a BLAST output or a similar comparative genomics tool output. Such lines typically contain information about sequence alignments, including the query and subject sequences (with their respective accession numbers), alignment scores, and statistical significance metrics. The presence of both a sequence accession number (GenBank ID) and an assembly accession number helps link individual sequences to the larger genomic context from which they were derived.

30662 + 7989 = 38651 > 38638

gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCM020000002.1       97.595  1206    29      0       1       1206    210993  209788  0.0     2067    GCA_032600565.2_PDT001925795.2
gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCN020000002.1       97.595  1206    29      0       1       1206    210993  209788  0.0     2067    GCA_032600585.2_PDT001925793.2

gi|169794206|ref|NC_010410.1|:974325-975530     ABNRCO020000003.1       97.595  1206    29      0       1       1206    54521   53316   0.0     2067    GCA_032600605.2_PDT001925792.2

# In the file containing the following lines, I want to extract the first tokens separated by '_'.

./GCF_009013175.1_ASM901317v1_genomic.fna
./GCF_013344775.1_ASM1334477v1_genomic.fna
./GCF_013344845.1_ASM1334484v1_v1_genomic.fna
./GCF_013346345.1_ASM1334634v1_genomic.fna
./GCF_013346385.1_ASM1334638v1_genomic.fna

38651-38638=13

diff fna_list_tokens_1_3 blastn_res_f13 > diff2
diff fna_list_tokens_1_3 blastn_res_f13_uniq > diff3

# Open the file named 'fna_list' and read its content
with open('fna_list', 'r') as file:
    lines = file.readlines()

# Extract the first token after the first '_' in each line
first_tokens = [line.split('_')[1].strip() for line in lines]

# Example output
print(first_tokens)

Easyfig usage

Leave a reply

How to use the program?

D: Zoom in on a ~15kb subregion at one end of the sequences

    Access the subregion window from the Image dropdown menu; click on file 01 and then enter 32599 and 50099 in the Min and Max "Range" boxes located directly under the list of annotation files

    Click (change cutoffs)

    Click on file 02 and then enter 1 and 15000 in the Min and Max "Range" boxes and click (change cutoffs)

IMPORTANT: remember that the second Annotation file (LT2_Gifsy.gbk) has been inverted, This has been taken into account when entering the subregion range

    Close the subregion window with (close)

    Click on ( Generate blastn Files ) to generate the BLAST comparison files for these subregions (a pop-up box will ask you where you want to save the BLAST files – default is usually in the Easyfig_example1_files folder)

    CREATE FIGURE as described in part 1A.

In the new image, the scale is still 5000 bp, but the image is zoomed in on the variable region. The minimum BLAST identity value is shown in the yellow dialog box after each figure is drawn – this can be used to calibrate the BLAST identity scale shown on the right (i.e. in this case the matches range from 100% (darkest) to 85% (lightest)).

As described in the manual there are many ways to customise the image further e.g. the following list shows a few of the options available:

    * Feature rendering type from arrows to boxes or pointers
    * Colours of any of the features
    * Thickness of any lines
    * Height of BLAST matches
    * Height of features
    * Width of figure
    * Type of image file (bmp is default, but svg [scalable vector graphics] files can be produced by changing the file type)
    * Add a gene legend or label the genes**

#https://github.com/gamcil/clinker
- clinker to finish the last part of Data_PaulBongarts_S.epidermidis_HDRNA: namely compare gene order in the a part of genbank (        * genomic rearrangements (e.g. SCCmec deletions, ACME deletions, agr insertions)
        #Staphylococcal Cassette Chromosome mec
        #arginine catabolic mobile element (ACME)

        #Prevalence and genetic diversity of arginine catabolic mobile element (ACME) in clinical isolates of coagulase-negative staphylococci: identification of ACME type I variants in Staphylococcus epidermidis.
        Fig. 1. A schematic drawing of genetic structures of ACME (a region from the arc to opp3 cluster, or corresponding genetic components) among the three DI subtypes (DI.1, DI.2, and DI.3: strains CNS266, CNS115, and CNS149, respectively), type I (strain USA300-FPR3757, accession number CP000255), type II (strain ATCC12228, accession number AE015929) and type DII (strain M08/0126, accession number FR753166). Putative ORFs of genes are represented by arrows colored with green (arc cluster), red (opp3 cluster), blue (a region between the arc and opp3 clusters in ACME I), or dark blue (genes in ACME II). The regions in light pink including the arc cluster indicate genetically identical areas to both ATCC12228 and USA300-FPR3757. The regions with light blue are identical to only ATCC12228, while those with light orange to USA300-FPR3757. White space regions between argR and SAUSA300_0072 show no sequence homology either to ATCC12228 or to USA300_FPR3757; however, these regions show 91–99% nucleotide identity among the three ACME subtypes. Regions colored with dark orange in the three ACME DI subtypes show=98% nucleotide sequence identity to each other. Regions colored with grey (type DI.1), purple or cyan (type DI.3) do not show high nucleotide identity (<98%) to cognate genes in other ACME types (Table S2.2). Positions of primers used for PCR profile (Tables 1 and 4) are shown with arrowheads under ACME I sequence. Collapse

#https://mjsull.github.io/Easyfig/files.html
#https://github.com/mjsull/Easyfig/wiki/
Easyfig
# - Fig. 4. AdeIJK has fewer SNPs within it and lower levels of recombination surrounding it than AdeABC or AdeFGH.
# - In total, 100 A. baumannii genomes were aligned against reference A. baumannii AYE (NC_010410.1) and the presence of polymorphisms and recombination was determined using Gubbins.
# - (a,c,e) Magnified parts of the genome at each ade operon, showing the levels of SNPs (red and blue squares; red are ancestral SNPs) and recombination levels (the black line on the bottom; the higher the peak the more recombination).
# - The right-hand panels (b, d and f) show the entire genome and the position of each different ade operon, which is highlighted in red beneath the label.
#- All panels have an associated mid-point rooted phylogenetic tree created by Snippy to show the relatedness of the A. baumannii sequences.
#- AdeIJK (a) has fewer SNPs and recombination than AdeABC (e) and AdeFGH (c), indicating it is highly conserved.

Input Data

https://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max_hit=1000&locale=en&serv=kegg&dbkey=genome&keywords=Acinetobacter+baumannii&page=1

T00667
aby; Acinetobacter baumannii AYE --> GCA_000069245.1_ASM6924v1
T00660
abm; Acinetobacter baumannii SDF --> GCA_000069205.1_ASM6920v1
T00710
abc; Acinetobacter baumannii ACICU --> GCA_000018445.1_ASM1844v1
T00793
abn; Acinetobacter baumannii AB0057 --> GCA_000021245.2_ASM2124v2
T00795
abb; Acinetobacter baumannii AB307-0294
T01819
abx; Acinetobacter baumannii 1656-2
T01820
abz; Acinetobacter baumannii MDR-ZJ06
T01908
abd; Acinetobacter baumannii TCDC-AB0715

T02045
abr; Acinetobacter baumannii MDR-TJ
T02261
abh; Acinetobacter baumannii TYTH-1
T02491
abad; Acinetobacter baumannii D1279779
T02726
abj; Acinetobacter baumannii BJAB07104

T02727
abab; Acinetobacter baumannii BJAB0715
T02728
abaj; Acinetobacter baumannii BJAB0868
T02954
abaz; Acinetobacter baumannii ZW85-1
T03374
abk; Acinetobacter baumannii AbH12O-A2

T03375
abau; Acinetobacter baumannii AB030
T03376
abaa; Acinetobacter baumannii AB031
T03377
abw; Acinetobacter baumannii AC29
T03519
abal; Acinetobacter baumannii LAC-4

T00486
acb; Acinetobacter baumannii ATCC 17978 --> GCA_000015425.1_ASM1542v1

After downloading the Genbank, perform the following commands to calculate the Subregions positions.

~/Scripts/genbank2fasta.py A.baumannii_AYE.gbk
~/Scripts/genbank2fasta.py A.baumannii_SDF.gbk
~/Scripts/genbank2fasta.py A.baumannii_ACICU.gbk

~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AB0057.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AB307-0294.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_1656-2.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_MDR-ZJ06.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_TCDC-AB0715.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_MDR-TJ.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_TYTH-1.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_D1279779.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_BJAB07104.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_BJAB0715.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_BJAB0868.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_ZW85-1.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AbH12O-A2.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AB030.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AB031.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_AC29.gbk
~/Scripts/genbank2fasta.py Acinetobacter_baumannii_LAC-4.gbk

~/Scripts/genbank2fasta.py A.baumannii_ATCC17978.gbk

cat [the first 4 gbk files] Acinetobacter_baumannii_AB0057.gbk_converted.fna Acinetobacter_baumannii_AB307-0294.gbk_converted.fna Acinetobacter_baumannii_1656-2.gbk_converted.fna Acinetobacter_baumannii_MDR-ZJ06.gbk_converted.fna Acinetobacter_baumannii_TCDC-AB0715.gbk_converted.fna Acinetobacter_baumannii_MDR-TJ.gbk_converted.fna Acinetobacter_baumannii_TYTH-1.gbk_converted.fna Acinetobacter_baumannii_D1279779.gbk_converted.fna Acinetobacter_baumannii_BJAB07104.gbk_converted.fna Acinetobacter_baumannii_BJAB0715.gbk_converted.fna Acinetobacter_baumannii_BJAB0868.gbk_converted.fna Acinetobacter_baumannii_ZW85-1.gbk_converted.fna Acinetobacter_baumannii_AbH12O-A2.gbk_converted.fna Acinetobacter_baumannii_AB030.gbk_converted.fna Acinetobacter_baumannii_AB031.gbk_converted.fna Acinetobacter_baumannii_AC29.gbk_converted.fna Acinetobacter_baumannii_LAC-4.gbk_converted.fna > all_converted.fasta
#for file in 1.easyfig.fa 2.easyfig.fa 3.easyfig.fa 4.easyfig.fa; do (cat "${file}"; echo) >> all.easyfig.fa; done
makeblastdb -in all_converted_.fasta -dbtype nucl
#-max_target_seqs 1
blastn -db all_converted_.fasta -query ABAYE_RS05070.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ABAYE_RS05070_positions2.easyfig.out

CU459141.1_Acinetobacter_baumannii_AYE  974325  975530 --> 964325  985530
CU468230.2_Acinetobacter_baumannii_SDF  854002  855207 --> 844002  865207
CP000863.1_Acinetobacter_baumannii_ACICU    2980675 2979470  --> 2969470 2990675
CP001921.1_Acinetobacter_baumannii_1656-2   3018236 3017031 --> 3007031 to 3028236
CP009257.1_Acinetobacter_baumannii_strain_AB030 1791300 1790095 --> 1780095 to 1801300
CP009256.1_Acinetobacter_baumannii_strain_AB031 3303170 3301965 --> 3291965 to 3313170
CP001182.2_Acinetobacter_baumannii_AB0057   3087319 3086114 --> 3076114 to 3097319
CP001172.2_Acinetobacter_baumannii_AB307-0294   974367  975572 --> 964367 to 985572
CP009534.1_Acinetobacter_baumannii_strain_AbH12O-A2 2890551 2889346 --> 2879346 to 2900551
CP007535.2_Acinetobacter_baumannii_strain_AC29  1283600 1284805 --> 1273600 to 1294805
CP003847.1_Acinetobacter_baumannii_BJAB0715 3060955 3059750 --> 3049750 to 3070955
CP003849.1_Acinetobacter_baumannii_BJAB0868 2950975 2949770 --> 2939770 to 2960975
CP003846.1_Acinetobacter_baumannii_BJAB07104    3043603 3042398 --> 3032398 to 3053603
CP003967.2_Acinetobacter_baumannii_D1279779 2755134 2753929 --> 2743929 to 2765134
CP007712.1_Acinetobacter_baumannii_LAC-4    976323  977528 --> 966323 to 987528
CP003500.1_Acinetobacter_baumannii_MDR-TJ   934298  935503 --> 924298 to 945503
CP001937.2_Acinetobacter_baumannii_MDR-ZJ06 306267  305062 --> 295062 to 316267
CP002522.2_Acinetobacter_baumannii_TCDC-AB0715  3194910 3193705 --> 3183705 to 3204910
CP003856.1_Acinetobacter_baumannii_TYTH-1   3248375 3247170 --> 3237170 to 3258375
CP006768.1_Acinetobacter_baumannii_ZW85-1   921662  922867 --> 911662 to 932867
CP000521.1_Acinetobacter_baumannii_ATCC17978    2990721 2989667 and 2944541 2944384 --> 2934384 3000721

After click the button “Generate blastn Files”, manully perform the command “blastn”

makeblastdb -in 2.easyfig.fa -dbtype nucl
blastn -db 2.easyfig.fa -query 1.easyfig.fa -num_threads 15 -outfmt 6 -strand both -evalue 0.1 -max_target_seqs 1 > 12.easyfig.out
makeblastdb -in 3.easyfig.fa -dbtype nucl
blastn -db 3.easyfig.fa -query 2.easyfig.fa -num_threads 15 -outfmt 6 -strand both -evalue 0.1 -max_target_seqs 1 > 23.easyfig.out
makeblastdb -in 4.easyfig.fa -dbtype nucl
blastn -db 4.easyfig.fa -query 3.easyfig.fa -num_threads 15 -outfmt 6 -strand both -evalue 0.1 -max_target_seqs 1 > 34.easyfig.out
makeblastdb -in 5.easyfig.fa -dbtype nucl
blastn -db 5.easyfig.fa -query 4.easyfig.fa -num_threads 15 -outfmt 6 -strand both -evalue 0.1 -max_target_seqs 1 > 45.easyfig.out
...

Generate as an svg file, add the name of genome manually. Note that the following colors were used.

#https://github.com/mjsull/Easyfig/wiki/Example-2.-whole-genome-comparison
normal-minimum: aqua
normal-maximum: blue
inverted-minimum: orange
inverted-maximum: red

Attachment example3_settings.easycfg

02. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/A.baumannii_AYE.gbk    03. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/A.baumannii_SDF.gbk    04. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/A.baumannii_ACICU.gbk  05. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_1656-2.gbk 06. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AB030.gbk  07. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AB031.gbk  08. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AB0057.gbk 09. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AB307-0294.gbk 10. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AbH12O-A2.gbk  11. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_AC29.gbk   12. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_BJAB0715.gbk   13. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_BJAB0868.gbk   14. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_BJAB07104.gbk  15. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_D1279779.gbk   16. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_LAC-4.gbk  17. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_MDR-TJ.gbk 18. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_MDR-ZJ06.gbk   19. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_TCDC-AB0715.gbk    20. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_TYTH-1.gbk 21. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/Acinetobacter_baumannii_ZW85-1.gbk 01. /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/A.baumannii_ATCC17978.gbk
02  964325  985530  False
03  844002  865207  False
04  2969470 2990675 False
05  3007031 3028236 False
06  1780095 1801300 False
07  3291965
    3313170 False
08  3076114 3097319 False
09  964367  985572  False
10  2879346 2900551 False
11  1273600 1294805 False
12  3049750 3070955 False
13  2939770 2960975 False
14  3032398 3053603 False
15  2743929 2765134 False
16  966323  987528  False
17  924298  945503  False
18  295062  316267  False
19  3183705 3204910 False
20  3237170 3258375 False
21  911662  932867  False
01  2934384 3000721 False
/mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/12.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/23.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/34.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/45.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/56.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/67.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/78.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/89.easyfig.out /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/910.easyfig.out    /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1011.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1112.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1213.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1314.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1415.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1516.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1617.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1718.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1819.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/1920.easyfig.out   /mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/2021.easyfig.out
/mnt/md1/Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex/comparative_genome_plots/Easyfig_files3/zzzz.bmp
Bitmap (bmp)
5000
150
500
centre
1
1
5000
0
0.001
0
0
255
255
#00ffff
255
165
0
#ffa500
0
0
255
#0000ff
255
0
0
#ff0000
1
Top & Bottom
None
locus_tag
20
2
1
1
64
224
208
#40e0d0
arrow
0
255
140
0
#ff8c00
arrow
0
165
42
42
#a52a2a
rect
0
0
191
255
#00bfff
rect

0
72
61
139
#483d8b
arrow
None
0

1000
1000
200
Auto
0
Histogram
1
255
0
0
#FF0000
0
0
255
#0000FF
10

identificatio of IS (Insertion Sequence) elements
```
  #extracted sequence segments from the two isolates, specifically:
  #    ATCC19606: 930469 to 951674 — segment1
  #    ATCC17978: 2,934,384 to 3,000,721 — segment2
  #Then, I compared the two segments and found that positions 1-11055 of segment1 mapped to 66338-55284 of segment2, and positions 11049-21206 of segment1 mapped to 10158-23 of segment2. This means the sequence from 10159-55283 of segment2 (about 45 kb nt) is not mapped. I then extracted the 45 kb sequence (see the attached fasta file). I attempted to detect IS elements using the tool ISEScan (https://academic.oup.com/bioinformatics/article/33/21/3340/3930124). Four ISs were detected (see 45kb.fasta.xlsx; for more detailed results, see 45kb.fasta.zip).

  samtools faidx Acinetobacter_baumannii_ATCC19606.gbk_converted.fna CP059040.1:930469-951674 > ../ATCC19606_segment.fasta
  samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
  makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
  blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
  samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
```
- ISEScan: Description: Although not a database, ISEScan is a software tool used to identify IS elements in bacterial genome sequences. It can be helpful for researchers looking to analyze newly sequenced genomes for the presence of IS elements. Website: Available on platforms like GitHub for download and integration into bioinformatics workflows.
- TnCentral including ISFinder: Description: TnCentral is a more comprehensive resource that includes information about transposons, which are larger and more complex than simple IS elements but often contain IS sequences as part of their structure. This database provides detailed information about transposon structures, including associated genes and regulatory features.
- ISsaga: ISsaga is a web-based tool for the identification and annotation of insertion sequences in prokaryotic genomes. It provides various features for IS element analysis, including detection, classification, and visualization. You can access ISsaga here: ISsaga (http://issaga.biotoul.fr/ISsaga/issaga_index.php)
- ISFinder: ISFinder is a curated database and analysis platform for insertion sequences in prokaryotic genomes. It provides a comprehensive collection of IS sequences and tools for sequence analysis, classification, and annotation. You can access ISFinder here: ISFinder For ISfinder please cite: Siguier P. et al. (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36 (link pubmed) and the database URL (http://www-is.biotoul.fr).
- For ISbrowser please cite: Kichenaradja P. et al. (2010) ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes. Nucleic Acids Res. 38: D62-D68 (link pubmed). For ISsaga please cite: Varani A. et al. (2011) ISsaga: an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes, Genome Biology 2011, 12:R30 (link pubmed).
- ISMapper: ISMapper is a tool for mapping insertion sequences in bacterial genomes. It uses paired-end sequence data to identify IS element insertion sites and provides information about their genomic context. You can access ISMapper here: ISMapper ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data https://pubmed.ncbi.nlm.nih.gov/26336060/
- ISseeker: ISseeker is a software package for the identification and annotation of insertion sequences in bacterial genomes. It provides a user-friendly interface for IS element detection and characterization. You can access ISseeker here: ISseeker

clinker usage

Leave a reply

https://github.com/gamcil/clinker/

https://cagecat.bioinformatics.nl/

https://pegaso.microbiologia.ull.es/clinker-server.php

code of extract_subregion.py

import sys
from Bio import SeqIO

def extract_subregion(genbank_path, start, end, output_file):
    for record in SeqIO.parse(genbank_path, "genbank"):
        subregion = record[start:end]
        SeqIO.write(subregion, output_file, "genbank")

if __name__ == "__main__":
    if len(sys.argv) < 5:
        print("Usage: python extract_subregion.py <GenBank file> <start> <end> <output file>")
        sys.exit(1)

    genbank_file = sys.argv[1]
    start = int(sys.argv[2])
    end = int(sys.argv[3])
    output_file = sys.argv[4]

    extract_subregion(genbank_file, start, end, output_file)

commands

python3 extract_subregion.py HDRNA_01_K01_conservative_23197.current.gb 40434 55820 HDRNA_01_K01_sub.gb
python3 extract_subregion.py HDRNA_03_K01_bold_bandage_26831.current.gb 2112517 2127744 HDRNA_03_K01_sub.gb
python3 extract_subregion.py HDRNA_06_K01_conservative_27645.current_chr.gb 42053 58568 HDRNA_06_K01_sub.gb
python3 extract_subregion.py HDRNA_07_K01_conservative_27169.current_chr.gb 37601 52993 HDRNA_07_K01_sub.gb
python3 extract_subregion.py HDRNA_08_K01_conservative_32455.current_chr.gb 47084 68870 HDRNA_08_K01_sub.gb
python3 extract_subregion.py HDRNA_12_K01_bold_37467.current_chr.gb 37586 52978 HDRNA_12_K01_sub.gb
python3 extract_subregion.py HDRNA_16_K01_conservative_37834.current_chr.gb 37706 52877 HDRNA_16_K01_sub.gb
python3 extract_subregion.py HDRNA_17_K01_conservative_37288.current_chr.gb 37118 52509 HDRNA_17_K01_sub.gb
python3 extract_subregion.py HDRNA_19_K01_bold_37377.current_chr.gb 36267 51286 HDRNA_19_K01_sub.gb
python3 extract_subregion.py HDRNA_20_K01_conservative_43457.current_chr.gb 37703 53095 HDRNA_20_K01_sub.gb

clinker gbk_sub/*.gb -p plot.html --dont_set_origin -s session.json -o alignments.csv -dl "," -dc 4

Microbial bioinformatics

Microbial bioinformatics uses computational tools to analyze genomes, track evolution, and study functions in microorganisms, including bacteria and viruses.

Author Archives: gene_x

RNA-seq 2024 Ute from raw counts

miRNA 2024 Ute from raw counts

piRNA 2024 Ute from raw counts

Call and merge SNP and Indel results from using two variant calling methods

Tn-seq analysis pipeline (improved3)

Small RNA-seq analysis pipeline XICRA

干眼症

Insertion Sequence (IS) Element Detection

Easyfig usage

clinker usage