Author Archives: gene_x

产前压力损害了女性后代的造血干细胞(HSC)的形成

  1. 引言

    高度易感的感染或导致哮喘和过敏的过度反应免疫影响着大量儿童,成为主要的全球健康问题。免疫细胞、红细胞和血小板通过所谓的造血过程发展,该过程始于产前,并在从无菌的子宫到富含微生物的世界的过渡中赋予后代免疫力。出生后,免疫发展继续变得更为复杂,反应也变得更为精细。其缺陷,在受到损害或免疫过度反应的儿童中,已经与多个因素相关,如出生方式和胎龄、母乳喂养,以及更近期的产前压力暴露。事实上,我们的团队和其他团队都能证明产前压力挑战与人类儿童时期哮喘和过敏的风险增加有关。令人震惊的是,有证据支持产前压力与人类自身免疫性疾病的风险之间存在关联,以及与早期生活感染有关。尽管有这种日益认识,但确定与产前压力和产后免疫之间的关联有因果关系的途径在很大程度上仍然缺失。基于这些证据,可以提出胎儿免疫器官可能是产前压力挑战的一个易受伤害的目标,这可能导致损害的胎儿免疫发展,在儿童时期表现为感染的高风险和慢性免疫疾病的发病。

    重要的是,造血干细胞(HSCs)通过个体的生命自我更新,并由于其多功能性不断提供所有免疫细胞组分。这些细胞在人类的胎儿期和小鼠的围生期大部分都已形成。HSCs在胎儿骨髓中定位后,建立了明确的造血以进一步在系列特异的分化过程中生成免疫细胞。最近已经认识到,个体的HSCs具有很大的异质性,并且可能产生有系列偏见的子代,而不是传统的分步二分化的细胞命运承诺。尽管如此,HSCs逐渐分化为淋巴系和髓系亚群仍受到大量生长因子严格调节,并涉及广泛的表观遗传重塑。的确,表观基因组的变化是文献中描述的关键因素,驱动系列决策。有趣的是,由于DNA合成广泛,表观基因组在发育过程中尤其容易受到挑战。迄今为止,已经在几种组织中研究了母亲应激对后代DNA甲基化剖面的影响,包括脐带血细胞。然而,HSPC表观基因组是否特别受到影响,从而对预期的细胞子代产生后果,仍然不知道。重要的是,新近的文献研究了HSC功能缺陷(例如,与自我复制和/或增强的细胞耗竭有关)与包括免疫反应和癌症在内的病理过程之间的关联。

    在这种背景下,我们假设由于造血干细胞部位早期生活的变化,孕前应激暴露在后代免疫上引起的长期变化是通过介导的。因此,我们的目的是研究造血级联包括系列承诺是否受到孕前应激的调节,以及这些效应是否随时间持续存在,例如通过损害造血干细胞和祖细胞(HSPC)的表观基因组。我们观察到的翻译相关性在人群队列中的脐带血细胞中进行了测试。

  2. 材料和方法

    • 人类研究设计,样本收集和分析

      在大学医学中心从2011年开始进行的研究中收集了分析的数据和样本。简而言之,纳入标准是母亲年龄在18岁或以上,并且在妊娠周12-14具有可行的单胎妊娠。慢性感染(HIV,乙型/丙型肝炎)、已知的物质滥用、吸烟、多胎妊娠或在辅助生殖技术后怀孕的女性被排除在外。符合条件的孕妇在签署知情同意书后被邀请参加三次产前检查,每个孕妇一次(妊娠周12至14,24至26和34至36)。每次,使用所谓的Perceived Stress Scale(PSS)评估个体的压力水平,该量表的得分在0(低)-40(高)之间。

      在本次研究中,仅包括出生时采集了脐带血的研究参与者(共134人)。脐带血样本在分娩后24小时内进行处理。简而言之,取50微升全血与荧光染料结合的抗体染色,然后通过流式细胞术进行分析。其中,怀孕期间未记录与PSS评分匹配的样本被排除,最终得到93个脐带血样本。根据怀孕期间的压力水平,样本被分类为轻度至中度压力(PSS评分<20;共30个)或高度压力(PSS评分>23;共45个)。那些PSS评分处于这两个阈值之间的女性样本由于结果变化较大而被排除在分析之外。

    • 小鼠

      成年雄性和雌性的C57BL/6及雄性Balb/c小鼠在医学中心动物设施内按12小时光暗循环饲养,可以随时取食和饮水。

    • 计时怀孕和母体压力暴露

      8周大的C57BL/6雌鼠与Balb/c雄鼠进行异种交配。在骨髓移植实验中,C57BL/6雌鼠和雄鼠进行交配。早上检测到阴道栓被认为是怀孕日(gd)0.5。然后将怀孕的雌鼠分为2组。第一组在怀孕期间除了在gds 11.5、13.5和15.5记录体重外,没有受到干扰,因此作为对照。第二组则按照我们已发表的产前压力协议进行处理。简而言之,雌鼠在gds 10.5、12.5和14.5这些天暴露于声音压力24小时。声音压力是由放置在压力笼内的驱鼠设备产生的,这些设备发出88分贝、频率为1200赫兹的声音,每15秒间隔或连续发出超声波。

    • 骨髓移植

      为产生骨髓嵌合小鼠,从8周大的产前受压挑战和对照的C57BL/6 CD45.1+供体小鼠中分离出的1000万个骨髓细胞被通过眼窝后静脉注射入10周大的产前受压挑战和对照的C57BL/6 CD45.2+受体中。注射是在全身辐射(10 Gy)后的12小时进行的。移植后5周,通过流式细胞术检测受体的外周血样本中的CD45.1+供体免疫细胞,确认了CD45.2+受体后代的骨髓造血功能的成功重建。移植后9周,动物被处死以研究骨髓造血。

    • 组织收集和处理

      在E18.5日,对照组或产前受压挑战的怀孕小鼠及相应的胎儿在CO2/O2吸入后经颅骨切除处死。尾部活检标本被收集用于PCR检测SRY序列以确定胎儿性别。胎儿的股骨和胫骨被机械破碎并过滤通过一个细胞筛,以获得单细胞骨髓细胞悬浮液。第二批怀孕的小鼠被允许分娩。三周后,幼仔与母鼠分开并保持不受干扰,直到达到8-9周大。

      成年后代和骨髓嵌合小鼠在CO2/O2吸入后迅速处死。从股骨和胫骨中冲刷骨髓,随后进行红细胞裂解以生成单细胞悬浮液。

    • 流式细胞术和细胞分选

      骨髓细胞(胎儿个体的总量,来自成年骨髓嵌合小鼠)按照我们的标准协议进行染色。尽管如此,对于造血细胞的确定,在与抗体混合物孵化之前,没有使用Fc块以允许抗CD16/32表位的抗体结合。使用LSR Fortessa流式细胞计(BD)获取样本。

      对于细胞分选,3个性别匹配的后代的样本被混合并按照通常的方法染色。用于甲基组分析的骨髓HSPC细胞使用FACSAria细胞分选机(BD Bioscience)进行分离。细胞悬浮液的纯度通过后续的流式细胞术得到确认。

    • 甲基化的DNA免疫沉淀测序

      近期发布的文献中描述了如何进行甲基化的DNA免疫沉淀 (MeDIP),通过短读取测序来识别富含甲基化的DNA区域。简要地说,从10-11周大的MMc+ 和MMclow后代的骨髓中分选出HSC。从1.5×105的HSC中提取基因组DNA,将细胞悬浮在300µl的裂解缓冲液 (7.5% NaHCO3) 中,通过苯酚-氯仿提取和乙醇沉淀。利用Bioruptor (Diagenode) 对分离的基因组DNA进行超声波断裂,平均碎片大小为50-100bp。作为质控,每个样本中都添加了0.025ng从体外甲基化的pCR2.1质粒DNA作为内标(as a spike-in control)。DNA在98°C下变性10分钟并在冰上冷却10分钟。每个样本的1/10总量被用作输入对照。为了进行MeDIP沉淀,向样品的其余部分添加了5µg的5-mc-specific抗体 (Diagenode),在4°C下旋转2小时。随后,加入50µl的M-280抗小鼠IgG Dynabeads,并在旋转轮上于4°C孵育2小时。以下步骤与输入样本并行进行。使用1.3µl的RNAse A (20–40mg/ml)在37°C下进行RNA消化30分钟,然后使用200µl的洗脱缓冲液 (50mM Tris-HCl, 10mM EDTA 和 1% SDS)、4µl的蛋白酶K (40mg/ml) 和 7µl的CaCl2 (300mM)在55°C下消化蛋白30分钟。如上所述,使用苯酚-氯仿提取和乙醇沉淀法提取DNA。

      使用5ng DNA和适合单链DNA的DNA SMART ChIP-Seq Kit (TaKaRa, Clontech, Cat-No.: 634865)生成与MeDIP兼容的下一代测序文库。在NextSeq 500系统 (Illumina, Cat-No.: 20024906)上对MeDIP文库进行测序,使用单读取 (1×75) 流细胞,每个MeDIP样本30百万次读取,每个输入对照10百万次读取。

      在对MeDIP-seq数据进行质量控制后,使用FASTQC75进行检查,使用BWA76将reads对齐到MM10基因组,生成BAM文件。使用IGV软件分析对齐文件。为了检测差异甲基化区域 (DMRs) 和基因本体分析,使用了diffreps77工具。除了将窗口大小调整到800 nt,其它使用默认参数。使用IGV分析具有显著DMRs (P值<0.0001, G-test) 的基因。在WebGestalt中,按照它们的生物学功能分类了显著上调 (高甲基化, log2FC>1, p-adj. <0.05) 或下调 (低甲基化, log2FC<-1, p-adj. <0.05) 的差异甲基化基因 (参见补充图6c)。sam2bedgff.pl, deepTools bamCoverage 和 circus软件支持了进一步的分析 (见代码可用性)。

    • 统计分析

      统计分析是使用Prism 9软件进行的。使用双因子方差分析(因子:应激暴露和性别)来比较结果。只要方差分析是显著的(应激暴露因子或应激和性别的交互作用 p < 0.05),就运行Sidak的多重比较测试,以比较应激对性别匹配后代的影响。在图中已标出显著的值。

  3. 结果

    • 产前压力损害了胎儿在子宫生活中造血干细胞和祖细胞(HSPC)的形成

    • 母亲的压力也影响到早期子代的HSPC丰度

    • 来自产前受压应激者的HSPC保留了受损的再生能力

      接下来,我们计划研究产前应激对胎儿骨髓HSPC部位的短期影响是否会在成年时产生后果。为此,用来自成年对照或产前应激受挑战的后代的骨髓细胞移植给致命照射的未受扰动的C57BL/6小鼠(图3.A)。移植后重建骨髓造血功能时,像之前一样确定了HSPC。与相应的性别匹配对照相比,携带产前受压应激后代HSPC的移植小鼠中,LTHSC的频率减少(anova因子应激p < 0.05)(图3.B)。这些事后比较未达到统计学意义,可能是由于实验的n值适中(n = 4-5)。携带受压HSPC的骨髓嵌合小鼠倾向于减少STHSC,但没有达到统计学意义。其他调查的HSPC亚群,即MM2、3和4、CMP、GMP、MEP和CLP,与移植了来自对照或产前应激后代的HSPC的小鼠之间没有差异(图3.C-F),无论它们的移植与HSPC来源于对照还是产前受压应激的后代(anova因子应激/应激和性互作> 0.05)。

    • 在产前受压应激的小鼠中,改变的甲基化模式作为HSPC的长期机制

      DNA甲基化不仅作为一种重要的表观遗传机制来稳定地从一个细胞传递信息给它的后代,而且它还参与HSPC向各种血细胞系列的分化。因此,我们计划调查来自产前应激后代的HSPC中受损的再生能力是否与在成年期保持的改变的表观遗传特征有关,方法是对纯化的HSPC进行甲基化DNA免疫沉淀测序(MeDIP-Seq)(补充图1)。值得注意的是,可以在产前应激挑战后代与对照之间的HSPC基因组中鉴定差异甲基化区域(DMR)(图4.A,B)。在女性后代中,这些变化更为明显。实际上,产前受压应激的女性后代不仅比男性呈现出更多的DMR,这些也更经常定位在启动子区域。来自产前受压应激女性的HSPC中的DMRs通常显示出增加的甲基化(高甲基化),而在产前受压应激的男性中,DMR在超甲基化和低甲基化之间更均匀地分布。在基因启动子中也观察到这种模式,其中高甲基化与转录活性的丧失和基因失活有关。特别是,参与抑制髓系细胞分化的C1qc和在增殖的HSCs中表达的Hoxa5基因的启动子在女性中呈现高甲基化。值得注意的是,在产前受压应激的女性HSPC样本中,还观察到基因体区域的高甲基化,这与转录抑制无关,而是与差异的mRNA剪接和/或转录延长有关。

  4. 讨论

    在我们目前的研究中,我们观察到产前应激损害了骨髓HSC部位的形成,特别是影响了雌性小鼠。这一效应高度保守,因为在来自在怀孕的第一到第二个孕期经历高压的女性所生的女性婴儿的脐血中,HSPC也减少了。在成年期,来自产前受压应激的小鼠后代的骨髓细胞未能完全重建LTHSC部位,与性别无关。这可能与产前应激挑战显著影响HSPC甲基组的事实有关。变化是性别特异的,而在产前受压应激的雌性中,与转录下调相关的基因启动子的高甲基化被产前应激所诱导,而在产前受压应激的雄性中,DMRs的低甲基化和高甲基化平衡。综上所述,我们的结果强调,产前应激暴露可以显著影响生命早期的HSPC部位,对成年期产生长期的重大影响。

    我们研究的胎儿生活时间与小鼠骨髓中确定性造血的早期步骤相吻合。尽管胎儿肝脏中的HSPC开始在E 15.5-17.5迁移到骨髓,但只有接近出生时(约E 18.5-19),骨髓HSC才获得多能性,这受到特定的转录特性的支持。之后,HSCs继续增殖,直到他们在出生后的几周进入静止状态。我们的结果显示,尤其是在产前受压应激的女性后代中,E18.5骨髓细胞的数量减少,尤其是LTHSC,提供了这一过程连续性的一个快照。事实上,这些短期变化已经可以在HSC发展的这些初始点观察到,这突显了它们在胎儿期编程的新生免疫中的参与,不考虑产前应激可能对分娩时微生物群的定植和早期母亲的养育行为的影响。这很重要,因为虽然现在很明确产前应激暴露与出生时免疫变化有关,但这是否源于HSC部位的变化则不太被理解。只有在不同的产前侵害的情况下,已经发现产前事件与改变的HPSC形成和/或功能之间存在联系。其中,产前对乙酰氨基酚的消耗减少了小鼠胎儿肝脏中的HSPC并倾向于减少与对照组相比的人脐血中的HSPC计数。相反,用于模拟小鼠产前感染的母亲免疫激活表明,炎症触发了胎儿肝脏HSPC的增殖,增强了MMP4和髓系细胞的生成。

    迄今为止,产前应激对HSPC甲基化及其在人类或动物模型中的后果的影响研究得很少。产前应激没有引起雄性或雌性后代HSPC的DNA甲基化模式的剧烈破裂,这种模式保持得非常稳定。然而,以200-1000 bp的分辨率比较甲基化区域发现,应激永久性地影响了成年后代HSPC的甲基化配置文件,这可能对它们的自我更新、多能性或其他特性产生影响。在人类中,母亲应激的各种代理,如焦虑、情感障碍或暴露于家庭暴力,都与后代的DNA甲基化改变有关。这在免疫组分中也是明显的,例如在人类青春期T细胞和脐带血白细胞中,其中糖皮质激素受体基因经常被发现高甲基化。我们的结果为这些观察提供了依据,因为应激对HSPC甲基组的印迹可能传递给后代,包括血液的所有细胞组分。与人类的观察相反,我们的数据集没有显示糖皮质激素受体基因Nr3c1的甲基化配置文件有任何显著的变化。但是,我们确实观察到与造血和免疫相关的基因的启动子发生变化。其中,产前应激的雌性的HSPC显示出Hoxa5(同源盒5)和C1qc基因的启动子高甲基化。Hoxa5在HSPC中优先表达,并在分化为MPP时下调,这可能表明成年HSPC骨髓细胞中减少的HSC标志。此外,Hoxa集群的基因的激活是HSPC增殖和再生的先决条件,其中Hoxa5在它们的扩展期间上调。因此,我们不禁猜测Hoxa5启动子的高甲基化可能与表观遗传约束扩展有关,如我们的骨髓移植实验中所观察到的,或可能在胎儿生活中。C1qc参与抑制髓样细胞分化,因此其启动子的高甲基化和预期的C1qc的失活可能导致分化偏向增加CMPs,如在胎儿小鼠中所观察到的,但不是在成年骨髓嵌合体中。

Differential Gene Analysis and Visualization for HSC Methylome Data

A Gene Ontology (GO) enrichment analysis was conducted on differential genes derived from both male and female conditions. The datasets were imported and preprocessed, followed by the application of the clusterProfiler package for enrichment analysis. Subsequently, significant GO terms were identified and visualized in a tailored bubble plot, with colors signifying different ontologies. The finalized results were documented in both EXCEL and PNG formats.

install.packages("BiocManager")
BiocManager::install("clusterProfiler")
BiocManager::install("enrichplot")
if(!requireNamespace("org.Mm.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Mm.eg.db")
}

# Load necessary libraries
library(clusterProfiler)
library(enrichplot)
library(org.Mm.eg.db)
library(AnnotationDbi)

# Read differential genes and GO terms
male_diff_genes <- read.csv("male_n_reworked.csv", header = TRUE, stringsAsFactors = FALSE)
#go_terms_list <- read.csv("male_n_reworked.csv", header = TRUE, stringsAsFactors = FALSE)
# Assuming the genes are in a column named 'genes' and GO terms are in a column named 'go_terms'
male_gene_vector_clean <- na.omit(male_diff_genes$GeneSymbol)

# # Convert to EntrezIDs
# male_entrez_ids <- select(org.Mm.eg.db, keys = male_gene_vector_clean, keytype = "SYMBOL", columns = "ENTREZID")
# # Filter out any NAs or genes not found
# male_entrez_ids_clean <- na.omit(male_entrez_ids$ENTREZID)

# Perform GO enrichment analysis
# Using 'org.Hs.eg.db' as an example for human. Replace with appropriate organism db if different.
male_ego <- enrichGO(gene = male_gene_vector_clean, 
                    OrgDb = org.Mm.eg.db, 
                    keyType = "SYMBOL",
                    ont = "ALL", 
                    pAdjustMethod = "none",
                    pvalueCutoff = 1.0,
                    qvalueCutoff = 1.0, 
                    readable = TRUE)                    
print(male_ego$ID)
View(male_ego)
write.csv(male_ego, "male_n_reworked_GO.csv", row.names = FALSE)

#NOTE that we should manually delete the second line of csv-file since the line is empty in the original Excel-file.
# Read differential genes and GO terms
female_diff_genes <- read.csv("female_n_reworked.csv", header = TRUE, stringsAsFactors = FALSE)
#go_terms_list <- read.csv("female_n_reworked.csv", header = TRUE, stringsAsFactors = FALSE)
# Assuming the genes are in a column named 'genes' and GO terms are in a column named 'go_terms'
female_gene_vector_clean <- na.omit(female_diff_genes$GeneSymbol)

# Perform GO enrichment analysis
# Using 'org.Hs.eg.db' as an example for human. Replace with appropriate organism db if different.
female_ego <- enrichGO(gene = female_gene_vector_clean, 
                    OrgDb = org.Mm.eg.db, 
                    keyType = "SYMBOL",
                    ont = "ALL", 
                    pAdjustMethod = "none",
                    pvalueCutoff = 1.0,
                    qvalueCutoff = 1.0, 
                    readable = TRUE)
print(female_ego$ID)
View(female_ego)
write.csv(female_ego, "female_n_reworked_GO.csv", row.names = FALSE)

#~/Tools/csv2xls-0.4/csv_to_xls.py male_n_reworked_GO.csv female_n_reworked_GO.csv -d',' -o GO_enrichments.xls

# ---- calculate the significant gene-list ----
male_ego_ <- enrichGO(gene = male_gene_vector_clean, 
                    OrgDb = org.Mm.eg.db, 
                    keyType = "SYMBOL",
                    ont = "ALL", 
                    pAdjustMethod = "BH",
                    readable = TRUE)
female_ego_ <- enrichGO(gene = female_gene_vector_clean, 
                    OrgDb = org.Mm.eg.db, 
                    keyType = "SYMBOL",
                    ont = "ALL", 
                    pAdjustMethod = "BH",
                    readable = TRUE)
# Merge and deduplicate
merged_description <- unique(c(male_ego_$Description, female_ego_$Description))
merged_ID <- unique(c(male_ego_$ID, female_ego_$ID))

#go_terms_list <- read.csv("go_terms.csv", header = TRUE, stringsAsFactors = FALSE)
#go_terms_vector <- go_terms_list$go_terms
#go_terms_vector <- c("GO:0090140", "GO:0006664", "GO:0046488", "GO:1903509")
# Filter the results by the provided GO terms
male_filtered_ego <- subset(male_ego, male_ego$ID %in% merged_ID)
female_filtered_ego <- subset(female_ego, female_ego$ID %in% merged_ID)
female_filtered_ego$Sex <- "Female"
male_filtered_ego$Sex <- "Male"
mydat <- rbind(male_filtered_ego, female_filtered_ego)

library(ggplot2)
library(dplyr)
library(magrittr)
library(tidyr)
library(forcats)

#mydat <- read.csv2("GO_BP_enrichment_genebody.txt", sep="\t", header=TRUE)
#mydat$FoldEnrichment <- as.numeric(mydat$FoldEnrichment)
#mydat$GeneRatio <- sapply(as.character(mydat$GeneRatio), function(x) eval(parse(text=x)))
#mydat$p.adjust <- as.numeric(as.character(mydat$FDR))
mydat$Sex <- factor(mydat$Sex, levels=c("Male","Female")) 
mydat$Description <- factor(mydat$Description, levels=rev(merged_description))
mydat$ONTOLOGY <- factor(mydat$ONTOLOGY, levels=c("BP","CC", "MF"))
mydat$log_p.adjust <- -log10(mydat$p.adjust)

# Define a custom scale function and its inverse
rescale_fun <- function(x) {
  return(-log10(x))
}

inverse_fun <- function(x) {
  return(10^(-x))
}

# Use the custom scale function in the ggplot
png("bubble.png", width=800, height=860)
ggplot(mydat, aes(y = Description, x = Sex, size = p.adjust)) + 
geom_point(aes(color = ONTOLOGY), alpha = 1.0) + 
labs(x = "", y = "") + 
theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + 
theme(axis.text = element_text(size = 20)) + 
theme(legend.text = element_text(size = 20)) + 
theme(legend.title = element_text(size = 20)) + 
scale_size_continuous(trans = scales::trans_new(name = "custom", transform = rescale_fun, inverse = inverse_fun),
                      breaks = c(1, 0.5, 0.1, 0.05, 0.01, 0.001), 
                      labels = c("1", "0.5", "0.1", "0.05", "0.01", "0.001"),
                      range = c(1, 13)) + 
guides(color = guide_legend(override.aes = list(size = 10))) +
scale_color_manual(values = c("BP" = "blue", "CC" = "red", "MF" = "green"))
dev.off()

RNAseq Analysis Overview of Osteoblasts Infected with S. epidermidis Strains

The analysis provides insights into the osteoblasts’ response when infected with two distinct strains of S. epidermidis: the biofilm-positive 1457 and its biofilm-negative counterpart 1457-M10. The PCA plot reveals nuanced differences between these infections at both 2-hour and 3-day intervals.

For a more detailed understanding, users can explore the DEGs_heatmap.png, showcasing a heatmap representation of all differentially expressed genes (DEGs). Associated expression data can be found in the gene_clusters.xls file.

The depth of this analysis is further evident in the detailed comparison sets, which include:

  • Immediate response (2h) of osteoblasts to 1457 vs. baseline (uninfected)
  • Immediate response (2h) to 1457-M10 vs. baseline
  • Extended infection duration (3 days) with 1457 vs. baseline
  • Extended duration with 1457-M10 vs. baseline
  • Head-to-head comparison of osteoblast responses: 1457 vs. 1457-M10 at 2h and 3 days
  • Temporal response shifts within each strain, comparing 2h to 3 days.

For visual clarity, each comparison set is complemented by a volcano plot and relevant data is organized in Excel tables. All necessary files and resources are provided for comprehensive exploration.

  1. running nextflow and multiqc

    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (rnaseq) [jhuang@sage Data_Samira_RNAseq]$ nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38 -profile test_full  -resume   --max_memory 256.GB --max_time 2400.h    --save_align_intermeds --save_unaligned    --aligner 'star_salmon' --skip_multiqc
    
    #-- rerun multiqc --
    ln -s ~/Tools/rnaseq/assets/multiqc_config.yaml multiqc_config.yaml
    conda activate rnaseq
    multiqc -f --config multiqc_config.yaml . 2>&1
    rm multiqc_config.yaml
    conda deactivate
  2. R-code for evaluation

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    
    library(tximport)
    library(DESeq2)
    
    setwd("~/DATA/Data_Samira_RNAseq/results2_GRCh38/star_salmon")
    
    # Define paths to your Salmon output quantification files, quant.sf refers to tx-counts, later will be summaried as gene-counts.
    files <- c("1457.2h_r1" = "./1457.2h_r1/quant.sf",
              "1457.2h_r2" = "./1457.2h_r2/quant.sf",
              "1457.2h_r3" = "./1457.2h_r3/quant.sf",
              "1457.3d_r1" = "./1457.3d_r1/quant.sf",
              "1457.3d_r2" = "./1457.3d_r2/quant.sf",
              "uninfected.2h_r1" = "./uninfected.2h_r1/quant.sf",
              "uninfected.2h_r2" = "./uninfected.2h_r2/quant.sf",
              "uninfected.3h_r3" = "./uninfected.3h_r3/quant.sf",
              "uninfected.3d_r1" = "./uninfected.3d_r1/quant.sf",
              "uninfected.3d_r2" = "./uninfected.3d_r2/quant.sf",
              "1457-M10.2h_r1" = "./1457-M10.2h_r1/quant.sf",
              "1457-M10.2h_r2" = "./1457-M10.2h_r2/quant.sf",
              "1457-M10.2h_r3" = "./1457-M10.2h_r3/quant.sf",
              "1457-M10.3d_r1" = "./1457-M10.3d_r1/quant.sf",
              "1457-M10.3d_r2" = "./1457-M10.3d_r2/quant.sf",
              "1457-M10.3d_r3" = "./1457-M10.3d_r3/quant.sf")
    
    # ---- tx-level count data ----
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
    # Define the replicates and condition of the samples
    #replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2"))
    condition <- factor(c("1457.2h", "1457.2h", "1457.2h", "1457.3d", "1457.3d", "uninfected.2h", "uninfected.2h", "uninfected.2h", "uninfected.3d", "uninfected.3d", "1457-M10.2h", "1457-M10.2h", "1457-M10.2h", "1457-M10.3d", "1457-M10.3d", "1457-M10.3d"))
    
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, row.names=names(files))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    
    # ---- gene-level count data ----
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
    #TODO: why a lot of reads were removed due to the too_short?
    #STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    
    dim(counts(dds))
    head(counts(dds), 10)  
  3. (optional) draw 3D PCA plots.

    library(gplots) 
    library("RColorBrewer")
    
    library(ggplot2)
    data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
    write.csv(data, file="plotPCA_data.csv")
    #calculate all PCs including PC3 with the following codes
    library(genefilter)
    ntop <- 500
    rv <- rowVars(assay(rld))
    select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    mat <- t( assay(rld)[select, ] )
    pc <- prcomp(mat)
    pc$x[,1:3]
    #df_pc <- data.frame(pc$x[,1:3])
    df_pc <- data.frame(pc$x)
    identical(rownames(data), rownames(df_pc)) #-->TRUE
    
    data$PC1 <- NULL
    data$PC2 <- NULL
    merged_df <- merge(data, df_pc, by = "row.names")
    #merged_df <- merged_df[, -1]
    row.names(merged_df) <- merged_df$Row.names
    merged_df$Row.names <- NULL  # remove the "name" column
    merged_df$name <- NULL
    merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","group","condition","replicate")]
    write.csv(merged_df, file="merged_df_10PCs.csv")
    summary(pc)  
    #0.5333  0.2125 0.06852
    
    draw_3D.py
  4. draw 2D PCA plots.

    # -- pca --
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    # -- heatmap --
    png("heatmap.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  5. (optional) estimate size factors

    > head(dds)
    class: DESeqDataSet 
    dim: 6 10 
    metadata(1): version
    assays(6): counts avgTxLength ... H cooks
    rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460
      ENSG00000000938
    rowData names(34): baseMean baseVar ... deviance maxCooks
    colnames(10): control_r1 control_r2 ... HSV.d8_r1 HSV.d8_r2
    colData names(2): condition replicate
    
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    sizeFactors(dds)
    #NULL
    dds <- estimateSizeFactors(dds)
    > sizeFactors(dds)
    
    normalized_counts <- counts(dds, normalized=TRUE)
    #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
    # ---- DEBUG sizeFactors(dds) always NULL, see https://support.bioconductor.org/p/97676/ ----
    nm <- assays(dds)[["avgTxLength"]]
    sf <- estimateSizeFactorsForMatrix(counts(dds), normMatrix=nm)
    
    assays(dds)$counts  # for count data
    assays(dds)$avgTxLength  # for average transcript length, etc.
    assays(dds)$normalizationFactors
    
    In normal circumstances, the size factors should be stored in the DESeqDataSet object itself and not in the assays, so they are typically not retrievable via the assays() function. However, due to the issues you're experiencing, you might be able to manually compute the size factors and assign them back to the DESeqDataSet.
    
    To calculate size factors manually, DESeq2 uses the median ratio method. Here's a very simplified version of how you could compute this manually:
    > assays(dds)
    List of length 6
    names(6): counts avgTxLength normalizationFactors mu H cooks
    
    To calculate size factors manually, DESeq2 uses the median ratio method. Here's a very simplified version of how you could compute this manually:
    
    geoMeans <- apply(assays(dds)$counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
    sizeFactors(dds) <- median(assays(dds)$counts / geoMeans, na.rm = TRUE)
    
    # ---- DEBUG END ----
    
    #unter konsole
    #  control_r1  ...
    # 1/0.9978755  ... 
    
    > sizeFactors(dds)
                        HeLa_TO_r1                      HeLa_TO_r2 
                          0.9978755                       1.1092227 
    
    1/0.9978755=1.002129023
    1/1.1092227=
    
    #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
  6. differential expressions

    #Guideline for Comparisons: 
    #   1457.2h vs uninfected.2h
    #   1457-M10.2h vs uninfected.2h
    #   
    #   1457.3d vs uninfected.3d
    #   1457-M10.3d vs uninfected.3d
    #   
    #   1457.2h infection vs 1457-M10 2h infection
    #   
    #   1457.3d infection vs 1457-M10 3 day infection
    #   
    #   1457 3 days vs 1457 2h 
    #   
    #   1457-M10 3 days vs 1457-M10 2h
    
    #"1457.2h", "1457.2h", "1457.2h", "1457.3d", "1457.3d", "uninfected.2h", "uninfected.2h", "uninfected.2h", "uninfected.3d", "uninfected.3d", "1457-M10.2h", "1457-M10.2h", "1457-M10.2h", "1457-M10.3d", "1457-M10.3d", "1457-M10.3d"
    
    #A central method for exploring differences between groups of segments or samples is to perform differential gene expression analysis. 
    
    dds$condition <- relevel(dds$condition, "uninfected.2h")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.M10.2h_vs_uninfected.2h","1457.2h_vs_uninfected.2h")
    
    dds$condition <- relevel(dds$condition, "uninfected.3d")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.M10.3d_vs_uninfected.3d","1457.3d_vs_uninfected.3d")
    
    dds$condition <- relevel(dds$condition, "1457-M10.2h")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.2h_vs_1457.M10.2h")
    
    dds$condition <- relevel(dds$condition, "1457-M10.3d")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.3d_vs_1457.M10.3d")
    
    dds$condition <- relevel(dds$condition, "1457.2h")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.3d_vs_1457.2h")
    
    dds$condition <- relevel(dds$condition, "1457-M10.2h")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("1457.M10.3d_vs_1457.M10.2h")
    
    ##https://bioconductor.statistik.tu-dortmund.de/packages/3.7/data/annotation/
    #BiocManager::install("EnsDb.Mmusculus.v79")
    #library(EnsDb.Mmusculus.v79)
    #edb <- EnsDb.Mmusculus.v79
    
    #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
    #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
    library(biomaRt)
    listEnsembl()
    listMarts()
    #ensembl <- useEnsembl(biomart = "genes", mirror="asia")  # default is Mouse strains 104
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", mirror = "www")
    #ensembl = useMart("ensembl_mart_44", dataset="hsapiens_gene_ensembl",archive=TRUE, mysql=TRUE)
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="104")
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="86")
    #--> total 69, 27  GRCh38.p7 and 39  GRCm38.p4; we should take 104, since rnaseq-pipeline is also using annotation of 104!
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
    datasets <- listDatasets(ensembl)
    #--> total 202   80                         GRCh38.p13         107                            GRCm39
    #80           hsapiens_gene_ensembl                                      Human genes (GRCh38.p13)                         GRCh38.p13
    #107         mmusculus_gene_ensembl                                        Mouse genes (GRCm39)                            GRCm39
    
    > listEnsemblArchives()
                name     date                                 url version
    1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
    2     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
    3     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
    4     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
    5     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
    6     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
    7     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
    
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    #https://www.ncbi.nlm.nih.gov/grc/human
    #BiocManager::install("org.Mmu.eg.db")
    #library("org.Mmu.eg.db")
    #edb <- org.Mmu.eg.db
    #
    #https://bioconductor.statistik.tu-dortmund.de/packages/3.6/data/annotation/
    #EnsDb.Mmusculus.v79
    #> query(hub, c("EnsDb", "apiens", "98"))
    #columns(edb)
    
    #searchAttributes(mart = ensembl, pattern = "symbol")
    
    ##https://www.geeksforgeeks.org/remove-duplicate-rows-based-on-multiple-columns-using-dplyr-in-r/
    library(dplyr)
    library(tidyverse)
    #df <- data.frame (lang =c ('Java','C','Python','GO','RUST','Javascript',
                          'Cpp','Java','Julia','Typescript','Python','GO'),
                          value = c (21,21,3,5,180,9,12,20,6,0,3,6),
                          usage =c(21,21,0,99,44,48,53,16,6,8,0,6))
    #distinct(df, lang, .keep_all= TRUE)
    
    for (i in clist) {
    #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
    
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = rownames(res), 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
    
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    #-- show methods of class DESeq2 --
    #x=capture.output(showMethods(class="DESeq2"))
    #unlist(lapply(strsplit(x[grep("Function: ",x,)]," "),function(x) x[2]))
  7. volcano plots with automatically finding top_g

    #A canonical visualization for interpreting differential gene expression results is the volcano plot.
    library(ggrepel) 
    
    for i in 1457.M10.2h_vs_uninfected.2h 1457.2h_vs_uninfected.2h 1457.M10.3d_vs_uninfected.3d 1457.3d_vs_uninfected.3d 1457.2h_vs_1457.M10.2h 1457.3d_vs_1457.M10.3d 1457.3d_vs_1457.2h 1457.M10.3d_vs_1457.M10.2h; do
    #HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control
    #for i in K3R_24hdox_vs_K3R_3hdox21hchase WT_3hdox21hchase_vs_K3R_3hdox21hchase; do
    #for i in WT_24hdox_vs_K3R_24hdox; do
    #for i in WT_24hdox_vs_WT_3hdox21hchase; do
      # read files to geness_res
      echo "geness_res <- read.csv(file = paste(\"${i}\", \"all.txt\", sep=\"-\"), row.names=1)"
    
      echo "subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0))"
      echo "geness_res\$Color <- \"NS or log2FC < 2.0\""
      echo "geness_res\$Color[geness_res\$pvalue < 0.05] <- \"P < 0.05\""
      echo "geness_res\$Color[geness_res\$padj < 0.05] <- \"P-adj < 0.05\""
      echo "geness_res\$Color[abs(geness_res\$log2FoldChange) < 2.0] <- \"NS or log2FC < 2.0\""
      echo "geness_res\$Color <- factor(geness_res\$Color, levels = c(\"NS or log2FC < 2.0\", \"P < 0.05\", \"P-adj < 0.05\"))"
      echo "write.csv(geness_res, \"${i}_with_Category.csv\")"
    
      # pick top genes for either side of volcano to label
      # order genes for convenience:
      echo "geness_res\$invert_P <- (-log10(geness_res\$pvalue)) * sign(geness_res\$log2FoldChange)"
      echo "top_g <- c()"
      echo "top_g <- c(top_g, \
                geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:100]], \
                geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:100]])"
      echo "top_g <- unique(top_g)"
      echo "geness_res <- geness_res[, -1*ncol(geness_res)]"  # remove invert_P from matrix
    
      # Graph results
      echo "png(\"${i}.png\",width=1200, height=2000)"
      echo "ggplot(geness_res, \
          aes(x = log2FoldChange, y = -log10(pvalue), \
              color = Color, label = external_gene_name)) + \
          geom_vline(xintercept = c(2.0, -2.0), lty = \"dashed\") + \
          geom_hline(yintercept = -log10(0.05), lty = \"dashed\") + \
          geom_point() + \
          labs(x = \"log2(FC)\", y = \"Significance, -log10(P)\", color = \"Significance\") + \
          scale_color_manual(values = c(\"P-adj < 0.05\"=\"darkblue\",\"P < 0.05\"=\"lightblue\",\"NS or log2FC < 2.0\"=\"darkgray\"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) + \
          geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = \"black\", min.segment.length = .1, box.padding = .2, lwd = 2) + \
          theme_bw(base_size = 16) + \
          theme(legend.position = \"bottom\")"
      echo "dev.off()"
    done
    
    sed -i -e 's/Color/Category/g' *_Category.csv
    
    for i in 1457.M10.2h_vs_uninfected.2h 1457.2h_vs_uninfected.2h 1457.M10.3d_vs_uninfected.3d 1457.3d_vs_uninfected.3d 1457.2h_vs_1457.M10.2h 1457.3d_vs_1457.M10.3d 1457.3d_vs_1457.2h 1457.M10.3d_vs_1457.M10.2h; do
      echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${i}-all.txt ${i}-up.txt ${i}-down.txt -d$',' -o ${i}.xls;"
    done
  8. clustering the genes and draw heatmap

    install.packages("gplots")
    library("gplots")
    
    for i in 1457.M10.2h_vs_uninfected.2h 1457.2h_vs_uninfected.2h 1457.M10.3d_vs_uninfected.3d 1457.3d_vs_uninfected.3d 1457.2h_vs_1457.M10.2h 1457.3d_vs_1457.M10.3d 1457.3d_vs_1457.2h 1457.M10.3d_vs_1457.M10.2h; do
      echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"
      echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"
    done
    
    #    1 1457.2h_vs_1457.M10.2h-down.id
    #    1 1457.2h_vs_1457.M10.2h-up.id
    #   23 1457.2h_vs_uninfected.2h-down.id
    #   74 1457.2h_vs_uninfected.2h-up.id
    #  126 1457.3d_vs_1457.2h-down.id
    #   61 1457.3d_vs_1457.2h-up.id
    #    2 1457.3d_vs_1457.M10.3d-down.id
    #    2 1457.3d_vs_1457.M10.3d-up.id
    #   97 1457.3d_vs_uninfected.3d-down.id
    #   79 1457.3d_vs_uninfected.3d-up.id
    #   17 1457.M10.2h_vs_uninfected.2h-down.id
    #   82 1457.M10.2h_vs_uninfected.2h-up.id
    #  162 1457.M10.3d_vs_1457.M10.2h-down.id
    #   69 1457.M10.3d_vs_1457.M10.2h-up.id
    #  171 1457.M10.3d_vs_uninfected.3d-down.id
    #  124 1457.M10.3d_vs_uninfected.3d-up.id
    
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id  #570 genes
    RNASeq.NoCellLine <- assay(rld)
    
    # Defining the custom order
    column_order <- c(
      "uninfected.2h_r1", "uninfected.2h_r2", "uninfected.3h_r3", 
      "uninfected.3d_r1", "uninfected.3d_r2", 
      "1457-M10.2h_r1", "1457-M10.2h_r2", "1457-M10.2h_r3","1457.2h_r1", "1457.2h_r2", "1457.2h_r3",  
      "1457-M10.3d_r1", "1457-M10.3d_r2", "1457-M10.3d_r3","1457.3d_r1", "1457.3d_r2"
    )
    RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    head(RNASeq.NoCellLine_reordered)
    
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine_reordered[GOI, ]
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.08)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    png("DEGs_heatmap.png", width=900, height=1010)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75),
                RowSideColors = mycol, labRow="", srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4)
    dev.off()
    
    # Extract rows from datamat where the row names match the identifiers in subset_1
    
    #### cluster members #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        colnames(expression_values) <- colnames(data)
        # Combine gene information and expression data
        combined_result <- cbind(result, expression_values)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o gene_clusters.xls

Unveiling Difference in Promoter regions of MeDIP data 2

  1. extract all genes by giving GO-terms

    # # Using Bioconductor in R
    # library(org.Mm.eg.db)
    # genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    # > genes
  2. extract annotation.gtf for mouse.

    • (Optional) UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: “Genes and Gene Predictions” track: “RefSeq Genes” (or another gene prediction track if you have a preference) output format: “GTF – gene transfer format” Click “get output”, and you can then save the result as a .gtf file.

    • (Optional) Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • (Used) Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk ‘OFS=”\t” {if ($3 == “gene”) print $1, $4, $4+1, $10, $7, $12, $14}’ gencode.vM25.annotation.gtf > tss_.bed #55401 (Note not under kate)

      grep “proteincoding” tss.bed > tss.bed #21859 records

  3. In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the “promoter region” can vary between studies and depends on the specific context. For the mouse genome (mm10), a commonly used range for the promoter region is:

    • (Used) -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    • (Optional) -1000 to +200 bp relative to the TSS Or even a broader one:

    • (Optional) -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

  4. Convert TSS positions to promoter regions (Note that $7 is SYMBOL, $4 ENSEMBL-ID).

    # For genes on the '+' strand:
    awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    # For genes on the '-' strand:
    awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    # Combine the two promoter files.
    cat promoters_plus.bed promoters_minus.bed > promoters.bed
    
    # #(Optional) Extract sequences for these promoter regions.
    # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
  5. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    #Control females:
    bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    #Control males:
    bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    #Stress females:
    bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    #Stress males:
    bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
  6. Parse and consolidate coverage files:

    # delete the 'chr positions '
    for file in ./promoters_coverage_*.txt; do
        awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    done
    for file in ./promoters_coverage_*.txt; do
        sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    done
    # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    for file in ./promoters_coverage_*.txt; do
        if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
            echo "$file does not match reference!"
        fi
    done
    #merge the counts
    cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    for file in ./promoters_coverage_*.txt; do
        paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    done
    #add the sample names
    for file in ./promoters_coverage_*.txt; do
        basename $file | sed 's/promoters_coverage_//;s/.txt//'
    done > sample_names.txt
    echo -n "Gene " > header.txt
    paste -s -d ' ' sample_names.txt >> header.txt
    {
        cat header.txt
        cat read_counts_table.txt
    } > read_counts_table_with_header.txt
    
    #MANUALLY replace "\n" in the first line, DELETE " and "; 
    #DELETE the version of EnsembleID with the following command: 21860-1 records remaining.
    cp read_counts_table_with_header.txt temp
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt
    
    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_3")
    # Load the data
    count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    row.names(sampleInfo) <- sampleInfo$sampleID
    dds <- DESeqDataSetFromMatrix(
      countData = count_data,
      colData = sampleInfo,
      design = ~ condition
    )
    
    # Plot PCA
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    # Differential analysis
    dds$condition <- relevel(dds$condition, "control_female")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_female_vs_control_female")
    
    dds$condition <- relevel(dds$condition, "control_male")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_male_vs_control_male")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    listDatasets(ensembl)
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    datasets <- listDatasets(ensembl)
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    for (i in clist) {
      #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = gene_ids_no_version, 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    #~/Tools/csv2xls-0.4/csv_to_xls.py stress_female_vs_control_female-all.txt stress_female_vs_control_female-up.txt stress_female_vs_control_female-down.txt -d',' -o stress_female_vs_control_female.xls
    #~/Tools/csv2xls-0.4/csv_to_xls.py stress_male_vs_control_male-all.txt stress_male_vs_control_male-up.txt stress_male_vs_control_male-down.txt -d',' -o stress_male_vs_control_male.xls
    
    # Clustering the genes and draw heatmap for DEGs (limiting the genes with ENSEMBL by giving GO-term or give all genes via the file 'ids_')
    #install.packages("gplots")
    library("gplots")
    
    #(Optional) cat *.id | sort -u > ids  #add Gene_Id in the first line, delete the ""
    #(Used) cut -d$'\t' -f1 read_counts_table_with_header.txt > ids  #Rename the first line to Gene_Id
    
    GOI <- read.csv("ids")$Gene_Id  #21859 genes
    RNASeq.NoCellLine <- assay(rld)
    # Defining the custom order
    #column_order <- c(
    #  "uninfected.2h_r1", "uninfected.2h_r2", "uninfected.3h_r3", 
    #  "uninfected.3d_r1", "uninfected.3d_r2", 
    #  "1457-M10.2h_r1", "1457-M10.2h_r2", "1457-M10.2h_r3","1457.2h_r1", "1457.2h_r2", "1457.2h_r3",  
    #  "1457-M10.3d_r1", "1457-M10.3d_r2", "1457-M10.3d_r3","1457.3d_r1", "1457.3d_r2"
    #)
    #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    #head(RNASeq.NoCellLine_reordered)
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.01)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    png("DEGs_heatmap.png", width=1000, height=1010)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), margins=c(4,22),
                RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4, labRow="")
    dev.off()
    
    # Generate the table for cluster members
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    subset_2<-names(subset(mycl, mycl == '2'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    subset_3<-names(subset(mycl, mycl == '3'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    # Take all
    data <- as.data.frame(datamat)  #21859
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        #colnames(expression_values) <- colnames(data)
        colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
        # # Combine gene information and expression data
        # combined_result <- cbind(result, expression_values)
        # Fetch raw counts for the gene from count_data
        raw_counts <- count_data[gene, , drop = FALSE]
        colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
        # Combine gene information, expression data, and raw counts
        combined_result <- cbind(result, expression_values, raw_counts)
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    write.csv(annotated_data, "annotated.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py annotated.csv -d',' -o raw_and_normalized_counts.xls

plot treeHeatmap using ggtree

ggtree_and_gheatmap

library(ggtree)
library(ggplot2)

setwd("/home/jhuang/DATA/Data_Luise_Epidome_batch3/plotTreeHeatmap/")

# -- edit tree --
#https://icytree.org/
info <- read.csv("typing_189.csv")
info$name <- info$Isolate
#tree <- read.tree("core_gene_alignment_fasttree_directly_from_186isoaltes.tree")  --> NOT GOOD!
tree <- read.tree("471.tree")
cols <- c(infection='purple2', commensalism='skyblue2')     

library(dplyr)
heatmapData2 <- info %>% select(Isolate, ST, Clonal.Complex, Phylotype)
rn <- heatmapData2$Isolate
heatmapData2$Isolate <- NULL
heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
rownames(heatmapData2) <- rn

#https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "lightcyan3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "azure3","blueviolet","darkgoldenrod",  "tomato","mediumpurple4","indianred", 
                      "cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta",     "cornflowerblue","darkgreen","red","tan","brown",      "darkgrey")
names(heatmap.colours) <- c("1","2","3","4","5", "6","7",   "20","21","22", "28","30","33","42","43","52","53", "66","68",    "100","105","124","133","134","135","137",     "159","161",    "CC1","CC2","CC3","CC4","CC5","CC6","CC30","CC72","CC77","Singleton",    "IA1","IA2","IB","II","III",    "NA")
#mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))

#circular
p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + 
  geom_tippoint(aes(color=Type)) + 
  scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
#, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
#difference between geom_tiplab and geom_tiplab2?
#+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
#font.size=10, 
png("ggtree.png", width=1260, height=1260)
#svg("ggtree.svg", width=1260, height=1260)
p
dev.off()

png("ggtree_and_gheatmap.png", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=1290, height=1000)
#svg("ggtree_and_gheatmap.svg", width=17, height=15)
gheatmap(p, heatmapData2, width=0.1,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
dev.off()

Intragenic Methylation Patterns and CpG Island Dynamics: Insights into the Methylome

MeDIP (Methylated DNA Immunoprecipitation) is a technique used to enrich for methylated DNA sequences. When you obtain a lot of reads mapping to the gene body in your MeDIP data, it may indicate the presence of intragenic DNA methylation. Here are a few reasons and implications for observing such a pattern:

  1. Intragenic Methylation is Common: DNA methylation within the gene body (especially in exonic regions) is quite common in many organisms, including humans. Intragenic methylation is not as well-understood as promoter methylation, but it does occur frequently.

  2. Positive Correlation with Gene Expression: Contrary to promoter methylation, which is often associated with gene repression, gene body methylation in many cases has been correlated with active gene expression. The exact reason is still under investigation, but some studies suggest that methylation in the gene body might help in the prevention of spurious transcription initiation.

  3. Alternative Promoters: In some cases, genes have alternative promoters within introns. Methylation around these intragenic promoters can regulate the alternative splicing or transcription from these internal sites.

  4. Role in Splicing: Some studies suggest a role for intragenic DNA methylation in alternative splicing, where methylation might influence the choice of exons included in the mRNA.

  5. Technical Bias: It’s essential to ensure that the observed pattern is not due to some technical biases in the MeDIP protocol, library preparation, or sequencing. For instance, biases could arise from the inefficiency of the immunoprecipitation step or the fragmentation of DNA.

  6. Comparison with Other Data: It would be beneficial to compare MeDIP data with other datasets like bisulfite sequencing (which gives single-nucleotide resolution of methylation) or with RNA-seq data (to correlate gene body methylation with gene expression levels).

If you’re consistently observing heavy methylation in the gene bodies across your data, it might be interesting to further investigate the implications and potential regulatory roles of this methylation in your system of interest.

CpG islands are regions of DNA that have a high frequency of CpG sites (cytosine-phosphate-guanine sequences). Specifically, a CpG island is typically defined by the following criteria:

  • Length > 200 bp
  • GC content > 50%
  • Ratio of observed to expected CpG > 0.6

Here’s where CpG islands are typically located:

  1. Promoter Regions: Most CpG islands (around 70% in the human genome) are found in the promoter regions of genes. These regions are upstream of the transcription start site (TSS) and play a crucial role in the regulation of gene expression.

  2. Exons: Some CpG islands can be found within the exons of genes, both in coding sequences and untranslated regions (UTRs).

  3. Intergenic Regions: CpG islands can also be found in regions between genes, though this is less common than promoter-associated or exon-associated islands.

  4. Introns: Less frequently, CpG islands can be located in intronic regions.

The methylation status of CpG islands plays a significant role in gene regulation. When a CpG island in a gene’s promoter region is methylated, that gene is typically repressed or silenced. Conversely, unmethylated CpG islands in promoter regions are usually associated with actively transcribed genes. This pattern of methylation is often involved in cellular differentiation, genomic imprinting, X-chromosome inactivation, and is also implicated in various diseases, including cancer, when it becomes dysregulated.

Both “intergenic” and “intragenic” refer to locations in the genome, but they denote different regions. In essence, while “intragenic” pertains to regions within genes, “intergenic” pertains to regions between genes.

  • Intragenic:

    • Refers to the regions within a gene.
    • This encompasses all sequences that lie within the boundaries of a gene, including both exons (coding regions) and introns (non-coding regions).
    • Modifications or mutations in these regions can potentially affect the function, expression, or regulation of the gene.
  • Intergenic:

    • Refers to the regions between genes.
    • These are sequences that lie outside the boundaries of genes.
    • While these regions were once thought to be “junk” DNA with no function, it’s now understood that many intergenic regions play critical roles in regulating gene expression, among other functions. They may contain regulatory elements, non-coding RNA genes, or sequences that have yet to be characterized for function.

Unveiling Difference in Promoter regions of MeDIP data

  1. Select Genes of Interest: Once Emilia provides the GO term, use a database like Gene Ontology or Bioconductor’s “org.Hs.eg.db” package (assuming human genes) to fetch the list of genes associated with that GO term.

    # Using Bioconductor in R
    library(org.Mm.eg.db)
    genes <- select(org.Mm.eg.db, keys="GO:0048813", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    > genes
              GO EVIDENCE ONTOLOGY    SYMBOL
    1  GO:0048813      IDA       BP      Abi1
    2  GO:0048813      IGI       BP      Abl2
    3  GO:0048813      IMP       BP      Abl2
    4  GO:0048813      IMP       BP     Atp7a
    5  GO:0048813      IMP       BP   Cacna1a
    6  GO:0048813      IMP       BP    Camk2a
    7  GO:0048813      IMP       BP    Ctnna2
    8  GO:0048813      IMP       BP      Cdk5
    9  GO:0048813      IGI       BP     Dclk1
    10 GO:0048813      IGI       BP       Dcx
    11 GO:0048813      IMP       BP     Dscam
    12 GO:0048813      IMP       BP      Dvl1
    13 GO:0048813      IGI       BP      Fmn1
    14 GO:0048813      IMP       BP      Fmn1
    15 GO:0048813      IMP       BP       Fyn
    16 GO:0048813      IMP       BP      Hprt
    17 GO:0048813      IBA       BP      Sdc2
    18 GO:0048813      IMP       BP    Elavl4
    19 GO:0048813      IGI       BP     Itgb1
    20 GO:0048813      IMP       BP     Itgb1
    21 GO:0048813      IMP       BP      Lrp8
    22 GO:0048813      ISO       BP     Mef2a
    23 GO:0048813      IBA       BP      Map6
    24 GO:0048813      ISO       BP      Map6
    25 GO:0048813      IMP       BP   Slc11a2
    26 GO:0048813      IGI       BP      Rac1
    27 GO:0048813      IGI       BP     Rock2
    28 GO:0048813      IMP       BP    Sema3a
    29 GO:0048813      IMP       BP     Vldlr
    30 GO:0048813      IMP       BP     Mapk8
    31 GO:0048813      ISO       BP     Mink1
    32 GO:0048813      IMP       BP      Tet1
    33 GO:0048813      ISO       BP    Celsr2
    34 GO:0048813      IMP       BP   Cacna1f
    35 GO:0048813      IGI       BP     Dact1
    36 GO:0048813      IMP       BP     Dact1
    37 GO:0048813      IMP       BP  Mapk8ip2
    38 GO:0048813      ISO       BP     Trak1
    39 GO:0048813      IMP       BP      Rere
    40 GO:0048813      ISO       BP     Trak2
    41 GO:0048813      IBA       BP  Tmem106b
    42 GO:0048813      ISO       BP  Tmem106b
    43 GO:0048813      IMP       BP   Slitrk5
    44 GO:0048813      IMP       BP Kidins220
    45 GO:0048813      IMP       BP    Rbfox2
    46 GO:0048813      IMP       BP      Klf7
    47 GO:0048813      IMP       BP    Dtnbp1
    48 GO:0048813      IMP       BP     Prex2
    49 GO:0048813      IBA       BP    Dcdc2a
    50 GO:0048813      IGI       BP    Dcdc2a
    51 GO:0048813      ISO       BP     Farp1
    52 GO:0048813      ISO       BP      Lrp4
    53 GO:0048813      IBA       BP     Btbd3
    54 GO:0048813      IDA       BP     Btbd3
    55 GO:0048813      IBA       BP   Abitram
    56 GO:0048813      IMP       BP   Abitram
    57 GO:0048813      IMP       BP    Picalm
    58 GO:0048813      ISO       BP    Picalm
  2. Extract Promoter Regions: Typically, the promoter region can be defined as a certain number of base pairs (e.g., 1000-2000 bp) upstream of the transcription start site (TSS) of a gene. Tools like BEDTools can help you extract these regions from a genome reference.

    2.1 extract annotation.gtf for mouse.

    • UCSC Genome Browser: Navigate to the UCSC Table Browser. Choose the following settings: group: “Genes and Gene Predictions” track: “RefSeq Genes” (or another gene prediction track if you have a preference) output format: “GTF – gene transfer format” Click “get output”, and you can then save the result as a .gtf file.

    • Ensembl: Go to the Ensembl FTP site. Navigate through: release-XYZ (choose the latest release) > gtf > mus_musculus (choose the appropriate assembly if there are multiple). Download the GTF file, which might be named something like Mus_musculus.GRCm38.XYZ.gtf.gz. Make sure to unzip the file after downloading. GENCODE:

    • Visit the GENCODE Mouse website: https://www.gencodegenes.org/mouse/release_M25.html Download the comprehensive gene annotation GTF file for the mm10 assembly.

    • Convert the GTF file to a BED file with only the TSS positions.

      awk ‘OFS=”\t” {if ($3 == “gene”) print $1, $4, $4+1, $10, $7, $12, $14}’ gencode.vM25.annotation.gtf > tss.bed

      grep “protein_coding” tss.bed | wc -l #21859 records

    In many genomics studies, the promoter region of a gene is typically considered to be the sequence immediately upstream of the transcription start site (TSS). However, the exact definition of the “promoter region” can vary between studies and depends on the specific context.

    For the mouse genome (mm10), a commonly used range for the promoter region is:

    -2000 to +500 bp relative to the TSS: This defines a 2.5 kb region centered around the TSS. The 2 kb upstream often captures many of the cis-regulatory elements, while the 500 bp downstream includes the TSS and possibly the beginning of the gene. Some studies might define a narrower region, such as:

    -1000 to +200 bp relative to the TSS Or even a broader one:

    -5000 to +1000 bp relative to the TSS The best definition often depends on the specific research question. If the goal is to capture as many potential regulatory elements as possible, a broader range might be chosen. If the goal is to focus specifically on elements directly at the TSS, a narrower range might be more appropriate.

    For many standard analyses, the -2000 to +500 bp window is a reasonable compromise that captures a significant portion of the regulatory elements without being overly broad. However, always consider the biological question and the specifics of your dataset when choosing the appropriate range for promoter regions.

    #I used the two methods to extract promoter regions, why the result is different as shown below.
    #promoters.bed:chrM      -498    2002    "ENSMUSG00000064336.1";
    #promoters_.bed:chrM     -999    1       "ENSMUSG00000064336.1";
    
    # Convert TSS positions to promoter regions. Here, we're considering a promoter to be 1kb upstream of the TSS ($7 is SYMBOL, $4 ENSEMBL-ID).
    # For genes on the '+' strand:
    awk 'OFS="\t" {start=$2-2000>0?$2-2000:0; end=$2+500; if ($5 == "+") print $1, start, end, $4}' tss.bed > promoters_plus.bed
    # For genes on the '-' strand:
    awk 'OFS="\t" {start=$3-500>0?$3-500:0; end=$3+2000; if ($5 == "-") print $1, start, end, $4}' tss.bed > promoters_minus.bed
    # Combine the two promoter files.
    cat promoters_plus.bed promoters_minus.bed > promoters.bed
    
    # ERROR --> NEED to be DEBUGGED!
    # # Initialize an empty promoters.bed file
    # > promoters.bed
    # # Loop through each line of tss.bed
    # while IFS=$'\t' read -r chr start end name strand; do
    #     if [[ $strand == "+" ]]; then
    #         # Adjust for '+' strand genes, set minimum start to 0
    #         adjusted_start=$(( $start - 2000 > 0 ? $start - 2000 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($start+500))\t$name" >> promoters.bed
    #     else
    #         # Adjust for '-' strand genes using the `end` column as the TSS, set minimum start to 0
    #         adjusted_start=$(( $end - 2500 > 0 ? $end - 2500 : 0 ))
    #         echo -e "$chr\t$adjusted_start\t$(($end-1000))\t$name" >> promoters.bed
    #     fi
    # done < tss.bed
    
    # # Extract sequences for these promoter regions.
    # bedtools getfasta -fi mm10_genome.fasta -bed promoters.bed -fo promoters.fasta
    
    # # Using bedtools. Assuming a BED file with TSS positions (tss.bed)
    # bedtools flank -i tss.bed -g mouse.genome -l 1000 -r 0 > promoters.bed
  3. Quantify Methylation Signal in Promoters: With the promoter regions defined, use your bam files to calculate coverage or read depth over these regions. The depth can be interpreted as a signal of methylation. You can use tools such as bedtools coverage to get the depth of coverage over these promoter regions.

    #Control females:
    bedtools coverage -a promoters.bed -b ../alg/1_0_1_sorted.bam > promoters_coverage_1.0.1.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_2_sorted.bam > promoters_coverage_1.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/1_0_3_sorted.bam > promoters_coverage_1.0.3.txt
    #Control males:
    bedtools coverage -a promoters.bed -b ../alg/1_1_2b_sorted.bam > promoters_coverage_1.1.2b.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_3_sorted.bam > promoters_coverage_1.1.3.txt
    bedtools coverage -a promoters.bed -b ../alg/1_1_4_sorted.bam > promoters_coverage_1.1.4.txt
    #Stress females:
    bedtools coverage -a promoters.bed -b ../alg/2_0_1b_sorted.bam > promoters_coverage_2.0.1b.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_2_sorted.bam > promoters_coverage_2.0.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_0_3_sorted.bam > promoters_coverage_2.0.3.txt
    #Stress males:
    bedtools coverage -a promoters.bed -b ../alg/2_1_1_sorted.bam > promoters_coverage_2.1.1.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_2_sorted.bam > promoters_coverage_2.1.2.txt
    bedtools coverage -a promoters.bed -b ../alg/2_1_3_sorted.bam > promoters_coverage_2.1.3.txt
  4. Parse and consolidate coverage files:

    # delete the 'chr positions '
    for file in ./promoters_coverage_*.txt; do
        awk '{$1=$2=$3=""; print substr($0, 4)}' $file > tmp.txt && mv tmp.txt $file
    done
    for file in ./promoters_coverage_*.txt; do
        sed 's/^ *//' $file > tmp.txt && mv tmp.txt $file
    done
    
    # text if they promoters_coverage_*.txt contain exactly the same set of genes (and in the same order),
    cut -d " " -f1 promoters_coverage_1.0.1.txt > reference_genes.txt
    for file in ./promoters_coverage_*.txt; do
        if ! cmp -s <(cut -d " " -f1 $file) reference_genes.txt; then
            echo "$file does not match reference!"
        fi
    done
    
    #merge the counts
    cut -d " " -f1 promoters_coverage_1.0.1.txt > read_counts_table.txt
    for file in ./promoters_coverage_*.txt; do
        paste read_counts_table.txt <(cut -d " " -f2 $file) > tmp && mv tmp read_counts_table.txt
    done
    
    #add the sample names
    for file in ./promoters_coverage_*.txt; do
        basename $file | sed 's/promoters_coverage_//;s/.txt//'
    done > sample_names.txt
    echo -n "Gene " > header.txt
    paste -s -d ' ' sample_names.txt >> header.txt
    {
        cat header.txt
        cat read_counts_table.txt
    } > read_counts_table_with_header.txt
    #Manually replace "\n" in the first line, DELETE " and "; 
    #Delete the version of EnsembleID
    cp read_counts_table_with_header.txt temp
    awk 'BEGIN {FS=OFS="\t"} NR==1 {print; next} {gsub(/\.+[0-9]+$/, "", $1); print}' temp > read_counts_table_with_header.txt 
    
    #grep "ENSMUSG00000104217.1" *.txt
    #Gene    1.0.1   1.0.2   1.0.3   1.1.2b  1.1.3   1.1.4   2.0.1b  2.0.2   2.0.3   2.1.1   2.1.2   2.1.3
    #promoters_coverage_1.0.1.txt:"ENSMUSG00000104217.1"; 9 347 2500 0.1388000
    #promoters_coverage_1.0.2.txt:"ENSMUSG00000104217.1"; 22 491 2500 0.1964000
    #promoters_coverage_1.0.3.txt:"ENSMUSG00000104217.1"; 18 525 2500 0.2100000
    #promoters_coverage_1.1.2b.txt:"ENSMUSG00000104217.1"; 5 255 2500 0.1020000
    #promoters_coverage_1.1.3.txt:"ENSMUSG00000104217.1"; 11 315 2500 0.1260000
    #promoters_coverage_1.1.4.txt:"ENSMUSG00000104217.1"; 8 300 2500 0.1200000
    #promoters_coverage_2.0.1b.txt:"ENSMUSG00000104217.1"; 18 375 2500 0.1500000
    #promoters_coverage_2.0.2.txt:"ENSMUSG00000104217.1"; 9 458 2500 0.1832000
    #promoters_coverage_2.0.3.txt:"ENSMUSG00000104217.1"; 8 430 2500 0.1720000
    #promoters_coverage_2.1.1.txt:"ENSMUSG00000104217.1"; 17 389 2500 0.1556000
    #promoters_coverage_2.1.2.txt:"ENSMUSG00000104217.1"; 17 428 2500 0.1712000
    #promoters_coverage_2.1.3.txt:"ENSMUSG00000104217.1"; 15 456 2500 0.1824000
    #read_counts_table.txt:"ENSMUSG00000104217.1";   9       22      18      5       11      8       18      9       8       17      17      15
    #read_counts_table_with_header.txt:"ENSMUSG00000104217.1"        9       22      18      5       11      8       18      9       8       17      17   15
    #reference_genes.txt:"ENSMUSG00000104217.1";
  5. Load into R and format for DESeq2:

    library(DESeq2)
    setwd("/home/jhuang/DATA/Data_Emilia_MeDIP/analysis_2023_2")
    
    # Load the data
    count_data <- read.table("read_counts_table_with_header.txt", header = TRUE, sep = "\t", quote = "", row.names = 1)
    
    ## Define conditions (you can replace this with your experimental design)
    #conditions <- c(rep("control_female", 3), rep("control_male", 3), rep("stress_female", 3), rep("stress_male", 3))
    
    ## Create DESeq2 object
    #dds <- DESeqDataSetFromMatrix(countData = count_data, colData = data.frame(condition = conditions), design = ~condition)
    
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #55401-->41415
    #dds <- DESeq(dds)
    #rld <- rlogTransformation(dds)
    
    #dim(counts(dds))
    #head(counts(dds), 10)
    
    ## This is an example using DESeq2, and it assumes count data rather than coverage.
    ## Proper normalization might need more manual adjustments based on the nature of MeDIP-seq data.
    #library(DESeq2)
    #dds <- DESeqDataSetFromMatrix(countData = countMatrix, colData = sampleInfo, design = ~ condition)
    #dds <- DESeq(dds)
    #normalized_counts <- counts(dds, normalized=TRUE)
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
    
    #-------
    
    sampleInfo <- read.csv("sampleInfo.txt", header = TRUE, stringsAsFactors = FALSE)
    row.names(sampleInfo) <- sampleInfo$sampleID
    
    library(DESeq2)
    dds <- DESeqDataSetFromMatrix(
      countData = count_data,
      colData = sampleInfo,
      design = ~ condition
    )
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    png("pca.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    ## Again using DESeq2 as an example
    #res <- results(dds)
    #significant_genes <- subset(res, padj < 0.05)
  6. differential analysis

    dds$condition <- relevel(dds$condition, "control_female")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_female_vs_control_female")
    
    dds$condition <- relevel(dds$condition, "control_male")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("stress_male_vs_control_male")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    listDatasets(ensembl)
    
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="100")
    datasets <- listDatasets(ensembl)
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    for (i in clist) {
      #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      gene_ids_no_version <- gsub("\\.\\d+$", "", rownames(res))
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = gene_ids_no_version, 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
  7. clustering the genes and draw heatmap (limiting the genes with ENSEMBL by giving GO-term)

    install.packages("gplots")
    library("gplots")
    
    # Combine the STEP 1
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090399", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090399.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0090398", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0090398.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:2000035", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:2000035.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0017145", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0017145.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002761", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002761.csv", genes)
    
    genes <- AnnotationDbi::select(org.Mm.eg.db, keys="GO:0002244", columns=c("SYMBOL", "ENSEMBL"), keytype="GO")
    write.csv(file="genes_GO:0002244.csv", genes)
    
    #write.csv(file="ids_", unique(genes$ENSEMBL))
    ##cut -d',' -f2 ids_ > ids
    ##add Gene_Id in the first line, delete the ""
    #GOI <- read.csv("ids")$Gene_Id  
    GOI <- unique(genes$ENSEMBL)
    RNASeq.NoCellLine <- assay(rld)
    # Defining the custom order
    #column_order <- c(
    #  "uninfected.2h_r1"
    #)
    #RNASeq.NoCellLine_reordered <- RNASeq.NoCellLine[, column_order]
    #head(RNASeq.NoCellLine_reordered)
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    GOI_clean <- GOI[GOI %in% rownames(RNASeq.NoCellLine) & !is.na(GOI)]
    #GOI_clean <- GOI_clean[GOI_clean != "ENSMUSG00000094658"]
    datamat <- RNASeq.NoCellLine[GOI_clean, ]
    #Remove Problematic Rows whose var is 0
    datamat <- datamat[!apply(datamat, 1, var) == 0, ]
    #datamat = RNASeq.NoCellLine[GOI, ]
    #rownames(datamat) <- genes$external_gene_name #DEBUG since genes has 58 records, datamat has only 48 records.
    write.csv(as.data.frame(datamat), file ="DEGs_heatmap_data.csv")
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.00)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    mycol = mycol[as.vector(mycl)]
    
    png("DEGs_heatmap.png", width=1000, height=1010)
    #labRow="",
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), margins=c(4,22),
                RowSideColors = mycol, srtCol=20, keysize=0.72, cexRow = 2, cexCol = 1.4)
    dev.off()
    
    # Extract rows from datamat where the row names match the identifiers in subset_1
    #### cluster members #####
    subset_1<-names(subset(mycl, mycl == '1'))
    data <- as.data.frame(datamat[rownames(datamat) %in% subset_1, ])  #212
    #subset_2<-names(subset(mycl, mycl == '2'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_2, ])  #185
    #subset_3<-names(subset(mycl, mycl == '3'))
    #data <- as.data.frame(datamat[rownames(datamat) %in% subset_3, ])  #173
    
    # Initialize an empty data frame for the annotated data
    annotated_data <- data.frame()
    # Determine total number of genes
    total_genes <- length(rownames(data))
    # Loop through each gene to annotate
    for (i in 1:total_genes) {
        gene <- rownames(data)[i]
        result <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
                        filters = 'ensembl_gene_id',
                        values = gene,
                        mart = ensembl)
        # If multiple rows are returned, take the first one
        if (nrow(result) > 1) {
            result <- result[1, ]
        }
        # Check if the result is empty
        if (nrow(result) == 0) {
            result <- data.frame(ensembl_gene_id = gene,
                                external_gene_name = NA,
                                gene_biotype = NA,
                                entrezgene_id = NA,
                                chromosome_name = NA,
                                start_position = NA,
                                end_position = NA,
                                strand = NA,
                                description = NA)
        }
        # Transpose expression values
        expression_values <- t(data.frame(t(data[gene, ])))
        #colnames(expression_values) <- colnames(data)
        colnames(expression_values) <- paste(colnames(data), "normalized", sep = "_")
    
        # # Combine gene information and expression data
        # combined_result <- cbind(result, expression_values)
    
        # Fetch raw counts for the gene from count_data
        raw_counts <- count_data[gene, , drop = FALSE]
        colnames(raw_counts) <- paste(colnames(raw_counts), "raw", sep = "_")
        # Combine gene information, expression data, and raw counts
        combined_result <- cbind(result, expression_values, raw_counts)
    
        # Append to the final dataframe
        annotated_data <- rbind(annotated_data, combined_result)
        # Print progress every 100 genes
        if (i %% 100 == 0) {
            cat(sprintf("Processed gene %d out of %d\n", i, total_genes))
        }
    }
    # Save the annotated data to a new CSV file
    write.csv(annotated_data, "cluster1_YELLOW.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster2_DARKBLUE.csv", row.names=FALSE)
    #write.csv(annotated_data, "cluster3_DARKORANGE.csv", row.names=FALSE)
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.csv -d',' -o gene_clusters.xls
  8. Add pvalues of p-adjusted to the end of output using stress_female_vs_control_female-all.txt and stress_male_vs_control_male-all.txt

    #python3 run.py
    import pandas as pd
    
    # Load the male and female stress data
    female_df = pd.read_csv('stress_female_vs_control_female-all.txt')
    male_df = pd.read_csv('stress_male_vs_control_male-all.txt')
    
    # List of files to process
    files_to_process = [
        './raw_and_normalized_counts_GO:0090399.csv',
        './raw_and_normalized_counts_GO:0017145.csv',
        './raw_and_normalized_counts_GO:0090398.csv',
        './raw_and_normalized_counts_GO:0002761.csv',
        './raw_and_normalized_counts_GO:0002244.csv',
        './raw_and_normalized_counts_GO:2000035.csv'
    ]
    
    for filepath in files_to_process:
        # Load the raw data
        raw_df = pd.read_csv(filepath)
    
        # Extracting the pvalue and padj for female dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(female_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_female_stress_vs_control', 
                                        'padj': 'padj_female_stress_vs_control'})
    
        # Extracting the pvalue and padj for male dataframe and adding them to the raw dataframe
        raw_df = raw_df.merge(male_df[['Unnamed: 0', 'pvalue', 'padj']], 
                              left_on='ensembl_gene_id', 
                              right_on='Unnamed: 0', 
                              how='left')
        raw_df = raw_df.rename(columns={'pvalue': 'pvalue_male_stress_vs_control', 
                                        'padj': 'padj_male_stress_vs_control'})
    
        # Drop the merged columns 'Unnamed: 0' from the final dataframe
        raw_df = raw_df.drop(columns=['Unnamed: 0_x', 'Unnamed: 0_y'])
    
        # Save the raw_df with the added pvalues and padj to a new CSV file
        output_filepath = filepath.replace('.csv', '_with_pvalues_and_padj.csv')
        raw_df.to_csv(output_filepath, index=False)
    
    #~/Tools/csv2xls-0.4/csv_to_xls.py raw_and_normalized*with_pvalues_and_padj.csv -d',' -o raw_and_normalized_with_pvalues_and_padj.xls

Merkel细胞多瘤病毒大T抗原(LT)、截短的大T抗原(LTtr; MCC350)和小T抗原(sT)对初级新生儿人类真皮成纤维细胞(nHDFs)中的mRNA水平的影响

1: 概要: 此部分讨论研究的优点/缺点、新颖性/意义、总体执行和学术性。

1. 研究了Merkel细胞多瘤病毒大T抗原(LT)、截短的大T抗原(LTtr; MCC350)和小T抗原(sT)对初级新生儿人类真皮成纤维细胞(nHDFs)中的mRNA水平的影响。病毒cDNA通过慢病毒(sT-IRES-mCherry)和LeGO-iG2(LT/LTtr - eGFP)转导,然后在第2天为荧光蛋白排序,然后在d3和d8进行分析。虽然LT诱导了DNA复制基因,但sT的表达似乎减少了干扰素相关的基因。ISGF3成分和下游靶标的mRNA水平都减少了,这意味着sT在上游发挥了其效应。值得注意的是,sT通过RT-qPCR和免疫荧光染色证明了IRF9 mRNA和蛋白的水平降低。为了验证sT对I型IFN信号基因的影响,他们在H1299和HEK293细胞中进行了启动子报告者测定。他们观察到,当与TBK1、IRF3、RIG-I、cGAS和STING共转染时,sT减少了IFN-B1报告者的活性。基于这些结果,作者提出sT干扰TBK1对IFN-B1报告者的信号。然而,除了启动子报告者测定外,没有直接证据表明sT干扰TBK1。图9中的卡通图暗示sT直接影响TBK1,但至多,下游有一种效应,TBK1似乎在sT存在时更不容易诱导转录反应。

2. 在这篇论文中,作者发现Merkel细胞多瘤病毒小肿瘤抗原(sT)可能通过在IFNB1上游,可能是在TBK1水平上,抑制干扰素刺激基因(ISG)的表达,从而有了一个潜在的新功能。他们通过转导新生儿人类真皮成纤维细胞(nHDFs)(一个可能的宿主细胞)与编码sT、大肿瘤抗原(LT)和一个来自病毒阳性Merkel细胞癌肿瘤的截短LT变种(LTtr)的慢病毒颗粒的组合,开发了一个新的MCPyV感染模型。他们在sT和LT以及sT和LTtr表示早期和晚期感染的两个不同时间点后,模拟肿瘤发生进行了RNA测序分析。他们发现了数百个与ST/LT/LTtr表达有关的差异表达基因,涉及不同的基因本体和生物过程。有趣的是,他们观察到sT导致I型IFN反应基因和先天免疫基因的显著下调。sT下调的基因与巨噬细胞中的ISGF3调节基因有很强的重叠,这表明sT抑制了ISGF3复合体的功能。他们接下来表明,sT减少了活跃转录的组蛋白标记(H3K4me3和H3K27Ac)在sT下调的先天免疫基因中,但对抑制性组蛋白标记(H3K27me3或H3K9me3)的影响很小,这表明sT阻止了这些基因的转录,而不是诱导表观遗传沉默。后续实验表明,sT在nHDFs中下调了ISGF3复合体的一个亚基IRF9的水平。通过在H1299和293细胞中进行的荧光素酶报告者测定实验,作者表明sT似乎不抑制对IFNB1刺激的ISGs的转录,这表明sT在IFNB1和ISGF3的上游起作用,与他们在nHDFs中的先前结果相矛盾。他们表明sT阻止cGAS-STING、RIG-I和TBK1介导的IFNB1-luciferase报告者的诱导,但不阻止组成活性的IRF3介导的诱导。他们得出结论,sT在TBK1水平上作用,下调IFNB1的表达,从而导致ISGs的下调。返回nHDF感染模型,作者观察到,LT在早期时间点瞬时上调IFNB1和ISGs,但在后期时间点没有持续。sT单独下调ISGs,在早期时间点减弱了LT介导的ISG诱导。作者得出结论,他们已经确定了sT的一个基本的新功能 - 在MCPyV感染期间阻止LT诱导的I型IFN调节基因。
论文以该领域当前知识的详细回顾开始,做得很好。该论文提出了一个有趣的观点,即sT对抗由LT引起的天生免疫应答,且评审者赞赏作者研究sT和LT单独和结合使用的努力。然而,存在几个主要问题,使得论文的结论值得质疑(参见第二部分)。该论文对sT如何下调ISGs的机械性理解非常有限且令人困惑。H1299/HEK293细胞的luciferase实验和nHDFs中的实验结果存在冲突,作者并未解决或解答这一问题。sT通过与NEMO结合调控NF-kB信号传导,这可能有助于ISG的下调,但作者并未提及。作者也未提及他们的LT构造可能表达的ALTO的贡献。因为目前需要进行多次主要的修订,所以这位评审者建议拒绝当前形式的论文。

3. 本研究探索了MCPyV(Merkel细胞多形性病毒)T抗原,特别是sT、LT和LTtr,对初生人皮肤成纤细胞(nHDFs)及随后的上皮源性细胞系的影响。现有的研究MCPyV的体外模型在捕获病毒生命周期的所有方面上都有很大的局限性。本研究使用慢病毒转导在nHDFs中过表达T抗原,以模拟初次感染以及在癌症中的整合后的早期和长期病毒基因表达,并研究了基因表达的变化。虽然sT的表达增强了细胞增殖和趋化相关基因,但它抑制了许多基因,包括LT和LTtr上调的一些基因。值得注意的是,它还抑制了I型干扰素(IFN)应答基因和MHC I类基因。这种效应与激活基因相关的改变的组蛋白标记相关。研究表明,sT对基因表达的影响与干扰素刺激基因因子3(ISGF3)复合物无关,主要在激酶TBK1的水平上破坏了I型IFN信号途径。研究有助于理解MCPyV对宿主细胞基因表达和免疫逃逸策略的影响。总体来说,这是一篇写得很好的稿件,对生成的数据进行了彻底的分析。然而,下面还有一些详细的担忧。
  1. 主要问题:为接受所需的关键实验: 请使用此部分详述应绝对要求的关键新实验或对现有实验的修改,以验证研究结论。

      • 图1C中的热图在每个时间点只显示一个复制品。不清楚进行了多少RNAseq复制品。每个病毒结构和控制是否都只进行了一次复制品?所有构造体都在两个供体中进行了测试吗?图S2G中的PCA图表显示两个供体细胞的转录谱之间存在很大的差异。观察到的基因表达差异(表1)是否反映了两个供体细胞?这些值是否反映了两个或更多的复制品?

      • 图5:H3K27me3和H3K9me3的差异似乎不是很大(5C),也许是因为这些标记的水平较低,所以5B中的相关图与低r值有些误导。或许列举sT下调的特定基因对于测试的每个组蛋白标记会更有说明性。图5C的颜色编码丢失了。执行了多少ChIP-seq复制品?供体1和2都进行了描述吗?

      • 图7:对于IFNB1启动子报告酶测定,sT或V5-sT似乎减少了TBK1、IRF3-5D(持续活跃)、cGAS-STING和RIG-I-N的共转染的影响。sT是否影响这些共转染蛋白的水平?TBK1、IRF3-5D、cGAS-STING和RIG-I-N的Western印迹可能会有用。尽管sT在IRF3-5D存在下的效果不显著(图7D),但活动呈下降趋势。这次测定的结果是否认为sT对TBK1的影响与IRF3没有显著区别?

      • 在图1C中,sT、LT、LTtr、sT+LT 和 sT+LTtr 的 RNAseq 数据并排展示,允许轻松比较不同基因和簇,以及它们是上调还是下调。令人惊讶的是,图1B中的印迹是在单独的印迹上运行的,这只能确认某种病毒蛋白是否被表达。无法比较sT在sT单独和sT+LT或sT+LTtr中的表达,限制了读者解释每种蛋白对基因表达变化的相对贡献的能力。审稿人建议作者在相同的凝胶上重复运行这些印迹,以所有条件和给定供体的时间点,以便可以对每种条件下的相对sT/LT表达进行比较。

      • 作者提到IFNB1受NF-kB信号调控,并引用了两篇显示sT调节NF-kB活性的文章(Griffiths等人,2013 & Abdul-Sada等人,2017)。然而,作者并没有评估sT调节NF-kB途径对sT调节IFNB1和ISGs的贡献。目前尚不清楚作者是否发现了sT的全新功能,还是他们的观察结果是sT抑制NF-kB信号的结果。应该使用不与PP4C/NEMO结合的sT突变体进行额外实验,以评估NF-kB信号对ISG由sT调节的贡献。

      • 作者提供了来自nHDFs、H1299和HEK293细胞的冲突数据,但在论文中并未解决。在图6A中,作者显示了nHDFs在3和8 dpt的STAT1和IRF9 mRNA水平显著下调,并在图6C中显示了8 dpt的IRF9蛋白下调。审稿人对于为什么在图6C中对未分类的nHDFs 2 dpt进行印迹感到困惑,因为这些细胞与图6A或图8中之前分析的任何其他时间点都不相关。审稿人希望看到nHDFs在3和8 dpt的STAT1,p-STAT1和IRF9印迹。到目前为止,图6中的数据并不支持结论“MCPyV sT抑制ISG转录而不针对ISGF3复合物”。所示数据表明,sT确实通过下调IRF9 mRNA和蛋白以及可能的STAT1来抑制ISGF3复合物的活性。还应该解决STAT2的水平。

      • 在图7A/B/C中,作者在H1299细胞中进行荧光素酶实验,观察到sT并不抑制对IFNB1治疗的hIFIT1-/hIFIT-2/mMX1-荧光素酶报告基因的下调。鉴于此前显示sT在nHDFs中下调ISGF3活性,这些结果令人惊讶。在图8中,作者展示了数据,表明sT在nHDFs的3和8 dpt中下调IFIT1,IFIT2和MX1。sT在H1299细胞中是否像在nHDFs中那样下调STAT1,p-STAT1或IRF9?sT对hIFIT1-/hIFIT-2/mMX1-荧光素酶报告基因在没有IFNB1的情况下有任何影响吗?审稿人认为H1299细胞并不是一个好的模型系统,以剖析sT如何调节ISGs。

      • 在图7D中,作者评估sT调节IFNB1启动子活性的能力,使用通过与持续活化的磷酸模拟IRF3 (IRF3-5D) 或wt TBK1共转染刺激的IFNB1-荧光素酶报告基因。他们观察到sT不阻止IRF3-5D激活IFNB1-荧光素酶报告基因,但确实阻止了TBK1介导的IFNB1-荧光素酶诱导。众所周知,TBK1可以激活IRF3和典型的(p65/RelA) NF-kB信号,IRF3和NF-kB都可以驱动IFNB1的表达。作者确实承认IFNB1启动子包含NF-kB结合位点。已经显示TBK1可以激活IKK复合物,sT已被显示可以通过与NEMO结合来结合并抑制IKK。审稿人希望看到sT与NF-kB信号相关的突变体对IFNB1启动子的影响,以确定本文的发现的新颖性和重要性。

      • 在Fig7F/G/H中,作者从H1299切换到293细胞,因为“cGAS-STING的过表达或持续活化的RIG-I-N在H1299细胞中没有导致足够的报告基因诱导”。这些发现再次加强了H1299细胞不是研究sT调节ISGs的好模型的观念。审稿人欣赏作者重复了实验,以测试sT对TBK1介导的IFNB1-荧光素酶在293细胞中的激活的影响,但想知道他们为什么没有重复IFIT1/IFIT2/MX1荧光素酶结果在293细胞中,并建议他们检查sT是否下调IRF9、STAT1和p-STAT1。

      • 最后,作者将注意力集中在sT和LT上,这是MCPyV早期蛋白中的两种。作者没有引用ALTO (Carter等人,2013)。已经显示ALTO是在LT的第二外显子上过度打印的,并在正常的病毒生命周期中被表达。Yang等人(2021)显示ALTO可以刺激各种荧光素酶报告构建,表明ALTO对基因转录和表达有影响。除非他们评估他们的LT和LTtr慈善病毒构建是否编码ALTO,否则关于LT对转录的影响的结论会受到质疑。

      • 总之,作者在重新提交之前需要解决的三个主要问题是:

        • sT是否针对ISGF3复合物,如果是,是如何做到的?在nHDFs中的数据表明它确实如此,但并没有解决如何下调IRF9和STAT1 mRNA。

        • sT是否通过一个与其通过结合NEMO调节NF-kB信号的能力无关的机制来下调ISGs?

        • LT和LTtr构建是否表达ALTO?如果是,那么LT刺激ISGs的结论可能是不准确的。

      • 该研究适当地指出,不同的细胞类型可能会对病毒蛋白产生不同的反应,特别是在研究的天然免疫应答,如BKPyV和JCPyV在各种细胞类型中所见。然而,该研究使用源于上皮的H1299和HEK293来研究RNA-测序的发现,却没有任何理由。后者细胞系还包括腺病毒E1A,这也可能影响这些途径的信号。从nHDFs中IRF9定位的代表性IF图像来看,sT存在与对照和未转导的情况相比,可能存在改变的定位,或者这是由于成像过程中曝光设置不一致导致的。考虑到该研究指出,“由于cGAS-STING的过表达或RIG-I-N的固有活性在H1299细胞中没有导致足够的报告基因诱导”,荧光素酶和IRF9实验应在更合适的细胞系中重复进行。

      • 虽然纤维细胞明显容易受到MCPyV感染,但这些是否是MCC的起源细胞类型尚不清楚。对于UV介导的、病毒阴性的MCC,基于MCC/SCC的联合形态与共享的主干突变,上皮细胞几乎被证实是前体。考虑到这一点,以及上面提到的实验已经在上皮源性细胞中进行,使用sT、LT和LTtr对初级角质细胞进行慈善病毒转导实验将更好地解决nHDFs的观察结果有多通用的问题。

3: 次要问题:编辑和数据呈现的修改:请在此部分提供编辑建议,以及对现有数据进行相对较小的修改以增强清晰度。

1.
    * 为供体1显示的生长曲线(图S2F)显示了LT +/- sT的缓慢生长率,尽管在转录组配置文件中没有观察到明显的差异。出乎意料的是,结合的sT和LTtr增长速度最慢。供体2观察到了类似的动力学吗?
* 图3A。火山图可以显示小T抗原的表达

2.
    * 该论文可以从额外的校对中受益,因为某些短语和标点符号使用的英语不是标准的。
    * 为什么在图7的荧光素酶实验中同时使用了sT-V5和sT?文本中没有解释。
    * 选择了一个trLT变体 - 它是否代表所有其他MCC trLTs?
    * 在所有的图中,标签应该得到改进,以便读者可以看到哪个细胞系和哪个T-抗原正在被表达,以及在什么时点收获细胞,而不必阅读图例。特别是,图3A(哪个构造,nHDF sT?),图7(哪个细胞?它在子图之间改变,所以需要额外说明在哪里使用了哪个细胞系)。
    * 图7F/G/H中的印迹被裁剪,使读者无法比较印迹之间的肌动蛋白表达水平。应该在同一张印迹上重新运行印迹,将样本放在上面。

3.
    * 第138-140行:句子笨拙且不清晰
    * 该研究应列出每个样本生成了多少读数。
    * 为什么MACS2和SICER用于不同的组蛋白ChIP seq?
    * 使用了所有软件的什么版本?
    * ChIP seq实验生成了复制品吗?

From epidome to treeHeatmap

  1. (namely epidome-processing step 2.6) Classification using 98% similarity.

    #Input Files: yycH_seqtab_nochim.csv and ../epidome/DB/yycH_ref_aln.fasta
    #Output Files: yycH_seqtab_ASV_seqs.fasta, yycH_seqtab_ASV_blast.txt, and yycH_seqtab.csv.classified.csv
    
    python3 ../epidome/scripts/ASV_blast_classification.py yycH_seqtab_nochim.csv yycH_seqtab_ASV_seqs.fasta ../epidome/DB/yycH_ref_aln.fasta yycH_seqtab_ASV_blast.txt yycH_seqtab.csv.classified.csv 98
    python3 ../epidome/scripts/ASV_blast_classification.py g216_seqtab_nochim.csv g216_seqtab_ASV_seqs.fasta ../epidome/DB/g216_ref_aln.fasta g216_seqtab_ASV_blast.txt g216_seqtab.csv.classified.csv 98
  2. Manually check the reults, ensuring no seq34;seq35 occurs.

    #Delete all records in *_seqtab.csv.classified.csv the alignment length < 448 (delete records 110 and 111) in g216, or < 440 (del record 90) in yycH.
    
    #Confirm the results
    jhuang@hamburg:/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98$ cut -d$'\t' -f4 g216_seqtab_ASV_blast.txt | sort -u
    446
    447
    448
    jhuang@hamburg:/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98$ cut -d$'\t' -f4 yycH_seqtab_ASV_blast.txt | sort -u
    475
    476
    
    #It is not neccesary to delete correponding records 90 in yycH_* and g216_seqtab_ASV_blast.txt, since the two files are not used. 
    #sed -i -e s/seq//g yycH_seqtab_ASV_blast.txt   #length=476
    #sed -i -e s/seq//g g216_seqtab_ASV_blast.txt   #length=448
    ##;-->""
    #sed -i -e s/';'//g yycH_seqtab_ASV_blast.txt
    #sed -i -e s/';'//g g216_seqtab_ASV_blast.txt
    
    cp g216_seqtab.csv.classified.csv g216_seqtab.csv.classified.backup
    cp yycH_seqtab.csv.classified.csv yycH_seqtab.csv.classified.backup
    
    # (Option 1): Processing *.classified.csv, ensuring the format seq24,21 or seq24.
    #https://github.com/ssi-dk/epidome/blob/master/example_data/190920_run1_G216_seqtab_from_dada2.csv.classified.csv
    ##DEBUG using LibreOffice, e.g. libreoffice --calc yycH_seqtab.csv.classified_noNA.csv after adding "ID"; at the corner.
    #Final goal: "seqseq31;,seq30;" --> "seq31,30"
    #;,seq --> ,
    sed -i -e s/";,seq"/","/g g216_seqtab.csv.classified.csv
    sed -i -e s/";,seq"/","/g yycH_seqtab.csv.classified.csv
    #;"; --> ";
    sed -i -e s/";\";"/"\";"/g g216_seqtab.csv.classified.csv
    sed -i -e s/";\";"/"\";"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/seqseq/seq/g g216_seqtab.csv.classified.csv
    sed -i -e s/seqseq/seq/g yycH_seqtab.csv.classified.csv
    #confirm!
    cut -d';' -f3 g216_seqtab.csv.classified.csv
    cut -d';' -f3 yycH_seqtab.csv.classified.csv
    
    # (Option 2): To reduce the unclassified, rename seq31,seq30 --> seq31 in g216,  seq37,seq36 --> seq36 in yycH.
    sed -i -e s/"seq40,seq31,seq30"/"seq40"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq31,seq30"/"seq31"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq5,seq2"/"seq2"/g        sed -i -e s/"seq40,seq31,seq30"/"seq40"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq31,seq30"/"seq31"/g g216_seqtab.csv.classified.csv
    sed -i -e s/"seq5,seq2"/"seq2"/g g216_seqtab.csv.classified.csv
    grep ",seq" g216_seqtab.csv.classified.csv  #should return no record.
    #sed -i -e s/"seq22,20"/"seq20"/g g216_seqtab.csv.classified.csv
    #sed -i -e s/"seq24,21"/"seq21"/g g216_seqtab.csv.classified.csv
    #sed -i -e s/"seq21,17"/"seq21"/g g216_seqtab.csv.classified.csv
    > epidome_object$p1_seqs    #vim g216_seqtab.csv.classified_noNA.csv
      [1] "seq28"       "seq5"        "seq40"       "seq31,30"(seq31)     "seq14"      
      [6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    [11] "seq29"       "seq24"       "seq1"        "seq26"       "seq40"      
    [16] "seq21"       "seq18"       "seq40"       "seq16"       "seq37"      
    [21] "seq5,2"      "seq18"       "seq11"       "seq14"       "seq26"      
    [26] "seq31,30"(seq31)     "seq37"       "seq26"       "seq1"        "seq21"      
    [31] "seq3"        "seq40"       "seq26"       "seq28"       "seq40"      
    [36] "seq10"       "seq5"        "seq40,31,30"(seq40)  "seq37"       "seq37"      
    [41] "seq37"       "seq40"       "seq28"       "seq26"       "seq37"      
    [46] "seq22"       "seq28"       "seq19"       "seq5,2"      "seq26"      
    [51] "seq40"       "seq40"       "seq26"       "seq40"       "seq28"      
    [56] "seq28"       "seq28"       "seq37"       "seq26"       "seq40"      
    [61] "seq26"       "seq37"       "seq28"       "seq28"       "seq1"       
    [66] "seq28"       "seq5,2"(seq5)      "seq40"       "seq21"       "seq28"      
    [71] "seq29"       
    #"seq37"       "seq22"       "seq22,20"(seq20)    "seq31,30"(seq31)    
    [76] "seq22"       "seq20"       "seq28"       "seq37"       "seq14"      
    [81] "seq28"       "seq20"       "seq40"       "seq28"       "seq24,21" 
    [86] "seq40"       "seq14"       "seq31,30"(seq31)      "seq1"        "seq1"       
    [91] "seq31,30"(seq31)     "seq1"        "seq40"       "seq20"       "seq1"       
    [96] "seq21"       "seq26"       "seq20"       "seq20"       "seq21,17"   
    [101] "seq37"       "seq26"       "seq31,30"(seq31)      "seq31,30"(seq31)    "seq40"      
    [106] "seq5"        "seq12"             
    
    sed -i -e s/"seq37,seq36"/"seq36"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq58,seq25,seq21,seq10"/"seq21"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq23,seq22"/"seq23"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq65,seq63,seq61,seq60"/"seq63"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq65,seq63,seq53"/"seq63"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq61,seq55"/"seq55"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq34,seq28,seq25,seq10,seq9"/"seq34"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq25,seq10,seq9"/"seq9"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/"seq25,seq10"/"seq10"/g yycH_seqtab.csv.classified.csv
    grep ",seq" yycH_seqtab.csv.classified.csv  #should return no record.
    > epidome_object$p2_seqs
    [1] "34"            "43"            "seq37,seq36"-->36         "42"           
    [5] "26"            "63"            "3"             "2"            
    [9] "11"            "seq65,seq63,seq53"-->63      "14"            "38"           
    [13] "55"            "3"             "65"            "seq58,seq25,seq21,seq10"-->21  
    [17] "seq23,seq22"-->23         "53"            "56"            "34"           
    [21] "62"            "14"            "35"            "seq61,seq55"-->55        
    [25] "52"            "seq25,seq10"-->10         "seq65,seq63,seq61,seq60"-->63   "21"           
    [29] "55"            "15"            "34"            "49"           
    [33] "19"            "63"            "63"            "34"           
    [37] "43"            "12"            "34"            "seq34,seq28,seq25,seq10,seq9"-->34
    [41] "43"            "61"            "25"            "34"           
    [45] "34"            "5"             "18"            "43"           
    [49] "62"            "34"            "19"            "63"           
    [53] "7"             
    --------------
    "seq37,seq36"-->36         "seq25,seq10,seq9"-->9       "56"           
    [57] "11"            "49"            "10"            "63"           
    [61] "seq37,seq36"-->36         "14"            "55"  
    #-------------
    "55"           
    [65] "14"            "37,36"         "43"            "45"           
    [69] "34"            "34"            "43"            "43"           
    [73] "25"            "14"            "65,64,61,56"   "15,9"         
    [77] "43"            "38"            "43"            "53"           
    [81] "11"            "9"             "34"            "15"           
    [85] "9"             "23,22"         "62"            "43"           
    [89] "37,36"
    
    # Common step after option 1 or option 2.
    grep -v ";NA;" g216_seqtab.csv.classified.csv > g216_seqtab.csv.classified_noNA.csv
    grep -v ";NA;" yycH_seqtab.csv.classified.csv > yycH_seqtab.csv.classified_noNA.csv
    
    #filtering the small ASV in which the number < 500: INPUT: *_noNA.csv; OUTPUT: *_seqtab.csv.classified_filtered.csv
    python3 g216_filtering.py  
        import pandas as pd
        # Load the file
        data = pd.read_csv('g216_seqtab.csv.classified_noNA.csv', delimiter=';')
        # Compute row sums excluding the first 2 columns
        data['row_sum'] = data.iloc[:, 2:].sum(axis=1)
        print(data.iloc[:, 2:])
        print(data.iloc[:, 2:].sum(axis=1))
        # Filter out rows where the sum is less than 100
        filtered_data = data[data['row_sum'] >= 500].drop(columns=['row_sum'])
        # Save the resulting dataframe to a new file
        filtered_data.to_csv('g216_seqtab.csv.classified_filtered.csv', sep=';', index=False)
    python3 yycH_filtering.py 
  3. draw plot from three amplicons: cutadapted_g216, cutadapted_yycH

    #Taxonomic database setup and classification
    #- Custom databases of all unique g216 and yycH target sequences can be found at https://github.com/ssi-dk/epidome/tree/master/DB. 
    #- We formatted our g216 and yycH gene databases to be compatible with DADA2’s assign-Taxonomy function and used it to classify the S. epidermidis ASVs with the RDP naive Bayesian classifier method (https://github.com/ssi-dk/epidome/tree/master/scripts).
    #- ST classification of samples was performed using the g216 target sequence as the primary identifier. 
    #- All g216 sequences unique to a single clonal cluster in the database were immediately classified as the matching clone, and in cases were the g216 sequence matched multiple clones, the secondary yycH target sequences were parsed to determine which clone was present. When this classification failed to resolve due to multiple potential combinations of sequences, ASVs were categorized as “Unclassified”. Similarly, g216 sequences not found in the database were labelled as “Novel”. 
    
    #install.packages("pls")
    #library(pls)
    #install.packages("reshape")
    #library(reshape)
    #install.packages("vegan")
    library('vegan') 
    library(scales)
    library(RColorBrewer)
    
    #~/Tools/csv2xls-0.4/csv_to_xls.py g216_seqtab.csv.classified_noNA.csv yycH_seqtab.csv.classified_noNA.csv -d$'\t' -o counts.xls
    
    setwd("/media/jhuang/Elements/Data_Luise_Epidome_batch3/run_2023_98_2")
    #under r4-base
    source("../epidome/scripts/epidome_functions.R")
    
    ST_amplicon_table = read.table("../epidome/DB/epidome_ST_amplicon_frequencies.txt",sep = "\t")
    
    epi01_table = read.table("g216_seqtab.csv.classified_filtered.csv",sep = ";",header=TRUE,row.names=1)
    epi02_table = read.table("yycH_seqtab.csv.classified_filtered.csv",sep = ";",header=TRUE,row.names=1)
    #> sum(epi01_table$AH.LH)
    #[1] 78872
    #> sum(epi02_table$AH.LH)
    #[1] 86949
    
    metadata_table = read.table("../metadata.txt",header=TRUE,row.names=1)
    metadata_table$patient.ID <- factor(metadata_table$patient.ID, levels=c("P7","P8","P9","P10","P11","P12","P13","P14","P15","P16","P17","P18","P19","P20", "AH","AL","CB","HR","KK",  "MC","MR","PT2","RP","SA","XN"))
    
    #epidome_object = setup_epidome_object(epi01_table,epi02_table, metadata_table=metadata_table)
    primer1_table <- epi01_table
    primer2_table <- epi02_table
    #setup_epidome_object <- function(primer1_table,primer2_table,metadata_table) {
      #DEBUG: 3:ncol --> 2:ncol
      primer1_counts = primer1_table[,2:ncol(primer1_table)]
      primer2_counts = primer2_table[,2:ncol(primer2_table)]
      primer1_all_sample_names = colnames(primer1_counts)
      primer2_all_sample_names = colnames(primer2_counts)
      samples_with_both_primers = primer1_all_sample_names[which(primer1_all_sample_names %in% primer2_all_sample_names)]
      samples_missing_primer1_data = primer2_all_sample_names[which(!primer2_all_sample_names %in% primer1_all_sample_names)]
      samples_missing_primer2_data = primer1_all_sample_names[which(!primer1_all_sample_names %in% primer2_all_sample_names)]
      primer1_seqs = as.vector(primer1_table$Seq_number)
      primer2_seqs = as.vector(primer2_table$Seq_number)
      primer1_seqs[which(is.na(primer1_seqs))] = "seqUnclassified"
      primer2_seqs[which(is.na(primer2_seqs))] = "seqUnclassified"
      primer1_counts = primer1_table[,which(colnames(primer1_table) %in% samples_with_both_primers)]
      primer2_counts = primer2_table[,which(colnames(primer2_table) %in% samples_with_both_primers)]
      if (!missing(metadata_table)) {
        metadata_names = rownames(metadata_table)
        samples_missing_metadata = samples_with_both_primers[which(!samples_with_both_primers %in% metadata_names)]
        samples_missing_primer_data = metadata_names[which(!metadata_names %in% samples_with_both_primers)]
        include_names = metadata_names[which(metadata_names %in% samples_with_both_primers)]
        metadata_include = match(include_names,metadata_names)
        metadata_include = metadata_include[which(!is.na(metadata_include))]
        metadata_table = metadata_table[metadata_include,]
        #primer1_include = match(colnames(primer1_counts),include_names)
        primer1_include = match(include_names,colnames(primer1_counts))
        primer1_include = primer1_include[which(!is.na(primer1_include))]
        #primer2_include = match(colnames(primer2_counts),include_names)
        primer2_include = match(include_names,colnames(primer2_counts))
        primer2_include = primer2_include[which(!is.na(primer2_include))]
        epi1_table = primer1_counts[,primer1_include]
        epi2_table = primer2_counts[,primer2_include]
        epidome_object = list('p1_seqs'=primer1_seqs,'p1_table'=epi1_table,'p2_seqs'=primer2_seqs,'p2_table'=epi2_table,'metadata'=metadata_table,'sample_names'=include_names,'meta_variables'=colnames(metadata_table))
        print(paste0("Metadata loaded with ",length(metadata_names)," samples and ",ncol(metadata_table)," variables"))
        print(paste0(length(samples_missing_metadata)," samples are found in both primer sets but not in metadata: ",paste0(samples_missing_metadata,collapse = " ")))
        print(paste0(length(samples_missing_primer_data)," samples are found in metadata but is missing from one or both primer sets: ",paste0(samples_missing_primer_data,collapse = " ")))
        print(paste0(length(include_names)," samples are found in metadata and both tables and are included in epidome object"))
    
      } else {
        epi1_table = primer1_counts[,match(colnames(primer1_counts),samples_with_both_primers)]
        epi2_table = primer2_counts[,match(colnames(primer2_counts),samples_with_both_primers)]
        epidome_object = list('p1_seqs'=primer1_seqs,'p1_table'=epi1_table,'p2_seqs'=primer2_seqs,'p2_table'=epi2_table,'sample_names'=samples_with_both_primers)
        print(paste0("No metadata loaded"))
        print(paste0(length(samples_missing_primer2_data)," samples are found in p1 table but not in p2 table: ",paste0(samples_missing_primer2_data,collapse = " ")))
        print(paste0(length(samples_missing_primer1_data)," samples are found in p2 table but not in p1 table: ",paste0(samples_missing_primer1_data,collapse = " ")))
        print(paste0(length(samples_with_both_primers)," samples are found in both tables and are included in epidome object"))
      }
      #return(epidome_object)
    #}
    
    str(epidome_object)
    #List of 7
    # $ p1_seqs       : chr [1:57] "seq28" "seq5" "seq40" "seq31,30" ...
    # $ p1_table      :'data.frame':    57 obs. of  51 variables:
    # $ p2_seqs       : chr [1:53] "seq34" "seq43" "seq37,36" "seq42" ...
    # $ p2_table      :'data.frame':    53 obs. of  51 variables:
    # $ metadata      :'data.frame':    51 obs. of  4 variables:
    # $ sample_names  : chr [1:51] "P7.Nose" "P8.Nose" "P9.Nose" "P10.Nose" ...
    # $ meta_variables: chr [1:4] "patient.ID" "sample.site" "sample.type" "patient.sample.site"
    unique(sort(epidome_object$p1_seqs))  # 57
    #[1] "seq28"       "seq5"        "seq40"       "seq31,30"    "seq14"      
    #[6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    #[11] "seq29"       "seq24"       "seq1"        "seq26"       "seq40"      
    #[16] "seq21"       "seq18"       "seq40"       "seq16"       "seq37"      
    #[21] "seq5,2"      "seq18"       "seq11"       "seq14"       "seq26"      
    #[26] "seq31,30"    "seq37"       "seq26"       "seq1"        "seq21"      
    #[31] "seq3"        "seq40"       "seq26"       "seq28"       "seq40"      
    #[36] "seq10"       "seq5"        "seq40,31,30" "seq37"       "seq37"      
    #[41] "seq37"       "seq40"       "seq28"       "seq26"       "seq37"      
    #[46] "seq22"       "seq28"       "seq19"       "seq5,2"      "seq26"      
    #[51] "seq40"       "seq40"       "seq26"       "seq40"       "seq28"      
    #[56] "seq28"       "seq28"
    unique(sort(epidome_object$p2_seqs))
    
    #Image1
    primer_compare = compare_primer_output(epidome_object,color_variable = "sample.type")
    png("image1.png")
    primer_compare$plot
    dev.off()
    
    eo_ASV_combined = combine_ASVs_epidome(epidome_object)
    #> str(eo_ASV_combined)
    #List of 7
    #$ p1_seqs       : chr [1:21] "seq28" "seq5" "seq40" "seq31,30" ...
    #$ p1_table      :'data.frame': 21 obs. of  51 variables:
    #> eo_ASV_combined$p1_seqs  #21
    #[1] "seq28"       "seq5"        "seq40"       "seq31,30"    "seq14"      
    #[6] "seq20"       "seq22"       "seq26"       "seq37"       "seq21"      
    #[11] "seq29"       "seq24"       "seq1"        "seq18"       "seq16"      
    #[16] "seq5,2"      "seq11"       "seq3"        "seq10"       "seq40,31,30"
    #[21] "seq19"
    #eo_filtered = filter_lowcount_samples_epidome(eo_ASV_combined,500,500)
    
    #Note that we run the following code instead run the method from the script due to modified code: 
    count_table_checked = classify_epidome(eo_ASV_combined,ST_amplicon_table)
    write.csv(count_table_checked, file="count_table_checked.csv")
    strict_classifier=FALSE
    #classify_epidome = function(epidome_object,ST_amplicon_table,strict_classifier=FALSE) {
    #Adapt the code to the true epi01 and epi02 values, please give the complete code:
    epidome_object_norm = normalize_epidome_object(eo_ASV_combined)
    p1_seqs = unlist(lapply(as.vector(eo_ASV_combined$p1_seqs),function(x) substr(x,4,nchar(x))))
    #> p1_seqs
    #[1] "28"       "5"        "40"       "31,30"    "14"       "20"      
    #[7] "22"       "26"       "37"       "21"       "29"       "24"      
    #[13] "1"        "26"       "40"       "21"       "18"       "40"      
    #[19] "16"       "37"       "5,2"      "18"       "11"       "14"      
    #[25] "26"       "31,30"    "37"       "26"       "1"        "21"      
    #[31] "3"        "40"       "26"       "28"       "40"       "10"      
    #[37] "5"        "40,31,30" "37"       "37"       "37"       "40"      
    #[43] "28"       "26"       "37"       "22"       "28"       "19"      
    #[49] "5,2"      "26"       "40"       "40"       "26"       "40"      
    #[55] "28"       "28"       "28" 
    p2_seqs = unlist(lapply(as.vector(eo_ASV_combined$p2_seqs),function(x) substr(x,4,nchar(x))))
    n_samples = length(eo_ASV_combined$sample_names)
    n_p1_seqs = length(p1_seqs)
    count_table = matrix(nrow = 0, ncol = n_samples,dimnames = list('ST'=c(),'Samples'=eo_ASV_combined$sample_names))
    match_type_table = matrix(nrow = n_p1_seqs, ncol = n_samples)
    count_table_names = c()
    unclassified_count_vec = rep(0,n_samples)
    g1_unclassified_count_vec = rep(0,n_samples)
    for (i in 1:n_p1_seqs) {
      #i=4: "31,30" --> length(p1_seq_split)>1 --> possible_STs==("-","-","8","215") --> Unclassified!
      p1_seq = p1_seqs[i]
      p1_seq_split = strsplit(p1_seq,',')[[1]]
      if (length(p1_seq_split)>1) {
        possible_STs = as.vector(ST_amplicon_table$ST)[which(ST_amplicon_table$epi01_ASV %in% p1_seq_split)]
        if (length(possible_STs)==1) {
          p1_seq = p1_seq_split[1]
        } else {
          p1_seq = "Unclassified"
        }
      }
      if (p1_seq != "Unclassified") {
        p1_seq_ST_table = ST_amplicon_table[which(ST_amplicon_table$epi01_ASV==p1_seq),]
        unique_p1_ASVs = unique(as.vector(p1_seq_ST_table$ST))
        p2_ASVs_matching_p1 = p1_seq_ST_table$epi02_ASV
        count_vec = rep(0,n_samples)
        for (j in 1:n_samples) {
          p1_percent = epidome_object_norm$p1_table[i,j]
          p1_count = eo_ASV_combined$p1_table[i,j]
          if (p1_percent > 0.01) {
            p2_seqs_present_within_difference_threshold_idx = which(epidome_object_norm$p2_table[,j]>(p1_percent-10) & epidome_object_norm$p2_table[,j]>1)
            p2_seqs_present_ASVs = p2_seqs[p2_seqs_present_within_difference_threshold_idx]
            p2_seqs_present_ASVs_matching_p1 = p2_seqs_present_ASVs[which(p2_seqs_present_ASVs %in% p2_ASVs_matching_p1)]
            p2_percent = epidome_object_norm$p2_table[which(p2_seqs %in% p2_seqs_present_ASVs_matching_p1),j]
            p2_count = eo_ASV_combined$p2_table[which(p2_seqs %in% p2_seqs_present_ASVs_matching_p1),j]
            if (length(p2_seqs_present_ASVs_matching_p1) == 0) {
              ST_order = order(p1_seq_ST_table$freq,decreasing = T)
              classification_group = as.vector(p1_seq_ST_table$ST)[ST_order[1]]
              if (classification_group %in% count_table_names) {
                classification_idx = which(count_table_names == classification_group)
                count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
              } else {
                count_vec = rep(0,n_samples)
                count_vec[j] = p1_count
                count_table = rbind(count_table,count_vec)
                count_table_names = c(count_table_names,classification_group)
              }
              #match_type_table[i,j] = "epi01 match without epi02 match"
              # Replace the match type with actual epi01 and epi02 values
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:NA", " ", classification_group)
            }
            else if (length(p2_seqs_present_ASVs_matching_p1) == 1) {
              p1_p2_seq_ST_table = p1_seq_ST_table[which(p1_seq_ST_table$epi02_ASV==p2_seqs_present_ASVs_matching_p1[1]),]
              ST_order = order(p1_p2_seq_ST_table$freq,decreasing = T)
              classification_group = as.vector(p1_p2_seq_ST_table$ST)[ST_order[1]]
              if (classification_group %in% count_table_names) {
                classification_idx = which(count_table_names == classification_group)
                count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
              } else {
                count_vec = rep(0,n_samples)
                count_vec[j] = p1_count
                count_table = rbind(count_table,count_vec)
                count_table_names = c(count_table_names,classification_group)
              }
              #match_type_table[i,j] = "Unique epi01 epi02 combination"
              # Replace the match type with actual epi01 and epi02 values
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:", p2_seqs_present_ASVs_matching_p1[1], " ", classification_group)
            } else {
              if (strict_classifier) {
                unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
              } else {
                p1_p2_seq_ST_table = p1_seq_ST_table[which(p1_seq_ST_table$epi02_ASV %in% p2_seqs_present_ASVs_matching_p1),]
                ST_order = order(p1_p2_seq_ST_table$freq,decreasing = T)
                classification_group = as.vector(p1_p2_seq_ST_table$ST)[ST_order[1]]
                if (classification_group %in% count_table_names) {
                  classification_idx = which(count_table_names == classification_group)
                  count_table[classification_idx,j] = count_table[classification_idx,j]+p1_count
                } else {
                  count_vec = rep(0,n_samples)
                  count_vec[j] = p1_count
                  count_table = rbind(count_table,count_vec)
                  count_table_names = c(count_table_names,classification_group)
                }
              }
    
              ##unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
              #match_type_table[i,j] = c("Non unique epi01 epi02 combination")
              # Concatenate all found epi02 values and replace the match type
              epi02_values = paste(p2_seqs_present_ASVs_matching_p1, collapse = ",")
              match_type_table[i, j] = paste0("epi01:", p1_seq, " epi02:", epi02_values, " ", classification_group)
            }
    
          } else {
            match_type_table[i,j] = "Low counts"
            unclassified_count_vec[j] = unclassified_count_vec[j]+p1_count
          }
    
        }
      } else {
        count_vec = as.numeric(as.vector(eo_ASV_combined$p1_table[i,]))
        unclassified_count_vec = unclassified_count_vec+count_vec
        g1_unclassified_count_vec = g1_unclassified_count_vec+count_vec
        match_type_vec = rep("Unclassified epi01 match",n_samples)
        match_type_table[i,] = match_type_vec
      }
    
    }
    count_table = rbind(count_table,unclassified_count_vec)
    rownames(count_table) = c(count_table_names,"Unclassified")
    count_table_df <- as.data.frame(count_table)
    write.csv(count_table_df, file="count_table_df.csv")
    #TODO: delete the last record "Unclassified" due to the few reads!
    #count_table_df = count_table_df[-32,]
    
    # --> FOR STEP4: prepare the file match_type_table.csv for treeHeatmap drawing.
    write.csv(match_type_table, file = "match_type_table.csv", row.names = TRUE)
    
    count_table_df_ordered = count_table_df[order(rowSums(count_table_df),decreasing = T),]
    row_sums <- apply(count_table_df_ordered, MARGIN=1, FUN=sum)
    #TODO: the ST in which the counts <= 500 not shown!
    #NOTE: ST225 is "epi01:3 epi02:3 225"!
    # Since "epi01:3 epi02:NA 225","epi01:3 epi02:NA 225" are 225 disappeared in the match_type_table.csv_____
    #Note that epi01 is g216, epi02 is yycH.
    #        215            -            8           59          130          297 
    #      611637       401179       341187       337409       316389       316273 
    #         331            2           73          384          200            5 
    #      178616       125734        99899        97885        93613        92385 
    #          14          218           23          278          487          290 
    #       71417        62467        60192        57213        53630        51454 
    #          87           89          100          329          153           83 
    #       49000        23390        21880        21635        19453        10539 
    #          19          170          225*         570          184           88 
    #        9465         1839         1487          863          437          318 
    #         114 Unclassified 
    #         108           11 
    #> row_sums
    #         215            -           59 Unclassified          297          130 
    #      611915       442192       359708       357868       335880       315734 
    #           5           73          331          384          200           14 
    #      145578       134301       127734       126465        94917        85440 
    #           2          218           23          278          290           87 
    #       85438        62492        60192        57213        51363        49000 
    #          89          100          329          210           83           19 
    #       23390        21880        21635        12663        10642         9465 
    #         170          225          570           88          114           10 
    #        1839         1487          863          318          108           91 
    # > unique(sort(eo_ASV_combined$p1_seqs))
    #  [1] "seq1"  "seq10" "seq11" "seq14" "seq16" "seq18" "seq19" "seq2"  "seq20"
    # [10] "seq21" "seq22" "seq24" "seq26" "seq28" "seq29" "seq3"  "seq31" "seq37"
    # [19] "seq40" "seq5" 
    # > unique(sort(eo_ASV_combined$p2_seqs))
    #  [1] "seq10" "seq11" "seq12" "seq14" "seq15" "seq18" "seq19" "seq2"  "seq21"
    # [10] "seq23" "seq25" "seq26" "seq3"  "seq34" "seq35" "seq36" "seq38" "seq42"
    # [19] "seq43" "seq49" "seq5"  "seq52" "seq53" "seq55" "seq56" "seq61" "seq62"
    # [28] "seq63" "seq65" "seq7"
    #row.names(count_table_df) <- c("ST215","ST130","ST278","ST200","ST184","ST5","ST59","ST83","ST487","ST114","ST8","ST297","ST153","ST384","ST2","ST14","ST89","ST570","-","ST290","ST331","ST73","ST88","ST100","ST87","ST23","ST218","ST329","ST19","ST225","ST170")
    #row.names(count_table_df) <- c("215","130","278","200","184","5","59","83","487","114","8","297","153","384","2","14","89","570","-","290","331","73","88","100","87","23","218","329","19","225","170")
    #row.names(count_table_df) <- c("ST215","ST130","ST278","ST200","ST5","ST59","ST83","ST114","ST297","ST384","ST14","ST89","ST210","-","ST328","ST331","ST73","ST2","ST88","ST100","ST10","ST290","ST87","ST23","ST218","ST329","ST19","ST225","ST170","Unclassified")
    
    col_order <- c("P7.Nose","P8.Nose","P9.Nose","P10.Nose","P11.Nose","P12.Nose","P13.Nose","P14.Nose","P15.Nose",     "P16.Foot","P17.Foot","P18.Foot","P19.Foot","P20.Foot","P16.Nose","P17.Nose","P18.Nose","P19.Nose","P20.Nose",    "AH.LH","AH.NLH","AH.Nose", "AL.LH","AL.NLH","AL.Nose", "CB.LH","CB.NLH","CB.Nose", "HR.LH","HR.NLH","HR.Nose", "KK.LH","KK.NLH","KK.Nose", "MC.LH","MC.NLH","MC.Nose", "MR.LH","MR.NLH","MR.Nose",  "PT2.LH","PT2.NLH","PT2.Nose", "RP.LH","RP.NLH","RP.Nose", "SA.LH","SA.NLH","SA.Nose", "XN.LH","XN.NLH","XN.Nose") 
    count_table_df_reordered <- count_table_df_ordered[,col_order]
    p = make_barplot_epidome(count_table_df_reordered,reorder=FALSE,normalize=TRUE)
    #p = make_barplot_epidome(count_table_df_reordered,reorder=TRUE,normalize=TRUE)
    png("Barplot_All.png", width=1600, height=900)
    p
    dev.off()
    
    #Change rowname from '-' to 'Novel'
    rownames(count_table_df_reordered)[rownames(count_table_df_reordered) == "-"] <- "Novel"
    write.csv(file="count_table_df_reordered.txt", count_table_df_reordered)
  4. prepare data for plotTreeHeatmap: table+tree.

      #replace "," to \n in match_type_table.csv
      sed 's/\",\"/\n/g' match_type_table.csv > match_type_table.csv_
      grep "epi02" match_type_table.csv_ > match_type_table.csv__
      grep -v "NA" match_type_table.csv__ > match_type_table.csv___
      #(Deprecated) awk -F' ' '{ split($2, a, ":"); split(a[2], b, ","); print $1 " epi02:" b[1] " " $3 }' match_type_table.csv___ > match_type_table.csv____
      #(Deprecated) sed -i -e 's/\"//g' match_type_table.csv____
      #(Deprecated) sort match_type_table.csv____ -u > match_type_table.csv_____
    
      #Manually select and merge items from match_type_table.csv___ and save it to match_type_table.csv____ (see content as below)!
          epi01:1 epi02:36 2
          epi01:3 epi02:3 225
          epi01:5 epi02:34 130
          epi01:5 epi02:3 278
          epi01:5 epi02:43 200
          epi01:5 epi02:36 184
          epi01:11 epi02:19 19
          epi01:14 epi02:34 297
          epi01:14 epi02:36 153
          epi01:16 epi02:56 329
          epi01:16 epi02:55 329
          epi01:18 epi02:14 218
          epi01:19 epi02:43 170
          epi01:2 epi02:15 -
          epi01:20 epi02:36 2
          epi01:20 epi02:43 14
          epi01:20 epi02:42 89
          epi01:20 epi02:34 384
          epi01:20 epi02:11 570
          epi01:21 epi02:34 100
          epi01:21 epi02:63 290
          epi01:24 epi02:38 23
          epi01:24 epi02:2 87
          epi01:26 epi02:34 -
          epi01:26 epi02:53 290
          epi01:26 epi02:63 331
          epi01:28 epi02:43 215
          epi01:29 epi02:43 215
          epi01:29 epi02:55 -
          epi01:31 epi02:43 8
          epi01:37 epi02:36 2
          epi01:37 epi02:26 88
          epi01:37 epi02:43 73
          epi01:40 epi02:2 5
          epi01:40 epi02:34 5
          epi01:40 epi02:34 59
          epi01:40 epi02:26 114
          epi01:40 epi02:43 83
          epi01:40 epi02:36 487
          #---- v2 ('-' three times, ST215, ST290, ST329, ST59 two times) --> delete 'epi01:28 epi02:43 215' ----
          epi01:1 epi02:34 -
          epi01:3 epi02:3 225
          epi01:5 epi02:3 278
          epi01:5 epi02:34 130
          epi01:5 epi02:43 200
          epi01:11 epi02:19 19
          epi01:14 epi02:34 297
          epi01:16 epi02:55 329
          epi01:16 epi02:56 329
          epi01:18 epi02:14 218
          epi01:19 epi02:43 170
          epi01:20 epi02:42 89
          epi01:20 epi02:43 14
          epi01:20 epi02:34 384
          epi01:20 epi02:11 570
          epi01:20 epi02:3 210
          epi01:21 epi02:38 10
          epi01:21 epi02:34 100
          epi01:21 epi02:63 290
          epi01:24 epi02:38 23
          epi01:24 epi02:2 87
          epi01:26 epi02:34 -
          epi01:26 epi02:53 290
          epi01:26 epi02:63 331
          epi01:29 epi02:43 215
          epi01:29 epi02:55 -
          epi01:37 epi02:62 2
          epi01:37 epi02:26 88
          epi01:37 epi02:43 73
          epi01:37 epi02:34 59
          epi01:40 epi02:2 5
          epi01:40 epi02:26 114
          epi01:40 epi02:43 83
          epi01:40 epi02:34 59
    
    python3 concat_epi01_epi02.py
        # Define the function to extract fasta sequences
        def fasta_to_dict(filename):
            sequences = {}
            header = None
            sequence = []
            with open(filename, 'r') as file:
                for line in file:
                    line = line.strip()
                    if line.startswith('>'):
                        if header:
                            sequences[header] = ''.join(sequence)
                        header = line[1:]
                        sequence = []
                    else:
                        sequence.append(line)
                if header:
                    sequences[header] = ''.join(sequence)
            return sequences
        # Read combinations from match_type_table.csv____
        combinations = []
        with open("match_type_table.csv____", 'r') as csv_file:
            for line in csv_file:
                parts = line.strip().split()
                epi01 = parts[0].split(":")[1]
                epi02 = parts[1].split(":")[1]
                st = parts[2]
                combinations.append((epi01, epi02, st))
    
        # Read both fasta files into dictionaries
        g216_sequences = fasta_to_dict("../epidome/DB/epi01_ref_aln.fasta")
        yycH_sequences = fasta_to_dict("../epidome/DB/epi02_ref_aln.fasta")
    
        output_filename = "concatenated_sequences.fasta"
        with open(output_filename, 'w') as output_file:
            for epi01, epi02, st in combinations:
                if epi01 in g216_sequences and epi02 in yycH_sequences:
                    concatenated_seq = g216_sequences[epi01] + "NNN" + yycH_sequences[epi02]
                    header = f">g216-{epi01}_yycH-{epi02}_ST{st}"
                    output_file.write(f"{header}\n{concatenated_seq}\n")
    
        print(f"Concatenated sequences saved to {output_filename}")
    
    sed -i -e 's/ST-/-/g' concatenated_sequences.fasta
    muscle -in concatenated_sequences.fasta -out aligned.fasta
    muscle -clw -in concatenated_sequences.fasta -out aligned.clw
    FastTree -nt < aligned.fasta > concatenated_sequences.tree
    # Run plotTree.R to draw ggtree_and_gheatmap.png.
    
    library(ggtree)
    library(ggplot2)
    
    setwd("/home/jhuang/DATA/Data_Luise_Epidome_batch3/run_2023_98_2/")
    
    # -- edit tree --
    #https://icytree.org/
    info <- read.csv("typing_34.csv")
    info$name <- info$Isolate
    tree <- read.tree("concatenated_sequences.tree")
    #cols <- c(infection='purple2', commensalism='skyblue2')     
    
    library(dplyr)
    heatmapData2 <- info %>% select(Isolate,  g216, ST)
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    #https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html
    #"g216:16","g216:18","g216:19","g216:20"(4),"g216:21"(2),"g216:24"(1),"g216:26"(2)
    heatmap.colours <- c("darkgrey",   "palegreen","seagreen","seagreen","seagreen4","seagreen4","yellowgreen",    "orange4","orange4","orange4","orange4","orange4",    "dodgerblue3","blue4","dodgerblue3","dodgerblue3",  "azure4",    "magenta2","maroon2","mediumorchid1","mediumorchid1","mediumorchid1","mediumorchid1","mediumorchid1", "mediumorchid3","mediumorchid3","mediumorchid3","mediumpurple1","mediumpurple4","mediumpurple4",         "red4",                                            "olivedrab3","olivedrab4","seagreen","seagreen4","yellowgreen",    "orange4",    "blue4","dodgerblue3",           "magenta2","maroon2","mediumorchid1", "mediumorchid3","mediumpurple1","mediumpurple4",         "red4")
    names(heatmap.colours) <- c("-",    "ST225","ST130","ST278","ST19","ST200", "ST297",                                             "ST5","ST59","ST114","ST83","ST487",   "ST2","ST215","ST73","ST88",   "ST8",                                                               "ST329","ST170", "ST89","ST14","ST570","ST384","ST210",                  "ST10","ST100","ST290","ST23","ST87", "ST331",     "ST218",      "g216:1","g216:3","g216:5","g216:11","g216:14",   "g216:40",   "g216:29","g216:37",        "g216:16","g216:19","g216:20","g216:21","g216:24","g216:26",    "g216:18")
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    
    #circular
    #geom_tippoint(aes(color=Type)) + scale_color_manual(values=cols) + 
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tiplab2(aes(label=name), offset=1)
    #, geom='text', align=TRUE,  linetype=NA, hjust=1.8,check.overlap=TRUE, size=3.3
    #difference between geom_tiplab and geom_tiplab2?
    #+ theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + theme(axis.text = element_text(size = 20))  + scale_size(range = c(1, 20))
    #font.size=10, 
    png("ggtree.png", width=1260, height=1260)
    #svg("ggtree.svg", width=1260, height=1260)
    p
    dev.off()
    
    desired_order <- c("-",    "ST225","ST130","ST278","ST19","ST200", "ST297",                                             "ST5","ST59","ST114","ST83","ST487",   "ST2","ST215","ST73","ST88",   "ST8",                                                               "ST329","ST170", "ST89","ST14","ST570","ST384","ST210",                  "ST10","ST100","ST290","ST23","ST87", "ST331",     "ST218",      "g216:1","g216:3","g216:5","g216:11","g216:14",   "g216:40",   "g216:29","g216:37",        "g216:16","g216:19","g216:20","g216:21","g216:24","g216:26",    "g216:18")
    
    png("ggtree_and_gheatmap.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap.svg", width=17, height=15)
    gheatmap(p, heatmapData2, width=0.15,colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 9) + scale_fill_manual(values=heatmap.colours, breaks=desired_order) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))  
    dev.off()
  5. prepre the table for the alpha diversity calculation

    To calculate the alpha diversity values for each row based on the provided metrics (chao1, observed_otus, shannon, PD_whole_tree), you’ll typically use specialized software packages such as QIIME or R packages like vegan. Here, I can provide a high-level approach and calculations for some of the metrics. However, some of the metrics like PD_whole_tree require phylogenetic trees, which can’t be derived from the table you provided.

    import pandas as pd
    import numpy as np
    
    # Read the data from CSV
    df = pd.read_csv('count_table_df_reordered.csv', index_col=0)
    
    def chao1(s):
    s = np.array(s)
    n = np.sum(s)
    s_obs = np.count_nonzero(s)
    s1 = np.sum(s == 1)
    s2 = np.sum(s == 2)
    return s_obs + ((s1 ** 2) / (2 * s2)) if s2 != 0 else s_obs
    
    def observed_sts(s):
    return np.count_nonzero(s)
    
    def shannon2(s):
    s = np.array(s)
    p = s / np.sum(s)
    return -np.sum(p * np.log(p + 1e-10))  # added small value to avoid log(0)
    
    def shannon(s):
    s = np.array(s)
    total = np.sum(s)
    p = s / total  # Convert absolute counts to proportions
    return -np.sum(p * np.log(p + 1e-10))  # added small value to avoid log(0)
    
    # Calculate the metrics for each sample (column)
    results = {
    'sample': [],
    'chao1': [],
    'observed_sts': [],
    'shannon': []
    }
    
    for column in df.columns:
    results['sample'].append(column)
    results['chao1'].append(chao1(df[column]))
    results['observed_sts'].append(observed_sts(df[column]))
    results['shannon'].append(shannon(df[column]))
    
    # Convert results to DataFrame and save to a new CSV
    result_df = pd.DataFrame(results)
    result_df.to_csv('alpha_diversity_metrics_samples.csv', index=False)

    For PD_whole_tree, you would require a phylogenetic tree of the OTUs and then compute the sum of branch lengths. This is more complex and requires specialized software and the phylogenetic tree information, which isn’t present in the table you provided.

    The input data being absolute counts versus relative abundance does influence certain alpha diversity metrics. Let’s re-evaluate:

    • Observed OTUs: Whether the data is in absolute counts or relative abundance doesn’t impact the “Observed OTUs” metric. It’s simply counting how many OTUs have a non-zero count.

    • Chao1: Chao1 is intended for absolute counts. The metric is based on the number of singletons and doubletons, which are derived from raw counts and not relative abundances. Hence, the calculation remains accurate.

    • Shannon Diversity Index: The Shannon Diversity Index is based on the relative abundances of OTUs, not their absolute counts. The formula requires the proportion of each species (or OTU) in the sample. Thus, you’d need to convert absolute counts to proportions (or relative abundance) before calculating this metric.

  6. calculate the alpha diversity in Phyloseq.Rmd. The code for the calculation is as follows.

    \pagebreak
    
    # Alpha diversity for ST
    Plot Shannon index and observed STs. 
    Regroup together samples from the same group.
    
    ```{r, echo=TRUE, warning=FALSE}
    hmp.div_st <- read.csv("alpha_diversity_metrics_samples.csv", sep=",") 
    colnames(hmp.div_st) <- c("sample", "chao1", "observed_sts", "shannon")
    row.names(hmp.div_st) <- hmp.div_st$sample
    div.df <- merge(hmp.div_st, hmp.meta, by.x="sample", by.y = "sam_name")
    div.df2 <- div.df[, c("Group", "shannon", "observed_sts")]
    colnames(div.df2) <- c("Group", "Shannon", "ST")
    #colnames(div.df2)
    options(max.print=999999)
    
    stat.test.Shannon <- compare_means(
    Shannon ~ Group, data = div.df2,
    method = "t.test"
    )
    knitr::kable(stat.test.Shannon) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
    
    stat.test.ST <- compare_means(
    ST ~ Group, data = div.df2,
    method = "t.test"
    )
    knitr::kable(stat.test.ST) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
    
    div_df_melt <- reshape2::melt(div.df2)
    
    p <- ggboxplot(div_df_melt, x = "Group", y = "value",
                  facet.by = "variable", 
                  scales = "free",
                  width = 0.5,
                  fill = "gray", legend= "right")
    #ggpar(p, xlab = FALSE, ylab = FALSE)
    lev <- levels(factor(div_df_melt$Group)) # get the variables
    L.pairs <- combn(seq_along(lev), 2, simplify = FALSE, FUN = function(i) lev[i]) #%>% filter(p.signif != "ns")
    my_stat_compare_means  <- function (mapping = NULL, data = NULL, method = NULL, paired = FALSE, 
        method.args = list(), ref.group = NULL, comparisons = NULL, 
        hide.ns = FALSE, label.sep = ", ", label = NULL, label.x.npc = "left", 
        label.y.npc = "top", label.x = NULL, label.y = NULL, tip.length = 0.03, 
        symnum.args = list(), geom = "text", position = "identity", 
        na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...) 
    {
        if (!is.null(comparisons)) {
            method.info <- ggpubr:::.method_info(method)
            method <- method.info$method
            method.args <- ggpubr:::.add_item(method.args, paired = paired)
            if (method == "wilcox.test") 
                method.args$exact <- FALSE
            pms <- list(...)
            size <- ifelse(is.null(pms$size), 0.3, pms$size)
            color <- ifelse(is.null(pms$color), "black", pms$color)
            map_signif_level <- FALSE
            if (is.null(label)) 
                label <- "p.format"
            if (ggpubr:::.is_p.signif_in_mapping(mapping) | (label %in% "p.signif")) {
                if (ggpubr:::.is_empty(symnum.args)) {
                    map_signif_level <- c(`****` = 1e-04, `***` = 0.001, 
                      `**` = 0.01, `*` = 0.05, ns = 1)
                } else {
                  map_signif_level <- symnum.args
                } 
                if (hide.ns) 
                    names(map_signif_level)[5] <- " "
            }
            step_increase <- ifelse(is.null(label.y), 0.12, 0)
            ggsignif::geom_signif(comparisons = comparisons, y_position = label.y, 
                test = method, test.args = method.args, step_increase = step_increase, 
                size = size, color = color, map_signif_level = map_signif_level, 
                tip_length = tip.length, data = data)
        } else {
            mapping <- ggpubr:::.update_mapping(mapping, label)
            layer(stat = StatCompareMeans, data = data, mapping = mapping, 
                geom = geom, position = position, show.legend = show.legend, 
                inherit.aes = inherit.aes, params = list(label.x.npc = label.x.npc, 
                    label.y.npc = label.y.npc, label.x = label.x, 
                    label.y = label.y, label.sep = label.sep, method = method, 
                    method.args = method.args, paired = paired, ref.group = ref.group, 
                    symnum.args = symnum.args, hide.ns = hide.ns, 
                    na.rm = na.rm, ...))
        }
    }
    
    p3 <- p + 
      stat_compare_means(
        method="t.test",
        comparisons = list(c("P16-P20.Foot", "P16-P20.Nose"), c("AH-XN.LH", "AH-XN.Nose"), c("AH-XN.NLH", "AH-XN.Nose")), 
        label = "p.signif",
        symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c("****", "***", "**", "*", "ns")),
      ) +
      theme(axis.text.x = element_text(angle = 30, hjust = 1))  # add this line to rotate x-axis text
    print(p3)
    
    ggsave("./figures/alpha_diversity_ST_Group2.png", device="png", height = 10, width = 12)
    ggsave("./figures/alpha_diversity_ST_Group2.svg", device="svg", height = 10, width = 12)
    ```

细胞分子的降解机制与cDNA的制备及用途

  1. mRNA、DNA 和蛋白质都是细胞内的重要分子,它们会在一定的时候被降解以维持细胞的稳态。以下是它们在人体中被降解的方式:

    • mRNA降解:

      • 5′->3′ 衰减:这是在真核生物中最常见的mRNA降解途径。当mRNA分子的5’端被去除帽子结构时,一种称为Xrn1的酶开始从5’端逐渐降解mRNA。
      • 3′->5′ 衰减:多蛋白复合体,如核糖核酸酶体(exosome complex)可以从3’端逐渐降解mRNA。
      • 内部切割:一些内部RNase可以特异性地识别并切割mRNA,从而导致其降解。
    • DNA降解:

      • DNA的降解主要是为了修复损伤或者进行细胞程序性死亡。
      • 细胞中有许多DNA修复酶,它们可以修复DNA损伤。当DNA受到损伤时,某些酶可以识别并去除损伤部位,然后由其他酶填充并连接这些缺失部位。
      • 在程序性细胞死亡(如凋亡)中,DNA会被内部的核酸酶切割,导致细胞内容的泄漏并促进细胞的死亡。
    • 蛋白质降解:

      • 蛋白酶体途径:蛋白酶体是一个大型的蛋白复合体,可以识别并降解标记有泛素的蛋白质。被泛素化的蛋白质会被蛋白酶体识别并降解为小的多肽片段。
      • 溶酶体途径:溶酶体是含有各种酶的酸性囊泡。蛋白质可以被送入溶酶体并在那里被降解。
      • 自噬:细胞可以通过一个叫做自噬的过程降解其自身的组分。在这个过程中,细胞形成一个围绕蛋白质或细胞器的膜囊,然后与溶酶体融合并降解其内容物。
  2. mRNA, DNA, and proteins are crucial molecules within cells. They undergo degradation to maintain cellular homeostasis. Here’s how they are degraded in the human body:

    • mRNA degradation:

      • 5′->3′ decay: This is the most common mRNA degradation pathway in eukaryotes. When the cap structure at the 5′ end of an mRNA molecule is removed, an enzyme called Xrn1 begins to degrade the mRNA progressively from the 5′ end.
      • 3′->5′ decay: Multi-protein complexes, like the exosome complex, can degrade mRNA progressively from the 3′ end.
      • Endonucleolytic cleavage: Some internal RNases can specifically recognize and cleave mRNAs, leading to their degradation.
    • DNA degradation:

      • DNA degradation is primarily for damage repair or programmed cell death.
      • Many DNA repair enzymes exist in cells. When DNA is damaged, certain enzymes can identify and remove the damaged parts, which are then filled and ligated by other enzymes.
      • In programmed cell death (such as apoptosis), DNA is cleaved by internal nucleases, leading to the leakage of cellular contents and promoting cell death.
    • Protein degradation:

      • Proteasome pathway: The proteasome is a large protein complex that identifies and degrades proteins marked with ubiquitin. Ubiquitinated proteins are recognized and degraded into short peptides by the proteasome.
      • Lysosomal pathway: Lysosomes are acidic vesicles containing various enzymes. Proteins can be delivered to lysosomes and degraded there.
      • Autophagy: Cells can degrade their components through a process called autophagy. Here, the cell forms a membranous sac around proteins or organelles, which then fuses with a lysosome and degrades its contents.
  3. cDNA(complementary DNA)是通过反转录酶从mRNA模板上合成的双链DNA。它是mRNA的补足DNA,因此其序列与mRNA的编码区域相对应,但不包括内含子。

    • cDNA的制备过程如下:

      • 反转录:首先,将纯化的mRNA和一个短的寡腺苷酸引物(通常是多T引物,也称为oligo(dT)引物)混合。这个引物能与mRNA的多A尾结合。
      • 使用反转录酶,这个引物起始合成一个cDNA的单链。
      • 该单链cDNA可以被用作模板,利用DNA聚合酶进行第二条链的合成,从而得到双链cDNA。
    • cDNA在分子生物学研究中有许多用途:

      • cDNA图书馆:由于cDNA是由mRNA模板合成的,因此只代表被转录的基因。通过制备特定组织或细胞类型的cDNA图书馆,研究者可以确定哪些基因在特定条件下被表达。
      • 克隆与表达:cDNA可以被克隆到表达载体中,然后在宿主细胞中产生蛋白质。这是因为cDNA只包含编码区域,不包含内含子,所以可以在原核细胞(如大肠杆菌)中得到正确的蛋白质表达。
      • 基因芯片和RNA测序:cDNA也常用于基因芯片技术和RNA测序中,作为对比或参考。
  4. cDNA (complementary DNA) is double-stranded DNA synthesized from an mRNA template through the action of reverse transcriptase. It is the complementary version of mRNA. Thus, its sequence corresponds to the coding region of mRNA but excludes introns.

    • The preparation of cDNA is as follows:

      • Reverse transcription: First, purified mRNA is mixed with a short oligoadenylate primer (often a poly-T primer or oligo(dT) primer). This primer can bind to the poly-A tail of mRNA.
      • Using reverse transcriptase, this primer starts synthesizing a single-stranded cDNA.
      • This single-stranded cDNA can then serve as a template, with DNA polymerase synthesizing the second strand to produce double-stranded cDNA.
    • cDNA has several uses in molecular biology research:

      • cDNA libraries: Since cDNA is synthesized from mRNA, it represents only transcribed genes. By preparing a cDNA library from specific tissues or cell types, researchers can determine which genes are expressed under specific conditions.
      • Cloning and expression: cDNA can be cloned into expression vectors and then produce proteins in host cells. Since cDNA only contains coding regions and excludes introns, it allows for the correct protein expression in prokaryotic cells (like E. coli).
      • Gene chips and RNA sequencing: cDNA is also frequently used in gene chip technology and RNA sequencing as a reference or comparator.