Author Archives: gene_x

How to extract promoters positions?

#https://charlesjb.github.io/How_to_extract_promoters_positions/  
setwd("/media/jhuang/Elements1/Data_Denise_LT_DNA_Bindung")

# Load required libraries
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(GenomicFeatures)
library(rtracklayer)
library(GenomicRanges)
library(org.Hs.eg.db)

# Load TxDb object
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

# Get promoters and genes
promoters_txdb <- promoters(genes(txdb))
genes_txdb <- genes(txdb)

# Find overlaps between promoters and genes
overlaps <- findOverlaps(promoters_txdb, genes_txdb)

# Add gene names, scores, and strands to promoters
promoters_with_gene_names <- promoters_txdb[queryHits(overlaps)]
gene_ids <- mcols(genes_txdb[subjectHits(overlaps)])$gene_id
gene_symbols <- mapIds(org.Hs.eg.db, keys = gene_ids, column = "SYMBOL", keytype = "ENTREZID", multiVals = "first")
mcols(promoters_with_gene_names)$gene_name <- gene_symbols
mcols(promoters_with_gene_names)$score <- 0

# Remove duplicate entries
promoters_with_gene_names <- unique(promoters_with_gene_names)

# Custom function to export the data in the desired BED format
export_bed_with_extra_columns <- function(gr, file) {
  bed_data <- data.frame(chrom = as.character(seqnames(gr)),
                         start = pmax(start(gr) - 1, 0), # Ensure start is never negative
                         end = end(gr),
                         gene_name = mcols(gr)$gene_name,
                         score = mcols(gr)$score,
                         strand = as.character(strand(gr)),
                         stringsAsFactors = FALSE)
  write.table(bed_data, file, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
}

# Save promoters with gene names, scores, and strands as a BED file
export_bed_with_extra_columns(promoters_with_gene_names, "promoters_with_gene_names.bed")

Why Do Significant Gene Lists Change After Adding Additional Conditions in Differential Gene Expression Analysis?

DESeq2 is a popular method for differential expression analysis of count data from RNA-seq experiments. It estimates fold changes and tests for differential expression using a negative binomial generalized linear model. When you perform a differential expression analysis using DESeq2, the results may vary depending on the experimental conditions included in the analysis. In your case, the significant gene list differs between the two analyses: one with condition A and B, and the other with condition A, B, and C.

There are several reasons for this discrepancy:

  1. Normalization factors: DESeq2 estimates size factors for normalization of read counts across samples. When you add an additional condition (condition C in your case), the normalization factors may change, which can affect the fold change estimates and ultimately the list of differentially expressed genes.

  2. Dispersion estimation: DESeq2 uses a shrinkage estimator for dispersions, which are gene-specific measures of biological variability across replicates. Including an additional condition may affect the dispersion estimates, and therefore influence the list of differentially expressed genes.

  3. Multiple testing correction: DESeq2 uses the Benjamini-Hochberg procedure to control the false discovery rate (FDR) when testing for differentially expressed genes. With the addition of another condition, the number of tests may change, which can affect the FDR threshold and consequently the list of significant genes.

  4. Model fitting: DESeq2 fits a negative binomial generalized linear model to the count data. Including an additional condition may affect the model fitting, leading to different estimates of the coefficients and p-values, which in turn can affect the list of differentially expressed genes.

These factors can contribute to the differences in the significant gene list between the two analyses. To minimize discrepancies, ensure that you have a well-designed experiment with biological replicates for each condition and carefully consider the comparisons of interest when setting up the design matrix for DESeq2 analysis.

The choice between inputting A and B or inputting A, B, and C depends on your research goals and the specific comparisons you want to make. Both methods can be appropriate, but they address different questions.

  • Input A and B: If you are interested in comparing the gene expression profiles between conditions A and B, and condition C is not relevant to this specific comparison, then inputting only A and B would be a better choice. By analyzing only the conditions relevant to your research question, you can focus on the specific contrasts of interest, making the interpretation of the results more straightforward.

  • Input A, B, and C: If your research goals involve understanding the gene expression profiles in a broader context, such as comparing all three conditions or investigating how the expression profiles change across a time course or gradient, then inputting A, B, and C would be more appropriate. In this case, including all conditions in the analysis will provide a more comprehensive view of the gene expression changes, and the comparisons between the different conditions can help identify genes that show unique expression patterns or that are specific to certain conditions.

In summary, the choice between inputting A and B or inputting A, B, and C depends on your research question and objectives. It is crucial to clearly define the comparisons you are interested in and to consider the biological context of your study before deciding which method to use.

When adding an additional condition to a differential gene expression analysis, it’s natural for the results to change due to factors such as normalization, dispersion estimation, model fitting, and multiple testing correction, as previously discussed. However, there are some strategies you can apply to minimize the impact of adding an additional condition and make your results more robust:

  1. Independent analysis: Analyze the pairwise comparisons separately (A vs B, A vs C, and B vs C), and then compare the lists of significant genes obtained from each analysis. This approach keeps the comparisons of interest independent of the other conditions.

  2. Batch effect correction: If the additional condition introduces batch effects, you can use methods like ComBat from the sva package or limma’s removeBatchEffect function to correct for these batch effects before running differential expression analysis. This can help reduce the impact of adding an additional condition on the results.

  3. Consistent normalization methods: Use consistent normalization methods across all samples, even if you are adding a new condition. This will ensure that the read counts are comparable across all samples, reducing the impact of the additional condition on the results.

  4. Incorporate the additional condition in the model: If you want to include the additional condition in your analysis but keep the results between A and B consistent, you can include the additional condition as a covariate in your model. This will account for the effect of the additional condition while still allowing you to make the comparisons of interest.

  5. Intersection of significant genes: Perform the differential expression analysis with and without the additional condition, and then take the intersection of the significant genes from both analyses. The intersecting set of genes is likely to be more robust against the addition of the extra condition.

However, it’s important to note that adding an additional condition will inherently influence the results to some extent, as the analysis is dependent on the data provided. The strategies mentioned above can help to minimize the impact of adding an additional condition, but they cannot completely eliminate it.

H3K4me3, H3K27me3 and TF

H3K4me3(trimethylated histone H3 lysine 4)和H3K27me3(trimethylated histone H3 lysine 27)是组蛋白修饰的一种,与基因表达调控密切相关。转录因子(transcription factor)则是一类能够调控基因转录的蛋白质。

  1. H3K4me3:通常与基因启动子区域相关联,标记着活性染色质,有利于基因的转录。H3K4me3作为一个表观遗传标记,可以招募并结合转录因子和其他转录辅助因子,从而启动基因转录。

  2. H3K27me3:与H3K4me3相反,H3K27me3通常与基因沉默相关联,它通过招募和结合PRC2复合物(Polycomb Repressive Complex 2)来抑制基因转录。当PRC2将H3K27位点的甲基化水平提高到三甲基化时,基因表达受到抑制。

  3. 转录因子:是一类特殊的蛋白质,能够结合到特定的DNA序列,进而调控相应基因的转录。转录因子可以分为激活因子(activators)和抑制因子(repressors),分别负责增强和抑制基因转录。转录因子通过识别并结合到启动子、增强子等调控元件上,招募或阻止RNA聚合酶的结合,进而调控基因转录。

H3K4me3和H3K27me3作为表观遗传修饰,与转录因子一起共同参与基因表达调控。在某些情况下,这两种修饰可以在同一基因上共存,形成所谓的“bivalent domains”,这种状态使得基因处于一种“待机”状态,可以在适当的信号刺激下迅速被激活或进一步被抑制。这种表观遗传调控机制在发育过程中和疾病发生过程中起到了重要作用。

利用H3K4me3和H3K27me3数据不能直接计算出转录因子。H3K4me3和H3K27me3是一种组蛋白修饰,它们可以影响基因表达,但并非直接参与转录因子结合的过程。要识别转录因子的结合位点,可以使用ChIP-seq(染色质免疫沉淀测序)等实验方法。

然而,H3K4me3和H3K27me3数据可以帮助我们预测可能的转录起始位点(TSS)和活性/非活性基因。接下来,我们可以利用这些信息来分析转录因子的结合模式和功能。

以下是一种可能的策略:

  • 分析H3K4me3和H3K27me3的ChIP-seq数据,找到基因组上这两种修饰的富集区域。
  • 利用这些修饰的富集信息,预测可能的转录起始位点(TSS)和活性/非活性基因。
  • 对于已知的或预测的活性基因,可以进一步分析转录因子的结合模式。此时,需要转录因子的ChIP-seq数据或其他相关实验数据,如DNase-seq或ATAC-seq,它们可以揭示开放染色质区域。
  • 对转录因子的结合位点进行分析,可以使用生物信息学工具如MEME、HOMER等来识别转录因子结合位点的共有序列特征(motif)。
  • 结合基因表达数据,可以进一步研究转录因子对特定基因的调控作用。

总之,H3K4me3和H3K27me3数据可以为我们提供基因表达调控的信息,但需要结合其他实验数据和分析方法,才能对转录因子的结合和功能进行研究。

结合H3K4me3、H3K27me3的ChIP-seq数据和RNA-seq数据来推测转录因子活动的详细策略如下:

  1. 数据准备:

    • 收集H3K4me3、H3K27me3的ChIP-seq数据,以及RNA-seq数据。
    • 对ChIP-seq和RNA-seq数据进行质量控制和预处理。
  2. ChIP-seq数据分析:

    • 使用MACS、SICER等软件对ChIP-seq数据进行比对和峰值检测,找到H3K4me3和H3K27me3在基因组上的富集区域。
    • 预测潜在的转录起始位点(TSS)及活性/非活性基因。H3K4me3富集区域通常位于活性基因的启动子附近,而H3K27me3富集区域则与非活性基因相关联。
  3. RNA-seq数据分析:

    • 对RNA-seq数据进行比对,使用HISAT2、STAR等软件将测序读取比对到参考基因组。
    • 估算基因表达水平,使用featureCounts、HTSeq等软件计算每个基因的读取计数,然后使用DESeq2、edgeR等软件对计数数据进行标准化,得到基因的表达水平。
  4. 差异表达基因分析:

    • 结合基因表达水平和H3K4me3、H3K27me3富集区域信息,确定哪些基因在特定条件下是活性的。
    • 使用DESeq2、edgeR等软件进行差异表达基因分析,找到在不同条件下显著差异表达的基因。
  5. 预测转录因子结合位点:

    • 对于差异表达的基因,检查它们的启动子和调控元件区域,以找到潜在的转录因子结合位点。
    • 使用生物信息学工具如MEME、HOMER等来识别转录因子结合位点的共有序列特征(motif)。
  6. 构建转录因子调控网络:

    • 根据转录因子结合位点和基因表达变化之间的关联,推测哪些转录因子可能在特定条件下起到了调控作用。
    • 使用Gene Set Enrichment Analysis(GSEA)等方法评估转录因子的潜在活动。

将预测出的转录因子与它们可能调控的差异表达基因关联起来,构建转录因子调控网络。 请注意,虽然这种策略可以在一定程度上推测转录因子活动,但它并不能直接计算出转录因子本身.

H3K4me3的peak region通常与转录起始位点(TSS)附近的启动子区域相关联。这些区域通常包含转录因子的结合位点。然而,H3K4me3富集区域并不等同于转录因子的结合区域。它们只是一个与活性基因相关的表观遗传标记。

转录因子结合位点通常位于开放染色质区域,这些区域可以通过DNase-seq或ATAC-seq等实验方法来检测。为了更直接地研究特定转录因子的结合和活动,可以使用ChIP-seq方法来检测转录因子在基因组上的结合位置。

综上所述,虽然H3K4me3的peak region可能包含转录因子的结合位点,但它们并不等同于转录因子的结合区域。要更准确地找到转录因子的结合位点,需要使用更专门针对转录因子结合的实验方法。

H3K27me3(histone H3 trimethylated at lysine 27)是一种组蛋白修饰,通常与基因沉默和抑制相关联。H3K27me3主要由PRC2复合物(Polycomb Repressive Complex 2)催化生成。H3K27me3的结合位置主要出现在基因组的以下区域:

  • 启动子区域:H3K27me3可以结合到基因的启动子区域,从而抑制RNA聚合酶的结合和基因的转录。

  • 基因体内:H3K27me3也可以在基因体内的染色质区域中富集,与抑制基因表达相关联。

  • 间隔区域:在某些情况下,H3K27me3可以在基因间区域形成大片的富集区,这些区域被称为“Polycomb组合域”。这些区域通常与异染色质结构、基因组稳定性和长期基因沉默有关。

H3K27me3结合位置的识别可以通过ChIP-seq(染色质免疫沉淀测序)实验方法实现。通过分析H3K27me3 ChIP-seq数据,可以找到H3K27me3在基因组上的富集区域。这些区域往往与基因表达受到抑制的区域相对应。

利用RNA-seq数据预测转录因子(transcription factor,TF)活性是可能的,但需要采用一些间接方法。RNA-seq数据为我们提供了基因在给定条件下的表达水平,但无法直接显示TF在基因组上的结合位点。然而,我们可以通过分析差异表达基因和TF的调控网络来推测TF的活性。

以下是使用RNA-seq数据预测TF活性的一种策略:

分析RNA-seq数据:

对RNA-seq数据进行比对,使用HISAT2、STAR等软件将测序读取比对到参考基因组。
估算基因表达水平,使用featureCounts、HTSeq等软件计算每个基因的读取计数,然后使用DESeq2、edgeR等软件对计数数据进行标准化,得到基因的表达水平。
差异表达基因分析:

使用DESeq2、edgeR等软件进行差异表达基因分析,找到在不同条件下显著差异表达的基因。
基因调控网络推断:

利用已知的转录因子靶点关系(如来自TRANSFAC、JASPAR等数据库),或者使用基因调控网络推断工具(如GENIE3、ARACNe等),根据差异表达基因的表达模式推断可能的TF-靶基因关系。
转录因子活性评估:

结合差异表达基因和推断出的TF-靶基因关系,使用Gene Set Enrichment Analysis(GSEA)或其他类似方法评估转录因子的潜在活动。
请注意,这种策略依赖于预测的TF-靶基因关系和基因表达模式,可能无法准确反映TF的真实活动。为了更直接地研究特定TF的结合和活动,可以使用ChIP-seq、DNase-seq或ATAC-seq等方法来检测TF在基因组上的结合位置。

可以通过分析差异表达基因的启动子区域来推导出潜在的motif(转录因子结合位点的共有序列特征)。以下是一种实现这一目标的策略:

提取差异表达基因的启动子序列:对于每个差异表达基因,从参考基因组提取其转录起始位点(TSS)附近的一段序列,作为启动子区域。通常可以选择TSS上游1000bp到下游200bp的区域,但这个范围可以根据具体需求进行调整。

连接启动子序列:将所有差异表达基因的启动子序列连接起来,形成一个较长的序列。这将用于后续的motif发现分析。

寻找motif:使用生物信息学工具,如MEME、HOMER、DREME等,分析启动子序列,以寻找在多个启动子中重复出现的共有序列特征。这些共有序列特征可能表示潜在的转录因子结合位点。

比较已知motif:将发现的motif与已知的转录因子结合位点进行比较,以确定可能的转录因子。可以使用转录因子结合位点数据库,如JASPAR、TRANSFAC、HOCOMOCO等,来比较motif的相似性。

分析和可视化motif:可以使用软件工具(如Seq2Logo、WebLogo等)生成motif的序列logo,以直观地展示核苷酸在转录因子结合位点的保守性。此外,还可以分析motif在不同差异表达基因的启动子中的分布和共现模式。

请注意,这种方法假设差异表达基因的调控主要通过转录因子在启动子区域的结合来实现。实际上,转录因子也可以通过远离TSS的增强子区域或其他调控元件来调控基因表达。因此,在分析启动子motif时,可能会遗漏一些重要的调控信息。

使用特定蛋白的抗体进行ChIP-seq实验可以帮助确定该蛋白在基因组上的结合位点。如果这个蛋白是一个已知的转录协同因子或一个与增强子活性相关的因子,可以通过分析其结合位点来识别潜在的增强子区域。

以下是使用特定蛋白抗体的ChIP-seq数据来寻找增强子的策略:

数据准备:

收集针对目标蛋白的ChIP-seq数据。
对ChIP-seq数据进行质量控制和预处理。
ChIP-seq数据分析:

使用MACS、SICER等软件对ChIP-seq数据进行比对和峰值检测,找到目标蛋白在基因组上的结合位点。
筛选潜在增强子区域:

筛选位于基因上游、内含子或基因间区域的结合位点,因为这些区域更有可能包含增强子。
结合其他相关的组蛋白修饰数据(如H3K4me1和H3K27ac)或开放染色质数据(如DNase-seq或ATAC-seq),进一步筛选具有这些特征的结合位点。
验证和功能分析:

使用Reporter Assay、CRISPR/Cas9或其他实验方法验证筛选出的潜在增强子的功能。
通过基因表达数据或其他功能分析方法,了解目标蛋白结合位点与增强子活性的关系。
请注意,这种策略依赖于所研究的特定蛋白与增强子活性有关。对于更通用的增强子预测,可以考虑使用组蛋白修饰数据(如H3K4me1和H3K27ac)或开放染色质数据(如DNase-seq或ATAC-seq)。

增强子(enhancer)是一类非编码的调控序列,它可以在与启动子相距较远的位置通过调控转录因子的结合来影响基因的表达。以下是一些与增强子结合并参与基因调控的主要分子:

转录因子(Transcription factors,TFs):转录因子是一类可以结合特定DNA序列的蛋白质。它们可以通过与增强子结合来调控基因表达。一些典型的转录因子家族包括bZIP、bHLH、Zinc finger、Homeodomain等。

转录协同因子(Transcription co-factors):转录协同因子可以与转录因子共同作用,帮助它们结合到增强子并调控基因表达。这些协同因子可以是组蛋白修饰酶、染色质重塑因子或其他调控蛋白质。

染色质修饰酶(Chromatin modifiers):这些酶可以通过添加或去除组蛋白修饰(如乙酰化、甲基化等)来调整染色质的可访问性和活性。例如,增强子活性通常与H3K4me1(histone H3 monomethylated at lysine 4)和H3K27ac(histone H3 acetylated at lysine 27)修饰相关联。

染色质重塑因子(Chromatin remodelers):这些蛋白质可以通过改变染色质结构来调整DNA的可访问性,从而影响转录因子和其他调控因子与增强子的结合。

Mediator复合物:Mediator是一个多蛋白复合物,它在转录调控中起到了桥梁的作用。它可以与增强子上的转录因子结合,并与启动子区域的RNA聚合酶II形成一个环状结构,从而促进基因的转录。

长非编码RNA(Long non-coding RNAs,lncRNAs):某些lncRNAs可以通过与增强子结合并调控转录因子和其他调控因子的活性来影响基因表达。

这些分子共同作用,与增强子结合并参与基因调控。在研究增强子的功能时,可以通过ChIP-seq、DNase-seq、ATAC-seq等技术检测这些分子在基因组上的结合位置,从而揭示增强子的调控网络。

Creating an hg38 Database File for DiffReps Analysis

To generate a database file for the hg38 human genome assembly for use with the DiffReps tool, you will first need to download the hg38 annotation file in GTF or GFF format, and then convert it to the BED format that DiffReps accepts.

Here are the steps to create the database file:

  1. Download the hg38 annotation file in GTF or GFF format from a reputable source like GENCODE or UCSC Genome Browser. For example, you can download the GTF file from GENCODE: https://www.gencodegenes.org/human/.

  2. Install the necessary tools. In this case, we will use bedtools to convert GTF to BED format. You can install it using conda or any other package manager:

    conda install -c bioconda bedtools
  3. Convert the GTF file to the BED format:

    bedtools gtf2bed -i gencode.v38.annotation.gtf > gencode.v38.annotation.bed

    Replace “gencode.v38.annotation.gtf” with the name of the GTF file you downloaded in step 1.

  4. The resulting “gencode.v38.annotation.bed” file can be used as a database file for the DiffReps analysis. Provide this file as input to the –db option when running DiffReps:

    diffreps -tr treatment.bam -co control.bam -db gencode.v38.annotation.bed -o output_dir

Please note that these instructions are for generating a database file for hg38, but you can follow similar steps for other genome assemblies as well.

Defining and Categorizing Promoter Types Based on the ‘GRGGC’ Motif Frequency, Distribution, and Distance to the Transcription Start Site (TSS)

To provide a more detailed explanation of how to define promoter types based on the frequency and distribution of the ‘GRGGC’ motif on both + and – strands within the promoter region, I will outline the steps using Python and the Biopython library.

  1. Load your genome and annotation file (e.g., in FASTA and GFF3 formats, respectively):

    import re
    from Bio import SeqIO
    from Bio.Seq import Seq
    genome_file = "your_genome.fasta"
    annotation_file = "your_annotation.gff3"
    
    genome_records = SeqIO.to_dict(SeqIO.parse(genome_file, "fasta"))
  2. Extract promoter sequences: Define a function to extract promoter sequences based on the annotation file.

    def extract_promoter_sequences(annotation_file, genome_records, promoter_length=1000):
        promoters = []
        with open(annotation_file, "r") as gff:
            for line in gff:
                if line.startswith("#"):
                    continue
                fields = line.strip().split("\t")
                if fields[2] == "gene":
                    start, end, strand = int(fields[3]), int(fields[4]), fields[6]
                    seq_id = fields[0]
                    if strand == "+":
                        promoter_start = max(start - promoter_length, 1)
                        promoter_end = start - 1
                    elif strand == "-":
                        promoter_start = end + 1
                        promoter_end = min(end + promoter_length, len(genome_records[seq_id]))
                    promoter_seq = genome_records[seq_id].seq[promoter_start-1:promoter_end]
                    if strand == "-":
                        promoter_seq = promoter_seq.reverse_complement()
                    promoters.append(promoter_seq)
        return promoters
    
    promoter_sequences = extract_promoter_sequences(annotation_file, genome_records)
  3. Search for the motif and calculate motif frequency and distribution:

    def find_motif_frequency_and_distribution(promoter_sequences, motif_1="GRGGC", motif_2="GCCYR"):
        motif_1 = motif_1.replace("R", "[AG]").replace("Y", "[CT]")
        motif_2 = motif_2.replace("R", "[AG]").replace("Y", "[CT]")
        motif_data = []
    
        for promoter in promoter_sequences:
            motif_positions = []
            for match in re.finditer(motif_1, str(promoter)):
                motif_positions.append(match.start())
            for match in re.finditer(motif_2, str(promoter)):
                motif_positions.append(match.start())
            motif_positions.sort()
            motif_data.append({"count": len(motif_positions), "positions": motif_positions})
    
        return motif_data
    
    motif_data = find_motif_frequency_and_distribution(promoter_sequences)
  4. Define promoter types: Based on the frequency and distribution of the motif within the promoter regions, you can categorize promoters into different types. For example:

    def classify_promoter_types(motif_data, low_count=0, high_count=3):
        promoter_types = []
        for data in motif_data:
            if data["count"] <= low_count:
                promoter_types.append("low")
            elif data["count"] >= high_count:
                promoter_types.append("high")
            else:
                promoter_types.append("medium")
        return promoter_types

    promoter_types = classify_promoter_types(motif_data)

5.1. Perform statistical analyses and visualizations: With the promoter types defined, you can now perform various statistical analyses and create visualizations to explore the relationships between the types and other genomic features or expression levels. Here’s an example of how to create a bar plot of promoter types using the matplotlib library:

#pip install matplotlib  #Install matplotlib
import matplotlib.pyplot as plt

def plot_promoter_types(promoter_types):
    type_counts = {}
    for promoter_type in promoter_types:
        if promoter_type not in type_counts:
            type_counts[promoter_type] = 1
        else:
            type_counts[promoter_type] += 1

    types = list(type_counts.keys())
    counts = list(type_counts.values())

    plt.bar(types, counts)
    plt.xlabel("Promoter Types")
    plt.ylabel("Frequency")
    plt.title("Frequency of Promoter Types Based on 'GRGGC' Motif")
    plt.show()

plot_promoter_types(promoter_types)
#This code will produce a bar plot that shows the frequency of the different promoter types based on the 'GRGGC' motif in the promoter regions. You can further analyze the relationship between the promoter types and gene expression levels or other genomic features, depending on your research question.

5.2. In order to define promoter types based on the distance of the ‘GRGGC’ motif to the transcription start site (TSS), we can modify the previous code to include the distance information.

  • Define a function to find the distance of the motif to the TSS for each promoter:

    def find_motif_distances_to_tss(promoters, motif):
        distances = []
        for promoter in promoters:
            for strand, sequence in promoter.items():
                motif_positions = [i for i in range(len(sequence)) if sequence.startswith(motif, i)]
                if strand == '+':
                    tss_distance = [abs(pos - len(sequence)) for pos in motif_positions]
                else:
                    tss_distance = [abs(pos) for pos in motif_positions]
                distances.extend(tss_distance)
        return distances
    
    motif_distances_to_tss = find_motif_distances_to_tss(promoters, 'GRGGC')
  • Define promoter types based on the distance to the TSS: We can define the promoter types by categorizing the distances into different groups.

    Very close: < 50 bp; Close: 50 - 200 bp; Moderate: 200 - 500 bp; Far: > 500 bp

    def categorize_distances(distances):
        promoter_types = []
        for distance in distances:
            if distance < 50:
                promoter_types.append("Very close")
            elif 50 <= distance < 200:
                promoter_types.append("Close")
            elif 200 <= distance < 500:
                promoter_types.append("Moderate")
            else:
                promoter_types.append("Far")
        return promoter_types
    
    promoter_types_distance = categorize_distances(motif_distances_to_tss)
  • Visualize the promoter types based on distance: Use the plot_promoter_types function we defined earlier to create a bar plot of promoter types based on the distance to the TSS:

    plot_promoter_types(promoter_types_distance)

    This plot will show the frequency of promoter types based on the distance of the ‘GRGGC’ motif to the TSS. You can further analyze the relationship between promoter types and gene expression levels or other genomic features, depending on your research question.

Exploring DNA Motifs with Custom Bash Script

#!/bin/bash

#./search_motif4.sh test1.fasta GRG 5

if [ $# -ne 3 ]; then
  echo "Usage: $0 
” exit 1 fi fasta_file=$1 motif=$2 context=$3 motif_regex=$(echo $motif | sed ‘s/R/[AG]/g’ | sed ‘s/Y/[CT]/g’ | sed ‘s/S/[GC]/g’ | sed ‘s/W/[AT]/g’ | sed ‘s/K/[GT]/g’ | sed ‘s/M/[AC]/g’ | sed ‘s/B/[CGT]/g’ | sed ‘s/D/[AGT]/g’ | sed ‘s/H/[ACT]/g’ | sed ‘s/V/[ACG]/g’) grep -B1 -A1 -i -E -o “.{0,$context}${motif_regex}.{0,$context}” $fasta_file

单细胞RNA测序数据分析步骤

单细胞RNA测序数据分析的具体步骤包括以下几个阶段:

  1. 数据预处理:这一步涉及到对原始测序数据进行质量控制,包括移除低质量的测序读段,对读段进行修剪,以及对可能的污染序列进行识别和移除。这一步骤是为了确保后续的分析基于的是高质量的数据。

  2. 比对和定量:接下来的步骤是将预处理后的读段比对到参考基因组上,并且对每个细胞中每个基因的表达量进行定量。比对可以使用如STAR, HISAT2等工具,而定量则可以使用如HTSeq, featureCounts等工具。

  3. 质控和过滤:在进行了比对和定量后,需要进行进一步的质量控制,这包括移除低质量的细胞(比如表达基因数量太少或者有大量的线粒体基因表达),以及过滤掉表达不足的基因。

  4. 标准化和批次效应校正:由于技术和实验设计的原因,数据中可能存在一些非生物学的技术性变异,比如批次效应。这一步可以使用如Seurat, Scanpy等工具中的方法进行标准化和批次效应校正。

  5. 特征选择和降维:为了能够在低维空间中展示细胞的结构,以及找出最能够区分不同细胞的基因,需要进行特征选择和降维。常用的降维方法包括PCA, t-SNE, UMAP等。

  6. 聚类:基于降维后的数据,可以对细胞进行聚类,以识别不同的细胞类型或状态。聚类方法有很多种,包括基于密度的方法,基于图的方法等。

  7. 差异表达分析:在识别了不同的细胞群体后,可以进行差异表达分析,以找出在不同细胞群体中表达不同的基因。

  8. 轨迹推断:对于时序数据或者细胞发育数据,可以进行轨迹推断,以研究细胞的发育过程或者状态转换过程。这一步可以使用如Monocle, Slingshot等工具进行。

以上就是单细胞RNA测序数据分析的一些基本步骤,但是需要注意的是,具体的分析流程和方法可能会根据研究问题的不同而有所不同。

The steps in the analysis of single-cell RNA sequencing data include the following stages:

  1. Data Preprocessing: This step involves quality control of the raw sequencing data, including removing low-quality sequencing reads, trimming reads, and identifying and removing potential contamination sequences. This step is to ensure that subsequent analyses are based on high-quality data.

  2. Alignment and Quantification: The next step is to align the preprocessed reads to the reference genome and quantify the expression of each gene in each cell. Alignment can be done using tools such as STAR, HISAT2, while quantification can be done with tools like HTSeq, featureCounts.

  3. Quality Control and Filtering: After alignment and quantification, further quality control is needed, including removing low-quality cells (e.g., those with too few expressed genes or a lot of mitochondrial gene expression) and filtering out underexpressed genes.

  4. Normalization and Batch Effect Correction: Due to technical and experimental design reasons, there may be some non-biological technical variations in the data, such as batch effects. This step can be done using methods in tools like Seurat, Scanpy for normalization and batch effect correction.

  5. Feature Selection and Dimensionality Reduction: To be able to display the structure of cells in a low-dimensional space and find the genes that best distinguish different cells, feature selection and dimensionality reduction are required. Common dimensionality reduction methods include PCA, t-SNE, UMAP.

  6. Clustering: Based on the dimensionality-reduced data, cells can be clustered to identify different cell types or states. There are many clustering methods, including density-based methods, graph-based methods, etc.

  7. Differential Expression Analysis: After identifying different cell populations, differential expression analysis can be performed to find genes that are differentially expressed in different cell populations.

  8. Trajectory Inference: For time series data or cell development data, trajectory inference can be performed to study the cell development process or state transition process. This step can be done using tools like Monocle, Slingshot.

These are some of the basic steps in single-cell RNA sequencing data analysis. However, it should be noted that the specific analysis workflow and methods may vary depending on the research question.

Snakefile

import os

#######################################################
############### Snakefile Configuration ###############
#######################################################

configfile: "bacto-0.1.json"

SAMPLES, PAIRS, = glob_wildcards("raw_data/{sample}_{pair}.fastq.gz")

SAMPLES = list(set(SAMPLES))

#https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref
#reference_name is one of: argannot, card, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core, vfdb_full, virulencefinder
#VFDB: a reference database for bacterial virulence factors.
#VFDB 2016: hierarchical and refined dataset for big data analysis-10 years on"
#ariba getref vfdb_core vfdb_core
#mv vfdb_core.tsv db/vfdb_core/vfdb_core.tsv
#mv vfdb_core.fa db/vfdb_core/vfdb_core.fa
#MODIFIED: trimmomatic threads=6 doesn't work, add hard-coding {threads} as 12 in line 136
#DB = ["argannot", "card", "megares", "plasmidfinder", "resfinder", "srst2_argannot", "vfdb_core", "vfdb_full", "virulencefinder"]
DB = ["megares", "plasmidfinder", "resfinder", "srst2_argannot", "vfdb_core", "virulencefinder"]

#########################################################################
### Helper functions to access and construct commands from parameters ###
### in the configuration file for this Snakemake Pipeline             ###
#########################################################################

def get_params_trailing():

    return "TRAILING:" + str(config["trimmomatic"]["trailing"])

def get_params_leading():

    return "LEADING:" + str(config["trimmomatic"]["leading"])

def get_params_minlen():

    return "MINLEN:" + str(config["trimmomatic"]["min_len"])

def get_params_illuminaclip():

    params = config["trimmomatic"]

    return "ILLUMINACLIP:" + params["adapter_path"] + ":" \
           + str(params["seed_mismatch"]) + ":" + str(params["palindrome_threshold"]) \
           + ":" + str(params["clip_threshold"]) + ":" + str(params["min_adapter_length"]) \
           + ":" + str(params["keep_both_reads"])

def get_params_slidingwindow():

    params = config["trimmomatic"]

    return "SLIDINGWINDOW" + ":" + str(params["window_size"]) + ":" + str(params["window_quality"])

def get_genus_options(wildcards):

    if config["prokka"]["genus"]:
        cmd = "--usegenus --genus " + config["prokka"]["genus"]
    else:
        cmd = ""

    return cmd

def get_reference_options(wildcards):

    if config["quast"]["reference"]:
        cmd = "-R " + config["quast"]["reference"]
    else:
        cmd = ""

    return cmd

def get_forward_files(wildcards):

    return "raw_data/{id}_R1.fastq.gz".format(id=wildcards.sample)

def get_reverse_files(wildcards):

    return "raw_data/{id}_R2.fastq.gz".format(id=wildcards.sample)

##############################################
############### Consuming Rule ###############
##############################################

rule all:
    input:
        expand("kraken/{sample}_report.txt", sample=SAMPLES) if config["taxonomic_classifier"] else [],
        #expand("trimmed/{sample}_trimmed_P_1.fastq.gz", sample=SAMPLES) if config["trim"] else [],
        expand("fastqc/{sample}_fastqc.html", sample=SAMPLES) if config["fastqc"] else [],
        expand("shovill/{sample}/contigs.fa", sample=SAMPLES) if config["typing_ariba"] else [],
        expand("prokka/{sample}/{sample}.gbk", sample=SAMPLES) if config["typing_ariba"] else [],

        #expand("mykrobe/{sample}/{sample}.json", sample=SAMPLES) if config["typing"] else [],
        expand("ariba/{db}/{sample}.report.txt", db=DB, sample=SAMPLES) if config["typing_ariba"] else [],
        expand("mlst/{sample}.mlst.txt", sample=SAMPLES) if config["assembly"] and config["typing_mlst"] else [],

    "roary/gene_presence_absence.csv" if config["pangenome"] else [],
    #"core_alignment/core_alignment_roary.fasta" if config["pangenome"] else [],

        "variants/snippy.core.full.aln" if config["variants_calling"] else [],
        "variants/snippy.core.aln" if config["variants_calling"] else [],

        "fasttree/snippy.core.tree" if config["phylogeny_fasttree"] else [],
        "raxml-ng/snippy.core.aln.raxml.bestTree" if config["phylogeny_raxml"] else [],
        "gubbins/recomb.final_tree.tre"  if config["recombination"] else [],

        #conda install -c anaconda seaborn
        #conda install libgcc
        #conda install matplotlib biopython numpy pandas
        #"fasttree_matrix.png"

########################################
############### QC Rules ###############
########################################

#if config["trim"]:
rule trimmomatic:
    input:
        fwd = get_forward_files,
        rev = get_reverse_files
    params:
        illumina_clip=get_params_illuminaclip(),
        sliding_window=get_params_slidingwindow(),
        minlen=get_params_minlen(),
        trailing=get_params_trailing(),
        leading=get_params_leading()
    output:
        fwd_p="trimmed/{sample}_trimmed_P_1.fastq",
        fwd_u="trimmed/{sample}_trimmed_U_1.fastq",
        rev_p="trimmed/{sample}_trimmed_P_2.fastq",
        rev_u="trimmed/{sample}_trimmed_U_2.fastq",
        fwd_fq="fastq/{sample}_1.fastq",
        rev_fq="fastq/{sample}_2.fastq"
    threads:
        config["trimmomatic"]["cpu"]
    shell:
        "trimmomatic PE -threads 12 {input.fwd} {input.rev} {output.fwd_p} {output.fwd_u}"
        " {output.rev_p} {output.rev_u} {params.illumina_clip} {params.sliding_window} {params.leading}"
        " {params.trailing} {params.minlen} && "
        "ln -s $(pwd)/{output.fwd_p} $(pwd)/{output.fwd_fq} && ln -s $(pwd)/{output.rev_p} $(pwd)/{output.rev_fq} && "
        "touch -h {output.fwd_fq} && touch -h {output.rev_fq}"

if config["fastqc"]:
    rule fastqc:
        input:
            fwd_after="fastq/{sample}_1.fastq",
            rev_after="fastq/{sample}_2.fastq"
        output:
            "fastqc/{sample}_1_fastqc.html",
            "fastqc/{sample}_2_fastqc.html",
        shell:
            "fastqc --outdir fastqc {input.fwd_after}"
            " && fastqc --outdir fastqc {input.rev_after}"

if config["taxonomic_classifier"]:
    rule kraken:
        input:
            fwd_kra="fastq/{sample}_1.fastq",
            rev_kra="fastq/{sample}_2.fastq"
        params:
            db=config["kraken"]["db_path"]
        output:
            tax="kraken/{sample}_tax.out",
            report="kraken/{sample}_report.txt"
        threads:
            config["kraken"]["cpu"]
        shell:
            "cat {params.db}/database.* > /dev/null"
            " && kraken --db {params.db} --threads {threads} --output {output.tax}"
            " --fastq-input --paired {input.fwd_kra} {input.rev_kra}"
            " && kraken-report --db {params.db} {output.tax} > {output.report}"

##############################################
########### Assembly and Annotation ##########
##############################################

if config["assembly"]:

    rule shovill:
        input:
            forward = "fastq/{sample}_1.fastq",
            reverse = "fastq/{sample}_2.fastq"
        params:
            spades = config["shovill"]["spades"],
            cpu = config["shovill"]["cpu"],
            depth = config["shovill"]["depth"],
            other = config["shovill"]["other"]
        output:
            "shovill/{sample}/contigs.fa"
        shell:
            "shovill --outdir shovill/{wildcards.sample} --depth {params.depth} --force --R1 {input.forward} "
            "--R2 {input.reverse} --mincov 1 --cpus {params.cpu} {params.other}"

    #DEBUG: replace the tbl2asn with the newest version: https://github.com/tseemann/prokka/issues/139 --> SOLVED!
    rule prokka:
        input:
            "shovill/{sample}/contigs.fa"
        params:
            cpu = config["prokka"]["cpu"],
            evalue = config["prokka"]["evalue"],
            genus_options = get_genus_options,
            kingdom = config["prokka"]["kingdom"],
            species = config["prokka"]["species"],
            other = config["prokka"]["other"]
        output:
            "prokka/{sample}/{sample}.gbk",
            "prokka/{sample}/{sample}.gff"
        shell:
             """ 
             prokka --force --outdir prokka/{wildcards.sample} --cpus {params.cpu} {params.genus_options} --kingdom {params.kingdom} --species {params.species} --addgenes --addmrna --prefix {wildcards.sample} --locustag {wildcards.sample} {params.other} {input} -hmm /mnt/h1/jhuang/REFs/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
             """
#tbl2asn -V b -a r10k -l paired-ends -M n -N 1 -y 'Annotated using prokka 1.12 from https://github.com/tseemann/prokka' -Z prokka\/HD31_2\/HD31_2\.err -i prokka\/HD31_2\/HD31_2\.fsa
#wget -O tbl2asn.gz ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz
#ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/DOCUMENTATION/VERSIONS
#ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/
#gunzip tbl2asn.gz
#chmod +x tbl2asn
#mv linux64.tbl2asn /home/jhuang/anaconda3/envs/bengal/bin/tbl2asn 
#prokka 1.12 has problem!
#conda install prokka

    if config["typing_mlst"]:
        rule mlst:
            input:
                "shovill/{sample}/contigs.fa"
            output:
                "mlst/{sample}.mlst.txt"
            shell:
                "mlst {input} > {output}"

########################################
############### Typing #################
########################################

if config["typing_ariba"]:
    # Warning. This is some black magic with the LD_LIBRARY_PATH
    # to make MykrobePredictor work in Conda. MCCORTEX31 fails for
    # ZLIB library dependency, which is present in CONDA_PREFIX/lib
    # mykrobe predict --skeleton_dir ./mykrobe/HD31N3_S93 HD31N3_S93 staph --mccortex31_path /home/jhuang/anaconda3/envs/bengal/bin/mccortex31 -1 fastq/HD31N3_S93_1.fastq fastq/HD31N3_S93_2.fastq > mykrobe/HD31N3_S93/HD31N3_S93.json
    #rule mykrobe_predictor:
    #    input:
    #        forward = "fastq/{sample}_1.fastq",
    #        reverse = "fastq/{sample}_2.fastq"
    #    params:
    #        species=config["mykrobe"]["species"]
    #    output:
    #        "mykrobe/{sample}/{sample}.json"
    #    conda:
    #        "envs/mykrobe.yaml"
    #    shell:
    #        #considering removing LIBRARY_PATH management from the code
    #        "LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH && "
    #        "echo $LD_LIBRARY_PATH && "
    #        "mykrobe predict --skeleton_dir ./mykrobe/{wildcards.sample} {wildcards.sample} {params.species} "
    #        "-1 {input.forward} {input.reverse} > {output}"

    #rule ariba_prepareref:
#        input:
#                gene_dbs="db/{db}/{db}.fa",
#                gene_tsv="db/{db}/{db}.tsv"
#        output:
#                gene_preps="db/{db}/prepareref.{db}"
#        shell:
#                "ariba prepareref -f {input.gene_dbs} -m {input.gene_tsv} {output.gene_preps}"

#argannot, card, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core, vfdb_full, virulencefinder
##ariba getref argannot argannot  
##mkdir db/argannot
##mv argannot* db/argannot
##ariba prepareref -f db/argannot/argannot.fa -m db/argannot/argannot.tsv db/argannot/prepareref.argannot
##-------
##ariba getref card card  
##mkdir db/card
##mv card* db/card
##ariba prepareref -f db/card/card.fa -m db/card/card.tsv db/card/prepareref.card
##-------
##ariba getref megares megares  
##mkdir db/megares
##mv megares* db/megares
##ariba prepareref -f db/megares/megares.fa -m db/megares/megares.tsv db/megares/prepareref.megares
##-------
##ariba getref plasmidfinder plasmidfinder  
##mkdir db/plasmidfinder
##mv plasmidfinder* db/plasmidfinder
##ariba prepareref -f db/plasmidfinder/plasmidfinder.fa -m db/plasmidfinder/plasmidfinder.tsv db/plasmidfinder/prepareref.plasmidfinder
##-------
##ariba prepareref -f db/resfinder/resfinder.fa -m db/resfinder/resfinder.tsv db/resfinder/prepareref.resfinder
##-------
##ariba getref srst2_argannot srst2_argannot  
##mkdir db/srst2_argannot
##mv srst2_argannot* db/srst2_argannot
##ariba prepareref -f db/srst2_argannot/srst2_argannot.fa -m db/srst2_argannot/srst2_argannot.tsv db/srst2_argannot/prepareref.srst2_argannot
##-------
##ariba prepareref -f db/vfdb_core/vfdb_core.fa -m db/vfdb_core/vfdb_core.tsv db/vfdb_core/prepareref.vfdb_core
##-------
##ariba getref vfdb_full vfdb_full  
##mkdir db/vfdb_full
##mv vfdb_full* db/vfdb_full
##ariba prepareref -f db/vfdb_full/vfdb_full.fa -m db/vfdb_full/vfdb_full.tsv db/vfdb_full/prepareref.vfdb_full
##-------
##ariba getref virulencefinder virulencefinder  
##mkdir db/virulencefinder
##mv virulencefinder* db/virulencefinder
##ariba prepareref -f db/virulencefinder/virulencefinder.fa -m db/virulencefinder/virulencefinder.tsv db/virulencefinder/prepareref.virulencefinder
#

    rule ariba:
        input:
                prepref = "db/{db}/prepareref.{db}",
                forward = "fastq/{sample}_1.fastq",
                reverse = "fastq/{sample}_2.fastq"
        params:
                outdir="ariba/{db}/{sample}",
                report="ariba/{db}/{sample}/report.tsv"
        output:
                "ariba/{db}/{sample}.report.txt"
        shell:
                "ariba run --force {input.prepref} {input.forward} {input.reverse} {params.outdir} && "
                "mv {params.report} {output}"

#####################################################
############### Pangenome using roary ###############
#####################################################

#http://sepsis-omics.github.io/tutorials/modules/roary/
if config["pangenome"]:
    rule roary:
        input:
            expand("prokka/{sample}/{sample}.gff", sample=SAMPLES)
        params:
            core=config["roary"]["core"],
            identity=config["roary"]["identity"],
            other=config["roary"]["other"]
        output:
            "roary/gene_presence_absence.csv"
            #"core_alignment/core_alignment_roary.fasta"
        threads:
            config["roary"]["cpu"]
        shell:
            #{threads} doesn't work, using hard-coding 15
            "roary -p 15 -f ./roary -i {params.identity} -cd {params.core} -s -e -n -v {params.other} {input} && "
            "mv roary_*/* ./roary && rm -rf ./roary_*"
            #"cp roary/core_gene_alignment.aln core_alignment/core_alignment_roary.fasta"
            #roary -e --mafft -p 8 *.gff
            #roary -p 4 -f {params.outdir} -s -e -n -v {input} && touch {output}    

########################################
############### Variants ###############
########################################

# Snippy in the environment requires the following:
# Install latest Python 3.5 freebayes=1.1.0=3 from BioConda
# Install bzip2 from Conda-Forge (add to channels in YAML)
# Then install standard Snippy from BioConda - for some reason the
# normal install does not work unless using this sequence of installations.

#TODO: using own refs from this STEP, using panphlan
if config["variants_calling"]:
    rule snippy:
        input:
            forward="fastq/{sample}_1.fastq",
            reverse="fastq/{sample}_2.fastq"
        params:
            outdir = "snippy/{sample}",
            reference = config["snippy"]["reference"],
            mincov = config["snippy"]["mincov"],
            minfrac = config["snippy"]["minfrac"],
            mapqual = config["snippy"]["mapqual"],
            other = config["snippy"]["other"],
            cpu = config["snippy"]["cpu"]
        output:
            "snippy/{sample}/{sample}.txt",
            #"snippy/{sample}/{sample}.depth"
        shell:
        #removing LIBRARY_PATH management from the code
            #"LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH && "
            "snippy --force --outdir {params.outdir} --ref {params.reference} "
            "--R1 {input.forward} --R2 {input.reverse} --cpus {params.cpu} "
            "--mincov {params.mincov} --minfrac {params.minfrac} "
            "--mapqual {params.mapqual} --prefix {wildcards.sample} {params.other} "
            #"&& gzip -d snippy/{wildcards.sample}/{wildcards.sample}.depth.gz"

    rule snippy_core:
        input:
            expand("snippy/{sample}/{sample}.txt", sample=SAMPLES)
        params:
            reference = config["snippy"]["reference"]
        output:
            "variants/snippy.core.full.aln",
            "variants/snippy.core.aln"
        shell:
            "snippy-core --ref {params.reference} --prefix snippy.core snippy/* && "
            "mv snippy.core.* variants/"

##########################################################################
########### Set Analysis based on variants/snippy.core.aln ###############
##########################################################################
if config["phylogeny_fasttree"]:
    rule fasttree_snp:
        input:
            "variants/snippy.core.aln"
        output:
            "fasttree/snippy.core.tree"
        shell:
            "FastTree -gtr -nt {input} > {output}"

#DEBUG: threads of raxml_ng doesn't work!
if config["phylogeny_raxml"]:
    rule raxml_ng:
        input:
            "variants/snippy.core.aln"
        params:
            model = config["raxml_ng"]["model"],
            correction = config["raxml_ng"]["correction"],
            bootstrap = config["raxml_ng"]["bootstrap"],
            other = config["raxml_ng"]["other"]
        output:
            "raxml-ng/snippy.core.aln.raxml.bestTree"
        threads:
            config["raxml_ng"]["cpu"]
        shell:
        #{threads} doesn't work, using hard-coding 8
            "raxml-ng --all --model {params.model}{params.correction} --prefix raxml-ng/snippy.core.aln --threads 8 "
            "--msa {input} --bs-trees {params.bootstrap} {params.other}"

if config["recombination"]:
    rule gubbins:
        input:
            "variants/snippy.core.aln"
        params:
            model = config["gubbins"]["model"],
            prefix = "recomb",
            tree_builder = config["gubbins"]["tree_builder"],
            iterations = config["gubbins"]["iterations"],
            min_snps = config["gubbins"]["min_snps"],
            min_window_size = config["gubbins"]["min_window_size"],
            max_window_size = config["gubbins"]["max_window_size"],
            filter_percentage = config["gubbins"]["filter_percentage"],
            other = config["gubbins"]["other"]
        output:
            "gubbins/recomb.final_tree.tre"
        shell:
            "run_gubbins.py --tree_builder {params.tree_builder} --iterations {params.iterations} --raxml_model "
            "{params.model} --min_snps {params.min_snps} --min_window_size {params.min_window_size} "
            "--max_window_size {params.max_window_size} --filter_percentage {params.filter_percentage} "
            "--prefix {params.prefix} {params.other} {input} && mv {params.prefix}* gubbins"

#cp variants/snippy.core.full.aln variants/snippy.core.full.with_ref.aln
#cp variants/snippy.core.aln variants/snippy.core.with_ref.aln
#rule visualize_roary:
#    input:
#        tree = "fasttree/snippy.core.tree",
#        csv = "roary/gene_presence_absence.csv",
#        plot_script_fp = "local/roary_plots"
#    output:
#        "fasttree_matrix.png"
#    shell:
#        """
#        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv}
#        mv pangenome_matrix.png matrix_fasttree.png
#        """
rule visualize_roary_fasttree:
    input:
        tree = "fasttree/snippy.core.tree",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.fasttree.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_fasttree.png
        """

rule visualize_roary_raxml:
    input:
        tree = "raxml-ng/snippy.core.aln.raxml.bestTree",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.raxml.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_raxml.png
        """

rule visualize_roary_gubbins:
    input:
        tree = "gubbins/recomb.final_tree.tre",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.gubbins.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_gubbins.png
        """

rule run_piggy:
    input:
        gff="prokka_gffs/",
        roa="roary/"
    output:
        "piggy.done"
    params:
        outdir = "piggy"
    shell:
        """
        piggy -i {input.gff} -r {input.roa} -o {params.outdir} && touch {output}
        """

Clustering of Promoter Types Based on Motif Frequency and Distribution

To implement the clustering of promoter types based on motif frequency and distribution using Python, you can follow these steps:

  1. Import the required libraries:

    import pandas as pd
    import numpy as np
    from sklearn.cluster import KMeans
  2. Prepare your data:

    • Read the dataset containing motif frequency and distribution information for each promoter region into a Pandas DataFrame.
    • Make sure your dataset has columns for promoter regions, motif frequencies, and motif distributions on the + and – strands.
  3. Perform clustering:

    • Select the features (motif frequencies and distributions) that you want to use for clustering.
    • Normalize the selected features using Min-Max scaling or another appropriate method.
    • Choose the number of clusters (k) you want to create.
    • Apply the K-means clustering algorithm to cluster the data based on the selected features.

      # Select features for clustering
      features = ['motif_frequency', 'positive_strand_distribution', 'negative_strand_distribution']
      
      # Normalize the features
      normalized_data = (data[features] - data[features].min()) / (data[features].max() - data[features].min())
      
      # Apply K-means clustering
      kmeans = KMeans(n_clusters=k)
      clusters = kmeans.fit_predict(normalized_data)
  4. Analyze the clustering results:

    • Assign the cluster labels to the original dataset.

      data['cluster'] = clusters
    • Analyze the characteristics of each cluster, such as the average motif frequency and distribution, by grouping the data by cluster labels and calculating the mean values.

      cluster_means = data.groupby('cluster')[features].mean()
  5. Visualize the clustering results:

    • Create visualizations, such as scatter plots or bar plots, to show the distribution of motifs in different clusters.
    • Plot the average motif frequency and distribution for each cluster.

      cluster_means.plot(kind='bar')

Remember to adjust the implementation based on your specific dataset and requirements. You may need to preprocess the data or use different clustering algorithms depending on your needs.

Analysis of Peak Distribution in Promoters

import pprint
import argparse
import matplotlib.pyplot as plt
import pandas as pd
import gffutils
import numpy as np

#db = gffutils.create_db('gencode.v43.annotation.gtf', dbfn='gencode.v43.annotation.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
#python3 plot_peaks4.py peaks.bed gencode.v43.annotation.db

def plot_peak_distribution(peaks_file, gencode_db):
    db = gffutils.FeatureDB(gencode_db, keep_order=True)

    peaks = pd.read_table(peaks_file, header=None)
    peaks.columns = ['chrom', 'start', 'end', 'name', 'score']

    tss_positions = []
    tss_genes = []
    for gene in db.features_of_type('gene'):
        if gene.strand == '+':
            tss_positions.append(gene.start)
        else:
            tss_positions.append(gene.end)
        tss_genes.append(gene)

    peak_tss_distances = []
    peak_names = []
    closest_genes = []
    closest_gene_names = []
    closest_strands = []
    for _, peak in peaks.iterrows():
        #the double forward slash // is used to perform integer division instead of floating-point division
        peak_center = (peak['start'] + peak['end']) // 2
        closest_tss_index = np.argmin([abs(tss - peak_center) for tss in tss_positions])
        distance_to_tss = peak_center - tss_positions[closest_tss_index]
        peak_tss_distances.append(distance_to_tss)
        peak_names.append(peak['name'])
        closest_genes.append(tss_genes[closest_tss_index].id)
        #closest_genes.append(tss_genes[closest_tss_index].attributes['gene_name'][0])
        if 'gene_name' in tss_genes[closest_tss_index].attributes:
            closest_gene_name = tss_genes[closest_tss_index].attributes['gene_name'][0]
            #print(closest_genesymbol)
        else:
            closest_gene_name = 'N/A'  # Set a default value if 'gene_name' attribute is not present
            #print(closest_genesymbol)
        closest_gene_names.append(closest_gene_name)
        #attributes_obj = tss_genes[closest_tss_index].attributes
        #attribute_names = attributes_obj.keys()
        #print(attribute_names)
        #dict_keys(['gene_id', 'gene_type', 'gene_name', 'level', 'hgnc_id', 'tag', 'havana_gene'])
        #print(tss_genes[closest_tss_index].attributes['havana_gene'])
        #['ENSG00000235519.3'], ['lncRNA'], ['TLCD5'], ['2'], NONE, NONE, ['OTTHUMG00000195005.1']
        closest_strands.append(tss_genes[closest_tss_index].strand)

    # Save distances and gene info to an Excel file
    df = pd.DataFrame({
        'Peak Name': peak_names,
        'Distance to TSS': peak_tss_distances,
        'Closest Gene (Ensemble ID)': closest_genes,
        'Closest Gene (Gene name)': closest_gene_names,
        'Gene Strand': closest_strands
    })
    df.to_excel('peak_tss_distances.xlsx', index=False)

    # Calculate histogram and save bins and counts to a file
    counts, bin_edges = np.histogram(peak_tss_distances, bins=1000)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    hist_df = pd.DataFrame({'Bin Center': bin_centers, 'Count': counts})
    hist_df.to_csv('peak_distribution.csv', index=False)
    total_peaks = hist_df['Count'].sum()
    with open('peak_distribution.csv', 'a') as f:
        f.write(f'\nTotal number of peaks: {total_peaks}') 

    plt.hist(peak_tss_distances, bins=1000)
    plt.xlabel('Distance to TSS')
    plt.ylabel('Number of peaks')
    plt.title('Distribution of peaks around TSS')
    plt.savefig('peak_distribution.png')
    plt.show()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Plot peak distribution around TSS.')
    parser.add_argument('peaks_file', type=str, help='Path to the peaks.bed file')
    parser.add_argument('gencode_db', type=str, help='Path to the gencode.v43.annotation.db file')

    args = parser.parse_args()
    plot_peak_distribution(args.peaks_file, args.gencode_db)