Author Archives: gene_x

Defining and Categorizing Promoter Types Based on the ‘GRGGC’ Motif Frequency, Distribution, and Distance to the Transcription Start Site (TSS)

To provide a more detailed explanation of how to define promoter types based on the frequency and distribution of the ‘GRGGC’ motif on both + and – strands within the promoter region, I will outline the steps using Python and the Biopython library.

  1. Load your genome and annotation file (e.g., in FASTA and GFF3 formats, respectively):

    import re
    from Bio import SeqIO
    from Bio.Seq import Seq
    genome_file = "your_genome.fasta"
    annotation_file = "your_annotation.gff3"
    
    genome_records = SeqIO.to_dict(SeqIO.parse(genome_file, "fasta"))
  2. Extract promoter sequences: Define a function to extract promoter sequences based on the annotation file.

    def extract_promoter_sequences(annotation_file, genome_records, promoter_length=1000):
        promoters = []
        with open(annotation_file, "r") as gff:
            for line in gff:
                if line.startswith("#"):
                    continue
                fields = line.strip().split("\t")
                if fields[2] == "gene":
                    start, end, strand = int(fields[3]), int(fields[4]), fields[6]
                    seq_id = fields[0]
                    if strand == "+":
                        promoter_start = max(start - promoter_length, 1)
                        promoter_end = start - 1
                    elif strand == "-":
                        promoter_start = end + 1
                        promoter_end = min(end + promoter_length, len(genome_records[seq_id]))
                    promoter_seq = genome_records[seq_id].seq[promoter_start-1:promoter_end]
                    if strand == "-":
                        promoter_seq = promoter_seq.reverse_complement()
                    promoters.append(promoter_seq)
        return promoters
    
    promoter_sequences = extract_promoter_sequences(annotation_file, genome_records)
  3. Search for the motif and calculate motif frequency and distribution:

    def find_motif_frequency_and_distribution(promoter_sequences, motif_1="GRGGC", motif_2="GCCYR"):
        motif_1 = motif_1.replace("R", "[AG]").replace("Y", "[CT]")
        motif_2 = motif_2.replace("R", "[AG]").replace("Y", "[CT]")
        motif_data = []
    
        for promoter in promoter_sequences:
            motif_positions = []
            for match in re.finditer(motif_1, str(promoter)):
                motif_positions.append(match.start())
            for match in re.finditer(motif_2, str(promoter)):
                motif_positions.append(match.start())
            motif_positions.sort()
            motif_data.append({"count": len(motif_positions), "positions": motif_positions})
    
        return motif_data
    
    motif_data = find_motif_frequency_and_distribution(promoter_sequences)
  4. Define promoter types: Based on the frequency and distribution of the motif within the promoter regions, you can categorize promoters into different types. For example:

    def classify_promoter_types(motif_data, low_count=0, high_count=3):
        promoter_types = []
        for data in motif_data:
            if data["count"] <= low_count:
                promoter_types.append("low")
            elif data["count"] >= high_count:
                promoter_types.append("high")
            else:
                promoter_types.append("medium")
        return promoter_types

    promoter_types = classify_promoter_types(motif_data)

5.1. Perform statistical analyses and visualizations: With the promoter types defined, you can now perform various statistical analyses and create visualizations to explore the relationships between the types and other genomic features or expression levels. Here’s an example of how to create a bar plot of promoter types using the matplotlib library:

#pip install matplotlib  #Install matplotlib
import matplotlib.pyplot as plt

def plot_promoter_types(promoter_types):
    type_counts = {}
    for promoter_type in promoter_types:
        if promoter_type not in type_counts:
            type_counts[promoter_type] = 1
        else:
            type_counts[promoter_type] += 1

    types = list(type_counts.keys())
    counts = list(type_counts.values())

    plt.bar(types, counts)
    plt.xlabel("Promoter Types")
    plt.ylabel("Frequency")
    plt.title("Frequency of Promoter Types Based on 'GRGGC' Motif")
    plt.show()

plot_promoter_types(promoter_types)
#This code will produce a bar plot that shows the frequency of the different promoter types based on the 'GRGGC' motif in the promoter regions. You can further analyze the relationship between the promoter types and gene expression levels or other genomic features, depending on your research question.

5.2. In order to define promoter types based on the distance of the ‘GRGGC’ motif to the transcription start site (TSS), we can modify the previous code to include the distance information.

  • Define a function to find the distance of the motif to the TSS for each promoter:

    def find_motif_distances_to_tss(promoters, motif):
        distances = []
        for promoter in promoters:
            for strand, sequence in promoter.items():
                motif_positions = [i for i in range(len(sequence)) if sequence.startswith(motif, i)]
                if strand == '+':
                    tss_distance = [abs(pos - len(sequence)) for pos in motif_positions]
                else:
                    tss_distance = [abs(pos) for pos in motif_positions]
                distances.extend(tss_distance)
        return distances
    
    motif_distances_to_tss = find_motif_distances_to_tss(promoters, 'GRGGC')
  • Define promoter types based on the distance to the TSS: We can define the promoter types by categorizing the distances into different groups.

    Very close: < 50 bp; Close: 50 - 200 bp; Moderate: 200 - 500 bp; Far: > 500 bp

    def categorize_distances(distances):
        promoter_types = []
        for distance in distances:
            if distance < 50:
                promoter_types.append("Very close")
            elif 50 <= distance < 200:
                promoter_types.append("Close")
            elif 200 <= distance < 500:
                promoter_types.append("Moderate")
            else:
                promoter_types.append("Far")
        return promoter_types
    
    promoter_types_distance = categorize_distances(motif_distances_to_tss)
  • Visualize the promoter types based on distance: Use the plot_promoter_types function we defined earlier to create a bar plot of promoter types based on the distance to the TSS:

    plot_promoter_types(promoter_types_distance)

    This plot will show the frequency of promoter types based on the distance of the ‘GRGGC’ motif to the TSS. You can further analyze the relationship between promoter types and gene expression levels or other genomic features, depending on your research question.

Exploring DNA Motifs with Custom Bash Script

#!/bin/bash

#./search_motif4.sh test1.fasta GRG 5

if [ $# -ne 3 ]; then
  echo "Usage: $0 
” exit 1 fi fasta_file=$1 motif=$2 context=$3 motif_regex=$(echo $motif | sed ‘s/R/[AG]/g’ | sed ‘s/Y/[CT]/g’ | sed ‘s/S/[GC]/g’ | sed ‘s/W/[AT]/g’ | sed ‘s/K/[GT]/g’ | sed ‘s/M/[AC]/g’ | sed ‘s/B/[CGT]/g’ | sed ‘s/D/[AGT]/g’ | sed ‘s/H/[ACT]/g’ | sed ‘s/V/[ACG]/g’) grep -B1 -A1 -i -E -o “.{0,$context}${motif_regex}.{0,$context}” $fasta_file

单细胞RNA测序数据分析步骤

单细胞RNA测序数据分析的具体步骤包括以下几个阶段:

  1. 数据预处理:这一步涉及到对原始测序数据进行质量控制,包括移除低质量的测序读段,对读段进行修剪,以及对可能的污染序列进行识别和移除。这一步骤是为了确保后续的分析基于的是高质量的数据。

  2. 比对和定量:接下来的步骤是将预处理后的读段比对到参考基因组上,并且对每个细胞中每个基因的表达量进行定量。比对可以使用如STAR, HISAT2等工具,而定量则可以使用如HTSeq, featureCounts等工具。

  3. 质控和过滤:在进行了比对和定量后,需要进行进一步的质量控制,这包括移除低质量的细胞(比如表达基因数量太少或者有大量的线粒体基因表达),以及过滤掉表达不足的基因。

  4. 标准化和批次效应校正:由于技术和实验设计的原因,数据中可能存在一些非生物学的技术性变异,比如批次效应。这一步可以使用如Seurat, Scanpy等工具中的方法进行标准化和批次效应校正。

  5. 特征选择和降维:为了能够在低维空间中展示细胞的结构,以及找出最能够区分不同细胞的基因,需要进行特征选择和降维。常用的降维方法包括PCA, t-SNE, UMAP等。

  6. 聚类:基于降维后的数据,可以对细胞进行聚类,以识别不同的细胞类型或状态。聚类方法有很多种,包括基于密度的方法,基于图的方法等。

  7. 差异表达分析:在识别了不同的细胞群体后,可以进行差异表达分析,以找出在不同细胞群体中表达不同的基因。

  8. 轨迹推断:对于时序数据或者细胞发育数据,可以进行轨迹推断,以研究细胞的发育过程或者状态转换过程。这一步可以使用如Monocle, Slingshot等工具进行。

以上就是单细胞RNA测序数据分析的一些基本步骤,但是需要注意的是,具体的分析流程和方法可能会根据研究问题的不同而有所不同。

The steps in the analysis of single-cell RNA sequencing data include the following stages:

  1. Data Preprocessing: This step involves quality control of the raw sequencing data, including removing low-quality sequencing reads, trimming reads, and identifying and removing potential contamination sequences. This step is to ensure that subsequent analyses are based on high-quality data.

  2. Alignment and Quantification: The next step is to align the preprocessed reads to the reference genome and quantify the expression of each gene in each cell. Alignment can be done using tools such as STAR, HISAT2, while quantification can be done with tools like HTSeq, featureCounts.

  3. Quality Control and Filtering: After alignment and quantification, further quality control is needed, including removing low-quality cells (e.g., those with too few expressed genes or a lot of mitochondrial gene expression) and filtering out underexpressed genes.

  4. Normalization and Batch Effect Correction: Due to technical and experimental design reasons, there may be some non-biological technical variations in the data, such as batch effects. This step can be done using methods in tools like Seurat, Scanpy for normalization and batch effect correction.

  5. Feature Selection and Dimensionality Reduction: To be able to display the structure of cells in a low-dimensional space and find the genes that best distinguish different cells, feature selection and dimensionality reduction are required. Common dimensionality reduction methods include PCA, t-SNE, UMAP.

  6. Clustering: Based on the dimensionality-reduced data, cells can be clustered to identify different cell types or states. There are many clustering methods, including density-based methods, graph-based methods, etc.

  7. Differential Expression Analysis: After identifying different cell populations, differential expression analysis can be performed to find genes that are differentially expressed in different cell populations.

  8. Trajectory Inference: For time series data or cell development data, trajectory inference can be performed to study the cell development process or state transition process. This step can be done using tools like Monocle, Slingshot.

These are some of the basic steps in single-cell RNA sequencing data analysis. However, it should be noted that the specific analysis workflow and methods may vary depending on the research question.

Snakefile

import os

#######################################################
############### Snakefile Configuration ###############
#######################################################

configfile: "bacto-0.1.json"

SAMPLES, PAIRS, = glob_wildcards("raw_data/{sample}_{pair}.fastq.gz")

SAMPLES = list(set(SAMPLES))

#https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref
#reference_name is one of: argannot, card, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core, vfdb_full, virulencefinder
#VFDB: a reference database for bacterial virulence factors.
#VFDB 2016: hierarchical and refined dataset for big data analysis-10 years on"
#ariba getref vfdb_core vfdb_core
#mv vfdb_core.tsv db/vfdb_core/vfdb_core.tsv
#mv vfdb_core.fa db/vfdb_core/vfdb_core.fa
#MODIFIED: trimmomatic threads=6 doesn't work, add hard-coding {threads} as 12 in line 136
#DB = ["argannot", "card", "megares", "plasmidfinder", "resfinder", "srst2_argannot", "vfdb_core", "vfdb_full", "virulencefinder"]
DB = ["megares", "plasmidfinder", "resfinder", "srst2_argannot", "vfdb_core", "virulencefinder"]

#########################################################################
### Helper functions to access and construct commands from parameters ###
### in the configuration file for this Snakemake Pipeline             ###
#########################################################################

def get_params_trailing():

    return "TRAILING:" + str(config["trimmomatic"]["trailing"])

def get_params_leading():

    return "LEADING:" + str(config["trimmomatic"]["leading"])

def get_params_minlen():

    return "MINLEN:" + str(config["trimmomatic"]["min_len"])

def get_params_illuminaclip():

    params = config["trimmomatic"]

    return "ILLUMINACLIP:" + params["adapter_path"] + ":" \
           + str(params["seed_mismatch"]) + ":" + str(params["palindrome_threshold"]) \
           + ":" + str(params["clip_threshold"]) + ":" + str(params["min_adapter_length"]) \
           + ":" + str(params["keep_both_reads"])

def get_params_slidingwindow():

    params = config["trimmomatic"]

    return "SLIDINGWINDOW" + ":" + str(params["window_size"]) + ":" + str(params["window_quality"])

def get_genus_options(wildcards):

    if config["prokka"]["genus"]:
        cmd = "--usegenus --genus " + config["prokka"]["genus"]
    else:
        cmd = ""

    return cmd

def get_reference_options(wildcards):

    if config["quast"]["reference"]:
        cmd = "-R " + config["quast"]["reference"]
    else:
        cmd = ""

    return cmd

def get_forward_files(wildcards):

    return "raw_data/{id}_R1.fastq.gz".format(id=wildcards.sample)

def get_reverse_files(wildcards):

    return "raw_data/{id}_R2.fastq.gz".format(id=wildcards.sample)

##############################################
############### Consuming Rule ###############
##############################################

rule all:
    input:
        expand("kraken/{sample}_report.txt", sample=SAMPLES) if config["taxonomic_classifier"] else [],
        #expand("trimmed/{sample}_trimmed_P_1.fastq.gz", sample=SAMPLES) if config["trim"] else [],
        expand("fastqc/{sample}_fastqc.html", sample=SAMPLES) if config["fastqc"] else [],
        expand("shovill/{sample}/contigs.fa", sample=SAMPLES) if config["typing_ariba"] else [],
        expand("prokka/{sample}/{sample}.gbk", sample=SAMPLES) if config["typing_ariba"] else [],

        #expand("mykrobe/{sample}/{sample}.json", sample=SAMPLES) if config["typing"] else [],
        expand("ariba/{db}/{sample}.report.txt", db=DB, sample=SAMPLES) if config["typing_ariba"] else [],
        expand("mlst/{sample}.mlst.txt", sample=SAMPLES) if config["assembly"] and config["typing_mlst"] else [],

    "roary/gene_presence_absence.csv" if config["pangenome"] else [],
    #"core_alignment/core_alignment_roary.fasta" if config["pangenome"] else [],

        "variants/snippy.core.full.aln" if config["variants_calling"] else [],
        "variants/snippy.core.aln" if config["variants_calling"] else [],

        "fasttree/snippy.core.tree" if config["phylogeny_fasttree"] else [],
        "raxml-ng/snippy.core.aln.raxml.bestTree" if config["phylogeny_raxml"] else [],
        "gubbins/recomb.final_tree.tre"  if config["recombination"] else [],

        #conda install -c anaconda seaborn
        #conda install libgcc
        #conda install matplotlib biopython numpy pandas
        #"fasttree_matrix.png"

########################################
############### QC Rules ###############
########################################

#if config["trim"]:
rule trimmomatic:
    input:
        fwd = get_forward_files,
        rev = get_reverse_files
    params:
        illumina_clip=get_params_illuminaclip(),
        sliding_window=get_params_slidingwindow(),
        minlen=get_params_minlen(),
        trailing=get_params_trailing(),
        leading=get_params_leading()
    output:
        fwd_p="trimmed/{sample}_trimmed_P_1.fastq",
        fwd_u="trimmed/{sample}_trimmed_U_1.fastq",
        rev_p="trimmed/{sample}_trimmed_P_2.fastq",
        rev_u="trimmed/{sample}_trimmed_U_2.fastq",
        fwd_fq="fastq/{sample}_1.fastq",
        rev_fq="fastq/{sample}_2.fastq"
    threads:
        config["trimmomatic"]["cpu"]
    shell:
        "trimmomatic PE -threads 12 {input.fwd} {input.rev} {output.fwd_p} {output.fwd_u}"
        " {output.rev_p} {output.rev_u} {params.illumina_clip} {params.sliding_window} {params.leading}"
        " {params.trailing} {params.minlen} && "
        "ln -s $(pwd)/{output.fwd_p} $(pwd)/{output.fwd_fq} && ln -s $(pwd)/{output.rev_p} $(pwd)/{output.rev_fq} && "
        "touch -h {output.fwd_fq} && touch -h {output.rev_fq}"

if config["fastqc"]:
    rule fastqc:
        input:
            fwd_after="fastq/{sample}_1.fastq",
            rev_after="fastq/{sample}_2.fastq"
        output:
            "fastqc/{sample}_1_fastqc.html",
            "fastqc/{sample}_2_fastqc.html",
        shell:
            "fastqc --outdir fastqc {input.fwd_after}"
            " && fastqc --outdir fastqc {input.rev_after}"

if config["taxonomic_classifier"]:
    rule kraken:
        input:
            fwd_kra="fastq/{sample}_1.fastq",
            rev_kra="fastq/{sample}_2.fastq"
        params:
            db=config["kraken"]["db_path"]
        output:
            tax="kraken/{sample}_tax.out",
            report="kraken/{sample}_report.txt"
        threads:
            config["kraken"]["cpu"]
        shell:
            "cat {params.db}/database.* > /dev/null"
            " && kraken --db {params.db} --threads {threads} --output {output.tax}"
            " --fastq-input --paired {input.fwd_kra} {input.rev_kra}"
            " && kraken-report --db {params.db} {output.tax} > {output.report}"

##############################################
########### Assembly and Annotation ##########
##############################################

if config["assembly"]:

    rule shovill:
        input:
            forward = "fastq/{sample}_1.fastq",
            reverse = "fastq/{sample}_2.fastq"
        params:
            spades = config["shovill"]["spades"],
            cpu = config["shovill"]["cpu"],
            depth = config["shovill"]["depth"],
            other = config["shovill"]["other"]
        output:
            "shovill/{sample}/contigs.fa"
        shell:
            "shovill --outdir shovill/{wildcards.sample} --depth {params.depth} --force --R1 {input.forward} "
            "--R2 {input.reverse} --mincov 1 --cpus {params.cpu} {params.other}"

    #DEBUG: replace the tbl2asn with the newest version: https://github.com/tseemann/prokka/issues/139 --> SOLVED!
    rule prokka:
        input:
            "shovill/{sample}/contigs.fa"
        params:
            cpu = config["prokka"]["cpu"],
            evalue = config["prokka"]["evalue"],
            genus_options = get_genus_options,
            kingdom = config["prokka"]["kingdom"],
            species = config["prokka"]["species"],
            other = config["prokka"]["other"]
        output:
            "prokka/{sample}/{sample}.gbk",
            "prokka/{sample}/{sample}.gff"
        shell:
             """ 
             prokka --force --outdir prokka/{wildcards.sample} --cpus {params.cpu} {params.genus_options} --kingdom {params.kingdom} --species {params.species} --addgenes --addmrna --prefix {wildcards.sample} --locustag {wildcards.sample} {params.other} {input} -hmm /mnt/h1/jhuang/REFs/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
             """
#tbl2asn -V b -a r10k -l paired-ends -M n -N 1 -y 'Annotated using prokka 1.12 from https://github.com/tseemann/prokka' -Z prokka\/HD31_2\/HD31_2\.err -i prokka\/HD31_2\/HD31_2\.fsa
#wget -O tbl2asn.gz ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz
#ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/DOCUMENTATION/VERSIONS
#ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/
#gunzip tbl2asn.gz
#chmod +x tbl2asn
#mv linux64.tbl2asn /home/jhuang/anaconda3/envs/bengal/bin/tbl2asn 
#prokka 1.12 has problem!
#conda install prokka

    if config["typing_mlst"]:
        rule mlst:
            input:
                "shovill/{sample}/contigs.fa"
            output:
                "mlst/{sample}.mlst.txt"
            shell:
                "mlst {input} > {output}"

########################################
############### Typing #################
########################################

if config["typing_ariba"]:
    # Warning. This is some black magic with the LD_LIBRARY_PATH
    # to make MykrobePredictor work in Conda. MCCORTEX31 fails for
    # ZLIB library dependency, which is present in CONDA_PREFIX/lib
    # mykrobe predict --skeleton_dir ./mykrobe/HD31N3_S93 HD31N3_S93 staph --mccortex31_path /home/jhuang/anaconda3/envs/bengal/bin/mccortex31 -1 fastq/HD31N3_S93_1.fastq fastq/HD31N3_S93_2.fastq > mykrobe/HD31N3_S93/HD31N3_S93.json
    #rule mykrobe_predictor:
    #    input:
    #        forward = "fastq/{sample}_1.fastq",
    #        reverse = "fastq/{sample}_2.fastq"
    #    params:
    #        species=config["mykrobe"]["species"]
    #    output:
    #        "mykrobe/{sample}/{sample}.json"
    #    conda:
    #        "envs/mykrobe.yaml"
    #    shell:
    #        #considering removing LIBRARY_PATH management from the code
    #        "LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH && "
    #        "echo $LD_LIBRARY_PATH && "
    #        "mykrobe predict --skeleton_dir ./mykrobe/{wildcards.sample} {wildcards.sample} {params.species} "
    #        "-1 {input.forward} {input.reverse} > {output}"

    #rule ariba_prepareref:
#        input:
#                gene_dbs="db/{db}/{db}.fa",
#                gene_tsv="db/{db}/{db}.tsv"
#        output:
#                gene_preps="db/{db}/prepareref.{db}"
#        shell:
#                "ariba prepareref -f {input.gene_dbs} -m {input.gene_tsv} {output.gene_preps}"

#argannot, card, megares, plasmidfinder, resfinder, srst2_argannot, vfdb_core, vfdb_full, virulencefinder
##ariba getref argannot argannot  
##mkdir db/argannot
##mv argannot* db/argannot
##ariba prepareref -f db/argannot/argannot.fa -m db/argannot/argannot.tsv db/argannot/prepareref.argannot
##-------
##ariba getref card card  
##mkdir db/card
##mv card* db/card
##ariba prepareref -f db/card/card.fa -m db/card/card.tsv db/card/prepareref.card
##-------
##ariba getref megares megares  
##mkdir db/megares
##mv megares* db/megares
##ariba prepareref -f db/megares/megares.fa -m db/megares/megares.tsv db/megares/prepareref.megares
##-------
##ariba getref plasmidfinder plasmidfinder  
##mkdir db/plasmidfinder
##mv plasmidfinder* db/plasmidfinder
##ariba prepareref -f db/plasmidfinder/plasmidfinder.fa -m db/plasmidfinder/plasmidfinder.tsv db/plasmidfinder/prepareref.plasmidfinder
##-------
##ariba prepareref -f db/resfinder/resfinder.fa -m db/resfinder/resfinder.tsv db/resfinder/prepareref.resfinder
##-------
##ariba getref srst2_argannot srst2_argannot  
##mkdir db/srst2_argannot
##mv srst2_argannot* db/srst2_argannot
##ariba prepareref -f db/srst2_argannot/srst2_argannot.fa -m db/srst2_argannot/srst2_argannot.tsv db/srst2_argannot/prepareref.srst2_argannot
##-------
##ariba prepareref -f db/vfdb_core/vfdb_core.fa -m db/vfdb_core/vfdb_core.tsv db/vfdb_core/prepareref.vfdb_core
##-------
##ariba getref vfdb_full vfdb_full  
##mkdir db/vfdb_full
##mv vfdb_full* db/vfdb_full
##ariba prepareref -f db/vfdb_full/vfdb_full.fa -m db/vfdb_full/vfdb_full.tsv db/vfdb_full/prepareref.vfdb_full
##-------
##ariba getref virulencefinder virulencefinder  
##mkdir db/virulencefinder
##mv virulencefinder* db/virulencefinder
##ariba prepareref -f db/virulencefinder/virulencefinder.fa -m db/virulencefinder/virulencefinder.tsv db/virulencefinder/prepareref.virulencefinder
#

    rule ariba:
        input:
                prepref = "db/{db}/prepareref.{db}",
                forward = "fastq/{sample}_1.fastq",
                reverse = "fastq/{sample}_2.fastq"
        params:
                outdir="ariba/{db}/{sample}",
                report="ariba/{db}/{sample}/report.tsv"
        output:
                "ariba/{db}/{sample}.report.txt"
        shell:
                "ariba run --force {input.prepref} {input.forward} {input.reverse} {params.outdir} && "
                "mv {params.report} {output}"

#####################################################
############### Pangenome using roary ###############
#####################################################

#http://sepsis-omics.github.io/tutorials/modules/roary/
if config["pangenome"]:
    rule roary:
        input:
            expand("prokka/{sample}/{sample}.gff", sample=SAMPLES)
        params:
            core=config["roary"]["core"],
            identity=config["roary"]["identity"],
            other=config["roary"]["other"]
        output:
            "roary/gene_presence_absence.csv"
            #"core_alignment/core_alignment_roary.fasta"
        threads:
            config["roary"]["cpu"]
        shell:
            #{threads} doesn't work, using hard-coding 15
            "roary -p 15 -f ./roary -i {params.identity} -cd {params.core} -s -e -n -v {params.other} {input} && "
            "mv roary_*/* ./roary && rm -rf ./roary_*"
            #"cp roary/core_gene_alignment.aln core_alignment/core_alignment_roary.fasta"
            #roary -e --mafft -p 8 *.gff
            #roary -p 4 -f {params.outdir} -s -e -n -v {input} && touch {output}    

########################################
############### Variants ###############
########################################

# Snippy in the environment requires the following:
# Install latest Python 3.5 freebayes=1.1.0=3 from BioConda
# Install bzip2 from Conda-Forge (add to channels in YAML)
# Then install standard Snippy from BioConda - for some reason the
# normal install does not work unless using this sequence of installations.

#TODO: using own refs from this STEP, using panphlan
if config["variants_calling"]:
    rule snippy:
        input:
            forward="fastq/{sample}_1.fastq",
            reverse="fastq/{sample}_2.fastq"
        params:
            outdir = "snippy/{sample}",
            reference = config["snippy"]["reference"],
            mincov = config["snippy"]["mincov"],
            minfrac = config["snippy"]["minfrac"],
            mapqual = config["snippy"]["mapqual"],
            other = config["snippy"]["other"],
            cpu = config["snippy"]["cpu"]
        output:
            "snippy/{sample}/{sample}.txt",
            #"snippy/{sample}/{sample}.depth"
        shell:
        #removing LIBRARY_PATH management from the code
            #"LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH && "
            "snippy --force --outdir {params.outdir} --ref {params.reference} "
            "--R1 {input.forward} --R2 {input.reverse} --cpus {params.cpu} "
            "--mincov {params.mincov} --minfrac {params.minfrac} "
            "--mapqual {params.mapqual} --prefix {wildcards.sample} {params.other} "
            #"&& gzip -d snippy/{wildcards.sample}/{wildcards.sample}.depth.gz"

    rule snippy_core:
        input:
            expand("snippy/{sample}/{sample}.txt", sample=SAMPLES)
        params:
            reference = config["snippy"]["reference"]
        output:
            "variants/snippy.core.full.aln",
            "variants/snippy.core.aln"
        shell:
            "snippy-core --ref {params.reference} --prefix snippy.core snippy/* && "
            "mv snippy.core.* variants/"

##########################################################################
########### Set Analysis based on variants/snippy.core.aln ###############
##########################################################################
if config["phylogeny_fasttree"]:
    rule fasttree_snp:
        input:
            "variants/snippy.core.aln"
        output:
            "fasttree/snippy.core.tree"
        shell:
            "FastTree -gtr -nt {input} > {output}"

#DEBUG: threads of raxml_ng doesn't work!
if config["phylogeny_raxml"]:
    rule raxml_ng:
        input:
            "variants/snippy.core.aln"
        params:
            model = config["raxml_ng"]["model"],
            correction = config["raxml_ng"]["correction"],
            bootstrap = config["raxml_ng"]["bootstrap"],
            other = config["raxml_ng"]["other"]
        output:
            "raxml-ng/snippy.core.aln.raxml.bestTree"
        threads:
            config["raxml_ng"]["cpu"]
        shell:
        #{threads} doesn't work, using hard-coding 8
            "raxml-ng --all --model {params.model}{params.correction} --prefix raxml-ng/snippy.core.aln --threads 8 "
            "--msa {input} --bs-trees {params.bootstrap} {params.other}"

if config["recombination"]:
    rule gubbins:
        input:
            "variants/snippy.core.aln"
        params:
            model = config["gubbins"]["model"],
            prefix = "recomb",
            tree_builder = config["gubbins"]["tree_builder"],
            iterations = config["gubbins"]["iterations"],
            min_snps = config["gubbins"]["min_snps"],
            min_window_size = config["gubbins"]["min_window_size"],
            max_window_size = config["gubbins"]["max_window_size"],
            filter_percentage = config["gubbins"]["filter_percentage"],
            other = config["gubbins"]["other"]
        output:
            "gubbins/recomb.final_tree.tre"
        shell:
            "run_gubbins.py --tree_builder {params.tree_builder} --iterations {params.iterations} --raxml_model "
            "{params.model} --min_snps {params.min_snps} --min_window_size {params.min_window_size} "
            "--max_window_size {params.max_window_size} --filter_percentage {params.filter_percentage} "
            "--prefix {params.prefix} {params.other} {input} && mv {params.prefix}* gubbins"

#cp variants/snippy.core.full.aln variants/snippy.core.full.with_ref.aln
#cp variants/snippy.core.aln variants/snippy.core.with_ref.aln
#rule visualize_roary:
#    input:
#        tree = "fasttree/snippy.core.tree",
#        csv = "roary/gene_presence_absence.csv",
#        plot_script_fp = "local/roary_plots"
#    output:
#        "fasttree_matrix.png"
#    shell:
#        """
#        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv}
#        mv pangenome_matrix.png matrix_fasttree.png
#        """
rule visualize_roary_fasttree:
    input:
        tree = "fasttree/snippy.core.tree",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.fasttree.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_fasttree.png
        """

rule visualize_roary_raxml:
    input:
        tree = "raxml-ng/snippy.core.aln.raxml.bestTree",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.raxml.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_raxml.png
        """

rule visualize_roary_gubbins:
    input:
        tree = "gubbins/recomb.final_tree.tre",
        csv = "roary/gene_presence_absence.csv",
        plot_script_fp = "local/roary_plots"
    output:
        "visualize.gubbins.done"
    shell:
        """
        python {input.plot_script_fp}/roary_plots.py {input.tree} {input.csv} && touch {output}
        mv pangenome_matrix.png pangenome_matrix_gubbins.png
        """

rule run_piggy:
    input:
        gff="prokka_gffs/",
        roa="roary/"
    output:
        "piggy.done"
    params:
        outdir = "piggy"
    shell:
        """
        piggy -i {input.gff} -r {input.roa} -o {params.outdir} && touch {output}
        """

Clustering of Promoter Types Based on Motif Frequency and Distribution

To implement the clustering of promoter types based on motif frequency and distribution using Python, you can follow these steps:

  1. Import the required libraries:

    import pandas as pd
    import numpy as np
    from sklearn.cluster import KMeans
  2. Prepare your data:

    • Read the dataset containing motif frequency and distribution information for each promoter region into a Pandas DataFrame.
    • Make sure your dataset has columns for promoter regions, motif frequencies, and motif distributions on the + and – strands.
  3. Perform clustering:

    • Select the features (motif frequencies and distributions) that you want to use for clustering.
    • Normalize the selected features using Min-Max scaling or another appropriate method.
    • Choose the number of clusters (k) you want to create.
    • Apply the K-means clustering algorithm to cluster the data based on the selected features.

      # Select features for clustering
      features = ['motif_frequency', 'positive_strand_distribution', 'negative_strand_distribution']
      
      # Normalize the features
      normalized_data = (data[features] - data[features].min()) / (data[features].max() - data[features].min())
      
      # Apply K-means clustering
      kmeans = KMeans(n_clusters=k)
      clusters = kmeans.fit_predict(normalized_data)
  4. Analyze the clustering results:

    • Assign the cluster labels to the original dataset.

      data['cluster'] = clusters
    • Analyze the characteristics of each cluster, such as the average motif frequency and distribution, by grouping the data by cluster labels and calculating the mean values.

      cluster_means = data.groupby('cluster')[features].mean()
  5. Visualize the clustering results:

    • Create visualizations, such as scatter plots or bar plots, to show the distribution of motifs in different clusters.
    • Plot the average motif frequency and distribution for each cluster.

      cluster_means.plot(kind='bar')

Remember to adjust the implementation based on your specific dataset and requirements. You may need to preprocess the data or use different clustering algorithms depending on your needs.

Analysis of Peak Distribution in Promoters

import pprint
import argparse
import matplotlib.pyplot as plt
import pandas as pd
import gffutils
import numpy as np

#db = gffutils.create_db('gencode.v43.annotation.gtf', dbfn='gencode.v43.annotation.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
#python3 plot_peaks4.py peaks.bed gencode.v43.annotation.db

def plot_peak_distribution(peaks_file, gencode_db):
    db = gffutils.FeatureDB(gencode_db, keep_order=True)

    peaks = pd.read_table(peaks_file, header=None)
    peaks.columns = ['chrom', 'start', 'end', 'name', 'score']

    tss_positions = []
    tss_genes = []
    for gene in db.features_of_type('gene'):
        if gene.strand == '+':
            tss_positions.append(gene.start)
        else:
            tss_positions.append(gene.end)
        tss_genes.append(gene)

    peak_tss_distances = []
    peak_names = []
    closest_genes = []
    closest_gene_names = []
    closest_strands = []
    for _, peak in peaks.iterrows():
        #the double forward slash // is used to perform integer division instead of floating-point division
        peak_center = (peak['start'] + peak['end']) // 2
        closest_tss_index = np.argmin([abs(tss - peak_center) for tss in tss_positions])
        distance_to_tss = peak_center - tss_positions[closest_tss_index]
        peak_tss_distances.append(distance_to_tss)
        peak_names.append(peak['name'])
        closest_genes.append(tss_genes[closest_tss_index].id)
        #closest_genes.append(tss_genes[closest_tss_index].attributes['gene_name'][0])
        if 'gene_name' in tss_genes[closest_tss_index].attributes:
            closest_gene_name = tss_genes[closest_tss_index].attributes['gene_name'][0]
            #print(closest_genesymbol)
        else:
            closest_gene_name = 'N/A'  # Set a default value if 'gene_name' attribute is not present
            #print(closest_genesymbol)
        closest_gene_names.append(closest_gene_name)
        #attributes_obj = tss_genes[closest_tss_index].attributes
        #attribute_names = attributes_obj.keys()
        #print(attribute_names)
        #dict_keys(['gene_id', 'gene_type', 'gene_name', 'level', 'hgnc_id', 'tag', 'havana_gene'])
        #print(tss_genes[closest_tss_index].attributes['havana_gene'])
        #['ENSG00000235519.3'], ['lncRNA'], ['TLCD5'], ['2'], NONE, NONE, ['OTTHUMG00000195005.1']
        closest_strands.append(tss_genes[closest_tss_index].strand)

    # Save distances and gene info to an Excel file
    df = pd.DataFrame({
        'Peak Name': peak_names,
        'Distance to TSS': peak_tss_distances,
        'Closest Gene (Ensemble ID)': closest_genes,
        'Closest Gene (Gene name)': closest_gene_names,
        'Gene Strand': closest_strands
    })
    df.to_excel('peak_tss_distances.xlsx', index=False)

    # Calculate histogram and save bins and counts to a file
    counts, bin_edges = np.histogram(peak_tss_distances, bins=1000)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    hist_df = pd.DataFrame({'Bin Center': bin_centers, 'Count': counts})
    hist_df.to_csv('peak_distribution.csv', index=False)
    total_peaks = hist_df['Count'].sum()
    with open('peak_distribution.csv', 'a') as f:
        f.write(f'\nTotal number of peaks: {total_peaks}') 

    plt.hist(peak_tss_distances, bins=1000)
    plt.xlabel('Distance to TSS')
    plt.ylabel('Number of peaks')
    plt.title('Distribution of peaks around TSS')
    plt.savefig('peak_distribution.png')
    plt.show()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Plot peak distribution around TSS.')
    parser.add_argument('peaks_file', type=str, help='Path to the peaks.bed file')
    parser.add_argument('gencode_db', type=str, help='Path to the gencode.v43.annotation.db file')

    args = parser.parse_args()
    plot_peak_distribution(args.peaks_file, args.gencode_db)

LiftOver: An Essential Utility for the Conversion of Genomic Coordinates

If you have genomic coordinates (like gene positions, SNP positions etc) in hg19 and want to convert them to hg38, you’d use what’s known as a “liftover”. The UCSC Genome Browser provides a tool specifically for this purpose called the UCSC LiftOver tool.

Here is a brief step-by-step process to use the UCSC LiftOver tool:

Format your data: Make sure your data is in BED format (a text file with at least the first three columns being chromosome, start position, and end position).

Go to the LiftOver tool: Visit the UCSC LiftOver tool at https://genome.ucsc.edu/cgi-bin/hgLiftOver.

Upload your file and select the appropriate parameters: Upload your BED file, choose “hg19” as the original genome, and “hg38” as the genome to convert to.

Submit and download the result: Click “submit” and download the resulting BED file, which will contain your original genomic coordinates converted from hg19 to hg38.

It’s important to note that not all genomic regions will have a direct equivalent between versions, so some data may be lost in the conversion.

If you need to do this conversion frequently or with large datasets, there are also command-line tools and scripts available to perform the liftover locally, such as the “CrossMap” tool or the “liftOver” utility provided by UCSC, which can be run from a Unix-like (Linux/Mac) command line.

If you wish to perform a liftover in Python, you can use the package pyliftover. This package provides a Python interface to use the precomputed chain files available from the UCSC Genome Browser. Here is a simple example of how to use this package:

First, you need to install the pyliftover package. You can do this with pip:

pip install pyliftover

Then, download the chain file for the conversion. For hg19 to hg38, you can download it from the UCSC Genome Browser:

wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz

Now, you can use the package in Python:

from pyliftover import LiftOver

# Initialize the LiftOver object
lo = LiftOver('hg19', 'hg38')

# Convert a position (e.g., chr1:1000000)
new_pos = lo.convert_coordinate('chr1', 1000000)

print(new_pos)

In this code:

  1. We import the LiftOver class from the pyliftover package.
  2. We initialize a LiftOver object with the source and target genomes (‘hg19’ and ‘hg38’).
  3. We call the convert_coordinate method of the LiftOver object to convert the position. The convert_coordinate method takes two arguments: the name of the chromosome (as a string) and the 0-based position (as an integer).
  4. Finally, we print the converted position.

Please note that the convert_coordinate method returns a list of conversions because a position could potentially map to multiple positions in the new genome. Each conversion is a tuple, where the first element is the name of the new chromosome, the second element is the new position, the third element is the strand (‘+’ or ‘-‘), and the fourth element is the conversion ratio.

Also, keep in mind that not all positions can be converted. If a position cannot be converted, convert_coordinate will return an empty list.

In the process of lifting over, some regions might not get mapped from the old assembly to the new one. There are several reasons why this could happen:

  1. Assembly differences: The two assemblies (source and target) might have different representations of certain genomic regions. For example, some regions might have been rearranged, added, or removed in the newer assembly.

  2. Unmapped regions: Some regions in the older assembly might not have a clear equivalent in the newer assembly, and vice versa. This could happen if, for example, a region is represented differently in the two assemblies due to updates in the understanding of the genome structure.

  3. Quality of mapping: LiftOver uses chain files that represent alignments of large blocks of the genome. If a region in the old assembly does not align well with any region in the new assembly, it might not get mapped.

  4. LiftOver parameters: LiftOver has some parameters (like the minimum ratio of bases that must remap) that can affect which regions get mapped. If you set these parameters to be more stringent, some regions might not get mapped.

When using LiftOver, it’s important to check the unmapped regions (the tool usually provides a file with these regions) to see if they are important for your analysis. If they are, you might need to consider other strategies to include them, such as manual inspection or alternative mapping tools.

#Deleted in new
chr14   483848  483848

#Explain failure messages
Deleted in new:
    Sequence intersects no chains
Partially deleted in new:
    Sequence insufficiently intersects one chain
Split in new:
    Sequence insufficiently intersects multiple chains
Duplicated in new:
    Sequence sufficiently intersects multiple chains
Boundary problem:
    Missing start or end base in an exon

https://epd.expasy.org/epd/human/human_database.php?db=human

Identifying the Nearest Genomic Peaks within Defined Regions

To find the closest peaks in the genome regions defined by a bed file, you can use a tool like BEDTools. BEDTools provides a function closest which allows you to find the closest feature in a second file to each feature in a first file.

Here is an example of how you might use BEDTools to achieve this:

bedtools closest -a peaks.bed -b genome_regions.bed > closest_peaks.bed

In this command:

  • -a peaks.bed specifies the input file with the peaks.
  • -b genome_regions.bed specifies the input file with the genome regions.
  • > closest_peaks.bed redirects the output to a file named closest_peaks.bed. The resulting closest_peaks.bed file will contain the closest peak from peaks.bed for each region in genome_regions.bed.

Please note that you’ll need to have BEDTools installed to use this command.

If you’re looking to do this operation in Python, you might want to look into libraries like pybedtools which provide a Python interface to BEDTools. Using pybedtools, the command would look something like this:

import pybedtools

peaks = pybedtools.BedTool('peaks.bed')
genome_regions = pybedtools.BedTool('genome_regions.bed')

closest_peaks = peaks.closest(genome_regions)

closest_peaks.saveas('closest_peaks.bed')

This script does the same thing as the previous bash command but in Python.

Remember that BEDTools requires your BED files to be sorted, and the bedtools closest command requires both input files to be sorted in the same order. So if you sort your peaks file, you should also sort your genome regions file in the same way, and vice versa.

  1. Sort your BED file: You can use the sort-bed command from the BEDOPS suite, or sort -k1,1 -k2,2n in Unix to sort your BED file. Here is an example:

    sort -k1,1 -k2,2n genome_regions.bed > genome_regions_sorted.bed

    Then you can rerun your bedtools closest command with the sorted BED file:

    bedtools closest -a peaks_NHDF_.bed -b genome_regions_sorted.bed
  2. Specify a genome file: If you have a genome file (a text file listing all of the chromosome names and sizes in your genome), you can use the -g option to tell BEDTools the correct order of the chromosomes. Here is an example:

    bedtools closest -a peaks_NHDF_.bed -b genome_regions.bed -g genome_file.txt

    In this example, genome_file.txt should be a two-column, tab-separated file with chromosome names in the first column and chromosome sizes in the second column. The chromosomes should be listed in the desired order.

Commands

sort -k1,1 -k2,2n peaks_NHDF_.bed > peaks_NHDF_sorted.bed
sort -k1,1 -k2,2n Integrationsites_converted_.bed > Integrationsites_converted_sorted.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d > peaks_on_integrationsites.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique.bed
sort -k9,9 -k10,10n peaks_on_integrationsites.bed > peaks_on_integrationsites_sorted.bed

bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_1.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_1.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique_1.bed
sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '($10 != -1) && !seen[$9]++' > peaks_on_integrationsites_unique_1.bed
#chr6    16523081        16524081        Merged-chr6-16523581-1  7.680000        chr6    20569311        20635162        WaGa_Czech-Sioli_et_al._Plos_Pathog_2020        4045231 -4045231
#cut -f10 peaks_on_integrationsites_1.bed | sort -n | uniq > doublecheck.txt
#-1
#4045231
#4085125
#4223056

> peaks_on_integrationsites_unique.bed
#for number in 1 2 3 4 5  6 7 8 9 10  11 12 13 14 15  16 17 18 19 20  21 22 23 24 25  26 27 28 29 30  31 32 33 34 35  36 37 38 39 40  41 42 43 44 45  46 47 48 49 50; do
for number in 51 52 53 54 55  56 57 58 59 60  61 62 63 64 65  66 67 68 69 70  71 72 73; do
  sed -n "${number}p" Integrationsites_converted_.bed > Integrationsites_converted_${number}.bed
  bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_${number}.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_${number}.bed
  sort -k9,9 -k10,10n peaks_on_integrationsites_${number}.bed | awk '($10 != -1) && !seen[$9]++' >> peaks_on_integrationsites_unique.bed
done

#check the bed file format using the following commands.
cat -t Integrationsites_converted_4.bed
sed 's/ \+/\t/g' Integrationsites_converted_4.bed > Integrationsites_converted_tab.bed
cut -f2,3 Integrationsites_converted_4.bed | grep -Pv '^\d+\t\d+$'

~/Tools/csv2xls-0.4/csv_to_xls.py peaks_on_integrationsites_unique_.bed -d$'\t' -o peaks_on_integrationsites.xls

Antibodies and cell lines that are commonly used in research

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeBroadHistone

Antibody:

  • CBP (SC-369)
  • H3K4me1
  • H3K4me2
  • H3K4me3
  • H3K9ac
  • H3K9me1
  • H3K9me3
  • H3K27ac
  • H3K27me3
  • H3K36me3
  • H3K79me2
  • H4K20me1

Cell Line:

  • GM12878 (Tier 1)
  • H1-hESC (Tier 1)
  • K562 (Tier 1)
  • A549 (Tier 2)
  • HeLa-S3 (Tier 2)
  • HepG2 (Tier 2)
  • HUVEC (Tier 2)
  • Monocytes CD14+ RO01746
  • Dnd41
  • HSMMtube
  • NH-A appears to refer to Normal Human Astrocytes. Astrocytes are a type of cell in the brain and spinal cord (i.e., the central nervous system). They provide physical and nutritional support for neurons and modulate their functions. Astrocytes play a key role in a variety of functions including maintaining the blood-brain barrier, providing nutrients to nervous tissue, and playing a role in the repair and scarring process of the brain and spinal cord following traumatic injuries. They are also involved in many neurological conditions, making them an important focus for research.
  • NHDF-Ad: Normal Human Dermal Fibroblasts (Adult). These cells are found in the dermis layer of the skin and are responsible for generating connective tissue and allowing the skin to recover from injury.
  • NHEK: Normal Human Epidermal Keratinocytes. These cells are from the outermost layer of the skin and are involved in protecting the body from environmental damage.
  • NHLF: Normal Human Lung Fibroblasts. Fibroblasts are the most common cells of connective tissue in the body and play a crucial role in wound healing.
  • HSMM: Human Skeletal Muscle Myoblasts. These cells are precursors to muscle cells (myocytes) and play a crucial role in muscle repair and regeneration.
  • HMEC: Human Mammary Epithelial Cells. These cells line the ducts and lobules of the mammary gland.
  • Osteoblasts: These are the cells that are responsible for bone formation.

Defining and Categorizing Promoter Types Based on the ‘GRGGC’ Motif Frequency and Distance to TSS

  1. generate_promter_sequences

    #!/usr/bin/env python3
    #./1_generate_promoter_sequences.py gencode.v43.annotation.gtf.db
    import gffutils
    from pyfaidx import Fasta
    import argparse
    from Bio import SeqIO
    from Bio.SeqRecord import SeqRecord
    from Bio.Seq import Seq
    
    def extract_promoter_sequences(gencode_db, genome_records, upstream_length=3000, downstream_length=3000):
        db = gffutils.FeatureDB(gencode_db, keep_order=True)
    
        promoters = []
        for gene in db.features_of_type('gene'): #CDS
            if gene.strand == '+':
                promoter_start = max(gene.start - upstream_length, 1)
                promoter_end = gene.start + downstream_length - 1
            else: # gene.strand == '-'
                promoter_start = gene.end - downstream_length + 1
                promoter_end = min(gene.end + upstream_length, len(genome_records[gene.chrom]))
    
            promoter_seq = str(genome_records[gene.chrom][promoter_start-1:promoter_end])
    
            # commenting the following two lines don't effect the output size, both are 62703 records 
            #if not set(promoter_seq.upper()) <= {"A", "C", "G", "T"}:
            #    print(f"Invalid sequence for id {gene.id}: {promoter_seq}")
    
            if gene.strand == '-':
                promoter_seq = str(Seq(promoter_seq).reverse_complement())
    
            record = SeqRecord(Seq(promoter_seq.upper()), id=gene.id, description="")
            promoters.append(record)
        return promoters
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description='Plot peak distribution around TSS.')
        parser.add_argument('gencode_db', type=str, help='Path to the gencode.v43.annotation.gtf.db file')
        args = parser.parse_args()
    
        genome_records = Fasta('/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa')
    
        promoter_sequences = extract_promoter_sequences(args.gencode_db, genome_records)
        SeqIO.write(promoter_sequences, "promoter_sequences_3000_3000.fasta", "fasta")
        #62703 containing ENSG00000198804.2
  2. find motifs in promoters

    #!/usr/bin/env python3
    #./find_motifs_in_promoters.py
    import re
    from Bio import SeqIO
    
    def find_motif_frequency_and_distribution(fasta_file, output_file, motif_1="GRGGC", motif_2="GCCYR"):
        motif_1 = motif_1.replace("R", "[AG]").replace("Y", "[CT]")
        motif_2 = motif_2.replace("R", "[AG]").replace("Y", "[CT]")
    
        promoter_sequences = SeqIO.parse(fasta_file, "fasta")
        #with open(output_file, 'w') as f:
        #    f.write("PromoterID\tMotif1_Count\tMotif2_Count\tMotif1_Positions\tMotif2_Positions\n")
        #    for promoter in promoter_sequences:
        #        motif1_positions = [m.start() for m in re.finditer(motif_1, str(promoter.seq))]
        #        motif2_positions = [m.start() for m in re.finditer(motif_2, str(promoter.seq))]
        #        line = f"{promoter.id}\t{len(motif1_positions)}\t{len(motif2_positions)}\t{motif1_positions}\t{motif2_positions}\n"
        #        f.write(line)
        with open(output_file, 'w') as f:
            f.write("PromoterID\tMotif1_Count\tMotif2_Count\tMotif1_Center_Positions\tMotif2_Center_Positions\n")
            for promoter in promoter_sequences:
                motif1_positions = [(m.start() + len(motif_1) // 2 - 3000) for m in re.finditer(motif_1, str(promoter.seq))]
                motif2_positions = [(m.start() + len(motif_2) // 2 - 3000) for m in re.finditer(motif_2, str(promoter.seq))]
                line = f"{promoter.id}\t{len(motif1_positions)}\t{len(motif2_positions)}\t{motif1_positions}\t{motif2_positions}\n"
                f.write(line)
    
    if __name__ == "__main__":
        find_motif_frequency_and_distribution("promoter_sequences_3000_3000.fasta", "motif_counts_positions.txt")
  3. R codes

    #--3.1. draw plots for all motifs
    #DELETE in Kate '[' and ']' in motif_counts_positions.txt
    data <- read.table("motif_counts_positions.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data <- rename(data, Motif_GRGGC_Count=Motif1_Count, Motif_GCCYR_Count=Motif2_Count, Motif_GRGGC_Center_Positions = Motif1_Center_Positions, Motif_GCCYR_Center_Positions = Motif2_Center_Positions)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif1 counts
    png("Histogram_of_Motif_GRGGC_counts.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif2 counts
    png("Histogram_of_Motif_GCCYR_counts.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    # # Convert list-like string to actual list of numbers
    # data$Motif_GRGGC_Center_Positions <- sapply(strsplit(gsub("[\\[\\]\\s+]", "", data$Motif_GRGGC_Center_Positions), ","), function(x) as.numeric(x[nzchar(x)]))
    # data$Motif_GCCYR_Center_Positions <- sapply(strsplit(gsub("[\\[\\]\\s+]", "", data$Motif_GCCYR_Center_Positions), ","), function(x) as.numeric(x[nzchar(x)]))
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif_GRGGC <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif_GCCYR <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif_GRGGC, data_motif_GCCYR)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif1 positions
    png("Density_of_Motif_GRGGC_positions.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif2 positions
    png("Density_of_Motif_GCCYR_positions.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.2. split data into two dataframe, one is in which PromoterID occurres in filtered_data$Closest.Gene..Ensemble.ID., another is the remaining
    # Create a new data frame for rows in 'data' where the PromoterID is in filtered_data's Closest Gene (Ensemble ID)
    data_in_filtered <- data[data$PromoterID %in% filtered_data$Closest.Gene..Ensemble.ID., ]
    data_in_filtered$Motif_GRGGC_Center_Positions <- sapply(data_in_filtered$Motif_GRGGC_Center_Positions, toString)
    data_in_filtered$Motif_GCCYR_Center_Positions <- sapply(data_in_filtered$Motif_GCCYR_Center_Positions, toString)
    write.table(data_in_filtered, file="motifs_in_promoters_with_LT_peak.txt", sep="\t", row.names = FALSE)
    
    # Create another data frame for the remaining rows
    data_not_in_filtered <- data[!data$PromoterID %in% filtered_data$Closest.Gene..Ensemble.ID., ]
    data_not_in_filtered$Motif_GRGGC_Center_Positions <- sapply(data_not_in_filtered$Motif_GRGGC_Center_Positions, toString)
    data_not_in_filtered$Motif_GCCYR_Center_Positions <- sapply(data_not_in_filtered$Motif_GCCYR_Center_Positions, toString)
    write.table(data_not_in_filtered, file="motifs_in_promoters_without_LT_peak.txt", sep="\t", row.names = FALSE)
    
    #--3.3. draw plots for motifs_in_promoters_with_LT_peak.txt
    #DELETE with Kate '"' in motifs_in_promoters_with_LT_peak.txt
    data <- read.table("motifs_in_promoters_with_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif_GRGGC counts
    png("Histogram_of_Motif_GRGGC_in_promoters_with_LT_peak.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif_GCCYR counts
    png("Histogram_of_Motif_GCCYR_in_promoters_with_LT_peak.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif1 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif2 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif1, data_motif2)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif_GRGGC positions
    png("Density_of_Motif_GRGGC_in_promoters_with_LT_peak.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif_GCCYR positions
    png("Density_of_Motif_GCCYR_in_promoters_with_LT_peak.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.4. draw plots for motifs_in_promoters_without_LT_peak.txt
    #DELETE '"' with Kate in motifs_in_promoters_without_LT_peak.txt
    data <- read.table("motifs_in_promoters_without_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif_GRGGC counts
    png("Histogram_of_Motif_GRGGC_in_promoters_without_LT_peak.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif_GCCYR counts
    png("Histogram_of_Motif_GCCYR_in_promoters_without_LT_peak.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif1 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif2 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif1, data_motif2)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif_GRGGC positions
    png("Density_of_Motif_GRGGC_in_promoters_without_LT_peak.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif_GCCYR positions
    png("Density_of_Motif_GCCYR_in_promoters_without_LT_peak.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.5. assign each promoter to a prefined class according to Total_Motifs and Avg_Motif_Distance_To_TSS
    data <- read.table("motifs_in_promoters_with_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Add a column for the sum of Motif_GRGGC_Count and Motif_GCCYR_Count
    data <- data %>%
      mutate(Total_Motifs = Motif_GRGGC_Count + Motif_GCCYR_Count)
    
    # Calculate absolute average Motif_GRGGC distance to TSS
    data$Avg_Motif_GRGGC_Distance_To_TSS <- sapply(strsplit(data$Motif_GRGGC_Center_Positions, ", "), 
                                             function(x) round(mean(as.numeric(x), na.rm = TRUE)))
    # Calculate absolute average Motif_GCCYR distance to TSS
    data$Avg_Motif_GCCYR_Distance_To_TSS <- sapply(strsplit(data$Motif_GCCYR_Center_Positions, ", "), 
                                             function(x) round(mean(as.numeric(x), na.rm = TRUE)))
    
    # Revised function to calculate the mean of two strings of comma-separated numbers
    mean_of_strings <- function(str1, str2) {
      # Convert strings of numbers to numeric vectors
      vec1 <- as.numeric(strsplit(str1, split = ",")[[1]])
      vec2 <- as.numeric(strsplit(str2, split = ",")[[1]])
      # Handle cases where vec1 or vec2 is NULL (i.e., when str1 or str2 is an empty string)
      if (is.null(vec1)) {
        vec1 <- numeric(0)
      }
      if (is.null(vec2)) {
        vec2 <- numeric(0)
      }
      # Compute the mean of the two vectors
      return(round(mean(c(vec1, vec2), na.rm = TRUE)))
    }
    # Apply the function to each row of the dataframe
    data$Avg_Motif_Distance_To_TSS <- mapply(mean_of_strings, data$Motif_GRGGC_Center_Positions, data$Motif_GCCYR_Center_Positions)
    
    # Assign classes based on the given rules
    #change the class name as string "Class 1", "Class 2" ... in the following code.
    data <- data %>%
      mutate(
        Class = case_when(
          Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) > 1000",
          Total_Motifs >= 50 & Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "50 <= Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs >= 50 & Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "50 <= Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) > 1000",
          Total_Motifs >= 100 & Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "100 <= Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs >= 100 & Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "100 <= Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) > 1000"
        )
      )
    
    write.table(data, file="motifs_in_promoters_with_LT_peak_Category.txt", sep="\t", row.names = FALSE)  
    #~/Tools/csv2xls-0.4/csv_to_xls.py motifs_in_promoters_with_LT_peak_Category.txt -d$'\t' -o  motifs_in_promoters_with_LT_peak_Category.xls;