Evaluating the Proximity of Genomic Features (=integration sites) to Peaks Using a Permutation Test genomic features

gene_x 0 like s 580 view s

Tags: python, Biopython, ChIP-seq

The study revolves around the evaluation of the proximity of integration sites to peaks in the human genome, using a permutation-based approach. It involves three primary steps:

  1. Observed Data: We begin by considering our observed data, which are the distances from each integration site to the nearest peak. We compute the mean of these observed distances.

  2. Null Distribution: To generate a null distribution, we perform a permutation test by randomizing the integration sites. For each iteration (in this case, 1,000 iterations to create a robust distribution), we randomly select a number of integration sites equivalent to the count of unique observed distances from the human genome and calculate the distance to the nearest peak. The random integration sites are represented in a BED format (chromosome, start, end). These integration sites, termed as 'random_integration_sites', are chosen from defined chromosomal regions that provide the lengths of human chromosomes.

    The calculation of distances proceeds as follows:

    • For each feature (randomly generated integration site), we find the closest peak using the 'closest' function from the 'pybedtools' library, resulting in a 'random_closest' BEDTool object.
    • We extract the distances from each random integration site to its closest peak. These distances are stored in 'random_distances'.
    • We then calculate the mean of these random distances and store it in 'random_mean_distances'.
  3. Comparison: We then compare our observed mean distance to the null distribution of mean distances. The output p-value serves as an indicator of statistical significance. If the p-value is small (conventionally, less than 0.05 is considered significant), it suggests that the observed mean distance is significantly different from what would be expected by random chance, indicating that the features are not randomly distributed but are likely to be located closer to peaks in the genome.

    Observed mean distance: 1218081
    
    The statistical results from the 1000 iterations of the permutation test, in which we randomly generated potential integration sites:
    Mean:  1445755
    Standard deviation:  400673
    Minimum:  509078
    Maximum:  2978797
    P-value:  0.313
    

(Optional) If your observed mean distance is smaller than the smallest mean distance from your null distribution (i.e., the 1000 permutations), this indeed suggests that your observed integration sites are significantly closer to the peaks than would be expected under the null hypothesis.

(Optional) In other words, your observed integration sites are more closely located to the peaks than random integration sites, which supports the conclusion that there is a non-random association between your integration sites and the peaks.

(Optional) A small p-value suggests that the observed mean distance is significantly different from what is expected by random chance. This could mean that the integration sites under study are preferentially located near peaks in the genome. The null distribution is the collection of mean distances we calculated for each permutation. The proportion of mean distances in the null distribution that is greater than or equal to our observed mean distance serves as our p-value.

So the p-value calculation should look something like this:

#pip install statsmodels

import numpy as np
from pybedtools import BedTool
import pprint
from statsmodels.stats.multitest import multipletests
pp = pprint.PrettyPrinter(indent=4)

#sort -k1,1 -k2,2n peaks_on_integrationsites.csv > peaks_on_integrationsites_sorted.bed
#=898046
#1406936,133333333

# Observed distances
#observed_distances = [-4045231,563541,1118767,-1779287,0,-5347653,3935720,1146367,1507718,0,-1826,-7456,81323,68056,1386933,0,-545651,-84468,-652642,351958,218160,5644455,320101,2050624,-418508,-1061416,-351892,-33175,-296551,-138858,2221723,-658351,3419047,-2701162,1295321,4712290,0,1434626,-5479512,1918341,465313,-986431,190096,-566869,-736100,3579169,1087322,-2696342,-1866390,-14123,1250899,-1424025,-929436,232285,232338,3962087,1042645,728148,-163988,-188515,-1445728,-198270,-116532,267672,924015,735666,-1705528,147724,-122133,261167]
observed_distances = [4045231,563541,1118767,1779287,0,5347653,3935720,1146367,1507718,0,1826,7456,81323,68056,1386933,0,545651,84468,652642,351958,218160,5644455,320101,2050624,418508,1061416,351892,33175,296551,138858,2221723,658351,3419047,2701162,1295321,4712290,0,1434626,5479512,1918341,465313,986431,190096,566869,736100,3579169,1087322,2696342,1866390,14123,1250899,1424025,929436,232285,232338,3962087,1042645,728148,163988,188515,1445728,198270,116532,267672,924015,735666,1705528,147724,122133,261167]
#unique_observed_distances = list(set(observed_distances))
observed_mean = np.mean(observed_distances)
print('observed_mean:', observed_mean)

# Load peak ranges from the BED file
peaks = BedTool('peaks_NHDF_.bed').sort()

# Define chrom regions
# 'chrM': 16569, 
#175187.58208955225
chrom_regions = {
    'chr1': 248956422, 'chr2': 242193529, 'chr3': 198295559, 'chr4': 190214555, 'chr5': 181538259,
    'chr6': 170805979, 'chr7': 159345973, 'chr8': 145138636, 'chr9': 138394717, 'chr10': 133797422,
    'chr11': 135086622, 'chr12': 133275309, 'chr13': 114364328, 'chr15': 101991189,
    'chr16': 90338345, 'chr17': 83257441, 'chr18': 80373285, 'chr20': 64444167,
    'chr21': 46709983, 'chr22': 50818468, 'chr14': 107043718, 'chr19': 58617616, 'chrX': 156040895, 'chrY': 57227415
}

# Permutation test parameters  16.4
#620708 --> 42208198/47=898046 --> 1406936
num_permutations = 1000
num_features = len(observed_distances)

random_mean_distances = []
for _ in range(num_permutations):
    # Generate random integration sites
    random_integration_sites = []
    for _ in range(num_features):
        chrom = np.random.choice(list(chrom_regions.keys()))
        position = np.random.randint(0, chrom_regions[chrom])
        # where the range from end to start is always 898046, meaning these represent an average length of the integration sites in the genome
        random_integration_sites.append((chrom, position, position+898046))
    random_integration_sites = BedTool(random_integration_sites)
    random_integration_sites = random_integration_sites.sort()
    #print(random_integration_sites)

    # Find closest peaks for each feature
    random_closest = random_integration_sites.closest(peaks, d=True)

    # Extract distances
    random_distances = [int(i[-1]) for i in random_closest]

    # Calculate distances and their mean
    random_mean_distances.append(np.mean(random_distances))

    # Calculate p-value
    p_values = [np.mean([mean_dist <= observed_mean for mean_dist in random_mean_distances])]
    p_values_corrected = multipletests(p_values, method='fdr_bh')[1]  # Apply BH correction

#pp.pprint(random_mean_distances)
print("Mean: ", np.mean(random_mean_distances))
print("Standard deviation: ", np.std(random_mean_distances))
print("Minimum: ", np.min(random_mean_distances))
print("Maximum: ", np.max(random_mean_distances))
print("Uncorrected p-value: ", p_values[0])
print("Corrected p-value: ", p_values_corrected[0])  # After BH correction

      #[mean_dist >= observed_mean for mean_dist in random_mean_distances]
      #This is a list comprehension that returns a boolean list. For each mean distance in the list random_mean_distances (i.e., mean distances calculated from randomly selected genomic positions), it checks if the mean distance is greater than or equal to observed_mean (i.e., the mean of observed distances from the nearest peak for each integration site). If the condition is met, it returns True (which is equivalent to 1), else it returns False (equivalent to 0).
      #np.mean([...])
      #This is calculating the mean of the boolean list. In this context, the mean of the boolean list is equivalent to the proportion of random mean distances that are greater than or equal to the observed mean distance. This is because True is considered as 1 and False as 0 when calculating the mean. This proportion is used as the p-value.
      #The p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation. In most cases, a threshold (alpha) is set at 0.05, meaning that there is a 5% chance that the difference is due to chance alone. If the p-value is less than 0.05, the null hypothesis is rejected.

I performed a statistical test aimed at evaluating whether the observed distances from integration sites to the nearest peaks are significantly different from what might be expected by random chance.

Here are the results:

  • Observed mean distance (calculated from a total of 70 integration sites to their nearest peaks): 1,218,081 nt.

The statistical results from the 1000 iterations of the permutation test are as follows:

  • Mean: 1,445,755
  • Standard deviation: 400,673
  • Minimum: 509,078
  • Maximum: 2,978,797

The P-value is 0.313. It suggests that our observed integration sites are not significantly closer to these peaks than would be expected if the sites were distributed randomly.

The statistical test involves three steps:

  1. Observed Data: We start by considering our observed data, which are the distances from each integration site to the nearest peak. We compute the mean of these observed distances.
  2. Null Distribution: To generate a null distribution, we perform a permutation test with 1000 iterations. For each iteration, we randomly generate 70 positions (as potential integration sites) and then calculate the distance from these positions to the nearest peak.
  3. Comparison: We then compare our observed mean distance to the 1000 mean distances generated in the last step.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum