gene_x 0 like s 730 view s
Tags: python, Biopython, ChIP-seq
The study revolves around the evaluation of the proximity of integration sites to peaks in the human genome, using a permutation-based approach. It involves three primary steps:
Observed Data: We begin by considering our observed data, which are the distances from each integration site to the nearest peak. We compute the mean of these observed distances.
Null Distribution: To generate a null distribution, we perform a permutation test by randomizing the integration sites. For each iteration (in this case, 1,000 iterations to create a robust distribution), we randomly select a number of integration sites equivalent to the count of unique observed distances from the human genome and calculate the distance to the nearest peak. The random integration sites are represented in a BED format (chromosome, start, end). These integration sites, termed as 'random_integration_sites', are chosen from defined chromosomal regions that provide the lengths of human chromosomes.
The calculation of distances proceeds as follows:
Comparison: We then compare our observed mean distance to the null distribution of mean distances. The output p-value serves as an indicator of statistical significance. If the p-value is small (conventionally, less than 0.05 is considered significant), it suggests that the observed mean distance is significantly different from what would be expected by random chance, indicating that the features are not randomly distributed but are likely to be located closer to peaks in the genome.
Observed mean distance: 1218081
The statistical results from the 1000 iterations of the permutation test, in which we randomly generated potential integration sites:
Mean: 1445755
Standard deviation: 400673
Minimum: 509078
Maximum: 2978797
P-value: 0.313
(Optional) If your observed mean distance is smaller than the smallest mean distance from your null distribution (i.e., the 1000 permutations), this indeed suggests that your observed integration sites are significantly closer to the peaks than would be expected under the null hypothesis.
(Optional) In other words, your observed integration sites are more closely located to the peaks than random integration sites, which supports the conclusion that there is a non-random association between your integration sites and the peaks.
(Optional) A small p-value suggests that the observed mean distance is significantly different from what is expected by random chance. This could mean that the integration sites under study are preferentially located near peaks in the genome. The null distribution is the collection of mean distances we calculated for each permutation. The proportion of mean distances in the null distribution that is greater than or equal to our observed mean distance serves as our p-value.
So the p-value calculation should look something like this:
#pip install statsmodels
import numpy as np
from pybedtools import BedTool
import pprint
from statsmodels.stats.multitest import multipletests
pp = pprint.PrettyPrinter(indent=4)
#sort -k1,1 -k2,2n peaks_on_integrationsites.csv > peaks_on_integrationsites_sorted.bed
#=898046
#1406936,133333333
# Observed distances
#observed_distances = [-4045231,563541,1118767,-1779287,0,-5347653,3935720,1146367,1507718,0,-1826,-7456,81323,68056,1386933,0,-545651,-84468,-652642,351958,218160,5644455,320101,2050624,-418508,-1061416,-351892,-33175,-296551,-138858,2221723,-658351,3419047,-2701162,1295321,4712290,0,1434626,-5479512,1918341,465313,-986431,190096,-566869,-736100,3579169,1087322,-2696342,-1866390,-14123,1250899,-1424025,-929436,232285,232338,3962087,1042645,728148,-163988,-188515,-1445728,-198270,-116532,267672,924015,735666,-1705528,147724,-122133,261167]
observed_distances = [4045231,563541,1118767,1779287,0,5347653,3935720,1146367,1507718,0,1826,7456,81323,68056,1386933,0,545651,84468,652642,351958,218160,5644455,320101,2050624,418508,1061416,351892,33175,296551,138858,2221723,658351,3419047,2701162,1295321,4712290,0,1434626,5479512,1918341,465313,986431,190096,566869,736100,3579169,1087322,2696342,1866390,14123,1250899,1424025,929436,232285,232338,3962087,1042645,728148,163988,188515,1445728,198270,116532,267672,924015,735666,1705528,147724,122133,261167]
#unique_observed_distances = list(set(observed_distances))
observed_mean = np.mean(observed_distances)
print('observed_mean:', observed_mean)
# Load peak ranges from the BED file
peaks = BedTool('peaks_NHDF_.bed').sort()
# Define chrom regions
# 'chrM': 16569,
#175187.58208955225
chrom_regions = {
'chr1': 248956422, 'chr2': 242193529, 'chr3': 198295559, 'chr4': 190214555, 'chr5': 181538259,
'chr6': 170805979, 'chr7': 159345973, 'chr8': 145138636, 'chr9': 138394717, 'chr10': 133797422,
'chr11': 135086622, 'chr12': 133275309, 'chr13': 114364328, 'chr15': 101991189,
'chr16': 90338345, 'chr17': 83257441, 'chr18': 80373285, 'chr20': 64444167,
'chr21': 46709983, 'chr22': 50818468, 'chr14': 107043718, 'chr19': 58617616, 'chrX': 156040895, 'chrY': 57227415
}
# Permutation test parameters 16.4
#620708 --> 42208198/47=898046 --> 1406936
num_permutations = 1000
num_features = len(observed_distances)
random_mean_distances = []
for _ in range(num_permutations):
# Generate random integration sites
random_integration_sites = []
for _ in range(num_features):
chrom = np.random.choice(list(chrom_regions.keys()))
position = np.random.randint(0, chrom_regions[chrom])
# where the range from end to start is always 898046, meaning these represent an average length of the integration sites in the genome
random_integration_sites.append((chrom, position, position+898046))
random_integration_sites = BedTool(random_integration_sites)
random_integration_sites = random_integration_sites.sort()
#print(random_integration_sites)
# Find closest peaks for each feature
random_closest = random_integration_sites.closest(peaks, d=True)
# Extract distances
random_distances = [int(i[-1]) for i in random_closest]
# Calculate distances and their mean
random_mean_distances.append(np.mean(random_distances))
# Calculate p-value
p_values = [np.mean([mean_dist <= observed_mean for mean_dist in random_mean_distances])]
p_values_corrected = multipletests(p_values, method='fdr_bh')[1] # Apply BH correction
#pp.pprint(random_mean_distances)
print("Mean: ", np.mean(random_mean_distances))
print("Standard deviation: ", np.std(random_mean_distances))
print("Minimum: ", np.min(random_mean_distances))
print("Maximum: ", np.max(random_mean_distances))
print("Uncorrected p-value: ", p_values[0])
print("Corrected p-value: ", p_values_corrected[0]) # After BH correction
#[mean_dist >= observed_mean for mean_dist in random_mean_distances]
#This is a list comprehension that returns a boolean list. For each mean distance in the list random_mean_distances (i.e., mean distances calculated from randomly selected genomic positions), it checks if the mean distance is greater than or equal to observed_mean (i.e., the mean of observed distances from the nearest peak for each integration site). If the condition is met, it returns True (which is equivalent to 1), else it returns False (equivalent to 0).
#np.mean([...])
#This is calculating the mean of the boolean list. In this context, the mean of the boolean list is equivalent to the proportion of random mean distances that are greater than or equal to the observed mean distance. This is because True is considered as 1 and False as 0 when calculating the mean. This proportion is used as the p-value.
#The p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation. In most cases, a threshold (alpha) is set at 0.05, meaning that there is a 5% chance that the difference is due to chance alone. If the p-value is less than 0.05, the null hypothesis is rejected.
I performed a statistical test aimed at evaluating whether the observed distances from integration sites to the nearest peaks are significantly different from what might be expected by random chance.
Here are the results:
The statistical results from the 1000 iterations of the permutation test are as follows:
The P-value is 0.313. It suggests that our observed integration sites are not significantly closer to these peaks than would be expected if the sites were distributed randomly.
The statistical test involves three steps:
点赞本文的读者
还没有人对此文章表态
没有评论
Density of motif plots and its statistical tests
Calculate AT content of the flanking regions of peaks
© 2023 XGenes.com Impressum