Identifying the Nearest Genomic Peaks within Defined Regions

gene_x 0 like s 1374 view s

Tags: python, genomics, pipeline

To find the closest peaks in the genome regions defined by a bed file, you can use a tool like BEDTools. BEDTools provides a function closest which allows you to find the closest feature in a second file to each feature in a first file.

Here is an example of how you might use BEDTools to achieve this:

bedtools closest -a peaks.bed -b genome_regions.bed > closest_peaks.bed

In this command:

-a peaks.bed specifies the input file with the peaks.
-b genome_regions.bed specifies the input file with the genome regions.
> closest_peaks.bed redirects the output to a file named closest_peaks.bed. The resulting closest_peaks.bed file will contain the closest peak from peaks.bed for each region in genome_regions.bed.

Please note that you'll need to have BEDTools installed to use this command.

If you're looking to do this operation in Python, you might want to look into libraries like pybedtools which provide a Python interface to BEDTools. Using pybedtools, the command would look something like this:

import pybedtools

peaks = pybedtools.BedTool('peaks.bed')
genome_regions = pybedtools.BedTool('genome_regions.bed')

closest_peaks = peaks.closest(genome_regions)

closest_peaks.saveas('closest_peaks.bed')

This script does the same thing as the previous bash command but in Python.

Remember that BEDTools requires your BED files to be sorted, and the bedtools closest command requires both input files to be sorted in the same order. So if you sort your peaks file, you should also sort your genome regions file in the same way, and vice versa.

Sort your BED file: You can use the sort-bed command from the BEDOPS suite, or sort -k1,1 -k2,2n in Unix to sort your BED file. Here is an example:
```
sort -k1,1 -k2,2n genome_regions.bed > genome_regions_sorted.bed
```
Then you can rerun your bedtools closest command with the sorted BED file:
```
bedtools closest -a peaks_NHDF_.bed -b genome_regions_sorted.bed
```
Specify a genome file: If you have a genome file (a text file listing all of the chromosome names and sizes in your genome), you can use the -g option to tell BEDTools the correct order of the chromosomes. Here is an example:
```
bedtools closest -a peaks_NHDF_.bed -b genome_regions.bed -g genome_file.txt
```
In this example, genome_file.txt should be a two-column, tab-separated file with chromosome names in the first column and chromosome sizes in the second column. The chromosomes should be listed in the desired order.

Commands

sort -k1,1 -k2,2n peaks_NHDF_.bed > peaks_NHDF_sorted.bed
sort -k1,1 -k2,2n Integrationsites_converted_.bed > Integrationsites_converted_sorted.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d > peaks_on_integrationsites.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique.bed
sort -k9,9 -k10,10n peaks_on_integrationsites.bed > peaks_on_integrationsites_sorted.bed

bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_1.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_1.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique_1.bed
sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '($10 != -1) && !seen[$9]++' > peaks_on_integrationsites_unique_1.bed
#chr6    16523081        16524081        Merged-chr6-16523581-1  7.680000        chr6    20569311        20635162        WaGa_Czech-Sioli_et_al._Plos_Pathog_2020        4045231 -4045231
#cut -f10 peaks_on_integrationsites_1.bed | sort -n | uniq > doublecheck.txt
#-1
#4045231
#4085125
#4223056

> peaks_on_integrationsites_unique.bed
#for number in 1 2 3 4 5  6 7 8 9 10  11 12 13 14 15  16 17 18 19 20  21 22 23 24 25  26 27 28 29 30  31 32 33 34 35  36 37 38 39 40  41 42 43 44 45  46 47 48 49 50; do
for number in 51 52 53 54 55  56 57 58 59 60  61 62 63 64 65  66 67 68 69 70  71 72 73; do
  sed -n "${number}p" Integrationsites_converted_.bed > Integrationsites_converted_${number}.bed
  bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_${number}.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_${number}.bed
  sort -k9,9 -k10,10n peaks_on_integrationsites_${number}.bed | awk '($10 != -1) && !seen[$9]++' >> peaks_on_integrationsites_unique.bed
done

#check the bed file format using the following commands.
cat -t Integrationsites_converted_4.bed
sed 's/ \+/\t/g' Integrationsites_converted_4.bed > Integrationsites_converted_tab.bed
cut -f2,3 Integrationsites_converted_4.bed | grep -Pv '^\d+\t\d+$'

~/Tools/csv2xls-0.4/csv_to_xls.py peaks_on_integrationsites_unique_.bed -d$'\t' -o peaks_on_integrationsites.xls

like unlike

点赞本文的读者

还没有人对此文章表态

本文有评论

没有评论

Identifying the Nearest Genomic Peaks within Defined Regions

本文有评论

看文章，发评论，不要沉默

最受欢迎文章

最新文章

最多评论文章

推荐相似文章