gene_x 0 like s 772 view s
Tags: python, genomics, pipeline
To find the closest peaks in the genome regions defined by a bed file, you can use a tool like BEDTools. BEDTools provides a function closest
which allows you to find the closest feature in a second file to each feature in a first file.
Here is an example of how you might use BEDTools to achieve this:
bedtools closest -a peaks.bed -b genome_regions.bed > closest_peaks.bed
In this command:
-a peaks.bed
specifies the input file with the peaks.-b genome_regions.bed
specifies the input file with the genome regions.> closest_peaks.bed
redirects the output to a file named closest_peaks.bed
.
The resulting closest_peaks.bed
file will contain the closest peak from peaks.bed
for each region in genome_regions.bed
.Please note that you'll need to have BEDTools installed to use this command.
If you're looking to do this operation in Python, you might want to look into libraries like pybedtools which provide a Python interface to BEDTools. Using pybedtools, the command would look something like this:
import pybedtools
peaks = pybedtools.BedTool('peaks.bed')
genome_regions = pybedtools.BedTool('genome_regions.bed')
closest_peaks = peaks.closest(genome_regions)
closest_peaks.saveas('closest_peaks.bed')
This script does the same thing as the previous bash command but in Python.
Remember that BEDTools requires your BED files to be sorted, and the bedtools closest command requires both input files to be sorted in the same order. So if you sort your peaks file, you should also sort your genome regions file in the same way, and vice versa.
Sort your BED file: You can use the sort-bed command from the BEDOPS suite, or sort -k1,1 -k2,2n in Unix to sort your BED file. Here is an example:
sort -k1,1 -k2,2n genome_regions.bed > genome_regions_sorted.bed
Then you can rerun your bedtools closest command with the sorted BED file:
bedtools closest -a peaks_NHDF_.bed -b genome_regions_sorted.bed
Specify a genome file: If you have a genome file (a text file listing all of the chromosome names and sizes in your genome), you can use the -g option to tell BEDTools the correct order of the chromosomes. Here is an example:
bedtools closest -a peaks_NHDF_.bed -b genome_regions.bed -g genome_file.txt
In this example, genome_file.txt should be a two-column, tab-separated file with chromosome names in the first column and chromosome sizes in the second column. The chromosomes should be listed in the desired order.
Commands
sort -k1,1 -k2,2n peaks_NHDF_.bed > peaks_NHDF_sorted.bed
sort -k1,1 -k2,2n Integrationsites_converted_.bed > Integrationsites_converted_sorted.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d > peaks_on_integrationsites.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique.bed
sort -k9,9 -k10,10n peaks_on_integrationsites.bed > peaks_on_integrationsites_sorted.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_1.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_1.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique_1.bed
sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '($10 != -1) && !seen[$9]++' > peaks_on_integrationsites_unique_1.bed
#chr6 16523081 16524081 Merged-chr6-16523581-1 7.680000 chr6 20569311 20635162 WaGa_Czech-Sioli_et_al._Plos_Pathog_2020 4045231 -4045231
#cut -f10 peaks_on_integrationsites_1.bed | sort -n | uniq > doublecheck.txt
#-1
#4045231
#4085125
#4223056
> peaks_on_integrationsites_unique.bed
#for number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50; do
for number in 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73; do
sed -n "${number}p" Integrationsites_converted_.bed > Integrationsites_converted_${number}.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_${number}.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_${number}.bed
sort -k9,9 -k10,10n peaks_on_integrationsites_${number}.bed | awk '($10 != -1) && !seen[$9]++' >> peaks_on_integrationsites_unique.bed
done
#check the bed file format using the following commands.
cat -t Integrationsites_converted_4.bed
sed 's/ \+/\t/g' Integrationsites_converted_4.bed > Integrationsites_converted_tab.bed
cut -f2,3 Integrationsites_converted_4.bed | grep -Pv '^\d+\t\d+$'
~/Tools/csv2xls-0.4/csv_to_xls.py peaks_on_integrationsites_unique_.bed -d$'\t' -o peaks_on_integrationsites.xls
点赞本文的读者
还没有人对此文章表态
没有评论
Small RNA sequencing processing in the example of smallRNA_7
© 2023 XGenes.com Impressum