BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input.

BART: a transcription factor prediction tool with query gene sets or epigenomic profiles
Zhenjia Wang, Mete Civelek, Clint Miller, Nathan Sheffield, Michael J. Guertin, Chongzhi Zang
Bioinformatics 34, 2867–2869 (2018)
https://zanglab.github.io/bart/
https://github.com/zanglab/bart2
https://github.com/Boyle-Lab/Blacklist/tree/master/lists
http://lisa.cistrome.org/doc
plotCorrelation -in results.npz –corMethod spearman –whatToPlot heatmap –plotFile correlation_heatmap.png
plotPCA -in results.npz –plotFile pca.png
#under python3.8 it is not necessary!
#cd /usr/lib/x86_64-linux-gnu
#sudo ln -s libhdf5_serial_hl.so.10.0.2 libhdf5_hl.so
#sudo ln -s libhdf5_serial.so.10.1.0 libhdf5.so









README for BART(1.0.1)

============
Introduction
============

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input. BART leverages 3,485 human TF binding profiles and 3,055 mouse TF binding profiles from the public domain (collected in Cistrome Data Browser) to make the prediction.

BART is implemented in Python and distributed as an open-source package along with necessary data libraries.

BART is developed and maintained by the Chongzhi Zang Lab at the University of Virginia.

============

========
Tutorial
========

Positional arguments

{geneset,profile}

bart geneset

Given a query gene set (at least 100 genes recommended), predict functional transcription factors that regulate these genes.

*Usage: bart geneset [-h] -i -s [-t ] [-p ]
[–nonorm] [–outdir ] [-o ]

*Example: bart geneset -i name_enhancer_prediction.txt -s hg38 -t target.txt -p 4
–outdir bart_output

*Input arguments:

-i , –infile

Input file, the name_enhancer_prediction.txt profile generated from MARGE.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of the input file.

*Notes:

The input file for , i.e., the enhancer_prediction.txt file generated by MARGE, might have two different formats below (depending on python versions py2 or py3):

a. Python2 version:

1 98.19
2 99.76
3 99.76
4 9.49
5 44.37
6 18.14

b. Python3 version:

chrom start end UDHSID Score
chr3 175483637 175483761 643494 3086.50
chr3 175485120 175485170 643497 2999.18
chr3 175484862 175485092 643496 2998.28
chr3 175484804 175484854 643495 2976.27
chr3 175491775 175491825 643507 2879.01
chr3 175478670 175478836 643491 2836.90

bart profile

Given a ChIP-seq data file (bed or bam format mapped reads), predict transcription factors whose binding pattern associates with the input ChIP-seq profile.

*Usage: bart profile [-h] -i -f [-n ] -s
[-t ] [-p ] [–nonorm]
[–outdir ] [-o ]

*Example: bart profile -i ChIP.bed -f bed -s hg38 -t target.txt -p 4
–outdir bart_output

*Input files arguments:

-i , –infile

Input ChIP-seq bed or bam file.

-f , –format

Specify “bed” or “bam” format.

-n , –fragmentsize

Fragment size of ChIP-seq reads, in bps. Default: 150.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of input file.

*Notes:

The input file for should be BED (https://genome.ucsc.edu/FAQ/FAQformat#format1) or BAM (http://samtools.github.io/hts-specs/SAMv1.pdf) format in either hg38 or mm10.

Bed is a tab-delimited text file that defines the data lines, and the BED file format is described on UCSC genome browser website (https://genome.ucsc.edu/FAQ/FAQformat). For BED format input, the first three columns should be chrom, chromStart, chromEnd, and the 6th column of strand information is required by BART.

BAM is a binary version of Sequence Alignment/Map(SAM) (http://samtools.sourceforge.net) format, and for more information about BAM custom tracks, please click here (https://genome.ucsc.edu/goldenPath/help/bam.html).

Output files

1. name_auc.txt contains the ROC-AUC scores for all TF datasets in human/mouse, we use this score to measure the similarity of TF dataset to cis-regulatory profile, and all TFs are ranked decreasingly by scores. The file should be like this:

AR_56254     AUC = 0.954
AR_44331     AUC = 0.950
AR_44338     AUC = 0.949
AR_50273     AUC = 0.947
AR_44314     AUC = 0.945
AR_44330     AUC = 0.943
AR_50100     AUC = 0.942
AR_44315     AUC = 0.942
AR_50044     AUC = 0.926
AR_50041     AUC = 0.925
FOXA1_50274     AUC = 0.924
AR_50042     AUC = 0.921

2. name_bart_results.txt is a ranking list of all TFs, which includes the Wilcoxon statistic score, Wilcoxon p value, standard Wilcoxon statistic score (zscore), maximum ROC-AUC score and rank score (relative rank of z score, p value and max auc) for each TF. The most functional TFs of input data are ranked first. The file should be like this:

TF statistic pvalue zscore max_auc rela_rank
AR 18.654 1.172e-77 3.024 0.954 0.004
FOXA1 13.272 3.346e-40 2.847 0.924 0.008
SUMO2 5.213 1.854e-07 3.494 0.749 0.021
PIAS1 3.987 6.679e-05 2.802 0.872 0.025
HOXB13 3.800 1.446e-04 2.632 0.909 0.027
GATA3 5.800 6.633e-09 2.549 0.769 0.028
NR3C1 4.500 6.789e-06 2.042 0.871 0.040
GATA6 4.240 2.237e-05 2.602 0.632 0.048
ESR1 12.178 4.057e-34 1.956 0.700 0.049
CEBPB 5.265 1.404e-07 2.287 0.602 0.057
ATF4 3.216 1.302e-03 2.348 0.658 0.065
TOP1 2.254 2.421e-02 3.057 0.779 0.065

3. name_plot is a folder which contains all the extra plots for the TFs listed in target files (target.txt file in test data). For each TF, we have boxplot, which shows the rank position of this TF in all TFs (derived from the rank score in name_bart_results.txt), and the cumulative distribution plot, which compares the distribution of ROC-AUC scores from datasets of this TF and the scores of all datasets (derived from the AUC scores in name_auc.txt).

Leave a Reply

Your email address will not be published. Required fields are marked *