ChIPseqSpikeInFree: A Spike-in Free ChIP-Seq Normalization Approach for Detecting Global Changes in Histone Modifications Background

https://github.com/stjude/ChIPseqSpikeInFree

Example:
> library(“ChIPseqSpikeInFree”)
> metaFile <- "/DATA/Data_Denise_ChIPSeq_Protocol2/Data_H3K27me3/sample_meta__part1.txt" > ChIPseqSpikeInFree(bamFiles = bams, chromFile = “hg19”, metaFile = metaFile, prefix = “k27”)
#–>ave.SF = 2.46
cat ${sample_id}.dedup.sorted.bed | wc -l #–>19887819
15000000/(19887819*2,46)=0,306597771
genomeCoverageBed -bg -scale 0.306597771 -i V_8_0_untreated_D1_H3K27me3.dedup.sorted.bed -g hg19.chromSizes > V_8_0_untreated_D1_H3K27me3.bedGraph
bedGraphToBigWig V_8_0_untreated_D1_H3K27me3.bedGraph hg19.chromSizes V_8_0_untreated_D1_H3K27me3.bw

A Spike-in Free ChIP-Seq Normalization Approach for Detecting Global Changes in Histone Modifications
Background

Traditional reads per million (RPM) normalization method is inappropriate for the evaluation of ChIP-seq data when the treatment or mutation has the global effect. Changes in global levels of histone modifications can be detected by using exogenous reference spike-in controls. However, most of the ChIP-seq studies have overlooked the normalization problem that have to be corrected with spike-in. A method that retrospectively renormalize data sets without spike-in is lacking.

We observed that some highly enriched regions were retained despite global changes by oncogenic mutations or drug treatment and that the proportion of reads within these regions was inversely associated with total histone mark levels. Therefore, we developped ChIPseqSpikeInFree, a novel ChIP-seq normalization method to effectively determine scaling factors for samples across various conditions and treatments, which does not rely on exogenous spike-in chromatin or peak detection to reveal global changes in histone modification occupancy. This method is capable of revealing the similar magnitude of global changes as the spike-in method.

In summary, ChIPseqSpikeInFree can estimate scaling factors for ChIP-seq samples without exogenous spike-in or without input. When ChIP-seq is done with spike-in protocol but high variation of Spike-In reads between samples are observed, ChIPseqSpikeInFree can help you determine a more reliable scaling factor than ChIP-Rx method. It’s not recommended to run ChIPseqSpikeInFree blindly without any biological evidences like Western Blotting to prove the global change at protein level between your control and treatment samples.

PiGx is a collection of genomics pipelines

http://bioinformatics.mdc-berlin.de/pigx/

PiGx: Pipelines in Genomics
What is PiGx?

PiGx is a collection of genomics pipelines. All pipelines are easily configured with a simple sample sheet and a descriptive settings file. The result is a set of comprehensive, interactive HTML reports with interesting findings about your samples.
Publication

Wurmus R, Uyar Bora, Osberg B, Franke V, Gosdschan A, Wreczycka K, Ronen J, Akalin A. PiGx: Reproducible genomics analysis pipelines with GNU Guix. Gigascience. 2018 Oct 2. doi: 10.1093/gigascience/giy123. PubMed PMID: 30277498.
DocumentationSample Reports

PiGx includes the following pipelines:
PiGx BSseq for raw fastq read data of bisulfite experiments
PiGx RNAseq for RNAseq samples
PiGx scRNAseq for single cell dropseq analysis
PiGx ChIPseq for reads from ChIPseq experiments
PiGx CRISPR (work in progress) for the analysis of sequence mutations in CRISPR-CAS9 targeted amplicon sequencing data

RNANR: a new set of algorithms for the exploration of RNA kinetics land-scapes at the secondary structure level

Motivation:Kinetics is key to understand many phenomena involving RNAs, such as co-transcriptional folding and riboswitches. Exact out-of-equilibrium studies induce extreme computa-tional demands, leading state-of-the-art methods to rely on approximated kinetics landscapes, ob-tained using sampling strategies that strive to generate the key landmarks of the landscape top-ology. However, such methods are impeded by a large level of redundancy within sampled sets.Such a redundancy is uninformative, and obfuscates important intermediate states, leading to anincomplete vision of RNA dynamics.

Results:We introduce RNANR, a new set of algorithms for the exploration of RNA kinetics land-scapes at the secondary structure level. RNANR considers locally optimal structures, a reduced setof RNA conformations, in order to focus its sampling on basins in the kinetic landscape. Along withan exhaustive enumeration, RNANR implements a novel non-redundant stochastic sampling, andoffers a rich array of structural parameters. Our tests on both real and random RNAs reveal thatRNANR allows to generate more unique structures in a given time than its competitors, and allowsa deeper exploration of kinetics landscapes.

Availability and implementation:RNANR is freely available at https://project.inria.fr/rnalands/rnanr.

Contact:yann.ponty@lix.polytechnique.fr

RNAlishapes: a tool for structural analysis of classes of RNAs

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1636479/

The knowledge about classes of non-coding RNAs (ncRNAs) is growing very fast and it is mainly the structure which is the common characteristic property shared by members of the same class. For correct characterization of such classes it is therefore of great importance to analyse the structural features in great detail. In this manuscript I present RNAlishapes which combines various secondary structure analysis methods, such as suboptimal folding and shape abstraction, with a comparative approach known as RNA alignment folding. RNAlishapes makes use of an extended thermodynamic model and covariance scoring, which allows to reward covariation of paired bases. Applying the algorithm to a set of bacterial trp-operon leaders using shape abstraction it was able to identify the two alternating conformations of this attenuator. Besides providing in-depth analysis methods for aligned RNAs, the tool also shows a fairly well prediction accuracy. Therefore, RNAlishapes provides the community with a powerful tool for structural analysis of classes of RNAs and is also a reasonable method for consensus structure prediction based on sequence alignments. RNAlishapes is available for online use and download at http://rna.cyanolab.de.

Binning

Binning
=======

Scripts required to calculate tetramer frequencies and create input files for ESOM.
See: Dick, G.J., A. Andersson, B.J. Baker, S.S. Simmons, B.C. Thomas, A.P. Yelton, and J.F. Banfield (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10: R85
Open Access: http://genomebiology.com/2009/10/8/R85

How to ESOM?
============

These instructions are for ESOM-based for binning: see http://databionic-esom.sourceforge.net/ for software download and manual.

1. Generate input files.
————————-
* Although not necessary but we recommend adding some reference genomes based on your 16s/OTU analysis as ‘controls’. The idea is that, if the ESOM worked, your reference genome should form a bin itself. You may do this by downloading genomes in fasta format from any public database, preferably a complete single sequence genome.
* Use the `esomWrapper.pl` script to create the relevant input files for ESOM. In order to run this script, you’ll need to have all your sequence(in fasta format) files with the same extension in the same folder. For example:
`perl esomWrapper.pl -path fasta_folder -ext fa`
For more help and examples, type:
`perl esomWrapper.pl -h`

* The script will use the fasta file to produce three tab-delimited files that ESOM requires:
* Learn file = a table of tetranucleotide frequencies (.lrn)
* Names file = a list of the names of each contig (.names)
* Class file = a list of the class of each contig, which can be used to color data points, etc. ( .cls)

**NOTE:**`class number`: The esom mapping requires that you define your sequences as classes. We generally define all the sequences that belong to your query (meatgenome for example) as 0 and all the others 1, 2 and so on. think of these as your predefined bins, each sequence that has the same class number will be assigned the same color in the map.

* These files are generated using Anders Anderssons perl script `tetramer_freqs _esom.pl` which needs to be in the same folder as the `esomWrapper.pl`. To see how to use the `tetramer_freqs _esom.pl` independent of the wrapper, type:
`perl tetramer_freqs _esom.pl -h`

2. Run ESOM:
————-
* On you termial, run w/ following command from anywhere (X11 must be enabled):
`./esomana`
* Load .lrn, .names, and .cls files (File > load .lrn etc.)
* Normalize the data(optional, but recommended): under data tab, see Z-transform, RobustZT, or To\[0,1\] as described in the users manual. I find that this RobustZT makes the map look cleaner.

3. Train the data:
——————-
###Using the GUI
* Tools > Training:
* Parameters: use default parameters with the following exceptions. Note this is what seems work best for AMD datasets but the complete parameter space has not been fully optimized. David Soergel (Brenner Lab) is working on this:
* Training algorithm: K-batch
* Number of rows/columns in map: I use ~5-6 times more neurons than there are data points. E.g. for 12000 data points (window, NOT contigs) I use 200 rows x 328 columns (~65600 neurons).
* Start value for radius = 50 (increase/decrease for smaller/larger maps).
* I’ve never seen a benefit to training for more than 20 epochs for the AMD data.
* Hit ‘START’ — training will take 10 minutes to many hours depending on the size of the data set and parameters used.

###From the terminal
* At this point, you may also choose to add additional data (like coverage) to your contigs. You may do so using the `addInfo2lrn.pl` script **OR** by simply using the flag `-info` in `esomTrain.pl`.
* `esomTrain.pl` script maybe used to train the data without launching the GUI. This script will add the additional information to the lrn file (using `-info`), normalize it and train the ESOM. Type `perl esomTrain.pl -h` in your terminal to see the help document for this script.
* To view the results of the training, simply launch ESOM by following the instructions in *Step 5: Loading a previous project* to load the relevant files.
* Resume analysis from *Step 4: Analyzing the output*

4. Analyzing the output:
————————
* Best viewed (see VIEW tab) with UMatrix background, tiled display. Use Zoom, Color, Bestmatch size to get desired view. Also viewing without data points drawn (uncheck “Draw bestmatches”) helps to see the underlying data structure.
* Use CLASSES tab to rename and recolor classes.
* To select a region of the map, go to DATA tab then draw a shape with mouse (holding left click), close it with right click. Data points will be selected and displayed in DATA tab.
* To assign data points to bins, use the CLASS tab and using your pointer draw a boundary around the region of interest (e.g. using the data structure as a guide — see also “contours” box in VIEW tab which might help to delineate bins). This will assign each data point to a class (bin). The new .cls file can be saved (`File > Save .cls`) for further analysis.

5. Loading a previous project:
——————————
* On you termial, run w/ following command from anywhere (X11 must be enabled): `./esomana`
* `File > load .wts`

Questions?
———-
* [Gregory J. Dick](http://www.earth.lsa.umich.edu/geomicrobiology/Index.html “Geomicro Homepage”),
gdick \[AT\] umich \[DOT\] edu,
Assistant Professor, Michigan Geomicrobiology Lab,
University of Michigan
* [Sunit Jain](http://www.sunitjain.com “Sunit’s Homepage”),
sunitj \[AT\] umich \[DOT\] edu,
Bioinformatics Specialist, Michigan Geomicrobiology Lab,
University of Michigan.

BigWig tools

[TXT] bedGraphToBigWig.txt
[TXT] bigWigAverageOverBed.txt
[TXT] bigWigCorrelate.txt
[TXT] bigWigInfo.txt
[TXT] bigWigMerge.txt
[TXT] bigWigSummary.txt
[TXT] bigWigToBedGraph.txt
[TXT] bigWigToWig.txt
[TXT] qacToWig.txt
[TXT] wigCorrelate.txt
[TXT] wigEncode.txt
[TXT] wigToBigWig.txt

BamM is a c library, wrapped in python, that parses BAM files.

BamM is a c library, wrapped in python, that parses BAM files. The code is intended to provide a faster, more stable interface to parsing BAM files than PySam, but doesn’t implement all / any of PySam’s features.

Do you want all the links that join two contigs in a BAM?
Do you need to get coverage?
Would you like to just work out the insert size and orientation of some mapped reads?

Then BamM is for you!
$ bamm make -d -c read1.R1.fq.gz read1.R2.fq.gz …
$ bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -b mapping.bam
$ bamm extract -g BIN_1.fna -b mapping.bam

BMGE (Block Mapping and Gathering with Entropy) is a program that selects regions in a multiple sequence alignment that are suited for phylogenetic inference

ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
http://mobyle.pasteur.fr/cgi-bin/portal.py

Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10:210.

BMGE (Block Mapping and Gathering with Entropy) is a program that selects regions in a multiple sequence alignment that are suited for phylogenetic inference. BMGE selects characters that are biologically relevant, thanks to the use of standard similarity matrices such as PAM or BLOSUM. Moreover, BMGE provides other character- or sequence-removal operations, such stationary-based character trimming (that provides a subset of compositionally homogeneous characters) or removal of sequences containing a too large proportion of gaps. Finally, BMGE can simply be used to perform standard conversion operations among DNA-, codon-, RY- and amino acid-coding sequences.

BEAM: uses Markov Chain Monte Carlo (MCMC) to search for both single-marker and interaction effects from case-control SNP data

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<>
Yu Zhang, Jun S Liu

This is a brief guide for using BEAM (Bayesian Epistatis Association Mapping).
The program uses Markov Chain Monte Carlo (MCMC) to search for both single-marker
and interaction effects from case-control SNP data.

Reference:

Zhang Y and Liu JS (2007). Bayesian Inference of Epistatic Interations in Case-Control Studies.
Nature Genetics, in press.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This package contains the following files:
BEAM
libgsl.dll <-- for dos version only libgslcblas.dll <-- for dos version only parameters.txt README.txt [1. Installation]----------------------------------------- Unzip all files into one folder. BEAM uses the GNU Scientific Library (GSL), for which two necessary files "libgsl.dll" and "libslcblas.dll" are included in the DOS package. Make sure you put the two dll files in the same folder with BEAM, or put them in your system folder (e.g., "c:\\windows\\system32"), otherwise the program will not run. For linux version, the GSL are included in the executables, so you can simply put all files into one folder and just run the program by ./BEAM. [2. Input Format]----------------------------------------- The user needs to create a file (by default, "data.txt") that contains the case-control genotype data to be used as an input to the program. Although our program is designed for biallelic SNPs, the user may use markers that take more than two alleles. However, BEAM assumes that all markers in the input file are of the same type (e.g., they are all k-allelic markers). The allele at a k-allelic marker should be indexed from 0 to (k-1). For example, without assuming Hardy-Wenberg Equilibrium (HWE), a genotype at a SNP locus takes 3 different values, which should be indexed as 0,1,2. It doesn't matter which allle is coded by which integer, i.e., the user may code the wild type homozygote by 2, the mutant type homozygote by 1, and the heterozygote by 0. In addition, use negative numbers for missing alleles The first line of the input file should contain the disease status of each individual. You should use 1 to denote patients and 0 to denote controls. Alternatively, you may use a one-dimensional continuous trait value for each individual. BEAM will automatically assign those individuals with traits larger than the mean to cases, and the remaining individuals to controls. Starting from the second line of the input file, each line contains the genotype data at one marker for all individuals, separated by ' ' (space). For diploid species, if you want to assume HWE in association mapping, then you should input two alleles at the locus for each individual (the order of alleles doesn't matter). For example, for 2 patients and 2 controls genotyped from 5 markers, if assuming HWE, the input file may look like: 1 1 1 1 0 0 0 0 <--disease status, two identical status per individual, 1: case, 0: control 1 0 0 0 1 1 0 0 <--marker 1, alleles are denoted by 0 and 1. 1 1 0 0 1 0 1 1 <--marker 2 0 0 0 1 0 0 0 1 <--marker 3 0 1 1 1 0 -1 -1 <--marker 4, "-1" denotes missing allele 1 1 0 1 1 0 1 0 <--marker 5 Since each column in the input file denote one individual (or a haploid), the disease status for each individual (haploid) must match with the correponding column in the file. In the above example, when specifying two allels per individual, the user should input two identical disease status per individual in the first line. The user may also provide the SNP ID and their chromosomal locations (in bps) at the begining of each line. The above example file then looks like: ID Chr Pos 1 1 1 1 0 0 0 0 rs100102 chr1 4924223 1 0 0 0 1 1 0 0 rs291093 chr2 35981121 1 1 0 0 1 0 1 1 rs490232 chr9 6920101 0 0 0 1 0 0 0 1 rs093202 chrX 319101 0 1 1 1 0 -1 -1 rs43229 chrY 103919 1 1 0 1 1 0 1 0 IMPORTANT: if SNP ID and locations are provided in the data, make sure you include in the first line "ID Chr Pos " to label each column. This is to ensure that the disease status of each individual matches to the individual in the corresponding column. In addition, please use chr1,...,chr22,chrX, chrY to denote chromosomes. "Chr" is ok too. Since BEAM is designed for detecting interaction between markers that are far apart, by providing the marker location information, it helps BEAM to avoid being trapped by local marker dependencies (due to LD rather than interaction effects). In addition, if the SNP ID and their locations are provided in the input file, you should turn on the correponding parameters in the file "parameter.txt". This can be done by setting both INC_SNP_ID and INC_SNP_POS to 1 (default). Please see the "data.txt" provided online for an example of the input file. [3. Program Parameters]----------------------------------------- The running parameters of BEAM are specified in the "parameter.txt" file. These para- meters include: input filename, outputfilename, priors, burnin-length, mcmc-length, and thin, etc. The parameters "burnin", "mcmc", "thin" should be chosen according to the number of markers genotyped in the data. Denote the totoal number of markers by L, we suggest the following choice of parameters: burin = 10~100 L mcmc = L^2 thin = L In addition, the user may let the program first search for some local modes of the joint posterior distribution, and then run MCMC from those modes. This strategy can be advantagous than starting from random points, espetically for detecting high-order interactions. To do this, set INITIALTRYS to a positive number (e.g. 10 ~ 100), so that BEAM will first search for local modes in 10~100 trials. Set TRY_LENGTH to a number such as 10L for the number of iterations in each trial (L is the number of markers). If during the MCMC run the algorithm finds a better local mode (with a larger likelihood) compared to the mode where BEAM starts with, you can let BEAM restart the chain from this better mode. To do this, set AUTORESTART to a number between 5~10, such that if the new local mode measured in log-likelihood is 5~10 larger than the initial mode, the chain will restart. To disable this function, set AUTORESTART to a large value, e.g., 1000000. Never set it to a negative number. The user can let BEAM to automatically determine a set of parameters for a dataset, such as burnin-length, mcmc-length, thin, etc. By default, we set burnin = 100L if the MCMC chain starts from random configurations (i.e., when INITIALTRYS is 0). If the user let BEAM to search for some local modes first (by setting INITIALTRYS to a positive integer), we then set burnin = 10L. The length of each initial trial (TRY_LENGTH) is 20L by default. We further set mcmc = L * L, and thin = L. To use the automatic setting, the user must set the corresponding parameters in "parameter.txt" to 0 or a negative number. For example, let BURNIN 0 MCMC 0 THIN 0 TRY_LENGTH 0 NOTE: Any positive integer will replace the default setting of BEAM. BEAM will output markers/interactions with Bonferroni adjusted p-value smaller than a user- specified value, which can be specified by "P_THRESHOLD" in "parameter.txt". You can use BEAM to search for marginal associations only if let SINGLE_ONLY = 1. You may specify the input file and the output file in "parameter.txt". We encourage you to modify "parameter.txt" to run BEAM under different parameter settings. Please see "parameter.txt" for more details. [4. Command Line]----------------------------------------- To run BEAM, type in the command line: "BEAM [input output]" BEAM will refer to "parameter.txt" for running parameters, so you can run the program by simply typing "BEAM". The only option you can use in the command line is to specify the input file name and the output file name. Note that if you want to specify either of them, you must specify both of them. [4. Output]----------------------------------------- The main output file contains the estimated posterior probability of association for each marker, including both marginal and interactive association probabilities. The file also contains posterior estimates of the number of marginal associations and the size of interactions. In addition, we evaluate the p-value of each detected association using the B-statistic introduced in the paper. Only significant p-values (<0.1 after the Bonferroni correction) and associated markers are reported. More than one markers reported within a single parenthsis indicate interactions effects. To check the performance of Markov chains, we output two more files: "lnp.txt" and "posterior.txt". The former contains the trace of log-posterior probabilities (up to a normalizing constant) of the sampled parameters. The latter contains the summary of Markov chains. "posterior.txt" also contains the B-statistics and the estimated p-values for detected candidate markers and interactions. [5. Credit]----------------------------------------- This program is developed based on algorithms proposed in Yu Zhang, Jun Liu (2007) "Bayesian Inference of Epistatic Interactions in Case-Control Studies", Nature Genetics, in press. Please cite the paper if you used this program in your research for publication. The research was supported by an NIH R01 grant. [6. Support]----------------------------------------- All questions and comments should be directed to Yu Zhang at the Department of Statistics, The Pennsylvania State University, University Park, PA 16802. Email: yuzhang@stat.psu.edu

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input.

BART: a transcription factor prediction tool with query gene sets or epigenomic profiles
Zhenjia Wang, Mete Civelek, Clint Miller, Nathan Sheffield, Michael J. Guertin, Chongzhi Zang
Bioinformatics 34, 2867–2869 (2018)
https://zanglab.github.io/bart/
https://github.com/zanglab/bart2
https://github.com/Boyle-Lab/Blacklist/tree/master/lists
http://lisa.cistrome.org/doc
plotCorrelation -in results.npz –corMethod spearman –whatToPlot heatmap –plotFile correlation_heatmap.png
plotPCA -in results.npz –plotFile pca.png
#under python3.8 it is not necessary!
#cd /usr/lib/x86_64-linux-gnu
#sudo ln -s libhdf5_serial_hl.so.10.0.2 libhdf5_hl.so
#sudo ln -s libhdf5_serial.so.10.1.0 libhdf5.so









README for BART(1.0.1)

============
Introduction
============

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input. BART leverages 3,485 human TF binding profiles and 3,055 mouse TF binding profiles from the public domain (collected in Cistrome Data Browser) to make the prediction.

BART is implemented in Python and distributed as an open-source package along with necessary data libraries.

BART is developed and maintained by the Chongzhi Zang Lab at the University of Virginia.

============

========
Tutorial
========

Positional arguments

{geneset,profile}

bart geneset

Given a query gene set (at least 100 genes recommended), predict functional transcription factors that regulate these genes.

*Usage: bart geneset [-h] -i -s [-t ] [-p ]
[–nonorm] [–outdir ] [-o ]

*Example: bart geneset -i name_enhancer_prediction.txt -s hg38 -t target.txt -p 4
–outdir bart_output

*Input arguments:

-i , –infile

Input file, the name_enhancer_prediction.txt profile generated from MARGE.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of the input file.

*Notes:

The input file for , i.e., the enhancer_prediction.txt file generated by MARGE, might have two different formats below (depending on python versions py2 or py3):

a. Python2 version:

1 98.19
2 99.76
3 99.76
4 9.49
5 44.37
6 18.14

b. Python3 version:

chrom start end UDHSID Score
chr3 175483637 175483761 643494 3086.50
chr3 175485120 175485170 643497 2999.18
chr3 175484862 175485092 643496 2998.28
chr3 175484804 175484854 643495 2976.27
chr3 175491775 175491825 643507 2879.01
chr3 175478670 175478836 643491 2836.90

bart profile

Given a ChIP-seq data file (bed or bam format mapped reads), predict transcription factors whose binding pattern associates with the input ChIP-seq profile.

*Usage: bart profile [-h] -i -f [-n ] -s
[-t ] [-p ] [–nonorm]
[–outdir ] [-o ]

*Example: bart profile -i ChIP.bed -f bed -s hg38 -t target.txt -p 4
–outdir bart_output

*Input files arguments:

-i , –infile

Input ChIP-seq bed or bam file.

-f , –format

Specify “bed” or “bam” format.

-n , –fragmentsize

Fragment size of ChIP-seq reads, in bps. Default: 150.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of input file.

*Notes:

The input file for should be BED (https://genome.ucsc.edu/FAQ/FAQformat#format1) or BAM (http://samtools.github.io/hts-specs/SAMv1.pdf) format in either hg38 or mm10.

Bed is a tab-delimited text file that defines the data lines, and the BED file format is described on UCSC genome browser website (https://genome.ucsc.edu/FAQ/FAQformat). For BED format input, the first three columns should be chrom, chromStart, chromEnd, and the 6th column of strand information is required by BART.

BAM is a binary version of Sequence Alignment/Map(SAM) (http://samtools.sourceforge.net) format, and for more information about BAM custom tracks, please click here (https://genome.ucsc.edu/goldenPath/help/bam.html).

Output files

1. name_auc.txt contains the ROC-AUC scores for all TF datasets in human/mouse, we use this score to measure the similarity of TF dataset to cis-regulatory profile, and all TFs are ranked decreasingly by scores. The file should be like this:

AR_56254     AUC = 0.954
AR_44331     AUC = 0.950
AR_44338     AUC = 0.949
AR_50273     AUC = 0.947
AR_44314     AUC = 0.945
AR_44330     AUC = 0.943
AR_50100     AUC = 0.942
AR_44315     AUC = 0.942
AR_50044     AUC = 0.926
AR_50041     AUC = 0.925
FOXA1_50274     AUC = 0.924
AR_50042     AUC = 0.921

2. name_bart_results.txt is a ranking list of all TFs, which includes the Wilcoxon statistic score, Wilcoxon p value, standard Wilcoxon statistic score (zscore), maximum ROC-AUC score and rank score (relative rank of z score, p value and max auc) for each TF. The most functional TFs of input data are ranked first. The file should be like this:

TF statistic pvalue zscore max_auc rela_rank
AR 18.654 1.172e-77 3.024 0.954 0.004
FOXA1 13.272 3.346e-40 2.847 0.924 0.008
SUMO2 5.213 1.854e-07 3.494 0.749 0.021
PIAS1 3.987 6.679e-05 2.802 0.872 0.025
HOXB13 3.800 1.446e-04 2.632 0.909 0.027
GATA3 5.800 6.633e-09 2.549 0.769 0.028
NR3C1 4.500 6.789e-06 2.042 0.871 0.040
GATA6 4.240 2.237e-05 2.602 0.632 0.048
ESR1 12.178 4.057e-34 1.956 0.700 0.049
CEBPB 5.265 1.404e-07 2.287 0.602 0.057
ATF4 3.216 1.302e-03 2.348 0.658 0.065
TOP1 2.254 2.421e-02 3.057 0.779 0.065

3. name_plot is a folder which contains all the extra plots for the TFs listed in target files (target.txt file in test data). For each TF, we have boxplot, which shows the rank position of this TF in all TFs (derived from the rank score in name_bart_results.txt), and the cumulative distribution plot, which compares the distribution of ROC-AUC scores from datasets of this TF and the scores of all datasets (derived from the AUC scores in name_auc.txt).