BEAM: uses Markov Chain Monte Carlo (MCMC) to search for both single-marker and interaction effects from case-control SNP data

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<>
Yu Zhang, Jun S Liu

This is a brief guide for using BEAM (Bayesian Epistatis Association Mapping).
The program uses Markov Chain Monte Carlo (MCMC) to search for both single-marker
and interaction effects from case-control SNP data.

Reference:

Zhang Y and Liu JS (2007). Bayesian Inference of Epistatic Interations in Case-Control Studies.
Nature Genetics, in press.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This package contains the following files:
BEAM
libgsl.dll <-- for dos version only libgslcblas.dll <-- for dos version only parameters.txt README.txt [1. Installation]----------------------------------------- Unzip all files into one folder. BEAM uses the GNU Scientific Library (GSL), for which two necessary files "libgsl.dll" and "libslcblas.dll" are included in the DOS package. Make sure you put the two dll files in the same folder with BEAM, or put them in your system folder (e.g., "c:\\windows\\system32"), otherwise the program will not run. For linux version, the GSL are included in the executables, so you can simply put all files into one folder and just run the program by ./BEAM. [2. Input Format]----------------------------------------- The user needs to create a file (by default, "data.txt") that contains the case-control genotype data to be used as an input to the program. Although our program is designed for biallelic SNPs, the user may use markers that take more than two alleles. However, BEAM assumes that all markers in the input file are of the same type (e.g., they are all k-allelic markers). The allele at a k-allelic marker should be indexed from 0 to (k-1). For example, without assuming Hardy-Wenberg Equilibrium (HWE), a genotype at a SNP locus takes 3 different values, which should be indexed as 0,1,2. It doesn't matter which allle is coded by which integer, i.e., the user may code the wild type homozygote by 2, the mutant type homozygote by 1, and the heterozygote by 0. In addition, use negative numbers for missing alleles The first line of the input file should contain the disease status of each individual. You should use 1 to denote patients and 0 to denote controls. Alternatively, you may use a one-dimensional continuous trait value for each individual. BEAM will automatically assign those individuals with traits larger than the mean to cases, and the remaining individuals to controls. Starting from the second line of the input file, each line contains the genotype data at one marker for all individuals, separated by ' ' (space). For diploid species, if you want to assume HWE in association mapping, then you should input two alleles at the locus for each individual (the order of alleles doesn't matter). For example, for 2 patients and 2 controls genotyped from 5 markers, if assuming HWE, the input file may look like: 1 1 1 1 0 0 0 0 <--disease status, two identical status per individual, 1: case, 0: control 1 0 0 0 1 1 0 0 <--marker 1, alleles are denoted by 0 and 1. 1 1 0 0 1 0 1 1 <--marker 2 0 0 0 1 0 0 0 1 <--marker 3 0 1 1 1 0 -1 -1 <--marker 4, "-1" denotes missing allele 1 1 0 1 1 0 1 0 <--marker 5 Since each column in the input file denote one individual (or a haploid), the disease status for each individual (haploid) must match with the correponding column in the file. In the above example, when specifying two allels per individual, the user should input two identical disease status per individual in the first line. The user may also provide the SNP ID and their chromosomal locations (in bps) at the begining of each line. The above example file then looks like: ID Chr Pos 1 1 1 1 0 0 0 0 rs100102 chr1 4924223 1 0 0 0 1 1 0 0 rs291093 chr2 35981121 1 1 0 0 1 0 1 1 rs490232 chr9 6920101 0 0 0 1 0 0 0 1 rs093202 chrX 319101 0 1 1 1 0 -1 -1 rs43229 chrY 103919 1 1 0 1 1 0 1 0 IMPORTANT: if SNP ID and locations are provided in the data, make sure you include in the first line "ID Chr Pos " to label each column. This is to ensure that the disease status of each individual matches to the individual in the corresponding column. In addition, please use chr1,...,chr22,chrX, chrY to denote chromosomes. "Chr" is ok too. Since BEAM is designed for detecting interaction between markers that are far apart, by providing the marker location information, it helps BEAM to avoid being trapped by local marker dependencies (due to LD rather than interaction effects). In addition, if the SNP ID and their locations are provided in the input file, you should turn on the correponding parameters in the file "parameter.txt". This can be done by setting both INC_SNP_ID and INC_SNP_POS to 1 (default). Please see the "data.txt" provided online for an example of the input file. [3. Program Parameters]----------------------------------------- The running parameters of BEAM are specified in the "parameter.txt" file. These para- meters include: input filename, outputfilename, priors, burnin-length, mcmc-length, and thin, etc. The parameters "burnin", "mcmc", "thin" should be chosen according to the number of markers genotyped in the data. Denote the totoal number of markers by L, we suggest the following choice of parameters: burin = 10~100 L mcmc = L^2 thin = L In addition, the user may let the program first search for some local modes of the joint posterior distribution, and then run MCMC from those modes. This strategy can be advantagous than starting from random points, espetically for detecting high-order interactions. To do this, set INITIALTRYS to a positive number (e.g. 10 ~ 100), so that BEAM will first search for local modes in 10~100 trials. Set TRY_LENGTH to a number such as 10L for the number of iterations in each trial (L is the number of markers). If during the MCMC run the algorithm finds a better local mode (with a larger likelihood) compared to the mode where BEAM starts with, you can let BEAM restart the chain from this better mode. To do this, set AUTORESTART to a number between 5~10, such that if the new local mode measured in log-likelihood is 5~10 larger than the initial mode, the chain will restart. To disable this function, set AUTORESTART to a large value, e.g., 1000000. Never set it to a negative number. The user can let BEAM to automatically determine a set of parameters for a dataset, such as burnin-length, mcmc-length, thin, etc. By default, we set burnin = 100L if the MCMC chain starts from random configurations (i.e., when INITIALTRYS is 0). If the user let BEAM to search for some local modes first (by setting INITIALTRYS to a positive integer), we then set burnin = 10L. The length of each initial trial (TRY_LENGTH) is 20L by default. We further set mcmc = L * L, and thin = L. To use the automatic setting, the user must set the corresponding parameters in "parameter.txt" to 0 or a negative number. For example, let BURNIN 0 MCMC 0 THIN 0 TRY_LENGTH 0 NOTE: Any positive integer will replace the default setting of BEAM. BEAM will output markers/interactions with Bonferroni adjusted p-value smaller than a user- specified value, which can be specified by "P_THRESHOLD" in "parameter.txt". You can use BEAM to search for marginal associations only if let SINGLE_ONLY = 1. You may specify the input file and the output file in "parameter.txt". We encourage you to modify "parameter.txt" to run BEAM under different parameter settings. Please see "parameter.txt" for more details. [4. Command Line]----------------------------------------- To run BEAM, type in the command line: "BEAM [input output]" BEAM will refer to "parameter.txt" for running parameters, so you can run the program by simply typing "BEAM". The only option you can use in the command line is to specify the input file name and the output file name. Note that if you want to specify either of them, you must specify both of them. [4. Output]----------------------------------------- The main output file contains the estimated posterior probability of association for each marker, including both marginal and interactive association probabilities. The file also contains posterior estimates of the number of marginal associations and the size of interactions. In addition, we evaluate the p-value of each detected association using the B-statistic introduced in the paper. Only significant p-values (<0.1 after the Bonferroni correction) and associated markers are reported. More than one markers reported within a single parenthsis indicate interactions effects. To check the performance of Markov chains, we output two more files: "lnp.txt" and "posterior.txt". The former contains the trace of log-posterior probabilities (up to a normalizing constant) of the sampled parameters. The latter contains the summary of Markov chains. "posterior.txt" also contains the B-statistics and the estimated p-values for detected candidate markers and interactions. [5. Credit]----------------------------------------- This program is developed based on algorithms proposed in Yu Zhang, Jun Liu (2007) "Bayesian Inference of Epistatic Interactions in Case-Control Studies", Nature Genetics, in press. Please cite the paper if you used this program in your research for publication. The research was supported by an NIH R01 grant. [6. Support]----------------------------------------- All questions and comments should be directed to Yu Zhang at the Department of Statistics, The Pennsylvania State University, University Park, PA 16802. Email: yuzhang@stat.psu.edu

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input.

BART: a transcription factor prediction tool with query gene sets or epigenomic profiles
Zhenjia Wang, Mete Civelek, Clint Miller, Nathan Sheffield, Michael J. Guertin, Chongzhi Zang
Bioinformatics 34, 2867–2869 (2018)
https://zanglab.github.io/bart/
https://github.com/zanglab/bart2
https://github.com/Boyle-Lab/Blacklist/tree/master/lists
http://lisa.cistrome.org/doc
plotCorrelation -in results.npz –corMethod spearman –whatToPlot heatmap –plotFile correlation_heatmap.png
plotPCA -in results.npz –plotFile pca.png
#under python3.8 it is not necessary!
#cd /usr/lib/x86_64-linux-gnu
#sudo ln -s libhdf5_serial_hl.so.10.0.2 libhdf5_hl.so
#sudo ln -s libhdf5_serial.so.10.1.0 libhdf5.so









README for BART(1.0.1)

============
Introduction
============

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input. BART leverages 3,485 human TF binding profiles and 3,055 mouse TF binding profiles from the public domain (collected in Cistrome Data Browser) to make the prediction.

BART is implemented in Python and distributed as an open-source package along with necessary data libraries.

BART is developed and maintained by the Chongzhi Zang Lab at the University of Virginia.

============

========
Tutorial
========

Positional arguments

{geneset,profile}

bart geneset

Given a query gene set (at least 100 genes recommended), predict functional transcription factors that regulate these genes.

*Usage: bart geneset [-h] -i -s [-t ] [-p ]
[–nonorm] [–outdir ] [-o ]

*Example: bart geneset -i name_enhancer_prediction.txt -s hg38 -t target.txt -p 4
–outdir bart_output

*Input arguments:

-i , –infile

Input file, the name_enhancer_prediction.txt profile generated from MARGE.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of the input file.

*Notes:

The input file for , i.e., the enhancer_prediction.txt file generated by MARGE, might have two different formats below (depending on python versions py2 or py3):

a. Python2 version:

1 98.19
2 99.76
3 99.76
4 9.49
5 44.37
6 18.14

b. Python3 version:

chrom start end UDHSID Score
chr3 175483637 175483761 643494 3086.50
chr3 175485120 175485170 643497 2999.18
chr3 175484862 175485092 643496 2998.28
chr3 175484804 175484854 643495 2976.27
chr3 175491775 175491825 643507 2879.01
chr3 175478670 175478836 643491 2836.90

bart profile

Given a ChIP-seq data file (bed or bam format mapped reads), predict transcription factors whose binding pattern associates with the input ChIP-seq profile.

*Usage: bart profile [-h] -i -f [-n ] -s
[-t ] [-p ] [–nonorm]
[–outdir ] [-o ]

*Example: bart profile -i ChIP.bed -f bed -s hg38 -t target.txt -p 4
–outdir bart_output

*Input files arguments:

-i , –infile

Input ChIP-seq bed or bam file.

-f , –format

Specify “bed” or “bam” format.

-n , –fragmentsize

Fragment size of ChIP-seq reads, in bps. Default: 150.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of input file.

*Notes:

The input file for should be BED (https://genome.ucsc.edu/FAQ/FAQformat#format1) or BAM (http://samtools.github.io/hts-specs/SAMv1.pdf) format in either hg38 or mm10.

Bed is a tab-delimited text file that defines the data lines, and the BED file format is described on UCSC genome browser website (https://genome.ucsc.edu/FAQ/FAQformat). For BED format input, the first three columns should be chrom, chromStart, chromEnd, and the 6th column of strand information is required by BART.

BAM is a binary version of Sequence Alignment/Map(SAM) (http://samtools.sourceforge.net) format, and for more information about BAM custom tracks, please click here (https://genome.ucsc.edu/goldenPath/help/bam.html).

Output files

1. name_auc.txt contains the ROC-AUC scores for all TF datasets in human/mouse, we use this score to measure the similarity of TF dataset to cis-regulatory profile, and all TFs are ranked decreasingly by scores. The file should be like this:

AR_56254     AUC = 0.954
AR_44331     AUC = 0.950
AR_44338     AUC = 0.949
AR_50273     AUC = 0.947
AR_44314     AUC = 0.945
AR_44330     AUC = 0.943
AR_50100     AUC = 0.942
AR_44315     AUC = 0.942
AR_50044     AUC = 0.926
AR_50041     AUC = 0.925
FOXA1_50274     AUC = 0.924
AR_50042     AUC = 0.921

2. name_bart_results.txt is a ranking list of all TFs, which includes the Wilcoxon statistic score, Wilcoxon p value, standard Wilcoxon statistic score (zscore), maximum ROC-AUC score and rank score (relative rank of z score, p value and max auc) for each TF. The most functional TFs of input data are ranked first. The file should be like this:

TF statistic pvalue zscore max_auc rela_rank
AR 18.654 1.172e-77 3.024 0.954 0.004
FOXA1 13.272 3.346e-40 2.847 0.924 0.008
SUMO2 5.213 1.854e-07 3.494 0.749 0.021
PIAS1 3.987 6.679e-05 2.802 0.872 0.025
HOXB13 3.800 1.446e-04 2.632 0.909 0.027
GATA3 5.800 6.633e-09 2.549 0.769 0.028
NR3C1 4.500 6.789e-06 2.042 0.871 0.040
GATA6 4.240 2.237e-05 2.602 0.632 0.048
ESR1 12.178 4.057e-34 1.956 0.700 0.049
CEBPB 5.265 1.404e-07 2.287 0.602 0.057
ATF4 3.216 1.302e-03 2.348 0.658 0.065
TOP1 2.254 2.421e-02 3.057 0.779 0.065

3. name_plot is a folder which contains all the extra plots for the TFs listed in target files (target.txt file in test data). For each TF, we have boxplot, which shows the rank position of this TF in all TFs (derived from the rank score in name_bart_results.txt), and the cumulative distribution plot, which compares the distribution of ROC-AUC scores from datasets of this TF and the scores of all datasets (derived from the AUC scores in name_auc.txt).

AmpliconNoise: a collection of programs for the removal of noise from 454 sequenced PCR amplicons

AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. Typical usage involves three steps:
1)the removal of noise from the sequencing itself,
2)the removal of PCR point errors, and
3)removal of chimeric sequences using the program Perseus.
While steps 2-­‐3 can be used on amplicon reads generated using most sequencing technologies, step 1 is currently only supported for sequences generated using 454 pyrosequencing.

When presenting work using AmpliconNoise, please refer to the following citation:
Quinceet al (2011), ‘Removing noise from pyrosequenced amplicons.’, BMC Bioinformatics12, 38

ALIENTRIMMER: a tool to quickly and accurately trim off multipleshort contaminant sequences from high-throughput sequencing reads

Criscuolo A, Brisse S (2013) ALIENTRIMMER: a tool to quickly and accurately trim off multipleshort contaminant sequences from high-throughput sequencing reads. Genomics. 102:500-506.doi:10.1016/j.ygeno.2013.07.011ftp://ftp.pasteur.fr/pub/GenSoft/projects/AlienTrimmer/

AlienTrimmer is a program that allows detecting and removing sequencing errors andcontaminant sequences (e.g. adaptors, primers) in both ends of high-throughput sequencing(HTS) read sequences. Based on the decomposition of the specified alien sequences intonucleotide k-mers of fixed length k, AlienTrimmer is able to determine whether such alienk-mers are occurring in both read ends by using a simple polynomial algorithm.AlienTrimmer can process typical HTS single- or paired-ends read files containingmillions entries in few minutes with very low computer resources.

CHANGESINVERSION 0.4.0•by default, quality-based trimming is performed along with alien trimming (Phred qualityscore cut-off = 20)•Phred quality score cut-off is specified with numerical value (Phred score encoding isautomatically assessed from input FASTQ files

AdmixTools – ROLLOFF

https://github.com/DReichLab/AdmixTools

Given genotype data from an admixed population and two ancestral populations, ROLLOFF computes the time since mixture using the rate of exponential decay of admixture LD. Specifically, it computes the correlation between a signed LD statistic for a pair of markers and a weight that reflects their allele frequency differentiation in the ancestral populations. The method assumes that samples are unrelated.

For details of the method, refer to Moorjani et al. (2011) and Patterson et al. (2012).

ROLLOFF requires that the input data is available in EIGENSTRAT format. To convert to the appropriate format, one can use CONVERTF program. See README.CONVERTF for documentation of programs for converting file formats.

ART: a next-generation sequencing read simulator

ART: a next-generation sequencing read simulator
https://doi.org/10.1093/bioinformatics/btr708

ART_ILLUMINA README (updated on 06/6/2016) Weichun Huang
ART_Illumina (2008-2016), Q Version 2.5.8 (Jun 6, 2016)

DESCRIPTION

ART (art_Illumina Q version) is a simulation program to generate sequence read data of Illumina
sequencers. ART generates reads according to the empirical read quality profile summarized from
large real read data. ART has been using for testing or benchmarking a variety of methods or tools
for next-generation sequencing data analysis, including read alignment, de novo assembly, detection
of SNP, CNV, or other structure variation.

art_Illumina can generate single-end, paired-end, mate-pair reads, and amplicon sequencing simulation
of Illumina sequencing platform. Its outputs include FASTQ read, ALN and/or SAM alignment files. ALN
files can also be converted to UCSC BED files by using aln2bed.pl program included.

art_Illumina comes with the tool art_profiler_illumina that can generate quality profiles from Illumina
sequencing data in the fastq format. The tool is in the folder ART_profiler_illumina. Please see README
in the folder for the details and usage.

EXAMPLES

In the “examples” subdirectory, the shell script “run_test_examples_illumina.sh” gives four examples of using
ART for read simulation. To test these four examples, just run the script “run_test_examples_illumina.sh”

USAGE
RECOMMENDED USAGES (specifying a sequencing system)
art_illumina [options] -ss -sam -i -l -f -o
art_illumina [options] -ss -sam -i -l -c -o
art_illumina [options] -ss -sam -i -l -f -m -s -o
art_illumina [options] -ss -sam -i -l -c -m -s -o

OTHER USAGES
art_illumina [options] -sam -i -l -f -o
art_illumina [options] -sam -i -l -c -o
art_illumina [options] -sam -i -l -f -m -s -o
art_illumina [options] -sam -i -l -c -m -s -o

===== PARAMETERS =====
-1 –qprof1 the first-read quality profile
-2 –qprof2 the second-read quality profile
-amp –amplicon amplicon sequencing simulation
-c –rcount number of reads/read pairs to be generated per sequence(not be used together with -f/–fcov)
-d –id the prefix identification tag for read ID
-ef –errfree indicate to generate the zero sequencing errors SAM file as well the regular one
NOTE: the reads in the zero-error SAM file have the same alignment positions
as those in the regular SAM file, but have no sequencing errors
-f –fcov the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
-h –help print out usage information
-i –in the filename of input DNA/RNA reference
-ir –insRate the first-read insertion rate (default: 0.00009)
-ir2 –insRate2 the second-read insertion rate (default: 0.00015)
-dr –delRate the first-read deletion rate (default: 0.00011)
-dr2 –delRate2 the second-read deletion rate (default: 0.00023)
-k –maxIndel the maximum total number of insertion and deletion per read (default: up to read length)
-l –len the length of reads to be simulated
-m –mflen the mean size of DNA/RNA fragments for paired-end simulations
-mp –matepair indicate a mate-pair read simulation
-M –cigarM indicate to use CIGAR ‘M’ instead of ‘=/X’ for alignment match/mismatch
-nf –maskN the cutoff frequency of ‘N’ in a window size of the read length for masking genomic regions
NOTE: default: ‘-nf 1’ to mask all regions with ‘N’. Use ‘-nf 0′ to turn off masking
-na –noALN do not output ALN alignment file
-o –out the prefix of output filename
-p –paired indicate a paired-end read simulation or to generate reads from both ends of amplicons
NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
-q –quiet turn off end of run summary
-qL –minQ the minimum base quality score
-qU –maxQ the maxiumum base quality score
-qs –qShift the amount to shift every first-read quality score by
-qs2 –qShift2 the amount to shift every second-read quality score by
NOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93)
that reduce substitution sequencing errors and a negative number will shift down
quality scores that increase sequencing errors. If shifting scores by x, the error
rate will be 1/(10^(x/10)) of the default profile.
-rs –rndSeed the seed for random number generator (default: system time in second)
NOTE: using a fixed seed to generate two identical datasets from different runs
-s –sdev the standard deviation of DNA/RNA fragment size for paired-end simulations.
-sam –samout indicate to generate SAM alignment file
-sp –sepProf indicate to use separate quality profiles for different bases (ATGC)
-ss –seqSys The name of Illumina sequencing system of the built-in profile used for simulation

NOTE: all built-in sequencing system ID names are:
GA1 – GenomeAnalyzer I (36bp,44bp)
GA2 – GenomeAnalyzer II (50bp, 75bp)
HS10 – HiSeq 1000 (100bp)
HS20 – HiSeq 2000 (100bp)
HS25 – HiSeq 2500 (125bp, 150bp)
HSXn – HiSeqX PCR free (150bp)
HSXt – HiSeqX TruSeq (150bp)
MinS – MiniSeq TruSeq (50bp)
MSv1 – MiSeq v1 (250bp)
MSv3 – MiSeq v3 (250bp)
NS50 – NextSeq500 v2 (75bp)

===== NOTES =====

* ART by default selects a built-in quality score profile according to the read length specified for the run.

* For single-end simulation, ART requires input sequence file, outputfile prefix, read length, and read count/fold coverage.

* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of
the mean and standard deviation of DNA/RNA fragment lengths

===== EXAMPLES =====

1) single-end read simulation
art_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat

2) paired-end read simulation
art_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

3) mate-pair read simulation
art_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

4) amplicon sequencing simulation with 5’ end single-end reads
art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat

5) amplicon sequencing simulation with paired-end reads
art_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat

6) amplicon sequencing simulation with matepair reads
art_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat

7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulation
art_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat

8) reduce the substitution error rate to one 10th of the default profile
art_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error

9) turn off the masking of genomic regions with unknown nucleotides ‘N’
art_illumina -ss HS20 -nf 0 -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask

10) masking genomic regions with >=5 ‘N’s within the read length 50
art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

READ QUALITY PROFILE

TOOL FOR CREATING A NEW QUALITY PROFILE

art_profiler_illumina in the folder ART_profiler_illumina can generate quality profiles from Illumina
fastq data. Please see README in the folder for the usage.

FORMAT
A valid quality score profile is tab-delimited and has no specific header line. Headers can be included
if each line of extraneous information begins with a number sign(#). Each line of actual quality profile
information must begin with an identifier that indicates where the data comes from for the remainder of
the line. Identifiers include:

. The identifier for the combination of all base information.

A The identifier for quality scores associated with A calls only.

T The identifier for quality scores associated with T calls only.

G The identifier for quality scores associated with G calls only.

C The identifier for quality scores associated with C calls only.

Following the identifier on a given line must be the position number indicating where the rest of the
information on that line applies within a given fragment. The data must be arrayed in pairs such that each
line in the pair has the same identifier and position number. The first line in a pair is a list of quality
scores in ascending order and the second line are the corresponding cumulative frequencies of the quality scores.

EXAMPLE:

. 0 3 6 7 8 9 10 11 12 13 14 15
. 0 39375 355755 395136 415685 1227131 1338634 1522001 1851208 2165909 2436839 2608538

. 35 4 5 6 7 8 9 10 11 12 13 14
. 35 434262 1341151 1725690 2979293 3478620 3592624 3672807 3873754 4096922 4983957 6111261
A 0 3 6 7 8 9 10 11 12 13 14 15
A 0 560 99637 111485 119727 389899 412150 458066 572958 665153 745916 793532

BUILT-IN PROFILES

Under subfolder “Illumina_Profiles” are the read profiles of Illumina GAII sequencers including all
ART’s built-in profiles as indicated below.

1) Raw quality profiles
GA1 36bp reads
Emp36R1.txt
Emp36R2.txt

GA1 44bp reads
Emp44R1.txt
Emp44R2.txt

GA2 50bp reads
Emp50R1.txt
Emp50R2.txt

GA2 75bp reads
Emp75R1.txt
Emp75R2.txt

MiSeq 250bp reads (250bp reads)
EmpMiSeq250R1.txt
EmpMiSeq250R2.txt

HiSeq1000 100bp reads
Emp100R1.txt
Emp100R2.txt

HiSeq2000 100bp reads (the new default profile for 100bp reads)
HiSeq2kL100R1.txt
HiSeq2kL100R2.txt

HiSeq2500 125bp reads
HiSeq2500L125R1.txt
HiSeq2500L125R2.txt

HiSeq2500 150bp reads
HiSeq2500L150R1.txt
HiSeq2500L150R2.txt

HiSeqX PCR free (150bp)
HiSeqXPCRfreeL150R1.txt
HiSeqXPCRfreeL150R2.txt

HiSeqX TruSeq (150bp)
HiSeqXtruSeqL150R1.txt
HiSeqXtruSeqL150R2.txt

MiniSeq TruSeq (50bp)
MiniSeqTruSeqL50.txt

MiSeq v3 (250bp)
MiSeqv3L250R1.txt
MiSeqv3L250R2.txt

NextSeq500 v2 (75bp)
NextSeq500v2L75R1.txt
NextSeq500v2L75R2.txt

2) Recalibrated quality profiles (all these are ART’s built-in profiles)

36bp reads
1st: EmpR36R1
2nd: EmpR36R2

44bp reads
1st: EmpR44R1
2nd: EmpR44R2

50bp reads
1st: EmpR50R1
2nd: EmpR50R2

75 reads
1st: EmpR75R1
2nd: EmpR75R2

OUTPUT FILES

*.fq – FASTQ read data files. For paired‐read simulation, *1.fq contains data of the first reads, and *2.fq for the second reads.
*.aln – read alignment files. For paired‐read simulation, *1.aln has read alignments for the first reads and *2.aln for the second reads.

FASTQ FILE format
A FASTQ file contains both sequence bases and quality scores of sequencing reads and is in the following format:
@read_id
sequence_read
+
base_quality_scores

A base quality score is coded by the ASCII code of a single character, where the quality score is equal to ASCII code of the
character minus 33.

Example:
@refid-4028550-1
caacgccactcagcaatgatcggtttattcacgat…
+
????????????7???????=&?<ref_seq_id read_id aln_start_pos ref_seq_strand
ref_seq_aligned
read_seq_aligned

aln_start_pos is the alignment start position of reference sequence. aln_start_pos is always relative to the strand of reference
sequence. That is, aln_start_pos 10 in the plus (+) strand is different from aln_start_pos 10 in the minus (‐) stand.

ref_seq_aligned is the aligned region of reference sequence, which can be from plus strand or minus strand of the reference sequence.
read_seq_aligned is the aligned sequence read, which always in the same orientation of the same read in the corresponding fastq file.

SAM format

SAM is a standard format for next-gen sequencing read alignments. The details of the format and examples are available at the links below:
1) http://samtools.sourceforge.net/SAM1.pdf
2) http://genome.sph.umich.edu/wiki/SAM

BED format

See the format at UCSC http://genome.ucsc.edu/FAQ/FAQformat.html#format1

NOTE: both ALN and BED format files use 0-based coordinate system while SAM format uses 1-based coordinate system.

ACKNOWLEDGEMENTS
I would like to thanks all ART users for their feedback and contributions, especially the users listed below.
Richard Nielson, DNASTAR
Bruno Nevado, CRAG in UAB
Lee Katz, US CDC

ChIPMunk

4. ChIPMunk quick-start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Important! ChIPMunk won't produce any output files
by default. If you want to save the output do not forget to add the 
   > output.log
to the end of the command line.

The input TEST_footprint.mfa assumed to be a simple multi-fasta file 
containing unaligned sequences.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[NEW] ChIPMunk "default" mode:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

A simple DEFAULT mode to find a motif in a given set of sequences
(the 'TEST_footprint.mfa' file):

  ChIPMunk s:TEST_footprint.mfa
or
  ChIPMunk s:TEST_footprint.mfa > output.log
to save the output to the 'output.log' file.

This will run ChIPMunk using a predefined lengths range from 7 to 22 
using default precision, zero-or-one occurence per sequence mode with the
flexible (1.0) segmentation coefficient and two concurrent computational threads.

See examples below for more information on setting specific parameters.


(a) To produce a motif (as a positional count matrix)
using a gapless multiple local alignment of a fixed width (length)
of 15bp from a given set of sequences:

  ChIPAct 15 s:TEST_footprint.mfa

(b) To search for the best flexible motif (take good motif hits).
The sequence set can contain some noise.
(so some sequences can have no motif hits).

  ChIPMunk 6 15 yes 1.0 s:TEST_footprint.mfa
  
(c) To search for the best strong motif starting  
from the 6bp to 15bp (each sequence must have at least one motif occurrence, 
i. e. one-occurrence-per-sequence mode, OOPS):

  ChIPMunk 6 15 yes oops s:TEST_footprint.mfa

(d) To search for the best strict motif (take only very strong hits 
very close to consensus). The sequence set can contain some noise 
(so some sequences can have no motif hits).
  
  ChIPMunk 6 15 yes 0.0 s:TEST_footprint.mfa


*****************************************************
*** <<<<   ChIPMunk detailed manual           >>> ***
*****************************************************

[NOTE] By default ChIPMunk prints the messages to the 
STDERR stream and the results to the STDOUT stream.
In practice this means that if you run it as the

  ChIPMunk {parameters} > chipmunk.log

by redirecting (using '>') the output into the chipmunk.log
then this file will contain all resulting data
and the console output will be used for 'work-in-progress' messages.


The full ChIPMunk command-line is:

ChIPMunk 
  <start_motif_length> <stop_motif_length> 
  <verbose>=(y)es|(n)o <mode>=oops|zoops_factor=0.0..1.0 
  <x:input_set1>..<x:input_setN> 
  <try_limit> <step_limit> <iter_limit> <thread_count> 
  <seeds>=random|filename.mfa <gc%>=0.5|auto
  <motif_shape>=flat|single|double
  <disable_log_weighting>

[NOTE] The letter case is important, 
so you need to type strictly 'ChIPMunk', not 'Chipmunk' or 'chIPmunk'.

The command-line parameters are given in a <..> brackets.
The parameters after the input_sets can be omitted.

ChIPMunk searches for the longest strong motif starting
from the [start_motif_length].

If [start_motif_length] > [stop_motif_length]
then the ChIPMunk decreases length by 1 at each step 
and takes the first strong motif.

If [stop_motif_length] > [start_motif_length]
then the ChIPMunk increases length by 1 at each step
and stops when first weak motif is found (taking
previously found strong motif).

If [stop_motif_length] = [start_motif_length]
then ChIPMunk searches for the best motif of a given length.

For the formal definition of the strong motif 
please check the ChIPMunk-on-the-web
details page at autosome.ru/ChIPMunk/

**************************
>>>> Parameters list: <<<<
**************************

[verbose]
  set y or yes for the additional program output
  
  For the ChIPMunk it would be the list of words 
    used for motifs construction
  For the ChIPHorde extension (see below) if would
    print out motif occurrences for each motif
  
[mode]
  oops (or OOPS) corresponds to the one-occurrence-per-sequence mode.
  If the number (starting from 0.0) is specified then the 
  zero-or-one-occurrence-per-sequence mode is used. 
  Larger numbers correspond to the even more flexible segmentation. 
  
  Recommended value: 1.0

[input_sets] x: can be 
  s: for simple multi-fasta
  
  r: for simple multi-fasta that should be considered in a single-strand mode
     (e.g. for discovery of RNA motifs)
  
  w: weighted data set where the number specifies the sequence weight 
  (i.e. the impact it has on the motif)
    
    > 1.0
    ACGGGAAA
    > 2.0
    GTGAAAAA
  
  p: peak data where each number in the space-separated list
  specifies the weight of the corresponding position
    > 1.0 2.0 1.0 1.0
    ACCG
    > 2.0 10.0 1.0 1.0 2.0
    GTACA
    
  The peak data is useful for any sequence-specific positional prior.
  The easiest example is the ChIP-Seq peak base coverage data,
  see http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq488
  
  Examples of tests on real data and comparison vs existing tools
  can be found here: http://autosome.ru/ChIPMunk/supplement.html
  
  ChIPMunk can integrate data from different sequence sets of 
  different types (so you can use peak and weighted and simple data
  sets together). More information on this topic is presented 
  in the corresponding paper: 
  http://www.springerlink.com/content/q86n48151u25278w/
  
[try_limit] (default 100) 
  This is a number of general optimization runs.
  For a random seeding this would be equal to the number of seeds.
  It can be as high as your computational power (100-1000-10000 
  seems to be enough depending on the size of your set 
  and the best motif strength). 
  In practice for huge sequence sets and strong motifs 
  the value of 100 is acceptable.
  Should be increased in case of the unstable ChIPMunk 
  behavior on weak motifs or noisy datasets.

[step_limit] (default 10) 
  Corresponds to the number of bootstrapping runs.
  As for the try_limit it can be (10-100-1000-..)
  Can be increased for small datasets and weak motifs.

Both [step_limit] and [try_limit] increase computational time 
in a linear fashion.

[iter_limit] (default 1) 
  is useful only if 1 or (rare) 2; higher values can reduce
  bootstrapping-achieved speedup effect and will have no real affect on motif quality

[thread_count] (default 1) 
  is extremely useful for modern multi-core processors. 
  Parallel nature of the ChIPMunk algorithm allows to
  linearly increase its speed with the increasing number 
  of available cores. Set it to 2 or 4 to use all available
  computational power of dual and quad core processors.
  If you have an Intel CPU with Hyper-Threading 
  for some datasets if may be useful to further increase 
  the number of computational threads up to 8
  (with the additional performance gain).
  
  For extreme 6/12-core processors you can use 6 or 12 threads.

[seeds] (default random)
  can be either 'random' (completely random seeds for PWM optimization) 
  or can contain multi-fasta filename with sequences 
  to extract starting seeding words
  
[gc%] (default 0.5)
  specifies prior background bias (for Kullback DIC). When "auto" or "local"
  is specified ChIPMunk will compute the background nucleotide composition
  of supplied sequence sets.

[motif_shape] (default flat)
  can be 'flat' (default, if not specified) to
  omit the motif shape prior.
  The 'single' and 'double' values correspond to the
  single-box and double-box motifs.
  The positions within motif are weighted so the
  positions with higher weight have more impact on the PWM score
  during optimization.
  
  The single box prior uses cos^2(PI*n/T) weighting,
  the double box prior uses sin^2(PI*n/T) weighting.
  Here PI = 3.1415926, T is the DNA helix period (10.5bp),
  and n is the position within the motif.

[disable_log_weighting]
  can be set to any value to disable the new data set weighting
  strategy. The original ChIPMunk assigned equal weights
  to all datasets not dependent of their sizes 
  (i.e. the number of sequences in the each dataset was not taken into account).
  
  The default v3 strategy assign the dataset weight proportional to the
  natural logarithm of the number of sequences in the set.
  Setting this parameter to any value will allow to use previous
  'equal dataset weighting' strategy.
  
  The default v4.2 strategy also uses the natural logarithm 
  of the peak height to assigns weights to the individual peaks.

One can use the ChIPAct with the a very similar syntax and parameters. 
It searches for the fixed-width motif always in the OOPS mode and 
therefore wants uses only the motif length parameter instead 
of [start_motif_length] and [stop_motif_length] for ChIPMunk.


******************************************************
*** <<<<          ChIPMunk examples           >>>> ***
******************************************************

1. ChIPMunk autosome package is placed into current 
working directory; default values for limits; 
motif lengths from 10 to 7; one standard sequence set data.mfa; 
one-occurrence-per-sequence mode; verbose on

  ChIPMunk 7 10 yes oops s:data.mfa
or simply
  ChIPMunk 7 10 yes oops s:data.mfa

2. ChIPMunk autosome package is located somewhere 
in a some_dir; user-specified values for limits; 
motif lengths from 10 to 7; two sequence sets with
different weighting models; flexible segmentation
zero-or-one-occurrence-per-sequence mode; verbose off

  ChIPMunk 7 20 no 1.0 p:chipseq.fasta 
    w:selex.mfa 1000 100 1

3. ChIPMunk autosome package is placed into the current working
directory; peak data used; user-specified values
for limits; motif lengths from 20 to 10; 
verbose on; strict ZOOPS mode; use of 4 computational threads; 
random seeds; gc% 0.6

  ChIPMunk 10 20 yes 0.0 p:data.fasta 
    10000 100 1 4 random 0.6

[NOTE] For either ChIPMunk or ChIPHorde in most cases it is wise to use
1.0 ZOOPS segmentation instead of the oops mode (shown in examples below).


=====================================================
*** <<<<        ChIPHorde extension           >>> ***
-----------------------------------------------------

----- ChIPHorde quick-start -------------------------

  ChIPHorde 12:7,12:7,12:7
    mask yes 0.0 s:BCD_footprint.mfa 100 10 1 2 random 0.5 single > test_result.out
    
This will produce upto 3 motifs of no more than 12bp length
using strict ZOOPS segmentation, default precision parameters, 
2 computational threads, uniform nucleotide composition 
and single-box motif shape prior.

----- ChIPHorde details -----------------------------

The ChIPHorde extension is intended for the
sequential search of a more than one motif.
It uses a very simple command-line format (for example):

  ChIPHorde 8:10,7:6
    mask yes 1.0 s:BCD_footprint.mfa > test_result.out

This states to sequentially run ChIPMunk twice using 8:10
and 7:6 possible length intervals searching for 
the first motif starting from the shortest possible
and for the second motif from the longest possible one.

"mask" means to mask (polyN) the motif occurrences
in a sequential search. There is another option 
"filter" which will make ChIPHorde to exclude 
sequences with the good motifs hits before
the sequential ChIPMunk run.

[NEW] If you want to check different length ranges 
using ChIPHorde but DO NOT wish to filter/mask sequences/hits
than you can use new 'dummy' mode.
The new 'dummy' mode will not perform any filtering after checking 
each of supplied motif length ranges.
So you can get many 'versions' of the same motif.

ChIPHorde 10:10,15:15,20:20
    mask yes 1.0 s:sequences.mfa > test_result.out

Both classic modes ('mask' and 'filter') 
are suitable for different conditions.
Typically the "mask" mode will try to search
for co-factor motifs while the "filter" mode 
is intended for searching for the different motifs 
of a single TF.

The motif occurrence threshold used for masking/filtering 
is the score of the worst word inculded into motif
by ChIPMunk zoops segmentation procedure.

NOTE! The ChIPHorde wouldn't work in case the 'oops' mode
was specified for ChIPMunk.

If you want more motifs the ZOOPS segmentation is better be set to 0.0
(strict mode) to ensure that at each step the worst selected word
has good PWM score (and so the masking threshold wouldn't be too low).
If you want to get only 2 top motifs it's enough to use 1.0 zoops coefficient.

The ChIPHorde command-line is very similar
to that of ChIPMunk like you can specify the 
iteration counts, specific GC-content to account for
(extremely important for heavily GC-shifted genomes), and so on.

- OCCS line - (in case you specified the verbose parameter)
The OCCS lines in the ChIPHorde output
refers to the motif occurrences in the input dataset.

The line itself looks like
  OCCS|0;44; CGCCTAATCT:6:DIRECT
where 0 is the number of dataset in the input data
(always 0 if you supply only one fasta-file),
44 is the number of sequence in the set,
CGCCTAATCT is the word at that position, 
6 is the index of the word in the sequence,
DIRECT is the strand of the occurrence.

[NOTE] All indices are ZERO-based.


=====================================================
*** <<<<      Notes on 'verbose' mode         >>> ***
-----------------------------------------------------

Note, that in verbose mode ChIPMunk will print out 
words used to construct the motif.

The ChIPHorde will print out BOTH words used to construct
the motifs AND the motif occurences in sequences
(note, that typically only one best word from each sequence
is used during motif construction while there can be
many motif hits reported by ChIPHorde OCCS).


------------------- Version history -----------------------

Version  4.2 - To reduce noise from extremely high peaks (caused by wrongly mapped reads or PCR artifacts)
               a logWeghting scheme is now applied to the peak heights
Version  4.1 - "r:" RNA motif discovery (simple mode for single strand search)
               fixed weird sequence indexes for "peak" mode if the input
               peaks were not sorted according to the peak height
Version    4 - ChIPMunk now lives in the 'autosome.ru' package
Version  3.3 - polyN stacking for ZOOPS mode fixed;
               ChIPHorde output beautified;
               ChIPMunk now reports positions of aligned words
Version  3.2 - small bugfixes, tweaked output, default mode added
Version  3.1 - fixed rare multithreading bug
Version  3.0 - huge source refactoring, motif shape support
Version  2.0 - ChIPHorde extension for a sequential motif discovery
Version 1.17 - support for positional weight profiles over sequences; 
               ChIP-Seq peak data analysis
Version  1.o - bugfixes and tweaks for experimental
               support of positional information 
               weight profiles over sequences
Version  1.n - length estimation procedure simplified and formalized
Version  1.m - multi-core CPU support
Version  1.0 - first public release