Binning

Binning
=======

Scripts required to calculate tetramer frequencies and create input files for ESOM.
See: Dick, G.J., A. Andersson, B.J. Baker, S.S. Simmons, B.C. Thomas, A.P. Yelton, and J.F. Banfield (2009). Community-wide analysis of microbial genome sequence signatures. Genome Biology, 10: R85
Open Access: http://genomebiology.com/2009/10/8/R85

How to ESOM?
============

These instructions are for ESOM-based for binning: see http://databionic-esom.sourceforge.net/ for software download and manual.

1. Generate input files.
————————-
* Although not necessary but we recommend adding some reference genomes based on your 16s/OTU analysis as ‘controls’. The idea is that, if the ESOM worked, your reference genome should form a bin itself. You may do this by downloading genomes in fasta format from any public database, preferably a complete single sequence genome.
* Use the `esomWrapper.pl` script to create the relevant input files for ESOM. In order to run this script, you’ll need to have all your sequence(in fasta format) files with the same extension in the same folder. For example:
`perl esomWrapper.pl -path fasta_folder -ext fa`
For more help and examples, type:
`perl esomWrapper.pl -h`

* The script will use the fasta file to produce three tab-delimited files that ESOM requires:
* Learn file = a table of tetranucleotide frequencies (.lrn)
* Names file = a list of the names of each contig (.names)
* Class file = a list of the class of each contig, which can be used to color data points, etc. ( .cls)

**NOTE:**`class number`: The esom mapping requires that you define your sequences as classes. We generally define all the sequences that belong to your query (meatgenome for example) as 0 and all the others 1, 2 and so on. think of these as your predefined bins, each sequence that has the same class number will be assigned the same color in the map.

* These files are generated using Anders Anderssons perl script `tetramer_freqs _esom.pl` which needs to be in the same folder as the `esomWrapper.pl`. To see how to use the `tetramer_freqs _esom.pl` independent of the wrapper, type:
`perl tetramer_freqs _esom.pl -h`

2. Run ESOM:
————-
* On you termial, run w/ following command from anywhere (X11 must be enabled):
`./esomana`
* Load .lrn, .names, and .cls files (File > load .lrn etc.)
* Normalize the data(optional, but recommended): under data tab, see Z-transform, RobustZT, or To\[0,1\] as described in the users manual. I find that this RobustZT makes the map look cleaner.

3. Train the data:
——————-
###Using the GUI
* Tools > Training:
* Parameters: use default parameters with the following exceptions. Note this is what seems work best for AMD datasets but the complete parameter space has not been fully optimized. David Soergel (Brenner Lab) is working on this:
* Training algorithm: K-batch
* Number of rows/columns in map: I use ~5-6 times more neurons than there are data points. E.g. for 12000 data points (window, NOT contigs) I use 200 rows x 328 columns (~65600 neurons).
* Start value for radius = 50 (increase/decrease for smaller/larger maps).
* I’ve never seen a benefit to training for more than 20 epochs for the AMD data.
* Hit ‘START’ — training will take 10 minutes to many hours depending on the size of the data set and parameters used.

###From the terminal
* At this point, you may also choose to add additional data (like coverage) to your contigs. You may do so using the `addInfo2lrn.pl` script **OR** by simply using the flag `-info` in `esomTrain.pl`.
* `esomTrain.pl` script maybe used to train the data without launching the GUI. This script will add the additional information to the lrn file (using `-info`), normalize it and train the ESOM. Type `perl esomTrain.pl -h` in your terminal to see the help document for this script.
* To view the results of the training, simply launch ESOM by following the instructions in *Step 5: Loading a previous project* to load the relevant files.
* Resume analysis from *Step 4: Analyzing the output*

4. Analyzing the output:
————————
* Best viewed (see VIEW tab) with UMatrix background, tiled display. Use Zoom, Color, Bestmatch size to get desired view. Also viewing without data points drawn (uncheck “Draw bestmatches”) helps to see the underlying data structure.
* Use CLASSES tab to rename and recolor classes.
* To select a region of the map, go to DATA tab then draw a shape with mouse (holding left click), close it with right click. Data points will be selected and displayed in DATA tab.
* To assign data points to bins, use the CLASS tab and using your pointer draw a boundary around the region of interest (e.g. using the data structure as a guide — see also “contours” box in VIEW tab which might help to delineate bins). This will assign each data point to a class (bin). The new .cls file can be saved (`File > Save .cls`) for further analysis.

5. Loading a previous project:
——————————
* On you termial, run w/ following command from anywhere (X11 must be enabled): `./esomana`
* `File > load .wts`

Questions?
———-
* [Gregory J. Dick](http://www.earth.lsa.umich.edu/geomicrobiology/Index.html “Geomicro Homepage”),
gdick \[AT\] umich \[DOT\] edu,
Assistant Professor, Michigan Geomicrobiology Lab,
University of Michigan
* [Sunit Jain](http://www.sunitjain.com “Sunit’s Homepage”),
sunitj \[AT\] umich \[DOT\] edu,
Bioinformatics Specialist, Michigan Geomicrobiology Lab,
University of Michigan.

BamM is a c library, wrapped in python, that parses BAM files.

BamM is a c library, wrapped in python, that parses BAM files. The code is intended to provide a faster, more stable interface to parsing BAM files than PySam, but doesn’t implement all / any of PySam’s features.

Do you want all the links that join two contigs in a BAM?
Do you need to get coverage?
Would you like to just work out the insert size and orientation of some mapped reads?

Then BamM is for you!
$ bamm make -d -c read1.R1.fq.gz read1.R2.fq.gz …
$ bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -b mapping.bam
$ bamm extract -g BIN_1.fna -b mapping.bam

BMGE (Block Mapping and Gathering with Entropy) is a program that selects regions in a multiple sequence alignment that are suited for phylogenetic inference

ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
http://mobyle.pasteur.fr/cgi-bin/portal.py

Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10:210.

BMGE (Block Mapping and Gathering with Entropy) is a program that selects regions in a multiple sequence alignment that are suited for phylogenetic inference. BMGE selects characters that are biologically relevant, thanks to the use of standard similarity matrices such as PAM or BLOSUM. Moreover, BMGE provides other character- or sequence-removal operations, such stationary-based character trimming (that provides a subset of compositionally homogeneous characters) or removal of sequences containing a too large proportion of gaps. Finally, BMGE can simply be used to perform standard conversion operations among DNA-, codon-, RY- and amino acid-coding sequences.

BEAM: uses Markov Chain Monte Carlo (MCMC) to search for both single-marker and interaction effects from case-control SNP data

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<>
Yu Zhang, Jun S Liu

This is a brief guide for using BEAM (Bayesian Epistatis Association Mapping).
The program uses Markov Chain Monte Carlo (MCMC) to search for both single-marker
and interaction effects from case-control SNP data.

Reference:

Zhang Y and Liu JS (2007). Bayesian Inference of Epistatic Interations in Case-Control Studies.
Nature Genetics, in press.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This package contains the following files:
BEAM
libgsl.dll <-- for dos version only libgslcblas.dll <-- for dos version only parameters.txt README.txt [1. Installation]----------------------------------------- Unzip all files into one folder. BEAM uses the GNU Scientific Library (GSL), for which two necessary files "libgsl.dll" and "libslcblas.dll" are included in the DOS package. Make sure you put the two dll files in the same folder with BEAM, or put them in your system folder (e.g., "c:\\windows\\system32"), otherwise the program will not run. For linux version, the GSL are included in the executables, so you can simply put all files into one folder and just run the program by ./BEAM. [2. Input Format]----------------------------------------- The user needs to create a file (by default, "data.txt") that contains the case-control genotype data to be used as an input to the program. Although our program is designed for biallelic SNPs, the user may use markers that take more than two alleles. However, BEAM assumes that all markers in the input file are of the same type (e.g., they are all k-allelic markers). The allele at a k-allelic marker should be indexed from 0 to (k-1). For example, without assuming Hardy-Wenberg Equilibrium (HWE), a genotype at a SNP locus takes 3 different values, which should be indexed as 0,1,2. It doesn't matter which allle is coded by which integer, i.e., the user may code the wild type homozygote by 2, the mutant type homozygote by 1, and the heterozygote by 0. In addition, use negative numbers for missing alleles The first line of the input file should contain the disease status of each individual. You should use 1 to denote patients and 0 to denote controls. Alternatively, you may use a one-dimensional continuous trait value for each individual. BEAM will automatically assign those individuals with traits larger than the mean to cases, and the remaining individuals to controls. Starting from the second line of the input file, each line contains the genotype data at one marker for all individuals, separated by ' ' (space). For diploid species, if you want to assume HWE in association mapping, then you should input two alleles at the locus for each individual (the order of alleles doesn't matter). For example, for 2 patients and 2 controls genotyped from 5 markers, if assuming HWE, the input file may look like: 1 1 1 1 0 0 0 0 <--disease status, two identical status per individual, 1: case, 0: control 1 0 0 0 1 1 0 0 <--marker 1, alleles are denoted by 0 and 1. 1 1 0 0 1 0 1 1 <--marker 2 0 0 0 1 0 0 0 1 <--marker 3 0 1 1 1 0 -1 -1 <--marker 4, "-1" denotes missing allele 1 1 0 1 1 0 1 0 <--marker 5 Since each column in the input file denote one individual (or a haploid), the disease status for each individual (haploid) must match with the correponding column in the file. In the above example, when specifying two allels per individual, the user should input two identical disease status per individual in the first line. The user may also provide the SNP ID and their chromosomal locations (in bps) at the begining of each line. The above example file then looks like: ID Chr Pos 1 1 1 1 0 0 0 0 rs100102 chr1 4924223 1 0 0 0 1 1 0 0 rs291093 chr2 35981121 1 1 0 0 1 0 1 1 rs490232 chr9 6920101 0 0 0 1 0 0 0 1 rs093202 chrX 319101 0 1 1 1 0 -1 -1 rs43229 chrY 103919 1 1 0 1 1 0 1 0 IMPORTANT: if SNP ID and locations are provided in the data, make sure you include in the first line "ID Chr Pos " to label each column. This is to ensure that the disease status of each individual matches to the individual in the corresponding column. In addition, please use chr1,...,chr22,chrX, chrY to denote chromosomes. "Chr" is ok too. Since BEAM is designed for detecting interaction between markers that are far apart, by providing the marker location information, it helps BEAM to avoid being trapped by local marker dependencies (due to LD rather than interaction effects). In addition, if the SNP ID and their locations are provided in the input file, you should turn on the correponding parameters in the file "parameter.txt". This can be done by setting both INC_SNP_ID and INC_SNP_POS to 1 (default). Please see the "data.txt" provided online for an example of the input file. [3. Program Parameters]----------------------------------------- The running parameters of BEAM are specified in the "parameter.txt" file. These para- meters include: input filename, outputfilename, priors, burnin-length, mcmc-length, and thin, etc. The parameters "burnin", "mcmc", "thin" should be chosen according to the number of markers genotyped in the data. Denote the totoal number of markers by L, we suggest the following choice of parameters: burin = 10~100 L mcmc = L^2 thin = L In addition, the user may let the program first search for some local modes of the joint posterior distribution, and then run MCMC from those modes. This strategy can be advantagous than starting from random points, espetically for detecting high-order interactions. To do this, set INITIALTRYS to a positive number (e.g. 10 ~ 100), so that BEAM will first search for local modes in 10~100 trials. Set TRY_LENGTH to a number such as 10L for the number of iterations in each trial (L is the number of markers). If during the MCMC run the algorithm finds a better local mode (with a larger likelihood) compared to the mode where BEAM starts with, you can let BEAM restart the chain from this better mode. To do this, set AUTORESTART to a number between 5~10, such that if the new local mode measured in log-likelihood is 5~10 larger than the initial mode, the chain will restart. To disable this function, set AUTORESTART to a large value, e.g., 1000000. Never set it to a negative number. The user can let BEAM to automatically determine a set of parameters for a dataset, such as burnin-length, mcmc-length, thin, etc. By default, we set burnin = 100L if the MCMC chain starts from random configurations (i.e., when INITIALTRYS is 0). If the user let BEAM to search for some local modes first (by setting INITIALTRYS to a positive integer), we then set burnin = 10L. The length of each initial trial (TRY_LENGTH) is 20L by default. We further set mcmc = L * L, and thin = L. To use the automatic setting, the user must set the corresponding parameters in "parameter.txt" to 0 or a negative number. For example, let BURNIN 0 MCMC 0 THIN 0 TRY_LENGTH 0 NOTE: Any positive integer will replace the default setting of BEAM. BEAM will output markers/interactions with Bonferroni adjusted p-value smaller than a user- specified value, which can be specified by "P_THRESHOLD" in "parameter.txt". You can use BEAM to search for marginal associations only if let SINGLE_ONLY = 1. You may specify the input file and the output file in "parameter.txt". We encourage you to modify "parameter.txt" to run BEAM under different parameter settings. Please see "parameter.txt" for more details. [4. Command Line]----------------------------------------- To run BEAM, type in the command line: "BEAM [input output]" BEAM will refer to "parameter.txt" for running parameters, so you can run the program by simply typing "BEAM". The only option you can use in the command line is to specify the input file name and the output file name. Note that if you want to specify either of them, you must specify both of them. [4. Output]----------------------------------------- The main output file contains the estimated posterior probability of association for each marker, including both marginal and interactive association probabilities. The file also contains posterior estimates of the number of marginal associations and the size of interactions. In addition, we evaluate the p-value of each detected association using the B-statistic introduced in the paper. Only significant p-values (<0.1 after the Bonferroni correction) and associated markers are reported. More than one markers reported within a single parenthsis indicate interactions effects. To check the performance of Markov chains, we output two more files: "lnp.txt" and "posterior.txt". The former contains the trace of log-posterior probabilities (up to a normalizing constant) of the sampled parameters. The latter contains the summary of Markov chains. "posterior.txt" also contains the B-statistics and the estimated p-values for detected candidate markers and interactions. [5. Credit]----------------------------------------- This program is developed based on algorithms proposed in Yu Zhang, Jun Liu (2007) "Bayesian Inference of Epistatic Interactions in Case-Control Studies", Nature Genetics, in press. Please cite the paper if you used this program in your research for publication. The research was supported by an NIH R01 grant. [6. Support]----------------------------------------- All questions and comments should be directed to Yu Zhang at the Department of Statistics, The Pennsylvania State University, University Park, PA 16802. Email: yuzhang@stat.psu.edu

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input.

BART: a transcription factor prediction tool with query gene sets or epigenomic profiles
Zhenjia Wang, Mete Civelek, Clint Miller, Nathan Sheffield, Michael J. Guertin, Chongzhi Zang
Bioinformatics 34, 2867–2869 (2018)
https://zanglab.github.io/bart/
https://github.com/zanglab/bart2
https://github.com/Boyle-Lab/Blacklist/tree/master/lists
http://lisa.cistrome.org/doc
plotCorrelation -in results.npz –corMethod spearman –whatToPlot heatmap –plotFile correlation_heatmap.png
plotPCA -in results.npz –plotFile pca.png
#under python3.8 it is not necessary!
#cd /usr/lib/x86_64-linux-gnu
#sudo ln -s libhdf5_serial_hl.so.10.0.2 libhdf5_hl.so
#sudo ln -s libhdf5_serial.so.10.1.0 libhdf5.so









README for BART(1.0.1)

============
Introduction
============

BART (Binding Analysis for Regulation of Transcription) is a bioinformatics tool for predicting functional transcription factors (TFs) that bind at genomic cis-regulatory regions to regulate gene expression in the human or mouse genomes, given a query gene set or a ChIP-seq dataset as input. BART leverages 3,485 human TF binding profiles and 3,055 mouse TF binding profiles from the public domain (collected in Cistrome Data Browser) to make the prediction.

BART is implemented in Python and distributed as an open-source package along with necessary data libraries.

BART is developed and maintained by the Chongzhi Zang Lab at the University of Virginia.

============

========
Tutorial
========

Positional arguments

{geneset,profile}

bart geneset

Given a query gene set (at least 100 genes recommended), predict functional transcription factors that regulate these genes.

*Usage: bart geneset [-h] -i -s [-t ] [-p ]
[–nonorm] [–outdir ] [-o ]

*Example: bart geneset -i name_enhancer_prediction.txt -s hg38 -t target.txt -p 4
–outdir bart_output

*Input arguments:

-i , –infile

Input file, the name_enhancer_prediction.txt profile generated from MARGE.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of the input file.

*Notes:

The input file for , i.e., the enhancer_prediction.txt file generated by MARGE, might have two different formats below (depending on python versions py2 or py3):

a. Python2 version:

1 98.19
2 99.76
3 99.76
4 9.49
5 44.37
6 18.14

b. Python3 version:

chrom start end UDHSID Score
chr3 175483637 175483761 643494 3086.50
chr3 175485120 175485170 643497 2999.18
chr3 175484862 175485092 643496 2998.28
chr3 175484804 175484854 643495 2976.27
chr3 175491775 175491825 643507 2879.01
chr3 175478670 175478836 643491 2836.90

bart profile

Given a ChIP-seq data file (bed or bam format mapped reads), predict transcription factors whose binding pattern associates with the input ChIP-seq profile.

*Usage: bart profile [-h] -i -f [-n ] -s
[-t ] [-p ] [–nonorm]
[–outdir ] [-o ]

*Example: bart profile -i ChIP.bed -f bed -s hg38 -t target.txt -p 4
–outdir bart_output

*Input files arguments:

-i , –infile

Input ChIP-seq bed or bam file.

-f , –format

Specify “bed” or “bam” format.

-n , –fragmentsize

Fragment size of ChIP-seq reads, in bps. Default: 150.

-s , –species

Species, please choose from “hg38” or “mm10”.

-t , –target

Target transcription factors of interests, please put each TF in one line. BART will generate extra plots showing prediction results for each TF.

-p , –processes

Number of CPUs BART can use.

–nonorm

Whether or not do the standardization for each TF by all of its Wilcoxon statistic scores in our compendium. If set, BART will not do the normalization. Default: FALSE.

*Output arguments:

–outdir

If specified, all output files will be written to that directory. Default: the current working directory

-o , –ofilename

Name string of output files. Default: the base name of input file.

*Notes:

The input file for should be BED (https://genome.ucsc.edu/FAQ/FAQformat#format1) or BAM (http://samtools.github.io/hts-specs/SAMv1.pdf) format in either hg38 or mm10.

Bed is a tab-delimited text file that defines the data lines, and the BED file format is described on UCSC genome browser website (https://genome.ucsc.edu/FAQ/FAQformat). For BED format input, the first three columns should be chrom, chromStart, chromEnd, and the 6th column of strand information is required by BART.

BAM is a binary version of Sequence Alignment/Map(SAM) (http://samtools.sourceforge.net) format, and for more information about BAM custom tracks, please click here (https://genome.ucsc.edu/goldenPath/help/bam.html).

Output files

1. name_auc.txt contains the ROC-AUC scores for all TF datasets in human/mouse, we use this score to measure the similarity of TF dataset to cis-regulatory profile, and all TFs are ranked decreasingly by scores. The file should be like this:

AR_56254     AUC = 0.954
AR_44331     AUC = 0.950
AR_44338     AUC = 0.949
AR_50273     AUC = 0.947
AR_44314     AUC = 0.945
AR_44330     AUC = 0.943
AR_50100     AUC = 0.942
AR_44315     AUC = 0.942
AR_50044     AUC = 0.926
AR_50041     AUC = 0.925
FOXA1_50274     AUC = 0.924
AR_50042     AUC = 0.921

2. name_bart_results.txt is a ranking list of all TFs, which includes the Wilcoxon statistic score, Wilcoxon p value, standard Wilcoxon statistic score (zscore), maximum ROC-AUC score and rank score (relative rank of z score, p value and max auc) for each TF. The most functional TFs of input data are ranked first. The file should be like this:

TF statistic pvalue zscore max_auc rela_rank
AR 18.654 1.172e-77 3.024 0.954 0.004
FOXA1 13.272 3.346e-40 2.847 0.924 0.008
SUMO2 5.213 1.854e-07 3.494 0.749 0.021
PIAS1 3.987 6.679e-05 2.802 0.872 0.025
HOXB13 3.800 1.446e-04 2.632 0.909 0.027
GATA3 5.800 6.633e-09 2.549 0.769 0.028
NR3C1 4.500 6.789e-06 2.042 0.871 0.040
GATA6 4.240 2.237e-05 2.602 0.632 0.048
ESR1 12.178 4.057e-34 1.956 0.700 0.049
CEBPB 5.265 1.404e-07 2.287 0.602 0.057
ATF4 3.216 1.302e-03 2.348 0.658 0.065
TOP1 2.254 2.421e-02 3.057 0.779 0.065

3. name_plot is a folder which contains all the extra plots for the TFs listed in target files (target.txt file in test data). For each TF, we have boxplot, which shows the rank position of this TF in all TFs (derived from the rank score in name_bart_results.txt), and the cumulative distribution plot, which compares the distribution of ROC-AUC scores from datasets of this TF and the scores of all datasets (derived from the AUC scores in name_auc.txt).

AmpliconNoise: a collection of programs for the removal of noise from 454 sequenced PCR amplicons

AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. Typical usage involves three steps:
1)the removal of noise from the sequencing itself,
2)the removal of PCR point errors, and
3)removal of chimeric sequences using the program Perseus.
While steps 2-­‐3 can be used on amplicon reads generated using most sequencing technologies, step 1 is currently only supported for sequences generated using 454 pyrosequencing.

When presenting work using AmpliconNoise, please refer to the following citation:
Quinceet al (2011), ‘Removing noise from pyrosequenced amplicons.’, BMC Bioinformatics12, 38

ALIENTRIMMER: a tool to quickly and accurately trim off multipleshort contaminant sequences from high-throughput sequencing reads

Criscuolo A, Brisse S (2013) ALIENTRIMMER: a tool to quickly and accurately trim off multipleshort contaminant sequences from high-throughput sequencing reads. Genomics. 102:500-506.doi:10.1016/j.ygeno.2013.07.011ftp://ftp.pasteur.fr/pub/GenSoft/projects/AlienTrimmer/

AlienTrimmer is a program that allows detecting and removing sequencing errors andcontaminant sequences (e.g. adaptors, primers) in both ends of high-throughput sequencing(HTS) read sequences. Based on the decomposition of the specified alien sequences intonucleotide k-mers of fixed length k, AlienTrimmer is able to determine whether such alienk-mers are occurring in both read ends by using a simple polynomial algorithm.AlienTrimmer can process typical HTS single- or paired-ends read files containingmillions entries in few minutes with very low computer resources.

CHANGESINVERSION 0.4.0•by default, quality-based trimming is performed along with alien trimming (Phred qualityscore cut-off = 20)•Phred quality score cut-off is specified with numerical value (Phred score encoding isautomatically assessed from input FASTQ files

AdmixTools – ROLLOFF

https://github.com/DReichLab/AdmixTools

Given genotype data from an admixed population and two ancestral populations, ROLLOFF computes the time since mixture using the rate of exponential decay of admixture LD. Specifically, it computes the correlation between a signed LD statistic for a pair of markers and a weight that reflects their allele frequency differentiation in the ancestral populations. The method assumes that samples are unrelated.

For details of the method, refer to Moorjani et al. (2011) and Patterson et al. (2012).

ROLLOFF requires that the input data is available in EIGENSTRAT format. To convert to the appropriate format, one can use CONVERTF program. See README.CONVERTF for documentation of programs for converting file formats.

ART: a next-generation sequencing read simulator

ART: a next-generation sequencing read simulator
https://doi.org/10.1093/bioinformatics/btr708

ART_ILLUMINA README (updated on 06/6/2016) Weichun Huang
ART_Illumina (2008-2016), Q Version 2.5.8 (Jun 6, 2016)

DESCRIPTION

ART (art_Illumina Q version) is a simulation program to generate sequence read data of Illumina
sequencers. ART generates reads according to the empirical read quality profile summarized from
large real read data. ART has been using for testing or benchmarking a variety of methods or tools
for next-generation sequencing data analysis, including read alignment, de novo assembly, detection
of SNP, CNV, or other structure variation.

art_Illumina can generate single-end, paired-end, mate-pair reads, and amplicon sequencing simulation
of Illumina sequencing platform. Its outputs include FASTQ read, ALN and/or SAM alignment files. ALN
files can also be converted to UCSC BED files by using aln2bed.pl program included.

art_Illumina comes with the tool art_profiler_illumina that can generate quality profiles from Illumina
sequencing data in the fastq format. The tool is in the folder ART_profiler_illumina. Please see README
in the folder for the details and usage.

EXAMPLES

In the “examples” subdirectory, the shell script “run_test_examples_illumina.sh” gives four examples of using
ART for read simulation. To test these four examples, just run the script “run_test_examples_illumina.sh”

USAGE
RECOMMENDED USAGES (specifying a sequencing system)
art_illumina [options] -ss -sam -i -l -f -o
art_illumina [options] -ss -sam -i -l -c -o
art_illumina [options] -ss -sam -i -l -f -m -s -o
art_illumina [options] -ss -sam -i -l -c -m -s -o

OTHER USAGES
art_illumina [options] -sam -i -l -f -o
art_illumina [options] -sam -i -l -c -o
art_illumina [options] -sam -i -l -f -m -s -o
art_illumina [options] -sam -i -l -c -m -s -o

===== PARAMETERS =====
-1 –qprof1 the first-read quality profile
-2 –qprof2 the second-read quality profile
-amp –amplicon amplicon sequencing simulation
-c –rcount number of reads/read pairs to be generated per sequence(not be used together with -f/–fcov)
-d –id the prefix identification tag for read ID
-ef –errfree indicate to generate the zero sequencing errors SAM file as well the regular one
NOTE: the reads in the zero-error SAM file have the same alignment positions
as those in the regular SAM file, but have no sequencing errors
-f –fcov the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
-h –help print out usage information
-i –in the filename of input DNA/RNA reference
-ir –insRate the first-read insertion rate (default: 0.00009)
-ir2 –insRate2 the second-read insertion rate (default: 0.00015)
-dr –delRate the first-read deletion rate (default: 0.00011)
-dr2 –delRate2 the second-read deletion rate (default: 0.00023)
-k –maxIndel the maximum total number of insertion and deletion per read (default: up to read length)
-l –len the length of reads to be simulated
-m –mflen the mean size of DNA/RNA fragments for paired-end simulations
-mp –matepair indicate a mate-pair read simulation
-M –cigarM indicate to use CIGAR ‘M’ instead of ‘=/X’ for alignment match/mismatch
-nf –maskN the cutoff frequency of ‘N’ in a window size of the read length for masking genomic regions
NOTE: default: ‘-nf 1’ to mask all regions with ‘N’. Use ‘-nf 0′ to turn off masking
-na –noALN do not output ALN alignment file
-o –out the prefix of output filename
-p –paired indicate a paired-end read simulation or to generate reads from both ends of amplicons
NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
-q –quiet turn off end of run summary
-qL –minQ the minimum base quality score
-qU –maxQ the maxiumum base quality score
-qs –qShift the amount to shift every first-read quality score by
-qs2 –qShift2 the amount to shift every second-read quality score by
NOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93)
that reduce substitution sequencing errors and a negative number will shift down
quality scores that increase sequencing errors. If shifting scores by x, the error
rate will be 1/(10^(x/10)) of the default profile.
-rs –rndSeed the seed for random number generator (default: system time in second)
NOTE: using a fixed seed to generate two identical datasets from different runs
-s –sdev the standard deviation of DNA/RNA fragment size for paired-end simulations.
-sam –samout indicate to generate SAM alignment file
-sp –sepProf indicate to use separate quality profiles for different bases (ATGC)
-ss –seqSys The name of Illumina sequencing system of the built-in profile used for simulation

NOTE: all built-in sequencing system ID names are:
GA1 – GenomeAnalyzer I (36bp,44bp)
GA2 – GenomeAnalyzer II (50bp, 75bp)
HS10 – HiSeq 1000 (100bp)
HS20 – HiSeq 2000 (100bp)
HS25 – HiSeq 2500 (125bp, 150bp)
HSXn – HiSeqX PCR free (150bp)
HSXt – HiSeqX TruSeq (150bp)
MinS – MiniSeq TruSeq (50bp)
MSv1 – MiSeq v1 (250bp)
MSv3 – MiSeq v3 (250bp)
NS50 – NextSeq500 v2 (75bp)

===== NOTES =====

* ART by default selects a built-in quality score profile according to the read length specified for the run.

* For single-end simulation, ART requires input sequence file, outputfile prefix, read length, and read count/fold coverage.

* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of
the mean and standard deviation of DNA/RNA fragment lengths

===== EXAMPLES =====

1) single-end read simulation
art_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat

2) paired-end read simulation
art_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

3) mate-pair read simulation
art_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

4) amplicon sequencing simulation with 5’ end single-end reads
art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat

5) amplicon sequencing simulation with paired-end reads
art_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat

6) amplicon sequencing simulation with matepair reads
art_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat

7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulation
art_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat

8) reduce the substitution error rate to one 10th of the default profile
art_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error

9) turn off the masking of genomic regions with unknown nucleotides ‘N’
art_illumina -ss HS20 -nf 0 -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask

10) masking genomic regions with >=5 ‘N’s within the read length 50
art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

READ QUALITY PROFILE

TOOL FOR CREATING A NEW QUALITY PROFILE

art_profiler_illumina in the folder ART_profiler_illumina can generate quality profiles from Illumina
fastq data. Please see README in the folder for the usage.

FORMAT
A valid quality score profile is tab-delimited and has no specific header line. Headers can be included
if each line of extraneous information begins with a number sign(#). Each line of actual quality profile
information must begin with an identifier that indicates where the data comes from for the remainder of
the line. Identifiers include:

. The identifier for the combination of all base information.

A The identifier for quality scores associated with A calls only.

T The identifier for quality scores associated with T calls only.

G The identifier for quality scores associated with G calls only.

C The identifier for quality scores associated with C calls only.

Following the identifier on a given line must be the position number indicating where the rest of the
information on that line applies within a given fragment. The data must be arrayed in pairs such that each
line in the pair has the same identifier and position number. The first line in a pair is a list of quality
scores in ascending order and the second line are the corresponding cumulative frequencies of the quality scores.

EXAMPLE:

. 0 3 6 7 8 9 10 11 12 13 14 15
. 0 39375 355755 395136 415685 1227131 1338634 1522001 1851208 2165909 2436839 2608538

. 35 4 5 6 7 8 9 10 11 12 13 14
. 35 434262 1341151 1725690 2979293 3478620 3592624 3672807 3873754 4096922 4983957 6111261
A 0 3 6 7 8 9 10 11 12 13 14 15
A 0 560 99637 111485 119727 389899 412150 458066 572958 665153 745916 793532

BUILT-IN PROFILES

Under subfolder “Illumina_Profiles” are the read profiles of Illumina GAII sequencers including all
ART’s built-in profiles as indicated below.

1) Raw quality profiles
GA1 36bp reads
Emp36R1.txt
Emp36R2.txt

GA1 44bp reads
Emp44R1.txt
Emp44R2.txt

GA2 50bp reads
Emp50R1.txt
Emp50R2.txt

GA2 75bp reads
Emp75R1.txt
Emp75R2.txt

MiSeq 250bp reads (250bp reads)
EmpMiSeq250R1.txt
EmpMiSeq250R2.txt

HiSeq1000 100bp reads
Emp100R1.txt
Emp100R2.txt

HiSeq2000 100bp reads (the new default profile for 100bp reads)
HiSeq2kL100R1.txt
HiSeq2kL100R2.txt

HiSeq2500 125bp reads
HiSeq2500L125R1.txt
HiSeq2500L125R2.txt

HiSeq2500 150bp reads
HiSeq2500L150R1.txt
HiSeq2500L150R2.txt

HiSeqX PCR free (150bp)
HiSeqXPCRfreeL150R1.txt
HiSeqXPCRfreeL150R2.txt

HiSeqX TruSeq (150bp)
HiSeqXtruSeqL150R1.txt
HiSeqXtruSeqL150R2.txt

MiniSeq TruSeq (50bp)
MiniSeqTruSeqL50.txt

MiSeq v3 (250bp)
MiSeqv3L250R1.txt
MiSeqv3L250R2.txt

NextSeq500 v2 (75bp)
NextSeq500v2L75R1.txt
NextSeq500v2L75R2.txt

2) Recalibrated quality profiles (all these are ART’s built-in profiles)

36bp reads
1st: EmpR36R1
2nd: EmpR36R2

44bp reads
1st: EmpR44R1
2nd: EmpR44R2

50bp reads
1st: EmpR50R1
2nd: EmpR50R2

75 reads
1st: EmpR75R1
2nd: EmpR75R2

OUTPUT FILES

*.fq – FASTQ read data files. For paired‐read simulation, *1.fq contains data of the first reads, and *2.fq for the second reads.
*.aln – read alignment files. For paired‐read simulation, *1.aln has read alignments for the first reads and *2.aln for the second reads.

FASTQ FILE format
A FASTQ file contains both sequence bases and quality scores of sequencing reads and is in the following format:
@read_id
sequence_read
+
base_quality_scores

A base quality score is coded by the ASCII code of a single character, where the quality score is equal to ASCII code of the
character minus 33.

Example:
@refid-4028550-1
caacgccactcagcaatgatcggtttattcacgat…
+
????????????7???????=&?<ref_seq_id read_id aln_start_pos ref_seq_strand
ref_seq_aligned
read_seq_aligned

aln_start_pos is the alignment start position of reference sequence. aln_start_pos is always relative to the strand of reference
sequence. That is, aln_start_pos 10 in the plus (+) strand is different from aln_start_pos 10 in the minus (‐) stand.

ref_seq_aligned is the aligned region of reference sequence, which can be from plus strand or minus strand of the reference sequence.
read_seq_aligned is the aligned sequence read, which always in the same orientation of the same read in the corresponding fastq file.

SAM format

SAM is a standard format for next-gen sequencing read alignments. The details of the format and examples are available at the links below:
1) http://samtools.sourceforge.net/SAM1.pdf
2) http://genome.sph.umich.edu/wiki/SAM

BED format

See the format at UCSC http://genome.ucsc.edu/FAQ/FAQformat.html#format1

NOTE: both ALN and BED format files use 0-based coordinate system while SAM format uses 1-based coordinate system.

ACKNOWLEDGEMENTS
I would like to thanks all ART users for their feedback and contributions, especially the users listed below.
Richard Nielson, DNASTAR
Bruno Nevado, CRAG in UAB
Lee Katz, US CDC