ChIPMunk

4. ChIPMunk quick-start >>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Important! ChIPMunk won't produce any output files
by default. If you want to save the output do not forget to add the 
   > output.log
to the end of the command line.

The input TEST_footprint.mfa assumed to be a simple multi-fasta file 
containing unaligned sequences.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[NEW] ChIPMunk "default" mode:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

A simple DEFAULT mode to find a motif in a given set of sequences
(the 'TEST_footprint.mfa' file):

  ChIPMunk s:TEST_footprint.mfa
or
  ChIPMunk s:TEST_footprint.mfa > output.log
to save the output to the 'output.log' file.

This will run ChIPMunk using a predefined lengths range from 7 to 22 
using default precision, zero-or-one occurence per sequence mode with the
flexible (1.0) segmentation coefficient and two concurrent computational threads.

See examples below for more information on setting specific parameters.


(a) To produce a motif (as a positional count matrix)
using a gapless multiple local alignment of a fixed width (length)
of 15bp from a given set of sequences:

  ChIPAct 15 s:TEST_footprint.mfa

(b) To search for the best flexible motif (take good motif hits).
The sequence set can contain some noise.
(so some sequences can have no motif hits).

  ChIPMunk 6 15 yes 1.0 s:TEST_footprint.mfa
  
(c) To search for the best strong motif starting  
from the 6bp to 15bp (each sequence must have at least one motif occurrence, 
i. e. one-occurrence-per-sequence mode, OOPS):

  ChIPMunk 6 15 yes oops s:TEST_footprint.mfa

(d) To search for the best strict motif (take only very strong hits 
very close to consensus). The sequence set can contain some noise 
(so some sequences can have no motif hits).
  
  ChIPMunk 6 15 yes 0.0 s:TEST_footprint.mfa


*****************************************************
*** <<<<   ChIPMunk detailed manual           >>> ***
*****************************************************

[NOTE] By default ChIPMunk prints the messages to the 
STDERR stream and the results to the STDOUT stream.
In practice this means that if you run it as the

  ChIPMunk {parameters} > chipmunk.log

by redirecting (using '>') the output into the chipmunk.log
then this file will contain all resulting data
and the console output will be used for 'work-in-progress' messages.


The full ChIPMunk command-line is:

ChIPMunk 
  <start_motif_length> <stop_motif_length> 
  <verbose>=(y)es|(n)o <mode>=oops|zoops_factor=0.0..1.0 
  <x:input_set1>..<x:input_setN> 
  <try_limit> <step_limit> <iter_limit> <thread_count> 
  <seeds>=random|filename.mfa <gc%>=0.5|auto
  <motif_shape>=flat|single|double
  <disable_log_weighting>

[NOTE] The letter case is important, 
so you need to type strictly 'ChIPMunk', not 'Chipmunk' or 'chIPmunk'.

The command-line parameters are given in a <..> brackets.
The parameters after the input_sets can be omitted.

ChIPMunk searches for the longest strong motif starting
from the [start_motif_length].

If [start_motif_length] > [stop_motif_length]
then the ChIPMunk decreases length by 1 at each step 
and takes the first strong motif.

If [stop_motif_length] > [start_motif_length]
then the ChIPMunk increases length by 1 at each step
and stops when first weak motif is found (taking
previously found strong motif).

If [stop_motif_length] = [start_motif_length]
then ChIPMunk searches for the best motif of a given length.

For the formal definition of the strong motif 
please check the ChIPMunk-on-the-web
details page at autosome.ru/ChIPMunk/

**************************
>>>> Parameters list: <<<<
**************************

[verbose]
  set y or yes for the additional program output
  
  For the ChIPMunk it would be the list of words 
    used for motifs construction
  For the ChIPHorde extension (see below) if would
    print out motif occurrences for each motif
  
[mode]
  oops (or OOPS) corresponds to the one-occurrence-per-sequence mode.
  If the number (starting from 0.0) is specified then the 
  zero-or-one-occurrence-per-sequence mode is used. 
  Larger numbers correspond to the even more flexible segmentation. 
  
  Recommended value: 1.0

[input_sets] x: can be 
  s: for simple multi-fasta
  
  r: for simple multi-fasta that should be considered in a single-strand mode
     (e.g. for discovery of RNA motifs)
  
  w: weighted data set where the number specifies the sequence weight 
  (i.e. the impact it has on the motif)
    
    > 1.0
    ACGGGAAA
    > 2.0
    GTGAAAAA
  
  p: peak data where each number in the space-separated list
  specifies the weight of the corresponding position
    > 1.0 2.0 1.0 1.0
    ACCG
    > 2.0 10.0 1.0 1.0 2.0
    GTACA
    
  The peak data is useful for any sequence-specific positional prior.
  The easiest example is the ChIP-Seq peak base coverage data,
  see http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btq488
  
  Examples of tests on real data and comparison vs existing tools
  can be found here: http://autosome.ru/ChIPMunk/supplement.html
  
  ChIPMunk can integrate data from different sequence sets of 
  different types (so you can use peak and weighted and simple data
  sets together). More information on this topic is presented 
  in the corresponding paper: 
  http://www.springerlink.com/content/q86n48151u25278w/
  
[try_limit] (default 100) 
  This is a number of general optimization runs.
  For a random seeding this would be equal to the number of seeds.
  It can be as high as your computational power (100-1000-10000 
  seems to be enough depending on the size of your set 
  and the best motif strength). 
  In practice for huge sequence sets and strong motifs 
  the value of 100 is acceptable.
  Should be increased in case of the unstable ChIPMunk 
  behavior on weak motifs or noisy datasets.

[step_limit] (default 10) 
  Corresponds to the number of bootstrapping runs.
  As for the try_limit it can be (10-100-1000-..)
  Can be increased for small datasets and weak motifs.

Both [step_limit] and [try_limit] increase computational time 
in a linear fashion.

[iter_limit] (default 1) 
  is useful only if 1 or (rare) 2; higher values can reduce
  bootstrapping-achieved speedup effect and will have no real affect on motif quality

[thread_count] (default 1) 
  is extremely useful for modern multi-core processors. 
  Parallel nature of the ChIPMunk algorithm allows to
  linearly increase its speed with the increasing number 
  of available cores. Set it to 2 or 4 to use all available
  computational power of dual and quad core processors.
  If you have an Intel CPU with Hyper-Threading 
  for some datasets if may be useful to further increase 
  the number of computational threads up to 8
  (with the additional performance gain).
  
  For extreme 6/12-core processors you can use 6 or 12 threads.

[seeds] (default random)
  can be either 'random' (completely random seeds for PWM optimization) 
  or can contain multi-fasta filename with sequences 
  to extract starting seeding words
  
[gc%] (default 0.5)
  specifies prior background bias (for Kullback DIC). When "auto" or "local"
  is specified ChIPMunk will compute the background nucleotide composition
  of supplied sequence sets.

[motif_shape] (default flat)
  can be 'flat' (default, if not specified) to
  omit the motif shape prior.
  The 'single' and 'double' values correspond to the
  single-box and double-box motifs.
  The positions within motif are weighted so the
  positions with higher weight have more impact on the PWM score
  during optimization.
  
  The single box prior uses cos^2(PI*n/T) weighting,
  the double box prior uses sin^2(PI*n/T) weighting.
  Here PI = 3.1415926, T is the DNA helix period (10.5bp),
  and n is the position within the motif.

[disable_log_weighting]
  can be set to any value to disable the new data set weighting
  strategy. The original ChIPMunk assigned equal weights
  to all datasets not dependent of their sizes 
  (i.e. the number of sequences in the each dataset was not taken into account).
  
  The default v3 strategy assign the dataset weight proportional to the
  natural logarithm of the number of sequences in the set.
  Setting this parameter to any value will allow to use previous
  'equal dataset weighting' strategy.
  
  The default v4.2 strategy also uses the natural logarithm 
  of the peak height to assigns weights to the individual peaks.

One can use the ChIPAct with the a very similar syntax and parameters. 
It searches for the fixed-width motif always in the OOPS mode and 
therefore wants uses only the motif length parameter instead 
of [start_motif_length] and [stop_motif_length] for ChIPMunk.


******************************************************
*** <<<<          ChIPMunk examples           >>>> ***
******************************************************

1. ChIPMunk autosome package is placed into current 
working directory; default values for limits; 
motif lengths from 10 to 7; one standard sequence set data.mfa; 
one-occurrence-per-sequence mode; verbose on

  ChIPMunk 7 10 yes oops s:data.mfa
or simply
  ChIPMunk 7 10 yes oops s:data.mfa

2. ChIPMunk autosome package is located somewhere 
in a some_dir; user-specified values for limits; 
motif lengths from 10 to 7; two sequence sets with
different weighting models; flexible segmentation
zero-or-one-occurrence-per-sequence mode; verbose off

  ChIPMunk 7 20 no 1.0 p:chipseq.fasta 
    w:selex.mfa 1000 100 1

3. ChIPMunk autosome package is placed into the current working
directory; peak data used; user-specified values
for limits; motif lengths from 20 to 10; 
verbose on; strict ZOOPS mode; use of 4 computational threads; 
random seeds; gc% 0.6

  ChIPMunk 10 20 yes 0.0 p:data.fasta 
    10000 100 1 4 random 0.6

[NOTE] For either ChIPMunk or ChIPHorde in most cases it is wise to use
1.0 ZOOPS segmentation instead of the oops mode (shown in examples below).


=====================================================
*** <<<<        ChIPHorde extension           >>> ***
-----------------------------------------------------

----- ChIPHorde quick-start -------------------------

  ChIPHorde 12:7,12:7,12:7
    mask yes 0.0 s:BCD_footprint.mfa 100 10 1 2 random 0.5 single > test_result.out
    
This will produce upto 3 motifs of no more than 12bp length
using strict ZOOPS segmentation, default precision parameters, 
2 computational threads, uniform nucleotide composition 
and single-box motif shape prior.

----- ChIPHorde details -----------------------------

The ChIPHorde extension is intended for the
sequential search of a more than one motif.
It uses a very simple command-line format (for example):

  ChIPHorde 8:10,7:6
    mask yes 1.0 s:BCD_footprint.mfa > test_result.out

This states to sequentially run ChIPMunk twice using 8:10
and 7:6 possible length intervals searching for 
the first motif starting from the shortest possible
and for the second motif from the longest possible one.

"mask" means to mask (polyN) the motif occurrences
in a sequential search. There is another option 
"filter" which will make ChIPHorde to exclude 
sequences with the good motifs hits before
the sequential ChIPMunk run.

[NEW] If you want to check different length ranges 
using ChIPHorde but DO NOT wish to filter/mask sequences/hits
than you can use new 'dummy' mode.
The new 'dummy' mode will not perform any filtering after checking 
each of supplied motif length ranges.
So you can get many 'versions' of the same motif.

ChIPHorde 10:10,15:15,20:20
    mask yes 1.0 s:sequences.mfa > test_result.out

Both classic modes ('mask' and 'filter') 
are suitable for different conditions.
Typically the "mask" mode will try to search
for co-factor motifs while the "filter" mode 
is intended for searching for the different motifs 
of a single TF.

The motif occurrence threshold used for masking/filtering 
is the score of the worst word inculded into motif
by ChIPMunk zoops segmentation procedure.

NOTE! The ChIPHorde wouldn't work in case the 'oops' mode
was specified for ChIPMunk.

If you want more motifs the ZOOPS segmentation is better be set to 0.0
(strict mode) to ensure that at each step the worst selected word
has good PWM score (and so the masking threshold wouldn't be too low).
If you want to get only 2 top motifs it's enough to use 1.0 zoops coefficient.

The ChIPHorde command-line is very similar
to that of ChIPMunk like you can specify the 
iteration counts, specific GC-content to account for
(extremely important for heavily GC-shifted genomes), and so on.

- OCCS line - (in case you specified the verbose parameter)
The OCCS lines in the ChIPHorde output
refers to the motif occurrences in the input dataset.

The line itself looks like
  OCCS|0;44; CGCCTAATCT:6:DIRECT
where 0 is the number of dataset in the input data
(always 0 if you supply only one fasta-file),
44 is the number of sequence in the set,
CGCCTAATCT is the word at that position, 
6 is the index of the word in the sequence,
DIRECT is the strand of the occurrence.

[NOTE] All indices are ZERO-based.


=====================================================
*** <<<<      Notes on 'verbose' mode         >>> ***
-----------------------------------------------------

Note, that in verbose mode ChIPMunk will print out 
words used to construct the motif.

The ChIPHorde will print out BOTH words used to construct
the motifs AND the motif occurences in sequences
(note, that typically only one best word from each sequence
is used during motif construction while there can be
many motif hits reported by ChIPHorde OCCS).


------------------- Version history -----------------------

Version  4.2 - To reduce noise from extremely high peaks (caused by wrongly mapped reads or PCR artifacts)
               a logWeghting scheme is now applied to the peak heights
Version  4.1 - "r:" RNA motif discovery (simple mode for single strand search)
               fixed weird sequence indexes for "peak" mode if the input
               peaks were not sorted according to the peak height
Version    4 - ChIPMunk now lives in the 'autosome.ru' package
Version  3.3 - polyN stacking for ZOOPS mode fixed;
               ChIPHorde output beautified;
               ChIPMunk now reports positions of aligned words
Version  3.2 - small bugfixes, tweaked output, default mode added
Version  3.1 - fixed rare multithreading bug
Version  3.0 - huge source refactoring, motif shape support
Version  2.0 - ChIPHorde extension for a sequential motif discovery
Version 1.17 - support for positional weight profiles over sequences; 
               ChIP-Seq peak data analysis
Version  1.o - bugfixes and tweaks for experimental
               support of positional information 
               weight profiles over sequences
Version  1.n - length estimation procedure simplified and formalized
Version  1.m - multi-core CPU support
Version  1.0 - first public release