Copyright 1998 Gerald Z. Hertz May be copied for noncommercial

advertisement
Copyright 1998 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
Dept. of Molecular, Cellular, and Developmental Biology
University of Colorado
Campus Box 347
Boulder, CO 80309-0347
hertz@colorado.edu
P-VALUE (version 3a)
This program estimates the p-value of negative log-likelihood scores
for an alignment matrix (i.e. the probability of observing a
particular negative log-likelihood score or greater in an arbitrary
alignment of random sequences). Negative log-likelihood scores are
the negative of the standard log-likelihood ratio statistic and equals
the information content multiplied by the number of sequences in the
alignment. The probability calculation assumes that the probability
of a letter at each position of a sequence is independent and
identically distributed.
P-values values are estimated using a technique from large-deviation
statistics. The estimate is most inaccurate and may be off by a factor
of 2 for scores very close to the maximum possible score.
Negative log-likelihood scores are read from the standard input,
unless the -b or the -s command line option is used (see section 2).
The output is printed to the standard output.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) Description of the alignment whose p-value is being determined.
-L integer: the width of the alignment being analyzed (required).
-n integer: the number of sequences in the alignment (required).
-c3: the alignment is constrained to be a symmetrical nucleotide
alignment;
requires that a symmetrical alphabet be defined in section 3.
2) The score---i.e. the negative log-likelihood ratio---whose p-value
is being determined. If neither option is used, the scores are read
from the standard input.
-s score: determine the p-value of the indicated score.
-b integer: divide the range of possible scores into the indicated
number of equal-sized bins. Determine and print the p-value
for the minimum score in each bin and for the maximum possible
score. (The number of scores printed will be one more than the
indicated integer.)
3) Alphabet options
The following 3 options are mutually exclusive (default: "-a
alphabet").
-a filename: file containing the alphabet and normalization
information.
Each line contains a letter (a symbol in the alphabet) followed by
an
optional normalization number (default: 1.0).
The normalization
is
based on the relative prior probabilities of the letters.
The
prior
probability of a letter might be its overall frequency in all the
sequences of a particular organism or of a particular subset of
sequences. In nucleic acid alphabets, a letter and its complement
appear on the same line, separated by a colon (a letter can be its
own
complement). Complementary letters may use the same normalization
number. Only the standard 26 letters are permissible; however,
when
the "-CS" option is used, the alphabet is case sensitive so that a
total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
-A alphabet_and_normalization_information: same as "-a" option,
except
information appears on the command line (e.g., -A a:t 3 c:g 2).
4) Alphabet modifier indicating whether ascii alphabets are case
sensitive---the following option is mutually exclusive with
the "-i" option (default: ascii alphabets are case insensitive).
-CS: ascii alphabets are case sensitive.
Download