Copyright 1998 Gerald Z. Hertz May be copied for noncommercial purposes. Author: Gerald Z. Hertz Dept. of Molecular, Cellular, and Developmental Biology University of Colorado Campus Box 347 Boulder, CO 80309-0347 hertz@colorado.edu P-VALUE (version 3a) This program estimates the p-value of negative log-likelihood scores for an alignment matrix (i.e. the probability of observing a particular negative log-likelihood score or greater in an arbitrary alignment of random sequences). Negative log-likelihood scores are the negative of the standard log-likelihood ratio statistic and equals the information content multiplied by the number of sequences in the alignment. The probability calculation assumes that the probability of a letter at each position of a sequence is independent and identically distributed. P-values values are estimated using a technique from large-deviation statistics. The estimate is most inaccurate and may be off by a factor of 2 for scores very close to the maximum possible score. Negative log-likelihood scores are read from the standard input, unless the -b or the -s command line option is used (see section 2). The output is printed to the standard output. COMMAND LINE OPTIONS: 0) -h: print these directions. 1) Description of the alignment whose p-value is being determined. -L integer: the width of the alignment being analyzed (required). -n integer: the number of sequences in the alignment (required). -c3: the alignment is constrained to be a symmetrical nucleotide alignment; requires that a symmetrical alphabet be defined in section 3. 2) The score---i.e. the negative log-likelihood ratio---whose p-value is being determined. If neither option is used, the scores are read from the standard input. -s score: determine the p-value of the indicated score. -b integer: divide the range of possible scores into the indicated number of equal-sized bins. Determine and print the p-value for the minimum score in each bin and for the maximum possible score. (The number of scores printed will be one more than the indicated integer.) 3) Alphabet options The following 3 options are mutually exclusive (default: "-a alphabet"). -a filename: file containing the alphabet and normalization information. Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the relative prior probabilities of the letters. The prior probability of a letter might be its overall frequency in all the sequences of a particular organism or of a particular subset of sequences. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization -i filename: same as the "-a" option, except that the symbols of the alphabet are represented by integers rather than by letters. Any integer permitted by the machine is a permissible symbol. -A alphabet_and_normalization_information: same as "-a" option, except information appears on the command line (e.g., -A a:t 3 c:g 2). 4) Alphabet modifier indicating whether ascii alphabets are case sensitive---the following option is mutually exclusive with the "-i" option (default: ascii alphabets are case insensitive). -CS: ascii alphabets are case sensitive.