Copyright 1991, 1992, 1993, 1994, 1998, 2003 Gerald Z. Hertz May be copied for noncommercial purposes. Author: Gerald Z. Hertz gzhertz AT alum.mit.edu GMAT-INF-GC (version 2c) This program determines the information content of an alignment matrix. It can also optionally determine and graph the information content for each individual position of the matrix. Matrices can be either horizontal or vertical. In a horizontal matrix, the columns correspond to the positions within the alignment, and the rows correspond to the letters. Each row begins with the corresponding letter (or integer, if the "-i" option is used). An optional row corresponding to the number of gaps begins with a dash (-). If the matrix contains a gap row, then it can also contain the 4 optional correlation rows: LL corresponding to the number of letters preceded by a letter, -L corresponding to the number of letters preceded by a gap, L- corresponding to the number of gaps preceded by a letter, and -- corresponding to the number of gaps preceded by a gap. In a vertical matrix, the rows correspond to the positions within the alignment, and the columns correspond to the letters. The first row contains the letters (or integers, if the "-i" option is used) corresponding to each column. The first row can also contain the gap and correlation labels as described in the previous paragraph (i.e. -, LL, -L, L-, --) to indicate the presence of the corresponding optional columns. In both types of matrices, spaces, tabs, and vertical bars (|) are ignored. The input files can contain comments according to the following convention. The portion of a line following a ';', '%', or '#' is considered a comment and is ignored. Comments can begin anywhere in a line and always end at the end of the line. The output of this program is sent to the standard output. The following options can be determined on the command line. 0) -h: print these directions. 1) Information options. -n integer: The number of sequences in the alignment. This option is necessary only if no position of the alignment contains a representative from every sequence. This situation can only occur in alignments that ignore terminal gaps. (default: the maximum number of sequences at each position of the alignment) -sa: Adjust the information for sample size by subtracting the average background expected from a random alignment. -st number: Adjust the information by subtracting the indicated number of standard deviations expected from a random alignment from each position of the alignment. This option can only be used when the "-sa" option is also used. 2) Graphing options (default: do NOT graph the information content) -g1: determine and graph the information content for each individual position of the matrix and print the matrix. -g2: determine and graph the information content for each individual position of the matrix, but do NOT print the matrix. 3) Matrix options. -m filename: file containing the matrix (default is the standard input). -v: the matrix is a vertical matrix (default: horizontal matrix). 4) Alphabet options---the three mutually exclusive (default: -a filename: file containing information. [Use "-af" when using the options in this section are "-a alphabet"). the alphabet and normalization VMS operating system] Each line contains a letter (a symbol in the alphabet) followed by an optional normalization number (default: 1.0). The normalization is based on the letter's relative prior probability when generating the alignment. For nucleic acids, this would typically be the genomic frequency of the bases or the frequency observed in the dataset used to generate the alignment. In nucleic acid alphabets, a letter and its complement appear on the same line, separated by a colon (a letter can be its own complement, e.g. when using a dimer alphabet). Complementary letters may use the same normalization number. Only the standard 26 letters are permissible; however, when the "-CS" option is used, the alphabet is case sensitive so that a total of 52 different characters are possible. POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: letter letter normalization POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: letter:complement letter:complement normalization letter:complement normalization:complement's_normalization -i filename: same as the "-a" option, except that the symbols of the alphabet are represented by integers rather than by letters. Any integer permitted by the machine is a permissible symbol. [Use "-if" when using the VMS operating system] -A alphabet_and_normalization_information: same as "-a" option, except information appears on the command line (e.g., -A a:t 3 c:g 2). [Use "-ac" when using the VMS operating system] 5) Alphabet modifier indicating whether ascii alphabets are case sensitive---the following option is mutually exclusive with the "-i" option (default: ascii alphabets are case insensitive). -CS: ascii alphabets are case sensitive. [Use "-as" when using the VMS operating system]