Copyright 1991, 1992, 1993, 1994, 1998, 2003 Gerald Z. Hertz May

advertisement
Copyright 1991, 1992, 1993, 1994, 1998, 2003 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
gzhertz AT alum.mit.edu
GMAT-INF-GC (version 2c)
This program determines the information content of an alignment
matrix. It can also optionally determine and graph the information
content for each individual position of the matrix.
Matrices can be either horizontal or vertical. In a horizontal
matrix, the columns correspond to the positions within the alignment,
and the rows correspond to the letters. Each row begins with the
corresponding letter (or integer, if the "-i" option is used). An
optional row corresponding to the number of gaps begins with a dash (-).
If the matrix contains a gap row, then it can also contain the 4
optional correlation rows: LL corresponding to the number of letters
preceded by a letter, -L corresponding to the number of letters
preceded by a gap, L- corresponding to the number of gaps preceded by
a letter, and -- corresponding to the number of gaps preceded by a gap.
In a vertical matrix, the rows correspond to the positions within the
alignment, and the columns correspond to the letters. The first row
contains the letters (or integers, if the "-i" option is used)
corresponding to each column. The first row can also contain the gap
and correlation labels as described in the previous paragraph
(i.e. -, LL, -L, L-, --) to indicate the presence of the corresponding
optional columns. In both types of matrices, spaces, tabs, and
vertical bars (|) are ignored.
The input files can contain comments according to the following
convention. The portion of a line following a ';', '%', or '#' is
considered a comment and is ignored. Comments can begin anywhere in a
line and always end at the end of the line. The output of this
program is sent to the standard output.
The following options can be determined on the command line.
0) -h: print these directions.
1) Information options.
-n integer: The number of sequences in the alignment. This option
is necessary only if no position of the alignment contains a
representative from every sequence. This situation can only
occur in alignments that ignore terminal gaps. (default: the
maximum number of sequences at each position of the alignment)
-sa: Adjust the information for sample size by subtracting the
average background expected from a random alignment.
-st number: Adjust the information by subtracting the indicated
number of standard deviations expected from a random alignment
from each position of the alignment. This option can only be
used when the "-sa" option is also used.
2) Graphing options (default: do NOT graph the information content)
-g1: determine and graph the information content for each individual
position of the matrix and print the matrix.
-g2: determine and graph the information content for each individual
position of the matrix, but do NOT print the matrix.
3) Matrix options.
-m filename: file containing the matrix (default is the standard
input).
-v: the matrix is a vertical matrix (default: horizontal matrix).
4) Alphabet options---the three
mutually exclusive (default:
-a filename: file containing
information.
[Use "-af" when using the
options in this section are
"-a alphabet").
the alphabet and normalization
VMS operating system]
Each line contains a letter (a symbol in the alphabet) followed by
an optional normalization number (default: 1.0). The
normalization
is based on the letter's relative prior probability when
generating
the alignment. For nucleic acids, this would typically be the
genomic frequency of the bases or the frequency observed in the
dataset used to generate the alignment. In nucleic acid
alphabets,
a letter and its complement appear on the same line, separated by
a
colon (a letter can be its own complement, e.g. when using a dimer
alphabet). Complementary letters may use the same normalization
number. Only the standard 26 letters are permissible; however,
when the "-CS" option is used, the alphabet is case sensitive so
that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
[Use "-if" when using the VMS operating system]
-A alphabet_and_normalization_information: same as "-a" option,
except
information appears on the command line (e.g., -A a:t 3 c:g 2).
[Use "-ac" when using the VMS operating system]
5) Alphabet modifier indicating whether ascii alphabets are case
sensitive---the following option is mutually exclusive with
the "-i" option (default: ascii alphabets are case insensitive).
-CS: ascii alphabets are case sensitive.
[Use "-as" when using the VMS operating system]
Download