/************************************************************************ ***** # Copyright (C) 1993-1996 by Phil Green. # All rights reserved. # # This software is part of a beta-test version of the swat/cross_match/phrap # package. It should not be redistributed or # used for any commercial purpose, including commercially funded # sequencing, without written permission from the author and the # University of Washington. # # This software is provided ``AS IS'' and any express or implied # warranties, including, but not limited to, the implied warranties of # merchantability and fitness for a particular purpose, are disclaimed. # In no event shall the authors or the University of Washington be # liable for any direct, indirect, incidental, special, exemplary, or # consequential damages (including, but not limited to, procurement of # substitute goods or services; loss of use, data, or profits; or # business interruption) however caused and on any theory of liability, # whether in contract, strict liability, or tort (including negligence # or otherwise) arising in any way out of the use of this software, even # if advised of the possibility of such damage. # #************************************************************************ *****/ /* SWAT: a program for searching one or more DNA or protein query sequences, or a query profile, against a sequence database, using (an efficient implementation of) the Smith-Waterman or Needleman-Wunsch algorithms, with linear (affine) gap penalties. For each match an empirical measure of statistical significance derived from the observed score distribution is computed. 5/11/94 N.B. NW option will not work correctly until function call has been made similar to that for s-w. Usage: swat query_file_name database_file_name [-option value] [-option value] ... Available options are : option name value value to be provided -matrix (or -M) BLOSUM50 name of file containing score matrix or profile -gap_init gap initiation penalty default (= penalty for -12 first residue in a gap) gap extension penalty (= penalty for -2 each subsequent residue) -ins_gap_ext insertion gap extension penalty (= penalty for gap_ext each subsequent residue, for insertions in subject relative to query) -del_gap_ext deletion gap extension penalty (= penalty for gap_ext each subsequent residue, for deletions in subject relative to query) ) -end_gap gap extension penalty at ends of -1 sequences (N-W algorithm only) -E E-value cutoff; alignments with an E-value 1.0 higher than this will not be displayed. -z z-score cutoff; alignments with a z-value 6.0 lower than this will not be displayed. (See below for description of z-scores). -N maximum no. of displayed alignments 20 -truncatedb no. of entries at start of db that will be searched; useful for testing purposes only. The default is to search the entire database. -nw [flag; no value should be provided. Indicates that Needleman-Wunsch algorithm should be used instead of Smith-Waterman] -raw [flag indicating that z and E-values should not be computed, and matches should be ranked by their raw alignment scores.] -file [flag to create an output file for each query, giving the "complete" results of the search (but without alignments). This option is useful primarily for testing the relative power of different score matrices and gap penalties and will not be needed by most users. The file name is constructed automatically for each query and should NOT be given on the command line; it consists of the query sequence ID, followed by the suffix ".allscores". On completion of the run the file contains one line for each database entry giving the ID, length, score, and z-score. Entries are sorted by decreasing z. This file is produced independently of the usual program output and is unaffected by the -z, -N, or -E settings.] -gap_ext The query file (which may contain more than one query sequence, unless a profile search is to be performed in which case it can contain only one query sequence) and database file must each be in FASTA format: Each entry must have a single header line with the leading character '>' followed immediately by the ID of the sequence; the remainder of this line may contain an optional description. The header line is followed by lines containing the sequence. Any non-alphabetic characters are ignored, lower case letters are automatically converted to upper case, and any letter not in the score matrix alphabet is converted to the last character in this alphabet (typically an 'X' or '*'). N.B. NO LONGER CASE SENSITIVE -- all row and column labels should be capital letters. An appropriate matrix must be provided, in order to score alignments between the query (or profile) and each "subject" (database sequence). The matrix may be either a conventional score matrix, or a profile, but must contain only integer score values. The format of the matrix file is as follows: Any blank lines, and lines beginning with # are ignored. A line beginning with the word "GAP" is taken to contain the gap initiation and gap extension penalties (a pair of integers separated by spaces); this line must precede the line containing the column labels. The values given here will be overridden by command line parameters. The first non-blank line not beginning with # or GAP is assumed to contain the column labels, each of which should be a letter or '*'; there should be one column for each distinct type of residue that may appear in a subject sequence. The same label should not be used for two different columns. Succeeding lines should contain the rows of the matrix. Each row may optionally include a label (i.e. a single letter or '*' at the beginning of the line); if these are omitted, the row labels are taken to be the same as the column labels. If the matrix is a profile, row labels must be included, a single query sequence must be provided in the query file whose positions correspond 1-1 to the rows of the profile, and the sequence of this query must match the list of row labels (the only role of the query sequence in this case is to identify positions in printed alignments -- it plays no role in the analysis). If the matrix is a conventional score matrix, there should be one row for each type of residue that may appear in the query. Given an alignment, the score for an aligned pair of residues is determined by looking up the matrix entry whose column label matches the subject residue, and whose row corresponds to the query residue, either by position (if the matrix is a profile) or by label (if it is a conventional score matrix). Case is significant (so 'A' does not match 'a'). Residues for which there is no matching label are assigned to the last column (or row). Note that three different options, E, z, and N, influence how many alignments are displayed. If a single one of these is set on the command line it becomes the decisive criterion; if more than one is set, the most stringent is used. If none are set, the most stringent default value for the three options is used. By default, matches are sorted and displayed by decreasing z-score (see below), or equivalently E-value, rather than raw alignment score, unless the flags -raw or -nw are set. N.B. For Needleman-Wunsch searches, or if the database is small, z- and E-scores are not computed and the matches are instead ranked by the raw alignment score. Current restrictions: (i) In Smith-Waterman searches, if the alignment score for a database entry is less than -1 * gap_init it will be reported as 0. This restriction can be relaxed (at some cost in execution time) by moving the position of the corresponding block of statements in the source code. It is irrelevant for most purposes unless a very large gap_init value is used. (ii) ints must be at least 32 bits (so this program may not work with some compilers on DOS machines). (iii) The statistical analyses currently assume that "chance" scores are very unlikely to exceed 200. This assumption is reasonable when protein searches are done using the usual score matrices; it may not be reasonable for searches of nucleotide databases. It can easily be relaxed by changing appropriate program parameters. (iv) z- and E-values are only computed for Smith-Waterman searches. They may be unreliable for some combinations of score matrix / gap penalties. (see below), or if the database is small. (v) A single set of gap penalties is allowed; i.e. they cannot be made position-specific. Output: Mostly self-explanatory. includes a summary of the statistical analysis of the score distribution, and the alignments, scores and E-values of the best matches. Matches are ranked by decreasing z-score. Statistical procedures/ ranking of matches: (N.B. The following applies only to Smith-Waterman searches.) When the database sequences vary greatly in length (as is usually the case), theoretical arguments and empirical studies suggest that the raw alignment score is less sensitive, for discriminating true homologies from chance matches, than an adjusted score which has a null distribution independent of the length of the database sequence. SWAT attempts to compute such an adjusted score in the form of a "z-score", defined for a given database sequence by z = (S - f(n)) / sqrt(g(n)) where S and n are the raw alignment score and length of the database sequence, and f(n) and g(n) are the predicted mean score and variance for sequences of length n. z-scores thus should have mean 0 and standard deviation 1, independent of n. Like the raw score S, but unlike the E-value, z should be relatively independent of database size (although it may depend somewhat on the characteristics of database sequences); moreover it has a more easily interpretable scale than the raw score. At present, f and g are estimated from the observed score distribution for the database search, by regressing the observed scores and squared residuals, respectively, against log(n). This functional form has some theoretical justification (see below) and appears to work well for the score matrices and gap penalties in common use, although we are still exploring whether other functions may work better in some situations. (The regressions are actually performed twice, the results of the first one being used to identify high-scoring outliers (e.g. true homologies) which are then eliminated prior to performing the second regression). When the search has "local" behavior (i.e. the score distribution conforms to Karlin-Altschul type formulae), then known theoretical results imply that f(n) is in fact a linear function of log(n), g(n) is constant, and z is distributed as the (unique) Gumbel distribution with mean 0 and variance 1, independently of n. Consequently it is easy to convert z-values to E-values in this case, and they are monotonically related to each other. For searches that do not have local behavior, it is not known what the appropriate theoretical forms for f and g are, nor what the distribution of z is nor even whether this distribution is independent of n. Nonetheless z-scores should still be preferable to raw scores, and consequently the same procedures for computing z and E are used by SWAT in this case. They appear to work reasonably well at least for for PAM250 and gap penalties -12 and -2, a particular combination having non-local behavior that Pearson has found to be optimal for Smith-Waterman searches. Our tests using random queries indicate that the E-values obtained in this case tend to be somewhat conservative. A second procedure for estimating statistical significance is also included in the code but no longer used. It explictly assumes the search has local behavior and is based on fitting an extreme-value (Gumbel) distribution to the observed score distribution using maximum likelihood estimation (inspired by, but simpler than, Mott's procedure). (The idea of using the observed score distribution to estimate significance appears to be due to Collins and Coulson). The top part of the distribution (by default, the top 5% of the scores) is ignored, and a censored Gumbel distribution is fitted to the remainder to get estimates of lambda and K (in the notation of Karlin and Altschul). (This is different from Mott's procedure, which uses an outlier detection criterion that may miss many true positives. Also, we assume a simpler form for the distribution that does not allow lambda to vary for different database sequences; this makes it possible to use an iterative algorithm for finding the maximum likelihood parameter estimates that should converge much more rapidly.) For brief description of the implementation of the s-w algorithm, see header to smith_wat.c.