Searching for Similarity in Sequences Gary Benson Departments of Computer Science and Biology Boston University ISSCB 2003 Benson Topic 1 Outline Similarity and Alignment • Define homology, similarity by descent and similarity by convergence • Common mutations and their mathematical models • Alignments • Scoring Alignments • Gap penalty functions • Computing the best scoring alignment – the Longest Common Subsequence (LCS) problem ISSCB 2003 Benson Similarity and Biomolecules Similarity is expected among biomolecules that are descended from a common ancestor. Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function while important catalytic, regulatory or structural regions remain similar. ISSCB 2003 Benson Similarity and Evolution Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules. ISSCB 2003 Benson Three common similarity problems 1. Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. 2. Start with a small set of sequences and identify similarities and differences among them. 3. In many sequences or very long sequences, detect commonly occurring patterns. ISSCB 2003 Benson What is Similarity? How can we measure it? ISSCB 2003 Benson Morphology Morphology is the form and structure of an organism. Should shared morphology mean similarity? ISSCB 2003 Benson Hands ISSCB 2003 Benson Aquatic Shape ISSCB 2003 Benson Shared morphology Shared morphology does not necessarily imply common ancestry. The animals with hands have all evolved from a common ancester with a hand. The ichthyosaur, shark and porpoise each evolved sea life adaptations independently. ISSCB 2003 Benson Homology When similarity is due to common ancestry, we call it homology. ISSCB 2003 Benson Modern molecular biology seeks to understand cellular processes through the action of DNA, RNA, and protein molecules. This will ultimately lead to a biochemical understanding of: • The pathogenesis of infectious diseases like AIDS, hepatitus and SARS. • The mutagenic properties of environmental toxins and how they lead to diseases like cancer. • The etiology of human genetic disease. • Strategies to prevent and treat diseases through drug and vaccine design, gene therapy, risk reduction, etc. ISSCB 2003 Benson How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) ==> INFO(X) (“ ~ ” means similar) ISSCB 2003 Benson Are the Sequences Similar? ISSCB 2003 Benson Are the Sequences Similar • How similar? • What parts are the most similar? Remember, the common ancestor of the two sequences may have existed millions of years ago. ISSCB 2003 Benson How can we tell if the two sequences are similar? Similarity judgements should be based on: • The types of changes or mutations that occur within sequences. • Characteristics of those different types of mutations. • The frequency of those mutations. ISSCB 2003 Benson Common mutations in DNA Substitution: A C G T T G A C A C G A T G A C Deletion: A C G T T G A C A C G A C Insertion: A C G T T G A C A C G C A A G T T G A C ISSCB 2003 Benson Common mutations Duplication: A C G T T G A C A C G T T G A T T G A C Inversion (double stranded DNA shown): A C G T T G A C T G C A A C T G A C T C A A C C A C A G T T G G ISSCB 2003 Benson Frequency of mutations Substitution > Insertion, Deletion >> Duplication > Inversion ISSCB 2003 Benson Evolutionary history of sequences ISSCB 2003 Benson Alignments There are many ways to align two sequences. We just saw one way: T T A C G T A C A G A T T A T - - G G A A C A - - - T A Here is another: T T A C G T – A C A G A T T A T - - - G G A A C - - A T - A Which is better? Remember, we can not choose based on the evolutionary history, because that is unknown. ISSCB 2003 Benson Alignments and Paths through the Alignment Array a c g t g a a t t ISSCB 2003 Benson t a c g c a a Alignments and Paths through the Alignment Array - t a c g c a a c g t g a a t t ISSCB 2003 Benson t a c g - c a a - - a c g t g a a t t a Alignments and Paths: An Alternate Alignment - t a c g c a a c g t g a a t t ISSCB 2003 Benson t - - a c g c a - - a a c g t g - - a a t t a Finding the Best Alignment: Ranking Alignments by Score Score an alignment by • Partitioning it into columns • Assign a weight to each column • Sum the column weights ISSCB 2003 Benson Distance Scoring Distance scoring: • Alignment gets a non-negative score. • Alignment of identical sequences scores zero, all others > zero. • Best alignment has smallest score. Typical scoring functions are: • d(a,a) = 0; identity • d(a,b) = d(b,a) > 0; a ≠ b; substitution • g = d(a, – ) > 0; indel (gap) ISSCB 2003 Benson Similarity Scoring Similarity scoring: • Alignment scores may be positive, zero, or negative. • More similar means larger positive score. • The best alignment has largest score. Typical scoring functions are: • s(a,b) is { > 0 if a and b are similar in one or more characteristics or are observed to substitute frequently for each other; ≤ 0 otherwise }; substitution • g = s(a, – ) < 0; indel (gap) ISSCB 2003 Benson Gap penalty functions • Single character gap penalty g(a, – ) = c (c a constant or a value dependent on a) • Affine (linear) gap penalty g(k) = α + βk (α is a gap opening penalty, β is a gap extension penalty) • Concave gap penalty g(k) = α + β(m(k)) m(k) is a function like log(k) which grows more slowly as k increases. ISSCB 2003 Benson Distance Scoring Alignment parameters: d(a, a) = 0; d(a, b) = + 2, g=+4 A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 ISSCB 2003 Benson = 18 Similarity Scoring Scoring parameters: s(a, a) = + 5, s(a, b) = - 3, g= -8 A – G C C G T A T A C G A - - T - T 5 5 5 + 5 8 3 8 8 8 - ISSCB 2003 Benson = - 15 Similarity scoring with affine gap Alignment parameters: s(a, a) = + 5, s(a, b) = - 3, g(k) = α + βk, α = - 5, β = - 4 A – G C C G T A T A C G A - - T - T + 5 5 5 5 8 3 8 4 8 ISSCB 2003 Benson = - 11 Computing the Optimal Alignment: The LCS Problem as Prototype The Longest Common Subsequence (LCS) problem is a method for comparing sequences. Although the solution does not produce an alignment, it illustrates a method of dynamic programming that is very similar to that used by alignment algorithms. ISSCB 2003 Benson Longest Common Subsequence Problem Let X be a string of characters. A subsequence X’ of X is formed by discarding zero or more letters of X. Note that the letters in X’ maintain their same order as in X. Let X and Y be two strings. A common subsequence Z is a subsequence of both. A longest common subsequence (LCS) is the longest such Z. Examples: X=abcdeba X’ = a b d b ISSCB 2003 Benson X=abcdeba Y=bebdceacd Z=bdea LCS Problem Given: Two sequences X and Y. Find: An LCS for X and Y. A divide and conquer solution can be developed by looking at what happens to the last letters in each sequence. That is, are they part of the LCS solution or not? ISSCB 2003 Benson Possible ways to split the problem ISSCB 2003 Benson LCS recursion ISSCB 2003 Benson Filling the dynamic programming array b e b d e c a c d ISSCB 2003 Benson 0 0 0 0 0 0 0 0 0 0 a 0 b 0 c 0 d 0 e 0 b 0 a 0 Filling the dynamic programming array b e b d e c a c d ISSCB 2003 Benson 0 0 0 0 0 0 0 0 0 0 a 0 0 0 b 0 1 1 c 0 1 1 d 0 1 1 e 0 1 ? b 0 1 a 0 1 Necessary values in adjacent cells ISSCB 2003 Benson Completed LCS array b e b d e c a c d ISSCB 2003 Benson 0 0 0 0 0 0 0 0 0 0 a 0 0 0 0 0 0 0 1 1 1 b 0 1 1 1 1 1 1 1 1 1 c 0 1 1 1 1 1 2 2 2 2 d 0 1 1 1 2 2 2 2 2 3 e 0 1 2 2 2 3 3 3 3 3 b 0 1 2 3 3 3 3 3 3 3 a 0 1 2 3 3 3 3 4 4 4 Tracing back for a solution 0 1 2 3 4 5 6 7 8 9 b E B d e C a c d 0 0 0 0 0 0 0 0 0 0 a 0 0 0 0 0 0 0 1 1 1 b 0 1 1 1 1 1 1 1 1 1 c 0 1 1 1 1 1 2 2 2 2 d 0 1 1 1 2 2 2 2 2 3 LCS = bdea ISSCB 2003 Benson e 0 1 2 2 2 3 3 3 3 3 b 0 1 2 3 3 3 3 3 3 3 a 0 1 2 3 3 3 3 4 4 4 LCS time complexity There are (n + 1)(m + 1) cells in the LCS score array. Each cell is filled by examining 3 other cells in constant time. The time complexity to fill the array is O(nm). Tracing back for an LCS solution takes at most n + m steps. The total time complexity is therefore O(nm). ISSCB 2003 Benson Topic 2 Outline Types of Alignment Substitution Matrices • Global vs Local Alignment • Recursions for Global, time complexity • Global alignment with affine gap penalty, time complexity • Similarity scoring and local alignment • Recursion for local, time complexity • Finding suboptimal local alignments: declumping • Substitution Matrices ISSCB 2003 Benson Global vs Local Alignment Given two strings, X and Y: • global alignment produces an alignment that contains all of X and all of Y. X Y • local alignment produces an alignment that contains only the best matching substrings, one from X and one from Y. X Y ISSCB 2003 Benson Global vs Local Alignment Global alignment is useful when • The sequences are known to be related throughout their length, for example, similar protein sequences from close species. Local alignment is useful when • The sequences are believed to contain parts that are closely related. ISSCB 2003 Benson Global Alignment Problem Given: two sequences X and Y and alignment scoring functions, Find: the best scoring alignment that includes all of X and all of Y. Solution: Dynamic Programming ISSCB 2003 Benson Global Alignment Analysis of global alignment is similar to the LCS. Alignments can end in one of three ways. In terms of the prefix strings x1…xi and y1…yj, we have: 1. xi and yj are aligned with each other. (Here it makes no difference whether xi and yj are the same.) G[i,j] = G[i – 1, j – 1] + s(xi, yj) X: C G T Y: C G C ISSCB 2003 Benson Global Alignment • xi is deleted (aligned against a dash). G[i, j] = G[i – 1, j] + g X: C A T Y: C A • yj is deleted (aligned against a dash). G[i, j] = G[i, j – 1] + g X: C A – Y: C A A ISSCB 2003 Benson Global alignment recursion (similarity scoring) ISSCB 2003 Benson Global alignment example match = +2, C G T A G C ISSCB 2003 Benson 0 -4 -8 -12 mismatch = - 3, C -4 2 -2 -6 T -8 -2 -1 ? A -12 -6 -5 gap = - 4 G -16 -10 -4 A -20 -14 -8 Global Alignment Example match = +2, C G T A G C ISSCB 2003 Benson 0 -4 -8 -12 -16 -20 -24 mismatch = - 3, C -4 2 -2 -6 -10 -14 -18 T -8 -2 -1 0 -4 -8 -12 A -12 -6 -5 -4 2 -2 -6 gap = - 4 G -16 -10 -4 -8 -2 4 0 A -20 -14 -8 -7 -6 0 1 Global Alignment Example Tracing back for an alignment C G T A G C 0 -4 -8 -12 -16 -20 -24 C -4 2 -2 -6 -10 -14 -18 T -8 -2 -1 0 -4 -8 -12 A -12 -6 -5 -4 2 -2 -6 C G T A G C C – T A G A ISSCB 2003 Benson G -16 -10 -4 -8 -2 4 0 A -20 -14 -8 -7 -6 0 1 Global alignment time complexity As with the LCS problem, there are (n + 1) (m + 1) cells in the dynamic programming array. Each is filled by examining 3 other cells in constant time. The time complexity to fill the array is O(nm). Tracing back for a global alignment takes at most n + m steps. The total time complexity is therefore O(nm). ISSCB 2003 Benson Global alignment and affine gap penalty Recall the affine gap penalty function g(k) = α + βk When xi or yj is deleted, we have to consider that it could be the last of a string of characters that is deleted as one unit. And the size of that unit will affect the deletion cost. ISSCB 2003 Benson Time Complexity (naïve) with Affine Gap Cost For each (i, j) in the alignment matrix, there are O(n + m) posible deletion costs that must be considered in order to choose the optimal cost. Without any improvements, the time complexity grows to O(nm(n + m)) or cubic O(n3) time. ISSCB 2003 Benson Refining the Affine Gap Computation The regularity of deletion costs helps reduce the time complexity. Observe the two tables. ISSCB 2003 Benson Auxillary Functions for Affine Gap E[i,j] is max of all possibilities ISSCB 2003 Benson Auxillary Functions for Affine Gap F[i,j] is the maximum of all possibilities ISSCB 2003 Benson Global Alignment with Affine Gap recursion (similarity scoring) ISSCB 2003 Benson Time Complexity with affine gap cost A total of (n +1)(m + 1) cells must be computed. For each cell, E, F, and G values must be computed. E and F both require looking up 2 values. G requires looking up 3 values. Time to compute scores is O(nm). Tracing back can be done in O(n + m) if the E and F values are retained. This triples the memory space required for scoring arrays (E, F, and G). Total time O(nm). ISSCB 2003 Benson Local Alignment Problem Given: two sequences X and Y and alignment scoring functions, Find: the best scoring alignment over all substring pairs, one from X and one from Y. Solution: Dynamic Programming ISSCB 2003 Benson Local Alignment Looks Harder than Global Alignment Where global alignment asks for the solution to one problem, the best alignment of X[1…m] versus Y[1…n], local alignment asks for the best alignment out of O(n4) subproblems, any substring in X versus any substring in Y: X[h…i] versus Y[k…j] for 1 ≤ h ≤ i ≤ m, 1 ≤ k ≤ j ≤ n Instead, we solve O(n2) subproblems, the best alignment of any substring ending at xi versus any substring ending at yj. ISSCB 2003 Benson Local Alignment and Similarity Scoring Local alignment uses similarity scoring for the following reason. When an alignment score is negative, the alignment is “worse” than no alignment at all. For an (i, j) pair, it often happens that the best alignment of every substring ending at xi with every substring ending at yj has a negative score. Similarity scoring detects these “bad” alignments and local alignment discards them. If every alignment score for an (i, j) cell is negative, then the score is reset to zero. ISSCB 2003 Benson Local Alignment Recursion ISSCB 2003 Benson Local Alignment Time Complexity Proportional to the product of the sequence lengths: O(nm). ISSCB 2003 Benson Finding Suboptimal Alignments When computing local alignment, we may want to know optimal and suboptimal alignments. This can be important in the case where the sequences contain several parts that are similar. X Y ISSCB 2003 Benson High Alignment Scores may not be Independent ISSCB 2003 Benson Declump scores by prohibiting match and substitution pairings from realignment ISSCB 2003 Benson ISSCB 2003 Benson Source: Michael S. Waterman, 1994 ISSCB 2003 Benson Source: Michael S. Waterman, 1994 Substitution Matrices • Used for protein alignments • Substitution rates for amino acid pairs are determined from known similar sequences • Matrices contain log-odds scores • First matrices designed by Margaret Dayhof are called PAM matrices (Point Accepted Mutation) • Current matrices designed by Henikoff and Henikoff are called BLOSUM matrices (Blocks Substitution Matrices) ISSCB 2003 Benson ISSCB 2003 Benson ISSCB 2003 Benson Odds Ratios Odds is a ratio of probabilities for two events which are mutually exclusive. In horse racing, for example, odds is the ratio of the betting that a horse will lose to the betting that the horse will win. So, for a horse with odds of 20 to 1, the betting is 20 times higher that the horse will lose than it will win, while for a horse with odds of 3 to 2, the betting is only 1.5 times higher that the horse will lose than win. ISSCB 2003 Benson Alignment and Odds A substitution score can be interpreted as an odds ratio. For an individual pair of aligned amino acids, the events are • The pair are aligned because they are evolutionarily related • The pair are aligned merely by chance. For each pair, the relevant question is: “What are the odds that amino acid i would be substituted for amino acid j if they were evolutionarily related?” If the odds are good, then the pair supports the alignment. If the odds are bad, then the pair reduces confidence in the alignment. ISSCB 2003 Benson Log Odds The odds for the entire alignment: “Aligned because the sequences are evolutionarily related or aligned by chance alone” can be obtained by multiplying the odds for each aligned pair of amino acids. Multiplication is expensive computationally, so logarithms are used because they can be added. ISSCB 2003 Benson Log-Odds Substitution Matrices The BLOSUM and PAM substitution matrices contain logodds values. The ratios have the basic form Observed probability of pairing amino acids i and j in related sequences Oij = Eij Expected probability of pairing at random ISSCB 2003 Benson BLOSUM Data DATA: Ungapped multiple alignments (blocks) taken from 504 families of known related protein sequences in the PROSITE database. In the original paper (1992), this produced 2100+ blocks. ISSCB 2003 Benson BLOSUM Observed Frequencies A typical block: R K K K R E T T W … … … S S D … … … C A C … … … L N L … … … H P P … … … First column pairs: RK: 6 RR: 1 RE: 2 KK: 3 KE: 3 Repeat and accumulate for every column. Fij = Pair (i, j) counts; Oi,j = Fij / total number of pairs = observed pair frequencies ISSCB 2003 Benson BLOSUM Background Frequencies Pi = probability of amino acid i occurring in an amino acid pair Pi = Oii + Σj ≠ i Oij / 2 Eij = expected probability of random pairs Eij = ISSCB 2003 Benson { Pi Pj Pi Pj + Pj Pi if i = j if i ≠ j BLOSUM Log-Odds Values Sij = Log-odds ratio for aligning amino acids i and j. Sij = 2 log 2 (Oij / Eij) If observed frequencies are • as expected, Sij ~ 0 • greater than expected, Sij > 0 (positive) • less than expected, Sij < 0 (negative) ISSCB 2003 Benson Related Sequences Must Be Clustered Overrepresentation of closely related sequences can bias the matrices. K K K K R E … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … 4 sequences more than 80% identical Other related sequences The overrepresentation of the closely related sequences increases the observed K to K pairing, falsely increasing the log-odds score for that pair. ISSCB 2003 Benson Clustering Sequences Closely related sequences are clustered and the letters in any cluster are given fractional values. In the cluster below, amino acids K in the first column counts as ¼ each rather than 1. R and T in the fourth column count as ½ each. K K K K R E … … … … … … … … … … … … ISSCB 2003 Benson R R T T L K … … … … … … … … … … … … 4 sequences more than 80% identical clustered The BLOSUM Matrix Family Blosum 80 clusters sequences if they are ≥ 80% identical (clustering is not transitive). Blosum 62 clusters sequences if they are ≥ 62% identical. Lower numbers yield matrices that give higher scores to alignments of distantly related sequences. ISSCB 2003 Benson ISSCB 2003 Benson Topic 3 Outline Database Searching Algorithms • Sensitivity vs Selectivity vs Running Time • Dot Plots • Blast • High Scoring Segment Pairs (HSP) • Target Words • Extending a Hit • Statistics of HSP scores • Gapped Blast modifications ISSCB 2003 Benson Alternatives to Alignment Alignment is fine when the sequences are relatively short, but is unusable for longer sequences such as are encountered in • database searches • comparison of genomes • large repetition identification because it takes too long. For these, we need alternate methods. ISSCB 2003 Benson Dot Plots Two dimensional array like an alignment array. Put a dot in each cell where the sequence characters “match”. Long diagonal runs of dots indicate sequence similarity. Match rule for proteins might be positive substitution value in BLOSUM matrix. ISSCB 2003 Benson Extended Dot Plots (Nature, 19 June 2003, Vol 423, p 831) ISSCB 2003 Benson ISSCB 2003 Benson Inverted repeat Tandem repeat Direct repeat ISSCB 2003 Benson Database Search Algorithms: Sensitivity, Selectivity, Running Time • Sensitivity – the ability to detect weak similarities between sequences (often due to long evolutionary separation). Increasing sensitivity reduces false negatives, those database sequences similar to the query, but rejected. • Selectivity – the ability to screen out similarities due to chance. Increasing selectivity reduces false positives, those sequences recognized as similar when they are not. Sensitivity Selectivity ISSCB 2003 Benson BLAST The BLAST program is designed for fast sequence to database search. The basic idea is that a high scoring alignment between the query and a database sequence will almost always contain a core part that is well conserved. This core is called a High-scoring Segment Pair (HSP). The statistical theory behind BLAST deals with the probability of finding HSPs. ISSCB 2003 Benson High-scoring Segment Pair (HSP) An HSP consists of two equal length substrings, one from the query and one from the database sequence. When aligned without gaps, the score, S, is above a specified threshold and is locally maximal, meaning that extending the substrings on either end to lengthen the alignment produces a smaller score. Suppose we have the following two sequences: X = LKFSFALCCTIG Y = ADQHLSRPTWAFYC One of many segment pairs is (BLOSUM scoring): ISSCB 2003 Benson S T 1 F W 1 A A 4 L C F Y 0 -2 C C 9 = 13 High Scoring Segment Pairs X = LKFSFALCCTIG Y = ADQHLSRPTWAFYC S T 1 F W 1 A A 4 L C F Y 0 -2 C C 9 = 13 This pair is locally maximal because shortening or lengthening the alignment reduces the score. L S -2 ISSCB 2003 Benson K F R P 2 -4 S T 1 F W 1 A A 4 L C F Y 0 -2 C C 9 Assumptions for the Statistical Theory We assume • a simple protein model – the probabilities for the amino acids appearing at each position in a protein are independent and identically distributed (iid) and amino acid i occurs randomly with probability Pi. • the substitution matrix (such as BLOSUM 62) has at least one positive score. • the expected score for amino acid pairs Σ pipj sij < 0 is negative. ISSCB 2003 Benson Normalized Scores of HSPs Given Pi and Sij, the statistical theory yields two parameters, λ and K which can be used to normalize an HSP score S with the formula: S' = λS – ln K ln 2 The units of S' are bits. When two random proteins sequences are compared, the expected number of HSPs with a normalized score of S' is: E = N / 2S' where N is the product of the sequence lengths. ISSCB 2003 Benson Score for a Given Statistical Significance Solving for S' yields: S' = log2 (N / E) For example, if • the query protein has length 250. • the database has length 50 000 000 • the E-value is .05 then the score required to reach that level of statistical signficance is ~ 38 bits. ISSCB 2003 Benson Deficiencies in the BLAST theory • The result is asymptotic, meaning there is some error for finite sequences. • Local variations in residue composition in real sequences often do not match the model. • The model assumes that all mutations at all sites in a protein sequence can be described by a single substitution matrix. • It does not apply to gapped alignments. ISSCB 2003 Benson Finding HSPs Blast finds HSPs by first looking for well conserved core alignments, i.e., ungapped alignments between short query “words” of length 3 and database “words”. Each core alignment must score greater than a threshold score, T, which is typically 13 using the BLOSUM matrix. Core alignments or “hits” are typically found using pattern matching finite automata, i.e. linear search through the database. ISSCB 2003 Benson Query and Target Words Suppose the query sequence is LVNRKPVVP. • Chop the query into all possible words of length 3: LVN VNR NRK RKP KPV PVV VVP • Collect “target words” associated with each query word. For example, the word RKP has six target words that score at least 13 (using BLOSUM 62) when aligned with RKP: QKP KKP RQP REP RRP RKP In general, there will be a large set of target words derived from the query sequence. Each target word and its associated query word could form the core of an HSP. ISSCB 2003 Benson Extending a Hit A hit is extended in either direction to find its locally maximal segment pair. Extension terminates when the score drops too far below the highest score found. For example, Suppose the target RRP is found and it occurs in the sequence EPGVCRRPLKCTAS When trying to extend the core against LVNRKPVVP the HSP has 6 letters and a score of 16: L G -3 ISSCB 2003 Benson V N V C 4 -3 R R 5 K R 2 P P 7 V V P L K C 1 -2 -3 Gapped Blast Gapped BLAST (1997) has two improvements over the original BLAST (1990). • Two hits – Only extends core alignments when two occur nearby on the same “diagonal”. Involves lowering the threshold T to retain sensitivity, but reduces extension which is the most costly. • Gapped alignments are computed. Allows raising the threshold T while retaining selectivity. Speeds initial database scan. ISSCB 2003 Benson Optimal Substitution Matrices for Distant Homologies Among HSPs (ungapped) from the comparison of random sequences, amino acids ai and aj are aligned with frequency approaching: qij = pipjeλsij The qij are called “target frequencies” for the given substitution matrix sij. Among alignments of distantly related proteins, amino acids tend to be paired with certain characteristic frequencies. “Only if these correspond to a matrix’s target frequencies ... can the matrix be optimal for distinguishing the distant local homologies.” (Altschul – 1991) ISSCB 2003 Benson Log-Odds Matrices Again Rearranging qij = pipjeλsij yields: sij = ln(qij /pipj)/λ where λ acts as a scaling factor for the logarithm. This shows • that the substitution scores are inherently log-odds scores for the target and background frequencies, and • scores may be chosen that correspond to any desired set of target frequencies. ISSCB 2003 Benson Topic 4 Outline Specialized Sequence Alignment Algorithms • Sim4 • Blat ISSCB 2003 Benson sim4 Sim4 is a program for aligning a cDNA sequence to a genomic sequence. cDNA is complementary to a messenger RNA which is the RNA molecule after introns have been cut out and the exons spliced together. The difference between cDNA and genomic DNA is the absence of the intron sequences in cDNA. Sim4 assumes that the differences between the two sequences are limited to: • introns in the genomic sequence • sequencing errors in either sequence ISSCB 2003 Benson sim4 – Find HSPs In the first step, sim4 finds HSPs which must have an exact matching core of 12 nucleotides. The core is extended on both ends with a score of 1 for match and -5 for mismatch. cDNA Genomic ISSCB 2003 Benson sim4 – Select chains of HSPs cDNA HSps are chained with the constraints that • starting positions in the cDNA are increasing • HSPs are in nearby diagonals or are in diagonals separated by plausible intron distances Genomic ISSCB 2003 Benson starting positions increase gap typical for intron sim4 – Trim overlaps in cDNA Exon cores that overlap in the cDNA are trimmed to find GT ... AG, a common intron signal, in the genomic DNA. GT overlap ISSCB 2003 Benson AG sim4 – Trim overlaps in cDNA Exon cores that overlap in the cDNA are trimmed to find GT ... AG, a common intron signal, in the genomic DNA. GT ISSCB 2003 Benson AG sim4 – Filling Gaps in cDNA Gaps are filled by recursively finding HSPs with smaller cores (starting with exact matches of length 8) smaller cores are chained as before ISSCB 2003 Benson BLAT – The BLAST-Like Alignment Tool BLAT is a specialized program for mRNA to DNA alignments and cross-species protein alignments. It is faster than BLAST and sim4 for these tasks. It is not appropriate for distant homology searching. BLAT is available at the UC Santa Cruz human genome website for database searches against the human genome. ISSCB 2003 Benson BLAT vs BLAST Both BLAT and BLAST first look for high-scoring pairs: • BLAST collects short words in the query and does a linear scan through the database for those words. • BLAT builds an index of the database (a data structure suited for rapid detection of word matches) and then scans linearly through the query. Since the query is much smaller than the database, the scan phase is very rapid. The data structure is persistent, meaning that it is built once and then reused for every query search. ISSCB 2003 Benson BLAT – Hits To form the database index, BLAT cuts the database sequences into non-overlapping words of length k. Database: Query: Query k-words ISSCB 2003 Benson A hit is an exact match or a near perfect match (single character mismatch) with any query subword. BLAT – Criteria for Extending Hits BLAT allows different criteria for extending hits: • Single exact match • Single near-exact match (one difference) C A G T G C G A T G A C A G T A C G A T G A • Two exact matches on the same diagonal separated by a small distance ISSCB 2003 Benson BLAT – Statistics The size, k, of matching words is selected to balance sensitivity and selectivity with running time. BLAT makes flexible assumptions about the query and database sequences to select k, the word match size: • M – the percent matching between homologous regions (98% for cDNA/genomic alignments, 89% for crossspecies protein alignments) • H – the size of the homologous regions • G – the size of the database (3 000 000 000 bases) • Q – the size of a query • A – the alphabet size (4 for DNA, 20 for protein) ISSCB 2003 Benson BLAT – Statistics for k Selection (Exact Match) The number of non-overlapping k-words in a homologous region is: T = floor (H / k) Sequence letters are assumed to be iid. The probability that: • a match occurs between a homologous region k-word and the query is p = Mk • no match occurs is: q = (1 – p) • at least one homologous region word matches the query is: Phit = 1 minus no matches = 1 – q T ISSCB 2003 Benson BLAT – Statistics for k Selection (Exact Match) The probability that two k words match by chance is: r = (1/A)k The number of: • query words is qw = Q – k + 1 • database words is dw = G/k The number of k words that are expected to match by chance is: F = qw d w r ISSCB 2003 Benson BLAT – Statistics for k Selection With increasing k, • the probability of a valid hit, Phit, goes down because it becomes harder to find two words that match due to errors or mutations. • the number of false hits, F, goes down because probability of random word matching decreases as words grow. For example, in DNA, with M = 0.95, H = 100, and Q = 500, if • k = 11, then Phit = 0.999 and F = 32 512 • k = 14, then Phit = 0.991 and F = 399 By decreasing sensitivity slightly, the number of false hits can be reduced by almost a factor of 100. ISSCB 2003 Benson BLAT indexing Each k-word (nonoverlapping) in the database is converted to a number in base four: A list, I, of all possible numbers is T C A G T T A maintained. For DNA, when k = 7, 3 1 0 2 3 3 04 this list has 47 or 16,384 entries. I: 3102330 3102331 L: 100 350 600 730 870 930 List I points to a list L, the sorted locations of the k-word in the database sequences. ISSCB 2003 Benson Deficiencies in the BLAT Program • Near exact matching with proteins requires an index which is too large to fit in memory, so a less efficient hashing scheme is used instead • The alignment method is not standard optimal alignment and – for DNA does not work well below 90% sequence identity – for proteins does not work well with indels in one of the sequences ISSCB 2003 Benson Topic 5 Outline Searching for Repetitive Motifs and Patterns • Pattern Detection vs Alignment • Short word methods • TRF ISSCB 2003 Benson Pattern Detection In pattern detection problems: • there is no query sequence • we look for a repetitive pattern or motif in one long sequence or several sequences • broad characteristics of the target pattern are specified in advance ISSCB 2003 Benson Short Word Methods In short word methods, small matching words are detected. A cluster of short words indicates a potential pattern. A typical applications is the search for tandem repeats in DNA sequences. Word spacings may differ. ISSCB 2003 Benson Short Word Methods Short word methods are well suited to DNA sequences because • the alphabet is small so exact matching words are common even in homologous regions that have experienced significant mutation. • they can handle insertions and deletions that are common in DNA They are less well suited to protein sequences because • large amino acid alphabet and frequent substitution make short matching words uncommon. ISSCB 2003 Benson Tandem Repeats A tandem repeat is any pattern of nucleotides that has been duplicated so that it appears several times in succession. For example, the sequence fragment below contains a tandem repeat of the trinucleotide CGT: tcgctggtcatacgtcgtcgtcgtcgttacaaacgtcttccgt ISSCB 2003 Benson Approximate Tandem Repeats More typically, the tandem copies are only approximate due to mutations. Here is an alignment of copies from a tandem repeat in C. elegans. Shown are the copies and a consensus pattern ISSCB 2003 Benson Tandem Repeats Associated with Human Disease •Trinucleotide diseases caused by expansion of a trinucleotide repeat: Fragile-X mental retardation Myotonic dystrophy Huntington’s disease Friedreich’s ataxia •Multilocus diseases linked in some cases to unstable or uncommon minisatellites: Epilepsy Diabetes Ovarian cancer ISSCB 2003 Benson Tandem Repeats Function and Usefulness Tandem repeats: • are involved in gene regulation and often contain putative transcription factor binding sites. • exhibit copy number polymorphism, making them valuable genomic markers. ISSCB 2003 Benson Tandem Repeats Finder (TRF) TRF finds tandem repeats in genomic DNA sequences using short word matches (k-tuple matches). It assumes: • repeats have on average >80% sequence identity • insertions and deletions occur on average in < 10% of the pattern positions These are average values for program parameter settings. Repeats with lower sequence identity and higher indel frequency are also found. ISSCB 2003 Benson LOCUS HUMFMR1 3765 bp mRNA PRI 08-NOV-1994 DEFINITION Human Fragile X mental retardation 1 FMR-1 gene, 3' end, clones 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 gacggaggcg cggcggaggc cggcggctgg agggctgaag tacaaggcat tggcagcctg aataaagata ccttgctgtt tatgcagcat aatcccaaca gacttacggc gccttttctg gtcacctcaa ttgtctctga gcctcgagat actcatggtg gatgaagata agaagctttc ggaaaaaatg attgaggctg cttccttcca ISSCB 2003 Benson cccgtgccag ggcggcggcg gcctcgagcg agaagatgga ttgtaaagga ataggcagat taaatgaaag ggtggttagc gtgatgcaac aacctgccac aaatgtgtgc taacttatga agcgagcaca taatgagaaa ttcatgaaca ctaatattca cctgcacatt tcgaatttgc gaaagctgat aaaatgagaa ataattcaag ggggcgtgcg gcggcggcgg cccgcagccc ggagctggtg tgttcatgaa tccatttcat tgatgaagtt taaagtgagg ttacaatgaa aaaagatact caaagaggcg tccagaaaat tatgctgatt tgaagaagct gtttatcgta gcaagctaga tcatatttat tgaagatgta tcaggagatt aaatgttcca ggttggacct gcagcgcggc cggcggaggc acctctcggg gtggaagtgc gattcaataa gatgtcagat gaggtgtatt atgataaagg attgtcacaa ttccataaga gcacataagg tatcagcttg gacatgcact agtaagcagc agagaagatc aaagtacctg ggagaggatc atacaagttc gtggacaagt caagaagagg aatgccccag ggcggcggcg ggcggcggcg ggcgggctcc ggggctccaa cagttgcatt tcccacctcc ccagagcaaa gtgagtttta ttgaacgtct tcaagctgga attttaaaaa tcattttgtc ttcggagtct tggagagttc tgatgggtct gggtcactgc aggatgcagt caaggaactt caggagttgt aaattatgcc aagaaaaaaa gcggcggcgg gcggcggcgg cggcgctagc tggcgctttc tgaaaacaac tgtaggttat tgaaaaagag tgtgatagaa aagatctgtt tgtgccagaa ggcagttggt catcaatgaa gcgcactaag aaggcagctt agctattggt tattgatcta gaaaaaagct agtagtaata gagggtgagg accaaattcc acatttagat LOCUS HUMFMR1 3765 bp mRNA PRI 08-NOV-1994 DEFINITION Human Fragile X mental retardation 1 FMR-1 gene, 3' end, clones 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 gacggaggcg cggcggaggc cggcggctgg agggctgaag tacaaggcat tggcagcctg aataaagata ccttgctgtt tatgcagcat aatcccaaca gacttacggc gccttttctg gtcacctcaa ttgtctctga gcctcgagat actcatggtg gatgaagata agaagctttc ggaaaaaatg attgaggctg cttccttcca ISSCB 2003 Benson cccgtgccag ggcggcggcg gcctcgagcg agaagatgga ttgtaaagga ataggcagat taaatgaaag ggtggttagc gtgatgcaac aacctgccac aaatgtgtgc taacttatga agcgagcaca taatgagaaa ttcatgaaca ctaatattca cctgcacatt tcgaatttgc gaaagctgat aaaatgagaa ataattcaag ggggcgtgcg gcggcggcgg cccgcagccc ggagctggtg tgttcatgaa tccatttcat tgatgaagtt taaagtgagg ttacaatgaa aaaagatact caaagaggcg tccagaaaat tatgctgatt tgaagaagct gtttatcgta gcaagctaga tcatatttat tgaagatgta tcaggagatt aaatgttcca ggttggacct gcagcgcggc cggcggaggc acctctcggg gtggaagtgc gattcaataa gatgtcagat gaggtgtatt atgataaagg attgtcacaa ttccataaga gcacataagg tatcagcttg gacatgcact agtaagcagc agagaagatc aaagtacctg ggagaggatc atacaagttc gtggacaagt caagaagagg aatgccccag ggcggcggcg ggcggcggcg ggcgggctcc ggggctccaa cagttgcatt tcccacctcc ccagagcaaa gtgagtttta ttgaacgtct tcaagctgga attttaaaaa tcattttgtc ttcggagtct tggagagttc tgatgggtct gggtcactgc aggatgcagt caaggaactt caggagttgt aaattatgcc aagaaaaaaa gcggcggcgg gcggcggcgg cggcgctagc tggcgctttc tgaaaacaac tgtaggttat tgaaaaagag tgtgatagaa aagatctgtt tgtgccagaa ggcagttggt catcaatgaa gcgcactaag aaggcagctt agctattggt tattgatcta gaaaaaagct agtagtaata gagggtgagg accaaattcc acatttagat LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994 DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end. 2881 2941 3001 3061 3121 3181 3241 3301 3361 3421 3481 3541 3601 3661 3721 3781 3841 3901 3961 4021 cgccccaagt tccatctcag cgcccaacca acacacacac ccaccatatc agtcggccag agagatggag ctccaggcca gcctgagctg gattataggg tcctataagt agattcctgg tcctggaggg ctgtcagata tgcccacaca catgcccaca cacacacaca gggtgggaga gtcaggggaa aagtgggatg ISSCB 2003 Benson aggcttcatc gcccagaggg ccaaccacca acacacacac cagagacaag cacctcagcc gaggtggagg atccttatac tggaaaacca agactgaggc ctgggctggg agccagagtg ccctgggcac cacacacaca catgcataca cacatgcata cacaccccgc tactgggtca aaggacatct gggagctctg atgctctttg atgaggagac gcacatcagg acacacacac tgtctgagtc tccaggccaa cctgagctgt tttggcccac gagacaggaa aggagtagag agtccatgtg tgcatgcagg tctgaacaaa tgcacacaca cacatgcaca cacacatgca aggtagcctt tggtgggcac gcctccaggg ccactccagt gtttagcaat cagaatcaag ttcacacacc acacacacac tgagatacct tccttatact ggaaaaccag tgcaggccat gatggtctgt ctcctacaag tcctgacttg ccctagaaga aggcaattct tacacacaca cacatacaca tgcacacaca catcatgctg cggagtagaa ctgaacagag ttcaccagga agcccaaagc acatacccac tgagaccagt acacacaagc ctgaggatca ttggcccact agacaggaag gagagatgga atggagagag gccagtagtc ctcctcagat aatgtggagc gtaggctgta gagacacaga cacagagaca cacacacaca tctagcgata agagggaatg acttggagca ctgcctgaga aagctatgca gcccatccca ggctcccatc ccgtacacat ccaatggcag gcaggccatg atggtctgta ggaggtggag tagtaaacca taccttagag atcacaacca ttagagccct tagaggcatc cacacacaca cagacacaca tacacataca gccctgctga agcagtcagg gtcccagagc ccagtgaggg LOCUS RATIGCA 4461 bp DNA ROD 18-APR-1994 DEFINITION Rat Ig germline epsilon H-chain gene C-region, 3' end. 2881 2941 3001 3061 3121 3181 3241 3301 3361 3421 3481 3541 3601 3661 3721 3781 3841 3901 3961 4021 cgccccaagt tccatctcag cgcccaacca acacacacac ccaccatatc agtcggccag agagatggag ctccaggcca gcctgagctg gattataggg tcctataagt agattcctgg tcctggaggg ctgtcagata tgcccacaca catgcccaca cacacacaca gggtgggaga gtcaggggaa aagtgggatg ISSCB 2003 Benson aggcttcatc gcccagaggg ccaaccacca acacacacac cagagacaag cacctcagcc gaggtggagg atccttatac tggaaaacca agactgaggc ctgggctggg agccagagtg ccctgggcac cacacacaca catgcataca cacatgcata cacaccccgc tactgggtca aaggacatct gggagctctg atgctctttg atgaggagac gcacatcagg acacacacac tgtctgagtc tccaggccaa cctgagctgt tttggcccac gagacaggaa aggagtagag agtccatgtg tgcatgcagg tctgaacaaa tgcacacaca cacatgcaca cacacatgca aggtagcctt tggtgggcac gcctccaggg ccactccagt gtttagcaat cagaatcaag ttcacacacc acacacacac tgagatacct tccttatact ggaaaaccag tgcaggccat gatggtctgt ctcctacaag tcctgacttg ccctagaaga aggcaattct tacacacaca cacatacaca tgcacacaca catcatgctg cggagtagaa ctgaacagag ttcaccagga agcccaaagc acatacccac tgagaccagt acacacaagc ctgaggatca ttggcccact agacaggaag gagagatgga atggagagag gccagtagtc ctcctcagat aatgtggagc gtaggctgta gagacacaga cacagagaca cacacacaca tctagcgata agagggaatg acttggagca ctgcctgaga aagctatgca gcccatccca ggctcccatc ccgtacacat ccaatggcag gcaggccatg atggtctgta ggaggtggag tagtaaacca taccttagag atcacaacca ttagagccct tagaggcatc cacacacaca cagacacaca tacacataca gccctgctga agcagtcagg gtcccagagc ccagtgaggg Basic Assumption Mutated, adjacent copies of a pattern will contain runs of exact matches. d TATAC G T C GAGAC TTA ISSCB 2003 Benson d T C CAC G GAGATATTTA Basic Assumption Mutated, adjacent copies of a pattern will contain runs of exact matches. d TATAC G T C GAGAC TTA d T C CAC G GAGATATTTA Runs of matches are identified using k-tuple matches. ISSCB 2003 Benson Using k-tuple matches For purposes of program efficiency, fixed size k-tuples are used. d ISSCB 2003 Benson d k-tuple matches In pattern matching, a k-tuple is a window of length k which contains text characters. Two windows which contain the same text, form a k-tuple match: GAACGTTAGGTAACTGCAT CCTAGTTATACGTTAAC ISSCB 2003 Benson ISSCB 2003 Benson Modeling Tandem Repeats Multiple k-tuple matches suggest the occurrence of a tandem repeat. The appropriate number of matches depends on how a tandem repeat is defined. For example, are the following two aligned sequence fragments two copies of the same underlying pattern? TCGGCATCAGTCTATGG TCAA–-TG-GTGT-TGG ISSCB 2003 Benson A Stochastic Model TRF’s stochastic model is based on the probability of character matching and the frequency of insertion and deletion (indels) between aligned adjacent copies. CCACAACC-CGTCAGGCAAGT CTGCACCATCGTCTGGGAAGT HTTHHTHTTHHHHTHHTHHHH Note that the alignment has been converted into a Bernoulli (coin-toss) sequence. ISSCB 2003 Benson Model Parameters PM = the expected frequency of a match PI = the expected frequency of an indel The parameters are applied to Bernoulli sequences to establish criteria for detecting the repeats. ISSCB 2003 Benson Number of matches to indicate a repeat Sum of heads is the minimum required number of matches • for a repeat with period n • match probability p • tuple size k Unless k = 1, not all matches will be detected. For example: HHTHHHHTHTHHHTTHHHT ISSCB 2003 Benson k H seen 1 13 2 12 3 10 4 4 Sum of heads Suppose a random Bernoulli sequence has length 100 and the expected number of heads is 75 (PM = 0.75). If we count the number of heads, then 95% of the time we expect to count at least 68 heads. If we count only heads that occur in runs of length 5 or more, then 95% of the time we expect to count at least 26 heads. This is the sum of heads criteria. ISSCB 2003 Benson Other Criteria • Apparent Size – Used to distinguish tandem from nontandem repeats. • Waiting time – Used to pick a suitable tuple size. • Random walk – Used to accommodate insertions and deletions. ISSCB 2003 Benson ISSCB 2003 Benson The Tandem Repeats Database The Tandem Repeats Database (TRDB) is: 1. A public database of information on tandem repeats. 2. A private workspace for extended research on tandem repeats. ISSCB 2003 Benson C. elegans Distribution of Repeat Pattern Size Chr 1 ISSCB 2003 Benson Human Distribution of Pattern Size Chr 1 ISSCB 2003 Benson C. elegans Distribution of Repeat Location Chr 1 ISSCB 2003 Benson Human Distribution of Repeat Location Chr 1 ISSCB 2003 Benson Human Distribution of Repeat Location Chr 1 for Patternsize >= 5 ISSCB 2003 Benson Clusters in 41 bp Repeats ISSCB 2003 Benson Two identical repeats ISSCB 2003 Benson Four repeats from a tandem repeat cluster ISSCB 2003 Benson Topic 6 Outline Composition Alignment • Sequence composition and composition match • Composition alignment algorithm • Composition match scoring functions • Limiting the length of a composition match • Growth of local composition alignment scores • Biological examples ISSCB 2003 Benson Sequence Composition Composition is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let S be a string over Σ. Then, C(S)=(fσ1 , fσ2 , fσ3 , … , fσ|Σ|) is the composition of S, where fσ is the fraction of the i characters in S that are σi. Note that the order of letters is irrelevant as it has no effect on the composition. ISSCB 2003 Benson Composition Example S = ACTGTACCTGGCGCTATT C(S) = ( 0.17, 0.28, 0.22, 0.33 ) A C G T ISSCB 2003 Benson Composition Match Two strings, S and T, have a composition match if their lengths are equal and C(S) = C(T). For example, S and T below have a composition match: S = ACTGTACCTGGCGCTATT T = AAACCCCCGGGGTTTTTT ISSCB 2003 Benson Composition and Sequence Features • Isochores – Multi-megabase, specifically GC-rich or GCpoor. GC-rich isochores have greater gene density. • CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine in these dinucleotides affects gene expression. • Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding. ISSCB 2003 Benson Composition Alignment Problem Given: Two sequences, S of length m, and T of length n, over an alphabet Σ, and a scoring function cm(s, t) for the score of a composition match between substrings s and t. Find: The best scoring alignment (global or local) of S with T such that the allowed scoring options include composition match between substrings of S and T as well as the standard options of single character match, single character mismatch, insertion and deletion. ISSCB 2003 Benson Example of composition alignment S = AACGTCTTTGAGCTC T = AGCCTGACTGCCTA Alignment AACGTCTTTGAGCTC | |<-> | <---> AGCCTGACT-GCCTA ISSCB 2003 Benson Algorithm Analysis Given two sequences, S and T, the best alignment of the prefix strings S[1, i] = s1 … si T[1, j] = t1 … tj ends in one of four ways, mismatch, insertion, deletion, or composition match between suffixes of length l 1 ≤ l ≤ min(i, j, limit) i.e., between substrings S[i – l + 1, i] and T[j – l + 1, j] ISSCB 2003 Benson Time Complexity Computing the optimal composition alignment is done with dynamic programming and is similar to standard alignment, except for the composition match scoring option. The overall time complexity is O(nmZ) where Z is the time required per (i, j) pair to find the best length l for the composition match. ISSCB 2003 Benson Computing length of the shortest composition match Our goal here is to start with two strings, S and T, of equal length, and for each prefix pair S[1, k], T[1, k], find the length of the shortest suffixes that have a composition match. For example, let S = AACGTCTTTGAGCT T = AGCCTGACTGCCTA Then for k = 6, the shortest suffixes which have a composition match have length = 3: S = AACGTC T = AGCCTG ISSCB 2003 Benson Composition difference Composition difference is a vector quantity for two strings x and y: CD(x, y) = (cσ1 , … , cσ|Σ|) where cσ is the difference between the number of times σi i occurs in x and in y. ISSCB 2003 Benson Using composition difference Key observation: two identical composition differences at prefix lengths k and g indicate a composition match of length k – g. ISSCB 2003 Benson Sorting to find shortest composition matches Sort on composition difference using stable sort. Adjacent tuples with the same composition difference identify shortest composition matches. ISSCB 2003 Benson Time complexity for composition matches O(nmΣ) to find n·m shortest composition match lengths for two strings of length n and m. In our work, Σ, is a small constant (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used. ISSCB 2003 Benson Composition match scoring functions Functions based on match length, k: • Function 1: cm(k) = ck • Function 2: cm(k) = c√ k where c is a constant. Functions based on substring composition: • Function 4: cm(C, B, k) = ck · H(C,B) where H is the relative entropy function, C is the composition of the matching substrings and B is a background composition. ISSCB 2003 Benson Additive and subadditive scoring functions The functions based on length are additive or subadditive: cm(i + j) ≤ cm(i) + cm(j) Lemma: For additive or subadditive composition match scoring functions, any best scoring alignment is equivalent in score to an alignment which contains only shortest composition matches. Theorem: Composition alignment with additive or subadditive match scoring functions and finite alphabet has time complexity O(nm). ISSCB 2003 Benson The limit parameter Intuitively, allowing scrambled letters to match should increase the amount of matching between sequences. If too much matching occurs, alignments will not be meaningful. The limit parameter is an upper bound on the length l of the longest single composition match. ISSCB 2003 Benson Limit and fraction of matching characters random, ungapped alignments Sequence length = limit binary 1 50 2 62.5 5 75.6 10 82.4 DNA Dinucleotide 25 6.2 30 6.5 37.5 7.1 44.2 7.5 ISSCB 2003 Benson Limit and fraction of matching characters random, ungapped alignments Sequence length = 100, all letters equal probability p = 0.25 limit DNA 1 25 2 33.7 5 44.4 10 51 Sequence length = 400, all letters equal probability p = 0.25 limit dinucleotide ISSCB 2003 Benson 1 6.25 2 6.81 10 7.76 50 7.78 Growth of local alignment score Function 1 Average Local Composition Alignment Scores: DNA Sequences Function 1 120 100 Limit = 4 Score 80 60 Limit = 3 40 Limit = 2 20 0 100 200 400 Sequence Length (log scale) ISSCB 2003 Benson 800 1000 Global score as a predictor of local parameter suitability: Function 1 Average Global Composition Alignment Scores: DNA Sequences Function 1 100 50 Limit = 5 0 -50 Limit = 4 Score -100 -150 Limit = 3 -200 -250 -300 -350 -400 100 Limit = 2 200 300 400 500 Sequence Length ISSCB 2003 Benson 600 700 800 900 Growth of local alignment score Function 2 Average Local Composition Alignment Scores: DNA Sequences Function 2 100 50 90 30 80 20 70 Score 60 10 50 6 40 30 20 10 0 100 200 400 Sequence Length ISSCB 2003 Benson 800 1000 Global score as a predictor of local parameter suitability: Function 2 Global Composition Alignment Scores: DNA Sequences Function 2 0 -20 -40 50 -60 Score -80 30 -100 20 -120 -140 -160 -180 10 -200 0 100 200 300 400 500 Sequence Length ISSCB 2003 Benson 600 700 800 900 Limit values for DNA • Function 1: cm(k) = ck: Limit ≤ 3. • Function 2: cm(k) = c√k: Limit ≤ 10. • Function 4: cm(C, B, k) = ck ·H(C, B): Limit ≤ 50. ISSCB 2003 Benson Biological examples Composition alignment was tested on a set of 1796 promoter sequences from the Eukaryotic Promoter Database. Each sequence is 600 nucleotides long, 500 bases upstream and 100 downstream of the transcription initiation site. Two local alignment scores were produced using function 1, W using composition alignment and S using standard alignment. The examples shown have statistically significant W with W ≥ 3 · S to exclude good standard alignments. ISSCB 2003 Benson Example 1 Composition alignment and standard alignment of two promoters. Standard alignment is not statistically significant. Sequences are characteristic of CpG islands. Composition Alignment: GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC <->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->|| CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard Alignment: CGCCGCCGCCG CGCCGCCGCCG ISSCB 2003 Benson Example 2 Composition alignment of two promoter sequences. Composition changes at vertical line. A C G T Left: (0.01, 0.61, 0.30, 0.08) Right: (0.19, 0.16, 0.56, 0.09) GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG <->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->| CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG ISSCB 2003 Benson Final Slide Recognize that similarity among biological sequences most likely exists in ways which we can not today perceive and for which we have no detection tools. You are encouraged, therefore, to think broadly, beyond the embellishment or refinement of current methods, to new definitions of similarity and new problems of comparison. ISSCB 2003 Benson