Voting, Meta-search, and Bioconsensus Fred S. Roberts Department of Mathematics and DIMACS (Center for Discrete Mathematics and Theoretical Computer Science) Rutgers University Piscataway, New Jersey From Social Science Methods to Information Technology Applications Over the years, social scientists have developed a variety of methods for dealing with problems of voting, decisionmaking, conflict and cooperation, measurement, etc. These methods, often heavily mathematical, are beginning to find novel uses in a variety of information technology applications. 2 Such methods will need to be substantially improved to deal with such issues as: •Computational Intractability •Limitations on Computational Power/Information •The Sheer Size of Some of the New Applications •Learning Through Repetition •Security, Privacy, and Cryptography This talk will concentrate on social science methods for dealing with voting and decisionmaking. We will look briefly at various applications of these methods to a variety of information technology problems and then concentrate on a particular application to biological 3 databases. Voting/ Group Decisionmaking In a standard model for voting, each member of a group gives an opinion. We seek a consensus among these opinions. Sometimes the opinion is just a vote for a first choice among a set of alternative choices or candidates. In other contexts, the opinion might be a ranking of all the alternatives. 4 Obtaining opinions as rankings among alternatives or candidates can sometimes give a lot more information about a voter’s true preferences than simply obtaining their first choice. But then we have the challenge of defining what we mean by a consensus. In many applications, we seek a ranking that is in some sense a consensus of the rankings provided by all of 5 the voters. Medians and Means Among the most important directions of research in the theory of group consensus is the idea that we can obtain a group consensus by first finding a way to measure the distance between any two alternatives or any two rankings. Let M be the set of alternatives (candidates) or the set of rankings of alternatives and d(a,b) = distance between a and b in M. A profile (of opinions) is a vector (a1,a2, …, an) of points from M. 6 The median of a profile is the set of all points x of M that minimize n i 1 d(ai,x) and the mean is the set of all points x of M that minimize n 2 d(a ,x) i i 1 One very commonly used method for measuring the distance between two rankings of candidates is called the Kemeny-Snell distance: twice the number of pairs of candidates i and j for which i is ranked above j in one ranking and below j in the other + the number of pairs 7 that are ranked in one ranking and tied in another. Consider the following profile: Voter 1 (a1): Bush, Gore, Nader Voter 2 (a2): Bush, Gore, Nader Voter 3 (a3): Gore, Bush, Nader In the case of this profile, the Kemeny-Snell median is the ranking x = Bush, Gore, Nader. We have d(a1,x) + d(a2,x) + d(a3,x) = 0 + 0 + 2 = 2. However, the Kemeny-Snell mean is the ranking y = Bush-Gore, Nader, in which Bush and Gore are tied for first place. For d(a1,y)2 + d(a2,y)2 + d(a3,y)2 = 1 + 1 + 1 8 = 3, while d(a1,x)2 + d(a2,x)2 + d(a3,x)2 = 4. Note that medians or means need not be unique. Voter 1: Bush, Gore, Nader Voter 2: Gore, Nader, Bush Voter 3: Nader, Bush, Gore This is the “voter’s paradox” situation. In this case, there are three Kemeny-Snell medians, these three rankings. However, there is a unique Kemeny-Snell mean, the ranking in which all three candidates are tied. Because of non-uniqueness, we think of consensus as defining a function F from the set of profiles to the set of sets of rankings. Kenneth Arrow called this a group 9 consensus function or social welfare function. In case the elements of M are rankings, calculation of medians and means can be quite difficult. Theorem (Bartholdi, Tovey and Trick, 1989; Wakabayashi, 1986): Under the Kemeny-Snell distance, the calculation of the median of a profile of rankings is an NP-complete problem. 10 Meta-Search and Other Information Technology Applications of Consensus Methods Meta-Search Meta-Search is the process of combining the results of several search engines. We seek the consensus of the search engines, whether it is a consensus first choice website or a consensus ranking of websites. 11 Meta-search has been studied using consensus methods by Cohen, Schapire, and Singer (1999) and Dwork, Kumar, Naor, and Sivakumar (2000). One important point: In voting, there are usually few candidates and many voters. In meta-search, there are usually few voters and many candidates. In this setting, Dwork, et al. developed an approximation to the Kemeny-Snell median that preserves most of its desirable properties and is computationally tractable. 12 Information Retrieval We rank documents according to their probability of relevance to a query. Given a number of queries, we seek a consensus ranking. Collaborative Filtering We use knowledge of the behavior of multiple users to make recommendations to an active user, for example combining movie ratings by others to prepare an ordered list of movies for a given user. 13 Consensus methods have been applied to collaborative filtering by Freund, Iyer, Schapire, and Singer (1998) -they designed an efficient “boosting system” for combining preferences. Pennock, Horvitz, and Giles (2000) applied consensus methods to develop “Recommender Systems.” Software Measurement Combining several different measures or ratings through appropriate consensus methods is an important topic in the measurement of the understandability, quality, functionality, reliability, efficiency, usability, maintainability, or portability of software. (Fenton and 14 Pfleeger, 1997) Ordinal Filtering in Digital Image Processing In one method of noise removal, to check if a pixel is noise, one compares it with neighboring pixels. If the values are beyond a certain threshold, one replaces the value of the given pixel with a mean or median of the values of its neighbors. (See Janowitz (1986).) Related methods are used in models of “distributed consensus.” A number of processors each holds an initial (binary) value, some of them may be faulty and ignore any protocol, yet it is required that the non-faulty processors eventually agree (reach consensus) on a value. 15 Berman and Garay (1993) developed a protocol for distributed consensus based on the parliamentary procedure known as “cloture” and showed it was very good in terms of a number of important criteria including polynomial computation and communication. 16 Bioconsensus In recent years, methods of consensus developed for applications in the social sciences have become widely used in biology. In molecular biology alone, Bill Day has compiled a bibliography of hundreds of papers that use such consensus methods. 17 The following are some of the ways that bioconsensus problems arise: •Alternative phylogenies (evolutionary trees) are produced using different methods and we need to choose a consensus tree. •Alternative taxonomies (classifications) are produced using different models and we need to choose a consensus taxonomy. •Alternative molecular sequences are produced using different criteria or different algorithms and we need to choose a consensus sequence. •Alternative sequence alignments are produced and we need to choose a consensus alignment. 18 Finding A Pattern or Feature Appearing in a Set of Molecular Sequences In many problems of the social and biological sciences, data is presented as a sequence or “word” from some alphabet . Given a set of sequences, we seek a pattern or feature that appears widely, and we think of this as a consensus sequence or set of sequences. A pattern is often thought of as a consecutive subsequence of short, fixed length. In biology, such sequences arise from DNA, RNA, proteins, etc. 19 Why Look for Such Patterns? Similarities between sequences or parts of sequences lead to the discovery of shared phenomena. For example, it was discovered that the sequence for platelet derived factor, which causes growth in the body, is 87% identical to the sequence for v-sis, a cancer-causing gene. This led to the discovery that v-sis works by stimulating growth. 20 In recent years, we have developed huge databases of molecular sequences. For example, GenBank has over 7 million sequences comprising 8.6 billion bases. The search for similarity or patterns has extended from pairs of sequences to finding patterns that appear in common in a large number of sequences or throughout the database. To find patterns in a database of sequences, it is useful to measure the distance between sequences. If a and b are sequences of the same length, a common way to define the distance d(a,b) is to take it to be the number of mismatches between the sequences. 21 To measure how closely a pattern fits into a sequence, we have to measure the distance between sequences of different lengths. If b is longer than a, then d(a,b) could be the smallest number of mismatches in all possible alignments of a as a consecutive subsequence of b. We call this the best-mismatch distance. 22 Example: a = 0011, b = 111010 Possible Alignments: 111010 111010 0011 0011 111010 0011 The best-mismatch distance is 2, which is achieved in the third alignment. 23 An alternative way to measure d(a,b) is to count the smallest number of mismatches between sequences obtained from a and b by inserting gaps in appropriate places -- a mismatch between a letter of and a gap is counted as an ordinary mismatch. We won’t use this alternative measure of distance. Waterman (1989), Waterman, Galas, and Arratia (1984), Galas, Eggert, and Waterman (1985) and others study the following situation: • is a finite alphabet • k is a fixed finite number (the pattern length) • A profile = (a1,a2, …, an) consists of a set of words 24 (sequences) of length L from , with L k We seek a set F() = F(a1,a2, …, an) of consensus words of length k from . Here is a small piece of data from Waterman (1989), in which he looks at 59 bacterial promoter sequences: RRNABP1: TNAA: UVRBP2: SFC: ACTCCCTATAATGCGCCA GAGTGTAATAATGTAGCC TTATCCAGTATAATTTGT AAGCGGTGTTATAATGCC Notice that if we are looking for patterns of length 4, each sequence has the pattern TAAT. 25 However, suppose that we add another sequence: M1 RNA: AACCCTCTATACTGCGCG The pattern TAAT does not appear here. However, it almost appears, since the word TACT appears, and this has only one mismatch from the pattern TAAT. So, in some sense, the pattern TAAT is a pattern that is a good consensus pattern. We now make this idea precise. 26 In practice, the problem is a bit more complicated than we have described it. We have long sequences and we consider “windows” of length L beginning at a fixed position, say the jth. Thus, we consider words of length L in a long sequence, beginning at the jth position. For each possible pattern of length k, we ask how closely it can be matched in each of the sequences in a window of length L starting at the jth position. 27 Formalization Let be a finite alphabet of size at least 2 and be a finite collection of words of length L on . Let F() be the set of words of length k 2 that are our consensus patterns. (We drop the distinction between profile as vector and profile as set.) Let = {a1, a2, …, an}. One way to define F() is as follows. Let d(a,b) be the best-mismatch distance. Consider nonnegative parameters d that are monotone decreasing with d and let F(a1,a2, …, an) be all those words w of length k that maximize n s(w) = d(w,ai) i 1 28 We call such an F a Waterman consensus. In particular, Waterman and others use the parameters d = (k-d)/k. Example: An alphabet used frequently is the purine/pyrimidine alphabet {R,Y}, where R = A (adenine) or G (guanine) and Y = C (cytosine) or T (thymine). For simplicity, it is easier to use the digits 0,1 rather than the letters R,Y. Thus, let = {0,1}, let k = 2. Then the possible pattern words are 00, 01, 10, 11. 29 Suppose a1 = 111010, a2 = 111111. How do we find F(a1,a2)? We have: d(00,a1) = 1, d(00,a2) = 2 d(01,a1) = 0, d(01,a2) = 1 d(10,a1) = 0, d(10,a2) = 1 d(11,a1) = 0, d(11,a2) = 0 S(00) = d(00,ai) = 1 + 2, S(01) = d(01,ai) = 0 + 1 S(10) = d(10,ai) = 0 + 1 S(11) = d(11,ai) = 0 + 0 As long as 0 > 1 > 2, it follows that 11 is the 30 consensus pattern, according to Waterman’s consensus. Example: Let ={0,1}, k = 3, and consider F(a1,a2,a3) where a1 = 000000, a2 = 100000, a3 = 111110. The possible pattern words are: 000, 001, 010, 011, 100, 101, 110, 111. d(000,a1) = 0, d(000,a2) = 0, d(000,a3) = 2, d(001,a1) = 1, d(001,a2) = 1, d(001,a3) = 2, d(100,a1) = 1, d(100,a2) = 0, d(100,a3) = 1, etc. S(000) = 2 + 20, S(001) = 2 + 21, S(100) = 21 + 0, etc. Now, 0 > 1 > 2 implies that S(000) > S(001). Similarly, one shows that the score is maximized by S(000) or S(100). Monotonicity doesn’t say which of these is highest. 31 Other Consensus Functions The median is the collection of words w of length k which minimize n (w) = d(w,ai). i 1 The mean is the collection of words w of length k which minimize (w) = n 2. d(w,a ) i i 1 32 Another measure which it might be of interest to minimize is a convex combination of these two: (w) = (w) + (1- ) (w) , [0,1]. Words which minimize will be called the mixed median-mean. This might be of interest if we are not ready to choose either medians or means or want some combination of the two. We might also choose to minimize d(w,a)m or logd(w,a)m for fixed m. 33 Example: Let = {0,1}, k = 2, = {a1,a2,a3,a4}, a1 = 1111, a2 = 0000, a3 = 1000, a4 = 0001. Possible pattern words: 00, 01, 10, 11. d(00,ai) = 2, d(01,ai) = 3, d(10,ai) = 3, d(11,ai) = 4. Thus, 00 is the median. d(00,ai)2 = 4, d(01,ai)2 = 3, d(10,ai)2 = 3, d(11,ai)2 = 6, so the mean consists of the two words 01 and 10, neither of which is a median. 34 Summary of Notation n s(w) = (w) = d(w,ai) i 1 n d(w,ai). i 1 n (w) = d(w,ai)2. i 1 (w) = (w) + (1- ) (w) , [0,1]. 35 The Special Case d = (k-d)/k Suppose that d = (k-d)/k. We have (w) = n d(w,ai), i 1 n n i 1 i 1 s(w) = d(w,ai) = n - (1/k) d(w,ai). Thus, for fixed k 2, of size at least 2, and any size set , for all words w, w of length L: (w) (w) s(w) s(w). 36 It follows that for fixed k 2, of size at least 2, and any size set , there is a choice of the parameter d so that the Waterman consensus is the same as the median. (This also holds for k = 1 or of size 1, but these are uninteresting cases.) Similarly, one can show that for any fixed k 2, of size at least 2, and any size set , there is a choice of parameter d so that for all words w, w of length L: (w) (w) s(w) s(w). For this choice of d , a word is a Waterman consensus 37 iff it is a mean. More generally, for all rational numbers [0,1], for fixed k 2, of size at least 2, and any size set , there is a choice of parameter d so that for all words w, w of length L: (w) (w) s(w) s(w). For this choice of d , a word is a Waterman consensus iff it is a mixed median-mean with convex combination depending upon. 38 What Parameters d Give Rise to Median, Mean, or Mixed Median-Mean? Let us first decide if can have repeated words. From the point of view of the application, this is a reasonable assumption. (Repeats are allowed in the database; or some words have more significance than others.) We shall investigate both the repetitive case -- where is allowed to have repeated words -- and the nonrepetitive case. The following results are joint with Boris Mirkin. 39 When do we Get the Median in Waterman Consensus? Theorem: Suppose k is fixed, k 2, is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w) s(w) s(w) holds for all words w, w of length k from and all finite nonrepetitive sets (of any size) of words of length L k from . (b). There are constants B, C, B < 0, s.t. for all 0 j k, 40 j = Bj + C. In other words, under the hypotheses of the theorem, the median procedure corresponds exactly to the choice of parameters j = Bj + C, B < 0. Note that j = (k-j)/k is a special case of this. Remark: This theorem (and all subsequent theorems) also hold if we replace “words of length L k” by “words of fixed length L, L k.” 41 When do we Get the Mean in Waterman Consensus? Theorem: Suppose k is fixed, k 2, is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w) s(w) s(w) holds for all words w, w of length k from and all finite sets (of any size) of words of length L k from . (b). There are constants A, C, A < 0, so that for all 0 j k, 42 = Aj2 + C. In other words, under the hypotheses of the theorem, the mean procedure corresponds exactly to the choice of parameters j = Aj2 + C, A < 0. Note that we require (a) to hold for all finite sets , even those with repetitions. It is a technicality to try to remove this hypothesis. To do so, we have found it necessary to allow a larger alphabet or to take k sufficiently large and also to consider only L larger than k. The first result uses an alphabet of size at least four. 43 Theorem: Suppose k is fixed, k 2, is an alphabet of at least four letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w) s(w) s(w) holds for all words w, w of length k from and all finite nonrepetitive sets (of any size) of words of length L > k from . (b). There are constants A, C, A < 0, s.t. for all 0 j k, j = Aj2 + C. 44 The next result removes the hypothesis of the alphabet being of size at least 4, but adds a hypothesis that k 3: Theorem: Suppose k is fixed, k 3, is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w) s(w) s(w) holds for all words w, w of length k from and all finite nonrepetitive sets (of any size) of words of length L > k from . (b). There are constants A, C, A < 0, s.t. for all 0 j k, 45 j = Aj2 + C. When do we Get the Mixed Median-Mean in Waterman Consensus? Recall that (w) = (w) + (1- ) (w) , [0,1]. In the following, we shall assume that is a rational number. It might just be of purely technical interest to figure out what happens when is irrational, but we have not been able to obtain analogous results for this case. 46 Theorem: Suppose k is fixed, k 2, is an alphabet of at least two letters, and d is a sequence with 1 < 0. Then the following are equivalent: (a). The equivalence (w) (w) s(w) s(w) holds for all words w, w of length k from and all finite sets (of any size) of words of length L k from . (b). There are constants D, E, D < 0, s.t. for all 0 j k, j = D(1- )j2 + Dj + E. 47 In other words, under the hypotheses of the theorem, the mixed median-mean procedure corresponds exactly to the choice of parameters j = D(1-)j 2 + Dj + E, D < 0. If we only want to require that the equivalence in part (a) holds for finite nonrepetitive sets , one way to do so, as with the results about the mean or , is to assume that is sufficiently large. We have been able to prove this result under the added hypothesis that L > k and the added hypothesis that has at least r()+ 1 elements, where 2(2-) = s/t for s, t positive integers and r() = max{s-t,t+1}. 48 Other Consensus Functions as Special Cases of Waterman’s Consensus It would be interesting to study conditions under which other consensus methods are special cases of Waterman’s. Of particular interest might be such consensus methods as minimizing d(w,ai)m or minimizing logd(w,ai)m. 49 Algorithms In practical applications in molecular biology, good algorithms for obtaining a consensus pattern are essential. Waterman and his co-authors provide a method for computing their consensus patterns in the case d = (k-d)/k. The Brute Force Algorithm Suppose that D(w,w) is the number of mismatches between two words w, w of length k. The most naïve algorithm for finding the Waterman consensus would proceed by brute force and calculate all best-mismatch distances d(w,a) for w a potential pattern word of length k and a by calculating D(w,w) for all words w of length k in a. 50 Suppose that c is the cost of computing any D(w,w). Let a be any word of length L and w be a word of length k. If the best-mismatch distance d(w,a) is calculated by computing D(w,w) for all w of length k in a, then since there are L-k+1 such words w in a, we calculate d(w,a) with cost (L-k+1)c. If p = , there are pk potential pattern words w. If there are n words in , then the brute force algorithm can compute the scores s(w) for all potential pattern words w and hence obtain the optimal pattern word by making a number of calculations of total cost pkn(L-k+1)c. 51 Note that p is relatively small, k is small, and n is typically large. An Improved Algorithm As Waterman, et al. observe, we can improve the performance of the algorithm considerably by looking at neighborhoods of a word w of length k. Let the neighborhood N(w) consist of all words w of length k so that D(w,w) T for some threshold T. Usually, T is taken to be small, for example 3. The idea is to eliminate all potential pattern words w which don’t fall within N(w) for some k-length word in each sequence word a in the database , i.e., to eliminate all words w 52 so that d(w,a) > T for some a in . Initial Calculations As an initial step, calculate D(w,w) for all words w, w of length k from . This requires p2k calculations of cost c each. Going Through the Database Go through the database one word at a time. Given a word a, consider each k-length word w in a, find N(w), and look up D(w,w) for all k-length pattern words w in N(w). If N is an upper bound on the size of the neighborhoods N(w), then we have to look up at most (L-k+1)N values D(w,w). Assume that C is the look-up cost. If a word w is not in N(w) for any such w, dismiss it as a potential pattern word. 53 Calculate s(w) for all pattern words w that have not been dismissed. Since we can update s(w) as we go through each word of , and since there are n words in , we can calculate s(w) for all non-dismissed w and therefore obtain the highest scoring non-dismissed w with cost at most n(L-k+1)NC + p2kc, where the second term comes from the initial calculation. This cost is considerably smaller than the cost of the brute force algorithm, n(L-k+1)pkc. 54 This is because, typically, n is large in comparison to p2kc. Since C is much less than c and N is presumably much less than pk, NC is much less than pkc. Since n is large, this change is significant compared to the added cost p2kc. Of course, the improved algorithm assumes that no dismissed word can be a consensus word, which could be false. 55 Algorithms for the Median A considerable amount of work has been devoted in the literature to finding algorithms for obtaining the median, although this is often a difficult computational problem and is NP-complete in some contexts. In a typical application, we have a large database and so a very efficient algorithm will be needed. One of the reasons for our interest in the median procedure was that some of the algorithms for computing medians might be usable to improve upon the computational methods given for the Waterman consensus. We have, however, not worked on this idea. 56 Axiomatic Approach In the group consensus literature in the social sciences and elsewhere, there has been considerable emphasis on finding axioms characterizing different consensus procedures. However, consensus methods used in molecular biology tend to be chosen because they seem interesting or useful, rather than on the basis of some theory. Such a theory could be based on an axiomatic approach. Axioms have been given for the median procedure in some contexts: 57 •Young and Levenglick (1978) axiomatized the KemenySnell median where the ai are rankings rather than words. •The median procedure has been axiomatically characterized when the ai are vertices of various kinds of graphs: n-trees: Barthelemy and McMorris, 1986 Covering graphs of semilattices: LeClerc, 1994 Median graphs: McMorris, Mulder, Roberts, 1998 58 Not as much is known about means: •The Kemeny-Snell mean procedure for rankings has not been characterized. •The mean procedure was characterized for trees by Holzman (1990). •However, Hansen and Roberts (1996) showed that a natural generalization of Holzman’s axioms for arbitrary connected graphs with cycles leads to an impossibility result. 59 It would be interesting and potentially useful to characterize axiomatically the median and the mean in the bioconsensus context we have described, i.e., to give axioms for F() to be obtained by minimizing or . Our results do not give such a characterization since these results depend upon the Waterman consensus method. No results are known in the sequence context which characterize axiomatically those consensus functions that are the median or the mean, either when L = k or when L k. It would also be interesting and potentially of practical significance to try to axiomatize the Waterman consensus. No results are known about this 60 problem.