Voting, Metasearch, and Bioconsensus

advertisement
Voting, Meta-search, and
Bioconsensus
Fred S. Roberts
Department of Mathematics and
DIMACS (Center for Discrete Mathematics
and Theoretical Computer Science)
Rutgers University
Piscataway, New Jersey
From Social Science Methods
to Information Technology
Applications
Over the years, social scientists have developed a
variety of methods for dealing with problems of
voting, decisionmaking, conflict and cooperation,
measurement, etc. These methods, often heavily
mathematical, are beginning to find novel uses in a
variety of information technology applications.
2
Such methods will need to be substantially improved to
deal with such issues as:
•Computational Intractability
•Limitations on Computational Power/Information
•The Sheer Size of Some of the New Applications
•Learning Through Repetition
•Security, Privacy, and Cryptography
This talk will concentrate on social science methods for
dealing with voting and decisionmaking. We will look
briefly at various applications of these methods to a
variety of information technology problems and then
concentrate on a particular application to biological 3
databases.
Voting/ Group Decisionmaking
In a standard model for voting, each member of a group
gives an opinion. We seek a consensus among these
opinions.
Sometimes the opinion is just a vote for a first choice
among a set of alternative choices or candidates.
In other contexts, the opinion might be a ranking of all
the alternatives.
4
Obtaining opinions as rankings among alternatives or
candidates can sometimes give a lot more information
about a voter’s true preferences than simply obtaining
their first choice.
But then we have the challenge of defining what we
mean by a consensus.
In many applications, we seek a ranking that is in some
sense a consensus of the rankings provided by all of
5
the voters.
Medians and Means
Among the most important directions of research in the
theory of group consensus is the idea that we can obtain a
group consensus by first finding a way to measure the
distance between any two alternatives or any two
rankings.
Let M be the set of alternatives (candidates) or the set of
rankings of alternatives and d(a,b) = distance between a
and b in M.
A profile (of opinions) is a vector (a1,a2, …, an) of
points from M.
6
The median of a profile is the set of all points x of M that
minimize
n

i 1
d(ai,x)
and the mean is the set of all points x of M that
minimize
n
2
d(a
,x)
 i
i 1
One very commonly used method for measuring the
distance between two rankings of candidates is called the
Kemeny-Snell distance: twice the number of pairs of
candidates i and j for which i is ranked above j in one
ranking and below j in the other + the number of pairs 7
that are ranked in one ranking and tied in another.
Consider the following profile:
Voter 1 (a1): Bush, Gore, Nader
Voter 2 (a2): Bush, Gore, Nader
Voter 3 (a3): Gore, Bush, Nader
In the case of this profile, the Kemeny-Snell median is
the ranking x = Bush, Gore, Nader. We have
d(a1,x) + d(a2,x) + d(a3,x) = 0 + 0 + 2 = 2.
However, the Kemeny-Snell mean is the ranking y =
Bush-Gore, Nader, in which Bush and Gore are tied for
first place. For d(a1,y)2 + d(a2,y)2 + d(a3,y)2 = 1 + 1 + 1
8
= 3, while d(a1,x)2 + d(a2,x)2 + d(a3,x)2 = 4.
Note that medians or means need not be unique.
Voter 1: Bush, Gore, Nader
Voter 2: Gore, Nader, Bush
Voter 3: Nader, Bush, Gore
This is the “voter’s paradox” situation. In this case, there
are three Kemeny-Snell medians, these three rankings.
However, there is a unique Kemeny-Snell mean, the
ranking in which all three candidates are tied.
Because of non-uniqueness, we think of consensus as
defining a function F from the set of profiles to the set of
sets of rankings. Kenneth Arrow called this a group
9
consensus function or social welfare function.
In case the elements of M are rankings, calculation of
medians and means can be quite difficult.
Theorem (Bartholdi, Tovey and Trick, 1989;
Wakabayashi, 1986): Under the Kemeny-Snell distance,
the calculation of the median of a profile of rankings is
an NP-complete problem.
10
Meta-Search and Other
Information Technology
Applications of Consensus
Methods
Meta-Search
Meta-Search is the process of combining the results of
several search engines. We seek the consensus of the
search engines, whether it is a consensus first choice
website or a consensus ranking of websites.
11
Meta-search has been studied using consensus methods
by Cohen, Schapire, and Singer (1999) and Dwork,
Kumar, Naor, and Sivakumar (2000).
One important point: In voting, there are usually few
candidates and many voters. In meta-search, there are
usually few voters and many candidates.
In this setting, Dwork, et al. developed an approximation
to the Kemeny-Snell median that preserves most of its
desirable properties and is computationally tractable.
12
Information Retrieval
We rank documents according to their probability of
relevance to a query. Given a number of queries, we seek
a consensus ranking.
Collaborative Filtering
We use knowledge of the behavior of multiple users to
make recommendations to an active user, for example
combining movie ratings by others to prepare an ordered
list of movies for a given user.
13
Consensus methods have been applied to collaborative
filtering by Freund, Iyer, Schapire, and Singer (1998) -they designed an efficient “boosting system” for
combining preferences. Pennock, Horvitz, and Giles
(2000) applied consensus methods to develop
“Recommender Systems.”
Software Measurement
Combining several different measures or ratings through
appropriate consensus methods is an important topic in
the measurement of the understandability, quality,
functionality, reliability, efficiency, usability,
maintainability, or portability of software. (Fenton and
14
Pfleeger, 1997)
Ordinal Filtering in Digital Image Processing
In one method of noise removal, to check if a pixel is
noise, one compares it with neighboring pixels. If the
values are beyond a certain threshold, one replaces the
value of the given pixel with a mean or median of the
values of its neighbors. (See Janowitz (1986).)
Related methods are used in models of “distributed
consensus.” A number of processors each holds an initial
(binary) value, some of them may be faulty and ignore
any protocol, yet it is required that the non-faulty
processors eventually agree (reach consensus) on a value.
15
Berman and Garay (1993) developed a protocol for
distributed consensus based on the parliamentary
procedure known as “cloture” and showed it was very
good in terms of a number of important criteria including
polynomial computation and communication.
16
Bioconsensus
In recent years, methods of consensus developed for
applications in the social sciences have become widely
used in biology. In molecular biology alone, Bill Day has
compiled a bibliography of hundreds of papers that use
such consensus methods.
17
The following are some of the ways that
bioconsensus problems arise:
•Alternative phylogenies (evolutionary trees) are
produced using different methods and we need to
choose a consensus tree.
•Alternative taxonomies (classifications) are produced
using different models and we need to choose a
consensus taxonomy.
•Alternative molecular sequences are produced using
different criteria or different algorithms and we need
to choose a consensus sequence.
•Alternative sequence alignments are produced and we
need to choose a consensus alignment.
18
Finding A Pattern or Feature Appearing in a
Set of Molecular Sequences
In many problems of the social and biological
sciences, data is presented as a sequence or “word”
from some alphabet . Given a set of sequences, we
seek a pattern or feature that appears widely, and we
think of this as a consensus sequence or set of
sequences. A pattern is often thought of as a
consecutive subsequence of short, fixed length. In
biology, such sequences arise from DNA, RNA,
proteins, etc.
19
Why Look for Such Patterns?
Similarities between sequences or parts of sequences
lead to the discovery of shared phenomena.
For example, it was discovered that the sequence for
platelet derived factor, which causes growth in the body, is
87% identical to the sequence for v-sis, a cancer-causing
gene. This led to the discovery that v-sis works by
stimulating growth.
20
In recent years, we have developed huge databases of
molecular sequences. For example, GenBank has over 7
million sequences comprising 8.6 billion bases. The
search for similarity or patterns has extended from
pairs of sequences to finding patterns that appear in
common in a large number of sequences or throughout
the database.
To find patterns in a database of sequences, it is useful to
measure the distance between sequences. If a and b are
sequences of the same length, a common way to define
the distance d(a,b) is to take it to be the number of
mismatches between the sequences.
21
To measure how closely a pattern fits into a sequence,
we have to measure the distance between sequences of
different lengths.
If b is longer than a, then d(a,b) could be the
smallest number of mismatches in all possible
alignments of a as a consecutive subsequence of b.
We call this the best-mismatch distance.
22
Example:
a = 0011, b = 111010
Possible Alignments:
111010
111010
0011
0011
111010
0011
The best-mismatch distance is 2, which is achieved in the
third alignment.
23
An alternative way to measure d(a,b) is to count the
smallest number of mismatches between sequences
obtained from a and b by inserting gaps in appropriate
places -- a mismatch between a letter of  and a gap is
counted as an ordinary mismatch. We won’t use this
alternative measure of distance.
Waterman (1989), Waterman, Galas, and Arratia (1984),
Galas, Eggert, and Waterman (1985) and others study the
following situation:
•  is a finite alphabet
• k is a fixed finite number (the pattern length)
• A profile  = (a1,a2, …, an) consists of a set of words
24
(sequences) of length L from , with L  k
We seek a set F() = F(a1,a2, …, an) of consensus words
of length k from .
Here is a small piece of data from Waterman (1989), in
which he looks at 59 bacterial promoter sequences:
RRNABP1:
TNAA:
UVRBP2:
SFC:
ACTCCCTATAATGCGCCA
GAGTGTAATAATGTAGCC
TTATCCAGTATAATTTGT
AAGCGGTGTTATAATGCC
Notice that if we are looking for patterns of length 4, each
sequence has the pattern TAAT.
25
However, suppose that we add another sequence:
M1 RNA:
AACCCTCTATACTGCGCG
The pattern TAAT does not appear here.
However, it almost appears, since the word TACT
appears, and this has only one mismatch from the pattern
TAAT.
So, in some sense, the pattern TAAT is a pattern that is
a good consensus pattern. We now make this idea
precise.
26
In practice, the problem is a bit more complicated than
we have described it. We have long sequences and we
consider “windows” of length L beginning at a fixed
position, say the jth. Thus, we consider words of
length L in a long sequence, beginning at the jth
position. For each possible pattern of length k, we
ask how closely it can be matched in each of the
sequences in a window of length L starting at the jth
position.
27
Formalization
Let  be a finite alphabet of size at least 2 and  be a finite
collection of words of length L on . Let F() be the set
of words of length k  2 that are our consensus patterns.
(We drop the distinction between profile as vector and
profile as set.)
Let  = {a1, a2, …, an}. One way to define F() is as
follows. Let d(a,b) be the best-mismatch distance.
Consider nonnegative parameters d that are monotone
decreasing with d and let F(a1,a2, …, an) be all those
words w of length k that maximize
n
s(w) = d(w,ai)
i 1
28
We call such an F a Waterman consensus.
In particular, Waterman and others use the parameters
d = (k-d)/k.
Example:
An alphabet used frequently is the purine/pyrimidine
alphabet {R,Y}, where R = A (adenine) or G (guanine)
and Y = C (cytosine) or T (thymine). For simplicity, it is
easier to use the digits 0,1 rather than the letters R,Y.
Thus, let  = {0,1}, let k = 2. Then the possible pattern
words are 00, 01, 10, 11.
29
Suppose a1 = 111010, a2 = 111111. How do we find
F(a1,a2)?
We have:
d(00,a1) = 1, d(00,a2) = 2
d(01,a1) = 0, d(01,a2) = 1
d(10,a1) = 0, d(10,a2) = 1
d(11,a1) = 0, d(11,a2) = 0
S(00) =  d(00,ai) = 1 + 2,
S(01) =  d(01,ai) = 0 + 1
S(10) =  d(10,ai) = 0 + 1
S(11) =  d(11,ai) = 0 + 0
As long as 0 > 1 > 2, it follows that 11 is the
30
consensus pattern, according to Waterman’s consensus.
Example:
Let  ={0,1}, k = 3, and consider F(a1,a2,a3) where
a1 = 000000, a2 = 100000, a3 = 111110. The possible
pattern words are: 000, 001, 010, 011, 100, 101, 110, 111.
d(000,a1) = 0, d(000,a2) = 0, d(000,a3) = 2,
d(001,a1) = 1, d(001,a2) = 1, d(001,a3) = 2,
d(100,a1) = 1, d(100,a2) = 0, d(100,a3) = 1, etc.
S(000) = 2 + 20, S(001) = 2 + 21, S(100) = 21
+ 0, etc.
Now, 0 > 1 > 2 implies that S(000) > S(001).
Similarly, one shows that the score is maximized by
S(000) or S(100). Monotonicity doesn’t say which of
these is highest.
31
Other Consensus Functions
The median is the collection of words w of length k
which minimize
n
(w) =  d(w,ai).
i 1
The mean is the collection of words w of length k
which minimize
(w) =
n
2.
d(w,a
)

i
i 1
32
Another measure which it might be of interest to minimize
is a convex combination of these two:
(w) = (w) + (1- ) (w) ,   [0,1].
Words which minimize  will be called the mixed
median-mean. This might be of interest if we are not
ready to choose either medians or means or want some
combination of the two.
We might also choose to minimize d(w,a)m or
logd(w,a)m for fixed m.
33
Example:
Let  = {0,1}, k = 2,  = {a1,a2,a3,a4},
a1 = 1111, a2 = 0000, a3 = 1000, a4 = 0001.
Possible pattern words: 00, 01, 10, 11.
d(00,ai)
= 2, d(01,ai) = 3, d(10,ai) = 3, d(11,ai) = 4.
Thus, 00 is the median.
d(00,ai)2
= 4, d(01,ai)2 = 3, d(10,ai)2 = 3, d(11,ai)2 = 6,
so the mean consists of the two words 01 and 10, neither of
which is a median.
34
Summary of Notation
n
s(w) =
(w) =
 d(w,ai)
i 1
n
 d(w,ai).
i 1
n
(w) =
d(w,ai)2.
i 1
(w) = (w) + (1- ) (w) , 
[0,1].
35
The Special Case d = (k-d)/k
Suppose that d = (k-d)/k. We have
(w) =
n
d(w,ai),
i 1
n
n
i 1
i 1
s(w) = d(w,ai) = n - (1/k)  d(w,ai).
Thus, for fixed k  2,  of size at least 2, and any size
set , for all words w, w of length L:
(w)  (w)  s(w)  s(w).
36
It follows that for fixed k  2,  of size at least 2, and any
size set , there is a choice of the parameter d so that
the Waterman consensus is the same as the median.
(This also holds for k = 1 or  of size 1, but these are
uninteresting cases.)
Similarly, one can show that for any fixed k  2,  of
size at least 2, and any size set , there is a choice of
parameter d so that for all words w, w of length L:
 (w)  (w)  s(w)  s(w).
For this choice of d , a word is a Waterman consensus
37
iff it is a mean.
More generally, for all rational numbers   [0,1], for
fixed k  2,  of size at least 2, and any size set , there is
a choice of parameter d so that for all words w, w of
length L:
(w)  (w)  s(w)  s(w).
For this choice of d , a word is a Waterman consensus
iff it is a mixed median-mean with convex combination
depending upon.
38
What Parameters d Give Rise to Median,
Mean, or Mixed Median-Mean?
Let us first decide if  can have repeated words. From
the point of view of the application, this is a reasonable
assumption. (Repeats are allowed in the database; or
some words have more significance than others.)
We shall investigate both the repetitive case -- where 
is allowed to have repeated words -- and the
nonrepetitive case.
The following results are joint with Boris Mirkin.
39
When do we Get the Median in Waterman
Consensus?
Theorem: Suppose k is fixed, k  2,  is an alphabet of at
least two letters, and d is a sequence with 1 < 0. Then
the following are equivalent:
(a). The equivalence
(w)  (w)  s(w)  s(w)
holds for all words w, w of length k from  and all finite
nonrepetitive sets  (of any size) of words of length
L
 k from .
(b). There are constants B, C, B < 0, s.t. for all 0  j  k,
40
j = Bj + C.
In other words, under the hypotheses of the theorem, the
median procedure corresponds exactly to the choice of
parameters j = Bj + C, B < 0. Note that j = (k-j)/k is
a special case of this.
Remark: This theorem (and all subsequent theorems) also
hold if we replace “words of length L  k” by “words of
fixed length L, L  k.”
41
When do we Get the Mean in Waterman
Consensus?
Theorem: Suppose k is fixed, k  2,  is an alphabet of at
least two letters, and d is a sequence with 1 < 0. Then
the following are equivalent:
(a). The equivalence
(w)  (w)  s(w)  s(w)
holds for all words w, w of length k from  and all finite
sets  (of any size) of words of length L  k from .
(b). There are constants A, C, A < 0, so that for all 0  j 
k,
42
 = Aj2 + C.
In other words, under the hypotheses of the theorem, the
mean procedure corresponds exactly to the choice of
parameters j = Aj2 + C, A < 0.
Note that we require (a) to hold for all finite sets , even
those with repetitions.
It is a technicality to try to remove this hypothesis. To do
so, we have found it necessary to allow a larger alphabet or
to take k sufficiently large and also to consider only L
larger than k. The first result uses an alphabet of size at
least four.
43
Theorem: Suppose k is fixed, k  2,  is an alphabet of at
least four letters, and d is a sequence with 1 < 0. Then
the following are equivalent:
(a). The equivalence
(w)  (w)  s(w)  s(w)
holds for all words w, w of length k from  and all finite
nonrepetitive sets  (of any size) of words of length L > k
from .
(b). There are constants A, C, A < 0, s.t. for all 0  j  k,
j = Aj2 + C.
44
The next result removes the hypothesis of the alphabet
being of size at least 4, but adds a hypothesis that k  3:
Theorem: Suppose k is fixed, k  3,  is an alphabet
of at least two letters, and d is a sequence with 1 < 0.
Then the following are equivalent:
(a). The equivalence
(w)  (w)  s(w)  s(w)
holds for all words w, w of length k from  and all
finite nonrepetitive sets  (of any size) of words of
length L > k from .
(b). There are constants A, C, A < 0, s.t. for all 0  j  k,
45
j = Aj2 + C.
When do we Get the Mixed Median-Mean in
Waterman Consensus?
Recall that
(w) = (w) + (1- ) (w) ,  [0,1].
In the following, we shall assume that  is a rational
number. It might just be of purely technical interest to
figure out what happens when  is irrational, but we
have not been able to obtain analogous results for this
case.
46
Theorem: Suppose k is fixed, k  2,  is an alphabet of at
least two letters, and d is a sequence with 1 < 0. Then
the following are equivalent:
(a). The equivalence
(w)  (w)  s(w)  s(w)
holds for all words w, w of length k from  and all finite
sets  (of any size) of words of length L  k from .
(b). There are constants D, E, D < 0, s.t. for all 0  j  k,
j = D(1- )j2 + Dj + E.
47
In other words, under the hypotheses of the theorem, the
mixed median-mean procedure corresponds exactly to the
choice of parameters j = D(1-)j 2 + Dj + E, D <
0.
If we only want to require that the equivalence in part (a)
holds for finite nonrepetitive sets , one way to do so, as
with the results about the mean or , is to assume that 
is sufficiently large. We have been able to prove this result
under the added hypothesis that L > k and the added
hypothesis that  has at least r()+ 1 elements, where
2(2-) = s/t for s, t positive integers and
r() = max{s-t,t+1}.
48
Other Consensus Functions as Special Cases
of Waterman’s Consensus
It would be interesting to study conditions under which
other consensus methods are special cases of
Waterman’s. Of particular interest might be such
consensus methods as minimizing  d(w,ai)m or
minimizing  logd(w,ai)m.
49
Algorithms
In practical applications in molecular biology, good
algorithms for obtaining a consensus pattern are essential.
Waterman and his co-authors provide a method for
computing their consensus patterns in the case d = (k-d)/k.
The Brute Force Algorithm
Suppose that D(w,w) is the number of mismatches between
two words w, w of length k. The most naïve algorithm for
finding the Waterman consensus would proceed by brute
force and calculate all best-mismatch distances d(w,a) for
w a potential pattern word of length k and a   by
calculating D(w,w) for all words w of length k in a.
50
Suppose that c is the cost of computing any D(w,w). Let
a be any word of length L and w be a word of length k.
If the best-mismatch distance d(w,a) is calculated by
computing D(w,w) for all w of length k in a, then since
there are L-k+1 such words w in a, we calculate d(w,a)
with cost (L-k+1)c.
If p = , there are pk potential pattern words w. If
there are n words in , then the brute force algorithm can
compute the scores s(w) for all potential pattern words w
and hence obtain the optimal pattern word by making a
number of calculations of total cost
pkn(L-k+1)c.
51
Note that p is relatively small, k is small, and n is
typically large.
An Improved Algorithm
As Waterman, et al. observe, we can improve the
performance of the algorithm considerably by looking at
neighborhoods of a word w of length k. Let the
neighborhood N(w) consist of all words w of length k
so that D(w,w)  T for some threshold T. Usually, T is
taken to be small, for example  3. The idea is to
eliminate all potential pattern words w which don’t fall
within N(w) for some k-length word in each sequence
word a in the database , i.e., to eliminate all words w
52
so that d(w,a) > T for some a in .
Initial Calculations
As an initial step, calculate D(w,w) for all words w, w of
length k from . This requires p2k calculations of cost c
each.
Going Through the Database
Go through the database  one word at a time. Given a
word a, consider each k-length word w in a, find
N(w), and look up D(w,w) for all k-length pattern words
w in N(w). If N is an upper bound on the size of the
neighborhoods N(w), then we have to look up at most
(L-k+1)N values D(w,w). Assume that C is the look-up
cost. If a word w is not in N(w) for any such w, dismiss
it as a potential pattern word.
53
Calculate s(w) for all pattern words w that have not been
dismissed. Since we can update s(w) as we go through
each word of , and since there are n words in , we
can calculate s(w) for all non-dismissed w and therefore
obtain the highest scoring non-dismissed w with cost at
most
n(L-k+1)NC + p2kc,
where the second term comes from the initial calculation.
This cost is considerably smaller than the cost of the
brute force algorithm,
n(L-k+1)pkc.
54
This is because, typically, n is large in comparison to
p2kc. Since C is much less than c and N is presumably
much less than pk, NC is much less than pkc. Since n is
large, this change is significant compared to the added cost
p2kc.
Of course, the improved algorithm assumes that no
dismissed word can be a consensus word, which could be
false.
55
Algorithms for the Median
A considerable amount of work has been devoted in the
literature to finding algorithms for obtaining the median,
although this is often a difficult computational problem
and is NP-complete in some contexts. In a typical
application, we have a large database  and so a very
efficient algorithm will be needed.
One of the reasons for our interest in the median
procedure was that some of the algorithms for computing
medians might be usable to improve upon the
computational methods given for the Waterman
consensus. We have, however, not worked on this idea.
56
Axiomatic Approach
In the group consensus literature in the social sciences and
elsewhere, there has been considerable emphasis on
finding axioms characterizing different consensus
procedures. However, consensus methods used in
molecular biology tend to be chosen because they seem
interesting or useful, rather than on the basis of some
theory. Such a theory could be based on an axiomatic
approach.
Axioms have been given for the median procedure in
some contexts:
57
•Young and Levenglick (1978) axiomatized the KemenySnell median where the ai are rankings rather than words.
•The median procedure has been axiomatically
characterized when the ai are vertices of various kinds of
graphs:

n-trees: Barthelemy and McMorris, 1986

Covering graphs of semilattices: LeClerc, 1994

Median graphs: McMorris, Mulder, Roberts, 1998
58
Not as much is known about means:
•The Kemeny-Snell mean procedure for rankings has
not been characterized.
•The mean procedure was characterized for trees by
Holzman (1990).
•However, Hansen and Roberts (1996) showed that a
natural generalization of Holzman’s axioms for
arbitrary connected graphs with cycles leads to an
impossibility result.
59
It would be interesting and potentially useful to
characterize axiomatically the median and the mean in
the bioconsensus context we have described, i.e., to give
axioms for F() to be obtained by minimizing  or
. Our results do not give such a characterization since
these results depend upon the Waterman consensus
method. No results are known in the sequence context
which characterize axiomatically those consensus
functions that are the median or the mean, either when
L = k or when L  k.
It would also be interesting and potentially of
practical significance to try to axiomatize the
Waterman consensus. No results are known about this
60
problem.
Download