Lecture 12

advertisement
Rank Aggregation Methods II
Experiments
CS728
Lecture 12
Recall the Rank Aggregation Problem
• m candidates (a.k.a. “alternatives”)
– M = {1,…,m}: set of candidates
• n voters (a.k.a. “agents” or “judges”)
– N = {1,…,n}: set of voters
• Each voter i, has an ranking i on M
– i(a) < i(b) means i-th voter prefers a to b
– Ranking may be a total or partial order
• The rank aggregation problem:
Combine 1,…,n into a single ranking  on M, which
represents the “social choice” of the voters.
– Rank aggregation function: f(1,…,n) = 
  may be a total or partial order
Experiments: Distance Measures
Goal: Quantitatively compare different rank aggregation
methods.
Performance Measures:
(1) Spearman footrule distance is sum of pointwise distances. It
is normalized by dividing this number by the maximum value
(1/2)|S|2, value between 0 and 1.
(2) Kendall tau distance counts the number of pairwise
disagreements. Dividing by the maximum possible value
(1/2)S(S - 1) we obtain a normalized version, value between 0
and 1.
(3) The induced footrule distance is obtained by taking the
projections of a full list s with each partial list. In a similar
manner, induced Kendall tau distance can be defined.
(4) The scaled footrule distance weights contributions of
elements based on the length of the lists they are present in. If
s is a full list and t is a partial list, then:
SF(s, t) = Sum | s(i)/|s|) - (t(i)/|t|) |. Normalize SF by
dividing by |t|/2.
Experiments: Distance Measures
• So for each aggregation method and each
distance measure we get a vector of values,
each component representing a distance to
from the aggregation to each voter list
• Simplest is to take the average (or 1-norm)
• Other norms are interesting
– Mean square distance (2-norm)
– Max distance (∞-norm)
Experiments: Minimizing Average
Altavista (AV), Alltheweb (AW), Excite (EX), Google (GG), Hotbot HB),Lycos (LY), and Northernlight (NL)
K = Kendall distance
IF = induced footrule distance
SF = scaled footrule distance
LK = Local Kemenization
Experiments in Spam Filtering
• Define spam to be web pages are low-ranked by
majority opinion (machine and human – a simplifying
assumption) – although they may be highly ranked by
some search engines
• Intuition: if a page spams most search engines for a
particular query, then no combination of these search
engines can filter the spam.---garbage in, garbage out.
• Spam pages are the Condorcet losers, and will
occupy the bottom of ranking that satisfies the
extended Condorcet criterion
• Similarly, good pages will be in the Condorcet
winners, and will rank above the losers.
Condorcet Criteria
• Condorcet Criterion
– An candidate of M which wins every other in
pairwise simple majority voting should be ranked first.
• Extended Condorcet Criterion (XCC):
– Version 1: If most voters prefer candidate a to
candidate b (i.e., # of i s.t. i(a) < i(b) is at least n/2),
then also  should prefer a to b (i.e., (a) < (b)).
– Version 2: If there is a partition (W, L) of M such that
for any x in W and y in L the majority prefers x to y,
then x must be ranked above y. W is called Condorcet
winners and L is Condorcet losers
XCC(2) and SPAM Filtering
• Note that XCC(1) => XCC(2), so Version 1 is
stronger
• But XCC(1) is not always realizable
• As we will see XCC(2) is always realizable via
Local Keminization
• Hence using rank aggregation with XCC(2)
should assist in SPAM filtering, since
Condorcet losers will be lowest rank
• Let us look at where spam pages (human
determined) are ranked with good aggregation
methods.
Experiments: Filtering SPAM
Table 3:
Ranks of "spam" pages for the queries:
Feng Shui, organic vegetables and gardening.
url
www.lucky-bamboo.com
AV AW GG HB LY NL SFO MC4
4
www.cambriumcrystals.com
43
41
144
63
9
51
5
31
59
www.luckycat.com
11
14
26
13
49
36
www.davesorganics.com
84
19
1
17
77
93
www.frozen.ch
9
63 11
49
121
www.eonseed.com
18
6
16
23
66
16
27 12 16
57
54
www.taunton.com
25
21
78
67
www.egroups.com
34
29
108
101
www.augusthome.com
26
Experiment: Word association
• Different search engines and portals have different (default)
semantics of handling a multi-word query.
• Some use OR semantics (documents contain one of the given
query terms) while Google uses the AND semantics (all the
query words must appear). Both inconvenient in many
situations.
• Consider searching for the job of a software engineer from an
on-line job database. The user lists a number of skills and a
number of potential keywords in the job description, for
example, "Silicon Valley C++ Java CORBA TCP-IP
algorithms start-up pre-IPO stock options". It is clear that the
"AND" rule might produce no document or SPAM, and the
"OR" rule is equally disastrous.
• Experiment with rank aggregation using multiple queries
based on small subsets of terms.
• Results for query: madras madurai coimbatore vellore.
(cities in the state of Tamil Nadu, India)
• Google www.mssrf.org/Fris9809/location-tamilnadu.html
www.indiaplus.com/Info/schools.html
www.focustamilnadu.com/tamilnadu/Policy%20Note
...Forests.html
www.tn.gov.in/policy/environ.htm
www.indiacolleges.com/Tamil_Nadu.htm
• SFO with LK www.madurai.com
www.ozemail.com.au/clday/locations.htm
www.utoledo.edu/homepages/speelam/coimbatore.html
www.ozemail.com.au/clday/madras.htm
www.madurai.com/around.htm
www.indiatraveltimes.com/tamilnadu/tamil1.html
• MC4 with LK www.madurai.com
www.surfindia.com/omsakthi/tourism.htm
www.indiatraveltimes.com/tamilnadu/tamil1.html
www.indiatraveltimes.com/tamilnadu/tamil2.html
www.indiatravels.com/forts/vellore_fort.htm
www.india-tourism.de/english/south/tamil_nadu.html
Locally Kemeny optimal
aggregation and XCC(2)
• Many of existing aggregation methods do not
satisfy XCC(1) or XCC(2).
• It is possible to use your favorite aggregation
method to obtain a full list. Then apply local
kemenization to realize XCC(2) which filters
Condorcet losers.
Locally Kemeny optimal
• Recall that Kemeny optimal is NP-hard
• Definition of locally optimal
A permutation p is a locally Kemeny optimal
aggregation of partial lists t1, t2, ..., tk, if there is no
permutation p' that can be obtained from p by
performing a single transposition of an adjacent pair
of elements and for which Kendal distance
K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk).
In other words, it is impossible to reduce the total
distance to the t's by flipping an adjacent pair.
Example of LKO but not KO
• Example 1
• t1 = (1,2), t2 = (2,3), t3 = t4 = t5 = (3,1).
• p = (1,2,3),
We have that p satisfies Definition of LKO,
K(p, t1, t2, ..., t5)= 3, but transposing 1 and 3
decreases the sum to 2.
LKO satisfies XCC(2)
• Proof by contradiction
If the result is false then there exist partial lists t1, t2, ..., tk, a
LKO aggregation p, and a partition (W,L) that violates
XCC(2); that is some pair c in W and d in L, such that p(d) <
p(c). Let (c,d) be the closest such pair in p.
• Consider the immediate successor of d in p, call it e. If e=c
then c is adjacent to d in p and transposing this adjacent pair of
alternatives produces a p' such that K(p', t1, t2, ..., tk) < K(p,
t1, t2, ..., tk), contradicting the assumption on p.
• If e does not equal c, then either e is in W, in which case the
pair (e,d) is a closer pair in p than (d,c) and also violates the
XCC(2), or e is in L, in which case (e,c) is a closer pair than
(d,c) that violates XCC(2). Both cases contradict the choice of
(d,c).
Local Kemenization procedure
• A local Kemenization of a full list with respect to preference
lists so as to compute a locally Kemeny optimal aggregation
that is maximally consistent with original.
This approach:
(1) preserves the strengths of the initial aggregation
(2) ranks non-spam above spam.
(3) gives a result that disagrees with original on any pair
(i, j) only if a majority endorse
 this disagreement.
(4) for every d, 1 ≤ d ≤ | μ |, the restriction of the output is a
local Kemenization of the top d elements of μ
Local Kemenization procedure
• A simple inductive construction.
• Assume inductively for that we have constructed p, a local
Kemenization of the projection of the t's onto the elements 1,
..., l-1.
• Insert next element x into the lowest-ranked "permissible"
position in p: just below the lowest-ranked element y in p such
that
– (a) no majority among the (original) t's prefers x to y and
– (b) for all successors z of y in p there is a majority that prefers x to z.
• In other words, we try to insert x at the end (bottom) of the list
p; we bubble it up toward the top of the list as long as a
majority of the t's insists that we do.
Example local kemenization procedure
• Local Kemenization Example!
A
B
F
E
C
D
A>B: 3
A<B: 2
B
C
A
E
F
D
A
C
F
D
E
B
B>D: 4
B<D: 1
B
F
D
C
A
E
C
A
B
F
E
D
B
A
D
C
E
F
A
B
A
B
D
C
D
C
F
E
D
disagree
RA and Searching Workplace Web
• Axiom 1: Intranet documents are not spam
• Axiom 2: Queries usually have unique answers
(not broad topic based)
• Axiom 3: Intranet docs are not search engine
friendly (docs are accessed through portals and
database queries
• Rank aggregation allows us to combine
number of heuristic alternatives: static and
dynamic, query dependent and independent
Download