Rank Aggregation Methods II Experiments CS728 Lecture 12 Recall the Rank Aggregation Problem • m candidates (a.k.a. “alternatives”) – M = {1,…,m}: set of candidates • n voters (a.k.a. “agents” or “judges”) – N = {1,…,n}: set of voters • Each voter i, has an ranking i on M – i(a) < i(b) means i-th voter prefers a to b – Ranking may be a total or partial order • The rank aggregation problem: Combine 1,…,n into a single ranking on M, which represents the “social choice” of the voters. – Rank aggregation function: f(1,…,n) = may be a total or partial order Experiments: Distance Measures Goal: Quantitatively compare different rank aggregation methods. Performance Measures: (1) Spearman footrule distance is sum of pointwise distances. It is normalized by dividing this number by the maximum value (1/2)|S|2, value between 0 and 1. (2) Kendall tau distance counts the number of pairwise disagreements. Dividing by the maximum possible value (1/2)S(S - 1) we obtain a normalized version, value between 0 and 1. (3) The induced footrule distance is obtained by taking the projections of a full list s with each partial list. In a similar manner, induced Kendall tau distance can be defined. (4) The scaled footrule distance weights contributions of elements based on the length of the lists they are present in. If s is a full list and t is a partial list, then: SF(s, t) = Sum | s(i)/|s|) - (t(i)/|t|) |. Normalize SF by dividing by |t|/2. Experiments: Distance Measures • So for each aggregation method and each distance measure we get a vector of values, each component representing a distance to from the aggregation to each voter list • Simplest is to take the average (or 1-norm) • Other norms are interesting – Mean square distance (2-norm) – Max distance (∞-norm) Experiments: Minimizing Average Altavista (AV), Alltheweb (AW), Excite (EX), Google (GG), Hotbot HB),Lycos (LY), and Northernlight (NL) K = Kendall distance IF = induced footrule distance SF = scaled footrule distance LK = Local Kemenization Experiments in Spam Filtering • Define spam to be web pages are low-ranked by majority opinion (machine and human – a simplifying assumption) – although they may be highly ranked by some search engines • Intuition: if a page spams most search engines for a particular query, then no combination of these search engines can filter the spam.---garbage in, garbage out. • Spam pages are the Condorcet losers, and will occupy the bottom of ranking that satisfies the extended Condorcet criterion • Similarly, good pages will be in the Condorcet winners, and will rank above the losers. Condorcet Criteria • Condorcet Criterion – An candidate of M which wins every other in pairwise simple majority voting should be ranked first. • Extended Condorcet Criterion (XCC): – Version 1: If most voters prefer candidate a to candidate b (i.e., # of i s.t. i(a) < i(b) is at least n/2), then also should prefer a to b (i.e., (a) < (b)). – Version 2: If there is a partition (W, L) of M such that for any x in W and y in L the majority prefers x to y, then x must be ranked above y. W is called Condorcet winners and L is Condorcet losers XCC(2) and SPAM Filtering • Note that XCC(1) => XCC(2), so Version 1 is stronger • But XCC(1) is not always realizable • As we will see XCC(2) is always realizable via Local Keminization • Hence using rank aggregation with XCC(2) should assist in SPAM filtering, since Condorcet losers will be lowest rank • Let us look at where spam pages (human determined) are ranked with good aggregation methods. Experiments: Filtering SPAM Table 3: Ranks of "spam" pages for the queries: Feng Shui, organic vegetables and gardening. url www.lucky-bamboo.com AV AW GG HB LY NL SFO MC4 4 www.cambriumcrystals.com 43 41 144 63 9 51 5 31 59 www.luckycat.com 11 14 26 13 49 36 www.davesorganics.com 84 19 1 17 77 93 www.frozen.ch 9 63 11 49 121 www.eonseed.com 18 6 16 23 66 16 27 12 16 57 54 www.taunton.com 25 21 78 67 www.egroups.com 34 29 108 101 www.augusthome.com 26 Experiment: Word association • Different search engines and portals have different (default) semantics of handling a multi-word query. • Some use OR semantics (documents contain one of the given query terms) while Google uses the AND semantics (all the query words must appear). Both inconvenient in many situations. • Consider searching for the job of a software engineer from an on-line job database. The user lists a number of skills and a number of potential keywords in the job description, for example, "Silicon Valley C++ Java CORBA TCP-IP algorithms start-up pre-IPO stock options". It is clear that the "AND" rule might produce no document or SPAM, and the "OR" rule is equally disastrous. • Experiment with rank aggregation using multiple queries based on small subsets of terms. • Results for query: madras madurai coimbatore vellore. (cities in the state of Tamil Nadu, India) • Google www.mssrf.org/Fris9809/location-tamilnadu.html www.indiaplus.com/Info/schools.html www.focustamilnadu.com/tamilnadu/Policy%20Note ...Forests.html www.tn.gov.in/policy/environ.htm www.indiacolleges.com/Tamil_Nadu.htm • SFO with LK www.madurai.com www.ozemail.com.au/clday/locations.htm www.utoledo.edu/homepages/speelam/coimbatore.html www.ozemail.com.au/clday/madras.htm www.madurai.com/around.htm www.indiatraveltimes.com/tamilnadu/tamil1.html • MC4 with LK www.madurai.com www.surfindia.com/omsakthi/tourism.htm www.indiatraveltimes.com/tamilnadu/tamil1.html www.indiatraveltimes.com/tamilnadu/tamil2.html www.indiatravels.com/forts/vellore_fort.htm www.india-tourism.de/english/south/tamil_nadu.html Locally Kemeny optimal aggregation and XCC(2) • Many of existing aggregation methods do not satisfy XCC(1) or XCC(2). • It is possible to use your favorite aggregation method to obtain a full list. Then apply local kemenization to realize XCC(2) which filters Condorcet losers. Locally Kemeny optimal • Recall that Kemeny optimal is NP-hard • Definition of locally optimal A permutation p is a locally Kemeny optimal aggregation of partial lists t1, t2, ..., tk, if there is no permutation p' that can be obtained from p by performing a single transposition of an adjacent pair of elements and for which Kendal distance K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk). In other words, it is impossible to reduce the total distance to the t's by flipping an adjacent pair. Example of LKO but not KO • Example 1 • t1 = (1,2), t2 = (2,3), t3 = t4 = t5 = (3,1). • p = (1,2,3), We have that p satisfies Definition of LKO, K(p, t1, t2, ..., t5)= 3, but transposing 1 and 3 decreases the sum to 2. LKO satisfies XCC(2) • Proof by contradiction If the result is false then there exist partial lists t1, t2, ..., tk, a LKO aggregation p, and a partition (W,L) that violates XCC(2); that is some pair c in W and d in L, such that p(d) < p(c). Let (c,d) be the closest such pair in p. • Consider the immediate successor of d in p, call it e. If e=c then c is adjacent to d in p and transposing this adjacent pair of alternatives produces a p' such that K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk), contradicting the assumption on p. • If e does not equal c, then either e is in W, in which case the pair (e,d) is a closer pair in p than (d,c) and also violates the XCC(2), or e is in L, in which case (e,c) is a closer pair than (d,c) that violates XCC(2). Both cases contradict the choice of (d,c). Local Kemenization procedure • A local Kemenization of a full list with respect to preference lists so as to compute a locally Kemeny optimal aggregation that is maximally consistent with original. This approach: (1) preserves the strengths of the initial aggregation (2) ranks non-spam above spam. (3) gives a result that disagrees with original on any pair (i, j) only if a majority endorse this disagreement. (4) for every d, 1 ≤ d ≤ | μ |, the restriction of the output is a local Kemenization of the top d elements of μ Local Kemenization procedure • A simple inductive construction. • Assume inductively for that we have constructed p, a local Kemenization of the projection of the t's onto the elements 1, ..., l-1. • Insert next element x into the lowest-ranked "permissible" position in p: just below the lowest-ranked element y in p such that – (a) no majority among the (original) t's prefers x to y and – (b) for all successors z of y in p there is a majority that prefers x to z. • In other words, we try to insert x at the end (bottom) of the list p; we bubble it up toward the top of the list as long as a majority of the t's insists that we do. Example local kemenization procedure • Local Kemenization Example! A B F E C D A>B: 3 A<B: 2 B C A E F D A C F D E B B>D: 4 B<D: 1 B F D C A E C A B F E D B A D C E F A B A B D C D C F E D disagree RA and Searching Workplace Web • Axiom 1: Intranet documents are not spam • Axiom 2: Queries usually have unique answers (not broad topic based) • Axiom 3: Intranet docs are not search engine friendly (docs are accessed through portals and database queries • Rank aggregation allows us to combine number of heuristic alternatives: static and dynamic, query dependent and independent