Lecture 12

Rank Aggregation Methods II Experiments CS728 Lecture 12 Recall the Rank Aggregation Problem • m candidates (a.k.a. “alternatives”) – M = {1,…,m}: set of candidates • n voters (a.k.a. “agents” or “judges”) – N = {1,…,n}: set of voters • Each voter i, has an ranking i on M – i(a) < i(b) means i-th voter prefers a to b – Ranking may be a total or partial order • The rank aggregation problem: Combine 1,…,n into a single ranking  on M, which represents the “social choice” of the voters. – Rank aggregation function: f(1,…,n) =    may be a total or partial order Experiments: Distance Measures Goal: Quantitatively compare different rank aggregation methods. Performance Measures: (1) Spearman footrule distance is sum of pointwise distances. It is normalized by dividing this number by the maximum value (1/2)|S|2, value between 0 and 1. (2) Kendall tau distance counts the number of pairwise disagreements. Dividing by the maximum possible value (1/2)S(S - 1) we obtain a normalized version, value between 0 and 1. (3) The induced footrule distance is obtained by taking the projections of a full list s with each partial list. In a similar manner, induced Kendall tau distance can be defined. (4) The scaled footrule distance weights contributions of elements based on the length of the lists they are present in. If s is a full list and t is a partial list, then: SF(s, t) = Sum | s(i)/|s|) - (t(i)/|t|) |. Normalize SF by dividing by |t|/2. Experiments: Distance Measures • So for each aggregation method and each distance measure we get a vector of values, each component representing a distance to from the aggregation to each voter list • Simplest is to take the average (or 1-norm) • Other norms are interesting – Mean square distance (2-norm) – Max distance (∞-norm) Experiments: Minimizing Average Altavista (AV), Alltheweb (AW), Excite (EX), Google (GG), Hotbot HB),Lycos (LY), and Northernlight (NL) K = Kendall distance IF = induced footrule distance SF = scaled footrule distance LK = Local Kemenization Experiments in Spam Filtering • Define spam to be web pages are low-ranked by majority opinion (machine and human – a simplifying assumption) – although they may be highly ranked by some search engines • Intuition: if a page spams most search engines for a particular query, then no combination of these search engines can filter the spam.---garbage in, garbage out. • Spam pages are the Condorcet losers, and will occupy the bottom of ranking that satisfies the extended Condorcet criterion • Similarly, good pages will be in the Condorcet winners, and will rank above the losers. Condorcet Criteria • Condorcet Criterion – An candidate of M which wins every other in pairwise simple majority voting should be ranked first. • Extended Condorcet Criterion (XCC): – Version 1: If most voters prefer candidate a to candidate b (i.e., # of i s.t. i(a) < i(b) is at least n/2), then also  should prefer a to b (i.e., (a) < (b)). – Version 2: If there is a partition (W, L) of M such that for any x in W and y in L the majority prefers x to y, then x must be ranked above y. W is called Condorcet winners and L is Condorcet losers XCC(2) and SPAM Filtering • Note that XCC(1) => XCC(2), so Version 1 is stronger • But XCC(1) is not always realizable • As we will see XCC(2) is always realizable via Local Keminization • Hence using rank aggregation with XCC(2) should assist in SPAM filtering, since Condorcet losers will be lowest rank • Let us look at where spam pages (human determined) are ranked with good aggregation methods. Experiments: Filtering SPAM Table 3: Ranks of "spam" pages for the queries: Feng Shui, organic vegetables and gardening. url www.lucky-bamboo.com AV AW GG HB LY NL SFO MC4 4 www.cambriumcrystals.com 43 41 144 63 9 51 5 31 59 www.luckycat.com 11 14 26 13 49 36 www.davesorganics.com 84 19 1 17 77 93 www.frozen.ch 9 63 11 49 121 www.eonseed.com 18 6 16 23 66 16 27 12 16 57 54 www.taunton.com 25 21 78 67 www.egroups.com 34 29 108 101 www.augusthome.com 26 Experiment: Word association • Different search engines and portals have different (default) semantics of handling a multi-word query. • Some use OR semantics (documents contain one of the given query terms) while Google uses the AND semantics (all the query words must appear). Both inconvenient in many situations. • Consider searching for the job of a software engineer from an on-line job database. The user lists a number of skills and a number of potential keywords in the job description, for example, "Silicon Valley C++ Java CORBA TCP-IP algorithms start-up pre-IPO stock options". It is clear that the "AND" rule might produce no document or SPAM, and the "OR" rule is equally disastrous. • Experiment with rank aggregation using multiple queries based on small subsets of terms. • Results for query: madras madurai coimbatore vellore. (cities in the state of Tamil Nadu, India) • Google www.mssrf.org/Fris9809/location-tamilnadu.html www.indiaplus.com/Info/schools.html www.focustamilnadu.com/tamilnadu/Policy%20Note ...Forests.html www.tn.gov.in/policy/environ.htm www.indiacolleges.com/Tamil_Nadu.htm • SFO with LK www.madurai.com www.ozemail.com.au/clday/locations.htm www.utoledo.edu/homepages/speelam/coimbatore.html www.ozemail.com.au/clday/madras.htm www.madurai.com/around.htm www.indiatraveltimes.com/tamilnadu/tamil1.html • MC4 with LK www.madurai.com www.surfindia.com/omsakthi/tourism.htm www.indiatraveltimes.com/tamilnadu/tamil1.html www.indiatraveltimes.com/tamilnadu/tamil2.html www.indiatravels.com/forts/vellore_fort.htm www.india-tourism.de/english/south/tamil_nadu.html Locally Kemeny optimal aggregation and XCC(2) • Many of existing aggregation methods do not satisfy XCC(1) or XCC(2). • It is possible to use your favorite aggregation method to obtain a full list. Then apply local kemenization to realize XCC(2) which filters Condorcet losers. Locally Kemeny optimal • Recall that Kemeny optimal is NP-hard • Definition of locally optimal A permutation p is a locally Kemeny optimal aggregation of partial lists t1, t2, ..., tk, if there is no permutation p' that can be obtained from p by performing a single transposition of an adjacent pair of elements and for which Kendal distance K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk). In other words, it is impossible to reduce the total distance to the t's by flipping an adjacent pair. Example of LKO but not KO • Example 1 • t1 = (1,2), t2 = (2,3), t3 = t4 = t5 = (3,1). • p = (1,2,3), We have that p satisfies Definition of LKO, K(p, t1, t2, ..., t5)= 3, but transposing 1 and 3 decreases the sum to 2. LKO satisfies XCC(2) • Proof by contradiction If the result is false then there exist partial lists t1, t2, ..., tk, a LKO aggregation p, and a partition (W,L) that violates XCC(2); that is some pair c in W and d in L, such that p(d) < p(c). Let (c,d) be the closest such pair in p. • Consider the immediate successor of d in p, call it e. If e=c then c is adjacent to d in p and transposing this adjacent pair of alternatives produces a p' such that K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk), contradicting the assumption on p. • If e does not equal c, then either e is in W, in which case the pair (e,d) is a closer pair in p than (d,c) and also violates the XCC(2), or e is in L, in which case (e,c) is a closer pair than (d,c) that violates XCC(2). Both cases contradict the choice of (d,c). Local Kemenization procedure • A local Kemenization of a full list with respect to preference lists so as to compute a locally Kemeny optimal aggregation that is maximally consistent with original. This approach: (1) preserves the strengths of the initial aggregation (2) ranks non-spam above spam. (3) gives a result that disagrees with original on any pair (i, j) only if a majority endorse  this disagreement. (4) for every d, 1 ≤ d ≤ | μ |, the restriction of the output is a local Kemenization of the top d elements of μ Local Kemenization procedure • A simple inductive construction. • Assume inductively for that we have constructed p, a local Kemenization of the projection of the t's onto the elements 1, ..., l-1. • Insert next element x into the lowest-ranked "permissible" position in p: just below the lowest-ranked element y in p such that – (a) no majority among the (original) t's prefers x to y and – (b) for all successors z of y in p there is a majority that prefers x to z. • In other words, we try to insert x at the end (bottom) of the list p; we bubble it up toward the top of the list as long as a majority of the t's insists that we do. Example local kemenization procedure • Local Kemenization Example! A B F E C D A>B: 3 A<B: 2 B C A E F D A C F D E B B>D: 4 B<D: 1 B F D C A E C A B F E D B A D C E F A B A B D C D C F E D disagree RA and Searching Workplace Web • Axiom 1: Intranet documents are not spam • Axiom 2: Queries usually have unique answers (not broad topic based) • Axiom 3: Intranet docs are not search engine friendly (docs are accessed through portals and database queries • Rank aggregation allows us to combine number of heuristic alternatives: static and dynamic, query dependent and independent

Lecture 12

Related documents

Products

Support

Lecture 12

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib