Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor 1 Presented by Sergey Shepshelvich 2 Motivation In interactive database querying, we often get more results than we can comprehend immediately When do you actually click over 2-3 pages of results? 85% of users never go to the second page! What to display on the first page? 3 Standard solutions Sorting by attributes Computationally expensive Similar results can be distributed many pages apart Ranking Hard to estimate of the user's preference. In database queries, all tuples are equally relevant! What to do when there are millions of results? 4 Make the First Page Count Human beings are very capable of learning from examples Show the most “representative” results Best help users learn what is in the result set User can decide further actions based on representatives 5 The Proposal: MusiqLens Experience (Model-driven Usable Systems for Information Querying) 6 Suppose a user wants a 2005 Civic but there are too many of them… 7 MusiqLens on the Car Data Id Model Price Year Mileage Condition 872 Civic $12,000 2005 50,000 Good 901 Civic $16,000 2005 40,000 Excellent 725 Civic $18,500 2005 30,000 Excellent 423 Civic $17,000 2005 42,000 Good 132 Civic $9,500 2005 86,000 Fair 322 Civic $14,000 2005 73,000 Good 122 more like this 345 more like this 86 more like this 201 more like this 185 more like this 55 more like this 8 MusiqLens on the Car Data Id Model Price Year Mileage Condition 872 Civic $12,000 2005 50,000 Good 901 Civic $16,000 2005 40,000 Excellent 725 Civic $18,500 2005 30,000 Excellent 423 Civic $17,000 2005 42,000 Good 132 Civic $9,500 2005 86,000 Fair 322 Civic $14,000 2005 73,000 Good 122 more like this 345 more like this 86 more like this 201 more like this 185 more like this 55 more like this 9 After Zooming in: 2005 Honda Civics ~ ID 132 Id Model Price Year Mileage Condition 342 Civic $9,800 2005 72,000 Good 768 Civic $10,000 2005 60,000 Good 132 Civic $9,500 2005 86,000 Fair 122 Civic $9,500 2005 76,000 Good 123 Civic $9,100 2005 81,000 Fair 898 Civic $9,000 2005 69,000 Fair 25 more like this 10 more like this 63 more like this 5 more like this 40 more like this 42 more like this 10 After Filtering by “Price < 9,500” Id Model Price Year Mileage Condition 123 Civic $9,100 2005 81,000 Fair 898 Civic $9,000 2005 69,000 Fair 133 Civic $9,300 2005 87,000 Fair 126 Civic $9,200 2005 89,000 Good 129 Civic $8,900 2005 81,000 Fair 999 Civic $9,000 2005 87,000 Fair 40 more like this 42 more like this 33 more like this 3 more like this 20 more like this 12 more like this 11 Challenges Representation metric What is the best set of representatives? Representative finding How to find them efficiently? Query Modeling: finding a suitable Refinement How to efficiently adapt to user’s query operations? 12 Finding a Suitable Metric Users should be the ultimate judge Which metric generates the representatives that I can learn the most from? User study to evaluate different representation modeling 14 Metric Candidates Sort by attributes Uniform random sampling Small clusters are missed Density-biased Sample more from sparse regions, less from dense regions Sort sampling by typicality Based on probabilistic modeling K-medoids 16 Metric Candidates - K-medoids A medoid of a cluster is the object whose dissimilarity to others is smallest Average medoid and max medoid K-medoids are k objects, each from a different cluster where the object is the medoid Why not K-means? K-means cluster centers do not exist in database We must present real objects to users 17 Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1 1 1 Random 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.5 Density Biased 0.9 1 0 0.5 1 18 Plotting the Candidates - Typicality 1 0.9 Typical 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 19 Plotting the Candidates – k-medoids 1 1 0.9 Max-Medoids 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.2 0.4 0.6 0.8 1 Avg-Medoids 0 0.2 0.4 0.6 0.8 1 20 User Study Procedure Users 7 sets of data, generated using the 7 candidate methods Each set consists of 8 representative points Users are given: predict 4 more data points That are most likely in the data set Should not pick those already given Measure the predication error 21 Verdict K-meoids is the winner In this paper, authors choose average kmedoids Proposed algorithm can be extended to maxmedoids with small changes 22 Challenges Representation metric What is the best set of representatives? Representative finding How to find them efficiently? Query Modeling: finding a suitable Refinement How to efficiently adapt to user’s query operations? 23 Cover Tree Based Algorithm Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 Briefly discuss Cover Tree properties See Cover Tree based algorithms for computing k-medoids 24 Cover Tree Properties (1) Nesting: for all 𝑖, 𝐶𝑖 ⊂ 𝐶𝑖+1 𝐶𝑖 𝐶𝑖+1 Points in the Data (One Dimension) 25 Cover Tree Properties (2) Covering: node in 𝐶𝑖 is within distance of 1 2𝑖 to its children in 𝐶𝑖+1 𝐶𝑖 𝐶𝑖+1 Distance from node to any descendant is less than 1 2𝑖−1 . This value is called the “span” of the node. 26 Cover Tree Properties (3) Separation: nodes in 𝐶𝑖 are separated by at least 1 2𝑖 𝐶𝑖 𝐶𝑖+1 Note: 𝑖 allowed to be negative to satisfy above conditions. 27 Additional Stats for Cover Tree (2D Example) s5 s2 s1 s3 s7 s6 s4 p s5 s8 s10 s9 s3 s3 s2 s7 DS = 10 s5 s8 s5 s8 DS = 3 s1 s3 s2 s6 s7 s5 s4 s8 s9 s10 Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree 28 k-medoid Algorithm Outline We descend the cover tree to a level with more than 𝑘 nodes Choose an initial 𝑘 points as first set of medoids (seeds) Bad seeds can lead to local minimums with a high distance cost Assigning nodes and repeated update until medoids converge 29 Cover Tree Based Seeding Descend the cover tree to a level with more than 𝑘 nodes (denote as level m) Use the parent level 𝑚 − 1 as starting point for seeds Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) Expand nodes using a priority queue Fetch the first 𝑘 nodes from the queue as seeds 30 A Simple Example: k = 4 s5 s2 s1 s3 s7 s6 s3 s4 s5 s8 s10 s9 s3 s2 s7 Span = 2 s5 s8 Span = 1 s5 s8 Span = 1/2 s1 s3 s2 s6 s7 s5 s4 s8 s9 s10 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds Span = 1/4 31 Update Process 1. 2. Initially, assign all nodes to closest seed to form 𝑘 clusters For each cluster, calculate the geometric center Use centroid and density information to approximate subtree 3. 4. Find the node that is closest to the geometric center, designate as a new medoid Repeat from step 1 until medoids converge 32 Challenges Representation metric What is the best set of representatives? Representative finding How to find them efficiently? Query Modeling: finding a suitable Refinement How to efficiently adapt to user’s query operations? 33 Query Adaptation Handle user actions Zooming Selection (filtering) 34 Zooming Zooming Expand all nodes assigned to the medoid Run k-medoid algorithm on the new set of nodes 35 Effect node of selection on a Completely invalid Fully valid Partially valid Estimate the validity percentage (VG) of each node Multiply the VG with weight of each node Price Selection A 50 12000 S1 30 150 S 3 57 S5 S2 45 S4 a b 201 90 S6 S7 Mileage 37 Experiments – Initial Medoid Quality Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu Data sets Synthetic dataset: 2D points with zipf distribution Real dataset: LA data set from R-tree Portal, 130k points Measurement Time to compute the medoids Average distance from a data point to its medoid 38 Results on Synthetic Data Distance Time 800 0.01 700 0.008 0.006 0.004 R-tree Cover Tree 0.002 Distance Time (seconds) 600 500 400 R-tree 300 Cover Tree 200 100 0 0 256K 512K 1024K Cardinality 2048K 4096K 256K 512K 1024K 2048K Cardinality For various sizes of data, Cover-tree based method outperforms R-tree based method 4096K 39 Results on Real Data 0.06 R-tree 0.05 1400 R-tree 1200 Cover Tree 1000 800 600 400 Time (seconds) Distance 1600 Cover Tree 0.04 0.03 0.02 0.01 200 0 0 2 8 32 k 128 512 2 8 32 128 k For various k values, Cover-tree based method outperforms R-tree based method on real data 512 40 Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data 600 350 Re-Compute Incremental 400 300 200 Re-Compute 300 Incremental 250 Distance Distance 500 200 150 100 100 50 0 0 0.8 0.6 0.4 Selectivity 0.2 0.8 0.6 0.4 Selectivity Time cost of re-building is orders-of-magnitude higher than incremental computation. 0.2 41 Conclusion Authors proposed MusiqLens framework for solving the many-answer problem Authors conducted user study to select a metric for choosing representatives Authors proposed efficient method for computing and maintaining the representatives under user actions Part of the database usability project at Univ. of Michigan Led by Prof. H.V. Jagadish http://www.eecs.umich.edu/db/usable/