Mirrors and Crystal Balls A Personal Perspective on Data Mining Raghu Ramakrishnan ACM SIGKDD Innovation Award 1 Outline • This award recognizes the work of many people, and I represent the many – A warp-speed tour of some earlier work • What’s a data mining talk without predictions? – Some exciting directions for data mining that we’re working on at Yahoo! ACM SIGKDD Innovation Award 2 A Look in the Mirror … (and the faces I found there: unfortunately, couldn’t find photos for some people) (and apologies in advance for not discussing the related work that provided context and, often, tools and motivation) ACM SIGKDD Innovation Award 3 1987 2007 ACM SIGKDD Innovation Award 4 Sequences, Streams • SEQ – Sequence Data Processing. P. Seshadri, M. Livny and R. Ramakrishnan. SIGMOD 1994 – SEQ: A Model for Sequence Databases. P. Seshadri, M. Livny, and R. Ramakrishnan, ICDE 1995 – The Design and Implementation of a Sequence Database System. P. Seshadri, M. Livny and R. Ramakrishnan. VLDB 1996 • SRQL – SRQL: Sorted Relational Query Language. R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SSDBM 1998 ACM SIGKDD Innovation Award 5 Scalable Clustering • Birch – BIRCH: A Clustering Algorithm for Large Multidimensional Datasets. T. Zhang, R. Ramakrishnan and M. Livny. SIGMOD 96 – Fast Density Estimation Using CF-Kernels. T. Zhang, R. Ramakrishnan, and M. Livny. KDD 1999 – Clustering Large Databases in Arbitrary Metric Spaces. V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French. ICDE 1999 • Clustering Categorical Data – CACTUS: A Scalable Clustering Algorithm for Categorical Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. KDD 1999 ACM SIGKDD Innovation Award 6 Scalable Decision Trees • Rain Forest – RainForest: A Framework for Fast Decision Tree Construction of Large Datasets. J. Gehrke, R. Ramakrishnan and V. Ganti. VLDB 1998 • Boat – BOAT: Optimistic Decision Tree Construction. J. Gehrke, V. Ganti, R. Ramakrishnan, and WY. Loh. SIGMOD 1999 ACM SIGKDD Innovation Award 7 Streaming and Evolving Data, Incremental Mining • FOCUS – FOCUS: A Framework for Measuring Changes in Data Characteristics. V. Ganti, J. Gehrke, R. Ramakrishnan, and W-Y. Loh. PODS 1999 • DEMON – DEMON: Mining and Monitoring Evolving Data. V. Ganti, J. Gehrke, and R. Ramakrishnan. ICDE 1999 ACM SIGKDD Innovation Award 8 Mass Collaboration • The QUIQ Engine: A Hybrid IR-DB System. Customer N. Kabra, R. Ramakrishnan, and V. Ercegovac. ICDE 2003 • Mass Collaboration: A Case Study. R. Ramakrishnan, A. Baptist, V. Ercegovac, M. Hanselman, N. Kabra, A. Marathe, U. Shaft. IDEAS 2004 QUESTION KNOWLEDGE BASE SELF SERVICE Answer added to Answer added to power self service power self service ANSWER ACM SIGKDD Innovation Award -Partner Experts -Customer Champions -Employees Support Agent 9 OLAP, Hierarchies, and Exploratory Mining • Prediction Cubes. B-C. Chen, L. Chen, Y. Lin, R. Ramakrishnan. VLDB 2005 • Bellwether Analysis: Predicting Global Aggregates from Local Regions. B-C. Chen, R. Ramakrishnan, J.W. Shavlik, P. Tamma. VLDB 2006 ACM SIGKDD Innovation Award 10 Hierarchies Redux • • • • • • • OLAP Over Uncertain and Imprecise Data. D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, S. Vaithyanathan. VLDB 2005 Efficient Allocation Algorithms for OLAP Over Imprecise Data. D. Burdick, P.M. Deshpande, T. S. Jayram, R. Ramakrishnan, S. Vaithyanathan. Learning from Aggregate Views. B-C. Chen, L. Chen, D. Musicant, and R. Ramakrishnan. ICDE 2006 Mondrian: Multidimensional K-Anonymity. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. ICDE 2006 Workload-Aware Anonymization. K. LeFevre, D.J. DeWitt, R. Ramakrishnan. KDD 2006 Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge. B-C. Chen, R. Ramakrishnan, K. LeFevre. VLDB 2007 Composite Subset Measures. L. Chen, R. Ramakrishnan, P. Barford, B-C. Chen, V. Yegneswaran. VLDB 2006 ACM SIGKDD Innovation Award 11 Many Other Connections … • Scalable Inference – Optimizing MPF Queries: Decision Support and Probabilistic Inference. H. Corrada Bravo, R. Ramakrishnan. SIGMOD 2007 • Relational Learning – View Learning for Statistical Relational Learning, with an Application to Mammography. J. Davis, E.S. Burnside, I. Dutra, David Page, R. Ramakrishnan, V. Santos Costa, J.W. Shavlik. ACM SIGKDD Innovation Award 12 Community Information Management • Efficient Information Extraction over Evolving Text Data. F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE 2008 • Toward Best-Effort Information Extraction. W. Shen, P. DeRose, R. McCann, A. Doan, R. Ramakrishnan. SIGMOD 2008 • Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan. VLDB 2007 • Source-aware Entity Matching: A Compositional Approach. W. Shen, P. DeRose, L. Vu, A. Doan, R. Ramakrishnan. ICDE 2007 ACM SIGKDD Innovation Award 13 … Through the Looking Glass Prediction is very hard, especially about the future. Yogi Berra ACM SIGKDD Innovation Award 14 Information Extraction … and the challenge of managing it ACM SIGKDD Innovation Award 15 ACM SIGKDD Innovation Award 16 DBLife Integrated information about a (focused) realworld community Collaboratively built and maintained by the community CIMple software package ACM SIGKDD Innovation Award 17 Search Results of the Future yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd (Slide courtesy Andrew Tomkins) ACM SIGKDD Innovation Award 18 Opening Up Yahoo! Search Phase 1 Phase 2 Giving site owners and developers control over the appearance of Yahoo! Search results. (Slide courtesy Prabhakar Raghavan) BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo! Search infrastructure and technology to developers and companies to help them build their own search experiences. ACM SIGKDD Innovation Award 19 Custom Search Experiences Social Search Vertical Search Visual Search (Slide courtesy Prabhakar Raghavan) ACM SIGKDD Innovation Award 20 Economics of IE • Data $, Supervision $ – The cost of supervision, especially large, high-quality training sets, is high – By comparison, the cost of data is low • Therefore – Rapid training set construction/active learning techniques – Tolerance for low- (or low-quality) supervision – Take feedback and iterate rapidly ACM SIGKDD Innovation Award 21 Example: Accepted Papers • Every conference comes with a slightly different format for accepted papers – We want to extract accepted papers directly (before they make their way into DBLP etc.) • Assume – Lots of background knowledge (e.g., DBLP from last year) – No supervision on the target page • What can you do? ACM SIGKDD Innovation Award 22 ACM SIGKDD Innovation Award 23 Down the Page a Bit ACM SIGKDD Innovation Award 24 Record Identification • To get started, we need to identify records – Hey, we could write an XPath, no? – So, what if no supervision is allowed? • Given a crude classifier for paper records, can we recursively split up this page? ACM SIGKDD Innovation Award 25 First Level Splits ACM SIGKDD Innovation Award 26 After More Splits … ACM SIGKDD Innovation Award 27 Now Get the Records • Goal: To extract fields of individual records • We need training examples, right? – But these papers are new • The best we can do without supervision is noisy labels. – From having seen other such pages ACM SIGKDD Innovation Award 28 Partial, Noisy Labels ACM SIGKDD Innovation Award 29 Extracted Records ACM SIGKDD Innovation Award 30 Refining Results via Feedback • Now let’s shift slightly to consider extraction of publications from academic home pages – Must identify publication sections of faculty home pages, and extract paper citations from them • Underlying data model for extracted data is – A flexible graph-based model (similar to RDF or ER conceptual model) – “Confidence” scores per-attribute or relationship ACM SIGKDD Innovation Award 31 Extracted Publication Titles ACM SIGKDD Innovation Award 32 A Dubious Extracted Publication… PSOX provides declarative lineage tracking over operator executions ACM SIGKDD Innovation Award 33 Where’s the Problem? Use lineage to find source of problem.. ACM SIGKDD Innovation Award 34 Source Page Hmm, not a publication page .. (but may have looked like one to a classifier) ACM SIGKDD Innovation Award 35 Feedback User corrects classification of that section.. ACM SIGKDD Innovation Award 36 Faculty or Student? •NLP •Build a Classifier •Or… ACM SIGKDD Innovation Award 37 …Another Clue… ACM SIGKDD Innovation Award 38 …Stepping Back… Prof Prof Prof-List • Leads to large-scale, partially-labeled relational learning • Involving different types of entities and links Prof Student-List Student AdvisorOf ACM SIGKDD Innovation Award Student 39 p1 p2 p3 Maximizing the Value of What You Select to Show Users ACM SIGKDD Innovation Award 40 Content Optimization • PROBLEM: Match-making between content, user, context – Content: • Programmed (e.g., editors); Acquired (e.g., RSS feeds, UGC) – User • Individual (e.g., b-cookie), or user segment – Context • E.g., Y! or non-Y! property; device; time period • APPROACH: Scalable algorithms that select content to show, using editorially determined content mixes, and respecting editorially set constraints and policies. ACM SIGKDD Innovation Award 41 Team from Y! Research BeeChung Chen Pradheep Elango Deepak Agarwal Raghu Ramakrishnan Wei Chu Seung-Taek Park ACM SIGKDD Innovation Award 42 Team from Y! Engineering Nitin Motgi Joe Zachariah Scott Roy Todd Beaupre Kenneth Fox ACM SIGKDD Innovation Award 43 Yahoo! Home Page Featured Box • It is the topcenter part of the Y! Front Page • It has four tabs: Featured, Entertainment, Sports, and Video ACM SIGKDD Innovation Award 44 Traditional Role of Editors • Strict quality control – Preserve “Yahoo! Voice” • E.g., typical mix of content – Community standards – Quality guidelines • E.g., Topical articles shown for limited time • Program articles periodically – New ones pushed, old ones taken out • Few tens of unique articles per day – 16 articles at any given time; editors keep up with novel articles and remove fading ones – Choose which articles appear in which tabs ACM SIGKDD Innovation Award 45 Content Optimization Approach • Editors continue to determine content sources, program some content, determine policies to ensure quality, and specify business constraints – But we use a statistically based machine learning algorithm to determine what articles to show where when a user visits the FP ACM SIGKDD Innovation Award 46 Modeling Approach • Pure feature based (did not work well): – Article: URL, keywords, categories – Build offline models to predict CTR when article shown to users – Models considered • Logistic Regression with feature selection • Decision Trees, Feature segments through clustering • Track CTR per article in user segments through online models – This worked well; the approach we took eventually ACM SIGKDD Innovation Award 47 Challenges • Non-stationary CTR • To ensure webpage stability, we show the same article until we find a better one – CTR decays over time; sharply at F1 – Time-of-day; day-of-week effect in CTR ACM SIGKDD Innovation Award 48 Modeling Approach • Track item scores through dynamic linear models (fast Kalman filter algorithms) • We model decay explicitly in our models • We have a global time-of-day curve explicitly in our online models ACM SIGKDD Innovation Award 49 Explore/Exploit • What is the best strategy for new articles? – If we show it and it’s bad: lose clicks – If we delay and it’s good: lose clicks • Solution: Show it while we don’t have much data if it looks promising – Classical multi-armed bandit type problem – Our setup is different than the ones studied in the literature; new ML problem ACM SIGKDD Innovation Award 50 Novel Aspects • Classical: Arms assumed fixed over time – We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research • Classical: Serving rule updated after each pull – We compute optimal design in batch mode • Classical: Generally. CTR assumed stationary – We have highly dynamic, non-stationary CTRs ACM SIGKDD Innovation Award 51 Some Other Complications • We run multiple experiments (possibly correlated) simultaneously; effective sample size calculation a challenge • Serving Bias: Incorrect to learn from data for serving scheme A and apply to serving scheme B – Need unbiased quality score – Bias sources: positional effects, time effect, set of articles shown together • Incorporating feature-based techniques – Regression style , E.g., logistic regression – Tree-based (hierarchical bandit) ACM SIGKDD Innovation Award 52 System Challenges • Highly dynamic system characteristics: – Short article lifetimes, pool constantly changing, user population is dynamic, CTRs non-stationary – Quick adaptation is key to success • Scalability: – 1000’s of page views/sec; data collection, model training, article scoring done under tight latency constraints ACM SIGKDD Innovation Award 53 Results • We built an experimental infrastructure to test new content serving schemes – Ran side-by-side experiments on live traffic – Experiments performed for several months; we consistently outperformed the old system – Results showed we get more clicks by engaging more users – Editorial overrides • Did not reduce lift numbers substantially ACM SIGKDD Innovation Award 54 Comparing buckets ACM SIGKDD Innovation Award 55 Experiments • Daily CTR Lift relative to editorial serving ACM SIGKDD Innovation Award 56 Lift is Due to Increased Reach • Lift in fraction of clicking users ACM SIGKDD Innovation Award 57 Related Work • Amazon, Netflix, Y! Music, etc.: – Collaborative filtering with large content pool – Achieve lift by eliminating bad articles – We have a small number of high quality articles • Search, Advertising – Matching problem with large content pool – Match through feature based models ACM SIGKDD Innovation Award 58 Summary of Approach • Offline models to initialize online models • Online models to track performance • Explore/exploit to converge fast • Study user visit patterns and behavior; program content accordingly ACM SIGKDD Innovation Award 59 Summary • There are some exciting “grand challenge” problems that will require us to bring to bear ideas from data management, statistics, learning, and optimization – i.e., data mining problems! • Our field is too young to think about growing old, but the best is yet to be … ACM SIGKDD Innovation Award 60