Analysing and Modelling Large-Scale Enterprise Data Thore Graepel Online Services and Advertising Group Microsoft Research Cambridge Overview • Complex large-scale data in the enterprise – What kind of data is available? – What technologies are used? – Tasks and enterprise-specific challenges? • Methodology: – Bayesian Inference in Factor Graph Models – PQL: Using SQL to describe probability models • Applications: – Gamer Rating and Matchmaking: TrueSkill – Click-Through Rate Prediction: AdPredictor – Large-Scale Recommendations: Matchbox Complex Data Joint work with Tom Minka & Phillip Trelford Data Sources at Microsoft (External) • Online Services Division – Web index – Search and Ad click logs (12-15 TB / day) – Hotmail, Instant messaging, Internet Explorer (100s million users) – MSN portal and Bing maps • Xbox Live Gaming Service – User transaction log data – Ranking and matchmaking data – Game instrumentation for user testing Data Sources at Microsoft (Internal) • Development and Software Instrumentation – Watson (customer feedback data) – Source depot (MS source code, e.g., Office, Windows) – Multilingual technical documentation • Business – Customer databases – Sales and Marketing Data-Intensive Tasks at Microsoft • Prediction of user behaviour and preferences – Improve web search – Improve targeting for advertising – Spam filtering and content prioritisation • Improve user experience – Matchmaking for games – Multi-modal user interfaces (Natal, speech) • Improve software development process – Improve productivity of developers – Analyse software for defects Technical Infrastructure • Relational Databases/SQL – Great agility for analysis and reliability for business – Limited scalability – Need to import data into SQL • Windows HPC – Complex computations / fine grained parallelism – Need to move data to HPC cluster • Cosmos – Take the computation to the data – Super efficient stream based computations Cosmos Architecture SCOPE DryadLINQ Sputnik Dryad Cosmos Cluster Machine Stream Cluster Machine Stream Cluster Machine Stream Cluster Machine Stream Enterprise/Online specific challenges • Privacy – Privacy limit the ways in which data can be used – Interesting trade-offs (differential privacy) • Incentives – Data produced by self-interested agents – Need to design incentive compatible mechanisms • Exploration/Exploitation – Results of inference feed back into business process and determine future observations. – Need to aim at long-term benefits Factor Graphs Factor Graphs / Trees • Definition: Graphical representation of product structure of a function (Wiberg, 1996) – Nodes: = Factors = Variables – Edges: Dependencies of factors on variables. • Question: – What are the marginals of the function (all but one variable are summed out)? – What is the mode of the function? Factor Graphs and Bayesian Inference • Bayes’ law s1 s s2 • Factorising prior t1 t2 • Factorising likelihood • Sum out latent variables d y Factor Trees: Separation y v w f1(v,w) x f3(x,y) f2(w,x) z f4(x,z) Observation: Sum of products becomes product of sums of all messages from neighbouring factors to variable! Messages: From Factors To Variables y w x f3(x,y) f2(w,x) z f4(x,z) Observation: Factors only need to sum out all their local variables! Messages: From Variables To Factors y x f3(x,y) f2(w,x) z f4(x,z) Observation: Variables pass on the product of all incoming messages! The Sum-Product Algorithm • Three update equations (Aji & McEliece, 1997) • Update equations can be directly derived from the distributive law. • Efficient for messages in the exponential family. • Calculate all marginals at the same time. Approximate Message Passing • Problem: The exact messages from factors to variables may not be closed under products. • Solution: Approximate the marginal as well as possible in the sense of minimal KL divergence. • Expectation Propagation (Minka, 2001): Approximate the marginal by moment-matching resulting in Distributed Message Passing • Map-Reduce for IID data – Map: Nodes compute messages mfis from data yi and mfis – Reduce: Combine messages mfis into ps by multiplication s • Caveats: – All approximate data factors need the incoming message msfi! – All messages m fi s need to be stored if the same data point is considered multiple times y1 y2 y3 PQL Joint work with Ralf Herbrich & Jurgen Van Gael PQL as a Platform Infer.Net Machine Learning PQL DryadLinq Declarative Language Distributed Computation PQL Platform PQL I – Augmenting Schemas People = AUGMENT DB.People ADD weight FLOAT DB.People People weight PQL II – Factor Types Single Relation Cross Relation Cross Entity Table 2 Table 1 Table 1 Table 1 PQL III – Single Relation Factors FACTOR Normal(p.weight,75.0,25.0) FROM People p People People PQL IV – Cross Relation Factors FACTOR Normal(g.weight, p.weight, 1.0) FROM People p, DrVisit g WHERE p.PersonID = g.PersonID People People DrVisit DrVisit PQL as a Unifying Platform PQL TrueSkill (Skill Estimation) AdPredictor (CTR Prediction) Matchbox (Recommendation) TrueSkill™ Joint work with Tom Minka & Phillip Trelford TrueSkill™ • Given: – Match outcomes: Orderings among k teams consisting of n1, n2 , ..., nk players, respectively • Questions: – Skill si for each player such that – Global ranking among all players – Fair matches between teams of players Efficient Approximate Inference Gaussian Prior Factors s s s s 1 2 3 4 Fast and efficient approximate message passing t using Expectation t Propagation t 1 2 3 y1 y2 2 3 Ranking Likelihood Factors TrueSkill: Superfast convergence to True Skills 40 35 30 Level 25 20 15 char (TrueSkill™) 10 SQLwildman (TrueSkill™) char (Halo 2 Beta) 5 SQLwildman (Halo 2 Beta) 0 0 100 200 300 400 Games played Applications to Online Gaming • Leaderboard – Global ranking of all players • Matchmaking – For gamers: Most uncertain outcome – For inference: Most informative – Both are equivalent! Trueskill in Xbox 360 and Halo 3 • Launched September 2005 • Every Xbox 360 game uses TrueSkill™ • > 6 million players • > 1 million matches per day • > 2 billion hours of accumulated game-play • Launched on 25th September 2007 • Largest entertainment launch in history • > 500,000 players concurrently playing • Reference implementation of TrueSkill Xbox 360 Live Halo 3 AdPredictor Joint work with Joaquin Quiñonero Candela, Onno Zoeter, Tom Borchert , Phillip Trelford Why Predict Probability-of-Click? • Display (according to expected revenue) – • Charge (per click) – $1.00 * 10% =$0.10 $0.80 • Advantages estimates: $2.00 of improved * 4% =$0.08 probability $1.25 – Increase$0.10 user satisfaction * 50% =$0.05 by better $0.05targeting – Fairer charges to advertisers – Increase revenue by showing ads with high click-thru rate adPredictor Details 102.34.12.201 Client IP 15.70.165.9 221.98.2.187 92.154.3.86 + Match Type Exact Match Broad Match ML-1 Position SB-1 SB-2 P(pClick) Training Algorithm in Action + w1 w2 s Click No Click c Client IP: Mean & Variance UserAgent: Mean Posterior Effects AdPredictor in Bing Search Engine • AdPredictor is now running 100% Paid Search traffic in Microsoft’s Bing Search Engine • Relevance and Click-Through Rate of Ads improved • Calibrated CTR prediction provides solid foundation for further improvements • AdPredictor explored for other tasks such as contextual and display advertising Matchbox Joint work with David Stern and Ralf Herbrich Collaborative Filtering Items 1 Users A 2 3 4 5 6 Metadata? B C D ? ? ? Map Sparse Features To ‘Trait’ Space 234566 User ID 456457 13456 654777 Gender Male 34 345 Item ID 64 5474 Horror Female Drama Country UK USA Height 1.2m Comedy Movie Genre Documentary Message Passing For Matchbox u01 u11 u21 + u02 u12 v11 s1 * + + t1 u22 v21 v12 s2 * t2 v22 + + r Message update functions powered by Infer.net 1.5 User/Item Taste Space Adaptation 24: Season 3 1 24: Season 2 0.5 ‘Preference Cone’ for user 145035 0 -1.5 -1 -0.5 0 0.5A Clockwork1Orange 1.5 A Knights Tale -0.5 AI: Artificial Intelligence -1 Users A Cinderella Story -1.5 Movies Applications Ranking of content on web portals Online advertising (Display and Paid Search) Personalised web search Algorithm portfolio management Tweet/News recommendation Friends recommendation on social platforms Conclusions Conclusions • • • • Great variety of data sources and tasks Challenges: privacy, incentives, exploration Tools: SQL, No-SQL , HPC Modelling platform (Factor Graphs & PQL): – Represent uncertainty – Composable models – Distributed, data-centric computation • Applications: TrueSkill, AdPredictor, Matchbox • Thanks!