Analysing and Modelling Large-Scale Enterprise Data Thore Graepel Online Services and Advertising Group

advertisement
Analysing and Modelling Large-Scale
Enterprise Data
Thore Graepel
Online Services and Advertising Group
Microsoft Research Cambridge
Overview
• Complex large-scale data in the enterprise
– What kind of data is available?
– What technologies are used?
– Tasks and enterprise-specific challenges?
• Methodology:
– Bayesian Inference in Factor Graph Models
– PQL: Using SQL to describe probability models
• Applications:
– Gamer Rating and Matchmaking: TrueSkill
– Click-Through Rate Prediction: AdPredictor
– Large-Scale Recommendations: Matchbox
Complex Data
Joint work with Tom Minka & Phillip Trelford
Data Sources at Microsoft (External)
• Online Services Division
– Web index
– Search and Ad click logs (12-15 TB / day)
– Hotmail, Instant messaging, Internet Explorer (100s
million users)
– MSN portal and Bing maps
• Xbox Live Gaming Service
– User transaction log data
– Ranking and matchmaking data
– Game instrumentation for user testing
Data Sources at Microsoft (Internal)
• Development and Software Instrumentation
– Watson (customer feedback data)
– Source depot (MS source code, e.g., Office,
Windows)
– Multilingual technical documentation
• Business
– Customer databases
– Sales and Marketing
Data-Intensive Tasks at Microsoft
• Prediction of user behaviour and preferences
– Improve web search
– Improve targeting for advertising
– Spam filtering and content prioritisation
• Improve user experience
– Matchmaking for games
– Multi-modal user interfaces (Natal, speech)
• Improve software development process
– Improve productivity of developers
– Analyse software for defects
Technical Infrastructure
• Relational Databases/SQL
– Great agility for analysis and reliability for business
– Limited scalability
– Need to import data into SQL
• Windows HPC
– Complex computations / fine grained parallelism
– Need to move data to HPC cluster
• Cosmos
– Take the computation to the data
– Super efficient stream based computations
Cosmos Architecture
SCOPE
DryadLINQ
Sputnik
Dryad
Cosmos
Cluster
Machine
Stream
Cluster
Machine
Stream
Cluster
Machine
Stream
Cluster
Machine
Stream
Enterprise/Online specific challenges
• Privacy
– Privacy limit the ways in which data can be used
– Interesting trade-offs (differential privacy)
• Incentives
– Data produced by self-interested agents
– Need to design incentive compatible mechanisms
• Exploration/Exploitation
– Results of inference feed back into business process
and determine future observations.
– Need to aim at long-term benefits
Factor Graphs
Factor Graphs / Trees
• Definition: Graphical representation of product
structure of a function (Wiberg, 1996)
– Nodes:
= Factors
= Variables
– Edges: Dependencies of factors on variables.
• Question:
– What are the marginals of the function (all but one
variable are summed out)?
– What is the mode of the function?
Factor Graphs and Bayesian Inference
• Bayes’ law
s1
s
s2
• Factorising prior
t1
t2
• Factorising likelihood
• Sum out latent variables
d
y
Factor Trees: Separation
y
v
w
f1(v,w)
x
f3(x,y)
f2(w,x)
z
f4(x,z)
Observation: Sum of products becomes product of sums of all
messages from neighbouring factors to variable!
Messages: From Factors To Variables
y
w
x
f3(x,y)
f2(w,x)
z
f4(x,z)
Observation: Factors only need to sum out all their
local variables!
Messages: From Variables To Factors
y
x
f3(x,y)
f2(w,x)
z
f4(x,z)
Observation: Variables pass on the product of all
incoming messages!
The Sum-Product Algorithm
• Three update equations (Aji & McEliece, 1997)
• Update equations can be directly derived from the
distributive law.
• Efficient for messages in the exponential family.
• Calculate all marginals at the same time.
Approximate Message Passing
• Problem: The exact messages from factors to
variables may not be closed under products.
• Solution: Approximate the marginal as well as
possible in the sense of minimal KL divergence.
• Expectation Propagation (Minka, 2001):
Approximate the marginal by moment-matching
resulting in
Distributed Message Passing
• Map-Reduce for IID data
– Map: Nodes compute messages
mfis from data yi and mfis
– Reduce: Combine messages
mfis into ps by multiplication
s
• Caveats:
– All approximate data factors
need the incoming message
msfi!
– All messages m fi s need to be
stored if the same data point is
considered multiple times
y1
y2
y3
PQL
Joint work with Ralf Herbrich & Jurgen Van Gael
PQL as a Platform
Infer.Net
Machine Learning
PQL
DryadLinq
Declarative
Language
Distributed
Computation
PQL
Platform
PQL I – Augmenting Schemas
People = AUGMENT DB.People ADD weight FLOAT
DB.People
People
weight
PQL II – Factor Types
Single Relation
Cross Relation
Cross Entity
Table 2
Table 1
Table 1
Table 1
PQL III – Single Relation Factors
FACTOR Normal(p.weight,75.0,25.0) FROM People p
People
People
PQL IV – Cross Relation Factors
FACTOR Normal(g.weight, p.weight, 1.0)
FROM People p, DrVisit g
WHERE p.PersonID = g.PersonID
People
People
DrVisit
DrVisit
PQL as a Unifying Platform
PQL
TrueSkill
(Skill Estimation)
AdPredictor
(CTR Prediction)
Matchbox
(Recommendation)
TrueSkill™
Joint work with Tom Minka & Phillip Trelford
TrueSkill™
• Given:
– Match outcomes: Orderings among k teams consisting of
n1, n2 , ..., nk players, respectively
• Questions:
– Skill si for each player such that
– Global ranking among all players
– Fair matches between teams of players
Efficient Approximate Inference
Gaussian Prior Factors
s
s
s
s
1
2
3
4
Fast and efficient approximate message passing
t using Expectation
t Propagation t
1
2
3
y1
y2
2
3
Ranking Likelihood Factors
TrueSkill: Superfast convergence to True Skills
40
35
30
Level
25
20
15
char (TrueSkill™)
10
SQLwildman (TrueSkill™)
char (Halo 2 Beta)
5
SQLwildman (Halo 2 Beta)
0
0
100
200
300
400
Games played
Applications to Online Gaming
• Leaderboard
– Global ranking of all players
• Matchmaking
– For gamers: Most uncertain outcome
– For inference: Most informative
– Both are equivalent!
Trueskill in Xbox 360 and Halo 3
• Launched September 2005
• Every Xbox 360 game uses
TrueSkill™
• > 6 million players
• > 1 million matches per day
• > 2 billion hours of
accumulated game-play
• Launched on 25th September
2007
• Largest entertainment launch
in history
• > 500,000 players
concurrently playing
• Reference implementation of
TrueSkill
Xbox 360
Live
Halo 3
AdPredictor
Joint work with Joaquin Quiñonero Candela, Onno Zoeter, Tom Borchert , Phillip Trelford
Why Predict Probability-of-Click?
• Display (according to
expected revenue)
–
• Charge (per click)
–
$1.00
* 10%
=$0.10
$0.80
• Advantages
estimates:
$2.00 of improved
* 4%
=$0.08 probability
$1.25
– Increase$0.10
user satisfaction
* 50%
=$0.05 by better
$0.05targeting
– Fairer charges to advertisers
– Increase revenue by showing ads with high click-thru rate
adPredictor Details
102.34.12.201
Client IP
15.70.165.9
221.98.2.187
92.154.3.86
+
Match
Type
Exact Match
Broad Match
ML-1
Position
SB-1
SB-2
P(pClick)
Training Algorithm in Action
+
w1
w2
s
Click
No Click
c
Client IP: Mean & Variance
UserAgent: Mean Posterior Effects
AdPredictor in Bing Search Engine
• AdPredictor is now running 100% Paid Search
traffic in Microsoft’s Bing Search Engine
• Relevance and Click-Through Rate of Ads
improved
• Calibrated CTR prediction provides solid
foundation for further improvements
• AdPredictor explored for other tasks such as
contextual and display advertising
Matchbox
Joint work with David Stern and Ralf Herbrich
Collaborative Filtering
Items
1
Users
A
2
3
4
5
6
Metadata?
B
C
D
?
?
?
Map Sparse Features To ‘Trait’ Space
234566
User ID
456457
13456
654777
Gender
Male
34
345
Item ID
64
5474
Horror
Female
Drama
Country
UK
USA
Height
1.2m
Comedy
Movie
Genre
Documentary
Message Passing For Matchbox
u01
u11
u21
+
u02
u12
v11
s1
*
+
+
t1
u22
v21
v12
s2
*
t2
v22
+
+
r
Message update functions
powered by Infer.net
1.5
User/Item Taste Space
Adaptation
24: Season 3
1
24: Season 2
0.5
‘Preference Cone’ for user
145035
0
-1.5
-1
-0.5
0
0.5A
Clockwork1Orange
1.5
A Knights Tale
-0.5
AI: Artificial Intelligence
-1
Users
A Cinderella Story
-1.5
Movies
Applications
Ranking of content on web portals
Online advertising (Display and Paid Search)
Personalised web search
Algorithm portfolio management
Tweet/News recommendation
Friends recommendation on social platforms
Conclusions
Conclusions
•
•
•
•
Great variety of data sources and tasks
Challenges: privacy, incentives, exploration
Tools: SQL, No-SQL , HPC
Modelling platform (Factor Graphs & PQL):
– Represent uncertainty
– Composable models
– Distributed, data-centric computation
• Applications: TrueSkill, AdPredictor, Matchbox
• Thanks!
Download