Slides

advertisement
Data Science:
Statistics in the Wild
Jin Kim
Hi, I’m Jin!
My professional mission is to help people and organizations
understand and improve themselves using the power of data.
Ph.D in Computer Science (Information Retrieval) ‘2012
Applied Scientist in Microsoft Bing
Agenda
• Talk about myths & truths about data science
• Introduce a few projects representative of Data Science in industry
• Compare experience in academia (CS Ph.D) vs. industry
Myths & Truths about Data Science in Industry
1. You need big data to do anything interesting
2. You spend most of time analyzing & building models
3. You need to be a hard-core programmer to be successful
4. You can communicate results after analysis is done
Can you guess
which one is
true?
You won’t need big data most of the time
You need ‘big’ data to do anything interesting
Burdens of Big Data
• Big data is costly to collect and store
• Big data slows down the iteration
• Big data is useful only if:
• You’re trying to build a data product (i.e., search engine)
• You’re dealing with very noisy measurement (i.e., A/B testing)
• You’re interested in identifying the exceptions (outliers)
Even then, start with small data!
Determining how much data you need
• Exploratory analysis
• Do we have enough coverage for all edge cases? (i.e., outliers)
• Statistical Inference
• Is our confidence interval narrow enough?
• Do we have enough statistical power to validate our hypotheses?
• Predictive Analysis
• Do we have enough data to train/evaluate our model?
Basic skills (e.g., SQL) get you pretty far
You need to be a hard-core programmer
Data Science Tool Usage Survey (2014/O’Rielly)
• Still dominated by simple tools…
Choosing Tools for Data Science
End-user
Excel
Small
Data
Big
Data
RDBMS
/ SQL
R
Python
Hadoop
Developer
Chaining Tools for Data Science
• Use the right toolset in different stages
Data
Preparation
Exploratory
Analysis
Inference /
Prediction
Solution
Implementation
Results
Communication
Hadoop
Excel
Python
Python
Excel
RDBMS
/ SQL
R
R
Custom
Code
Modern R is no more difficult than SQL
• Enter Hadleyverse:
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
What matters more is the ability to choose
and learn right tools and methods…
You spend most of time preparing data
You spend most of time analyzing data
Data scientists, according to interviews and expert
estimates, spend from 50 percent to 80 percent
of their time mired in this more mundane labor of
collecting and preparing unruly digital data, before it
can be explored for useful nuggets.
Things can go wrong in many different levels…
• Inherent noise / bias in data
• The process of collecting the data (instrumentation)
• The process of processing the data
• Interpretation of processed data
•…
How to share data with a statistician:
https://github.com/jtleek/datasharing
Make sure you check for quality issues!
Completeness
Fidelity
Consistency
• Is the data representative of the problem space?
• Any missing observations / attributes?
• Do the measurements capture the reality?
• Any issues of bias or variance?
• Are values follow data types specified?
• Do different attributes agree with each other?
You need to communicate throughout the process
You can communicate results after analysis is done
Imagine you’re in jungle with complete strangers
Why communication is so critical for solving a
data problem?
• You are seldom given a clear-cut problem (hence the data problem)
• The team is composed of people with different expertise / style
• No one has complete information of the problem / solution space
• You often need to change courses multiple times, along the way
Myths & Truths about Data Science in Industry
1. You need big data to do anything interesting
2. You spend most of time analyzing & building models
3. You need to be a hard-core programmer to be successful
4. You can communicate results after analysis is done
All these are
myths!
Improving the Sensitivity of
Online Controlled Experiments by
Utilizing Pre-Experiment Data
Alex Deng, Ya Xu, Ron Kohavi, Toby Walker
Background: Online (A/B) Experiment
• Randomly split the traffic into two groups
• Make sure your split is actually random (pre-A/A test)
• Ideally, no difference in the metric values
• Apply treatment to one group (A/B test)
• Now, the difference is purely due to the treatment
• Use the two-sample T-test for comparison
Group1
Group2
U1: 0.5
U4: 0.2
U2: 0.4
U5: 0.7
U3: 0.1
U6: 0.4
Goal: Higher Sensitivity in A/B Experiments!
• T-stat =
Δ
π‘‰π‘Žπ‘Ÿ Δ
, where
• Reduce π‘‰π‘Žπ‘Ÿ Δ => improve sensitivity
• π‘‰π‘Žπ‘Ÿ Δ can only be reduced by increasing traffic
• Goal:
Find another Δ∗ such that 𝐸 Δ∗ = 𝐸(Δ), but π‘‰π‘Žπ‘Ÿ Δ∗ < π‘‰π‘Žπ‘Ÿ Δ
• Breakdown of π‘‰π‘Žπ‘Ÿ Δ
Main idea
• Analysis of Variance:
total variance = between-group variance + within-group variance
• π‘‰π‘Žπ‘Ÿ Y = 𝐸 π‘‰π‘Žπ‘Ÿ Y 𝑋
+ π‘‰π‘Žπ‘Ÿ 𝐸 Y 𝑋
Variance explained by X
Stratification for Variance Reduction
• Assuming we can find reasonable strata of size K:
Beyond Discrete Strata: Control Covariate
• Stratification is a special case when X is discrete
• We can control covariate for continuous X:
π‘Œ ∗ ≔ π‘Œ − πœƒπ‘‹ + πœƒπΈ 𝑋 = π‘Œ − πœƒ(𝑋 − 𝐸 𝑋 )
• Optimal choice of πœƒ is
πΆπ‘œπ‘£ π‘Œ,𝑋
π‘‰π‘Žπ‘Ÿ 𝑋
• π‘‰π‘Žπ‘Ÿ π‘Œ ∗ = π‘‰π‘Žπ‘Ÿ π‘Œ × (1 − π‘π‘œπ‘Ÿ(π‘Œ, 𝑋)2 )
Practical Issues
How to choose covariates?
• In practice, we found that
using pre-experiment data of
the same metric usually work
out quite well
• This is because for the same
subject/user, the same
metrics usually have strong
temporal correlation
Group1
Group2
U1: 0.5
U4: 0.2
U2: 0.4
U5: 0.7
U3: 0.1
U6: 0.4
A/B Test
Group1
Group2
Treatment Effect
U1: 0.7
U4: 0.2
U2: 0.5
U5: 0.6
U3: 0.2
U6: 0.5
Pre-AA Test
No Difference!
Empirical Results with CUPED
Delay-page-load experiment in Bing
Faster with CUPED
• Statistically significant from day 1!
Fewer Users with CUPED
• Better results with only half the users!
with half the users
(Metric: CTR)
Impact of pre-experiment period length
Evaluating Online Ad
Campaigns in a Pipeline
Causal Models At Scale
David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert
Problem Setting
Controls in Natural Experiment
• Controls: those who could have been exposed, but werern’t
• visited the publisher site, saw other display ads, met targeting conditions
Solution: Match Control & Treatment Users
• But exact matching wouldn’t work because there are so many
dimensions…
Removing the bias in the controls by reweighting
• Each exposed gets weight one
• Each control gets weight p(x) / (1-p(x))
Now the estimate the campaign effect:
Summary: Data Science in Industry
• Need to be aware of inherent variability / bias in data
• Statistical techniques are useful to mitigate these issues
• There is nothing more practical than a good theory!
Working in Academia vs. Industry
Other lessons I learned in industry…
• You need to learn new things all the time
• Multitasking is not a choice, but a must
• Need to ‘sell’ your ideas and results
• Hard science + Soft skill = Rockstar
Optional
Doubly Robust Estimate
Results
Download