Data Science: Statistics in the Wild Jin Kim Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves using the power of data. Ph.D in Computer Science (Information Retrieval) ‘2012 Applied Scientist in Microsoft Bing Agenda • Talk about myths & truths about data science • Introduce a few projects representative of Data Science in industry • Compare experience in academia (CS Ph.D) vs. industry Myths & Truths about Data Science in Industry 1. You need big data to do anything interesting 2. You spend most of time analyzing & building models 3. You need to be a hard-core programmer to be successful 4. You can communicate results after analysis is done Can you guess which one is true? You won’t need big data most of the time You need ‘big’ data to do anything interesting Burdens of Big Data • Big data is costly to collect and store • Big data slows down the iteration • Big data is useful only if: • You’re trying to build a data product (i.e., search engine) • You’re dealing with very noisy measurement (i.e., A/B testing) • You’re interested in identifying the exceptions (outliers) Even then, start with small data! Determining how much data you need • Exploratory analysis • Do we have enough coverage for all edge cases? (i.e., outliers) • Statistical Inference • Is our confidence interval narrow enough? • Do we have enough statistical power to validate our hypotheses? • Predictive Analysis • Do we have enough data to train/evaluate our model? Basic skills (e.g., SQL) get you pretty far You need to be a hard-core programmer Data Science Tool Usage Survey (2014/O’Rielly) • Still dominated by simple tools… Choosing Tools for Data Science End-user Excel Small Data Big Data RDBMS / SQL R Python Hadoop Developer Chaining Tools for Data Science • Use the right toolset in different stages Data Preparation Exploratory Analysis Inference / Prediction Solution Implementation Results Communication Hadoop Excel Python Python Excel RDBMS / SQL R R Custom Code Modern R is no more difficult than SQL • Enter Hadleyverse: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html What matters more is the ability to choose and learn right tools and methods… You spend most of time preparing data You spend most of time analyzing data Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. Things can go wrong in many different levels… • Inherent noise / bias in data • The process of collecting the data (instrumentation) • The process of processing the data • Interpretation of processed data •… How to share data with a statistician: https://github.com/jtleek/datasharing Make sure you check for quality issues! Completeness Fidelity Consistency • Is the data representative of the problem space? • Any missing observations / attributes? • Do the measurements capture the reality? • Any issues of bias or variance? • Are values follow data types specified? • Do different attributes agree with each other? You need to communicate throughout the process You can communicate results after analysis is done Imagine you’re in jungle with complete strangers Why communication is so critical for solving a data problem? • You are seldom given a clear-cut problem (hence the data problem) • The team is composed of people with different expertise / style • No one has complete information of the problem / solution space • You often need to change courses multiple times, along the way Myths & Truths about Data Science in Industry 1. You need big data to do anything interesting 2. You spend most of time analyzing & building models 3. You need to be a hard-core programmer to be successful 4. You can communicate results after analysis is done All these are myths! Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data Alex Deng, Ya Xu, Ron Kohavi, Toby Walker Background: Online (A/B) Experiment • Randomly split the traffic into two groups • Make sure your split is actually random (pre-A/A test) • Ideally, no difference in the metric values • Apply treatment to one group (A/B test) • Now, the difference is purely due to the treatment • Use the two-sample T-test for comparison Group1 Group2 U1: 0.5 U4: 0.2 U2: 0.4 U5: 0.7 U3: 0.1 U6: 0.4 Goal: Higher Sensitivity in A/B Experiments! • T-stat = Δ πππ Δ , where • Reduce πππ Δ => improve sensitivity • πππ Δ can only be reduced by increasing traffic • Goal: Find another Δ∗ such that πΈ Δ∗ = πΈ(Δ), but πππ Δ∗ < πππ Δ • Breakdown of πππ Δ Main idea • Analysis of Variance: total variance = between-group variance + within-group variance • πππ Y = πΈ πππ Y π + πππ πΈ Y π Variance explained by X Stratification for Variance Reduction • Assuming we can find reasonable strata of size K: Beyond Discrete Strata: Control Covariate • Stratification is a special case when X is discrete • We can control covariate for continuous X: π ∗ β π − ππ + ππΈ π = π − π(π − πΈ π ) • Optimal choice of π is πΆππ£ π,π πππ π • πππ π ∗ = πππ π × (1 − πππ(π, π)2 ) Practical Issues How to choose covariates? • In practice, we found that using pre-experiment data of the same metric usually work out quite well • This is because for the same subject/user, the same metrics usually have strong temporal correlation Group1 Group2 U1: 0.5 U4: 0.2 U2: 0.4 U5: 0.7 U3: 0.1 U6: 0.4 A/B Test Group1 Group2 Treatment Effect U1: 0.7 U4: 0.2 U2: 0.5 U5: 0.6 U3: 0.2 U6: 0.5 Pre-AA Test No Difference! Empirical Results with CUPED Delay-page-load experiment in Bing Faster with CUPED • Statistically significant from day 1! Fewer Users with CUPED • Better results with only half the users! with half the users (Metric: CTR) Impact of pre-experiment period length Evaluating Online Ad Campaigns in a Pipeline Causal Models At Scale David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert Problem Setting Controls in Natural Experiment • Controls: those who could have been exposed, but werern’t • visited the publisher site, saw other display ads, met targeting conditions Solution: Match Control & Treatment Users • But exact matching wouldn’t work because there are so many dimensions… Removing the bias in the controls by reweighting • Each exposed gets weight one • Each control gets weight p(x) / (1-p(x)) Now the estimate the campaign effect: Summary: Data Science in Industry • Need to be aware of inherent variability / bias in data • Statistical techniques are useful to mitigate these issues • There is nothing more practical than a good theory! Working in Academia vs. Industry Other lessons I learned in industry… • You need to learn new things all the time • Multitasking is not a choice, but a must • Need to ‘sell’ your ideas and results • Hard science + Soft skill = Rockstar Optional Doubly Robust Estimate Results