Managing Technical Talent: How to Find the Right Analyst for Your Problem Presentation to the Wolfram Data Summit Washington DC, Friday, Sept 09, 2011 nicholas.gruen@kaggle.com @nicholasgruen Different users - different techniques. • • • • • • • neural networks logistic regression support vector machine decision trees ensemble methods adaBoost Bayesian networks • • • • • • genetic algorithms random forest Monte Carlo methods principal component analysis Kalman filter evolutionary fuzzy modelling “A discovery is ... an accident meeting a prepared mind.” Albert Szent-Gyorgyi, 1937 Nobel Prize for Medicine ‣ Is the crown pure gold? ‣ We know its weight. ‣ How to measure its volume? Eureka! Finding the world’s most perfectly prepared mind Our User Base Competition Mechanics Competitions are judged on objective criteria How Kaggle Works 1 2 3 Users create predictive models, submit these to Kaggle, and are scored on their accuracy. Competitions are judged based on predictive accuracy + Genetic marker 4 + Genetic marker 3 Which HIV patients will be sicker next week? + Genetic marker 2 Genetic marker 1 Scouring the world for the best analysts for a problem. Grant Forecasting Traffic flow Chess Ratings Stock Prices HIV Load Edmund & Adrian Dr. Derek London & USA Gatherer UK Tim Salimans Erasmus U Philipp Emanuel Gzegorz Widmann Jure Zbontar Swiszcz Giuseppe Heidelberg, DE Ljubljana Gera Ragusa Ivan Russian Federation Robert Warsaw Chih-Li Sung & Roy Tseng Penghu & Taipei Jeremy Howard Rome Uri Blass Tel-Aviv Emir Delic Australia Thomas Mahony Canberra Glen Maher Canberra Dr. Christopher Hefele, New York Chris Raimondi Cole Harris Baltimore Texas Chris DuBois Claudio Perlich Portland Edmund John Blatz Jason Trigg USA& Adrian London & USA Rajstennaj Baltimore Pennsylvania ChrisTrigg Raimondi Jason Barrabas Batimore Lee Baker Pennsylvania USA Las Cruces, Nan Zhou NM Pittsburgh Global competitions Predicting HIV progression Competition closes 77% 1½ weeks 70.8% State of the art 70% Where’s Wally? Scouring the world for the best analysts for a problem. Grant Forecasting Traffic flow Chess Ratings Stock Prices HIV Load Edmund & Adrian Dr. Derek London & USA Gatherer UK Philipp Emanuel Gzegorz Widmann Jure Zbontar Swiszcz Giuseppe Heidelberg, DE Ljubljana Gera Ragusa Rome Uri Blass Tel-Aviv Ivan Russian Federation Robert Warsaw Chih-Li Sung & Roy Tseng Penghu & Taipei Dr. Christopher Hefele, New York Cole Harris Texas Chris DuBois Claudio Perlich Portland Edmund & Adrian John Blatz Jason Trigg USA London & USA Rajstennaj Baltimore Pennsylvania ChrisTrigg Raimondi Jason Barrabas Batimore Lee Baker Pennsylvania USA Las Cruces, Nan Zhou NM Pittsburgh Chris Raimondi Baltimore Where’s Wally? Scouring the world for the best analysts for a problem. Grant Forecasting Traffic flow Chess Ratings Stock Prices HIV Load Edmund & Adrian Dr. Derek London & USA Gatherer UK Philipp Emanuel Gzegorz Widmann Jure Zbontar Swiszcz Giuseppe Heidelberg, DE Ljubljana Gera Ragusa Rome Uri Blass Tel-Aviv Ivan Russian Federation Robert Warsaw Dr. Christopher Hefele, New York Chih-Li Sung & Roy Tseng Penghu & Taipei Tim Salimans Erasmus U R’dam Cole Harris Texas Chris DuBois Claudio Perlich Portland Edmund & Adrian John Blatz Jason Trigg USA London & USA Rajstennaj Baltimore Pennsylvania ChrisTrigg Raimondi Jason Barrabas Batimore Lee Baker Pennsylvania USA Las Cruces, Nan Zhou NM Pittsburgh Martin O’Leary “In less than a week … a PhD student in glaciology outperformed the stateof-the-art algorithms” We could not be happier with the result. The Kaggle approach has set a new benchmark in Government for the development of successful predictive models, delivered quickly and very cost effectively. In particular, the flexibility of the winning predictive model will enable its application to other major transport routes to the CBD and allow for the addition of other factors such as weather and incident. Susan Calvert Director, Strategy and Project Delivery Unit Department Premier and Cabinet A Few Kaggle Projects Take historical medical claims and predict who will go to hospital. This competition has a $3 million prize. Predict which editors will stop contributing Detect driver drowsiness New algorithm for chess ratings. Has wide gaming and ranking significance Predict the likelihood of claims given different vehicle models Predict successful grant applications Predict shoppers’ next visit to supermarket User base: 14,107 registered data scientists Kaggle Competition Results This competition (to forecast tourism demand) used one of the most heavily studied sets of time series data. It had previously been modeled using the leading commercial software and academic algorithms. Competitors quickly surpassed world’s best practice and found the frontier of what’s possible. Kaggle Competition Results Forecast Error (MASE) Combination of world’s best models Frontier reached after all information is extracted from the dataset Aug 9 2 weeks later 1 month later Competition End Where’s Wally? Scouring the world for the best analysts for a problem. Grant Forecasting Traffic flow Chess Ratings Stock Prices HIV Load Edmund & Adrian Dr. Derek London & USA Gatherer UK Philipp Emanuel Gzegorz Widmann Jure Zbontar Swiszcz Giuseppe Heidelberg, DE Ljubljana Gera Ragusa Rome Uri Blass Tel-Aviv Ivan Russian Federation Robert Warsaw Chih-Li Sung & Roy Tseng Penghu & Taipei Jeremy Howard Dr. Christopher Hefele, New York Cole Harris Texas Chris DuBois Claudio Perlich Portland Edmund & Adrian John Blatz Jason Trigg USA London & USA Rajstennaj Baltimore Pennsylvania ChrisTrigg Raimondi Jason Barrabas Batimore Lee Baker Pennsylvania USA Las Cruces, Nan Zhou NM Pittsburgh Jeremy Howard Jeremy Howard From generating value => Making money 1. Open Comps: Unleashing the power of Crowdsourcing $ Commission, consulting and performance fees 2. Consulting partnerships $ revenue share 3. The platform as marketplace for technical talent $ revenue share Our market Business analytics Outsourced business analytics Private Sector • Sales forecasts • Credit scoring • Stock picking • Risk modelling and pricing • Identifying fraud • Identifying best practice • Production management • Inventory management • Logistic optimisation = $107 bil market = $38b [IDC] Public and third sector • Revenue forecasts • Traffic forecasting • Energy demand • Predicting crime • Tax/social security fraud • Hospital casualty demand • Identifying great • Teachers • Hospitals First mover advantages of internet platforms Clients Analysts Kaggle not for profit Kaggle public good competitions “I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009 No matter who you are, most of the smartest people work for someone else. Bill Joye Founder, Sun Microsystems 2009 Wally Photos by William Murphy (Flickr: infomatique) Transforming the inefficient market for technical talent into the world’s largest meritocracy. nicholas.gruen@kaggle.com @nicholasgruen Who We Are Anthony Goldbloom CEO / Founder Jeremy Howard Chief Scientist • Econometrician @ the Australian Treasury & Reserve Bank of Australia • Journalism Intern @ The Economist. Nicholas Gruen Chairman Jeff Moser CTO Chairman of the Australian Gov. 2.0 taskforce Developer @ Raytheon and widely read blogger • McKinsey and A.T. Kearney alumnus • Founder of 2 successful startups: FastMail (exit to Opera) and Optimal Decisions Group (exit to Choicepoint)