Data Prediction Competitions

advertisement
Managing Technical Talent: How to Find the
Right Analyst for Your Problem
Presentation to the Wolfram Data Summit
Washington DC, Friday, Sept 09, 2011
nicholas.gruen@kaggle.com @nicholasgruen
Different users - different techniques.
•
•
•
•
•
•
•
neural networks
logistic regression
support vector machine
decision trees
ensemble methods
adaBoost
Bayesian networks
•
•
•
•
•
•
genetic algorithms
random forest
Monte Carlo methods
principal component analysis
Kalman filter
evolutionary fuzzy modelling
“A discovery is ... an accident meeting a prepared mind.”
Albert Szent-Gyorgyi, 1937 Nobel Prize for Medicine
‣ Is the crown pure gold?
‣ We know its weight.
‣ How to measure its volume?
Eureka!
Finding the world’s most perfectly prepared mind
Our User Base
Competition Mechanics
Competitions are judged on objective criteria
How Kaggle Works
1
2
3
Users create
predictive models,
submit these
to Kaggle,
and are scored
on their accuracy.
Competitions are judged based on predictive accuracy
+ Genetic marker 4
+ Genetic marker 3
Which HIV patients will
be sicker next week?
+ Genetic marker 2
Genetic marker 1
Scouring the world for the best analysts for a problem.
Grant Forecasting
Traffic flow
Chess Ratings
Stock Prices
HIV Load
Edmund & Adrian
Dr. Derek
London & USA
Gatherer
UK
Tim Salimans
Erasmus U
Philipp
Emanuel
Gzegorz
Widmann
Jure
Zbontar
Swiszcz
Giuseppe
Heidelberg, DE
Ljubljana
Gera
Ragusa
Ivan
Russian Federation
Robert
Warsaw
Chih-Li Sung & Roy Tseng
Penghu & Taipei
Jeremy Howard
Rome
Uri Blass
Tel-Aviv
Emir Delic
Australia
Thomas
Mahony
Canberra
Glen Maher
Canberra
Dr. Christopher
Hefele, New York
Chris Raimondi
Cole Harris
Baltimore
Texas
Chris DuBois
Claudio Perlich
Portland
Edmund
John Blatz
Jason Trigg
USA& Adrian
London & USA
Rajstennaj
Baltimore
Pennsylvania
ChrisTrigg
Raimondi
Jason
Barrabas
Batimore
Lee Baker
Pennsylvania
USA
Las Cruces,
Nan Zhou
NM
Pittsburgh
Global competitions
Predicting HIV progression
Competition closes 77%
1½ weeks 70.8%
State of the art 70%
Where’s Wally? Scouring the world for the best analysts for a problem.
Grant Forecasting
Traffic flow
Chess Ratings
Stock Prices
HIV Load
Edmund & Adrian
Dr. Derek
London & USA
Gatherer
UK
Philipp
Emanuel
Gzegorz
Widmann
Jure
Zbontar
Swiszcz
Giuseppe
Heidelberg, DE
Ljubljana
Gera
Ragusa
Rome
Uri Blass
Tel-Aviv
Ivan
Russian Federation
Robert
Warsaw
Chih-Li Sung & Roy Tseng
Penghu & Taipei
Dr. Christopher
Hefele, New York
Cole Harris
Texas
Chris DuBois
Claudio Perlich
Portland
Edmund
&
Adrian
John Blatz
Jason Trigg
USA
London & USA
Rajstennaj
Baltimore
Pennsylvania
ChrisTrigg
Raimondi
Jason
Barrabas
Batimore
Lee Baker
Pennsylvania
USA
Las Cruces,
Nan Zhou
NM
Pittsburgh
Chris Raimondi
Baltimore
Where’s Wally? Scouring the world for the best analysts for a problem.
Grant Forecasting
Traffic flow
Chess Ratings
Stock Prices
HIV Load
Edmund & Adrian
Dr. Derek
London & USA
Gatherer
UK
Philipp
Emanuel
Gzegorz
Widmann
Jure
Zbontar
Swiszcz
Giuseppe
Heidelberg, DE
Ljubljana
Gera
Ragusa
Rome
Uri Blass
Tel-Aviv
Ivan
Russian Federation
Robert
Warsaw
Dr. Christopher
Hefele, New York
Chih-Li Sung & Roy Tseng
Penghu & Taipei
Tim Salimans
Erasmus U R’dam
Cole Harris
Texas
Chris DuBois
Claudio Perlich
Portland
Edmund
&
Adrian
John Blatz
Jason Trigg
USA
London & USA
Rajstennaj
Baltimore
Pennsylvania
ChrisTrigg
Raimondi
Jason
Barrabas
Batimore
Lee Baker
Pennsylvania
USA
Las Cruces,
Nan Zhou
NM
Pittsburgh
Martin O’Leary
“In less than a week … a PhD student
in glaciology outperformed the stateof-the-art algorithms”
We could not be happier with the result. The Kaggle approach has
set a new benchmark in Government for the development
of successful predictive models, delivered quickly and very cost
effectively.
In particular, the flexibility of the winning predictive model will enable
its application to other major transport routes to the CBD and allow
for the addition of other factors such as weather and incident.
Susan Calvert
Director, Strategy and Project Delivery Unit
Department Premier and Cabinet
A Few Kaggle Projects
Take historical medical claims and predict
who will go to hospital. This competition
has a $3 million prize.
Predict which
editors will stop
contributing
Detect driver
drowsiness
New algorithm for chess
ratings. Has wide gaming
and ranking significance
Predict the likelihood of
claims given different
vehicle models
Predict successful
grant applications
Predict shoppers’ next
visit to supermarket
User base: 14,107 registered data scientists
Kaggle Competition Results
This competition (to forecast tourism demand)
used one of the most heavily studied sets of
time series data. It had previously been
modeled using the leading commercial software
and academic algorithms. Competitors quickly
surpassed world’s best practice and found the
frontier of what’s possible.
Kaggle Competition Results
Forecast
Error
(MASE)
Combination of world’s best models
Frontier reached after all information is extracted
from the dataset
Aug 9
2 weeks
later
1 month
later
Competition
End
Where’s Wally? Scouring the world for the best analysts for a problem.
Grant Forecasting
Traffic flow
Chess Ratings
Stock Prices
HIV Load
Edmund & Adrian
Dr. Derek
London & USA
Gatherer
UK
Philipp
Emanuel
Gzegorz
Widmann
Jure
Zbontar
Swiszcz
Giuseppe
Heidelberg, DE
Ljubljana
Gera
Ragusa
Rome
Uri Blass
Tel-Aviv
Ivan
Russian Federation
Robert
Warsaw
Chih-Li Sung & Roy Tseng
Penghu & Taipei
Jeremy Howard
Dr. Christopher
Hefele, New York
Cole Harris
Texas
Chris DuBois
Claudio Perlich
Portland
Edmund
&
Adrian
John Blatz
Jason Trigg
USA
London & USA
Rajstennaj
Baltimore
Pennsylvania
ChrisTrigg
Raimondi
Jason
Barrabas
Batimore
Lee Baker
Pennsylvania
USA
Las Cruces,
Nan Zhou
NM
Pittsburgh
Jeremy Howard
Jeremy Howard
From generating value => Making money
1. Open Comps: Unleashing the power of
Crowdsourcing
$
Commission, consulting and performance fees
2. Consulting partnerships
$
revenue share
3. The platform as marketplace for technical
talent
$
revenue share
Our market
Business analytics
Outsourced business analytics
Private Sector
• Sales forecasts
• Credit scoring
• Stock picking
• Risk modelling and pricing
• Identifying fraud
• Identifying best practice
• Production management
• Inventory management
• Logistic optimisation
= $107 bil market
= $38b [IDC]
Public and third sector
• Revenue forecasts
• Traffic forecasting
• Energy demand
• Predicting crime
• Tax/social security fraud
• Hospital casualty demand
• Identifying great
• Teachers
• Hospitals
First mover advantages of internet platforms
Clients
Analysts
Kaggle not for profit
Kaggle public good competitions
“I keep saying the sexy job in the
next ten years will be statisticians.”
Hal Varian
Google Chief Economist
2009
No matter who you are, most of the
smartest people work for someone
else.
Bill Joye
Founder, Sun Microsystems
2009
Wally Photos by William Murphy (Flickr: infomatique)
Transforming the inefficient market for
technical talent into the world’s largest meritocracy.
nicholas.gruen@kaggle.com @nicholasgruen
Who We Are
Anthony
Goldbloom
CEO / Founder
Jeremy Howard
Chief Scientist
• Econometrician @ the
Australian Treasury &
Reserve Bank of
Australia
• Journalism Intern @ The
Economist.
Nicholas Gruen
Chairman
Jeff Moser
CTO
Chairman of the
Australian Gov. 2.0
taskforce
Developer @ Raytheon
and widely read blogger
• McKinsey and A.T. Kearney
alumnus
• Founder of 2 successful
startups: FastMail (exit to
Opera) and Optimal
Decisions Group (exit to
Choicepoint)
Download