The Emergence of Data Science: Why Now?

advertisement
The Emergence of Data Science:
Why Now?
Ike Nassi
(With contributions from Andrew McAfee, MIT Sloan)
17-Oct 2013
BSOE Research Day
What this talk is all about
 Convince you that
 There is a need
 We have some tools
 We need new approaches
 We can’t do it all ourselves
 Evidence-based decision making is important
 And it needs more attention
 It will happen anyway
Outline
 Societal
 Economic
 Technological
A Short Story – Point of View
1984
1984
Configuration = 0
Configuration ≠ 0
The Future: Hard to Predict Accurately
iWatch?
Skynet?
Changes happen faster than we think!
How well can experts predict?
2012 Political Campaign
slide by Andrew McAfee (MIT)
“Bottom line: Romney 315,
Obama 223. That sounds high
for Romney. But he could
drop Pennsylvania and
Wisconsin and still win the
election. Fundamentals."
Barone: Going out on a limb: Romney beats Obama, handily (315 to 223)
The Washington Examiner ^ | 11/2/12 | Michael Barone
What about the experts?
slide by Andrew McAfee (MIT)
A Meta-Study Scorecard
slide by Andrew McAfee (MIT)
136 studies of expert vs. algorithmic prediction
Experts Clearly Better
8 (6%)
Tossup
Algorithm Clearly Better
65 (48%)
63 (46%)
The Digital Frontier Keeps Expanding
(slide contributed by Andy McAfee, MIT)
Source: “Building Watson: It’s not so elementary, my dear” – W. Shih. HBS case #9-612-017
Ken Jennings
(slide contributed by Andrew McAfee, MIT)
Why is Data Science happening now?
We can collect “Big Data”
slide by Andrew McAfee (MIT)
Big Data
slide by Andrew McAfee (MIT)
What can Economics tell us?
 We are collecting a lot more data, but…
 We are facing a rapidly changing economic
landscape
 And we are not very good at controlling the economy
 Who is going to analyze it?
Capital vs. Labor
slide by Andrew McAfee (MIT)
Corporate Profits After Tax & Non-Farm Labor Share, 1947-2012
120
Corporate Profits ($Billions)
1,400
Corporate Profit
117
1,200
114
1,000
111
800
108
600
105
400
102
200
Labor Share
0
-200
1947
99
96
1952
1957
1962
1967
1972
1977
Source: Federal Reserve Bank of St. Louis, Economic Research
1982
1987
1992
1997
2002
2007
93
2012
Labor Share (2005 = 100)
1,600
Level of GDP, Profits, and Investment (Jan-95 = 100)
Recent Trends
slide by Andrew McAfee (MIT)
Trends in US GDP, Profits, Investment, and Employment,
1995-2011
350
GDP
Corporate Investments
300
All Profits After Tax
Non-Financial Profits After Tax
250
200
150
100
50
Shaded areas indicate recessions
0
1995
1997
1999
2001
2003
2005
2007
2009
2011
slide by Andrew McAfee (MIT)
Trends in US GDP, Profits, Investment, and Employment,
1995-2011
350
74
GDP
Corporate Investments
All Profits After Tax
Non-Financial Profits After Tax
Employment to Population Ratio
300
250
72
70
68
200
66
150
64
100
62
50
60
Shaded areas indicate recessions
0
1995
1997
1999
2001
2003
2005
2007
2009
58
2011
Employment/Population Ratio
Level of GDP, Profits, and Investment (Jan-95 = 100)
Recent Trends
Skill Disparities
slide by Andrew McAfee (MIT)
Changes in Wages for Full-Time, Full-Year Male U.S. Workers, 1963-2008
Composition-Adjusted Real Log Weekly Wages
0.6
Graduate
School
0.5
0.4
College
Graduate
0.3
0.2
0.1
Some
College
0.0
High School
Graduate
-0.1
1963
1968
1973
1978
1983
1988
Source: http://econ-www.mit.edu/~dautor/hole-vol4/figs/fig-04.zip
1993
1998
2003
High School
2008Dropout
Superstars
U.S. Top 0.01% Income Share, 1913-2010
7%
6%
Income Share
5%
4%
3%
2%
1%
0%
1913
1923
1933
1943
1953
1963
Source: http://emlab.berkeley.edu/users/saez/piketty-saezOUP04US.pdf
1973
1983
1993
2003
How to effect change
Make the experts more effective
Proactive and Reactive Approaches
 Collect data, predict, act (proactive)
 E.g. Evidence-based medicine
 Build systems that collect data, create feedback loops (reactive)
 E.g. Human body
 Both are needed
Analysis
Proactive
Reactive
Technology Requirements
 Data sizes for data under management are monotonically
increasing
 Who wants less data?
 Our appetite for analysis is monotonically increasing
 Do you think, or do you know?
 Trend toward evidence-based management
 Our appetite for speed is monotonically increasing
 Who wants questions answered more slowly?
 Hence the industry interest in in-memory data management
systems
 Our overall ability to manage complexity is not increasing
Technology To Support Data Science
 Processor speeds are limited
 Processor core density has been increasing at a healthy rate
 Memory density is increasing (but at a lower rate than core
density)!
 Therefore, the memory/core ratio is going in the wrong
direction!
 We haven’t significantly changed the memory/storage
hierarchies for decades
 Interconnects are getting faster – as fast as memory access?
 memory access is slow
 caches are fast!
Memory-Density/Core-Density Declining…
Technological Solutions
 It’s in our nature to tackle more ambitious
problems
 Need faster answers
 SAP, Oracle, Neo-4j, Objectivity, etc.
 More in-memory solutions (e.g. NYSE/Euronext –
Steve Rubinow)
 Cannot get faster processors, but we can get
more of them
 But: parallelism is difficult
 Legacy software is a huge problem
 Need more machine learning, therefore,
feedback
What about memory?
Scaling out
 When all you have is a hammer, every problem looks
like a nail
 Or, in my case, a thumb!
 Today we rely almost exclusively on “scale-out” systems
 Because that’s the main way we add processors and
memory
 Shard the data, intelligently target the queries – time
consuming
 It’s not easy to query partitioned databases
 What is the best way to do it?
 Moving data is time-consuming
 And you might have to change it
 What if you could build systems that “scale-up”?
What I’m doing about this
 Enabling systems that scale-up (TidalScale Inc. mission)
 Software that sits below an operating system but above
the hardware that aggregates a set of servers together
and runs that collection as a single virtual server running
a single conventional operating system
 dynamic scaling at linear cost
 supporting unmodified legacy software and legacy
operating systems
 automatically, dynamically and hierarchically optimizing
processors, memory, networks, and storage systems through
machine learning
 automatically evolving as hardware evolves
 The computer begins to learn what it needs to do to
manage itself!
Why Data Science Now?
 NEED: the future is increasingly complex and difficult to
predict
 NEED: we don’t have enough qualified experts, and
experts often get it wrong
 RAW MATERIALS: we are collecting huge amounts of
data at an increasing rate
 ENABLER: new hardware and software tools are
emerging
 THEREFORE: Data science is inevitable! We don’t have
a choice
What are the implications?
 Danny Hillis, inventor of the Connection Machine:
 “I want to build a computer that will be proud of me”
 What about SkyNet?
 Let’s leave that discussion for another day….
The Second
Machine
Age
Andrew McAfee, MIT
amcafee@mit.edu
@amcafee
Thank you
Ike Nassi
UCSC Computer Science
inassi@ucsc.edu
and
TidalScale, Inc.
ike.nassi@tidalscale.com
Download