The Emergence of Data Science: Why Now? Ike Nassi (With contributions from Andrew McAfee, MIT Sloan) 17-Oct 2013 BSOE Research Day What this talk is all about Convince you that There is a need We have some tools We need new approaches We can’t do it all ourselves Evidence-based decision making is important And it needs more attention It will happen anyway Outline Societal Economic Technological A Short Story – Point of View 1984 1984 Configuration = 0 Configuration ≠ 0 The Future: Hard to Predict Accurately iWatch? Skynet? Changes happen faster than we think! How well can experts predict? 2012 Political Campaign slide by Andrew McAfee (MIT) “Bottom line: Romney 315, Obama 223. That sounds high for Romney. But he could drop Pennsylvania and Wisconsin and still win the election. Fundamentals." Barone: Going out on a limb: Romney beats Obama, handily (315 to 223) The Washington Examiner ^ | 11/2/12 | Michael Barone What about the experts? slide by Andrew McAfee (MIT) A Meta-Study Scorecard slide by Andrew McAfee (MIT) 136 studies of expert vs. algorithmic prediction Experts Clearly Better 8 (6%) Tossup Algorithm Clearly Better 65 (48%) 63 (46%) The Digital Frontier Keeps Expanding (slide contributed by Andy McAfee, MIT) Source: “Building Watson: It’s not so elementary, my dear” – W. Shih. HBS case #9-612-017 Ken Jennings (slide contributed by Andrew McAfee, MIT) Why is Data Science happening now? We can collect “Big Data” slide by Andrew McAfee (MIT) Big Data slide by Andrew McAfee (MIT) What can Economics tell us? We are collecting a lot more data, but… We are facing a rapidly changing economic landscape And we are not very good at controlling the economy Who is going to analyze it? Capital vs. Labor slide by Andrew McAfee (MIT) Corporate Profits After Tax & Non-Farm Labor Share, 1947-2012 120 Corporate Profits ($Billions) 1,400 Corporate Profit 117 1,200 114 1,000 111 800 108 600 105 400 102 200 Labor Share 0 -200 1947 99 96 1952 1957 1962 1967 1972 1977 Source: Federal Reserve Bank of St. Louis, Economic Research 1982 1987 1992 1997 2002 2007 93 2012 Labor Share (2005 = 100) 1,600 Level of GDP, Profits, and Investment (Jan-95 = 100) Recent Trends slide by Andrew McAfee (MIT) Trends in US GDP, Profits, Investment, and Employment, 1995-2011 350 GDP Corporate Investments 300 All Profits After Tax Non-Financial Profits After Tax 250 200 150 100 50 Shaded areas indicate recessions 0 1995 1997 1999 2001 2003 2005 2007 2009 2011 slide by Andrew McAfee (MIT) Trends in US GDP, Profits, Investment, and Employment, 1995-2011 350 74 GDP Corporate Investments All Profits After Tax Non-Financial Profits After Tax Employment to Population Ratio 300 250 72 70 68 200 66 150 64 100 62 50 60 Shaded areas indicate recessions 0 1995 1997 1999 2001 2003 2005 2007 2009 58 2011 Employment/Population Ratio Level of GDP, Profits, and Investment (Jan-95 = 100) Recent Trends Skill Disparities slide by Andrew McAfee (MIT) Changes in Wages for Full-Time, Full-Year Male U.S. Workers, 1963-2008 Composition-Adjusted Real Log Weekly Wages 0.6 Graduate School 0.5 0.4 College Graduate 0.3 0.2 0.1 Some College 0.0 High School Graduate -0.1 1963 1968 1973 1978 1983 1988 Source: http://econ-www.mit.edu/~dautor/hole-vol4/figs/fig-04.zip 1993 1998 2003 High School 2008Dropout Superstars U.S. Top 0.01% Income Share, 1913-2010 7% 6% Income Share 5% 4% 3% 2% 1% 0% 1913 1923 1933 1943 1953 1963 Source: http://emlab.berkeley.edu/users/saez/piketty-saezOUP04US.pdf 1973 1983 1993 2003 How to effect change Make the experts more effective Proactive and Reactive Approaches Collect data, predict, act (proactive) E.g. Evidence-based medicine Build systems that collect data, create feedback loops (reactive) E.g. Human body Both are needed Analysis Proactive Reactive Technology Requirements Data sizes for data under management are monotonically increasing Who wants less data? Our appetite for analysis is monotonically increasing Do you think, or do you know? Trend toward evidence-based management Our appetite for speed is monotonically increasing Who wants questions answered more slowly? Hence the industry interest in in-memory data management systems Our overall ability to manage complexity is not increasing Technology To Support Data Science Processor speeds are limited Processor core density has been increasing at a healthy rate Memory density is increasing (but at a lower rate than core density)! Therefore, the memory/core ratio is going in the wrong direction! We haven’t significantly changed the memory/storage hierarchies for decades Interconnects are getting faster – as fast as memory access? memory access is slow caches are fast! Memory-Density/Core-Density Declining… Technological Solutions It’s in our nature to tackle more ambitious problems Need faster answers SAP, Oracle, Neo-4j, Objectivity, etc. More in-memory solutions (e.g. NYSE/Euronext – Steve Rubinow) Cannot get faster processors, but we can get more of them But: parallelism is difficult Legacy software is a huge problem Need more machine learning, therefore, feedback What about memory? Scaling out When all you have is a hammer, every problem looks like a nail Or, in my case, a thumb! Today we rely almost exclusively on “scale-out” systems Because that’s the main way we add processors and memory Shard the data, intelligently target the queries – time consuming It’s not easy to query partitioned databases What is the best way to do it? Moving data is time-consuming And you might have to change it What if you could build systems that “scale-up”? What I’m doing about this Enabling systems that scale-up (TidalScale Inc. mission) Software that sits below an operating system but above the hardware that aggregates a set of servers together and runs that collection as a single virtual server running a single conventional operating system dynamic scaling at linear cost supporting unmodified legacy software and legacy operating systems automatically, dynamically and hierarchically optimizing processors, memory, networks, and storage systems through machine learning automatically evolving as hardware evolves The computer begins to learn what it needs to do to manage itself! Why Data Science Now? NEED: the future is increasingly complex and difficult to predict NEED: we don’t have enough qualified experts, and experts often get it wrong RAW MATERIALS: we are collecting huge amounts of data at an increasing rate ENABLER: new hardware and software tools are emerging THEREFORE: Data science is inevitable! We don’t have a choice What are the implications? Danny Hillis, inventor of the Connection Machine: “I want to build a computer that will be proud of me” What about SkyNet? Let’s leave that discussion for another day…. The Second Machine Age Andrew McAfee, MIT amcafee@mit.edu @amcafee Thank you Ike Nassi UCSC Computer Science inassi@ucsc.edu and TidalScale, Inc. ike.nassi@tidalscale.com