Program Prospectus Ph.D. in Analytics and Data Science Last Updated: November 13, 2012 Executive Summary The McKinsey Global Institute has identified that the demand for deep analytical talent will outpace the supply in the United States by almost 200,000 people within three years. The White House has launched a “Big Data Research and Development Initiative”, to “expand the workforce needed to develop and use Big Data technologies”. This theme is echoed by Thomas Davenport’s recent article in Harvard Business Review titled “Data Scientist: The Sexiest Job in the 21st Century”. These studies – and many others – point to the need for universities to train “Data Scientists”. However, no university in the country currently has a degree program in Data Science – defined as the intersection of Statistics, Mathematics and Computer Science. Kennesaw State University is proposing a Ph.D. in Data Science – the first of its kind in the country. The degree will train individuals to translate large, unstructured, complex data into information to improve decision making. This curriculum will include programming, mathematics, data mining, statistical modeling, and the mathematical foundations to support these concepts. Importantly, it will also emphasize communication skills – both oral and written – as well as application and tying results to business and research problems. Because this degree is a Ph.D. (rather than a Doctorate in Data Science), it creates flexibility for the student. Graduates can either pursue a position in the private or public sector as a “practicing” Data Scientist – where the demand is expected to greatly outpace the supply – or pursue a position within academia, where they would be uniquely qualified to teach these skills to the next generation. Kennesaw State University is well positioned to launch this degree. This is evidenced by the unparalleled success of the MS in Applied Statistics – where graduates are in great demand and continue to have 100% placement – and by the Minor in Applied Statistics with undergraduate demand from every college across the university. The Minor supports approximately 200 undergraduates every semester – making it the most successful and sought out Minor in the history of KSU. The Ph.D. in Data Science will not only help to close the talent gap in the area of Data Science, but will also continue KSU’s trajectory of regional and national recognition in the area of applied analytics. Prospectus – Ph.D. in Analytics and Data Science Page 2 SECTION 1: Justification of Need “I skate to where the puck is going to be, not where it has been.” – Wayne Gretzky The United States Federal Government recently issued a press release addressing what it sees as a growing critical shortage of data analysts and, on March 29, 2012, issued the “Big Data Research and Development Initiative”. One of the main purposes of the initiative is to “expand the workforce needed to develop and use Big Data technologies”. The term “Big Data” is beginning to dominate descriptions of required skill sets across a wide variety of disciplines and sectors of the economy. While the accepted definition of Big Data is continuing to evolve, there is no question about the expansion and prevalence of related concepts and their expanded role in the future. According to The Economist magazine, unmanned American military aircraft (i.e., drone aircraft) flying over Iraq and Afghanistan in a single year (2009) produced approximately 24 years’ worth of video surveillance footage. This astonishing fact highlights at least four major points about the new direction of how data is collected, analyzed, and used: 1. Extraordinary, previously unimaginable amounts of data are being collected and stored for subsequent analysis, which contain potentially significant and meaningful information to society at large. 2. It is not feasible to manually review and/or analyze such massive data in a timely manner even with a team of human analysts using traditional methods. Computer-assisted semi- or fully-automated processes using new computational and data mining methods are needed in order to extract useful information from massive data sources in a timely manner. 3. In addition to massive amounts of traditional structured data (i.e., tabular data), extraordinary amounts of unstructured, non-traditional data such as video footage, audio recordings, and unstructured text are being collected and stored. Increasingly, these two very different types of data must be merged together in systematic ways in order to obtain useful information. 4. Unlike the past, data collection and analysis is no longer a purely academic endeavor. Data gathering and analysis for obtaining useful information most often used in decision making processes is used in almost every field and sector imaginable at present including the sciences, public health, the healthcare industry, all aspects of business and finance (including retail, insurance, marketing, the service industry, the credit industry, fraud detection, the Prospectus – Ph.D. in Analytics and Data Science Page 3 communications industry, etc.), psychology, education, public policy agencies, government elections, and even national security and defense. From these four points it follows that: The next generation of statisticians will face very different challenges and issues than previous generations of statisticians. As a result, the next generation of statisticians needs a new set of knowledge and skills in order to effectively serve the data analysis needs of the 21st century. These skills will incorporate more emphasis on applied mathematics and on computer programming than has historically been the case – even for applied statisticians. The U.S. business community is also aware of this need: Hal Varian, Ph.D., the Chief Economist at Google, Inc. states simply, “Data are available; what is scarce is the ability to extract wisdom from them”. Further to the recognition of the talent shortage evidenced through the White House Big Data Research and Development Initiative, the “Big Data Report” from the McKinsey Global Institute (MGI) estimates that the demand for data analysts could exceed the current supply by 140,000 to 190,000 positions by the year 2018 (see Figure 1). Figure 1 shows that there are 440,000 to 490,000 total data analyst job positions projected for 2018 with only 300,000 trained analyst to fill those positions. In other words, the demand for big data analysts could be 50 to 60% greater than its projected supply by 2018. Prospectus – Ph.D. in Analytics and Data Science Page 4 FIGURE 1: The Talent Gap for Big Data Analysts The Big Data MGI report also predicts differential gains as a result of the impact of big data and its use across different sectors. According to MGI, finance and government (Cluster B in Figure 2) are expected to benefit strongly from big data use in the future where computer and electronic products and information sectors (Cluster A in Figure 2) have already and will continue to experience substantial benefits from the impact and use of big data. Prospectus – Ph.D. in Analytics and Data Science Page 5 FIGURE 2: Differential Potential Gains of Big Data by Sector A brief survey of the diverse disciplines which have recognized the role of Big Data and the changing role of analytics includes: Business Customer relations management (CRM) is one of the most innovative and profitable ways in which businesses use big data. CRM is essentially the business practice of analyzing customer-centric big data to discover trends and use that information to customize or personalize offers and communications with customers to optimize business. CRM was once used only by Fortune 500 companies, however, now with the proliferation of big data and reduced costs in collecting and storing it, all types of companies are using it to optimize their business. In one example of a typical CRM application, a U.S. bank used big data analytics to predict which product offer was most likely to be accepted by a particular customer and thereby customize the next on-line product offered to that customer in an effort to cross-sell to existing customers (Berry & Linoff, 2000). This CRM initiative resulted in substantial gains in cross-selling and Prospectus – Ph.D. in Analytics and Data Science Page 6 therefore profits to the bank well above the cost of implementation. This is just one of many examples of big data analytics in business. Others include fraud identification, service rate estimation, predicting product failure, and optimizing direct mailing campaigns, among others. By all accounts, the main hindrance in CRM is a lack qualified data analyst (The Economist, 2010; Significance, 2012). Healthcare & Public Health The proper use of digitized medical records has the potential of revolutionizing the healthcare industry. Proper analysis of these records may be used to detect unwanted drug interactions and/or side-effects, identify best practices in care (e.g., identify the most effective drug therapies), and even predict the onset of certain diseases before patients themselves are aware of symptoms (The Economist, 2012). In one example, medical doctors and data analysts in Alabama developed automated infection surveillance software that assists hospitals in identifying changes in nosocomial infection (i.e., hospital-acquired infection) rates using massive data from Blue Cross/Blue Shield of Alabama and statistical and data mining methods (Putman, 2003). It has been estimated that nosocomial infections add as much as nine days to a patient’s hospital stay leading to more than a $4 million per year additional expense. This infection surveillance software provides early warning to hospitals and allows them to intervene in a timely manner. This is only one of many possible examples where nontraditional statistical work involving big data has made a substantial improvement in healthcare quality and substantial savings to society. Government According to the National Science Foundation (NSF, 2012) in the document entitled, “Core Techniques and Technologies for Advancing Big Data Science & Engineering” (NSF 12-499), the impact of big data is causing a literal paradigm shift in scientific and biomedical investigation that is transforming the missions of a number of U.S. Federal Government agencies: Today, US government agencies recognize that the scientific, biomedical and engineering research communities are undergoing a profound transformation with the use of large-scale, diverse, and high-resolution data sets that allow for data-intensive decision-making, including clinical decision making, at a level never before imagined. Prospectus – Ph.D. in Analytics and Data Science Page 7 New statistical and mathematical algorithms, prediction techniques, and modeling methods, as well as multidisciplinary approaches to data collection, data analysis and new technologies for sharing data and information are enabling a paradigm shift in scientific and biomedical investigation. Advances in machine learning, data mining, and visualization are enabling new ways of extracting useful information in a timely fashion from massive data sets, which complement and extend existing methods of hypothesis testing and statistical inference. As a result, a number of agencies are developing big data strategies to align with their missions. These examples and countless others highlight three common emerging themes: 1. Data is ubiquitous. All disciplines. All sectors of the economy. 2. Data is no longer considered a necessary cost to be managed down, but rather as an asset to be “mined” and leveraged. 3. All sectors are increasingly finding a dearth of analytical talent to support their nascent, but explosive analytical needs, particularly as it is related to Big Data. In response to this, Kennesaw State University is proposing the development of a Ph.D. in Analytics and Data Science. It is our position that the Data Scientist will be uniquely positioned to fill the talent shortage as outlined above. It is critical to note that we are proposing a Ph.D. program in Analytics and Data Science rather than in Statistics. A great deal of attention is emerging in the field of analytics towards the role of the Data Scientist – From IBM - A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong application acumen, coupled with the ability to communicate findings…in a way that can influence how an organization approaches a business challenge. From Thomas Davenport, Senior Managing Partner at Accenture and author of Competing on Analytics – “(Data Scientists) are not typical scientists…but rather hybrids of science and computation. Somewhere along their career journeys they became interested in, and Prospectus – Ph.D. in Analytics and Data Science Page 8 good at, the manipulation of data. In fact, many of them really have ‘computational’ in front of their scientific specialties: computational biology, computational ecology, etc. If you want some evidence of this hybrid specialization, look at your favorite data scientist’s profile on LinkedIn -the home, by the way, of some of the best data scientists around -- and check out the skills they say they have. You’ll see “analytics” (quantitative analysis, statistical modeling, predictive analytics, social network analysis, data mining, etc.) listed, of course. But you are also likely to see SQL, Java, C, Python, R, distributed databases, and so forth. All of these skills actually are found in one individual, and he seems typical of the breed…to my knowledge, no universities have programs yet in big data analytics (though some are talking about them -- universities typically don’t move too hastily). From Daniel Tunkelang, Chief Data Scientist at LinkedIn – “Strong analytical skills are a given: above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve the will create value for users and drive business decisions. Communication is essential, because data scientists work in horizontal roles and partner with groups across the entire organization. At LinkedIn, data scientists collaborate with every other product group, as well as with sales and finance. Strong communication skills are a must-have.” From Steve Hillion, VP of Analytics at GreenPlum, as quoted in Forbes - “I’m sure in 30 years’ time, there will be lots and lots of degrees in data science and that’s where [data scientists will] come from, but right now it’s coming from all these different buckets (math, computer science, economics)…And, just as the early days of computing were born in the garages of Silicon Valley do-it-yourself-ers, data science is likely to develop first in an ad-hoc, hands-on way.” It is our position that the intersection of skills outlined in multiple ways above, are brought together under the description of “Data Scientist”, in a way that does not occur in a traditional Statistics curriculum. This term has emerged as the moniker of an individual with strong computational and programming skills, but also possessing business/content acumen, enabling clear and meaningful communication. As can be seen below, the term “data scientist” is emerging as a dominant search term in Job Search engines. Prospectus – Ph.D. in Analytics and Data Science Page 9 FIGURE 3: Job Trends in Data Science From Michael Rappa, Director of the Institute for Advanced Analytics at NC State – “The future of data science in the enterprise will be extremely bright, if a few key things happen: First, the right kinds of partnerships must be formed between data-rich companies and forwardthinking academic institutions. Second, institutions and employers need to encourage and reward the right set of data-science skills.” Are statisticians going away? No. There will always be a need for traditional statistics. Disciplines such as psychology, nursing, marketing research, medical research, etc., will always have a need for the traditional skills associated with hypothesis testing and model development. Data Scientists are different. They embody skills which traditional statisticians don’t have. While data scientists must have strong skills in statistical testing and modeling, they are also strong in computational mathematics, data architecture, the process of ETL (extract, transport, load), programming (i.e., SAS, Java, C++, Hadoop), and typically have some content knowledge (i.e., Chemistry, Biology, Finance). The proposed Ph.D. in Analytics and Data Science at Kennesaw State University directly meets the national and local talent shortage in this space, as evidenced by movements such as the Big Data Research and Development Initiative of 2012, by effectively and thoroughly training and thereby expanding the workforce available to develop and use Big Data technologies. Prospectus – Ph.D. in Analytics and Data Science Page 10 Furthermore, this degree program would transform teaching and learning in the field of Big Data technology, another major objective of the White House Initiative. Consequently, we believe the degree will directly and/or indirectly accelerate the pace of discovery in science and engineering used to further understanding and knowledge, strengthen U.S. national security, and increase the quality of life for the average American citizen. With respect to the shortage of big data analysts and their training, The Big Data MGI report states, “…we believe that the constraint on this type of talent will be global, with the caveat that some regions may be able to produce the supply that can fill talent gaps in other regions”. It is our strongly held position that we can make Georgia one of these key regions which produces Big Data analytical leadership for the world with this proposed Ph.D. degree. A Ph.D. program in Analytics and Data Science at Kennesaw State University has the potential of defining Kennesaw State University, the University System of Georgia, and the State of Georgia, as cutting-edge, state-of-the-art innovators in the methods and technologies that will shape and see us through the 21st century. While there are no unique statistics on positions for “Data Scientists” from the Georgia Department of Labor, there are unique statistics on the constituent disciplines of Statistics, Mathematics and Computer Science and their projected employability. TABLE 1: GA Department of Labor projections for Mathematics, Statistics and Computer Science Occupational Employment Projections in Georgia for Multiple Occupations for a base year of 2010 and a projected year of 2020 Occupation Code Occupational Title (SOC) 151111 150000 152000 152021 152041 TOTAL 2010 Estimated 2020 Projected Employment Employment Computer and Information Research Scientists Computer and Mathematical Occupations Total 2010-2020 Total Percent Employment Change Change 630 687 57 9.00% 105,907 122,049 16,142 15.20% 4,883 5,438 *** *** 888 987 112,308 129,161 Source: Georgia Department of Labor 555 *** 99 16,853 11.40% *** 11.10% 15% Mathematical Science Occupations Mathematicians Statisticians In addition, the economies of State of Georgia and the City of Atlanta, which are heavily dominated by Finance and Insurance, Government Services and Healthcare, are the very Prospectus – Ph.D. in Analytics and Data Science Page 11 industries the McKinsey Global Institute identified in Figure 3 above as expecting the greatest benefit (and by association should have the greatest demand) for big data expertise. SECTION 2: Demand for the Program Local demand for the program is evidenced, in part, through the successes of both the Minor in Applied Statistics and Data Analysis as well as in the Master of Science in Applied Statistics. The Minor in Applied Statistics, more than any other Minor field of study in the history of KSU, is a flagship of interdisciplinary success. Students are required to complete 15 hours (five courses) in Statistics at the 3000 level or above to qualify for a Minor in Applied Statistics and Data Analysis. In any given semester, the Minor serves the needs of over 200 students from almost every college across the university. Statistics represents the most diverse cross section of majors in 3000 or 4000 level courses, of any course of study. Where most upper division courses are populated by students from a single major, in the statistics courses (all STAT courses are above 3000), the classes are consistently populated with students from Biology and Chemistry, Finance and Economics, Psychology, Mathematics, Sociology…and even Theater (see Figure 4). Prospectus – Ph.D. in Analytics and Data Science Page 12 FIGURE 4: Distribution of Minors in Applied Statistics and Data Analysis by Declared Major (Fall 2012) Psychology Biology Mathematics Chemistry Biochemistry International Affairs Nursing Exercise & Health Sciences Communications Accounting Sociology Marketing Information Systems Geographic Info Sciences Finance Business Theatre Political Science Math Education Interdisciplinary Studies Bio Environmental Studies Criminal Justice Computer Science Applied Exercise & Health Sci 0 10 20 30 40 50 60 70 80 90 Why do almost 1% of the undergraduates at KSU seek out a series of five upper division electives in Statistics? We believe that there are three primary reasons that have created this demand: 1. We have an inherently interdisciplinary faculty – the same faculty which will power the Ph.D. in Analytics and Data Science. Most of the Statistics faculty has had experience in the private sector, including Ford Motor Company, The Children’s Hospital of Cincinnati, The Cancer Center at MD Andersen in Houston, TX, MasterCard International, VISA EU (London), AT&T/BellSouth (Brazil), Thompson Reuters, The Southern Company and ChoicePoint. Most students can find someone with an application of statistics outside the classroom, aligned with their career aspirations. We bring our experiences into the classroom and students respond. Prospectus – Ph.D. in Analytics and Data Science Page 13 2. Statistics is the process through which data is converted into meaningful information to support decision making. But, as outlined above, while data is increasingly ubiquitous and cheap and easy to capture and store, it is difficult to translate. Students recognize that whether they are studying Finance or Psychology, Biology or Political Science, they will have to understand how to translate data into information. Since all disciplines work with data, in some form, all disciplines of study need to have some integration of statistics for their graduates to be marketable. 3. Jobs. Jobs. Jobs. Students are increasingly turning to Statistics as a great way to differentiate themselves in the marketplace. Undergraduates with Minors in statistics are having great success with job placement after graduation. Statistics students from KSU are recruited for positions across a wide variety of companies including The Home Depot, The Southern Company, Link Analytics, Aspen Marketing Services, Epsilon, Ultimate Software, IBM, Assurant, Compucredit, The CDC, Equifax. The Masters of Science in Applied Statistics has a similar story. Since the launch of the degree in 2006, very few of the applicants have had undergraduate degrees in Statistics. MSAS applicants come from Engineering, Business, Medicine, and Education. A defining characteristic of the MSAS program is its fluid alignment with the needs of the market. As a result, the MSAS is proud of its effective 0% unemployment rate amongst students without work restrictions. Statistics emerged as a unique discipline at KSU in Fall of 2006 – all of this success has occurred in less than 6 years. In an effort to ensure limited duplication with other successful initiatives in Statistics within the University System, such as the programs at the University of Georgia and at Georgia Tech, KSU pursued a strongly applied orientation, meaning that our course materials were focused on leveraging our faculty experience outside the classroom, and applying statistics the way practitioners apply statistics. From the beginning we elected to have less emphasis on theoretical statistics. This last point meant that we would have to have strong integration of statistical software into our curriculum. So, we looked to the dominant software/language in the marketplace. This was, without question, SAS, which is used by 95% of the Fortune 500 – including all of the top companies in our regional footprint. As a result, all of our students, both at the undergraduate and graduate levels, learn strong SAS programming skills as a complement to their statistics skills. Prospectus – Ph.D. in Analytics and Data Science Page 14 It is this dimension of programming skills, combined with a strong mathematical foundation, and deep and broad instruction in statistical modeling which has already well positioned the program to offer a Ph.D. in Analytics and Data Science. Additional evidence of local demand for these skills comes from analytical job sites such as icrunchdata.com – a job posting site uniquely designed for analytical professionals. A recent keyword search for open positions in Georgia generated the following results: “Big Data” – 79 positions, including postings with Intuit, The Home Depot, Hitachi, United Health Group, and IBM. “Statistics” – 759 positions, including postings with Coca-Cola, The Home Depot, CDC, Assurant, Fiserv and Lockheed Martin. “Data Scientist” – 8 positions – all with salaries over $100,000. “Advanced Analytics” – 1490 positions – including positions with every Fortune 500 in the state. Screen shots from these searches can be found in Appendix 1. We have also received strong demand and support for this program from the Statistical Advisory Board. TABLE 2: KSU Statistical Advisory Board NAME Chuck Clemens Steven Einbender Bill Franks Ron Garmon Will Hakes Don Hayes Jim Head Darrell Maret Billy Nix Jerry Oglesby Carol Pierannunzi Brian Stone COMPANY Maxum Specialty Insurance Group The Home Depot Teradata Vuelogic LinkAnalytics Hayes Consulting BBDO INPO Southern Company SAS CDC CompuCredit TITLE CIO Senior Manager – Pricing Chief Analytics Officer President Founder and CEO CEO Senior Vice President, Analytics Senior Analyst Vice President, Load Research Division Head, Higher Education Senior Survey Methodologist Chief Risk Officer As our Advisory Board guided the development of the proposal, they emphasized the importance of practical experience to this degree. To that end, the Board has agreed, in principle, to “hire” Ph.D. students on a contract basis for a minimum of one year, after Prospectus – Ph.D. in Analytics and Data Science Page 15 they have completed their coursework – but prior to completing their dissertation. This would accomplish three objectives: The hiring firm would cover one year (minimum) of doctoral student stipend ($25 - $30K). The Ph.D. student would apply concepts and skills learned in the classroom in a “real” environment. The experience has the potential to become a source of dissertation research. This integration with the companies represented by the Advisory Board also represents an important endorsement of our proposed program, as well as an extension of the engagement with the business community which has been the trademark of the Statistics programs to date. Letters of Intent and Support from the Statistical Advisory Board can be found in Section Appendix 2. Prospectus – Ph.D. in Analytics and Data Science Page 16 SECTION 3: Non-Duplication of Similar Programs at USG Institutions The proposed Program is not only NOT a duplication of any programs currently in existence in the State of Georgia, the program would be the first of its kind in the country. A brief outline of the most closely related Ph.D. programs in the State of Georgia is provided in Table 3 below. TABLE 3: Comparison of related Ph.D. programs in the University System of Georgia Institution Name of Program Stated Objectives Notes on Curriculum Program Housed Georgia Institute of Technology Ph.D. in Industrial Engineering with a Specialization in Statistics “The Ph.D. in (Industrial and Systems Engineering) is a research degree...students have the opportunity to pursue work at virtually any of the points across the applied/theoretical spectrum...” College of Engineering, H. Milton Stewart School of Industrial and Systems Engineering. Georgia Institute of Technology Ph.D. in Industrial Engineering with a Specialization in Computational Science and Engineering “Georgia Tech's CSE Ph.D. degree will prepare students for a variety of positions in industry, government and academia that emphasize research and development. Students will be well prepared for positions in industry…and in government. Graduates may pursue work in software and systems for modeling and simulation, systems integration, data mining and visualization, high performance computing, and computational modeling. Academic career possibilities include research and education in departments concerned with advancing the stateof-the-art in the development and application of computational models in engineering, the sciences and computing. Courses incorporate strong mathematics, with methods courses aligned with manufacturing and engineering. Requirements include five core courses, two theory courses, three methods courses, one elective course (11 courses total). No requirement for internships or co-op. The program emphasizes the integration and application of principles from mathematics, science, engineering and computing to create computational models. University of Georgia Ph.D. in Statistics Georgia State University Ph.D. in Mathematics and Statistics “The Ph.D degree program in Mathematics and Statistics includes concentrations in bioinformatics, biostatistics, and mathematics. These concentrations address the critical need for mathematics faculty and the need for highly trained specialists in the areas of bioinformatics and biostatistics…(the program) will graduate individuals with a broad background in applied areas for direct placement in business, industry, governmental institutions and research universities. College of Engineering, H. Milton Stewart School of Industrial and Systems Engineering. Courses include 6 core courses in computational mathematics and in high performance computing, three elective courses which “must go beyond ‘using computers’ to deepen understanding of computational methods, preferably in the context of some application domain” and three elective courses in an application domain (12 courses total). No requirement for internships or co-op. Heavy theoretical emphasis – placement is exclusively in academic positions. Program includes a minimum of 10 courses including four statistical theory courses (core), two subcore electives – one of which is a statistical computing course, and four unspecified STAT electives. Heavy emphasis on mathematics. The four core courses include Real Analysis, Matrix Analysis, Theory of Probability and Linear Statistical Analysis. Remaining courses vary based upon selected concentration. The Concentration in Bioinformatics incorporates three computer science courses. Eighteen courses required. College of Arts and Science, Statistics Department. College of Arts and Sciences, Department of Mathematics and Statistics. These are all excellent programs which have achieved recognition in varying contexts. However, none of these programs are aligned with the skills defining the “Data Scientist”. Prospectus – Ph.D. in Analytics and Data Science Page 17 APPENDIX 1: Screen Shots from icrunchdata.com for Georgia job postings Prospectus – Ph.D. in Analytics and Data Science Page 18 Prospectus – Ph.D. in Analytics and Data Science Page 19 APPENDIX 2: Letters of Intent from the Statistical Advisory Board Prospectus – Ph.D. in Analytics and Data Science Page 20