Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute rcpaffenroth@wpi.edu 2014 My Research "Internet Connectivity Access layer" by User:Ludovic.ferre Internet_Connectivity_Overview2_Access.svg. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer .svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg This is a panel, so I want to be provocative! Provocative Adjective 1. tending or serving to provoke; inciting, stimulating, irritating, or vexing. So, I will be a little sad if I don’t end up irritating anyone The first war: Terminology • Analyzing data has a long history! • There have been many terms that have been used to describe such endeavors: • Statistics • Artificial Intelligence • Machine learning • Data analytics • Since I happen to work in a “Data Science” program perhaps I may be allowed the indulgence of using that terminology… Whatever we call it, what makes things different now? Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries. - Frontiers in Massive Data Analysis, National Research Council of the National Academies Given a large mass of data, we can by judicious selection construct perfectly plausible unassailable theories—all of which, some of which, or none of which may be right. - Paul Arnold Srere The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it. - Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers My personal goal: Getting students to be able to think critically about data. What is Big Data? The are many examples of "data", but what makes some of it “big”? The classic definition revolves around the three Vs. Volume, velocity, and variety. Volume: There is a just a lot of it being generated all the time. Things get interesting and “big”, when you can’t fit it all on one computer anymore. Why? There are many ideas here such as MapReduce, Hadoop, etc. that all revolve around being able to process data that goes from Terabytes, to Petabytes, to Exabytes. Velocity: Data is being generated very quickly. Can you even store it all? If not, then what do you get rid of and what do you keep? Variety: The data types you mention all take different shapes. What does it mean to store them so that you can play with or compare them? http://pl.wikipedia.org /wiki/Green_Giant#m ediaviewer/Plik:Jolly_ green_giant.jpg Is Big Data the same as Data Science? Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would necessarily be called Data Science. Big Data Data Science Is Big Data the same as Data Science? Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would necessarily be called Data Science. But there certainly is a substantial overlap! Data Big Data Science Can you even be certain? For real world problems, I claim that you will never be certain of any inferences from data. I mean, what happens to your carefully thought out marketing plan for some rocking slacks when the Martians land. What is unacceptable is when the data you actually have does not support the conclusion you report. Public domain image It can be easy to fool yourself! Human beings are really good at pattern detection... Perhaps a bit too good! http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars) It can be easy to fool yourself! http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars) Skills for Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Which is most important? http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram WPI Data Science Program: A Collaboration Mathematical Sciences Department Computer Science Department Business School M.S. in Data Science Program GRADUATE QUALIFYING PROJECT OR MS THESIS (3 TO 9 CREDITS) CONCENTRATION AND ELECTIVES (9 TO 15 CREDITS) MATHEMATICAL ANALYTICS (3 CREDITS) DATA ACCESS & MANAGEMENT (3 CREDITS) DATA ANALYTICS & MINING (3 CREDITS) BUSINESS INTELLIGENCE & CASE STUDIES (3 CREDITS) INTEGRATIVE DATA SCIENCE (3 CREDITS) Data Science Core I N T E G R AT I V E D ATA S C I E N C E : D S 5 0 1 I N T R O D U C T I O N T O D ATA S C I E N C E ( N E W C O U R S E ) M A T H E M A T I C A L A N A LY T I C S ( S E L E C T O N E ) : M A 5 4 3 / D S 5 0 2 S TAT I S T I C A L M E T H O D S F O R D ATA S C I E N C E ( N E W COURSE) Data Science Certificate M A 5 4 2 R E G R E S S I O N A N A LY S I S Program (18 credits); M A 5 5 4 A P P L I E D M U LT I VA R I AT E A N A LY S I S D ATA A C C E S S A N D M A N A G E M E N T ( S E L E C T O N E ) : • 15 CREDIT DATA SCIENCE CORE plus • 3 CREDIT ELECTIVE C S 5 4 2 D ATA B A S E M A N A G E M E N T S Y S T E M S M I S 5 7 1 D ATA B A S E A P P L I C AT I O N S D E V E L O P M E N T C S 5 6 1 A D VA N C E D T O P I C S I N D ATA B A S E S Y S T E M S C S 5 8 5 / D S 5 0 3 B I G D ATA M A N A G E M E N T ( N E W C O U R S E ) D A T A A N A LY T I C S A N D M I N I N G ( S E L E C T O N E ) : C S 5 4 8 K N O W L E D G E D I S C O V E R Y A N D D ATA M I N I N G CS 539 MACHINE LEARNING C S 5 8 6 / D S 5 0 4 B I G D ATA A N A LY T I C S ( N E W C O U R S E ) BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE): MIS 584 BUSINESS INTELLIGENCE M K T 5 6 8 D ATA M I N I N G B U S I N E S S A P P L I C AT I O N S 2014 Data Science Cohort EDUCATIONAL FOUNDATION QUANTITATIVE/ COMPUTATIONAL BACKGROUNDS PROGRAMMING WITH DATA STRUCTURES AND ALGORITHMS FOR COMPUTATIONAL SKILLS QUANTITATIVE SKILLS CALCULUS, LINEAR ALGEBRA AND STATISTICS EMPLOYMENT HISTORIES SENIOR RESEARCH ANALYST SENIOR BUSINESS ANALYST PATIENT FINANCIAL SERVICES DATA BASE ANALYST-ARCHITECT DECISION SCIENTIST MINISTRY OF FINANCE LAHEY HEALTH TECHNICAL PROGRAM MANAGEMENT U.S. DEPARTMENT OF STATE NATIONALITY CAMBODIA 10% FULBRIGHT SCHOLARS INDIA CHINA PAKISTAN TAIWAN GENDER 66.70% Male 33.3% Female IRAN U.S.A. BRAZIL NEPAL AFGHANISTAN INDONESIA 2014 Data Science Cohort FALL 2014 Total Applicants Total acceptances Fulbright Scholars Brazil Science Mobility Student Countries Represented Domestic Students International Students 126 33 3 1 9 5 28 Many hold more than one earned Bachelor’s Degree US Universities include Columbia, UNH and WPI Dean Oates gave two Awards of $5K to outstanding students. These awards help attract top students. Skills Acquired by Our Students Fundamental/Technical : Tools : SQL/ Data Modeling / Cleaning Oracle /MySQL/DB2/SQLServer Data Integration / Warehousing R / SAS / SciKit Statistical Learning / Machine Learning Weka /RapidMiner /MatLab Distributed Computing IBM Cognos / SPSS Modeler Big Data Management Hadoop / Mahout / Cassandra Classif./Regression/DecisionTrees Python / Java / Cloud Computing Business Intelligence Distributed Mining Algorithms Professional Skills: Business Use Cases / Entrepreneurship Interdisciplinary Teams / Leadership Storm / Sparc / InfoSphere Streams Spotfire / Tableaux Professional Skills: Story Telling / Visualization Presentations / Reports Data Science Tools for Students: Free! Software: Data: •Python •UCI Machine learning repository •http://www.python.org/ • iPython: http://ipython.org/ • Numpy: http://www.numpy.org/ • Pandas: http://pandas.pydata.org/ • Matplotlib: http://matplotlib.org/ • Mayavi: http://mayavi.sourceforge.net/ • Scikit-learn: http://scikitlearn.org/stable/ • http://archive.ics.uci.edu/ml/ •Kaggle • https://www.kaggle.com/ •U.S. Government • https://www.data.gov/