integrative data science - Worcester Polytechnic Institute

advertisement
Musings on Data Science and
Students Experiencing Data Analytics
New England SENCER Center for Innovation
Prof. Randy Paffenroth
Data Science Program
Department of Mathematical Sciences
Worcester Polytechnic Institute
rcpaffenroth@wpi.edu
2014
My Research
"Internet Connectivity Access layer" by User:Ludovic.ferre Internet_Connectivity_Overview2_Access.svg. Licensed under Creative
Commons Attribution-Share Alike 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer
.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg
This is a panel, so I want to be
provocative!
Provocative
Adjective
1. tending or serving to provoke; inciting,
stimulating, irritating, or vexing.
So, I will be a little sad if I don’t end up irritating
anyone 
The first war: Terminology
• Analyzing data has a long history!
• There have been many terms that have been
used to describe such endeavors:
• Statistics
• Artificial Intelligence
• Machine learning
• Data analytics
• Since I happen to work in a “Data Science”
program perhaps I may be allowed the
indulgence of using that terminology…
Whatever we call it, what makes
things different now?
Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes
of data, and in some cases are on the verge of generating
petabytes and beyond. Analyses of the information contained in
these data sets have already led to major breakthroughs in fields
ranging from genomics to astronomy and high-energy physics and
to the development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research Council of the National Academies
Given a large mass of data, we can by judicious selection
construct perfectly plausible unassailable theories—all of
which, some of which, or none of which may be right.
- Paul Arnold Srere
The ability to take data—to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it—that’s going to
be a hugely important skill in the next decades, not only at the
professional level but even at the educational level for elementary
school kids, for high school kids, for college kids. Because now we
really do have essentially free and ubiquitous data. So the
complimentary scarce factor is the ability to understand that data and
extract value from it.
-
Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers
My personal goal: Getting students to be able to
think critically about data.
What is Big Data?


The are many examples of "data", but what makes some of
it “big”? The classic definition revolves around the three
Vs.
Volume, velocity, and variety.



Volume: There is a just a lot of it being generated all
the time. Things get interesting and “big”, when you
can’t fit it all on one computer anymore. Why? There
are many ideas here such as MapReduce, Hadoop, etc.
that all revolve around being able to process data that
goes from Terabytes, to Petabytes, to Exabytes.
Velocity: Data is being generated very quickly. Can
you even store it all? If not, then what do you get rid of
and what do you keep?
Variety: The data types you mention all take different
shapes. What does it mean to store them so that you
can play with or compare them?
http://pl.wikipedia.org
/wiki/Green_Giant#m
ediaviewer/Plik:Jolly_
green_giant.jpg
Is Big Data the same as Data
Science?

Are Big Data and Data Science the same thing?

I wouldn't say so...

Data Science can be done on small data sets.

And not everything done using Big Data would
necessarily be called Data Science.
Big Data
Data
Science
Is Big Data the same as Data
Science?

Are Big Data and Data Science the same thing?

I wouldn't say so...

Data Science can be done on small data sets.


And not everything done using Big Data would
necessarily be called Data Science.
But there certainly is a substantial overlap!
Data
Big Data
Science
Can you even be certain?



For real world problems, I
claim that you will never be
certain of any inferences from
data.
I mean, what happens to your
carefully thought out marketing
plan for some rocking slacks
when the Martians land.
What is unacceptable is when
the data you actually have
does not support the
conclusion you report.
Public domain image
It can be easy to fool yourself!
Human beings are really
good at pattern
detection...
Perhaps a bit too good!
http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
It can be easy to fool yourself!
http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Skills for Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important?
http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
WPI Data Science Program:
A Collaboration
Mathematical
Sciences
Department
Computer
Science
Department
Business School
M.S. in Data Science Program
GRADUATE QUALIFYING PROJECT OR MS THESIS
(3 TO 9 CREDITS)
CONCENTRATION AND ELECTIVES
(9 TO 15 CREDITS)
MATHEMATICAL
ANALYTICS
(3 CREDITS)
DATA ACCESS &
MANAGEMENT
(3 CREDITS)
DATA
ANALYTICS &
MINING
(3 CREDITS)
BUSINESS
INTELLIGENCE &
CASE STUDIES
(3 CREDITS)
INTEGRATIVE DATA SCIENCE (3 CREDITS)
Data Science Core
I N T E G R AT I V E D ATA S C I E N C E :
D S 5 0 1 I N T R O D U C T I O N T O D ATA S C I E N C E ( N E W C O U R S E )
M A T H E M A T I C A L A N A LY T I C S ( S E L E C T O N E ) :
M A 5 4 3 / D S 5 0 2 S TAT I S T I C A L M E T H O D S F O R D ATA S C I E N C E ( N E W
COURSE)
Data Science Certificate
M A 5 4 2 R E G R E S S I O N A N A LY S I S
Program (18 credits);
M A 5 5 4 A P P L I E D M U LT I VA R I AT E A N A LY S I S
D ATA A C C E S S A N D M A N A G E M E N T ( S E L E C T O N E ) :
• 15 CREDIT DATA SCIENCE CORE
plus
• 3 CREDIT ELECTIVE
C S 5 4 2 D ATA B A S E M A N A G E M E N T S Y S T E M S
M I S 5 7 1 D ATA B A S E A P P L I C AT I O N S D E V E L O P M E N T
C S 5 6 1 A D VA N C E D T O P I C S I N D ATA B A S E S Y S T E M S
C S 5 8 5 / D S 5 0 3 B I G D ATA M A N A G E M E N T ( N E W C O U R S E )
D A T A A N A LY T I C S A N D M I N I N G ( S E L E C T O N E ) :
C S 5 4 8 K N O W L E D G E D I S C O V E R Y A N D D ATA M I N I N G
CS 539 MACHINE LEARNING
C S 5 8 6 / D S 5 0 4 B I G D ATA A N A LY T I C S ( N E W C O U R S E )
BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE):
MIS 584 BUSINESS INTELLIGENCE
M K T 5 6 8 D ATA M I N I N G B U S I N E S S A P P L I C AT I O N S
2014 Data Science Cohort
EDUCATIONAL FOUNDATION
QUANTITATIVE/ COMPUTATIONAL
BACKGROUNDS
PROGRAMMING WITH DATA STRUCTURES
AND ALGORITHMS FOR COMPUTATIONAL
SKILLS
QUANTITATIVE SKILLS
CALCULUS, LINEAR ALGEBRA AND
STATISTICS
EMPLOYMENT HISTORIES
SENIOR RESEARCH ANALYST
SENIOR BUSINESS ANALYST
PATIENT FINANCIAL SERVICES
DATA BASE ANALYST-ARCHITECT
DECISION SCIENTIST
MINISTRY OF FINANCE
LAHEY HEALTH
TECHNICAL PROGRAM MANAGEMENT
U.S. DEPARTMENT OF STATE
NATIONALITY
CAMBODIA
10%
FULBRIGHT
SCHOLARS
INDIA
CHINA
PAKISTAN
TAIWAN
GENDER
66.70% Male
33.3% Female
IRAN
U.S.A.
BRAZIL
NEPAL
AFGHANISTAN
INDONESIA
2014 Data Science Cohort
FALL 2014
Total Applicants
Total acceptances
Fulbright Scholars
Brazil Science Mobility Student
Countries Represented
Domestic Students
International Students
126
33
3
1
9
5
28
Many hold more than one earned Bachelor’s Degree
US Universities include Columbia, UNH and WPI
Dean Oates gave two Awards of $5K to outstanding
students.
These awards help attract top students.
Skills Acquired by Our Students
Fundamental/Technical :
Tools :
SQL/ Data Modeling / Cleaning
Oracle /MySQL/DB2/SQLServer
Data Integration / Warehousing
R / SAS / SciKit
Statistical Learning / Machine Learning
Weka /RapidMiner /MatLab
Distributed Computing
IBM Cognos / SPSS Modeler
Big Data Management
Hadoop / Mahout / Cassandra
Classif./Regression/DecisionTrees
Python / Java / Cloud Computing
Business Intelligence
Distributed Mining Algorithms
Professional Skills:
Business Use Cases / Entrepreneurship
Interdisciplinary Teams / Leadership
Storm / Sparc / InfoSphere Streams
Spotfire / Tableaux
Professional Skills:
Story Telling / Visualization
Presentations / Reports
Data Science Tools for Students:
Free!
Software:
Data:
•Python
•UCI Machine learning
repository
•http://www.python.org/
•
iPython: http://ipython.org/
•
Numpy: http://www.numpy.org/
•
Pandas: http://pandas.pydata.org/
•
Matplotlib: http://matplotlib.org/
•
Mayavi: http://mayavi.sourceforge.net/
•
Scikit-learn: http://scikitlearn.org/stable/
•
http://archive.ics.uci.edu/ml/
•Kaggle
•
https://www.kaggle.com/
•U.S. Government
•
https://www.data.gov/
Download