Slides for Section 1

advertisement
Introduction to Data Science
Section 1
Data Matters 2015
Sponsored by the Odum Institute, RENCI, and
NCDS
Thomas M. Carsey
carsey@unc.edu
1
Course Materials
• I used many sources in preparing for this course:
– Practical Data Science using R by Zumel and Mount
– http://www.manning.com/zumel/
– Data Mining with R: Learning with Case Studies, by Torgo
– http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/
– An Introduction to Data Science, Version 3, by Stanton
– http://jsresearch.net/
– Monte Carlo Simulation and Resampling Methods for
Social Science, by Carsey and Harden
– http://www.sagepub.com/books/Book241131/reviews?course=Course14
&subject=J00&sortBy=defaultPubDate%20desc&fs=1#tabview=title
– Machine Learning with R by Lantz
– http://www.packtpub.com/machine-learning-with-r/book
2
Additional Materials
• A Simple Introduction to Data Science, by
Burlingame and Nielsen
• http://newstreetcommunications.com/businesstechnical/a_
simple_introduction_to_data_science
• Ethics of Big Data, by Davis
• http://shop.oreilly.com/product/0636920021872.do
• Privacy and Big Data, by Craig and Ludloff
• http://shop.oreilly.com/product/0636920020103.do
• Doing Data Science: Straight Talk from the
Frontline, by O’Neil and Schutt
• http://shop.oreilly.com/product/0636920028529.do
3
Learning R
• Lots of places to learn more about R
– All of the sources on the first slide have R code available
– Comprehensive R Archive Network (CRAN)
– http://cran.r-project.org/manuals.html
– Springer Textbooks Use R! Series
– http://www.springer.com/series/6991
– Online search tool Rseek
– http://www.rseek.org/
– The RStudio site
– http://www.rstudio.com/
– The Odum Institute’s online course
– http://www.odum.unc.edu/odum/contentSubpage.jsp?nodeid=670
4
What is Data Science?
5
What is Data Science?
• What words come to mind when you think of
Data Science?
• What experience do you have with Data
Science?
• Why are you taking an Introduction to Data
Science Class?
6
The Data Science Revolution
• Data science is exploding in importance and
the attention it receives.
• It’s hard to sort through the substance and the
hype.
• There is real value in data science, but you
should have a purpose or goal in mind first.
7
8
The Roots of Data Science
• Simple observation and recording those
observations dates back to the most ancient
civilizations
– The Greeks were the first western civilization to adopt
observation and measurement
• Some call Aristotle the first empirical scientist
– Muslim scholars between the 10th and 14th centuries
developed experimentation (Haytham)
– Roger Bacon (1214-1284) promoted inductive
reasoning (inference)
– Descartes (1596-1650) shifted focus to deductive
reasoning.
9
What is Data Science?
• “How Companies Learn Your Secrets” NYT, by
Charles Duhigg, February 16, 2012
• http://www.nytimes.com/2012/02/19/magazine/shopp
ing-habits.html?pagewanted=1&_r=2&hp&
10
What did Target Do?
• Mining of data on shopping patterns
– Specific products purchased
– Combination of products purchased
– Combined with demographic and other data
• Psychology and neuroscience
– Habits:
• Cue-routine-reward
• When are habits open to change?
11
Lessons from Target
• Yes, Data Science is about mining data
• There are deeper theoretical issues involved in
understanding what you find
• Left out of that long article are most of the
critical steps that precede the analysis
• In short, Data Science > data mining
12
Definition of Data Science
• There are many, but most say data science is:
– Broad – broader than any one existing discipline
– Interdisciplinary: Computer Science, Statistics,
Information Science, databases, mathematics
• Also substantive domains (environmental science, sociology,
public health, etc.)
– Applied focus on extracting knowledge from data to
inform decision making.
– Focuses on the skills needed to collect, manage, store,
distribute, analyze, visualize, and reuse data.
• There are many visual representations of Data
Science
13
Some definitions link computational, statistical, and substantive expertise
14
Other definitions focus more on technical skills alone
15
Still other definitions are so broad as to include nearly everything
16
There are many “Word Cloud” representations of Data Science as well
17
18
19
Definition of Data Science
• The field is immature, cluttered by hype,
unfocused.
• But, key features should include:
– Data across its lifecycle
– Interdisciplinary skills
– Substantive knowledge
20
Defining Some Terms
21
MapReduce and Hadoop
– Designed to process large operations quickly
– Distributes the problem across multiple servers
– The Map part filters and sorts data into bins or
queues based on some share characteristic
– The Reduce part then executes some operation on
each bin of data.
– Results are then reassembled
– It is like parallel processes, but distributed across
servers rather than just processors
– Scalable and has a fault tolerance
– Hadoop is an open-source version
22
More on MapReduce
• Pig
– Software platform used for creating MapRedce
programs used by Hadoop.
• Hive
– A date warehouse infrastructure built on top of
Hadoop. Used to query, summarize, or analyzed
data.
23
Database Management
• SQL – Structured Query Language
– A programming language designed for management of
relational databases
• MySQL – Open source implementation of an SQL-like
system for management of relational databases (used by
Wikipedia, Google, Facebook, Twitter, Flickr, YouTube)
• NoSQL – (Not Only SQL)
– Used for databases where the data is in some form other than
tabular relations like those used in relational databases
– Cassandra (Apache)
• Distributed database management with not single node of failure
• Scalable with no down time
24
25
Cloud Computing
• Standard client-server model where computing
operations don’t happen on the local (desktop)
machine.
• What’s new? Virtualization. You are not connecting to
a specific server.
– Servers are virtual
– One server can run multiple virtual machines
– One virtual machine can use multiple servers
• This makes the “machine” scalable, moveable,
configurable.
• Allows selling software, platforms, and even computing
infrastructure as a “service”
26
27
Data Mining/Machine Learning
• Machine learning uses computer algorithms to
get a machine to learn and adapt to new
information.
• Data Mining more explicitly focuses on
discovering patterns or structure in a given set
of data.
• Often used as synonyms by non-experts
without much loss of information.
28
29
Web Scraping
• This is a process of collecting information from
websites and then organizing it for some sort
of analysis.
• Scraping is just about getting the data; the
analysis comes later.
30
Programming Tools
– R – Statistical programming (object oriented, scripting
language)
– Python – a scripting programming language that
supports object-oriented programming, structured
programming, functional programming
– SQL – Relational Database
– SAS – General purpose data analysis software
– Julia – Faster than R and more scalable than Python
– Kafka and Storm – Used for real-time streaming
analysis
31
Download