syllabus - Nathan VanHoudnos

advertisement
Statistics 301-1
Fall 2015
Introduction to Data Science
Instructors:
Larry V. Hedges
Office:
2046 N. Sheridan Road
Office hours: By appointment
Email:
l-hedges@northwestern.edu
Administrative Assistant: Valerie Lyne
Telephone: (847) 467-4001
Email: v-lyne@northwestern.edu
Teaching Assistant: Paki Reid-Brossard
Email: paki@northwestern.edu
Nathan M. VanHoudnos
2046 N. Sheridan Road (#304)
By appointment
nathanvan@northwestern.edu
STAT 301-I will meet every Monday and Wednesday from 5:00-6:20 in the
statistics classroom 2006 Sheridan Road. An additional (and optional) tutorial/review
sessions may be scheduled as needed.
Course Description
We are in an era where the world is awash in data. More data being created all the time
and the rate of data creation is accelerating. The Science News (May 22, 2013) tells us that 90%
of the data that existed then was created in the previous 2 years. At the same time, the cost of
storing and manipulating data has decreased dramatically. This has presented enormous
opportunities for those trained in how to manipulate that data and use it for practical purposes.
The skills required draw heavily from statistics and computer science, and have become a new
area at the intersection of these fields, variously called data science, predictive analytics, or just
applied statistics.
This course is an introduction to fundamental topics in data science. It will give a broad
survey of methods for supervised learning in and some experience with each of them, but will not
give an in depth coverage of any of them. More advanced topics and topics in unsupervised
learning will be covered in subsequent courses (we anticipate that these will be Statistics 301-II
and Statistics 301-III), including substantial projects that will provide project based learning as
those projects require in depth experience with methods.
Learning Objectives
To help students and the instructor be clear on what you are supposed to learn from this
course, I have developed a set of learning objectives, listed below:
1. Without the aid of their notes, students will be able to explain the general methods by which
data science uses statistics and computer science to develop predictive models.
2. Students will be able to define linear regression and classification methods as they are used for
prediction and demonstrate how they are used with a real dataset.
3. Students will be able to describe how subset selection, shrinkage methods, the lasso and least
angle regression methods can improve prediction in regression models and apply each of these
STAT 301-1
2
methods to real data.
4. Students will be able to write an R or Python program to carry out a simple data analysis and
make simple modifications of existing programs to change the details of the analyses they
perform.
5. Students will be able to apply generalized additive models to develop regression and
classification trees for real data.
6. Students will be able to describe the idea of boosting and use the adaboost algorithm to
improve prediction models.
7. Students will be able to describe the idea of model averaging and describe how it is applied to
the analysis of random forests.
8. Students will be able to explain the concepts of the bias-variance tradeoff and effective
degrees of freedom for predictive models.
9. Students will be able to evaluate competing predictive models by correctly using a cross
validation strategy.
Evaluation
There will be four assignments that involve conceptual work, computation using
R or Python, or both. There will also be a final project that involves carrying out an
analysis of a dataset, developing a predictive model and describing what you
concluded from it. This class is cumulative and therefore it is essential that the first four
assignments be completed on time. Your final grade will depend on your cumulative
work on these assignments plus the final project.
Approaching This Class
You may find two things useful in approaching this course. First, the course is
cumulative. Although the individual concepts are relatively straightforward, they build
upon one another so that getting behind seriously impedes learning new concepts.
Keep up with the reading and assignments. Do the reading assigned before the date of
the class in which it is discussed. Second, take responsibility for your own learning. I
strongly suggest that you form study groups to work together on the material and
discuss readings and assignments.
Prerequisites
This course assumes a basic knowledge of the idea of distributions, estimation
and hypothesis testing. It also assumes some experience with programming languages
like R and Python and software for data analysis. While we will offer an introduction to
programming languages, we will not offer a comprehensive course in either
programming language. Students will be expected to build their skill in programming
by supplementing the instruction in this class with other resources such as experience,
reading, and online coursework like that provided by codecademy
(https://www.codecademy.com/).
STAT 301-1
Textbooks
The required textbook is one of the standard textbook on data science and
machine learning by two of the key innovators of these methods (denoted HTF in the
course outline):
Hastie, T., Tibshirani, R., & Friedman, J. (2013) The elements of statistical learning, by (2nd
edition). New York: Springer. Available at:
http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html
Additional Reference
Kuhn, M. & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
In addition students will use the computer software in R and Python, both of which are available
for free download.
Course Outline
Week 1. (9/21 & 9/23) Introduction to data science and this course
A. Seminar and organization
B. Review of basic probability and conditional probability
Reading HTF: pp. 1 – 17
Week 2. (9/28 & 9/30) Review of probability and statistics
A. Distributions and Bayes theorem
B. Statistical decision theory
Reading HTF: pp. 18 - 22
Week 3. (10/5 & 10/7) Introduction to useful programming languages and learning tools
A. R and reproducible research
B. Introduction to the Social Sciences Compute Cluster (git, bash, ssh, etc.)
Reading: Online resources to be distributed
Week 4. (10/12 & 10/14) Introduction to supervised learning
A. Linear regression and nearest neighbor methods
B. The curse of dimensionality
Reading HTF: 9 – 39
Week 5. (10/19 & 10/21) Linear regression
A. Subset selection and shrinkage methods
B. The lasso and least angle regression
Reading HTF: 43 – 79
3
STAT 301-1
Week 6. (10/26 & 10/28) Linear classification
A. Linear discriminant analysis
B. Logistic regression
Reading HTF: 101 – 127
Regression Project 1 due
Week 7. (11/2 & 11/4) Generalized additive models and tree based methods
A. Generalized additive models
B. Regression and classification trees
Reading HTF: 295 – 317
Regression/classification project due
Note: No class 11/9 or 11/11)
Week 8. (11/16 & 11/18) Boosting
A. The adaboost algorithm
B. Gradient boosting
Reading HTF: 337 – 364
Generalized additive models project due
Week 9. (11/23 & 11/25) Random forests
A. Model averaging
B. Random forests
Reading HTF: 587 – 602
Boosting project due
Week 10. (11/30 & 12/2) Model assessment
A. Bias, variance, model complexity, and effective degrees of freedom
B. Cross validation and the bootstrap
Reading HTF: 101 – 135
Week 11 (Final Exam Week)
Final Project due December 9
4
Download