Statistics 301-1 Fall 2015 Introduction to Data Science Instructors: Larry V. Hedges Office: 2046 N. Sheridan Road Office hours: By appointment Email: l-hedges@northwestern.edu Administrative Assistant: Valerie Lyne Telephone: (847) 467-4001 Email: v-lyne@northwestern.edu Teaching Assistant: Paki Reid-Brossard Email: paki@northwestern.edu Nathan M. VanHoudnos 2046 N. Sheridan Road (#304) By appointment nathanvan@northwestern.edu STAT 301-I will meet every Monday and Wednesday from 5:00-6:20 in the statistics classroom 2006 Sheridan Road. An additional (and optional) tutorial/review sessions may be scheduled as needed. Course Description We are in an era where the world is awash in data. More data being created all the time and the rate of data creation is accelerating. The Science News (May 22, 2013) tells us that 90% of the data that existed then was created in the previous 2 years. At the same time, the cost of storing and manipulating data has decreased dramatically. This has presented enormous opportunities for those trained in how to manipulate that data and use it for practical purposes. The skills required draw heavily from statistics and computer science, and have become a new area at the intersection of these fields, variously called data science, predictive analytics, or just applied statistics. This course is an introduction to fundamental topics in data science. It will give a broad survey of methods for supervised learning in and some experience with each of them, but will not give an in depth coverage of any of them. More advanced topics and topics in unsupervised learning will be covered in subsequent courses (we anticipate that these will be Statistics 301-II and Statistics 301-III), including substantial projects that will provide project based learning as those projects require in depth experience with methods. Learning Objectives To help students and the instructor be clear on what you are supposed to learn from this course, I have developed a set of learning objectives, listed below: 1. Without the aid of their notes, students will be able to explain the general methods by which data science uses statistics and computer science to develop predictive models. 2. Students will be able to define linear regression and classification methods as they are used for prediction and demonstrate how they are used with a real dataset. 3. Students will be able to describe how subset selection, shrinkage methods, the lasso and least angle regression methods can improve prediction in regression models and apply each of these STAT 301-1 2 methods to real data. 4. Students will be able to write an R or Python program to carry out a simple data analysis and make simple modifications of existing programs to change the details of the analyses they perform. 5. Students will be able to apply generalized additive models to develop regression and classification trees for real data. 6. Students will be able to describe the idea of boosting and use the adaboost algorithm to improve prediction models. 7. Students will be able to describe the idea of model averaging and describe how it is applied to the analysis of random forests. 8. Students will be able to explain the concepts of the bias-variance tradeoff and effective degrees of freedom for predictive models. 9. Students will be able to evaluate competing predictive models by correctly using a cross validation strategy. Evaluation There will be four assignments that involve conceptual work, computation using R or Python, or both. There will also be a final project that involves carrying out an analysis of a dataset, developing a predictive model and describing what you concluded from it. This class is cumulative and therefore it is essential that the first four assignments be completed on time. Your final grade will depend on your cumulative work on these assignments plus the final project. Approaching This Class You may find two things useful in approaching this course. First, the course is cumulative. Although the individual concepts are relatively straightforward, they build upon one another so that getting behind seriously impedes learning new concepts. Keep up with the reading and assignments. Do the reading assigned before the date of the class in which it is discussed. Second, take responsibility for your own learning. I strongly suggest that you form study groups to work together on the material and discuss readings and assignments. Prerequisites This course assumes a basic knowledge of the idea of distributions, estimation and hypothesis testing. It also assumes some experience with programming languages like R and Python and software for data analysis. While we will offer an introduction to programming languages, we will not offer a comprehensive course in either programming language. Students will be expected to build their skill in programming by supplementing the instruction in this class with other resources such as experience, reading, and online coursework like that provided by codecademy (https://www.codecademy.com/). STAT 301-1 Textbooks The required textbook is one of the standard textbook on data science and machine learning by two of the key innovators of these methods (denoted HTF in the course outline): Hastie, T., Tibshirani, R., & Friedman, J. (2013) The elements of statistical learning, by (2nd edition). New York: Springer. Available at: http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html Additional Reference Kuhn, M. & Johnson, K. (2013). Applied predictive modeling. New York: Springer. In addition students will use the computer software in R and Python, both of which are available for free download. Course Outline Week 1. (9/21 & 9/23) Introduction to data science and this course A. Seminar and organization B. Review of basic probability and conditional probability Reading HTF: pp. 1 – 17 Week 2. (9/28 & 9/30) Review of probability and statistics A. Distributions and Bayes theorem B. Statistical decision theory Reading HTF: pp. 18 - 22 Week 3. (10/5 & 10/7) Introduction to useful programming languages and learning tools A. R and reproducible research B. Introduction to the Social Sciences Compute Cluster (git, bash, ssh, etc.) Reading: Online resources to be distributed Week 4. (10/12 & 10/14) Introduction to supervised learning A. Linear regression and nearest neighbor methods B. The curse of dimensionality Reading HTF: 9 – 39 Week 5. (10/19 & 10/21) Linear regression A. Subset selection and shrinkage methods B. The lasso and least angle regression Reading HTF: 43 – 79 3 STAT 301-1 Week 6. (10/26 & 10/28) Linear classification A. Linear discriminant analysis B. Logistic regression Reading HTF: 101 – 127 Regression Project 1 due Week 7. (11/2 & 11/4) Generalized additive models and tree based methods A. Generalized additive models B. Regression and classification trees Reading HTF: 295 – 317 Regression/classification project due Note: No class 11/9 or 11/11) Week 8. (11/16 & 11/18) Boosting A. The adaboost algorithm B. Gradient boosting Reading HTF: 337 – 364 Generalized additive models project due Week 9. (11/23 & 11/25) Random forests A. Model averaging B. Random forests Reading HTF: 587 – 602 Boosting project due Week 10. (11/30 & 12/2) Model assessment A. Bias, variance, model complexity, and effective degrees of freedom B. Cross validation and the bootstrap Reading HTF: 101 – 135 Week 11 (Final Exam Week) Final Project due December 9 4