BUS 212f (2) ANALYZING BIG DATA II Spring 2016—Tuesdays 6:30–9:20 pm Sachar 116 (International Hall) Prof. Robert Carver 781-775-5493 (mobile) rcarver@brandeis.edu Office: Sachar 1B (far end of computer cluster) Hours: Tuesdays, 4:00 – 5:30 and by appointment TAs: Darsei Canhasi, Shantanu Livania, Tamsa Sabat Overview This is a two credit module that is a continuation of BUS 211f. This module provides theoretical and hands-on instruction in three major elements of Big Data analytics: management-oriented visualizations, data mining, and predictive modeling. Through the use of widely adopted software tools, students will build models and execute analyses to address current needs of selected Brandeis administrative offices as well as solve problems presented in cases. Assignments and classroom time will be devoted both to analysis of current developments in business analytics and to gaining experience with current tools. Required Readings Provost, Foster & Fawcett, Tom. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. (2013, Sebastopol, CA: O’Reilly Media) 978-1449361327. Purchase at Bookstore or on-line. There is a required on-line course pack available for purchase at the Harvard Business Publishing website. A direct link is available on LATTE . See last page of Syllabus for course pack contents. Other readings as posted on LATTE site. Recommended Readings Berry, M. and Linoff, G. Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. 3rd ed. (2011, Wiley) available on-line through LTS. Ebook ISBN9781118087459. Hastie, T., Tibshirani, R. and Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (2001, Springer). Available in library main stacks; pdf of new edition available for download at http://wwwstat.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf Prerequisites BUS 211f or permission of instructor. Learning Goals and Objectives Upon successful completion of this module, students will: Understand the challenges of performing a business needs assessment to determine how analytics and visual displays can provide business value Be able to use training, validation, and test datasets to carry out data mining analyses BUS 212 f(2) Spring 2016 Course Approach 2 Use common techniques such as multiple regression, partition trees, kmeans clustering to develop predictive models Apply best practices of predictive modeling to real and realistic business problems Design informational graphics and displays grounded in concepts of business needs and principles of human cognitive processes Analysis of massive, real-time data is rapidly gaining prominence in numerous industries, with applications ranging from fraud detection to consumer behavior. As in the predecessor course (BUS 211f), BUS212f uses theory, cases, and hands-on analysis to approach course topics. In six short weeks, we can only dive so deep; we aim for depth in a carefully selected list of topics rather than breadth. Students should expect to grapple with complex software-based analyses that do not lend themselves to quick, easy solutions. Communications We’ll make regular use of LATTE. All lecture notes, handouts, assignments, and supporting materials will be available via LATTE, and any late-breaking news will reach you via email. Please check your Brandeis email and the LATTE site regularly to keep apprised of important course-related announcements. Other Course Technology All of the software we will use in this course can be accessed on the public computer clusters at IBS and/or on your personal laptops. If you do use a laptop, the class schedule below indicates dates when it will be useful to have it with you. As in BUS 211f, we will make use of proprietary and public-use databases accessible through the World Wide Web. We’ll continue to use some of the tools we adopted in that course as well as R for most of our analysis. You should bring a laptop to each class. Student Classroom Contributions R: R is a free software environment for statistical computing and graphics, and is widely used by both academia and industry. The advantage of the R software is that it can work on both Windows and Mac-OS. It is ranked no. 1 in the KDnuggets 2013 poll on top languages for analytics, data mining, and data science. RStudio is a user friendly environment for R that has become popular. R Software: http://www.r-project.org/index.html. RStudio: http://www.rstudio.com/products/RStudio/#Desk Github is also a free environment that facilitates (a) collaborative work and (b) version control for software projects that are under development. It is very widely used by data scientists to manage and share their work. Class participation is important in this course both as a means of developing understanding and as an indicator of student progress. Participation can take many forms, and each student is expected to contribute actively, freely, and effectively to the classroom experience by raising questions, demonstrating preparedness and proficiency in the analysis of problems and cases, and explaining the implications of particular analyses in context. Homework-based discussion and presentations are an important part of participation. To this end, regular class attendance is required, and students should use name BUS 212 f(2) Spring 2016 3 cards. We meet only six times, so absence can become a serious problem. Even if you must arrive late or leave early, be here. With assistance from the TAs, I will evaluate the quality of your contributions in class each evening, as well as the quality of your contributions via email, LATTE discussion, etc. These will all be factored together in determining your ultimate Contributions grade (see below). In general, absence from class reduces your contribution grade. Written Assignments and Projects Students will complete five analytic assignments during the course. Three of these will be brief analyses, requiring both computer modeling and writing. These may be completed with one or two partners, and each student should expect to briefly discuss one of their work products in class. Two other written assignments will be two phases of a single project requiring more significant time and analysis. The project assignments will be prepared in teams of four students, and will include written and computerbased elements. Owing to the size of the class, students will have only limited opportunities to present parts of their projects orally in the course. All assignments should be submitted via LATTE upload prior to the start of class. Papers should be professional in appearance and use clear, grammatically correct business English. Analytical work (graphs, tables, and other output) should be incorporated seamlessly into the written document, showing readers exactly and only what you want them to see. Evaluation Your final grade in the course will be computed using these weights: Contributions to Class Discussions Brief analyses (3) Projects (2 parts) TOTAL 15% 35% 50% please note! 100% Academic Integrity You are expected to be honest in all of your academic work. Please consult Brandeis University Rights and Responsibilities for all policies and procedures related to academic integrity. Students may be required to submit work to TurnItIn.com software to verify originality. Allegations of alleged academic dishonesty will be forwarded to the Director of Academic Integrity. Sanctions for academic dishonesty can include failing grades and/or suspension from the university. Citation and research assistance can be found at LTS - Library guides. Disabilities If you are a student with a documented disability on record at Brandeis and wish to have a reasonable accommodation made for you in this class, please see me immediately. Study Groups Working with one or two partners is an excellent way to gain understanding of this subject. I encourage small groups to work on assignments, with a few caveats: Be sure that you are neither carrying nor being carried by the group; each member of the group is entitled to learn and expected to contribute. Except for the group project, each student is responsible for turning in original memos and problem sets. BUS 212 f(2) Spring 2016 4 Each group member retains the right to “go it alone.” Joining a group is not a marriage. Similarly, teams are encouraged to dismiss underperforming members. Course Outline Note: for each session, you should complete the assigned reading before coming to class. See list of deliverables on next page; detailed assignments will be distributed in class each week, and all assignments and handouts will also be available on our LATTE site. The abbreviation “P&F” refers to the Provost and Fawcett book. Session Date Topics and Readings Deliverable Due by class time Starting at the End: Visualizations to Support Business Intelligence Session 1 March 15 READINGS: Russom, Big Data Analytics (2013, on LATTE) P&F, Chapter 1 & 2 Watson, “All about Analytics” Leek & Peng, ”What is the Question?” a. b. c. d. (none) Course introduction and objectives Relationship of Business knowledge and Big Data Analytics Data Mining Process (overview) Introduce/ Review R & R Studio Decision Trees & Logistic Regression READINGS: P&F, Chap 3 & 4 Loh, “Classification and Regression Trees” (LATTE) Session 2 March 22 CASE READING: A Game of Two Halves: In-Play Betting in Football a. b. c. Analysis I (R data analysis) Supervised Segmentation Theory: Decision trees and concepts of Logistic Regression (simple/ multinomial logistic) Application: Game of Two Halves Classification Models and Model Performance READINGS: P&F, Chaps 5 CASE READING: Heterogeneity of Movement (posted on LATTE) Session 3 March 29 a. b. c. Classification models with regression Training & Validation Confusion Matrix to assess model performance Analysis 2 (Game of Two Halves) BUS 212 f(2) Spring 2016 Session Date 5 Deliverable Due by class time Topics and Readings Association Rules Session 4 April 5 READINGS: P&F, Chaps 6–8 “Cluster Analysis for Segmentation” Recommended: Hastie & Tishbirani (parts of 13 & 14— LATTE) a. b. c. Project 1 (Heterogeneity of Movement) Project 1 Debriefing Clustering methods Unsupervised Data Mining: Association Rules/Market Basket Analysis Basics of Text Mining READINGS: P&F, Chap 10 Session 5 April 12 CASE READING: Job Salary Prediction (LATTE) a. b. c. Text Mining basics Word clouds in R Initial analysis of Job Salary data Review, Summary & Project Session 6 April 19 READINGS: P&F, Chaps 11 & 12 Zhao (R & Data Mining) Chapter 9 Nolan & Temple Lang “Exploring Data Science Jobs with Web Scraping” (LATTE) Project 2 instructions—Job Salary Prediction Tuesday May 3 Brief project-2 discussion Debrief Analysis 3 Scraping the Web for Data Other application areas and challenges Developing models with Business Value No Class Session this week Final project due before this date. Graduating students are encouraged to submit early Analysis 3 (Job Salary Part 1) Project 2 (Job Salary Prediction) Brief Description of Assignments (complete assignment details to be distributed in class): Analysis 1 Introduction to Modeling with R and R Studio Analysis 2 Build a model to support In-Game Betting in Football (soccer) Analysis 3 Text analysis of Job Salary Prediction Data Project 1 Heterogeneity of Movement Project 2 Job Salary Prediction BUS 212 f(2) Spring 2016 Supplementary Readings and Cases (chronologically during course): Those in bold-face are in the Harvard Business Publishing on-line course. Russom P., (2011) “Big Data Analytics”, TDWI Best Practices Report Watson, H. (2013) “All about Analytics” International Journal of Business Intelligence Research, January-March, Vol. 4, No. 1. Leek, J. and Peng, R. (2015) “What is the Question?” Sciencexpress. Published online 26 Februrary: 10.1126/science.aa6146. Loh, Wei-Lin (2011) “Classification and Regression Trees” WIREs Data Mining and Knowledge Discovery, Wiley. Kumar, U., Sandeep, V. and Satyabala (2013) “A Game of Two Halves: In-Play Betting in Football” (IMB-401). Indian Institute of Management–Bangalore. “Heterogeneity of Movement” case: inspired by entry in the U.C. Irvine Machine Learning Repository (2015). Online: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition. Venkatesan, Rajkumar (2014). “Cluster Analysis for Segmentation” (UV0745-PDF-ENG). Darden School of Business. “Job Salary Prediction” case: Inspired by Kaggle Competition (2013). Online: https://www.kaggle.com/c/job-salary-prediction. “Exploring Data Science Jobs with Web Scraping” (2015). Based on Nolan, D. and Temple Lang, D., Data Science in R, Chapter 12. Boca Raton, FL: CRC Press. Rev. 01/2016 6