Revised: Jan 23, 2015 Stevens Institute of Technology Howe School of Technology Management Syllabus BIA 656 Statistical Learning and Analytics Fall 2014 Germán Creamer, Babbio 637 gcreamer@stevens.edu 6.15-8.45PM, Babbio 320 Office Hours: M, 1.30 PM – 3 PM Also by appointment Course Room/Web Address: Babbio 320/ http://www.stevens.edu/moodle Overview The significant amount of corporate information available requires a systematic and analytical approach to select the most important information and anticipate major events. Machine learning algorithms facilitate this process understanding, modeling and forecasting the behavior of major corporate variables. This course introduces statistical and graphical (machine learning) models used for inference and prediction. The emphasis of the course is in the learning capability of the algorithms and their application to several business areas. Prerequisites: Basic course in probability and statistics at the level of MGT 620 or BIA 652 Multivariate data analytics. Prerequisites Admission requirements for the BI&A program. Course Objectives Students will: Learn the fundamental concepts of statistical learning algorithms. Explore existent and new applications of statistical learning methods to business problems, and to generic classification problems. Learn to solve analytical problems in groups and effectively communicate its results. Relationship of Course to Rest of Curriculum Students will have the opportunity to explore the main concepts of statistical learning that will be used in the applied modules of this program. List of Course Outcomes: By the end of this course, the students will be able to: 1. Understand the foundations of statistical learning algorithms 2. Apply statistical models and analytical methods to several business domains using a statistical language. 3. Recognize the value and also the limits of statistical learning algorithms to solve business problems. Additional learning objectives include the development of: 1. Written and oral communications skills: students are required to communicate properly during the class discussions and project class presentations. Homeworks and project report should be presented “as if” they were submitted to a senior manager of a major corporation. 2. Solve a major analytical problem using large and heterogeneous datasets in a group project and communicate its results in a professional way. Pedagogy The class will combine class presentations, discussions, exercises and case analysis to motivate students and train them in the appropriate use of statistical and econometric techniques. Readings Required Text Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly, 2013. (code to get a discount on oreilly.com: AUTHD) Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. (Amazon.com sells the paperback version (2013)) Case Pilgrim Bank A (602104), Harvard Business School You must register in the following website, buy the case and download related documents: https://cb.hbsp.harvard.edu/cbmp/access/28615189 Optional Texts Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning. Springer-Verlag, New York, 2010 (selected sections) (downloadable at http://www-stat.stanford.edu/~tibs/ElemStatLearn/). Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008 (downloadable at http://nlp.stanford.edu/IR-book). R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification, John Wiley & Sons, 2001. Tom M. Mitchell, Machine Learning, McGraw-Hill Series in Computer Science, 1997. Vasant Dhar and Roger Stein. Seven methods for transforming corporate data into business intelligence. Upper Saddle River: Prentice Hall. 1997. Additional Free Texts A. Rajaraman, J. Ullman Mining of Massive Datasets Book (very useful for big data problems) Mohammed Zaki and Wagner Meira Jr. Mohammed Zaki and Wagner Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms (draft) StatSoft Electronic Statistics Textbook (statistics and data mining) Roberto Battiti and Mauro Brunato LIONbook: Learning and Intelligent Optimization (introductory) Assignments The course will have a main project and 4 assignments/cases of data analysis. The assignments must be submitted electronically through the course website before the beginning of the class of the assigned day. Each student must submit his/her own report. You should also include the Readme, log and code files if you used a script or wrote a program. E-mail submissions will not be accepted. Each assignment has a value of 5 points. Project The project requires that participants build a decision support system (DSS) based on one of the methods explored in this course. Each project must be developed by groups of three students and they should present a project proposal at the middle of the semester. Grades Assignment Assignments/cases Team project Participation Final exam Total Grade Software Grade % 20% 30% 10% 40% 100% Python is the main software packages that will be used. You should participate in the Python bootcamp offered by the school at the beginning of the semester. Class policy Late Policy: 1 point lost for each day late. No assignments accepted after 3 days. Cooperation: You are allowed to discuss lecture and textbook materials, and how to approach assignments. You cannot share ideas in any written form: code, pseudocode or solutions. You cannot submit someone else's work found through internet or any other source, or a modification of that work, with or without that person's knowledge, regardless of the circumstances under which it was obtained, copied, or modified. Of course, no cooperation is allowed during exams. Re-grades: If you dispute the grade received for an assignment, you must submit, in writing, your detailed and clearly stated argument for what you believe is incorrect and why. This must be submitted by the beginning of the next class after the assignment was returned. Requests for re-grade after the beginning of class will not be accepted. A written response will be provided by the next class indicating your final score. Be aware that requests of re-grade of a specific problem can result in a regrade of the entire assignment. This re-grade and written response is final; no additional re-grades or debate for that assignment. Ethical Conduct The following statement is printed in the Stevens Graduate Catalog and applies to all students taking Stevens courses, on and off campus. “Cheating during in-class tests or take-home examinations or homework is, of course, illegal and immoral. A Graduate Academic Evaluation Board exists to investigate academic improprieties, conduct hearings, and determine any necessary actions. The term ‘academic impropriety’ is meant to include, but is not limited to, cheating on homework, during in-class or take home examinations and plagiarism.“ Consequences of academic impropriety are severe, ranging from receiving an “F” in a course, to a warning from the Dean of the Graduate School, which becomes a part of the permanent student record, to expulsion. Reference: The Graduate Student Handbook, Academic Year 2003-2004 Stevens Institute of Technology, page 10. Consistent with the above statements, all homework exercises, tests and exams that are designated as individual assignments MUST contain the following signed statement before they can be accepted for grading. ____________________________________________________________________ I pledge on my honor that I have not given or received any unauthorized assistance on this assignment/examination. I further pledge that I have not copied any material from a book, article, the Internet or any other source except where I have expressly cited the source. Signature _________________________ Date: _____________ Please note that assignments in this class may be submitted to www.turnitin.com, a webbased anti-plagiarism system, for an evaluation of their originality. Course/Teacher Evaluation Continuous improvement can only occur with feedback based on comprehensive and appropriate surveys. Your feedback is an important contributor to decisions to modify course content/pedagogy which is why we strive for 100% class participation in the survey. All course teacher evaluations are conducted on-line. You will receive an e-mail one week prior to the end of the course informing you that the survey site (https://www.stevens.edu/assess) is open along with instructions for accessing the site. Login using your Campus (email) username and password. This is the same username and password you use for access to Moodle. Simply click on the course that you wish to evaluate and enter the information. All responses are strictly anonymous. We especially encourage you to clarify your position on any of the questions and give explicit feedbacks on your overall evaluations in the section at the end of the formal survey which allows for written comments. We ask that you submit your survey prior to end of the examination period. COURSE SCHEDULE 8/25 9/8 Topic(s) Introduction to data science and data analytic thinking Predictive modeling 9/15 From correlation to supervised segmentation 9/22 Linear models Reading(s) PF, ch. 1 and 2 PF, ch. 3 B., 1.3, 1.4, 1.5 B, 1.6, 14.4 Optional reference: HTF, ch. 9.2 PF, ch. 4 Hwks 9/29 Support vector machines 10/6 Model performance analysis 10/14 Graphical models (Tuesday ) 10/20 10/27 11/3 11/10 11/17 Graphical Models Relational learning: Bayesian models Application to marketing: Targeting consumers Sequential data (time series): Markov decision processes: -Reinforcement learning -Time series -Application to trading Sequential data (time series): Hidden Markov models Mean variance decomposition Combining models: Ensemble methods B. 3.1, 4.1.1-4.1.3, 4.3.2 B, 6.1, 6.2, 7.1 Optional references: HTF, ch. 12 MRS, ch. 15 PF, ch. 5, 7 and 8 PF, ch. 9 B. 1.2 Optional references: HTF, ch. 8.3-8.4 MRS, ch. 11, 13 B, Ch. 8 PF, ch. 11 B, 11.1, 11.2, 11.3 Case Pilgrim Bank 1st part Hwk 1: Python Project proposal Hwk 2: classification Hwk 3: Case Pilgrim Bank I discussion B, 13.1 http://www1.icsi.berkeley.ed u/~moody/MoodySaffellTN N01.pdf B, 13.2 Case Pilgrim Bank 2nd part Hwk 4: B, 3.2, 14.2-14.3 Case Pilgrim Bank II PF, ch. 12 discussion Optional references (click on each): ADTrees, Bagging, Random Forests 11/24 Combining models. 12/1 Application to finance: Mixed trading strategies algorithmic trading Final presentations HTF, 8.7, 10.1, 15.1-15.3, 16 B, 14.1, 14.4, 14.5 Creamer, Model calibration…, Quantitative. Final project report (12/5/2014) PF: Provost and Fawcett, Data Science for Business B: C. Bishop, Pattern Recognition and Machine Learning Optional readings: HTF: Hastie, Tibshirani and Friedman, The Elements of Statistical Learning. 2010 MRS: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval, 2008.