Stevens Institute of Technology Howe School of Technology Management Business Intelligence & Analytics Program Syllabus BIA 676: Data Streams Analytics: Internet of Things Fall, 2014 Instructor: M. Daneshmand Mahmoud.daneshmand@stevens.edu Office Hours: Wednesdays 5-6 PM Babbio 303 B Class Website: /http://www.stevens.edu/moodle http://webct.stevens.edu Overview The objective of this course is to study emerging online analytical tools & techniques for management & mining of massive data streams for real-time intelligent decisions. Topics include: an overview of IoT architecture, the stream data model, data streams quality, bounds of random variables, sufficient statistics, synopsis & sampling techniques, sliding windows, computing the entropy in streams, drifting concepts, drift detection, change detection, outliers & anomaly detection, data streams histograms and density, change detection in histograms, clustering data streams, decision trees and classification methods in data streams, time series in data streams. Emphasize will be on both methods and practice. Since the area is very new, no textbook exist for the course. Several books with collection of papers written by leaders of the filed will be used to drive lectures. A distinct feature of the course is that students will be engaged on lectures through conducting literature research, writing and presenting a “survey paper" on emerging topics. Resources, such as books and papers & case studies, will be provided so that students can pick their topics of interest and work on a “survey paper” throughout the semester and present it according to a mutually agreed schedule. These “survey papers” make students ready for conducting research on this wide open area of “Stream Data Analytics”. Prerequisites: MIS 637 (Knowledge Discovery from Debases). Introduction to Course In recent years, the progress in sensor technologies, RFID (Radio Frequency Identification) tags, smart phones and other smart devices has made it possible to measure, record, and report large streams of transactional data in real time. Such data sets, which continuously and rapidly grow over time, are referred to as Big Data Streams. Analysis of streaming data poses a number of unique challenges which are not easily solved through direct applications of well-known data mining methods and algorithms developed for traditional static data. This course will serve as a first course on the emerging field of “Data Streams Analytics”. It will provide an introduction to IoT, sensors & devices, the architecture and environment in which these devices generate data streams, the data quality & data cleaning, data acquisition, and emerging methodologies and algorithms for knowledge discovery from data streams. Topics include: synopsis & sampling techniques, sliding windows, computing the entropy in streams, data streams correlations, change detection, outliers & anomaly detection. Learning Goals At the end of the course, students will be able to: 1. Understand the infrastructures and in particular IoT infrastructure generating Data Streams. 2. Formulate the differences between traditional “static” data analytics versus emerging “streaming” data analytics. 3. Manage data quality and conduct data cleaning specific to IoT and Streaming data. 4. Conduct predictive analysis, change detection, extreme value, anomaly, and outliers detection. 5. Select and execute new emerging algorithms specific to streaming data. 6. Extract knowledge and intelligence from data streams which exhibit high volume, velocity, and/or variety. 7. Have a good foundation to conduct research on streaming data analytics towards a PhD thesis. Additional learning objectives include the development of: Written and oral communications skills: the individual project proposal, execution, documentation, and presentation will be used to assess written skills and the final presentations. Pedagogy The course will employ lectures, class discussion, in-class individual and team assignments, and individual and team homeworks and projects. Students will make presentations during the class. An End-to-End Knowledge Discovery in Databases Project developed and executed during the semester by each students using a real world data set. The result is documented as a research project and presented at the class. 2 Required Text(s) There will not be a formal textbook. Several books have been published over the last two years, none which are written as a textbook. Most of these are edited books with a collection of papers. In addition to lectures notes the following two books will be used as primary resources: 1. Knowledge Discovery from Data Streams, Joao Gama, CRC Press, 2010 2. Mining of Massive Datasets, A. Rajaraman, J.D Ullman, Stanford University, Cambridge University Press, 2012 Additional Resources and Recommended Readings: 1. Managing and Mining Sensor Data, Edited by Charu C. Aggarwal, IBM 2. Research, Springer 2013: http://link.springer.com/chapter/10.1007/978-1-46146309- 12 3. Outliers Analysis, Charu C. Aggarwal, IBM Research, Springer 2013 4. Data Stream: Models and Algorithms, Edited by Charu C. Aggarwal, Aggraval, 5. IBM Research, Springer 2007 6. DATA STREAM MINING, A Practical Approach, by Albert Bifet, Geoff Holmes, Richard Kirkby and Bernhard Pfahringer, May 2011, the university of WAIKATO A bulk package of selected articles taken from well-known magazines and international scientific journals will be distributed to the students as further reading material. Students will also have access to all lecture slides. Grading Percentages: Assignment(s) Homework Mid-term Final Survey/Research/Project Paper and Presentation* Total Grade Grade Percent 10% 15% 15% 60% 100% Survey Paper and Presentation*: Each student will conduct a literature review research, write and present a “survey paper" on topic of his/her interest on the emerging field of Streaming Data Analytics”. 3 Ethical Conduct The following statement is printed in the Stevens Graduate Catalog and applies to all students taking Stevens courses, on and off campus. “Cheating during in-class tests or take-home examinations or homework is, of course, illegal and immoral. A Graduate Academic Evaluation Board exists to investigate academic improprieties, conduct hearings, and determine any necessary actions. The term ‘academic impropriety’ is meant to include, but is not limited to, cheating on homework, during in-class or take home examinations and plagiarism.“ Consequences of academic impropriety are severe, ranging from receiving an “F” in a course, to a warning from the Dean of the Graduate School, which becomes a part of the permanent student record, to expulsion. Reference: The Graduate Student Handbook, Academic Year 2003-2004 Stevens Institute of Technology, page 10. Consistent with the above statements, all homework exercises, tests and exams that are designated as individual assignments MUST contain the following signed statement before they can be accepted for grading. ____________________________________________________________________ I pledge on my honor that I have not given or received any unauthorized assistance on this assignment/examination. I further pledge that I have not copied any material from a book, article, the Internet or any other source except where I have expressly cited the source. Signature ________________ Date: _____________ Please note that assignments in this class may be submitted to www.turnitin.com, a webbased anti-plagiarism system, for an evaluation of their originality. 4 Course Schedule (can follow instructor’s own style) Week Week 1 Week 2 Topic(s) Course Orientation; Introduction to the topic and some of the primary sources. Introduction to IoT, Sensors, and RFID Tags, Data Privacy & Security Readings & Cases Homework Expected Outcomes Lecture slides Chapter 12 of Resources 1. Week 3 Stream Data Model 2.1 of 1. and 4.1 of 2. From ch 4 of 2 Week 4 Bounds of Random Variables 2.2 of 1. From Ch 2 of 1 Week 5 Sufficient Statistics. 2.2 of 1. From Ch 2 of 1 Week 6 Synopsis & Sampling Techniques Chapter 4 of 2. From Ch 4 of 2 Week 7 Mid-Term Week 8 Sliding Windows Chapter 2 of 1 From ch 8 of 5 2.3 of 1 From ch 2 of 1 Ch 3 of 1 From ch 3 of 1 Test any concept drift Ch 4 of 1 From ch 4 of 1 Change detection Ch 6 of 1 From ch 6 of 1 Practice clustering algorithm Ch 8 of 1 From ch 8 of 1 Practice CT algorithm Week 9 Week 10 Week 11 Week 12 Week 13 Week 14 Week 15 Computing the Entropy in Streams Drifting Concept, Drift Detection, Change Detection, outliers, Data streams Histograms and Density, Change Detection in Histograms Clustering Data Streams Decision Trees and Classification Methods in Data Streams Final Exam Students Survey Paper Presentations TBD From Resources Conduct and 1- 3 above present an E2E Formulate Data Streams Calculate important bound Calculate Important summaries Practice selecting representative samples Optimal time horizon Get ready for change detection KD in Data Streams CT = Classification Tree KD = Knowledge Discovery 5 SP = Selected Problems E2E = End to End 6