Ethical Conduct - Stevens Institute of Technology

advertisement
Stevens Institute of Technology
Howe School of Technology Management
Business Intelligence & Analytics Program
Syllabus
BIA 676: Data Streams Analytics: Internet of Things
Fall, 2014
Instructor: M. Daneshmand
Mahmoud.daneshmand@stevens.edu
Office Hours: Wednesdays 5-6 PM
Babbio 303 B
Class Website:
/http://www.stevens.edu/moodle
http://webct.stevens.edu
Overview
The objective of this course is to study emerging online analytical tools & techniques for
management & mining of massive data streams for real-time intelligent decisions. Topics
include: an overview of IoT architecture, the stream data model, data streams quality,
bounds of random variables, sufficient statistics, synopsis & sampling techniques, sliding
windows, computing the entropy in streams, drifting concepts, drift detection, change
detection, outliers & anomaly detection, data streams histograms and density, change
detection in histograms, clustering data streams, decision trees and classification methods
in data streams, time series in data streams.
Emphasize will be on both methods and practice. Since the area is very new, no textbook
exist for the course. Several books with collection of papers written by leaders of the filed
will be used to drive lectures. A distinct feature of the course is that students will be
engaged on lectures through conducting literature research, writing and presenting a
“survey paper" on emerging topics. Resources, such as books and papers & case studies,
will be provided so that students can pick their topics of interest and work on a “survey
paper” throughout the semester and present it according to a mutually agreed schedule.
These “survey papers” make students ready for conducting research on this wide open
area of “Stream Data Analytics”.
Prerequisites: MIS 637 (Knowledge Discovery from Debases).
Introduction to Course
In recent years, the progress in sensor technologies, RFID (Radio Frequency
Identification) tags, smart phones and other smart devices has made it possible to
measure, record, and report large streams of transactional data in real time. Such data
sets, which continuously and rapidly grow over time, are referred to as Big Data Streams.
Analysis of streaming data poses a number of unique challenges which are not easily
solved through direct applications of well-known data mining methods and algorithms
developed for traditional static data. This course will serve as a first course on the
emerging field of “Data Streams Analytics”. It will provide an introduction to IoT,
sensors & devices, the architecture and environment in which these devices generate data
streams, the data quality & data cleaning, data acquisition, and emerging methodologies
and algorithms for knowledge discovery from data streams. Topics include: synopsis &
sampling techniques, sliding windows, computing the entropy in streams, data streams
correlations, change detection, outliers & anomaly detection.
Learning Goals
At the end of the course, students will be able to:
1. Understand the infrastructures and in particular IoT infrastructure generating Data
Streams.
2. Formulate the differences between traditional “static” data analytics versus
emerging “streaming” data analytics.
3. Manage data quality and conduct data cleaning specific to IoT and Streaming
data.
4. Conduct predictive analysis, change detection, extreme value, anomaly, and
outliers detection.
5. Select and execute new emerging algorithms specific to streaming data.
6. Extract knowledge and intelligence from data streams which exhibit high volume,
velocity, and/or variety.
7. Have a good foundation to conduct research on streaming data analytics towards a
PhD thesis.
Additional learning objectives include the development of:
Written and oral communications skills: the individual project proposal, execution,
documentation, and presentation will be used to assess written skills and the final
presentations.
Pedagogy
The course will employ lectures, class discussion, in-class individual and team
assignments, and individual and team homeworks and projects. Students will make
presentations during the class. An End-to-End Knowledge Discovery in Databases
Project developed and executed during the semester by each students using a real
world data set. The result is documented as a research project and presented at the
class.
2
Required Text(s)
There will not be a formal textbook. Several books have been published over the last two
years, none which are written as a textbook. Most of these are edited books with a
collection of papers. In addition to lectures notes the following two books will be used as
primary resources:
1. Knowledge Discovery from Data Streams, Joao Gama, CRC Press, 2010
2. Mining of Massive Datasets, A. Rajaraman, J.D Ullman, Stanford University,
Cambridge University Press, 2012
Additional Resources and Recommended Readings:
1. Managing and Mining Sensor Data, Edited by Charu C. Aggarwal, IBM
2. Research, Springer 2013: http://link.springer.com/chapter/10.1007/978-1-46146309- 12
3. Outliers Analysis, Charu C. Aggarwal, IBM Research, Springer 2013
4. Data Stream: Models and Algorithms, Edited by Charu C. Aggarwal, Aggraval,
5. IBM Research, Springer 2007
6. DATA STREAM MINING, A Practical Approach, by Albert Bifet, Geoff
Holmes, Richard Kirkby and Bernhard Pfahringer, May 2011, the university of
WAIKATO
A bulk package of selected articles taken from well-known magazines and international
scientific journals will be distributed to the students as further reading material. Students
will also have access to all lecture slides.
Grading Percentages:
Assignment(s)
Homework
Mid-term
Final
Survey/Research/Project Paper and Presentation*
Total Grade
Grade
Percent
10%
15%
15%
60%
100%
Survey Paper and Presentation*: Each student will conduct a literature review research,
write and present a “survey paper" on topic of his/her interest on the emerging field of
Streaming Data Analytics”.
3
Ethical Conduct
The following statement is printed in the Stevens Graduate Catalog and applies to all
students taking Stevens courses, on and off campus.
“Cheating during in-class tests or take-home examinations or homework is, of course,
illegal and immoral. A Graduate Academic Evaluation Board exists to investigate
academic improprieties, conduct hearings, and determine any necessary actions. The
term ‘academic impropriety’ is meant to include, but is not limited to, cheating on
homework, during in-class or take home examinations and plagiarism.“
Consequences of academic impropriety are severe, ranging from receiving an “F” in a
course, to a warning from the Dean of the Graduate School, which becomes a part of the
permanent student record, to expulsion.
Reference:
The Graduate Student Handbook, Academic Year 2003-2004 Stevens
Institute of Technology, page 10.
Consistent with the above statements, all homework exercises, tests and exams that are
designated as individual assignments MUST contain the following signed statement
before they can be accepted for grading.
____________________________________________________________________
I pledge on my honor that I have not given or received any unauthorized assistance on
this assignment/examination. I further pledge that I have not copied any material from a
book, article, the Internet or any other source except where I have expressly cited the
source.
Signature ________________
Date: _____________
Please note that assignments in this class may be submitted to www.turnitin.com, a webbased anti-plagiarism system, for an evaluation of their originality.
4
Course Schedule (can follow instructor’s own style)
Week
Week 1
Week 2
Topic(s)
Course Orientation;
Introduction to the
topic and some of the
primary sources.
Introduction to IoT,
Sensors, and RFID
Tags, Data Privacy &
Security
Readings &
Cases
Homework
Expected
Outcomes
Lecture slides
Chapter 12 of
Resources 1.
Week 3
Stream Data Model
2.1 of 1. and
4.1 of 2.
From ch 4 of 2
Week 4
Bounds of Random
Variables
2.2 of 1.
From Ch 2 of 1
Week 5
Sufficient Statistics.
2.2 of 1.
From Ch 2 of 1
Week 6
Synopsis & Sampling
Techniques
Chapter 4 of 2.
From Ch 4 of 2
Week 7
Mid-Term
Week 8
Sliding Windows
Chapter 2 of 1
From ch 8 of 5
2.3 of 1
From ch 2 of 1
Ch 3 of 1
From ch 3 of 1
Test any concept
drift
Ch 4 of 1
From ch 4 of 1
Change detection
Ch 6 of 1
From ch 6 of 1
Practice clustering
algorithm
Ch 8 of 1
From ch 8 of 1
Practice CT
algorithm
Week 9
Week 10
Week 11
Week 12
Week 13
Week 14
Week 15
Computing the Entropy
in Streams
Drifting Concept, Drift
Detection, Change
Detection, outliers,
Data streams
Histograms and
Density, Change
Detection in
Histograms
Clustering Data
Streams
Decision Trees and
Classification Methods
in Data Streams
Final Exam
Students Survey
Paper Presentations
TBD
From Resources Conduct and
1- 3 above
present an E2E
Formulate Data
Streams
Calculate important
bound
Calculate Important
summaries
Practice selecting
representative
samples
Optimal time
horizon
Get ready for
change detection
KD in Data Streams
CT = Classification Tree
KD = Knowledge Discovery
5
SP = Selected Problems
E2E = End to End
6
Download