Ethical Conduct - Stevens Institute of Technology

advertisement
Revised: January 2015
Stevens Institute of Technology
Howe School of Technology Management
Syllabus
BIA 678
Big Data Seminar
Spring 2015
David Belanger
Babbio 409
Tel: 201-216-3392
Fax: 201-216-5385
dbelange@stevens.edu
Tuesday 6:15 pm
Office Hours:
Monday 2:00 and 5:00 pm
Also by appointment
Course Room/Web Address:
Babbio304
/http://www.stevens.edu/canvas
Overview
The field of Big Data is emerging as one of the transformative business processes of
recent times. It utilizes classic techniques from Business Intelligence & Analysis, along
with a new tools and processes to deal with the volume, velocity, and variety associate
with big data. As they enter the workforce, a significant percentage of BIA students will
be directly involved with big data either as technologists, managers, or users. This course
will build on their understanding of the basic concepts of BI&A to provide them with the
background to succeed in the evolving data centric world, not only from the point of view
of the technologies required, but in terms of management, governance, and organization.
Tools will include Hadoop, Hbase, and related software.
Prerequisites: Admission requirements for the BI&A program.Course ObjectivesThe
objective of this course is to study key technological, management, and governance
techniques for application of big data. This will be done through a series of readings and
lectures, some by outside experts; case studies of the application of big data; application
of technologies typical of the field (e.g. Map/Reduce); and a semester long, small team
project applying what has been learned. They will learn how to apply selected tools in
areas such as data management, data analysis, and data visualization, and also learn how
to deal with the issues related to the management of large sets of data. The course will
concentrate on what is different in a big data environment, from what they have already
learned about standard BIA environments.. Finally, through the analysis and discussion
of case studies they get useful insights on how to optimize the value of big data processes
and operations, to streamline the goals and to design flexible systems. Students taking
the course will be expected to have some background in areas such as multivariate
statistics, data mining, data management, and programming.
Additional learning objectives include the development of:
Written and oral communications skills: the individual project proposal will be used to
assess written skills and the final presentations will be used to assess presentation skills.
Technical Reading Capability: Students will be required to read, and lead discussions on,
seminal papers in the field of big data.
Team skills: The final project for the course will involve student teams; an online survey
instrument will be used to measure individual contributions to team performance.
List of Course Outcomes:
After taking this course, students will be able to:
CO.1. Understand and discuss what big data is, and how it differs from traditional
approaches to BI&A
CO.2. Plan and use the primary tools associated with big data in creating systems to
take advantage of big data.
CO.3. Extract knowledge and intelligence from datasets which exhibit high volume,
velocity, and/or variety.
CO.4. Plan and execute a project that includes the use of at least one big data dataset.
CO.5. Understand and discuss the meta issues around big data such as governance,
security, privacy, and OAM&P.
CO.6. Understand and be able to execute analyses oriented to streaming data.
CO.7. Have a framework with which to understand new advances in the field, and
distinguish hype from reality.
CO.8. Understand and discuss organizational issues related to big data.
Pedagogy
The course will employ lectures, class discussion, in-class individual assignments, an
individual term paper and a team project. In the team project, students will analyze an
industrial problem using real data, design a solution approach using big data techniques
along with other statistical and machine learning techniques, program and execute the
solution, and interpret the solution for management. In the term paper, students will be
required to describe and address issues of importance in modern big data systems.
Readings
Required Text
Soares, Sunil, “Big Data Governance – An Emerging Imperative.” Boise ID, MC Press,
2012.
Supplementary Reading:
Wu, et. al., “Data Mining with Big Data”, IEEE Transactions on Knowledge and Data Engineering,
1/2014 http://www.cs.umb.edu/~ding/papers/TKDE2013.pdf
Lin & Ryaboy, “Scaling Big Data Mining Infrastructure: The Twitter Experience”, SIGKDD Explorations,
V14 I2 http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
McKinsey Global Institute, “Big Data: The next frontier for innovation, competition, and productivity”,
2011
http://www.mckinsey.com/Search.aspx?q=big%20data%20the%20next%20frontier%20for%20innovatio
n%20competition%20and%20productivity&l=Insights%20%26%20Publications
Dean & Ghemawat, “MapReduce:Simplified Data Processing on Large Clusters”,
http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf,
2004
Ghemawat, et al, “Google File System”,
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf , 2003
Compression vcodex,
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.4161&rep=rep1&type=pdf,
Cortes, et al., Communities of Interest,
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=737FC800B052765E59749637FAB5AF7D?doi
=10.1.1.23.8792&rep=rep1&type=pdf
CAP, IEEE Computer V45 N2 2/2012 pp. 21-58., esp: 21, 23, 30, 37, 43.
Lynch & Gilbert, “Perspectives on the CAP Theorem”,
http://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf , 2012
Abadi, et al, “Column-Stores vs. Row-Stores: How Different are they Really,
http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf,
Chang et al, “Bigtable: A Distributed Storage System for Structured Data”,
http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf ,
2006
Decandia, et al, “Amazon’s Highly Available Key Value Store”,
http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf ,
Hbase Basics (Cassandra Basics) O’reilly ;
http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Widom, et. al, “STREAM: The Stanford Data Stream Management System”,
http://ilpubs.stanford.edu:8090/641/1/2004-20.pdf, 2004
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.4467&rep=rep1&type=pdf, 2004, Cranor,
et al. “Gigascope: A Stream Database for Network Applications”
Johnson, “Stream Warehouseing” , http://www.stanford.edu/group/mmds/slides2012/s-johnson.pdf,
An application to Darkstar
IBM Infosphere Streams, http://www-03.ibm.com/software/products/en/infosphere-streams
Wu, et al, “Top 10 Algorithms in Data Mining”, Knowledge Systems 2007,
http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf,
Marai,Liz http://vis.cs.pitt.edu/teaching/cs2620/lectures/L04_TufteDesign.pdf
Shneiderman, Ben, Extreme Visualization: Squeezing a Billion Records into a Million Pixels
http://www.cs.umd.edu/~ben/papers/Shneiderman2008Extreme.pdf
Scheidegger, et al, “Visual Embedding, a Model for Visualization”,
http://cscheid.net/static/papers/visual_embedding.pdf, 1/2014
http://docs.media.bitpipe.com/io_11x/io_113511/item_821580/Big%20Data%20Needs%20Agile%20Information%
20And%20Integration%20Governance.PDF
NIST Big Data Public Working Group and Standardization activities,
http://bigdatawg.nist.gov/_uploadfiles/M0270_v1_9179221138.pdf,
Privacy Policies, for example: jpmorgan, at&t, google, smaller folks, …
Johnson, et al, “Bistro Data Feed Management System”,
http://www.research.att.com/export/sites/att_labs/techdocs/TD_100454.pdf, ,,
MIT, “an evaluation framework for data quality tools”,
http://mitiq.mit.edu/iciq/pdf/an%20evaluation%20framework%20for%20data%20quality%20tools.pdf,
Assignments
Class Discussion Leadership (10%)
Each student will be required to lead the class discussion of one or more of the assigned
readings. This will be done in front of the class. All students are expected to have read
the assigned readings, and to take part in the discussison.
Biweekly homework assignments including programming using map/reduce,
compression, etc. along with weekly reading assignments (33%)
1 INDIVIDUAL Term Paper (33%).
Each student will be required to write a term paper of approximately 5 – 10 pages on a
topic of their choice within the domain of big data.
TEAM PROJECT REPORT & PRESENTATION (33%)
The class will be divided into teams of approximately 5 students each. Each team will be
expected to select a data set appropriate to big data, to conduct a variety of analyses on
the data using big data associated tools, to present the project results to the class, and to
create a written report on the project.
Ethical Conduct
The following statement is printed in the Stevens Graduate Catalog and applies to all students
taking Stevens courses, on and off campus.
“Cheating during in-class tests or take-home examinations or homework is, of course, illegal and
immoral. A Graduate Academic Evaluation Board exists to investigate academic improprieties,
conduct hearings, and determine any necessary actions. The term ‘academic impropriety’ is
meant to include, but is not limited to, cheating on homework, during in-class or take home
examinations and plagiarism.“
Consequences of academic impropriety are severe, ranging from receiving an “F” in a course, to a
warning from the Dean of the Graduate School, which becomes a part of the permanent student
record, to expulsion.
Reference:
The Graduate Student Handbook, Academic Year 2003-2004 Stevens
Institute of Technology, page 10.
Consistent with the above statements, all homework exercises, tests and exams that are
designated as individual assignments MUST contain the following signed statement before they
can be accepted for grading.
____________________________________________________________________
I pledge on my honor that I have not given or received any unauthorized assistance on this
assignment/examination. I further pledge that I have not copied any material from a book, article,
the Internet or any other source except where I have expressly cited the source.
Signature _________________________
Date: _____________
Please note that assignments in this class may be submitted to www.turnitin.com, a web-based
anti-plagiarism system, for an evaluation of their originality.
Course/Teacher Evaluation
Continuous improvement can only occur with feedback based on comprehensive and appropriate
surveys. Your feedback is an important contributor to decisions to modify course
content/pedagogy which is why we strive for 100% class participation in the survey.
All course teacher evaluations are conducted on-line. You will receive an e-mail one week prior
to the end of the course informing you that the survey site (https://www.stevens.edu/assess) is
open along with instructions for accessing the site. Login using your Campus (email) username
and password. This is the same username and password you use for access to Moodle. Simply
click on the course that you wish to evaluate and enter the information. All responses are strictly
anonymous. We especially encourage you to clarify your position on any of the questions and
give explicit feedbacks on your overall evaluations in the section at the end of the formal survey
which allows for written comments. We ask that you submit your survey prior to end of the
examination period.
COURSE SCHEDULE
The course is divided into modules, some of which will extend across more than a single
class meeting.
1. Introduction to Big Data
Overview: An introduction to Big Data, Definitions, Applications, Tools, and
Governance.
Readings:Wu, et. al., 2014
Lin & Ryaboym, 2012
McKinsey Global Institute, 2011
2. Core Technologies for Distribution and Scale
An introduction to the core technologies for scale and distribution, including map/reduce,
Hadoop, compression, GFS and HDFS
Readings: Dean & Ghemawat, 2004
Ghemawat, et. al., 2003
Vcodex
Cortes, et. al.
Cloudera Tutorial
3. Data Base Management
CAP, NoSQL, Column Store, Hbase, Xquery,
Readings: Lynch & Gilbert, 2012
Abadi, et al, 2008
Chang, et al, 2006
Decandia, et al, 2008
4. Data Stream Management
Internet of Things, Data Stream Management Systems, Infosphere Stream, STREAM,
Gigascope, Analytics
Readings: Widom, et al, 2004
Cranor, et al, 2004
Johnson, 2012
Infosphere Speaker
5. Data Analytics
Data analytics in a big data, distributed world. R over Hadoop
Readings: Wu et al, 2007
6. Visualization in a big data world
Issues and techniques in visualizing large, or fast moving, datasets.
Readings: Marai, 2004
Sheiderman, 2008
Scheidigger, et al, 2012
7. Data Governance
Issues related to the governance of large data sets, including: security, privacy, integrity,
quality, and OA&M
Readings: Soares Parts 1, 2, and 3
8. Meta Issues in Big Data Governance
More detailed discussion of the issues of security, privacy, integrity, quality, OA&M, and
management of big data, including related technologies. (Visiting Speaker.)
Readings: NIST Documents
Privacy Policies of selected companies (e.g. JPMorgan, AT&T)
MIT
9. Applications
Detailed discussion of selected applications of big data in a few different industries.
(Visiting Speaker)
10. Student Presentations of Term Projects
Each team presents their term project: written report plus oral presentation
Download