Cleveland State University

advertisement
Cleveland State University
CIS 695 Big Data Processing and Data Analytics (3-0-3) – 2016
Section 51 – Class Nbr. 5493. Tues, Thur – TBA
Prerequisites: CIS 505 and CIS 530. CIS 612, CIS 660 Preferred.
Instructor: Dr. Sunnie S. Chung
Office Location: FH211 Phone: 216 687 4732 Email: sschung.cis@gmail.com
s.chung@csuohio.edu
Webpage: http://grail.csuohio.edu/~sschung
Office Time: Tues, Thurs 2:00 – 3:30 PM and 5:45 – 6:45 PM (or by appointment)
Class Location: TBA
Section 51
Tue & Thu TBA
Key Concepts: Big Data Processing and Parallel Computing, Google’s MapReduce and Aphathe Hadoop,
Hadoop/Map Reduce Based Big Data Processing Systems: Google’s Big Table, Facebook HBase, Hive, Apathe
Pig Latin, Key Value Store Systems: MongoDB, Cassandra, Cloud Computing Systems: Google Cloud,
Amazon Elastic Cloud, Parallel Data Warehouse Based Big Data Processing Systems, Data Analytics using
OLAP Cube, Text Data Analytics, Data Analytic for Big Data, Building Big Data Processing
Infrastructures for Data Analytics, Data Analytics in Twitter, LinkedIn, and Practicum in Data Analytics:
Case Study from Progressive, Explorys by IBM and Data Think
List of Required Materials:
Microsoft SQL Server 2012,
Microsoft Visual Studio 2012 or any higher
Microsoft SQL Server Business Intelligence Data Analytic Tool 2012 – OLAP Server and SQL Server
Data Tool
They are available at the Microsoft Academic Alliance program:
http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897
Adventure Works 2012 Data Warehouse Database for SQL Server 2012 – Will be directed in class
Applied Analytics Using SAS Enterprise Miner
Apathe Pig Latin: http://pig.apache.org/docs/r0.11.1/basic.html
WEKA: http://www.cs.waikato.ac.nz/~ml/weka/
R and MapR
Detailed Instructions on Installation and Setting Up for Big Data Processing Systems will be given in Class.
Text Book:
1. Lecture Notes – Will be given in class
2. List of Selected Industry R&D Papers on Data Analytics and Big Data Processing will be given in
class
Supplement Text Book:
1. “Data Mining with Microsoft SQL Server BI Data Tool” (2008 or 2012), Jamie MacLennan, Bogdan
Crivat, Publisher: Wiley; 1 Edition ISBN-10: 0470277742 | ISBN-13: 978-0470277744 | Edition: 1
Official Calendar
Please consult the page http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html
Final exam: Thur, Dec 11 4:00-6:00 PM.
Grading: The course grade is based on a student's overall performance through the entire Semester. The
final grade is distributed among the following components:
1. Exams (Midterm & Final) 45% (15% Midterm, 30% Final)
2. Computer Labs 25% (about 3-4 Assignments)
3. 1 Project on Big data processing: 2-3 person group project (20%)
4. Research Topic Presentation : 10% -- Tentative
A
AB+
B
94% +
90% - 93%
87% - 89%
80% - 86%
BC
70% - 79%
<70%
D
< 60%
F
A: Outstanding (student's performance is genuinely excellent)
B: Very Good (student's performance is clearly commendable but not necessarily
outstanding)
C: Good (student's performance meets every course requirement and is acceptable;
not distinguished)
D: Below Average (student's performance fails to meet course objectives and
standards)
F: Failure (student's performance is unacceptable)
Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with
their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices
is prohibited, and (2) students must not share any materials.
Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance.
Requests will be considered only in case of exceptional demonstrated need.
Homework Policy: The students are expected to attend all classes. The students are responsible for
collecting the notes, handouts and any other course material distributed during the class period. All
assignments must be individually and independently completed and must represent the effort of the student
turning in the assignment. Should two or more students turn in substantially the same solution or output, in
the judgment of the instructor, the solution will be considered group effort. All involved in group effort
homework will receive a zero grade for that assignment. A student turning in a group effort assignment
more than once will automatically receive an “F” grade for the course.
Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory
Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a
week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will
lower your course grade one additional letter grade.
Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct,
cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified
in the CSU Student Conduct Code. A copy can be obtained on the web page at:
http://www.csuohio.edu/studentlife/StudentCodeOfConduct.pdf or by contacting Valerie Hinton Hannah,
Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ).
For more information consult the following web page CSU Judicial Affairs available at
http://www.csuohio.edu/studentlife/jaffairs/faq.html
Course Schedule: The tentative schedule of topics and their order of coverage is given below. The
schedule and topics to be covered may vary depending upon the students’ progress made.
Week of
Topic
Reading
1-2
Big Data Processing and Parallel Computing
Lecture Notes,
Listed Papers
Google’s MapReduce and Aphathe Hadoop
Google Map Reduce Tutorial
Apache Hadoop File System
•
MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean
(Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004
http://grail.csuohio.edu/~sschung/CIS695/mapreduce-osdi04.pdf
•
Apathy Hadoop in White Papers by Apache, Yahoo
https://hadoop.apache.org/
Lab 1: MapReduce Programming on Hadoop
3–5
Hadoop/Map Reduce Based Big Data Processing Systems:
Google’s Big Table
Data Models for Unstructured Data Processing
Bigtable: A Distributed Storage System for Structured Data, by Fay Chang,
Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI
2006
http://grail.csuohio.edu/~sschung/CIS695/googlebigtable-osdi06.pdf
•
Facebook HBase, Hive
• Data Warehousing and Analytics Infrastructure at Facebook, in
SIGMOD 2010 by Ashish Thusoo (Facebook), et al,
http://grail.csuohio.edu/~sschung/CIS695/facebook_sigmodwarehouse2010_CIS695.pdf
http://grail.csuohio.edu/~sschung/CIS695/hive.pdf
•
Petabyte Scale Databases and Storage Systems Deployed at Facebook,
in SIGMOD 2013 by Dhruba Borthakur
http://grail.csuohio.edu/~sschung/IST734/Facebook_Sig2013_Paper_IST734.pdf
http://grail.csuohio.edu/~sschung/IST734/Facebook_Sigmod2013_Presentation_IST734.pdf
https://hbase.apache.org/
Apathe Pig Latin
• Pig Latin: A Not-So-Foreign Language for Data Processing in
SIGMOD 2008
http://pig.apache.org/docs/r0.11.1/basic.html
http://grail.csuohio.edu/~sschung/cis612/WebDataProcessingonDWYahooSig08.pdf
Key Value Stores
MongoDB, Cassandra
https://www.mongodb.com/
http://cassandra.apache.org/
Cloud
Google Cloud
Amazon Elastic Cloud
http://aws.amazon.com/
http://aws.amazon.com/what-is-cloud-computing/
https://cloud.google.com/storage/docs/resources-support#samples
http://en.m.wikipedia.org/wiki/Google_Cloud_Platform
http://www.wired.com/2014/03/urs-google-story/
Lecture Notes,
Listed Papers
Lab2:Processing Streaming Twitter Logging Data to build Data Warehouse or Mongo
DB on Hadoop
6–7
Parallel Data Warehouse Based Big Data Processing Systems:
Listed papers.
Lecture Notes
Data Warehouse and OLAP
- Decision Support Technology
- On Line Analytical Processing
- Star Schema
- OLAP Aggregation Operators:
Data Cube, Roll Up, Drill Down
- Building BI Data Analytics using DW and OLAP
- MDX, DMX
•
•
An Overview of Data Warehousing and OLAP Technology by Surajit
Chaudhuri (Microsoft) and Umeshwar Dayal (HP Labs) , in the proceedings of
IEEE 1995
Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross
Tab, and SubTotals by Jim Gray (Microsoft), et al, in the proceedings of IEEE
1996
MS PDW Optimization
• Query Optimization in Microsoft SQL Server Parallel Data Warehouse
in SIGMOD 2012 by Srinath Shankar, Microsoft, David DeWitt,
Microsoft, César Galindo-Legaria, Microsoft, et al.
http://grail.csuohio.edu/~sschung/IST734/MicrosoftSqlOptimizationSig2012-shankar.pdf
Parallel Data Warehouse with OLAP Query Processing: Microsoft
Extended PDW with Map Reduce and Hadoop : Oracle, Teradata
Columnar Databases : SAP HANA Databases
Extended PDW with Columnar Data Processing :Teradata, Microsoft
PDW on Cloud
Azure Cloud System by Micrsoft published in IEEE 2011
http://grail.csuohio.edu/~sschung/IST734/azure2_Microsoft_IEEE2011.pdf
http://grail.csuohio.edu/~sschung/IST734/AzureSQLMicrosoft_Sigmod2010campbell
Big Data Processing System with PDW on Cloud
http://grail.csuohio.edu/~sschung/IST734/CloudVista_huiqixu_vldb2012.pdf
8-9
Big Data Processing Techniques for Data Analytic Tasks
MapReduce Join Algorithms
Pig Latin Job Execution Steps and Operators: Filter By, Group, Cogroup
Unstructured Data Processing: Data Transformation from Unstructured/Semi
Structure to Object Exchange Model (JASON, XML) or Relation/CSV
Introduction to Information Retrieval and Web Logging Data Processing
Text Data Analytics, Data Analytic for Big Data
Optimization Techniques for Big Data Processing
Lab3: Unstructured Logging Data Procession to JSON format for Object Exchange
Model (OEM)
Lecture Notes.
Selected Papers
10 - 11
Data Analytics in Twitter and LinkedIn:
Lecture Notes.
Listed Papers,
Twitter:
Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query
Suggestion Architecture
Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin
(Twitter, Inc.)
http://grail.csuohio.edu/~sschung/IST734/FastDataTwitter.pdf
LinkedIn:
The “Big Data” Ecosystem at LinkedIn
Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn)
http://grail.csuohio.edu/~sschung/IST734/BigDataEcoSystemLinkedInSigmod2
014.pdf
Avatara: OLAP for Webscale Analytics Products
Lili Wu, et al. (LinkedIn)
http://grail.csuohio.edu/~sschung/cis612/LinkedIn_liliwu_vldb2012.pdf
Lab4: Data Analytics with Steaming Twitter Messages
12 – 13
Practicum in Data Analytics: Case Study
•
•
•
Progressive
Explorys by IBM
Data Think
Lecture Notes
Selected Papers
14
Data Analytics and Security
Text mining – Intrusion detection in Network and System
Database Security
Security in Cloud
Lecture Notes
15 - 16
Presentation of Significant Research Papers on Data Analytics and Big Data
Processing: List of Selected Papers will be given in class.
Selected Papers
NOTE: The instructor reserves the right to retain, for pedagogical reasons, either the original or a copy of
your work submitted either individually or as a group project for this class. Students' names will be deleted
from any retained items.
Download