Cleveland State University CIS 695 Big Data Processing and Data Analytics (3-0-3) – 2016 Section 51 – Class Nbr. 5493. Tues, Thur – TBA Prerequisites: CIS 505 and CIS 530. CIS 612, CIS 660 Preferred. Instructor: Dr. Sunnie S. Chung Office Location: FH211 Phone: 216 687 4732 Email: sschung.cis@gmail.com s.chung@csuohio.edu Webpage: http://grail.csuohio.edu/~sschung Office Time: Tues, Thurs 2:00 – 3:30 PM and 5:45 – 6:45 PM (or by appointment) Class Location: TBA Section 51 Tue & Thu TBA Key Concepts: Big Data Processing and Parallel Computing, Google’s MapReduce and Aphathe Hadoop, Hadoop/Map Reduce Based Big Data Processing Systems: Google’s Big Table, Facebook HBase, Hive, Apathe Pig Latin, Key Value Store Systems: MongoDB, Cassandra, Cloud Computing Systems: Google Cloud, Amazon Elastic Cloud, Parallel Data Warehouse Based Big Data Processing Systems, Data Analytics using OLAP Cube, Text Data Analytics, Data Analytic for Big Data, Building Big Data Processing Infrastructures for Data Analytics, Data Analytics in Twitter, LinkedIn, and Practicum in Data Analytics: Case Study from Progressive, Explorys by IBM and Data Think List of Required Materials: Microsoft SQL Server 2012, Microsoft Visual Studio 2012 or any higher Microsoft SQL Server Business Intelligence Data Analytic Tool 2012 – OLAP Server and SQL Server Data Tool They are available at the Microsoft Academic Alliance program: http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897 Adventure Works 2012 Data Warehouse Database for SQL Server 2012 – Will be directed in class Applied Analytics Using SAS Enterprise Miner Apathe Pig Latin: http://pig.apache.org/docs/r0.11.1/basic.html WEKA: http://www.cs.waikato.ac.nz/~ml/weka/ R and MapR Detailed Instructions on Installation and Setting Up for Big Data Processing Systems will be given in Class. Text Book: 1. Lecture Notes – Will be given in class 2. List of Selected Industry R&D Papers on Data Analytics and Big Data Processing will be given in class Supplement Text Book: 1. “Data Mining with Microsoft SQL Server BI Data Tool” (2008 or 2012), Jamie MacLennan, Bogdan Crivat, Publisher: Wiley; 1 Edition ISBN-10: 0470277742 | ISBN-13: 978-0470277744 | Edition: 1 Official Calendar Please consult the page http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html Final exam: Thur, Dec 11 4:00-6:00 PM. Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components: 1. Exams (Midterm & Final) 45% (15% Midterm, 30% Final) 2. Computer Labs 25% (about 3-4 Assignments) 3. 1 Project on Big data processing: 2-3 person group project (20%) 4. Research Topic Presentation : 10% -- Tentative A AB+ B 94% + 90% - 93% 87% - 89% 80% - 86% BC 70% - 79% <70% D < 60% F A: Outstanding (student's performance is genuinely excellent) B: Very Good (student's performance is clearly commendable but not necessarily outstanding) C: Good (student's performance meets every course requirement and is acceptable; not distinguished) D: Below Average (student's performance fails to meet course objectives and standards) F: Failure (student's performance is unacceptable) Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices is prohibited, and (2) students must not share any materials. Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance. Requests will be considered only in case of exceptional demonstrated need. Homework Policy: The students are expected to attend all classes. The students are responsible for collecting the notes, handouts and any other course material distributed during the class period. All assignments must be individually and independently completed and must represent the effort of the student turning in the assignment. Should two or more students turn in substantially the same solution or output, in the judgment of the instructor, the solution will be considered group effort. All involved in group effort homework will receive a zero grade for that assignment. A student turning in a group effort assignment more than once will automatically receive an “F” grade for the course. Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will lower your course grade one additional letter grade. Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct, cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified in the CSU Student Conduct Code. A copy can be obtained on the web page at: http://www.csuohio.edu/studentlife/StudentCodeOfConduct.pdf or by contacting Valerie Hinton Hannah, Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ). For more information consult the following web page CSU Judicial Affairs available at http://www.csuohio.edu/studentlife/jaffairs/faq.html Course Schedule: The tentative schedule of topics and their order of coverage is given below. The schedule and topics to be covered may vary depending upon the students’ progress made. Week of Topic Reading 1-2 Big Data Processing and Parallel Computing Lecture Notes, Listed Papers Google’s MapReduce and Aphathe Hadoop Google Map Reduce Tutorial Apache Hadoop File System • MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean (Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004 http://grail.csuohio.edu/~sschung/CIS695/mapreduce-osdi04.pdf • Apathy Hadoop in White Papers by Apache, Yahoo https://hadoop.apache.org/ Lab 1: MapReduce Programming on Hadoop 3–5 Hadoop/Map Reduce Based Big Data Processing Systems: Google’s Big Table Data Models for Unstructured Data Processing Bigtable: A Distributed Storage System for Structured Data, by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI 2006 http://grail.csuohio.edu/~sschung/CIS695/googlebigtable-osdi06.pdf • Facebook HBase, Hive • Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al, http://grail.csuohio.edu/~sschung/CIS695/facebook_sigmodwarehouse2010_CIS695.pdf http://grail.csuohio.edu/~sschung/CIS695/hive.pdf • Petabyte Scale Databases and Storage Systems Deployed at Facebook, in SIGMOD 2013 by Dhruba Borthakur http://grail.csuohio.edu/~sschung/IST734/Facebook_Sig2013_Paper_IST734.pdf http://grail.csuohio.edu/~sschung/IST734/Facebook_Sigmod2013_Presentation_IST734.pdf https://hbase.apache.org/ Apathe Pig Latin • Pig Latin: A Not-So-Foreign Language for Data Processing in SIGMOD 2008 http://pig.apache.org/docs/r0.11.1/basic.html http://grail.csuohio.edu/~sschung/cis612/WebDataProcessingonDWYahooSig08.pdf Key Value Stores MongoDB, Cassandra https://www.mongodb.com/ http://cassandra.apache.org/ Cloud Google Cloud Amazon Elastic Cloud http://aws.amazon.com/ http://aws.amazon.com/what-is-cloud-computing/ https://cloud.google.com/storage/docs/resources-support#samples http://en.m.wikipedia.org/wiki/Google_Cloud_Platform http://www.wired.com/2014/03/urs-google-story/ Lecture Notes, Listed Papers Lab2:Processing Streaming Twitter Logging Data to build Data Warehouse or Mongo DB on Hadoop 6–7 Parallel Data Warehouse Based Big Data Processing Systems: Listed papers. Lecture Notes Data Warehouse and OLAP - Decision Support Technology - On Line Analytical Processing - Star Schema - OLAP Aggregation Operators: Data Cube, Roll Up, Drill Down - Building BI Data Analytics using DW and OLAP - MDX, DMX • • An Overview of Data Warehousing and OLAP Technology by Surajit Chaudhuri (Microsoft) and Umeshwar Dayal (HP Labs) , in the proceedings of IEEE 1995 Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross Tab, and SubTotals by Jim Gray (Microsoft), et al, in the proceedings of IEEE 1996 MS PDW Optimization • Query Optimization in Microsoft SQL Server Parallel Data Warehouse in SIGMOD 2012 by Srinath Shankar, Microsoft, David DeWitt, Microsoft, César Galindo-Legaria, Microsoft, et al. http://grail.csuohio.edu/~sschung/IST734/MicrosoftSqlOptimizationSig2012-shankar.pdf Parallel Data Warehouse with OLAP Query Processing: Microsoft Extended PDW with Map Reduce and Hadoop : Oracle, Teradata Columnar Databases : SAP HANA Databases Extended PDW with Columnar Data Processing :Teradata, Microsoft PDW on Cloud Azure Cloud System by Micrsoft published in IEEE 2011 http://grail.csuohio.edu/~sschung/IST734/azure2_Microsoft_IEEE2011.pdf http://grail.csuohio.edu/~sschung/IST734/AzureSQLMicrosoft_Sigmod2010campbell Big Data Processing System with PDW on Cloud http://grail.csuohio.edu/~sschung/IST734/CloudVista_huiqixu_vldb2012.pdf 8-9 Big Data Processing Techniques for Data Analytic Tasks MapReduce Join Algorithms Pig Latin Job Execution Steps and Operators: Filter By, Group, Cogroup Unstructured Data Processing: Data Transformation from Unstructured/Semi Structure to Object Exchange Model (JASON, XML) or Relation/CSV Introduction to Information Retrieval and Web Logging Data Processing Text Data Analytics, Data Analytic for Big Data Optimization Techniques for Big Data Processing Lab3: Unstructured Logging Data Procession to JSON format for Object Exchange Model (OEM) Lecture Notes. Selected Papers 10 - 11 Data Analytics in Twitter and LinkedIn: Lecture Notes. Listed Papers, Twitter: Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc.) http://grail.csuohio.edu/~sschung/IST734/FastDataTwitter.pdf LinkedIn: The “Big Data” Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn) http://grail.csuohio.edu/~sschung/IST734/BigDataEcoSystemLinkedInSigmod2 014.pdf Avatara: OLAP for Webscale Analytics Products Lili Wu, et al. (LinkedIn) http://grail.csuohio.edu/~sschung/cis612/LinkedIn_liliwu_vldb2012.pdf Lab4: Data Analytics with Steaming Twitter Messages 12 – 13 Practicum in Data Analytics: Case Study • • • Progressive Explorys by IBM Data Think Lecture Notes Selected Papers 14 Data Analytics and Security Text mining – Intrusion detection in Network and System Database Security Security in Cloud Lecture Notes 15 - 16 Presentation of Significant Research Papers on Data Analytics and Big Data Processing: List of Selected Papers will be given in class. Selected Papers NOTE: The instructor reserves the right to retain, for pedagogical reasons, either the original or a copy of your work submitted either individually or as a group project for this class. Students' names will be deleted from any retained items.