Cleveland State University Department of Electrical and Computer Engineering CIS 612/712 Big Data & Parallel Database Processing Systems Catalog Data: Big Data & Parallel Database Processing Systems (3-0-3). Prerequisites: CIS 505 and CIS 530. Detailed study of modern database processing and parallel database processing systems for big data processing. The topics include Transaction concept, concurrency control strategies, semi-structured and unstructured data processing strategies. The course advances the study with big data processing strategies on distributed file system Hadoop with Map Reduce paradigm and focuses on the study of massively parallel database processing systems for big data processing with selective NoSQL systems, NewSQL systems, and cloud computing platforms and infrastructures. The course covers data model, index, querying techniques, data processing methods, and ACID (Atomicity, Consistency, Isolation, and Durability) issues in parallel database processing systems. The students will get hands-on experiences on big data processing systems with processing real time big data stream obtained from well-known social network sites. Finally, the course will explore the latest advances in industry research for big data processing and data analytics. Textbooks: Fundamentals of Database Systems, by Elmasri / Navathe. 7th Edision. Addison Wesley Pub Co. Lecture Notes taken from the Database Research Papers and Industry Parallel Database Systems Documentations on Big Data Processing Systems and Data Analytics References: Tutorials for Hadoop/Map Reduce and VM (Virtual Machine) Tutorials for NoSQL Systems - Hive, HBase, PigLatin, MongoDB, Cassandra Tutorials for New SQL System VoltDB Coordinator: Dr. Sunnie S. Chung Outcomes: Upon successful completion of the course, the student will be able to: • Understand a well-defined Transaction concept and concurrency control strategies in database processing systems; • Create modern database applications that process non-traditional data - semistructured data such as JASON or XML data, or unstructured data such as web logging data; • Understand big data processing techniques and comprehensive knowledge on Massively Parallel Processing (MPP) systems - NoSQL/New SQL systems, and Cloud Computing; • • • Topics 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Obtain hands-on experiences on parallel data processing systems and tools, and cloud computing platforms and infrastructures for big data processing; Build an infrastructure for big data processing systems; Exposed to the latest advances in database industry research in big data processing; Lecture Hours Introduction to Big Data 3 Transaction, ACID (Atomicity, Consistency, Isolation, Durability) Concurrency Control Database Programming Constructs 3 Database Triggers Stored Procedure, Embedded SQL, Dynamic SQL, JDBC/ODBC, PHP User Defined Function (UDF), User Defined Type (UDT), 3 User Defined Aggregate (UDA), Table Function, Common Language Runtime (CLR) Functions and Types Enhanced Data Models for Advanced Applications 3 Semi Structured and Unstructured Databases: XML Data Processing, XPath, XQuery JavaScript Object Notation (JSON) Data Processing Introduction to Information Retrieval and Web Data Processing 3 Data Models for Unstructured Big Data Processing Google’s Big Table Introduction to Big Data 3 Google’s Map Reduce Paradigm Apache Hadoop File System for Parallel Processing Big Data Processing and Massively Parallel Processing Systems 3 NoSQL Systems: Pig Latin on Apache Hadoop by Yahoo and Apache Data Warehouse HIVE with Hadoop by Facebook HBase MongoDB 3 Cassandra Key Value Stores 3 Map Reduce Join Algorithms NewSQL System: 3 VoltDB Extended PDW with Map Reduce and Hadoop : Oracle, Teradata NoSQL vs NewSQL ACID of Massively Parallel Processing Systems 3 Cloud Computing: Platforms and Infrastructures 3 Advanced Research literature review and Presentations 3 Exams and Reviews 3 __ 45 Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components: • Exams (Midterm & Final) 40% (15% Midterm, 20% Final) • Computer Labs 30% (about 4-5 Lab Assignments) • 1 Project on Big data processing: 2 person group project (25%) • Research Paper Presentation: 10% Additional Requirements for CIS712 Students: • Doctoral students who take CIS712 must select a project to work on • Doctoral students who take CIS712 must work on the project individually (instead of 2 person group) • The list of projects and research papers for doctoral students will be given separately in class. A tentative example of the selection of the research projects and the paper list are given at the end of the course schedule here • In each exam, one additional problem is designed to be completed by doctoral students only Computer Software Required: Installation and Set Up instruction details for each system will be given in class. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Visual Studio 2012/2013 or higher SQL Server 2014 or higher Microsoft SQL Server Data Tools for Analysis Service 2014 or higher Hadoop/MapReduce and VM Hive PigLatin HBase MongoDB Cassandra VoltDB Tentative List of Research Papers and Projects for CIS 712 Doctoral Students: CIS 712 Doctoral Students should choose one of the following research topics and give a 30 min presentation on the papers (will be given in class) and complete a project related to the subjects. Paper List and Project Specification on each research topic below will be given in class. Examples of Selective Current Database Research Topics in Big Data and Parallel Database Systems (The subjects and the paper list may vary every year.) 1. 2. 3. 4. 5. 6. 7. 8. Semistructured/Unstructured Data Processing Hadoop based Data Warehousing and Analytics Infrastructure at Facebook Parallel Computing for Big Data Processing: • Google Cloud, Amazon Cloud • Hadoop Based NoSQL Systems • NewSQL Systems MapReduce: Simplified Data Processing on Large Clusters by Google Lammal, Ralf. Google's MapReduce Programming Model Revisited. Stream Processing Sparks NoSQL Systems: Pig Latin, HBase, Hive, Mongo DB, Cassandra Map Reduce Join Algorithmes, 9. 10. 11. 12. 13. 14. • • • • • • • • Data Partition Techniques Performance Survey : SQL vs NoSQL Processing MR/Hadoop with PDW : Oracle, Teradata Information Retrieval: Google Search Engine Big Data Integration Systems Cloud Computing : Microsoft AZURE, Amazon Cloud, Google Cloud Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, et al. (Yahoo! Research) in the proceedings of SIGMOD 2008 Data Warehousing and Analytics Infrastructure at Facebook. by Ashish Thusoo, et al. (Facebook) in the proceedings of SIGMOD 2010 Petabyte Scale Databases and Storage Systems Deployed at Facebook, Dhruba Borthakur, et al. in the proceedings of SIGMOD 2014 Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture, Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc), SIGMOD 2014. The “Big Data” Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn), SIGMOD 2015 Avatara: OLAP for Webscale Analytics Products Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin Kim Jay, Kreps Sam Shah (LinkedIn), SIGMOD 2014 Microsoft Azure as a Self-Managing Database Service: Lessons Learned and Challenges Ahead by Kunal Mukerjee, et al (Microsoft) in the proceedings of IEEE Computer Society Technical Committee on Data Engineering 2014