Cleveland State University Department of Electrical and Computer Engineering

advertisement
Cleveland State University
Department of Electrical and Computer Engineering
CIS 612/712 Big Data & Parallel Database Processing Systems
Catalog Data: Big Data & Parallel Database Processing Systems (3-0-3).
Prerequisites: CIS 505 and CIS 530. Detailed study of modern database
processing and parallel database processing systems for big data
processing. The topics include Transaction concept, concurrency control
strategies, semi-structured and unstructured data processing strategies. The
course advances the study with big data processing strategies on
distributed file system Hadoop with Map Reduce paradigm and focuses on
the study of massively parallel database processing systems for big data
processing with selective NoSQL systems, NewSQL systems, and cloud
computing platforms and infrastructures. The course covers data model,
index, querying techniques, data processing methods, and ACID
(Atomicity, Consistency, Isolation, and Durability) issues in parallel
database processing systems. The students will get hands-on experiences
on big data processing systems with processing real time big data stream
obtained from well-known social network sites. Finally, the course will
explore the latest advances in industry research for big data processing and
data analytics.
Textbooks:
Fundamentals of Database Systems, by Elmasri / Navathe. 7th Edision.
Addison Wesley Pub Co.
Lecture Notes taken from the Database Research Papers and Industry
Parallel Database Systems Documentations on Big Data Processing
Systems and Data Analytics
References:
Tutorials for Hadoop/Map Reduce and VM (Virtual Machine)
Tutorials for NoSQL Systems - Hive, HBase, PigLatin, MongoDB,
Cassandra
Tutorials for New SQL System VoltDB
Coordinator: Dr. Sunnie S. Chung
Outcomes: Upon successful completion of the course, the student will be able to:
• Understand a well-defined Transaction concept and concurrency control strategies
in database processing systems;
• Create modern database applications that process non-traditional data - semistructured data such as JASON or XML data, or unstructured data such as web
logging data;
• Understand big data processing techniques and comprehensive knowledge on
Massively Parallel Processing (MPP) systems - NoSQL/New SQL systems, and
Cloud Computing;
•
•
•
Topics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Obtain hands-on experiences on parallel data processing systems and tools, and
cloud computing platforms and infrastructures for big data processing;
Build an infrastructure for big data processing systems;
Exposed to the latest advances in database industry research in big data
processing;
Lecture Hours
Introduction to Big Data
3
Transaction, ACID (Atomicity, Consistency, Isolation, Durability)
Concurrency Control
Database Programming Constructs
3
Database Triggers
Stored Procedure, Embedded SQL, Dynamic SQL, JDBC/ODBC, PHP
User Defined Function (UDF), User Defined Type (UDT),
3
User Defined Aggregate (UDA), Table Function,
Common Language Runtime (CLR) Functions and Types
Enhanced Data Models for Advanced Applications
3
Semi Structured and Unstructured Databases:
XML Data Processing, XPath, XQuery
JavaScript Object Notation (JSON) Data Processing
Introduction to Information Retrieval and Web Data Processing
3
Data Models for Unstructured Big Data Processing
Google’s Big Table
Introduction to Big Data
3
Google’s Map Reduce Paradigm
Apache Hadoop File System for Parallel Processing
Big Data Processing and Massively Parallel Processing Systems
3
NoSQL Systems:
Pig Latin on Apache Hadoop by Yahoo and Apache
Data Warehouse HIVE with Hadoop by Facebook
HBase
MongoDB
3
Cassandra
Key Value Stores
3
Map Reduce Join Algorithms
NewSQL System:
3
VoltDB
Extended PDW with Map Reduce and Hadoop : Oracle, Teradata
NoSQL vs NewSQL
ACID of Massively Parallel Processing Systems
3
Cloud Computing: Platforms and Infrastructures
3
Advanced Research literature review and Presentations
3
Exams and Reviews
3
__
45
Grading: The course grade is based on a student's overall performance through the entire
Semester. The final grade is distributed among the following components:
• Exams (Midterm & Final) 40% (15% Midterm, 20% Final)
• Computer Labs 30% (about 4-5 Lab Assignments)
• 1 Project on Big data processing: 2 person group project (25%)
• Research Paper Presentation: 10%
Additional Requirements for CIS712 Students:
• Doctoral students who take CIS712 must select a project to work on
• Doctoral students who take CIS712 must work on the project individually (instead
of 2 person group)
• The list of projects and research papers for doctoral students will be given
separately in class. A tentative example of the selection of the research projects
and the paper list are given at the end of the course schedule here
• In each exam, one additional problem is designed to be completed by doctoral
students only
Computer Software Required:
Installation and Set Up instruction details for each system will be given in class.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Visual Studio 2012/2013 or higher
SQL Server 2014 or higher
Microsoft SQL Server Data Tools for Analysis Service 2014 or higher
Hadoop/MapReduce and VM
Hive
PigLatin
HBase
MongoDB
Cassandra
VoltDB
Tentative List of Research Papers and Projects for CIS 712 Doctoral Students:
CIS 712 Doctoral Students should choose one of the following research topics and give a 30 min
presentation on the papers (will be given in class) and complete a project related to the subjects.
Paper List and Project Specification on each research topic below will be given in class.
Examples of Selective Current Database Research Topics in Big Data and Parallel Database Systems
(The subjects and the paper list may vary every year.)
1.
2.
3.
4.
5.
6.
7.
8.
Semistructured/Unstructured Data Processing
Hadoop based Data Warehousing and Analytics Infrastructure at Facebook
Parallel Computing for Big Data Processing:
•
Google Cloud, Amazon Cloud
•
Hadoop Based NoSQL Systems
•
NewSQL Systems
MapReduce: Simplified Data Processing on Large Clusters by Google
Lammal, Ralf. Google's MapReduce Programming Model Revisited.
Stream Processing Sparks
NoSQL Systems: Pig Latin, HBase, Hive, Mongo DB, Cassandra
Map Reduce Join Algorithmes,
9.
10.
11.
12.
13.
14.
•
•
•
•
•
•
•
•
Data Partition Techniques
Performance Survey : SQL vs NoSQL
Processing MR/Hadoop with PDW : Oracle, Teradata
Information Retrieval: Google Search Engine
Big Data Integration Systems
Cloud Computing : Microsoft AZURE, Amazon Cloud, Google Cloud
Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, et al. (Yahoo! Research) in the
proceedings of SIGMOD 2008
Data Warehousing and Analytics Infrastructure at Facebook. by Ashish Thusoo, et al. (Facebook) in the
proceedings of SIGMOD 2010
Petabyte Scale Databases and Storage Systems Deployed at Facebook, Dhruba Borthakur, et al. in the proceedings
of SIGMOD 2014
Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture, Gilad Mishne, Jeff
Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc), SIGMOD 2014.
The “Big Data” Ecosystem at LinkedIn
Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn), SIGMOD 2015
Avatara: OLAP for Webscale Analytics Products
Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin Kim Jay, Kreps Sam Shah (LinkedIn),
SIGMOD 2014
Microsoft Azure as a Self-Managing Database Service: Lessons Learned and Challenges Ahead by Kunal
Mukerjee, et al (Microsoft) in the proceedings of IEEE Computer Society Technical Committee on Data
Engineering 2014
Download