Databases Management Systems (2): SICT 3308 Fall 2013 Instructor: Rawia Awadallah Email: rradi@iugaza.edu.ps Faculty of Information Technology Office: I205 Office Hours: Sat./Mon./Wed. 1:00-2:00 Time and Location: Sat./Mon./Wed. 9:00-10:00, Room I102 Sun./Tues. 9:00-11:00 Course Description: This course covers a spectrum of topics from core techniques in relational data management to highly-scalable data processing using parallel database systems and MapReduce. The course material will be drawn from textbooks. The material will cover storage structures, query evaluation and execution, query optimization, concurrency control, transaction processing, recovery, parallel databases, ., and data warehousing Textbook 1: Database System Concepts Sixth Edition by Avi Silberschatz, Henry F. Korth, S. Sudarshan Textbook 2: Hadoop: The Definitive Guide Third Edition by Tom White Textbook 3: Database Management Systems Third Edition by Raghu Ramakrishnan and Johannes Gehrke Course website: The course website on Moodle ? includes links to course content, including textbooks, assignments, and lecture slides. Grading: 1. Assignments: 20% 2. Research Paper Presentation 20% 3. Midterm Exam: 20% 4. Final Exam: 40% Course Outline: In general, slides are intended for the lecture presentations and do not provide thorough descriptions of the material alone; for reference, read the indicated sections from textbooks. An effective strategy is to skim the textbook before the lecture, then read it thoroughly soon afterwards. We recommend printing the slides before lecture and annotating them during lecture. Week# Topic 1-6 Subtopics Readings Storage and File Structure Textbook 1 Ch. 10 Indexing and Hashing Textbook 1 Ch. 11 Query Processing Textbook 1 Ch. 12 Query Optimization Textbook 1 Ch. 13 Transactions Textbook 1 Ch. 14 Concurrency Control Textbook 1 Ch. 15 Recovery System Textbook 1 Ch. 16 Parallel Databases & Distributed databases Textbook 3 Ch. 22 Overview of Hadoop Textbook 2 Ch1 The Hadoop Distributed Filesystem Textbook 2 Ch3 MapReduce Overview Textbook 2 Ch2 Data Warehousing Textbook 3 Ch. 25 Data Mining Textbook 3 Ch. 26 Information Retrieval and XML data Textbook 3 Ch. 27 Data Storage and Querying 7-11 Transaction Management 12-14 Parallel and Distributed Databases 15-16 Overview of Data Warehousing, Data Mining, and IR Data Storage and Querying 1. We begin with an overview of physical storage media, including mechanisms to minimize the chance of data loss due to device failures. Then we describe how records are mapped to files, which in turn are mapped to bits on the disk. 2. An index is a structure that helps locate desired records of a relation quickly, without examining all records. This course will describes several types of indices used in database systems. 3. User queries have to be executed on the database contents, which reside on storage devices. It is usually convenient to break up queries into smaller operations, roughly corresponding to the relational-algebra operations. In this course we will describe how queries are processed, presenting algorithms for implementing individual operations, and then outlining how the operations are executed in synchrony, to process a query. 4. There are many alternative ways of processing a query, which can have widely varying costs. Query optimization refers to the process of finding the lowest-cost method of evaluating a given query. Transaction Management 1. The term transaction refers to a collection of operations that form a single logical unit of work. For instance, transfer of money from one account to another is a transaction consisting of two updates, one to each account. It is important that either all actions of a transaction be executed completely, or, in case of some failure, partial effects of each incomplete transaction be undone. This property is called atomicity. 2. Further, once a transaction is successfully executed, its effects must persist in the database—a system failure should not result in the database forgetting about a transaction that successfully completed. This property is called durability. 3. In a database system where multiple transactions are executing concurrently, if updates to shared data are not controlled there is potential for transactions to see inconsistent intermediate states created by updates of other transactions. Such a situation can result in erroneous updates to data stored in the database. Thus, database systems must provide mechanisms to isolate transactions from the effects of other concurrently executing transactions. This property is called isolation. 4. Concurrency-control techniques help implement the isolation property. 5. Recovery management component of a database implements the atomicity and durability properties. Parallel and Distributed Databases 1. The architecture of a database system is greatly influenced by the underlying computer system on which the database system runs. Database systems can be centralized, where one server machine executes operations on the database. 2. Database systems can be designed to exploit parallel computer architectures. Distributed databases span multiple geographically separated machines. 3. We will describe how various actions of a database, in particular query processing, can be implemented to exploit parallel processing. 4. We will also present a number of issues that arise in a distributed database, and describes how to deal with each issue. The issues include how to store data, how to ensure atomicity of transactions that execute at multiple sites, how to perform concurrency control, and how to provide high availability in the presence of failures. A Cloud-based data storage systems, distributed query processing and directory systems will also be described. Data Warehousing 1. The first aspect to know in this part of the course is to gather data from multiple sources into a central repository, called a data warehouse. Issues involved in warehousing include techniques for dealing with dirty data, that is, data with some errors, and with techniques for efficient storage and indexing of large volumes of data. 2. The second aspect is to analyze the gathered data to find information or knowledge that can be the basis for business decisions. Some kinds of data analysis can be done by using SQL constructs for online analytical processing (OLAP). Another approach to getting knowledge from data is to use data mining, which aims at detecting various types of patterns in large volumes of data. Data mining supplements various types of statistical techniques with similar goals.