Database Management Systems (2): SICT 3308

advertisement
Databases Management Systems (2): SICT 3308 Fall 2013
Instructor: Rawia Awadallah
Email: rradi@iugaza.edu.ps
Faculty of Information Technology
Office: I205
Office Hours:
Sat./Mon./Wed. 1:00-2:00
Time and Location: Sat./Mon./Wed. 9:00-10:00, Room I102
Sun./Tues. 9:00-11:00
Course Description: This course covers a spectrum of topics from core techniques in relational data management to
highly-scalable data processing using parallel database systems and MapReduce. The course material will be drawn from
textbooks. The material will cover storage structures, query evaluation and execution, query optimization, concurrency
control, transaction processing, recovery, parallel databases, ., and data warehousing
Textbook 1: Database System Concepts Sixth Edition by Avi Silberschatz, Henry F. Korth, S. Sudarshan
Textbook 2: Hadoop: The Definitive Guide Third Edition by Tom White
Textbook 3: Database Management Systems Third Edition by Raghu Ramakrishnan and Johannes Gehrke
Course website: The course website on Moodle ? includes links to course content, including textbooks, assignments, and
lecture slides.
Grading:
1. Assignments: 20%
2. Research Paper Presentation 20%
3. Midterm Exam: 20%
4. Final Exam: 40%
Course Outline: In general, slides are intended for the lecture presentations and do not provide thorough descriptions of the
material alone; for reference, read the indicated sections from textbooks. An effective strategy is to skim the textbook
before the lecture, then read it thoroughly soon afterwards. We recommend printing the slides before lecture and annotating
them during lecture.
Week#
Topic
1-6
Subtopics
Readings
Storage and File Structure
Textbook 1 Ch. 10
Indexing and Hashing
Textbook 1 Ch. 11
Query Processing
Textbook 1 Ch. 12
Query Optimization
Textbook 1 Ch. 13
Transactions
Textbook 1 Ch. 14
Concurrency Control
Textbook 1 Ch. 15
Recovery System
Textbook 1 Ch. 16
Parallel Databases & Distributed databases
Textbook 3 Ch. 22
Overview of Hadoop
Textbook 2 Ch1
The Hadoop Distributed Filesystem
Textbook 2 Ch3
MapReduce Overview
Textbook 2 Ch2
Data Warehousing
Textbook 3 Ch. 25
Data Mining
Textbook 3 Ch. 26
Information Retrieval and XML data
Textbook 3 Ch. 27
Data Storage and Querying
7-11
Transaction Management
12-14
Parallel and Distributed Databases
15-16
Overview of Data Warehousing,
Data Mining, and IR
Data Storage and Querying
1. We begin with an overview of physical storage media, including mechanisms to minimize the chance of data
loss due to device failures. Then we describe how records are mapped to files, which in turn are mapped to
bits on the disk.
2. An index is a structure that helps locate desired records of a relation quickly, without examining all records.
This course will describes several types of indices used in database systems.
3. User queries have to be executed on the database contents, which reside on storage devices. It is usually
convenient to break up queries into smaller operations, roughly corresponding to the relational-algebra
operations. In this course we will describe how queries are processed, presenting algorithms for
implementing individual operations, and then outlining how the operations are executed in synchrony, to
process a query.
4. There are many alternative ways of processing a query, which can have widely varying costs. Query
optimization refers to the process of finding the lowest-cost method of evaluating a given query.
Transaction Management
1. The term transaction refers to a collection of operations that form a single logical unit of work. For instance,
transfer of money from one account to another is a transaction consisting of two updates, one to each account.
It is important that either all actions of a transaction be executed completely, or, in case of some failure,
partial effects of each incomplete transaction be undone. This property is called atomicity.
2. Further, once a transaction is successfully executed, its effects must persist in the database—a system failure
should not result in the database forgetting about a transaction that successfully completed. This property is
called durability.
3. In a database system where multiple transactions are executing concurrently, if updates to shared data are not
controlled there is potential for transactions to see inconsistent intermediate states created by updates of other
transactions. Such a situation can result in erroneous updates to data stored in the database. Thus, database
systems must provide mechanisms to isolate transactions from the effects of other concurrently executing
transactions. This property is called isolation.
4. Concurrency-control techniques help implement the isolation property.
5. Recovery management component of a database implements the atomicity and durability properties.
Parallel and Distributed Databases
1. The architecture of a database system is greatly influenced by the underlying computer system on which the
database system runs. Database systems can be centralized, where one server machine executes operations on
the database.
2. Database systems can be designed to exploit parallel computer architectures. Distributed databases span
multiple geographically separated machines.
3. We will describe how various actions of a database, in particular query processing, can be implemented to
exploit parallel processing.
4. We will also present a number of issues that arise in a distributed database, and describes how to deal with
each issue. The issues include how to store data, how to ensure atomicity of transactions that execute at
multiple sites, how to perform concurrency control, and how to provide high availability in the presence of
failures. A Cloud-based data storage systems, distributed query processing and directory systems will also be
described.
Data Warehousing
1. The first aspect to know in this part of the course is to gather data from multiple sources into a central
repository, called a data warehouse. Issues involved in warehousing include techniques for dealing with dirty
data, that is, data with some errors, and with techniques for efficient storage and indexing of large volumes of
data.
2. The second aspect is to analyze the gathered data to find information or knowledge that can be the basis for
business decisions. Some kinds of data analysis can be done by using SQL constructs for online analytical
processing (OLAP). Another approach to getting knowledge from data is to use data mining, which aims at
detecting various types of patterns in large volumes of data. Data mining supplements various types of
statistical techniques with similar goals.
Download