CPSC 404 Advanced Relational Databases Syllabus

Learning Goals for CPSC 404: Advanced Relational Databases September-December 2023 Last Update: September 4, 2023 This document contains the course-level learning goals and a superset of the topic-level learning goals, for CPSC 404. I’ve been using most of these learning goals in CPSC 404 classes in recent years. Note that this document includes a bunch of special (or historical) topics that changed from term to term, since we used to have 2-3 weeks’ worth of specialized material (instructor’s choice). Now, with the introduction of Transaction Management, Concurrency Control, and Crash Recovery, we don’t really have that opportunity anymore. Skip the learning goals for the “Special Topics” units if we don’t cover the topic in the current version of CPSC 404. Course-Level Learning Goals Explain and justify the I/O cost-based model for query evaluation in relational database management systems (RDBMSs). Recommend the most useful indexes for a set of tables, given some information about the expected query mix (e.g., the importance and frequency of expected queries). Show how to use an index (e.g., B+ tree, extendible hash structure, linear hash structure) to look up search keys in an index and the corresponding rows in a table. Demonstrate insertions and deletions of keys in the index, by performing splitting or merging of nodes. Analyze the complexity of operations against these data structures. Explain the query evaluation and optimization decisions regarding joins, indexes, pipelining, selection, projection, etc. made by an RDBMS optimizer, given appropriate metadata. Estimate the I/O cost of evaluating and optimizing a given SQL query, given a set of indexes and appropriate metadata about the tables and indexes. Estimate the I/O cost of sorting a large file using external mergesort. Explain how concurrency is handled to permit greater throughput in an RDBMS. Compare and contrast the different concurrency control methods that could be used by an RDBMS in terms of the types of schedules they produce, efficiency, and ease of implementation. Define the following properties of database transactions, and give practical examples of their characteristics or desirable features: ACID properties, serializability, and isolation levels. Explain how logging works in a DBMS that uses the ARIES crash recovery algorithm. Provide guidelines for setting appropriate checkpoint intervals. For application-oriented recovery, provide guidelines for setting appropriate image copy intervals. Topic-Level Learning Goals Introduction (brief coverage) List the job responsibilities of various DB personnel. Gain an awareness of the scope and complexity of database technology in an organization. Explain the benefits of logical and physical data independence brought 1 about by the relational model. Identify some database challenges in providing 24x7 operations. Explain why DBMSs are so hard to configure “properly” for an organization. Justify the use of autonomic (self-tuning) database systems. Explain how self-tuning RDBMSs can improve query performance. Chapter 9a: Disks Provide examples of DBA (database administration) activity that can be simplified using autonomic computing. Explain the impact that disk activity has on DBMS query performance. Draw the memory hierarchy. Show where database bottlenecks are most likely to occur and where extensive caching takes place. Compare and contrast: cost, capacity, and speed of access, in the levels of the memory hierarchy. Identify the components of a disk drive. Given disk geometry figures, calculate the amount of time that it takes to read or write a number of bytes, blocks/pages, tracks, or cylinders of data onto a disk. Given disk geometry figures, compute the minimum, average, and maximum seek, rotation, and transfer times to/from a disk. Compare and contrast the relative speeds of seek, rotation, and transfer times—when accessing a given size of data on disk. Explain how a large file, broken up into pages, can be optimally placed on a disk to improve performance. Compare and contrast hard disk drives (HDDs) to solid state disks (SSDs). Discuss performance implications. Defend the ongoing role of hard disk drives in the DBMS world (e.g., explain why we can’t eliminate spinning, hard disk drives anytime soon). Explain the difference between a random read and a sequential read, and argue why one is preferable over the other. Chapter 9b: Buffer Pool Management Argue for or against: It is worth spending a fair bit of DBA time on analyzing disk geometries, disk usage, and table and index sizes. Explain the purpose of a DBMS buffer pool, and justify why a DBMS should manage its own buffer pool, rather than let the OS handle it. 2 Provide an example of sequential flooding in a DBMS buffer pool. Explain the tradeoffs between force/no force and steal/no steal page management in a buffer pool. Justify the use of the ARIES algorithm to exploit these properties. [Many times, the ARIES algorithm isn’t covered in depth in CPSC 304; therefore, students often need a refresher of the key goals and features of it.] For a given reference string (of page accesses), compute the behavior of the following buffer pool page replacement algorithms: FIFO, LRU, MRU, Clock (reference bit), and Extended Clock (reference bit + dirty bit). Create a reference string that produces worst-case performance for a given page replacement algorithm. Explain why page requests for a disk may not be serviced immediately. List some of the points of contention. [Later in the course] Explain how the page replacement policy and the access plan can have a big impact on the number of I/Os required by a given application. Chapter 9c (First Part): Disk Scheduling Only an overview of the basics will be presented. We won’t spend too much time on this topic, though. [Later in the course] Predict which buffer pool page replacement strategy works best with a given workload (e.g., table scan, index scan, index lookup, return a large number of rows, RUNSTATS utility, log files). Explain why page requests from disk may not be serviced immediately. List some of the points of contention. Explain the relationship among disk geometry, buffer pool management, and disk scheduling in providing good performance for large data requests from a user of a DBMS. List the bottlenecks that may contribute to poor I/O performance in this disk “chain”. Compute the service order for a queue of track/cylinder/page requests using each of these disk scheduling algorithms: FCFS (First Come, First Serve), SSTF (shortest seek time first), and Elevator (Scan with and without Look), and [optional] CSCAN (circular scan, with and without Look). [Optional] Integrate disk scheduling times with disk geometries. In particular, given disk geometry statistics (e.g., sector size, RPM) and page requests at specific arrival times, compute the service completion schedule for short jobs for the various disk scheduling algorithms. [Optional] Compare and contrast these disk scheduling algorithms: FCFS, SSTF, and Elevator (with and without Look), and CSCAN (with and without Look). Determine which kinds of page requests would most benefit (perhaps unfairly) for each algorithm. 3 Chapter 9c (Second Part): Record and Page Layouts, Fixed and Variable Length Records, Metadata Case Study: IBM DB2 Catalog statistics for z/OS (a large enterprise-level system). IBM DB2 has been rebranded to Db2; but, for consistency with existing course materials, we’ll often refer to it as DB2. Explain the problem of starvation, with respect to disk scheduling. Provide an example of starvation. Compare and contrast the record layouts for fixed-length and variablelength records in a DBMS. Provide an advantage for each. Compare and contrast the page layouts for fixed-length and variablelength records in a DBMS. Provide an advantage for each. Explain why rows in a table might be relocated. Justify the use of free space within a page, and intermittent free pages within a file, for an RDBMS table. Compute the maximum number of records in a data file given an estimate of the size of the file (e.g., pages, tracks, or cylinders) and possibly other constraints (e.g., level of fill on a typical page). Compute the size of the file needed to be able to store a given number of records (given a set of assumptions). Given probabilities of average string lengths, determine whether it makes more sense to use a fixed-length field, rather than a variablelength field. Give at least ten examples of the kind of metadata stored for an RDBMS. Justify the use of metadata from the perspective of both a DBMS and a DBA. [During assignments] Query an RDBMS catalog for metadata that is of interest to a DBA. Chapter 10: Tree-Structured Indexes Argue for the value of storing metadata as a table in an RDBMS, rather than as another data structure. List the 3 ways or “alternatives” of representing data entries k* having key k, in a general index. Justify the use of indexes in database systems. Explain how a tree-structured index supports both range and point queries. Explain the differences between dense and sparse indexes. Build a B+ tree for a given set of data. Show how to insert and delete keys (records) in a B+ tree. Analyze the complexity of: (a) searching, (b) inserting into, and (c) deleting from a B+ tree. 4 Explain why a B+ tree needs sibling pointers for its leaf pages. Argue for or against: Overflow pages are not needed in a B+ tree. Explain why B+ trees tend to be very shallow, even after storing many millions of data entries. Provide arguments for why B+ trees can store large numbers of data entries in their pages. Explain the pros and cons of allowing prefix key compression in an index. Given a set of data, build a B+ tree using bulk loading. Provide several advantages of bulk loading compared to using SQL INSERT statements. Estimate the number of page I/Os and/or page faults required to look up a search key in an index, and to locate the record(s) corresponding to that search key. Estimate the number of pages at each level of a B+ tree, given a (possibly) composite search key, the number of data records, and a set of assumptions (e.g., percentage of fill, unique vs. non-unique, Alt. 1 vs. Alt. 2 vs. Alt. 3). (Note: Most of our calculations will be done using dense, Alt. 2 indexes, where the file is stored separately from the index.) Using a diagram, show the principal difference between a clustered and an unclustered index. Provide arguments for what column(s) (possibly composite) would be the best candidate(s) for a clustering index for a given table. Chapter 11: HashBased Indexes Provide examples of when clustering doesn’t help performance, and may actually hinder the performance of certain kinds of queries. Compare and contrast the performance of hash-based indexes versus tree-based indexes (e.g., B+ tree) for equality and range searches. Provide the best-case, average-case, and worst-case complexities for such searches. Explain how collisions are handled for open addressing and chaining implementations of hash structures. (from CPSC 221) Explain the advantages that dynamic hashing provides over static hashing. Show how insertions and deletions are handled in extendible hashing. 5 Build an extendible hash index using a given set of data. Show how insertions and deletions are handled in linear hashing. Build a linear hash index using a given set of data. Chapter 13: External Sorting Using Mergesort (General External Mergesort, Two-Phase Multiway Mergesort) Describe some of the major differences between extendible hashing and linear hashing (e.g., how the directory is handled, how skew is handled). Justify the importance of external sorting (i.e., sorting for disk resident files). Argue that, for many database applications, I/O activity (and hence, elapsed time) dominates CPU time when estimating the complexity. Compute the number of sorted runs (or passes) that are required to sort a large file using general external mergesort. Compute the number of I/Os (using one physical I/O per page request) that are required to sort a file of size N using k-way (general) external mergesort, where N is the number of pages in the file, and k ≥ 2. Show how your calculations change when the block size is bigger than one page at a time. Analyze the complexity and scalability of sorting a large file using general external mergesort. Identify potential bottlenecks in sorting a large file using general external mergesort (e.g., RAM, I/O, block size, pages vs. tracks vs. cylinders, I/O scheduling). Explain how two-phase multiway mergesort relates to general external mergesort. Using the parameters for a given disk geometry, and a list of assumptions, estimate the elapsed time for sorting a large file using general external mergesort (or two-phase multiway mergesort). Suggest several optimizations that can be exploited when sorting large files via general external mergesort (or two-phase multiway mergesort)—e.g., cylindrification, larger block sizes, double buffering (not covered anymore), disk scheduling, and multiple smaller disks including work files. Explain, perhaps via a diagram, why the sorting performance of large files is data dependent. Estimate the number of short seeks and the number of long seeks that are required in a data dependent sort of a large file, where a “short seek” is to a neighbouring cylinder, and a “long seek” is a seek that’s further away than one cylinder. 6 Chapters 12 and 14: Query Evaluation and Optimization Note: We no longer cover Chapter 15, as there is sufficient content in Chapters 12 and 14. Argue for or against: sorting performance has linear complexity for large datasets. Translate between SQL and Relational Algebra. Give examples of the diverse types of metadata stored in an RDBMS catalog. Write SQL queries against catalog tables, such as those for IBM’s DB2. Explain the purpose of a plan in an RDBMS. Explain why many plans are often possible when evaluating a query. Differentiate between good and bad plans. Justify the use of metadata in estimating the cost of computing relational queries. Explain why the System R model has stood the test of time for query evaluation. Provide examples of—and explain—how “out of date” catalog statistics can provide horrendous query performance. Regarding “out of date” catalog/metadata statistics, analyze the tradeoffs between static and dynamic metadata updates, drawing examples from commercial RDBMSs such as those based upon IBM’s System R model (e.g., DB2 RUNSTATS). Provide examples of when it is faster to read the whole table(s) in response to a query, than to use any index(es). Explain how access paths relate to query plans. Given an SQL query, determine the indexes that may apply when evaluating the query. Given an SQL query and a list of indexes, determine an efficient plan for evaluating the query. Justify the assumptions behind your reasoning. Compute the difference that clustering makes when evaluating certain types of queries. Compare and contrast the roles that sorting and hashing have (for intermediate results) when evaluating a query. Compute the cost of evaluating a given query using a block nested loop (BNL) join. Compute the cost of evaluating a given query using an index nested loop (INL) join (hash, B+ tree) for these types of indexes: (a) clustered B+ tree, (b) unclustered. 7 Compute the cost of doing a sort-merge join when evaluating a query. Explain when and how it is possible to merge more than 2 sorted runs during the join phase, when joining 2 tables. Compute the cost of doing a hash join when evaluating a query. Identify possible uses of a hash join, and explain how such joins scale to larger datasets. Justify which type of join to perform when evaluating a query, given a list of indexes, catalog statistics, reduction factors, and/or assumptions. Construct a query evaluation tree for evaluating a given SQL query. Explain the purpose of left-deep joins and pipelining. Relate these to the complexity of evaluating a query plan when joining multiple tables. Compute the number of I/Os required to evaluate a query using various plans, both with and without pipelining, for: (a) single-relation queries, and (b) multiple-relation queries (where multiple is a very small positive integer). Analyze and explain (in high-level terms), the additional complexity that multiple-relation queries bring to query evaluation and optimization. For example, demonstrate that additional tables, joins, indexes, reduction factors, etc. complicate the process. Determine when a table scan is better than using an index, during query evaluation and optimization. Argue for or against: TPC benchmarks are very useful when comparing the performance of database systems. Suggest additional forms of metadata (that are currently not available) that would be useful when performing query evaluation. Identify and discuss the criteria, measures, and resource costs (including user buy-in) that go into a table reorganization decision. Discuss and evaluate the tradeoffs between autonomic computing decisions and DBA decisions, in determining (for example) when to do table reorganizations and backups. Identify potential bottlenecks with respect to online activity, isolation levels, and RDBMS transaction performance that such decisions may bring. Describe, using a diagram, how a hash join can improve BNL join performance (over sort-merge join). Explain the conditions under which you should use a hash join instead of a sort-merge join. 8 Explain the role of histograms in selectivity estimation. Compute data frequencies using equi-width and equi-depth histograms using various numbers of buckets. Chapters 16 and 17: Transaction Management and Concurrency Control Explain how the choice of histogram and its parameters can affect the approximation of an attribute’s data distribution. Define the terms transaction, schedule, serial schedule, serializability, and read/write conflicts; and describe the relationship among those terms when it comes to database read and write actions. Define the ACID properties of an RDBMS. Argue for why these properties are desirable. Identify probable consequences of not meeting any one of these properties. Give reasons for allowing concurrency in a database system. Explain why a serializable schedule is a desirable schedule. Explain what is meant by the terms recoverable schedule and cascading abort. Provide examples of transaction schedules that are recoverable and that avoid cascading aborts. Describe the relationship between increased concurrency and increased throughput. Explain the purpose of having isolation levels in an RDBMS. Give an example of a query that could benefit from each isolation level. Given a set of transactions, determine whether a given schedule is serial, serializable, conflict serializable, and/or view serializable. Explain how the two-phase locking protocol or any of its variants works, and justify its usefulness. Explain the purpose of lock management. Justify the use of multiple granularities in lock management. Determine whether a given transaction schedule can or cannot lead to a deadlock. Apply deadlock detection and deadlock prevention algorithms to a given schedule. Determine if a schedule with a set of queries and updates exhibits the phantom problem. [time dependent] Determine whether a schedule can/cannot be 9 produced by the timestamp-based concurrency control algorithm. [time dependent] Determine whether a schedule is valid when using Optimistic Concurrency Control. Chapter 18: Crash Recovery Compare the different concurrency control methods in terms of: the types of schedules they produce, efficiency, and ease of use. Explain the purposes of an image copy. Provide guidelines for how often to make an image copy of a given table or set of tables—and in the latter case, provide a reason for why it may be appropriate to backup more than one table at a time. Describe the steal/no-steal and force/no-force buffer policies. Justify the policies used by the ARIES crash recovery algorithm. Describe the three phases of ARIES. Describe the actions taken by ARIES when a transaction updates a page, aborts, or commits. Explain the purpose of write-ahead logging. Explain the purpose of a checkpoint. Provide guidelines for determining an appropriate checkpoint interval. Given a schedule with a set of actions, show the log that would be produced by these actions. Given a log and the fact that a crash occurred, itemize the steps that ARIES goes through to bring the table back to an acceptable state. Rebuild the Transaction Table and Dirty Page Table. 10

CPSC 404 Advanced Relational Databases Syllabus

Related documents

Products

Support

CPSC 404 Advanced Relational Databases Syllabus

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib