Uploaded by estephanychung3

learning goals 2023f

advertisement
Learning Goals for
CPSC 404: Advanced Relational Databases
September-December 2023
Last Update: September 4, 2023
This document contains the course-level learning goals and a superset of the topic-level learning
goals, for CPSC 404. I’ve been using most of these learning goals in CPSC 404 classes in recent
years. Note that this document includes a bunch of special (or historical) topics that changed
from term to term, since we used to have 2-3 weeks’ worth of specialized material (instructor’s
choice). Now, with the introduction of Transaction Management, Concurrency Control, and
Crash Recovery, we don’t really have that opportunity anymore. Skip the learning goals for the
“Special Topics” units if we don’t cover the topic in the current version of CPSC 404.
Course-Level Learning Goals
Explain and justify the I/O cost-based model for query evaluation in relational database
management systems (RDBMSs).
Recommend the most useful indexes for a set of tables, given some information about the
expected query mix (e.g., the importance and frequency of expected queries).
Show how to use an index (e.g., B+ tree, extendible hash structure, linear hash structure) to look
up search keys in an index and the corresponding rows in a table. Demonstrate insertions and
deletions of keys in the index, by performing splitting or merging of nodes. Analyze the
complexity of operations against these data structures.
Explain the query evaluation and optimization decisions regarding joins, indexes, pipelining,
selection, projection, etc. made by an RDBMS optimizer, given appropriate metadata.
Estimate the I/O cost of evaluating and optimizing a given SQL query, given a set of indexes and
appropriate metadata about the tables and indexes.
Estimate the I/O cost of sorting a large file using external mergesort.
Explain how concurrency is handled to permit greater throughput in an RDBMS. Compare and
contrast the different concurrency control methods that could be used by an RDBMS in terms of
the types of schedules they produce, efficiency, and ease of implementation.
Define the following properties of database transactions, and give practical examples of their
characteristics or desirable features: ACID properties, serializability, and isolation levels.
Explain how logging works in a DBMS that uses the ARIES crash recovery algorithm. Provide
guidelines for setting appropriate checkpoint intervals. For application-oriented recovery,
provide guidelines for setting appropriate image copy intervals.
Topic-Level Learning Goals
Introduction (brief
coverage)
List the job responsibilities of various DB personnel. Gain an
awareness of the scope and complexity of database technology in an
organization.
Explain the benefits of logical and physical data independence brought
1
about by the relational model.
Identify some database challenges in providing 24x7 operations.
Explain why DBMSs are so hard to configure “properly” for an
organization.
Justify the use of autonomic (self-tuning) database systems.
Explain how self-tuning RDBMSs can improve query performance.
Chapter 9a: Disks
Provide examples of DBA (database administration) activity that can be
simplified using autonomic computing.
Explain the impact that disk activity has on DBMS query performance.
Draw the memory hierarchy. Show where database bottlenecks are
most likely to occur and where extensive caching takes place.
Compare and contrast: cost, capacity, and speed of access, in the levels
of the memory hierarchy.
Identify the components of a disk drive.
Given disk geometry figures, calculate the amount of time that it takes
to read or write a number of bytes, blocks/pages, tracks, or cylinders of
data onto a disk.
Given disk geometry figures, compute the minimum, average, and
maximum seek, rotation, and transfer times to/from a disk.
Compare and contrast the relative speeds of seek, rotation, and transfer
times—when accessing a given size of data on disk.
Explain how a large file, broken up into pages, can be optimally placed
on a disk to improve performance.
Compare and contrast hard disk drives (HDDs) to solid state disks
(SSDs). Discuss performance implications.
Defend the ongoing role of hard disk drives in the DBMS world (e.g.,
explain why we can’t eliminate spinning, hard disk drives anytime
soon).
Explain the difference between a random read and a sequential read,
and argue why one is preferable over the other.
Chapter 9b:
Buffer Pool
Management
Argue for or against: It is worth spending a fair bit of DBA time on
analyzing disk geometries, disk usage, and table and index sizes.
Explain the purpose of a DBMS buffer pool, and justify why a DBMS
should manage its own buffer pool, rather than let the OS handle it.
2
Provide an example of sequential flooding in a DBMS buffer pool.
Explain the tradeoffs between force/no force and steal/no steal page
management in a buffer pool. Justify the use of the ARIES algorithm
to exploit these properties. [Many times, the ARIES algorithm isn’t
covered in depth in CPSC 304; therefore, students often need a
refresher of the key goals and features of it.]
For a given reference string (of page accesses), compute the behavior of
the following buffer pool page replacement algorithms: FIFO, LRU,
MRU, Clock (reference bit), and Extended Clock (reference bit + dirty
bit).
Create a reference string that produces worst-case performance for a
given page replacement algorithm.
Explain why page requests for a disk may not be serviced immediately.
List some of the points of contention.
[Later in the course] Explain how the page replacement policy and the
access plan can have a big impact on the number of I/Os required by a
given application.
Chapter 9c (First Part):
Disk Scheduling
Only an overview of
the basics will be
presented. We won’t
spend too much time
on this topic, though.
[Later in the course] Predict which buffer pool page replacement
strategy works best with a given workload (e.g., table scan, index scan,
index lookup, return a large number of rows, RUNSTATS utility, log
files).
Explain why page requests from disk may not be serviced immediately.
List some of the points of contention.
Explain the relationship among disk geometry, buffer pool
management, and disk scheduling in providing good performance for
large data requests from a user of a DBMS. List the bottlenecks that
may contribute to poor I/O performance in this disk “chain”.
Compute the service order for a queue of track/cylinder/page requests
using each of these disk scheduling algorithms: FCFS (First Come,
First Serve), SSTF (shortest seek time first), and Elevator (Scan with
and without Look), and [optional] CSCAN (circular scan, with and
without Look).
[Optional] Integrate disk scheduling times with disk geometries. In
particular, given disk geometry statistics (e.g., sector size, RPM) and
page requests at specific arrival times, compute the service completion
schedule for short jobs for the various disk scheduling algorithms.
[Optional] Compare and contrast these disk scheduling algorithms:
FCFS, SSTF, and Elevator (with and without Look), and CSCAN (with
and without Look). Determine which kinds of page requests would
most benefit (perhaps unfairly) for each algorithm.
3
Chapter 9c (Second
Part): Record and
Page Layouts, Fixed
and Variable Length
Records, Metadata
Case Study: IBM DB2
Catalog statistics for
z/OS (a large
enterprise-level
system). IBM DB2
has been rebranded to
Db2; but, for
consistency with
existing course
materials, we’ll often
refer to it as DB2.
Explain the problem of starvation, with respect to disk scheduling.
Provide an example of starvation.
Compare and contrast the record layouts for fixed-length and variablelength records in a DBMS. Provide an advantage for each.
Compare and contrast the page layouts for fixed-length and variablelength records in a DBMS. Provide an advantage for each.
Explain why rows in a table might be relocated.
Justify the use of free space within a page, and intermittent free pages
within a file, for an RDBMS table.
Compute the maximum number of records in a data file given an
estimate of the size of the file (e.g., pages, tracks, or cylinders) and
possibly other constraints (e.g., level of fill on a typical page).
Compute the size of the file needed to be able to store a given number
of records (given a set of assumptions).
Given probabilities of average string lengths, determine whether it
makes more sense to use a fixed-length field, rather than a variablelength field.
Give at least ten examples of the kind of metadata stored for an
RDBMS.
Justify the use of metadata from the perspective of both a DBMS and a
DBA.
[During assignments] Query an RDBMS catalog for metadata that is of
interest to a DBA.
Chapter 10:
Tree-Structured
Indexes
Argue for the value of storing metadata as a table in an RDBMS, rather
than as another data structure.
List the 3 ways or “alternatives” of representing data entries k* having
key k, in a general index.
Justify the use of indexes in database systems.
Explain how a tree-structured index supports both range and point
queries.
Explain the differences between dense and sparse indexes.
Build a B+ tree for a given set of data.
Show how to insert and delete keys (records) in a B+ tree.
Analyze the complexity of: (a) searching, (b) inserting into, and (c)
deleting from a B+ tree.
4
Explain why a B+ tree needs sibling pointers for its leaf pages.
Argue for or against: Overflow pages are not needed in a B+ tree.
Explain why B+ trees tend to be very shallow, even after storing many
millions of data entries.
Provide arguments for why B+ trees can store large numbers of data
entries in their pages.
Explain the pros and cons of allowing prefix key compression in an
index.
Given a set of data, build a B+ tree using bulk loading.
Provide several advantages of bulk loading compared to using SQL
INSERT statements.
Estimate the number of page I/Os and/or page faults required to look up
a search key in an index, and to locate the record(s) corresponding to
that search key.
Estimate the number of pages at each level of a B+ tree, given a
(possibly) composite search key, the number of data records, and a set
of assumptions (e.g., percentage of fill, unique vs. non-unique, Alt. 1
vs. Alt. 2 vs. Alt. 3). (Note: Most of our calculations will be done
using dense, Alt. 2 indexes, where the file is stored separately from the
index.)
Using a diagram, show the principal difference between a clustered and
an unclustered index.
Provide arguments for what column(s) (possibly composite) would be
the best candidate(s) for a clustering index for a given table.
Chapter 11: HashBased Indexes
Provide examples of when clustering doesn’t help performance, and
may actually hinder the performance of certain kinds of queries.
Compare and contrast the performance of hash-based indexes versus
tree-based indexes (e.g., B+ tree) for equality and range searches.
Provide the best-case, average-case, and worst-case complexities for
such searches.
Explain how collisions are handled for open addressing and chaining
implementations of hash structures. (from CPSC 221)
Explain the advantages that dynamic hashing provides over static
hashing.
Show how insertions and deletions are handled in extendible hashing.
5
Build an extendible hash index using a given set of data.
Show how insertions and deletions are handled in linear hashing.
Build a linear hash index using a given set of data.
Chapter 13:
External Sorting Using
Mergesort (General
External Mergesort,
Two-Phase Multiway
Mergesort)
Describe some of the major differences between extendible hashing and
linear hashing (e.g., how the directory is handled, how skew is
handled).
Justify the importance of external sorting (i.e., sorting for disk resident
files).
Argue that, for many database applications, I/O activity (and hence,
elapsed time) dominates CPU time when estimating the complexity.
Compute the number of sorted runs (or passes) that are required to sort
a large file using general external mergesort.
Compute the number of I/Os (using one physical I/O per page request)
that are required to sort a file of size N using k-way (general) external
mergesort, where N is the number of pages in the file, and k ≥ 2. Show
how your calculations change when the block size is bigger than one
page at a time.
Analyze the complexity and scalability of sorting a large file using
general external mergesort.
Identify potential bottlenecks in sorting a large file using general
external mergesort (e.g., RAM, I/O, block size, pages vs. tracks vs.
cylinders, I/O scheduling).
Explain how two-phase multiway mergesort relates to general external
mergesort.
Using the parameters for a given disk geometry, and a list of
assumptions, estimate the elapsed time for sorting a large file using
general external mergesort (or two-phase multiway mergesort).
Suggest several optimizations that can be exploited when sorting large
files via general external mergesort (or two-phase multiway
mergesort)—e.g., cylindrification, larger block sizes, double buffering
(not covered anymore), disk scheduling, and multiple smaller disks
including work files.
Explain, perhaps via a diagram, why the sorting performance of large
files is data dependent.
Estimate the number of short seeks and the number of long seeks that
are required in a data dependent sort of a large file, where a “short
seek” is to a neighbouring cylinder, and a “long seek” is a seek that’s
further away than one cylinder.
6
Chapters 12 and 14:
Query Evaluation and
Optimization
Note: We no longer
cover Chapter 15, as
there is sufficient
content in Chapters 12
and 14.
Argue for or against: sorting performance has linear complexity for
large datasets.
Translate between SQL and Relational Algebra.
Give examples of the diverse types of metadata stored in an RDBMS
catalog.
Write SQL queries against catalog tables, such as those for IBM’s DB2.
Explain the purpose of a plan in an RDBMS. Explain why many plans
are often possible when evaluating a query. Differentiate between good
and bad plans.
Justify the use of metadata in estimating the cost of computing
relational queries. Explain why the System R model has stood the test
of time for query evaluation.
Provide examples of—and explain—how “out of date” catalog statistics
can provide horrendous query performance.
Regarding “out of date” catalog/metadata statistics, analyze the
tradeoffs between static and dynamic metadata updates, drawing
examples from commercial RDBMSs such as those based upon IBM’s
System R model (e.g., DB2 RUNSTATS).
Provide examples of when it is faster to read the whole table(s) in
response to a query, than to use any index(es).
Explain how access paths relate to query plans.
Given an SQL query, determine the indexes that may apply when
evaluating the query.
Given an SQL query and a list of indexes, determine an efficient plan
for evaluating the query. Justify the assumptions behind your
reasoning.
Compute the difference that clustering makes when evaluating certain
types of queries.
Compare and contrast the roles that sorting and hashing have (for
intermediate results) when evaluating a query.
Compute the cost of evaluating a given query using a block nested loop
(BNL) join.
Compute the cost of evaluating a given query using an index nested
loop (INL) join (hash, B+ tree) for these types of indexes: (a) clustered
B+ tree, (b) unclustered.
7
Compute the cost of doing a sort-merge join when evaluating a query.
Explain when and how it is possible to merge more than 2 sorted runs
during the join phase, when joining 2 tables.
Compute the cost of doing a hash join when evaluating a query.
Identify possible uses of a hash join, and explain how such joins scale
to larger datasets.
Justify which type of join to perform when evaluating a query, given a
list of indexes, catalog statistics, reduction factors, and/or assumptions.
Construct a query evaluation tree for evaluating a given SQL query.
Explain the purpose of left-deep joins and pipelining. Relate these to
the complexity of evaluating a query plan when joining multiple tables.
Compute the number of I/Os required to evaluate a query using various
plans, both with and without pipelining, for: (a) single-relation queries,
and (b) multiple-relation queries (where multiple is a very small
positive integer).
Analyze and explain (in high-level terms), the additional complexity
that multiple-relation queries bring to query evaluation and
optimization. For example, demonstrate that additional tables, joins,
indexes, reduction factors, etc. complicate the process.
Determine when a table scan is better than using an index, during query
evaluation and optimization.
Argue for or against: TPC benchmarks are very useful when
comparing the performance of database systems.
Suggest additional forms of metadata (that are currently not available)
that would be useful when performing query evaluation.
Identify and discuss the criteria, measures, and resource costs
(including user buy-in) that go into a table reorganization decision.
Discuss and evaluate the tradeoffs between autonomic computing
decisions and DBA decisions, in determining (for example) when to do
table reorganizations and backups. Identify potential bottlenecks with
respect to online activity, isolation levels, and RDBMS transaction
performance that such decisions may bring.
Describe, using a diagram, how a hash join can improve BNL join
performance (over sort-merge join).
Explain the conditions under which you should use a hash join instead
of a sort-merge join.
8
Explain the role of histograms in selectivity estimation.
Compute data frequencies using equi-width and equi-depth histograms
using various numbers of buckets.
Chapters 16 and 17:
Transaction
Management and
Concurrency Control
Explain how the choice of histogram and its parameters can affect the
approximation of an attribute’s data distribution.
Define the terms transaction, schedule, serial schedule, serializability,
and read/write conflicts; and describe the relationship among those
terms when it comes to database read and write actions.
Define the ACID properties of an RDBMS. Argue for why these
properties are desirable. Identify probable consequences of not meeting
any one of these properties.
Give reasons for allowing concurrency in a database system.
Explain why a serializable schedule is a desirable schedule.
Explain what is meant by the terms recoverable schedule and
cascading abort.
Provide examples of transaction schedules that are recoverable and that
avoid cascading aborts.
Describe the relationship between increased concurrency and increased
throughput.
Explain the purpose of having isolation levels in an RDBMS. Give an
example of a query that could benefit from each isolation level.
Given a set of transactions, determine whether a given schedule is
serial, serializable, conflict serializable, and/or view serializable.
Explain how the two-phase locking protocol or any of its variants
works, and justify its usefulness.
Explain the purpose of lock management. Justify the use of multiple
granularities in lock management.
Determine whether a given transaction schedule can or cannot lead to a
deadlock.
Apply deadlock detection and deadlock prevention algorithms to a
given schedule.
Determine if a schedule with a set of queries and updates exhibits the
phantom problem.
[time dependent] Determine whether a schedule can/cannot be
9
produced by the timestamp-based concurrency control algorithm.
[time dependent] Determine whether a schedule is valid when using
Optimistic Concurrency Control.
Chapter 18: Crash
Recovery
Compare the different concurrency control methods in terms of: the
types of schedules they produce, efficiency, and ease of use.
Explain the purposes of an image copy. Provide guidelines for how
often to make an image copy of a given table or set of tables—and in
the latter case, provide a reason for why it may be appropriate to
backup more than one table at a time.
Describe the steal/no-steal and force/no-force buffer policies. Justify
the policies used by the ARIES crash recovery algorithm.
Describe the three phases of ARIES.
Describe the actions taken by ARIES when a transaction updates a
page, aborts, or commits.
Explain the purpose of write-ahead logging.
Explain the purpose of a checkpoint. Provide guidelines for
determining an appropriate checkpoint interval.
Given a schedule with a set of actions, show the log that would be
produced by these actions.
Given a log and the fact that a crash occurred, itemize the steps that
ARIES goes through to bring the table back to an acceptable state.
Rebuild the Transaction Table and Dirty Page Table.
10
Download