Conceptual Design Using the ER Model

advertisement
Database Design and
The EntityRelationship Model
Lecture 21
R&G - Chapter 2
A relationship, I think, is like a
shark, you know? It has to
constantly move forward or it
dies. And I think what we got on
our hands is a dead shark.
Woody Allen (from Annie Hall, 1979)
Administriva
• Homework 4 Due One Week From Today
• Midterm is graded
Exam Summary
20
Average: 74.5
Max: 93.5
Min: 36.5
Std Dev: 11.6
10
Diff from Mean Exam 1
•
•
•
•
15
Good on Exam 1
Worse on Exam 2
5
0
-40
-30
-20
-10
0
10
20
-5
-10
Bad on Exam 1
Good on Exam 2
-15
-20
Diff from Mean Exam 2
30
Exam Details
Section: Low – High / Total (Average)
• 1: External Sorting 10 – 24 / 24 (18)
• 2: Query Execution 17 – 43 / 44 (36)
• 3: Query Optim.
6.5 – 29 / 32 (20)
Section 1: External Sorting
• For all questions in this section, assume that you have 96
pages in the buffer pool, and you wish to sort EMP by
Ename.
• Using the general external sort, how many sorting passes
are required? (6pts)
– #passes = 1 + Ceil(log(B-1)(N/B))
– = 1 + Ceil(log(95)(10,000/96))
– =1+2
– =3
• What is the I/O cost of doing this sort? (6pts)
– I/O Cost = 2N * (# passes)
– = 20,000 * 3
Section 1 (cont)
•
Database systems sometimes use “blocked I/O” because reading a block of
continuous pages is more efficient that doing a separate I/O for each page.
Assume that in our computer system, it is much more efficient to read and
write blocks of 32 pages at a time, so all reads and writes from files must be in
blocks of 32 pages (and if a file has less than 32 pages, it is padded with blank
pages). Consider doing external sort with blocked I/O, which gets faster I/O at
the expense of more sorting passes. If the database must read and write 32
pages at a time, how many sorting passes are required? Show either the
formula you use, or the temp file sizes after each pass. (4pts)
–
–
–
–
–
–
–
–
–
Pass 0: 105 runs of 96 pages
Pass 1: 2-way merge 96 + 96 -> result is 53 runs of 192 pages
Pass 2: 2-way merge of 192 + 192 -> result is 27 runs of 384 pages
Pass 3: -> result is 14 runs of 768 pages,
Pass 4: -> result is 7 runs 1536 pages,
Pass 5: -> result is 4 runs 3072 pages,
Pass 6: -> result is 2 runs of 6144,
Pass 7: -> result is 10,000
8 passes total
Section 1 (cont)
•
Blocked I/O is used because reading a 32-page block is
faster than 32 separate 1-page I/Os. For each sorting
pass, you still must read and write every page in the
file, but instead of doing 10,000 1-page I/Os, instead
you do (10,000/32) block I/Os. Assume that in our
system, we can read a 32-page block in the time it
would normally take to do 8 single-page I/Os. Is the
blocked I/O sort faster or slower than a regular sort
from question 2? By approximately what ratio? (4pts)
•
Assume that instead of a heap file, the records from
EMP are stored a clustered B-Tree index, whose key is
Ename, using alternative 1 (i.e., the full data records
are stored in the leaves of the tree). The B-tree has
depth 3. Assuming the B-Tree already exists, what is
the approximate I/O cost to use the B-tree to get the
records in sorted order? (4pts)
Section 2: Relational Operators
•
Consider the operation:  (EID < 5000)EMP
•
What is the I/O cost of this operation if there is no index? (4pts)
–
Lecture 15 slide 9: no index -> sequential scan, N I/Os, meaning 10,000
I/Os
•
What is the reduction factor? (4pts)
–
5000/100,000 = 0.05
•
Assume that instead of a heap file, the records from EMP are stored
in a clustered B-Tree index, whose key is Ename, using alternative 1
(i.e., the full data records are stored in the leaves of the tree). The
B-tree has depth 3. Assuming the B-Tree already exists, what is the
I/O cost of this selection operation? (4pts)
–
Due to a copy paste error on my part, the B-Tree Index is *not* useful for
the query. I gave credit either way, whether people tried to use the index
as an index, or for sequential scan.
–
In considering the cost of using the index, either as an index or for a
scan, full credit required knowing that B-Trees have an overhead of
approximately 50%, i.e., there are 50% more leaves than the number of
pages in a heap file.
Relational Operators (cont)
•
Consider the query: select distinct Ename from EMP
order by Ename
– If you had 96 pages of memory, what algorithm would you
use to execute this query and why? (4pts)
– Order by suggests sort-merge to remove dups.
•
If, instead, the query were: select distinct Ename from
EMP and you had 101 pages of memory, what
algorithm would you use and why? (4pts)
– Could use hashing to remove dups, which would have a
similar I/O cost but be more CPU efficient.
Relational Operators (cont)
•
Consider the join: In_Dept  Dept
– Assuming 100 pages of memory, what is the I/O cost of this join using Blocked Nested Loops?
(4pts)
– Lec 15, slide 27: Blocked nested loops: M + Ceil(M/(B-2)) * N
– 550 + Ceil(550/100)) * 20 = 550 + 6 * 20 = 670
•
What is the I/O cost of this using Index Nested Loops, with an unclustered Hash index
on Dept.DID? Assume an average cost of 1.2 I/Os to get to the right hash bucket.
(4pts)
– Lec 15, slide 27: index nested loops: M + #tuples in M * index cost
– 550 + 110,000 * (1.2 + 1) = 242,000
•
Consider the join: Dept  In_Dept. Assuming 100 pages of memory, what is the I/O
cost of this using Sort-Merge-Join, optimized so that the last merge pass is combined
with the join if possible? (4pts)
– Best: 1 pass over In_Dept, to make 6 runs, sort Dept in memory, merge. Cost 550*3 + 20 =
1670
– By the book: 3(M + N) = 3*570 = 1710
•
Assuming 100 pages of memory, what is the I/O cost of this using regular (not hybrid)
hash join? (4pts)
– Best: partition In_Dept (2 * 550), hash Dept in memory, match. Cost 550 * 3 + 20 = 1670
– By the book: 3(M + N) = 1710
Relational Operators (cont)
•
Consider the join: ( (EID <= 100)EMP)  In_Building Assume there
is no index on EMP, and an unclustered hash index on
In_Building.EID, and plan operators are pipelined where possible.
You have 100 pages of memory. Assume that it takes on average 1.2
I/Os get reach the appropriate hash bucket.
•
What is the I/O cost of this using Blocked Nested Loops? (4pts)
– M + Ceil(M/(B-2)) * N
– If you do not push select down:10,000 + Ceil(10,000/(100-2)) * 550 =
10,000 + 103*550 = 66,650
– If you do push select, pipeline result: Size of selection in pages is S =
(I/100,000)*10,000 = I/10 = 10
– Join cost: 10,000 + Ceil(10/(100-2)) * (550) = 10,550
•
What is the I/O cost of this using Index Nested Loops? (4pts)
– If you do push select, pipeline result: Size of selection in pages is:
– S = (I/100,000)*10,000 = I/10 = 10
–
each EID will have 1.1 matching tuples.
– Join cost: M + (# tuples * index cost)
– 10,000 + 100 * (1.2 + 1.1) = 10,230
Section 3: Query Optimization
•
There are several parts of an SQL query that can cause a column (or columns)
to be considered an "interesting order", meaning that an ordered input can
lower the cost of evaluating that part of the query. Briefly describe three of
these, and for each one briefly explain why. (6 pts)
–
Order By, Group By, Joins, Distinct, some aggregations (e.g. Max, Min)
•
Write the reduction factor that would be used by a System R style query
optimizer for each of the following predicates (6pts)
–
Building.BID > 150
–
–
•
•
•
BID ranges from 1 to 200, so this would be ¼.
•
We don’t know about distribution of names, so 1/10.
•
Since there are 100,000 employees, this is 1/100,000
EMP.Ename = “Joe”
IN_DEPT.EID = 003
Given the following query, where “X” is the join operator:
π(EMP.Ename) (Dept.Budget > 500000) (Emp X In_Dept X Dept)
Mark whether each of the following queries are equivalent (True/False).
–
__T___π(EMP.Ename) (Dept.Budget > 500000) (Emp X Dept X In_Dept)
–
__T___π(EMP.Ename) (Dept.Budget > 500000) (Dept X In_Dept X Emp)
–
__T___π(EMP.Ename) (Emp X In_Dept X (Dept.Budget > 500000)(Dept))
–
__F___(Dept.Budget > 500000) (π(EMP.Ename) (Emp) X In_Dept X Dept)
Query Optimization (cont)
•
Given the schema on the first page of the exam, assume the following:
– There are unclustered hash indexes on both Emp.EID and Dept.DID.
– There are clustered B-Tree Indexes (alt 1 – data records store in the B-Tree leaves) on
both Emp.EID and Dept.DID.
– There is an unclustered Btree index on (Emp.Salary, Emp.EID)
– You can assume that Btree-Indexes have heights of 3. And the cost for getting to the
data record using a hash index is 2.
•
What is the best access plan for the following query (4pts).
SELECT E.eid, E.Salary
FROM Emp E
WHERE E.Salary > 100,000
– Very best is unclustered B-tree, index only plan
•
Draw all possible join orders considered by a System-R style query optimizer for
the following query. (4pts)
SELECT E.eid, D.dname
FROM Emp E, In_Dept I, Dept D
WHERE E.sal = 64,000 AND D.budget > 500,000 AND
E.eid = I.eid AND I.did = D.did
– (E X I) X D)
– (I X E) X D)
– (D X I) X E)
– (I X D) X E)
Where are we?
• This week: Database Design (me)
– The ER Model as a design tool
• Next Week: Concurrency Control & Recovery
(Prof. Roth)
• Week after: More Database Design (me)
Today and Thursday: The ER Model
• A different data model from Relational
• Most commonly used for database design
• Today: Details of the ER Model
• Thursday: Translating ER Schemas to Relational
Databases Model the Real World
• “Data Model” translates real world things
into structures computers can store
• Many models:
– Relational, E-R, O-O, Network, Hierarchical, etc.
• Relational
– Rows & Columns
– Keys & Foreign Keys to link Relations
Enrolled
sid
53666
53666
53650
53666
cid
grade
Carnatic101
C
Reggae203
B
Topology112
A
History105
B
Students
sid
53666
53688
53650
name
login
Jones jones@cs
Smith smith@eecs
Smith smith@math
age
18
18
19
gpa
3.4
3.2
3.8
Database Design
• The process of modelling things in the real
world into elements of a data model.
• I.E., describing things in the real world using a
data model.
• E.G., describing students and enrollments
using various tables with key/foreign key
relationships
• The Relational model is not the only model in
use…
A Problem with the Relational Model
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))
CREATE TABLE Students
(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)
With complicated schemas, it may be hard for a person
to understand the structure from the data definition.
Enrolled
cid
grade sid
Carnatic101
C 53666
Reggae203
B 53666
Topology112
A 53650
History105
B 53666
Students
sid
53666
53688
53650
name
login
Jones jones@cs
Smith smith@eecs
Smith smith@math
age
18
18
19
gpa
3.4
3.2
3.8
One Solution: The E-R Model
• Instead of relations, it has:
Entities and Relationships
• These are described with diagrams,
both structure, notation more obvious to humans
• A visual language for describing schemas
since
name
ssn
lot
Students
dname
budget
did
Enrolled_in
Courses
Steps in Database Design
• Requirements Analysis
– user needs; what must database do?
• Conceptual Design
– high level descr (often done w/ER model)
• Logical Design
– translate ER into DBMS data model
• Schema Refinement (in 2 weeks)
– consistency, normalization
• Physical Design (discussed already)
– indexes, disk layout
• Security Design
– who accesses what, and how
Conceptual Design
• Define enterprise entities and relationships
• What information about entities and relationships
should be in database?
• What are the integrity constraints or business rules
that hold?
• A database `schema’ in the ER Model is represented
pictorially (ER diagrams).
• Can map an ER diagram into a relational schema.
ER Model Basics
ssn
name
lot
Employees
• Entity:
Real-world thing, distinguishable from other objects.
Entity described by set of attributes.
• Entity Set: A collection of similar entities. E.g., all employees.
– All entities in an entity set have the same set of attributes.
(Until we consider hierarchies, anyway!)
– Each entity set has a key (underlined).
– Each attribute has a domain.
ER Model Basics (Contd.)
since
name
ssn
did
lot
Employees
dname
Works_In
budget
Departments
• Relationship: Association among two or more entities.
E.g., Attishoo works in Pharmacy department.
– relationships can have their own attributes.
• Relationship Set: Collection of similar relationships.
– An n-ary relationship set R relates n entity sets E1 ... En ;
each relationship in R involves entities e1  E1, ..., en  En
ER Model Basics (Cont.)
name
ssn
since
dname
did
Employees
supervisor
budget
Departments
lot
Works_In
subordinate
Reports_To
• Same entity set can participate in
different relationship sets, or in
different “roles” in the same set.
name
ssn
since
lot
dname
did
budget
Key Constraints
Employees
An employee can
work in many
departments; a
dept can have
many employees.
In contrast, each dept
has at most one
manager, according
to the key constraint
on Manages.
Manages
Departments
Works_In
since
Many-toMany
1-to Many
1-to-1
Participation Constraints
• Does every employee work in a department?
• If so, this is a participation constraint
– the participation of Employees in Works_In is said to be
total (vs. partial)
– What if every department has an employee working in it?
• Basically means “at least one”
since
name
ssn
dname
did
lot
Employees
Manages
budget
Departments
Works_In
Means: “exactly one”
since
Weak Entities
A weak entity can be identified uniquely only by
considering the primary key of another
(owner) entity.
– Owner entity set and weak entity set must
participate in a one-to-many relationship set (one
owner, many weak entities).
– Weak entity set must have total participation in
this identifying relationship set.
name
ssn
lot
Employees
cost
Policy
pname
age
Dependents
Weak entities have only a “partial key” (dashed underline)
Binary vs. Ternary Relationships
name
ssn
If each policy is
owned by just 1
employee:
Key constraint on
Policies would
mean policy can
only cover 1
dependent!
• Think through all
the constraints in
the 2nd diagram!
pname
lot
Employees
Dependents
Covers
Bad design
age
Policies
policyid
cost
name
pname
ssn
lot
age
Dependents
Employees
Purchaser
Beneficiary
Better design
policyid
Policies
cost
Binary vs. Ternary Relationships (Contd.)
• Previous example illustrated case when two binary
relationships were better than one ternary
relationship.
• Opposite example: a ternary relation Contracts
relates entity sets Parts, Departments and Suppliers,
and has descriptive attribute qty. No combination of
binary relationships is an adequate substitute.
Binary vs. Ternary Relationships (Contd.)
qty
Parts
Contract
Departments
VS.
Suppliers
Parts
can-supply
needs
Suppliers
Departments
deals-with
– S “can-supply” P, D “needs” P, and D “deals-with” S does
not imply that D has agreed to buy P from S.
– How do we record qty?
Summary so far
• Entities and Entity Set (boxes)
• Relationships and Relationship sets (diamonds)
– binary
– n-ary
• Key constraints (1-1,1-M, M-M, arrows on 1 side)
• Participation constraints (bold for Total)
• Weak entities - require strong entity for key
• Next, a couple more “advanced” concepts…
Download