Lecture 3 - Analysis of File Organizations

advertisement
Department of Computer Science
University of Cyprus
EPL446 – Advanced Database Systems
Lecture 3
Overview of Storage and Indexing
Chap. 8.4-8.5: Ramakrishnan & Gehrke
* exclude 8.4.5-8.4.6
Demetris Zeinalipour
http://www.cs.ucy.ac.cy/~dzeina/courses/epl446
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-1
Lecture Outline
Overview of Storage and Indexing
• Note: This lecture aims to qualitatively compare (ποιοτική
σύγκριση) the file organization and indexes alternatives we
Query Optimization
introduced in the previous lecture
and Execution
• 8.4) Comparison of File Organization
Relational Operators
Files and Access Methods
– System and Cost Model (Μνληέιν Κόζηνπο)
Buffer Management
– Heap Files, Sorted Files and Clustered Files
(Αξρεία: Σωπού, Ταξινομημένα, Ομαδοποιημένα)Disk Space Management
– Comparison on I/O Costs (΢ύγθξηζε Κόζηνπο I/O)
DB
• 8.5) Indexes and Performance Tuning (Ρύζκηζε Δπίδνζεο)
–
–
–
–
Understanding the Workload (Δθηηκώληαο ηνλ Φόξην Δξγαζίαο)
Index Specification in SQL (Γήισζε Δπξεηεξίσλ ζηελ SQL)
Index-Only Plans (Πιάλα κε Μόλν ην Δπξεηήξην)
Index Selection Guidelines (Οδεγίεο Δπηινγήο Δπξεηεξίσλ)
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-2
Cost Model for Our Analysis
(Μνληέιν Κόζηνπο γηα ηελ Αλάιπζε)
• The unit of information that is read and written to a disk
is called Page (Σελίδα), e.g., 4KB or 8KB
• Αn index or data page in any file organization consists
of several records which are accessed by their
respective RecordID
• Our analysis will utilize the following notation
(συμβολισμoί)
– B: # of data pages
– R: # of records per page
– D: (Average) time to read or write a disk page
Page_1
RID_1
Page_2
…
RID_R
RID_1
Page_Β
…
RID_R
…
RID_1
…
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
RID_R
3-3
System Model
(Μνληέιν ΢πζηήκαηνο)
Sorted File (by composite key <age,sal>)
R
Page-1
<22,6003>
<25,3000>
<29,2007>
Page-2
<33,4003>
<40,6003>
...
Page-B
<44,4000>
<44,5004>
<50,5004>
Average
Time for
1 I/O = D
DBMS
(File and Index Layer) in Main Memory
DB
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-4
Analysis Assumptions
(Παξαδνρέο Αλάιπζεο)
To simplify our analysis we make some assumptions
1.
2.
3.
Processing Time (Χπόνορ Δκηέλεζηρ) : The Average time to
process a record (e.g., comparison) in main memory (i.e., denoted
as C) is zero.
–
The I/O cost out-weights the CPU costs by many (i.e., 3)
orders of magnitudes (ηάξειρ μεγέθοςρ).
–
For example, some typical values for D (I/O time) and C (CPU
time) are as follows: D=15 milliseconds (1.5 x 10-4) and C=100
nanoseconds (1.0 x 10-7)
Prefetching (Πποανάκηηζη): We will ignore the gains of prefetching a sequence of pages (i.e., I/O cost is approximated).
Average-case analysis (Ανάλςζη Μέζηρ Πεπίπηωζηρ): We
perform an Average-case analysis, as opposed to Worst-Case
analysis, which is good enough to show the overall trends!
–
e.g., A DBMS operation A takes 100 I/Os half of the times and 0 I/Os
the other half times. The average cost is 50 I/Os.
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-5
Comparing File Organizations
(΢ύγθξηζε Οξγάλσζεο Αξρείσλ)
EMPLOYEE Records (ssn, name, age, salary, ….)
1. Heap files: Employee records inserted at EOF in random order.
2. Sorted files: Employee records are sorted on Composite Search
Key - concatenated key (ζύλζεην θιεηδί αλαδήηεζεο) <age, sal>
3. Clustered B+ tree file
–
–
Alternative 1: Data Entries contain <age, sal> (useful for equality searches)
Clustered: Order of data-entries is close (actually same) to that of datarecords
Age=12 AND sal<=30
Index entries
ONLY contain age
(first search key)
and not sal)
Data entries
contain both age
sal=20 AND age<50
and sal
4 B+Indexes on Sorted-(by name)-data
EPL446: Advanced Database Systems - Demetris
Zeinalipour
(University
Cyprus)
(illustrated
to explain
Compositeof
Search
Key)
(not data
recs)
3-6
Operations to Compare
(Πξάμεηο πνπ ζα ΢πγθξηζνύλ)
•
Scan (Σάπωζη): Fetch all records from disk
–
•
e.g., SELECT * FROM Employees;
Equality Selection (Δπιλογή Ιζόηηηαρ)
–
•
e.g., SELECT * FROM Employees WHERE age=33 AND sal=4003;
Range selection (Δπιλογή Γιαζηήμαηορ)
–
–
–
•
e.g., SELECT * FROM Employees WHERE age BETWEEN 35 AND 45;
e.g., SELECT * FROM Employees WHERE 35<age AND sal<=4000;
But NOT: SELECT * FROM Employees WHERE sal>40; (tree index is on age )
Insert a record (Διζαγωγή Δγγπαθήρ)
–
•
e.g., INSERT INTO Employees (age, sal) VALUES (45, 3000);
Delete a record (Γιαγπαθή Δγγπαθήρ)
–
e.g., DELETE FROM Employees WHERE age=45;
* For more details on Composite Search Keys check section 8.5.3.
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-7
Heap File Analysis
(Αλάιπζε Αξρείνπ ΢σξνύ)
• Heap File Assumptions
–
–
Equality Selection on key <age,sal>
Equality Selection produces exactly 1 match.
ScanAll
Eq. Selection
Range Selection
Insert
Delete
BD
0.5BD
BD
2D
0.5BD+ D
All records
On average we
traverse ½ records
File in random order Read (last) PageB
and tuples might be
+ Write PageB
anywhere!
Heap File (records in random order)
Page-1
<40,6003>
<25,3000>
<44,5004>
Page-2
<29,2007>
<33,4003>
<44,4000>
Page-B
...
Find Page +
Delete Page
Average Time for
1 I/O = D
<22,6003>
<50,5004>
DBMS (File and Index Layer)
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
DB
3-8
Sorted File Analysis
(Αλάιπζε Σαμηλνκεκέλνπ Αξρείνπ)
• Sorted File Assumptions
–
Files compacted after deletions (no holes in pages)
ScanAll
Eq. Selection
Range Selection
Insert
Delete
BD
Dlog2B
D(log2B + #matches)
Dlog2B + BD
Dlog2B + BD
Binary Search for 1
tuple, then transfer
rest qualifying pages
Binary Search for
Correct Position +
Shift (Read/Write) ½
subsequent pages
(i.e., 2x0.5BD=BD)
All records
Binary Search over B
pages. Each I/O
costs D
SortedFile
Page-1
<22,6003>
<25,3000>
<29,2007>
Page-2
<33,4003>
<40,6003>
<44,4000>
Page-B
...
<44,5004>
<50,5004>
Same as Insert but
½ pages are shifted
back in order to
compact the file
Average Time for
1 I/O = D
DBMS (File and Index Layer)
EPL446: Advanced Database Systems
- Demetris Zeinalipour (University of Cyprus)
DB
Eq. selection
3-9
Clustered File Analysis
(Αλάιπζε Οκαδνπνηεκέλνπ Αξρείνπ)
• Clustered B+ Assumptions (Alternative 1)
Index Pages = 67% full, implies that Index File = 3/2 Data File
(recall that for every data record we have an index record)
–
ScanAll
Eq. Selection
Range Selection
Insert
Delete
1.5BD
DlogF1.5Β
D(logF1.5Β + #matches)
DlogF1.5Β + D
DlogF1.5Β + D
Index is 1.5
larger than
Data
F: Avg Fanout (κέζνο
Equality Selection, then
βαζκόο εμόδνπ)
transfer rest qualifying
F-ary Search over 1.5B
pages
pages. Each I/O costs D
F-ary Search for Correct Same as Insert
Position + Write Page
(Most of the time there
will be enough space in
the last page)
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-10
Understanding the Workload
(Δθηηκώληαο ηνλ Φόξην Δξγαζίαο)
• Workload (Φόπηορ Δπγαζίαρ): The typical mix of i)
Query (Select) and ii) Update (Insert/Delete/Update)
operations in a DBMS system.
• i) For each query/update in the workload :
–
–
–
•
Which types are involved (Select,Insert,Delete,Update)
Which relations/attributes(ζσέζειρ, σαπακηηπιζηικά) does it access?
Which attributes are involved in selection/join (επηινγή/ ζπλέλσζε)
conditions? How selective are these conditions likely to be?
Selectivity (Δπιλεκηικόηηηα ηηρ Σςνθήκηρ): The fraction of
tuples selected by a selection condition is referred to as the
selectivity of the condition.
E.g., ζage>40(EMPLOYEE) returns 10 out of 1000 tuples. Selectivity=1%
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-11
Index Specification in SQL
Γήισζε Δπξεηεξίνπ ζηε SQL
• The SQL standard (up until SQL 2008) does not
include any statement for creating/dropping
indexes.
• However, in practice every major DBMS supports
such indexes (access methods) such as Btrees,
Hash, Rtrees, GIST.
Example from the PostgreSQL DBMS
CREATE INDEX AgeSalIndex
ON Employees (age, sal)
USING BTREE
WHERE sal > 3000
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-12
Index-Only Plans
(Πιάλα κε Μόλν ην Δπξεηήξην)
• Index-Only Plans: Plans that can be evaluated
(αποηιμηθούν) without EVER accessing the base
relation (i.e., by ONLY accessing the data entries)
Consider the following B+ Tree Indexes
// Count Employees for each Dno
SELECT E.dno, COUNT(*)
FROM Emp E
GROUP BY E.dno
with Data Entries: <E.dno>
// For each Dno find the min salary
SELECT E.dno, MIN(E.sal)
FROM Emp E
GROUP BY E.dno
Index-Only Plans Work
both for Clustered and
Unclustered Indexes!
Data entries
// Avg Salary for Ε that satisfy the predicate (κατηγορούμενο)
SELECT AVG(E.sal)
FROM Emp E
WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
With Data Entries <E. age,E.sal>
or <E.sal,
E.age> of Cyprus)
EPL446: Advanced Database Systems - Demetris Zeinalipour
(University
with Data Entries: <E.dno,E.sal>
3-13
Choice of Indexes
(Δπηινγή ησλ Δπξεηεξίσλ)
• The DBA is usually confronted with several
questions in regards to indexes:
–
–
–
–
Which relations should have indexes?
What type of index should we use?
Clustered? Hash? Btree?
What attribute(s) should be the search key?
Should we build several indexes?
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-14
Index Selection Guidelines
(Οδεγίεο Δπηινγήο ησλ Δπξεηεξίσλ)
• Tip 1: Consider the queries executed most of the
time (most important ones), e.g., for Oracle :
– SELECT executions, sql_text FROM v$sqlarea ORDER BY executions desc;
– V$ => Oracle’s Dynamic Performance Views
• Tip 2: Try to choose indexes that benefit as many
queries as possible
• Tip 3: Attributes in WHERE clause are candidates
for index keys.
• Tip 4: Hash vs. Tree
– Exact match condition suggests Hash index.
– Range query suggests tree index.
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-15
Index Selection Guidelines
•
•
•
•
(Οδεγίεο Δπηινγήο ησλ Δπξεηεξίσλ)
Tip 5: Consider the best plan using the current
indexes, and see if a better plan is possible with an
additional index. If so, create it!
Tip 6: Since only one index can be clustered per
relation, choose it based on important queries that
would benefit the most from clustering.
Tip 7: Multi-attribute search keys should be
considered when a WHERE clause contains
several conditions.
Tip 8: Indexes can make queries go faster but
updates become slower. Indexes also require
additional disk space, choose them wisely!
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-16
Summary
(΢ύλνςε)
• Understanding the nature of the workload for the
application, and the performance goals, is essential
to developing a good design.
–
What are the important queries and updates? What
attributes/relations are involved?
• Indexes must be chosen to speed up important
queries (and perhaps some updates!).
–
–
–
–
Choose indexes that can help many queries, if possible.
Don’t use an excessive number of indexes as there an
associated maintenance overhead.
Build indexes to support index-only strategies.
Clustering is an important decision; only 1 index on a
given relation can be clustered!
EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
3-17
Download