Uploaded by Jude Ocomil

3.0 Indexing Structure for files

advertisement
Topic 3
3.Indexing Structure
for Files
Learning Objectives
•
Discuss Single level Index• Primary Index
• Clustering Index
• Secondary Index
•
Explain the need of Multilevel Indexing
•
Discuss Dynamic Multilevel indexing using B-trees and B+trees
•
Explain the insertion and Deletion process using B-tree and
B+trees
•
Discuss indexing using Hashing techniques and buckets
•
Develop a hashing table given a hashing function
Reference Book
• FUNDAMENTALS OF Database Systems
SIXTH EDITION Ramez Elmasri &Shamkant
B. Navathe
(p660-697)
Basic Concepts and Introduction
Basic Concepts
•
Indexing mechanisms is used to speed up access to desired data.
–
E.g., author catalog in library
•
Search Key - attribute to set of attributes used to look up records in a file.
•
An index file consists of records (called index entries) of the form shown in the
diagram--search-key
•
pointer
Index files are typically much smaller than the original file
Indexes
•
Extra access structures, or indexes, can be created on a file.
•
They're stored on disk just like the main data file, and provide other access
paths that can be quicker for certain operations on certain fields.
•
Any field (or combination of fields) can be used to create an index, but there
will be different index types depending on whether the field is a key
(unique), and whether the main file is sorted by it or not.
•
There can be multiple indexes on one file.
•
Indexes speed up access on the indexed field, but slow down updates—
almost every update on the main table must also update every index.
Sparse and Dense Indexes
– The index file usually occupies considerably less disk blocks than the
data file because its entries are much smaller
– A binary search on the index yields a pointer to the file record
– Indexes can also be characterized as dense or sparse.
• A dense index has an index entry for every search key value
(and hence every record) in the data file.
• A sparse (or nondense) index, on the other hand, has index
entries for only some of the search values
Monday, June 14,
2021
Dense Index
•
•
•
In dense index, there is an index
record for every search key value
in the database.
This makes searching faster but
requires more space to store
index records itself.
Index records contain search key
value and a pointer to the actual
record on the disk.
Sparse Index
• It is an index record
that appears for only
some of the values in
the file.
Monday, June 14,
2021
Pointers
Types of Single-Level Ordered Indexes
1.Primary Indexes
•
•
•
•
•
An index on the ordering key
(often primary key) of a sorted
file.
The index file is a table of <key,
block_pointer> pairs.
These are called the index
entries and recap the ordering key
of the first record of their
pointed-to block.
The first record of each block is
called the anchor record.
To retrieve a record given its
ordering key value using the
index, the system does a binary
search in the index file to find the
index entry whose key value is ≤
the goal key's value, then
retrieves the pointed-to block
from the original file.
Example 1
• Consider an ordered file with these
parameters:
Block size (B)
1024 Bytes
Record count (r)
30,000
Record length (R)
100 Bytes fixed size, unspanned
Calculate the the bfr=⌊(B/R)⌋= 1024/100= 10 records per block
Calculate the number of blocks needed by the file b = ⌈(r/bfr)⌉
= 3000 blocks
A binary search would, on average, need to access ⌈log2 b⌉ =
12 blocks.
Example 1 Continued..
• Now consider a primary index on that file
with these parameters:
Ordering key length (V)
9 Bytes
Block pointer length (P)
6 Bytes
Index entry length (Ri)
15 Bytes (9+6)
Calculate the following:
i. The blocking factor of the index file bfri
ii. The total number of index entries
iii. The number of blocks needed by the index file is bi
iv. The number of block access when doing binary search
Solution
i. The blocking factor of the index file is bfri = ⌊(B/Ri)⌋ = 68 record/block.
ii. The total number of index entries will be the same as the number of blocks
in the main file, or ri = 3000.
iii. Thus the number of blocks needed by the index file is bi = ⌈(ri/bfri)⌉ = 45
blocks.
iv. A binary search on this need access on average only ⌈log2 bi⌉ = 6+1 block
=7 blocks
.
Problems with Primary
Indexing
i.
Inserting and deleting records in the main file must move other records,
since it's ordered.
ii. Some insertions/deletions must also change index entries, if the anchor
records change.
iii. There can be only one primary index on a file.
Types of Single-Level Indexes
•
2.Clustering Index
– Defined on an ordered data file
– The data file is ordered on a non-key field unlike primary index,
which requires that the ordering field of the data file have a
distinct value for each record.
– Includes one index entry for each distinct value of the field; the
index entry points to the first data block that contains records with
that field value.
– It is another example of nondense index(Sparse) where Insertion
and Deletion is relatively straightforward with a clustering index.
Monday, June 14,
2021
Example:
•
Suppose a company contains several
employees
in
each
department.
Suppose we use a clustering index,
where all employees which belong to
the same Dept_ID are considered
within a single cluster, and index
pointers point to the cluster as a whole.
Here Dept_Id is a non-unique key.
FIGURE 2
Clustering index with a
separate block cluster
for each group of
records that share the
same value for the
clustering field.
Monday, June 14,
2021
Types of Single-Level Indexes
•
3.Secondary Index
– A secondary index provides a secondary means of accessing a file
for which some primary access already exists.
– The secondary index may be on a field which is a candidate key and
has a unique value in every record, or a nonkey with duplicate
values.
– The index is an ordered file with two fields.
i. The first field is of the same data type as some non-ordering field
of the data file that is an indexing field.
ii. The second field is either a block pointer or a record pointer.
There can be many secondary indexes (and hence, indexing
fields) for the same file.
– Includes one entry for each record in the data file; hence, it is a
dense index
Monday, June 14,
2021
Secondary Index
Monday, June 14,
2021
Summary of Single indexes
Multilevel Indexes
•
The previous approaches accessed records with a binary search on the index
file, then direct access to the block containing the record.
•
How about creating an index on the index? And an index on that?
•
Why not keep going until the top-level index fits in a single block?
•
Since the first-level index is ordered and key on the index value, the
remaining levels can all be non-dense
Two level Index
Multilevel Indexing
•
Here the blocking factor of the index entries, bfri, is called the fan-out fo.
•
Then the average number of block accesses needed to retrieve a record is
only ⌈logfo(bi)⌉.
•
If the first level has r1 index entries, it needs ⌈r1/fo⌉ blocks, which is also r2, the
number of index entries in the next level up.
•
The second level needs ⌈r2/fo⌉ blocks, which is also r3. And so on.
•
Eventually at some level rt = 1, and t is the top level.
•
It can be calculated as t = ⌈logfo(r1)⌉.
Example 2:
Data file
•
•
•
Recall these parameters given
in example1:
If this is extended into a
multilevel index, the index
blocking factor is also the fanout (bfri = fo = 68) and the
index size(entries) is
the b1 size. So the block size
of level 2 is: ⌈b1/fo⌉ = ⌈45/68⌉
= 1. So level 2 is enough.
Accessing a record by Ssn will
require 3 block
reads,(Number of levels+1)
and the index will need 46
blocks in total (about 15% the
size of the data file).
1024 Bytes
Block size (B)
30,000
Record count (r)
100 Bytes fixed size, unspanned
Record length (R)
10
record
/block
Blocking factor (bfr)
3000 blocks
Block size of file (b)
Primary index
9 Bytes
Ordering key length (V)
6 Bytes
Block pointer length (P)
15 B ytes(9+6)
Index entry length (Ri)
68
record
/block
Blocking factor (bfri)
45 blocks
Block size of index (bi)
Class Exercise
•
Consider a disk with block size B = 1024 bytes. A block pointer is P = 9 bytes
long, and a record pointer is PR = 8 bytes long. A file has r = 30,000
EMPLOYEE records of fixed length. Each record has the following fields:
Name(20 bytes), Ssn (15 bytes), Department_code (10 bytes), Address (30
bytes),Phone (11 bytes), Birth_date (8 bytes), Sex (1 byte), Job_code (4
bytes), and Salary (4 bytes, real number).An additional byte is used as a
deletion marker.
a) Calculate the record size R in bytes.
b) Calculate the blocking factor bfr and the number of file blocks b, assuming
an unspanned organization.
c) Suppose that the file is ordered by the key field Ssn and we want to
construct a primary index on Ssn. Calculate
i.
ii.
iii.
iv.
v.
The index blocking factor bfri(which is also the index fan-out fo);
The number of first-level index entries and the number of first-level index blocks;
The number of levels needed if we make it into a multilevel index;
The total number of blocks required by the multilevel index;
The number of block accesses needed to search for and retrieve rom the file—given its Ssn
value—using the primary index.
(d) Suppose the file is not ordered by the key field SSN and we want to
construct a secondary index on SSN. Repeat the previous exercise (c) for the
secondary index and compare with the primary index.
Dynamic Multilevel
Indexes Using B-Trees
and B+-Trees
•
The previous multilevel index is
great, but inserting or deleting
records has to happen on the
underlying sorted file, which is
costly. Plus now it may require
modifying the index file(s), also
costly.
•
Recall (binary) tree structure. But
if
the
tree
is
in
storage,
occupying more than one block,
there is the risk of thrashing or
heavy caching.
Terms used in a tree
• A tree is formed of nodes. Each node in the tree, except for a special
node called the root, has one parent node and zero or more child
nodes.
•
The root node has no parent. A node that does not have any child
nodes is called a leaf node;a nonleaf node is called an internal node.
• A subtree of a node consists of that node and all its descendant
nodes—its child nodes, the child nodes of its child nodes, and so on.
• In figure 18.7(shown in the previous slide) the root node is A, and its
child nodes are B, C, and D. Nodes E, J, C, G, H, and K are leaf nodes.
Since the leaf nodes are at different levels of the tree, this tree is
called unbalanced.
Monday, June 14,
2021
Pointers
Search Trees and BTrees
• Each search field has
the search value (index
value) and a pointer to
the associated record
in storage (a record
pointer or block
pointer, depending).
• In practice each node
is one block of storage.
• Need for balancing?
B-trees
• A b-tree is known as balanced sorted tree.
• It is used for external sorting. External sorting is done when
data values cannot fit into the main memory.
• To reduce disk access, several conditions of the tree must
be true;
i.
The height of the tree must be kept minimal
ii.
There must be no empty sub-trees above the leaves of the
tree
iii. Leaves of the tree must be at the same level
iv. All nodes except the leaves must have some minimum
number of children
Properties of a B-tree
B-tree of order of M has the following properties
i.
Each node has a maximum of M children and a minimum of M/2
ii.
Each node has (M-1) Keys
iii.
Keys are arranged in ascending order. All the keys in the subtree to the left of a key are
smaller than the key(Predecessor).All the keys to the right are greater than the value
of the key(Successor)
iv.
When a key is inserted into a full node, the node is split into two nodes, and the
median value is inserted in the parent node.
v.
All leaves are on the same level i.e no empty subtree above the leave node
Example
• Construct a B-tree of order of 5 with the
following set of data
10,22,9,6,4,12,40,60,2,1,8,32,36,41
Class Exercise
1) Construct a B-tree of order 3 with the following set of data:8, 5,
1, 7, 3, 12, 9, 6.
2) Construct a B-tree of order 4 with the following set of
data:5,3,21,9,1,13,2,7,10,12,4,8
Solution??
1) Construct a B-tree of order 3 with the following set of data:8, 5,
1, 7, 3, 12, 9, 6.
B-Trees Summary
•
•
•
•
•
•
•
•
•
•
Each node has at most m children and at most m-1
keys.
Nodes may leave some empty space, allowing for
inserting new entries. (Each node has at minimum
(m-1)/2 keys.
Insertion into a non-full node is very efficient.
If the node is full it causes a split.
Splitting the root node creates two children, with
only the middle value left in the original root.
Splitting a branch node divides it into two branch
nodes, and moves the middle value to the parent
node.
Splitting can propagate to other tree levels.
Deletion from a node more than ½-full is very
efficient.
If the node becomes ½-full it must be merged with
neighboring nodes. Specific approaches to deletion
merging vary.
Equilibrium at 69%.
B+Trees
• A B+ tree is a balanced tree structure
where the leaf nodes contain the data
entries and the nodes above contain
entries which direct the search
» A distinction is made in a B+ tree between the entries in leaf nodes and entries
in the non-leaf nodes
» The entries in the leaf nodes contain pointers to the data records. The leaf level
is a dense index
• The non-leaf nodes form a sparse
B+Trees
B+trees
Compared with B-Trees:
– Only leaf nodes store record pointers.
– All search values are stored in leaf nodes. (And may be
stored at higher levels.)
– Leaf nodes are usually linked into a sorted list.(Sequential
list/Linked list)
Insertion and Deletion in B+
trees
Example
• Construct a B+tree where order
value is 3.The data set is
45,77,60,33,100,14,11,55
B+Trees
• B+ trees support equality and range
searches,multiple attribute key and partial
key searches.
• It responds to dynamic changes in the table.
• Sibling pointers in the leaf level allows range
search and equality search. It creates
sequential set or linked list.
• E.g. select *from employees where salary is
BETWEEN 2000 AND 30000;
Class Exercise-Assignment 3
a)A B+tree is of order 4. Construct a B+tree
given
the
following
data
set:
1,4,10,17,21,31,25,19,20,28,42
b)Construct a B+tree of order 3 with the
following set of data: 8, 5, 1, 7, 3, 12, 9,
6.Then delete 5,12 and 9 from the list
B+Tree insertion
B+tree Deletion Sequence
Hashed Files
•
•
•
•
•
•
Hashing for disk files is called External Hashing
The file blocks are divided into M equal-sized buckets,
numbered bucket0, bucket1, ..., bucket M-1. Typically, a bucket
corresponds to one (or a fixed number of) disk block.
One of the file fields is designated to be the hash key of the file.
The record with hash key value K is stored in bucket i, where
i=h(K), and h is the hashing function.
Search is very efficient on the hash key.
Collisions occur when a new record hashes to a bucket that is
already full. An overflow file is kept for storing such records.
Overflow records that hash to each bucket can be linked
together.
Monday, June 14,
2021
Hash Indexes
The hash index is a secondary structure to access
•
Searching for an entry uses the hash search
the file by using hashing on a search key other
than the one used for the primary data file
algorithm on K. Once an entry is found, the
organization.
pointer Pr (or P) is used to locate the
The index entries are of the type <K, Pr> or <K,
corresponding record in the data file. In a
P>, where Pr is a pointer to the record containing
practical application, there may be thousands
the key, or P is a pointer to the block containing
the record for that key.
of buckets; the bucket number, which may
be several bits long, would be subjected to
the directory schemes
Example:
• Employee file with EmployeeID as the hash
key includes records with the following
EmployeeID values: 2369, 3760, 4692, 4871,
5659, 1821, 1074, 7115, 1620, 2428,
3943,4750, 6975, 4981, and 9208.
• The file uses eight buckets, numbered 0 to
7. Each bucket is one disk block and holds
two records. Load these records into the
file in the given order, using the hash
function h(K) = K mod 8. Clearly show the
hashing table and the buckets.
Monday, June 14,
2021
Pointers
Solution
function h(K)
= K mod 8.
EmployeeID
Modulus
2369
1
3760
0
4692
4
4871
7
5659
3
1821
5
1074
2
7115
3
1620
4
2428
4
3943
7
4750
6
6975
7
4981
5
9208
0
Hashing Table
• Buckets 0 to 7
Bucket
0verflow
0
3760
9208
1
2369
2
1074
3
5659
7115
4
4694
1620
5
1821
4981
6
4750
7
4871
3943
2428
6975
Hashed Files (cont.)
There are numerous methods for collision resolution, including the
following:
• Open addressing: Proceeding from the occupied position
specified by the hash address, the program checks the
subsequent positions in order until an unused (empty) position is
found.
•
Chaining: For this method, various overflow locations are kept,
usually by extending the array with a number of overflow
positions. In addition, a pointer field is added to each record
location. A collision is resolved by placing the new record in an
unused overflow location and setting the pointer of the occupied
hash address location to the address of that overflow location.
•
Multiple hashing: The program applies a second hash function if
the first results in a collision. If another collision results, the program
uses open addressing or applies a third hash function and then
uses open addressing if necessary.
Monday, June 14,
2021
Hashed Files (cont.)
Monday, June 14,
2021
Employee hashing
Figure 18.15 illustrates a hash index on the
Emp_id field for a file that has been stored
as a sequential file ordered by Name. The
Emp_id is hashed to a bucket number by
using a hashing
function: the sum of the digits of Emp_id
modulo 10. For example, to find Emp_id
51024, the hash function results in bucket
number 2; that bucket is accessed first. It
contains the index entry < 51024, Pr >; the
pointer Pr leads us to the actual record in the
file.
End
Any Questions?
Download