Database Indexes

advertisement
DATABASE INDEXES
Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b)
Outline



Why Indexes ?
Demo
Types of Single-level Ordered Indexes
•
•
•





Primary Indexes
Clustering Indexes
Secondary Indexes
Multilevel Indexes
Dynamic Multilevel Indexes Using B-Trees and B+-Trees
Indexes on Multiple Keys
Disadvantages of Indexes
Indexes in SQL SERVER


Clustered Index
Non Clustered Index
Why Indexes?



Indexing mechanisms used to speed up access to desired data. They are
much the same as book indexes, providing the database with quick jump
points on where to find the full reference.
Search Key - attribute or set of attributes used to look up records in a file.
An index file consists of records (called index entries) of the form
Search-Key


Pointer
Index files are typically much smaller than the original file
Access types supported efficiently. E.g.,


Records with a specified value in the attribute
Records with an attribute value falling in a specified range of values.
Demo

Given a CUSTOMER table, find all customers whose
name = ‘Justine’
 Query
:
SELECT * FROM CUSTOMER
WHERE name ='Justine‘
Single Level Indexes - Indexes as
Access Paths




A single-level index is an auxiliary file that makes it more
efficient to search for a record in the data file.
The index is usually specified on one field of the file
(although it could be specified on several fields)
One form of an index is a file of entries <field value,
pointer to record>, which is ordered by field value
The index is called an access path on the field.
Indexes as Access Paths (contd.)

The index file usually occupies considerably less disk
blocks than the data file because its entries are much
smaller

A binary search on the index yields a pointer to the file
record

Indexes can also be characterized as dense or sparse.


A dense index has an index entry for every search key value
(and hence every record) in the data file.
A sparse (or nondense) index, on the other hand, has index
entries for only some of the search values
Indexes as Access Paths (contd.)
Example: Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
Record size R = 150 bytes
Block size B = 512 bytes
No of Records r = 30000 records
Then, we get:
Blocking factor Bfr = B div R = 512 div 150 = 3 records/block
Number of file blocks b = (r/Bfr) = (30000/3) = 10000 blocks
For an index on the SSN field, assume the field size VSSN= 9 bytes,
Assume the record pointer size PR= 7 bytes. Then:
Index entry size RI=(VSSN+ PR)=(9+7)=16 bytes
Index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block
Number of index blocks b= (r/ BfrI)= (30000/32)= 938 blocks
Binary search needs log2bI= log2938= 10 block accesses
This is compared to an average linear search cost of:
(b/2)= 30000/2= 15000 block accesses
If the file records are ordered, the binary search cost would be: log 2b= log230000= 15 block accesses
Types of Single-Level Indexes

Primay Index

Defined on an ordered data file

The data file is ordered on a key field

Includes one index entry for each block in the data file; the index
entry has the key field value for the first record in the block,
which is called the block anchor

A similar scheme can use the last record in a block.

A primary index is a nondense (sparse) index, since it includes an
entry for each disk block of the data file and the keys of its
anchor record rather than for every search value.
Primary Index (contd.)
Primary
index on
the ordering
key field of
an ordered
data file
Types of Single-Level Indexes

Clustering Index

Defined on an ordered data file

The data file is ordered on a non-key field unlike primary index,
which requires that the ordering field of the data file have a
distinct value for each record.

Includes one index entry for each distinct value of the field; the
index entry points to the first data block that contains records
with that field value.

It is another example of nondense index where Insertion and
Deletion is relatively straightforward with a clustering index.
Clustering Index (contd.)
A clustering index on the
DEPTNUMBER ordering nonkey
field of an EMPLOYEE file.
Clustering Index (contd.)
Clustering index with a
separate block cluster for each
group of records that share
the same value for the
clustering field.
Types of Single-Level Indexes

Secondary indexes




A secondary index provides a secondary means of accessing a
file for which some primary access already exists.
The secondary index may be on a field which is a candidate key
and has a unique value in every record, or a nonkey with
duplicate values.
The index is an ordered file with two fields.
 The first field is of the same data type as some nonordering field
of the data file that is an indexing field.
 The second field is either a block pointer or a record pointer.
There can be many secondary indexes (and hence, indexing fields)
for the same file.
Includes one entry for each record in the data file; hence, it is a
dense index
Secondary Indexes (contd.)
A dense secondary
index (with block
pointers) on a
nonordering key
field of a file.
Secondary Indexes (contd.)
A secondary index (with recored pointers) on
a nonkey field implemented using one level
of indirection so that index entries are of
fixed length and have unique field values.
Types of Single-Level Indexes
Multi-Level Indexes

Because a single-level index is an ordered file, we can
create a primary index to the index itself ; in this case, the
original index file is called the first-level index and the
index to the index is called the second-level index.

We can repeat the process, creating a third, fourth, ..., top
level until all entries of the top level fit in one disk block

A multi-level index can be created for any type of firstlevel index (primary, secondary, clustering) as long as the
first-level index consists of more than one disk block
Multi-Level Indexes (contd.)
A two-level primary
index resembling ISAM
(Indexed Sequential
Access Method)
organization.
Multi-Level Indexes (contd.)


Such a multi-level index is a form of search tree ;
however, insertion and deletion of new index entries
is a severe problem because every level of the index
is an ordered file.
A node in a search tree with pointers to subtrees
below it.
Dynamic Multilevel Indexes Using
B-Trees and B+ -Trees

Because of the insertion and deletion problem,
most multi-level indexes use B-tree or B+-tree
data structures, which leave space in each tree
node (disk block) to allow for new index entries

These data structures are variations of search
trees that allow efficient insertion and deletion of
new search values.

In B-Tree and B+-Tree data structures, each node
corresponds to a disk block

Each node is kept between half-full and
completely full
Dynamic Multilevel Indexes Using
B-Trees and B+ -Trees

An insertion into a node that is not full is quite
efficient; if a node is full the insertion causes a
split into two nodes

Splitting may propagate to other tree levels

A deletion is quite efficient if a node does not
become less than half full

If a deletion causes a node to become less than
half full, it must be merged with neighboring
nodes
Difference between B-tree and B+-tree





In a B-tree, pointers to data records exist at all levels of the
tree
In a B+-tree, all pointers to data records exists at the leaflevel nodes
A B+-tree can have less levels (or higher capacity of search
values) than the corresponding B-tree
B-tree allows search-key values to appear only once;
eliminates redundant storage of search keys.
Search keys in nonleaf nodes appear nowhere else in the
Btree; an additional pointer field for each search key in a
nonleaf node must be included.
B-tree structures. (a) A node in a B-tree with q – 1 search values.
(b) A B-tree of order p = 3. The values were inserted in the order
8, 5, 1, 7, 3, 12, 9, 6.
The nodes of a B+-tree. (a) Internal node of a B+-tree with q –1
search values. (b) Leaf node of a B+-tree with q – 1 search
values and q – 1 data pointers.
Indexes on Multiple Keys


Use multiple indices for certain types of queries.
Example:
select account-number from account
where branch-name = “Perryridge” and balance = 1000

Possible strategies for processing query using indices on single
attributes:
1. Use index on branch-name to find accounts with branch name of
Perryridge; test balance = 1000.
2. Use index on balance to find accounts with balances of $1000; test
branch-name = “Perryridge”.
3. Use branch-name index to find pointers to all records pertaining to the
Perryridge branch. Similarly use index on balance. Take intersection of
both sets of pointers obtained.
Indexes on Multiple Keys


All of these alternatives eventually give the correct result. However, if the
set of records that meet each condition (branch-name = “Perryridge” OR
balance = 1000) individually are large, yet only a few records satisfy the
combined condition, then none of the above is a very efficient technique for
the given search request.
Suppose we have an index on combined search-key (branch-name, balance).

With the where clause

where branch-name = “Perryridge” and balance = 1000 the index on the
combined search-key will fetch only records that satisfy both conditions.

Can also efficiently handle where branch-name = “Perryridge” and balance <
1000

But cannot efficiently handle where branch-name < “Perryridge” and balance =
1000. May fetch many records that satisfy the first but not the second condition.
Disadvantages of Indexes


Indexes require additional storage space on disk,
just as the index in a book require additional
pages.
Too many indexes can actually slow your database
down. Each time a page or database row is
updated or removed, the reference or index also
has to be updated
Indexes in Microsoft SQL Server

Indexes are organized as B-trees.
Each page in an index B-tree is called an index node.
 The top node of the B-tree is called the root node.
 The bottom level of nodes in the index is called the leaf
nodes.
 Any index levels between the root and the leaf nodes are
collectively known as intermediate levels.
 The pages in each level of the index are linked in a doublylinked list.


SQL Server primarily supports 2 types of indexes
CLUSTERED Index
 NON-CLUSTERED Index

Clustered Index






A clustered index is a special type of index that reorders the way
records in the table are physically stored.
The root and intermediate level nodes contain index pages holding
index rows
Each index row contains a key value and a pointer to either an
intermediate level page in the B-tree, or a data row in the leaf level
of the index.
Leaf node contains the actual data pages.
The data row of the table are sorted and stored in the table based
on their clustered index key (i.e. based on the index column(s)).
You can have only one clustered index per table.
Clustered Index
Clustered Index


CREATE CLUSTERED INDEX <index name> ON
<table name> (<column name list>)
Eg:
 CREATE
CLUSTERED INDEX IX_CUSTOMER_id ON
CUSTOMER(id)
Clustered Index

Potential candidates for the clustered index:
Columns with a number of duplicate values that are
searched frequently, for example, WHERE last_name =
‘Smith’.
 Columns that are often specified in the ORDER BY / GROUP
BY clause
 Columns that are often searched for within a range of
values, for example, WHERE price between $10 and $20.
 Columns, other than the primary key, that are frequently
used in join clauses.
 Queries that return large result sets.

Clustered Index

Clustered indexes are not a good choice for:
Columns that undergo frequent changes: This results in the
entire row moving (because SQL Server must keep the data
values of a row in physical order). This is an important
consideration in high-volume transaction processing systems
where data tends to be volatile.
 Wide keys: The key values from the clustered index are
used by all nonclustered indexes as lookup keys and
therefore are stored in each nonclustered index leaf entry.

Non-Clustered Index



A nonclustered index is a special type of index in which the logical
order of the index does not match the physical stored order of the
rows on disk.
Leaf node contains index pages instead of data pages.
Each index row in the nonclustered index contains the nonclustered
key value and a row locator. This locator points to the data row in
the clustered index or heap having the key value



Heap: The row locator is a pointer to the row. Row locator is built based
on the following. ROW ID (RowLocator)= file identifier + page number
+ row number on the page
Clustered Index : Row locator is the clustered index key for the row
You can have up to 249 Non Clustered Index per table.
Non-Clustered Index
Non-Clustered Index


CREATE [NONCLUSTERED] INDEX <index name>
ON <table name> (<column name list>)
Eg:
 CREATE
NONCLUSTERED INDEX IX_CUSTOMER_name
ON CUSTOMER(name)
Non-Clustered Index

Potential candidates for nonclustered indexes for
your environment:
 Columns
referenced in SARGs or join clauses that have
a relatively high selectivity(the density value is low).
 Columns referenced in both the WHERE clause and the
ORDER BY clause
 Nonclustered indexes are useful for single-row lookups,
joins, queries on columns that are highly selective, or
queries with small range retrievals.
Non-Clustered Index

Index with Included Columns
 Functionality
of nonclustered indexes can be extended
by adding nonkey columns to the leaf level of the
nonclustered index
 An index with included nonkey columns can significantly
improve query performance when all columns in the
query are included in the index either as key or nonkey
columns.
Questions
?
Download