DATABASE INDEXES Nimesh Shah (nimesh.s) , Amit Bhawnani (amit.b) Outline Why Indexes ? Demo Types of Single-level Ordered Indexes • • • Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes Dynamic Multilevel Indexes Using B-Trees and B+-Trees Indexes on Multiple Keys Disadvantages of Indexes Indexes in SQL SERVER Clustered Index Non Clustered Index Why Indexes? Indexing mechanisms used to speed up access to desired data. They are much the same as book indexes, providing the database with quick jump points on where to find the full reference. Search Key - attribute or set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form Search-Key Pointer Index files are typically much smaller than the original file Access types supported efficiently. E.g., Records with a specified value in the attribute Records with an attribute value falling in a specified range of values. Demo Given a CUSTOMER table, find all customers whose name = ‘Justine’ Query : SELECT * FROM CUSTOMER WHERE name ='Justine‘ Single Level Indexes - Indexes as Access Paths A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file. The index is usually specified on one field of the file (although it could be specified on several fields) One form of an index is a file of entries <field value, pointer to record>, which is ordered by field value The index is called an access path on the field. Indexes as Access Paths (contd.) The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller A binary search on the index yields a pointer to the file record Indexes can also be characterized as dense or sparse. A dense index has an index entry for every search key value (and hence every record) in the data file. A sparse (or nondense) index, on the other hand, has index entries for only some of the search values Indexes as Access Paths (contd.) Example: Given the following data file: EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... ) Suppose that: Record size R = 150 bytes Block size B = 512 bytes No of Records r = 30000 records Then, we get: Blocking factor Bfr = B div R = 512 div 150 = 3 records/block Number of file blocks b = (r/Bfr) = (30000/3) = 10000 blocks For an index on the SSN field, assume the field size VSSN= 9 bytes, Assume the record pointer size PR= 7 bytes. Then: Index entry size RI=(VSSN+ PR)=(9+7)=16 bytes Index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block Number of index blocks b= (r/ BfrI)= (30000/32)= 938 blocks Binary search needs log2bI= log2938= 10 block accesses This is compared to an average linear search cost of: (b/2)= 30000/2= 15000 block accesses If the file records are ordered, the binary search cost would be: log 2b= log230000= 15 block accesses Types of Single-Level Indexes Primay Index Defined on an ordered data file The data file is ordered on a key field Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor A similar scheme can use the last record in a block. A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value. Primary Index (contd.) Primary index on the ordering key field of an ordered data file Types of Single-Level Indexes Clustering Index Defined on an ordered data file The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record. Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. It is another example of nondense index where Insertion and Deletion is relatively straightforward with a clustering index. Clustering Index (contd.) A clustering index on the DEPTNUMBER ordering nonkey field of an EMPLOYEE file. Clustering Index (contd.) Clustering index with a separate block cluster for each group of records that share the same value for the clustering field. Types of Single-Level Indexes Secondary indexes A secondary index provides a secondary means of accessing a file for which some primary access already exists. The secondary index may be on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. The index is an ordered file with two fields. The first field is of the same data type as some nonordering field of the data file that is an indexing field. The second field is either a block pointer or a record pointer. There can be many secondary indexes (and hence, indexing fields) for the same file. Includes one entry for each record in the data file; hence, it is a dense index Secondary Indexes (contd.) A dense secondary index (with block pointers) on a nonordering key field of a file. Secondary Indexes (contd.) A secondary index (with recored pointers) on a nonkey field implemented using one level of indirection so that index entries are of fixed length and have unique field values. Types of Single-Level Indexes Multi-Level Indexes Because a single-level index is an ordered file, we can create a primary index to the index itself ; in this case, the original index file is called the first-level index and the index to the index is called the second-level index. We can repeat the process, creating a third, fourth, ..., top level until all entries of the top level fit in one disk block A multi-level index can be created for any type of firstlevel index (primary, secondary, clustering) as long as the first-level index consists of more than one disk block Multi-Level Indexes (contd.) A two-level primary index resembling ISAM (Indexed Sequential Access Method) organization. Multi-Level Indexes (contd.) Such a multi-level index is a form of search tree ; however, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file. A node in a search tree with pointers to subtrees below it. Dynamic Multilevel Indexes Using B-Trees and B+ -Trees Because of the insertion and deletion problem, most multi-level indexes use B-tree or B+-tree data structures, which leave space in each tree node (disk block) to allow for new index entries These data structures are variations of search trees that allow efficient insertion and deletion of new search values. In B-Tree and B+-Tree data structures, each node corresponds to a disk block Each node is kept between half-full and completely full Dynamic Multilevel Indexes Using B-Trees and B+ -Trees An insertion into a node that is not full is quite efficient; if a node is full the insertion causes a split into two nodes Splitting may propagate to other tree levels A deletion is quite efficient if a node does not become less than half full If a deletion causes a node to become less than half full, it must be merged with neighboring nodes Difference between B-tree and B+-tree In a B-tree, pointers to data records exist at all levels of the tree In a B+-tree, all pointers to data records exists at the leaflevel nodes A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree B-tree allows search-key values to appear only once; eliminates redundant storage of search keys. Search keys in nonleaf nodes appear nowhere else in the Btree; an additional pointer field for each search key in a nonleaf node must be included. B-tree structures. (a) A node in a B-tree with q – 1 search values. (b) A B-tree of order p = 3. The values were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6. The nodes of a B+-tree. (a) Internal node of a B+-tree with q –1 search values. (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers. Indexes on Multiple Keys Use multiple indices for certain types of queries. Example: select account-number from account where branch-name = “Perryridge” and balance = 1000 Possible strategies for processing query using indices on single attributes: 1. Use index on branch-name to find accounts with branch name of Perryridge; test balance = 1000. 2. Use index on balance to find accounts with balances of $1000; test branch-name = “Perryridge”. 3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained. Indexes on Multiple Keys All of these alternatives eventually give the correct result. However, if the set of records that meet each condition (branch-name = “Perryridge” OR balance = 1000) individually are large, yet only a few records satisfy the combined condition, then none of the above is a very efficient technique for the given search request. Suppose we have an index on combined search-key (branch-name, balance). With the where clause where branch-name = “Perryridge” and balance = 1000 the index on the combined search-key will fetch only records that satisfy both conditions. Can also efficiently handle where branch-name = “Perryridge” and balance < 1000 But cannot efficiently handle where branch-name < “Perryridge” and balance = 1000. May fetch many records that satisfy the first but not the second condition. Disadvantages of Indexes Indexes require additional storage space on disk, just as the index in a book require additional pages. Too many indexes can actually slow your database down. Each time a page or database row is updated or removed, the reference or index also has to be updated Indexes in Microsoft SQL Server Indexes are organized as B-trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. The pages in each level of the index are linked in a doublylinked list. SQL Server primarily supports 2 types of indexes CLUSTERED Index NON-CLUSTERED Index Clustered Index A clustered index is a special type of index that reorders the way records in the table are physically stored. The root and intermediate level nodes contain index pages holding index rows Each index row contains a key value and a pointer to either an intermediate level page in the B-tree, or a data row in the leaf level of the index. Leaf node contains the actual data pages. The data row of the table are sorted and stored in the table based on their clustered index key (i.e. based on the index column(s)). You can have only one clustered index per table. Clustered Index Clustered Index CREATE CLUSTERED INDEX <index name> ON <table name> (<column name list>) Eg: CREATE CLUSTERED INDEX IX_CUSTOMER_id ON CUSTOMER(id) Clustered Index Potential candidates for the clustered index: Columns with a number of duplicate values that are searched frequently, for example, WHERE last_name = ‘Smith’. Columns that are often specified in the ORDER BY / GROUP BY clause Columns that are often searched for within a range of values, for example, WHERE price between $10 and $20. Columns, other than the primary key, that are frequently used in join clauses. Queries that return large result sets. Clustered Index Clustered indexes are not a good choice for: Columns that undergo frequent changes: This results in the entire row moving (because SQL Server must keep the data values of a row in physical order). This is an important consideration in high-volume transaction processing systems where data tends to be volatile. Wide keys: The key values from the clustered index are used by all nonclustered indexes as lookup keys and therefore are stored in each nonclustered index leaf entry. Non-Clustered Index A nonclustered index is a special type of index in which the logical order of the index does not match the physical stored order of the rows on disk. Leaf node contains index pages instead of data pages. Each index row in the nonclustered index contains the nonclustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value Heap: The row locator is a pointer to the row. Row locator is built based on the following. ROW ID (RowLocator)= file identifier + page number + row number on the page Clustered Index : Row locator is the clustered index key for the row You can have up to 249 Non Clustered Index per table. Non-Clustered Index Non-Clustered Index CREATE [NONCLUSTERED] INDEX <index name> ON <table name> (<column name list>) Eg: CREATE NONCLUSTERED INDEX IX_CUSTOMER_name ON CUSTOMER(name) Non-Clustered Index Potential candidates for nonclustered indexes for your environment: Columns referenced in SARGs or join clauses that have a relatively high selectivity(the density value is low). Columns referenced in both the WHERE clause and the ORDER BY clause Nonclustered indexes are useful for single-row lookups, joins, queries on columns that are highly selective, or queries with small range retrievals. Non-Clustered Index Index with Included Columns Functionality of nonclustered indexes can be extended by adding nonkey columns to the leaf level of the nonclustered index An index with included nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Questions ?