Single Level Indexing

advertisement
Indexed Files
Hashing is a computational technique for organizing files. Records in hashed files
can be stored and retrieved quickly. However the difficulty in hashed files is that they
are difficult to process in key order, which is important if you want to access all
records with keys in a certain range.
Indexing is a data structure based technique for accessing records in a file. Indexes
are auxiliary access structures which are used to speed up the retrieval of records in
response to certain search conditions. A main file of records can be supplemented
by one or more indexes.
Index structures provide secondary access paths, which provide alternative ways of
accessing the records without affecting the physical placement of records on the
disk. Indexes may be part of the main file or be separate files and may be created
and destroyed as required without affecting the main file.
Indexes allow for efficient access to records based on the indexing fields that are
used to construct the index. Any field of the file can be used to create an index, and
multiple indexes on different fields can be constructed on the same file.
Single Level Ordered Indexes
Ordered access structures are similar to indexes in textbooks. The indexes or tables
of contents list the important terms in alphabetical order, along with page numbers
where the information can be found.
You can search an index to find the address (page numbers in this case), and locate
the term by searching the appropriate page. The alternative is to complete a linear
search of the textbook, looking through each page one at a time.
For a file consisting of several fields, and index access structure is usually defined on
a single field, called an indexing field. An index usually is composed of two
components, the index field value and a list of pointers to all disk blocks that contain
records with that field value. The values in the index are ordered, so we can do a
binary search on the index. The index file is much smaller than the data file, so the
binary search is much faster.
There are different types of indexes, these include:
- Primary Indexes
- Clustering index
- Secondary index
Primary Indexes
A primary index is an ordered file whose records are of fixed length with two fields.
The first field is the primary key of the main data file, and the second is a pointer to a
disk block. There is one index entry in the index file for each block in the data file.
Each index entry has the value of the primary key field for the first record in the block,
and a pointer to the block.
Employee #
Name
Sex
Job
Salary
Birth Date
125-777
135-566
Index File
Primary
Key Value
125-777
Block
Pointer
175-456
185-654
195-222
185-654
300-213
245-225
;
.
.
.
215-333
240-225
250-566
290-111
750-222
300-213
800-356
310-444
340-333
750-222
760-477
780-444
800-356
810-357
820-358
Each entry in the index has an employee id value and a pointer. The total number of
entries in the index is the same as the number of disk blocks in the ordered data file.
<K(1) = (125-777), P(1) = address of block 1>
The first record in each block of the data file is called the anchor record or the block
anchor.
Indexes can be characterised as dense or sparse:
Dense Index – has an index entry for every search key value (ie. every record) in the
data file.
Sparse(non dense) Index – has index entries for only some of the search key
values. A primary index is a sparse (non dense) index because it includes an entry
for each disk block of the data file rather than for every search value.
The index file for a primary index needs fewer blocks than the data file, because a)
There are fewer index entries than there are records of data, and b) each index entry
is smaller in size than a data record because it has only two fields, therefore more
index entries than data records can fit in one block. This means that a binary search
on the index file requires fewer block accesses than a binary search on the data file.
Block Accesses Required
Remember a binary search for an ordered data file requires log 2b block accesses. If
the primary index file contains bi blocks, then to locate a record with a search key
value, it requires a binary search of the index (bI blocks), plus an access to the block
containing the record: log2 bi +1 block accesses.
A record whose primary key value is K, is in the block with the address P, where
K(i)<=K(i+1). To retrieve a record, we do a binary search of the index file, to find the
appropriate entry, then retrieve the data block whose address is P(i).
Example from Text:
We have an ordered file with r = 30,000 records with block size B = 1024 bytes.
Records are fixed and unspanned with R = 100 bytes.



What is the bfr?
How many blocks are needed?
How many block accesses needed to locate a record?
Now assume the ordering key field of the file is V = 9bytes, and a block pointer is
P=6 bytes and a primary index has been constructed for the file.
What is the size of each index entry?
What is the blocking factor?
How many blocks are needed?
How many block accesses are needed to locate a record?
Insertion and deletion of records are a problem with primary indexes, because they
are ordered files. The problem increases with primary indexes because when we
insert a record into the correct position in the data file, not only does space have to
be made in the blocks for the new record, but index entries may also need to be
changed because the anchor records of some blocks will change.
Problems can be overcome as seen previously using an unordered overflow file, or a
linked list of overflow records for each block in the data file.
Clustering Indexes
Occurs when records are physically ordered on a nonkey field, which does not have
a distinct value for each record. The field is called the clustering field. A clustering
index can be created to speed the retrieval of records that have the same value for
the clustering field.
Clustering indexes are different from a primary index, because a primary index
requires that the ordering field of the data file has a distinct value for each record.
A clustering index is an ordered file with two fields, the first is of the same type as the
clustering field of the data file, and the second is a block pointer. There is one entry
in the clustering index for each distinct value of the clustering field containing the
value and a pointer to the first block in the data file that has a record with that value.
For its clustering field.
Clustering Example with indexing fields with different values being stored in the same
block:
Dept No.
Name
Job
BirthDate
Salary
1
1
1
2
2
3
3
3
1
2
3
3
3
4
4
5
4
5
5
5
5
Insertion and deletion still cause problems because the records in the data file are
still physically ordered. It is common to reserve a whole block or cluster of blocks for
each value of the clustering field. All records with that value are placed in the block
(or cluster). See the diagram below for an example.
Clustering Example with each distinct key value being stored in a separate
block.
Dept No.
Name
Job
BirthDate
Salary
1
1
Block Pointer
Null
pointer
Block Pointer
Null
pointer
2
2
1
2
3
3
4
3
3
3
Block Pointer
3
3
Block Pointer
Null
pointer
4
4
4
Block Pointer
Null
pointer
A clustering index is a nondense index, because there is an entry for every distinct
value of the indexing field (which is nonkey by definition) and has duplicate values,
rather than for every record in the file.
An index is similar to the directory structures used for extendible hashing. Both are
searched to find a pointer to the data block containing desired record. The main
difference is that an index search uses the values of the search field itself, where a
hash directory search uses the hash value that is calculated by applying the hash
function to the search field.
Secondary Indexes
A secondary index provides a secondary means of accessing a file for which some
primary access already exists. The secondary index may be on a field which is a
candidate key and has a unique value in every record, or a nonkey with duplicate
values.
The index is an ordered file with two fields. The first field is of the same data type as
the nonordering field of the data file that is an indexing field. The second is either a
block pointer or a record pointer. There can be many secondary indexes for the
same file.
A secondary index access structure on a key field that has a distinct value for every
record is called a secondary key. There is one index entry for each record in the
data file, therefore it is dense.
The index entries are ordered by value K(i), so a binary search can be performed.
Because the records in the data file are not physically ordered by the values of the
secondary key field, block anchors cannot be used. An index entry is created for
each record in the data file rather than for each block.
A block pointer is used to point to the block containing the record with the desired key
value. The appropriate block is transferred to main memory and a search for the
desired record within the block can be performed.
A secondary index needs more storage space and longer search time than a primary
index, because there is a larger number of entries. However the improvement in the
search time for a secondary index is greater than for a primary index since a linear
search would have to be performed if the secondary index did not exist.
SEE DIAGRAM FROM TEXT (14.4)
Example from Text
Consider the file from the previous example, with r=30,000 fixed length records of
size R=100 bytes, stored on a disk with block size B = 1024 bytes. The file has b =
3000 blocks.
How many block accesses would be required to do a linear search on the file?
If we construct a secondary index on a nonordering key field of the file that is V = 9
bytes long. A block pointer is P = 6 bytes long.
How big is each index entry?
What is the blocking factor?
In a dense secondary index, the total number of index entries is equal to the number
of records in the data file.
How many blocks are needed for the index?
How many block accesses are needed to do a binary search?
As you can see, there is a large improvement over the number of block accesses
needed for the average linear search, but it is slightly worse than the seven block
accesses required for the primary index.
Non Key Secondary Indexes
A secondary index can be created on a nonkey field of a file. In this case, many
records in the data file can have the same value for the indexing field. There are
several ways an index like this can be created:



Include several index entries with the same key value, one for each record. It
would be a dense index.
Have variable length records for the index entries, with a repeating field for the
pointer. A list of pointers can be kept in the index for the key, one to each block
that contains a record whose indexing field is equal to the key value. Binary
search can be used, but must be modified.
Keep the index entries are a fixed length, and have a single entry for each index
field value, but create a level of indirection to handle the multiple pointers. This is
a non dense scheme.
o The pointer P in the index entry, points to a block of record pointers,
o Each record pointer in the block points to one of the data file records with
the key value K for the indexing field.
o If a value of K occurs in too many records, so the record pointers can’t fit
in a single disk block, a cluster or linked list of blocks is used.
o Retrieval using the index requires one or more additional block accesses
because of the extra level, but searching and inserting more records is
straightforward.
o SEE DIAGRAM FROM TEXT (14.5)
A secondary index provides a logical ordering on the records by the indexing field.
Download