Part 2 - File Organizations and Indexing

advertisement
Part 2 - File Organizations and Indexing
1
Files and Records
• A file is a sequence of records, where each record is a collection of
data values.
• Records are stored on disk blocks. The blocking factor for a file is
the (average) number of file records stored in a disk block.
• A file can have fixed-length records or variable-length records.
• File records can be un-spanned (no record can span two blocks)
or spanned (a record can be stored in more than one block).
• The physical disk blocks that are allocated to hold the records of a
file can be contiguous, linked, or indexed.
• In a file of fixed-length records, all records have the same
structure. Usually, un-spanned blocking is used with such files.
-
Files of variable-length records require additional information to
be stored in each record, such as separator characters and field
types. Usually spanned blocking is used with such files.
2
Basic File Operations:
• OPEN: Prepares the file for access, and associates a pointer that
will refer to a current file record at each point in time.
• READ: Reads the current file record into a program variable.
• INSERT: Inserts a new record into the file, and makes it the current
file record.
• DELETE: Removes the current file record from the file, usually by
marking the record to indicate that it is no longer valid.
• UPDATE: Changes the values of some fields of the current file
record.
• CLOSE: Terminates access to the file.
• FIND: Searches for the first file record that satisfies a certain
condition, and makes it the current file record.
• FINDNEXT: Searches for the next file record (from the current
record) that satisfies a certain condition, and makes it the current file
record.
• REORGANIZE: Reorganizes the file records. For example, the
records marked deleted are physically removed from the file or a
new organization of the file records is created.
3
Sequential File Organization
• Developed in the early 1950s to store data on magnetic tapes.
(Mag. tapes were the first secondary storage devices available.)
• Also called a heap file.
• To search for a record, a linear search is necessary. On the
average, searching an un-ordered file requires reading and
searching half the file’s records.
• Maintenance of a Sequential file:
•
Addition
• Unordered file:
• Requires appending to the end of the file. O(1).
• Ordered file:
• More complex, since the order (ascending or
descending) must be preserved. O(n)
• Requires 3 steps:
1) copy the records up to the location where the
new record must be inserted, into a new file.
2) write the new record
3) copy the rest of the file to the new file.
•
Deletion:
• Same as addition to an Ordered file, however, we must
rewrite the file leaving the deleted record out. O(n)
•
Update:
• Same as addition to an Ordered file, except one or more
of the records will be changed before rewriting. O(n)
•
Search:
• Unordered file (average n/2)
• Ordered file (log n)
4
Sequential Search:
• Processes the records of a file in their order of occurrence
until it either locates the record or reaches the EOF.
• O(n) complexity
• Average search will require (n/2) comparisons.
• Worst case require (n) comparisons. (i.e. if it is the last
record or the records does not exist)
• Best if (n) is small.
• Simple algorithm.
• On the average performance can be improved (n/2) by
sorting the records in the file.
(i.e. if you pass the point in which the record should
appear, and it is not there, you know that it is not in the
file.)
Ž Advantages:
• Simple
Ž Disadvantages:
• Slow
5
Binary Search:
Basic Requirement:
• Requires a sorted file!
• Must be a random access file on a random access device!
• Must have equal sized records!
Basic Algorithm:
• Same as Binary search in an array.
• Search the file by continuously splitting it in half, until the record
is found or until you longer can split the file.
• O(log2 n) complexity.
• Both Average search as well as Worst case require O(log2
n) comparisons.
Ž Advantages:
• Faster than sequential search.
Ž Disadvantages:
• Still Slow
(Ok for in memory searches however, in a file it is best to
use a binary search when (n) is small.)
(Example: with n=256, we still need 8 comparisons to
locate a record. Possibly 8 seek + 8
Rotations)
• Overhead
(The file must be initially sorted and the order must be
maintained after insertion and deletions.)
6
Random Access File Organization :
• Direct Access
• Insertion, deletion and searching of records can be done at
random.
• Fixed record size. (Or buckets)
• Location of the records are computed by using their unique
key (identifier).
• Logical ordering of records may or may not correspond to their
physical sequence.
• Random access files can be updated in place.
• Often a HASH function is used to locate a record.
• Fast
7
3) Index Sequential File Organization:
• Provides both Sequential and Direct Access
• Combines sequential access and ordering with random access
capabilities.
• Contains two parts:
1) A sequential file:
- Maintains the actual data.
- Sometimes ordered on a key.
- Variable sized records.
2) An Index:
- A linear or hierarchical index structure:
(Key, Block #) or
(Key, Absolute file location)
• Variable length records within fixed size blocks.
• One or more records can be placed in the same block.
• Some space may be wasted.
• Variable length records waste less space, but are slower.
• Fixed length records waste space but provide faster access.
8
4) Multi-Key:
• Provides both Sequential and Direct Access
• Allows access to a data file by several different key fields.
• Example:
• A library file which requires access by author and title.
• Index sequential file organization provide random access by
one key field only.
9
Hashing
Internal Hashing
• In memory hashing, typically implemented through the use of an
array of records.
• The record with hash key value K is stored in record i, where i=h(K),
and h is the hashing function.
• Collisions are handled using methods such as open addressing,
chaining, or multiple hashing. (Below a chaining method is shown)
10
Hashed Files
Static External Hashing
• The file blocks are divided into M equal-sized buckets, numbered
bucket0, bucket1, ..., bucketM-1. Typically, for efficiency reasons, a
bucket corresponds to one (or a fixed number of) disk block.
• One of the fields is designated to be the hash key of the file.
• The record with hash key value K is stored in bucket i, where i=h(K),
and h is the hashing function.
• Search is very efficient on the hash key.
• Collisions occur when a new record hashes to a bucket that is
already full. An overflow area is kept for storing such records.
11
• To reduce overflow records, a hash file is typically kept 70-80% full.
• The hash function h should distribute the records uniformly among
the buckets; otherwise, search time will be increased because many
overflow records will exist.
• Main disadvantages of static external hashing:
-
Fixed number of buckets M is a problem if the number of
records in the file grows or shrinks.
-
Ordered access on the hash key is quite inefficient (requires
sorting the records).
12
Dynamic and Extendible Hashing Techniques
-
Hashing techniques are adapted to allow the dynamic growth and
shrinking of the number of file records.
-
Extendible hashing use the binary representation of the hash value
h(K) in order to access a directory. In extendible hashing the
directory is an array of size 2d where d is called the global depth.
-
The directories can be stored on disk, and they expand or shrink
dynamically. Directory entries point to the disk blocks that contain the
stored records.
-
Each bucket has a local depth which is less than or equal to the
glo
bal
de
pth
.
13
-
An insertion in a disk block that is full causes the block to split into
two blocks and the records are redistributed among the two blocks.
The directory is updated appropriately.
-
If a bucket that overflows and is split, used to have a local depth d’
equal to the global depth of d (of the directory, then the size of the
directory must be doubled (extra bit) to accommodate the overflow.
-
Extendible hashing do not require an overflow area.
14
Index Structures
-
A single-level index is an auxiliary file that makes it more efficient to
search for a record in the data file
-
The index is usually specified on one field of the file (although it
could be specified on several fields)
-
One form of an index is a file of entries <field value, pointer to
record>, which is ordered by field value
-
The index is called an access path on the field
-
The index file usually occupies considerably less disk blocks than
the data file because its entries are much smaller
-
A binary search (or a hash) on the index yields a pointer to the file
record
Example:
Given the following data file:
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
record size R=150 bytes
block size B=512 bytes
r=30000 records
Then, we get:
blocking factor Bfr= B div R= 512 div 150= 3 records/block
number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks
For an index on the SSN field,
assume the field size VSSN=9 bytes,
assume the record pointer size PR=7 bytes.
Then:
index entry size RI=(VSSN+ PR)=(9+7)=16 bytes
index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block
number of index blocks b= (r/BfrI)= (30000/32)= 938 blocks
binary search needs log2 bI = log2 938= 10 block accesses
15
This is compared to an average linear search cost of:
(b/2)= 30000/2= 15000 block accesses
If the file records are ordered, the binary search cost would be:
log2 b= log2 30000= 15 block accesses
16
Types of Single-Level Indexes
1. Primary Index
- Defined on an ordered data file
- The data file is ordered on a key field
- Includes one index entry for each block in the data file; the index
entry has the key field value for the first record in the block, which is
called the block anchor.
- A similar scheme can use the last record in a block.
- Primary index is a non-dense index: it includes an entry for
each disk block of the data file rather than for every record.
17
2. Clustering Index
- As cluster index is used to maintain an index on file which needs to
be ordered on a non-key field
-
Includes one index entry for each distinct value of the field; the index
entry points to the first data block that contains records with that field
value.
-
Cluster index is another example of a non-dense index: it
includes an entry for each disk block of the data file rather
than for every record.
18
19
20
3. Secondary Index
-
Defined on an unordered data file
-
Can be defined on a key field or a non-key field
-
Includes one entry for each record in the data file; hence, it is called
a dense index
21
22
Multi-Level Indexes
-
Because a single-level index is an ordered file, we can create a
primary index to the index itself ; in this case, the original index file is
called the first-level index and the index to the index is called the
second-level index
-
We can repeat the process, creating a third, fourth, ..., top level until
all entries of the top level fit in one disk block
-
A multi-level index can be created for any type of first-level index
(primary, secondary, clustering) as long as the first-level index
consists of more than one disk block
-
Such a multi-level index is a form of search tree ; however, insertion
and deletion of new index entries is a severe problem because
every level of the index is an ordered file
-
Because of the insertion and deletion problem, most multi-level
indexes use B-tree or B+-tree data structures, which leave space in
each tree node (disk block) to allow for new index entries
23
Using B-Trees and B+-Trees as Dynamic Multi-level Indexes
-
These data structures are variations of search trees that allow
efficient insertion and deletion of new search values.
-
In B-Tree and B+-Tree data structures, each node corresponds to a
disk block.
-
Each node is kept between half-full and completely full.
-
An insertion into a node that is not full is quite efficient; if a node is
full the insertion causes a split into two nodes.
-
Splitting may propagate to other tree levels.
-
A deletion is quite efficient if a node does not become less than half
full.
-
If a deletion causes a node to become less than half full, it must be
merged with neighboring nodes.
Difference between B-tree and B+-tree:
-
In a B-tree, pointers to data records exist at all levels of the tree.
-
In a B+-tree, all pointers to data records exists at the leaf-level
nodes.
-
A B+-tree can have less levels (or higher capacity of search values)
than the corresponding B-tree
24
Download