Comp 335 File Structures

advertisement
Comp 335
File Structures
Hashing
What is Hashing?


A process used with record files that
will try to achieve O(1) (i.e. – constant)
access to a record’s location in the file.
An algorithm, called a hash function
(h), is given a primary key as input; the
resulting output is the location of the
record within the file; h(key) =
address.
Hashing Example


Assume you want to store 5,000 data records on
file. You want this to be a hashed file for quick
access. Each record will be fixed in length and the
primary key for each record is an employee number
which is 8 digits long.
A common hash function is called modulo
arithmetic.
 h(key) = key mod n; n = 5000
 h(82461792) = 82461792 mod 5000 = 1792
 The address (RRN) of the record with this key is
1792
Other Hashing Methods
Folding


Folding requires extracting certain groupings from the key and
then adding or multiplying the groupings in some fashion to
form the hash address.
Example :



Key = “BISON”
Address Space = 101
Step 1 – get ASCII values of each character in the string


Step 2 – Add “even[even index val]”


73+79 = 152
Step 4 – Multiply results


66 +83+78 = 227
Step 3 – Add “odd[odd index val]”


B(66), I(73), S(83), O(79), N(78)
227 * 152 = 34504
Step 5 – Modulo results

34504 mod 101 = 63 (hash address)
Other Hashing Methods
Mid-Square


Involves squaring the “numeric” form
of a key and extracting some of the
digits from the “middle of the square”.
Example:




Assume address space is 1000
Key(4 digit int) = 2973
2973 * 2973 = 8838729
Extract “middle” digits = 387 (hash
address)
Other Hashing Methods
Radix Transformation


Convert the key to a different base and
then use modulo arithmetic.
Example:




Address space is 100.
Key is 43510
Conversion: 38211
382 mod 100 = 82 (hash address)
Other Hashing Methods
Multiplicative Function


Involves multiplying the key by some
constant less than one, the hash function
will return some of the digits of the fractional
part of the result.
Example:





Address space = 1000
Key (5 digit integer): 82165
Multiplier: 0.39731
82165 * 0.39731 = 32644.97615
First three digits of fractional part is hash
address = 976
Major Problem with Hashing



Given a random set of keys and a hash function (h), it is highly
probable that some keys in the set will be hash synonyms. In
other words, the same hash function output can be obtained
from different keys in the set.
A hashing algorithm can yield three different types of address
distributions:

Perfect – no synonyms given a set of keys; the probability of
obtaining a perfect distribution from a large set of unknown
keys is very, very low (textbook – 1 out 10120,000)

Random – “few” synonyms generated; what we strive for!

Scud – many synonyms generated
If the set of keys is known beforehand, it is possible to generate
a perfect hashing algorithm (Pearson, Cichelli)
Collisions



When two or more keys hash to same
address, this is called a collision.
This has to be accounted for with random
hashing algorithms.
The handling of collisions becomes a critical
issue in the overall search efficiency of a
given file. Remember each search could
mean a “disk access”.
Decreasing the Probability of
Collisions


Increase the address space – a common
technique; allocate more addresses in the
file than records to store; this can decrease
the possibility of collisions greatly assuming
the hashing algorithm is random. The
disadvantage obviously is wasted space.
Place more than one record at an address.
This is commonly referred to as buckets. A
single address space can store an array of
records. This has been shown to increase
search efficiency.
Collision Resolution


Even if you have tried to decrease the
probability of collisions, they still can
and will happen.
Ways to resolve collisions:




Linear Probing
Double Hashing
Prime area with overflow
Chaining
Linear Probing




If a key is hashed to an address already occupied
or full, search the address space linearly until the
first free space is found.
Easy to implement, however this technique can
lead to poor search efficiency. This technique can
take away home addresses from other keys
resulting in more collision handling.
It can also take many accesses to determine if a
key does not exist.
What about if a key is deleted using this technique?
Could be bad if not handled properly.
Double Hashing



Upon a collision, the key re-hashed using a
different algorithm; this determines the
increment to take to search for an open
address space.
The same problems exist as with linear
probing.
Research has shown that this technique will
give better performance than linear probing.
Prime area with Overflow


Usually used with buckets. A bucket will
hold x number of records in the prime
address space and will also contain a
pointer to an overflow area of the file which
is entry-sequenced. This pointer will contain
the first overflow record and each overflow
record will contain a pointer to the next
overflow record.
This is a common technique and gives
excellent search efficiency.
Chaining


The file consists of a hash table which is
simply an array of pointers. When a key is
hashed, the result is an index into the hash
table. At this location is a pointer to the first
record which has this hash address. All the
records are then “chained” together as a
linked list.
The data record portion of the file can be
entry sequenced.
Hash Address Distributions

Assuming you have a random hash
function, the Poisson Function can be used
to compute various probabilities such as:



How many empty hash slots will there be?
What percentage of the time will access to a key
result in more than one access to find it?
What is the probability that a certain hash
address will have x number of keys assigned to
it?
Poisson Function
p(x) = (r/n)x e-r/n
x!
n – the address space
r - number of keys to hash
x – number of records assigned to a given address
r/n = packing density; load factor
Poisson Function Example
Assume 1,000 records to be hashed into a 1,000 address hashed
file.
1)
What is the probability that a given address will have two
keys hashed to it?
p(2) = (1,000/1,000)2 e-1,000/1,000
2!
= e-1
2
= .368/2 = .184
2)
1,000 (number of addresses) * .184 = 184
Therefore there are approximately 184 addresses which will
have 2 keys hashed to it which means there will be 184
overflow records.
Poisson Function Example
Assume 1,000 records to be hashed into a 1,500 address hashed
file.
1)
What is the probability that a given address will have two
keys hashed to it?
p(2) = (1,000/1,500)2 e-1,000/1,500
2!
= (.67)2 e-.67 = (.449)(.512)/2
2!
= .230/2 = .115
2)
1,500 (number of addresses) * .115 = 172.5 (173)
Therefore there are approximately 173 addresses which will
have 2 keys hashed to it which means there will be 173
overflow records.
Download