CS 405G: Introduction to Database Systems 13 Hashing Chen Qian

advertisement
CS 405G: Introduction to
Database Systems
13 Hashing
Chen Qian
University of Kentucky
Searching
Consider the problem of searching an array for a given
value


If the array is not sorted, the search requires O(n) time

If the array is sorted, we can do a binary search


It doesn’t seem like we could do much better


2
A binary search requires O(log n) time
How about an O(1), that is, constant time search?
We can do it if the array is organized in a particular way
Hashing algorithm
Suppose we were to come up with a “magic function”
that, given a value to search for, would tell us exactly
where in the array to look



If it’s in that location, it’s in the array
If it’s not in that location, it’s not in the array
This function is called a hash function because it “makes
hash” of its inputs

3
Example (ideal) hash function

Suppose our hash function
gave us the following values:
0
hashCode("apple") = 5
hashCode("watermelon") = 3
hashCode("grapes") = 8
hashCode("cantaloupe") = 7
hashCode("kiwi") = 0
hashCode("strawberry") = 9
hashCode("mango") = 6
hashCode("banana") = 2
2
1
3
banana
watermelon
4
5
6
7
8
9
4
kiwi
apple
mango
cantaloupe
grapes
strawberry
Sets and tables


Sometimes we just want a set
of things—objects are either
in it, or they are not in it
Sometimes we want a map—
a way of looking up one thing
based on the value of another


5
We use a key to find a place in
the map
The associated value is the
information we are trying to
look up
...
key
value
robin
robin info
141
142
143 sparrow
sparrow info
144
hawk
hawk info
145
seagull
seagull info
147
bluejay
bluejay info
148
owl
owl info
146
Finding the hash function


How can we come up with this magic function?
In general, we cannot--there is no such magic
function 


What is the next best thing?


6
In a few specific cases, where all the possible values
are known in advance, it has been possible to compute
a perfect hash function
A perfect hash function would tell us exactly where to
look
In general, the best we can do is a function that tells us
where to start looking!
Example imperfect hash function

Suppose our hash function gave
us the following values:

hash("apple") = 5
hash("watermelon") = 3
hash("grapes") = 8
hash("cantaloupe") = 7
hash("kiwi") = 0
hash("strawberry") = 9
hash("mango") = 6
hash("banana") = 2
hash("honeydew") = 6
0
1
2
3
7
banana
watermelon
4
5
6
7
8
• Now what?
kiwi
9
apple
mango
cantaloupe
grapes
strawberry
Collisions



8
When two values hash to the same array location,
this is called a collision
Collisions are normally treated as “first come, first
served”—the first value that hashes to the location
gets it
We have to find something to do with the second and
subsequent values that hash to this same location
Handling collisions

What can we do when two different values attempt
to occupy the same place in an array?

Solution #1: Search from there for an empty location



Solution #2: Use a second hash function



...and a third, and a fourth, and a fifth, ...
Solution #3: Use the array location as the header of a
linked list of values that hash to this location
All these solutions work, provided:

9
Can stop searching when we find the value or an empty
location
Search must be end-around
We use the same technique to add things to the array
as we use to search for things in the array
Insertion, I



10
Suppose you want to add
seagull to this hash table
Also suppose:
...
141
142
robin

hashCode(seagull) = 143
143
sparrow

table[143]
144
hawk

table[143] != seagull
145

table[144]

table[144] != seagull
146

table[145]
is not empty
is not empty
is empty
Therefore, put seagull at
location 145
seagull
147
bluejay
148
owl
...
Searching, I


Suppose you want to look up
seagull in this hash table
Also suppose:



142
robin
143
sparrow
144
hawk

table[144] is not empty
table[144] != seagull

table[145]
147
bluejay
148
owl

11
table[143] is not empty
table[143] != seagull
141
145


hashCode(seagull) = 143
...
is not empty
table[145] == seagull !
We found seagull at location
145
seagull
146
...
Searching, II


Suppose you want to look up
cow in this hash table
Also suppose:




12
table[144] is not empty
table[144] != cow
141
142
robin
143
sparrow
144
hawk

table[145] is not empty
table[145] != cow
145

table[146]
147
bluejay
148
owl


hashCode(cow) = 144
...
is empty
If cow were in the table, we
should have found it by now
Therefore, it isn’t here
seagull
146
...
Insertion, II



Suppose you want to add
hawk to this hash table
Also suppose
...
142
robin

hashCode(hawk) = 143
143
sparrow

table[143]
144
hawk

table[143] != hawk
145
seagull

table[144]

table[144] == hawk
146
is not empty
is not empty
hawk is already in the table,
so do nothing
13
141
147
bluejay
148
owl
...
Insertion, III

Suppose:






14
You want to add cardinal to
this hash table
141
142
robin
hashCode(cardinal) = 147
143
sparrow
The last location is 148
147 and 148 are occupied
144
hawk
145
seagull
Solution:

...
Treat the table as circular;
after 148 comes 0
Hence, cardinal goes in
location 0 (or 1, or 2, or ...)
146
147
bluejay
148
owl
Efficiency




15
Hash tables are actually surprisingly efficient
Until the table is about 70% full, the number of
probes (places looked at in the table) is typically
only 2 or 3
Sophisticated mathematical analysis is required to
prove that the expected cost of inserting into a hash
table, or looking something up in the hash table, is
O(1)
Even if the table is nearly full (leading to occasional
long searches), efficiency is usually still quite high
Hash for database indexing

Hash-based indexes



A good hash function is:



are best for equality selections
cannot support range searches
Random
Uniform
Static and dynamic hashing techniques exist; trade-offs
similar to ISAM vs. B+ trees.
7/1/2016
Chen Qian @ University of Kentucky
16
Static Hashing



Each record (entry) has a key k
# primary pages fixed, allocated sequentially, never
de-allocated; overflow pages if needed.
h(k) mod N = bucket to which data entry with key k
belongs. (N = # of buckets)
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
7/1/2016
Overflow pages
Chen Qian @ University of Kentucky
17
Static Hashing (Contd.)


Buckets contain data records.
Hashing works on search key field of record r. Must
distribute values over range 0 ... N-1.



h(key) = a binary number with 32/64/128/256 bits.
h(k) mod N is a way to distribute values
Long overflow chains can develop and degrade
performance.

7/1/2016
Extendible and Linear Hashing: Dynamic techniques to
fix this problem.
Chen Qian @ University of Kentucky
18
Extendible Hashing

Situation: Bucket (primary page) becomes full. Why not reorganize file by doubling # of buckets?


Idea: Use directory of pointers to buckets



Reading and writing all pages is expensive!
double # of buckets by doubling the directory,
splitting just the bucket that overflowed!
Directory much smaller than file, so doubling it is much cheaper.
Only one page of data entries is split. No overflow page!
7/1/2016
Chen Qian @ University of Kentucky
19
Example
LOCAL DEPTH
GLOBAL DEPTH


Directory is array of size 4.
To find bucket for r, take last
`global depth’ # bits of h(r).
 If h(5) = binary(5) = 101, it
is in bucket pointed to by
01.
2
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
DATA PAGES
Bucket D
LOCAL DEPTH
GLOBAL DEPTH
2
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
Bucket D
DATA PAGES

Insert: If bucket is full, split it (allocate new page, re-distribute).

If necessary, double the directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
LOCAL DEPTH
GLOBAL DEPTH
2
00

2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B


01
10
2
11
10*
DIRECTORY
Bucket C
DATA PAGES


2
15* 7* 19*
Suppose now insert
20, whose hash is
10100.
Bucket A is full.
Split A to A1 and A2
Bucket D
A1: values ending
with 000
A2: values ending
with 100
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
2
3
1* 5* 21*13* Bucket B
32* 16* Bucket A
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
7/1/2016
3
GLOBAL DEPTH
01
DIRECTORY
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
Chen Qian @ University of Kentucky
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
23
Points to Note

20 = binary 10100.





Last 2 bits (00) tell us r belongs in A or others.
Last 3 bits needed to tell us r belongs in A1 or A2.
Global depth of directory: Max # of bits needed to tell
which bucket an entry belongs to.
Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket.
When does bucket split cause directory doubling?

Before insert, local depth of bucket = global depth. Insert
causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split
image page.
7/1/2016
Chen Qian @ University of Kentucky
24
Comments on Extendible Hashing


Static hash: one disk access for equality search
If directory fits in memory, equality search answered
with one disk access; else two.




100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
Directory grows in spurts, and, if the distribution of
hash values is skewed, directory can grow large.
Multiple entries with same hash value cause problems!
7/1/2016
Chen Qian @ University of Kentucky
25
Linear Hashing



This is another dynamic hashing scheme, an alternative to Extendible
Hashing.
Linear hashing handles the problem of long overflow chains without
using a directory, and handles duplicates.
Idea: Use a family of hash functions h0, h1, h2, ...





hi(key) = h(key) mod(2iN); N = initial # buckets
h is some hash function (range is not 0 to N-1)
h0 ranges in 0 to N-1. (If N = 2d, looking at the last d bits)
h1 ranges in 0 to 2N-1. (looking at the last d+1 bits)
hi+1 doubles the range of hi (similar to directory doubling)
7/1/2016
Chen Qian @ University of Kentucky
26
Linear Hashing (Contd.)

Directory avoided in LH by using overflow pages, and
choosing bucket to split round-robin.



Splitting proceeds in `rounds’. Round ends when all NR
initial (for round R) buckets are split. Buckets 0 to
Next-1 have been split; Next to NR yet to be split.
Current round number is Level.
Search: To find bucket for data entry r, find hLevel(r):


7/1/2016
If hLevel(r) in range `Next to NR’ , r belongs here.
Else, r could belong to bucket hLevel(r) or bucket hLevel(r) +
NR; must apply hLevel+1(r) to find out.
Chen Qian @ University of Kentucky
27
Overview of LH File

In the middle of a round.
Bucket to be split
Next
Buckets that existed at the
beginning of this round:
this is the range of
Buckets split in this round:
If h Level ( search key value )
is in this range, must use
h Level+1 ( search key value )
to decide if entry is in
`split image' bucket.
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
7/1/2016
Chen Qian @ University of Kentucky
28
Linear Hashing (Contd.)

Insert: Find bucket by applying hLevel and hLevel+1:

If bucket to insert into is full:



Add overflow page and insert data entry.
Split Next bucket and increment Next.
Since buckets are split round-robin, long overflow chains
don’t develop!
7/1/2016
Chen Qian @ University of Kentucky
29
Example of Linear Hashing
Now adding 43

Level=0, N=4
h
h
1
0
000
00
001
01
010
10
011
11
Level=0
PRIMARY
Next=0 PAGES
h
32*44* 36*
9* 25* 5*
14* 18*10*30*
31*35* 7* 11*
Data entry r
with h(r)=5
Primary
bucket page
h
PRIMARY
PAGES
1
0
000
00
001
Next=1
9* 25* 5*
01
010
10
011
11
100
00
OVERFLOW
PAGES
32*
14* 18*10*30*
31*35* 7* 11*
44* 36*
43*
Example: End of a Round
Level=1
h1
PRIMARY
PAGES
h0
Next=0
Level=0
h1
h0
000
00
001
010
01
10
PRIMARY
PAGES
OVERFLOW
PAGES
000
00
32*
001
01
9* 25*
010
10
66* 18* 10* 34*
011
11
43* 35* 11*
100
00
44* 36*
101
11
5* 37* 29*
32*
50*
9* 25*
66*18* 10* 34*
Next=3
31*35* 7* 11*
43*
011
11
100
00
44*36*
101
01
5* 37*29*
110
10
14* 30* 22*
110
10
14*30*22*
111
11
31*7*
7/1/2016
OVERFLOW
PAGES
Chen Qian @ University of Kentucky
31
LH Described as a Variant of EH

The two schemes are actually quite similar:
 Begin with an EH index where directory has N elements.
 Use overflow pages, split buckets round-robin.

So, directory can double gradually. Also, primary bucket
pages are created in order.
If they are allocated in sequence too (so that finding i’th is
easy), we actually don’t need a directory! Voila, LH.

7/1/2016
Chen Qian @ University of Kentucky
32
Summary



Hash-based indexes: best for equality searches, cannot
support range searches.
Static Hashing can lead to long overflow chains.
Extendible Hashing avoids overflow pages by splitting a
full bucket when a new data entry is to be added to it.
(Duplicates may require overflow pages.)


Directory to keep track of buckets, doubles periodically.
Can get large with skewed data; additional I/O if this does
not fit in main memory.
7/1/2016
Chen Qian @ University of Kentucky
33
Summary (Contd.)

Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.




Overflow pages not likely to be long.
Duplicates handled easily.
Space utilization could be lower than Extendible Hashing,
since splits not concentrated on `dense’ data areas.
For hash-based indexes, a skewed data distribution is one
in which the hash values of data entries are not uniformly
distributed!
7/1/2016
Chen Qian @ University of Kentucky
34


Quiz 5 this Wed.
Homework due 4/25
7/1/2016
Chen Qian @ University of Kentucky
35
Download