CS 245: Database System Principles Notes 5: Hashing and More Hashing

advertisement
Hashing
CS 245: Database System
Principles
Notes 5: Hashing and More
<key>
key  h(key)
Hector Garcia-Molina
CS 245
.
.
.
Notes 5
1
Two alternatives
Notes 5
2
Two alternatives
.
.
.
(2) key  h(key)
records
(1) key  h(key)
CS 245
CS 245
Buckets
(typically 1
disk block)
key 1
.
.
.
record
Index
Notes 5
3
Two alternatives
CS 245
Notes 5
4
Example hash function
(2) key  h(key)
key 1
• Key = ‘x1 x2 … xn’ n byte character string
• Have b buckets
• h: add x1 + x2 + ….. xn
record
–
Index
compute sum modulo b
• Alt (2) for “secondary” search key
CS 245
Notes 5
5
CS 245
Notes 5
6
1
 This may not be best function …
 Read Knuth Vol. 3 if you really
need to select a good function.
 This may not be best function …
 Read Knuth Vol. 3 if you really
need to select a good function.
Good hash
function:
CS 245
Notes 5
7
 Expected number of
keys/bucket is the
same for all buckets
CS 245
Notes 5
8
Next: example to illustrate
inserts, overflows, deletes
Within a bucket:
• Do we keep keys sorted?
h(K)
• Yes, if CPU time critical
& Inserts/Deletes not too frequent
CS 245
Notes 5
9
EXAMPLE 2 records/bucket
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
CS 245
Notes 5
10
EXAMPLE 2 records/bucket
0
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
1
2
3
0
d
1
a
c
2
b
3
h(e) = 1
CS 245
Notes 5
11
CS 245
Notes 5
12
2
EXAMPLE 2 records/bucket
INSERT:
h(a) = 1
h(b) = 2
h(c) = 1
h(d) = 0
0
EXAMPLE: deletion
Delete:
e
f
d
1
a
c
2
b
e
3
0
a
1
b
c
2
e
3
f
g
h(e) = 1
CS 245
Notes 5
13
EXAMPLE: deletion
Delete:
e
f
c
0
b
c
2
e
3
f
g
CS 245
Notes 5
14
EXAMPLE: deletion
Delete:
e
f
c
a
1
CS 245
d
d
a
1
b
c d
e
2
3
maybe move
“g” up
Notes 5
0
15
CS 245
f
g
d
maybe move
“g” up
Notes 5
Rule of thumb:
Rule of thumb:
• Try to keep space utilization
between 50% and 80%
Utilization =
# keys used
total # keys that fit
• Try to keep space utilization
between 50% and 80%
Utilization =
# keys used
total # keys that fit
16
• If < 50%, wasting space
• If > 80%, overflows significant
depends on how good hash
function is & on # keys/bucket
CS 245
Notes 5
17
CS 245
Notes 5
18
3
How do we cope with growth?
How do we cope with growth?
• Overflows and reorganizations
• Dynamic hashing
• Overflows and reorganizations
• Dynamic hashing
• Extensible
• Linear
CS 245
Notes 5
19
Extensible hashing: two ideas
Notes 5
20
(b) Use directory
(a) Use i of b bits output by hash function
b
h(K) 
CS 245
.
.
.
h(K)[i ]
to bucket
.
.
.
00110101
use i  grows over time….
CS 245
Notes 5
21
Example: h(k) is 4 bits; 2 keys/bucket
i= 1
CS 245
22
Example: h(k) is 4 bits; 2 keys/bucket
1
0001
i= 1
1
1001
1100
1
0001
1
1001
1010 1100
Insert 1010
CS 245
Notes 5
Insert 1010
Notes 5
23
CS 245
1
1100
Notes 5
24
4
Example continued
Example: h(k) is 4 bits; 2 keys/bucket
i= 1
i= 2
i=2
1
0001
00
00
01
01
Insert 1010
10
10
1 2
1001
1010 1100
11
11
Insert:
New directory
1 2
1100
0111
1
0001
2
1001
1010
2
1100
0000
CS 245
Notes 5
25
Example continued
i= 2
00
01
10
11
Insert:
0111
CS 245
Example continued
0000
0001
i= 2
00
1
0001 0111
0111
01
10
2
1001
11
1010
Insert:
2
1100
0111
0000
CS 245
Notes 5
27
CS 245
1 2
0001 0111
0111
2
1001
1010
2
1100
Notes 5
0000 2
0001
i= 2
00
0111 2
0001
0111 2
01
01
10
10
1001
11
11
1001
1001 2
Insert:
1001
CS 245
28
Example continued
0000 2
00
2
0000
0001
0000
Notes 5
Example continued
i= 2
26
1010 1001 2
1010
Insert:
1100 2
Notes 5
1001
29
CS 245
1010
1100 2
Notes 5
30
5
Example continued
i=3
0000 2
000
0001
i= 2
00
0111 2
011
10
1001 3
11
1001
100
1010 1001 2 3
101
1010
110
1100 2
111
1001
CS 245
• No merging of blocks
• Merge blocks
and cut directory if possible
(Reverse insert procedure)
010
01
Insert:
Extensible hashing: deletion
001
Notes 5
31
CS 245
Notes 5
32
Note: Still need overflow chains
Deletion example:
• Example: many records with duplicate keys
• Run thru insert example in reverse!
if we split:
insert 1100
2
1
1101
1100
2
1100
1100
CS 245
Notes 5
33
Solution: overflow chains
CS 245
Summary
+
add overflow block:
insert 1100
CS 245
1
1101
1
1101
1100
1101
Notes 5
Notes 5
34
Extensible hashing
Can handle growing files
- with less wasted space
- with no full reorganizations
1100
35
CS 245
Notes 5
36
6
Summary
+
Linear hashing
Extensible hashing
• Another dynamic hashing scheme
Can handle growing files
- with less wasted space
- with no full reorganizations
-
Indirection
-
Directory doubles in size
Two ideas:
(a) Use i low order bits of hash
b
01110101
grows
i
(Not bad if directory in memory)
(Now it fits, now it does not)
CS 245
Notes 5
37
CS 245
Notes 5
Example b=4 bits,
Linear hashing
38
i =2, 2 keys/bucket
• Another dynamic hashing scheme
Two ideas:
(a) Use i low order bits of hash
b
01110101
grows
i
0000
0101
1010
1111
00
01
Notes 5
Example b=4 bits,
10
11
m = 01 (max used block)
(b) File grows linearly
CS 245
Future
growth
buckets
39
i =2, 2 keys/bucket
CS 245
Notes 5
Example b=4 bits,
40
i =2, 2 keys/bucket
• insert 0101
0000
0101
1010
1111
00
Future
growth
buckets
01
10
11
m = 01 (max used block)
Rule
CS 245
0101
1010
1111
00
01
Future
growth
buckets
10
11
m = 01 (max used block)
If h(k)[i ]  m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Notes 5
0000
Rule
41
CS 245
If h(k)[i ]  m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Notes 5
42
7
Example b=4 bits,
i =2, 2 keys/bucket
• insert 0101
0101
0000
0101
1010
1111
00
01
Note
• In textbook, n is used instead of m
• n=m+1
• can have overflow chains!
Future
growth
buckets
10
n=10
11
m = 01 (max used block)
Rule
Notes 5
Example b=4 bits,
0000
0101
1010
1111
00
01
10
11
m = 01 (max used block)
0000
0101
1010
1111
CS 245
1111
11
Notes 5
Example b=4 bits,
0000
0101
1010
1111
01
44
i =2, 2 keys/bucket
Future
growth
buckets
1010
10
11
10
45
CS 245
Notes 5
Example b=4 bits,
0101
• insert 0101
Future
growth
buckets
01
10
11
m = 01 (max used block)
10
Notes 5
10
m = 01 (max used block)
i =2, 2 keys/bucket
1010
01
CS 245
00
Notes 5
0101
00
43
Future
growth
buckets
Example b=4 bits,
1010
Future
growth
buckets
m = 01 (max used block)
i =2, 2 keys/bucket
CS 245
0101
00
If h(k)[i ]  m, then
look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
CS 245
0000
0000
0101
1010
1111
00
01
46
i =2, 2 keys/bucket
• insert 0101
Future
growth
buckets
1010
10
11
m = 01 (max used block)
10
11
47
CS 245
Notes 5
48
8
Example b=4 bits,
0101
Example Continued: How to grow beyond this?
i =2, 2 keys/bucket
• insert 0101
i=2
0000
1010
00
0101
1010
1111
Future
growth
buckets
0000
0101
1111
01
0101
1010
1111
10
11
0101
10
11
00
01
...
m = 01 (max used block)
10
11
m = 11 (max used block)
CS 245
Notes 5
49
CS 245
Example Continued: How to grow beyond this?
0101
i=2 3
1010
1111
0000
0101
0 00
100
0 01
101
50
Example Continued: How to grow beyond this?
i=2 3
0000
Notes 5
0101
1010
1111
0101
010
110
0 11
111
...
0 00
100
m = 11 (max used block)
0 01
101
010
110
0 11
111
100
...
m = 11 (max used block)
100
CS 245
Notes 5
51
Example Continued: How to grow beyond this?
CS 245
Notes 5
52
 When do we expand file?
i=2 3
0000
0101
1010
0 01
101
010
110
0 11
111
# used slots
total # of slots
CS 245
Notes 5
0101
0101
1111
0101
0 00
100
• Keep track of:
100
=U
101
...
m = 11 (max used block)
100
101
CS 245
Notes 5
53
54
9
Summary
 When do we expand file?
• Keep track of:
# used slots
total # of slots
+
Can handle growing files
- with less wasted space
- with no full reorganizations
+
No indirection like extensible hashing
=U
• If U > threshold then increase m
(and maybe i )
Can still have overflow chains
-
CS 245
Notes 5
55
Example: BAD CASE
Need to move
Would waste
space...
Notes 5
Notes 5
57
CS 245
Notes 5
Next:
Indexing vs Hashing
• Indexing vs Hashing
• Index definition in SQL
• Multiple key access
• Hashing good for probes given key
e.g.,
SELECT …
FROM R
WHERE R.A = 5
CS 245
Notes 5
56
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear
m here…
CS 245
CS 245
Summary
Very full
Very empty
Linear Hashing
59
CS 245
Notes 5
58
60
10
Indexing vs Hashing
Index definition in SQL
• INDEXING (Including B Trees) good for
Range Searches:
e.g.,
SELECT
FROM R
WHERE R.A > 5
• Create index name on rel (attr)
• Create unique index name on rel (attr)
CS 245
CS 245
Notes 5
61
Note CANNOT SPECIFY TYPE OF INDEX
defines candidate key
• Drop INDEX name
Notes 5
62
Note ATTRIBUTE LIST  MULTIKEY INDEX
(next)
e.g., CREATE INDEX foo ON R(A,B,C)
(e.g. B-tree, Hashing, …)
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL...
CS 245
Notes 5
63
CS 245
Notes 5
Multi-key Index
Strategy I:
Motivation: Find records where
DEPT = “Toy” AND SAL > 50k
• Use one index, say Dept.
• Get all Dept = “Toy” records
and check their salary
64
I1
CS 245
Notes 5
65
CS 245
Notes 5
66
11
Strategy II:
Strategy III:
• Use 2 Indexes; Manipulate Pointers
• Multiple Key Index
I2
Toy
Sal
> 50k
CS 245
Notes 5
Example
Art
Sales
Toy
Dept
Index
67
One idea:
I1
CS 245
I3
Notes 5
68
For which queries is this index good?
10k
15k
17k
21k
Example
Record
Name=Joe
DEPT=Sales
SAL=15k
12k
15k
15k
19k
Find
Find
Find
Find
RECs
RECs
RECs
RECs
Dept = “Sales”
Dept = “Sales”
Dept = “Sales”
SAL = 20k
SAL=20k
SAL > 20k
Salary
Index
CS 245
Notes 5
69
CS 245
Notes 5
Interesting application:
Queries:
• Geographic Data
• What city is at <Xi,Yi>?
• What is within 5 miles from <Xi,Yi>?
• Which is closest point to <Xi,Yi>?
y
DATA:
70
<X1,Y1, Attributes>
<X2,Y2, Attributes>
...
x
CS 245
Notes 5
71
CS 245
Notes 5
72
12
Example
i
e
h
n
l
m
Example
i
e
n
f
h
b
f
o
j
k
a
d
10
20
l
c
g
m
c
g
10
CS 245
Notes 5
Example
73
40
10
25
30
20
15 35
i
20
e
h
n
20
l
10
j
k
o
Example
25
15 35
20
i
e
n
f
h
30
20
c
10
74
40
10
g
20
Notes 5
b
f
m
a
d
CS 245
b
o
j
k
a
d
20
l
o
10
j
k
m
20
a
d
b
c
g
10
20
5
15
15
CS 245
Notes 5
Example
75
40
10
25
30
20
15 35
i
20
e
h
n
20
l
10
j
k
o
10
5
h i
CS 245
f
d e
c
l
m
Example
25
20
e
n
f
20
l
o
10
j
k
m
20
10
5
a b
h i
n o
j k
77
CS 245
g
f
15
15
Notes 5
15 35
i
h
30
20
c
g
76
40
10
15
15
j k
g
Notes 5
b
f
m
a
d
CS 245
l
m
d e
c
a
d
b
c
g
20
a b
• Search points near f
• Search points near b
n o
Notes 5
78
13
Queries
•
•
•
•
Find
Find
Find
Find
points
points
points
points
• Many types of geographic index
structures have been suggested
•
•
•
•
with Yi > 20
with Xi < 5
“close” to i = <12,38>
“close” to b = <7,24>
CS 245
Notes 5
79
Two more types of multi key indexes
kd-Trees (very similar to what we described here)
Quad Trees
R Trees
...
CS 245
Notes 5
80
Grid Index
Key 2
• Grid
• Partitioned hash
V1
V2
X1 X2
……
Xn
Key 1
Vn
To records with key1=V3, key2=X2
CS 245
Notes 5
81
CS 245
Notes 5
CLAIM
CLAIM
• Can quickly find records with
• Can quickly find records with
– key 1 = Vi  Key 2 = Xj
– key 1 = Vi
– key 2 = Xj
82
– key 1 = Vi  Key 2 = Xj
– key 1 = Vi
– key 2 = Xj
• And also ranges….
– E.g., key 1  Vi  key 2 < Xj
CS 245
Notes 5
83
CS 245
Notes 5
84
14
• How do we find entry i,j in linear structure?
max number of
i values N=4
i, j
CS 245
max number of
i values N=4
position S+0
position S+1
0, 0
0, 1
0, 2
0, 3
1, 0
1, 1
1, 2
1, 3
2, 0
2, 1
2, 2
2, 3
3, 0
pos(i, j) =
position S+2
position S+3
pos(i, j) = S + iN + j
position S+4
Issue: Cells must be same size,
and N must be constant!
position S+9
Issue: Some cells may overflow,
some may be sparse...
Notes 5
85
Solution: Use Indirection
V1
V2
V3
V4
X1 X2 X3
----
Buckets
CS 245
----------
CS 245
87
CS 245
8
3
-
position S+9
86
88
Good for multiple-key search
Space, management overhead
(nothing is free)
1
Linear Scale
CS 245
position S+4
Notes 5
+
Grid
50K-
position S+3
contains
pointers to
buckets
Salary
2
position S+2
*Grid only
Grid files
20K-50K
position S+1
Notes 5
Can also index grid on value ranges
1
position S+0
• Grid can be regular without wasting space
• We do have price of indirection
Notes 5
0-20K
i, j
0, 0
0, 1
0, 2
0, 3
1, 0
1, 1
1, 2
1, 3
2, 0
2, 1
2, 2
2, 3
3, 0
With indirection:
Buckets
----
• How do we find entry i,j in linear structure?
2
Need partitioning ranges that evenly
split keys
3
Toy Sales Personnel
Notes 5
89
CS 245
Notes 5
90
15
Partitioned hash function
Idea:
Key1
EX:
h1(toy)
h1(sales)
h1(art)
.
.
h2(10k)
h2(20k)
h2(30k)
h2(40k)
.
.
010110 1110010
h1
h2
Key2
Insert
CS 245
Notes 5
91
CS 245
=0
=1
=1
=01
=11
=01
=00
000
001
010
011
100
101
110
111
<Fred,toy,10k>,<Joe,sales,10k>
<Sally,art,30k>
Notes 5
92
EX:
h1(toy)
h1(sales)
h1(art)
.
.
h2(10k)
h2(20k)
h2(30k)
h2(40k)
.
.
Insert
CS 245
=0
=1
=1
=01
=11
=01
=00
000
001
010
011
100
101
110
111
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110 <Tom><Bill>
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales  Sal=40k
<Fred>
<Joe><Sally>
<Fred,toy,10k>,<Joe,sales,10k>
<Sally,art,30k>
Notes 5
93
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110 <Tom><Bill>
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales  Sal=40k
CS 245
Notes 5
CS 245
Notes 5
h1(toy)
=0
000
h1(sales) =1
001
h1(art)
=1
010
.
011
.
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110
h2(40k)
=00
111
.
.
• Find Emp. with Sal=30k
95
CS 245
Notes 5
94
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
96
16
h1(toy)
=0
000
h1(sales) =1
001
h1(art)
=1
010
.
011
.
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110
h2(40k)
=00
111
.
.
• Find Emp. with Sal=30k
CS 245
<Fred>
<Joe><Jan>
<Mary>
<Sally>
<Tom><Bill>
<Andy>
look here
Notes 5
97
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110 <Tom><Bill>
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales
CS 245
Notes 5
98
Summary
h1(toy)
=0
000
<Fred>
h1(sales) =1
001
<Joe><Jan>
h1(art)
=1
010
<Mary>
.
011
.
<Sally>
h2(10k)
=01
100
h2(20k)
=11
101
h2(30k)
=01
110 <Tom><Bill>
<Andy>
h2(40k)
=00
111
.
.
• Find Emp. with Dept. = Sales
look here
CS 245
Notes 5
Post hashing discussion:
- Indexing vs. Hashing
- SQL Index Definition
- Multiple Key Access
- Multi Key Index
Variations: Grid, Geo Data
- Partitioned Hash
99
CS 245
Notes 5
100
BIG picture….
Reading Chapter 5
The
• Skim the following sections:
• Chapters 11 & 12 [13]: Storage, records, blocks...
• Chapters 13 & 14 [14]: Access Mechanisms
- Indexes
- B trees
- Hashing
- Multi key
• Chapters 15 & 16 [15, 16]: Query Processing
– Sections 14.3.6, 14.3.7, 14.3.8
[Second Ed: 14.6.6, 14.6.7, 14.6.8]
– Sections 14.4.2, 14.4.3, 14.4.4
[Second Ed: 14.7.2, 14.7.3, 14.7.4]
• Read the rest
NEXT
CS 245
Notes 5
101
CS 245
Notes 5
102
17
Download