`hashing`.

advertisement
CPSC 461
Instructor: Marina Gavrilova
Goal
 Goal of today’s lecture is to introduce concept of
“hash-index”. Hashing is an alternative to tree
structure that allows near constant access time to ANY
record in a very large database.
Presentation Outline
 Introduction
 Did you know that?
 Static hashing definition and methods
 Good hash function
 Collision resolution (open addressing, chaining)
 Extendible hashing
 Linear hashing
 Useful links and current market trends
 Summary
Introduction to Hashing
Approaches to Search
1. Sequential and list methods
(lists, tables, arrays).
2. Direct access by key value (hashing)
3. Tree indexing methods.
4
Introduction to Hashing
Definition
Hashing is the process of
mapping a key value to a
position in a table.
A hash
function maps key values to positions.
A hash
table is an array that holds the records.
Searching in a hash table can be done in O(1) regardless of the
hash table size.
5
Introduction to Hashing
6
Introduction to Hashing
Example of Usefullness
 10 stock details, 10 table positions
 Stock numbers are between 0 and 1
1000.
Using the whole stock numbers may
require 1000 storage locations and
this is an obvious waste of memory.
7
Introduction of Hashing
Applications of Hashing
 Compilers use hash tables to keep track of declared variables
 A hash table can be used for on-line spelling checkers — if
misspelling detection (rather than correction) is important, an entire
dictionary can be hashed and words checked in constant time
 Game playing programs use hash tables to store seen positions,
thereby saving computation time if the position is encountered
again


Hash functions can be used to quickly check for inequality — if
two elements hash to different values they must be different
Storing sparse data
8
Did you know that?
 Cryptography was once known only to the key people in the the
National Security Agency and a few academics.
 Until 1996, it was illegal to export strong cryptography from the
United States.
 Fast forward to 2006, and the Payment Card Industry Data Security
Standard (PCI DSS) requires merchants to encrypt cardholder
information. Visa and MasterCard can levy fines of up to $500,000
for not complying!
 Among methods recommended are:
 Strong one-way hash functions (hashed indexes)
 Truncation
 Index tokens and pads (pads must be securely stored)
 Strong cryptography
[Hashing for fun and profit: Demystifying encryption for PCI DSS
Roger Nebel]
9
Did you know that?
 Transport Layer Security protocol on networks (TLS) uses the Rivest,
Shamir, and Adleman (RSA) public key algorithm for the TLS key
exchange and authentication, and the Secure Hashing Algorithm 1
(SHA-1) for the key exchange and hashing.
[System cryptography: Use FIPS compliant algorithms for encryption,
hashing, and signing, Microsoft TechNews, 2005]
10
Did you know that?
 Spatial hashing studies performed at Microsoft Research, Redmond
combine hashing with computer graphics to create a new set of tools for
rendering, mesh reconstruction, and collision optimization (see public
poster by Hugues Hoppe on the next slide)
11
Sylvain Lefebvre Hugues Hoppe
(Microsoft Research)
Perfect Spatial Hashing
• We design a perfect hash function to losslessly pack sparse data while retaining efficient random access:
Hash table
Offset table
382
182
1282
Hash function
Offset table
353
193
Applications
2D
• Simply: h( p)  p    p (modulo table sizes)
p
Hash table
1283
q  p mod r
3D
Offset
table 
24372
[ q ]
pS
Domain
11632
83333

453
s  h( p )
Hash
table H
3D textures
Vector images
10242, 500KB, 700fps
Sprite maps
+900KB, 200fps
3D painting
10243, 46MB, 530fps
nearest: 7.5MB, 370fps
20483, 56MB, 200fps
Simulation
Collision detection
2563, 100fps
10243, 12MB, 140fps
• Perfect hash on multidimensional data
• No collisions  ideal for GPU
• Single lookup into a small offset table
• Offsets only ~4 bits per defined data
• Access only ~4 instructions on GPU
• Optimized spatial coherence
1.8%
Alpha compression
0.9bits/pixel, 800fps
Did you know that?
 Combining hashing and encryption provides a much
stronger tool for database and password protection.
 http://msdn.microsoft.com/msdnmag/issues/03/08/S
ecurityBriefs/
[Security Briefs, SMDN Magazine]
13
Hash Functions
Hash Functions
 Hashing is the process of chopping up the key and
mixing it up in various ways in order to obtain an index
which will be uniformly distributed over the range of
indices - hence the ‘hashing’.
There are several common ways of doing this:
 Truncation
 Folding
 Modular Arithmetic
14
Hash Functions
Hash Functions – Truncation
 Truncation is a method in which parts of the key are ignored and
the remaining portion becomes the index.
-
For this, we take the given key and produce a hash location by
taking portions of the key (truncating the key).

Example – If a hash table can hold 1000 entries and an 8-digit
number is used as key, the 3rd, 5th and 7th digits starting
from the left of the key could be used to produce the index.
- e.g. .. Key is 62538194 and the hash location is 589.
 Advantage: Simple and easy to implement.

Problems: Clustering and repetition.
15
Hash Functions
Hash Functions – Folding
 Folding breaks the key into several parts and combines the parts to form
an index.
-
The parts may be recombined by addition, subtraction, multiplications and may have
to be truncated as well.
- Such a process is usually better than truncation by itself since it produces a better
distribution: all of the digits in the key are considered.
- Using a key 62538194 and breaking it into 3 numbers using the first 3 and the last 2
digits produced 625, 381 and 94. These could be added to get 1100 which could be truncated
to 100.
They could be also be multiplied together and then three digits chosen
from the middle of the number produced.
16
Hash Functions
Hash Functions – (Modular Arithmetic)
 Modular Arithmetic process essentially assures that the index
produced is within a specified range. For this, the key is converted to
an integer which is divided by the range of the index with the resulting
function being the value of the remainder.
Uses: biometrics, encryption, compression
-
If the value of the modulus is a prime number, the distribution of
indices
obtained is quite uniform.
- A table whose size is some number which has many factors provides the
possibility of many indices which are the same, so the size should be a prime
number.
17
Hash Functions
Good Hash Functions
 Hash functions which use all of the key are almost always better
than those which use only some of the key.
- When only portions are used, information is lost and therefore the
number of possibilities for the final key are reduced.
- If we deal with the integer its binary form, then the number of
pieces that can be manipulated by the hash function is greatly
increased.
18
Collision Resolution
Collision
 It is obvious that no matter what function is used, the possibility
exists that the use of the function will produce an index which is a
duplicate of an index which already exists. This is a Collision.
Collision resolution strategy:
- Open addressing: store the key/entry in a different position
- Chaining: chain together several keys/entries in each position
19
Collision Resolution
Collision - Example
- - Hash table size 11
- - Hash function: key mod hash size
So, the new positions in the hash table are:
Some collisions occur with this hash function.
20
Collision Resolution
Collision Resolution – Open Addressing
 Resolving collisions by open addressing is resolving the problem by
taking the next open space as determined by rehashing the key
according to some algorithm.

Two main open addressing collision resolution techniques:
- - Linear probing: increase by 1 each time [mod table size!]
- - Quadratic probing: to the original position, add 1, 4, 9, 16,…
also in some cases key-dependent increment technique is used.
Probing
If the table position given by the hashed key is already
occupied, increase the position by some amount, until an empty
position is found
21
Collision Resolution
Collision Resolution – Open Addressing
Linear Probing
new position = (current position + 1) MOD hash size
Example –
Before linear probing:
After linear probing:
Problem – Clustering occurs, that is, the used spaces tend to
appear in groups which tends to grow and thus increase
the search time to reach an open space.
22
Collision Resolution
Collision Resolution – Open Addressing
 In order to try to avoid clustering, a method which does not look for
the first open space must be used.
 Two common methods are used –
- - Quadratic Probing
- - Key-dependent Increments
23
Collision Resolution
Collision Resolution – Open Addressing
Quadratic Probing
new position = (collision position + j2) MOD hash size
{ j = 1, 2, 3, 4, ……}
Example –
Before quadratic probing:
After quadratic probing:
Problem – Overflow may occurs when there is still space in the
24
hash table.
Collision Resolution
Collision Resolution – Open Addressing
Key-dependent Increments
 This technique is used to solve the overflow problem of the
quadratic probing method.
 These increments vary according to the key used for the hash
function.
If the original hash function results in a good distribution, then keydependent functions work quite well for rehashing and all locations in the
table will eventually be probed for a free position.
 Key dependent increments are determined by using the key to
calculate a new value and then using this as an increment to determine
successive probes.
25
Collision Resolution
Collision Resolution – Open Addressing
Key-dependent Increments
For example, since the original hash function was key Mod 11, we might choose a function
of key MOD 7 to find the increment. Thus the hash function becomes - -
new position = current position + ( key DIV 11) MOD 11
Example –
Before key-dependent increments:
After key-dependent increments:
26
Collision Resolution
Collision Resolution – Open Addressing
Key-dependent Increments

In all of the closed hash functions it is important to ensure that an
increment of 0 does not arise.
If the increment is equal to hash size the same position will be probed all the
time, so this value cannot be used.
 If we ensure that the hash size is prime and the divisors for the open and
closed hash are prime, the rehash function does not produce a 0 increment,
then this method will usually access all positions as does the linear
probe.
- Using a key-dependent method usually result reduces clustering and therefore
searches for an empty position should not be as long as for the linear method.
27
Collision Resolution
Collision Resolution – Chaining


Each table position is a linked list
Add the keys and entries anywhere in the
list (front easiest)
Advantages over open addressing:
- Simpler insertion and removal
- Array size is not a limitation (but should
still minimize collisions: make table size
roughly equal to expected number of keys
and entries)
Disadvantage
- Memory overhead is large if entries are
small
28
Collision Resolution
Collision Resolution – Chaining
Example:
Before chaining:
After chaining:
29
Analysis of Searching using Hash Tables

In analyzing search efficiency, the average is usually used. Searching with
hash tables is highly dependent on how full the table is since as the table
approaches a full state, more rehashes are necessary. The proportion of the
table which is full is called the Load Factor.
- When collisions are resolved using open addressing, the maximum load
factor is 1.
- Using chaining, however, the load factor can be greater than 1 when the
table is full and the linked list attached to each hash address has more than
one element.
 - Chaining consistently requires fewer probes than open addressing.
- Traversal of the linked list is slow and if the records are small, it may be just
as well to use open addressing.
- Chaining is the best under two conditions --- when the number of
unsuccessful searches is large or when the records are large.
- Open addressing would likely be a reasonable choice when most searches are
likely to be successful, the load factor is moderate and the records are
relatively small.
30
Analysis of Searching using Hash Tables
Average number of probes for different collision resolution
methods:
[ The values are for large hash tables, in this case larger than
500]
31
Analysis of Searching using Hash Tables
When are other representations more suitable than hashing:
 Hash tables are very good if there is a need for many searches in a
reasonably stable database
 Hash tables are not so good if there are many insertions and
deletions, or if table traversals are needed — in this case, trees are
better for indexing
 Also, hashing is very slow for any operations which require the
entries to be sorted (e.g. query to Find the minimum key)
32
Perfect Hashing
 A perfect hashing function maps a key into a unique address. If the
range of potential addresses is the same as the number of keys, the
function is a minimal (in space) perfect hashing function.
 What makes perfect hashing distinctive is that it is a process for
mapping a key space to a unique address in a smaller address space, that is
hash (key)
unique address
 Not only does a perfect hashing function improve retrieval performance,
but a minimal perfect hashing function would provide 100 percent storage
utilization.
33
Perfect Hashing
Process of creating a perfect hash function
A general form of a perfect hashing function is:
p.hash (key) =(h0(key) + g[h1(key)] + g[h2(key)] mod N
34
Cichelli’s Algorithm
 In Cichelli’s algorithm, the component functions are:
h0 = length (key)
h1 = first_character (key)
h2 = second_character (key)
and
g = T (x)
where T is the table of values associated with individual characters x which
may apply in a key.
The time consuming part of Cichelli’s algorithm is determining T.
35
Cichelli’s Algorithm
Table 1: Values associated with the characters of the Pascal
reserved words
 When we apply the Cichelli’s perfect hashing function to the keyword
begin using table 1, we can get –
The keyword begin would be stored in location 33. Since the hash values
run from 2 through 37 for this set of data, the hash function is a minimal
36
perfect hashing function.
Some Links to Hashing Animation
Links for interactive hashing examples:

http://www.engin.umd.umich.edu/CIS/course.des/cis350/hashing/WEB/HashApplet.htm

http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.html

http://www.cse.yorku.ca/~aaw/Hang/hash/Hash.html

http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html
37
Hashing as Database Index
 The basic idea is to use hashing function, which maps a
search key value(of a field) into a record or bucket of
records.
 As for any index, 3 alternatives for data entries k*:
 Data record with key value k
 <k, rid of data record with search key value k>
 <k, list of rids of data records with search key k>
 Hash-based indexes are best for equality selections.
Cannot support range searches.
 Static and dynamic hashing techniques exist; trade-offs
similar to ISAM vs. B+ trees.
Static Hashing
 # primary pages fixed, allocated sequentially, never
de-allocated; overflow pages allowed if needed.
 h(k) mod M = bucket to which data entry with key k
belongs. (M = # of buckets)
h(key) mod N
key
0
2
h
N-1
Primary bucket pages
Overflow pages
Static Hashing (Contd.)
 Buckets contain data entries.
 Hash function depends on search key field of record r.
Must distribute values over range 0 ... M-1 (table size).


h(key) = (a * key + b) mod M usually works well.
a and b are constants; lots known about how to tune h.
 Long overflow chains can develop and degrade
performance.

Extendible and Linear Hashing: Dynamic techniques to fix
this problem.
Rule of thumb
 Try to keep space utilization between 50% and 80%
 If <50%, wasting space
 If >80%, overflow significant
Depends on how good hashing function is
&
On # keys/bucket
Hash Functions for Extendible Hashing
 Extendible Hashing (Fagin et. al. 1979)
Expandable hashing (Knott 1971)
Dynamic Hashing (Larson 1978)
42
Extendible Hashing
Assume that a hashing technique is applied to a dynamically
changing file composed of buckets, and each bucket can hold
only a fixed number of items.
Extendible hashing accesses the data stored in buckets
indirectly through an index that is dynamically adjusted to
reflect changes in the file.
The characteristic feature of extendible hashing is the
organization of the index, which is an expandable table.
43
Extendible Hashing
 A hash function applied to a certain key indicates a position in the index
and not in the file (or table or keys). Values returned by such a hash
function are called pseudokeys.
 The database/file requires no reorganization when data are added to or
deleted from it, since these changes are indicated in the index.
Only one hash function h can be used, but depending on the size of the
index, only a portion of the added h(K) is utilized.
 A simple way to achieve this effect is by looking at the address into the
string of bits from which only the i leftmost bits can be used.
The number i is the depth of the directory.
In figure 1(a) (in the next slide), the depth is equal to two.
44
 Example
Figure 1. An example of extendible
hashing (Drozdek Textbook)
45
Extendible Hashing as Index
 Situation: Bucket (primary page) becomes full. Why not
re-organize file by doubling # of buckets?




Reading and writing all pages is expensive!
Idea: Use directory of pointers to buckets, double # of
buckets by doubling the directory, splitting just the bucket
that overflowed!
Directory much smaller than file, so doubling it is much
cheaper. Only one page of data entries is split. No
overflow page!
Trick lies in how hash function is adjusted!
LOCAL DEPTH
GLOBAL DEPTH
Example
 Directory is array of size 4.
 To find bucket for r, take last
`global depth’ # bits of h(r); we
denote r by h(r).
 If h(r) = 5 = binary 101, it is
in bucket pointed to by 01.
2
00
2
4* 12* 32* 16*
Bucket A
2
1*
5* 21* 13*
Bucket B
01
10
2
11
10*
DIRECTORY
Bucket C
2
15* 7* 19*
Bucket D
DATA PAGES

Insert: If bucket is full, split it (allocate new page, re-distribute).

If necessary, double the directory. (As we will see, splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
Insert h(r)=20 (Causes Doubling)
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
3
32* 16* Bucket A
GLOBAL DEPTH
2
3
1* 5* 21*13* Bucket B
01
000
2
1* 5* 21* 13* Bucket B
001
10
2
11
10*
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
DIRECTORY
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
Points to Note
 20 = binary 10100. Last 2 bits (00) tell us r belongs in A or
A2. Last 3 bits needed to tell which.


Global depth of directory: Max # of bits needed to tell which
bucket an entry belongs to.
Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket.
 When does bucket split cause directory doubling?
 Before insert, local depth of bucket = global depth. Insert
causes local depth to become > global depth; directory is
doubled by copying it over and `fixing’ pointer to split image
page. (Use of least significant bits enables efficient doubling
via copying of directory!)
Directory Doubling
Why use least significant bits in directory?
ó Allows for doubling via copying!
6 = 110
2
1
0
1
6*
6 = 110
3
00
10
11
000
000
001
100
2
010
1
011
01
6*
3
100
0
101
1
110
6*
110
10
6*
01
11
111
Least Significant
00
010
001
6*
101
011
6*
111
vs.
Most Significant
Comments on Extendible Hashing
 If directory fits in memory, equality search answered with
one disk access; else two.



100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
Directory grows in spurts, and, if the distribution of hash
values is skewed, directory can grow large.
Multiple entries with same hash value cause problems!
 Delete: If removal of data entry makes bucket empty,
can be merged with `split image’. If each directory
element points to same bucket as its split image, can
halve directory.
Hybrid methods
Expandable Hashing
 Similar idea to an extendible hashing.
But binary tree is used to store an index on the buckets.
Dynamic Hashing
 multiple binary trees are used.
Outcome:
- To shorten the search.
- Based on the key --- select what tree to search.
52
Linear Hashing
 This is another dynamic hashing scheme, an alternative to
Extendible Hashing.
 LH handles the problem of long overflow chains without
using a directory, and handles duplicates.
 Idea: Use a family of hash functions h0, h1, h2, ...




hi(key) = h(key) mod(2iN); N = initial # buckets
h is some hash function (range is not 0 to N-1)
If N = 2d0, for some d0, hi consists of applying h and looking at
the last di bits, where di = d0 + i.
hi+1 doubles the range of hi (similar to directory doubling)
Linear Hashing (Contd.)
 Directory avoided in LH by using overflow pages, and
choosing bucket to split round-robin.



Splitting proceeds in `rounds’. Round ends when all NR
initial (for round R) buckets are split. Buckets 0 to Next-1
have been split; Next to NR yet to be split.
Current round number is Level.
Search: To find bucket for data entry r, find hLevel(r):
 If hLevel(r) in range `Next to NR’ , r belongs here.
 Else, r could belong to bucket hLevel(r) or bucket hLevel(r)
+ NR; must apply hLevel+1(r) to find out.
Overview of LH File
 In the middle of a round.
Bucket to be split
Next
Buckets that existed at the
beginning of this round:
this is the range of
Buckets split in this round:
If h Level ( search key value )
is in this range, must use
h Level+1 ( search key value )
to decide if entry is in
`split image' bucket.
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
Linear Hashing (Contd.)
 Insert: Find bucket by applying hLevel / hLevel+1:
 If bucket to insert into is full:
 Add overflow page and insert data entry.
 (Maybe) Split Next bucket and increment Next.
 Can choose any criterion to `trigger’ split.
 Since buckets are split round-robin, long overflow chains
don’t develop!
 Doubling of directory in Extendible Hashing is similar;
switching of hash functions is implicit in how the # of bits
examined is increased.
Example of Linear Hashing
 On split, hLevel+1 is used to
re-distribute entries.
Level=0, N=4
h
h
1
0
000
00
001
01
010
10
011
11
(This info
is for illustration
only!)
Level=0
PRIMARY
Next=0 PAGES
h
32*44* 36*
9* 25* 5*
14* 18*10*30*
Data entry r
with h(r)=5
Primary
bucket page
31*35* 7* 11*
(The actual contents
of the linear hashed
file)
h
PRIMARY
PAGES
1
0
000
00
001
Next=1
9* 25* 5*
01
010
10
011
11
100
00
OVERFLOW
PAGES
32*
14* 18*10*30*
31*35* 7* 11*
44* 36*
43*
Example: End of a Round
Level=1
h1
PRIMARY
PAGES
h0
Next=0
Level=0
h1
h0
000
00
001
010
01
10
PRIMARY
PAGES
OVERFLOW
PAGES
000
00
32*
001
01
9* 25*
010
10
66* 18* 10* 34*
011
11
43* 35* 11*
100
00
44* 36*
101
11
5* 37* 29*
OVERFLOW
PAGES
32*
9* 25*
66*18* 10* 34*
Next=3
31*35* 7* 11*
43*
011
11
100
00
44*36*
101
01
5* 37*29*
110
10
14* 30* 22*
110
10
14*30*22*
111
11
31*7*
50*
LH Described as a Variant of EH
 The two schemes are actually quite similar:
 Begin with an EH index where directory has N elements.
 Use overflow pages, split buckets round-robin.
 First split is at bucket 0. (Imagine directory being doubled at
this point.) But elements <1,N+1>, <2,N+2>, ... are the same.
So, need only create directory element N, which differs from 0,
now.

When bucket 1 splits, create directory element N+1, etc.
 So, directory can double gradually. Also, primary bucket
pages are created in order. If they are allocated in sequence
too (so that finding i’th is easy), we actually don’t need a
directory!
Useful Links
 http://www.cs.ucla.edu/classes/winter03/cs143/l1/han
douts/hash.pdf
 http://www.ecst.csuchico.edu/~melody/courses/csci15
1_live/Static_hash_course_notes.htm
 http://www.smckearney.com/adb/notes/lecture.exten
dible.hashing.pdf
Summary
 Hash-based indexes: best for equality searches, cannot
support range searches.
 Static Hashing can lead to long overflow chains.
 Extendible Hashing avoids overflow pages by splitting a
full bucket when a new data entry is to be added to it.
(Duplicates may require overflow pages.)



Directory to keep track of buckets, doubles periodically.
Directoryless schemes (linear dynamic hashing) available
Can get large with skewed data; additional I/O if this does
not fit in main memory.
Summary
 Linear Hashing avoids directory by splitting buckets
round-robin, and using overflow pages.



Overflow pages not likely to be long.
Duplicates handled easily.
Space utilization could be lower than Extendible Hashing,
since splits not concentrated on `dense’ data areas.
Check List
 What is the intuition behind hash-structured indexes?
 Why are they especially good for equality searches but
useless for range selections?
 What is Extendible Hashing? How does it handle
search, insert, and delete?
 What is Linear Hashing?
 What are the similarities and differences between
Extendible and Linear Hashing?
How does perfect hash function works?
Download