Hashing

advertisement
3.3 Hashing
Hashing is a method for storing and retrieving information. It is often very fast, much
faster than linear search and even faster than binary search. Suppose we have n items.
To search for one of these using linear search requires an amount of time that is O(n). To
search using binary search requires time O(log n). However, if properly implemented,
searching using hashing requires time O(1). Thus the amount of time is constant,
provided n remains in bounds. In a sense this would be true for linear and binary search,
but practically speaking hashing is usually better.
We illustrate hashing by storing and retrieving names in an
array. Suppose the array is called Customer. For
simplicity we shall make it not too large. Suppose it has
111 elements with indices starting out at 0 and going to 110.
We want to store some names in this array. For example,
Customer
0
1
2
3
Williams
Smith
Johnson
However, instead of putting them in the first 3 locations,
110
what we do is compute from each name a number which is
the location in the array where it will be stored. We
compute the location where the name will be stored using a function called a hashing
function. There are many possible hashing functions. Let’s look at one.
It is convenient to take advantage of how names are commonly stored on a computer.
The letters of the name are usually stored sequentially in a computer’s memory with each
letter being stored in 1 byte (8 bits) with the letter represented by its ASCII code. At the
end of this section is an ASCII table with the bit representation of each letter. For
example, a capital A is represented by 41 (hex) or
65 (decimal) with the other capital letters
W i
l
l
i
a M s
following in sequence. A lower case a is
57 69 6C 6C 69 61 6D 73
represented by 61 (hex) or 97 (decimal) with the
other lower case letters following in sequence.
Thus Williams is represented using 8 bytes as
follows. We give the contents in hex.
The idea of the hashing function is to take some of the characters, treat them as a number
and divide by the length of the array and take the remainder. The remainder is where the
name will be stored in the array.
3.3 - 1
To simplify matters, let’s take the first 4 letters of the name and treat that as an integer. If
there are fewer than 4 letters, we take however many there are.
For example, with the name Williams we would just take Will.
W i
l
l
57 69 6C
At this point we need to look at one little quirk of the processors in
personal computers. When the CPU takes 4 bytes of memory and
treats it as an integer, it reverses the order of the bytes when treating it as an integer.
Thus the first byte is the low order byte of the integer and the fourth
byte is the high order byte of the integer. In the example of Will, the
l
l
W is the low order byte and the right most l is the high order byte.
6C 6C
See table at right where we also include the decimal representation
108 108
of each letter.
6C
i
W
69 57
105 87
If we want to see what number this represents in decimal we must compute
1082563 + 1082562 + 105256 + 87
Customer
= (108)(16,777,216) + (108)(65,536) + 26,880 + 87
= 1,811,939,328 + 7,077,888 + 26,880 + 87
= 1,819,044,183
In the above example the array length is 111, so we need to mod
1,819,044,183 by 111.
0
1
2
3
48
Williams
1,819,044,183 mod 111 = 48
So Williams is stored in location 48 of the array.
110
We can use the algebra of mod to reduce the size of the numbers
we are working with at each step of the calculations. In particular, we can then do the
computations entirely by hand without a calculator. The main idea is that in sequence of
additions, subtractions and multiplications followed by a mod, we can replace any
number by another number that is the same when one mod's by 111.
[1082563 + 1082562 + 105256 + 87] mod 111
= [ 108(256 mod 111)3 + 108(256 mod 111)2 + 105(256 mod 111) + 87] mod 111
= [ (- 3)(34)3 + (- 3)(34)2 + (- 6)(34) - 24] mod 111
= [ - 102(34)2 - 10234 - 2102 - 24] mod 111 = [ 9(34)2 + 934 + 29 - 24] mod 111
= [ 333434 + 3334 + 29 - 24] mod 111
3.3 - 2
= [ 102102 + 3102 + 18 - 24] mod 111
= [ (-9)(-9) + 3(-9) - 6] mod 111 = [ 81 - 27 - 6] mod 111 = 48 mod 111 =
48
Remark: Most calculators have a way to do mod. On the HP 48G to do n mod p, do the following.
(1) MTH (2) REAL (3) n <Enter> (4) p <Enter> (5) MOD. Some calculators allow one to do calculations
in hex. On the HP 48G to convert a number from hex to decimal do the following (1) MTH (2) BASE (3)
| # (4) 6 C 6 C 6957 | h (5) | h (6) B  R.
Specifing a hashing function in mathematical notation differs somewhat from person to
person. Here is an example to illustrate one method.
Example 1. Specify a function that takes a character string s, treats the first four
characters as an integer with the first character as the low order digit and then mod's by
111.
Solution. Let
s = a character string
p = Length of s
sj = (j+1)st character of s. So s0 = first character, s1 = second character, …
sp-1 = last character of s.
aj =



numerical value of ASCII code of sj
0
if j < p
if j  p
h(s) = [a0 + 256a1 + 2562 a2 + 2563 a3] mod 111
3.3 - 3
Here is a C program which illustrates the hashing method we have been discussing.
#include <iostream>
#include <cstdlib>
#include <string>
using namespace std;
int main()
{ char Name[20], Name4[4];
int i, NameEnd, p;
unsigned long int *n;
// Read a name.
cout << "Program to demonstrate a hashing function" << endl;
cout << "Enter a Name: ";
cin >> Name;
// Take first four characters, padding with 0's if necessary.
NameEnd = strlen(Name) - 1;
for (i = 0; i <= 3; i = i + 1)
if (i <= NameEnd)
Name4[i] = Name[i];
else
Name4[i] = '\0';
// Treat as an integer
n = (unsigned long int *) (Name4);
cout << "First 4 characters viewed as an integer = "
// Compute the hashed value.
p = *n % 111;
cout << "Hashed value = " << p << endl;
system("PAUSE");
}
3.3 - 4
<< *n << endl;
ASCII Table
The ASCII code represents keyboard characters as sequences of 7 bits. In many computers the
code for each character is stored in the low order 7 bits of an 8-bit byte and the high order bit is set to 0. In
the table below the ASCII codes for the keyboard characters are given in hex along with the decimal
equivalent of the hex value. Note: SP indicates a space and Del indicates the "Delete" character.
The first 32 ASCII codes do not represent printable characters. Rather they are used to send
control signals between the computer and input/output devices. For example, the code 0A indicates a Line
Feed which means to skip to a new line. The code 0D indicates a Carriage Return which means to move
the print position back to the start of a new line.
char
hex
dec
char
hex
dec
char
hex dec
SP
!
"
#
$
%
&
'
(
)
*
+
,
.
/
20
21
22
23
24
25
26
27
28
29
2A
2B
2C
2D
2E
2F
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
40
41
42
43
44
45
46
47
48
49
4A
4B
4C
4D
4E
4F
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
60
61
62
63
64
65
66
67
68
69
6A
6B
6C
6D
6E
6F
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
0
1
2
3
4
5
6
7
8
9
:
;
<
=
>
?
30
31
32
33
34
35
36
37
38
39
3A
3B
3C
3D
3E
3F
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
P
Q
R
S
T
U
V
W
X
Y
Z
[
\
]
^
_
50
51
52
53
54
55
56
57
58
59
5A
5B
5C
5D
5E
5F
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
p
q
r
s
t
u
v
w
x
y
z
{
|
}
~
Del
70
71
72
73
74
75
76
77
78
79
7A
7B
7C
7D
7E
7F
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
3.3 - 5
Download