Hash codes for strings

advertisement
Hashing
Searching techniques:


Sequential search: O(n)
Binary search O(log2(n))
There is at least one condition where a binary search does not work very well – when the
data that you are searching for is on a disk, and not in main memory. When accessing data
on a disk, the slowest part is accessing the disk drive, and the goal is to minimize the
number of disk accesses needed to find the desired data.
Obviously, the best possible case would be if you could find the data on the first try every
time! There is no method for doing this, but there is a way of coming close – hashing.
Hashing: the process of taking a key and applying an algorithm to it to come up with an
address.
Ideal hashing
Consider a town with a population of 9,000 where everybody's telephone number begins
with the same 3-digit prefix. We want to be able to use everybody's phone number as a key
to look up the person's name and address. And we want to find the name and address on
the first guess.
One solution is to create an array of 10,000 items (numbered from 0 to 9,999) and use the
last 4 digits of the phone numbers (phone number mod 10,000) as keys.
So if your number is 555-1234, your information would be stored in location 1234 in the
array. If the number of names and addresses is close to 10,000 (as it is in this case),
almost all of the array would be full, and we would have very little wasted space. In return
for the wasted space, we would be able to go directly to the item that we are looking for on
the first try!
It doesn't get any better than this! We will always find the item we are looking for on the
first try.
However, not all keys will convert into an address (or array index) as easily.
Typical Hashing
In real life, things don't always work out this easily. Consider the same example as above,
but for a much smaller town, say with a population of 700. If we allocate 10,000 memory
locations, we will be wasting 9,300 of them (93%).
One possible solution is to apply the same algorithm that we did before, except make our
array size 700 (0 to 699). Since we need to map (convert) each key into a number in the
range 0 to 699, we could take the last 4 digits of the phone number (mod 10,000) and
divide it by 700 and take the remainder (mod 700) to get a number in the range 0 to 699.
3/18/2016
Document1
1 of 8
However, unless the phone numbers are 555-0000 through 555-0699 (or something
similar), we are going to have a problem. It is very likely that we will have two phone
numbers that, when divided by 700, produce the same remainder.
Example
The phone numbers 555-0123 and 555-0823 both produce a remainder of 123 when
divided by 700. Only one of them can be put in location 123 in the array. So what
happens when we try to put another one in the array? We get a collision.
Collision: the mapping of two keys into the same hash index.
Collisions are bad.
In the real world, perfect data occurs very infrequently. In the real world, collisions are
going to occur. So we need to decide:


How to minimize their occurrence.
How to handle them when they do occur.
General characteristics of a hashing function
A good hashing function should:
 Minimize collisions
 Distribute entries uniformly throughout the hash table
 Be fast to compute
Java's built-in hashCode method
The Object class has a built-in method called hashCode, which returns an integer. The
integer is based on the object's address in main memory. However, this is not a good
hashing algorithm to use. The reason is that we could have two different objects that have
the same values. These objects would hash into different locations in a hash table (because
their addresses are different), even though they should hash into the same location in the
hash table (because their values are the same).
So we should override the built-in hashCode method with one of our own. Our own hash
function should:
 Provide equal hash codes for Objects that have equal values.
 Always produce the same hash code for the same data values.
 Evenly distribute the keys through the range of possible hash indexes.
Hash codes for strings
A hashing function is supposed to convert a key into an integer. If the key is a string, we
need to find a way to convert the string into an integer. The most common way is to take
the ASCII or Unicode value of the character (note that the low order byte of the Unicode
character is the ASCII code). We could just add up the values, but unfortunately this
doesn't distribute the keys very evenly. A better solution is to use the following formula:
u0gn-1 + u1gn-2 + ... + un-2g1 + un-1g0
3/18/2016
Document1
2 of 8
where the u's represent the codes of the characters of the string, and the g is some
constant, and n is the number of characters in the string.
This formula can be converted into the following which reduces the number of arithmetic
operations that need to be done:
(...((u0g + u1)g + u2)g + ... + un-2)g + un-1
This is written in Java like this:
public class HashTest
{
public static void main (String[] args)
{
Scanner input = new Scanner(System.in);
System.out.print("Enter a string: ");
String s = input.nextLine();
int g = 31;
int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++)
{
System.out.println("Code = " + (int) s.charAt(i));
hash = g * hash + s.charAt(i);
System.out.println("Hash = " + hash);
}
}
}
Note:
 The variable g could be any number.
 The int value of the charAt function is the same as the ASCII code value.
 You can use charAt(i) in an arithmetic expression and Java will use its integer value!
The output from this program is:
Enter a string: Hello
Code = 72
Hash = 72
Code = 101
Hash = 2333
Code = 108
Hash = 72431
Code = 108
Hash = 2245469
Code = 111
Hash = 69609650
3/18/2016
Document1
3 of 8
Note that for long strings, you can get an overflow. Java will ignore integer overflows and
just keep the low-order bits of the result! You do not get an error message!
The String class has a built-in hashCode function that uses a value of 31 for g.
The hash code for the string "Hello" is 69,609,650.
Hash codes for primitive types
If the type is byte, short, or char, cast it into an int.
If the type is long, you can cast it into an int by dividing by 232 and taking the remainder
(this will give you the low-order 32 bits of the number, and will fit into an integer).
If the type is double, you can cast it into a long like this:
long bits = Double.doubleToLongBits(n);
Hashing method: Folding
Another alternative is folding. Folding involves dividing the key into several parts and then
combining the parts.
Example
We can get the leftmost 32 bits by using the right-shift operator in Java: >>. So to shift a
long variable called key to the right 32 bits, we would write this:
key >> 32
We can then add these bits to the rightmost bits and cast it to a 32-bit integer:
hashCode = (int) key + (key >> 32);
Example
We can also create a hash code for doubles. However, casting a double to a long or an int
would only make use of the bits that make up the integer, and would discard the fractional
part. Keeping the fractional part would be more likely to generate a "more random" number
(it seems like the more bits, the more random the resulting hash value should be). The
Double wrapper class has the following method that will convert a double into a long
integer:
long bits = Double.doubleToLongBits(key);
int hashCode = (int) (bits ^ (bits >> 32));
Things to note:




A double value in Java is represented by a 64-bit bit pattern.
A long value in Java is also represented by a 64-bit bit pattern.
Since they are both the same size, we can convert the double to an int without losing
any bits. The result will be a 64-bit long value.
We can then shift the number 32 bits to the right and perform an exclusive-or
operation on the leftmost 32 bits with the right-most 32 bits.
3/18/2016
Document1
4 of 8
There are two types of or operations – an inclusive or and an exclusive-or. Inclusive-or
means one or the other or both. The exclusive-or means one or the other, but not both.
The truth table for the exclusive-or function is:
A
B
A xor B
0
0
0
0
1
1
1
0
1
1
1
0
The last row is where it differs from the inclusive or.
Hashing method: Digit Analysis/Extraction
TABLE of frequency of occurrence of the digits 0..9 for 2800 five-digit keys:
digit:
1
2
3
4
5
0
2026
250
218
1012
260
1
618
395
391
185
382
2
128
263
389
299
271
3
23
298
330
52
302
4
5
298
330
52
302
5
335
299
101
387
6
303
339
18
199
7
289
308
124
301
8
267
267
999
245
9
400
259
0
353
Analysis:
"By this method a frequency count is performed in regard to the number of times each of
the 10 digits occurs in each of the positions included in the record key. For example,
consider the following table showing the number of times each digit occurred in a fiveposition numeric key for 2,800 records. In this tabulation we can observe that digits 0..9
occur with approximately uniform distribution in key positions 2, 3, and 5; therefore, if a 3digit address were required, the digits in these three positions in the record keys could be
used. Given that there are 2,800 records, however, a four-digit address would be required.
Suppose we desire the first digit to be a 0, 1, 2, or 3 only. Such assignment can be made
with about equal frequency for each digit by using a rule such as the following: assign a 0
when digits in positions 2 and 3 both contain odd numbers, a 1 if position 2 is odd and
position 3 is even, a 2 if position 2 is even and position 3 is odd, or a 3 if positions 2 and 3
both contain even numbers. Thus, the address for key 16258 would be 3628: the 3 from
the fact that positions 2 and 3 both contain even numbers and the 628 from key positions 2,
3, and 5. Other rules for prefixing additional digits can be formulated for different
circumstances. In any event, the digit analysis method relies on the digits in some of the
key positions being approximately equally distributed. If such is not the case, the method
cannot be used with good results. " (Philippakis)
3/18/2016
Document1
5 of 8
Hashing Method: Mid-Square Method
The record key is multiplied by itself, and the product is truncated from both left and right
so as to form a number equal to the desired address length. Thus, key 36,258 would be
squared to give 1,314,642,564. To form a four-digit address, this number would be
truncated from both the left and the right, resulting in the address 4642.
Converting a Hash Code into an Index for the Hash Table
The most common way to convert a hash code into an index into a hash table is to divide
the hash code by the size of the hash table and take the remainder:
index = key % tableSize;
Resolving collisions
The problem with hashing is that we are mapping a LARGE key space into a SMALL address
space. Whenever we do this, we are bound to have COLLISIONS (sometimes called
clashes). When a collision occurs, we have a problem. We still have to put the item into the
table (or dictionary), but we can't put it into the location that it hashed to because it's
already occupied. So the challenge is to figure out where to put it.
COLLISION/CLASH: when 2 records hash into the same address.
So, how do you resolve the problem???
You need to make sure your table is big enough for all of the data that you will want to put
into it. That means that you must have some idea of how much data you will be storing. If
we leave more than enough space, all we have to do is find an extra space. One solution to
this problem is a Linear probe.
Open Addressing with a Linear probe
LINEAR PROBE: a search of subsequent memory locations until an open slot is found.
A linear probe simply looks for the next available location. If the key hashes into location
index, then we look at location index + 1. If that is occupied, we look at location index + 2,
etc., until we find a free location.
Note that if we are at the end of the array, we must "wrap around" to position 0 and
continue from there. That is, we treat the array as if it were circular.
For a linear probe to work, at creation time, mark each hash table element as being unused
(e.g. by putting the value null there).
Now assume that two values hash into the same location. The first one will get to occupy
that location. The second one, however, must look at the next few (hopefully) places and
find an unoccupied place.
Example: Adding and Retrieving (from Carrano)
Assume that all 4 of the following data/key pairs hash to the same location (52):

"555-1214", "150 Main Street"
3/18/2016
Document1
6 of 8



"555-8132", "75 Center Court"
"555-4294", "205 Ocean Road"
"555-2072", "82 Campus Way"
We will put "150 Main Street" into location 52.
We will put "75 Center Court" into location 53 (if it is available).
We will put "205 Ocean Road" into location 54 (if it is available).
We will put "82 Campus Way" into location 55 (if it is available).
Now, consider what happens when we want to retrieve the data associated with phone
number 555-2072.
Our hash function will take us to location 52! How do we know that this is the correct data?
We don't!
The only way that you can know that you have retrieved the correct data is if you also store
the key with the data!
Example: Deleting
Suppose we remove the objects in locations 53 and 54 by replacing their values with null.
Now when we go to retrieve the value for the key "555-2072", if we stop when we find a
null value, we won't find it!
There are two kinds of empty spaces in a hash table:
(1) Spaces that have never been occupied (and should end a search), and
(2) Spaces that have been occupied (and should NOT end a search).
Therefore, when we remove an item from a hash table, we should NOT replace it with null,
but we should mark it in some way as being available.
Clustering
When you add records and a lot of them end up in the same part of the table, this can
severely slow down your searching. You will end up with areas of your table that have
clusters of filled entries, and other areas of your table that have very few entries. This is
called clustering.
A way to scatter out the records that hash into the same location is to use a quadratic
probe instead of a linear probe. Instead of looking at position k+1, k+2, k+3, etc., a
quadratic probe looks at positions k+1, k+4, k+9, k+16, k+25, etc.
A Potential Problem with Open Addressing
It is possible with the methods that we just described that, after many additions and
deletions, ALL of the records in a table will be marked either as occupied or available and
that no table entries are marked as empty, or null. Then, if we have a clash and have to
search for another location to put the data, we may actually end up searching the entire
table! This is not good – it will be very slow.
3/18/2016
Document1
7 of 8
Separate Chaining
Another alternative is to allow more than one data item to go in a single table entry. When
you do this, each location in the table is called a bucket and each table entry points to a
linked list of data items that all hashed into this spot.
The Load Factor
Our first example used a perfect hashing function – there were never any collisions, and
never any unused table locations. In real life, this never happens.
We will have collisions, but we want to minimize the number of collisions that will occur.
There are several things we can do to make this more likely:



Make the table larger than the number of keys (so there will ALWAYS be unused
locations).
Try to develop a hashing function that will truly evenly distribute the keys over the
hash table.
Use a prime number for the size of the hash table.
Hash Table Size
How big should you make the hash table?
If you use open addressing, you should try to keep the table less than half full.
If you use separate chaining, it doesn't matter.
It is also recommended that the size of the hash table be a prime number.
Advantage of Hashing: Lookup is fast.
Disadvantage of Hashing: The data cannot be retrieved in sorted order.
3/18/2016
Document1
8 of 8
Download