Sets using hashing

advertisement
HashSet<T> implementation: Hash Sets
Hashing
Preface: There are two different approaches to hash sets. We describe hashing using chaining, as
used for HashSets in the Java library. The alternative approach is called open addressing.
Hashing is a technique for implementing sets (and consequently maps). The essential idea is to
encode the client’s view of a set as a fixed-size sequence of small sets. E.g. the set of naturals {4,
7, 21, 43, 56, 63, 98, 49} might be encoded internally as the following 10-element sequence of
mini-sets: {}, {21}, {}, {43, 63}, {4}, {}, {56}, {7}, {98}, {49} (we write sets within chain
brackets and sequences within angle brackets). Each value x in the original set is here placed in
the mini-set located at index x%10. If we now insert 15 and 27, for example, the set becomes
from the user’s viewpoint {4, 7, 21, 43, 56, 63, 98, 49, 15, 27} and internally: {}, {21}, {}, {43,
63}, {4}, {15}, {56}, {7, 27}, {98}, {49}.
To determine whether x occurs, we examine the mini-set at index x%10. E.g., to determine
whether 45 occurs, we look at the mini-set at index 5 (and conclude 45 does not occur).
Generating the index (here x%10) from the item x is called hashing x.
We implement the fixed-size sequence as an array (called a hash table). The mini-sets are very
small in practice and so a simple implementation is ok. Typically class LinkedSet is used.
Warning! Most textbooks don’t make clear that hashing is a technique for treating a set as a
sequence of mini-sets.
Hashing objects: hashCode
Every object p in Java has a method p.hashCode() which yields an integer (it is defined in
class Object). So Math.abs(p.hashCode()%hashTable.length) hashes p to an
index in hash table hashTable.
Implementation of HashSet<T>
class HashSet<T> {
private LinkedSet<T>[] hashTable; // hash table
HashSet() { // create the empty set
hashTable = (LinkedSet<T>[])(new LinkedSet[1000]);
// note coding trick!
for (int i=0; i<hashTable.length; i++)
hashTable[i] = new LinkedSet<T>();
}
private int hash(T t) { // hash t into hashTable index
return Math.abs(t.hashCode()%hashTable.length);
}
Hash Sets 1
int size() {
int numItems = 0;
for (LinkedSet<T> miniSet: hashTable)
numItems = numItems+miniSet.size();
return numItems;
}
boolean contains(T t) {
return hashTable[hash(t)].contains(t);
}
boolean add(T t) {
return hashTable[hash(t)].add(t);
}
boolean remove(T t) {
return hashTable[hash(t)].remove(t);
}
}
Size of hash table
The size of the hash table should be chosen so that it’s greater than the maximum size of the set,
preferably significantly greater. To accommodate this, a second constructor is usually provided:
HashSet(int maxSize) {
hashTable = (LinkedSet<T>[])(new LinkedSet[maxSize]);
... as before ...
}
Complexity
If we know an upper bound m on the maximum size of the set, and if we use a hash table of size
m, then the average time complexity of each of the basic operations is constant, i.e. O(1) – each
operation acts on a mini-set of about 1 element assuming the hash function gives a good spread
over the indices of the hash table. Even if we use a smaller hash table, say one of size m/4, the
expected time complexities remain O(1), but of course the operations will be a little bit slower.
For an example of an operation that is not cheap with hash sets, consider finding the minimum
element in the set. For TreeSet, finding the minimum element has O(log n) average time
complexity. With a hashed implementation, however, we would have to search all the mini-sets
in the hash table, and this has time complexity O(n+k) where k denotes the size of the hash table.
If the hash table gets crowded then performance deteriorates (why?). If the load factor (i.e. set
size over table size) exceeds some upper bound, say 0.75, it is common to increase (double, say)
the table size. This is an expensive operation as all the elements must be re-hashed afresh.
A better hash
Hash Sets 2
For some applications a function hash as above, which does little more than taking an integer
modulo the table size, may be too simple because it only draws on the rightmost digits of the
integer. This may not always yield a good spread of hash table indices. For example, suppose the
set elements are student identity numbers in which the third digit from the right encodes the
student’s sex as 0 or 1 and the table size is 1000 (explain what goes wrong). We want hash so
make use of all the digits in hashCode, e.g. as follows:
private int hash(T t) {
int val = t.hashCode(); int hashVal = 0;
while (val!=0) {
hashVal = hashVal+val%hashTable.length;
val = val/hashTable.length;
}
return Math.abs(hashVal%hashTable.length);
}
hashCode in Java library classes
Although every class inherits a hashCode from class Object, it is usually not a good one and
so most classes in the Java library provide their own. For example, hashCode in class
Integer returns the encapsulated integer, e.g. given Integer t = new Integer(-27)
then t.hashCode() yields -27.
hashCode in String
For class String, hashCode might be coded (poorly) thus:
public int hashCode() {
int hashVal = 0;
for (int i=0; i<length(); i++)
hashVal = hashVal+charAt(i);
return hashVal;
}
(Java is like C in allowing characters to be added to integers – the ACSII value of the character is
used.) This hashCode is poor because strings that are permutations of one another return the
same hash value. The Java library avoids this by taking the position of the characters in the string
into account (the constant 31 used below is arbitrary – experiment shows it works well)
public int hashCode() {
int hashVal = 0;
for (int i=0; i<length(); i++)
hashVal = hashVal*31+charAt(i);
return hashVal;
}
hashCode in programmer classes
Hash Sets 3
If you intend to put objects of class C in a hash set, where C is a class that you write yourself,
then you must define a good hashCode and not just rely on the default definition inherited
from Object. The weakness of the default hashCode is that it is based on the object’s
memory address and not on its contents. For example, execution of the code
HashSet<Person> set = new HashSet<Person>();
set.add(new Person("Bill", 1991));
System.out.println(set.contains(new Person("Bill", 1991)));
will cause false to be output (why?). The solution is to provide a hashCode based on the
contents of the object (as well as an equals based on contents). For example, a suitable
hashCode for a class Person with instance variables representing a person’s name (name of
type String) and year of birth (yob of type int) is:
public int hashCode() {return name.hashCode()+3*yob;}
The header must always be public int hashCode() – public cannot be omitted. We
here used the hashCode() method in the String class to get an integer from name, to which
we add three times the value of yob; the 3 is arbitrary
All the primitive wrapper classes redefine hashCode() appropriately. This means, for
example, that to derive a hash value from a variable salary of type double, say, you can use
(new Double(salary)).hashCode().
As far as correct behaviour is concerned, it doesn’t matter what integer is returned by
hashCode() as long as it is compatible with equals(). By “compatible” we mean that
whenever p.equals(q) yields true, p.hashCode() and q.hashCode() yield the
same value.
Iteration over a HashSet
1. Make sure LinkedSet<T> implements iteration. If you are using your own implementation
of ArrayList<T> make sure it too implements iteration.
2. Add the following method to the HashSet<T> class:
public Iterator<T> iterator() {
ArrayList<T> items = new ArrayList<T>();
for (LinkedSet<T> ls: hashTable)
for (T t: ls) items.add(t);
return items.iterator();
}
3. Write implements Iterable<T> in HashSet<T> header.
The above is a lazy trick in that it just piggybacks on iteration for ArrayList. You are not
required to memorise the details of implementing iteration.
Hash Sets 4
Download