HashSet<T> implementation: Hash Sets Hashing Preface: There are two different approaches to hash sets. We describe hashing using chaining, as used for HashSets in the Java library. The alternative approach is called open addressing. Hashing is a technique for implementing sets (and consequently maps). The essential idea is to encode the client’s view of a set as a fixed-size sequence of small sets. E.g. the set of naturals {4, 7, 21, 43, 56, 63, 98, 49} might be encoded internally as the following 10-element sequence of mini-sets: {}, {21}, {}, {43, 63}, {4}, {}, {56}, {7}, {98}, {49} (we write sets within chain brackets and sequences within angle brackets). Each value x in the original set is here placed in the mini-set located at index x%10. If we now insert 15 and 27, for example, the set becomes from the user’s viewpoint {4, 7, 21, 43, 56, 63, 98, 49, 15, 27} and internally: {}, {21}, {}, {43, 63}, {4}, {15}, {56}, {7, 27}, {98}, {49}. To determine whether x occurs, we examine the mini-set at index x%10. E.g., to determine whether 45 occurs, we look at the mini-set at index 5 (and conclude 45 does not occur). Generating the index (here x%10) from the item x is called hashing x. We implement the fixed-size sequence as an array (called a hash table). The mini-sets are very small in practice and so a simple implementation is ok. Typically class LinkedSet is used. Warning! Most textbooks don’t make clear that hashing is a technique for treating a set as a sequence of mini-sets. Hashing objects: hashCode Every object p in Java has a method p.hashCode() which yields an integer (it is defined in class Object). So Math.abs(p.hashCode()%hashTable.length) hashes p to an index in hash table hashTable. Implementation of HashSet<T> class HashSet<T> { private LinkedSet<T>[] hashTable; // hash table HashSet() { // create the empty set hashTable = (LinkedSet<T>[])(new LinkedSet[1000]); // note coding trick! for (int i=0; i<hashTable.length; i++) hashTable[i] = new LinkedSet<T>(); } private int hash(T t) { // hash t into hashTable index return Math.abs(t.hashCode()%hashTable.length); } Hash Sets 1 int size() { int numItems = 0; for (LinkedSet<T> miniSet: hashTable) numItems = numItems+miniSet.size(); return numItems; } boolean contains(T t) { return hashTable[hash(t)].contains(t); } boolean add(T t) { return hashTable[hash(t)].add(t); } boolean remove(T t) { return hashTable[hash(t)].remove(t); } } Size of hash table The size of the hash table should be chosen so that it’s greater than the maximum size of the set, preferably significantly greater. To accommodate this, a second constructor is usually provided: HashSet(int maxSize) { hashTable = (LinkedSet<T>[])(new LinkedSet[maxSize]); ... as before ... } Complexity If we know an upper bound m on the maximum size of the set, and if we use a hash table of size m, then the average time complexity of each of the basic operations is constant, i.e. O(1) – each operation acts on a mini-set of about 1 element assuming the hash function gives a good spread over the indices of the hash table. Even if we use a smaller hash table, say one of size m/4, the expected time complexities remain O(1), but of course the operations will be a little bit slower. For an example of an operation that is not cheap with hash sets, consider finding the minimum element in the set. For TreeSet, finding the minimum element has O(log n) average time complexity. With a hashed implementation, however, we would have to search all the mini-sets in the hash table, and this has time complexity O(n+k) where k denotes the size of the hash table. If the hash table gets crowded then performance deteriorates (why?). If the load factor (i.e. set size over table size) exceeds some upper bound, say 0.75, it is common to increase (double, say) the table size. This is an expensive operation as all the elements must be re-hashed afresh. A better hash Hash Sets 2 For some applications a function hash as above, which does little more than taking an integer modulo the table size, may be too simple because it only draws on the rightmost digits of the integer. This may not always yield a good spread of hash table indices. For example, suppose the set elements are student identity numbers in which the third digit from the right encodes the student’s sex as 0 or 1 and the table size is 1000 (explain what goes wrong). We want hash so make use of all the digits in hashCode, e.g. as follows: private int hash(T t) { int val = t.hashCode(); int hashVal = 0; while (val!=0) { hashVal = hashVal+val%hashTable.length; val = val/hashTable.length; } return Math.abs(hashVal%hashTable.length); } hashCode in Java library classes Although every class inherits a hashCode from class Object, it is usually not a good one and so most classes in the Java library provide their own. For example, hashCode in class Integer returns the encapsulated integer, e.g. given Integer t = new Integer(-27) then t.hashCode() yields -27. hashCode in String For class String, hashCode might be coded (poorly) thus: public int hashCode() { int hashVal = 0; for (int i=0; i<length(); i++) hashVal = hashVal+charAt(i); return hashVal; } (Java is like C in allowing characters to be added to integers – the ACSII value of the character is used.) This hashCode is poor because strings that are permutations of one another return the same hash value. The Java library avoids this by taking the position of the characters in the string into account (the constant 31 used below is arbitrary – experiment shows it works well) public int hashCode() { int hashVal = 0; for (int i=0; i<length(); i++) hashVal = hashVal*31+charAt(i); return hashVal; } hashCode in programmer classes Hash Sets 3 If you intend to put objects of class C in a hash set, where C is a class that you write yourself, then you must define a good hashCode and not just rely on the default definition inherited from Object. The weakness of the default hashCode is that it is based on the object’s memory address and not on its contents. For example, execution of the code HashSet<Person> set = new HashSet<Person>(); set.add(new Person("Bill", 1991)); System.out.println(set.contains(new Person("Bill", 1991))); will cause false to be output (why?). The solution is to provide a hashCode based on the contents of the object (as well as an equals based on contents). For example, a suitable hashCode for a class Person with instance variables representing a person’s name (name of type String) and year of birth (yob of type int) is: public int hashCode() {return name.hashCode()+3*yob;} The header must always be public int hashCode() – public cannot be omitted. We here used the hashCode() method in the String class to get an integer from name, to which we add three times the value of yob; the 3 is arbitrary All the primitive wrapper classes redefine hashCode() appropriately. This means, for example, that to derive a hash value from a variable salary of type double, say, you can use (new Double(salary)).hashCode(). As far as correct behaviour is concerned, it doesn’t matter what integer is returned by hashCode() as long as it is compatible with equals(). By “compatible” we mean that whenever p.equals(q) yields true, p.hashCode() and q.hashCode() yield the same value. Iteration over a HashSet 1. Make sure LinkedSet<T> implements iteration. If you are using your own implementation of ArrayList<T> make sure it too implements iteration. 2. Add the following method to the HashSet<T> class: public Iterator<T> iterator() { ArrayList<T> items = new ArrayList<T>(); for (LinkedSet<T> ls: hashTable) for (T t: ls) items.add(t); return items.iterator(); } 3. Write implements Iterable<T> in HashSet<T> header. The above is a lazy trick in that it just piggybacks on iteration for ArrayList. You are not required to memorise the details of implementing iteration. Hash Sets 4