CS 261 – Data Structures Hash Tables Part II: Using Buckets Hash Tables, Review • Hash tables are similar to Vectors except… – Elements can be indexed by values other than integers – A single position may hold more than one element • Arbitrary values (hash keys) map to integers by means of a hash function • Computing a hash function is usually a two-step process: 1. Transform the value (or key) to an integer 2. Map that integer to a valid hash table index • Example: storing names – Compute an integer from a name – Map the integer to an index in a table (i.e., a vector, array, etc.) Hash Tables Say we’re storing names: Angie Joe Abigail Linda Mark Max Robert John Angie, Robert 1 Linda 0 Hash Function 2 Joe, Max, John 3 4 Abigail, Mark Hash Tables: Resolving Collisions There are two general approaches to resolving collisions: 1. Open address hashing: if a spot is full, probe for next empty spot 2. Chaining (or buckets): keep a Linked list at each table entry Today we will look at option 2 Resolving Collisions: Chaining / Buckets Maintain a Collection at each hash table entry: – Chaining/buckets: maintain a linked list (or other collection type data structure, such as an AVL tree) at each table entry 0 Robert Angie 1 Linda Max John Mark Abigail 2 3 4 Joe Combining arrays and linked lists … struct hashTable { struct link ** table; // array initialized to null pointers int tablesize; int dataCount; // count number of elements … }; Hash table init Void hashTableInit (struct hashTable &ht, int size) { int i; ht->count = 0; ht->table = (struct link **) malloc(size * sizeof(struct link *)); assert(ht->table != 0); ht->tablesize = size; for (i = 0; i < size; i++) ht->table[i] = 0; /* null pointer */ } Adding a value to a hash table public void add (struct hashTable * ht, EleType newValue) { // find correct bucket, add to list int indx = abs(hashfun(newValue)) % table.length; struct link * newLink = (struct link *) malloc(…) assert(newLink != 0); newLink->value = newValue; newLink->next = ht->table[indx]; ht->table[indx] = newLink; /* add to bucket */ ht->count++; // note: next step: reorganize if load factor > 3.0 } } Contains test, remove • Contains: Find correct bucket, then see if the element is there • Remove: Slightly more tricky, because you only want to decrement the count if the element is actually in the list. • Alternatives: instead of keeping count in the hash table, can call count on each list. What are pro/con for this? Remove - need to change previous • Since we have only single links, remove is tricky • Solutions: use double links (too much work), or use previous pointers • We have seen this before. Keep a pointer to the previous node, trail after current node Prev = 0; For (current = ht->table[indx]; current != 0; current = current->next) { if (EQ(current->value, testValue)) { … remove it } prev = current; } Two cases, prev is null or not Prev = 0; For (current = ht->table[indx]; current != 0; current = current->next) { if (EQ(current->value, testValue)) { … remove it if (prev == 0) ht->table[indx] = current->next; else prev->next = current->next; free (current); ht->size--; return; } prev = current; } Hash Table Size • Load factor: Load factor # of elements l=n/m Size of table – So, load factor represents the average number of elements at each table entry • Want the load factor to remain small • Can do same trick as open table hashing - if load factor becomes larger than some fixed limit (say, 3.0) then you double the table size Hash Tables: Algorithmic Complexity • Assumptions: – Time to compute hash function is constant – Chaining uses a linked list – Worst case analysis All values hash to same position – Best case analysis Hash function uniformly distributes the values (all buckets have the same number of objects in them) • Find element operation: – Worst case for open addressing O( n ) – Worst case for chaining O( n ) O(log n) if use AVL tree – Best case for open addressing O( 1 ) – Best case for chaining O( 1 ) Hash Tables: Average Case • Assuming that the hash function distributes elements uniformly (a BIG if) • Then the average case for all operations is O(l) • So you want to try and keep the load factor relatively small. • You can do this by resizing the table (doubling the size) if the load factor is larger than some fixed limit, say 10 • But that only improves things IF the hash function distributes values uniformly. • What happens if hash value is always zero? So when should you use hash tables? • Your data values must be objects with good hash functions defined (string, Double) • Or you need to write your own definition of hashCode • Need to know that the values are uniformly distributed • If you can’t guarantee that, then a skip list or AVL tree is often faster Your turn • Now do the worksheet to implement hash table with buckets • Run down linked list for contains test • Think about how to do remove • Keep track of number of elements • Resize table if load factor is bigger than 3.0 Questions??