CS 261 – Data Structures Hash Tables Part II: Using Buckets

advertisement
CS 261 – Data Structures
Hash Tables
Part II: Using Buckets
Hash Tables, Review
• Hash tables are similar to Vectors except…
– Elements can be indexed by values other than integers
– A single position may hold more than one element
• Arbitrary values (hash keys) map to integers by means of a
hash function
• Computing a hash function is usually a two-step process:
1. Transform the value (or key) to an integer
2. Map that integer to a valid hash table index
• Example: storing names
– Compute an integer from a name
– Map the integer to an index in a table (i.e., a vector, array, etc.)
Hash Tables
Say we’re storing names:
Angie
Joe
Abigail
Linda
Mark
Max
Robert
John
Angie, Robert
1 Linda
0
Hash Function
2
Joe, Max, John
3
4
Abigail, Mark
Hash Tables: Resolving Collisions
There are two general approaches to resolving collisions:
1. Open address hashing: if a spot is full, probe for next empty spot
2. Chaining (or buckets): keep a Linked list at each table entry
Today we will look at option 2
Resolving Collisions: Chaining / Buckets
Maintain a Collection at each hash table entry:
– Chaining/buckets: maintain
a linked list (or other collection
type data structure, such as an
AVL tree) at each table entry
0
Robert
Angie
1
Linda
Max
John
Mark
Abigail
2
3
4
Joe
Combining arrays and linked lists …
struct hashTable {
struct link ** table; // array initialized to null pointers
int tablesize;
int dataCount; // count number of elements
…
};
Hash table init
Void hashTableInit (struct hashTable &ht, int size) {
int i;
ht->count = 0;
ht->table = (struct link **) malloc(size * sizeof(struct
link *));
assert(ht->table != 0);
ht->tablesize = size;
for (i = 0; i < size; i++)
ht->table[i] = 0; /* null pointer */
}
Adding a value to a hash table
public void add (struct hashTable * ht, EleType newValue) {
// find correct bucket, add to list
int indx = abs(hashfun(newValue)) % table.length;
struct link * newLink = (struct link *) malloc(…)
assert(newLink != 0);
newLink->value = newValue; newLink->next = ht->table[indx];
ht->table[indx] = newLink; /* add to bucket */
ht->count++;
// note: next step: reorganize if load factor > 3.0
}
}
Contains test, remove
• Contains: Find correct bucket, then see if the element is
there
• Remove: Slightly more tricky, because you only want to
decrement the count if the element is actually in the list.
• Alternatives: instead of keeping count in the hash table,
can call count on each list. What are pro/con for this?
Remove - need to change previous
• Since we have only single links, remove is tricky
• Solutions: use double links (too much work), or use previous
pointers
• We have seen this before. Keep a pointer to the previous node, trail
after current node
Prev = 0;
For (current = ht->table[indx]; current != 0; current = current->next)
{
if (EQ(current->value, testValue)) { … remove it
}
prev = current;
}
Two cases, prev is null or not
Prev = 0;
For (current = ht->table[indx]; current != 0; current = current->next)
{
if (EQ(current->value, testValue)) { … remove it
if (prev == 0) ht->table[indx] = current->next;
else prev->next = current->next;
free (current);
ht->size--;
return;
}
prev = current;
}
Hash Table Size
• Load factor:
Load factor
# of elements
l=n/m
Size of table
– So, load factor represents the average number of elements at
each table entry
• Want the load factor to remain small
• Can do same trick as open table hashing - if load factor
becomes larger than some fixed limit (say, 3.0) then you
double the table size
Hash Tables: Algorithmic Complexity
• Assumptions:
– Time to compute hash function is constant
– Chaining uses a linked list
– Worst case analysis  All values hash to same position
– Best case analysis  Hash function uniformly distributes the
values (all buckets have the same number of objects in them)
• Find element operation:
– Worst case for open addressing  O( n )
– Worst case for chaining
 O( n )  O(log n) if use AVL tree
– Best case for open addressing
 O( 1 )
– Best case for chaining
 O( 1 )
Hash Tables: Average Case
• Assuming that the hash function distributes elements
uniformly (a BIG if)
• Then the average case for all operations is O(l)
• So you want to try and keep the load factor relatively
small.
• You can do this by resizing the table (doubling the size) if
the load factor is larger than some fixed limit, say 10
• But that only improves things IF the hash function
distributes values uniformly.
• What happens if hash value is always zero?
So when should you use hash tables?
• Your data values must be objects with good hash
functions defined (string, Double)
• Or you need to write your own definition of hashCode
• Need to know that the values are uniformly distributed
• If you can’t guarantee that, then a skip list or AVL tree is
often faster
Your turn
• Now do the worksheet to implement hash table with
buckets
• Run down linked list for contains test
• Think about how to do remove
• Keep track of number of elements
• Resize table if load factor is bigger than 3.0
Questions??
Download