Hashing (Ch. III.11) Data Structures:= Examples: Table List Stack Queue = = = = ( Set ; Operations-on-Set) ( {items}, ( {items}, ( {items}, ( {items}, {Insert, Delete, Search } ) {Insert, Delete, Search } {Push, Pop} ) {Enqueue, Dequeue } ) stack and queue do not ) support the search operation dictionary operations Typically Operations-on-Set { Insert , Modify, Delete, Find-Max, Search, Find-Min, etc.…} Comparison with algebraic structures in mathematics: similarity: same basic idea algebraic structure: = ( Set; Operations-on-Set) Example: Group (A,+) : e.g. (real-numbers, addition), ({0,1,2}, +modulus 3) difference: Set in math structure is fixed and often infinite; In CS the set is always finite and items are repeatedly added and removed (set is dynamic); in fact inserting and deleting items are central to data structure work, are performed a great number of times and therefore need to be very fast. Goal: 1. Design: Find data structure most appropriate for task at hand; Example: priority queues for scheduling. 2. Implementation: should satisfy software engineering criteria: more general criteria such as modularity, scalability; and/or more specific criteria such as encapsulation, info hiding, inheritance, etc. Example: heap for implementing priority queues. 3. Performance: Operations-on-Set should be as fast as possible: ideally the worst or average time should be of constant order or at most of order lgn. ????? Example: worst case asymptotic times for priority queues implemented with heap: T-Insert/Change = (lg n), T-Extract = ( lg n) Hashing Goal: Develop data structure for dictionary operations, i.e. insert, delete, search as fast an as elegantly implemented as possible, and no need to worry about max, min, etc. First thing that comes to mind: Table, implemented as array, with direct access to every element. (see Figure 11.1, p.223) Pros: ( 1) for all dictionary operations Cons: array requires lots of space, and may be outright impossible if we work with a large numer of items/keys. Page 1 of 6 D:\291220516.doc 5/29/2016, 6:18 AM Main Ideas of Hash Tables: - The number of keys that potentially occur in the problem may be very large, but the values that need to actually be stored is relatively small. - We could make a rule (function) that assigns not one but a whole set of values to a single table slot, taking the risk that relatively few values will be assigned to the same slot in a particular run of the problem. Thus we generalize the direct access table idea from "one key-value to one table slot" to "set of key-values to one table slot". (see Figure 11.2, p.225) - To make sure only few values go to the same slot we need to come up with a clever rule for assigning values to slots. Intuitively this could be accomplished by dispersing in a hodgepodge way or "hashing" the values "evenly" across the table slots -- hash function; - Under very unfortunate circumstances all values can hash to a single slot making the worst case time of the order of n, O( n) = O( #items), a depressing result. The average time, however, is proportional to the average number of values hashing to a slot. Assuming the table has m slots and our hash function disperses the n values evenly over these slots, the average length should be m/n, making the average table access time O( m/n) = O( #table-slots/#items) - Of course, we still cannot avoid that more than one key hashes to a single slot: when this occurs there is a collision and we need a collision resolution policy. Now let's work out the details: Let U – universe of n keys; |U| = n; K – keys actually stored in table T[ 0… m-1] – array implementing table with m slots # table slots n load factor # items m Hash function: h: U {0, 1, 2, …m-1} Collision Resolution by Chaining: (see Figure 11.3, p.225) This first thing that comes to mind is simply to make a list of all keys with the same hash value and attach it to the corresponding table slot h(k). Let k – some key or element of U, while x – a pointer to an item/element to be sorted; the element has a key, key(x) U, and assorted satellite data (for more detail review section 10.2 Linked lists); Assume a doubly linked list with element x – pointer to element prev[x] – pointer to key[x] previous element next[x] - pointer to next element Page 2 of 6 D:\291220516.doc 5/29/2016, 6:18 AM CHAINED-HASH-INSERT(T, x) //p.227 Insert x at beginning of list at T [ h( key[x] ) ] O(1) CHAINED-HASH-SEARCH(T, k) //p.227 Search k in list at T[ h(k)] O( length-list at T[ h(k)] ) CHAINED-HASH-DELETE(T, x) //p.227 Delete k from list at T[ h( key[x] )] O( 1) if doubly linked list Note that - if we are deleting based on some key k and not on pointer x to element we must search for the key k in the list at T[ h( k ) ] : the time is then be proportional to the length of the list at h(k) as in searching; - if the list is not doubly linked, i.e. the previous link is missing, and only the next link is provided, we must still search for x in order to find the element preceding x and update its next link to point to element after x. The worst case occurs when x is last, requiring to traverse the whole list and making the time again proportional to the length of the list at T[h(k)] (see section 10.2, and p. 226). Analysis of hashing with chaining: How long does it take to search for a key? Worst case: O(n) trivial Average case: O(α = load factor) Proof Idea: Assume: each key is equally likely to be hashed to any slot or equivalently each slot is equally likely to get a key -- simple uniform hashing Cost of unsuccessful search (Theorem 11.1, p.227): O(1+α) - O(1) to compute hash function and access slot T[ h(k)] - search entire list at slot, average list length = load factor = α Cost of successful search (Theorem 11.2, p.227): O(1 - 2 2n 1) O(1) to access slot T[ h(k)] average list length of successful search for some key k is 1 T = Σ (avg-number-elements-examined-before-k +1). n Intuitively, as k is equally likely to be the key of the first, as well as of the last element, as well as any element in between the avg-number-elements-before-k is α/2; Page 3 of 6 D:\291220516.doc 5/29/2016, 6:18 AM More precisely: how many elements were examined before finding k depends on when k was inserted ,or how many elements were added after k, or avg-number-elements-examined-before-k = avg-number-elements-added-after-k Thus if k was added 1st avg-number-elements-added-after-k = (n-1)/m 2d avg-number-elements-before-k = (n-2)/m 3st avg-number-elements-before-k = (n-3)/m … ith avg-number-elements-before-k = (n-i)/m with i=1,2,…,n 1 n ni 1 n 1 n 1) ( ( n i ) 1 n i1 m nm i1 n i 1 n n 1 n n = 1 i nm i1 nm i1 n T = = n 1 n(n 1) 1 m nm 2 = (n 1) 1 n 1 1 1 2m 2 2m n 2 2m What is a good hash function? – one that disperses values evenly so that for most x ≠ y h(x) ≠ h(y) How to do this? (a) reason systematically through probabilities, and assure that the condition of univorm hashing is met, i.e. the probability that k hashes to some slot j is the same for all slots: P(j) = 1/m i.e. P( j ) P(k ) k : h( k ) j Unfortunately we usually do not know the probability distribution of the keys and cannot check this. (b) Ad hoc methods that try to destroy any dependency patterns in the data, and thus make hash values independent of each other: with fixed hash function: division method, multiplication method with randomly chosen hash function for each run: universal hashing Page 4 of 6 D:\291220516.doc 5/29/2016, 6:18 AM Division method (Ch. 11.3.1): h(k) = k mod m The simplest thing that can come to mind in fact. There are some caveats, however: 1. Do not choose m= 2p (some power of 2) when working with binary numbers: Choosing m= 2p makes the hash function h(k) dependent only on its p lowest digits: Let k have a w+1 bit representation k = kw2w + kw-12w-1 + … + kp+12p+1 + kp 2p + kp-12p-1 + kp-12p-2 + … + k121 + k020 or = (kw2w-p + kw-12w-1-p + … + kp+121 + kp)2p + (kp-12p-1 + kp-12p-2 + … + k121 + k020) Dividing by m= 2p yields k/m = = kw-p2w-p + kw-12w-1-p + … + kp+12p+1-p + kp2p-p + kp-12-1 + kp-12-2 + … + k12-p+1 + k02-p or =(kw-p2w-p + kw-12w-p-1 + … + kp+121 integer part of division by m irrelevant for h(k) + kp ) + ( kp-1 + kp-1 + … + k1 + k0)2-p fractional part = remainder of division by m, i.e. h(k) Example: w+1=8, p=4, k = 1101 1010 = 1.2 7 + 1.2 6 +0.2 5 +1.2 4 +1.2 3 +0.2 2 +1.2 1 +0.2 0 = (1101) 2 4=p + (1010) 2 0 Dividing by m=2 4=p yields k/m = = 1.2 7-4 + 1.2 6-4 +0.2 5-4 +1.2 4-4 +1.2 3-4 +0.2 2-4 +1.2 1-4 +0.2 0-4 = 1.2 3 + 1.2 2 +0.2 1 +1.2 0 +1.2 -1 +0.2 -2 +1.2 -3 +0.2 -4 = (1101) 2 0 + (1010) 2 -4 integer part fractional h(k) Similarly do not choose m a power of 10 (or of radix d) for decimal (or radix d) applications. 2. If k is a character string in radix 2 p representation and m= 2 p –1 any pair of strings that are identical except for a transposition of two adjacent characters hash to the same slot. 3. Good values for m are primes not too close to exact powers of 2 See examples p.231 Page 5 of 6 D:\291220516.doc 5/29/2016, 6:18 AM Multiplication method (Ch. 11.3.2): h(k) = floor( m k A mod 1) with m = 2p s 0< A= w <1 2 or 0 < s = A . 2w < 2w Example: k = 13 , p=3 , w=4 , A= 5 1 9 .618033... approx. 2 16 Page 6 of 6 D:\291220516.doc 5/29/2016, 6:18 AM