Lecture 3 ADT Map/Dictionary, Hash Tables, Skip Lists TDDC32 Lecture notes in Design and Implementation of a Software Module in Java 21 January 2013 Tommy Färnqvist, IDA, Linköping University 3.1 Lecture Topics Innehåll 1 ADT Map/Dictionary 1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 2 Hash Tables 2.1 Collision Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Choosing a Hash Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 6 3 Skip Lists 7 1 3.2 ADT Map/Dictionary 1.1 Definitions ADT Map • Domain: sets of entries/pairs (key, value) The sets are partial functions mapping keys to values! • Typical operations: – size() the number of pairs in the set – isEmpty() check if the set is empty – get(k) get the info associated with k or null if there is no such key – put(k, v) add (k, v) to the set an return null if k is new; otherwise replace value by v and return the old value – remove(k) remove entry (k, v) and return v; return null if there is no such entry 3.3 ADT Map • Examples: – Course database: (code, name) – Associative memory: (address, value) – Sparse matrix: ((row, column), value) – Lunch menu: (day, course) • Static Map: no updates allowed • Dynamic Map: updates are allowed 3.4 1 Example: operations performed on initially empty Map M operation isEmpty() put(5,A) put(7,B) put(2,C) put(8,D) put(2,E) get(7) get(4) get(2) size() remove(5) remove(2) get(2) isEmpty() utdata true null null null null C B null E 4 A E null false M 0/ (5,A) (5,A),(7,B) (5,A),(7,B),(2,C) (5,A),(7,B),(2,C),(8,D) (5,A),(7,B),(2,E),(8,D) (5,A),(7,B),(2,E),(8,D) (5,A),(7,B),(2,E),(8,D) (5,A),(7,B),(2,E),(8,D) (5,A),(7,B),(2,E),(8,D) (7,B),(2,E),(8,D) (7,B),(8,D) (7,B),(8,D) (7,B),(8,D) 3.5 ADT Dictionary • Domain: sets of pairs (key, value) The sets are relations between keys and values! • Typical operations: – size() the number of pairs in the set – isEmpty() check if the set is empty – find(k) return some entry with key k or null if no pair exists – findAll(k) return an iterable collection of all entries with key k – insert(k, v) add (k, v) and return the new entry – remove(k, v) remove and return pair (k, v); return null if there is no such pair – entries() return iterable collection of all entries 3.6 ADT Dictionary • Example: – Swedish-English dictionary . . . , (jakt, yacht), (jakt, hunting), . . . – Telephone directory (multiple numbers allowed) – Relation between liuid and courses taken – Lunch menu (multiple courses): (day, course) • Static Dictionary: no updated allowed • Dynamic Dictionary: updates are allowed 3.7 Example: operations on initially empty Dictionary D operation isEmpty() insert(5,A) insert(7,B) insert(2,C) insert(8,D) insert(2,E) find(7) find(4) find(2) findAll(2) size() remove(find(5)) find(5) utdata true (5,A) (7,B) (2,C) (8,D) (2,E) (7,B) null (2,C) (2,C),(2,E) 5 (5,A) null D 0/ (5,A) (5,A),(7,B) (5,A),(7,B),(2,C) (5,A),(7,B),(2,C),(8,D) (5,A),(7,B),(2,C),(8,D),(2,E) (5,A),(7,B),(2,C),(8,D),(2,E) (5,A),(7,B),(2,C),(8,D),(2,E) (5,A),(7,B),(2,C),(8,D),(2,E) (5,A),(7,B),(2,E),(8,D),(2,E) (5,A),(7,B),(2,E),(8,D),(2,E) (7,B),(2,C),(8,D),(2,E) (7,B),(2,C),(8,D),(2,E) 3.8 2 1.2 Implementation Implementations: Map, Dictionary • Table/array: sequence of memory chunks of equal size – Unordered: no particular order between T [i] and T [i + 1] – Ordered: . . . but here T [i] < T [i + 1] • Linked list – Unordered – Ordered • (Binary) Search Trees • Hashing • Skip Lists 3.9 Table representation of Dictionary Unordered table: find by linear search • unsuccessful look-up: n comparisons ⇒ O(n) time • successful look-up, worst case: n comparisons ⇒ O(n) time • successful look-up, average case with uniform distribution of requests: comparisons ⇒ O(n) time 1 n (1 + 2 + . . . + n) = n+1 2 3.10 Table representation of Dictionary Ordered table (keys in Map/Dictionary are linearly ordered): find by binary search • look-up: O(log n) time • . . . updates are expensive!! 3.11 Search tree representation of Dictionary Balanced search tree: • look-up: O(log n) time • updates also in O(log n) time 3.12 Can we do better? Yes, using hash tables • Idea: given a table T [0, . . . , max] to store elements in. . . . . . for each element find a suitable table index • Find a function h such that h(key) ∈ [0, . . . , max] and (ideally) such that k1 6= k2 ⇒ h(k1 ) 6= h(k2 ) • Store each key-value pair (k, v) in T [h(k)] 3.13 2 Hash Tables Hash Table • In practice, hash functions do not give unique values (are not injective) • We need conflict (or collision) resolution . . . and • We need to find a good hash function 3.14 3 2.1 Collision Resolution Collision Resolution Two principles for handling collisions: • Chaining: keep conflicting data in linked lists – Separate chaining: Keep a linked list of the colliding items outside the table – Coalesced chaining: Store all items inside the table • Open addressing: Store all items inside the table and let some algorithm decide what index to use in case of a collision [Swe: Separat länkning, samlad länkning, öppen adressering] 3.15 Chaining 0 1 2 41 28 54 3 4 18 5 6 7 8 9 10 36 10 90 12 11 Separate chaining 12 38 25 Coalesced chaining 3.16 Open addressing 3.17 Example: Hashing with Separate Chaining • Hash table of size 13 • Hash function h with h(k) = k mod 13 • Store 10 integer keys: 54, 10, 18, 25, 28, 41, 38, 36, 12, 90 0 1 2 41 28 54 3 4 5 18 6 7 8 9 10 36 10 90 12 11 12 38 25 3.18 4 Separate Chaining: find Given: key k, hash table T , hash function h • compute h(k) • search for k in the list pointed to by T [h(k)] Notation: probe = one access to the linked list data structure • • • • 1 probe for accessing the list header (if non-empty) 1+1 probes for accessing the contents of the first element 1+2 probes for accessing the contents of the second element ... A probe (just pointer de-referencing) takes constant time. How many probes P are needed to retrieve a hash table entry? [Swe: probe=sondering] 3.19 Separate Chaining: Unsuccessful Look-up • n data items • m positions in the table Worst case: • All items have the same hash value: P = 1 + n Average case: • Hash values equally distributed among m: • Average length α of the list: α = n/m – α = n/m is called the load factor • P = 1+α 3.20 Separate Chaining: Successful Look-up Average case: • Access to T [h(k)] (beginning of list L): 1 • Traversing L ⇒ k found after: |L|/2 • Expected (or average) |L| corresponds to α, thus: P = α/2 + 1 3.21 Coalesced Chaining: items inside table • Place data items in the table • Extend them with pointers • Resolve collisions by using the first free slot Chain may contain keys with different hash values. . . . . . but all keys with same hash value appears in the same chain + Better space utilization - Table may become full - Long chains (look-up takes longer time) 3.22 Coalesced Chaining 3.23 5 Open addressing • Store all elements inside the table • Use a fixed algorithm fo find a free slot Sequential/Linear Probing • desired hash index j = h(k) • in case of conflict go the the next free position • if at the end of the table, go to the beginning. . . • Close positions rapidly filled (primary clustering) • How to do remove(k)? 3.24 Open addressing — remove() The element to remove may be part of a collision chain. How can we tell? It it is part fo a chain, we cannot just remove it! • since all keys are stored, re-hash all remaining data? • Scan elements after, re-hash or compact when suitable, stop at first free slot. . . ? • Ignore – insert a “deleted” marker if the next slot is non-empty. . . 3.25 Double Hashning – or what to do on a collision? • Second hash function h2 computes increments in case of conflicts • Increment beyond the table is taken modulo m = tableSize Linear probing is double hashing with h2 (k) = 1 Requirements on h2 : • h2 (k) 6= 0 for all k • For each k, h2 (k) has no common divisors with m ⇒ all table positions can be reached A common choice: h2 (k) = q − (k mod q) for q < m, m prime (i.e., pick a prime less than the table size!) 2.2 3.26 Choosing a Hash Function What is a good hash function? Assume k is a natural number. Hashing should give a uniform distribution of hash values, but this depends on the distribution of keys in the data to be hashed. Example: Hashing of last names in a (Swedish) student group • hash function: the ASCII-value of the last character bad choice: majority of the names end with ’n’. 3.27 6 Example Hash Functions • Memory address – Reinterpret the memory address of the object as an integer (default hash code of all Java objects) – Good in general, except for numeric and string keys • Integer cast – Reinterpret the bits of the key as an integer – Suitable for keys of length less than or equal to the number of bits of the integer type • Component sum – We partition the bits of the key into components of fixed length (e.g. 16 or 32 bits) and sum the components (ignoring overflows) – Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type 3.28 Example Hash Functions • Polynomial accumulation – Partition the bits of the key into a sequence of components of fixed length (e.g. 8, 16, or 32 bits) a0 a1 an−1 – Evaluate the polynomial p(z) = a0 + a1 z + a2 z2 + . . . + an−1 zn−1 at a fixed value z, ignoring overflows – Especially suitable for strings. (E.g. z = 33 gives at most 6 collisions on a set of 50000 English words.) • Polynomial p(z) can be evaluated in O(n) time using Horner’s rule: – The following polynomials are successively computed, each from the previous one in O(1) time p0 (z) = an−1 pi (z) = an−i−1 + zpi−1 (z) (i = 1, 2, . . . , n − 1) • We have p(z) = pn−1 (z) 3.29 Hashing by integer division Let m be the table size h(k) = k mod m Avoid • m = 2d : hashing gives d last bits of k • m = 10d : hashing gives d last decimal digits Usually, prime number is suggested for m Check samples of real data to experiment with hash parameters See http://burtleburtle.net/bob/hash/doobs.html for other opinions on the matter. 3 3.30 Skip Lists Skip List • • • • • A hierarchical linked list. . . A randomized alternative for implementation of ADT Dictionary Insertion uses randomization (“coin flipping”) Good expected-case performance Worst-case performance in skip lists is very unlikely (>250 items, the risk of a search time more than 3 times the expected is below 10−6 ) 3.31 7 Skip List data structure • • • • Levels L1 , . . . , Lh of nodes (key, value) Some nodes appear in several levels (towers) Special keys: −∞ and +∞ . . . smaller/larger than any real key. . . Several levels of doubly linked lists, less dense higher up – Level 1: all nodes in a doubly linked list between −∞ and +∞ in order by the ’<’-relation – On average half of the nodes in Li appear in Li+1 – Special keys −∞ and +∞ appear on all levels – Only −∞ and +∞ appear on level Lh 3.32 Example: Skip List 3.33 Searching When searching for a key k: • Follow the list of the highest level. . . – Stop before we pass a ki > k (we risk to overshoot what we are looking for) – If found, return it, otherwise. . . • We have stopped at a level: – We have found the key? – No, switch to next lower level (via the “last tower”) and continue searching – Returns: The largest key ki ≤ k (which might be +∞) 3.34 Searching Search for key k: • Similarity with binary search – but for lists • Example: find(18) 3.35 8 Insert function INSERT(x) P ← FIND(x) if P.value < x then insert new list cell after P “toss a coin” to decide how high this “tower” should be: while “toss a coin”=yes do increase the tower height one step (possibly increase height of skip list) 3.36 Example: insert(20) 3.37 Delete. . . and properties • Similar to insert: – Search – If found, remove and fix the link between the towers • Worst case execution time of find, insert, and remove on a skip list of n items is O(n + h) • But expected execution time (assuming uniform distribution of keys) is O(log n) if search starts at height blog nc 3.38 Voluntary Homework Problem In a hash table of size 13, use linear probing to insert:14, 43, 28, 66, 79, 19, 1, 21, 72, 29 Describe where each element is placed and comment/explain why 9 3.39