J. Maluszynski, IDA, Linköpings Universitet, 2004. TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 2 Page 1 TDDB57 DALG – Lecture 4: ADT Array, Hashing . TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 4 Arrays with elements of different size Different elements X i may have different size si, l E M I L s1 =4 s3 =5 X Y L I S S I s2 =2 M U R P H Y s3 =6 i J. Maluszynski, IDA, Linköpings Universitet, 2004. u J. Maluszynski, IDA, Linköpings Universitet, 2004. a function from an integer interval l u of indices to a value set ADT array Lecture Overview: (1)Arrays, (2)Hashing Array: Operations: [Lewis/Denenberg pp. 130–131] ADT Array [Lewis/Denenberg pp. 133–134] [Lewis/Denenberg p. 139–140] [Lewis/Denenberg 6.1] [Lewis/Denenberg 8.3] Access X i returns the value of X at index i Length X returns the number of elements u l 1 Assign X i v changes the value of X at index i to v Initialize X v assign v to every element of X Special case: strings with additional operations Multidimensional arrays Representation of arrays list representations ADT Set, ADT Dictionary Hashing techniques – Chaining: separate or coalesced – Open addressing: Linear Probing or Double Hashing [Lewis/Denenberg 8.5] J. Maluszynski, IDA, Linköpings Universitet, 2004. – space for pointers in addition to space for data – time needed to follow them + fast substitution by moving pointers instead of moving data elements of large size si Hash Functions Represented as table of pointers to elements: Page 3 l TDDB57 DALG – Lecture 4: ADT Array, Hashing . Length or u must be stored Alternatively, a sentinel value may be used as end marker instead of keeping u. (for constant element size s) Contiguous Representation of Arrays Elements of array X stored in a table Start address X Each element has size s i-th element stored starting at address X s i Access in O 1 time Iteration: sequence of addresses computed as above until u reached TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 5 4 2 J. Maluszynski, IDA, Linköpings Universitet, 2004. Inf Idx Next 25 5 Inf Idx Next 81 9 majority of the elements are null elements. Linked list representation of arrays Used for sparse arrays: 0 13 A FirstInRow 1 4 11 13 22 31 43 FirstInCol column-compressed A 11 Row 4 1 3 5 22 43 13 31 1 number of nonzero entries Inf Idx Next 0 4 0 0 25 0 0 0 81 0 0 0 – access in O m time in the worst case, with m For a sparse matrix we may use + list items linked in two-dimensions (Right, Down pointers; row/col index) + compressed storage using one-dimensional arrays 11 0 0 0 22 0 0 0 0 0 0 0 31 0 3 43 0 0 0 4 S y x S J. Maluszynski, IDA, Linköpings Universitet, 2004. y Page 7 y Col 1 row-compressed uncompressed TDDB57 DALG – Lecture 4: ADT Array, Hashing . x ADT Set Operations on a set S: MakeEmptySet creates and returns a new, empty set IsMember S x searches in S for an element x, returns true iff x IsEmptySet S returns true iff S 0/ Insert S x adds x to S Delete S x removes x from S Size S returns the number of elements in S IsEqual S1 S2 returns S1 S2 Union S1 S2 returns S1 S2 Intersection S1 S2 returns S1 S2 Difference S1 S2 returns S1 S2 For ordered sets (order function ): FindMin S returns the least element x in S: TDDB57 DALG – Lecture 4: ADT Array, Hashing . Sets Page 6 J. Maluszynski, IDA, Linköpings Universitet, 2004. Set: collection of (mutually different) elements from a single universe. Multiset: an element may be contained multiple times Elements in a set are not necessarily ordered, but an available (total) order can be used for more efficient retrieval of elements. J. Maluszynski, IDA, Linköpings Universitet, 2004. We consider only finite sets for representation in a computer. Examples: Page 8 Set of cities in a country Membership list of an association Set of declared names maintained in a compiler Set of words known to a spell-checking program ... TDDB57 DALG – Lecture 4: ADT Array, Hashing . Dictionaries Data sets: data items retrieved by a key. Examples: Telephone directory, person name English-Swedish dictionary, English word Railway timetable, destination Identfier list in compiled program, identifier ... Static Dictionary: no updates allowed Dynamic Dictionary: updates allowed Ideal Hashing LookUp(D,k) h(k) = j j T Page 9 i J. Maluszynski, IDA, Linköpings Universitet, 2004. J. Maluszynski, IDA, Linköpings Universitet, 2004. TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 10 Store each item k i in T h k Page 12 Find a function h (hash function) that maps keys to indices of T in a unique way: k1 k2 h k1 h k2 and that can be computed in time O 1 . Represent the data set as a table T (hash table). TDDB57 DALG – Lecture 4: ADT Array, Hashing . Hashing: Resolving Conflicts J. Maluszynski, IDA, Linköpings Universitet, 2004. J. Maluszynski, IDA, Linköpings Universitet, 2004. Open Addressing: the index for a conflicting value is determined by an algorithm Coalesced Chaining: the data are placed in the table and linked a chain may include data with different hash values Separate Chaining: the hash table includes only a pointer to the list Chaining: conflicting data elements are represented as linked lists Page 11 k h k2 In practice, hash functions are not unique (not injective): k2 but h k1 TDDB57 DALG – Lecture 4: ADT Array, Hashing . Hashing ADT Dictionary How to achieve (O 1 time) retrieval from a Set or Dictionary? Domain: A dictionary is a set of pairs k i k K key, i I information The keys are unique and linearly ordered. A conflict / collision: Operations on a dictionary D MakeEmptyDictionary K I creates an empty dict. for keys from K, inf. from I LookUp D k searches in D for a pair (item) with key k; returns k i if found, Λ otherwise. IsEmpty D returns true iff D contains no items. Insert D k i adds to D an item k i if k not yet in D, overwrites old information entry for k by i, otherwise. Delete D k removes from D the item with key k, if existing. TDDB57 DALG – Lecture 4: ADT Array, Hashing . k1 J. Maluszynski, IDA, Linköpings Universitet, 2004. TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 14 Hashing with Separate Chaining: LookUp Page 13 Hashing with Separate Chaining: Example Given: key k, hash table T , hash function h TDDB57 DALG – Lecture 4: ADT Array, Hashing . store 10 integer keys: 54, 10, 18, 25, 28, 41, 38, 36, 12, 90 k mod 13 54 25 1. compute h k 10 38 2. search for k in the list pointed by T h k n J. Maluszynski, IDA, Linköpings Universitet, 2004. 1 probe for accessing the list header (if nonempty) 1 1 probes for accessing the contents of the first element 1 2 probes for accessing the contents of the second element ... Notation: probe = one access to the linked list data structure using a hash table of size 13 36 12 28 90 18 41 and hash function h with h k 0 1 2 3 4 5 6 7 8 9 10 11 12 J. Maluszynski, IDA, Linköpings Universitet, 2004. Successful lookup: Page 16 1 1 2 probes Average case: Expected number P of probes for given key k k found after L 3 2 Access to T h k (beginning of a list L) Traversing L α 2 Expected L is α, thus: P J. Maluszynski, IDA, Linköpings Universitet, 2004. How many probes P are needed to retrieve a hash table entry? A probe (pointer dereferencing and comparison ) takes constant time. TDDB57 DALG – Lecture 4: ADT Array, Hashing . TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 15 Hashing with Separate Chaining: Successful lookup Hashing with Separate Chainining: Unsuccessful lookup 1 n data items m positions in the table α Unsuccessful lookup: 1 Worst case: P all items have the same hash value Average case: Hash values equally distributed among m: Average length of the list α n m α n m is called the load factor P TDDB57 DALG – Lecture 4: ADT Array, Hashing . Coalesced Chaining (1) 41 10 28 36 12 18 90 54 38 Page 17 25 Eliminate the first probe by storing the first list item in the table entry 0 1 2 3 4 5 6 7 8 9 10 11 12 The increase in space consumption is acceptable if key fields are small or hash table is quite full Page 19 from end of the table go to the beginning 1 mod m – Delete ?? 0 1 2 3 4 5 6 7 8 9 J. Maluszynski, IDA, Linköpings Universitet, 2004. probes 5 times Insert(62) J. Maluszynski, IDA, Linköpings Universitet, 2004. 10 12 23 34 45 99 TDDB57 DALG – Lecture 4: ADT Array, Hashing . Coalesced Chaining (2) Page 18 Place data items in the table Extend them with pointers 0 1 2 3 4 15 1 41 0 Page 20 Two approaches: J. Maluszynski, IDA, Linköpings Universitet, 2004. Rehashing: the deleted position is made into empty; all subsequent positions in the probe sequence are deleted and re-inserted to the hash table Deleted bit: the deleted position is marked but kept in the probe sequence. J. Maluszynski, IDA, Linköpings Universitet, 2004. – table may become full + exploits space better with about the same average time Chains may contain keys with different hash values, but all keys with the same hash value appear in the same list. Lookup: Resolve collisions by using the first free slot Insert(41) Insert(15) Insert(1) Insert(0) ... 5 6 7 8 9 10 11 12 TDDB57 DALG – Lecture 4: ADT Array, Hashing . Two Deletion Techniques for Open Adressing Open adressing: a position is in probe sequence cannot be made into empty position TDDB57 DALG – Lecture 4: ADT Array, Hashing . Open Addressing with Linear Probing hk Sequence of probed table positions: H k l hk H k 0 1 H k l l mod m hk – close positions rapidly filled (primary clustering) Sequential / Linear Probing desired hash index j in case of conflict go to the next free position TDDB57 DALG – Lecture 4: ADT Array, Hashing . 10 12 22 32 43 99 10 0 1 2 3 4 5 6 7 8 9 Lookup(32) Delete(22) Page 21 99 32 43 12 10 Deletion under Open Adressing 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 12 22 d 32 43 0 1 2 3 4 5 6 7 6 Lookup(32) Lookup(32) Lookup stops WRONG!!! 10 12 32 43 99 "Rehashing" technique J. Maluszynski, IDA, Linköpings Universitet, 2004. J. Maluszynski, IDA, Linköpings Universitet, 2004. TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 22 mod m = h k q 1 l h2 k k. k mod q for q size T mod m m prime, m prime. J. Maluszynski, IDA, Linköpings Universitet, 2004. Lab 2: Open Adressing with Linear Probing and Re-hashing The assignment: implement a hash table open adressing, linear probing, deletion by re-hashing; the keys are strings the associated information is a type information; may be empty. insert and delete are to be implemented by a single write operation: – write K, empty delete K – write K, T delete K; insert K,T Page 24 J. Maluszynski, IDA, Linköpings Universitet, 2004. A skeleton code is given. The assignment is to implement the operations look-up and write. TDDB57 DALG – Lecture 4: ADT Array, Hashing . What is a good hash function? Bad choice: majority of the names ends with n hashing last names in a (swedish) student group hash function: the ASCII-value of the last character Example: Hashing should give a uniform distribution of hash values, but this depends on the distribution of keys in the data to be hashed. Open addressing with Double Hashing 9 7 "Deleted bit" technique 8 9 8 99 Page 23 Assume k is a natural number. TDDB57 DALG – Lecture 4: ADT Array, Hashing . Double Hashing Requirements on h2: k second hashing function h2 computes increments in case of conflicts h2 k increment beyond the table is taken modulo m, with m H k 0 hk H k l 1 H k l Linear probing is double hashing with h2 k h2 k 0 h2 k For each k, h2 k has no common divisor with m all table positions can be reached by the computed increments A common choice for h2: TDDB57 DALG – Lecture 4: ADT Array, Hashing . Page 26 n 1 J. Maluszynski, IDA, Linköpings Universitet, 2004. Page 25 TDDB57 DALG – Lecture 4: ADT Array, Hashing . 0 Uniform distribution!!! TDDB57 DALG – Lecture 4: ADT Array, Hashing . Hashing by Multiplication Page 28 Choose an irrational number θ Choose table size m hk m kθ kθ 5 1982 87 2 4 9 1 6 3 8 J. Maluszynski, IDA, Linköpings Universitet, 2004. 7 2000 23 J. Maluszynski, IDA, Linköpings Universitet, 2004. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 10 Hashing by Multiplication: A property of irrational numbers Hashing by Modulo-Division θ an irrational number. n denote: j m size of the table 12 The hash function is Example 1968 22 For i fi j k mod m 1 hk iθ iθ 0 1 ; all of them are distinct. fi Notice: fi We put them in order: fi1 fi2 fin Avoid: m 2d : hashing gives d last bits of k m 10d : hashing of decimal numbers gives last d digits For every n, there are at most three different lengths of the intervals fi j Prime numbers suggested for m n falls in a maximum length interval Check samples of real data to experiment with hash parameters ... ... J. Maluszynski, IDA, Linköpings Universitet, 2004. f7 0.326 100 Page 27 0 61803399 f6 0.708 φ 1, m 1954 57 1953 95 5 1 2 f5 0.090 hashing with θ TDDB57 DALG – Lecture 4: ADT Array, Hashing . 1 φ 1774 33 f4 0.472 input value : hash value : θ 1 3 8 Example 4 6 A nice θ is the reciprocal of golden ratio φ 2 7 9 Approximate distribution f1 f2 f3 0.618 0.236 0.854 5 10 ! 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0