Hashing The Magic Container Interface • Main methods: – Void Put(Object) – Object Get(Object) … returns null if not i – … Remove(Object) • Goal: methods are O(1)! (ususally) • Implementation details – HashTable: the storage bin – hashfunction(object): tells where object should go – collision resolution strategy: what to do when two objects “hash” to same location. • In Java, all objects have default int hashcode(), but better to define your own. Except for strings. • String hashing in Java is good. HashFunctions • Goal: map objects into table so distribution is uniform • Tricky to do. • Examples for string s – product ascii codes, then mod tablesize • nearly always even, so bad – sum ascii codes, then mod tablesize • may be too small – shift bits in ascii code • java allows this with << and >> – Java does a good job with Strings Example Problem • Suppose we are storing numeric id’s of customers, maybe 100,000 • We want to check if a person is delinquent, usually less than 400. • Use an array of size 1000, the delinquents. • Put id in at id mod tableSize. • Clearly fast for getting, removing • But what happens if entries collide? Separate Chaining • • • • Array of linked lists The hash function determines which list to search May or may keep individual lists in sorted order Problems: – needs a very good hash function, which may not exist – worse case: O(n) – extra-space for links • Another approach: Open Addressing – everything goes into the array, somehow – several approaches: linear, quadratic, double, rehashing Linear Probing • Store information (or prts to objects) in array • Linear Probing – When inserting an object, if location filled, find first unfilled position. I.e look at hi(x)+f(i) where f(i)= i; – When getting an object, start at hash addresses, and do linear search till find object or a hole. – primary clustering blocks of filled cells occur – Harder to insert than find existing element – Load factor =lf = percent of array filled – Expected probes for • insertion: 1/2(1+1/(1-lf)^2)) • successful search: 1/2(1+1/(1-lf)) Expected number of probes Load factor failure success .1 1.11 1.06 .2 1.28 1.13 .3 1.52 1.21 .4 1.89 1.33 .5 2.5 1.50 .6 3.6 1.75 .7 6.0 2.17 .8 13.0 3.0 .9 50.5 5.5 Quadratic Probing • Idea: f(i) = i^2 (or some other quadratic function) • Problem: If table is more than 1/2 full, no quarantee of finding any space! • Theorem: if table is less than 1/2 full, and table size is prime, then an element can be inserted. • Good: Quadratic probing eliminates primary clustering • Quadratic probing has secondary clustering (minor) – if hash to same addresses, then probe sequence will be the same Proof of theorem • Theorem: The first P/2 probes are distinct. – Suppose not. – Then there are i and j <P/2 that hash to same place – So h(x)+i^2 = h(y)+j^2 and h(x) = h(y). – So i^2 = j^2 mod P – (i+j)*(i-j) = 0 mod P – Since P is prime and i and j are less than P/2 – then i+j and i-j are less than P and P factors. – Contradiction Double Hashing • • • • • Goal: spreading out the probe sequence f(i) = i*hash2(x), where hash2 is another hash function Dangerous: can be very bad. Also may not eliminate any problems In best case, it’s great Rehashing • All methods degrade when table becomes too full • Simpliest solution: – create new table, twice as large – rehash everything – O(N), so not happy if often – With quadratic probing, rehash when table 1/2 full Extendible Hashing: Uses secondary storage • Suppose data does not fit in main memory • Goal: Reduce number of disks accesses. • Suppose N records to store and M records fit in a disk block • Result: 2 disk accesses for find (~4 for insert) • Let D be max number of bits so 2^D < M. • This is for root or directory (a disk block) • Algo: – hash on first D bits, yields ptr to disk block • Expected number of leaves: (N/M) log 2 • Expected directory size: O(N^(1+1/M) / M) • Theoretically difficult, more details for implementation Applications • Compilers: keep track of variables and scope • Graph Theory: associate id with name (general) • Game Playing: E.G. in chess, keep track of positions already considered and evaluated (which may be expensive) • Spelling Checker: At least to check that word is right. – But how to suggest correct word • Lexicon/book indices HashSets vs HashMaps • HashSets store objects – supports adding and removing in constant time • HashMaps store a pair (key,object) – this is an implementation of a Map • HashMaps are more useful and standard • Hashmaps main methods are: – put(Object key, Object value) – get(Object key) – remove(Object key) • All done in expected O(1) time. Lexicon Example • Inputs: text file (N) + content word file (the keys) (M) • Ouput: content words in order, with page numbers Algo: Define entry = (content word, linked list of integers) Initially, list is empty for each word. Step 1: Read content word file and Make HashMap of content word, empty list Step 2: Read text file and check if work in HashMap; if in, add to page number, else continue. Step 3: Use the iterator method to now walk thru the HashMap and put it into a sortable container. Lexicon Example • Complexity: – step 1: O(M), M number of content words – step 2: O(N), N word file size – step 3: O(M log M) max. – So O(max(N, M log M)) • Dumb Algorithm – Sort content words O(Mlog M) (balanced tree) – Look up each word in Content Word tree and update • O(N*logM) – Total complexity: O(N log M) – N = 500*2000 =1,000,000 and M = 1000 – Smart algo: 1,000,000; dumb algo: 1,000,000*10. Memoization • Recursive Fibonacci: fib(n) = if (n<2) return 1 else return fib(n-1)+fib(n-2) • Use hashing to store intermediate results Hashtable ht; fib(n) = Entry e = (Entry)ht.get(n); if (e != null) return e.answer; else if (n<2) return 1; else ans = fib(n-1)+fib(n-2); ht.put(n,ans); return ans;