Appendix E-A Hashing Modified Chapter Scope • Concept of hashing • Hashing functions • Collision handling – Open addressing – Buckets – Chaining • Deletions • Performance Java Software Structures, 4th Edition, Lewis/Chase 15 - 2 What is hashing? • Hashing is a scheme for storing and retrieving information by (key) value. Sometimes used to implement associative memory. • A hash function is used to map a value to a location; The value (and associated info) may be stored at that location or at least accessed via that location. • Very efficient for storing and retrieving • Used extensively in computing – software and hardware. Java Software Structures, 4th Edition, Lewis/Chase 15 - 3 Collisions • Ideally, the value being mapped would be stored at the mapped location in the location space, and this could be true for a perfect hashing function. • However, in most situations, multiple values will/may map to the same location (collisions) • So we have to have to have a strategy to handle collisions. • There are several popular collision handling strategies. Java Software Structures, 4th Edition, Lewis/Chase 15 - 4 Hash function • A hash function is a mapping from a value space to a location space. • The value space is any domain of values. Strings, ints, phone numbers, student IDs, … • The location space is normally a sequence of integers from 0 to N-1, where N is the size of the location space. The location space resembles a 1-dimentional array (like computer memory). Value space Location space 15 - 5 Characteristics of a good hash function • It should cover the entire location space • It should distribute the key values fairly evenly into the location space • Generally, 2 values that are “close together” in the value space should not be close together in the location space. Aside: Cryptographic hashing Java Software Structures, 4th Edition, Lewis/Chase 15 - 6 Division (remainder) hash function • Probably most commonly used method, either by itself or combined with another method. • If the location space is of size N, divide the value (somehow represented as an integer) by N and take the remainder as the result. • Choosing N to be a prime number improves the likelihood of the mapping distributing the values fairly evenly. Java Software Structures, 4th Edition, Lewis/Chase 15 - 7 Representing a value as an integer • We know that all information stored in computer storage is a string of bits. • Any string of bits can be interpreted as a binary integer. • So how to we make that interpretation? • Modern languages try to prevent us from changing out interpretation of a string of bits – strong typing. Java Software Structures, 4th Edition, Lewis/Chase 15 - 8 Char to int Java does give us a loophole for character data System.out.println( (int) "ABC".charAt(0)); displays 65 The data type char is an “integral” type, and can be automatically converted to an int or long. Int I = ‘B’; // sets I to 66; Java Software Structures, 4th Edition, Lewis/Chase 15 - 9 Also, can use bitwise and bit shift ops. • ~, &, |, ^, <<, >>, >>> int i = 'B'; System.out.println( i); //displays 66 System.out.println( (int) "ABC".charAt(0)); //displays 65 System.out.println( 'A' & 'B' ); //displays 64 System.out.println( 'A' | 'B' ); //displays 67 System.out.println( 'A' ^ 'B' ); //displays 3 System.out.println( ~'A' ); //displays -66 System.out.println( 'A' << 2 ); //displays 260 System.out.println( 'A' >>> 1 ); //displays 32 Java Software Structures, 4th Edition, Lewis/Chase 15 - 10 Folding • Divide the value into parts and then combine them. • Example: value is 234-56-9876 234 965 + 876 --------------2075 % N Java Software Structures, 4th Edition, Lewis/Chase 15 - 11 Other hash functions • Mid square -- Square the value, as a number, and take a portion out of the middle of that product. • Extraction involves using only a part of an element’s value or key to compute the location at which to store the element • Length dependent – use a portion of the value, then combine with the length of the value. Java Software Structures, 4th Edition, Lewis/Chase 15 - 12 Hashing Functions - Digit Analysis • In the digit analysis method, the index is formed by extracting, and then manipulating specific digits from the key • For example, if our key is 1234567, we might select the digits in positions 2 through 4 yielding 234 • The manipulation can then take many forms – – – – Reversing the digits (432) Performing a circular shift to the right (423) Performing a circular shift to the left (342) Swapping each pair of digits (324) • Alternately, these manipulations could be done on the bits Java Software Structures, 4th Edition, Lewis/Chase 15 - 13 Appendix E-B Hashing – Open Addressing Modified Open Addressing A.K.A Closed Hashing • All hashed entries, including collisions. are stored within the hash table (closed array) • Colliding entries are stored at (open addresses)/locations within the table. • When a collision occurs and the entry cannot be stored at its home address (to which it was originally hashed), the table is probed for an open position in the table where it can be stored. • When the entry is looked for, this same probe sequence must be followed until it is found or determined that it is not in the table. Java Software Structures, 4th Edition, Lewis/Chase 15 - 15 Three probing approaches 1. Linear probing 2. Quadratic probing 3. Double hashing Java Software Structures, 4th Edition, Lewis/Chase 15 - 16 Open addressing using Linear Probing In linear probing, if an entry hashes to position P and that position is occupied we simply probe for empty positions at (P + I) % TableSize where I = 1,2,3,4 … or some other linear sequence Issues with Linear probing • Linear Probing may lead to clustering; both good and bad. Increases average number of probes, but gives good locality of reference (if interval is 1). • Deletions are marked as deletions, not empty; they can be reused, but they do not mark the end of a probe sequence. • Need table size to be a prime number to ensure all positions are in probe sequences. 15 - 18 Issues with Linear probing • Performance drops off as load factor nears 80% • Must expand table and rehash all entries https://www.cs.usfca.edu/~galles/visualization/Clo sedHash.html Java Software Structures, 4th Edition, Lewis/Chase 15 - 19 Quadratic probing • In Quadratic probing, the probe interval is a quadratic polynomial – I2 in the simplest case. • So, if an entry hashes to position P and that position is occupied we simply probe for empty positions at • (P + I2) % TableSize where I = 1,2,3,4 … • Less primary clustering than with linear probing • https://www.cs.usfca.edu/~galles/visualization/C losedHash.html Java Software Structures, 4th Edition, Lewis/Chase 15 - 20 Double Hashing • The interval between probes is computed by another hash function H2(x) • So, if an entry x hashes to position P and that position is occupied we simply probe for empty positions at • ( P + I * ( H2(x) ) % TableSize where I = 1,2,3,4 … • Less primary clustering than with linear probing • https://www.cs.usfca.edu/~galles/visualization/C losedHash.html Java Software Structures, 4th Edition, Lewis/Chase 15 - 21 Appendix E-C Hashing – Buckets Modified Buckets • The locations in the hash table are referred to as cells or as buckets. • A bucket can be big enough to hold several entries (not just one). • So, entries are hashed to a bucket location, and colliding entries can be stored in the same bucket until it becomes full. • After it becomes full, the colliding elements can be stored in a common overflow area. Java Software Structures, 4th Edition, Lewis/Chase 15 - 23 • What is this advantage of this approach over some of the other open addressing approaches? • Locality of reference – the likelihood that when you are accessing a place in memory or on disk, that the next place you reference is “nearby”. • This makes for better efficiency in virtual memory and in more efficient disk access. • Eliminates primary clustering • https://www.cs.usfca.edu/~galles/visualization/C losedHash.html Java Software Structures, 4th Edition, Lewis/Chase 15 - 24 Appendix E-D Hashing – With chaining Modified Chaining • The chaining method simply treats the hash table conceptually as an array of lists of individual elements • Thus each hash value locates a list of all entries that hash to (collide at) that hash location. • These lists are usually linked (chained) lists. Java Software Structures, 4th Edition, Lewis/Chase 15 - 26 The chaining method of collision handling 0 1 2 … Two variants: 1. The table cells can contain the data being stored, or 2. The table cells can contain only head pointers to the lists, with all data being stored in the list nodes. Pros and cons of each variant? N-1 Java Software Structures, 4th Edition, Lewis/Chase 15 - 27 Basic operations • Insert • Find • Delete Lists can be ordered or not Java Software Structures, 4th Edition, Lewis/Chase 15 - 28 Pros of chaining – compared to closed hashing • Hash table does not ever have to be expanded. • Performance degrades more slowly as table fills up. • Fewer (or no) empty table (data) spaces. • Insertion (at the head of list) is simple and takes constant time. • Deletion does not require special treatment. • No clustering Java Software Structures, 4th Edition, Lewis/Chase 15 - 29 Cons of chaining – compared to closed hashing • Extra space used for pointers • Extra time required to allocate list nodes dynamically!!! • Worse locality of reference. Significant if lists get long. Size (and number) of data records must be considered. Java Software Structures, 4th Edition, Lewis/Chase 15 - 30 Chaining using an overflow area Chaining (with simulated links) can be accomplished using an array based structure with an overflow area. Pros and cons?? Java Software Structures, 4th Edition, Lewis/Chase 15 - 31 Appendix E-E Hashing – Variations Modified http://en.wikipedia.org/wiki/Hash_table Coalesced hashing Omit! Java Software Structures, 4th Edition, Lewis/Chase 15 - 33 Incremental resizing of a hash table Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually: • During the resize, allocate the new hash table, but keep the old table unchanged. • In each lookup or delete operation, check both tables. • Perform insertion operations only in the new table. • At each insertion also move r elements from the old table to the new table. • When all elements are removed from the old table, deallocate it. To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing. Java Software Structures, 4th Edition, Lewis/Chase 15 - 34