3.3 Hashing Hashing is a method for storing and retrieving information. It is often very fast, much faster than linear search and even faster than binary search. Suppose we have n items. To search for one of these using linear search requires an amount of time that is O(n). To search using binary search requires time O(log n). However, if properly implemented, searching using hashing requires time O(1). Thus the amount of time is constant, provided n remains in bounds. In a sense this would be true for linear and binary search, but practically speaking hashing is usually better. We illustrate hashing by storing and retrieving names in an array. Suppose the array is called Customer. For simplicity we shall make it not too large. Suppose it has 111 elements with indices starting out at 0 and going to 110. We want to store some names in this array. For example, Customer 0 1 2 3 Williams Smith Johnson However, instead of putting them in the first 3 locations, 110 what we do is compute from each name a number which is the location in the array where it will be stored. We compute the location where the name will be stored using a function called a hashing function. There are many possible hashing functions. Let’s look at one. It is convenient to take advantage of how names are commonly stored on a computer. The letters of the name are usually stored sequentially in a computer’s memory with each letter being stored in 1 byte (8 bits) with the letter represented by its ASCII code. At the end of this section is an ASCII table with the bit representation of each letter. For example, a capital A is represented by 41 (hex) or 65 (decimal) with the other capital letters W i l l i a M s following in sequence. A lower case a is 57 69 6C 6C 69 61 6D 73 represented by 61 (hex) or 97 (decimal) with the other lower case letters following in sequence. Thus Williams is represented using 8 bytes as follows. We give the contents in hex. The idea of the hashing function is to take some of the characters, treat them as a number and divide by the length of the array and take the remainder. The remainder is where the name will be stored in the array. 3.3 - 1 To simplify matters, let’s take the first 4 letters of the name and treat that as an integer. If there are fewer than 4 letters, we take however many there are. For example, with the name Williams we would just take Will. W i l l 57 69 6C At this point we need to look at one little quirk of the processors in personal computers. When the CPU takes 4 bytes of memory and treats it as an integer, it reverses the order of the bytes when treating it as an integer. Thus the first byte is the low order byte of the integer and the fourth byte is the high order byte of the integer. In the example of Will, the l l W is the low order byte and the right most l is the high order byte. 6C 6C See table at right where we also include the decimal representation 108 108 of each letter. 6C i W 69 57 105 87 If we want to see what number this represents in decimal we must compute 1082563 + 1082562 + 105256 + 87 Customer = (108)(16,777,216) + (108)(65,536) + 26,880 + 87 = 1,811,939,328 + 7,077,888 + 26,880 + 87 = 1,819,044,183 In the above example the array length is 111, so we need to mod 1,819,044,183 by 111. 0 1 2 3 48 Williams 1,819,044,183 mod 111 = 48 So Williams is stored in location 48 of the array. 110 We can use the algebra of mod to reduce the size of the numbers we are working with at each step of the calculations. In particular, we can then do the computations entirely by hand without a calculator. The main idea is that in sequence of additions, subtractions and multiplications followed by a mod, we can replace any number by another number that is the same when one mod's by 111. [1082563 + 1082562 + 105256 + 87] mod 111 = [ 108(256 mod 111)3 + 108(256 mod 111)2 + 105(256 mod 111) + 87] mod 111 = [ (- 3)(34)3 + (- 3)(34)2 + (- 6)(34) - 24] mod 111 = [ - 102(34)2 - 10234 - 2102 - 24] mod 111 = [ 9(34)2 + 934 + 29 - 24] mod 111 = [ 333434 + 3334 + 29 - 24] mod 111 3.3 - 2 = [ 102102 + 3102 + 18 - 24] mod 111 = [ (-9)(-9) + 3(-9) - 6] mod 111 = [ 81 - 27 - 6] mod 111 = 48 mod 111 = 48 Remark: Most calculators have a way to do mod. On the HP 48G to do n mod p, do the following. (1) MTH (2) REAL (3) n <Enter> (4) p <Enter> (5) MOD. Some calculators allow one to do calculations in hex. On the HP 48G to convert a number from hex to decimal do the following (1) MTH (2) BASE (3) | # (4) 6 C 6 C 6957 | h (5) | h (6) B R. Specifing a hashing function in mathematical notation differs somewhat from person to person. Here is an example to illustrate one method. Example 1. Specify a function that takes a character string s, treats the first four characters as an integer with the first character as the low order digit and then mod's by 111. Solution. Let s = a character string p = Length of s sj = (j+1)st character of s. So s0 = first character, s1 = second character, … sp-1 = last character of s. aj = numerical value of ASCII code of sj 0 if j < p if j p h(s) = [a0 + 256a1 + 2562 a2 + 2563 a3] mod 111 3.3 - 3 Here is a C program which illustrates the hashing method we have been discussing. #include <iostream> #include <cstdlib> #include <string> using namespace std; int main() { char Name[20], Name4[4]; int i, NameEnd, p; unsigned long int *n; // Read a name. cout << "Program to demonstrate a hashing function" << endl; cout << "Enter a Name: "; cin >> Name; // Take first four characters, padding with 0's if necessary. NameEnd = strlen(Name) - 1; for (i = 0; i <= 3; i = i + 1) if (i <= NameEnd) Name4[i] = Name[i]; else Name4[i] = '\0'; // Treat as an integer n = (unsigned long int *) (Name4); cout << "First 4 characters viewed as an integer = " // Compute the hashed value. p = *n % 111; cout << "Hashed value = " << p << endl; system("PAUSE"); } 3.3 - 4 << *n << endl; ASCII Table The ASCII code represents keyboard characters as sequences of 7 bits. In many computers the code for each character is stored in the low order 7 bits of an 8-bit byte and the high order bit is set to 0. In the table below the ASCII codes for the keyboard characters are given in hex along with the decimal equivalent of the hex value. Note: SP indicates a space and Del indicates the "Delete" character. The first 32 ASCII codes do not represent printable characters. Rather they are used to send control signals between the computer and input/output devices. For example, the code 0A indicates a Line Feed which means to skip to a new line. The code 0D indicates a Carriage Return which means to move the print position back to the start of a new line. char hex dec char hex dec char hex dec SP ! " # $ % & ' ( ) * + , . / 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 @ A B C D E F G H I J K L M N O 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 ` a b c d e f g h i j k l m n o 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 P Q R S T U V W X Y Z [ \ ] ^ _ 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 p q r s t u v w x y z { | } ~ Del 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 3.3 - 5