Search, Sorting and Big

8 Tables A table is an unordered collection of items. Each item in the table has a unique key (definitely no duplicates in the table). Aside from the key, each item contains other information related to the key. For lists, arrays stacks and queues, the items are ordered (nth item in the list, front of the queue, top of item in stack, etc.) and the key does not have to be unique. 8.1 Table operations: Initialize - Table is initialized to an empty table isempty - tests if table is empty isfull - tests if table is full insert - add a new item with key:data to the table delete - give a key remove the item (key and associated data update - given a key change the data find - given a key find the data enumerate - process/list/count all items in the table 8.2 Simple implementation Use a list to implement a table. Store items into it so that the keys are in ascending order (for binary search purpose). Note that the ordering is part of the implementation details and not part of the table definition. 8.3 Hash Tables A hash table uses the key to directly calculate its location in an array structure. To do this the key is passed into a hash function which returns an index. int Hashfunction(key); key can be any data type. Ideally, the index returned by the hash function would be the key but this is unlikely in the real world. To be able to do this, the possible values for keys would have to be small. But if keys are big (9 digit student numbers, unique names, etc.) this would not be possible. Page 51 A good hash function should be: Uniform (all indexes are equally likely for a given key) Random (not predictable) For example, suppose we wanted to hash a number of telephone numbers of people living in a building (<1000 residents in all) Using the first three digits of the phone number is probably a bad idea (phone numbers in the same general area often have the same numbers) Using the last three digits of the phone number would be a better idea 8.4 Load Factor When looking at the efficiency of a hash table, the number of items already stored in the hash table will have an effect on the performance of the hash table. The measurement of fullness is called the Load Factor. In textbooks it is often represented by the symbol  (pronounced lamda).  = number of items in list/size of list. Put another way,. is simply the percentage of occupied spots in the table. Therefore: if  = 0.5 it would mean that 50% of the table is full.  = 0.1 means it is 10% full. NOTE: depending on collision resolution method (described in next section) it is possible that  > 1 8.5 Collision A hash function takes a key and translates it into an index. This translation can sometimes translate two keys into the same index. This problem is called a collision. Collisions are common in hashtables and they are unavoidable. Therefore it is essential that there be methods to resolve a collision. 8.5.1 Bucketing Make every entry in the array big enough to hold N items (N is not amount of data. Just a constant). Think of it as a 2-D array. Problems:  Lots of wasted space.  If N is exceeded, another strategy will need to be used  Not good for memory based algorithm but doable if buckets are disk-based For bucketing it is alright to have  > 1. However, the higher  is the higher a chance of collision.  > 1 guarantees there will be at least 1 collision (pigeon hole principle). That will bring up both the run time and the possibility of running out of buckets. Page 52 For a hash table of M hash entries and X buckets:  Successful Search - O(X) worst case  Unsuccessful Search - O(X) worst case  Insertion - O(X) - assuming success, bucketing does not have good way to handle non-successful insertions.  Deletion - O(X)  Storage: O(M * X) 8.5.2 Chaining At every spot in your hash table store a linked list of items. This is better than bucketing as you only use as many nodes as necessary. Some space will still be wasted for the pointers but not nearly as much as Bucketing. Table will also not overflow (ie no X to exceed). You will still need to conduct a short linear search of the linked list but if your hash function uniformly distributes the items, the list should not be very long For chaining, the runtimes depends on l. The average length of each chain is .  is the number of expected probes needed for either an insertion or an unsuccessful search. For a successful search it is 1+/2 probes. 8.5.3 Linear Probing If a given position is already being used, then search the list for the next available open space (obviously, you would need to have a way to tell if each space was filled or not). Typically, this method will involve making small linear searches for the correct item as hashed item may not be exactly where the hash function indicates. Using this method, any probes will use the hash function to find out where to look. If the item is not found there, do a linear search starting at the current position. This method only works well if the size of the array is larger than the maximum number of expected values. In other words you would expect lots of open spaces. In general you will need to keep no less than around 30% to 35% of the table empty. The problem is that you must estimate the maximum size of the table which can be hard. Another problem is that whenever you have a collision, the item you are trying to add will have to be placed into the table in another spot. This of course could mean that it is taking up a spot that was meant for anther item that would have hashed directly into it. That item would in turn have to be placed at another place. The problem that this creates is that the items in hash tables tend to create clusters. Any attempt to hash a value into the cluster would cause the item to be placed in an alternate spot and increase the size of the cluster. Page 53 Average number of probes in an unsuccessful search using linear probing is around: 1 1 (1   ) 2 2 Average number of probes for a successful search using linear probing is around: 1 1 (1   ) 2 Note that the actual cost of finding an item depends on  at time that item was inserted. For example, if table was empty when item was inserted, then any future searches for the items would always find the item where the hash function said it should be. 8.5.3.1 Example of using linear probing to create hash table Suppose we want to create a hashtable where the key is a string. Suppose that we expect no more than 100 items. How would we create the hash table using linear probing? Step 1: decide on the size. Since we are using linear probing, we will want to use a table at least 30% bigger than 100 items. Generally it is a good idea to choose a hash table size that is a prime number. In this case 137 would be a good choice. Step 2: choose a hash function. Choose a hash function that will create a relatively uniform distribution. This means you should not do something like choose the first letter (more words begin with a than with z). One simple solution may simply be to use the sum of the ascii codes and take its modulus to 137 (array size): int HashFunction(char* s){ int total=0; for(int i=0;s[i]!='\0';i++){ total+=s[i]; } return total%137; } Writing a good hash function is not an easy task. You may wish to see if there are some that are better than this. Page 54 Step 3: Init the hash table. For the sake of coding lets create an object that we can hash. We will define it as: struct HashItem{ char key_[100]; int data1_; double data2_; }; class HashTable{ HashItem* mytable_[137]; //array of hashitem pointers //that way NULL pointer means empty int numitems_; int Search(char* key); public: HashTable(); int Insert(HashItem item); void Remove(char* key); int Retrieve(char* key,HashItem& item); }; HashTable::HashTable(){ for(int i=0;i<137;i++){ mytable_[i]=NULL; } numitems_=0; } Above code assumes a table size of 137 but this could easily be modified to use any length. Step 4: Insert an item. int HashTable::Insert(HashItem item){ int retval=0; if(numitems_<137){ retval=1; numitems_++; int idx=HashFunction(item.key_); int found=0; for(;mytable_[idx];idx=(idx+1)%137); mytable_[idx]=new HashItem; *(mytable_[idx])=item; } //loop stops when //mytable_[idx] is //NULL //copy into new item } return retval; } Page 55 Step 5: Write a function to do a retrieval (given a key find the data). Since other functions (like remove()) will also require doing a search, an internal search function returning the appropriate index will first be written. //return -1 if not found. int HashTable::Search(char* key){ int retval=-1; int start=HashFunction(key); if(mytable_[start]){ //remember strcmp returns 0 if the items are the same if(strcmp(mytable_[start]->key_,key)){ int i; //loop stops when we either have searched entire table, //find empty space or find a match for(i=(start+1)%137; i!=start&&mytable_[i]&& strcmp(mytable_[i]->key_,key); i=(i+1)%137); //endfor if(mytable_[i] && !strcmp(mytable_[i]->key_,key)) retval=i; } } return retval; } int HashTable::Retrieve(char* key,HashItem& item){ int idx=Search(key); if(idx!=-1){ item=*mytable_[idx]; } return idx==-1?0:1; } Step 6: Write a function to do removal. Delete is not as simple as it may seem when using linear probing because you might have put something in the wrong place. Consider: Suppose that we are using a 11 element array. Keys are ints and hash function is simply the last digit in the number. Let's suppose that after doing a number of hashes we have the following array. Arrow indicates their proper location in the array (no arrow means item is already in proper place): 0 11 1 2 3 54 4 64 5 76 6 87 7 85 8 99 9 10 Page 56 Using linear probing we must be very careful about leaving empty spaces. This is because when we search we look for empty spaces as indication that the search is done. we must not leave an empty space between where an item should be and where an item actually is. Example: Suppose we delete 54: 0 11 1 2 3 4 64 5 76 6 87 7 85 8 99 9 10 We can't just leave the array like this because if we try to look for 64 now, we won't find it since the space where it should have been is empty. We will need to move 64 over 0 11 1 2 3 64 4 5 76 6 87 7 76 6 87 7 85 8 99 9 10 8 99 9 10 problem now of course is that 85 will not be found. We will need to move 85 over too. 0 11 1 2 3 64 4 85 5 Now it is ok. Remove function is below: void HashTable::Remove(char* key){ int rm=Search(key); if(rm!=-1){ delete mytable_[rm]; //delete the appropriate item mytable_[rm]=NULL; int next=(rm+1)%137; while(mytable_[next]){ int proper=HashFunction(mytable_[next]->key_); if(isbetween(rm,proper,next){ mytable_[rm]=mytable_[next]; //move if empty spot //is between its proper //and current location mytable_[next]=NULL; rm=next; } next=(next+1)%137; } } } Page 57 The isbetween() function returns true if the first parameter is between the second and third parameter. In other words it is true if rm >= proper && rm < next BUT don't forget that our array is a circular Suppose that proper=136 and next=3 is rm=0 between proper and next? YES! so function is slightly more complex: int isbetween(int rm, int proper, int next){ int retval=0; if(proper < next){ if(rm >= proper && rm < next) retval=1; } else if(proper > next){ if((rm >= proper && rm < 137) || (rm >=0 && rm < next)) retval=1; } return retval; } 8.5.4 Quadratic Probing Quadratic Probing is similar to Linear probing. The difference is that if you were to try to insert into a space that is filled you would first check 12=1 element away then 22 = 4 elements away, then 32 =9 elements away then 42=16 elements away and so on. With linear probing we know that we will always find an open spot if one exists (It might be a long search but we will find it). However, this is not the case with quadratic probing unless you take care in the choosing of the table size. For example consider what would happen in the following situation: Table size is 16. First 5 pieces of data that all hash to index 2  .First piece goes to index 2.  Second piece goes to 3 ((2 + 1)%16  Third piece goes to 6 ((2+4)%16  Fourth piece goes to 11((2+9)%16  Fifth piece dosen't get inserted because (2+16)%16==2 which is full so we end up back where we started and we haven't searched all empty spots. In order to guarantee that your quadratic probes will hit every single available spots eventually, your table size must meet these requirements:  Be a prime number  never be more than half full (even by one element) Page 58 8.5.5 Double Hashing Double Hashing is works on a similar idea to linear and quadratic probing. Use a big table and hash into it. Whenever a collision occurs, choose another spot in table to put the value. The difference here is that instead of choosing next opening, a second hash function is used to determine the location of the next spot. For example, given hash function H1 and H2 and key. do the following:  Check location H1(key). If it is empty, put record in it.  If it is not empty calculate H2(key).  check if H1(key)+H2(key) is open, if it is, put it in  repeat with H1(key)+2*H2(key), H1(key)+3*H2(key) and so on, until an opening is found. like quadratic probing, you must take care in choosing H2. H2 CANNOT return 0. H2 must be done so that all cells will be probed eventually. Page 59

Search, Sorting and Big

Related documents

Products

Support

Search, Sorting and Big

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib