Search, Sorting and Big

advertisement
8 Tables
A table is an unordered collection of items. Each item in the table has a unique key
(definitely no duplicates in the table). Aside from the key, each item contains other
information related to the key. For lists, arrays stacks and queues, the items are ordered
(nth item in the list, front of the queue, top of item in stack, etc.) and the key does not
have to be unique.
8.1
Table operations:
Initialize - Table is initialized to an empty table
isempty - tests if table is empty
isfull - tests if table is full
insert - add a new item with key:data to the table
delete - give a key remove the item (key and associated data
update - given a key change the data
find - given a key find the data
enumerate - process/list/count all items in the table
8.2
Simple implementation
Use a list to implement a table. Store items into it so that the keys are in ascending order
(for binary search purpose). Note that the ordering is part of the implementation details
and not part of the table definition.
8.3
Hash Tables
A hash table uses the key to directly calculate its location in an array structure. To do
this the key is passed into a hash function which returns an index.
int Hashfunction(key);
key can be any data type.
Ideally, the index returned by the hash function would be the key but this is unlikely in
the real world. To be able to do this, the possible values for keys would have to be small.
But if keys are big (9 digit student numbers, unique names, etc.) this would not be
possible.
Page 51
A good hash function should be:
Uniform (all indexes are equally likely for a given key)
Random (not predictable)
For example, suppose we wanted to hash a number of telephone numbers of people living
in a building (<1000 residents in all)
Using the first three digits of the phone number is probably a bad idea (phone numbers in
the same general area often have the same numbers)
Using the last three digits of the phone number would be a better idea
8.4
Load Factor
When looking at the efficiency of a hash table, the number of items already stored in the
hash table will have an effect on the performance of the hash table. The measurement of
fullness is called the Load Factor. In textbooks it is often represented by the symbol 
(pronounced lamda).  = number of items in list/size of list. Put another way,. is
simply the percentage of occupied spots in the table. Therefore: if  = 0.5 it would mean
that 50% of the table is full.  = 0.1 means it is 10% full. NOTE: depending on
collision resolution method (described in next section) it is possible that  > 1
8.5
Collision
A hash function takes a key and translates it into an index. This translation can
sometimes translate two keys into the same index. This problem is called a collision.
Collisions are common in hashtables and they are unavoidable. Therefore it is essential
that there be methods to resolve a collision.
8.5.1 Bucketing
Make every entry in the array big enough to hold N items (N is not amount of data. Just
a constant). Think of it as a 2-D array.
Problems:
 Lots of wasted space.
 If N is exceeded, another strategy will need to be used
 Not good for memory based algorithm but doable if buckets are disk-based
For bucketing it is alright to have  > 1. However, the higher  is the higher a chance of
collision.  > 1 guarantees there will be at least 1 collision (pigeon hole principle). That
will bring up both the run time and the possibility of running out of buckets.
Page 52
For a hash table of M hash entries and X buckets:
 Successful Search - O(X) worst case
 Unsuccessful Search - O(X) worst case
 Insertion - O(X) - assuming success, bucketing does not have good way to handle
non-successful insertions.
 Deletion - O(X)
 Storage: O(M * X)
8.5.2 Chaining
At every spot in your hash table store a linked list of items. This is better than bucketing
as you only use as many nodes as necessary. Some space will still be wasted for the
pointers but not nearly as much as Bucketing. Table will also not overflow (ie no X to
exceed). You will still need to conduct a short linear search of the linked list but if your
hash function uniformly distributes the items, the list should not be very long
For chaining, the runtimes depends on l. The average length of each chain is .  is the
number of expected probes needed for either an insertion or an unsuccessful search. For
a successful search it is 1+/2 probes.
8.5.3 Linear Probing
If a given position is already being used, then search the list for the next available open
space (obviously, you would need to have a way to tell if each space was filled or not).
Typically, this method will involve making small linear searches for the correct item as
hashed item may not be exactly where the hash function indicates.
Using this method, any probes will use the hash function to find out where to look. If the
item is not found there, do a linear search starting at the current position.
This method only works well if the size of the array is larger than the maximum number
of expected values. In other words you would expect lots of open spaces. In general you
will need to keep no less than around 30% to 35% of the table empty.
The problem is that you must estimate the maximum size of the table which can be hard.
Another problem is that whenever you have a collision, the item you are trying to add
will have to be placed into the table in another spot. This of course could mean that it is
taking up a spot that was meant for anther item that would have hashed directly into it.
That item would in turn have to be placed at another place. The problem that this creates
is that the items in hash tables tend to create clusters. Any attempt to hash a value into
the cluster would cause the item to be placed in an alternate spot and increase the size of
the cluster.
Page 53
Average number of probes in an unsuccessful search using linear probing is around:
1
1
(1   ) 2
2
Average number of probes for a successful search using linear probing is around:
1
1
(1   )
2
Note that the actual cost of finding an item depends on  at time that item was inserted.
For example, if table was empty when item was inserted, then any future searches for the
items would always find the item where the hash function said it should be.
8.5.3.1 Example of using linear probing to create hash table
Suppose we want to create a hashtable where the key is a string. Suppose that we expect
no more than 100 items. How would we create the hash table using linear probing?
Step 1: decide on the size. Since we are using linear probing, we will want to use a table
at least 30% bigger than 100 items. Generally it is a good idea to choose a hash table size
that is a prime number. In this case 137 would be a good choice.
Step 2: choose a hash function. Choose a hash function that will create a relatively
uniform distribution. This means you should not do something like choose the first letter
(more words begin with a than with z). One simple solution may simply be to use the
sum of the ascii codes and take its modulus to 137 (array size):
int HashFunction(char* s){
int total=0;
for(int i=0;s[i]!='\0';i++){
total+=s[i];
}
return total%137;
}
Writing a good hash function is not an easy task. You may wish to see if there are some
that are better than this.
Page 54
Step 3: Init the hash table. For the sake of coding lets create an object that we can hash.
We will define it as:
struct HashItem{
char key_[100];
int data1_;
double data2_;
};
class HashTable{
HashItem* mytable_[137];
//array of hashitem pointers
//that way NULL pointer means empty
int numitems_;
int Search(char* key);
public:
HashTable();
int Insert(HashItem item);
void Remove(char* key);
int Retrieve(char* key,HashItem& item);
};
HashTable::HashTable(){
for(int i=0;i<137;i++){
mytable_[i]=NULL;
}
numitems_=0;
}
Above code assumes a table size of 137 but this could easily be modified to use any
length.
Step 4: Insert an item.
int HashTable::Insert(HashItem item){
int retval=0;
if(numitems_<137){
retval=1;
numitems_++;
int idx=HashFunction(item.key_);
int found=0;
for(;mytable_[idx];idx=(idx+1)%137);
mytable_[idx]=new HashItem;
*(mytable_[idx])=item;
}
//loop stops when
//mytable_[idx] is
//NULL
//copy into new item
}
return retval;
}
Page 55
Step 5: Write a function to do a retrieval (given a key find the data). Since other
functions (like remove()) will also require doing a search, an internal search function
returning the appropriate index will first be written.
//return -1 if not found.
int HashTable::Search(char* key){
int retval=-1;
int start=HashFunction(key);
if(mytable_[start]){
//remember strcmp returns 0 if the items are the same
if(strcmp(mytable_[start]->key_,key)){
int i;
//loop stops when we either have searched entire table,
//find empty space or find a match
for(i=(start+1)%137;
i!=start&&mytable_[i]&&
strcmp(mytable_[i]->key_,key);
i=(i+1)%137); //endfor
if(mytable_[i] && !strcmp(mytable_[i]->key_,key))
retval=i;
}
}
return retval;
}
int HashTable::Retrieve(char* key,HashItem& item){
int idx=Search(key);
if(idx!=-1){
item=*mytable_[idx];
}
return idx==-1?0:1;
}
Step 6: Write a function to do removal. Delete is not as simple as it may seem when
using linear probing because you might have put something in the wrong place.
Consider:
Suppose that we are using a 11 element array. Keys are ints and hash function is simply
the last digit in the number.
Let's suppose that after doing a number of hashes we have the following array. Arrow
indicates their proper location in the array (no arrow means item is already in proper
place):
0
11
1
2
3
54
4
64
5
76
6
87
7
85
8
99
9
10
Page 56
Using linear probing we must be very careful about leaving empty spaces. This is
because when we search we look for empty spaces as indication that the search is done.
we must not leave an empty space between where an item should be and where an item
actually is.
Example: Suppose we delete 54:
0
11
1
2
3
4
64
5
76
6
87
7
85
8
99
9
10
We can't just leave the array like this because if we try to look for 64 now, we won't find it since
the space where it should have been is empty.
We will need to move 64 over
0
11
1
2
3
64
4
5
76
6
87
7
76
6
87
7
85
8
99
9
10
8
99
9
10
problem now of course is that 85 will not be found.
We will need to move 85 over too.
0
11
1
2
3
64
4
85
5
Now it is ok. Remove function is below:
void HashTable::Remove(char* key){
int rm=Search(key);
if(rm!=-1){
delete mytable_[rm]; //delete the appropriate item
mytable_[rm]=NULL;
int next=(rm+1)%137;
while(mytable_[next]){
int proper=HashFunction(mytable_[next]->key_);
if(isbetween(rm,proper,next){
mytable_[rm]=mytable_[next]; //move if empty spot
//is between its proper
//and current location
mytable_[next]=NULL;
rm=next;
}
next=(next+1)%137;
}
}
}
Page 57
The isbetween() function returns true if the first parameter is between the second and
third parameter. In other words it is true if rm >= proper && rm < next
BUT don't forget that our array is a circular
Suppose that proper=136 and next=3 is rm=0 between proper and next? YES! so
function is slightly more complex:
int isbetween(int rm, int proper, int next){
int retval=0;
if(proper < next){
if(rm >= proper && rm < next)
retval=1;
}
else if(proper > next){
if((rm >= proper && rm < 137) ||
(rm >=0 && rm < next))
retval=1;
}
return retval;
}
8.5.4 Quadratic Probing
Quadratic Probing is similar to Linear probing. The difference is that if you were to try
to insert into a space that is filled you would first check 12=1 element away then 22 = 4
elements away, then 32 =9 elements away then 42=16 elements away and so on. With
linear probing we know that we will always find an open spot if one exists (It might be a
long search but we will find it). However, this is not the case with quadratic probing
unless you take care in the choosing of the table size. For example consider what would
happen in the following situation:
Table size is 16. First 5 pieces of data that all hash to index 2
 .First piece goes to index 2.
 Second piece goes to 3 ((2 + 1)%16
 Third piece goes to 6 ((2+4)%16
 Fourth piece goes to 11((2+9)%16
 Fifth piece dosen't get inserted because (2+16)%16==2 which is full so we end up
back where we started and we haven't searched all empty spots.
In order to guarantee that your quadratic probes will hit every single available spots
eventually, your table size must meet these requirements:
 Be a prime number
 never be more than half full (even by one element)
Page 58
8.5.5 Double Hashing
Double Hashing is works on a similar idea to linear and quadratic probing. Use a big
table and hash into it. Whenever a collision occurs, choose another spot in table to put
the value. The difference here is that instead of choosing next opening, a second hash
function is used to determine the location of the next spot. For example, given hash
function H1 and H2 and key. do the following:
 Check location H1(key). If it is empty, put record in it.
 If it is not empty calculate H2(key).
 check if H1(key)+H2(key) is open, if it is, put it in
 repeat with H1(key)+2*H2(key), H1(key)+3*H2(key) and so on, until an opening
is found.
like quadratic probing, you must take care in choosing H2. H2 CANNOT return 0. H2
must be done so that all cells will be probed eventually.
Page 59
Download