PROBLEM: Find the Minimum Spanning Tree

advertisement

PROBLEM: Search for a symbol among a group of symbols in

“constant” time:

HASHING

You have already seen that an unsorted array can be searched for a single value in linear time (O(N)). This is called linear search. A sorted array can be searched in logarithmic time (O(log N)). This is called binary search. A binary search tree of good form can also be searched in logarithmic time (O(log N)). Remember that an

AVL tree is one form that is guaranteed to be well balanced, and thus searchable in logarithmic time

Hashing is a technique that can search for a value in constant time ( O(c) ),

under the right circumstances.

(Note the qualifying phrase.) Hashing is an address calculation technique, and like most such techniques, it is not guaranteed to work well regardless of the data. It is the task of the programmer to set up the parameters of the hash in such a manner that the desired results and the desired search speed will be achieved. The primary factor the programmer must consider while setting up the hash is the number of unique symbols expected.

There are two basic forms that hashing takes. One form is dynamic in nature and allocates space for the symbols on the fly as the program executes. I will call this

Open Hashing . The second form is static in nature and space for all the symbols must be allocated before inserting any. I will call this form Closed Hashing .

Unfortunately, there are no standardized terms for hashing that are used in all or even most books.

OPEN/DYNAMIC HASHING

(Also called separate chaining, bucket hashing, etc.)

This technique is actually very simple. Symbols are placed in one of a series of linked lists (also called bins or buckets). Typically an array of LL headers is allocated, one for each bin.

1.

Decide on the number of bins. Here is where you get O(c) time. Suppose you expect 200 unique symbols and you want the expected number of comparisons required to search a bin to be no more than 5. If the bins are ordered, and the average bin length is 10, then the expected number of

comparisons required to locate a symbol in a linked list of size 10 is 5 comparisons. If you expect 200 unique symbols and want an average bin length of 10, then you should allocate 20 bins. Notice that the qualifying phrase “under the right circumstances” is needed. If the number of symbols exceeds expectations, or if your symbols are not scattered evenly across the bins, then the bins will get longer than expected and be slower to search.

2.

Create a hash function that is sent a symbol and returns a bin number. If the symbols to be hashed are strings, the hash function usually adds up some or all of the ASCII values in the symbol, and then returns the sum MOD the number of bins. If the symbols are numbers, then the MOD function alone suffices. Regardless, the hash function must have the following properties: a.

The hash function H should map the symbol S into the bin number.

H(S) => 0..B-1 b.

It should be fast, preferably running in constant time. c.

It should be deterministic. The same symbol must always return the same bin number. (This is often called “pseudo-random”.) d.

It should provide uniform scattering among the bins, so that the bins will be approximately the same length.

Below is a sample hash function in C++ that assumes the symbol is a string.

In Java, various methods are possible depending upon whether the string is the object or a parameter. I included one variation below that assumes the string is the object (but have not checked it). Strings are not objects in C++.

} int H( char symbol[], const int NUM_BINS ) public int H( int NUM_BINS )

{ int sum = 0, k = 0; { int sum = 0; int k; while( symbol[k] != 0 ) sum += int( symbol[k++] ); return sum % NUM_BINS;

} for( k = 0; k < this.length(); k++ ) sum += (int)this.charAt(k); return sum % NUM_BINS;

3. As each symbol is encountered, the hash function is called to determine the bin number. The appropriate bin (usually a linked list) is scanned. A new symbol is inserted in the list in order. A known symbol is not added a second time to the list, but other data associated with the symbol may be updated.

CLOSED/STATIC

HASHING (There are many variations of closed hashing.

The particular version shown below is called variously: linear quotient hashing,

double hashing, quadratic hashing, etc.)

1.

ARRAY SIZE . In closed hashing there are no bins. Instead, there is an array of a predetermined size. Each slot in the array can hold one symbol (as well as whatever else about the symbol you want to preserve). Thus the total number of unique symbols stored cannot be more than the number of slots in the array. For reasons that will become clear later, the size of the array should be prime. Furthermore, the size of the array should be chosen so that it is approximately twice that of the expected number of unique symbols. In this way the array will never be more than half full, unless more symbols are encountered than expected.

2.

StartAt, the first hash function . This hash function is sent the symbol and returns the starting location where the hash will begin. It returns a value in the range 0.. NumSlots-1.

3.

StepSize, the second hash function .

This hash function will determine the step to be made for all additional lookups if there is a collision at the first location. It should be different than the first function and it must return a value in the range 1..Numslots-1 . The step cannot be zero.

4.

A C++ version of the algorithm is given below . This version is correct, but not entirely satisfactory. First, the second hash function should only be called if the first lookup generates a collision. Second, setting and testing firstTry slows down the algorithm. The algorithm can be made much faster (albeit longer) if rewritten so that the first lookup code is separate from the loop providing all the other lookups. What do you think about keeping track of the number of insertions? Is the result worth the time required?

Globals : const int NUM_SLOTS;// size of the array int NumInsertions; // total # of unique symbols inserted so far

InfoStruct Slot[NUM_SLOTS]; bool Hash ( string s )

{ bool firstTry = true; int loc, step; if( NumInsertions >= NUM_SLOTS ) else return false;

{ loc = StartAt( s ); step = StepSize( s ); while( true )

}

}

}

{ if( ! firstTry ) loc = ( loc + step ) % NUM_SLOTS; if( strcmp( Slot[ loc ].sym, s ) == 0 )

{//found the symbol s here. Do whatever is

// necessary, then

return true; } else if( strlen ( Slot[ loc ].sym == 0 )

{ // found empty slot. Insert s here, then

NumInsertions ++;

return true; }

else //collision

{ firstTry = false; }

COMMENTS:

Language Support.

As with many other data structures, Java offers a

HashSet class that is an extension of the AbstractSet class. I have not used it. It looks interesting and easy, but by no means optimal.

What is the Array Size?

The array size should be a prime number, as all slots will be examined only if the step size and the array size are relatively prime. A prime number is relatively prime to all natural numbers less than itself, so the variation of the step size cannot cause a problem if the array size is prime.

Why Two Hash Functions?

There is nothing magical about having two hash functions. We could get by with only one function and have a fixed step size to be added in if there is a collision. This works but produces clustering.

Alternatively, we could have written and used three hash functions or more. The first could give the first place to look. The second could give the second place to look. The third could give the step if the first two lookups fail. Implementing two hash functions seems to be the best compromise between the desire to maximize speed, and the need to generate uniform scattering and minimize clustering. Even if two lookups collide once, they will likely not collide again, as they are unlikely to have the same step. Similarly, if two symbols produce the same step, they will likely not start at the same location. One common method of writing the two hash functions is to add up the characters with even indices in one hash function and the characters with odd indices in the other.

Notice that none of the hash functions discussed here actually perform in constant time even though the discussion suggests that this is desirable. The execution time of both candidate functions as well as the sample hash function above is linearly proportional to the length of the symbol. To make the function faster, one must limit the number of characters used. You might use the first four characters for one function and the last four characters for the other. You might use characters 1,2,4,8, etc up to the length of the symbol. (Notice this requires logarithmic time, not constant time.)

Table half full.

Why should the table be no more than half full? It can be shown that the performance of closed hashing is quite acceptable up to about 80% full, but the degradation is non-linear after that. Thus, f you were 15% off in your estimate, then the table could get as much as 95% full, and performance would be quite unsatisfactory. If you were 20% off, the table could even fill up. If you do not keep track of the number of insertions, this could produce an infinite loop. By attempting to make the table at most half full, you keep the number of collisions very low, and the hash procedure will still be acceptably fast even if you are as much as 50% off in your estimate. If the table is half full, then the expected number of lookups required before completing the hash should be no more than 2 (Here is our constant time). Note that a good estimation of the number of unique symbols, combined with a table size that should produce a table that is no more than half full, makes the Infinite loop paragraph below moot.

Symbol Deletion . Note that a symbol can be removed from a bin in open hashing, but cannot be removed from its slot in closed hashing. If absolutely necessary, a dead bit can be added to the information stored in the slot. If turned on, the dead bit indicates the slot is unavailable, but no longer holds a symbol.

Infinite Loop . In closed hashing, it is possible for the array to fill up if more symbols are encountered than expected. If your algorithm makes no provision for this case, the program will enter an infinite loop. There are three basic ways of guaranteeing that this will not occur:

1.

Keep track of the number of unique symbols inserted, and test this counter before entering the hash function. (This choice is the fastest, but it will not permit searches for existing symbols once the table is full.)

2.

Hold the initial search location, the one returned by StartAt. If the collision loop should get back to this initial location, the table is full.

3.

Keep a count of the number of times through the collision loop on the current symbol. If the count gets up to the table size, the table is full.

Download