Hashing - MHS Comp Sci

advertisement
Hashing
Purpose:
In this lecture we will discuss another data structure, the Hash Table.
We will also learn how to use Java’s Map and Set implementations in
the HashSet and HashMap classes.
Resources:
Barrons Chapter 11 p.374 – 386 (exclude treeset,treemap)
Chapter 12 p.424
Lambert Fundamentals Comprehensive Lesson 17 p.567
C++ Notes Chapter 26
Java Essentials Study Guide Chapter 17 p.303 & Chapter 20 p.370
Java Methods Chapter 6 p.151
Litvin Be Prepared Chapter 5 p.137
Handouts:
YOU MUST BRING YOUR BARRONS TEXT TO EACH CLASS !!!
1.
main.java & myStuff.java (hashing ZIP file)
2.
Hashing --- Illustration.doc
Intro:
We have discussed various data structures like the List implementations ArrayList
and LinkedList. We have also discussed Stacks and Queues and will soon learn
about Binary trees.
With these structures we can iterate over the entire structure and determine if a
specific value is in the set.
As an example with a structure of strings, we can maintain a structure of domain
names and determine if a given name has already been assigned. However, we do
not know anything about the user who owns it.
In another example, we can have a structure of dictionary words. We can
determine if a given word is spelled correctly, but if we also wanted to get the
meaning, pronunciation or derivation of the word these current structures would
come up short.
If we were to store these, in an ordered fashion, how would we do it so that we
can retrieve specific information quickly (ie. Less than linear) ?
These requirements lead us to utilizing a structure that is more elaborate, such as a
Map.
A Map allows us to associate a Key with an object.
Example:
Key / Index (Lot Number)
A140
Ultimately Links to a HomeOwnerInfo Object
A140
Smith, Joe
120 East End Avenue
973-333-5555
value $420.000
Property Taxes $11,000
Family Income $ 210,000
3 Children
Example:
Key / Index (ID)
Ultimately Links to a Compound Object
The compound Sodium Chloride has the following:
NaCl
ID
Bonding
Mol Wt
Den
Sol
NaCl
Ionic
58.5 g
1.54 g/ml
YES
Databases are based on this principle as we can perform searches on the existence
of specific objects by searching against an INDEX (key) that provides a LINK to
the actual data (object)
In these examples, the Index or KEY is stored SEPARATE from the data to
which it will ultimately point
This structure allows us to maintain the physical data in a separate storage
location
The Index or Key provides a link to the data
We can have multiple / separate INDICES that work against a single set of
objects
For example, we can store objects that maintain information on homeowners
We can keep their name, address, lot number, home value, tax base, income,
number of children, etc
We might wish to access this information in different ways
Maybe we want to search by phone number or Lot number
Key / Index (Phone Number) Ultimately Links to a HomeOwnerInfo Object
973-333-5555
A140
Smith, Joe
120 East End Avenue
972-333-5555
value $420.000
Property Taxes $11,000
Family Income $ 210,000
3 Children
Maybe we want to get information on all homes worth over $500,000
If we were to attempt to store this information in a linked list or an array we
would have difficulty implementing efficient search (or sort) processes that could
perform searches based on different pieces of data
If we were to sort this data it can only be sorted based on 1 piece of information,
(Lot Number) further changes to elements require re sorting
This is where a Map implementation is best used
This lecture will focus on this type of implementation including Hash Tables,
HashSet and HashMap
Hashing: (Barrons p.424 thru 426)
A System of mapping from KEYS to integer indices in a table
The goal is to Map all possible KEY values into a smaller set of indices & to
cover that range uniformly
The hash algorithm will convert a KEY ( SSN, UPC, Account Number) into a
representation of a specific location to store or find that information (converts a
KEY into a location in the hash table). At that spot in the table is the address of
the associated object. This tells us where to look for a specific item or where to
insert an item. It always returns an integer
The “perfect hash function” is one where it yields a 1 to 1 mapping from the index
elements to the integers starting at 0 and ending at the last element in the set
(array, list)
However, there is no known systematic process that can be used to generate a
perfect hash function from an arbitrary set of values
Therefore we will have to account for and resolve Collisions when several
different Keys map to the same position in the Hash Table
Example:
Using our Homeowner Database for Example, we can write our own “hashing
algorithm” that converts a given Key, Lot Number for example, into an integer
value that corresponds to an index in an Array or ArrayList
We MUST makes certain assumptions, we MUST understand our data so we can
estimate its load
In this example, lets assume that our universe of LOTS in Millburn is
approximately 1,000
So, lets count on an array (to hold the Key and address of related
HomeOwnerInfo) that can hold about 1,500 indices
This will allow us to spread out our data so that we can minimize situations where
our Keys “hash” to the same index on the array (a Colission)
Our “Hashing Algorithm” is simple, we take the numeric value of the Lot and
add in the ASCII value of the letter, Given this
A140 will “hash” to the integer value 205 (140 + 65)
A151 
216
B140 
206
C150 
217
So, the HomeOwnerInfo along with the Key will be inserted into the array, known
as our Hash Table” as follows:
Index #
HomeOwnerObjectInfo with a Key of:
205
206
207
208
209
210
211
212
213
214
214
216
217
A140
B140
A151
C150
So if we were looking for HomeOwner Information for lot Number C150
All we need to do is “Hash” the Lot number which will result in the integer 217
We can then access the Homeowner information as follows
MyHomeownerInfoArray[ hashedInteger]
Hash Tables:
Typically a fixed sized array that contains an integer representation of a KEY
A well balanced Hash Table hinges upon the proper handling of two major issues:
Deciding on a solid Hash Function
Building an Algorithm for dealing with Collisions
The KEY can be SSN’s, last names, UPC Codes
When we retrieve an element we need to verify that its KEY matches the target so
the KEY must be explicitly stored in the table along with the address of the rest of
the record
Hash Functions:
Converts a KEY into an integer (hashed) where the integer ranges from 0
to one less than the size of the table
Properties of a good hash function:





Easy and fast to compute
Scatter the data evenly throughout the hash table (uniform)
Select a data structure that has more space than actually required
Develop a function to compute the hash address (value)
Minimize collisions
For example, if our Key is a String we could slice the String into parts and add
them (using their ASCII values)
For Example, the String containing SSN can be broken down into parts
133-56-7878
mod the first part 133 % 100 = 33
reverse the second part 56 = 65
int divide 3rd part by 100 = 78
The hashed value for 133-56-7878 is 176 (33 + 65 + 78)
How good a hash function this is will depend on how evenly it scatters the
data over the array and how well it minimizes any collisions
The result MUST be an integer that does not exceed the range of the Hash Table
This method of manipulating the key is given the term “hashing”
Common hash functions are;
Numeric / Division:
MOD the KEY by an integer equal to the size of the array
KEY % (#elements)
Example:
UPC # 1966211001
ArraySize 1500
Hash Value = (501) UPC % Size
Alpha:
Hash the sum of ACSII values of its characters
MidSquare:
Square the KEY and maintain the KEY’s middle digits for the
Hashed value
Works better with smaller values (less than 10,000)
Example:
number 9876
9876 ^2 = 975 353 76
353 becomes the hashed value
Folding:
Divide KEY into several parts
Each of which are combined to provide the hashed value
Example:
Social Security Number :
387-58-1505
hash as sum of three integers:
387 + 58 + 1505 = 1950
The data stored in the KEY is everything you need for a given structure or
record (price, item name, etc…)
NOTE: Java classes like Strings and Integers provide a HashCode method
that hashes the object and returns an integer --- SEE JDocs String Class
Example:
Bar Coding of items in a supermarket
UPC codes allow for up to 1 billion items ( 10 digit code)
The average store has aproximatly 10,000 items
If the program that scans these items had to search through all 1 billion
possibilities It would be very inefficient (Similar to the MBS 2D
environment)
We can store the UPC codes, specific to that store, in an array called
the HASH TABLE
We typically size the hash table with more elements (items) than the
initial universe of elements (KEYS)
We could size our array at 15,000 elements
The HASH Function will tell us where a specific item is stored in the
15,000 element Array
UPC
1966211001
1966211011
1966211021
1966211031
Hash Value
501
511
521
531
So, if we were to add in information on Products Keyed by UPC code
into a hash table, we could do so as follows:
MyHashTable[myProduct.getUPC( ) % 15000] = myProduct;
To retrieve product price for a given product you can:
priceOfProduct = MyHashTable[1966211011% 15000],getPrice( );
Using our HomeOwnerInfo Example:
So, if we were to add in information on HomwOwners Keyed by Lot Number
into a hash table, we could do so as follows:
aString = myHomeOwnerInfo.getLot( );
index = // break up the string and calculate the hash value;
MyHashTable[index] = myHomeOwnerInfo;
To retrieve Lot value for a given home you can:
lotValue = MyHashTable[index],getValue( );
Collisions:
Problems occur when 2 different keys MAY map to the same hash value, the
same element (location) in the table
This Occurs when we try to insert a new element into the table and that element is
already occupied
Example, if we used a hashing function that combines Folding with Division:
UPC 70662 11001
Group into pairs: 70 66 21 10 01
Multiply the first three pairs together
70 X 66 X 21 = 97020
Add this number to the last two pairs:
97020 + 10 + 01 = 97031
Find the remainder of mod division by 14997 (15000 – 3)
97031 % 14997 = 7049
What happens when we have an item with the bar code 66702 10110 and we use
the same hash function to code it:
66 70 21 01 10
66 X 70 X 21 = 97020
97020 + 1 + 10 = 97031
97031 % 14997 = 7049
This is the same address as the previous bar code. When this event occurs, two
values need to be stored in the same hash address. This is called a collision (or
hash clash)
One reason why our table size is 15000 and not 10000 is to help avoid collisions.
The smaller the number of possible addresses the higher the probability of a
collision.
In order for a hash table to work properly it is important that the programmer
knows the number of items in the table in advance
There are several ways to resolve a Collision:
Chaining
With Chaining, we implement our Hash Table as an Array of Linked Lists
When we have Keys that map to the same Index, we add it to that indexes
List
Table entries in this structure are called “Buckets”
Chaining is good with densely populated hash tables
However, the retrieval and insertion of chained elements is more involved
Probing
With probing we store the colliding element in a different slot of the same
hash table
Calculate the index into the table using the hash function, if that element is
occupied a probing function is used to convert that index into a new index
and repeat until an empty slot is located
Probing should be used only with sparsely populated hash tables so any
Probing sequences are short
Example:
int index = hashCode(target.getkKey( ));
while ( hashTable[index] != null )
index = probe(index);
hashTable[index] = setValue(target.getValue( ));
The same function must be used to locate an element:
int index = hashCode(target.getkKey( ));
while ( hashTable[index] != null ) &&
! key.equals(hashTable[index].getKey( ))
index = probe(index);
target = hashTable[index];
Review Example on Hash Coding in Barrons P.422 to 426
Load Factor:
A measure of how full a Hash Table gets before capacity is doubled (rehash)
A Hash table with many collisions degrades its performance
If the hash table resolves collisions via Chaining then the ratio of entries in the
table to the total number of “buckets” is called the Hash Table’s Load Factor
The Load Factor determines how full the table may get BEFORE the Maps
capacity is increased
A small Load Factor means that there is significant wasted space in the Hash
Table
A high Load Factor means that the advantages of the Hash Table are minimized
Reasonable Load Factor is approximately .75 as it represents a good tradeoff
between time and space costs. The higher the Load Factor the denser the keys
and therefore the higher incidence of collisions
Java’s HashSet and HashMap take in maximum Load Factors in the constructor
but have a default Load Factor of .75
Initial Capacity Represents the number of openings in a HashTable
HashSet:
Remember that a Set Interface --- extends the collection interface
Think of a Math Set
Definition: a collection that contains NO DUPLICATES of an Object
For example the input of: 1, 3, 5, 6, 7, 7, 8, 2, 9
Has a set of: 1, 3, 5, 6, 7, 8, 9
class java.util.HashSet implements java.util.Set
This Generic class is implemented with a Hash table
The hashSet contains an Object that can be hashed, but it holds a single object
With a hashSet (unlike the hashMap), you do not select a “key” to hash by, the object
is hashed based on it’s implementation of the hashCode method
The HashSet implements the Set behaviors:
boolean add(E x)
adds element if unique otherwise leaves set unchanged
boolean contains(Object x)
determines if a given object is an element of the set
boolean remove(Object x)
removes the element from the set or leaves set unchanged
int size( )
number of elements in the set
Iterator <E> iterator( )
allows for set traversal
Object [] toArray( ); OR <T> t[] toArray
Returns elements in the set as a array
HashSet has a default constructor that creates an empty Hash Table with a default
capacity and Load Factor (16, .75)
You may set the initial capacity by using the overloaded constructor
HashSet myHash = new HashSet(200);
To avoid unnecessary reallocation and rehashing of the table when it runs out of space set
the initial capacity , number of buckets to be used in the table, to roughly 2 times the
expected number of elements to be stored
Another overloaded constructor allows to also set Load Factor limit
HashSet myHash = new HashSet(200, 1.5);
Objects stored in the HashSet DO NOT need to implement the Comparable interface,
WHY? (they are not in any specific order)
An Iterator for the HashSet produces the set’s values in NO particular order. Also
remember that Iterator returns objects in no particular order.
When ordering is not important HashSet is a better choice than the TreeSet (discussed in
next lecture)
When iterating over a HashSet Do NOT modify the Set with any iterator method other
than the iter.remove( ) as an error will be produced
Invoking the HashSet’s add or contains method invokes the OBJECT (value) being
stored’s HashCode method
For example, if we were storing a String as the value, the String’s HashCode is executed
The String class returns a HashCode value as an int for the String
Sets DO NOT allow duplicates
A duplicate exists when the equals method applied against two objects resolves to true
Therefore, if you use a user defined class in a HashSet make sure the equals AND
HashCode methods are defined (overridden from the super Object’s version) Otherwise
unwanted duplicates may result
Review Examples 1-2-3 on HashSet Coding in Barrons P. 3769 to 378
NOTE: in example 2 Remember that ArrayList IS A Collection and HashSet has a
constructor that takes in a Collection, therefore passing this as a constructor to HashSet
will automatically remove any duplicates
OPEN and Review HashSet on Java Docs
Another Example:
import java.util.Set;
import java.util.Iterator;
import java.util.HashSet;
Set names = new hashSet(101);
// size of hash table
names.add(“Larry”);
names.add(“Tony”);
names.add(“Kathy”);
names.add(“Eve”);
names.add(“Julie”);
System.out.println(names.size); // 5
Iterator iter = names.iterator();
while (iter.hasNext)
{
System.out.println(iter.next( ));
}
Displays:
5
Larry
Eve
Julie
Kathy
Tony
names.add(“Tony”);
names.remove(“Eve”);
System.out.println(names.size); // 4
Iterator iter = names.iterator();
while (iter.hasNext)
{
System.out.println(iter.next( ));
if (names.contains(“Frank”)
System.out.println(“frank found”);
else
System.out.println(“Frank NOT Found”);
}
4
Kathy
Jilie
Tony
Larry
Frank Not Found
The add method of HashSet names.add(“Julie”);
calls the hashCode of the Object being added, String in this example
String has a hashCode method and resolves the “state” of the String into a
hash value (integer) that is the place in the HashSet’s hash table where this object
will be stored
In the same manner the call to the HashSet’s remove method names.remove(“Eve”);
invokes the String’s hashCode to determine where in the Hash Table this object resides
This is why it is CRITICAL to understand that Objects used in a HashSet MUST have the
equals and hashCode methods defined !!!
In your own classes, you would need to have the hashCode and equals methods defined
HashMap:
class java.util.HashMap implements java.util.Map
The HashMap implements the Map behaviors:
Object put(Object key, Object value)
Associates a Value with a Key and places this pair into the Map
REPLACES a prior value if the Key already is Mapped to a value
Returns the PREVIOUS Key associated value or NULL if no prior
mapping exists
Object get(Object key) // RETURNS OBJECT TEMPLETED IN CONSTRUCTOR
Returns the value associated with a Key OR
NULL if no map exists or the Key does map to a NULL
Object remove (Object key)
Removes the map to this Key and returns its associated value OR
returns NULL if no map existed or mapping was to NULL
boolean containsKey(Object key)
True if there is a key / value map otherwise false
int size( )
Returns the number key / value mappings
Set keySet( )
Retuns the Set of keys in the map
Default constructor creates an empty Map. You can also Create a Templeted HashMap
By Identifying the Type of Key and Value to be stored. Remember that the KEY MUST
implement HashCode.
Keys (Objects) stored in the HashMap DO NOT need to impalement the Comparable
interface
Invoking the HashMap’s put or containsKey method invokes the OBJECT (Key) being
stored’s HashCode method
For example, if an Integer is the Key, the Integer’s HashCode is executed
The Integer class returns a HashCode value as an int for the Integer (Key)
You are not required to Iterate over a HashMap
However, you will be expected to write code that iterates over the Set of Keys in a Map:
Map <Integer, myStuff> stuff = new HashMap<Integer, myStuff>();
// Can also Create: HashMap stuff = new HashMap( );
// add key / value pairs to the map
for (Iterator I = stuff.keySet( ).iterator( ) ; i.hasNext( ) ; )
System.out.println( i.next( ) );
The Keys will appear in an unpredictable order
If I.remove( ) is executed during this iteration over the Ket Set, then the associated
Key / Value pair will be removed from the HashMap
Review Examples 1-2-3 on HashMap Coding in Barrons P. 374-386
OPEN and Review HashMap on Java Docs
Another Example:
import Java.util.Iterator; import
Java.util.Map; import
Java.util.HashMap; import
java.util.Set;
public class HashMapTest
{
public static void main(String[] args)
{
Map names = new HashMap ( ) ;
names .put (new Integer (1435), “Smith”);
names .put (new Integer (1110), “Thomas”);
names .put (new Integer (1425) , "Jones");
names .put (new Integer (987) , "Evans");
names .put (new Integer (1323) , "Murray");
System. out .println ("Number of cases: " + names . size ()); // 5
Integer lookfor = new Integer (1435) ;
if (names . containsKey (lookfor) )
System. out .println ("Key found. ") ;
else
System.out.println("Key NOT found.");
Set namesSet = names.keySet ();
Iterator iter = namesSet.iterator ();
while (iter.hasNext ( ))
{
Integer caseNumber = (Integer)iter.next ();
System.out.println(caseNumber + " handled by " +
names.get(caseNumber));
}
}
}
The resulting output is:
Number of cases: 5
Key found.
1323 handled by Murray
987 handled by Evans
1435 handled by Smith
1110 handled by Thomas
1425 handled by Jones
If the statements that insert keys and values into the HashMap were changed to:
names.put(new Integer(1435), "Smith");
names.put(new Integer(1110), "Thomas");
names.put(new Integer(1425) , "Jones");
names.put(new Integer(987), "Evans");
names.put(new Integer(1323), "Murray");
names.put(new Integer(1323), "Duplicate");
The resulting output would be:
Number of cases: 5 Key
found.
1323 handled by Duplicate
987 handled by Evans
1435 handled by Smith
1110 handled by Thomas
1425 handled by Jones
Notice that case #1323 is handled by Duplicate, not by Murray. If a duplicate key entry is
attempted, the original one is replaced.
Review of Sample Code: Main.java & myStuff.java (handout)
Misc:
Java’s String, Double and Integer classes have their own HashCode methods built
When designing your own class for use in a HashSet or HashMap you need to override
the Object’s HashCode method with a method that is appropriate for your specific class
The Object HashCode operates on the Objects memory location to hash and NOT on the
attributes of the class
Regardless of who designs it, you MUST supply a HashCode if you plan on using your
objects in a HashSet or a HashMap
The HashCode method returns an integer from which the HashSet and HashMap further
map the HashCode onto the range of valid table indices for a particular table
Make sure you are aware of the Generic Capability of the HashMap and HashSet
classes. Additionally, the FOR EACH LOOP that can be utilized.
Big-O:
HashSet has a Big-O of O(1) for adds removes and contains
HashMap has a Big-O of O(1) for get and put but could be O(n) in worst case if many
collisions occur
Hash Table provides a structure where insert and search is carried out in constant time
AP AB Subset Requirements:
Students should be able to understand:
Hash tables as well as understand how to use the Java classes HashSet and HashMap
Understand and be able to utilize the three HashSet constructors
Know the concept of hashing and how collisions are created and resolved
Explain how best to construct a Hash Table to minimize collisions
Understand the goal of a good hash function
Understand chaining, probing and load factor
Determine when to use the HashSet and HashMap and know the Big-O of their behaviors
Write code that creates, adds, removes and iterates over Sets using HashSet
Write code that creates, puts, gets, removes and returns the Set of Keys for a HashMap
Be aware of the Generic Capability of the HashMap and HashSet classes
Tips for the AP Exam:
Do not change objects in a Set
Sets do not contain duplicates
Sets are not ordered
Use an Iterator to list all of the elements of a Set
Iterating thru a HashSet Does not iterate in any specific order
You can not add an element to a set at an iterator position
In a HashMap only the Keys are hashed
HashSet and HashMaps add, remove, contains run in O(1) expected time but O(n) in
worst case
User Defined Classes that will be used in a HashSet or HashMap should have on
overloaded Equals and HashCode methods
Project:
Fun With Chemistry
MBS
POE
ASCII
Int n = (int)’e’ – (int)’a’ ---- gives you alpha displacement of e from the beginning of an
array
Use this to count number of different letters in a phrase
Create an array of 26 ints
Download