Hashing - MHS Comp Sci

advertisement
Hashing
Starring: HashSet
Co-Starring: HashMap
1
Purpose:
In this lecture we will discuss another
data structure, the Hash Table.
We will also learn how to use Java’s
Map and Set implementations in the
HashSet and HashMap classes
2
Resources:
Barrons Chapter 11 p.378 – 379 & p. 383
– 385
Chapter 12 p.422
Lambert Fundamentals Comprehensive
Lesson 17 p.567
C++ Notes Chapter 26
Java Essentials Study Guide Chapter
17 p.303 & Chapter 20 p.370
Java Methods Chapter 6 p.151
Litvin Be Prepared Chapter 5 p.137
3
Handouts:
YOU MUST BRING YOUR BARRONS
TEXT TO EACH CLASS !!!
1.
2.
Map-Key_value.java
Hashing --- Illustration.doc
4
Intro:
We have discussed various data
structures like the List implementations
ArrayList and LinkedList. We have also
discussed Stacks and Queues and will
soon learn about Binary trees.
5
Intro:
With these structures we can iterate
over the entire structure and determine
if a specific value is in the set.
6
As an example, we can maintain a
structure of domain names and
determine if a given name has already
been assigned. However, we do not
know anything about the user who
owns it.
7
In another example, we can have a
structure of dictionary words. We can
determine if a given word is spelled
correctly, but if we also wanted to get
the meaning, pronunciation or
derivation of the word these current
structures would come up short.
8
 These requirements lead us to utilizing a structure
that is more elaborate, such as a Map.
 A Map allows us to associate a Key with an object.
Example:
Key / Index (Lot #)
A140
Ultimately Links to a
HomeOwnerInfo Object
A140
Smith, Joe 120 East End
Avenue
973-333-5555
value $420.000
Property Taxes $11,000
Family Income $ 210,000
3 Children
9
Databases are based on this principle
as we can perform searches on the
existence of specific objects by
searching against an INDEX (key) that
provides a LINK to the actual data
(object)
10
In this example, the Index or KEY is
stored SEPARATE from the data it will
to which it will ultimately point
This structure allows us to maintain the
physical data in a separate storage
location
The Index or Key provides a link to the
data
11
We can have multiple / separate
INDICES that work against a single set
of objects
For example, we can store objects that
maintain information on homeowners
We can keep their name, address, lot
number, home value, tax base, income,
number of children, etc
12
We might wish to access this
information in different ways
Maybe we want to search by phone
number or Lot number
13
Key / Index
(Phone Number)
973-333-5555
Ultimately Links to a
HomeOwnerInfo Object
A140
Smith, Joe120 East End
Avenue
972-333-5555
value $420.000
Property Taxes $11,000
Family Income $ 210,000
3 Children
14
Maybe we want to get information on all
homes worth over $500,000
If we were to attempt to store this
information in a linked list or an array we
would have difficulty implementing
efficient search (or sort) processes that
could perform searches based on different
pieces of data
15
If we were to sort this data it can only be
sorted based on 1 piece of information,
(Lot Number) further changes to elements
require re sorting
This is where a Map implementation is
best used
This lecture will focus on this type of
implementation including Hash Tables,
HashSet and HashMap
16
Hashing:
A System of mapping from KEYS to
integer indices in a table
The goal is to Map all possible KEY values
into a smaller set of indices & to cover
that range uniformly
17
The hash algorithm will convert a KEY
(SSN, UPC, Account Number) into a
representation of a specific location to
store or find that information (converts a
KEY into a location in the hash table).
This tells us where to look for a specific
item or where to insert an item. It always
returns an integer
18
The “perfect hash function” is one where
it yields a 1 to 1 mapping from the index
elements to the integers starting at 0 and
ending at the last element in the set (array,
list)
19
 However, there is no known systematic process
that can be used to generate a perfect hash
function from an arbitrary set of values
 Therefore we will have to account for and
resolve Collisions when several different Keys
map to the same position in the Hash Table
20
Example:
Using our Homeowner Database for
Example, we can write our own “hashing
algorithm” that converts a given Key, Lot
Number for example, into an integer value
that corresponds to an index in an Array
or ArrayList
21
We MUST makes certain assumptions, we
MUST understand our data so we can
estimate its load
In this example, lets assume that our
universe of LOTS in Millburn is
approximately 1,000
22
So, lets count on an array (to hold the Key
and related HomeOwnerInfo) that can hold
about 1,500 indices
This will allow us to spread out our data
so that we can minimize situations where
our Keys “hash” to the same index on the
array (a Collision)
23
Our “Hashing Algorithm” is simple, we
take the numeric value of the Lot and
add in the ASCII value of the letter, Given
this:
A140 will “hash” to the integer value
205 (140 + 65)
A151 216
B140 206
C150 217
24
So, the HomeOwnerInfo along with the
Key will be inserted into the array, known
as our “Hash Table” as follows:
25
Index #
205
206
207
208
209
210
211
212
213
214
214
216
217
HomeOwnerObjectInfo with a Key of:
A140
B140
A151
C150
26
So if we were looking for HomeOwner
Information for lot Number C150
All we need to do is “Hash” the Lot
number which will result in the integer 217
We can then access the Homeowner
information as follows
MyHomeownerInfoArray[ hashedInteger]
27
Hash Tables:
Typically a fixed sized array that contains
an integer representation of a KEY
28
A well balanced Hash Table hinges upon
the proper handling of two major issues:
Deciding on a solid Hash Function
Building an Algorithm for dealing with
Collisions
29
The KEY can be SSN’s, last names, UPC
Codes
When we retrieve an element we need to
verify that its KEY matches the target so
the KEY must be explicitly stored in the
table along with the rest of the record
30
Hash Functions:
Converts a KEY into an integer (hashed)
where the integer ranges from 0 to one
less than the size of the table
Properties of a good hash function:
Easy and fast to compute
31
 Hash Functions:
 Scatter the data evenly throughout the hash
table (uniform)
 Select a data structure that has more space than
actually required
 Develop a function to compute the hash address
(value)
 Minimize collisions
32
For example, if our Key is a String we
could slice the String into parts and add
them (using their ASCII values)
33
For Example, the String containing SSN
can be broken down into parts
133-56-7878
mod the first part 133 % 100 = 33
reverse the second part 56 = 65
int divide 3rd part by 100 =
78
The hashed value for 133-56-7878 is 176
(33 + 65 + 78)
34
How good a hash function this is will
depend on how evenly it scatters the data
over the array and how well it minimizes
any collisions
The result MUST be an integer that does
not exceed the range of the Hash Table
This method of manipulating the key is
given the term “hashing”
35
Common hash functions are:
Numeric / Division:
MOD the KEY by an integer equal to the
size of the array
KEY % (#elements)
Example: UPC # 1966211001
ArraySize 1500
Hash Value = (501) UPC % Size
36
Alpha:
Hash the sum of ACSII values of its
characters
37
MidSquare:
Square the KEY and maintain the KEY’s
middle digits for the Hashed value
Works better with smaller values (less
than 10,000)
Example: number 9876
9876 ^2 = 975 353 76
353 becomes the hashed value
38
Folding:
Divide KEY into several parts
Each of which are combined to provide
the hashed value
39
Example:
Social Security Number :
387-58-1505
hash as sum of three integers:
387 + 58 + 1505 = 1950
The data stored in the KEY is everything
you need for a given structure or record
(price, item name, etc…)
40
Example:
Bar Coding of items in a supermarket
UPC codes allow for up to 1 billion items
(10 digit code)
The average store has aproximatly 10,000
items
41
If the program that scans these items had
to search through all 1 billion possibilities
It would be very inefficient
We can store the UPC codes, specific to
that store, in an array called the HASH
TABLE
We typically size the hash table with more
elements (items) than the initial universe
of elements (KEYS)
42
We could size our array at 15,000
elements
The HASH Function will tell us where a
specific item is stored in the 15,000
element Array
43
UPC
1966211001
1966211011
1966211021
1966211031
Hash Value
501
511
521
531
44
So, if we were to add in information on
Products Keyed by UPC code into a hash
table, we could do so as follows:
MyHashTable[myProduct.getUPC( ) %
15000] = myProduct;
45
To retrieve product price for a given
product you can:
priceOfProduct =
MyHashTable[1966211011%
15000],getPrice( );
46
 Using our HomeOwnerInfo Example:
 So, if we were to add in information on
HomwOwners Keyed by Lot Number into a hash
table, we could do so as follows:
aString = myHomeOwnerInfo.getLot( );
index = // break up the string and calculate the
// hash value;
MyHashTable[index] = myHomeOwnerInfo;
To retrieve Lot value for a given home you can:
lotValue = MyHashTable[index],getValue( );
47
Collisions:
Problems occur when 2 different keys
MAY map to the same hash value, the
same element (location) in the table
This Occurs when we try to insert a new
element into the table and that element is
already occupied
48
Example, if we used a hashing function
that combines Folding with Division:
UPC 70662 11001
Group into pairs: 70 66 21 10 01
Multiply the first three pairs together
70 X 66 X 21 = 97020
Add this number to the last two pairs:
97020 + 10 + 01 = 97031
Find the remainder of mod division by
14997 (15000 – 3)
97031 % 14997 = 7049
49
What happens when we have an item with
the bar code 66702 10110 and we use the
same hash function to code it:
66 70 21 01 10
66 X 70 X 21 = 97020
97020 + 1 + 10 = 97031
97031 % 14997 = 7049
50
This is the same address as the previous
bar code. When this event occurs, two
values need to be stored in the same hash
address. This is called a collision (or hash
clash)
One reason why our table size is 15000
and not 10000 is to help avoid collisions.
The smaller the number of possible
addresses the higher the probability of a
collision.
51
In order for a hash table to work properly
it is important that the programmer knows
the number of items in the table in
advance
There are several ways to resolve a
Collision:
52
Chaining
Probing
Review Example on Hash Coding in
Barrons P.422 to 424
53
Load Factor:
A Hash table with many collisions
degrades its performance
If the hash table resolves collisions via
Chaining then the ratio of entries in the
table to the total number of “buckets” is
called the Hash Table’s Load Factor
54
The Load Factor determines how full the
table may get BEFORE the Maps capacity
is increased
A small Load Factor means that there is
significant wasted space in the Hash Table
55
A high Load Factor means that the
advantages of the Hash Table are
minimized
Reasonable Load Factors range from 0.5
to 2.0
Java’s HashSet and HashMap take in
maximum Load Factors in the constructor
but have a default Load Factor of .75
56
 HashSet:
 Remember that a Set Interface --- extends the
collection interface

 Definition: a collection that contains NO
DUPLICATES of an Object
 For example the input of: 1, 3, 5, 6, 7, 7, 8, 2, 9
Has a set of: 1, 3, 5, 6, 7, 8, 9
 class java.util.HashSet implements java.util.Set
57
This class is implemented with a Hash table
The hashSet contains an Object that can be
hashed, but it holds a single object
With a hashSet (unlike the hashMap), you
do not select a “key” to hash by, the
object is hashed based on it’s
implementation of the hashCode method
58
The HashSet implements the Set behaviors:
boolean add(Object x)
adds element if unique otherwise leaves set
unchanged
boolean contains(Object x)
determines if a given object is an element of
the set
boolean remove(Object x)
removes the element from the set or leaves
set unchanged
59
The HashSet implements the Set behaviors:
int size( )
number of elements in the set
Iterator iterator( )
allows for set traversal
Object [] toArray( );
Returns elements in the set as a array
60
HashSet has a default constructor that
creates an empty Hash Table with a default
capacity and Load Factor
You may set the initial capacity by using the
overloaded constructor
HashSet myHash = new HashSet(200);
61
To avoid unnecessary reallocation and
rehashing of the table when it runs out of
space set the initial capacity , number of
buckets to be used in the table, to roughly 2
times the expected number of elements to be
stored
Another overloaded constructor allows to also
set Load Factor limit
HashSet myHash = new HashSet(200, 1.5);
62
Objects stored in the HashSet DO NOT need
to implement the Comparable interface
An Iterator for the HashSet produces the set’s
values in NO particular order
When ordering is not important HashSet is a
better choice than the TreeSet (discussed in
next lecture)
63
When iterating over a HashSet Do NOT
modify the Set with any iterator method other
than the iter.remove( ) as an error will be
produced
Invoking the HashSet’s add or contains
method invokes the OBJECT (value) being
stored’s HashCode method
64
For example, if we were storing a String as
the value, the String’s HashCode is executed
The String class returns a HashCode value
as an int for the String
Sets DO NOT allow duplicates
65
A duplicate exists when the equals method
applied against two objects resolves to true
Therefore, if you use a user defined class in a
HashSet make sure the equals AND
HashCode methods are defined (overridden
from the super Object’s version) Otherwise
unwanted duplicates may result
66
Review Examples 1-2-3 on HashSet
Coding in Barrons P. 379 to 381
67
NOTE: in example 2 Remember that
ArrayList IS A Collection and HashSet has a
constructor that takes in a Collection,
therefore passing this as a constructor to
HashSet will automatically remove any
duplicates
68
OPEN and Review HashSet on Java Docs
69
Another Example:
Lets Review the HashSet Example in the
Handout
70
The add method of HashSet
names.add(“Julie”);
calls the hashCode of the Object being
added, String in this example
String has a hashCode method and
resolves the “state” of the String into a
hash value (integer) that is the place in the
HashSet’s hash table where this object
will be stored
71
In the same manner the call to the
HashSet’s remove method
names.remove(“Eve”);
invokes the String’s hashCode to
determine where in the Hash Table this
object resides
72
This is why it is CRITICAL to understand
that Objects used in a HashSet MUST
have the equals and hashCode methods
defined !!!
In your own classes, you would need to
have the hashCode and equals methods
defined
73
HashMap:
class java.util.HashMap implements
java.util.Map
The HashMap implements the Map
behaviors:
74
Object put(Object key, Object value)
Associates a Value with a Key and places this
pair into the Map
REPLACES a prior value if the Key already is
Mapped to a value
Returns the PREVIOUS Key associated value or
NULL if no prior
mapping exists
Object get(Object key)
Returns the value associated with a Key OR
NULL if no map exists or the Key does map to a
NULL
75
Object remove (Object key)
Removes the map to this Key and returns its
associated value OR
returns NULL if no map existed or mapping was
to NULL
boolean containsKey(Object key)
True if there is a key / value map otherwise false
int size( )
Returns the number key / value mappings
Set keySet( )
Retuns the Set of keys in the map
76
Default constructor creates an empty Map
Keys (Objects) stored in the HashMap DO
NOT need to implelement the Comparable
interface
Invoking the HashMap’s put or containsKey
method invokes the OBJECT (Key) being
stored’s HashCode method
77
For example, if an Integer is the Key, the
Integer’s HashCode is executed
The Integer class returns a HashCode value
as an int for the Integer (Key)
You are not required to Iterate over a
HashMap
78
However, you will be expected to write code
that iterates over the Set of Keys in a Map:
79
HashMap m = new HashMap( );
// add key / value pairs to the map
for (Iterator I = m.keySet( ).iterator( ) ;
i.hasNext( ) ; )
System.out.println( i.next( ) );
The Keys will appear in an unpredictable
order
If I.remove( ) is executed during this
iteration over the Key Set, then the
associated
80
Key / Value pair will be removed from the
HashMap
81
Review Examples 1-2-3 on HashMap
Coding in Barrons P. 384-387
OPEN and Review HashMap on Java Docs
82
Another Example:
Lets Review the HashMap Example in the
Handout
83
Misc:
Java’s String, Double and Integer classes
have their own HashCode methods built
When designing your own class for use in
a HashSet or HashMap you need to
override the Object’s HashCode method
with a method that is appropriate for your
specific class
84
The Object HashCode operates on the
Objects memory location to hash and NOT
on the attributes of the class
Regardless of who designs it, you MUST
supply a HashCode if you plan on using
your objects in a HashSet or a HashMap
85
The HashCode method returns an integer
from which the HashSet and HashMap
further map the HashCode onto the range
of valid table indices for a particular table
86
Big-O:
HashSet has a Big-O of O(1) for adds
removes and contains
HashMap has a Big-O of O(1) for get and
put but could be O(n) in worst case if
many collisions occur
Hash Table provides a structure where
insert and search is carried out in
constant time
87
AP AB Subset Requirements:
Students should be able to understand:
Hash tables as well as understand how to
use the Java classes HashSet and
HashMap
Understand and be able to utilize the three
HashSet constructors
88
AP AB Subset Requirements:
Know the concept of hashing and how
collisions are created and resolved
Explain how best to construct a Hash
Table to minimize collisions
Understand the goal of a good hash
function
89
AP AB Subset Requirements:
Understand chaining, probing and load
factor
Determine when to use the HashSet and
HashMap and know the Big-O of their
behaviors
Write code that creates, adds, removes
and iterates over Sets using HashSet
90
AP AB Subset Requirements:
Write code that creates, puts, gets,
removes and returns the Set of Keys for a
HashMap
91
Tips for the AP Exam:
Do not change objects in a Set
Sets do not contain duplicates
Sets are not ordered
92
Tips for the AP Exam:
Use an Iterator to list all of the elements of
a Set
Iterating thru a HashSet Does not iterate in
any specific order
You can not add an element to a set at an
iterator position
93
Tips for the AP Exam:
In a HashMap only the Keys are hashed
HashSet and HashMaps add, remove,
contains run in O(1) expected time but
O(n) in worst case
User Defined Classes that will be used in a
HashSet or HashMap should have on
overloaded Equals and HashCode
methods
94
Project:
MyMap
POE
95
TEST FOLLOWS
LABS !!!
96
Download