AppendixE

advertisement
Appendix E-A
Hashing
Modified
Chapter Scope
• Concept of hashing
• Hashing functions
• Collision handling
– Open addressing
– Buckets
– Chaining
• Deletions
• Performance
Java Software Structures, 4th Edition, Lewis/Chase
15 - 2
What is hashing?
• Hashing is a scheme for storing and retrieving
information by (key) value. Sometimes used to
implement associative memory.
• A hash function is used to map a value to a
location; The value (and associated info) may be
stored at that location or at least accessed via
that location.
• Very efficient for storing and retrieving
• Used extensively in computing – software and
hardware.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 3
Collisions
• Ideally, the value being mapped would be
stored at the mapped location in the location
space, and this could be true for a perfect
hashing function.
• However, in most situations, multiple values
will/may map to the same location (collisions)
• So we have to have to have a strategy to handle
collisions.
• There are several popular collision handling
strategies.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 4
Hash function
• A hash function is a mapping from a value space to a
location space.
• The value space is any domain of values. Strings, ints,
phone numbers, student IDs, …
• The location space is normally a sequence of integers
from 0 to N-1, where N is the size of the location space.
The location space resembles a 1-dimentional array (like
computer memory).
Value space
Location space
15 - 5
Characteristics of a good hash function
• It should cover the entire location space
• It should distribute the key values fairly evenly
into the location space
• Generally, 2 values that are “close together” in
the value space should not be close together in
the location space.
Aside: Cryptographic hashing
Java Software Structures, 4th Edition, Lewis/Chase
15 - 6
Division (remainder) hash function
• Probably most commonly used method, either by
itself or combined with another method.
• If the location space is of size N, divide the value
(somehow represented as an integer) by N and
take the remainder as the result.
• Choosing N to be a prime number improves the
likelihood of the mapping distributing the values
fairly evenly.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 7
Representing a value as an integer
• We know that all information stored in computer
storage is a string of bits.
• Any string of bits can be interpreted as a binary
integer.
• So how to we make that interpretation?
• Modern languages try to prevent us from
changing out interpretation of a string of bits –
strong typing.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 8
Char to int
Java does give us a loophole for character data
System.out.println( (int) "ABC".charAt(0));
displays 65
The data type char is an “integral” type, and can
be automatically converted to an int or long.
Int I = ‘B’; // sets I to 66;
Java Software Structures, 4th Edition, Lewis/Chase
15 - 9
Also, can use bitwise and bit shift ops.
• ~, &, |, ^, <<, >>, >>>
int i = 'B';
System.out.println( i);
//displays 66
System.out.println( (int) "ABC".charAt(0)); //displays 65
System.out.println( 'A' & 'B' ); //displays 64
System.out.println( 'A' | 'B' );
//displays 67
System.out.println( 'A' ^ 'B' ); //displays 3
System.out.println( ~'A' );
//displays -66
System.out.println( 'A' << 2 ); //displays 260
System.out.println( 'A' >>> 1 ); //displays 32
Java Software Structures, 4th Edition, Lewis/Chase
15 - 10
Folding
• Divide the value into parts and then combine
them.
• Example:
value is 234-56-9876
234
965
+
876
--------------2075 % N
Java Software Structures, 4th Edition, Lewis/Chase
15 - 11
Other hash functions
• Mid square -- Square the value, as a number, and
take a portion out of the middle of that product.
• Extraction involves using only a part of an
element’s value or key to compute the location
at which to store the element
• Length dependent – use a portion of the value,
then combine with the length of the value.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 12
Hashing Functions - Digit Analysis
• In the digit analysis method, the index is formed by
extracting, and then manipulating specific digits from
the key
• For example, if our key is 1234567, we might select
the digits in positions 2 through 4 yielding 234
• The manipulation can then take many forms
–
–
–
–
Reversing the digits (432)
Performing a circular shift to the right (423)
Performing a circular shift to the left (342)
Swapping each pair of digits (324)
• Alternately, these manipulations could be done
on the bits
Java Software Structures, 4th Edition, Lewis/Chase
15 - 13
Appendix E-B
Hashing – Open Addressing
Modified
Open Addressing
A.K.A Closed Hashing
• All hashed entries, including collisions. are stored within
the hash table (closed array)
• Colliding entries are stored at (open
addresses)/locations within the table.
• When a collision occurs and the entry cannot be stored
at its home address (to which it was originally hashed),
the table is probed for an open position in the table
where it can be stored.
• When the entry is looked for, this same probe sequence
must be followed until it is found or determined that it
is not in the table.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 15
Three probing approaches
1. Linear probing
2. Quadratic probing
3. Double hashing
Java Software Structures, 4th Edition, Lewis/Chase
15 - 16
Open addressing using
Linear Probing
In linear probing, if an entry
hashes to position P and that
position is occupied we simply
probe for empty positions at
(P + I) % TableSize
where I = 1,2,3,4 …
or some other linear sequence
Issues with Linear probing
• Linear Probing may lead to clustering; both good
and bad. Increases average number of probes,
but gives good locality of reference (if interval is
1).
• Deletions are marked as deletions, not empty;
they can be reused, but they do not mark the
end of a probe sequence.
• Need table size to be a prime number to ensure
all positions are in probe sequences.
15 - 18
Issues with Linear probing
• Performance drops off as load factor nears 80%
• Must expand table and rehash all entries
https://www.cs.usfca.edu/~galles/visualization/Clo
sedHash.html
Java Software Structures, 4th Edition, Lewis/Chase
15 - 19
Quadratic probing
• In Quadratic probing, the probe interval is a
quadratic polynomial – I2 in the simplest case.
• So, if an entry hashes to position P and that
position is occupied we simply probe for empty
positions at
•
(P + I2) % TableSize
where I = 1,2,3,4 …
• Less primary clustering than with linear probing
• https://www.cs.usfca.edu/~galles/visualization/C
losedHash.html
Java Software Structures, 4th Edition, Lewis/Chase
15 - 20
Double Hashing
• The interval between probes is computed by
another hash function H2(x)
• So, if an entry x hashes to position P and that
position is occupied we simply probe for empty
positions at
•
( P + I * ( H2(x) ) % TableSize
where I = 1,2,3,4 …
• Less primary clustering than with linear probing
• https://www.cs.usfca.edu/~galles/visualization/C
losedHash.html
Java Software Structures, 4th Edition, Lewis/Chase
15 - 21
Appendix E-C
Hashing – Buckets
Modified
Buckets
• The locations in the hash table are referred to as
cells or as buckets.
• A bucket can be big enough to hold several
entries (not just one).
• So, entries are hashed to a bucket location, and
colliding entries can be stored in the same bucket
until it becomes full.
• After it becomes full, the colliding elements can
be stored in a common overflow area.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 23
• What is this advantage of this approach over
some of the other open addressing approaches?
• Locality of reference – the likelihood that when
you are accessing a place in memory or on disk,
that the next place you reference is “nearby”.
• This makes for better efficiency in virtual
memory and in more efficient disk access.
• Eliminates primary clustering
• https://www.cs.usfca.edu/~galles/visualization/C
losedHash.html
Java Software Structures, 4th Edition, Lewis/Chase
15 - 24
Appendix E-D
Hashing – With chaining
Modified
Chaining
• The chaining method simply treats the hash table
conceptually as an array of lists of individual
elements
• Thus each hash value locates a list of all entries
that hash to (collide at) that hash location.
• These lists are usually linked (chained) lists.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 26
The chaining
method of collision
handling
0
1
2
…
Two variants:
1. The table cells can contain
the data being stored, or
2. The table cells can contain
only head pointers to the
lists, with all data being
stored in the list nodes.
Pros and cons of each variant?
N-1
Java Software Structures, 4th Edition, Lewis/Chase
15 - 27
Basic operations
• Insert
• Find
• Delete
Lists can be ordered or not
Java Software Structures, 4th Edition, Lewis/Chase
15 - 28
Pros of chaining –
compared to closed hashing
• Hash table does not ever have to be expanded.
• Performance degrades more slowly as table fills
up.
• Fewer (or no) empty table (data) spaces.
• Insertion (at the head of list) is simple and takes
constant time.
• Deletion does not require special treatment.
• No clustering
Java Software Structures, 4th Edition, Lewis/Chase
15 - 29
Cons of chaining –
compared to closed hashing
• Extra space used for pointers
• Extra time required to allocate list nodes
dynamically!!!
• Worse locality of reference. Significant if lists get
long.
Size (and number) of data records must be
considered.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 30
Chaining using
an overflow area
Chaining (with simulated
links) can be accomplished
using an array based
structure with an overflow
area.
Pros and cons??
Java Software Structures, 4th Edition, Lewis/Chase
15 - 31
Appendix E-E
Hashing – Variations
Modified
http://en.wikipedia.org/wiki/Hash_table
Coalesced hashing
Omit!
Java Software Structures, 4th Edition, Lewis/Chase
15 - 33
Incremental resizing of a hash table
Some hash table implementations, notably in real-time systems,
cannot pay the price of enlarging the hash table all at once, because it
may interrupt time-critical operations. If one cannot avoid dynamic
resizing, a solution is to perform the resizing gradually:
• During the resize, allocate the new hash table, but keep the old
table unchanged.
• In each lookup or delete operation, check both tables.
• Perform insertion operations only in the new table.
• At each insertion also move r elements from the old table to the
new table.
• When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new
table itself needs to be enlarged, it is necessary to increase the size of
the table by a factor of at least (r + 1)/r during resizing.
Java Software Structures, 4th Edition, Lewis/Chase
15 - 34
Download