Course 2:Searching

advertisement
Data Structures(数据结构)
Course 2:Searching
Vocabulary
sequential search 顺序查找
element 元素
order 次序
binary search 二分查找
target 目标
algorithm 算法
array 数组
location 位置
object 对象,目标
parameter 参数
index 下标,索引,指针
sentinel 哨兵
probability 概率
key 关键字
hash 散列,杂凑
collision 冲突
cluster 聚集,群集
synonym 同义语,同义词
probe 探测
load factor 装填因子
西南财经大学天府学院
2
Searching
One of the most common and timeconsuming operations in computer science.
To find the location of a target among a list
of objects.
西南财经大学天府学院
3
Main contents(in chapter 2)
List searching(including two basic search algorithms)
Sequential search(including three variations)
Binary search
Hashed list searching—the key through an algorithmic
function determines the location of data
Collision resolution
To discuss the list search algorithms using an array
structure
西南财经大学天府学院
4
2-1 list searches (work with arrays)
 The algorithm used to search a list depends to the
structure of list
 Sequential search(any array)
List no ordered
Small lists
Not searched often
西南财经大学天府学院
5
Locating data in unordered list
Location wanted
(3)
A[0] A[1]
A[11]
4 21 36 14 62 91 8 22 7 81 77 10
Target given
(14)
西南财经大学天府学院
6
Search Concept
Inde0
x A[0] A[1]
Target given:14
Location wanted:3
14 not equal 4
A[11]
4 21 36 14 62 91 8 22 7 81 77 10
Inde 1
14 not equal 21
x A[0] A[1]
A[11]
…
4 21 36 14 62 91 8 22 7 81 77 10
Inde 3
x A[0]A[1]
14 equal 14
A[11]
4 21 36 14 62 91 8 22 7 81 77 10
西南财经大学天府学院
7
Search Concept
西南财经大学天府学院
8
Sequential search algorithms
Needs to tell the calling algorithm two things
Did it Find the data it was looking for?
If it did, at what index are the target data
found.
Requires four parameters
The list we are searching
An index to the last element in the list
The target
The address where the found element’s
index location is to stored
(Return Boolean)
西南财经大学天府学院
9
Locate the target in an
sequential search algorithm
unordered list
algorithm seqsearch(val list <array>Pre list must contain at
val last <index> least one element
last is index to last
val target <keytype>
ref locn <index>) element in the list
looker=0
target contains the data
loop (looker < last and
to be located
target not equal list [looker])
locn is address of index
looker = looker + 1
in calling algorithm
end loop
Post
locn = looker
if found—matching index
if (target equal list [looker])
found = true
stored in locn & found
else
true
found = false
If not found—last stored
end if
in locn & found false
return found
Return found<boolean>
end seqsearch
西南财经大学天府学院
10
Variations on sequential searches
Sentinel search
Probability search
Ordered list search
西南财经大学天府学院
11
Sentinel search
Locate the target in an
unordered list
algorithm seqsearch(val list <array>
Pre list must contain at
val last <index>
val target <keytype> least one element
Last is index to last
ref locn <index>)
element in the list
List [last + 1] = target
looker=0
Target contains the data
loop (target not equal list [looker])
to be located
looker = looker + 1
Locn is address of index
end loop
in calling algorithm
locn = looker
if (looker <= last)
Post
found = true
if found—matching index
locn = looker
stored in locn & found
else
true
found = false
If not found—last stored
locn = last
end if
in locn & found true
return found
Return found<boolean>
end sentinel search
西南财经大学天府学院
12
probability search
looker=0
loop (looker < last and target not equal list [looker])
looker = looker + 1
Locate the target in an
end loop
if (target equal list [looker])
unordered list
found = true
Pre as the same above
if ( looker > 0 )
Post
temp = list [looker – 1]
if found—matching
list [looker – 1] = list [looker]
index stored in locn &
list [looker] = temp
found true &
looker = looker – 1
Element move up in
endif
else
priority
found = false
If not found—as same
end if
Return
locn = looker
found<boolean>
return found
end probability search
西南财经大学天府学院
13
Ordered list search
Locate target in a list
ordered on target
If (target <= list[last ] )
Note:
looker=0
• It is not necessary to
loop (target > list [looker])
search to the end of list
looker = looker + 1
• It is only for the small
end loop
list
else
• Incorporate the Sentinel
looker = last
Pre: the same as sequential
endif
if (target equal list[looker]) Post
if found—the same as above
found = true
If not found—locn is index of
else
first element > target or
found = false
locn equal last & found is
end if
false
locn = looker
Return found < boolean >
return found
西南财经大学天府学院
14
Binary search
Sequential search algorithm is
very slow
–But, It is the only solution if the array is not sorted
Binary search(ordered list)
–For the large list
–First sort
–Then search
西南财经大学天府学院
15
Binary search method
Suppose
L a sorted list
searching for a value X
1. Compare X to the middle value (M) in L.
2. if X = M we are done.
3. if X < M we continue our search, but we can confine our search
to the first half of L and entirely ignore the second half of L.
4.if X > M we continue, but confine ourselves to the second half of
L.
西南财经大学天府学院
16
First mid last
0
5
Target are found ,target 22 is in the list
11
A[0] A[1]
4
7
A[11]
8
10
14
21
22
36
62
77
81
91
First mid last
22>21
6
8
11
A[0] A[1]
4
7
A[11]
8
10
14
21
22
36
62
77
6
7
A[0] A[1]
4
7
91
22<62
First mid last
6
81
A[11]
8
10
14
21
22
36
62
77
81
22=22
西南财经大学天府学院
17
91
Target not found --Target 11 is not in the list
First mid last
0
5
11
11<21
A[0]
4
7
8
10
14
21
First mid last
0
2
4
A[0] A[1]
4
7
8
7
8
4
7
8
62
77
81
91
A[11]
10
14
21
22
10
14
21
10
14
21
36
62
77
81
91
11>10
A[11]
22
First mid last
4
4
4
A[0] A[1]
36
11>8
First mid last
3
3
4
A[0] A[1]
4
22
A[11]
36
62
77
81
11<14
22
36
A[11]
62
77
81
First mid last
Function terminates 4
4
3
西南财经大学天府学院
91
18
91
Binary search(ordered list )
elsePre list is ordered; it must
algorithm binary_search(
contain equal
at least :one
element
found
force
val list <array>,
end is index to the largest
exit
val end <index>,
element
in the
first
= last
+ 1list
val target <keytype>,
is the value of element
endTarget
if
ref locn <index>)
being sought
end loop
First = 0
Locn is address of index in
locn = mid
Last = end
calling algorithm
if (target
equal list [mid])
loop (first <= last )
Post
found
= true
Found:locn assigned index to
mid = ( first + last ) / 2
else target element
if ( target > list [mid] )
found
= false
found
set true
look in upper half
end if not found:locn = element
first = mid +1
return found
below or above target
else if ( target < list [mid] )
found set false
look in lower half
Return
found<boolean>
end binary
search
last = mid – 1
西南财经大学天府学院
19
Analyzing (the efficiency)
Sequential search ,Sentinel search ,Ordered list search
: O(n)
Binary search: O(log 2n)
Comparison of binary and sequential searches
size
16
10,000
1,000,000
binary
4
14
20
Sequential
(average)
8
5000
500,000
西南财经大学天府学院
Sequential
(worst case)
16
10,000
1,000,000
20
2-3 Hashed list searches
Ideal search : we would know exactly where the
data are and go directly to there
Goal of hashed search : to find the data with only
one test
Location of data
key
Hash function
Use an array of data
key
Hash algorithm
index of array(address of list )
西南财经大学天府学院
21
Hash
function
key
address
address
102002
107095
111060
hash
key
5
100
2
[000]
[001]
[002]
[003]
[004]
[005]
[006]
[007]
[008]
Harry lee
111060
Sarah trapp
102002
Vu nguyen
…
Figure 2-6 Hash concept
[099]
[100]
西南财经大学天府学院
107095
…
John adams
22
Basic Concepts
Hash search:
A search in which the key ,through an algorithmic
function, determines the location of the data.
we use a hashing algorithm to transform the key into the
index that contains the data we need to locate
(key-to –address)
西南财经大学天府学院
23
Problem
A set of keys hash to the same location—Synonym
Contain two or more synonyms in a list—collision
Home address—produced by hashing algorithm
Prime area—memory contains all of home addresses
Collision resolution—two keys collide at a home address
Place one of the keys and its data in another location
西南财经大学天府学院
24
B and A
Collide at 8
[0]
Collision resolution
C and B
Collide at 16
C
A
B
[4]
[8]
[16]
Collision resolution
1.hash(A)
2.hash(B)
3.hash(C)
Figure 2-7 the collision resolution concept
西南财经大学天府学院
25
Locate an element in a hashed list
Use the same algorithm to insert it into the list
First hash the key and check the home address
If it does – the search is complete
If not – use the collision resolution algorithm to
determine the next location and continue until
find the element or determine it is not in the list
Each calculation of an address and test for
success – probe
西南财经大学天府学院
26
Hashing methods
Hashing
methods
direct
modulo
division
subtraction
midsquare
digit
extraction
folding
rotation
pseudorandom
generation
Figure 2-8 Basic hashing techniques
西南财经大学天府学院
27
Direct method
The key is the address(an element a key , no
synonyms)
Example1: total monthly sales by the days of the
months
Create an array of 31accumulator
The accumulation code is:
dailySales[sale.day] = dailySales[sale.day]
+sale.amount;
西南财经大学天府学院
28
Example 2: a small
company has fewer<100
Employee number is
between 1 and 100
address
005
100
002
hash
5
100
2
key
Figure 2-9 Direct hashing
Of employee numbers
[000]
[001]
[002]
[003]
[004]
[005]
[006]
[007]
[008]
000
001
002
003
004
005
006
007
008
…
[099] 099
[100] 100
西南财经大学天府学院
(not used)
Harry lee
Sarah trapp
Vu nguyen
…
John adams
29
Subtraction method
•keys are consecutive , but do not start from 1
•Such as your student ID number
Advantage
•Hashing function is very simple
•No collisions
Disadvantage
Only for small lists
西南财经大学天府学院
30
Note:
1. Generally speaking , hashing lists require some
empty elements to reduce the number of collisions
2. This application above two is the ideal ,but it is very
limited , such as ID card number
西南财经大学天府学院
31
Modulo-division method(Division remainder)
This method divides the key by the array size and uses the
remainder for the address
Hashing algorithm is:
Address = key modulus listsize
Note: a prime number listsize produces
collisions
西南财经大学天府学院
fewer
32
121267
045128
379452
hash
2
306
0
[000] 379452 Marry Dodd
[001]
[002] 121267 Bryan Devaux
[003]
[004]
[005]
[006]
[007] 378845 John Carver
[008]
Listsize=307
Figure 2-10 modulo-division
Hashing
…
…
[305] 160252 Tuan Ngo
[306] 045128 Shouli Feldman
西南财经大学天府学院
33
Digit extraction method
Selected digits are extracted from the key And used
as address
Example
6-digits
Employe
e number
379452
121267
378845
160252
045128
394
112
388
102
051
3-digit
address
Select the first,
third, fourth digits
西南财经大学天府学院
34
Midsquare method
The key is squared and the address selected from the
middle of the squared number
Limitation: the size of the key
Example: 4-digit keys
9452*9452=89340304:address is 3403
Variation : select a portion of the key
379452
121267
378845
160252
045128
379 * 379=143641
121 * 121=014641
378 * 378=142884
160 * 160=025600
045 * 045=002025
Select 1-3 digits
364
464
288
560
202
Select 3-5 digits
as address
squared Fill 0 to 6 digits
西南财经大学天府学院
35
Folding methods : fold shift and fold boundary
123
123 456 789
+
789
123456789
Digits
reversed
321
123 456 789
+
987
1 764
1368
discarded
(a)fold shift
Digits
reversed
discarded
(b)fold boundary
Figure 2-11 hash fold examples
西南财经大学天府学院
36
Rotation method : Incorporate with others
Useful when keys are assigned serially
600101
600102
600103
600104
600105
Original key
600101
600102
600103
600104
600105
160010
260010
360010
460010
560010
Rotation
Rotated key
Figure 2-12 Rotation hashing
西南财经大学天府学院
37
Pseudorandom method:
In this method, the key is used as the
seed in a pseudorandom number
generator , the resulting random number
is scaled into the possible address range
using modulo division
A common random generator is: y=ax+c
For efficiency,factors a and c should be prime
numbers
For example , a=17, c=7
西南财经大学天府学院
38
(17*045128+7)
modulo 307=297
(17*121267+7)
modulo 307=41
121267
045128
379452
hash
41
297
7
(17*379452+7)
modulo 307=7
[000]
…
[007] 379452
…
[041] 121267
…
Marry Dodd
…
Bryan Devaux
…
…
378845 John Carver
[297] 045128 Shouli Feldman
Figure 2-10 modulo-division Hashing
…
…
160252 Tuan Ngo
[306]
西南财经大学天府学院
39
Hash Algorithm
Convert the alphanumeric key into a number by
adding the American Standard Code for Information
Interchange(ASCII) to accumulator.
Rotate the bits in the address to maximize the
distribution of the values.
Take the absolutely value of the address and map it
into the address range.
西南财经大学天府学院
40
Hash Algorithm
algorithm Hash(
test for negative address
val key <array >,
if (addr<0)
This algorithm converts an
val size <integer>,
addr=absolute(addr)
alphanumeric
key of size
val maxAddr <integer>, end if
characters into an integral
ref addr <integer>)
addr
=addr modulo maxaddr
address.
Looper = 0
return
Pre Key is a key to be hashed.
Addr = 0
end Hashsize is the number of
Hash Key
characters in the key.
MaxAddr is the maximum
Loop (Loop<size)
possible address for the
if (key[looper] not space)
list.
addr =addr+key[looper]
Post addr contain the hashed
rotate addr 12 bits right
address
end if
End loop
西南财经大学天府学院
41
2-4 collision resolution
Except the direct and subtraction, none of the hashing
methods are one-to-one mapping
Collision not avoid
There are several methods for hashing collisions
Collision resolution
Open addressing
Linear
probe
Quadratic
probe
Linked lists
pseudorandom
buckets
Key offset
Figure 2-13 collision resolution methods
西南财经大学天府学院
42
Several concepts
•data to group within the list
(unevenly across a hashed list).
•There •a
must
some
highbe
degree
of empty
clustering grows
elements
a list:of probes to locate an
the in
number
element and reduces the
The number
of filled elements
processing
efficiency
of the list.
load
=
<75%
The
total
number of elements
factor
There
are
two:
•Primary clustering : when data
cluster around a home address
•Secondary clustering:when data
become grouped along a collision
path throughout a list
•Need to design hashing algorithms
to minimize clustering
load factor
Clustering
西南财经大学天府学院
43
Open addressing
Resolves collisions in the prime area (contains all of the
home addresses )
Linear probe
Quadratic probe
Double hashing
Pseudorandom
Key offset
西南财经大学天府学院
44
Linear Probe
[000]
[001]
[002]
[003]
[004]
[005]
[006]
[007]
[008]
First insert:
No collision
070918
1
hash
166702
1
second insert:
collision
Add 1
Figure 2-14 linear probe collision
resolution
379452
070918
121267
166702
Marry Dodd
Sarah Trapp
Bryan Devaux
Harry eagle
378845 John Carver
…
…
[305] 160252 Tuan Ngo
[306] 045128 Shouli Feldman
西南财经大学天府学院
45
linear probe
Variation :Add 1, subtract 2,Add 3, subtract 4
Advantage: simple to implement.
Disadvantage: first, tend to produce primary
clustering . Second, tend to make the search
algorithm more complex
西南财经大学天府学院
46
Quadratic probe
To eliminate primary clustering
The increment is the collision probe number
squared.first probe, add 12,second probe, add 22 ,…
The new address is the modulo of the list size.
Disadvantage :
1. the time required to square the probe number.
2. It is not possible to generate a new address for every
element in the list.
西南财经大学天府学院
47
Pseudorandom collision resolution
A double hashing : the address is rehashed
Uses a pseudorandom number to resolve the collision
Using the collision address as a factor in the random
number calculation, such as:
New address = 3 * collision address + 5
Figure2-15 showing a collision resolving for figure
2-14
西南财经大学天府学院
48
Pseudorandom probe
First insert:
No collision
1
070918
166702
hash
1
second insert:
collision
[000]
[001]
[002]
[003]
[004]
[005]
[006]
[007]
[008]
379452 Marry Dodd
070918 Sarah Trapp
121267 Bryan Devaux
378845 John Carver
166702 Harry eagle
…
…
Pseudorandom [305] 160252 Tuan Ngo
Y = 3x+5
[306] 045128 Shouli Feldman
Figure 2-15 pseudorandom collision resolution
西南财经大学天府学院
49
Key offset
Another double hashing
Produces different collision paths for different
keys
key offset calculates the new address as (the
simplest versions)
offset = key/listsize
address = ((offset + old address) modulo listsize)
西南财经大学天府学院
50
Example: the key is 166702, list size is 307,using the
modulo-division generate an address of 1
This synonym of 070918 produces a collision at 1
Using key offset to calculate the next address
offset = 166702 / 307 = 543
address = ((543 + 001) modulo 307) = 237
If 237 were also a collision, repeat the process
offset = 166702 / 307 = 543
address = ((543 + 237) modulo 307) = 166
西南财经大学天府学院
51
To really see the effect of key offset, we need to calculate
several different keys ,all hashing to the same home
address. Table 2-3 shows that three keys that collide at
address 001,
Next two collision probe addresses
Key28
Home
address
Key
offset
Probe 1
Probe 2
166702
572556
067234
1
1
1
543
1865
219
237
024
220
166
047
132
Table 2-3 key offset
Note: each key resolves its collision at a different
address for both the first and second probes
西南财经大学天府学院
52
Linked list resolution
To eliminate the disadvantage of open addressing that
each collision resolution increases the probability of
future collisions
A linked list is an ordered collection of data in which
each element contains the location of the next element
西南财经大学天府学院
53
[000] 379452 Marry Dodd
[001] 070918 Sarah Trapp
[002] 121267 Bryan Devaux
166702 Harry eagle
572556
[003]
[004]
Chris Wallj
pointer
[005]
[006]
[007]
[008]
pointer
…
…
[305] 160252 Tuan Ngo
[306] 045128 Shouli Feldman
Figure 2-16 linked list collision resolution
西南财经大学天府学院
54
Linked list resolution
Linked list resolution uses a separate area to store
collisions and chains all synonyms together in a linked
list
It uses two storage areas, the prime area and the
overflow area
Each element in the prime area contains an additional
field, a link head pointer
The linked list data can be stored in any order, but the
most common is key sequence
西南财经大学天府学院
55
Bucket hashing
nodes that
accommodat
e multiple
data.
occurrences,
collision are
postponed
until the
bucket is full
379452 Marry Dodd
[000] Bucket
0
070918 Sarah Trapp
[001] Bucket 166702 Harry eagle
367173 Ann georgis
1
121267 Bryan Devaux
[002] Bucket 572556 Chris wallj
2
Linear probe
Places here
045128
Shouli
Feldman
[307] Bucket
307
Figure 2-17 bucked hashing
西南财经大学天府学院
56
Two problems & combination approaches
First : it uses significantly more space, many of the
buckets will be (or partially) empty
Second: it does not completely resolve the collision
problem
Resolving the collision is to use the linear probe
There are several approaches to resolving collisions
,often uses multiple steps
Example one large database hashes to a bucket, full,
linear probe , linked list overflow area
西南财经大学天府学院
57
summary
Searching is the process of finding the location of a target
among a list of objects
Two basic searching methods for arrays: sequential and
binary search
The sequential search is normally used when a list is not
sorted. It starts at the beginning of the list and searches
until it finds the data or hits the end of the list
One of the variation of the sequential search is the sentinel
search. In this method,the condition ending the search is
reduced to only one by artificially inserting the target at the
end of the list
The second variation of the sequential search is called the
probability search. In this method, the list is ordered with
the most probable elements at the beginning of the list and
the least probable at the end
西南财经大学天府学院
58
2-5 summary(continued)
The sequential search can also be used to search a
sorted list, in this case, we can terminate the search
when the target is less than the current element
If an array is sorted, we can use a more efficient
algorithm called the binary search
the binary search algorithm searches the list by first
checking the middle element. If the target is not in the
middle element, the algorithm eliminates the upper half
or the lower half of the list depending on the value of
the middle element. The process continues until the
target is found or reduced list length becomes zero
The efficiency of a sequential search is O(n)
The efficiency of a binary search is O(log2n)
西南财经大学天府学院
59
summary(continued)
In a hashed search,the key through an
algorithmic transformation,determines the
location of the data. It is a key-to-address
transformation
There are several hashing functions : we
discussed direct, subtraction, modulo
division, digit extraction, mid-square,
folding, rotation , and pseudorandom
generation
西南财经大学天府学院
60
summary(continued)
In direct hashing,the key is the address without any
algorithmic manipulation
In subtraction hashing,the key is transformed to an
address by subtracting a fixed number from it
In modulo-division hashing,the key is divided by the
list size,recommended to be a prime number
In digit-extraction hashing,selected digits are extracted
from the key and used as an address
In mid-square hashing,the key is squared and the
address is selected from the middle of the result
In fold shift hashing,the key is divided into parts whose
sizes match the size of the required address.then the
parts are added to obtain the address
西南财经大学天府学院
61
summary(continued)
In fold boundary hashing,the key is divided into parts
whose sizes match the size of the required address.then
the left and right parts are reversed and added to the
middle part to obtain the address
In rotation hashing,the rightmost digit of the key is
rotated to the left to determine an address.
However,this method is usually used in combination
with other methods
In the pseudorandom generation hashing,the key is
used as the seed to generate a pseudorandom number.
The result is then scaled to obtain the address
Except in the direct and subtraction methods, collisions
are unavoidable in hashing. Collision occur when a
new key is hashed to an address that is already
occupied
62
西南财经大学天府学院
summary(continued)
Clustering is the tendency of data to build up unevenly
across a hashed list.
Primary clustering occur when data build up around
a home address
Secondary clustering occurs when data build up
along a collision path in the list
To solve a collision, a collision resolution method is
used
Three general methods are used to resolve collision :
open addressing,linked list,and buckets
The open addressing method can be subdivided into
linear probe,quadratic probe,pseudorandom
rehashing,and key-offset rehashing
西南财经大学天府学院
63
summary(continued)
In the linear probe method,when the collision
occurs,the new data will be stored in the next
available address.
In the quadratic method,the increment is the
collision probe number squared.
In the pseudorandom rehashing method, we use
a random number generator to rehash the
address
In the key-offset rehashing method,we use an
offset to rehash the address
西南财经大学天府学院
64
summary(continued)
In the linked list technique,we use separate areas to
store collision and chain all synonyms together in a
linked list
In bucket hashing,we use a bucket that can
accommodate multiple data occurrences
西南财经大学天府学院
65
Homework
Using the modulo-division method and linear probing,
store the keys shown below in an array with 19
elements, How many collision occurred? The value of
load factor of the list after all keys have been inserted?
224562,137456,214562,140145,214567,162145,144467,
199645,234534
Repeat above problem using the digit-extraction
method (first, third and fifth digits) and quadratic
probing.
西南财经大学天府学院
66
Download