Data Structures(数据结构) Course 2:Searching Vocabulary sequential search 顺序查找 element 元素 order 次序 binary search 二分查找 target 目标 algorithm 算法 array 数组 location 位置 object 对象,目标 parameter 参数 index 下标,索引,指针 sentinel 哨兵 probability 概率 key 关键字 hash 散列,杂凑 collision 冲突 cluster 聚集,群集 synonym 同义语,同义词 probe 探测 load factor 装填因子 西南财经大学天府学院 2 Searching One of the most common and timeconsuming operations in computer science. To find the location of a target among a list of objects. 西南财经大学天府学院 3 Main contents(in chapter 2) List searching(including two basic search algorithms) Sequential search(including three variations) Binary search Hashed list searching—the key through an algorithmic function determines the location of data Collision resolution To discuss the list search algorithms using an array structure 西南财经大学天府学院 4 2-1 list searches (work with arrays) The algorithm used to search a list depends to the structure of list Sequential search(any array) List no ordered Small lists Not searched often 西南财经大学天府学院 5 Locating data in unordered list Location wanted (3) A[0] A[1] A[11] 4 21 36 14 62 91 8 22 7 81 77 10 Target given (14) 西南财经大学天府学院 6 Search Concept Inde0 x A[0] A[1] Target given:14 Location wanted:3 14 not equal 4 A[11] 4 21 36 14 62 91 8 22 7 81 77 10 Inde 1 14 not equal 21 x A[0] A[1] A[11] … 4 21 36 14 62 91 8 22 7 81 77 10 Inde 3 x A[0]A[1] 14 equal 14 A[11] 4 21 36 14 62 91 8 22 7 81 77 10 西南财经大学天府学院 7 Search Concept 西南财经大学天府学院 8 Sequential search algorithms Needs to tell the calling algorithm two things Did it Find the data it was looking for? If it did, at what index are the target data found. Requires four parameters The list we are searching An index to the last element in the list The target The address where the found element’s index location is to stored (Return Boolean) 西南财经大学天府学院 9 Locate the target in an sequential search algorithm unordered list algorithm seqsearch(val list <array>Pre list must contain at val last <index> least one element last is index to last val target <keytype> ref locn <index>) element in the list looker=0 target contains the data loop (looker < last and to be located target not equal list [looker]) locn is address of index looker = looker + 1 in calling algorithm end loop Post locn = looker if found—matching index if (target equal list [looker]) found = true stored in locn & found else true found = false If not found—last stored end if in locn & found false return found Return found<boolean> end seqsearch 西南财经大学天府学院 10 Variations on sequential searches Sentinel search Probability search Ordered list search 西南财经大学天府学院 11 Sentinel search Locate the target in an unordered list algorithm seqsearch(val list <array> Pre list must contain at val last <index> val target <keytype> least one element Last is index to last ref locn <index>) element in the list List [last + 1] = target looker=0 Target contains the data loop (target not equal list [looker]) to be located looker = looker + 1 Locn is address of index end loop in calling algorithm locn = looker if (looker <= last) Post found = true if found—matching index locn = looker stored in locn & found else true found = false If not found—last stored locn = last end if in locn & found true return found Return found<boolean> end sentinel search 西南财经大学天府学院 12 probability search looker=0 loop (looker < last and target not equal list [looker]) looker = looker + 1 Locate the target in an end loop if (target equal list [looker]) unordered list found = true Pre as the same above if ( looker > 0 ) Post temp = list [looker – 1] if found—matching list [looker – 1] = list [looker] index stored in locn & list [looker] = temp found true & looker = looker – 1 Element move up in endif else priority found = false If not found—as same end if Return locn = looker found<boolean> return found end probability search 西南财经大学天府学院 13 Ordered list search Locate target in a list ordered on target If (target <= list[last ] ) Note: looker=0 • It is not necessary to loop (target > list [looker]) search to the end of list looker = looker + 1 • It is only for the small end loop list else • Incorporate the Sentinel looker = last Pre: the same as sequential endif if (target equal list[looker]) Post if found—the same as above found = true If not found—locn is index of else first element > target or found = false locn equal last & found is end if false locn = looker Return found < boolean > return found 西南财经大学天府学院 14 Binary search Sequential search algorithm is very slow –But, It is the only solution if the array is not sorted Binary search(ordered list) –For the large list –First sort –Then search 西南财经大学天府学院 15 Binary search method Suppose L a sorted list searching for a value X 1. Compare X to the middle value (M) in L. 2. if X = M we are done. 3. if X < M we continue our search, but we can confine our search to the first half of L and entirely ignore the second half of L. 4.if X > M we continue, but confine ourselves to the second half of L. 西南财经大学天府学院 16 First mid last 0 5 Target are found ,target 22 is in the list 11 A[0] A[1] 4 7 A[11] 8 10 14 21 22 36 62 77 81 91 First mid last 22>21 6 8 11 A[0] A[1] 4 7 A[11] 8 10 14 21 22 36 62 77 6 7 A[0] A[1] 4 7 91 22<62 First mid last 6 81 A[11] 8 10 14 21 22 36 62 77 81 22=22 西南财经大学天府学院 17 91 Target not found --Target 11 is not in the list First mid last 0 5 11 11<21 A[0] 4 7 8 10 14 21 First mid last 0 2 4 A[0] A[1] 4 7 8 7 8 4 7 8 62 77 81 91 A[11] 10 14 21 22 10 14 21 10 14 21 36 62 77 81 91 11>10 A[11] 22 First mid last 4 4 4 A[0] A[1] 36 11>8 First mid last 3 3 4 A[0] A[1] 4 22 A[11] 36 62 77 81 11<14 22 36 A[11] 62 77 81 First mid last Function terminates 4 4 3 西南财经大学天府学院 91 18 91 Binary search(ordered list ) elsePre list is ordered; it must algorithm binary_search( contain equal at least :one element found force val list <array>, end is index to the largest exit val end <index>, element in the first = last + 1list val target <keytype>, is the value of element endTarget if ref locn <index>) being sought end loop First = 0 Locn is address of index in locn = mid Last = end calling algorithm if (target equal list [mid]) loop (first <= last ) Post found = true Found:locn assigned index to mid = ( first + last ) / 2 else target element if ( target > list [mid] ) found = false found set true look in upper half end if not found:locn = element first = mid +1 return found below or above target else if ( target < list [mid] ) found set false look in lower half Return found<boolean> end binary search last = mid – 1 西南财经大学天府学院 19 Analyzing (the efficiency) Sequential search ,Sentinel search ,Ordered list search : O(n) Binary search: O(log 2n) Comparison of binary and sequential searches size 16 10,000 1,000,000 binary 4 14 20 Sequential (average) 8 5000 500,000 西南财经大学天府学院 Sequential (worst case) 16 10,000 1,000,000 20 2-3 Hashed list searches Ideal search : we would know exactly where the data are and go directly to there Goal of hashed search : to find the data with only one test Location of data key Hash function Use an array of data key Hash algorithm index of array(address of list ) 西南财经大学天府学院 21 Hash function key address address 102002 107095 111060 hash key 5 100 2 [000] [001] [002] [003] [004] [005] [006] [007] [008] Harry lee 111060 Sarah trapp 102002 Vu nguyen … Figure 2-6 Hash concept [099] [100] 西南财经大学天府学院 107095 … John adams 22 Basic Concepts Hash search: A search in which the key ,through an algorithmic function, determines the location of the data. we use a hashing algorithm to transform the key into the index that contains the data we need to locate (key-to –address) 西南财经大学天府学院 23 Problem A set of keys hash to the same location—Synonym Contain two or more synonyms in a list—collision Home address—produced by hashing algorithm Prime area—memory contains all of home addresses Collision resolution—two keys collide at a home address Place one of the keys and its data in another location 西南财经大学天府学院 24 B and A Collide at 8 [0] Collision resolution C and B Collide at 16 C A B [4] [8] [16] Collision resolution 1.hash(A) 2.hash(B) 3.hash(C) Figure 2-7 the collision resolution concept 西南财经大学天府学院 25 Locate an element in a hashed list Use the same algorithm to insert it into the list First hash the key and check the home address If it does – the search is complete If not – use the collision resolution algorithm to determine the next location and continue until find the element or determine it is not in the list Each calculation of an address and test for success – probe 西南财经大学天府学院 26 Hashing methods Hashing methods direct modulo division subtraction midsquare digit extraction folding rotation pseudorandom generation Figure 2-8 Basic hashing techniques 西南财经大学天府学院 27 Direct method The key is the address(an element a key , no synonyms) Example1: total monthly sales by the days of the months Create an array of 31accumulator The accumulation code is: dailySales[sale.day] = dailySales[sale.day] +sale.amount; 西南财经大学天府学院 28 Example 2: a small company has fewer<100 Employee number is between 1 and 100 address 005 100 002 hash 5 100 2 key Figure 2-9 Direct hashing Of employee numbers [000] [001] [002] [003] [004] [005] [006] [007] [008] 000 001 002 003 004 005 006 007 008 … [099] 099 [100] 100 西南财经大学天府学院 (not used) Harry lee Sarah trapp Vu nguyen … John adams 29 Subtraction method •keys are consecutive , but do not start from 1 •Such as your student ID number Advantage •Hashing function is very simple •No collisions Disadvantage Only for small lists 西南财经大学天府学院 30 Note: 1. Generally speaking , hashing lists require some empty elements to reduce the number of collisions 2. This application above two is the ideal ,but it is very limited , such as ID card number 西南财经大学天府学院 31 Modulo-division method(Division remainder) This method divides the key by the array size and uses the remainder for the address Hashing algorithm is: Address = key modulus listsize Note: a prime number listsize produces collisions 西南财经大学天府学院 fewer 32 121267 045128 379452 hash 2 306 0 [000] 379452 Marry Dodd [001] [002] 121267 Bryan Devaux [003] [004] [005] [006] [007] 378845 John Carver [008] Listsize=307 Figure 2-10 modulo-division Hashing … … [305] 160252 Tuan Ngo [306] 045128 Shouli Feldman 西南财经大学天府学院 33 Digit extraction method Selected digits are extracted from the key And used as address Example 6-digits Employe e number 379452 121267 378845 160252 045128 394 112 388 102 051 3-digit address Select the first, third, fourth digits 西南财经大学天府学院 34 Midsquare method The key is squared and the address selected from the middle of the squared number Limitation: the size of the key Example: 4-digit keys 9452*9452=89340304:address is 3403 Variation : select a portion of the key 379452 121267 378845 160252 045128 379 * 379=143641 121 * 121=014641 378 * 378=142884 160 * 160=025600 045 * 045=002025 Select 1-3 digits 364 464 288 560 202 Select 3-5 digits as address squared Fill 0 to 6 digits 西南财经大学天府学院 35 Folding methods : fold shift and fold boundary 123 123 456 789 + 789 123456789 Digits reversed 321 123 456 789 + 987 1 764 1368 discarded (a)fold shift Digits reversed discarded (b)fold boundary Figure 2-11 hash fold examples 西南财经大学天府学院 36 Rotation method : Incorporate with others Useful when keys are assigned serially 600101 600102 600103 600104 600105 Original key 600101 600102 600103 600104 600105 160010 260010 360010 460010 560010 Rotation Rotated key Figure 2-12 Rotation hashing 西南财经大学天府学院 37 Pseudorandom method: In this method, the key is used as the seed in a pseudorandom number generator , the resulting random number is scaled into the possible address range using modulo division A common random generator is: y=ax+c For efficiency,factors a and c should be prime numbers For example , a=17, c=7 西南财经大学天府学院 38 (17*045128+7) modulo 307=297 (17*121267+7) modulo 307=41 121267 045128 379452 hash 41 297 7 (17*379452+7) modulo 307=7 [000] … [007] 379452 … [041] 121267 … Marry Dodd … Bryan Devaux … … 378845 John Carver [297] 045128 Shouli Feldman Figure 2-10 modulo-division Hashing … … 160252 Tuan Ngo [306] 西南财经大学天府学院 39 Hash Algorithm Convert the alphanumeric key into a number by adding the American Standard Code for Information Interchange(ASCII) to accumulator. Rotate the bits in the address to maximize the distribution of the values. Take the absolutely value of the address and map it into the address range. 西南财经大学天府学院 40 Hash Algorithm algorithm Hash( test for negative address val key <array >, if (addr<0) This algorithm converts an val size <integer>, addr=absolute(addr) alphanumeric key of size val maxAddr <integer>, end if characters into an integral ref addr <integer>) addr =addr modulo maxaddr address. Looper = 0 return Pre Key is a key to be hashed. Addr = 0 end Hashsize is the number of Hash Key characters in the key. MaxAddr is the maximum Loop (Loop<size) possible address for the if (key[looper] not space) list. addr =addr+key[looper] Post addr contain the hashed rotate addr 12 bits right address end if End loop 西南财经大学天府学院 41 2-4 collision resolution Except the direct and subtraction, none of the hashing methods are one-to-one mapping Collision not avoid There are several methods for hashing collisions Collision resolution Open addressing Linear probe Quadratic probe Linked lists pseudorandom buckets Key offset Figure 2-13 collision resolution methods 西南财经大学天府学院 42 Several concepts •data to group within the list (unevenly across a hashed list). •There •a must some highbe degree of empty clustering grows elements a list:of probes to locate an the in number element and reduces the The number of filled elements processing efficiency of the list. load = <75% The total number of elements factor There are two: •Primary clustering : when data cluster around a home address •Secondary clustering:when data become grouped along a collision path throughout a list •Need to design hashing algorithms to minimize clustering load factor Clustering 西南财经大学天府学院 43 Open addressing Resolves collisions in the prime area (contains all of the home addresses ) Linear probe Quadratic probe Double hashing Pseudorandom Key offset 西南财经大学天府学院 44 Linear Probe [000] [001] [002] [003] [004] [005] [006] [007] [008] First insert: No collision 070918 1 hash 166702 1 second insert: collision Add 1 Figure 2-14 linear probe collision resolution 379452 070918 121267 166702 Marry Dodd Sarah Trapp Bryan Devaux Harry eagle 378845 John Carver … … [305] 160252 Tuan Ngo [306] 045128 Shouli Feldman 西南财经大学天府学院 45 linear probe Variation :Add 1, subtract 2,Add 3, subtract 4 Advantage: simple to implement. Disadvantage: first, tend to produce primary clustering . Second, tend to make the search algorithm more complex 西南财经大学天府学院 46 Quadratic probe To eliminate primary clustering The increment is the collision probe number squared.first probe, add 12,second probe, add 22 ,… The new address is the modulo of the list size. Disadvantage : 1. the time required to square the probe number. 2. It is not possible to generate a new address for every element in the list. 西南财经大学天府学院 47 Pseudorandom collision resolution A double hashing : the address is rehashed Uses a pseudorandom number to resolve the collision Using the collision address as a factor in the random number calculation, such as: New address = 3 * collision address + 5 Figure2-15 showing a collision resolving for figure 2-14 西南财经大学天府学院 48 Pseudorandom probe First insert: No collision 1 070918 166702 hash 1 second insert: collision [000] [001] [002] [003] [004] [005] [006] [007] [008] 379452 Marry Dodd 070918 Sarah Trapp 121267 Bryan Devaux 378845 John Carver 166702 Harry eagle … … Pseudorandom [305] 160252 Tuan Ngo Y = 3x+5 [306] 045128 Shouli Feldman Figure 2-15 pseudorandom collision resolution 西南财经大学天府学院 49 Key offset Another double hashing Produces different collision paths for different keys key offset calculates the new address as (the simplest versions) offset = key/listsize address = ((offset + old address) modulo listsize) 西南财经大学天府学院 50 Example: the key is 166702, list size is 307,using the modulo-division generate an address of 1 This synonym of 070918 produces a collision at 1 Using key offset to calculate the next address offset = 166702 / 307 = 543 address = ((543 + 001) modulo 307) = 237 If 237 were also a collision, repeat the process offset = 166702 / 307 = 543 address = ((543 + 237) modulo 307) = 166 西南财经大学天府学院 51 To really see the effect of key offset, we need to calculate several different keys ,all hashing to the same home address. Table 2-3 shows that three keys that collide at address 001, Next two collision probe addresses Key28 Home address Key offset Probe 1 Probe 2 166702 572556 067234 1 1 1 543 1865 219 237 024 220 166 047 132 Table 2-3 key offset Note: each key resolves its collision at a different address for both the first and second probes 西南财经大学天府学院 52 Linked list resolution To eliminate the disadvantage of open addressing that each collision resolution increases the probability of future collisions A linked list is an ordered collection of data in which each element contains the location of the next element 西南财经大学天府学院 53 [000] 379452 Marry Dodd [001] 070918 Sarah Trapp [002] 121267 Bryan Devaux 166702 Harry eagle 572556 [003] [004] Chris Wallj pointer [005] [006] [007] [008] pointer … … [305] 160252 Tuan Ngo [306] 045128 Shouli Feldman Figure 2-16 linked list collision resolution 西南财经大学天府学院 54 Linked list resolution Linked list resolution uses a separate area to store collisions and chains all synonyms together in a linked list It uses two storage areas, the prime area and the overflow area Each element in the prime area contains an additional field, a link head pointer The linked list data can be stored in any order, but the most common is key sequence 西南财经大学天府学院 55 Bucket hashing nodes that accommodat e multiple data. occurrences, collision are postponed until the bucket is full 379452 Marry Dodd [000] Bucket 0 070918 Sarah Trapp [001] Bucket 166702 Harry eagle 367173 Ann georgis 1 121267 Bryan Devaux [002] Bucket 572556 Chris wallj 2 Linear probe Places here 045128 Shouli Feldman [307] Bucket 307 Figure 2-17 bucked hashing 西南财经大学天府学院 56 Two problems & combination approaches First : it uses significantly more space, many of the buckets will be (or partially) empty Second: it does not completely resolve the collision problem Resolving the collision is to use the linear probe There are several approaches to resolving collisions ,often uses multiple steps Example one large database hashes to a bucket, full, linear probe , linked list overflow area 西南财经大学天府学院 57 summary Searching is the process of finding the location of a target among a list of objects Two basic searching methods for arrays: sequential and binary search The sequential search is normally used when a list is not sorted. It starts at the beginning of the list and searches until it finds the data or hits the end of the list One of the variation of the sequential search is the sentinel search. In this method,the condition ending the search is reduced to only one by artificially inserting the target at the end of the list The second variation of the sequential search is called the probability search. In this method, the list is ordered with the most probable elements at the beginning of the list and the least probable at the end 西南财经大学天府学院 58 2-5 summary(continued) The sequential search can also be used to search a sorted list, in this case, we can terminate the search when the target is less than the current element If an array is sorted, we can use a more efficient algorithm called the binary search the binary search algorithm searches the list by first checking the middle element. If the target is not in the middle element, the algorithm eliminates the upper half or the lower half of the list depending on the value of the middle element. The process continues until the target is found or reduced list length becomes zero The efficiency of a sequential search is O(n) The efficiency of a binary search is O(log2n) 西南财经大学天府学院 59 summary(continued) In a hashed search,the key through an algorithmic transformation,determines the location of the data. It is a key-to-address transformation There are several hashing functions : we discussed direct, subtraction, modulo division, digit extraction, mid-square, folding, rotation , and pseudorandom generation 西南财经大学天府学院 60 summary(continued) In direct hashing,the key is the address without any algorithmic manipulation In subtraction hashing,the key is transformed to an address by subtracting a fixed number from it In modulo-division hashing,the key is divided by the list size,recommended to be a prime number In digit-extraction hashing,selected digits are extracted from the key and used as an address In mid-square hashing,the key is squared and the address is selected from the middle of the result In fold shift hashing,the key is divided into parts whose sizes match the size of the required address.then the parts are added to obtain the address 西南财经大学天府学院 61 summary(continued) In fold boundary hashing,the key is divided into parts whose sizes match the size of the required address.then the left and right parts are reversed and added to the middle part to obtain the address In rotation hashing,the rightmost digit of the key is rotated to the left to determine an address. However,this method is usually used in combination with other methods In the pseudorandom generation hashing,the key is used as the seed to generate a pseudorandom number. The result is then scaled to obtain the address Except in the direct and subtraction methods, collisions are unavoidable in hashing. Collision occur when a new key is hashed to an address that is already occupied 62 西南财经大学天府学院 summary(continued) Clustering is the tendency of data to build up unevenly across a hashed list. Primary clustering occur when data build up around a home address Secondary clustering occurs when data build up along a collision path in the list To solve a collision, a collision resolution method is used Three general methods are used to resolve collision : open addressing,linked list,and buckets The open addressing method can be subdivided into linear probe,quadratic probe,pseudorandom rehashing,and key-offset rehashing 西南财经大学天府学院 63 summary(continued) In the linear probe method,when the collision occurs,the new data will be stored in the next available address. In the quadratic method,the increment is the collision probe number squared. In the pseudorandom rehashing method, we use a random number generator to rehash the address In the key-offset rehashing method,we use an offset to rehash the address 西南财经大学天府学院 64 summary(continued) In the linked list technique,we use separate areas to store collision and chain all synonyms together in a linked list In bucket hashing,we use a bucket that can accommodate multiple data occurrences 西南财经大学天府学院 65 Homework Using the modulo-division method and linear probing, store the keys shown below in an array with 19 elements, How many collision occurred? The value of load factor of the list after all keys have been inserted? 224562,137456,214562,140145,214567,162145,144467, 199645,234534 Repeat above problem using the digit-extraction method (first, third and fifth digits) and quadratic probing. 西南财经大学天府学院 66