Introduction to Programming with Matlab Weeks 11-12 © 2003 J. Michael Fitzpatrick and John D. Crocetti Searching, Sorting, and Files Prerequisites: Familiarity with concepts from previous Lab/Homework assignments. Objective concepts: General (i.e., not specific to Matlab): o searching methods: sequential search and binary search o sorting methods: selection sort o algorithmic complexity worst-case analysis order notation o files: binary versus text, opening and closing, file handle, permissions Matlab: o sort() and sortrows() o file manipulation: file handle = “file id number”, exist(), fopen(), fgetl(), fgets(), fprintf(), permissions. Reference(s): Matlab Programming for Engineers by Chapman (your textbook) Chapter.Sections: 5.2 (Example 5.1 only), 8.3, 8.4-8.4.2, 8.6-8.6.3, 8.6.4, 8.6.5, 8.6.6, 8.8.1 The concepts concerning files are well covered in the chapters and sections of your textbook listed above. The concepts associate with searching and sorting are covered below. Searching Searching is important in many areas of computer science. Many large computer systems include databases that must be searched when an inquiry is made. For example, when you phone a credit card company to inquire about your bill, you will be asked for your card number. In a few seconds the company representative will be looking at your data. Your data and that of millions of other customers are all stored in a database. Your information is found by searching among all those millions of card numbers for your unique number. Finding that number is accomplished by searching for it in the database. Sequential search The simplest search method is the sequential search method, also known as the linear search method. In this method the target value that is being sought is compared with the first member of the database, then with the second, etc., until either the number is found, or the end of the list is reached, whichever comes first. Here is a function that carries out the sequential search: 1 function index = sequential_search(array, target) Found = 0; first = 1; last = length(array); for ii = first:last if target == array(ii) Found = 1; break; end end index = ii; if ~Found index = -1; end This function returns –1, as a flag indicating failure, if the target value is not in the list. If the target is in the list, it returns the first position at which it finds the target. So, for example, when searching for the target 17 when array = [ 45 23 17 17 –2 100 34 ], the number 17 would be found at index 3 after 45, 23, and 17 (the first one) had been compared with it. When searching for a number that is not on the list, 82, for example, every number will have to be checked, requiring 7 comparisons. These few comparisons will not take long, but if there is a million numbers on the list, there will be on the average about half a million comparisons required when the number is on the list, when it is not there, one million comparisons are required. With such large lists the time becomes prohibitive. Faster searching methods are available, but only if the list is sorted. For an unsorted list, the sequential search is the only possible choice. Binary search If a list has been sorted, searches can be carried out much more quickly. The search method to use in this case is the binary search. It is more complicated than the sequential search method, but the complication is well worth it. The binary search requires only 40 comparisons to find a number on a list of one million numbers. The method works as follows: First the number in the middle of the list is compared with the target. If the target is less than that number, then the numbers that come after the middle number are no longer searched. The search can now be confined to the first half. Half the numbers have been eliminated with just one comparison! If the target is greater than the middle number then the search can be confined to the second half. If the target equals the middle number, then the search is completed. Thus, regardless of what happens on this first comparison, less than half of the list is left to search after the first comparison 2 Now the number in the middle of the remaining list is compared with the target. Again, there are three possibilities. The second half of the list is eliminated, the first half is eliminated, or the number is found at the middle. After this second comparison, less than one-fourth of the list remains to be search. This technique of dividing the list in half, and in half again, and again, etc., proceeds until there is just one number left. If this number matches the target, its index is the answer. If not, then the number is not found. As in the case of the sequential search above, we can return a –1 as a flag to indicate that the number was not found. This search method is called the “binary” search because the list is repeatedly divided into two parts. (Actually it is divided into two big parts and one very small one. The very small part is the single number in the middle.) Here is a function that carries out the binary search: function index = binary_search(array, target) Found = 0; first = 1; last = length(array); while first <= last & ~Found mid = fix( (first + last) /2 ); if target < array(mid) last = mid - 1; elseif target > array(mid) first = mid + 1 ; else Found = 1; end end if Found index = mid; else index = -1; end In this function the variables first and last are set equal to the first and last indexes, respectively, of the list. The dividing of the list into smaller and smaller pieces is accomplished by changing one of these two values at a time, either moving first closer to the end or last closer to the beginning. The index of the element in the middle of the list is assigned to mid by taking the average of first and last. If first is odd and 3 last is even, or vice versa, then that average will not be an integer. Since mid must be an integer in order to be an index, the fix() function is used to discard the fractional part (which is always 0.5), if there is one. All this work is done in a while loop. That loop will end when the target if found because the value of Found is changed to 1 (meaning true). If the target value is not in the list, then the loop will still halt when the values of first and last become out of order (first>last). As with sequential_sort above, in that case the number –1 is returned as a flag to mean that the number was not found. Sorting Obviously, large lists should be sorted to make searches feasible. Sorting is another important task in computer applications and is carefully studied in computer science. It is important because of the huge savings in search time possible when using the binary search instead of the sequential search. Many sorting methods have been developed, some of which are quite complicated (far more complicated than the binary search algorithm given above, for example). The benefit in reduced sorting time is, however, worth the programming effort. Students who choose computer science or computer engineering as a major spend a good deal of time mastering the concepts involved in efficient sorting algorithms. An algorithm that is not very fast, but is simple enough to be readily understood is the selection sort. This sort is well described by your textbook as Example 5.2 in Section 5.2, where Chapman provides a function called ssort() that is used as follows: y = ssort(x), which causes y to the sorted version of the vector x. Among the faster sorts are the heap sort, the merge sort, and “quicksort”. Matlab provides an efficient sort, called sort(). It is used as follows: y = sort(x), where y is the sorted version of x. Algorithmic Complexity When a computer scientist says that an algorithm is fast, he or she means something very specific. A “fast” algorithm is not simply an algorithm that runs fast. One problem with that definition is that the speed at which an algorithm runs depends on the hardware that it runs on. Since computers are different, and since computers improve every year, it is meaningless to measure the speed of an algorithm by timing it. Furthermore, the relative speeds of two algorithms that are running on the same machine may depend on the size of the data set that they are working on. For example, for a list of 4 numbers, the sequential search may take less time than the binary search, while for larger lists the binary search will beat the sequential search (and beat it soundly). Worst-case analysis The more important aspect of an algorithm is the way in which the number of operations that it must perform grows as the size of the data set grows. That number may depend on the data involved. The best case for the sequential search is when the first item on the list 4 has the target value. In designing algorithms, however, the worst case is usually of most concern. This makes sense in the example of the search for credit-card information above, for example, where we would want to minimize longest time that a customer would have to wait. Therefore, most algorithms are subjected to a worst-case analysis. For example, an analysis of the sequential search algorithm above reveals that the required number of comparisons in the worst case, which is when the target value is not in the list (or equals only the last item on the list), is equal simply to the number N of items on the list. While comparisons are not the only work done by the algorithm, the number of comparisons is a good measure of the algorithm’s work because, in the general form of the algorithm, comparing a target to an item on the list is the most time consuming step and also because the number of other required operations is approximately proportional to the number of comparisons. For the binary search, the number of comparisons is, in the worst case, in which the item is greater than all items on the list (or is equal to only the last item on the list) approximately equal to 2*log2(N+1), where log2(x) is the power to which 2 must be raised to get x. This number is usually not an integer. If it is not, then the exact number of required comparisons is equal to the next higher integer, which can be written in Matlab as 2*ceil(log2(N+1)). It is fairly easy to see that this is indeed the number of comparisons required by considering some specific worst-case examples. The worst-case time necessary to complete these algorithms on a list of length N has the following two forms, sequential: a + bN, binary: a + b*ceil(log2(N+1)), where the a and b represent constants that will be different for the two algorithms (i.e., two different values for a and two different values for b). The value of a represents the start-up time for the algorithm (including the time for the function to read the values of its arguments and to define its local variables and to initialize them, etc.). The value of b represents the amount of time required per comparison in the sequential search and represents twice the amount of time required per comparison in the binary search. The plot below shows a sample behavior for these functions when a=100 and b= 5 for the sequential search and a=200 and b=10 for the binary search: 5 Note that the binary search (these plots are for fictitious implementations) actually requires more time for small lists (fewer than about 30 items), but after the size of the list has grown to 100, the sequential search takes over twice times as long. The advantage of binary search grows as N grows. For N = 1000, the ratio is 8 to 1, and for N = one million the ratio is over 6000 to one! Order Notation The important difference between the two plots above lies in the difference between their general shapes. Both plots increase monotonically with the number of items, but the red plot (the binary search) has a different shape. It curves downward. Because of that downward curve, it will eventually go below the straight-line blue plot (the sequential search). There are three things that differ between the two formulas above. Each formula has different values of a and of b and one has ceil(log2(N+1) where the other has simply N. Of these, the difference that determines the shape of the curve is the last one---the dependence on N. That is so because, for any values of a and b in either or both formulas, the second formula will produce a plot that curves downward and hence will cross the straight line for sufficiently large N. That means that for a big enough problem the binary search will win, even if its a and b are bigger then the a and b for the sequential search (as they are in the example above). Furthermore, as the plot below shows, there is little importance to either the ceil() or the +1 in ceil(log2(N+1)). When we omit them, we are left with log2(N), and the resultant plot (the dashed line) is almost the same. Finally, the importance of log2(N) is the fact that we are taking the log of N, regardless of what the base is. As the plot below shows, if we were to replace log2 with log10 (log base 10), the 6 general shape of the plot would be the same. It is curved downward, so it will cross the plot of the sequential search. It crosses the plot of the sequential search at a different place, but the important thing is not where it crosses, but the fact that it must cross somewhere. Thus, the important aspects of the worst-time behavior of the binary search can be described simply by saying that the plot of the search time versus N has the same shape as log(N) for any base. Likewise the important aspects of the worst-time behavior of the sequential search are captured by saying that the plot of the search time versus N has the same shape as N. These statements are said more formally as follows: The worst-case time complexity of the sequential sort is “order N”. The worst-case time complexity of the binary sort is “order log N”. The phrase “order N” is written O(N). The phrase “order log(N)” is written O(log N). Usually a computer scientist will simply write, “the binary search is O(log N), leaving out the words “time” and “complexity”, which are to be understood. (Sometimes we are interested in the increase in memory required by an algorithm as N increases, as opposed to the increase in time. In that case, we measure space as a function of N and determine the “space complexity”). This notation is called “Order notation”, “Big-O notation”, or sometimes, “O notation”. The selection sort is O(N2). The best sorting algorithms, Matlab’s sort() for example, are O(N log N). We have now discussed four complexities: O(log N), O(N), O(N log N), and O(N2). As the plots below show, they are listed here from fastest to slowest. The relative sizes are so different that in order to show them all on the same plot, we have multiplied the first three by 10,000, 500, and 100, respectively. 7 Dominance It is clear from the discussion above that the value of a in the formulas is unimportant. That is because the second term dominates a for very large values of N. A similar thing happens for formulas like this: a*log(N) + b*N. Here, the second term is the only important one because N dominates log(N). The dominance of N over log(N) can be roughly stated this way: the ratio, N/log(N) can get as large as you want, provided you make N large enough. In terms of algorithms, that means that if an algorithm’s timebehavior is of the form a*log(N) + b*N, then the algorithm is O(N), not O(log N + N). The dominant term determines the order of the algorithm because the other terms become insignificant for very large N. 8