Lab/Homework 2 - Electrical Engineering and Computer Science

advertisement
Introduction to Programming with Matlab
Weeks 11-12
© 2003 J. Michael Fitzpatrick and John D. Crocetti
Searching, Sorting, and Files
Prerequisites: Familiarity with concepts from previous Lab/Homework assignments.
Objective concepts:
 General (i.e., not specific to Matlab):
o searching methods: sequential search and binary search
o sorting methods: selection sort
o algorithmic complexity
 worst-case analysis
 order notation
o files: binary versus text, opening and closing, file handle, permissions
 Matlab:
o sort() and sortrows()
o file manipulation: file handle = “file id number”, exist(), fopen(), fgetl(),
fgets(), fprintf(), permissions.
Reference(s):
Matlab Programming for Engineers by Chapman (your textbook)
Chapter.Sections: 5.2 (Example 5.1 only), 8.3, 8.4-8.4.2, 8.6-8.6.3, 8.6.4, 8.6.5, 8.6.6,
8.8.1
The concepts concerning files are well covered in the chapters and sections of your
textbook listed above. The concepts associate with searching and sorting are covered
below.
Searching
Searching is important in many areas of computer science. Many large computer systems
include databases that must be searched when an inquiry is made. For example, when you
phone a credit card company to inquire about your bill, you will be asked for your card
number. In a few seconds the company representative will be looking at your data. Your
data and that of millions of other customers are all stored in a database. Your information
is found by searching among all those millions of card numbers for your unique number.
Finding that number is accomplished by searching for it in the database.
Sequential search
The simplest search method is the sequential search method, also known as the linear
search method. In this method the target value that is being sought is compared with the
first member of the database, then with the second, etc., until either the number is found,
or the end of the list is reached, whichever comes first. Here is a function that carries out
the sequential search:
1
function index = sequential_search(array, target)
Found = 0;
first = 1;
last = length(array);
for ii = first:last
if target == array(ii)
Found = 1;
break;
end
end
index = ii;
if ~Found
index = -1;
end
This function returns –1, as a flag indicating failure, if the target value is not in the list. If
the target is in the list, it returns the first position at which it finds the target. So, for
example, when searching for the target 17 when array = [ 45 23 17 17 –2 100 34 ], the
number 17 would be found at index 3 after 45, 23, and 17 (the first one) had been
compared with it. When searching for a number that is not on the list, 82, for example,
every number will have to be checked, requiring 7 comparisons. These few comparisons
will not take long, but if there is a million numbers on the list, there will be on the
average about half a million comparisons required when the number is on the list, when it
is not there, one million comparisons are required. With such large lists the time becomes
prohibitive. Faster searching methods are available, but only if the list is sorted. For an
unsorted list, the sequential search is the only possible choice.
Binary search
If a list has been sorted, searches can be carried out much more quickly. The search
method to use in this case is the binary search. It is more complicated than the sequential
search method, but the complication is well worth it. The binary search requires only 40
comparisons to find a number on a list of one million numbers. The method works as
follows:

First the number in the middle of the list is compared with the target. If the
target is less than that number, then the numbers that come after the middle
number are no longer searched. The search can now be confined to the first
half. Half the numbers have been eliminated with just one comparison! If the
target is greater than the middle number then the search can be confined to the
second half. If the target equals the middle number, then the search is
completed. Thus, regardless of what happens on this first comparison, less than
half of the list is left to search after the first comparison
2

Now the number in the middle of the remaining list is compared with the
target. Again, there are three possibilities. The second half of the list is
eliminated, the first half is eliminated, or the number is found at the middle.
After this second comparison, less than one-fourth of the list remains to be
search.

This technique of dividing the list in half, and in half again, and again, etc.,
proceeds until there is just one number left. If this number matches the target,
its index is the answer. If not, then the number is not found. As in the case of
the sequential search above, we can return a –1 as a flag to indicate that the
number was not found.
This search method is called the “binary” search because the list is repeatedly divided
into two parts. (Actually it is divided into two big parts and one very small one. The very
small part is the single number in the middle.)
Here is a function that carries out the binary search:
function index = binary_search(array, target)
Found = 0;
first = 1;
last = length(array);
while first <= last & ~Found
mid = fix( (first + last) /2 );
if target < array(mid)
last = mid - 1;
elseif target > array(mid)
first = mid + 1 ;
else
Found = 1;
end
end
if Found
index = mid;
else
index = -1;
end
In this function the variables first and last are set equal to the first and last indexes,
respectively, of the list. The dividing of the list into smaller and smaller pieces is
accomplished by changing one of these two values at a time, either moving first closer
to the end or last closer to the beginning. The index of the element in the middle of the
list is assigned to mid by taking the average of first and last. If first is odd and
3
last is even, or vice versa, then that average will not be an integer. Since mid must be
an integer in order to be an index, the fix() function is used to discard the fractional
part (which is always 0.5), if there is one.
All this work is done in a while loop. That loop will end when the target if found
because the value of Found is changed to 1 (meaning true). If the target value is not in
the list, then the loop will still halt when the values of first and last become out of
order (first>last). As with sequential_sort above, in that case the number –1
is returned as a flag to mean that the number was not found.
Sorting
Obviously, large lists should be sorted to make searches feasible. Sorting is another
important task in computer applications and is carefully studied in computer science. It is
important because of the huge savings in search time possible when using the binary
search instead of the sequential search. Many sorting methods have been developed,
some of which are quite complicated (far more complicated than the binary search
algorithm given above, for example). The benefit in reduced sorting time is, however,
worth the programming effort. Students who choose computer science or computer
engineering as a major spend a good deal of time mastering the concepts involved in
efficient sorting algorithms.
An algorithm that is not very fast, but is simple enough to be readily understood is the
selection sort. This sort is well described by your textbook as Example 5.2 in Section
5.2, where Chapman provides a function called ssort() that is used as follows: y =
ssort(x), which causes y to the sorted version of the vector x. Among the faster sorts are
the heap sort, the merge sort, and “quicksort”. Matlab provides an efficient sort, called
sort(). It is used as follows: y = sort(x), where y is the sorted version of x.
Algorithmic Complexity
When a computer scientist says that an algorithm is fast, he or she means something very
specific. A “fast” algorithm is not simply an algorithm that runs fast. One problem with
that definition is that the speed at which an algorithm runs depends on the hardware that
it runs on. Since computers are different, and since computers improve every year, it is
meaningless to measure the speed of an algorithm by timing it. Furthermore, the relative
speeds of two algorithms that are running on the same machine may depend on the size of
the data set that they are working on. For example, for a list of 4 numbers, the sequential
search may take less time than the binary search, while for larger lists the binary search
will beat the sequential search (and beat it soundly).
Worst-case analysis
The more important aspect of an algorithm is the way in which the number of operations
that it must perform grows as the size of the data set grows. That number may depend on
the data involved. The best case for the sequential search is when the first item on the list
4
has the target value. In designing algorithms, however, the worst case is usually of most
concern. This makes sense in the example of the search for credit-card information
above, for example, where we would want to minimize longest time that a customer
would have to wait.
Therefore, most algorithms are subjected to a worst-case analysis. For example, an
analysis of the sequential search algorithm above reveals that the required number of
comparisons in the worst case, which is when the target value is not in the list (or equals
only the last item on the list), is equal simply to the number N of items on the list. While
comparisons are not the only work done by the algorithm, the number of comparisons is a
good measure of the algorithm’s work because, in the general form of the algorithm,
comparing a target to an item on the list is the most time consuming step and also
because the number of other required operations is approximately proportional to the
number of comparisons. For the binary search, the number of comparisons is, in the worst
case, in which the item is greater than all items on the list (or is equal to only the last item
on the list) approximately equal to 2*log2(N+1), where log2(x) is the power to which 2
must be raised to get x. This number is usually not an integer. If it is not, then the exact
number of required comparisons is equal to the next higher integer, which can be written
in Matlab as 2*ceil(log2(N+1)). It is fairly easy to see that this is indeed the number of
comparisons required by considering some specific worst-case examples.
The worst-case time necessary to complete these algorithms on a list of length N has the
following two forms,
sequential: a + bN,
binary: a + b*ceil(log2(N+1)),
where the a and b represent constants that will be different for the two algorithms (i.e.,
two different values for a and two different values for b). The value of a represents the
start-up time for the algorithm (including the time for the function to read the values of its
arguments and to define its local variables and to initialize them, etc.). The value of b
represents the amount of time required per comparison in the sequential search and
represents twice the amount of time required per comparison in the binary search. The
plot below shows a sample behavior for these functions when a=100 and b= 5 for the
sequential search and a=200 and b=10 for the binary search:
5
Note that the binary search (these plots are for fictitious implementations) actually
requires more time for small lists (fewer than about 30 items), but after the size of the list
has grown to 100, the sequential search takes over twice times as long. The advantage of
binary search grows as N grows. For N = 1000, the ratio is 8 to 1, and for N = one million
the ratio is over 6000 to one!
Order Notation
The important difference between the two plots above lies in the difference between their
general shapes. Both plots increase monotonically with the number of items, but the red
plot (the binary search) has a different shape. It curves downward. Because of that
downward curve, it will eventually go below the straight-line blue plot (the sequential
search).
There are three things that differ between the two formulas above. Each formula has
different values of a and of b and one has ceil(log2(N+1) where the other has simply N.
Of these, the difference that determines the shape of the curve is the last one---the
dependence on N. That is so because, for any values of a and b in either or both formulas,
the second formula will produce a plot that curves downward and hence will cross the
straight line for sufficiently large N. That means that for a big enough problem the binary
search will win, even if its a and b are bigger then the a and b for the sequential search
(as they are in the example above). Furthermore, as the plot below shows, there is little
importance to either the ceil() or the +1 in ceil(log2(N+1)). When we omit them, we are
left with log2(N), and the resultant plot (the dashed line) is almost the same. Finally, the
importance of log2(N) is the fact that we are taking the log of N, regardless of what the
base is. As the plot below shows, if we were to replace log2 with log10 (log base 10), the
6
general shape of the plot would be the same. It is curved downward, so it will cross the
plot of the sequential search. It crosses the plot of the sequential search at a different
place, but the important thing is not where it crosses, but the fact that it must cross
somewhere.
Thus, the important aspects of the worst-time behavior of the binary search can be
described simply by saying that the plot of the search time versus N has the same shape
as log(N) for any base. Likewise the important aspects of the worst-time behavior of the
sequential search are captured by saying that the plot of the search time versus N has the
same shape as N. These statements are said more formally as follows: The worst-case
time complexity of the sequential sort is “order N”. The worst-case time complexity of
the binary sort is “order log N”. The phrase “order N” is written O(N). The phrase “order
log(N)” is written O(log N). Usually a computer scientist will simply write, “the binary
search is O(log N), leaving out the words “time” and “complexity”, which are to be
understood. (Sometimes we are interested in the increase in memory required by an
algorithm as N increases, as opposed to the increase in time. In that case, we measure
space as a function of N and determine the “space complexity”). This notation is called
“Order notation”, “Big-O notation”, or sometimes, “O notation”.
The selection sort is O(N2). The best sorting algorithms, Matlab’s sort() for example, are
O(N log N). We have now discussed four complexities: O(log N), O(N), O(N log N), and
O(N2). As the plots below show, they are listed here from fastest to slowest. The relative
sizes are so different that in order to show them all on the same plot, we have multiplied
the first three by 10,000, 500, and 100, respectively.
7
Dominance
It is clear from the discussion above that the value of a in the formulas is unimportant.
That is because the second term dominates a for very large values of N. A similar thing
happens for formulas like this: a*log(N) + b*N. Here, the second term is the only
important one because N dominates log(N). The dominance of N over log(N) can be
roughly stated this way: the ratio, N/log(N) can get as large as you want, provided you
make N large enough. In terms of algorithms, that means that if an algorithm’s timebehavior is of the form a*log(N) + b*N, then the algorithm is O(N), not O(log N + N).
The dominant term determines the order of the algorithm because the other terms become
insignificant for very large N.
8
Download