Sorting

advertisement
Sorting
Introduction
• The objective is to take an unordered set of comparable data
items and arrange them in order.
• We will usually sort the data into ascending order — sorting
into descending order is very similar.
• Data can be sorted in various ADTs, such as arrays and trees;
in fact, we have already seen how a binary search tree can be
used to sort data.
• There are two main sorting themes: address-based sorting and
comparison-based sorting. We will focus on the latter.
CSM1220 - Sorting
2
The family of sorting methods
Main sorting themes
Address-based
sorting
Comparison-based
sorting
Proxmap
Sort
Transposition
sorting
RadixSort
BubbleSort
Insert and
keep sorted
Insertion
sort
CSM1220 - Sorting
Tree
sort
Diminishing
increment
sorting
ShellSort
Priority queue
sorting
Selection
sort
Heap
sort
Divide and
conquer
QuickSort
MergeSort
3
Lecture schedule
• An overview of sorting algorithms
• We will not study address-based sorting in detail because they
are quite restricted in their application
– One slide on Proxmap/Radix sort
• Examples from each type of comparison-based sorting
algorithm
–
–
–
–
–
Bubble sort [transposition sorting]
Insertion sort (already seen, see Tree notes) [insert and keep sorted]
Selection sort [priority queue sorting]
Shell sort [diminishing increment sorting]
Merge sort and Quick sort [divide and conquer sorting]
CSM1220 - Sorting
4
Bubble sort
• A pretty dreadful type of sort!
• However, the code is small:
end of one inner loop
for (int i=arr.length; i>0; i--) {
for (int j=1; j<i; j++) {
if (arr[j-1] > arr[j]) {
temp = arr[j-1];
arr[j-1] = arr[j];
arr[j] = temp;
}
}
}
5 ‘bubbled’ to the correct position
remaining elements put in place
CSM1220 - Sorting
5
3
2
4
3
5
2
4
3
2
5
4
3
2
4
5
2
3
4
5
5
Insertion sort
Tree Insertion Sort
• This is inserting into a normal tree structure: i.e. data are put into the
correct position when they are inserted.
• We’ve already seen this happening with a tree, requiring a find and an
insert.
• We know the time complexity for one insert is O(logN) + O(1) = O(logN);
therefore to insert N items will have a complexity of O(NlogN).
Array Insertion Sort
• The array must be sorted; insertion requires a find + insert.
• The insertion time complexity is O(logN) + O(N) = O(N) for one item,
therefore to insert N items will have a complexity of O(N2).
CSM1220 - Sorting
6
SelectionSort
• SelectionSort uses an array implementation of the Priority
Queue ADT (not the heap implementation we will study)
• The elements are inserted into the array as they arrive and then
extracted in descending order into another array
• The major disadvantage is the performance overhead of
finding the largest element at each step; we have to traverse
over the entire array to find it
13 2
15 4
15 13 4
CSM1220 - Sorting
2
7
SelectionSort (2)
• In practice, we use the same array for the Priority Queue and
the output results
• General algorithm:
1. Initialise PQLast to the last index of the Priority Queue
2. Search from the start of the array to PQLast for the largest element:
call its position front
3. Swap element indexed by front with element indexed by PQLast
4. Decrement PQLast by one
5. Repeat steps 2 – 4
CSM1220 - Sorting
8
SelectionSort (3)
13 2
15 4
12 9
front
13 2
9
PQLast
4
front
12 2
front
12 15
PQLast
9
4
13 15
PQLast
etc.
CSM1220 - Sorting
9
Time complexity of SelectionSort
• The main operations being performed in the algorithm are:
1. Comparisons to find the largest element in the Priority Queue subarray
(to find front index): there are ‘n + (n-1) + … + 2 + 1’ comparisons,
i.e., n/2 (n+1) = (n2 + n) / 2 so this is an O(n2) operation
2. Swapping elements between front and PQLast: n-1 exchanges are
performed so this is an O(n) operation
• The dominant operation (time-wise) gives the overall time
complexity, i.e., O(n2)
• Although this is an O(n2) algorithm, its advantage over
O(n log n) sorts is its simplicity
CSM1220 - Sorting
10
Time complexity of SelectionSort (2)
• For very small sets of data, SelectionSort may actually be
more efficient than O(n log n) algorithms
• This is because many of the more complex sorts have a
relatively high level of overhead associated with them, e.g.,
recursion is expensive compared with simple loop iteration
• This overhead might outweigh the gains provided by a more
complex algorithm where a small number of data elements is
being sorted
• SelectionSort does better than BubbleSort as fewer swaps are
required, although the same number of comparison operations
are performed (each swap puts an element in its correct place)
CSM1220 - Sorting
11
Shell sort [diminishing increment sorting]
• Named after D.L. Shell! But it is also rather like shrinking
shells of sorting, for Insertion Sort.
• Shell sort aims to reduce the work done by insertion sort (i.e.
scanning a list and inserting into the right position).
• Do the following:
– Begin by looking at the lists of elements x1 elements apart and sort
those elements by insertion sort
– Reduce the number x1 to x2 so that x1 is not a multiple of x2 and repeat
these two steps until x2 = 1.
• It can be shown that this approach will sort a list with a time
complexity of O(N1.25).
CSM1220 - Sorting
12
Shell Sort Illustration
• Consider sorting the following list by Shell sort:
GAP=3
45
23
4
9
6
7
91
8
12
24
9
6
4
24
8
7
45
23
12
91
CSM1220 - Sorting
GAP=2
9
6
4
24
8
7
45
23
12
91
4
6
8
7
9
23
12
24
45
91
GAP=1
4
6
8
7
9
23
12
24
45
91
4
6
7
8
9
12
23
24
45
91
13
Comparing O(N2), O(N1.25) and O(N)
CSM1220 - Sorting
700
O(N)
600
O(N^1.25)
O(N^2)
500
400
300
200
100
25
23
21
19
17
15
13
11
9
7
5
0
3
O(N^2)
1
4
9
16
25
36
49
64
81
100
121
144
169
196
225
256
289
324
361
400
441
484
529
576
625
1
O(N^1.25)
1
2.378414
3.948222
5.656854
7.476744
9.390507
11.38604
13.45434
15.58846
17.78279
20.03276
22.33452
24.68478
27.08071
29.51985
32
34.51923
37.07581
39.66815
42.29485
44.9546
47.64621
50.36859
53.12073
55.9017
Time (T)
O(N)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Size of input (N)
14
How do you choose the gap size?
• The idea of the decreasing gap size is that the list becomes
more and more sorted each time the gap size is reduced,
therefore you don’t (for example) want to have a gap size of 4
followed by a gap size of 2 because you’ll be sorting half the
numbers a second time.
• There is no formal proof of a good initial gap size, but about a
10th the size of N is a reasonable start.
• Try to use prime numbers as your gap size, or odd numbers if
you can not readily get a list of primes (though note gaps of
9, 7, 5, 3, 1 will be doing less work when gap=3).
CSM1220 - Sorting
15
Divide and conquer sorting
MergeSort
CSM1220 - Sorting
QuickSort
16
Divide ...
5
5
5
1
1
4
CSM1220 - Sorting
4
1
2
10 3
9
2
1
5
4
4
15
10 3
2
9
15
10 3
2
10
9
3
9
15
15
17
and conquer
1
1
1
2
2
4
CSM1220 - Sorting
2
1
4
5
9
10 15
5
5
5
3
4
3
4
3
2
10
9
10 15
10
9
3
9
15
15
18
MergeSort [divide and conquer sorting]
• For MergeSort an initial array is repeatedly divided into
halves (usually each is a separate array), until arrays of just
one or zero elements remain
• At each level of recombination, two sorted arrays are merged
into one
• This is done by copying the smaller of the two elements from
the sorted arrays into the new array, and then moving along
the arrays
1
13 24 26
Aptr
CSM1220 - Sorting
2
Bptr
15 27 38
Cptr
19
Merging
1
13 24 26
13 24 26
2
1
Cptr
15 27 38
1
2
Bptr
Aptr
1
15 27 38
Bptr
Aptr
1
2
13 24 26
Aptr
2
15 27 38
Bptr
Cptr
1
2
13
Cptr
etc.
CSM1220 - Sorting
20
Merge Sort In Pseudocode
• Basic idea in pseudocode:
method mergeSort(array) is:
if len(array) == 0 or len(array) == 1 then: // terminating case
return array
else:
end = len(array)
center = end / 2
left = mergeSort( array[0:center] )
// recursive call
right = mergeSort( array[center:end] ) // recursive call
return merge(left,right)
// method that merges lists
end if
end method
CSM1220 - Sorting
21
Analysis of MergeSort
• Let the time to carry out a MergeSort on n elements be T(n)
• Assume that n is a power of 2, so that we always split into
equal halves (the analysis can be refined for the general case)
• For n=1, the time is constant, so we can take T(1) = 1
• Otherwise, the time T(n) is the time to do two MergeSorts on
n/2 elements, plus the time to merge, which is linear
• So, T(n) = 2 T(n/2) + n
• Divide through by n to get T(n)/n = T(n/2)/(n/2) + 1
• Replacing n by n/2 gives, T(n/2)/(n/2) = T(n/4)/(n/4) + 1
• And again gives, T(n/4)/(n/4) = T(n/8)/(n/8) + 1
CSM1220 - Sorting
22
Analysis of MergeSort (2)
• We continue until we end up with T(2)/2 = T(1)/1 + 1
• Since n is divided by 2 at each step, we have log2n steps
• Now, substituting the last equation in the previous one, and
working back up to the top gives T(n)/n = T(1)/1 + log2n
• That is, T(n)/n = log2n + 1
• So T(n) = n log2n + n = O(n log n)
• Although this is an O(n log n) algorithm, it is hardly ever used
for main memory sorts because it requires linear extra
memory
CSM1220 - Sorting
23
QuickSort [divide and conquer sorting]
• As its name implies, QuickSort is the fastest known sorting
algorithm in practice (address-based sorts can be faster)
• It was devised by C.A.R. Hoare in 1962
• Its average running time is O(n log n) and it is very fast
• It has worst-case performance of O(n2) but this can be made
very unlikely with little effort
• The idea is as follows:
1. If the number of elements to be sorted is 0 or 1, then return
2. Pick any element, v (this is called the pivot)
3. Partition the other elements into two disjoint sets, S1 of elements  v,
and S2 of elements > v
4. Return QuickSort (S1) followed by v followed by QuickSort (S2)
CSM1220 - Sorting
24
QuickSort example
5
1
4
2
10 3
9
15 12
Pick the middle element as the pivot, i.e., 10
Partition into the two subsets below
5
1
4
2
3
9
4
5
9
15 12
Sort the subsets
1
2
3
12 15
Recombine with the pivot
1
CSM1220 - Sorting
2
3
4
5
9
10 12 15
25
QuickSort Example (full)
Unsorted List 5
1
4
2
10 3
9
Partition1
5
1
4
2
3
9
10 15 12
Partition2
1
2
3
4
5
9
12 15
Partition3
1
2
3
5
9
Sorted List
1
2
3
5
9
CSM1220 - Sorting
4
15 12
10 12 15
26
QuickSort Pseudocode
• Basic idea in pseudocode (compare with mergeSort)
method quickSort(array) is:
if len(array) == 0 or len(array) == 1 then: // terminating case
return array
else:
pivotIndx = int( random() * len(array) )
pivot
= array(pivotIndx) // just choose a random pivot
(left, right) = partition(array, pivot)
// get left and right
return quickSort(left) + pivot + quickSort(right)
end if
end method
CSM1220 - Sorting
27
Partitioning example
5
11
4
25 10 3
9
15 12
Pick the middle element as the pivot, i.e., 10
Move the pivot out of the way by swapping it with the first element
10 11
4
25 5
3
9
15 12
swapPos
Step along the array, swapping small elements into swapPos
10 4
11
25 5
3
9
15 12
swapPos
CSM1220 - Sorting
28
Partitioning example (2)
10 4
5
25 11
3
9
15 12
25 9
15 12
swapPos
10 4
5
3
11
swapPos
10 4
5
3
9
25 11
15 12
swapPos
CSM1220 - Sorting
29
Pseudo code for partitioning
pivotPos = middle of array a;
swap a[pivotPos] with a[first]; // Move the pivot out of the way
swapPos = first + 1;
for each element in the array from swapPos to last do:
// If the current element is smaller than pivot we
// move it towards start of array
if (a[currentElement] < a[first]):
swap a[swapPos] with a[currentElement];
increment swapPos by 1;
// Now move the pivot back to its rightful place
swap a[first] with a[swapPos-1];
return swapPos-1; // Pivot position
CSM1220 - Sorting
30
Analysis of QuickSort
•
•
•
•
We assume a random choice of pivot
Let the time to carry out a QuickSort on n elements be T(n)
We have T(0) = T(1) = 1
The running time of QuickSort is the running time of the
partitioning (linear in n) plus the running time of the two
recursive calls of QuickSort
• Let i be the number of elements in the left partition, then
T(n) = T(i) + T(n–i–1) + cn (for some constant c)
CSM1220 - Sorting
31
Worst-case analysis
• If the pivot is always the smallest element, then i = 0 always
• We ignore the term T(0) = 1, so the recurrence relation is
T(n) = T(n–1) + cn
• So, T(n–1) = T(n–2) + c(n–1) and so on until we get
T(2) = T(1) + c(2)
• Substituting back up gives T(n) = T(1) + c(n + … + 2) = O(n2)
• Notice that this case happens if we always take the pivot to be
the first element in the array and the array is already sorted
• So, in this extreme case, QuickSort takes O(n2) time to do
absolutely nothing!
CSM1220 - Sorting
32
Best-case analysis
• In the best case, the pivot is in the middle
• To simplify the equations, we assume that the two subarrays
are each exactly half the length of the original (a slight
overestimate which is acceptable for big-Oh calculations)
• So, we get T(n) = 2T(n/2) + cn
• This is very similar to the formula for MergeSort, and a
similar analysis leads to T(n) = cn log2n + n = O(n log n)
CSM1220 - Sorting
33
Average-case analysis
• We assume that each of the sizes of the left partition are
equally likely, and hence have probability 1/n
• With this assumption, the average value of T(i), and hence
also of T(n–i–1), is (T(0) + T(1) + … + T(n–1))/n
• Hence, our recurrence relation becomes
T(n) = 2(T(0) + T(1) + … + T(n–1))/n + cn
• Multiplying by n gives
nT(n) = 2(T(0) + T(1) + … + T(n–1)) + cn2
• Replacing n by n–1 gives
(n–1)T(n–1) = 2(T(0) + T(1) + … + T(n–2)) + c(n–1)2
• Subtracting the last equation from the previous one gives
nT(n) – (n–1)T(n–1) = 2T(n–1) + 2cn – c
CSM1220 - Sorting
34
Average-case analysis (2)
• Rearranging, and dropping the insignificant c on the end,
gives nT(n) = (n+1)T(n–1) + 2cn
• Divide through by n(n+1) to get
T(n)/(n+1) = T(n–1)/n + 2c/(n+1)
• Hence, T(n–1)/n = T(n–2)/(n–1) + 2c/n and so on down to
T(2)/3 = T(1)/2 + 2c/3
• Substituting back up gives
T(n)/(n+1) = T(1)/2 + 2c(1/3 + 1/4 + … + 1/(n+1))
• The sum in brackets is about loge(n+1) +  – 3/2, where  is
Euler’s constant, which is approximately 0.577
• So, T(n)/(n+1) = O(log n) and T(n) = O(n log n)
CSM1220 - Sorting
35
Some observations about QuickSort
• We have seen that a consistently poor choice of pivot can lead
to O(n2) time performance
• A good strategy is to pick the middle value of the left, centre,
and right elements
• For small arrays, with n less than (say) 20, QuickSort does not
perform as well as simpler sorts such as SelectionSort
• Because QuickSort is recursive, these small cases will occur
frequently
• A common solution is to stop the recursion at n = 10, say, and
use a different, non-recursive sort
• This also avoids nasty special cases, e.g., trying to take the
middle of three elements when n is one or two
CSM1220 - Sorting
36
Address-Based Sorting
• Proxmap uses techniques similar to hashing to assign an
element to its correctly sorted position in a container such as
an array using a hash function
• The algorithms are generally complex, and very often only
suitable for certain kinds of data
• You need to find a suitable hashing function that will
distribute data fairly evenly
• Clashes are dealt with using a comparison based sort, so the
more clashes there are the further the time complexity moves
away from O(n) (see notes view for more details)
CSM1220 - Sorting
37
Radix / Bucket / Bin sort
• Radix Sort
– In its favour, it is an O(n) sort. That makes it the fastest sort we have
investigated.
– However it requires at least 2n space in which to operate, and is in principle
exponential in its space complexity..
– A radix sort makes one pass through the data for each atom in the key. If the
key is three character uppercase alphabetic strings, such as ABC, then A is an
atom. In this case, there would be three passes through the data.
– The first pass sorts on the low order element of the key (the C in our example).
The second pass sorts on the next atom, in order of importance (the B in our
example). Each pass progresses toward the high order atom of the key.
– In each such pass, the elements of the array are distributed into k buckets. For
our alphabetic key example, A's go into the 'A' bucket, B's into the 'B' bucket,
and so on. These distribution buckets are then gathered, in order, and placed
back in the original array. The next pass is then executed. When a pass has
been made on the high order atom in the key, the array will be sorted.
CSM1220 - Sorting
38
Radix
Sort
Example
col2
col3
col1
123
234
345
134
245
356
156
267
378
121
232
343
123
134
156
121
121
134
Bins 0,2,4-9
Are empty
156
123
234
245
267
232
Bins 0-1,4,6-9
Are empty
234
232
232
345
356
378
343
245
234
267
Bins 0-1,3,5-9
Are empty
Bins 0-2,5,7-9
Are empty
.
.
.
Bins 0, 4-9
are empty
CSM1220 - Sorting
123
121
134
156
39
Download