Sorting

advertisement
Sorting
• Importance of sorting
• Quicksort
• Lower bounds for comparison-based
methods
• Heapsort
• Non-comparison based sorting
Why don't CS profs ever stop
talking about sorting?!
• Computers spend more time sorting than anything else,
historically 25% on mainframes.
• Sorting is the best studied problem in computer
science, with a variety of different algorithms known.
• Most of the interesting ideas we encounter in the course
are taught in the context of sorting, such as divide-andconquer, randomized algorithms, and lower bounds.
You should have seen most of the algorithms - we will
concentrate on the analysis
Applications of Sorting
• Closest Pair
• Element Uniqueness
• Frequency Distribution
• Selection of Kth largest element
• Convex Hulls
–See next slide!
Convex Hulls
Huffman Codes
If you are trying to minimize the amount of space a
text file is taking up, it is silly to assign each letter the
same length (i.e. one byte) code.
Example: e is more common than q, a is more
common than z.
If we were storing English text, we would want a and
e to have shorter codes than q and z.
Example Problems
a. You are given a pile of thousands of telephone bills and thousands
of checks sent in to pay the bills. Find out who did not pay.
b. You are given a list containing the title, author, call number and
publisher of all the books in a school library and another list of 30
publishers. Find out how many of the books in the library were
published by each of those 30 companies.
c. You are given all the book checkout cards used in the campus
library during the past year, each of which contains the name of the
person who took out the book. Determine how many distinct people
checked out at least one book.
Quicksort
Although mergesort is O( n log n ), it is difficult to implement on
arrays since we need space to merge. In practice, Quicksort is
the fastest sorting algorithm.
Example: Pivot about 10
17 12 6 23 19 8 5 10 - before
6 8 5 10 17 12 23 19 - after
The pivot point is now in the correctly sorted position, and all
other numbers are in the relative correct position, before or after.
Quicksort Walkthrough
17
6 8
5 6
6
6
12 6 23 19 8 5
5 10 17 12 23
8
17 12 19
8
12 17
17
10
19
23
23
5 6 8 10 12 17 19 23
Pseudocode
Sort(A) {
Quicksort(A,1,n);
}
Quicksort(A, low, high) {
if (low < high) {
pivotLocation = Partition(A,low,high);
Quicksort(A,low, pivotLocation - 1);
Quicksort(A, pivotLocation+1, high);
}
}
Pseudocode
int Partition(A,low,high) {
pivot = A[high];
leftwall = low-1;
for i = low to high-1 {
if (A[i] < pivot) then {
leftwall = leftwall+1;
swap(A[i],A[leftwall]);
}
swap(A[high],A[leftwall+1]);
}
return leftwall+1;
}
Best Case for Quicksort
Worst Case for Quicksort
Intuition: The Average Case
0
n/4
n/2
3n/4
n
Anywhere in the middle half is a decent partition
(3/4)h n = 1 => n = (4/3)h
log(n) = h log(4/3)
h = log(n) / log(4/3) < 2 log(n)
What have we shown?
At most 2log(n) decent partitions suffices to sort an array
of n elements.
But if we just take arbitrary pivot points, how often will
they, in fact, be decent?
Since any number ranked between n/4 and 3n/4 would
make a decent pivot, we get one half the time on average.
Therefore, on average we will need 2 x 2log(n) = 4log(n)
partitions to guarantee sorting.
Quicksort in the real world…
Average-case Analysis
• Let X denote the random variable that
represents the total number of comparisons
performed
• Let Xij = probability that the ith smallest
element and jth smallest element are
compared
• E[X] = Si=1 to n-1 Sj=i+1 to n Xij
Computing Xij
• Observation
– All comparisons are between a pivot element
and another element
– If an item k is chosen as pivot where i < k < j,
then items i and j will not be compared
• Xij = 2/(j-i+1)
– Items i or j must be chosen as a pivot before
any items in interval (i..j)
Computing E[X]
E[X] = Si=1 to n-1 Sj=i+1 to n 2/(j-i+1)
= Si=1 to n-1 Sj=i+1 to n 2/(j-i+1)
= Si=1 to n-1 Sk=1 to n-i 2/(k+1)
<= Si=1 to n-1 2 Hn-i+1
<= Si=1 to n-1 2 Hn
= 2 (n-1)Hn
Avoiding worst-case
• Understanding quicksort’s worst-case
• Methods for avoiding it
– Pivot strategies
– Randomization
Understanding the worst case
A
A
A
A
A
A
A
B
B
B
B
B
B
D
D
D
D
D
F H J K
F H J
F H
F
The worst case occur is a likely case for many
applications.
Pivot Strategies
• Use the middle Element of the sub-array as
the pivot.
• Use the median element of (first, middle,
last) to make sure to avoid any kind of presorting.
What is the worst-case performance for these
pivot selection mechanisms?
Randomization Techniques
• Make chance of worst-case run time equally
small for all inputs
• Methods
– Choose pivot element randomly from range
[low..high]
– Initially permute the array
Is Quicksort really faster than
Mergesort?
Since Quicksort is (n log n) and Selection Sort is (n2),
there isn’t any debate about which is faster.
How can we compare two (n log n) algorithms to know
which one is faster?
Using the RAM model and the big Oh notation, we can't!
If all of the algorithms are well implemented, Quicksort
is at least 2-3 times faster than any of the others, but this
only has to do with implementation details.
Possible reasons for not choosing
quicksort
• What do you know about the input data?
•Is the data already partially sorted?
• Do we know the distribution of the keys?
• Are your keys very long or hard to compare?
• Is the range of possible keys very small?
Optimizing Quicksort
Using randomization: guarantees never to never have
worst-case time due to bad data.
Median of three: Can be slightly faster than
randomization for somewhat sorted data.
Leave small sub-arrays for insertion sort: Insertion
sort can be faster, in practice, for small values of n.
Do the smaller partition first: minimize runtime
memory.
Is Linear Sorting Possible?
Any comparison-based sorting program can be thought
of as defining a decision tree of possible executions.
Example Decision Tree
How big is the decision tree?
Since different permutations of n elements requires a
different sequence of steps to sort, there must be at least
n! different paths from the root to leaves in the decision
tree, ie. at least n! different leaves in the tree.
Since a binary tree of height h has at most 2h leaves, we
know that n!  2h, or h  log(n!)
By inspection, n! > (n/2)n/2 since the last n/2 elements of
the product are greater than n/2. Thus h > (n/2)log(n/2)
Heaps
• Definition
• Operations
– Insertion
– Heap construction
– Heap extract max
• Heapsort
Definition
A binary heap is defined to be a binary tree with a key
in each node such that:
1: All leaves are on, at most, two adjacent levels.
2: All leaves on the lowest level occur to the left, and all
levels except the lowest one are completely filled.
3: The key in root is greater than all its children, and the
left and right subtrees are again binary heaps.
Conditions 1 and 2 specify shape of the tree, and
condition 3 the labeling of the tree.
Example Heap
Are these legal?
Partial Order Property
The ancestor relation in a heap defines a partial order on
its elements, which means it is reflexive, anti-symmetric,
and transitive.
Reflexive: x is an ancestor of itself.
Anti-symmetric: if x is an ancestor of y and y is an
ancestor of x, then x=y.
Transitive: if x is an ancestor of y and y is an ancestor of
z, x is an ancestor of z.
Partial orders can be used to model hierarchies with
incomplete information or equal-valued elements.
Insertion Operation
•Heaps can be constructed incrementally, by inserting new
elements into the left-most open spot in the array.
•If the new element is greater than its parent, swap their
positions and recur.
The height h of an n element heap is bounded because:
h
i
h 1
2

2
1  n

i 0
so,
h  log n
and insertions take O(log n) time
Heap Construction
The bottom up insertion algorithm gives a good way to
build a heap, but Robert Floyd found a better way, using
a merge procedure called heapify.
Given two heaps and a fresh element, they can be
merged into one by making the new entry the root and
trickling down.
To convert an array of integers into a heap, place them
all into a binary tree, and call heapify on each node.
How long would this take?
Heapify Example
Try to create a heap with the entries:
5, 3, 17, 10, 84, 19, 6, 22, 9
Heap Extract Max
if heap-size(A) < 1
then error “Heap Underflow”;
max = A[1];
A[1] = A[heap-size(A)];
heap-size(A)--;
Heapify(A, 1);
return max;
Heap Sort
To sort using the heap data structure, we first build the
heap, and then just repeatedly extract the maximum.
Build Heap = O(n)
Extract Maximum = O(log n)
Therefore:
Heap Sort = O(n) + n O(log n)
= O(n log n)
Non-comparison Based Sorting
All the sorting algorithms we have seen assume binary
comparisons as the basic primitive, questions of the form
“is x before y?”.
Suppose you were given a deck of playing cards to sort.
Most likely you would set up 13 piles and put all cards
with the same number in one pile.
A 2 3 4 5 6 7 8 9 10 J Q K
Bucketsort
Suppose we are sorting n numbers from 1 to m, where
we know the numbers are approximately uniformly
distributed.
We can set up n buckets, each responsible for an interval
of m/n numbers from 1 to m
1
m/n
m/n+1
2m/n
2m/n+1 3m/n …
…
…
Bucketsort
We can use bucketsort effectively whenever we
understand the distribution of the data.
However, bad things happen when we assume the
wrong distribution.
1
m/n
m/n+1
2m/n
2m/n+1 3m/n …
…
…
Real World Distributions
Consider the distribution of names in a telephone book.
• Will there be a lot of Ofria’s?
• Will there be a lot of Smith’s?
• Will there be a lot of Zucker’s?
Make sure you understand your data, or use a good
worst-case or randomized algorithm!
Download