Order Statistics

advertisement
Introduction to
Algorithms
Jiafen Liu
Sept. 2013
Today’s Tasks
• Order Statistics
– Randomized divide and conquer
– Analysis of expected time
– Worst-case linear-time order statistics
– Analysis
Order statistics
• Given n elements in array, try to select the
ith smallest of n elements (the element
with rank i)?
• This has various applications.
– i=1, find the minimum element.
– i=n, find the maximum element .
– Find the median:
– i= (n+1)/2 (odd)or i=n/2 and n/2+1(even)
– This is useful in statistics.
How to find the ith element?
• Naïve algorithm?
– Sort array A, and find the element A[i].
– If we use merge sort or randomized quicksort
– Worst-case running time= Θ(nlgn) + Θ(1)
= Θ(nlgn)
• Can we do better than that?
– Related with sorting, but different.
– Our expected time is Θ(n).
Randomized divide-and-conquer algorithm
• In which Rand-Partition(A,p,q) seems familiar?
Partitioning subroutine
PARTITION(A, p, q)
//A[p. . q]
//pivot= A[p]
x←A[p]
i←p
for j← p+1 to q
do if A[j] ≤x
then i←i+ 1
exchange A[i] ↔ A[j]
exchange A[p] ↔ A[i]
return i
Randomized divide-and-conquer algorithm
Example
• Select the i= 7th smallest:
• Partition
Algorithm Analysis
• (All our analyses today assume that all
elements are distinct.)
• Like Quicksort, our algorithm depends on
the effect of partition.
• Recall what’s the lucky case of Partition?
– Median
– 1/10:9/10?
– Each case is lucky except 0:n-1or n-1:0
Lucky or Unlucky?
• Lucky:
– Let’s take 1/10:9/10 partition as an example
– T(n)= T(9n/10) + Θ(n)
– How to solve it?
– Master Method:
– T(n)= Θ(n)
Lucky or Unlucky?
• Unlucky:
– 0:n-1or n-1:0 partition
– T(n)= T(n-1) + Θ(n)
– T(n)= Θ(n2)
– That’s like arithmetic series.
– Even worse than sorting first and then select!
Analysis of Expected Time
• We have deal with expected running time
of Quicksort algorithm in Lecture 4.
– Recall how we handle that?
– We have n possibilities in partition, how
to express them all in an expression?
– Indicator random variable.
Analysis of Expected Time
• Let T(n) = the running time of RAND-SELECT
on an input of size n, assuming random
numbers are independent.
• To obtain an upper bound, assume that the ith
element always falls in the larger side of the
partition:
Analysis of Expected Time
• For k= 0, 1, …, n–1, define the indicator
random variable:
•
Computing of Expected Time
Independence
!
Computing of Expected Time
• How to solve
• Substitution Method
?
– We guess the answer is Θ(n)
– Prove: E[T(n)] ≤ cn for some constant c.
– Try to do the rest of this by yourself.
if c is chosen
large enough
so that cn/4
dominates
Θ(n).
That’s the end of
proof?
The Base Case
Summary of randomized orderstatistic selection
•
•
•
•
Works fast: linear expected time.
the worst case is bad: Θ(n2).
Still an excellent algorithm in practice.
Questions: Is there an algorithm that runs
in linear time even in the worst case?
• Pick the pivot randomly is simple, but is
not good.
Improvement of randomized selection
• Due to Blum, Floyd, Pratt, Rivest, and
Tarjan [1973].
• IDEA: Generate a really good pivot
recursively.
• How can we make the complexity of
recursion less than Θ(n)?
Worst-case linear-time order statistics
• SELECT(i, n)
– Divide the n elements into └n/5┘ groups of 5
elements. Find the median of each 5elements group by rote.
– Recursively select the median x of the
└n/5┘group medians to be the pivot.
– Partition around the pivot x. Let k= rank(x).
• If i=k then return x
• Else if i< k then recursively select the ith
smallest element in the lower part
• Else recursively select the (i–k)th smallest
element in the upper part
Choosing the pivot
5
└n/5┘
into └n/5┘
• Divide n elements
groups of 5 elements.
• Reorganize five elements in each group so that
– the middle one is the median.
– the upper two are less than the median.
– the lower two are bigger.
Choosing the pivot
•
•
How much time does it takes?
Θ(n)
Choosing the pivot
• Recursively select the median x of the
└n/5┘group medians to be the pivot.
• Rearranged these groups by these medians.
Choosing the pivot
• Suppose that the whole SELECT(i, n) algorithm
takes T(n), What’s the running time of this step?
• T(└n/5┘)=T(n/5)
• Now what do we know about all these
elements?
Analysis
• Rest of the algorithm
– Partition around the pivot x. Let k= rank(x).
• If i=k then return x
• Else if i< k then recursively select the ith
smallest element in the lower part
• Else recursively select the (i–k)th smallest
element in the upper part
• The whole cost we expected is Θ(n), so
the rest cost must strictly less than
T(4n/5), why?
– We have already a recursive call of T(n/5).
Analysis
• Look at this figure carefully
– there are some directed paths and gives us more
information than we just had.
Analysis
• Look at this figure carefully
– there are some directed paths and gives us more
information than we just had.
– All the elements in the block are ≤ x.
– How many elements are there?
Analysis
• At least half the group medians are ≤x, which is
at least └└n/5┘/2┘ = └n/10┘group medians.
• Therefore, at least 3└n/10┘elements are ≤x.
Analysis
• Look at this figure carefully
– there are some directed paths and gives us more
information than we just had.
– Now all the elements in the block are ≥ x.
– How many elements are there?
Analysis
• At least half the group medians are ≥ x, which is
at least └└n/5┘/2┘ = └n/10┘group medians.
• Therefore, at least 3└n/10┘elements are ≥ x.
• Similarly, at least 3└n/10┘ elements are ≤ x.
Analysis
• Then, what’s the expression of the cost of
3-case recursion?
– One side with at least 3└n/10┘elements
– The other side with at most 7└n/10┘elements
– Then the cost is T(7└n/10┘)
• For n≥50, we have 3└n/10┘≥ n/4
– It means, for n≥50 we have 7└n/10┘ ≤ 3n/4
– T (3n/4) is even better than our expectation
4n/5.
• For n ≤ 50, we have T(n) = Θ(1).
Total Running Time
Solving the recurrence
• How?
• Substitution Method
desired
residual
If c is chosen large enough to handle
Θ(n)
desired
Conclusions
• Since the work at each level of recursion
is a constant fraction (19/20) smaller, the
work per level is a geometric series.
• In practice, this algorithm runs slowly,
because the constant in front of n is large.
• The randomized algorithm is far more
practical.
Further Thought
• Why did we use groups of five?
• Why not groups of three?
• How about 7?
Download