22S6 - Numerical and data analysis techniques Mike Peardon Hilary Term 2012

advertisement
22S6 - Numerical and data analysis
techniques
Mike Peardon
School of Mathematics
Trinity College Dublin
Hilary Term 2012
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
1 / 15
The median
and sorting
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
2 / 15
The median
In describing statistical data, often use the median.
The median is a “typical” sample.
It is defined as the middle value in the sample (so the pdf
must be non-zero at that value, unlike the sample mean).
Median of n data
After ordering the data into a sequence
S = {X1 , X2 , X3 , . . . Xn } where X1 ≤ X2 ≤ X3 · · · ≤ Xn
Consider the two cases where n is
1
Odd: MX = Xm where m =
2
Even: MX =
Mike Peardon (TCD)
Xm +Xm+1
2
n+1
2
where m =
n
2
22S6 - Data analysis
Hilary Term 2012
3 / 15
The median (2)
The median of 9 data-points
Consider the data
{3, 7, 2, 9, 6, 5, 1, 9, 8}
After ordering this data, we find the sequence
S = {1, 2, 3, 5, 6, 7, 8, 9, 9}
and so the median is MX = 6.
The median of 10 data-points
Consider the data
{23, 28, 12, 84, 92, 45, 32, 81, 11, 52}
After ordering this data, we find the sequence
S = {11, 12, 23, 28, 32, 45, 52, 81, 84, 92}
and so the median is MX =
Mike Peardon (TCD)
32+45
2
= 38 21
22S6 - Data analysis
Hilary Term 2012
4 / 15
Sorting algorithms
An algorithm is a practical method for solving some
problem.
To find the median of m data, where m is large, we would
use a computer. Finding the median is then equivalent to
solving the problem of sorting the data-set.
As we shall see, this is almost true - there is a short-cut...
There are many different approaches to solving this
problem. Are they all equally useful?
Assuming they all work, we would like to find the
algorithm that finds the correct solution in the shortest
amount of time.
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
5 / 15
Bubblesort
NB: this is an example of a bad algorithm!
The bubblesort algorithm
To sort n data;
1
For i = 1, 2, 3, . . . n − 1
2
Test if Xi > Xi+1 and if true, swap Xi ↔ Xi+1
3
Repeat steps 1, 2 until all pairs are in the right order
How many tests and swap operations on average will we
need to perform?
Suppose the smallest number starts at position k. We
need k − 1 iterations of the loop to get it to top of the list
and each iteration requires n − 1 “>-tests”, so method will
converge in at least (k − 1) × (n − 1) ≈ kn iterations.
For jumbled data, k ∝ n, so cost of method grows like n2
Bubblesort is called an O(n2 ) algorithm.
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
6 / 15
Bubblesort - an example
Use bubblesort to sort 5 numbers
Start with X = {7, 2, 9, 4, 1}
7
2
9
4
1
2
7
9
4
1
2
7
9
4
1
2
7
4
9
1
2
7
4
1
9
2
7
4
1
9
Mike Peardon (TCD)
2
7
4
1
9
2
4
7
1
9
2
4
1
7
9
2
4
1
7
9
2
4
1
7
9
2
4
1
7
9
2
1
4
7
9
2
1
4
7
9
2
1
4
7
9
22S6 - Data analysis
2
1
4
7
9
1
2
4
7
9
1
2
4
7
9
1
2
4
7
9
1
2
4
7
9
1
2
4
7
9
1
2
4
7
9
1
2
4
7
9
Hilary Term 2012
1
2
4
7
9
1
2
4
7
9
7 / 15
Quicksort
Donald Knuth (The Art of Computer Programming, Vol 3):
“The bubble sort seems to have nothing to recommend it,
except a catchy name”
Are there algorithms better than O(n2 )? Yes
One example: quicksort. Popular as it is efficient on
computers.
Algorithm is an example of a recursive method.
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
8 / 15
The quicksort algorithm
Quicksort
To sort a sequence of n numbers Ω = {X1 , X2 , . . . Xn }
1
2
Choose one element Xp (usually at random) called the
pivot. Define two sets, Ωlo and Ωhi
For i = 1, 2, 3, . . . p − 1, p + 1, . . . n − 1
Test if Xp > Xi . If true, put Xi in Ωlo otherwise put Xi in Ωhi
3
Now apply this algorithm to sort both Ωlo and Ωhi
(recursion)
After the first pass, the pivot is in its correct final location
If the purpose is to find the median, only one recursion
needs to be followed.
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
9 / 15
The quicksort algorithm
How many test/swap operations are needed?
The first pass, comparing all elements in Ω to the pivot
takes n operations.
There are two sorts to do at the next level of recursion,
each (on avarage) of length n/ 2. These two sorts need
2 × n/ 2 = n comparisons
The number of recursions (on average) is d where 2d = n,
so d = log2 n
Total number of comparisons is then O(n log2 n)
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
10 / 15
Quicksort
Example: quicksorting 10 numbers
Suppose we want to sort the sequence
Ω = {12, 3, 8, 15, 4, 9, 1, 14, 7, 5}
12
3
3
1
1
1
1
1
Mike Peardon (TCD)
3
4
1
3
3
3
3
3
8
1
4
4
4
4
4
4
15
7
7
7
5
5
5
5
4
5
5
5
7
7
7
7
9
8
8
8
8
8
8
8
1
12
12
12
12
9
9
9
22S6 - Data analysis
14
15
15
15
15
12
12
12
7
9
9
9
9
15
14
14
5
14
14
14
14
14
15
15
Hilary Term 2012
11 / 15
Comparing bubblesort with quicksort
1
10
0
10
time to sort (secs)
-1
10
-2
10
quicksort
bubblesort
-3
10
-4
10
-5
10
-6
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
size of sort array
Except for tiny arrays, quicksort is much faster
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
12 / 15
Visualising the cdf
1
0.8
i/n
0.6
0.4
0.2
0
-4
-2
0
2
4
a[i]
Plotting the sorted data (with its index in the array) gives
a visualisation of the cdf
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
13 / 15
Expectation of the median for a large sample
For a large set of independent samples, the median
should be the “middle” of the cdf
Expected value of the median M¯X in this case would obey
FX (M̄X ) =
1
2
For a random value, probability it is above (or below) the
median is 21 .
Not true for the expected value of the mean.
Mike Peardon (TCD)
22S6 - Data analysis
Hilary Term 2012
14 / 15
Expectation of the median for a large sample
Example: Expected values of the mean and median
Consider a random number X in the range [0, 1] with pdf
fX (x) = 2x
What is expected value of the mean of X?
Z1
2
2
E[X] =
x · 2x dx = [ x3 ]10 =
3
3
0
What is expected value of the median of X?
Zx
2x̃ dx̃ = x2
FX (x) =
0
so
FX (M̄X ) =
Mike Peardon (TCD)
1
2
È
→ M̄X =
1
2
= 0.707106 . . .
22S6 - Data analysis
Hilary Term 2012
15 / 15
Download