22S6 - Numerical and data analysis techniques Mike Peardon School of Mathematics Trinity College Dublin Hilary Term 2012 Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 1 / 15 The median and sorting Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 2 / 15 The median In describing statistical data, often use the median. The median is a “typical” sample. It is defined as the middle value in the sample (so the pdf must be non-zero at that value, unlike the sample mean). Median of n data After ordering the data into a sequence S = {X1 , X2 , X3 , . . . Xn } where X1 ≤ X2 ≤ X3 · · · ≤ Xn Consider the two cases where n is 1 Odd: MX = Xm where m = 2 Even: MX = Mike Peardon (TCD) Xm +Xm+1 2 n+1 2 where m = n 2 22S6 - Data analysis Hilary Term 2012 3 / 15 The median (2) The median of 9 data-points Consider the data {3, 7, 2, 9, 6, 5, 1, 9, 8} After ordering this data, we find the sequence S = {1, 2, 3, 5, 6, 7, 8, 9, 9} and so the median is MX = 6. The median of 10 data-points Consider the data {23, 28, 12, 84, 92, 45, 32, 81, 11, 52} After ordering this data, we find the sequence S = {11, 12, 23, 28, 32, 45, 52, 81, 84, 92} and so the median is MX = Mike Peardon (TCD) 32+45 2 = 38 21 22S6 - Data analysis Hilary Term 2012 4 / 15 Sorting algorithms An algorithm is a practical method for solving some problem. To find the median of m data, where m is large, we would use a computer. Finding the median is then equivalent to solving the problem of sorting the data-set. As we shall see, this is almost true - there is a short-cut... There are many different approaches to solving this problem. Are they all equally useful? Assuming they all work, we would like to find the algorithm that finds the correct solution in the shortest amount of time. Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 5 / 15 Bubblesort NB: this is an example of a bad algorithm! The bubblesort algorithm To sort n data; 1 For i = 1, 2, 3, . . . n − 1 2 Test if Xi > Xi+1 and if true, swap Xi ↔ Xi+1 3 Repeat steps 1, 2 until all pairs are in the right order How many tests and swap operations on average will we need to perform? Suppose the smallest number starts at position k. We need k − 1 iterations of the loop to get it to top of the list and each iteration requires n − 1 “>-tests”, so method will converge in at least (k − 1) × (n − 1) ≈ kn iterations. For jumbled data, k ∝ n, so cost of method grows like n2 Bubblesort is called an O(n2 ) algorithm. Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 6 / 15 Bubblesort - an example Use bubblesort to sort 5 numbers Start with X = {7, 2, 9, 4, 1} 7 2 9 4 1 2 7 9 4 1 2 7 9 4 1 2 7 4 9 1 2 7 4 1 9 2 7 4 1 9 Mike Peardon (TCD) 2 7 4 1 9 2 4 7 1 9 2 4 1 7 9 2 4 1 7 9 2 4 1 7 9 2 4 1 7 9 2 1 4 7 9 2 1 4 7 9 2 1 4 7 9 22S6 - Data analysis 2 1 4 7 9 1 2 4 7 9 1 2 4 7 9 1 2 4 7 9 1 2 4 7 9 1 2 4 7 9 1 2 4 7 9 1 2 4 7 9 Hilary Term 2012 1 2 4 7 9 1 2 4 7 9 7 / 15 Quicksort Donald Knuth (The Art of Computer Programming, Vol 3): “The bubble sort seems to have nothing to recommend it, except a catchy name” Are there algorithms better than O(n2 )? Yes One example: quicksort. Popular as it is efficient on computers. Algorithm is an example of a recursive method. Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 8 / 15 The quicksort algorithm Quicksort To sort a sequence of n numbers Ω = {X1 , X2 , . . . Xn } 1 2 Choose one element Xp (usually at random) called the pivot. Define two sets, Ωlo and Ωhi For i = 1, 2, 3, . . . p − 1, p + 1, . . . n − 1 Test if Xp > Xi . If true, put Xi in Ωlo otherwise put Xi in Ωhi 3 Now apply this algorithm to sort both Ωlo and Ωhi (recursion) After the first pass, the pivot is in its correct final location If the purpose is to find the median, only one recursion needs to be followed. Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 9 / 15 The quicksort algorithm How many test/swap operations are needed? The first pass, comparing all elements in Ω to the pivot takes n operations. There are two sorts to do at the next level of recursion, each (on avarage) of length n/ 2. These two sorts need 2 × n/ 2 = n comparisons The number of recursions (on average) is d where 2d = n, so d = log2 n Total number of comparisons is then O(n log2 n) Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 10 / 15 Quicksort Example: quicksorting 10 numbers Suppose we want to sort the sequence Ω = {12, 3, 8, 15, 4, 9, 1, 14, 7, 5} 12 3 3 1 1 1 1 1 Mike Peardon (TCD) 3 4 1 3 3 3 3 3 8 1 4 4 4 4 4 4 15 7 7 7 5 5 5 5 4 5 5 5 7 7 7 7 9 8 8 8 8 8 8 8 1 12 12 12 12 9 9 9 22S6 - Data analysis 14 15 15 15 15 12 12 12 7 9 9 9 9 15 14 14 5 14 14 14 14 14 15 15 Hilary Term 2012 11 / 15 Comparing bubblesort with quicksort 1 10 0 10 time to sort (secs) -1 10 -2 10 quicksort bubblesort -3 10 -4 10 -5 10 -6 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 size of sort array Except for tiny arrays, quicksort is much faster Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 12 / 15 Visualising the cdf 1 0.8 i/n 0.6 0.4 0.2 0 -4 -2 0 2 4 a[i] Plotting the sorted data (with its index in the array) gives a visualisation of the cdf Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 13 / 15 Expectation of the median for a large sample For a large set of independent samples, the median should be the “middle” of the cdf Expected value of the median M¯X in this case would obey FX (M̄X ) = 1 2 For a random value, probability it is above (or below) the median is 21 . Not true for the expected value of the mean. Mike Peardon (TCD) 22S6 - Data analysis Hilary Term 2012 14 / 15 Expectation of the median for a large sample Example: Expected values of the mean and median Consider a random number X in the range [0, 1] with pdf fX (x) = 2x What is expected value of the mean of X? Z1 2 2 E[X] = x · 2x dx = [ x3 ]10 = 3 3 0 What is expected value of the median of X? Zx 2x̃ dx̃ = x2 FX (x) = 0 so FX (M̄X ) = Mike Peardon (TCD) 1 2 È → M̄X = 1 2 = 0.707106 . . . 22S6 - Data analysis Hilary Term 2012 15 / 15