Mean, Median, Standard Deviation Prof. McGahagan – Stat 1040 Mean = arithmetic average, add all the values and divide by the number of values. Median = 50th percentile; sort the data and choose the middle value (or average the two middle values) In a perfectly symmetric distribution, the mean and median will be the same. In a right-skewed distribution, the mean will be higher than the median. (Example: income distribution with 9 individuals making $ 40,000 a year plus one individual making $10,000,000 a year. Median = $ 40,000; Mean = $ 10,360,000 / 10 = $ 1,036,000) The median is more robust than the mean = it is not as sensitive to outliers. (In the above example, suppose someone who makes $ 50,000,000 is added to the group. The median income remains $ 40,000; the mean income is $ 60,300,000 / 11 = $ 5,487,273) Deviation = difference of an individual value from the mean. We want to find some way of finding the “average” deviation. There are three candidates: Mean absolute deviation = add the absolute values of all deviations and take their mean. Median absolute deviation = the median value of the absolute deviations from the median. Standard deviation = root mean square deviation. Take the square root of the average of the squared deviations. Example: Wage distribution with 4 individuals: Arthur makes $ 20 an hour. Beth makes $ 300 an hour. Charles makes $ 40 an hour. Diana makes $ 40 an hour. Note that if you asked the computer to draw a histogram of this data, you would get exactly the same histogram as we did with the example in our “Data and Histograms” handout. Try it: the commands would be: wages < - c(20,300,40,40) hist (wages, breaks = c(0, 30, 50, 500)) And you can find the results given below by the commands: mean(wages), median(wages), mean(abs (wages – mean(wages))), median(abs(wages - median(wages)) Mean = $ 460 / 4 = $ 100 Median = middle observation or average of the two middle observations. Sorted data = ($20 $40 $ 40 $300) Median = ( $ 40 + $ 40) / 2 = $ 40 The mean and median have some interesting properties, which can be explained after giving a little thought to a bet we might propose: Guess the income of the next person who walks through the door (we know Arthur, Beth, Charles and Diana are in a meeting and are betting on who will walk out). You are risk neutral, so you will take any bet which has a positive expected value. If you agree to play the game, I will give you $ 80 before you guess.. However, you will lose the absolute value of the difference between your guess and the average hourly income of the person who actually walks through the door next. If you guess $ 50 and Diana comes in, your payoff is $ 80 - | $ 40 - $ 50| = + $ 70. If you guess $ 50 and Beth comes in, your payoff is $ 80 - | $ 300 - $ 50| = - $ 170. Should you play the game at all? What guess should you make if you play the game? Consider the mean and median as logical guesses. You might also consider guessing Beth's salary to avoid the maximum possible loss. To answer the question, you must calculate the expected value of the game for each guess. For example, if you guess Beth's salary, the expected value will be the average of all the four payoffs, which we assume equally likely. If Arthur walks in, your payoff is: If Beth walks in, your payoff is: If Charles walks in, your payoff is If Diana walks in, your payoff is: $ 80 $ 80 $ 80 $ 80 - |$ 300 – $ 20| = $ 80 - $ 280 = |$ 300 – $ 300| = $ 80 - $ 0 = |$ 300 – $ 40 | = $ 80 - $ 260 = |$ 300 – $ 40| = $ 80 - $ 260 = - $ 200 + $ 80 - $ 180 - $ 180 The expected value of the game is therefore ( - 200 + 80 - 180 - 180) / 4 = - $ 480 / 4 = - $ 120 Would another guess make it profitable to play the game? First, try guessing the mean value of $ 100. If Arthur walks in, your payoff is: If Beth walks in, your payoff is: If Charles walks in, your payoff is If Diana walks in, your payoff is: $ 80 $ 80 $ 80 $ 80 - |$ 100 – $ 20| = |$ 100 – $ 300 | = |$ 100 – $ 40 | = |$ 100 – $ 40 | = $ 80 - $ 80 = $ 80 - $ 200 = $ 80 - $ 60 = $ 80 - $ 60 = +$0 - $ 120 + $ 20 + $ 20 The expected value of the game is therefore ( 20 – 100 + 40 + 40) / 4 = - $ 0 / 4 = - $ 22.50 It is not worth while to play and guess the mean. How about guessing the median value of $ 40? If Arthur walks in, your payoff is: If Beth walks in, your payoff is: If Charles walks in, your payoff is If Diana walks in, your payoff is: $ 80 $ 80 $ 80 $ 80 - |$ 40 – $ 20| = $ 80 - $ 20 = |$ 40 – $ 300| = $ 80 - $ 260 = |$ 40 – $ 40 | = $ 80 - $ 0 = |$ 40 – $ 40| = $ 80 - $0 = + $ 60 - $ 180 + $ 80 + $ 80 The expected value of the game is therefore ( 60 – 180 + 80 + 80) / 4 = - $ 40 / 4 = $ 10 If you are risk neutral, it is worth while to play the game and guess the median income – the expected value of the game for you is positive.. Deviations from the median result in a smaller sum of errors, and hence a smaller average error, than deviations from the mean. It clearly sounds like a good idea to minimize the sum of errors – whether in this betting game or in (say) estimating your minimum cost level of output. So why bother with the mean? Isn't the median the better of the two? Answer: not always. If you think that avoiding a big loss is also an important consideration, you might look more kindly on the mean - even though the expected value is negative, you win some money in three of the possible four events, and take a smaller maximum loss than if you had guessed the median. The mean minimizes the sum of squared deviations, and so minimizes the really big errors one might make with the median. Often small errors are not that bad, and large errors are really terrible. Example: Estimating cost functions. You should remember from microeconomics that the average cost curve is roughly U-shaped. The textbook average cost curves look nice and smooth – but estimating real ones from data is likely to lead to a lot of points scattered about a U-shaped curve. Suppose you are interested in choosing the quantity of output which will minimize your firm's average cost, and the true average cost curve is AC = 1000 + square(Q – 500). The minimum average cost will be at a quantity of 500, as is clear without calculus (a more realistic AC curve would require some calculus; more realistic economics would note that firms should be interested in maximizing profit rather than minimizing average cost, but simplicity is better than realism here). Here's the curve in red, with a horizontal blue line at the minimum cost of 1000. The important point here is not so much the optimum quantity of 500, but the fact that a small mistake in choosing that optimum quantity will not cost you very much – but a large mistake will. Let's define the COE (Cost of Error) as AC(Q - Q*) = square(500 - Q), so that we can compute: COE (0) = square(500 – 500) = 0 COE (501) = square (500 – 501) = 1 COE (510) = square (500 – 510) = 100 COE (600) = square (500 – 600) = 10,000 An error of 10 is not 10 times as costly as an error of 1, but 100 times as costly; an error of 100 is not 10 times as costly, but 10,000 times as costly. The cost function increases with the square of the error. So, if our real target is the cost of the error rather than the arithmetic magnitude of the error, we want an estimator which penalizes the squared error, not the absolute value of the error. Other examples: Medicine: giving a patient a little more or a little less of a drug may be quite harmless – but cutting the dose in half or doubling it may kill the patient. Grading: a history teacher asking when Thomas Jefferson and John Adams died (both died the same day, July 4, 1826) might not penalize you very much if you said July 4, 1825 – but might flunk you for the course if you said October 12, 1492. Astronomy: If you were trying to estimate a missing asteroid's course on the basis of a few observations, a small error in the estimation might still leave the missing asteroid in the viewing field of your telescope – but a larger error would leave it invisible. This was exactly the problem that Karl Friedrich Gauss was trying to solve when he invented “least squares estimation” in 1809. Does the mean value minimize the sum of squared errors in our example? Make your guess for the best estimator of “average” income in the list (20, 40, 40, 300) Median = 40: Errors: ( 20 – 40) Squared errors: (40 – 40) 400 0 (40 – 40) 0 (300 - 40) 260 * 260 = 67,600 Sum of squared errors = 68,000 Mean = 100 Errors: ( 20 – 100) Squared errors: (40 – 100) (40 – 100) 80 * 80 60 * 60 60 * 60 6400 3600 3600 (300 - 100) 200 * 200 40,000 Sum of squared errors = 53, 600 To examine the other possibilities, see the Appendix. A calculus-based proof is also presented in the appendix. It will not be on the exam. Computational details: Mean/median absolute deviation and standard deviation. Mean absolute deviation: ( | 20 – 100 | + | 40 – 100 | + | 40 - 100 | + | 300 – 100| ) / 4 Note that the vertical lines are the math symbol for “take the absolute value” (80 + 60 + 60 + 200) / 4 $ 400 / 4 = $ 100 Median absolute deviation: median( | 20 – 40| + | 40 – 40 | + | 40 - 40 | + | 300 – 40| ) median (20, 0, 0, 2600) = ($ 0 + $ 20) / 2 = $ 10 Note: in R, the median absolute deviation (or MAD) is defined with an adjustment constant defined so that it is in a predictable relation to a normal distribution. To get our results, you would set the adjustment constant to 1, with the command: mad(wages, constant=1). Standard deviation: First, compute the sum of squared deviations: [ (20 – 100) 2 + (40 – 100) 2 + (40 - 100) 2 + (300 – 100 ) 2 ] = ( (-80) 2 + (-60) 2 + (-60) 2 + 200 2 ) = ( 6,400 + 3,600 + 3,600 + 40,000) = 53,600 Second, find the mean squared deviation = sum of squared deviations / number of observations. The mean squared deviation is also called the variance. Variance = 53,600 / 4 = 13,400 The standard deviation in our example is sqrt(13,400) = 115.7584 Note well: The text and Stark use what is called the population standard deviation = sqrt(SSE / N) in their calculations throughout the book. There is a (very technical) argument that a slight change to the formula gives a more reliable estimate of the variance, when dealing with small samples, and the sample standard deviation = sqrt(SSE / (N-1)) is used by some texts and computer programs (including R) when computing standard deviations. The very slight technical advantage of the more complicated formula in some situations does not in my mind justify inflicting it on beginning students. The basic idea that the standard deviation is one of three possible ways to measure the “average” deviation is watered down by forcing the formula on students, and it will not be used at all in this course. But keep it in mind if your calculator or computer program gives you a slightly differed standard deviation than you calculated. In R, you can define stdev < - function(x) sqrt(mean(square(x - mean(x))) to get the text formula; the R command sd(x) will give the sample standard deviation. Properties of the mean and standard deviation: Both the text and Stark explore at length what happens to the mean and SD if you add, subtract, multiply and/or divide a list of numbers by another number. Such a transformation of a list of numbers by simple arithmetic operations (not including squares, square roots, logarithms, or trignometric functions), is known as a linear transformation or (in Stark's online text) as an affine transformation (a term I will not use on exams). Consider a few examples (and run the numbers through your calculator to confirm my statements and to get some practice in calculating) Consider the list of numbers: x = (3, 10, 7, 2, 3) Confirm that mean(x) = 5 and stdev(x) = 3.03315 (and that R's sd(x) = 3.391165) Add 3 to each number in x to get another list, y = (6, 13, 10, 5, 6) Confirm that mean(y) = 8 and stdev(y) = 3.03315 (and R's sd(y) also remains unchanged) Multiply each number in the original list x by -2 to get the list z = (-6, -20, -14, -4, -6) Confirm that mean(z) = -10 and stdev(z) = 6.0663 (and that sd(z) = 2 * sd(x)) Answer the following questions: (true/false, but explain any false statement or part of a statement) If necessary, work your own example to confirm or refute the statements. Notice that refutation only requires a single counterexample, but a confirmation could be accidental, so you should try two or three examples to confirm that your example would work for both fractions and whole numbers, positive and negative numbers. ______ 1. Multiplying any list by a positive number k will lead to a list with a mean and SD k times greater. ______ 2. Subtracting 10,000 from any list will decrease the mean by 10,000 but the SD by only 100, since in calculating the SD we take the square root. ______ 3. Adding any number, whether positive or negative, to a list will not change the SD of a list. ______4. Multiplying any list by a negative number will lead to a negative mean and negative standard deviation. ______5. Squaring every number in a list will square both the mean and the SD. ______6. Squaring every number in a list will lead to a larger mean and SD, but not necessarily to an exact square of either mean or SD. ______7. If the standard deviation of a list is zero, every item on that list must be zero. ______8. The standard deviation of the list (-3, -2, -1, 0, 1, 2, 3) is zero. See the end of appendix for answers. Appendix 1. Demonstration that the median minimizes absolute value of difference among all reasonable guesses. Would any other strategy be better? We can try to see this graphically, by looking at the outcome of all guesses between $ 0 and $ 100. Note: lines beginning with a prompt (>) indicate what you are supposed to type. (Don't type the prompt) Press return where I start a new line, and the computer will provide a “+” to indicate that you are continuing the line. First, define an R function: > payoff <- function (guess, payment=80,salaries=c(20,300,40,40)) { + mean(payment - abs(salaries – guess)) } # defines payoff as a function of three arguments: # guess, which you must supply # payment, which defaults to 80 # salaries, which defaults to a list of all the salaries we assumed. And then set guesses to all integers from 1 to 100: guesses < - seq(1, 100) The computation is done simply: > payoffs < - sapply(guesses, payoff) # sapply is short for “simple apply” – we apply the payoff function to all the guesses we made. > plot(guesses, payoffs, type=”l”, col=”red”, lwd=3) # draws the plot below > abline(v=40, col=”blue”) was used to add the vertical blue line at the median. 2. Demonstration that the mean minimizes the sum of squared errors First, define two R function: square and sse: > square < - function(x) (x * x) > sse <- function (guess, salaries=c(20,300,40,40)) { + sum(square(payment - abs(salaries – guess))) } Second, set the range of your guesses: > guesses <- seq(40, 300) will give you the sequence 1.0, 1.1, 1.2 … 499.9, 500.0 Confirm that the mean beats the median: > sum.squared.errors(40) # we guess the median, and should get 68,000 > sum.squared.errors(100) # we guess the mean, and should get 53,600 Compute and save the sums of squared errors: (note that R is case sensitive: SSEs is not the same as sses) SSEs < - sapply(guesses, sse) Plot the payoffs: plot(guesses, SSEs, col="red", type="l", lwd=2, main="Sum of Squared Errors") abline(h=53600, col=”blue”) # horizontal line at minimum sum of squared errors. abline(v=100) # vertical line at a guess of the mean. Change your guesses to seq(50, 150) to focus in on the critical area, or calculate sse(99.9), sse(100), sse(100.1) to see that this is the real minimum. 3.Calculus based proof that the mean minimizes the sum of squared errors: Start with a list of numbers x. There are N numbers in the list. The SSE is defined as: Σ ( x - b) 2 where Σ is the summation sign (over all n observations)and b is any guess. We want the guess that minimizes this value, so we take the derivative of the SSE and set it equal to zero Derivative of SSE with respect to b = Σ -2 ( x - b) = -2 Σ x + 2 Σ b = -2 Σ x + 2 N b (note that the sum of N b's is N times b) Set the derivative equal to zero and solve: Divide through by 2, so - Σ x + N b = 0; add Σ x to each side of the equation, so we have N b = Divide both sides by N,so b= Σx/N Σx Which is of course the arithmetic mean – but note that we were looking for any expression that would minimize the sum of squared errors. 4.Answers to true/false questions on linear transformations and their effect on mean and SD ______ 1. Multiplying any list by a positive number k will lead to a list with a mean and SD k times greater. True – but note that if k is a fraction, “k times greater” means smaller in absolute terms. ______ 2. Subtracting 10,000 from any list will decrease the mean by 10,000 but the SD by only 100, since in calculating the SD we take the square root. The mean does decrease by 10,000, but the SD remains unchanged by adding or subtracting any number. ______ 3. Adding any number, whether positive or negative, to a list will not change the SD of a list. True – the location of all numbers changes, but their spread does not. ______4. Multiplying any list by a negative number will lead to a negative mean and negative standard deviation. False.. If you begin with a list of all negative numbers, multiplying by a negative number will mean a list of all positive numbers, and hence a positive mean. The standard deviation can never be negative – squaring all deviations results in a list of all positive numbers. ______5. Squaring every number in a list will square both the mean and the SD. False – we are not dealing with a linear transformation, so results are completely unpredictable. ______6. Squaring every number in a list will lead to a larger mean and SD, but not necessarily to an exact square of either mean or SD. False if the list is a list of positive fractions , say x = (0.3, 0.4, 0.5). The mean is 0.4, and SD is 0.08165 The new list is z = (0.09, 0.16, 0.25). The mean of this list is 0.1667, and the SD is 0.06549 Both the mean and SD are smaller after squaring. ______7. If the SD of a list is zero, every item in the list must be zero. FALSE – every item must be the SAME, but not necessarily zero. Compute the SD of (8, 8, 8) if you don't believe this. ______8. The standard deviation of the list (-3, -2, -1, 0, 1, 2, 3) is zero. FALSE – the mean is zero. The SD will be non-zero if every item is not the same. In this case, it is sqrt ( 28 / 7) = 2.