Numerical Measures In this handout we will develop numerical measures which will help us describe a data set. We begin with some definitions of great importance for understanding inferential statistics. A population is the entire body of data from which a sample may be drawn. A sample is a specific subset of a population. A statistic is a numerical measure which is computed from a sample of data. A parameter is a numerical measure of a population. Parameters are usually represented with Greek letters. Population Parameter s Sample Statistics As a little foreshadowing, you should know that questions of interest are typically questions about parameters. Inferential statistics, which we will talk about in great detail, is the study of using statistics to answer questions about parameters. 13 Numerical Measures of Populations—Parameters Measures of Central Tendency (1) The mean or average: Let xi be the ith data point in a population, i = 1,…, N. Then N x i 1 i N A mean is also the number which minimizes the sum of squared distances to each of the population values. Example: Consider a population made up of the five values: 3, 4, 1, 7, 5. Then 3 4 1 7 5 4 5 (2) A median is the middle value of a population where the values have been ordered from smallest to largest. What is the median for the population above? What should we do if our ordered population looks like: 2, 3, 4, 5, 6, 7? That is, what if there is no unique middle value? A median minimizes the sum of the absolute distances to the data points. As a result, the median is not as readily influenced by outliers as is the case with the mean. Example: Two populations {1,2,3,4,5} {1,2,3,4,5000} μ= μ= M= M= 14 Homework: For the population {1, 2, 3, 4, 5, 6}, compute (1) the sum of squared distances and (2) the sum of absolute distance from each population value to the number (a) 4.0 and (b) 5.0. To illustrate, to compute the sum of squared distances of each population value to 3.5 (μ), Sum-of-squared-distances = (1 – 3.5)2 + (2 – 3.5)2 + (3 – 3.5)2 + (4 – 3.5)2 + (5 – 3.5)2 + (6 – 3.5)2 = 17.5 Sum-of-absolute-distance = |1 – 3.5| + |2 – 3.5| + |3 – 3.5| + |4 – 3.5| + |5 – 3.5| + |6 – 3.5| = 9.0 Now you do the computations for 4.0 and 5.0. You may use EXCEL if you wish. (3) A mode is the most frequently occurring value in a population. Example: What is the mode for the population {1, 2, 3, 3, 5}? Descriptions of populations Unimodal Bimodal Skewed 15 Measures of Variability or Dispersion While a measure of central tendency is very useful, it does not distinguish between populations which look considerably different. For example, the mean, median, and mode of the two populations {1000, 10000, 10000, 19000} and {9000, 10000, 10000, 11000} are exactly the same, but look at their frequency distribution. 1 1 0 0 10 00 30 00 50 00 70 00 90 00 11 00 0 13 00 0 15 00 0 17 00 0 19 00 0 2 10 00 30 00 50 00 70 00 90 00 11 00 0 13 00 0 15 00 0 17 00 0 19 00 0 2 If these represent the populations of incomes for a summer intern position, which population has a distribution which is more “equitable”? Yet both populations have μ = $10,000. The key distinguishing feature between these populations is the amount of dispersion exhibited by their values. Which population do you think has the greatest dispersion? (1) The range is the difference between the largest and smallest value in the population. Range = largest value – smallest value What is the range for population 1 above? For population 2? Now consider the following populations: {1, 1, 1, 1, 5} {1, 2, 3, 4, 5} What are their ranges? 16 It should be obvious, however, that these two populations are dispersed in significantly different ways. In this sense, a range provides a naïve measure of dispersion because it takes into account only two values in the population. Which ones? Note that in some sense the measure does take into account all of the values but in a loose way. How? Another way of developing a measure of dispersion is to measure how far each value is from some fixed point. For example, we could choose our fixed point to be zero. Unfortunately, the two populations {-1, 0 , 1} and {4, 5, 6} would have different measures of dispersion although they have (intuitively) the same dispersion. Why? 2 2 1 1 0 0 -2 -1 0 1 3 2 4 5 6 7 Perhaps a better “fixed” point would be in the middle of the values, like the population mean. With the population {1, 3, 4, 5, 7}, we have μ = 4 and xi 1 3 4 5 7 xi - μ -3 -1 0 1 3 0 5 Thus (x i 1 i ) 0. 17 In fact: N N N i 1 i 1 i 1 ( xi ) xi N x i N i 1 N x N i xi N [ i 1 ] N i 1 N N i 1 i 1 xi xi 0 This will happen for any population. Is this a very good measure of dispersion? How can we correct for this? (2) A variance is the average or mean squared distance to the population mean (μ). N 2 (x i 1 i )2 N . The standard deviation, σ, is the square root of the variance. For the population {1, 3, 4, 5, 7}: xi 1 3 4 5 7 xi - μ -3 -1 0 1 3 (xi - μ)2 9 1 0 1 9 0 20 Then σ2 = 20/5 = 4, and σ = √4 = 2. 18