Chapter 3 Descriptive Statistics: Numerical Methods Section 3.1 Describing Central Tendency Population Parameters(总体参数) A population parameter is a number calculated from all the population measurements that describes some aspect of the population The population mean, denoted , is a population parameter and is the average of the population measurements The Mean(均值) Population X1, X2, …, XN Sample x1, x2, …, xn x Population Mean Sample Mean n N Xi i=1 N x x i i=1 n Point Estimates and Sample Statistics A point estimate(点估计) is a one-number estimate of the value of a population parameter A sample statistic is a number calculated using sample measurements that describes some aspect of the sample Use sample statistics as point estimates of the population parameters The sample mean, denoted x, is a sample statistic and is the average of the sample measurements The sample mean is a point estimate of the population mean Measures of Central Tendency Mean, : The average or expected value Median, Md: The value of the middle point of the ordered measurements Mode, Mo: The most frequent value The Sample Mean(样本均值) For a sample of size n, the sample mean is defined as n x x i 1 n i x1 x2 ... xn n and is a point estimate of the population mean • It is the value to expect, on average and in the long run Example: Car Mileage Case Sample mean for first five car mileages from Table 2.1 30.8, 31.7, 30.1, 31.6, 32.1 5 x x i 1 5 i x1 x2 x3 x4 x5 5 30.8 31.7 30.1 31.6 32.1 156.3 x 31.26 5 5 Example: Car Mileage Case Continued Sample mean for all the car mileages from Table 2.1 49 x i 1546.1 x 31.5531 49 49 i 1 Based on this calculated sample mean, the point estimate of mean mileage of all cars is 31.5531 mpg The Median(中位数) The population or sample median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it The median Md is found as follows: 1. If the number of measurements is odd, the median is the middlemost measurement in the ordered values 2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordered values Example: Sample Median Example 2.3 Internist’s Yearly Salaries (x$1000) 127 132 138 141 144 146 152 154 165 171 177 192 241 Because n = 13 (odd,) then the median is the middlemost or 7th value of the ordered data, so Md=152 An annual salary of $180,000 is in the high end, well above the median salary of $152,000 • In fact, $180,000 a very high and competitive salary The Mode(众数) The mode Mo of a population or sample of measurements is the measurement that occurs most frequently • Modes are the values that are observed “most typically” • Sometimes higher frequencies at two or more values • If there are two modes, the data is bimodal • If more than two modes, the data is multimodal • When data are in classes, the class with the highest frequency is the modal class • The tallest box in the histogram Example 2.4 DVD Recorder Satisfaction Satisfaction rankings on a scale of 1 (not satisfied) to 10 (extremely satisfied), arranged in increasing order 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 Because n = 20 (even,) then the median is the average of two middlemost ratings; these are the 10th and 11th values. Both of these are 8 (circled), so Md = 8 Because te rating 8 occurs with the highest rating, Mo = 8 Comparing Mean, Median & Mode The median is not affected by extreme values • “Extreme values” are values much larger or much smaller than most of the data • The median is resistant to extreme values The mean is strongly affected by extreme values • The mean is sensitive to extreme values Payment Time Case Mean=18.108 days Median=17.000 days Mode=16.000 days So: Expect the mean payment time to be 18.108 days A long payment time would be > 17 days and a short payment time would be < 17 days The typical payment time is 16 days Section 3.2 Measures of Variation Figure 2.31 indicates that we need measures of variation to express how the two distributions differ. Figure 2.31 20 Repair Times for Personal Computers at Two Service Centers The Range Range = largest measurement - smallest measurement The range measures the interval spanned by all the data Example 2.3: Internist’s Salaries (in thousands of dollars) 127 132 138 141 144 146 152 154 165 171 177 192 241 Range = 241 - 127 = 114 ($114,000) The Population Variance 2 (pronounced sigma squared) (总体方差) The average of the squared deviations of all the population measurements from the population mean Standard Deviation (pronounced sigma) (标准差) The square root of the variance The Variance Population X1, X2, …, XN Sample x1, x2, …, xn 2 s2 Population Variance N 2 X i - Sample Variance n 2 i=1 N s 2= 2 x x i i=1 n-1 The Variance For a population of size N, the population variance 2 is defined as N 2 2 x i i 1 N 2 2 2 x1 x2 xN N For a sample of size n, the sample variance s2 is defined as n s2 2 x x i i 1 n 1 2 2 2 x1 x x2 x xn x and is a point estimate for 2 n 1 The Standard Deviation(标准差) Population Standard Deviation, : Sample Standard Deviation, s: s s 2 2 Example 2.6 The Car Mileage Case Sample variance and standard deviation for first five car mileages from Table 2.1 30.8, 31.7, 30.1, 31.6, 32.1 2 xi x 5 s2 i 1 5 1 30 .8 31.26 2 31.7 31.26 2 30.1 31.26 2 31.6 31 .26 2 32.1 31.26 2 4 = 2.572 /4 = 0.643 Sample variance and standard deviation for all car mileages from Table 2.1, . 49 s 2 2 x x i i 1 49 1 30.66204 0.638793 48 s s 2 0.638793 0.7992 The point estimate of the variance of all cars is 0.638793 mpg2 and the point estimate of the standard deviation of all cars is 0.7992 mpg. The computational formula for the sample variance s2 2 n xi 1 n i 1 2 x i n 1 i 1 n The Payment Time Case Example 2.7 Consider the sample of 65 payment times in Table 2.2. 65 x i 1 i 65 x i 1 2 i x1 x2 x65 22 19 21 1,177 2 x12 x22 x65 (22) 2 (19) 2 (21) 2 22,317 Therefore 1 (1,177) 2 1,004.2464 s 15.69135 22,317 (65 1) 65 64 2 and s s 2 15.69135 3.9612 Days. Section 3.3 The Normal Curve(正态曲线) Symmetrical and bell-shaped curve for a normally distributed population The height of the normal over any point represents the relative proportion of values near that point Example 2.1, The Car Mileages Case Daily Return of ZSYH for Recent 5 Years Daily Return 15,00% 10,00% 5,00% 0,00% -5,00% -10,00% -15,00% 2006.4.21 2007.4.21 2008.4.21 2009.4.21 2010.4.21 The Empirical Rule(经验准则) for Normal Populations If a population has mean and standard deviation and is described by a normal curve, then 1. 68.26% of the population measurements lie within one standard deviation of the mean: [, ] 2. 95.44% of the population measurements lie within two standard deviations of the mean: [2, 2] 3. 99.73% of the population measurements lie within three standard deviations of the mean: [3, 3] The Empirical Rule The Empirical Rule holds for normally distributed populations. This rule also approximately holds for populations having mound-shaped (single-peaked) distributions that are not very skewed to the right or left. For example , Recall that the distribution of 65 payment times, it indicates that the empirical rule holds. x x x 49 x i 1 49 49 i 1546.1 31.5531 49 s2 i 1 2 i 49 1 30.66204 0.638793 48 s s 2 0.638793 0.7992 Example 2.8 The Car Mileage Case 68.26% of all individual cars will have mileages in the range x s] 31.6 0.8] 30.8,32.4] mpg 95.44% of all individual cars will have mileages in the range x 2s] 31.6 1.6] 30.0,33.2] mpg 99.73% of all individual cars will have mileages in the range x 3s] 31.6 2.4] 29.2,34.0] mpg Tolerance Intervals(容许区间) An Interval that contains a specified percentage of the individual measurements in a population is called a tolerance interval. The one, two, and three standard deviation intervals around given in (1), (2) and (3) are tolerance intervals containing, respectively, 68.26 percent, 95.44 percent and 99.73 percent of the measurements in a normally distributed population. The three-sigma interval 3 ] to be a tolerance interval that contains almost all of the measurements in a normally distributed population. Section 2.4 Percentiles, Quartiles(四分之一分位 点) and Box-and-Whiskers Display For a set of measurements arranged in increasing order, the pth percentile(百分位点) is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value The first quartile Q1 is the 25th percentile The second quartile (or median) Md is the 50th percentile The third quartile Q3 is the 75th percentile The interquartile range IQR(四分位距) is Q3 - Q1 Calculating pth percentile • Calculate the index i=(p/100) ×n • If i is not an integer, the next integer greater than i denotes the position of the pth percentile in the ordered arrangement. • If i is an integer, then the pth percentile is the average of the measurements in position i and i+1 in the ordered arrangement. Percentile Example • i=(10/100)12=1.2 • Not an integer so round up to 2 • 10th percentile is in the second position so 11,070 • i=(25/100)12=3 • Integer so average values in positions 3 and 4 • 25th percentile (18,211+26,817)/2 or 22,514 Figure 2.33 Using stem-and-leaf displays to find percentiles. (a) The 75th percentile of the 65 payment (b) The 5th percentile of the 60 bottle design ratings and a five-number summary times, and a five-number summary Example 2.10 DVD Recorder Satisfaction 20 customer satisfaction ratings: 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 Md = (8+8)/2 = 8 Q1 = (7+8)/2 = 7.5 Q3 = (9+9)/2 = 9 IQR = Q3 Q1 = 9 7.5 = 1.5 The Box-and-Whiskers Plots(盒型图) The box plots the: first quartile, Q1 median, Md third quartile, Q3 inner fences, located 1.5IQR away from the quartiles: = Q1 – (1.5 IQR) = Q3 + (1.5 IQR) outer fences, located 3IQR away from the quartiles: = Q1 – (3 IQR) = Q3 + (3 IQR) The “whiskers” are dashed lines that plot the range of the data A dashed line drawn from the box below Q1 down to the smallest measurement Another dashed line drawn from the box above Q3 up to the largest measurement Note: Q1, Md, Q3, the smallest value, and the largest value are sometimes referred to as the five number summary Outliers(异常值) Outliers are measurements that are very different from most of the other measurements Because they are either very much larger or very much smaller than most of the other measurements Outliers lie beyond the fences of the box-and-whiskers plot Measurements between the inner and outer fences are mild outliers Measurements beyond the outer fences are extreme outliers Weighted Means(加权均值) Sometimes, some measurements are more important than others Assign numerical “weights” to the data Weights measure relative importance of the value Calculate weighted mean as w x w i i i where wi is the weight assigned to the ith measurement xi Example 2.12 June 2001 unemployment rates in the U.S. by region Census Region Civilian Labor Force Unemployment (millions) Rate (%) Northeast 26.9 4.1 South 50.6 4.7 Midwest 34.7 4.4 West 32.5 5.0 Want the mean unemployment rate for the U.S. Calculate it as a weighted mean So that the bigger the region, the more heavily it counts in the mean The data values are the regional unemployment rates The weights are the sizes of the regional labor forces 26 .9 4.1 50 .6 4.7 34 .7 4.4 32 .5 5.0 26 .9 50 .6 34 .7 25 .5 32 .5 663 .29 4.58 % 144 .7 Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% That is, 0.0003 144.7 million = 43,410 workers