1 1. Descriptive Statistics and Basic Probability 1.1. Descriptive Statistics Suppose y1 , y 2 , , y N are all the elements in the population and x1 , x2 ,, xn are the sample drawn from y1 , y 2 , , y N , where N is referred to as the population size and n is the sample size. In this chapter, we introduce several numerical measures to obtain important information about the population. These numerical measures computed from a sample are called sample statistics while those numerical measures computed from a population are called population parameters. (I) Measure of Location: n 1. Mean: x x i 1 n i . 2. Median: The data are arranged in ascending (or descending) order. Then, (i) As the sample size is odd, the median is the middle value. (ii) As the sample size is even, the median is the mean of the middle two numbers. 3. Mode: The data value occurs with greatest frequency (not necessarily to be numerical). 4. Percentile:The pth percentile is a value such as at least p percent of the data have this value or less and at least (100-p) percent of the data have this value or more. Note: 50th percentile = median!! The procedure to calculate the pth percentile: (i) Arrange the data in ascending order. p (ii) Compute an index i, i n. 100 (iii)(a) If i is not an integer, round up, i.e., the next integer value greater than i denote the position of the pth percentile. 1 2 (b) If i is an integer, the pth percentile is the average of the data values in positions i and i+1. 5. Quartiles: When dividing data into 4 parts, the division points are referred to as the quartile!! That is, Q1 the first quartile or 25th percentile Q2 the second quartile or 50th percentile Q3 the third quartile or 75th percentile Example 1: Suppose the following data are the scores of 10 students in a quiz, 1, 3, 5, 7, 9, 2, 4, 6, 8, 10. Some measures need to be used to provide information about the performance of the 10 students in this quiz. 1. mean: x 1 3 10 5.5 10 2. median 56 5.5 2 If the data are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. Then, median 6 4. Please find 40th percentile and 26th percentile for the previous data. Step 1: the data in ascending order are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Step 2: For 40 th percentile, 40 i 10 4 . 100 2 3 For 26 th percentile, 26 i 10 2.6 100 Step 3: 40th percentile 45 4.5 and 26th percentile 3 . 2 5. Find the first quartile and the third quartile for the previous example. Step 2: For the first quartile, 25 i 10 2.5 . 100 For the third quartile, 75 i 10 7.5 100 Step 3: Q1 3 and Q3 8 (II) Measure of Dispersion: Example 2: Suppose there are two factories producing the batteries. From each factory, 10 batteries are drawn to test for the lifetime (in hours). These lifetimes are: Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1 Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12. The mean lifetimes of the two factories are both 10. However, by looking at the data, it is obvious that the batteries produced by factory 1 are much more reliable than the ones by factory 2. This implies other measures for measuring 3 4 the “dispersion” or “variation” of the data are required. ◆ 1. Range: range=(largest value of the data)-(smallest value of the data). 2. Interquartile Range: Interquartile is the difference between the third and the first quartiles. That is, IQR Q3 Q1 . 3. Variance and Standard Deviation: n s2 xi n x 2 i 1 n 1 x 2 i i 1 nx 2 n 1 , s s2 . 4. Coefficient of Variation: The coefficient of variation is another useful statistic for measuring the dispersion of the data. The coefficient of variation is C .V . s 100 x The coefficient of variation is invariant with respect to the scale of the data. On the other hand, the standard deviation is not scale-invariant. Example 2 (continue): 1. Range of lifetime data for factory 1=10.1-9.9=0.2 Range of lifetime data for factory 2=16-3=13 The range of battery lifetimes for factory 1 is much smaller than the one for factor 2. 2. The first quartile and the third quartile for the data from factory 1 are 9.9 and 10.1, respectively, and 6 and 14 for the data from factory 2. Therefore, IQR (factory 1)=10.1-9.9=0.2 IQR (factory 2)=14-6=8. The interquartile of battery lifetimes for factory 1 is much smaller than 4 5 the one for factor 2. 3. s 2 ( factory.1) 10.1 102 9.9 102 10.1 102 s 2 ( factory.2) 10 1 16 102 0.0111 5 10 12 10 21.1111 10 1 2 2 The sample variance of battery lifetimes for factory 2 is 1900 times larger than the one for factor 1. The sample standard deviation for the data from factories 1 and 2 are 0.01111 0.1054 21.1111 4.5946 , and respectively. 4. In the battery data from factory 1, suppose the measurement is in minutes rather than hours. Then, the data are 606, 594, 606, 594, 594, 606, 594, 606, 594, 606. Thus, the standard deviation becomes 6.3245 which is 60 times larger than the one 0.1054 based on the original data measured in hours. However, no matter the data are measured in hours and minutes, the coefficient of variation is C.V . 0.1054 6.3245 100 100 1.054. 10 600 Example 3: The amount of time (in minutes) that a sample of students spends watching television per day is given below. 40 25 35 30 20 40 30 40 10 30 20 10 5 20 (a) Compute the mean (b) The standard deviation. (c) The coefficient of variation. (d) The 40th percentile. (e) The mode. 5 20 6 (f) The interquartile range. (g) Construct a frequency distribution, a cumulative frequency distribution and a relative frequency distribution. Let the first class be 1-10. [solution:] (a) 15 x x i 1 15 i 40 25 5 20 25 15 (b) 15 s x i 1 i x 15 1 2 40 252 25 252 5 252 20 252 14 11.339 (c) C.V . s 11.339 100 100 45.356 . x 25 (d) 1. The data are 5 10 10 20 20 20 20 30 30 30 35 40 40 40 25 2. 15 40 6 100 Thus, 20 20 20 2 is the 40th percentile. (e)The mode is 20. (f) Since Q1 20, Q3 35 , IQR Q3 Q1 35 20 15 . (III) Other Descriptive Statistics: 1. Five-Number Summary: The five number summary can provide important information about both the location and the dispersion of the data. They are 6 7 Smallest value First quartile Median Third quartile Largest value 2. Z-score, referred to as the standardized value for observation i, is defined as xi x s zi . 3. Weighted Mean: n xw w x i i 1 n i . w i i 1 4. Sample Mean for Grouped Data: m xg m fk M k k 1 m k 1 k 1 fk M k n fk , Sample Variance for Grouped Data: m s g2 k 1 f k M k x g m 2 n 1 k 1 f k M k2 nx g2 n 1 . Example 2 (continue): The original data (in hours) are: Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1 Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12. The five-number summary for the data from both factories is Smallest Q1 Median Q3 Largest Factory 1 9.9 9.9 10 10.1 10.1 Factory 2 3 6 10.5 14 16 Z-scores for the data: 7 8 Factory 1: xi 10.1 zi 0.948 -0.948 0.948 -0.948 -0.948 0.948 -0.948 0.948 -0.948 0.948 9.9 10.1 9.9 9.9 10.1 9.9 10.1 9.9 10.1 Factory 2: xi 16 zi 5 7 14 6 15 3 13 9 12 1.305 -1.088 -0.652 0.870 -0.870 1.088 -1.523 0.652 -0.217 0.435 Example 4: The following are 5 purchases of a raw material over the past 3 months. Purchase Cost per Pound ($) Number of Pounds 1 2 3 3.00 3.40 2.80 1200 500 2750 4 5 2.90 3.25 1000 800 Find the mean cost per pound. [solutions:] w1 1200, w2 500, w3 2750, w4 1000, w5 800. and x1 3.00, x2 3.40, x3 2.80, x4 2.90, x5 3.25. Then, 5 xw w x i 1 5 i i w i 1 i 1200 3.00 500 3.40 2750 2.80 1000 2.90 800 3.25 1200 500 2750 1000 800 2.96 Example 5: The following are the frequency distribution of the time in days required to complete year-end audits: Audit Time (days) Frequency 10-14 15-19 20-24 4 8 5 8 9 25-29 2 30-34 1 What is the mean and the variance of the audit time? [solutions:] f1 4, f 2 8, f 3 5, f 4 2, f 5 1. n f1 f 2 f 3 f 4 f 5 4 8 5 2 1 20 and M 1 12, M 2 17, M 3 22, M 4 27, M 5 32. Thus, 5 xg fM i i 1 i 5 f i 1 4 12 8 17 5 22 2 27 1 32 19 4 8 5 2 1 i and f M 5 s g2 i 1 i xg 2 i n 1 2 2 2 2 2 4 12 19 8 17 19 5 22 19 2 27 19 1 32 19 20 1 30 (IV) Numerical Measures of Association: Covariance and Correlation Coefficient: Let sample 1: x1 , x2 ,, xn and sample 2: z1 , z 2 ,, z n . The sample covariance n s xz ( xi x )( zi z ) i 1 n 1 n x z i 1 i i nx z n 1 . while the sample correlation coefficient is n rxz s xz sx sz x n x i 1 Note: i i 1 x z i z x 2 i n z i 1 rxz 1 . 9 z 2 i . 10 Example 6: . Let xi be the total money spent on advertisement for some product and z i be the sales volume (1 unit 100 packs). xi 2 5 1 3 4 1 5 3 4 2 zi 50 57 41 54 54 38 63 48 59 46 ( xi x )( z i z ) 1 12 20 0 3 26 24 0 8 5 10 s xz (x i 1 x )( z i z ) i 10 1 10 ( xi x ) 2 s x2 i 1 10 1 Then, rxz 10 1.4907 2 and s z2 (z i 1 i 99 11 . 10 1 z )2 7.9303 2 10 1 s xz 0.93 . sx sz Example 7: Let z i 2 xi , i 1,2,3,4,5 . xi 1 2 3 4 5 zi 2 4 6 8 10 Then, 5 x 3, z 6, s x ( xi x ) 2 i 1 5 1 5 s xz Thus, rxz s xz sx sz (x i 1 5 , sz 2 x )( z i z ) 5 1 5 5 2 i 5 (z i 1 i z)2 5 1 10 , 5. 1. 10 Note: when there is a perfect positive linear relationship between variable x and z, then rxz 1. rxz 1 might indicate a positive linear relationship. 10