Descriptive Statistics Measure of Central Tendency Numerical Summary Measures Mean: the average value of the data. If n observations are denoted by x1, x2, ..., xn, their (sample) mean is Describe Distribution with Numbers • Measure of Center • Measure of Variation • Measure of Position n x + x + ... + xn x= 1 2 = n ∑x i =1 i n 1 Example: Birth weights (in lb) of 5 babies born from a group of women under certain diet. 2 Example: (number of hysterectomies performed by 15 male doctors) 7, 6, 8, 7, 7 27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28 ⇒ mean = 41.33 Sol: 2 | 05578 3 | 13467 4|4 5 | 09 6| 7| 8 | 56 mean = (7 + 6 + 8 +7 + 7) / 5 = 35/5 = 7 → [near the center of the data set] 3 Weighted Mean Weighted Mean Example: (Grade point average) A student received 3 A’s, 5 B’s, 2 C’s. Class (grade point, x) 4 3 Frequency (weight, w) 3 5 2 2 Average grade point = = 4 3x4+5x3+2x2 3+5+2 weight 31 = 3.1 10 Weighted mean = w1. x1 + w2. x2 + … + wk. xk w1 + w2 + … + w k = Σw.x Σw Where w1, w2, … are the weights and x1, x2, … are the values (or class midpoint or class mark). 5 6 Descriptive Stat - 1 Descriptive Statistics Grouped Mean Mean Estimation Class Frequency Class Mark (w) (x) Median: of a data set is w .x 90 - < 110 1 100 1x100 110 - < 130 2 120 2x120 130 - <150 3 140 3x140 150 - < 170 1 160 1x160 Total 7 Estimated mean = the data value exactly in the middle of its ordered list if the number of pieces of data is odd, the mean of the two middle data values in its ordered list if the number of pieces of data is even. 920 920 7 [median is not influenced by outliers and is best for nonsymmetric distribution] = 131.43 7 Example: (number of hysterectomies performed by 15 doctors) 27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28 8 Example: (Birth weights for 6 infants.) 5, 7, 6, 8, 5, 9 ordered list => 5, 5, 6, 7, 8, 9 ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 median = 34 median = (6+7) / 2 = 6.5 9 10 Example 1: (number of times visited class website by 15 students) 27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28 Mode: of a data set is the observation that occurs most frequently. ordered list => 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 Mode = 25 Example 2: (Blood type of 15 students) A, B, A, A, O, AB, A, A, B, B, O, O, A, A, A Mode = A 11 A–8 B–3 O–3 AB – 1 12 Descriptive Stat - 2 Descriptive Statistics Mean ? What is a Modal class? Skewed to the Right Median ? Class 90< - 110 110< - 130 130< - 150 150< - 170 170< - 190 190< - 210 210< - 230 230< - 250 250< - 270 270< - 290 Total Frequency 2 2 4 2 7 3 1 0 0 1 22 Relative Freq. 2/22 = .091 2/22 = .091 4/22 = .182 2/22 = .091 7/22 = .318 3/22 = .136 1/22 = .045 0/22 = .000 0/22 = .000 1/22 = .045 1.000 Mode ? Cumulative R.F. 2/22 4/22 8/22 10/22 17/22 20/22 21/22 21/22 21/22 22/22 13 14 Measure of Dispersion (variability) Range = largest data value − smallest data value Is there any difference between the two samples? Sample from group I (diet program I): 7, 6, 8, 7, 7 => mean = (7 + 6 + 8 +7 + 7) / 5 = 35/5 = 7 range of sample I = 8 - 6 = 2 Sample from group II (diet program II): 3, 4, 8, 9, 11 => mean = (3 + 4 + 8 + 9 + 11) / 5 = 35/5 = 7 range of sample II = 11 - 3 = 8 Does the mother’s diet program affect the birth weights of babies? 15 16 Example: Birth weights (in lb) of 5 babies born from a group of women under diet program II. 3, 4, 8, 9, 11 ⇒ mean = x = 7 Data Value Deviation from mean Squared Dev. xi Variance and Standard Deviation Measure the spread of the data around the center of the data. 17 xi − x ( xi − x ) 2 3 3–7 =–4 16 4 4–7=–3 9 8 8–7=1 1 9 9–7=2 4 11 11 – 7 = 4 16 Total 0 46 Sample Variance = 46/4 = 11.5 lb, 46 / 4 = 3.39 lb. Sample Standard Deviation = 18 Descriptive Stat - 3 Descriptive Statistics If n observations are denoted by x1, x2, ..., xn, their variance and standard deviation are 2 n s2 = Sample Variance: ∑ ( xi − x ) 2 i =1 n −1 (unbiased estimator for variance of an infinite population.) n s= Sample Standard Deviation: Sample Mean: x= A Short Cut formula: ∑ ( xi − x ) 2 i =1 n −1 x1 + x2 + ... + xn 1 n = ∑ xi n n i =1 ⎛n ⎞ ⎜ ∑ xi ⎟ n 2 ∑ xi − ⎝ i =1 ⎠ n s 2 = i =1 n −1 352 291 − 5 = 11.5 = 4 Data, x x2 3 9 4 16 8 64 9 81 11 121 35 291 19 20 Population Parameters What is the standard deviation of the weights of babies from the sample of mothers who received diet program I? If N observations are denoted by x1, x2, ..., xn, are all the observation in a finite population, their mean, µ , variance σ 2, and standard deviation, σ , are Data: 7, 6, 8, 7, 7 Population Mean: µ= x1 + x2 + ... + xn 1 n = ∑ xi N N i =1 n s2 = (0+1+1+0+0)/(5-1) = ½ 1 = 0.71 2 Does the mother’s diet program affect the birth weights of babies? Population Variance: s= Diet I: mean = 7, Diet II: mean = 7, s = 0.71 s = 3.39 i =1 2 i N n Population Standard Deviation: σ= ∑ (x − µ) i =1 21 s measures the spread around the mean. the larger s is, the more spread out the data are. if s = 0, then all the observations must be equal. s is strongly influenced by outliers. N The Use of Mean and Standard Deviation 23 2 i 22 About s (sample standard deviation) : σ2 = ∑ (x − µ) Describe distribution Understand the center and the spread of the distribution 24 Descriptive Stat - 4 Descriptive Statistics Profit Margin (1972-1981) American Water Works x s 7.61 .68 Birth Weight Data Unit: oz Mean, Which company should you 7.62 7.39 invest your money? Brown & Sharpe Campbell Soup 13.65 1.05 McDonald’s 20.04 1.02 -.98 14.18 Pam American x Standard Deviation, s Mother Drank Alcohol 102.2 25 Mother Did not Drink Alcohol 118.4 20 25 Many distributions can be described by a mathematical function with specific parameters, such as mean and standard deviation. 26 Properties of a symmetric and bell-shaped (Normal) distribution: Example: Normal Distribution (Bell-shaped) the distribution is symmetric about it mean (µ), 68% of the area is between µ − σ and µ + σ , 95% of the area is between µ − 2σ and µ + 2σ , 99.7% of the area is between µ − 3σ and µ + 3σ . σ µ 27 Heart rates for a certain population at a certain condition follow a bell shape symmetric distribution with mean 70 and standard deviation 2. µ − 3σ µ µ + 3σ 28 Chebychev’s inequality There is at least 1 – (1/k2) of the data in a data set lie within k standard deviation of their mean. What percentage of people in this population will have heart rates between 66 and 74? 95% ?% 66 70 74 29 30 Descriptive Stat - 5 Descriptive Statistics Example: Heart rates for asthmatic patients in a state of respiratory arrest has a mean of 140 beats per minute and a standard deviation of 35.5 beats per minute. What percentage of the population of this type of patients have heart rates lie between two standard deviations of the mean in a state of respiratory arrest? Heart rates example: mean=144, s.d.=35.5 k=2 75% = 1 − (1/22) At least 75% It will be at 75%, because ??? k = 2, and 1 – (1/22) = ¾ = 75%. 69 140 - 2x35.5 = 69 144 211 140 + 2x35.5 = 211 31 What about within three standard deviations? Heart rates example: mean=144, s.d.=35.5 k=3 2)2) ?% ≈≈11−−(1/3 89% (1/3 AtAtleast least89% ?% 32 Coefficient of Variation (C.V.): is the standard deviation expressed as a percentage of the mean. It is a unit-free measure of dispersion. It provides a measurement for comparing relative variability of data sets from different scales. C.V. = 33.5 144 - 3x35.5 = 33.5 144 s ⋅ 100 % x 246.5 144 + 3x35.5 = 246.5 33 Example: As part of the Berkeley Guidance Study. The heights (in cm) and weights (in kg) of 13 girls were measured at age two. The average height was 86.6 cm with a s.d.= 2.9 and average weight is 12.6 kg with a s.d.= 1.4. 34 Measure of Position C.V. (height) = (2.9/86.6)100% = 3.3% C.V. (weight) = (1.4/12.6)100% = 11.1% Standard Score, Percentile, Quartile More variability is weight than in height 35 36 Descriptive Stat - 6 Descriptive Statistics If x is an observation from a distribution that has mean µ , and standard deviation σ , the standardized value of x is, z-score of x : z= If a distribution has a mean 10 and a s.d. 2, the value 7 has a z-score –1.5. x −µ z-score = (7 – 10)/2 = – 1.5. σ “µ + 3σ” has a z-score 3, since it is 3 s.d. from mean. 1.5 s.d. 6 8 10 12 14 37 Percentile (Measure of position) Find a Data Value Corresponding to a Given Percentile Step 1: Sort the data. Step 2: Compute position index c c = n ⋅ p / 100 38 Example: A sample of number of times visited class website by 15 students is the following: 27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20, 36, 59, 34, 28. Find the 90th percentile of the data in this sample. Sol: n = 15, p = 90. Ordered data: 20, 25, 25, 27, 28, 31, 33, 34, 36, 37, 44, 50, 59, 85, 86 n = total number of values p = percentile (If for 90th percentile, p = 90.) Step 3 (find position): c = np/100 = 15 x 90 / 100 = 13.5 1) If c is not whole number, round up c to the next whole number. 2) If c is a whole number, the percentile is at the position that is halfway between c and c + 1. Round c to 14. The 14th number in the ordered list is the 90th percentile and that is 85. 39 Quartiles: (Measure of Position) 40 Example: [odd number of data values] (n = 21) 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75 • • The first quartile, Q1, or 25th percentile, is the median of the lower half of the list of ordered observations. The third quartile, Q3, or 75th percentile, is the median of the upper half of the list of ordered observations. Q1 = ?64.5 Median = 69 72 Q3 = ? Measure of spread: Interquartile range (IQR) = Q3 − Q1 IQR = 72 – 64.5 = 7.5 41 42 Descriptive Stat - 7 Descriptive Statistics The five-number summary Example: [even number of data] (n = 22) 6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75 Q1 = 64 ? Median = 68 ? .Minimum value .Q1 .Median .Q3 .Maximum value 72 Q3 = ? Measure of spread: Interquartile range (IQR) = Q3 − Q1 IQR = 72 - 64 = 8 43 44 With 6 in the data: Example: (data sheet without outlier “6”) 6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75 Q1 = 64 Min = 60, Q1 = 64.5, Median = 69, Q3 = 72, Max = 75. 80 80 70 60 Q3 = 72 40 60 20 50 N= Median = 68 IQR = 72 - 64 = 8 21 HEIGHT 1 0 45 N= 22 46 HEIGHT IQR = 72 – 64 = 8; Q1 = 64; Q3 = 72 Inner and outer fences for outliers • • • The inner fences are located at a distance of 1.5 IQR below Q1 (lower inner fence = Q1 - 1.5 x IQR ) and at a distance of 1.5 IQR above Q3 (upper inner fence = Q3 + 1.5 x IQR ). The outer fences are located at a distance of 3 IQR below Q1 (lower outer fence = Q1 – 3 x IQR ) and at a distance of 3 IQR above Q3 (upper outer fence = Q3 + 3 x IQR ) . • 47 The inner fences are located at a distance of 1.5 IQR below Q1 (lower inner fence = 64 - 1.5 x 8 = 52 ) and at a distance of 1.5 IQR above Q3 (upper inner fence = 72 + 1.5 x 8 = 84). The outer fences are located at a distance of 3 IQR below Q1 (lower outer fence = 64 – 3 x 8 = 40) and at a distance of 3 IQR above Q3 (upper outer fence = 72 + 3 x 8 = 96) . 48 Descriptive Stat - 8 Descriptive Statistics 84 UOF:72 + 3 x 8 = 96 96 UIF: 72 + 1.5 x 8 = 84 Outer fence Inner fence Inner fence 80 80 IQR IQR 60 60 Inner fence Inner fence 52 LIF: 64 - 1.5 x 8 = 52 40 20 40 Q1 = 64; Q3 = 72; Outer fence 40 LOF: 64 - 3 x 8 = 40 20 IQR = 72 – 64 = 8 1 1 0 N= 0 22 N= 22 HEIGHT HEIGHT 49 50 Mild and Extreme outliers Side-by-side Box Plot 80 Data values falling between the inner and outer fences are considered mild outliers. Data values falling outside the outer fences are considered extreme outliers. 8 14 70 21 18 19 60 HEIGHT When outliers exist, the whisker extended to the smallest and largest data values within the inner fence. 50 N= sex 51 Boxplot 8 13 Female Male 52 Remarks: 3 Find the five-number-summary and make a boxplot for the following data: 13 72 78 40 50 56 50 52 57 69 130 142 51 52 53 If the distribution of the data is symmetric, then the mean and median will be about the same. The five-number summary is best for nonsymmetric data. The median and quartiles are not influenced by outliers. The mean and standard deviation are most appropriate to use only if the data are symmetric because both of these measures are easily influenced by outliers. 54 Descriptive Stat - 9