Measures of Center and Variation Sections 3.1 and 3.3 Prof. Felix Apfaltrer fapfaltrer@bmcc.cuny.edu Office:N518 Phone: 212-220 8000X 7421 Office hours: Mon-Thu 1:30-2:15 pm 1 Measures of center - mean • A measure of center is a value that represents the center of the data set • The mean is the most important measure of center (also called arithmetic mean) • sample mean addition of values variable (indiv. data vals) sample size population size • population mean Example. Lead (Pb) in air at BMCC (mmg/m3), 1.5 high: 5.4, 1.1, 0.42, 0.73, 0.48, 1.1 Outlier has strong effect on mean! Measures of center - median • Mean is good but sensitive to outliers! • Large values can have dramatic effect! Previous example: -reorder data: 0.42, 0.48, 0.73, 1.1, 1.1, 5.4 The median is the middle value of the original data arranged in increasing order – If n odd: exact middle value – If n even: average 2 middle values If we had an extra data point: 5.4, 1.1, 0.42, 0.73, 0.48, 1.1, 0.66 After reordering we have 0.42, 0.48, 0.66, 0.73, 1.1, 1.1, 5.4 Outlier has strong effect on mean, not so on median! Used for example in median household income: $ 36,078 Measures of Center - mode and midrange • Mode M value that occurs most frequently – if 2 values most frequent: bimodal – if more than 2: multimodal – Iif no value repeated: no mode • Needs no numerical values Examples: a. b. c. 5.4, 1.1,0.42, 0.73, 0.48, 1.1 27, 27, 27, 55, 55, 55, 88, 88, 99 1, 2, 3, 6 , 7, 8, 9, 10 Solutions: • unimodal: 1.1 • Bimodal 27 and 55 • No mode • Midrange = (highest-lowest value)/2 • Outliers have very strong weight M e an M e dian Mode Midrange 172 170 1 245.5 a. (0.42+5.4)/2=2.91 b. (27+99)/2=63 c. (1+10)/2= 5.5 61 16 2 0 1 0 276 154.5 Mode and more … • Mode: not much used with numerical data Example: Survey shows students own: • 84% TV • 76% VCR • 69% CD player • 39% video game player • 35% DVD TV is the mode! No mean, median or midrange! Round-off: carry one more decimal than in data! • Mean from frequency distribution • Weighted mean: • Dis-Advantages of different measures of center Measures of variation • Variation measures consistency • Range = (highest value - lowest value)/2 • Standard deviation: Precision arrows jungle arrows Same mean length, but different variation! Standard deviation • Measure of variation of all values from mean • Positive or zero (data = ) • Larger deviations, larger s • Can increase dramatically with outliers • Same units as original data values T ota l: Calculat iong st andard deviat ion B ank U npre dic ta ble x x- me an (x- me an)2 0 -5 25 15 10 100 5 0 0 0 -5 25 0 -5 25 10 5 25 30 200 mean=30/6=5min s = s qrt(200/(6-1))=s qrt(40)=6.3 min Recipe: x 1. Compute the mean 2. Substract mean from Individual values 3. Square the differences 4. Add the squared differences 5. Divide by n-1. 6. Take the square root. Example: Bank Consistency Bank Unpredictable waiting times 6 5 4 4 6 5 0 15 5 0 0 10 1. Mean: (6+5+4+4+6+5)/6=5 2. (6-5)=1,(5-5)=0, (4-5)=-1, (4-5)=-1, (6-5)=1, 0 3. 4. 5. 6. 12=1 , 02=0, (-1)2=1, (-1)2=1, 12=1,02=0 ∑ 1+0+1+1+1+0 = 4 n-1=6-1=5 4/5=0.8 √0.8 = 0.9 min vs 6.3 min Standard deviation of sample and population Standard deviation of a population Example using fast formula: • Find values of n, , n=6 6 values in sample = 30 adding the values = 62+52+42 +42 +52+ 62 = 154 • divide by N • - mu (population mean) • Sigma (st. dev. of population) • Different notations in calculators – Excell: STDEVP instead of – STDEV Estimating s and : (highest value - lowest value)/4 Example: class grades A statistics class of 20 students obtains the following grades: S tudent N ame P eter Kathy P at N ina N anc y V ic tor V ikki J en J ay Fred G rade 83 98 57 73 78 86 82 95 92 66 N ame A lbert J ohn J ohn B. H ughes Zak Zoe L ena M ary J oe Betty G rade 69 71 64 85 89 84 83 92 74 78 83 98 57 73 78 86 82 95 92 66 69 71 64 85 89 84 83 92 74 78 To rapidly approximate the mean, we take a random sample of 5 students. At random, we pick N anc y M ary J ohn B. Betty The population mean is obtained by adding all grades Lena x = (78+92+64+83+78)/5=395/5 =79 s =√((78-79) 2 +(92-79) 2 +(64-79)2+(83-79) 2 +(78-79)2)/4 =√(( -1) 2 + ( 13 ) 2 + ( -15 )2+ ( 4 ) 2 +( -1 )2)/4 =√( 1 + 169 + 225 + 16 + 1)/4 =√( 412 )/4 =√( 103 ) = 10.15 and dividing by 20, which is 79.95. The population variance is 10.71. Which we can obtain using Excell: N ame P eter Kathy P at N ina N anc y V ic tor V ikki J en J ay Fred A lbert J ohn J ohn B . H ughes Z ak Z oe L ena M ary J oe B etty G rade x x- mu 83 3 .1 98 1 8 .1 57 - 2 3 .0 73 - 7 .0 78 - 2 .0 86 6 .1 82 2 .1 95 1 5 .1 92 1 2 .1 66 - 1 4 .0 69 - 1 1 .0 71 - 9 .0 64 - 1 6 .0 85 5 .1 89 9 .1 84 4 .1 83 3 .1 92 1 2 .1 74 - 6 .0 78 - 2 .0 s um / N =2 0 root s quared 9 .3 3 2 5 .8 5 2 6 .7 4 8 .3 3 .8 3 6 .6 4 .2 2 2 6 .5 1 4 5 .2 1 9 4 .6 1 1 9 .9 8 0 .1 2 5 4 .4 2 5 .5 8 1 .9 1 6 .4 9 .3 1 4 5 .2 3 5 .4 3 .8 2 2 9 3 .0 1 1 4 .6 10.71 Variance and coefficient of variation Variance Examples: Variance = square of standard deviation In class grade case, sample standard deviation was 10.15. Therefore, s2=103. The population standard deviation was 10.71, therefore, 2=10.71 2= 114.7. sample population General terms refering to variation: dispersion, spread, variation Variance: specific definition Ex: finding a variance 0.8, 40 Coefficient of variation Coefficient of variation CV [p.155 ex. 49] Describes the standard deviation relative to the mean: • Coefficient of variation allows to compare dispersion of completely different data sets – ex: In previous example, CVsample=10.1/79 =12.8% CVpopulation=10.71/ 79.95 =13.4% • consistent bank data set 6,5,4,4,6,5; x=5, s=0.9 CV=.9/5=0.18 • Class sample: x=79, s=10.1 CV=10.1/79=0.13 – Variation of consistent bank is larger than that of the class in relative terms! More on variance and standard deviation • Why use variance, standard deviation is more intuitive? – (Independent) variances have additive properties – Probabilistic properties – Standard deviation is more intuitive Empirical rule for data with normal distribution 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2 -1 0 1 2 68% of data 95% of data 99.7% of data • Why divide sample st. dev by Example: Adult IQ scores have a bell-shaped n-1? distribution with mean of 100 and a standard – Only n-1 free parameters deviation of 15. What percentage of adults have IQ in 55:145 range? s=15, 3s=45, x-3s=55, x+3s=145 Hence, 99.7% of adults have IQs in that range. Chebyshev’s theorem: At least 1-1/k2 percent of the data lie between k standard deviations from the mean. Ex: At least 1-1/3^2=8/9=89% of the data lie within 3 st. dev. of the mean. 3 • The mean and the median are often different • This difference gives us clues about the shape of the distribution Is it symmetric? Is it skewed left? Is it skewed right? Are there any extreme values? • Symmetric – the mean will usually be close to the median • Skewed left – the mean will usually be smaller than the median • Skewed right – the mean will usually be larger than the median • Skewness: Pearson’s index I=3( mean-median )/s •If I < -1 or I > 1: significantly skewed • For a mostly symmetric distribution, the mean and the median will be roughly equal • Many variables, such as birth weights below, are approximately symmetric Summary: Chapter 3 – Sections 1and 2 • Mean The center of gravity Useful for roughly symmetric quantitative data • Median Splits the data into halves Useful for highly skewed quantitative data • Mode The most frequent value Useful for qualitative data • Range The maximum minus the minimum Not a resistant measurement • Variance and standard deviation Measures deviations from the mean Not a resistant measurement • Empirical rule About 68% of the data is within 1 standard deviation About 95% of the data is within 2 standard deviations Summary: Chapter 3 – Section 3 (Grouped Data) • As an example, for the following frequency table, Class Midpoint Frequency 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 1 3 3 7 we calculate the mean as if The value 1 occurred 3 times The value 3 occurred 7 times The value 5 occurred 6 times The value 7 occurred 1 time 5 6 7 1 • Evaluating this formula (1 3) (3 7) (5 6) (7 1) 61 3 .6 3 7 6 1 17 • The mean is about 3.6 • In mathematical notation xi fi fi • This would be μ for the population mean and x for the sample mean Variance and Standard deviation (grouped data) • Finding s from a frequency distribution Example: cotinine levels of smokers Range 0-99 100-199 200-299 300-399 400-499 500-599 Midpoint 49.5 149.5 249.5 349.5 449.5 549.5 Smokers 11 12 14 1 2 0 using Excel we obtain Range 0-99 100-199 200-299 300-399 400-499 500-599 Totals: Frequency Midpoint f x 11 49.5 12 149.5 14 249.5 1 349.5 2 449.5 0 549.5 40 f. x 544.5 1794 3493 349.5 899 0 7080 with which we calculate: f.( x^2) 26952.75 268203 871503.5 122150.25 404100.5 0 1692910 Interpreting a known value of the standard deviation s: If the standard deviation s is known, use it to find rough estimates of the minimum and maximum “usual” sample values by using max “usual” value ≈ mean + 2(st. dev) min “usual” value ≈ mean - 2(st. dev) N-1: DATA 3,6,9 =6, 2=6 Samples (replacement): 33 36 39 63 66 69 93 96 99 x= 3 4.5 6 4.5 6 7.5 6 7.5 9 ∑(x-x )2 = 0 4.5 18 4.5 0 4.5 18 4.5 0 S2=(divide by n-1=2-1) 0 4.5 18 4.5 0 4.5 18 4.5 0 Mean value of s2= 54/9 = 6 S 2=(divide by n=2) 0 2.25 9 2.25 0 2.25 9 2.25 0 Mean value of s 2= 27/9 = 3 Measures of relative standing unusual values ordinary values -3 -2 -1 0 z Useful for comparing different data sets • z scores – Number of standard deviations that a value x is above of below the mean sample population Example: • NBA Jordan 78, =69, =2.8 • WNBA Lobo 76, =63.6, =2.5 Number of standard deviations that a value x is above of below the mean – J: z=(x-)/=(78-69)/2.8=3.21 – L: z=(x-)/=(76-63.6)/2.5=4.96 1 unusual values 2 3 • Percentiles: – Percentile of value x Px Px= number of values less than x total number of values Example data point 48 in Smoker data 8/40*100=20th percentile = P20 Exercise: Locate the percentiles of data points 1, 130 and 250. Quartiles and percentiles pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 SMOKERS value 1 0 131 173 265 210 44 277 32 3 35 112 477 289 227 103 222 149 313 491 130 234 164 198 17 253 87 121 266 290 123 167 250 245 48 86 284 1 208 173 sorted 0 1 1 3 17 32 35 44 48 86 87 103 112 121 123 130 131 149 164 167 173 173 198 208 210 222 227 234 245 250 253 265 266 277 284 289 290 313 477 491 CLASS grade 83 98 57 73 78 86 82 95 92 66 69 71 64 85 89 84 83 92 74 78 SORTED 57 64 66 69 71 73 74 78 78 82 83 83 84 85 86 89 92 92 95 98 Percentiles and Quartiles Pk: k= number of values less than x total number of values • Quartiles: – Q1,= P25, Q2 = P50 =median, Q3= P75 Pk: k = (L – 1)/n •100 Example: data point 48 in Smoker data is 9th on table, n= 40. (9 – 1)/40 •100=20 48 is in P20 or 20th percentile or the first quartile Q1. pos 1 2 3 4 5 6 7 28 9 10 11 12 13 14 15 16 17 18 19 20 Data point 234 is 28th. k=(28 – 1)/40 •100= 68th percentile, or the 3rd quartile Q3. sorted 0 173 1 173 1 198 3 208 17 210 32 222 35 227 44 234 48 245 86 250 87 253 103 265 112 266 121 277 123 284 130 289 131 290 149 313 164 477 167 491 SORT DATA Compute L=(k/100)*n n=number of values k=percentile Yes: take average of Lth and (L+1)st value as Pk L whole number? No: ROUND UP Pk is the Lth value pos SORTED 1 57 2 64 3 66 4 69 5 71 6 73 7 74 8 78 9 78 10 82 11 83 12 83 13 84 14 85 15 86 16 89 17 92 18 92 19 95 20 98 Example: In class table ( n = 20 ) • Conversely, if you are looking for data in the kth percentile: L=(k/100)*n n total number of values k percentiles being used L locator that gives position of a value (the 12th value in the sorted list L=12) Pk kth percentile (ex: P25 is 25th percentile) START • find value of 21 percentile – L=21/100 * 20 = 4.2 – round up to 5th data point – --> P21 = 71 find the 80th percentile: – L=80/100 * 20 = 16, – WHOLE NUMBER: – P80 =(89+92)/2=90.5 Exploratory Data Analysis Exploratory data analysis is the process of using statistical tools (graphs, measures of center and variation) to investigate data sets in order to understand their characteristics. Outlier: Extreme value. (often they are typos when collecting data, but not always). • can have a dramatic effect on mean • can have dr. effect on standard deviation • … on histogram Min Q1 Median Q3 0 100 200 300 Max 400 500 • Box plots have less information than histograms and stem-andleaf plots • Not that often used with only one set of data • Good when comparing many different sets of data