CB2200 Week 3 Introduction to Statistics Exercises TUESDAY 23 Concept Review ● Measures of Central Tendency: Mean, Median (only numerical variable) and Mode (central tendency X best to describe categorical variable) ● Measures of Variation: Range, Interquartile Range (IQR), Variance and Standard deviation ● Distribution Shape of curve ● Five-numbers Summary and Boxplot Mean ● Sample mean ● Population mean Median Mode ● Value that occurs most often ● Not affected by extreme values (outliers) ● Used for both numerical and categorical data ● There may be no mode ● There may be several modes Comparison Table Measures Variable Type Can be nonexist? Can be morethan-one? Effect of Outliers Mean Numerical No No Yes Median Numerical No No No Mode Numerical & Categorical Yes Yes No Range ● Simplest measure of variation ● Difference between the largest and the smallest values ● Disadvantage: Ignores the way in which data are distributed ● X reflect the distribution shape Quartiles Calculating the quantile position for n values: Eg. N = 3 Q1 = (3+1)/4 =1 = whole number ● If r is a whole number, it is the ranked position to use = n (odd number) ● When r is not a whole number, the following linear interpolation steps can be used to determine the quantile value = n (even number) 1. 2. d = r – [r], where [r] is the integer part of r ■ Eg. d = 8.75 - 8 = 0.75 Quantile value = X[r] + d*(X[r]+1 – X[r]), where X[r] is the value at the rank rth position → eg. X8 + 0.75 * (X8 + 1 - X8) = X8 + 0.75 * (X9 - X8) Interquartile Range ● Interquartile range IQR = Q3 – Q1 and measures the spread in the middle 50% of the data ● Interquartile range is also called the mid-spread because it covers the middle 50% of the data ● Not influenced by outliers or extreme values ● Usually, values fall outside the range [Q1 - 1.5*IQR, Q3 + 1.5*IQR] are considered as outliers Variance ● Most preferred measure of variation due to its mathematical property ● It shows variation of each value from the mean ■ Sample Variance ■ Population Variance Standard Deviation ● It is the square-root of variance. ● It has the same units as the original data ■ Sample Standard Deviation ■ Population Standard Deviation Standard Deviation ■ Smaller standard deviation means most values of X are closer to its mean value. → The more the data are concentrated, the smaller the range, variance, and standard deviation → safer, more stable and more reliable ■ Larger standard deviation means the values of X are more spread out → The more the data are spread out, the greater the range, variance, and standard deviation → higher volatility, unstable Distribution Shape ● Position of mean and median for unimodal continuous distribution Negatively skewed/ Positively skewed/ ● If data are skewed, the median may be a more appropriate measure of central tendency The Five Number Summary and Boxplot ● The five numbers that help describe the center, spread and shape of data are ● Boxplot Distribution Shape = Boxplot Distribution Shape and Boxplot Q7 The following is a set of data for a population of size N=10: 7, 5, 11, 8, 3, 5, 2, 1, 10, 8 a) Compute the population mean (µ). = (7 + 5 + 11 + 8 + 3 + 5 + 2 + 1 + 10 +8) / 10 Population mean = µ = 6 b) Compute the population standard deviation (σ). = Variance = (7-6)^ 2+…..+[(8-6)^2]/10 = 10.2 = Population standard deviation (σ) = 3.1937 → Square root of = 10.2 Q8 n = 10 A food inspector, examining 10 bottles of a certain brand of honey, obtained the following percentages of impurities: 23.5 19.8 21.3 22.6 19.4 18.2 24.7 21.9 20.0 21.1 What are the mean and standard deviation of this sample? 1 2 Q9 The data contain the price for two tickets with online service charges, large popcorn, and two medium soft drinks at a sample of six theater chains: n=6 $36.15 $31.00 $35.05 $40.25 $33.75 $43.00 a) Compute the mean and median ● Mean, X = (36.15 + 31 + 35.05 + 40.25 + 33.75 + 43) / 6 = 36.53 ● Median → 1. Order the sample data in ascending order (Small to Large) 2. $31, $33.75 (2nd), $35.05, $36.15, $40.25 (5th), $43 3. Median = ($35.05 + $36.15) /2 = $35.6 Q9 The data contain the price for two tickets with online service charges, large popcorn, and two medium soft drinks at a sample of six theater chains: $36.15 $31.00 $35.05 $40.25 $33.75 $43.00 b) Compute the variance, standard deviation and range ● Variance (S2) = (31-36.53)2+.....+(43-36.53)2/(6-1) = 19.27 ● Standard deviation (S) = Square root of 19.27 = 4.39 ● Range = XLargest - XSmallest = 43 - 31 = 12 c) Are the data skewed? If so, how? → Since the mean (36.53) is only slightly greater than the median (35.6), the data are slightly right skewed Q9 The data contain the price for two tickets with online service charges, large popcorn, and two medium soft drinks at a sample of six theater chains: $31 $33.75 $35.05 $36.15 $40.25 $43 d) Based on the results of (a) through (c), describe the data. 1. Describe range (difference): There is a $12 difference between the most expensive and the least expensive outlet. 2. Compare the mean and median: The prices vary around $36.53 (Mean) with half of the outlets being more expensive than $35.6 (Median). 3. Mention middle half of the price (Q1&Q3): The middle half of the prices fall between $33.75 (2nd) and $40.25 (5th) $31, $33.75 (2nd), $35.05, $36.15, $40.25 (5th), $43 Quartile Quartile Position Q1 r = (n+1)/4 = 7/4 = 1.75 ~ 2nd Q3 r = 3(n+1)/4 = 3(7)/4 = 5.25 ~ 5th Q10 The data contains the total fat, in grams per serving, for a sample of 20 chicken sandwiches from fast-food chains. The data are as follows: n = 20 7 8 4 5 16 20 20 24 19 30 23 30 25 19 29 29 30 30 50 56 a) Compute the first quartile (Q1), the third quartile (Q3), and the interquartile range. 1. Reorder the data → 4 (Min), 5, 7, 8, 16 (5th), 19, 19, 20, 20, 23, 24, 25, 29, 29, 30 (15th), 30, 30, 30, 50, 56 (Max) Quartile Quartile Position Quartile Value Q1 r = (20+1)/4 = 5.25 ~ 5 16+0.25(19-16)= 16.75 Q3 r = 3(20+1)/4 = 15.75 ~ 15 30+0.75(30-30) = 30 Interquartile range = Q3-Q1 = 30 – 16.75 = 13.25 Q10 The data contains the total fat, in grams per serving, for a sample of 20 chicken sandwiches from fast-food chains. The data are as follows: 7 8 4 5 16 20 20 24 19 30 23 30 25 19 29 29 30 30 50 56 b) List the five-number summary. c) Construct a boxplot and describe the shape. Q11 The following data is the number of vitamin supplements sold by a health food store in a sample of 11 days: n = 11 19 19 20 20 20 22 23 25 26 27 30 a) What are the average (X-bar) and standard deviation (S) of daily sale of vitamin supplements of the health food store? b) Work out a five-number summary of the data in the sample. Comment on the distribution of the sample data. Formula Reference Formula Reference Sample Mean Population µ Variance S2 σ2 Standard Deviation S σ Size n N Summary