Review: Chapter 1, Section 1, Describing distributions with graphs Stat 226 – Introduction to Business Statistics I Spring 2009 Professor: Dr. Petrutza Caragea Section A Tuesdays and Thursdays 9:30-10:50 a.m. Pie charts Bar graphs Pareto graphs Histograms Stemplots Chapter 1, Section 1.2 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 1 / 29 Match the histograms to the best description Stat 226 (Spring 2009, Section A) Numbers of medals won by countries in the 1992 Winter Olympics. 2 Last digit of each of 500 students’ social security numbers. 3 Age at death of a sample of 45 persons. 4 The SAT scores of 500 students. 5 The heights in inches of 500 college students. 6 Time on hold at a help line Chapter 1, Section 1.2 Introduction to Business Statistics I Chapter 1, Section 1.2 2 / 29 Chapter 1.2 – Describing Distributions with Numbers 1 Introduction to Business Statistics I Stat 226 (Spring 2009, Section A) want to describe NUMERICALLY 1 CENTER of the data 2 SPREAD of the data Measuring the center: associated with locating the “middle” of the data finding the value that is most typical for the data three common measures: mean – average value of all data points median – “middle” value of all data points 3 / 29 mode – data point(s) with highest frequency (most popular) Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 4 / 29 Chapter 1.2 – Sample mean Chapter 1.2 – Sample mean Notation: x̄ The sample mean of a set of observations x1 , x2 , . . . , xn is the arithmetic average of all observations Sometimes, the mean is not an appropriate measure of the center, because it simply does not reflect a typical value of the data. x̄ = x1 + x2 + . . . + x n 1 = n n n ! xi This is almost always the case when we have unusually large or small observations in the data (called outliers). Example: starting salaries of 5 people after graduating from college i=1 Example: # of sick days employees took in a small local business 0, 1, 2, 0, 4, 0, 1, 2, 35,000; 37,000; 35,000; 33,000; 210,000 3 x̄ 35,000 + 37,000 + 35,000 + 33,000 + 210,000 5 = 70,000 = 70,000 is certainly not a typical starting salary for all 5 people, it is “just” the average. Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 5 / 29 Chapter 1.2 – Sample mean after removing the salary of 210,000 the new sample mean is Introduction to Business Statistics I Chapter 1, Section 1.2 6 / 29 A measure of center that is more robust against outliers is the so-called median. Notation: median, M The median corresponds to the value of the data that occupies the middle position when all observations are ordered from smallest to largest. 35,000 + 37,000 + 35,000 + 33,000 = 4 = 35,000 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1.2 – Median Note: The sample mean x̄ is sensitive toward outliers, i.e. it gets pulled toward the extreme values in a data set. x̄new Stat 226 (Spring 2009, Section A) Chapter 1, Section 1.2 7 / 29 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 8 / 29 Chapter 1.2 – Median Chapter 1.2 –Median Finding the median: Examples: Example 1: Salary data ordered 1 order all observations from smallest to largest. 2 assess whether the total number of observations is odd or even. locate middle value of data 3 33,000; 35,000; 35,000; 37,000; 210,000 odd: median M is the middle observation in the ordered list, i.e. the " n+1 2 #th observation. Example 2: even: median M corresponds to the average of the two middle observations in the ordered list; i.e. the average of the $ n %th 2 Stat 226 (Spring 2009, Section A) and $n 2 %th +1 33,000; 35,000; 35,000; 37,000; 39,000; 210,000 observation. Introduction to Business Statistics I Chapter 1, Section 1.2 9 / 29 Chapter 1.2 – Mean vs. Median Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 10 / 29 Chapter 1.2 – Mean vs. Median Note: Salary data in Example 1 x̄ = 70,000 and M = 35,000 ⇒ the median M is obviously less influenced by outliers. we should not conclude though that the median should always be preferred over the mean simply because of its robustness against outliers. the mean and median measure the center of a data set in different ways – they are both useful depending on the situation/application Example: sick days data: 0, 1, one more data point of x = 56 Stat 226 (Spring 2009, Section A) 2, 0, 4, Introduction to Business Statistics I 0, 1, 2, 3 — add Chapter 1, Section 1.2 11 / 29 If costs are directly associated with the amount of sick days, then the mean would clearly be a better measure as it takes the extreme observation into account If we are just interested in the typical number of sick days for all employees, the median is probably the more representative measure. Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 12 / 29 Chapter 1.2 – Mean vs. Median Chapter 1.2 – Mode Relation between the shape of a distribution and mean/median The more symmetric a distribution is, the closer the mean and the median will be perfectly symmetric The mode corresponds to the value of the variable that occurs most frequently. Most useful for categorical data with a relatively small number of possible values. Example: Stat 226 – classification Fr – 4 So – 31 J – 41 S–7 skewed to the right skewed to the left Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 13 / 29 Chapter 1.2 – Measuring spread/variation Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 14 / 29 Chapter 1.2 – Measuring spread/variation Measures of spread variation is always present in real data it is important to know how spread out the data are as this tells us something about the behavior of a variable furthermore, describing data just using the measures of location/center is not sufficient — totally different data sets can still have the same mean/median Example: Number of sick days for 9 employees, two data sets Data set 1: 0, 0, 0, 1, 1, 2, 2, 3, 4 are 3 numbers that divide the ordered observations into 4 equally sized groups (i.e. each group contains 25% of all observations) Data set 2: 0, 0, 0, 0, 0, 0, 0, 0, 13 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Quartiles Q1 , Q2 , Q3 describe the position of a specific data value in relation to the rest of the data Chapter 1, Section 1.2 15 / 29 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 16 / 29 Chapter 1.2 – Quartiles Chapter 1.2 – Quartiles Finding quartiles: If an additional observation of x = 56 is added (now total number of observations is even) Q1 : median of all observations to the left of the median M Q2 : corresponds to the median M 0, Q3 : median of all observations to the right of the median M 0, 0, 1, 1, 2, 2, 3, 4, 56 Example: sick days (total number of observations is odd) 0, 0, 0, 1, 1, 2, 2, 3, 4 Quartiles are also less influenced by outliers Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 17 / 29 Chapter 1.2 – Five-number summary Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 18 / 29 Chapter 1.2 – Boxplots A graphical display of the 5-number summary is a so-called boxplot convenient tool to describe both, the center and the spread in a data set 5-number summary The 5-number summery consists of the following measures Min Q1 Median Q3 Max Example: sick days 0, 0, 0, 1, 1, 2, 2, 3, Note: 4 Min = Q1 coincide here due to the nature of the data (this is more the exception than it is the rule) boxplots can be either vertical or horizontal side-by-side boxplots to compare different groups Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 19 / 29 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 20 / 29 Chapter 1.2 – Boxplots Chapter 1.2 –Boxplots Example: data on the # of surgeries performed by male and female surgeons in a hospital side-by-side boxplots: female: 5, 7, 10, 14, 18, 19, 25, 29, 31, 32 male: 20, 25, 25, 27, 28, 31, 33, 34, 36, 36, 37, 44, 50, 59, 85, 86 5-number summary: Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 21 / 29 Chapter 1.2 – Boxplots Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 22 / 29 Chapter 1.2 – More measures of spread Measuring spread: the range, IQR, variance and standard deviation Using boxplots to describe distributions Need to describe the amount of spread or variability that is present in the data less variability among female surgeons distribution is also more symmetric for female surgeons more variability among male surgeons Note: Any measure of spread will take the value of zero only if all observations in the data set have the same value! mean/median is much higher for male surgeons than for female ones in general: for a symmetric distribution: Q1 and Q3 are about equally apart from M for a skewed to the right distribution: Q3 will be further away from M than Q1 (as well as Min and Max) for a skewed to the left distribution: Q1 will be further away from M than Q3 (as well as Min and Max) Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 23 / 29 range, R The range R corresponds to the difference between the highest and lowest value. Example: # of surgeries performed by the 16 male surgeons Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 24 / 29 Chapter 1.2 – Interquartile range Chapter 1.2 – More measures of spread: Sample variance Note: the range shows the full range of spread in the data, but the range depends on the smallest/largest observation which could be outliers!! Improve the description of SPREAD by looking at the deviations of each single observations from the mean, i.e. how far is an observation away from the overall mean x̄ Alternatively, we can use the so-called interquartile range, IQR IQR IQR = Q3 − Q1 sample variance s 2 The sample variance corresponds to the sum of all squared deviations of each observations from the sample mean x̄ corresponding to range of the middle 50% of the data. Example: 16 male surgeons s2 = Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 25 / 29 Chapter 1.2 – Sample standard deviation Stat 226 (Spring 2009, Section A) = Introduction to Business Statistics I Chapter 1, Section 1.2 26 / 29 Chapter 1.2 – Variance and Standard deviation standard deviation, s The standard deviation is the positive square root of the variance s 2 Note: the variance s 2 (and hence s) can only be greater or equal to zero (as based on squared deviations) Why work with s instead of s 2 ? s has the same units of measurements as observations in data set s 2 and s measure the spread about the sample mean x̄ Example: # of surgeries – female surgeons s 2 = s = 0 only if all observations are of same value 5, 7, 10, 14, 18, 19, 25, 29, 31, 32 s 2 and s are strongly influenced by outliers; one outlier can cause s 2 and s to drastically increase in value Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 27 / 29 Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 28 / 29 Choosing a numerical summary Choice of an appropriate measure of center/spread heavily relies on the shape of the distribution the presence of outliers ⇒ If the data are reasonably symmetric and no outliers are present, then the sample mean x̄ and the standard deviation s can be used ⇒ If the data are skewed and/or outliers are present, the 5-number summary should be used Stat 226 (Spring 2009, Section A) Introduction to Business Statistics I Chapter 1, Section 1.2 29 / 29