MAT 107 Chapter 2 Descriptive Statistics: ( Covered in a different order and with different emphasis and technique than your text!) Graphical Displays For summarizing categorical data the primary displays are Pie Charts and Bar Graphs Pie Charts: circles where each “slice” represents a category and the size of each slice corresponds to the proportion or percentage of observations in that category. Bar Charts display a vertical bar for each category. The height of the bar is the percentage or proportion of observations in that category. The proportion of observations in a class or category is the frequency of observations that fall in the class divided by the total number of observations. The percentage is the proportion times 100. Proportions and percentages are both known as relative frequencies. If let n be the sample size and f be the frequency of a class, then RF = f/n. A frequency table is a listing of all the classes and their corresponding frequencies. It is necessary to create a frequency table before making a pie chart or bar graph or histogram (later). Examples: Nominal: Cell phone carrier. Ordinal: Year at college Graphs for quantitative variables Dot plots: a dot for each observation is place above the appropriate number in a number line. Stem and Leaf Plots: each observation is represented as a stem and leaf. The stem usually consists of all the digits of the number except for the last one, which is the leaf. Sometimes stem can be split. Dot plots and stem and leaf plots are only reasonable for small data sets. For larger Data sets a Histogram is used. (this differs from what your book states) A Histogram is a graph that uses bars to portray the frequencies or relative frequencies of the possible outcomes for a quantitative variable. If you are asked to make a histogram, always make a frequency distribution first! Steps for constructing a Frequency Distribution and Histogram 1. Choose a range of values that captures all of the data. Then divide the range of the data into classes, non-overlapping intervals of equal length. The endpoints are the boundaries. For a discrete set of data with a small number of values, use the actual values as the classes. 2. The text and calculator use left-endpoint convention, meaning that an observation equal to the lower or left-endpoint is included in that class and the upper or right-endpoint is not included. Interval of the form [a, b), like [1, 2), [2, 3),…, What is important is that every observation is in one and only one class. 3. Guideline: there should be 5 – 20 intervals and you should use “nice” numbers like [10, 20) or [0, .5), [.5, 1) not numbers like [10.387, 12.451), … 4. Count the number of observations in each class, forming a frequency table. 5. Compute the relative frequencies, by dividing the frequency in each class by the total sample size. 6. Find the cumulative relative frequency of each class: the sum of all the relative frequencies up to and including that class. 7. On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over each class or value with height equal to its frequency (or percentage). The vertical axis should be scaled and labeled with either the raw or relative frequencies. Both the horizontal and vertical axes should be scaled so that all the classes and frequencies fit and are disguisable. Histogram example Heights of students (inches) from a previous class Heights 57 59 66 60 70 47 61 55 57 71 Heights 74 48 70 67 62 62 58 55 62 68 We have 20 data points and want to break them up into 4 or 5 classes. The range of the data = max – min = 74 – 47 = 27. Note that there are 28 integers between 47 and 74 if you include the endpoints. So, 4 classes that are 7 units long will work fine. The classes would then be [47, 54), [54, 61), [61, 68), [68, 75). Since we have integer data the following classes are equivalent, [47, 53], [54, 60], [61, 67], [68, 74]. Note that all the data points are included and no point is in more than one class. The frequency distribution would then be: Class Freq Rel. Freq Cum. Rel. Freq [47, 53] 2 .10 .10 [54, 60] 7 .35 .45 [61, 67] 6 .30 .75 [68, 74] 5 .25 1.00 Note that the total is 20. Now just make a bar graph of the frequency distribution. Histogram of Heights 8 7 Number 6 5 4 Freq 3 2 1 0 [47, 53] [54, 60] [61, 67] [68, 74] Height in inches Note that the axes are labeled and the histogram is titled and there are no gaps in the bars. Note that Excel uses right-endpoint inclusion. Some books use histograms with unequal class widths. This is almost always a bad idea! For Quantitative data there are 3 kinds of plots: Dot plots Stem and leaf plots Histograms Dot plots and Stem and leaf plots are used for small data sets (under 50 observations). Histograms are more flexible, because of classes Histograms and dot plots and stem-and-leaf-plots allow us to see the shape of the distribution. 1. Outlier detection: rare or unusual observations 2. The mode or most common observation class. unimodal vs. bimodal and multimodal 3. Symmetry of the dataset. a. b. c. Symmetric: when you divide the histogram down the middle, the left side of is a mirror image of the right side. Skewed left: if the left tail of the histogram is longer than the right tail. The small observations are more extreme than the large observations. Skewed right: if the right tail of the histogram is longer than the left tail. The large observations are more extreme than the small observations. Ex 2.7 page 56. Given Stem and leaf plot of heights of male semi-professional soccer players. 60; 60.5; 61; 61; 61.5; 63.5; 63.5; 63.5; 64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5; 70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71; 72; 72; 72; 72.5; 72.5; 73; 73.5; 74; This stem and leaf plot is terrible. It is not very clear what is going on here. See Example2-7.xlsx In class ex: Misleading your audience with statistics Guidelines for Constructing Effective Graphs 1. Label both axes and provide title. 2. Compare relative sizes accurately, scale correctly! Y axis should start at 0 3. Use standard shapes and symbols. 4. Displaying more than one group on a single graph can be difficult. Don’ts 1. Do not use scale breaks in any of your axes! 2. When making a histogram, uses classes and bars of the same width. 3. Do not make inferences about the population from one simple statistic like the mean, especially when you have a small sample size. Numerical Summary Measures Mean: average Sample (arithmetic) mean: the average of the sample. This is a capital sigma: ∑ It means to take the sum. Symbolically we write the formula for the sample mean as: x (read x-bar) x n The population mean is denoted by µ, which will talk more about later. Median: the middle Sample median: the middle of the sample Symbolically we will write the sample median as x~. (read x-squiggle) To find the sample median: 1. Sort the n observations (in ascending order) 2. If n is odd, let k = (n + 1) / 2. Then x~ = kth observation 3. If n is even, let k = n / 2 and j = (n + 2)/2. Then x~ is the average of the kth and the jth observations. Outlier: an observation that falls outside pattern of data. Ex. The following is a sample of 10 scores from a test given last semester. 75 84 86 68 93 97 32 90 80 70 Find the mean. Find the median. Make a dot plot. Are there any outliers? If so identify them. ∑x = 775 and n = 10, so the mean = 775 / 10 = 77.5 Mean = 77.5 Sort the data. 32 68 70 75 80 84 86 90 93 97 n = 10, 10 / 2 = 5 and 12/2 = 6, so the median is the average of the 5th and 6th observations = (80 + 84)/ 2 = 82. Median = 82 Dot plot: ● 30 ● ● 40 50 60 ● 70 ● ●● ● ● ● 80 90 32 seems to be an outlier. How can outliers affect the mean and the median? Assume that the person who got the 32 drops the class, because the student got really sick. The remaining (sorted) data looks like: 68 70 75 80 84 86 90 93 97 So now are n = 9 and ∑x = 743, so Mean = 743 / 9 = 82.5 Median = 5th observation = 84. The mean increased 5 points, but the median only increased 2 points. The mean is a weighted measure, whereas the median is a resistant measure. Resistant measures if extreme observations have little if any effect. To calculate the mean and the median as well as some other important statistics on the TI83/ TI84. 1. Enter the data into a list. Hit [STAT] Choose 1. Edit In L1, enter the years. Ex. 75 [ENTER] 84 [ENTER] … 70 [ENTER] 2. Hit [STAT]. Hit the right arrow to highlight CALC. Choose 1:1-Var Stats hit [ENTER] The screen should read: 1-Var Stats (then hit L1 [2nd] [1]), so that the screen reads: 1-Var Stats L1. Hit [ENTER] The output should look like: 1-Var Stats x 77.5 ∑ x = 775 ∑ x2 = 63183 Sx = 18.62047857 σx = 17.66493702 n = 10 (to see more hit the down arrow) minX = 32 Q1 = 70 Med = 82 Q3 = 90 maxX = 97 if mean = median then the data are symmetric if mean > median then data are skewed right. if mean < median then data are skewed left. The above example is symmetric since the mean and the median are so close. The mode of a data set of n observations is the value that occurs most often. If each value occurs the same number of times then the data set has no mode. If there is a tie between 2 values then we say the data set is bimodal. If there is a tie between 3 values then the data set is said to be trimodal. The above data set does not have a mode. There are 6 values that occur twice and 3 values that occur once. This is why the mode is rarely used. EX: How many math classes have you taken? Make Frequency Table. Find Mean, Median and Mode. The sample proportion of successes is denoted and read p-hat. It is also called the relative frequency of successes. Note that successes are non-conotational. Ex. In a random sample of 50 students, 37 had failed a class. If we are counting number of students who failed a class, then = 37/50 = .74 Measures of Variability First we measured the center of the data, the mean and the median. We also looked at the shape of the data, unimodal or bimodal, symmetric or skewed. No we look at how spread out the data is. The first measure is simple but does not tell us much about the spread. The range is the difference between the largest and smallest observations. Range = max - min A better measure would summarize the deviations from the center of the data. A deviation of an observation x from the mean xbar is (x - xbar), the difference. A deviation is positive if x is bigger than xbar. A deviation is negative if x is smaller than xbar. Unfortunately if we sum all the deviations of any data set, we get 0, because of how xbar is defined. So we before we sum up the deviations, we square them, which makes them all positive. The average of these squared deviations is called the variance and is denoted by s 2. The square root of s2 is s which is called the standard deviation. The bigger the standard deviation, the more spread out the data is. Ex. A random sample of 10 grades is given below. Calculate the mean, and standard deviation of the sample. Grades (x) 95 87 45 76 76 82 68 63 92 88 xbar = s^2 = s= 77.2 233.067 15.267 (x - xbar) 17.8 9.8 -32.2 -1.2 -1.2 4.8 -9.2 -14.2 14.8 10.8 (x - xbar)^2 316.84 96.04 1036.84 1.44 1.44 23.04 84.64 201.64 219.04 116.64 0.000 2097.600 233.067 15.267 We will not use the formula much because your calculator will do it for you. Remember under 1-VAR_STATS there was Sx, which is the standard deviation. Technically this is the sample standard deviation which is what we want. Interpreting the standard deviation. The more spread out the data are the greater s is. In general, interpretation can be difficult mathematically, so we will deal with a special case which is easier. Also, s = 0 means that there is no deviation, which only happens when all the observations are the same. For example, if your data set was: 20, 20, 20, 20, 20, 20. S = 0. The special case: when the data is bell or mound shaped meaning that the data is unimodal and symmetric around the mean (median or mode), we can use The Empirical Rule: Approximately 68% of the observations fall within 1 standard deviation of the mean. Approximately 95% of the observations fall within 2 standard deviations of the mean. Approximately 100% of the observations fall within 3 standard deviations of the mean. 3 intervals (xbar - s, xbar + s) (xbar - 2s, xbar + 2s) (xbar - 3s, xbar + 3s) Ex. A random sample of the weights (in ounces) of full term babies born at a large local hospital is collected. The population is assumed to bell shaped. The sample mean is 124 oz with a standard deviation of 17 oz. Use the empirical rule to find the intervals where about: 68% of full term baby weights should fall? 95% of full term baby weights should fall? 100% of full term baby weights should fall? Xbar = 124, s = 17 68 % (124 – 17, 124 + 17) = (107, 141) 95 % (124 – 34, 124 + 34) = (90, 158) 100 % (124 – 51, 124 + 51) = (73, 175) Using the above data do you think a baby that was 170 oz was unusual? Yes. 170 is outside the 95% interval. Using the above data do you think a baby that was 100 oz was unusual? No. 100 is inside the 95% interval. Using the above data do you think a baby that was 70 oz was unusual? Yes. 70 is outside the 95% interval and the 100% interval. Z-scores / T-scores z = (x – µ)/σ t = (x – x-bar)/ s Z-scores and T-scores can be used to compare observations that are on different scales. Both tell you how many standard deviations away from the mean a specific observation is and in what direction. T-scores use the sample mean and standard deviation and Z-scores use the population mean and standard deviation. Ex. A student Amy takes two tests. On the first test she gets an 82 and the mean was 73 with a standard deviation of 6.7. On the second test she gets an 86 and the mean was 78 with a standard deviation of 9.8. The teacher grades on a curve and she want to know which test she did better (based on the curve). Find the T-scores. T1 = (82 – 73)/6.7 = 1.34 her score was 1.34 standard deviations above the mean. T2 = (86 – 78)/9.8 = 0.82her score was 0.82 standard deviations above the mean. For the curve she did better on the second test. The pth percentile is a value such that p percent of the observations fall below or at that value. You have probably seen percentiles on standardized tests. We will not cover percentiles in general, but we will use some special percentiles. The median is a percentile, the 50th. Three useful percentiles that we will use are the quartiles. The first quartile is called Q1or QL and is the 25th percentile. It is also the median of the lower half of the data. The median is the second called Q2. The Third quartile is called Q3 or QU and is the 75th percentile. It is also the median of the upper half of the data. To calculate the quartiles, find the median. Divide the data into a lower half and upper half. Do not include the median in either half. To find Q1, find the median of the lower half. To find Q3, find the median of the upper half. The calculator calculates all 3 of them for you. The Inter Quartile Range (IQR = Q3 - Q1). Some people look at Q0 as the minimum observation and Q4 as the maximum observation. These 5 numbers together are called the 5-number-summary of the data. These numbers can be used to detect outliers and create visual display of the data called a boxwhisker-plot. The box-plots that your text described are useless! Modified Box Plots and the 5-Number Summary. Constructing a box-plot 1. Calculate Q1, Q2, Q3 and IQR = Q3 – Q1 2. Compute the Inner Fences (IF) and the Outer Fences (OF): LIF = Q1 – 1.5 * IQR UIF = Q3 + 1.5 * IQR LOF = Q1 – 3 * IQR UOF = Q3 + 3 * IQR 3. Draw a horizontal axis. Draw vertical lines at Q1, Q2, Q3. 4. A whisker (horizontal line) is drawn from Q1 to the smallest observation that is bigger than LIF. A whisker (horizontal line) is drawn from Q3 to the largest observation that is smaller than UIF. Any observation that is between the inner and outer fences is a mild outlier and is labeled with a solid circle. Any observation that is outside the outer fences is an extreme outlier and is labeled with an open circle. Ex. Grades (x) 95 87 35 76 76 82 68 63 92 88 Sorted x 95 92 88 87 82 76 76 68 63 35 Median = (76 + 82) / 2 = 79 Q1 = 68 Q3 = 88 Min = 35 Max = 95 IQR = 88 – 68 = 20 1.5 * IQR = 30 3 * IQR = 60 68 – 30 = 38 = LIF 88 + 30 = 118 = UIF 68 – 60 = 8 = LOF 88 + 60 = 148 = UOF Since 8 < 35 < 38 it is a mild outlier. Since 95 < 118 it is NOT an outlier. As one can see from the box plot the data set looks pretty symmetric except for the one outlier (35) which skews the distribution left. Note that mean = 76.2 < 79 = median Ex. An airline company is wondering about the number of cancellations it receives for a specific commuter flight. The airline takes a random sample of 15 days. The data are listed below. Find the mean and the median for the sample. Make a dot plot of the data. Are there any outliers? Describe the symmetry of the data. 4, 24, 17, 17, 9, 12, 9, 12, 13, 14, 14, 15, 15, 16, 16. Answers: Mean = 13.8 Median = 14 Q1=12 Q2=14 Q3=16 IQR = 16 – 12 = 4 1.5* IQR = 1.5 * 4 = 6 3*IQR = 12 LIF = 12 – 6 = 6 UIF = 16 + 6 = 22 LOF = 12 – 12 = 0 UOF = 16 + 12 = 28 4 is mild outlier because it lies between the LIF and the LOF. In class: 2.13.1, 2.13.6, 2.13.9, 2.13.10, 2.13.14