Chapter 2 – Descriptive Statistics 2.1 Frequency Distributions Types of Data Qualitative Data – a nonnumerically valued data. Quantitative Data – a numerically valued data. Frequency and Relative Frequency Distributions: Example 1 A nursery school offers programs for 4-year olds ranging from 1-day-a-week program to 5-day-a-week program. To help in planning, the school's director surveyed parents regarding the type of program they prefer. The following data, which represents the number of days, were obtained. 2 3 1 2 2 3 3 1 2 2 3 4 2 1 4 4 2 3 5 1 3 2 5 2 2 5 Construct the frequency and relative frequency distributions and answer the following. (a) What percentage of parents did not prefer 2-daya-week program? (b) What percentage of parents prefers 4-day-a-week or 5-day-a-week program? Solution In the data set there are 4 1s, 10 2s, 6 3s, 3 4s, and 3 5s with 26 total data items. For example, the relative frequency of the data value 1 is (4/26)100 = 15.4%. Number of Days 1 2 3 4 5 Frequency 4 10 6 3 3 Total: 26 Relative Frequency (%) 15.4% 38.5% 23.1% 11.5% 11.5% 100% a. Since there are 38.5% parents that prefer 2-day-a-week program, the percentage of parents that do not prefer 2day-a-week program is 100 - 38.5 = 61.5%. b. The percentage of parents that prefer 4-day-a-week program or 5-day-a- week program is 11.5 + 11.5 = 23%. Grouped-Data Table Example 1 We are given the mathematics achievement test scores for a sample of 50 sixthgrade students at Maple Elementary School. 75 49 84 55 61 77 63 84 41 67 48 61 85 69 72 51 98 79 65 57 46 51 79 88 64 61 54 75 77 63 Test Scores 40-49 50-59 60-69 70-79 80-89 90-99 Total 65 57 85 89 67 68 53 65 71 71 71 49 83 55 60 54 71 50 63 77 Frequency 5 10 15 12 7 1 50 Relative Frequency (%) 10 20 30 24 14 2 100% Cumulative Frequency and Cumulative Percent Frequency Distributions: Freq. Distribution Test Scores 40-49 50-59 60-69 70-79 80-89 90-99 Comulative Freq. Distribution Frequency Test Cumulative Scores Freq. 5 5 49 10 15 59 15 30 69 12 42 79 7 49 89 1 50 99 Cumulative Relative Frequency Distribution Freq. Distribution Test Scores 40-49 50-59 60-69 70-79 80-89 90-99 Comulative Relative Freq. Distribution Frequency Test Scores Cumulative Relative Freq. (%) 5 10 49 10 30 59 15 60 69 12 84 79 7 98 89 1 100 99 Questions: What are we looking for when we look at data? a. The shape of the distribution of the data. b. The symmetry or skew of the data. c. The center of the data. d. The spread of the data. Graphical Displays: Histogram -- A histogram is a graphical representation of quantitative data that can help answer the questions above. Example 2 Draw the histogram for the data in Example 1. 1.Guidelines for making a histogram: a. Choose between 5 and 20 classes (intervals for a histogram). A histogram is sensitive to the number of classes, so you may want to try several possibilities in practice. Rule of thumb: about n classes for a histogram. b. All class widths must be the same. c. The lower limit of the smallest class is always less than the smallest data value. The upper limit of the largest class is always greater than the largest value. d. Each item goes into one and only class; that is, the classes are non-overlapping. Homework: 15-23(odd), 29, 31, 32 (pp. 43-45) 2.2 Pie Charts and Bar Graphs A Histogram is designed for use with quantitative data. Two methods for displaying qualitative data are Pie Charts and Bar Graphs. Example 3. (Reference Example 4 on page 49) Display the relative frequency distribution of the data using a. a pie chart b. a bar graph Stem-and-Leaf Diagram We are given the mathematics achievement test scores for a sample of 50 sixth-grade students at Maple Elementary School. Draw a stem-and-leaf display for this data. 75 49 84 55 61 77 63 84 41 67 48 61 85 69 72 51 98 79 65 57 46 51 79 88 64 61 54 75 77 63 65 57 85 89 67 68 53 65 71 71 71 49 83 55 60 54 71 50 63 77 Solution Stem Leaves 4| 8 6 9 9 1 5| 1 7 5 5 1 4 4 3 0 7 6| 5 1 9 1 4 7 0 1 8 3 5 5 3 7 3 7| 5 1 9 2 7 1 9 5 7 1 1 7 8| 4 5 5 3 8 9 4 9| 8 4| 1 6 8 9 9 5| 0 1 1 3 4 4 5 5 7 7 6| 0 1 1 1 3 3 3 4 5 5 5 5 7 7 8 9 7| 1 1 1 1 2 5 5 5 7 7 7 9 9 8| 3 4 4 5 5 8 9 9| 8 Homework: 15, 17, 20, 21 (pp.54-55) 2.3 Measures of Center 1. Mean (Also called the Arithmetic Mean) The mean of a data set is the sum of the observations divided by the number of observations. If the data values are x , x , x , …, x , then 1 2 3 n Mean = x x nx ... x Two Notations for the mean: 1 2 3 n a. Sample mean: x (read as x-bar) b. Population Mean: (“Mu”) Thus x = n x where n = number of items in the sample data, and = N x where N = size of the population. Note: (called sigma) is a Greek symbol that signifies summation. Example 1: Find the mean for this sample data: 2, 3, 6, 7, 7, 8. 9, 9, 9, 10 Solution: x = n x = 2 3 6 7 7 8 9 9 9 10 10 = 70/10 = 7 Example 2: A sample of five families in Harrold, Iowa showed the following annual family incomes: $17,500, $23,000, $24,000, $26,000, $320,000 Find the mean for this data. Soln. x = n x = 17500 23000 24000 26000 320000 5 = 410500/5 = $82,100 Extreme Value/Outlier: a data value that is too large or too small as compared to most of the data values. Note: In the presence of extreme value(s), the mean provides a poor description of the center of the data set. 2. Median The median is the middle value of the data when the data has been arranged in the ascending/descending order. Example 3: Find the median for the data set 1 and data set 2. Data Set 1: 7, 2, 8, 5, 9, 4, 7, 8, 6 Data Set 2: 7, 2, 8, 5, 9, 4, 8, 8 Solution: The median for data set 1 is 7 while the median for the data set 2 is 7.5 Example 4: Find the median for the data set given in Example 2. Solution: Median = $24,000 Note: The median in not affected by extreme values. Thus in the presence of extreme values, median may be a better indicator of the center. 3. Mode The most frequently occurring data value in a set of data is called the mode. That is, the mode is the value that occurs with greatest frequency. Example5: Find the mode for the given data: 2, 3, 3, 2, 2, 8, 7, 8, 7, 9, 8, 8 Solution: Mode = 8 Example 6: Find the mode for the given data: 2, 3, 3, 2, 2, 8, 7, 8, 7, 9, 8, 8, 2 Solution: Mode = 2 or 8 Note: Such a distribution is called bimodal. Example 7: Find the mode for the given data: 2, 3, 8, 7, 9 Solution: Mode is undefined. Note: Mode is seldom used in practice, except to answer the very special question that it is designed to answer: a. What is the most watched TV show? b. What is the best selling automobile? c. What is the most common cause of death? Example 8: 10 out of the 11 data values in a data set are 11, 13, 15, 9, 4, 12, 10, 7, 8, and 15. If the mean for the data is 10, what is the missing item? Soln. 6 Homework: 1, 2, 3, 5, 14, 15-23(odd), 29, and 30 pp. 64-66 2.4 Measures of Variation Range = Largest Value – Smallest Value Example 9: Given the three data sets below, find the range, mean, and median. Data Set 1: 99, 91, 84, 84, 80, 80, 80, 76, 76, 69, 61 Data Set 2: 99, 80, 80, 80, 80, 80, 80, 80, 80, 80, 61 Data Set 3: 99, 99, 99, 99, 99, 80, 61, 61, 61, 61, 61 Soln: For all of these data sets, Range = 99 – 61 = 38 and Mean = Median = 80 Note: The range is based on only two of the items in the data set and thus is influenced too much by extreme values. Variance: Given the data 46, 54, 42, 46, 32. The mean () for this data is 44. X 46 54 X- 2 10 (X - ) 4 100 2 42 46 32 Total -2 2 -12 0 4 4 144 256 Variance = average squared deviation from the mean = 256/5 = 51.2 Population Variance = (XN- ) where N is the size of the population. 2 2 Sample Variance s = 2 (X - x ) n 1 2 where n is the size of the sample. Easier Computational Formula for Variance = (X ) N N ( ) 2 2 2 s = 2 (X 2 ) n( x ) 2 n 1 Standard Deviation = Variance So, Sample Standard Deviation = s = Population Standard Deviation = = s2 2 Example 10: Given the sample data below, find the sample standard deviation. 9, 11, 16, 14, 12, 12, 10, 9, 9 Solution: Sum of the x- values = 102 Sum of the squares of x-values = 1204 Sample mean=11.33, sample variance = 6.08 sample s.d. = 2.47 Some Uses of Mean and Standard Deviation Data: x , x , x , …, x 1 Z-score = xx s 2 3 n where s = sample s.d. Z-score for any data item is referred to as its standardized value. It can be interpreted as a measure of the relative location of an item in the data. Example 11: If the Z-score of a data item is 2, the data value is 2-standard deviations above, or larger than, the sample mean. CHEBYSHEV’S THEOREM (P 77) For any data set, at least 75% of the items must lie within two standard deviations of the mean; 89% of the items must lie within three standard deviations of the mean; 94% of the items must lie within four standard deviations of the mean. Example 12: Midterm scores for 100 students in a college statistics course had a mean of 70 and s.d. of 5. (a) How many students scored between 60 and 80? (b) How many students scored between 50 and 90? The Empirical Rule: For a data having approximately a bell-shaped distribution, Approximately 68% of the data fall within 1-standard deviation of the mean; Approximately 95% of the data fall within 2-standard deviation of the mean; Approximately 99.7% of the data fall within 3-standard deviation of the mean. Example 13: In a class with 50 students, the mean score on a test was 60 while the standard deviation was 12. It is given that the scores are normally distributed. How many students a. scored between 48 and 72? b. scored between 36 and 84? c. scored between 24 and 96? Detecting Outliers: Sometimes a set of data has one or more items with unusually large or unusually small values. Extreme values such as these are called Outliers. Experienced statisticians take steps to identify outliers and then review each one carefully. An outlier may have been an item for which the value has been incorrectly recorded. If so, the value can be corrected before proceeding with the analysis. An outlier may also be an item that was incorrectly included in the data set; If so, it can be removed. Finally, an outlier may just be an unusual item that has been correctly recorded and does belong in the data set. In such cases, the item should remain in the data set. Use of Z-score to identify outliers: RULE: An item with z-value > 3 or Z-value < -3 will be treated as an outlier. Example 14: Given the data set below, identify outliers, if any, in the data. 46, 54, 42, 46, 32 Soln. Note that x = 44 and s.d. = 8 x x x- 46 54 42 46 32 44 44 44 44 44 2 10 -2 2 -12 x z-score 0.25 1.25 -0.25 0.25 -1.50 There are no outliers in this data. Homework: 2, 4, 6, 9, 13, 15, 16, 17, 19, 21, 25, 26, 27, 28, 29, 30 pp. 80-83. 2.5 Measures of Position Percentile: A percentile is a numerical measure that also locates values of interest in the data set. A percentile provides information regarding how the data items are spread over the interval from the lowest value to the highest value. Defn. The pth percentile of a data set is a value such that at least p percent of the items take of this value or less and at least (100 – p) percent of the items take on this value or more. Step 1: Sort the data in ascending order, that is, from the smallest to the largest. Step 2: Find i = (p/100)n where n is the number of data values. Step 3: If i is not an integer, then pth percentile = x If i is an integer, then pth percentile = x 2x INT ( i 1) i i 1 Example 16: Given the data below, find the 50th and 90th percentiles. 26, 4, 5, 20, 6, 12, 15, 15, 15, 8, 9, 10, 14, 18, 16, 17 Soln: Step 1:Sort the data in ascending order, that is, from the smallest to the largest. 4, 5, 6, 8, 9, 10, 12, 14, 15, 15, 15, 16, 17, 18, 20, 26 90th percentile = 20; 50th percentile = 14.5 Note: The median and the 50th percentile are the same. Quartiles It is often desired to divide a data set into four parts with each part containing one-fourth of the data. Q = First Quartile = 1 25% percentile Q = Second Quartile Q 3 = Third Quartile 2 = = 50% percentile 75% percentile Example 17: For the data given in Example 16, find the first, second, and third quartiles. Soln. Q = 8.5, Q = 14.5, Q 3 = 16.5 1 2 The Interquartile Range (IQR) IQR = Q - Q 3 1 Note: The IQR gives the range of the middle 50% of the observations. The Five-Number Summary The five number summary of a data set consists of the minimum, maximum, and quartiles written in increasing order: Min, Q , Q Q 3 , and Max. 1 2 Example 17: Reference Example 16. Find the five-number summary. Soln. The data is 4, 5, 6, 8, 9, 10, 12, 14, 15, 15, 15, 16, 17, 18, 20, 26 Minimum = 4, Q ,= 8.5, Q = 14.5, Q 3 = 16.5, and Maximum = 26. 1 2 Boxplot (P 89) A boxplot is based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set. Notes: 1. There are two ways of identifying outliers – using z-scores, and upper and lower fences. These methods do not necessarily identify the same items as outliers. 2. An advantage of using boxplots for analysis of data is that we need very few numerical calculations. Just arrange data in the ascending order and compute the five-number summary. You do not have to compute the mean and the standard deviation. The Shape of Distributions (Ref. Page 63 in the text book) Homework:1, 3, 5, 7, 8, 15, 16, 21(pp. 93-95)