Return to the definitions in Class Handout #1: five-number summary Minimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max center value in a distribution Now return to Exercise #1(i) in Class Handout #1: 1.-continued (i) From the stem-and-leaf display in part (g) for the variable “yearly income,” find the five-number summary, the median, the range, and the interquartile range. 2 3 4 5 6 7 56789 003445999 014559 35 01458 158 Min = 25 Q1 = 33 Q2 = 40.5 Q3 = 60 Max = 78 range = Max – Min = 78 – 25 = 53 IQR = Q3 – Q1 = 60 – 33 = 27 = median outlier an observation whose value is deemed to be either unusually high or unusually low relative to the other observations in the data set; one guideline is to consider any value which is a distance of more than 1.5(IQR) below Q1 or above Q3 a potential outlier box plot a graphical display for quantitative sample data where two edges of a rectangular box mark the values of Q1 and Q3 , a line dividing the box into two sections marks the value of Q2 , and the ends of two lines extending from the sides of the box represent the minimum and maximum. Often, the lines are stopped at the largest and smallest values which are not potential outliers, and all potential outliers are designated with dots. (j) From the five-number summary found in part (i) for the variable “yearly income,” decide whether or not there are any potential outliers, and construct a box plot. (1.5)IQR = (1.5)27 = 40.5 Since 33 – 25 = 7 < 40.5, the minimum is not a potential outlier. Min = 25 Q1 = 33 Since 78 – 60 = 18 < 40.5, the maximum is not a potential outlier. Q2 = 40.5 = median Q3 = 60 Max = 78 25 30 35 40 45 50 55 60 Yearly Income ($1000s) 65 70 75 80 five-number summary Minimum, First Quartile, Second Quartile, Third Quartile, Maximum Min Q1 Q2 Q3 Max center value in a distribution refers to a middle or average value for the distribution Some measures of center for a distribution are as follows: mean ( denoted y ) the arithmetic average of n quantitative observations y1 , y2 , , yn , calculated by dividing the sum of the observations by the number of observations; that is, y1 + y2 + + yn y y = ——————— = —– n n median the second quartile Q2 in the five-number summary mode the most frequently occurring observation(s) if any exist (With a quantitative variable, the mean and median are often of more interest then the mode; the mode is more useful with a qualitative variable, where computing a mean or median makes no sense as a general rule.) dispersion in a distribution refers the amount of variation (spread) in the distribution Some measures of dispersion for a distribution are as follows: range the difference between the largest and smallest observations, that is, Max – Min interquartile range (IQR) the difference between the third and first quartiles, that is, Q3 – Q1 variance (denoted s2) the sum of the squared deviations from the mean (i.e., the sum of y1 – y , y2 – y , … , yn – y) divided by one less than the number of observations; that is, (y – y)2 s2 = ———— n–1 standard deviation (denoted s) the square root of the variance These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics, 1.-continued (k) The yearly incomes ($1000s) for the eight democrats in the sample are 28 75 26 78 40 60 49 39 Do each of the following for these eight incomes: List the observations in ascending order. 26 28 39 40 49 60 75 78 Find the five-number summary. 26 Find the median. 33.5 44.5 67.5 78 44.5 Find the mean. 26 + 28 + 39 + 40 + 49 + 60 + 75 + 78 495 y = ———————————————— = —— = 49.375 8 8 Find the variance and standard deviation. (26–49.375)2 + (28–49.375)2 + … + (78–49.375)2 2787.875 2 ———————————————————— = ———— = 398.268 s = 8–1 7 s = 398.268 = 19.957 Find the mode. There is no mode, since no observations are repeated. If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution? This can be done using the Excel spreadsheet named Summary_Statistics, The mean would be 61.5, but the median would stay equal to 44.5. When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed. shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows: shape of a distribution (symmetry and skewness) involves the type and amount of symmetry or non-symmetry present in the distribution A distribution is called positively skewed when larger values tend to be more dispersed (perhaps resulting in a few unusually high values) and called negatively skewed when smaller values tend to be more dispersed (perhaps resulting in a few unusually small values). One measure of skewness in a distribution is as follows: (If the skewness ratio for a distribution y – median is less than –0.3, then the distribution skewness ratio ————– can be considered very negatively s skewed, and if the skewness ratio for a distribution is greater than +0.3, then the distribution can be considered very positively skewed.) Find the mode. There is no mode, since no observations are repeated. If the largest salary of 78 ($78,000) were changed to 175 ($175,000), what would be the values for the mean and median? What does this suggest about how the values of the mean and median are influenced by symmetry and skewness in a distribution? The mean would be 61.5, but the median would stay equal to 44.5. When a distribution is nearly symmetric, the mean and median will be close in value. If the mean is smaller than the median, the distribution is negatively skewed; if the mean is larger than the median, the distribution is positively skewed. What does the skewness ratio suggest about the shape of the distribution? 49.375 – 44.5 —————— = + 0.24 suggests a somewhat positively skewed distribution 19.957 (l) From the information displayed in part (d) for the variable “political party affiliation,” find the mode. The mode is the category “Republican” which is repeated most often. 2. For each of four levels of educational attainment, the distribution of ages for Americans at least 25 years old in 1984 is organized into a frequency polygon (from data collected by the U.S. Bureau of the Census (and taken from the World Almanac and Book of Facts, 1986). Completed High School Did Not Complete High School 40 35 30 25 20 15 10 5 0 Relative Frequency Relative Frequency Did Not Complete High School 15 25 35 45 55 65 75 85 Completed High School 40 35 30 25 20 15 10 5 0 95 15 25 35 Age (years) Relative Frequency Relative Frequency 25 35 45 55 65 Age (years) 65 75 85 95 85 95 4 or More Years of College 1 to 3 Years of College 15 55 Age (years) 1 to 3 Years of College 40 35 30 25 20 15 10 5 0 45 75 85 95 4 or More Years of College 40 35 30 25 20 15 10 5 0 15 25 35 45 55 65 Age (years) 75 (a) Does the distribution of ages appear to be centered at different values for the different levels of education? If yes, for which level of education does the center appear to be smallest, and for which level of education does the center appear to be largest? The ages for those not completing high school appear to be centered at a considerably higher value than for those in each of the other three level of education categories. (b) Does the distribution of ages appear to have a different amount of dispersion for the different levels of education? If yes, for which level of education does the dispersion appear to be smallest, and for which level of education does the dispersion appear to be largest? The variation of ages appears to be roughly the same for the four level of education categories. (c) Does the distribution of ages appear to have a different shape for the different levels of education? If yes, how does the shape appear to differ? None of the distributions appear to be symmetric. The distribution of ages looks negatively skewed for those not completing high school and positively skewed for each of the other three level of education categories. bar chart a graphical display for qualitative data where categories are listed on a horizontal axis and the height of a bar for each category represents a raw or relative frequency as indicated by the labels on a vertical axis. pie chart a graphical display for qualitative data where relative frequency for each category is represented as a slice of a circle (pie) and the categories listed include all possibilities histogram a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and the height of a bar for each of several intervals of values represents a raw or relative frequency as indicated by the labels on a vertical axis. frequency polygon a graphical display for quantitative sample data where the possible numerical values are scaled on a horizontal axis and dots placed at the middle of the top of where each bar for a histogram would be are connected to produce a rough “curve” - the proportion of observations that fall within a given interval of values is represented by the corresponding area under the “curve” probability distribution curve a “smooth curve” designed to describe quantitative population data so that the proportion, or probability, of observations falling within a given interval of values is represented by the corresponding area under the “curve” 3. Suppose each of the box plots displayed represents the distribution of sample data selected from some population. For each box plot, make a sketch of what a corresponding histogram of the data could look like, and make a sketch of what the probability distribution curve for the corresponding population could look like. Uniform Distribution Bell-Shaped Distribution Positively Skewed Distribution Negatively Skewed Distribution Now look at Class Handout #2. Class Handout #2 Definitions parameter a numerical quantity which describes some characteristic of a population statistic a numerical quantity which describes some characteristic of a sample The symbol x is used to represent the mean for a sample. The symbol is used to represent the mean for a population. The symbol s is used to represent the standard deviation for a sample. The symbol is used to represent the standard deviation for a population. We see then that and are parameters, and that x and s are statistics. Tchebysheff’s Theorem 1. For each situation, identify the experimental unit, the population, the sample, the parameter, and the statistic. (a) A state government official computes the mean yearly amount of spending per pupil for 75 selected public school districts in the state in order to draw a conclusion about the mean yearly amount of spending per pupil for all public school districts in the state. The experimental unit is each public school district in the state. The population is all public school districts in the state. The sample is the 75 selected public school districts. The parameter is the mean yearly amount of spending per pupil for all public school districts in the state. The statistic is the mean yearly amount of spending per pupil for the 75 selected public school districts in the state. (b) A pollster surveys 427 selected voters in a state by phone and calculates the proportion intending to vote for the incumbent governor in order to draw a conclusion about the proportion of all voters in the state intending to vote for the incumbent governor. The experimental unit is each voter in the state. The population is all voters in the state. The sample is the 427 selected voters. The parameter is the proportion of all voters in the state intending to vote for the incumbent governor. The statistic is the proportion of the 427 surveyed voters intending to vote for the incumbent governor. Class Handout #2 Definitions parameter a numerical quantity which describes some characteristic of a population statistic a numerical quantity which describes some characteristic of a sample The symbol x is used to represent the mean for a sample. The symbol is used to represent the mean for a population. The symbol s is used to represent the standard deviation for a sample. The symbol is used to represent the standard deviation for a population. We see then that and are parameters, and that x and s are statistics. Tchebysheff’s Theorem a statement of the following facts (requiring calculus to prove): For any data set (population or sample), at least 75% (three fourths) of the measurements must lie within two standard deviations of the mean, that is, 75% (3/4) of a data set must lie between the values mean – 2(standard deviation) and mean + 2(standard deviation) . For any data set (population or sample), at least 89% of the measurements must lie within three standard deviations of the mean, that is, 89% of a data set must lie between the values mean – 3(standard deviation) and mean + 3(standard deviation) . 2. For each data set (all of which are displayed in a stem-and-leaf display), (i) verify that the given mean, the given standard deviation, and the given five-number summary are correct; (ii) find the proportion of measurements which lie within two standard deviations of the mean, and compare this proportion with what Tchebysheff’s Theorem states. Min = 109 (a) 10 9 11 5 Q1 = 188 12 mean = 200 13 Q2 = 200 14 standard deviation = 41.94 15 Q3 = 212 16 17 8 Max = 291 18 1 5 19 1 4 6 8 9 20 1 2 4 6 9 These statistics can all be verified by using the 21 5 9 22 2 Excel spreadsheet named Summary_Statistics, 23 The interval within two standard deviations of the mean is 24 25 from 200 – 2(41.94) to 200 +2(41.94), that is from 116.12 to 283.88 26 The proportion of measurements which lie within two standard 27 28 5 deviations of the mean is 16/20 = 80% . 29 1 Tchebysheff’s Theorem states that this proportion will be at least 75%. (b) 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Min = 129 Q1 = 188 9 5 mean = 200 standard deviation = 33.20 8 1 1 1 5 2 5 4689 2469 9 Q2 = 200 Q3 = 212 Max = 271 These statistics can all be verified by using the Excel spreadsheet named Summary_Statistics, The interval within two standard deviations of the mean is from 200 – 2(33.20) to 200 +2(33.20), that is from 133.60 to 266.40 5 1 The proportion of measurements which lie within two standard deviations of the mean is 18/20 = 90% . Tchebysheff’s Theorem states that this proportion will be at least 75%. Note that in both data sets, the two largest measurements and the two smallest measurements are all potential outliers. For most bell-shaped (or mound-shaped) distributions with no outliers, 95% of the measurements will lie within two standard deviations of the mean. 3. A random sample of 30 days of hotel reservations is selected to obtain information about the distribution of the daily number of “no shows”. Each value recorded below is the number of room reservation bookings where the party did not show up and did not cancel the reservation. 18 16 16 16 14 18 16 18 14 19 15 19 9 20 10 10 12 14 18 12 14 14 17 12 18 13 15 13 15 19 (a) Identify the experimental unit, the variable of interest, the population, and the sample. The experimental unit is each day. The variable of interest is the number of “no shows” among hotel bookings for a day. The population is all days when hotel reservations are made. The sample is the 30 selected days. (b) Are the mean and standard deviation for this data parameters or statistics? The mean and standard deviation calculated from this random sample are statistics. (Parameters are numerical quantities which describe characteristics of a population.) (c) Use the Excel files Summary_Statistics and M214_Data to find the mean and standard deviation for this data, and comment on the type of distribution the data appears to have. mean = x = 15.133 The data appears to have a somewhat bell-shaped (mound-shaped) standard deviation = s = 2.945 distribution, and there are no outliers. (d) We shall use the mean and standard deviation from part (c) as estimates of the mean and standard deviation for the mean and standard deviation for the population. Explain why it is reasonable to assume that about 95% of all days will have a number of “no shows” within two standard deviations of the mean, and then find this interval. Since the data appears to have a somewhat bell-shaped (mound-shaped) distribution, and there are no outliers, we expect about 95% of all measurements to be within two standard deviations of the mean. The interval within two standard deviations of the mean is from 15.133 – 2(2.945) to 15.133 +2(2.945), that is from 9.243 to 21.023 (e) From the interval found in part (d), how many rooms per day can the hotel overbook per day and still feel confident that all reservations will be honored? Since it appears that the number of “no shows” each day will almost always be at least 9 or 10, a hotel might feel confident with overbooking 9 or 10 rooms per day. Before submitting Homework #1, check some of the answers (if you haven’t done so already) from the link on the course schedule: http://srv2.lycoming.edu/~sprgene/M214/Schedule214.htm