MSTM 4025 Solutions to Assignment #1 Winter 2019 Total of 47 marks, due: Jan. 23, 11:50pm Comments: Student Name: Student ID: Question Grades 1 /6 2 /8 3 /4 4 /6 5 /4 6 /5 7 /5 8 /4 9 /5 Total /47 1. (6 marks) Open the file Birth Year of First Year Class. a. Plot these data using a bar chart. What birth year is most common among the class? What is the distribution of the bar chart? (Does it look symmetric or skewed?) What would we expect this distribution to look like? Is there a birth year that sticks out as unusual and what is it? Is there a plausible explanation for this or do you figure it is pure chance? b. Provide boxplot of Birth Year of first year class, is there any issue shown in this plot and plausible explanation? c. Plot the relative percentage of these using a pie chart. Be sure to include the birth year and the percentage in the graph. What percentage of the class was born in 1995? Answer a) Frequency of Birth Year of First Year Class 9 8 7 Count 6 5 4 3 2 1 0 1 985 1 990 1 991 1 994 1 995 1 996 1 997 1 998 1 999 Birth Year The most common birth year is 1999. The bar chart shows a distribution that is left skewed. We would expect the highest frequency in a first year class to be the most recent graduates and for the numbers to fall off the older the student. That would Boxplot of Birth Year lead to a left skew. The year 1995 sticks out as unusual. We would expect the number of students born in that year to fall between 1 and 3; not 5. There is no plausible explanation. This looks like a matter of chance. 2000.0 1 997.5 Birth Year 1 995.0 1 992.5 1 990.0 1 987.5 1 985.0 b) Another unusual feature is 1985 which is far from the rest group of students, although it only has one student, but clearly it is an outlier. See the boxplot. c) 17.2% of the first year class were born in 1995. Pie Chart of Birth Year of First Year Class 1985 1990 3.4% 3.4% 1991 3.4% 1999 27.6% 1994 6.9% 1995 17.2% 1996 3.4% 1998 24.1% (8 marks) Students in an Intro Stats class were asked to describe their politics as “Liberal”, “Conservative”, or “Moderate”. Results are shown in the table below. Answer A) Marginal and Joint Distribution of Gender and Politics Rows: Gender Columns: Politics Liberal Moderate Conservative All 18.23 26.04 44.27 18.75 22.92 41.67 3.13 10.94 14.06 40.10 59.90 100.00 Female Male All Cell Contents: Joint Distribution of Politics and Gender Gender Female Male 25 20 % of Total Percent 2. 1997 10.3% 15 10 5 0 Gender B )Relative Frequency of Politics Conditional on Gender Rows: Gender Columns: Politics Female Male Liberal Percent is calculated within all data. Female Male Moderate Female Male Conservative Liberal Moderate Conservative All 45.45 43.48 44.27 46.75 38.26 41.67 7.79 18.26 14.06 100.00 100.00 100.00 Female Male All Cell Contents: % of Row Relative Frequency of Politics Conditional on Gender 50 Variable Liberal Moderate Conservative 40 Percent 30 20 10 0 l ra be Li e e iv at at er rv od e s M on C Gender F Li e al em l ra be od M e at er C e iv at rv e s on e al M Percent is calculated within levels of Gender. c) Relative Frequency of Gender Conditional on Politics Rows: Gender Female Male All Columns: Politics Liberal 41.18 58.82 100.00 Cell Contents: Moderate 45.00 55.00 100.00 Conservative 22.22 77.78 100.00 All 40.10 59.90 100.00 % of Column Relative Frequency of Gender Conditional on Politics 80 70 Gender Female Male 60 Percent 50 40 30 d) 59.9% of the class were male 20 e) 14.06% considers themselves to be 10 Conservative 0 Gender Female Male Female Male Female Male f) 18.26% of the males in the class Liberal Moderate Conservative Percent is calculated within variables. consider themselves Conservative. g) 10.94% of all the students in the class are males who consider themselves Conservative. h) 45.45% of the females in the class consider themselves to be Liberal. i) In the graph titled “Rel Freq of Politics Conditional on Gender” the heights of the blue bars is about the same, the heights of the red bars about the same and the heights of the yellow bars about the same for each gender. In the graph titled “Rel Freq of Gender Conditional on Politics” the heights of the blue bars is about the same for both liberals and moderates, the heights of the red bars about the same for liberals and moderates. The heights of the red and blue bars for conservatives is different than for the pattern we see for liberals and moderates, but that may not be significant. Overall, it would appear the two variables are independent, in that one’s politics don’t seem to depend on one’s gender. 3. (4 marks) The annual number of deaths from tornadoes in the United States is given for the years 1998 to 2011. a) Descriptive Statistics: no of deaths Variable no of deaths N 14 N* 0 Mean 133.5 Q1 39.5 Median 60.5 Q3 126.3 Range 534.0 IQR 86.8 b) By Hand: First put data in increasing order: 21 35 38 40 40 45 54 67 81 94 125 130 544 555 Median = 0.5 (54 + 67) = 60.5 Q1 = 40; number in middle of bottom half Q3 = 125 (number in middle of top half) IQR = Q3 – Q1 = 125 – 40 = 85 c) 4. Minitab answers do not agree totally with those done by hand. Minitab calculates percentiles not quartiles. When it gives a value for Q1 it really is giving the 25 th percentile. In a similar manner, what Minitab gives as Q3, really represents the 75 th percentile. (6 marks) The frequency table shows the heights (in inches) of 130 members of a choir. Data are also found in the file Heights of Choir Members. a) Descriptive Statistics: Height Variable Height Minimum 60.000 Q1 65.000 Median 66.000 Q3 70.000 Maximum 76.000 IQR 5.000 b) Descriptive Statistics: Height Variable Height Mean 67.115 StDev 3.792 c) d) The histogram looks to be bimodal which we would expect because the distribution of height for males and females is different. The high point at about 65 inches probably represents the average height for females in the choir, while the other point at roughly 69 to 70 inches would represent the average height of the males in the choir. I changed the number of bins to 15 rather than the default that Minitab used of 17 as shown in the plot below. Histogram of Height 20 Frequency 15 10 5 0 60.0 62.4 64.8 67.2 69.6 72.0 74.4 Height The boxplot shows the distribution of heights, without regard to males or females, is positively skewed because the top half of the box is longer than the bottom half. Some members of the choir are taller than the “typical” member. Histogram of Height Boxplot of Height 78 20 76 74 15 Height Frequency 72 10 70 68 66 5 64 62 0 60 63 66 69 Height 72 75 60 76.8 5. (4 marks) The Cornell Lab of Ornithology holds an annual Christmas Bird Count in which bird watchers at various locations around the country see how many different species of birds they can spot. Here are some of the sites in Texas during the 2010 event: Stem-and-Leaf Display: Number of bird species 7 (5) 8 5 4 3 2 1 1 15 16 17 18 19 20 21 22 23 N = 20 0004689 04469 578 3 9 3 6 1 Number of Species in 201 0 Texas Christmas Bird Count 240 230 220 Number of species Stem-and-leaf of Number of bird species Leaf Unit = 1.0 210 200 190 180 170 160 150 This distribution is very right skewed. Think of it as a histogram and twist your head so you are looking at it sideways. There is a long tail to the right, which makes it positively skewed. (positive = right). Most of the locations (15 of the 20) taking part in the Christmas Bird count, reported between 150 and 179 different species. Fewer locations reported counts of over 200 different species, and only one location reported a count above 230. In fact, if you plot these data as a boxplot, you can clearly see the 231 count is an outlier. 6. (5 marks) What percent of the general U.S. population are high school dropouts?? Data from Statistical Abstract of the United States, 120th ed, gives percentages of high school dropouts by state. Data are found in the file High School Dropouts. a) Using Minitab, obtain summary descriptive statistics for this data set. Include the mean, median, standard deviation, Min, Max, Range, Q1 and Q3 and IQR. Descriptive Statistics: High school dropout (%) Variable High school dropout (%) N 50 Variable High school dropout (%) Range 10.000 b) Mean StDev Minimum 10.420 2.383 5.000 Q1 Median 9.000 10.000 Q3 Maximum 12.250 15.000 IQR 3.250 Compare the mean and the median. Do these statists agree with one another? If so, what does that tell you about the distribution of these data? If they don’t, what would this imply? The mean is 10.42% and the median is 10.0%. These values agree pretty closely and this implies the distribution will be fairly symmetric. Since the mean is a bit bigger than the median, we can expect a slight right skew to the data set. c) Provide a boxplot of these data. Comment on its shape. Does it agree with your answers to part ‘a’ and ‘b’? Boxplot of Percent high school dropout Percent high school dropout 15.0 12.5 10.0 7.5 Data are fairly symmetric but the center 50% (the box) do exhibit a right skew. 5.0 d) Using the range and the IQR, describe the spread of this data set. The spread of the entire data set is given by the range which is 10 percentage points. The IQR of 3.25 percentage points describes the spread of the middle 50% of the data (ie the length of the box). If the data were perfectly symmetric we would expect the IQR to be 5.0. Since it is only 3.25 the data are less variable in the center than we might expect. Would it be better to use the mean and the standard deviation, or the median and the IQR to describe the centre and spread of these data? Explain why. f) Since the data are mostly symmetric and do not indicate any outliers, the mean and st dev would be the better measures to use to describe the center and the spread of these data. 7. (5 marks) At Center Hospital there is some concern about the high turnover of nurses. A survey of 20 randomly picked nurses from Centre Hospital was conducted to determine how long (in months) nurses had been in their current positions. To compare the turnover of the nurses, 20 clerical staff were also surveyed. The responses are shown in the table below. a) Provide histograms of both data sets and describe and compare their distributions. Histogram of Time as nurse (months), Time as clerical staff(months) 0 Frequency Time as nurse (months) 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 10 20 30 40 50 10 20 30 40 50 60 Time as clerical staff(months) 60 70 0 70 Histogram of Time as nurse (months), Time as clerical staff(months) Time as nurse (months) 8 6 Frequency 4 2 0 8 Time as clerical staff(months) 6 4 2 0 0 10 20 30 40 50 60 70 The distribution for clerical staff is unimodal, mostly symmetric, with most respondents being in their position 15 to 35 months, with one person having 65 to 75 months of service. The nurses on the other hand show two distinct modes and hence is a bimodal distribution. One group seems to have a mode of 7.5 to 12.5 months, while the other group clusters around 22.5 to 27.5 months. The nurses also have service that ranges from 0 to 42.5 months while the and with the exception of 2 people, the clerical staff also range about 40 to 45 months. Perhaps nurses include both permanent and casual groups which could account for the bimodal distribution. b) What can we say about the length of time in the current position for nurses and clerical staff? Are they the same, different? Explain. Although, the median for both groups is 23 months, but the spread of data are different, with the exception of one clerical staff person with upwards of 70 months of service, the spread of the service is about the same for clerical staff and nurses, that is, from 0 to 40 months. However, the nurses show a bimodal distribution with peaks at 10 months and another at 30 months while the clerical staff show a unimodal distribution with a peak around 25 months of service. Perhaps nurses include both permanent and casual groups which could account for the bimodal distribution. 8. (4 marks) Find the z value (to two decimal places) that cuts off an area in the upper tail section of the standard normal curve equal to: a) 0.025 b) 0.05 c) 0.005 d) 0.01 Show the area and corresponding value of z on a graph of the normal curve for each part of the problem. You can do this by hand and scan the graph, or use the Probability Distribution Plot in Minitab. a) z = 1.96 b) Z = 1.645 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=0, StDev=1 0.4 0.4 0.3 Density Density 0.3 0.2 0.2 0.1 0.1 0.025 0.0 0 z 0.05 1.960 0.0 0 1.645 z c) Z = 2.576 d) z = 2.326 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=0, StDev=1 0.4 0.4 0.3 Density Density 0.3 0.2 0.1 0.0 0.2 0.1 0.005 0 z 2.576 0.0 0.01 0 z 2.326 9. (5 marks) Suppose that a normal random variable x has a mean of = 32.0 and a standard deviation of = 6.0. Find the following probabilities: P(x 24.0) a) b) P(x 38.0) Calculate the value of z for each value of x and provide a sketch of the normal curve for each part of the problem showing the required area under the curve. z x 24.0 32.0 1.33 P(x 24.0) = P(z ≤ -1.33) = 0.09121 6.0 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=32, StDev=6 0.4 0.07 0.06 0.3 0.04 Density Density 0.05 0.03 0.2 0.02 0.1 0.01 0.09121 0.00 Or from tables: 0.09176 24 X 32 0.0 -1.33 0 z z x 38.0 32.0 1.00 6.0 P(x 38.0) = P(z ≥ 1.00) = 0.1587 Distribution Plot Distribution Plot Normal, Mean=32, StDev=6 Normal, Mean=0, StDev=1 0.07 0.4 0.06 0.3 0.04 Density Density 0.05 0.03 0.2 0.02 0.1587 0.01 0.00 X 32 38 0.1 0.1587 0.0 0 z 1