Uploaded by sarahynes1995

Solution to Assignment #1

advertisement
MSTM 4025
Solutions to Assignment #1 Winter 2019
Total of 47 marks, due: Jan. 23, 11:50pm
Comments:
Student Name:
Student ID:
Question
Grades
1
/6
2
/8
3
/4
4
/6
5
/4
6
/5
7
/5
8
/4
9
/5
Total
/47
1. (6 marks) Open the file Birth Year of First Year Class.
a. Plot these data using a bar chart. What birth year is most common among the
class? What is the distribution of the bar chart? (Does it look symmetric or
skewed?) What would we expect this distribution to look like? Is there a birth
year that sticks out as unusual and what is it? Is there a plausible explanation for
this or do you figure it is pure chance?
b. Provide boxplot of Birth Year of first year class, is there any issue shown in this
plot and plausible explanation?
c. Plot the relative percentage of these using a pie chart. Be sure to include the
birth year and the percentage in the graph. What percentage of the class was
born in 1995?
Answer
a)
Frequency of Birth Year of First Year Class
9
8
7
Count
6
5
4
3
2
1
0
1 985
1 990
1 991
1 994
1 995
1 996
1 997
1 998
1 999
Birth Year
The most common birth year is 1999.
The bar chart shows a distribution that is left skewed. We would expect the highest
frequency in a first year class to be the most recent graduates and for the numbers
to fall off the older the student. That would
Boxplot of Birth Year
lead to a left skew. The year 1995 sticks out
as unusual. We would expect the number of
students born in that year to fall between 1
and 3; not 5. There is no plausible
explanation. This looks like a matter of
chance.
2000.0
1 997.5
Birth Year
1 995.0
1 992.5
1 990.0
1 987.5
1 985.0
b) Another unusual feature is 1985 which is far from the rest group of students,
although it only has one student, but clearly it is an outlier. See the boxplot.
c) 17.2% of the first year class were born in 1995.
Pie Chart of Birth Year of First Year Class
1985
1990
3.4%
3.4% 1991
3.4%
1999
27.6%
1994
6.9%
1995
17.2%
1996
3.4%
1998
24.1%
(8 marks) Students in an Intro Stats class were asked to describe their politics as “Liberal”,
“Conservative”, or “Moderate”. Results are shown in the table below.
Answer
A) Marginal and Joint Distribution of Gender and Politics
Rows: Gender
Columns: Politics
Liberal
Moderate
Conservative
All
18.23
26.04
44.27
18.75
22.92
41.67
3.13
10.94
14.06
40.10
59.90
100.00
Female
Male
All
Cell Contents:
Joint Distribution of Politics and Gender
Gender
Female
Male
25
20
% of Total
Percent
2.
1997
10.3%
15
10
5
0
Gender
B )Relative Frequency of Politics Conditional on
Gender
Rows: Gender
Columns: Politics
Female Male
Liberal
Percent is calculated within all data.
Female Male
Moderate
Female Male
Conservative
Liberal
Moderate
Conservative
All
45.45
43.48
44.27
46.75
38.26
41.67
7.79
18.26
14.06
100.00
100.00
100.00
Female
Male
All
Cell Contents:
% of Row
Relative Frequency of Politics Conditional on Gender
50
Variable
Liberal
Moderate
Conservative
40
Percent
30
20
10
0
l
ra
be
Li
e
e
iv
at
at
er
rv
od
e
s
M
on
C
Gender
F
Li
e
al
em
l
ra
be
od
M
e
at
er
C
e
iv
at
rv
e
s
on
e
al
M
Percent is calculated within levels of Gender.
c) Relative Frequency of Gender Conditional on Politics
Rows: Gender
Female
Male
All
Columns: Politics
Liberal
41.18
58.82
100.00
Cell Contents:
Moderate
45.00
55.00
100.00
Conservative
22.22
77.78
100.00
All
40.10
59.90
100.00
% of Column
Relative Frequency of Gender Conditional on Politics
80
70
Gender
Female
Male
60
Percent
50
40
30
d) 59.9% of the class were male
20
e) 14.06% considers themselves to be
10
Conservative
0
Gender
Female Male
Female Male
Female Male
f) 18.26% of the males in the class
Liberal
Moderate
Conservative
Percent is calculated within variables.
consider themselves Conservative.
g) 10.94% of all the students in the class are males who consider themselves Conservative.
h) 45.45% of the females in the class consider themselves to be Liberal.
i) In the graph titled “Rel Freq of Politics Conditional on Gender” the heights of the blue
bars is about the same, the heights of the red bars about the same and the heights of
the yellow bars about the same for each gender.
In the graph titled “Rel Freq of Gender Conditional on Politics” the heights of the blue
bars is about the same for both liberals and moderates, the heights of the red bars
about the same for liberals and moderates. The heights of the red and blue bars for
conservatives is different than for the pattern we see for liberals and moderates, but
that may not be significant. Overall, it would appear the two variables are independent,
in that one’s politics don’t seem to depend on one’s gender.
3.
(4 marks) The annual number of deaths from tornadoes in the United States is given for
the years 1998 to 2011.
a) Descriptive Statistics: no of deaths
Variable
no of deaths
N
14
N*
0
Mean
133.5
Q1
39.5
Median
60.5
Q3
126.3
Range
534.0
IQR
86.8
b) By Hand:
First put data in increasing order:
21
35
38
40
40
45
54
67
81 94 125
130
544 555
Median = 0.5 (54 + 67) = 60.5
Q1 = 40; number in middle of bottom half
Q3 = 125 (number in middle of top half)
IQR = Q3 – Q1 = 125 – 40 = 85
c)
4.
Minitab answers do not agree totally with those done by hand. Minitab calculates
percentiles not quartiles. When it gives a value for Q1 it really is giving the 25 th
percentile. In a similar manner, what Minitab gives as Q3, really represents the 75 th
percentile.
(6 marks) The frequency table shows the heights (in inches) of 130 members of a choir.
Data are also found in the file Heights of Choir Members.
a) Descriptive Statistics: Height
Variable
Height
Minimum
60.000
Q1
65.000
Median
66.000
Q3
70.000
Maximum
76.000
IQR
5.000
b) Descriptive Statistics: Height
Variable
Height
Mean
67.115
StDev
3.792
c)
d) The histogram looks to be bimodal which we would expect
because the distribution of
height for males and females is
different. The high point at
about 65 inches probably
represents the average height
for females in the choir, while
the other point at roughly 69 to
70 inches would represent the
average height of the males in
the choir. I changed the number
of bins to 15 rather than the
default that Minitab used of 17
as shown in the plot below.
Histogram of Height
20
Frequency
15
10
5
0
60.0
62.4
64.8
67.2
69.6
72.0
74.4
Height
The boxplot shows the distribution of heights, without regard to males or females, is positively
skewed because the top half of the box is longer than the bottom half. Some members of the
choir are taller than the “typical” member.
Histogram of Height
Boxplot of Height
78
20
76
74
15
Height
Frequency
72
10
70
68
66
5
64
62
0
60
63
66
69
Height
72
75
60
76.8
5.
(4 marks) The Cornell Lab of Ornithology holds an annual Christmas Bird Count in which bird
watchers at various locations around the country see how many different species of birds
they can spot. Here are some of the sites in Texas during the 2010 event:
Stem-and-Leaf Display: Number of bird species
7
(5)
8
5
4
3
2
1
1
15
16
17
18
19
20
21
22
23
N
= 20
0004689
04469
578
3
9
3
6
1
Number of Species in 201 0 Texas Christmas Bird Count
240
230
220
Number of species
Stem-and-leaf of Number of bird species
Leaf Unit = 1.0
210
200
190
180
170
160
150
This distribution is very right skewed. Think of it as a histogram and twist your head so you are
looking at it sideways. There is a long tail to the right, which makes it positively skewed. (positive =
right). Most of the locations (15 of the 20) taking part in the Christmas Bird count, reported
between 150 and 179 different species. Fewer locations reported counts of over 200 different
species, and only one location reported a count above 230. In fact, if you plot these data as a
boxplot, you can clearly see the 231 count is an outlier.
6. (5 marks) What percent of the general U.S. population are high school dropouts?? Data
from Statistical Abstract of the United States, 120th ed, gives percentages of high school
dropouts by state. Data are found in the file High School Dropouts.
a) Using Minitab, obtain summary descriptive statistics for this data set. Include the mean,
median, standard deviation, Min, Max, Range, Q1 and Q3 and IQR.
Descriptive Statistics: High school dropout (%)
Variable
High school dropout (%)
N
50
Variable
High school dropout (%)
Range
10.000
b)
Mean StDev Minimum
10.420 2.383
5.000
Q1 Median
9.000 10.000
Q3 Maximum
12.250
15.000
IQR
3.250
Compare the mean and the median. Do these statists agree with one another? If so,
what does that tell you about the distribution of these data? If they don’t, what would
this imply?
The mean is 10.42% and the median is 10.0%. These values agree pretty closely and this implies
the distribution will be fairly symmetric. Since the mean is a bit bigger than the median, we can
expect a slight right skew to the data set.
c)
Provide a boxplot of these data. Comment on its shape. Does it agree with your answers
to part ‘a’ and ‘b’?
Boxplot of Percent high school dropout
Percent high school dropout
15.0
12.5
10.0
7.5
Data are fairly
symmetric but the
center 50% (the
box) do exhibit a
right skew.
5.0
d)
Using the range and the IQR, describe the spread of this data set.
The spread of the entire data set is given by the range which is 10 percentage points.
The IQR of 3.25 percentage points describes the spread of the middle 50% of the data (ie the
length of the box). If the data were perfectly symmetric we would expect the IQR to be 5.0.
Since it is only 3.25 the data are less variable in the center than we might expect.
Would it be better to use the mean and the standard deviation, or the median and the
IQR to describe the centre and spread of these data? Explain why.
f)
Since the data are mostly symmetric and do not indicate any outliers, the mean and st dev
would be the better measures to use to describe the center and the spread of these data.
7. (5 marks) At Center Hospital there is some concern about the high turnover of nurses. A
survey of 20 randomly picked nurses from Centre Hospital was conducted to determine
how long (in months) nurses had been in their current positions. To compare the
turnover of the nurses, 20 clerical staff were also surveyed. The responses are shown in
the table below.
a) Provide histograms of both data sets and describe and compare their distributions.
Histogram of Time as nurse (months), Time as clerical staff(months)
0
Frequency
Time as nurse (months)
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
10
20
30
40
50
10
20
30
40
50
60
Time as clerical staff(months)
60
70
0
70
Histogram of Time as nurse (months), Time as clerical staff(months)
Time as nurse (months)
8
6
Frequency
4
2
0
8
Time as clerical staff(months)
6
4
2
0
0
10
20
30
40
50
60
70
The distribution for clerical staff is unimodal, mostly symmetric, with most respondents being in
their position 15 to 35 months, with one person having 65 to 75 months of service. The nurses
on the other hand show two distinct modes and hence is a bimodal distribution. One group
seems to have a mode of 7.5 to 12.5 months, while the other group clusters around 22.5 to 27.5
months. The nurses also have service that ranges from 0 to 42.5 months while the and
with the exception of 2 people, the clerical staff also range about 40 to 45 months.
Perhaps nurses include both permanent and casual groups which could account for the
bimodal distribution.
b) What can we say about the length of time in the current position for nurses and clerical
staff? Are they the same, different? Explain.
Although, the median for both groups is 23 months, but the spread of data are different, with the
exception of one clerical staff person with upwards of 70 months of service, the spread of the
service is about the same for clerical staff and nurses, that is, from 0 to 40 months. However, the
nurses show a bimodal distribution with peaks at 10 months and another at 30 months while the
clerical staff show a unimodal distribution with a peak around 25 months of service. Perhaps nurses
include both permanent and casual groups which could account for the bimodal distribution.
8. (4 marks) Find the z value (to two decimal places) that cuts off an area in the upper tail
section of the standard normal curve equal to:
a)
0.025
b) 0.05
c)
0.005
d) 0.01
Show the area and corresponding value of z on a graph of the normal curve for each part of the
problem. You can do this by hand and scan the graph, or use the Probability Distribution Plot in
Minitab.
a) z = 1.96
b)
Z = 1.645
Distribution Plot
Distribution Plot
Normal, Mean=0, StDev=1
Normal, Mean=0, StDev=1
0.4
0.4
0.3
Density
Density
0.3
0.2
0.2
0.1
0.1
0.025
0.0
0
z
0.05
1.960
0.0
0
1.645
z
c) Z = 2.576
d) z = 2.326
Distribution Plot
Distribution Plot
Normal, Mean=0, StDev=1
Normal, Mean=0, StDev=1
0.4
0.4
0.3
Density
Density
0.3
0.2
0.1
0.0
0.2
0.1
0.005
0
z
2.576
0.0
0.01
0
z
2.326
9.
(5 marks) Suppose that a normal random variable x has a mean of  = 32.0 and a standard
deviation of  = 6.0. Find the following probabilities:
P(x  24.0)
a)
b)
P(x  38.0)
Calculate the value of z for each value of x and provide a sketch of the normal curve for
each part of the problem showing the required area under the curve.
z
x


24.0  32.0
 1.33 P(x  24.0) = P(z ≤ -1.33) = 0.09121
6.0
Distribution Plot
Distribution Plot
Normal, Mean=0, StDev=1
Normal, Mean=32, StDev=6
0.4
0.07
0.06
0.3
0.04
Density
Density
0.05
0.03
0.2
0.02
0.1
0.01
0.09121
0.00
Or from tables:
0.09176
24
X
32
0.0
-1.33
0
z
z
x


38.0  32.0
 1.00
6.0
P(x  38.0) = P(z ≥ 1.00) = 0.1587
Distribution Plot
Distribution Plot
Normal, Mean=32, StDev=6
Normal, Mean=0, StDev=1
0.07
0.4
0.06
0.3
0.04
Density
Density
0.05
0.03
0.2
0.02
0.1587
0.01
0.00
X
32
38
0.1
0.1587
0.0
0
z
1
Download