Practice Midterm Exam

advertisement
Statistics 103
Answers to practice problems for Midterm II
1. Isolation and brain-wave activity
a) I would use the analysis for two separate samples. The prisoners are
not matched before random assignment; rather, the group of 20 is split
into two groups without any matching.
b) 0.80  2.11 .46 2 / 10  .612 / 10 ,
to 0.80  2.11 .46 2 / 10  .612 / 10
or, after doing the math, (.80–2.11(.24), .80+2.11(.24)) = (.29, 1.31)
the 2.11 is the multiplier from the t-table with 17 degrees of freedom.
c) Looking the confidence interval, we find that it is always positive.
Hence, the data suggest that the population average for non-confined
prisoners is larger than the population average for confined prisoners.
Alternatively, you can use a two-sided hypothesis test. The null
hypothesis is that the two averages are equal, and the alternative
hypothesis is that they differ. The test-statistic equals:
t = 0.80 / .46 2 / 10  .612 / 10
= 3.31.
The p-value equals the area under the t-curve with 17 degrees of freedom
to the left of -3.31 and to the right of 3.31. This is a very small
area, roughly .004. Hence, we reject the null hypothesis. There does
appear to be a difference in population average alpha waves between
confined and non-confined prisoners.
d) We assume the prisoners were randomly assigned to the treatments
(they were), and that the central limit theorem holds. We would check
the CLT by examining a normal probability plot to make sure the alpha
wave frequencies in each group roughly follow normal curves.
e) Because the study was randomized, we can draw valid conclusions for
the population of Canadian prisoners. However, I would be reluctant to
extend these conclusions to populations outside of Canadian prisons.
Prisoners may have different alpha-wave reactions to isolation than nonprisoners because they are different psychologically than non-prisoners.
2. Chucky Cheese
a) This is a matched pairs analysis. Pairs of plants are put in the
same pot at the same time, so that they are matched. This matching
invalidates the two sample analysis.
b) Using the matched pairs analysis, we get
( 2.63  2.145 1.97 2 / 15 , 2.63  2.145 1.97 2 / 15 )
= (2.63 – 2.145(.51), 2.63+2.145(.51)) = (1.54, 3.72)
The 2.145 is the multiplier from the t-curve with 14 degrees of freedom.
c) The null hypothesis is that the average difference equals zero. The
alternative hypothesis is that the average difference (cross – self) is
greater than zero.
The test statistic equals:
t  2.63 / 1.97 2 / 15  5.15
The p-value equals the area under the t-curve with 14 degrees of freedom
to the right of 5.15, which is a very small number (less than .0001).
Hence, we reject the null hypothesis. It does appear that the
population average height of cross-fertilized plants is larger than the
population average height of self-fertilized plants.
d) We assume that the plants were assigned randomly to locations in the
pot (they were), and that the central limit theorem holds. Because the
sample size is not large, we the central limit theorem will hold only if
the differences variable roughly follow a normal curve. We could check
this using a normal probability plot or histogram in JMP.
e) It is not valid. Galton matched values based on the outcomes, not on
background characteristics. By doing so, he used a wrong standard error
(it’s too small). One cannot ignore the way the data are randomized.
3.
A real bloody problem
a) I would use a hypothesis test for two separate samples. My
hypothesis would be that the population average blood pressures
equal in the two groups, and the alternative would be that they
Because the sample sizes are large, the CLT is likely to apply,
as there are no really bad outliers in the two groups.
b) This is a 95% CI for a single proportion.
.13  1.96
.13(1  .13)
to
100
.13  1.96
null
are
differ.
as long
It equals:
.13(1  .13)
100
= (.13 – 1.96(.036), .13+1.96(.036))
=
(.06, .20)
The assumptions are that the data were collected at random and that the
CLT holds. Because the sample size is large, the CLT holds.
c) This is a 95% CI for difference in two proportions.
(.13  .10)  1.96
.13(1  .13) .10(1  .10)

100
100
to
(.13  .10)  1.96
It equals:
.13(1  .13) .10(1  .10)

100
100
The assumptions are that the data were collected at random and that the
CLT holds. Because the sample size is large, the CLT holds.
d) The physician is wrong. Because this was a randomized experiment,
the background characteristics of the two groups should be similar.
Hence, a comparison of the blood pressures after the treatments gives a
measure of the effects of the treatments on blood pressures.
4. Choose the right analysis
a) z-test or confidence interval for difference in two proportions.
b) t-test or confidence interval for difference in two means for
separate samples. Note that “percentage of income” is a continuous
variable, like the amount of change in your pocket. It is not a
percentage of successes.
c) z-test or confidence interval for one proportion
d) regression
5.
Linear combinations
a) This is not the variance. We need to add, not subtract. This is
because when taking variances of linear combinations, we square the
constants in front of each random variable, so that -1 squared equals 1.
b)
Var (3 X  2Y )  9Var ( X )  4Var (Y )  2(3)( 2)Cov( X , Y )
 9(25 2 )  4(15 2 )  12(.3)( 25)(15)  5175
The last term comes from the fact that Cov( X , Y )  Corr ( X , Y ) SD( X ) SD(Y ) .
6.
Studying for midterm
Let X be the random variable for score, and let Y be the random variable
for hours studied.
a) E ( X )  60(.06)  70(.22)  80(.35)  90(.31)  100(.06)  80.9
Var ( X )  E ( X 2 )  E ( X ) 2  60 2 (.06)  70 2 (.22)  80 2 (.35)  90 2 (.31)  100 2 (.06)  80.9 2  100.19
So, the SD(X) is 10.0095.
b) E (Y )  0(.07)  5(.19)  10(.25)  15(.29)  20(.20)  11.8
Var (Y )  E (Y 2 )  E (Y ) 2  0 2 (.07)  5 2 (.19)  10 2 (.25)  15 2 (.29)  20 2 (.20)  11.8 2  35.76 So,
the SD(Y) is 5.98.
c) Cov( X , Y )  E ( XY )  E ( X ) E (Y )  E ( XY )  (80.9)(11.8).
E ( XY )  (60)(5)(. 01)  (70)(5)(. 1)  (70)(10)(. 07)  (70)(15)(.03)
 (80)(5)(. 05)  (80)(10)(. 1)  (80)(15)(. 15)  (80)( 20)(. 05)
 (90)(5)(. 03)  (90)(10)(.08)  (90)(15)(. 1)  (90)( 20)(. 1)
 100(15)(. 01)  (100)( 20)(. 05)
Putting it all together, we get Cov( X , Y )  39.38 .
d) E ( X | Y  10)  60(0)  70(.28)  80(.40)  90(.32)  100(0)  80.4 .
Var ( X | Y  10)  60 2 (0)  70 2 (.28)  80 2 (.40)  90 2 (.32)  100 2 (0)  80.4 2  59.84
So, the SD(X|Y=10) = 7.74.
e) E ( X  Y )  E ( X )  E (Y )  80.9  11.8  69.1 .
Var ( X  Y )  Var ( X )  (12 )Var (Y )  2(1)( 1)Cov( X , Y )  100.19  35.76  78.76  57.19
f) Because the sample size is large (100), we can use the central limit
theorem for sample averages.
Pr( X  85)  Pr( Z 
85  80.9
10.0095 / 100
)  Pr( Z  4.09) .
Looking at the normal table, the area to the right of 4.09 is less than
.0001. Hence there is less than a .01% chance the sample average score
for 100 students will exceed 85.
g) Because the sample size is small (2), and the population distribution
of scores is not a normal curve, we cannot use the central limit theorem
for sample averages. We have to write down the sampling distribution
for the sample average, and determine the probability that the sample
average will exceed 85.
For two people, the only ways the sample average can exceed 85 include:
100, 100 Yields average of 100.
100, 90.
90, 100.
Yields average of 95.
100, 80. 80, 100. Yields average of 90.
90, 90. Yields average of 90.
The sample average will equal 85 if we get 90, 80. But, we want the
chance that it will exceed 85, so we do not count the 90,80 option.
Because each person can be considered independent, we have:
Pr( X  100)  (.06)(.06)  .0036.
Pr( X  95)  (.06)(.31)  (.31)(.06)  .0372.
Pr( X  90)  (.06)(.35)  (.35)(.06)  (.31)(.31)  .1381.
Hence, the chance that the sample average score of two students exceeds
85 equals .1789, or 17.89%.
7. Finishing the analysis from Midterm 1 practice problems
Treatment Group
Visits + Childcare
Control
Mean
90.79
85.13
SD
16.46
18.23
a) Null hypothesis: Population average Peabody scores are the same for
childcare and control groups.
Alternative hypothesis: Population average Peabody score for childcare
group is larger than population average Peabody score for control group.
SE =
16.46 2 18.23 2

 1.1248
377
608
obs.  exp 90.79  85.13

 5.03185
SE
1.1248
Test statistic =
To get the p-value, we want the area under the t-curve with about 1000
degrees of freedom to the right of 5.03185. This is a p-value less than
.0001. Hence, we reject the null hypothesis: it appears that population
average Peabody scores under the childcare+visits treatment are larger
than population average Peabody scores under the control condition.
That is, there is less than a .01% chance that we would observe a
difference in the sample averages of 5.66 if in fact the two conditions
are equally effective. Hence, we doubt that the conditions are equally
effective.
b) FALSE. Because it is a randomized experiment, background
characteristics in the two groups should be similar.
8. Conceptual questions about hypothesis testing.
(i) In 38% of all samples, the sample percentage of happy caregivers
will be larger for the treated group than for the control group.
FALSE. This is not the definition of a p-value. Actually, if in fact
the null hypothesis is true, the percentage of happy caregivers would be
larger for the treated group than the control group in about half of the
samples.
(ii) The sample percentage of happy caregivers is larger for the
treated group than for the control group.
TRUE. Because the test statistic uses treatment – control percentages,
and because the result is positive, the sample percentage for the
treated group must be larger than the sample percentage for the control
group. Note this question is about the sample percentages, not the
population percentages.
(iii) There about a 30% difference in the sample percentages in the two
groups.
treated %  control %
SE
.30 is the test statistic, which equals
.
FALSE.
So,
the only way the numerator could equal .30 is if the SE=1. But, there is
no way the SE can equal 1, because we are dealing with percentages. To
verify this, write down the formula for the SE of the difference of two
percentages, and plug in any values of treated and control percentages
you want. You’ll always get an SE<1.
(v) There is a 38% chance that caregivers whose child is assigned to
the treatment condition will be happier than caregivers whose child is
assigned to the control condition.
.
FALSE. This is not the definition of a p-value. The p-value is not the
probability that the alternative hypothesis is true, which is what this
statement is saying.
(iv) The standard error for the difference in the sample percentages is
less than 0.05.
TRUE.
SE 
To show this, write down the SE formula for this problem:
treated %(1  treated %) control %(1  control %)

377
608
.
We don’t know the treated % or the control %. But, the largest value of
the SE occurs when the treated % = 0.5 and the control % = 0.5. (Plug
in any other values and you’ll see this is true.) If you plug 0.5s into
the SE formula, you get an SE = .032. Hence, the SE must be less than
.05, since it is at most .032.
(vi) Heads-up: this is not a true/false question.
Based on the p-value of 0.38, write a brief conclusion about the effect
on caregivers’ happiness of the treatment compared to the control.
Because the p-value is large, we cannot reject the null hypothesis. It
is plausible that the population percentages are equal, because if the
null hypothesis is true, the difference in the observed sample
percentages could well be explained by random chance.
9.
Me work out.
Me strong.
a) Lower limit for 99% CI: x  2.738s / n . Note the multiplier for a 99%
CI is 2.738, obtained from the t-curve with 32 degrees of freedom.
The mean has to be right in the middle of the 99% CI.
37.7.
So, it must be
Hence, 37.7  25.3  2.738s / 33
Solving for SD, you get s=26.02
b)
i) If we took random samples of 33 male Duke undergraduate students
over and over again, we would expect roughly 95% of the sample averages
to fall within two SEs of the population average.
TRUE. This follows from the central limit theorem of MMDS Ch. 4 and is
the foundation of confidence intervals. Note that you are told the data
follow a normal curve, so the CLT kicks in automatically, regardless of
the sample size.
ii) Approximately 99% of all Duke men work out between 25.3 and 50.1
minutes during a workout.
Cannot tell, or false. The 99% confidence interval tells you a likely
range for the population AVERAGE amount of time Duke men work out. We
don’t know whether or not this range encompasses 99% of all work out
times for Duke men. We’d need the population histogram of the data to
see if that is the case.
iii) There is a 1% chance that the population average time spent
working out by male Duke undergraduate students is greater than 50.1.
FALSE. The population average time spent working out is either greater
than 50.1, or it is less than 50.1. There is no 1% chance. What is
true is that there is a 1% chance that the 99% confidence interval
method will give you an interval that does not contain the population
average.
iv) A 95% confidence interval made using the same data has a larger
lower limit than 25.3 and a smaller upper limit than 50.1.
TRUE. You would use a smaller multiplier, which means you subtract or
add less from the mean when making the 95% CI. Conceptually, if you are
willing to have less confidence, you can use a narrower interval.
10. Genetics and birth
a) Because there 100 sampled births, we can use the central limit
theorem (np>10, and n(1-p)>10)
The expected value of the sample percentage equals .50
The SE of the sample percentage =
.5(1  .5) / 100  .05
.6  .5
2
Hence, the z-statistic equals .05
We want the area under the normal curve to the right of 2, which equals
about 2.2%.
b) We can convert this to a problem about percentages. Because there
are 100 sampled births, we can use the central limit theorem (np>10, and
n(1-p)>10)
The expected value of the sample percentage equals .50.
The SE of the sample percentage =
.5(1  .5) / 100  .05
.45  .50
 1
.
05
Hence, the z-statistic equals
We want the area under the normal curve to the left of -1, which equals
about 16%.
c) Because there are only 5 sampled births, we cannot use the central
limit theorem (np<10).
Hence, we have to use probability:
Pr(zero males) = Pr(female and female and female and female and female)
5
= 0 .5
since births are independent.
12.
a)
Let X be the score on midterm 1.
Let Y be the score on midterm 2.
E (( X  Y ) / 2)  (1 / 2) E ( X  Y )  (1 / 2)( E ( X )  E (Y ))  (1 / 2)(80  70)  75.
Var (( X  Y ) / 2)  (1 / 4)Var ( X  Y )  (1 / 4)(Var ( X )  Var (Y )  2Cov( X , Y ))
 (1 / 4)(10 2  15 2  2(.65)(10)(15)  130
Hence, the SD of the average of the two scores is 11.4.
b)
E (.3 X  .7Y )  .3E ( X )  .7 E (Y )  (.3)(80)  (.7)(70)  73.
Var (.3 X  .7Y )  Var (.3 X )  Var (.7Y )  2Cov(.3 X ,.7Y )  .32 Var ( X )  .7 2 Var (Y )  2(.3)(.7)Cov( X , Y )
 (.09)(10 2 )  (.49)(15 2 )  2(.21)(.65)(10)(15)  160.2
Hence, the SD of the weighted average of the two scores is 12.66.
14a and b) Let X i be the random variable for person i, where i=1..n.
know that Pr( X i  1)  p and that Pr( X i  0)  1  p .
We
Note that
X  X 2  ...  X n
.
Pˆ  1
n
Hence,
n
n
n
i 1
i 1
i 1
E ( Pˆ )  E (( X 1  X 2  ...  X n ) / n)  (1 / n) E ( X i )  (1 / n) E ( X i )  (1 / n) p  (1 / n)( np)  p..
n
n
i 1
i 1
Var ( Pˆ )  Var (( X 1  X 2  ...  X n ) / n)  (1 / n 2 )Var ( X i )  (1 / n 2 ) Var ( X i )
n
 (1 / n 2 ) p (1  p )  (1 / n 2 )np(1  p)  p (1  p ) / n.
i 1
12a) Let Yi be the random variable for whether Prof. Reiter reaches
base safely on the ith opportunity. The probability distribution of
each Yi is
Pr(Yi  1)  0.40.
.
The total number of times reached safely in
Pr(Yi  0)  0.60
four at bats is Y  Y1  Y2  Y3  Y4 .
Hence, E (Y )  E (Y1  Y2  Y3  Y4 )  .4  .4  .4  .4  1.6.
b) We can use the law of iterated expected values to solve this problem.
5
E (Y )  E ( E (Y | X ))   E (Y | X  x) Pr( X  x)  (1.2) Pr( X  3)  (1.6) Pr( X  4)  (2.0) Pr( X  5)
x 3
 1.2(.2)  1.6(.7)  2.0(.1)  1.56.
~
14. Let X be the random variable for the average of the smallest and
largest values in a sample of three.
~
Pr( X
~
Pr( X
~
Pr( X
~
Pr( X
~
Pr( X
 2)  1 / 10.
 2.5)  2 / 10.
 3)  4 / 10.
 3.5)  2 / 10.
 4)  1 / 10.
b)
3/10.
~
c) E ( X )  2(.1)  (2.5)(. 2)  (3)(. 4)  3.5(.2)  (4)(. 1)  3 . The average of the smallest
and largest values is an unbiased estimator of the population average,
since its expected value equals the average of the box (which is 3).
~
d) Var ( X )  2 2 (.1)  (2.5) 2 (.2)  (3) 2 (.4)  3.5 2 (.2)  (4) 2 (.1)  3 2  0.3.
15.
a)
Let M be the random variable for the maximum.
Pr( M  3)  1 / 10.
Pr( M  4)  3 / 10.
Pr( M  5)  6 / 10.
b) 9/10.
c) E ( M )  3(.1)  4(.3)  5(.6)  4.5. Hence, M is a biased estimator of the
average of the box, since its expected value is larger than that average
(which is 3).
d) Var ( M )  9(.1)  16(.3)  25(.6)  4.5 2  0.45.
16. Let X be the sample average of the three draws.
Pr( X  2)  1 / 10.
Pr( X  7 / 3)  1 / 10.
Pr( X  8 / 3)  2 / 10.
Pr( X  3)  2 / 10.
Pr( X  10 / 3)  2 / 10.
Pr( X  11 / 3)  1 / 10.
Pr( X  4)  1 / 10.
b) 4/10.
c) E ( X )  2(.1)  (7 / 3)(.1)  (8 / 3)(.2)  3(.2)  (10 / 3)(.2)  (11 / 3)(.1)  4(.1)  3 . Hence,
the sample average is an unbiased estimator of the population average,
since its expected value equals the average of the box (which is 3).
d)
Var ( X )  2 2 (.1)  (7 / 3) 2 (.1)  (8 / 3) 2 (.2)  32 (.2)  (10 / 3) 2 (.2)  (11 / 3) 2 (.1)  4 2 (.1)  9  .33333
Download