Name:____________ Biometry 109 Final 5/12/06

advertisement
Name:____________
Biometry 109
Final
5/12/06
(a.) Show all work to receive full credit.
(b.) Circle your final answer.
(c.) You may use a sheet of notes, probability distribution tables, and calculators.
Explicitly describe any advanced calculator commands used.
(d.) Ask the instructor for clarification if any questions are unclear.
(1. 1pt each)Circle whether the item is a parameter or a statistic.
(1a) statistic parameter: The mean length of 57 netted fish.
(1b) statistic parameter: The standard deviation of yearly Eureka rainfall totals as
calculated by the National Weather Service.
(1c) statistic
parameter:
The mean weight of mule deer in Colorado.
(1d) statistic parameter: The slope of the regression line estimating the relationship
between weeks of gestation and birth weight.
(1e) statistic
parameter:
The probability of surviving to age 2 for grizzly bear cubs.
(2. 2pts) The empirical rule states that if the sampled data come from a symmetric bellshaped population then approximately ______________% of the data will fall inside
x  1s
(3. 2pts) When is P(A and B) = P(A)  P(B) true? _____________________
(Hint: Not a number.)
(4. 2pts) Morning weather conditions were recorded for 90 days of winter in Eureka.
Suppose 45 mornings were rainy, 15 foggy, and 30 sunny. Create a pie chart for the data
and label the number of degrees allocated to each slice.
(5. 3pts) Circle which one best describes the key part of the central limit theorem?
a) As the sample size increases, the population becomes distributed more like the
normal distribution.
b) As the sample size increases, the population variance decreases.
c) As the sample size increases, the distribution of the sample becomes distributed
like the normal distribution.
d) As the sample size increases, the distribution of the sample means become
distributed more like the normal distribution.
1
(6. 3pts) Circle which statement best defines “95% confidence interval for the mean”.
a) The 95% confidence interval is an interval that will include 95% of the data.
b) If the sampling process was to be repeated 100 times and the 95% confidence
interval calculated for each sample, about 95 of the 100 sample means would be
inside their respective 95% confidence interval.
c) The 95% confidence interval is a fixed interval calculated so that the population
mean, which is random, will fall within the interval about 95% of the time.
d) If the sampling process was to be repeated 100 times and the 95% confidence
interval calculated for each sample, about 95 of the 100 confidence intervals will
contain the fixed population mean.
(7. 3pts) Circle which statement best defines “p-value”.
a) The probability of getting sampled data that gives a test statistic as extreme or
more extreme than the test statistic calculated from your sample, if the null
hypothesis actually were true.
b) The probability of null hypothesis being true.
c) The power of a statistical test.
d) The probability of the alternative hypothesis being true.
(8. 2pts) Circle the appropriate choices: Power is the probability of (rejecting, retaining)
the null hypothesis when the ( null, alternative ) hypothesis is true.
(9) Circle whether the following statements are true or false.
(9a. 2pts) True or False: The sample mean is more affected by outliers than is the
median.
(9b. 1pt) True or False: Increased sample sizes will result in greater power of a statistical
test.
(9c. 1pt) True or False: The greater the variance in the two populations, the greater the
statistical power of a two-sample t-test.
(9d. 1pt) True or False: The greater the difference between the two populations means,
the greater the statistical power of a two-sample t-test.
(9e. 2pts) True or False: The chance of a type one error, when the null hypothesis is true,
is determined by the level of significance,  , used in the statistical test.
(9f. 1pt) True or False: The median and the 50th percentile are the same.
(9g. 2pts) True or False: You should keep the null hypothesis if the p-value is less than
.
(9h. 1pt) True or False: The ANOVA test essentially compares the variation of the data
within each group to that of the variation between the group means.
(9i. 2pts)True or False: The calculations for hypothesis tests are performed assuming the
alternative hypothesis is true.
(10. 2pts) Let Y be a random variable from a uniform random distribution with a lower
bound of 0 and an upper bound of 25. What is the height of the density curve below?
(The following density curve is not drawn to scale.)
?
0
25
?=__________________
2
(11) The number of children in a family is distributed according to the following fictitious
probability distribution. The probability distribution for the number of children (k) is
given below with the exception of the probability for 2 children.
k P(X=k)
cdf: P( X  k )
1 0.3
???=
2
????=
???=
3
0.1
???=
4+ 0.05
???=
(11a. 2pts) Fill in the table for P(X=2).
(11b. 3pts) Fill in the table for the cumulative distribution function (cdf) column.
(12) Let X be a normal random variable with mean 50 and standard deviation 5.
(12a. 3pts) Calculate P( 39 < X < 45 )
(12b. 2pts) Calculate the 95th percentile of X.
(12c. 2pts) Suppose n=25 values of X were sampled and the average calculated.
Calculate P ( X  48 ) .
(13. 3pts) Suppose the probability of a seed germinating is 0.9 and you plant 10 seeds.
Assume the number of seeds that germinate is distributed according to the binomial
distribution. By hand, calculate the probability of exactly 7 seeds germinating. Show
your work.
3
(14. 3pts) A study found 386 of 543 birds to be carriers of a certain parasite. Calculate
the 95% confidence interval for the population proportion of birds with the parasite.
Show the formula you use.
(15. 2pts) Suppose you want the confidence interval for a sample proportion to be no
wider than pˆ  0.015 . Calculate the minimum sample size.
(16. 2pts) Three of the four statements are true. Which one of the following is FALSE?
a) The level of significance,  , is the probability of committing a type 1 error if the
null hypothesis is true.
b) The level of significance,  , determines the cut-off value when using the p-value
to decide whether or not to reject the null hypothesis
c) The level of significance,  , is the probability that the alternative hypothesis is
true.
d) The choice of the level of significance,  , affects the power of a test.
(17. 2pts) The statistic from an ANOVA is compared to the ____________ distribution to
get the p-value.
(18. 3pts) A scientist believes that there is no difference between the placebo, drug X,
drug Y, and drug Z with regards to blood pressure of the patient. 30 patients were
randomly assigned to one of the four treatments. After being on the drug or placebo for
one month, the blood pressures were measured and the difference between the before and
after blood pressures calculated. Which statistical test seems best for analyzing this
dataset?
a) Chi-square test for independence
b) 2-sample t-test
c) ANOVA
d) Chi-square goodness-of-fit test
e) Simple linear regression
4
(19) A physiologist is interested in whether or not caffeine increases blood pressure. The
physiologist had 60 volunteers in his study. Upon awakening in the morning, 30
randomly selected volunteers were given a large cup of regular coffee and the other 30
provided a large cup of decaffeinated coffee. Blood pressure (mm Hg) was measured 30
minutes later for each volunteer.
(19a. 3pts) Which statistical methodology seems best for analyzing this dataset?
a) Chi-square test for independence
b) 2-sample t-test
c) ANOVA
d) Chi-square goodness-of-fit test
e) Simple linear regression
(19b. 2pts) How would you alter the study to make it such that a paired t-test would be
the best way to analyze the data?
(19c. 2pts) What would be an advantage of a paired study design over the original study
design?
(20. 2pts) A biologist is interested in whether or not the prevalence of a certain parasite
among deer differs by sex. 100 does were inspected, of which 34 had the parasite. 100
bucks were inspected, of which 42 had the parasite. Which statistical test would be most
reasonable to investigate this dataset?
a) Chi-square test for independence
b) 2-sample t-test
c) ANOVA
d) Chi-square goodness-of-fit test
e) Simple linear regression
5
(21) Body measurement data were collected on indigenous Peruvian men. The
measurements included height and weight. The below simple linear regression
examines whether height (mm) can be used to predict weight (kg). Below is a scatter
plot, the regression line, and the output from the analysis.
Fitted Line Plot
Weight = - 32.63 + 0.06067 Height
90
S
R-Sq
R-Sq(adj)
6.42340
20.3%
18.1%
Weight
80
70
60
50
1450
1500
1550
Height
1600
1650
The regression equation is
Weight = - 32.6 + 0.0607 Height
Predictor
Constant
Height
S = 6.42340
Coef
-32.63
0.06067
SE Coef
31.24
0.01977
R-Sq = 20.3%
T
-1.04
3.07
P
0.303
0.004
R-Sq(adj) = 18.1%
(21a. 2pts) For each mm increase in height, how many kg should you expect the average
weight to increase?
(21b. 2pts) For a man with a height of 1542mm, how many kg is the expected weight
according to the regression model?
(21c. 2pts) A man who was 1542mm tall weighed 87.0 kg.
(i) Circle the dot that represents that man and draw a line for the residual of that man.
(ii) Calculate the value of the residual for that man's data point.
residual = _________________
(21d. 2pts) Is the slope statistically significantly different from 0? Specifically explain
how you reached your conclusion. Use =0.05.
(21e. 2pts) Which method best describes how the line’s slope (b1) and intercept (b0) were
determined for the simple linear regression?
a) b0 and b1 were selected so that the line runs over as many points as possible.
b) b0 and b1 were selected so that the line minimizes the sum of the residuals.
c) b0 and b1 were selected so that the line minimizes the sum of the squared
residuals.
d) b0 and b1 were selected so that the line maximizes the sum of the residuals.
6
(22. 2pts) The below graph shows a scatter plot of age versus height for the Peruvian
dataset.
Scatterplot of Age vs Height
55
50
45
Age
40
35
30
25
20
1450
1500
1550
Height
1600
1650
Which number is the closest to the sample correlation between Age and height?
(i) -35 (ii) -1 (iii) -0.9 (iv) 0 (v) +0.9 (vi) +1 (vii) +35
(23) A two-sample t-test was performed to test if there is a difference between the mean
length of male cicada tibia lengths and that of females. For the following, use df=8.
Group n mean
sd
males 5 78.42
2.87
female 6 80.44
3.52
s
(23a. 2pts) Calculate SE( y1  y2 )
(23b. 2pts) Calculate the 95% confidence interval for 1   2 .
(23c. 2pts) Calculate the appropriate test statistic.
(23d. 2pts) Using your t-table, bracket the p-value.
7
(24) The below contingency table consists of data describing the number of cycles
between stopping birth control and a planned pregnancy. Women are categorized by
smokers (1st row) and non-smokers (2nd row).
Chi-Square Test: first cycle, 2+ cycles
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
first
cycle
29
XXXXX
2.448
2+ cycles
71
XXXXX
1.548
Total
100
2
198
XXXXX
0.504
288
YYYYY
0.318
486
Total
227
359
586
1
Chi-Sq = ZZZZZ, DF = 1, P-Value = 0.028
(24a. 2pts) Assuming independence between smoking and number of cycles, how many
women were expected to fall in the category of non-smoker & 2+ cycles (YYYYY)?
YYYYY=____________________
(24b. 2pts) What is the chi-square statistic value (ZZZZZ) for this analysis?
ZZZZZ = ______________________
(24c. 3pts) Which conclusion is most appropriate from the above analysis?
a) There is a statistically significant dependence between smoking and the number of
cycles until pregnancy (P=0.028).
b) There is not a statistically significant dependence between smoking and the
number of cycles until pregnancy (P=0.028).
c) There is a statistically significant difference between the average number of
smokers and non-smokers (P=0.028).
d) There is not a statistically significant difference between the average number of
smokers and non-smokers (P=0.028).
8
Download