Name:____________ Biometry 109 Final 5/12/06 (a.) Show all work to receive full credit. (b.) Circle your final answer. (c.) You may use a sheet of notes, probability distribution tables, and calculators. Explicitly describe any advanced calculator commands used. (d.) Ask the instructor for clarification if any questions are unclear. (1. 1pt each)Circle whether the item is a parameter or a statistic. (1a) statistic parameter: The mean length of 57 netted fish. (1b) statistic parameter: The standard deviation of yearly Eureka rainfall totals as calculated by the National Weather Service. (1c) statistic parameter: The mean weight of mule deer in Colorado. (1d) statistic parameter: The slope of the regression line estimating the relationship between weeks of gestation and birth weight. (1e) statistic parameter: The probability of surviving to age 2 for grizzly bear cubs. (2. 2pts) The empirical rule states that if the sampled data come from a symmetric bellshaped population then approximately ______________% of the data will fall inside x 1s (3. 2pts) When is P(A and B) = P(A) P(B) true? _____________________ (Hint: Not a number.) (4. 2pts) Morning weather conditions were recorded for 90 days of winter in Eureka. Suppose 45 mornings were rainy, 15 foggy, and 30 sunny. Create a pie chart for the data and label the number of degrees allocated to each slice. (5. 3pts) Circle which one best describes the key part of the central limit theorem? a) As the sample size increases, the population becomes distributed more like the normal distribution. b) As the sample size increases, the population variance decreases. c) As the sample size increases, the distribution of the sample becomes distributed like the normal distribution. d) As the sample size increases, the distribution of the sample means become distributed more like the normal distribution. 1 (6. 3pts) Circle which statement best defines “95% confidence interval for the mean”. a) The 95% confidence interval is an interval that will include 95% of the data. b) If the sampling process was to be repeated 100 times and the 95% confidence interval calculated for each sample, about 95 of the 100 sample means would be inside their respective 95% confidence interval. c) The 95% confidence interval is a fixed interval calculated so that the population mean, which is random, will fall within the interval about 95% of the time. d) If the sampling process was to be repeated 100 times and the 95% confidence interval calculated for each sample, about 95 of the 100 confidence intervals will contain the fixed population mean. (7. 3pts) Circle which statement best defines “p-value”. a) The probability of getting sampled data that gives a test statistic as extreme or more extreme than the test statistic calculated from your sample, if the null hypothesis actually were true. b) The probability of null hypothesis being true. c) The power of a statistical test. d) The probability of the alternative hypothesis being true. (8. 2pts) Circle the appropriate choices: Power is the probability of (rejecting, retaining) the null hypothesis when the ( null, alternative ) hypothesis is true. (9) Circle whether the following statements are true or false. (9a. 2pts) True or False: The sample mean is more affected by outliers than is the median. (9b. 1pt) True or False: Increased sample sizes will result in greater power of a statistical test. (9c. 1pt) True or False: The greater the variance in the two populations, the greater the statistical power of a two-sample t-test. (9d. 1pt) True or False: The greater the difference between the two populations means, the greater the statistical power of a two-sample t-test. (9e. 2pts) True or False: The chance of a type one error, when the null hypothesis is true, is determined by the level of significance, , used in the statistical test. (9f. 1pt) True or False: The median and the 50th percentile are the same. (9g. 2pts) True or False: You should keep the null hypothesis if the p-value is less than . (9h. 1pt) True or False: The ANOVA test essentially compares the variation of the data within each group to that of the variation between the group means. (9i. 2pts)True or False: The calculations for hypothesis tests are performed assuming the alternative hypothesis is true. (10. 2pts) Let Y be a random variable from a uniform random distribution with a lower bound of 0 and an upper bound of 25. What is the height of the density curve below? (The following density curve is not drawn to scale.) ? 0 25 ?=__________________ 2 (11) The number of children in a family is distributed according to the following fictitious probability distribution. The probability distribution for the number of children (k) is given below with the exception of the probability for 2 children. k P(X=k) cdf: P( X k ) 1 0.3 ???= 2 ????= ???= 3 0.1 ???= 4+ 0.05 ???= (11a. 2pts) Fill in the table for P(X=2). (11b. 3pts) Fill in the table for the cumulative distribution function (cdf) column. (12) Let X be a normal random variable with mean 50 and standard deviation 5. (12a. 3pts) Calculate P( 39 < X < 45 ) (12b. 2pts) Calculate the 95th percentile of X. (12c. 2pts) Suppose n=25 values of X were sampled and the average calculated. Calculate P ( X 48 ) . (13. 3pts) Suppose the probability of a seed germinating is 0.9 and you plant 10 seeds. Assume the number of seeds that germinate is distributed according to the binomial distribution. By hand, calculate the probability of exactly 7 seeds germinating. Show your work. 3 (14. 3pts) A study found 386 of 543 birds to be carriers of a certain parasite. Calculate the 95% confidence interval for the population proportion of birds with the parasite. Show the formula you use. (15. 2pts) Suppose you want the confidence interval for a sample proportion to be no wider than pˆ 0.015 . Calculate the minimum sample size. (16. 2pts) Three of the four statements are true. Which one of the following is FALSE? a) The level of significance, , is the probability of committing a type 1 error if the null hypothesis is true. b) The level of significance, , determines the cut-off value when using the p-value to decide whether or not to reject the null hypothesis c) The level of significance, , is the probability that the alternative hypothesis is true. d) The choice of the level of significance, , affects the power of a test. (17. 2pts) The statistic from an ANOVA is compared to the ____________ distribution to get the p-value. (18. 3pts) A scientist believes that there is no difference between the placebo, drug X, drug Y, and drug Z with regards to blood pressure of the patient. 30 patients were randomly assigned to one of the four treatments. After being on the drug or placebo for one month, the blood pressures were measured and the difference between the before and after blood pressures calculated. Which statistical test seems best for analyzing this dataset? a) Chi-square test for independence b) 2-sample t-test c) ANOVA d) Chi-square goodness-of-fit test e) Simple linear regression 4 (19) A physiologist is interested in whether or not caffeine increases blood pressure. The physiologist had 60 volunteers in his study. Upon awakening in the morning, 30 randomly selected volunteers were given a large cup of regular coffee and the other 30 provided a large cup of decaffeinated coffee. Blood pressure (mm Hg) was measured 30 minutes later for each volunteer. (19a. 3pts) Which statistical methodology seems best for analyzing this dataset? a) Chi-square test for independence b) 2-sample t-test c) ANOVA d) Chi-square goodness-of-fit test e) Simple linear regression (19b. 2pts) How would you alter the study to make it such that a paired t-test would be the best way to analyze the data? (19c. 2pts) What would be an advantage of a paired study design over the original study design? (20. 2pts) A biologist is interested in whether or not the prevalence of a certain parasite among deer differs by sex. 100 does were inspected, of which 34 had the parasite. 100 bucks were inspected, of which 42 had the parasite. Which statistical test would be most reasonable to investigate this dataset? a) Chi-square test for independence b) 2-sample t-test c) ANOVA d) Chi-square goodness-of-fit test e) Simple linear regression 5 (21) Body measurement data were collected on indigenous Peruvian men. The measurements included height and weight. The below simple linear regression examines whether height (mm) can be used to predict weight (kg). Below is a scatter plot, the regression line, and the output from the analysis. Fitted Line Plot Weight = - 32.63 + 0.06067 Height 90 S R-Sq R-Sq(adj) 6.42340 20.3% 18.1% Weight 80 70 60 50 1450 1500 1550 Height 1600 1650 The regression equation is Weight = - 32.6 + 0.0607 Height Predictor Constant Height S = 6.42340 Coef -32.63 0.06067 SE Coef 31.24 0.01977 R-Sq = 20.3% T -1.04 3.07 P 0.303 0.004 R-Sq(adj) = 18.1% (21a. 2pts) For each mm increase in height, how many kg should you expect the average weight to increase? (21b. 2pts) For a man with a height of 1542mm, how many kg is the expected weight according to the regression model? (21c. 2pts) A man who was 1542mm tall weighed 87.0 kg. (i) Circle the dot that represents that man and draw a line for the residual of that man. (ii) Calculate the value of the residual for that man's data point. residual = _________________ (21d. 2pts) Is the slope statistically significantly different from 0? Specifically explain how you reached your conclusion. Use =0.05. (21e. 2pts) Which method best describes how the line’s slope (b1) and intercept (b0) were determined for the simple linear regression? a) b0 and b1 were selected so that the line runs over as many points as possible. b) b0 and b1 were selected so that the line minimizes the sum of the residuals. c) b0 and b1 were selected so that the line minimizes the sum of the squared residuals. d) b0 and b1 were selected so that the line maximizes the sum of the residuals. 6 (22. 2pts) The below graph shows a scatter plot of age versus height for the Peruvian dataset. Scatterplot of Age vs Height 55 50 45 Age 40 35 30 25 20 1450 1500 1550 Height 1600 1650 Which number is the closest to the sample correlation between Age and height? (i) -35 (ii) -1 (iii) -0.9 (iv) 0 (v) +0.9 (vi) +1 (vii) +35 (23) A two-sample t-test was performed to test if there is a difference between the mean length of male cicada tibia lengths and that of females. For the following, use df=8. Group n mean sd males 5 78.42 2.87 female 6 80.44 3.52 s (23a. 2pts) Calculate SE( y1 y2 ) (23b. 2pts) Calculate the 95% confidence interval for 1 2 . (23c. 2pts) Calculate the appropriate test statistic. (23d. 2pts) Using your t-table, bracket the p-value. 7 (24) The below contingency table consists of data describing the number of cycles between stopping birth control and a planned pregnancy. Women are categorized by smokers (1st row) and non-smokers (2nd row). Chi-Square Test: first cycle, 2+ cycles Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts first cycle 29 XXXXX 2.448 2+ cycles 71 XXXXX 1.548 Total 100 2 198 XXXXX 0.504 288 YYYYY 0.318 486 Total 227 359 586 1 Chi-Sq = ZZZZZ, DF = 1, P-Value = 0.028 (24a. 2pts) Assuming independence between smoking and number of cycles, how many women were expected to fall in the category of non-smoker & 2+ cycles (YYYYY)? YYYYY=____________________ (24b. 2pts) What is the chi-square statistic value (ZZZZZ) for this analysis? ZZZZZ = ______________________ (24c. 3pts) Which conclusion is most appropriate from the above analysis? a) There is a statistically significant dependence between smoking and the number of cycles until pregnancy (P=0.028). b) There is not a statistically significant dependence between smoking and the number of cycles until pregnancy (P=0.028). c) There is a statistically significant difference between the average number of smokers and non-smokers (P=0.028). d) There is not a statistically significant difference between the average number of smokers and non-smokers (P=0.028). 8