Stat 101 L – Final Exam May 5, 2008 Name: ______________________ INSTRUCTIONS: Read the questions carefully and completely. Answer each question and show work in the space provided. Partial credit will not be given if work is not shown. When asked to explain, describe, or comment, do so within the context of the problem. Be sure to include units when dealing with quantitative variables. 1. [15 pts] Short answer. a) [2] Statistics is about _______________. (Fill in the blank with one word.) b) [2] A ____________________ is a numerical summary of a sample, while a _____________________ is a numerical summary, of a population. c) [3] What does “90% confidence” mean? In your explanation you cannot use the words chance, sure, probability or confident. d) [3] Explain why we use t instead of z when constructing a confidence interval for the population mean. e) [2] Holding all other things the same, if the sample size is decreased the width of the confidence interval will _______________________. f) [3] Sketch a normal model with μ = 80 and σ = 20. 1 2. [12 pts] Multiple Choice: a) ___ The correct interpretation of a 95% confidence interval is? A: I am 95% confident that the sample mean is in the interval. B: 95% of the sample values are in the interval. C: 95% of the population values are in the interval. D: I am 95% confident that the population mean is in the interval. b) ___ The P-value is …? A: The probability of getting a value of the test statistic more extreme than the one observed when the null hypothesis is false. B: The probability that the null hypothesis is true. C: The probability of getting a value of the test statistic more extreme than the one observed when the null hypothesis is true. D: The probability that the null hypothesis is false. c) ___ You have calculated the correlation coefficient between two variables to be –0.95. This would indicate? A: A strong positive linear relationship. B: A strong negative linear relationship. C: No relationship. D: No linear relationship. d) ___ The difference between an observed value of a response variable and the corresponding predicted value of the response is called? A: An outlier. B: A statistic. C: An influential point. D: A residual. e) ___ When you get one measurement on each individual in a sample of males and one measurement on each individual in a sample of females you have A: paired sample data. B: two independent sample data. C: population data. D: cluster sample data. f) ___ Which is not a fundamental principle of experimental design? A: Control. B: Randomization. C: Replication. D: They are all fundamental principles. 2 3. [10 pts] Below is a screen shot of the web page that simulates sampling from a population of Reese’s Pieces that contains 36% orange pieces. a) [2] If we randomly select samples of 100 Reese’s Pieces what will be the mean of the distribution of the sample proportion of orange Reese’s Pieces, p̂ ? b) [4] If we randomly select samples of 100 Reese’s Pieces what will be the standard deviation of the distribution of the sample proportion of orange Reese’s Pieces? c) [4] Sketch the distribution of the sample proportion of orange Reese’s Pieces, p̂ , for random samples of size 100. Use the space to the right of the Reese’s Pieces machine above. 3 4. [22 pts] Suppose you received a diamond as a gift. You know the size of the diamond (50 milligrams) but you don’t know the price ($). A random sample of 25 diamonds (weighing from 25 to 100 mg) is obtained and you find the corresponding prices. a) [4] Answer the questions Who? and What? for this problem. b) [4] Below is a plot of price versus weight. 800 700 Price ($) 600 500 400 300 200 100 20 30 40 50 60 70 80 Weight (mg) Describe the relationship in terms of direction, form, strength, and indicate any unusual points. 4 Below is partial JMP output for the least squares regression line. Linear Fit Predicted Price ($) = –172.04 + 12.273*Weight (mg) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.975972 0.974928 23.77098 330.64 25 c) [5] Give an interpretation of the slope within the context of the problem. d) [2] Use the least squares regression line to predict the price of a diamond weighing 50 mg. e) [2] One of the 25 diamonds in the sample weighs 50 mg and has a price of $489. What is the residual for this diamond? f) [3] Graph the least squares regression line on the plot in b). In order to get full credit it must be obvious to me that you are using the equation to draw the line. g) [2] How much of the variability in the price of diamonds can be explained by the linear relationship with weight? 5 3 .99 2 .95 .90 .75 .50 1 0 .25 .10 .05 .01 -1 200 250 300 Blood Cholesterol 350 Moments Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 264.6875 42.115268 10.528817 287.12914 242.24586 16 -2 -3 150 Normal Quantile Plot 5. [16 pts] A major medical center in the Northeastern U.S. conducted a study looking at blood cholesterol levels and incidence of heart attack. Below is JMP output on cholesterol for 16 randomly selected people who had had a heart attack. Test Mean = value Hypothesized Value Actual Estimate df Std Dev Test Statistic Prob > |t| Prob > t Prob < t 240 264.688 15 42.1153 t Test 2.3448 0.0332 0.0166 0.9834 We wish to see if the mean cholesterol level of all people who had a heart attack is greater than 240. a) [3] Set up appropriate null and alternative hypotheses. Be sure to clearly define the population parameter you are testing. b) [4] Verify the nearly normal condition is met. Be sure to refer to all three plots. c) [3] What are the value of the test statistic and the P-value? 6 d) [2] Reach a decision using the P-value. e) [4] State a conclusion in the context of the problem. 6. [12 pts] For each of the following indicate whether the study is an experiment or an observational study. Also indicate the explanatory and response variables and whether it is paired sample or independent sample data. Explain each choice briefly. a) [6] In a study of dietary calcium on blood pressure, 30 participants experience two different diets each for one month. One diet is low in calcium and the other diet is high in calcium. Which diet the participant experiences first is determined by a flip of a coin. After each diet the blood pressure of each participant in measured. The researchers want to see if average blood pressure differs with diet. b) [6] In a study on exercise and diet, 30 participants are asked to keep a diary of the food they eat and the exercise they do. After 30 days the diaries are examined and the participants are classified into one of two groups, the healthy lifestyle group and the unhealthy lifestyle group. The body mass index (BMI) of each participant is calculated. The researchers wish to see if average BMI is different for the lifestyle groups. 7 7. [15 pts] A study was done in Michigan with students in grades 4 – 6. The students were asked the following question: What would you most like to do at school? The choices were Make good grades, Be good at sports, or Be popular. Students came from Rural, Suburban and Urban schools. Below are the data. Rural Suburban Urban Total Make good grades. 57 87 103 247 Be good at sports. 50 42 49 141 Be popular. Total 42 22 26 90 149 151 178 478 a) [2] What is the probability that a student selected at random is from an Urban school? b) [2] What is the probability that a student selected at random wants to be good at sports? c) [3] If location of school (rural, urban, suburban) is independent of what students most like to do at school, what is the expected count for the cell, urban and be good at sports? d) [3] What is the contribution to the χ 2 test statistic for the cell, urban and be good at sports? e) [1] How many degrees of freedom are there for the test of independence between location and what students most like to do at school? f) [4] The value of the χ 2 test statistic is 18.828 with an associated P-value = 0.0008. What does this tell you about school location and what students most like to do at school? Be sure to justify your answer statistically. 8 .50 1 0 .25 .10 .05 .01 2 .95 .90 .75 .50 -1 .10 .05 -2 .01 4.5 4.6 4.7 -2 4 6 4 2 1 4.4 -1 -3 2 4.3 0 -3 3 4.2 1 .25 Count .75 .99 4.8 pH of Rain (1) City in Illinois Mean 4.495 Std Dev 0.1432 N 16 Count 2 .95 .90 3 Normal Quantile Plot 3 .99 Normal Quantile Plot 8. [23 pts] A researcher is interested in seeing if the rain in a major city in Illinois is more acid (has a lower pH) than rain in a major city in Texas. The researcher measures the pH of rain on 20 randomly selected days in the city in Texas and on a separate set of 16 randomly selected days in the city in Illinois. Below is JMP output for the data. (1) City in Illinois (2) City in Texas 4 4.5 5 5.5 pH of Rain (2) City in Texas Mean 4.880 Std Dev 0.3843 N 20 a) [4] Describe the distribution of the pH of rain in the city in Illinois. Be sure to comment of center, spread, and shape. b) [4] Describe what you see in the Normal Quantile Plot for the pH of rain in the city in Illinois. What does this indicate about the nearly normal condition? 9 We wish to test a hypothesis to see if the mean pH of rain in the two cities is the same against an alternative that the mean pH of rain in the city in Illinois is less than that in the city in Texas. Note: df = 25 for this problem. c) [2] Set up the null and alternative hypothesis. d) [4] Calculate the value of the test statistic. e) [3] Use Table t to find the P-value. f) [2] Use the P-value to reach a decision. g) [4] State a conclusion within the context of the problem. The Final Exam is worth 125 points. How many points do you think you got? _______ I will probably not finish grading Final Exams until the end of finals week. Course grades will be available on Access Plus approximately one week after final exams. I keep the final exam papers for one semester. If you wish to pick up your final exam you can do so during final exam week fall 2008. 10 Formulas y= r= ∑y n ∑ zx z y sy sx Sampling Distribution of Mean: p x= ∑x b0 = y − b1 x n −1 yˆ = b0 + b1 x residual = y − yˆ p̂ : Standard Deviation: SD( p̂ ) = ∑ (x − x ) 2 sx = n −1 n x−x y− y zx = zy = sx sy n −1 b1 = r ∑(y − y) 2 sy = p (1 − p ) n Standard Error: SE( y ) = Single sample (Categorical Variable) Confidence interval for p : Test Statistic: p̂ − z* z= p̂ (1 − p̂ ) p̂ (1 − p̂ ) to p̂ + z* n n Single sample (Numerical Variable) Confidence interval for μ: y − t* y: Sampling Distribution of Mean: μ s s to y + t * n n s n p̂ − po p0 (1 − p0 ) n Test Statistic: y − μo s n t= df = n – 1 Two independent samples (Numerical Response, Categorical Explanatory) Confidence interval for μ1 − μ2 : Test Statistic: (y ) * 1 − y2 − t s12 s22 s2 s2 + to ( y1 − y 2 ) − t * 1 + 2 n1 n2 n1 n2 Paired samples Confidence interval for d − t* μd : sd s to d + t * d nd nd t= (y 1 − y2 ) − 0 s12 s22 + n1 n2 Test Statistic: t= d − μd sd nd df = nd – 1 Test of Independence (Categorical Response, Categorical Explanatory) Expected = (row total )(column total ) χ 2 = (Observed − Expected )2 ∑ total in sample Expected df = (r − 1)(c − 1) 11