Practice Final Examination (With Answers) Statistics 515 Spring Semester 2002 E. A. Pena's Class Part I: (20 points) Basic Concepts: Explain briefly what each of the following terms/phrases mean, or what their importance is. 1. Which statistical hypothesis typically correspond to the "research hypothesis"? Answer: The alternative hypothesis. The null hypothesis usually coincides with the statusquo. 2. In statistical hypothesis testing, which type of error is considered to be more serious? Answer: The type I error, which is committed when the null hypothesis is rejected when in reality it is correct, is the more serious type of error. This is the reason why we set an upper limit (the level of significance) to the probability of committing this error. 3. Science Magazine reported that the mean listening time of 7-month-old infants exposed to a three-syllable sentence (e.g., "ga ti ti") is 9 seconds. Set up the null and alternative hypotheses for testing the claim. Answer: NULL: Mean listening time is 9 seconds. ALTERNATIVE: Mean listening time is not equal to 9 seconds. 4. What is the level of significance of a test. Answer: It is the (maximum) probability of committing a type I error. 5. How is the p-value used in making decisions in hypothesis testing? Answer: It is the probability under the null hypothesis of observing the observed value of the test statistic, or more extreme values. Equivalently, it is the lowest significance level such that the null hypothesis will be rejected for the data at hand. 6. How are the probabilities of a Type I and a Type II error related for a fixed sample size? 1 Answer: They are inversely related. When one is decreased, the other increases. (Such is life … no free lunch). 7. Why is it that we could not "accept" the null hypothesis, but instead simply conclude that we "fail to reject the null hypothesis"? Answer: It is because we did not control the probability of committing a type II error, which is committed when we do not reject the null when in reality it is false. In contrast, since we specify the level of significance, we have controlled the probability of committing a type I error. 8. What does the regression coefficient in the simple linear regression model represent? Answer: It represents the change in the mean value of the response variable (Y) per unit change in the predictor variable (X). 9. In simple linear regression, what is the idea behind the least-squares principle for obtaining the coefficients in the regression model? Answer: The idea is to minimize the distance between the observed Y-values and the predicted Y-values. The distance is measured by taking the sum of the squared residuals, where the residual is the difference between the observed Y-value and the predicted value. 10. In simple linear regression, as well as in a one-way analysis of variance, which quantity serves as an estimator of the common variance Answer: The mean square error (MSE) is the estimator of the variance. 2 Part II: Problem Solving and Interpretations. 1. (20 points) Environmental Science and Technology reported on a study of contaminated soil in The Netherlands. Seventy-two 400-gram soil specimens were sampled, dried, and analyzed for the contaminant cyanide. The cyanide concentration [in milligrams per kilogram (mg/kg) of soil] of each soil specimen was determined using an infrared microscopic method. The sample resulted in a mean cyanide level of 84 mg/kg and a standard deviation of S = 80 mg/kg. Perform a test of the null hypothesis that the true mean cyanide level in The Netherlands exceeds 100 mg/kg. Use a level of significance of 0.05. a) State the hypotheses. H0 (Null): Mean cyanide level >= 100 mg/kg. H1 (Alternative): Mean cyanide level < 100 mg/kg. b) State your decision rule. Reject Ho if Z = (XBAR - 100)/(S/SQRT(n)) < -1.645. c) Compute your test-statistic. Z = (84 - 100)/(80/SQRT(72)) = -1.70. d) State your decision. Since -1.70 < 1.645 then we reject the null hypothesis. e) State your conclusion with regards to the practical problem considered. We are 95% confident that the mean cyanide level is less than 100 mg/kg. 3 2. (20 points) The Cleveland Casting Plant is a large, highly automated producer of gray and nodular iron automotive castings for Ford Motor Company. One process variable of interest to Cleveland Casting is the pouring temperature of the molten iron. The pouring temperatures (in degrees Fahrenheit) for a random sample of ten crankshafts produced at Cleveland Casting are listed below. The target setting for the pouring temperature is 2,550 degrees. Assuming the process is stable, conduct a test to determine whether the true mean pouring temperature differs from the target setting. 2543 2541 2544 2620 2560 2559 2562 2553 2552 2553 For this data set, the sample mean equals 2558.7 and the sample standard deviation is 22.7452. a) State the hypotheses. H0 (Null): Mean pouring temperature is 2550 degrees. H1 (Alternative): Mean pouring temperature is different from 2550 degrees. b) State your decision rule. Reject H0 if |T| > t9;.025 = 2.262, where T = (XBAR - 2550)/(S/SQRT(n)) c) Compute your test-statistic. T = (2558.7 - 2550)/(22.7452/SQRT(10)) = 1.21. d) State your decision. Since 1.21 < 2.262, then we fail to reject the null hypothesis. e) State your conclusion with regards to the practical problem considered. Based on the data, and at the 5% level of significance, we cannot conclude that the mean pouring temperature is different from 2550, so we cannot conclude that the process is out of order. 4 3. (20 points) Marine biochemists at the University of Tokyo studied the properties of crustacean striated muscles (The Journal of Experimental Zoology). The main purpose of the experiment was to compare the biochemical properties of fast and slow muscles of crayfish. Using crayfish obtained from a local supplier, the researchers excised twelve fast-muscle fiber bundles and tested each fiber bundle for uptake of calcium. Twelve slow-muscle fiber bundles were excised from a second sample of crayfish, and calcium uptake was measured. A summary of the sample statistics associated with the calcium uptake (in moles per milligram) for these two groups is provided below. Descriptive Statistics Group n Sample Mean Fast Muscle Slow Muscle 12 12 .57 .37 Sample Standard Deviation .104 .035 Based on this information, compare the population means of the calcium uptake for the fast and slow-muscle groups. In particular, test the null hypothesis that the two means are identical. In performing your test you may assume that the population distribution of the calcium uptakes for each group is normally distributed, and that the two populations have equal variances. Also, use a 5% level of significance. Again you may answer this question by following the steps below. a) State the hypotheses. NULL: Mean calcium uptakes for the fast and slow-muscle groups are identical. ALT: Mean calcium uptakes for the fast and slow-muscle groups are different. b) State your decision rule. Decision Rule: Reject the null hypothesis if |T| > t22;.025 = 2.074, where T = (XBAR1-XBAR2)/{Sp[SQRT(1/n1 + 1/n2)]} where Sp is the pooled standard deviation. c) Compute your test-statistic. Sp2 = {(12-1)(.104)2 + (12-1)(.035)2}/(12 + 12 - 2) = .0060 Sp = .0776 5 T = (.57 - .37)/[.0776 SQRT(1/12 + 1/12)] = 6.31 d) State your decision. Since 6.31 > 2.074, then we reject the null hypothesis of equal means. e) State your conclusion with regards to the practical problem considered. Based on the data, we can conclude that there is a difference between the mean calcium uptakes of slow- and fast-muscle groups, with the fast-muscle groups having a higher mean calcium uptake. 6 4. (30 points) The quality of the orange juice produced by a manufacturer (e.g., Tropicana) is constantly monitored. There are numerous sensory and chemical components that combine to make the best tasting orange juice. There is a measure of "sweetness" of an orange juice, with the higher the value of this "sweetness" measure, the better the orange juice. In order to study the relationship between the "sweetness" and a chemical measure such as the amount of water soluble pectin (parts per million), in 24 production runs, the sweetness and the pectin level were measured.. 6.0 5.9 Sweetness 5.8 5.7 5.6 5.5 5.4 5.3 5.2 200 300 400 PectinLevel A scatterplot of these 24 pairs of values is provided above. A simple linear regression analysis with Sweetness as response or dependent variable and PectinLevel as predictor or independent variable was fitted using Minitab. The output of this analysis is given below. Regression The regression equation is y = 6.25 - 0.00231 x Predictor Constant x Coef 6.2521 -0.0023106 S = 0.2150 StDev 0.2366 0.0009049 R-Sq = 22.9% T 26.42 -2.55 P 0.000 0.018 R-Sq(adj) = 19.4% Analysis of Variance Source Regression Residual Error Total DF 1 22 23 SS 0.30140 1.01693 1.31833 MS 0.30140 0.04622 F 6.52 P 0.018 7 a) By examining the scatterplot, describe the type of relationship between PectinLevel and Sweetness. For instance, is there a negative type of relationship? There is a negative linear (almost) relationship between PectinLevel and Sweetness. b) Based on the simple linear regression analysis, what are the least-squares estimates of and ? The estimate of is 6.2521. The estimate of is -.0023. c) Provide an interpretation for the value of b, the estimate of . The value of b = -.0023 means that for a change of one unit in the Pectin Level, the mean Sweetness will change by the amount of -.0023. d) For testing the hypothesis that = 0 (that is, there is no linear relationship between PectinLevel and Sweetness), what will be your conclusion at the 5% level of significance? Indicate the information you are using to make your conclusion. Based on the p-value of .018 associated with the t-value of -2.55, we can conclude that Pectin Level is a significant predictor of the Sweetness level. You could also obtain this same conclusion by looking at the analysis of variance table where the pvalue is also .018. e) What will be the estimate of the common standard deviation ? The estimate of the common standard deviation is the square root of the MSE and this equals .2150. Using the "fitted line" option in Minitab, the 95% confidence band and prediction interval were also generated. These are shown in the plot that follows. 8 Regression Plot Y = 6.25207 - 2.31E-03X R-Sq = 22.9 % Sweetness 6.0 5.5 Regression 5.0 95% CI 95% PI 200 300 400 PectinLevel f) Based on these plots, if a new production line produced a Pectin Level equal to 300, what will be a 95% confidence interval for the mean Sweetness of the orange juice? A 95% confidence interval for the mean Sweetness of the orange juice when the Pectin Level is 300 is between (approximately) 5.46 to 5.75. These values are obtained from the red curves at PectinLevel of 300. g) What will be a 95% prediction interval for the exact value of the Sweetness of this orange juice with Pectin Level of 300? A 95% prediction interval for Pectin Level of 300 goes from (approximately) 5.1 to 6.0. These values are obtained from the blue curves when PectinLevel is 300. g) The coefficient of determination of the fitted simple linear regression was 22.9%. Based on this value, how would you assess the ability of Pectin Level to explain the variation in the Sweetness measure? Is it high or is it low? The coefficient of determination of 22.9% indicates that 22.9% of the total variability in the Y-values (the sweetness) can be explained through the predictor variable which is the Pectin Level. The value is not high, so as a predictor of Sweetness, the Pectin Level may not be very good. 9 5. (20 points) The Journal of Hazardous Materials published the results of a study of the chemical properties of three different types of hazardous organic solvents used to clean metal parts: aromatics, choloalkanes, and esters. One variable studied was sorption rate, measured as mole percentage. Independent samples of solvents from each type were tested and their sorption rates were recorded. Summary statistics for the three groups are provided below. Descriptive Statistics Variable Aromatic Chloroal Esters N 9 8 15 Variable Aromatic Chloroal Esters Mean 0.9422 1.006 0.3300 Minimum 0.6500 0.430 0.0600 Median 0.9500 1.015 0.3400 Maximum 1.1500 1.580 0.6100 TrMean 0.9422 1.006 0.3292 Q1 0.8050 0.635 0.1000 StDev 0.1683 0.401 0.2076 SE Mean 0.0561 0.142 0.0536 Q3 1.0900 1.377 0.5300 Overlaid boxplots for the three groups is also given below. Aromatics 1.5 1.0 0.5 0.0 Aromatics Chloroalkanes Esters To determine whether the population mean sorption rate for the three groups are identical, a one-way analysis of variance was performed using Minitab. The output of this analysis is provided below. 10 One-way Analysis of Variance Analysis of Variance Source DF SS Factor 2 3.3054 Error 29 1.9553 Total 31 5.2607 Level Aromatic Chloroal Esters MS 1.6527 0.0674 N Mean StDev 9 8 15 0.9422 1.0063 0.3300 0.1683 0.4010 0.2076 Pooled StDev = 0.2597 F 24.51 P 0.000 Individual 95% CIs For Mean Based on Pooled StDev ----+---------+---------+---------+(----*-----) (------*-----) (----*----) ----+---------+---------+---------+0.30 0.60 0.90 1.20 Based on the description of the problem and the Minitab output, answer the following questions. a) What will be your null hypothesis and your alternative hypothesis. NULL: The (population) mean sorption rates for the three groups are identical. ALT: At least two of the three (population) mean sorption rates are different. b) How many levels do you have in your factor? What are they? There are three levels. The levels are Aromatic, Choloalkanes, and Esters c) What will be your estimate of the common variance of the three populations? The common variance is estimated by the MSE, which is 0.0674. d) What will be your conclusion with regards to your hypothesis, and what is the basis of your conclusion? Since the p-value in the analysis of variance table is 0, then we will reject the null hypothesis and conclude that at least two of the mean sorption rates are different. e) Which population mean would you conclude is different from the other two? By examining the confidence intervals for the three mean sorption rates, we note that the intervals for the aromatic and choloalkanes overlap, and these intervals do not overlap with the interval for the ester group. Therefore, we could conclude that the mean sorption rate of the ester group is different from the means of the aromatic and choloalkanes groups. 11