Fall 2011 Name STATISTICS 479 Exam II (100 points) 1. A SAS data set world was created using the following input statement and various demographic variables for 60 countries: input country $ popurban percgnp birthrat deathrat lifeexp; Category variables popgrp and gnpgrp were created corresponding to % urban population and per capita gnp. Answer parts(a) to (e) below. (a) (2) Give the name of a SAS procedure that enables you to create a variety of tables, that may be highly customized, of many different statistics computed for variables such as birthrat, classified by variables such as popgrp and gnpgrp. (b) (2) Give the name of a SAS procedure that allows you to produce side-by-side boxplots in high resolution graphics (without using SAS/GRAPH statements such as symbol or axis). (c) (2) Give the name of a SAS procedure (that is not a SAS/GRAPH procedure) that you may use to obtain a normal probability plot of a variable such as popurban in high resolution graphics. (d) (2) Give the name of a SAS procedure that you may use to obtain a scatter plot matrix of the 4 variables popurban, percgnp, birthrat, and deathrat, in high resolution graphics. (e) (2) Give the name of a SAS procedure that enables you to assign descriptive strings (e.g. ”Medium”) to be displayed instead of the numeric values of category variables such as popgrp and gnpgrp on output. 2. (a) (3) The following statement is included in a proc freq step, where the data are observed counts for combinations of popgrp and gnpgrp in the SAS data set world : tables popgrp*gnpgrp/chisq expected cellchi2 norow nocol nopercent; Explain the purpose that you may use the χ2 statistic this statement will produce. (b) (3) The following statement is included in a proc tabulate step analyzing the SAS data set world: table popgrp*gnpgrp,birthrat*(n*f=4.0 mean*f=9.4 stderr*f=9.4); Explain as much as possible the SAS output this statement will produce. 1 (c) (3) The following statement is included in a proc gchart step using the world data set as input: vbar lifeexp/midpoints= 50 to 80 by 5 type=freq; Explain as much as possible the SAS output this statement will produce. (d) (3) The following statement is included in a proc sgplot step using the world data set as input: hbar popgrp/response=deathrat stat=mean group=gnpgrp; Explain as much as possible the SAS output this statement will produce. 3. Daily ozone concentrations (in ppb.) for an Eastern city in the U.S. for a period of 152 consecutive days are graphed in a box plot. Answer the questions given below: ∗ + 10 20 30 40 50 60 70 80 90 100 110 (a) (3) Give the name and value of a measure of the location of this distribution. (b) (3) What is the shape of the distribution of the ozone measurements suggested by the boxplot? (c) (3) Give the name and value of a measure of spread of this distribution. (d) (3) For how many days during this period is the daily ozone concentration below 30 ppb.? 4. (4) Use the following plot to say in what way the distribution of the data is different from that of a normal distribution: 6 4 2 0 Ordered Data 8 10 Normal Probabilty Plot −2 −1 0 1 2 Normal Percentiles 2 5. Below is a Q-Q plot of carbon monoxide (CO) concentrations in the air, measured on successive Sundays and Fridays for several months in Linden, Mass.: (a) (2) What does this tell you about the shapes of distributions of CO measurements on these two days? (b) (2) Using the above graph, compare the median CO concenrations on the two days. (c) (2) Using the above graph, compare the variability of CO concenrations on the two days. 6. Below is a Quantile plot of the Stamford ozone data used in class: (a) (2) Explain what is plotted on the vertical axis. (b) (2) Approximately, estimate the 40th percentile (c) (2) Approximately, estimate the percentage of days having ozone concentrations of 150 ppb or more 3 7. Consider following data: x y 10 120 20 115 21 250 27 210 29 300 33 330 40 295 44 400 52 380 56 460 62 125 68 510 The procedure reg in SAS was used to an perform analysis of this set of data using the model y = β0 + β1 x1 + ǫ where ǫi are assumed to be independently distributed as N (0, σ 2 ) variables. Answer the following questions based on the results appearing in the output attached to the end of the question. i) (2) Give the regression sum of squares and its degrees of freedom. What is the value of R2 ? ii) (2) Give the residual sum of squares and its degrees of freedom. What is the estimate s2 of σ 2 ? iii (2) What is the F statistic for testing H0 : β1 = 0 vs. Ha : β1 6= 0? Make a decision based on the p-value. iv) (4) Using the estimate of β1 and its standard error, compute the t-statistic for testing H0 : β1 = 0 vs. Ha : β1 6= 0. v) (4) Using the estimate of β1 and its standard error, compute a 95% confidence interval for β1 . vi) (4) Use the value of h22 to compute the standard error of the residual for observation 2. vii) (4) Use the value of h22 to compute the standard error of ŷ2 4 viii) (4) Use the the residual for observation 2 and its standard error to calculate the corresponding studentized residual. ix) (4) Use RStudent to conduct the Bonferroni test procedure for y-outliers using Table B.10 supplied using α = .05. x) (4) Find any cases, if any, that may be x-outliers explaining why you selected these. xi) (4) Find any cases, if any, that may be influential explaining why you selected these. xii) (4) If you find any case to be influential, explain why this case should be examined further. xiii) (4) Examine the residuals vs. predicted values plot. State what you think this plot indicates about the fit of the data to a straight line model. xiv) (4) Examine normal probablity plot of the studentized residuals. Does it show that the assumptions made about the model are plausible? Explain why or why not? 5 Simple Linear Regression of Data for Exam II: Fall 2011 1 The REG Procedure Number of Observations Read Number of Observations Used 12 12 Analysis of Variance Source DF Sum of Squares Mean Square Model Error Corrected Total 1 10 11 77201 116755 193956 77201 11676 Root MSE Dependent Mean Coeff Var 108.05335 291.25000 37.09986 R-Square Adj R-Sq F Value Pr > F 6.61 0.0278 0.3980 0.3378 Parameter Estimates Variable Intercept x DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 114.35740 4.59461 75.53323 1.78680 1.51 2.57 0.1610 0.0278 Simple Linear Regression of Data for Exam II: Fall 2011 2 Output Statistics Obs 1 2 3 4 5 6 7 8 9 10 11 12 Dependent Predicted Std Error Variable Value Mean Predict 120.0000 115.0000 250.0000 210.0000 300.0000 330.0000 295.0000 400.0000 380.0000 460.0000 125.0000 510.0000 160.3035 206.2497 210.8443 238.4119 247.6012 265.9796 298.1419 316.5204 353.2773 371.6557 399.2234 426.7911 Residual 59.7176 -40.3035 45.4494 -91.2497 44.1668 39.1557 37.3522 -28.4119 35.5119 52.3988 32.7038 64.0204 31.3073 -3.1419 32.7038 83.4796 39.4312 26.7227 44.1668 88.3443 52.3078 -274.2234 61.2484 83.2089 Std Error Student Residual Residual 90.052 98.030 98.614 101.4 102.1 103.0 103.4 103.0 100.6 98.614 94.549 89.018 -0.448 -0.931 0.397 -0.280 0.513 0.622 -0.0304 0.811 0.266 0.896 -2.900 0.935 -2-1 0 1 2 | | | *| | | | | | |* | |* | | | |* | | | |* | *****| | |* | | | | | | | | | | | | Cook's D RStudent Hat Diag H 0.044 0.093 0.016 0.005 0.016 0.019 0.000 0.033 0.005 0.080 1.287 0.207 -0.4289 -0.9240 0.3797 -0.2669 0.4937 0.6015 -0.0288 0.7956 0.2529 0.8862 -6.9047 0.9283 0.3054 0.1769 0.1671 0.1195 0.1080 0.0916 0.0839 0.0916 0.1332 0.1671 0.2343 0.3213 B Tables 539 Table B.10. 5% critical values based on the Bonferroni bounds for the t-test for a single outlier using externally studentized residual in a linear regression model. k n 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 50 60 70 80 90 100 1 2 3 9.92 6.23 5.07 4.53 4.22 4.03 3.90 3.81 3.74 3.69 3.65 3.62 3.59 3.57 3.56 3.54 3.53 3.52 3.52 3.51 3.50 3.50 3.50 3.50 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.49 3.51 3.53 3.55 3.57 3.59 3.60 63.66 10.89 6.58 5.26 4.66 4.32 4.10 3.96 3.86 3.79 3.73 3.68 3.65 3.62 3.60 3.58 3.57 3.55 3.54 3.53 3.53 3.52 3.52 3.51 3.51 3.51 3.50 3.50 3.50 3.50 3.50 3.50 3.50 3.50 3.50 3.50 3.51 3.53 3.55 3.57 3.59 3.60 76.39 11.77 6.90 5.44 4.77 4.40 4.17 4.02 3.91 3.83 3.77 3.72 3.68 3.65 3.62 3.60 3.59 3.57 3.56 3.55 3.54 3.54 3.53 3.53 3.52 3.52 3.52 3.52 3.51 3.51 3.51 3.51 3.51 3.51 3.51 3.52 3.54 3.55 3.57 3.59 3.60 4 5 6 7 8 9 10 11 12 89.12 12.59 101.86 7.18 13.36 114.59 5.60 7.45 14.09 127.32 4.88 5.75 7.70 14.78 140.05 4.49 4.98 5.89 7.94 15.44 152.79 4.24 4.56 5.08 6.02 8.16 16.08 165.52 4.07 4.30 4.63 5.16 6.14 8.37 16.69 178.25 3.95 4.12 4.36 4.70 5.25 6.25 8.58 17.28 190.98 3.87 4.00 4.17 4.41 4.76 5.33 6.36 8.77 17.85 3.80 3.90 4.04 4.21 4.46 4.82 5.40 6.47 8.95 3.75 3.83 3.94 4.08 4.26 4.51 4.88 5.47 6.57 3.71 3.78 3.86 3.97 4.11 4.30 4.55 4.93 5.54 3.67 3.73 3.81 3.89 4.00 4.15 4.33 4.59 4.98 3.65 3.70 3.76 3.83 3.92 4.03 4.18 4.37 4.64 3.63 3.67 3.72 3.78 3.86 3.95 4.06 4.21 4.40 3.61 3.65 3.69 3.75 3.81 3.88 3.98 4.09 4.24 3.59 3.63 3.67 3.71 3.77 3.83 3.91 4.00 4.12 3.58 3.61 3.65 3.69 3.73 3.79 3.85 3.93 4.02 3.57 3.60 3.63 3.66 3.70 3.75 3.81 3.87 3.95 3.56 3.58 3.61 3.65 3.68 3.72 3.77 3.83 3.89 3.55 3.58 3.60 3.63 3.66 3.70 3.74 3.79 3.84 3.55 3.57 3.59 3.62 3.64 3.68 3.71 3.76 3.81 3.54 3.56 3.58 3.60 3.63 3.66 3.69 3.73 3.77 3.54 3.55 3.57 3.59 3.62 3.64 3.67 3.71 3.74 3.53 3.55 3.57 3.59 3.61 3.63 3.66 3.69 3.72 3.53 3.54 3.56 3.58 3.60 3.62 3.64 3.67 3.70 3.53 3.54 3.56 3.57 3.59 3.61 3.63 3.66 3.68 3.52 3.54 3.55 3.57 3.58 3.60 3.62 3.64 3.67 3.52 3.54 3.55 3.56 3.58 3.60 3.61 3.63 3.66 3.52 3.53 3.55 3.56 3.57 3.59 3.61 3.62 3.65 3.52 3.53 3.54 3.56 3.57 3.58 3.60 3.62 3.64 3.52 3.53 3.54 3.55 3.57 3.58 3.59 3.61 3.63 3.52 3.53 3.54 3.55 3.56 3.58 3.59 3.60 3.62 3.53 3.53 3.54 3.54 3.55 3.56 3.57 3.57 3.58 3.54 3.54 3.55 3.55 3.56 3.56 3.57 3.57 3.58 3.56 3.56 3.56 3.56 3.57 3.57 3.57 3.58 3.58 3.57 3.58 3.58 3.58 3.58 3.58 3.59 3.59 3.59 3.59 3.59 3.59 3.60 3.60 3.60 3.60 3.60 3.60 3.61 3.61 3.61 3.61 3.61 3.61 3.61 3.62 3.62 (continued)