MA3518: Applied Statistics Page 1 Department of Mathematics Faculty of Science and Technology City University of Hong Kong MA 3518: Applied Statistics Solutions to Assignment 2 Question 1: The SAS input is Data A2Q1; do n=1 to 30; y=100+3*rannor(201); output; end; proc print; run; Obs n y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 105.562 102.734 102.287 104.082 93.260 99.222 96.632 99.624 99.087 95.600 104.330 99.462 101.572 98.272 101.197 101.894 100.110 99.824 102.931 100.692 103.079 104.685 99.858 99.982 101.914 98.087 104.141 104.492 96.969 103.581 (a) The SAS procedure is given by: MA3518: Applied Statistics Page 2 proc univariate data=A2Q1 normal; var y; run; Relevant output: The UNIVARIATE Procedure Variable: y Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 30 100.838755 2.95744478 -0.555295 305307.283 2.93284539 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 30 3025.16265 8.74647963 0.05011072 253.647909 0.53995307 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.966698 0.076831 0.037879 0.285016 < > > > W D W-Sq A-Sq 0.4532 >0.1500 >0.2500 >0.2500 The p-value of the normality tests are all greater than 5%. normality assumption. (b) The SAS input may be: Data A2Q1m; set A2Q1; if ranuni(0)<.4; title ' sampling without replacement'; proc print data=A2Q1m; run; possible output: sampling without replacement Obs 1 2 3 4 5 6 7 8 9 10 n 2 4 9 13 15 17 24 27 28 30 (c) The input is: proc univariate data=a2Q1m mu0=100; var y; run; possible output: y 102.734 104.082 99.087 101.572 101.197 100.110 99.982 104.141 104.492 103.581 The data satisfies the MA3518: Applied Statistics Page 3 sampling without replacement The UNIVARIATE Procedure Variable: y Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 10 102.09782 1.97276849 -0.2153613 104274.674 1.93223372 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 10 1020.9782 3.89181553 -1.611451 35.0263398 0.62384417 Basic Statistical Measures Location Mean Median Mode Variability 102.0978 102.1528 . Std Deviation Variance Range Interquartile Range 1.97277 3.89182 5.40463 3.97179 Tests for Location: Mu0=100 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 3.36273 3 23.5 0.0084 0.1094 0.0137 The Student’s t-test has a p-value less than 5%, therefore the mean value is not equal to 100. The Wilcoxon’s signed rank test also confirmed that the average value is not equal to 100. The Sign test is less accurate. (5 marks) Question 2: (a) The SAS procedure is given by: PROC REG Data = A2Q2; MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t; RUN; The relevant part of the SAS output is given by: Parameter Estimates Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| MA3518: Applied Statistics Intercept S1_t LR2_t R2_t LR1_t R1_t 1 1 1 1 1 1 Page 4 -88.18317 0.07915 173.40100 26.18427 407.37338 -6.85785 14.37005 0.00327 895.48752 0.94783 607.44108 0.44933 -6.14 24.23 0.19 27.63 0.67 -15.26 <.0001 <.0001 0.8465 <.0001 0.5027 <.0001 From the SAS output, the p-values of the t-tests on the coefficients of all independent variables except LR1_t and LR2_t are less than 10% significance level. Hence, we conclude that all independent variables except LR1_t and LR2_t are significant at 10% significance level. (b) The SAS procedure and the relevant part of the SAS output are given by: PROC REG Data = A2Q2; MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t / p; OUTPUT OUT = A2Q2NS p = predict; RUN; PROC PLOT Data = A2Q2NS; PLOT S2_t*predict; RUN; MA3518: Applied Statistics Page 5 Plot of S2_t*predict. Legend: A = 1 obs, B = 2 obs, etc. S2_t ‚ 3000 ˆ ‚ ‚ A ‚ ‚ ‚ ‚ 2500 ˆ ‚ ‚ A ‚ ‚ ‚ A 2000 ˆ ‚ ‚ ‚ ‚ ‚ ‚ A A A 1500 ˆ ‚ A A ‚ ‚ A A ‚ A ‚ A ‚ AB 1000 ˆ AA A ‚ A ‚ ABA A A ‚ A C A A A ‚ ABB A ‚ CAAB A A ‚ A AA 500 ˆ A AA AAD AB BA A ‚ BBEBDBBB AA ‚ A A ABECDEAAA AAA A ‚ A DFHEF C BA A ‚ A BGFKKGCDDA A ‚ A GKSSKJBA C A ‚ A BAMRZZTVLECF A 0 ˆ ACEZZZZZRTHGBEAA A A ‚ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ -500 0 500 1000 1500 2000 2500 Predicted Value of S2_t NOTE: 5 obs had missing values. 88 obs hidden. MA3518: Applied Statistics Page 6 From the SAS output, the graph of the plot of the observed values of S2_t against their corresponding predicted values seriously deviates from a straight line. Hence, the linearity assumption of the regression model is not valid for the data set and the regression model cannot fit the data set very well (adjrsq=0.7597). (c) The SAS procedure is given by: PROC REG Data = A2Q2; MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t; OUTPUT OUT = A2Q2NS1 student = residual; RUN; PROC PLOT Data = A2Q2NS1; PLOT residual*LR1_t; RUN; The relevant part of the SAS output is given by: Plot of residual*LR1_t. ‚ 8 ˆ ‚ ‚ ‚ ‚ Legend: A = 1 obs, B = 2 obs, etc. A MA3518: Applied Statistics S t u d e n t i z e d R e s i d u a A Page 7 ‚ ‚ A 6 ˆ ‚ ‚ A ‚ ‚ ‚ A A A ‚ 4 ˆ A ‚ ‚ ‚ AB A AA ‚ A ‚ A B ‚ A A A A 2 ˆ A A A A A A A ‚ A A A A A ‚ A A B C AAA ‚ A AAAAA AA AAB A B A A ‚ BBC BB B AB AAA C AAB A A A A ‚ AA B BDDCBDFCAEFDADDDFEADB BACB ‚ A AACBBCBB DABADJEKHGHJDDEDBEDD BA BA A 0 ˆ AA CAAADEADFDDCDDOBGGHEFDEDGECABA AAA A ‚ A ACE CCBABFGCCFHJFDFDHBDBAFBB BAA AA l ‚ BABAA AAB CDBBCEC GBBCCDBAC A A AA ‚ AA AB BBAAEBAAE BCAAFABDBAAAB A A A ‚ A B A A C ABBABAAA A BAA A A ‚ AC AA A A ABAC ‚ AA A A A B A -2 ˆ A A CAA A A ‚ A A A A A A A ‚ A A ‚ ‚ A ‚ ‚ A -4 ˆ A A A ‚ Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ -0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 LR1_t NOTE: 5 obs had missing values. MA3518: Applied Statistics Page 8 From the SAS output, the residual plot fluctuates around the zero level and quite random. However, it seems that the fluctuations of the residuals have higher positive values and depend on the level of LR1_t. From the residual plots, we can also identify some outliers in the data set. (5 marks) Question 3: (a) The SAS procedure is given by: proc reg data=A2Q2; model S2_t= S1_t R1_t R2_t LR1_t LR2_t/selection= rsquare adjrsq cp; run; The REG Procedure Model: MODEL1 Dependent Variable: S2_t R-Square Selection Method Number of Observations Read Number of Observations Used Number of Observations with Missing Values Number in Model R-Square Adjusted R-Square C(p) 752 747 5 Variables in Model 1 0.5438 0.5432 673.0083 R2_t 1 0.5148 0.5141 763.0125 S1_t 1 0.1727 0.1716 1824.782 R1_t 1 0.0136 0.0122 2318.701 LR1_t 1 0.0095 0.0081 2331.408 LR2_t ---------------------------------------------------------------------------2 0.6845 0.6837 238.1280 S1_t R2_t 2 0.5635 0.5623 613.7550 R1_t R2_t 2 0.5517 0.5505 650.3115 R2_t LR1_t 2 0.5492 0.5480 658.2073 R2_t LR2_t 2 0.5154 0.5140 763.2571 S1_t LR1_t 2 0.5152 0.5139 763.6300 S1_t LR2_t 2 0.5148 0.5135 764.9497 S1_t R1_t 2 0.1828 0.1806 1795.433 R1_t LR1_t 2 0.1792 0.1770 1806.706 R1_t LR2_t 2 0.0138 0.0111 2319.993 LR1_t LR2_t ---------------------------------------------------------------------------3 0.7601 0.7592 5.5354 S1_t R1_t R2_t 3 0.6862 0.6849 235.0184 S1_t R2_t LR1_t 3 0.6857 0.6844 236.5475 S1_t R2_t LR2_t 3 0.5719 0.5702 589.6471 R1_t R2_t LR1_t 3 0.5695 0.5677 597.2604 R1_t R2_t LR2_t 3 0.5519 0.5501 651.7302 R2_t LR1_t LR2_t 3 0.5154 0.5134 765.1615 S1_t R1_t LR1_t 3 0.5154 0.5134 765.2569 S1_t LR1_t LR2_t 3 0.5153 0.5133 765.5516 S1_t R1_t LR2_t 3 0.1832 0.1799 1796.077 R1_t LR1_t LR2_t ---------------------------------------------------------------------------4 0.7612 0.7600 4.0375 S1_t R1_t R2_t LR1_t 4 0.7611 0.7598 4.4498 S1_t R1_t R2_t LR2_t MA3518: Applied Statistics Page 9 4 0.6862 0.6845 236.9415 S1_t R2_t LR1_t LR2_t 4 0.5720 0.5697 591.2971 R1_t R2_t LR1_t LR2_t 4 0.5154 0.5128 767.1609 S1_t R1_t LR1_t LR2_t ---------------------------------------------------------------------------5 0.7613 0.7597 6.0000 S1_t R1_t R2_t LR1_t LR2_t (i) Based on R2, the “best” regression model is the full model (ii) Based on the adjusted R2, the “best” regression model is the model without LR2_t (iii) Based on the Mallow Cp statistics, the “best” regression model is without LR2_t (b) The SAS procedure is given by: proc reg data=A2Q2; model S2_t= S1_t R1_t R2_t LR1_t LR2_t/Selection = forward sle = 0.05; RUN; The REG Procedure Model: MODEL1 Dependent Variable: S2_t Number of Observations Read Number of Observations Used Number of Observations with Missing Values 752 747 5 Forward Selection: Step 1 Variable R2_t Entered: R-Square = 0.5438 and C(p) = 673.0083 Analysis of Variance Source Model Error Corrected Total DF Sum of Squares Mean Square 1 745 746 35814558 30047143 65861702 35814558 40332 F Value Pr > F 888.00 <.0001 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept R2_t -270.31683 26.94211 16.60755 0.90412 10685187 35814558 264.93 888.00 <.0001 <.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------------------------Forward Selection: Step 2 Variable S1_t Entered: R-Square = 0.6845 and C(p) = 238.1280 Analysis of Variance Source DF Sum of Squares Mean Square Model Error 2 744 45084988 20776714 22542494 27926 F Value Pr > F 807.23 <.0001 MA3518: Applied Statistics Page 10 Corrected Total 746 65861702 The REG Procedure Model: MODEL1 Dependent Variable: S2_t Forward Selection: Step 2 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept S1_t R2_t -193.56306 0.06408 17.98320 14.44705 0.00352 0.89876 5012915 9270430 11180284 179.51 331.97 400.36 <.0001 <.0001 <.0001 Bounds on condition number: 1.4272, 5.7087 -------------------------------------------------------------------------------------------------Forward Selection: Step 3 Variable R1_t Entered: R-Square = 0.7601 and C(p) = 5.5354 Analysis of Variance DF Sum of Squares Mean Square 3 743 746 50062952 15798750 65861702 16687651 21263 Source Model Error Corrected Total F Value Pr > F 784.80 <.0001 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept S1_t R1_t R2_t -87.74372 0.07994 -6.87660 26.14860 14.37898 0.00324 0.44943 0.94861 791790 12948623 4977964 16156915 37.24 608.96 234.11 759.84 <.0001 <.0001 <.0001 <.0001 Bounds on condition number: 2.1822, 17.58 -------------------------------------------------------------------------------------------------No other variable met the 0.0500 significance level for entry into the model. The REG Procedure Model: MODEL1 Dependent Variable: S2_t Summary of Forward Selection Step 1 2 3 Variable Entered R2_t S1_t R1_t Number Vars In Partial R-Square Model R-Square C(p) F Value Pr > F 1 2 3 0.5438 0.1408 0.0756 0.5438 0.6845 0.7601 673.008 238.128 5.5354 888.00 331.97 234.11 <.0001 <.0001 <.0001 MA3518: Applied Statistics Page 11 From the SAS output, the “best” regression model has 3 variables by forward selection with significance level 5% (c) The SAS procedure is given by: proc reg data=A2Q2ns; model S2_t= S1_t R1_t R2_t LR1_t LR2_t/selection= backward sls=0.10; run; Part of the output: Backward Elimination: Step 0 All Variables Entered: R-Square = 0.7613 and C(p) = 6.0000 Backward Elimination: Step 1 Variable LR2_t Removed: R-Square = 0.7612 and C(p) = 4.0375 All variables left in the model are significant at the 0.1000 level. Summary of Backward Elimination Step 1 Variable Removed LR2_t Number Vars In Partial R-Square Model R-Square 4 0.0000 0.7612 C(p) 4.0375 F Value Pr > F 0.04 0.8465 Hence backward elimination removes the variable LR2_t from the model. (d) From the SAS output, the “best” regression model is the same by both best subset selection and backward regression methods in (a) and (c). It differs slightly from the choice in (b) which removes also the variable LR1_t. The decisions will all be the same if the significance level is set at 10% for (b) as well. (5 marks) Question 4: (a) The model can be formulated a one factor design. There are one-way ANOVA method and Kruskal-Wallis test to perform the analysis. Data A2Q4; INPUT BRL $ RL @@; Datalines; L 1.25 L 1.17 L 1.32 L 1.18 L 1.62 L 1.11 L 1.32 L 1.31 L 1.33 M 1.28 M 1.36 M 1.12 M 1.22 M 1.36 M 1.21 M 1.33 M 1.28 M 1.13 H 1.12 H 1.33 H 1.26 H 1.30 H 1.28 H 1.18 H 1.10 H 1.16 H 1.62 MA3518: Applied Statistics Page 12 ; RUN; (b)Assuming normality of the data, the problem can be assessed by ANOVA: PROC ANOVA Data = A2Q4; Class BRL; MODEL RL = BRL; RUN; The SAS output is given by: The ANOVA Procedure Class Level Information Class Levels BRL Values 3 H L M Number of observations 27 The ANOVA Procedure Dependent Variable: RL Source DF Sum of Squares Mean Square F Value Pr > F Model 2 0.00642963 0.00321481 0.18 0.8393 Error 24 0.43731111 0.01822130 Corrected Total 26 0.44374074 Source BRL R-Square Coeff Var Root MSE RL Mean 0.014490 10.64125 0.134986 1.268519 DF Anova SS Mean Square F Value Pr > F 2 0.00642963 0.00321481 0.18 0.8393 From the SAS output, the p-value of the F-test is 0.8393. Hence, we do not reject H0 and conclude that there is no significant difference on the readings of the meter under the three background radiation levels at 5% significance level (c)Without relying on the normality assumption, the Kruskal-Wallis test may be used: PROC npar1way Data = A2Q5 wilcoxon; Class BRL; var RL ; RUN; The NPAR1WAY Procedure MA3518: Applied Statistics Page 13 Wilcoxon Scores (Rank Sums) for Variable RL Classified by Variable BRL Sum of Expected Std Dev Mean BRL N Scores Under H0 Under H0 Score ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ L 9 135.00 126.0 19.403608 15.00 M 9 130.50 126.0 19.403608 14.50 H 9 112.50 126.0 19.403608 12.50 Average scores were used for ties. Kruskal-Wallis Test Chi-Square DF Pr > Chi-Square 0.5020 2 0.7780 The p-value of the test is 0.7780, hence it gives the same result as (b). (5 marks) Question 5: data A2Q5; input Reactant $ Catalyst $ yield; cards; A- B- 28.0 A- B- 25.0 A- B- 27.0 A- B+ 18.0 A- B+ 19.0 A- B+ 23.0 A+ B- 36.0 A+ B- 32.0 A+ B- 32.0 A+ B+ 31.0 A+ B+ 30.0 A+ B+ 29.0 ; run; proc glm data=a2q5; class reactant catalyst; model yield= reactant|catalyst; run; The GLM Procedure Class Level Information Class Levels Values Reactant 2 A+ A- Catalyst 2 B+ B- MA3518: Applied Statistics Page 14 Number of observations 12 The GLM Procedure Dependent Variable: yield Source DF Sum of Squares Mean Square F Value Pr > F Model 3 291.6666667 97.2222222 24.82 0.0002 Error 8 31.3333333 3.9166667 11 323.0000000 Corrected Total Source Reactant Catalyst Reactant*Catalyst Source Reactant Catalyst Reactant*Catalyst R-Square Coeff Var Root MSE yield Mean 0.902993 7.196571 1.979057 27.50000 DF Type I SS Mean Square F Value Pr > F 1 1 1 208.3333333 75.0000000 8.3333333 208.3333333 75.0000000 8.3333333 53.19 19.15 2.13 <.0001 0.0024 0.1828 DF Type III SS Mean Square F Value Pr > F 1 1 1 208.3333333 75.0000000 8.3333333 208.3333333 75.0000000 8.3333333 53.19 19.15 2.13 <.0001 0.0024 0.1828 As both the p-values for Reactant and Catalyst are less than 10%, there are treatment effect for Reactant and treatment effect for Catalyst, but there is no interaction between the two factors as the corresponding p-value is 0.1828 > 0.1 (5 marks) ~ End of Solutions to Assignment 2~