Department of Mathematics - City University of Hong Kong

advertisement
MA3518: Applied Statistics
Page 1
Department of Mathematics
Faculty of Science and Technology
City University of Hong Kong
MA 3518: Applied Statistics
Solutions to Assignment 2
Question 1:
The SAS input is
Data A2Q1;
do n=1 to 30;
y=100+3*rannor(201);
output;
end;
proc print;
run;
Obs
n
y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
105.562
102.734
102.287
104.082
93.260
99.222
96.632
99.624
99.087
95.600
104.330
99.462
101.572
98.272
101.197
101.894
100.110
99.824
102.931
100.692
103.079
104.685
99.858
99.982
101.914
98.087
104.141
104.492
96.969
103.581
(a) The SAS procedure is given by:
MA3518: Applied Statistics
Page 2
proc univariate data=A2Q1 normal;
var y;
run;
Relevant output:
The UNIVARIATE Procedure
Variable: y
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
30
100.838755
2.95744478
-0.555295
305307.283
2.93284539
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
30
3025.16265
8.74647963
0.05011072
253.647909
0.53995307
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.966698
0.076831
0.037879
0.285016
<
>
>
>
W
D
W-Sq
A-Sq
0.4532
>0.1500
>0.2500
>0.2500
The p-value of the normality tests are all greater than 5%.
normality assumption.
(b) The SAS input may be:
Data A2Q1m;
set A2Q1;
if ranuni(0)<.4;
title ' sampling without replacement';
proc print data=A2Q1m;
run;
possible output:
sampling without replacement
Obs
1
2
3
4
5
6
7
8
9
10
n
2
4
9
13
15
17
24
27
28
30
(c) The input is:
proc univariate data=a2Q1m mu0=100;
var y;
run;
possible output:
y
102.734
104.082
99.087
101.572
101.197
100.110
99.982
104.141
104.492
103.581
The data satisfies the
MA3518: Applied Statistics
Page 3
sampling without replacement
The UNIVARIATE Procedure
Variable: y
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
10
102.09782
1.97276849
-0.2153613
104274.674
1.93223372
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
10
1020.9782
3.89181553
-1.611451
35.0263398
0.62384417
Basic Statistical Measures
Location
Mean
Median
Mode
Variability
102.0978
102.1528
.
Std Deviation
Variance
Range
Interquartile Range
1.97277
3.89182
5.40463
3.97179
Tests for Location: Mu0=100
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
3.36273
3
23.5
0.0084
0.1094
0.0137
The Student’s t-test has a p-value less than 5%, therefore the mean value is not equal to
100. The Wilcoxon’s signed rank test also confirmed that the average value is not equal
to 100. The Sign test is less accurate.
(5 marks)
Question 2:
(a) The SAS procedure is given by:
PROC REG Data = A2Q2;
MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t;
RUN;
The relevant part of the SAS output is given by:
Parameter Estimates
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
MA3518: Applied Statistics
Intercept
S1_t
LR2_t
R2_t
LR1_t
R1_t
1
1
1
1
1
1
Page 4
-88.18317
0.07915
173.40100
26.18427
407.37338
-6.85785
14.37005
0.00327
895.48752
0.94783
607.44108
0.44933
-6.14
24.23
0.19
27.63
0.67
-15.26
<.0001
<.0001
0.8465
<.0001
0.5027
<.0001
From the SAS output, the p-values of the t-tests on the coefficients of all independent
variables except LR1_t and LR2_t are less than 10% significance level. Hence, we
conclude that all independent variables except LR1_t and LR2_t are significant at 10%
significance level.
(b) The SAS procedure and the relevant part of the SAS output are given by:
PROC REG Data = A2Q2;
MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t / p;
OUTPUT OUT = A2Q2NS p = predict;
RUN;
PROC PLOT Data = A2Q2NS;
PLOT S2_t*predict;
RUN;
MA3518: Applied Statistics
Page 5
Plot of S2_t*predict.
Legend: A = 1 obs, B = 2 obs, etc.
S2_t ‚
3000 ˆ
‚
‚
A
‚
‚
‚
‚
2500 ˆ
‚
‚
A
‚
‚
‚
A
2000 ˆ
‚
‚
‚
‚
‚
‚
A
A
A
1500 ˆ
‚
A
A
‚
‚
A
A
‚
A
‚
A
‚
AB
1000 ˆ
AA
A
‚
A
‚
ABA A
A
‚
A
C A A
A
‚
ABB
A
‚
CAAB A A
‚
A AA
500 ˆ
A AA AAD AB BA A
‚
BBEBDBBB
AA
‚
A
A ABECDEAAA AAA
A
‚
A
DFHEF C BA A
‚
A BGFKKGCDDA
A
‚
A GKSSKJBA C A
‚
A BAMRZZTVLECF A
0 ˆ
ACEZZZZZRTHGBEAA
A A
‚
Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ
-500
0
500
1000
1500
2000
2500
Predicted Value of S2_t
NOTE: 5 obs had missing values.
88 obs hidden.
MA3518: Applied Statistics
Page 6
From the SAS output, the graph of the plot of the observed values of S2_t against
their corresponding predicted values seriously deviates from a straight line. Hence,
the linearity assumption of the regression model is not valid for the data set and the
regression model cannot fit the data set very well (adjrsq=0.7597).
(c) The SAS procedure is given by:
PROC REG Data = A2Q2;
MODEL S2_t = S1_t LR2_t R2_t LR1_t R1_t;
OUTPUT OUT = A2Q2NS1 student = residual;
RUN;
PROC PLOT Data = A2Q2NS1;
PLOT residual*LR1_t;
RUN;
The relevant part of the SAS output is given by:
Plot of residual*LR1_t.
‚
8 ˆ
‚
‚
‚
‚
Legend: A = 1 obs, B = 2 obs, etc.
A
MA3518: Applied Statistics
S
t
u
d
e
n
t
i
z
e
d
R
e
s
i
d
u
a
A
Page 7
‚
‚
A
6 ˆ
‚
‚
A
‚
‚
‚
A
A
A
‚
4 ˆ
A
‚
‚
‚
AB
A AA
‚
A
‚
A
B
‚
A A
A
A
2 ˆ
A
A A
A
A A
A
‚
A
A
A
A A
‚
A
A B C
AAA
‚
A AAAAA AA AAB
A
B A A
‚
BBC BB B AB AAA C AAB A A A
A
‚
AA
B BDDCBDFCAEFDADDDFEADB BACB
‚
A AACBBCBB DABADJEKHGHJDDEDBEDD BA BA A
0 ˆ
AA
CAAADEADFDDCDDOBGGHEFDEDGECABA AAA
A
‚
A
ACE CCBABFGCCFHJFDFDHBDBAFBB BAA AA
l
‚
BABAA AAB CDBBCEC GBBCCDBAC A A AA
‚
AA AB BBAAEBAAE BCAAFABDBAAAB
A
A
A
‚
A
B A A C
ABBABAAA A
BAA
A
A
‚
AC
AA A
A ABAC
‚
AA A
A
A B
A
-2 ˆ
A
A
CAA
A A
‚
A
A
A
A A
A
A
‚
A
A
‚
‚
A
‚
‚
A
-4 ˆ
A
A
A
‚
Šˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒ
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
LR1_t
NOTE: 5 obs had missing values.
MA3518: Applied Statistics
Page 8
From the SAS output, the residual plot fluctuates around the zero level and quite random.
However, it seems that the fluctuations of the residuals have higher positive values and
depend on the level of LR1_t. From the residual plots, we can also identify some
outliers in the data set.
(5 marks)
Question 3:
(a) The SAS procedure is given by:
proc reg data=A2Q2;
model S2_t= S1_t R1_t R2_t LR1_t LR2_t/selection= rsquare adjrsq cp;
run;
The REG Procedure
Model: MODEL1
Dependent Variable: S2_t
R-Square Selection Method
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values
Number in
Model
R-Square
Adjusted
R-Square
C(p)
752
747
5
Variables in Model
1
0.5438
0.5432
673.0083
R2_t
1
0.5148
0.5141
763.0125
S1_t
1
0.1727
0.1716
1824.782
R1_t
1
0.0136
0.0122
2318.701
LR1_t
1
0.0095
0.0081
2331.408
LR2_t
---------------------------------------------------------------------------2
0.6845
0.6837
238.1280
S1_t R2_t
2
0.5635
0.5623
613.7550
R1_t R2_t
2
0.5517
0.5505
650.3115
R2_t LR1_t
2
0.5492
0.5480
658.2073
R2_t LR2_t
2
0.5154
0.5140
763.2571
S1_t LR1_t
2
0.5152
0.5139
763.6300
S1_t LR2_t
2
0.5148
0.5135
764.9497
S1_t R1_t
2
0.1828
0.1806
1795.433
R1_t LR1_t
2
0.1792
0.1770
1806.706
R1_t LR2_t
2
0.0138
0.0111
2319.993
LR1_t LR2_t
---------------------------------------------------------------------------3
0.7601
0.7592
5.5354
S1_t R1_t R2_t
3
0.6862
0.6849
235.0184
S1_t R2_t LR1_t
3
0.6857
0.6844
236.5475
S1_t R2_t LR2_t
3
0.5719
0.5702
589.6471
R1_t R2_t LR1_t
3
0.5695
0.5677
597.2604
R1_t R2_t LR2_t
3
0.5519
0.5501
651.7302
R2_t LR1_t LR2_t
3
0.5154
0.5134
765.1615
S1_t R1_t LR1_t
3
0.5154
0.5134
765.2569
S1_t LR1_t LR2_t
3
0.5153
0.5133
765.5516
S1_t R1_t LR2_t
3
0.1832
0.1799
1796.077
R1_t LR1_t LR2_t
---------------------------------------------------------------------------4
0.7612
0.7600
4.0375
S1_t R1_t R2_t LR1_t
4
0.7611
0.7598
4.4498
S1_t R1_t R2_t LR2_t
MA3518: Applied Statistics
Page 9
4
0.6862
0.6845
236.9415
S1_t R2_t LR1_t LR2_t
4
0.5720
0.5697
591.2971
R1_t R2_t LR1_t LR2_t
4
0.5154
0.5128
767.1609
S1_t R1_t LR1_t LR2_t
---------------------------------------------------------------------------5
0.7613
0.7597
6.0000
S1_t R1_t R2_t LR1_t LR2_t
(i) Based on R2, the “best” regression model is the full model
(ii) Based on the adjusted R2, the “best” regression model is the model without LR2_t
(iii) Based on the Mallow Cp statistics, the “best” regression model is without LR2_t
(b) The SAS procedure is given by:
proc reg data=A2Q2;
model S2_t= S1_t R1_t R2_t LR1_t LR2_t/Selection = forward sle = 0.05;
RUN;
The REG Procedure
Model: MODEL1
Dependent Variable: S2_t
Number of Observations Read
Number of Observations Used
Number of Observations with Missing Values
752
747
5
Forward Selection: Step 1
Variable R2_t Entered: R-Square = 0.5438 and C(p) = 673.0083
Analysis of Variance
Source
Model
Error
Corrected Total
DF
Sum of
Squares
Mean
Square
1
745
746
35814558
30047143
65861702
35814558
40332
F Value
Pr > F
888.00
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
R2_t
-270.31683
26.94211
16.60755
0.90412
10685187
35814558
264.93
888.00
<.0001
<.0001
Bounds on condition number: 1, 1
-------------------------------------------------------------------------------------------------Forward Selection: Step 2
Variable S1_t Entered: R-Square = 0.6845 and C(p) = 238.1280
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
2
744
45084988
20776714
22542494
27926
F Value
Pr > F
807.23
<.0001
MA3518: Applied Statistics
Page 10
Corrected Total
746
65861702
The REG Procedure
Model: MODEL1
Dependent Variable: S2_t
Forward Selection: Step 2
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
S1_t
R2_t
-193.56306
0.06408
17.98320
14.44705
0.00352
0.89876
5012915
9270430
11180284
179.51
331.97
400.36
<.0001
<.0001
<.0001
Bounds on condition number: 1.4272, 5.7087
-------------------------------------------------------------------------------------------------Forward Selection: Step 3
Variable R1_t Entered: R-Square = 0.7601 and C(p) = 5.5354
Analysis of Variance
DF
Sum of
Squares
Mean
Square
3
743
746
50062952
15798750
65861702
16687651
21263
Source
Model
Error
Corrected Total
F Value
Pr > F
784.80
<.0001
Variable
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
S1_t
R1_t
R2_t
-87.74372
0.07994
-6.87660
26.14860
14.37898
0.00324
0.44943
0.94861
791790
12948623
4977964
16156915
37.24
608.96
234.11
759.84
<.0001
<.0001
<.0001
<.0001
Bounds on condition number: 2.1822, 17.58
-------------------------------------------------------------------------------------------------No other variable met the 0.0500 significance level for entry into the model.
The REG Procedure
Model: MODEL1
Dependent Variable: S2_t
Summary of Forward Selection
Step
1
2
3
Variable
Entered
R2_t
S1_t
R1_t
Number
Vars In
Partial
R-Square
Model
R-Square
C(p)
F Value
Pr > F
1
2
3
0.5438
0.1408
0.0756
0.5438
0.6845
0.7601
673.008
238.128
5.5354
888.00
331.97
234.11
<.0001
<.0001
<.0001
MA3518: Applied Statistics
Page 11
From the SAS output, the “best” regression model has 3 variables by forward selection
with significance level 5%
(c) The SAS procedure is given by:
proc reg data=A2Q2ns;
model S2_t= S1_t R1_t R2_t LR1_t LR2_t/selection= backward sls=0.10;
run;
Part of the output:
Backward Elimination: Step 0
All Variables Entered: R-Square = 0.7613 and C(p) = 6.0000
Backward Elimination: Step 1
Variable LR2_t Removed: R-Square = 0.7612 and C(p) = 4.0375
All variables left in the model are significant at the 0.1000 level.
Summary of Backward Elimination
Step
1
Variable
Removed
LR2_t
Number
Vars In
Partial
R-Square
Model
R-Square
4
0.0000
0.7612
C(p)
4.0375
F Value
Pr > F
0.04
0.8465
Hence backward elimination removes the variable LR2_t from the model.
(d) From the SAS output, the “best” regression model is the same by both best subset
selection and backward regression methods in (a) and (c). It differs slightly from the
choice in (b) which removes also the variable LR1_t. The decisions will all be the same
if the significance level is set at 10% for (b) as well.
(5 marks)
Question 4:
(a) The model can be formulated a one factor design. There are one-way ANOVA
method and Kruskal-Wallis test to perform the analysis.
Data A2Q4;
INPUT BRL $ RL @@;
Datalines;
L 1.25 L 1.17 L 1.32 L 1.18 L 1.62 L 1.11 L 1.32 L 1.31 L 1.33
M 1.28 M 1.36 M 1.12 M 1.22 M 1.36 M 1.21 M 1.33 M 1.28 M 1.13
H 1.12 H 1.33 H 1.26 H 1.30 H 1.28 H 1.18 H 1.10 H 1.16 H 1.62
MA3518: Applied Statistics
Page 12
;
RUN;
(b)Assuming normality of the data, the problem can be assessed by ANOVA:
PROC ANOVA Data = A2Q4;
Class BRL;
MODEL RL = BRL;
RUN;
The SAS output is given by:
The ANOVA Procedure
Class Level Information
Class
Levels
BRL
Values
3
H L M
Number of observations
27
The ANOVA Procedure
Dependent Variable: RL
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
2
0.00642963
0.00321481
0.18
0.8393
Error
24
0.43731111
0.01822130
Corrected Total
26
0.44374074
Source
BRL
R-Square
Coeff Var
Root MSE
RL Mean
0.014490
10.64125
0.134986
1.268519
DF
Anova SS
Mean Square
F Value
Pr > F
2
0.00642963
0.00321481
0.18
0.8393
From the SAS output, the p-value of the F-test is 0.8393. Hence, we do not reject H0 and
conclude that there is no significant difference on the readings of the meter under the three
background radiation levels at 5% significance level
(c)Without relying on the normality assumption, the Kruskal-Wallis test
may be used:
PROC npar1way Data = A2Q5 wilcoxon;
Class BRL;
var RL ;
RUN;
The NPAR1WAY Procedure
MA3518: Applied Statistics
Page 13
Wilcoxon Scores (Rank Sums) for Variable RL
Classified by Variable BRL
Sum of
Expected
Std Dev
Mean
BRL
N
Scores
Under H0
Under H0
Score
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
L
9
135.00
126.0
19.403608
15.00
M
9
130.50
126.0
19.403608
14.50
H
9
112.50
126.0
19.403608
12.50
Average scores were used for ties.
Kruskal-Wallis Test
Chi-Square
DF
Pr > Chi-Square
0.5020
2
0.7780
The p-value of the test is 0.7780, hence it gives the same result as (b).
(5 marks)
Question 5:
data A2Q5;
input Reactant $ Catalyst $ yield;
cards;
A- B- 28.0
A- B- 25.0
A- B- 27.0
A- B+ 18.0
A- B+ 19.0
A- B+ 23.0
A+ B- 36.0
A+ B- 32.0
A+ B- 32.0
A+ B+ 31.0
A+ B+ 30.0
A+ B+ 29.0
;
run;
proc glm data=a2q5;
class reactant catalyst;
model yield= reactant|catalyst;
run;
The GLM Procedure
Class Level Information
Class
Levels
Values
Reactant
2
A+ A-
Catalyst
2
B+ B-
MA3518: Applied Statistics
Page 14
Number of observations
12
The GLM Procedure
Dependent Variable: yield
Source
DF
Sum of
Squares
Mean Square
F Value
Pr > F
Model
3
291.6666667
97.2222222
24.82
0.0002
Error
8
31.3333333
3.9166667
11
323.0000000
Corrected Total
Source
Reactant
Catalyst
Reactant*Catalyst
Source
Reactant
Catalyst
Reactant*Catalyst
R-Square
Coeff Var
Root MSE
yield Mean
0.902993
7.196571
1.979057
27.50000
DF
Type I SS
Mean Square
F Value
Pr > F
1
1
1
208.3333333
75.0000000
8.3333333
208.3333333
75.0000000
8.3333333
53.19
19.15
2.13
<.0001
0.0024
0.1828
DF
Type III SS
Mean Square
F Value
Pr > F
1
1
1
208.3333333
75.0000000
8.3333333
208.3333333
75.0000000
8.3333333
53.19
19.15
2.13
<.0001
0.0024
0.1828
As both the p-values for Reactant and Catalyst are less than 10%, there are treatment effect for
Reactant and treatment effect for Catalyst, but there is no interaction between the two factors as
the corresponding p-value is 0.1828 > 0.1
(5 marks)
~ End of Solutions to Assignment 2~
Download