Lecture Using Proc RSQUARE

advertisement
PROC ROBUSTRET &
Evaluating Regression
analyses With The Help of
PROC RSQUARE
Animal Science 500
Lecture No. 10
October 5, 2010
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ROBUSTREG

The purpose of robust regression is to detect outliers and provide
stable results in the presence of outliers.


In order to achieve this stability, robust regression limits the influence of outliers.
Outliers can be classified as:

Problems with outliers in the y-direction (response direction)

Problems with multivariate outliers in the x-space (i.e., outliers in the
covariate space, which are also referred to as leverage points)

Problems with outliers in both the y-direction and the x-space
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ROBUSTREG

Two types of estimations methods

M Estimation - is the method for outlier detection and
robust regression when contamination is mainly in
the response direction (y)

LTS Estimation - the method used when data
contamination occurs in the x space.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ROBUSTREG M-estimation

The following ROBUSTREG statements analyze the
data:
Proc Robustreg data=stack;
model y = x1 x2 x3 / diagnostics leverage;
id x1;
test x3;
run;
quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ROBUSTREG M-estimation
Proc Robustreg data=stack;
model y = x1 x2 x3 / diagnostics leverage;
id x1;
test x3;
run;
quit;
The procedure does M estimation with the bisquare weight function (default), and it uses the median
method for estimating the scale parameter.
The MODEL statement specifies the covariate effects.
The DIAGNOSTICS option requests a table for outlier diagnostics,
The LEVERAGE option adds leverage point diagnostic results to this table for continuous covariate
effects.
The ID statement specifies that variable x1 is used to identify each observation in this table. If the ID
statement is missing, the observation number is used to identify the observations (might even be
better this way in some cases).
Tests of significance for the covariate effects are obtained using the test line with a variable(s) listed
with the test term.
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect3.htm
PROC ROBUSTREG example output M-estimation
The ROBUSTREG Procedure
Model Information
Data Set
WORK.STACK
Dependent Variable
y
Number of Covariates
3
Number of Observations
21
Method
M Estimation
Summary Statistics
Variable
Q1
Median
Q3
Mean
x1
x2
x3
y
53.0000
18.0000
82.0000
10.0000
58.0000
20.0000
87.0000
15.0000
62.0000
24.0000
89.5000
19.5000
60.4286
21.0952
86.2857
17.5238
Standard
Deviation
9.1683
3.1608
5.3586
10.1716
MAD
5.9304
2.9652
4.4478
5.9304
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect3.htm
PROC ROBUSTREG example output
Parameter Estimates
Parameter
Standard
Error 95% Confidence Limits Chi-Square Pr > ChiSq
DF
Estimate
Intercept
1
-42.2854
x1
1
0.9276
The
x2
1
0.6507
0.2940
0.0744
x3
1
-0.1123
0.1249
-0.3571
Scale
1
2.2819
9.5045
-60.9138
-23.6569
19.79
<.0001
0.1077
0.7164
1.1387
ROBUSTREG
Procedure
74.11
<.0001
1.2270
4.90
0.0269
0.1324
0.81
0.3683
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect3.htm
PROC ROBUSTREG example output M-estimation
The ROBUSTREG Procedure
Diagnostics
Obs
x1
Mahalanobis Robust MCD
Distance
Distance
Leverage
Standardized
Robust
Residual
1
80.000000
2.2536
5.5284
*
1.0995
2
80.000000
2.3247
5.6374
*
-1.1409
3
75.000000
1.5937
4.1972
*
1.5604
4
62.000000
1.2719
1.5887
21
70.000000
2.1768
3.6573
*
Outlier
3.0381
*
-4.5733
*
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect3.htm
PROC ROBUSTREG example output M-estimation
Diagnostics Summary
Observation Type
Proportion
Cutoff
Outlier
0.0952
3.0000
Leverage
0.1905
3.0575
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect3.htm
PROC ROBUSTREG LTS-estimation

The following statements invoke the ROBUSTREG procedure with
the LTS estimation method.
Proc Robustreg data=hbk fwls method=lts;
model y = x1 x2 x3 / diagnostics leverage;
Id index;
run;
quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC ROBUSTREG LTS-estimation

The following statements invoke the ROBUSTREG procedure with
the LTS estimation method.
Proc Robustreg data=hbk fwls method=lts;
model y = x1 x2 x3 / diagnostics leverage;
Id index;
run;
quit;
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
Model Information
Data Set
WORK.HBK
Dependent Variable
The ROBUSTREG Procedure
Number of Covariates
Number of Observations
Method
y
3
75
LTS Estimation
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
Summary Statistics
Variable
Q1
Median
Q3
Mean
Model Information
X1
Data0.8000
Set
1.8000
3.1000
X2
X3
1.0000 Variable
2.2000
3.3000
5.5973
Dependent
The ROBUSTREG Procedure
0.9000
2.1000
3.0000
7.2307
Number
of Covariates
Y
-0.5000
0.1000
Number
of Observations
0.7000
Method
Standard
Deviation
3.2067WORK.HBK
3.6526
1.2787
MAD
1.9274
8.2391
y
1.6309
11.7403
3
1.7791
3.4928
75
0.8896
LTS Estimation
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
LTS Profile
Total Number of Observations
75
Number of Squares Minimized
57
Number of Coefficients
The ROBUSTREG Procedure
Highest Possible Breakdown
Value
4
0.2533
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
LTS Parameter Estimates
Parameter
DF
Estimate
Intercept
1
-0.3431
x1
1
0.0901
x2
x3
1
The ROBUSTREG Procedure
1
0.0703
-0.0731
Scale (sLTS)
0
0.7451
Scale (Wscale)
0
0.5749
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
Diagnostics
Obs
1
3
5
7
9
11
13
15
17
19
21
23
25
27
index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Mahalanobis Robust MCD
Distance
Distance
1.9168
1.8558
2.3137
2.2297
2.1001
2.1462
2.0105
1.9193
2.2212
2.3335
2.4465
3.1083
2.6624
6.3816
29.4424
30.2054
31.8909
32.8621
32.2778
30.5892
30.6807
29.7994
31.9537
30.9429
36.6384
37.9552
36.9175
41.0914
Leverage
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Standardized
Robust
Residual
17.0868
17.8428
18.3063
16.9702
17.7498
17.5155
18.8801
18.2253
17.1843
17.8021
0.0406
-0.0874
1.0776
-0.7875
Outlier
*
*
*
*
*
*
*
*
*
*
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
Diagnostics Summary
Observation Type
Proportion
Cutoff
Outlier
0.1333
3.0000
Leverage
0.1867
3.0575
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC ROBUSTREG Output LTS-estimation
Parameter Estimates for Final Weighted Least Squares Fit
Parameter
Standard
95% Confidence Limits
Error
ChiSquare
Pr > ChiSq
0.0242
2.99
0.0840
-0.0493
0.2120
1.49
0.2222
0.0405
-0.0394
0.1192
0.97
0.3242
0.0354
-0.1210
0.0177
2.13
0.1441
DF
Estimate
Intercept
1
-0.1805
0.1044
-0.3852
x1
1
0.0814
0.0667
x2
1
0.0399
x3
1
-0.0517
Scale
0
0.5572
The final weighted least squares estimates are shown. These estimates are least
squares estimates computed after deleting the detected outliers.
IOWA STATE UNIVERSITY
Department of Animal Science
http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/rreg_sect4.htm
PROC RSQUARE
 The
RSQUARE procedure selects optimal
subsets of independent variables in a multiple
regression analysis.
 Regression
coefficients and a variety of
statistics useful for model selection can be
printed or output to a SAS data set.
 In
SAS Version 6+, the RSQUARE procedure is
subsumed by PROC REG.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE
 General

PROC RSQUARE options;






Form
MODEL dependents=independents/options;
FREQ variable;
WEIGHT variable;
BY variables;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE
 There
must be one or more MODEL
statements.
 The
FREQ, WEIGHT, and BY statements can
appear only once.
 The
MODEL, FREQ, WEIGHT, and BY
statements can appear in any order.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Options
 The
following options can be specified in the
PROC statement;

DATA=SASdataset



names the SAS data set to be used.
The data set can be an ordinary SAS data set or a
TYPE=CORR, COV, or SSCP data set. If the DATA= option is
omitted, RSQUARE uses the most recently created SAS data
set.
SIMPLE|S

Prints means and standard deviations for every variable listed in
a MODEL statement.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Options
 The
following options can be specified in the
PROC statement;

CORR|C


NOINT


suppresses the intercept term from all models.
NOPRINT


Pprints the correlation matrix for all variables in the analysis.
suppresses the regression printout
OUTEST=SASdataset

creates a TYPE=EST data set containing model-selection
statistics and parameter estimates for the selected models.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Options
 The
options listed in the MODEL Statement
section can also be used in the PROC
RSQUARE statement.
 Any
option specified in the PROC statement
applies to every MODEL statement except
those in which you specify a different value of
the option.
 Optional
statistics will appear in the OUTEST=
data set only if the corresponding options are
specified in the PROC statement
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

Options are listed after a forward slash that follows the model
statement

SELECT=n

Specifies the maximum number of subset models of each size to be
printed or output to the OUTEST= data set.

If SELECT= is used without the B option, the variables in each MODEL
are listed in order of inclusion instead of the order in which they appear in
the MODEL statement.

If SELECT= is omitted and the number of regressors is less than 11, all
possible subsets are evaluated.

If SELECT= is omitted and the number of regressors is greater than 10,
the number of subsets selected is at most equal to the number of
regressors. A small value of SELECT= greatly reduces the CPU time
required for large problems.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

Options are listed after a forward slash that follows the model
statement

INCLUDE=i



Requests that the first i variables after the equal sign in the
MODEL statement be included in every regression model.
The default status = no variables are required to appear in
every model.
START=n

Specifies the smallest number of regressors to be reported in a
subset model. The default value is one more than the value
specified by the INCLUDE= option, or one if INCLUDE= is
omitted.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

Options are listed after a forward slash that follows the model
statement

STOP=n


ADJRSQ


Specifies the largest number of regressors to be reported in a
subset model. The default is the number of regressors listed in
the MODEL statement.
Computes r-square adjusted for degrees of freedom for each
model selected.
CP

Computes Mallows' Cp statistic for each model selected.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

Options are listed after a forward slash that follows the model statement

JP



MSE


Computes the mean square error for each model selected.
SSE


Computes Jp, the estimated mean square error of prediction for each
model selected assuming that the values of the regressors are fixed
and that the model is correct.
The Jp statistic is also called the final prediction error (FPE).
Computes the error sum of squares for each model selected.
B

Computes estimated regression coefficients for each model selected
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

The FREQ Statement can also be used in this syntax

The use of FREQ in this sense treats the data set as if each observation
appears n times where n is the value of the FREQ variable for the
observation.

The total number of observations will be considered equal to the sum of the
FREQ variable when the procedure determines the df when calculating
significance probabilities.
PROC RSQUARE options;
MODEL
dependents=independents/options;
FREQ variable;
WEIGHT variable;
BY variables;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

The FREQ Statement can also be used in this syntax

If your data set includes a variable indicating the frequency of occurrence for
other values in the observation, you would include this variables name beside
the Freq statement.
PROC RSQUARE options;
MODEL
dependents=independents/options;
FREQ variable;
WEIGHT variable;
BY variables;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUAREModel Statement Options

MODEL dependents=independents/options;

The WEIGHT Statement can also be used in this syntax

The WEIGHT statement names a variable in the input data set whose values
are relative weights for a weighted least-squares fit. If the weight value is
proportional to the reciprocal of the variance for each observation, then the
weighted estimates are the best linear unbiased estimates (BLUE).

The WEIGHT and FREQ statements have similar effects, except in the
calculation of degrees of freedom. BY Statement
PROC RSQUARE options;
MODEL
dependents=independents/options;
FREQ variable;
WEIGHT variable;
BY variables;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options

MODEL dependents=independents/options;

The BY variable can be used in this syntax

The BY statement can be used with PROC RSQUARE

Will result in separate analyses on observations in groups defined by the BY
variables.

When a BY statement appears, the procedure expects the input data set to be
sorted in order of the BY variables.

If the data has not been sorted previously in ascending order,

Use PROC SORT procedure with a similar BY statement to sort the data,

Or might be appropriate to use the option NOTSORTED

or DESCENDING if data was previous sorted in the largest to smallest value
for some other reason previously.

Most likely you will need to sort the data
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE Model Statement Options
PROC SORT DATA=New by variable1;
Run;
Quit;
PROC RSQUARE options;
MODEL dependents=independents/options;
FREQ variable;
WEIGHT variable;
BY variables;
Run;
Quit;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC RSQUARE
 What
we are building toward using PROC
RSQUARE is building the best model or most
predictive model.
 Topic
of next lecture Model Development and
Selection of Variables
IOWA STATE UNIVERSITY
Department of Animal Science
Download