Model Development and Selection of Variables I S

advertisement
Model Development and
Selection of Variables
Animal Science 500
Lecture No. 11
October 7, 2010
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement
 Variables
included in the CLASS statement
referred to as class variables.
 Specifies
the variables whose values define the
subgroup combinations for the analysis.

Represent various level of some factors or effects








Treatment (1,….n)
Season (spring, summer, fall, and winter coded 1 through 4)
Breed
Color
Sex
Line
Day
Laboratory
IOWA STATE UNIVERSITY
Department of Animal Science
Class Variables
 Are
usually things you would like to account for
in your model
 Can
be numeric or character
 Can
be continuous values
 They
are generally not used in regression
analyses

What meaning would they have
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement Options

Ascending
sorts class variable in ascending order

Descending
sorts class variable in descending order
Other options with the Class statement generally related to the
procedure (PROC) being used and thus will not cover them all
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
A
discrete variable is one that cannot take on
all values within the limits of the variable.



Limited to whole numbers
For example, responses to a five-point rating scale can
only take on the values 1, 2, 3, 4, and 5.
The variable cannot have the value 1.7. A variable such
as a person's height can take on any value.
Discrete variables also are of two types:
1.
2.
unorderable (also called nominal variables)
orderable (also called ordinal)
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 Data
sometimes called categorical as the
observations may fall into one of a number of
categories for example:

Any trait where you score the value



Lameness scores
Body condition scores
Soundness scoring
 Reproductive
 Feet and leg

Behavioral traits
 Fear test
 Back test
 Vocal scores

Body lesion scores
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 When
do discrete variables become continuous
or do they?
 What
is a trait like number born alive considered
discrete or continuous?
IOWA STATE UNIVERSITY
Department of Animal Science
Model Development and Selection of Variables
Example:
The general problem addressed is to identify
important soil characteristics influencing aerial
biomass production of marsh grass, Spartina
alterniflora.
IOWA STATE UNIVERSITY
Department of Animal Science
Assumptions of the Linear Regression Model
1.
2.
3.
4.
5.
6.
7.
8.
9.
Linear Functional form
Fixed independent variables
Independent observations
Representative sample and proper specification of the
model (no omitted variables)
Normality of the residuals or errors
Equality of variance of the errors (homogeneity of
residual variance)
No multicollinearity
No autocorrelation of the errors
No outlier distortion
IOWA STATE UNIVERSITY
Department of Animal Science
Explanation of the Assumptions
1.
Linear Functional form

2.
The Observations are Independent observations


3.

Heteroskedasticity precludes generalization and external validity
This too distorts the significance tests being used
Multicollinearity (many of the traits exhibit collinearity)


6.
Permits proper significance testing similar to ANOVA and other statistical procedures
Equal variance (or no heterogenous variance)

5.
Representative sample from some larger population
If the observations are not independent results in an autocorrelation which inflates the
t and r and f statistics which in turn distorts the significance tests
Normality of the residuals

4.
Does not detect curvilinear relationships
Biases parameter estimation.
Can prevent the analysis from running or converging (getting your answers)
Severe or several outliers will distort the results and may bias the
results.

If outliers have high influence and the sample is not large enough, then they may
serious bias the parameter estimates
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data Origination
(Dr. P. J. Berger)
Data: The data were published as an exercise
by Rawlings (1988) and originally appeared as a
study by Dr. Rick Linthurst, North Carolina State
University (1979). The purpose of his research
was to identify the important soil characteristics
influencing aerial biomass production of the
marsh grass, Spartina alterniflora in the Cape
Fear Estuary of North Carolina. The design for
collecting data was such that there were three
types of Spartina vegetation, in each of three
locations, and five random sites within each
location vegetation type.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Variables
Data:
The dependent variable (what is being
measured) is aerial biomass
and there are five substrate measurements:
(These are the independent variables)
1.
2.
3.
4.
Salinity,
Acidity,
Potassium,
Sodium, and Zinc.
 Objective:
IOWA
STATE UNIVERSITY
Department of Animal Science
Example Data

Objective:

Find the substrate variable, or combination of
variables, showing the strongest relationship to
biomass.
Or,

From the list of five independent variables of salinity,
acidity, potassium, sodium, and zinc, find the
combination of one or more variables that has the
strongest relationship with aerial biomass.

Find the independent variables that can be used to
predict aerial biomass.
IOWA STATE UNIVERSITY
Department of Animal Science
Example analysis
 The
REG Procedure INTRODUCTION
 The
REG procedure fits least-squares
estimates to linear regression models.
 SPECIFICATIONS

PROC REG;

MODEL dependents = regressors / options;
IOWA STATE UNIVERSITY
Department of Animal Science
Example analysis
 The
RSQUARE Procedure RECALL
 The
RSQUARE procedure selects optimal
subsets of independent variables in a multiple
regression analysis
IOWA STATE UNIVERSITY
Department of Animal Science
Example analysis
PROC RSQUARE options;
MODEL dependents = independents / options;
(options can appear in either PROC RSQUARE or
any MODEL statement).





SELECT = n specific maximum number of subset models
INCLUDE = I requests that the first I variables after the
equal sign be included in every regression
SIGMA = n specifies the true standard deviation of the
error term
ADJRSQ computes R2 adjusted for degrees of freedom
CP computes MALLOWS’ Cp statistic
IOWA STATE UNIVERSITY
Department of Animal Science
Example analysis
PROC RSQUARE options;
MODEL dependents = independents / options;
(options can appear in either PROC RSQUARE or
any MODEL statement).
PROC RSQUARE DATA=name OUTEST=EST
ADJRSQ MSE CP;
SELECT=n;
MODEL = variable list;
IOWA STATE UNIVERSITY
Department of Animal Science
Example analysis
PROC PRINT DATA=EST;
PROC PLOT;
PLOT _CP_*_P_ = ‘C’ _P_*_P_ = ‘P’ /
OVERLAY;
PLOT _MSE_*_P_ = ‘M’;
Run;
Quit
IOWA STATE UNIVERSITY
Department of Animal Science
PROC STEPWISE
The STEPWISE procedure provides five methods
for stepwise regression.
General form:
PROC STEPWISE;
MODEL dependents = independents / options;
Run;
Quit;
** Assumes that you have at least one dependent variable and 2 or more independent
variables. If only one independent variable exists then you are just doing a simple
regression of x on y or y on x.
IOWA STATE UNIVERSITY
Department of Animal Science
Types of Regression
 Uses
of PROC REG for standard problems:
1.
PROC REG;
model y = x;
/* simple linear regression */
2.
PROC REG;
model y = x;
weight w;
/* weighted linear regression */
3.
PROC REG;
/* multiple regression */
model y = x1 x2 x3;
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
General form:
PROC REG;
MODEL dependents = independents / options;
Options available include:
NOINT – regression with no intercept
FORWARD
A forward selection analysis starts out with no predictors in the model.
Each predictor that that was chosen by the user is evaluated with respect to see how much the R 2 is increased
by adding it to the model.
The predictor that increases the R2 will be added if it meets the statistical conditions for entry
With SAS the statistical conditions is the significance level for the increase in the R2 produced by addition of the
predictor.
If no predictor meets the condition, the analysis stops.
If a predictor is added, then the second step involves re-evaluating all of the available predictors which have not
yet been entered into the model.
If any satisfy the statistical condition for entry, the predictor increasing the R2 the greatest is added.
This process is continued until no predictors remain that could enter.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
General form:
PROC REG;
MODEL dependents = independents / options;
Options available include:
BACKWARD
In a backwards elimination analysis we start out with all of the predictors in the model.
At each step we evaluate the predictors which are in the model and eliminate any that meet the criterion for
removal.
STEPWISE
Stepwise selection begins similar to forwards selection. However at each “step” variables that are in the model
are first evaluated for removal. Those meeting removal criteria are evaluated to see which would lower the R 2,
the least.
How does this work where a variable enters and then might leave later? If two predictors ultimately enter the
model, one may be removed because they are well correlated and removing one impacts the R 2 very little if at
all.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
General form:
PROC REG;
MODEL dependents = independents / options;
Options available include:
MAXR
The maximum R2 option does not settle on a single model. Instead, it tries to
find the "best" one-variable model, the "best" two-variable model, and so forth. ,
MAXR starts out by finding the single variable model producing the greatest R2
After finding the one variable MAXR then another variable is added until it finds
the variable that increases the R2 the most. It continues this process until it
stops where the addition of another variable is no better than the previous (i.e.
adding the 4th variable did not significantly improve the R2 compared to the 3
variable model for example.
The difference between the STEPWISE and MAXR options is that all switches
are evaluated before any switch is made in the MAXR method .
Using the STEPWISE option, the "worst" variable may be removed without
considering what adding the "best" remaining variable might accomplish.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
General form:
PROC REG;
MODEL dependents = independents / options;
Options available include:
MINR
The MINR option closely resembles the MAXR method. However, the switch
chosen with the MINR option is switch that produces the smallest increase in
R2. In a way approaching the “best” model in reverse compared to MAXR.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
General form:
PROC REG;
MODEL dependents = independents / options;
Options available include:
SLE=value
This option sets some criterion for entry into the model. This can be defined
by the user by meeting some level of change or Δ to the R2
SLS=value
This option sets some criterion for staying or remaining in the model. This can
be defined by the user by meeting some level of change or Δ to the R2
to stay in the model.
IOWA STATE UNIVERSITY
Department of Animal Science
PROC REG
 The
default statistical levels for each type of
regression analysis is different unless it is
changed by the user:
 The
defaults are:
BACKWARD = 0.10
FORWARD = 0.10
STEPWISE = 0.15
User can set it by using the SLSTAY option for
example / SLSTAY=.05.
IOWA STATE UNIVERSITY
Department of Animal Science
Significance Tests for the Regression Coefficients
1.
Finding the significance of the parameter estimates by using the F or t
test (will see in a couple of slides)
2.
R2 = R-Square is the proportion of variation in the dependent variable
(Y) that can be explained by the predictors (X variables) in the
regression model.
3.
Adjusted R2 Predictors could be added to the model which would
continue to improve the ability of the predictors to explain the dependent
variable. Some of the improvement in the R-Square would be simply
due to chance variation. The adjusted R-Square attempts to yield a
more honest value to estimate R-Square.
= 1-(1-R2) (n-1)/(n-p-1)
where
R2 = the unadjusted R2
n = the number of number of observations, and
p = the number of predictors
IOWA STATE UNIVERSITY
Department of Animal Science
Significance Tests for the Regression Coefficients

The Mallows’ Cp statistic

CP (Cp) = SSE / σ2 + 2p – n
where
SSE
=
error sums of squares
σ2
=
the estimate of pure error variance from
the SIGMA = option for from fitting the full
model
p
=
the number of parameters including
the intercept, and
n
=
the number of observations
IOWA STATE UNIVERSITY
Department of Animal Science
F and T tests for significance for overall model
F 
Model variance
error variance

R2 / p
(1  R 2 ) /( n  p  1)
where
p  number of parameters
n  sample size
t 

F
( n  2) * r 2
1 r2
IOWA STATE UNIVERSITY
Department of Animal Science
Download