Lecture 17 Model Development and Selection of Variables

Model Development and
Selection of Variables
Animal Science 500
Lecture No. 17
October 28, 2010
IOWA STATE UNIVERSITY
Department of Animal Science
Using PROC COMPARE

PROC COMPARE compares two SAS datasets with each other.


It warns you if it detects observations (rows) or variables (columns) that do not
agree across the two datasets.
When there are no disagreements, you can be confident that data entry is reliable.

To use PROC COMPARE, enter your data twice, once each into
two separate raw data files.

Next use the two raw data files to create two SAS data sets.

Then use PROC COMPARE.
IOWA STATE UNIVERSITY
Department of Animal Science
Using PROC COMPARE

Example:
The following example compares the two SAS data sets named
PIG1 and PIG12.

PROC COMPARE BASE = PIG1 COMPARE = PIG12 ERROR ;
ID subjctid ;

The BASE keyword defines the data set that SAS will use as a
basis for comparison.

The keyword COMPARE defines the dataset which SAS will
compare with the base dataset.

The ERROR keyword requests that SAS print an error message to
the SASLOG file if it discovers any differences when it compares
the two data sets.
IOWA STATE UNIVERSITY
Department of Animal Science
Using PROC COMPARE

The ID statement tells SAS to compare rows (observations) in the
data set by the identifying variable, which here is named
SUBJCTID. This variable must have a unique value for each case.

PROC COMPARE features a number of options, many of which
are designed to control the amount and type of information
displayed in the listing file.
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement
 Variables
included in the CLASS statement
referred to as class variables.
 Specifies
the variables whose values define the
subgroup combinations for the analysis.

Represent various level of some factors or effects








Treatment (1,….n)
Season (spring, summer, fall, and winter coded 1 through 4)
Breed
Color
Sex
Line
Day
Laboratory
IOWA STATE UNIVERSITY
Department of Animal Science
Class Variables
 Are
usually things you would like to account for
in your model
 Can
be numeric or character
 Can
be continuous values
 They
are generally not used in regression
analyses

What meaning would they have
IOWA STATE UNIVERSITY
Department of Animal Science
Class Statement Options

Ascending
sorts class variable in ascending order

Descending
sorts class variable in descending order
Other options with the Class statement generally related to the
procedure (PROC) being used and thus will not cover them all
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
A
discrete variable is one that cannot take on
all values within the limits of the variable.



Limited to whole numbers
For example, responses to a five-point rating scale can
only take on the values 1, 2, 3, 4, and 5.
The variable cannot have the value 1.7. A variable such
as a person's height can take on any value.
Discrete variables also are of two types:
1.
2.
unorderable (also called nominal variables)
orderable (also called ordinal)
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 Data
sometimes called categorical as the
observations may fall into one of a number of
categories for example:

Any trait where you score the value



Lameness scores
Body condition scores
Soundness scoring
 Reproductive
 Feet and leg

Behavioral traits
 Fear test
 Back test
 Vocal scores

Body lesion scores
IOWA STATE UNIVERSITY
Department of Animal Science
Discrete Variables
 When
do discrete variables become continuous
or do they?
 What
is a trait like number born alive considered
discrete or continuous?
IOWA STATE UNIVERSITY
Department of Animal Science
Model Development and Selection of Variables
Example:
The general problem addressed is to identify
important soil characteristics influencing aerial
biomass production of marsh grass, Spartina
alterniflora.
IOWA STATE UNIVERSITY
Department of Animal Science
Assumptions of the Linear Regression Model
1.
2.
3.
4.
5.
6.
7.
8.
9.
Linear Functional form
Fixed independent variables
Independent observations
Representative sample and proper specification of the
model (no omitted variables)
Normality of the residuals or errors
Equality of variance of the errors (homogeneity of
residual variance)
No multicollinearity
No autocorrelation of the errors
No outlier distortion
IOWA STATE UNIVERSITY
Department of Animal Science
Explanation of the Assumptions
1.
Linear Functional form

2.
The Observations are Independent observations


3.

Heteroskedasticity precludes generalization and external validity
This too distorts the significance tests being used
Multicollinearity (many of the traits exhibit collinearity)


6.
Permits proper significance testing similar to ANOVA and other statistical procedures
Equal variance (or no heterogenous variance)

5.
Representative sample from some larger population
If the observations are not independent results in an autocorrelation which inflates the
t and r and f statistics which in turn distorts the significance tests
Normality of the residuals

4.
Does not detect curvilinear relationships
Biases parameter estimation.
Can prevent the analysis from running or converging (getting your answers)
Severe or several outliers will distort the results and may bias the
results.

If outliers have high influence and the sample is not large enough, then they may
serious bias the parameter estimates
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data Origination
(Dr. P. J. Berger)
Data: The data were published as an exercise
by Rawlings (1988) and originally appeared as a
study by Dr. Rick Linthurst, North Carolina State
University (1979). The purpose of his research
was to identify the important soil characteristics
influencing aerial biomass production of the
marsh grass, Spartina alterniflora in the Cape
Fear Estuary of North Carolina. The design for
collecting data was such that there were three
types of Spartina vegetation, in each of three
locations, and five random sites within each
location vegetation type.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Variables
Data:
The dependent variable (what is being
measured) is aerial biomass
and there are five substrate measurements:
(These are the independent variables)
1.
2.
3.
4.
Salinity,
Acidity,
Potassium,
Sodium, and Zinc.
IOWA STATE UNIVERSITY
Department of Animal Science
Example Data

Objective:

Find the substrate variable, or combination of
variables, showing the strongest relationship to
biomass.
Or,

From the list of five independent variables of salinity,
acidity, potassium, sodium, and zinc, find the
combination of one or more variables that has the
strongest relationship with aerial biomass.

Find the independent variables that can be used to
predict aerial biomass.
IOWA STATE UNIVERSITY
Department of Animal Science
Definition of Mixed Models by their
component effects
 Mixed
Models contain both fixed and random
effects
 Fixed
Effects: factors for which the only levels
under consideration are contained in the
coding of those effects
 Random
Effects: Factors for which the levels
contained in the coding of those factors are a
random sample of the total number of levels in
the population for that factor.
IOWA STATE UNIVERSITY
Department of Animal Science
Examples of Fixed and Random Effects
 Fixed


effect:
Sex where both male and female genders are included
in the factor, sex.
Breed: Pure or Crossbred or Angus, Hereford, and
Charlois are examples that would be included in the
factor of breed
 Random

effect:
Subject: the sample is a random sample of the target
population
IOWA STATE UNIVERSITY
Department of Animal Science
Defining fixed or random factor
Random
Fixed
Levels
Selected at random from
a conceptually infinite
collection of possibilities
Finite number of
possibilities
Another experiment
Would use different levels
from the same population
Would use the same
levels of the factor
Goals
Estimate variance
components
Estimate means
Inference
For all levels of the factor
(i.e. for population from
which levels are selected)
Only for levels actually
used in the experiment
From D. A. Dickey, 2008: SAS Global Forum
IOWA STATE UNIVERSITY
Department of Animal Science
Classification of effects

There are main effects: Linear Explanatory Factors

There are interaction effects: Joint effects over and
above the component main effects.

There are nested effects. Hierarchical designs
contained nested effects: Animals may be nested witin
treatment that might be nested within farm.

Such effects may sometimes be fixed or random. Their
classification depends on the experimental design
IOWA STATE UNIVERSITY
Department of Animal Science
Classification of effects

Between-subjects effects are those who are in one
group or another but not in both.

Experimental group is a fixed effect because the
manager is considering only those groups in his
experiment.

One group is the experimental group and the other is
the control group. Therefore, this grouping factor is a
between- subject effect.

Within-subject effects are experienced by subjects
repeatedly over time.
IOWA STATE UNIVERSITY
Department of Animal Science
Classification of effects

Trial is a random effect when there are several trials in
the repeated measures design; all subjects experience
all of the trials.

Trial is therefore a within-subject effect.

Example an operator of a scanning machine may be a
fixed or random effect, depending upon whether one is
generalizing beyond the sample

If ultrasound scanner operator is a random effect, then
the machine*operator interaction is a random effect.

There are contrasts: These contrast the values of one
level with those of other levels of the same effect.
IOWA STATE UNIVERSITY
Department of Animal Science
Classification of Effects cont’d
 Hierarchical

designs have nested effects.
Nested effects are those with subjects within groups.
 An
example would be pen of animals nested
within barn and barns nested within farms
 SAS


expresses nesting of effects by:
Pen of animals(barn)
Barn(farms)
IOWA STATE UNIVERSITY
Department of Animal Science
Interactions case
 If
an interaction term were included, the
formula would be
yij = μ + αi + βi + αβij + eij
 The
interaction or crossed effect is the joint
effect, over and above the individual main
effects. Therefore, the main effects must be in
the model for the interaction to be properly
specified.
αβij = (yij - μ) – ( α – μ) – (β – μ) = yij - α - β + μ
IOWA STATE UNIVERSITY
Department of Animal Science
Higher Order Interactions
 If
3-way interactions are in the model, then the
main effects and all lower order interactions
must be in the model for the 3-way interaction to
be properly specified. For example, a 3-way
interaction model would be:
yijk = μ + ai + bj + ck + abij + acik + bc jk + abcijk + eijk
IOWA STATE UNIVERSITY
Department of Animal Science