Statistical Modeling in SAS - Electrical and Computer Engineering

advertisement
Statistical Modeling with SAS/STAT
Cheng Lei
Department of Electrical and Computer Engineering
University of Victoria, Canada
rexlei86@uvic.ca
I.
Overview
There are more than 70
procedures in SAS/STAT software and
most of them are dedicated to solving
problems in statistical modeling. The
contents discussed in the statistical
document are presented in a simple
schema so that it is not easy to express
the complexity of statistical models.
Compared
to
completely
understand the theories behind each
model, it is more practical to apply
these models in terms of simple
criteria.
II.
Statistical Models
Simply classified, the statistical
models consists of mainly two
categories, namely, deterministic and
stochastic models, model-based and
design-based randomness models.
The
deterministic
models
represent that the relationship
between inputs and outputs are
theoretically in deterministic fashion.
The purely mathematical models can
be of importance as theoretical tools,
but
impractical
for
describing
observational, experimental, or survey
data. However, the stochastic models
illustrate that the outputs are
uncertainly and affected
parameters.
These
constants, parameters, are
based on the assumptions
model and the data.
by some
unknown
estimated
about the
In practice, the statistical models
are preferred over deterministic
models. The reasons are that, a)
randomness is often introduced into a
system so that it can balance or
representativeness, B) even if a
deterministic
model
can
be
formulated for the phenomenon under
study, a stochastic model can provide
a more parsimonious and more easily
comprehended description, c) it is
often sufficient to describe the
average behaviour of a process, rather
than each particular realization.
In the statistical theory, the
process of estimating the parameters
based on the data is called fitting the
model.
There
are
numerous
procedures in SAS/STAT software that
can perform the fitting. The principles
of estimating the parameters vary
from computing the distribution,
variance, bias and etc.
A statistical model is a description
of how the date is generated, not a
description of the specific data to
which it is applied. The models’
missions are to capture those aspects
of a phenomenon that is relevant to
inquiry and to explain how the data
could have come about as a realization
of a random experiment. Different
means lead to different model
formulations,
different
analytic
strategies and different results. Thus,
the model-based randomness owns
the innate randomness while designbased randomness has the induced
randomness. Interpreted in another
way, randomness characteristic of the
model-based randomness derives
from the model and cannot be
modified unless the model is changed.
Conversely,
the
design-based
randomness is controlled or induced
by how to design the model. The
randomness changes as the design
changes.
Design-based approaches play an
important role in the analysis of data
from controlled experiments by
randomization tests. Its view treats
the potential response of a particular
treatment
for
a
particular
experimental unit as a constant. The
stochastic nature of the error-control
design is induced by randomly
selecting one of the potential
responses. A design-based approach is
implicit in SAS/STAT procedures
whose name commences with
SURVEY.
In the purely model-based
framework, the only source of random
variation for inference comes from the
unknown variation in the responses as
the model has been already designed
and ready to use. However, it does not
imply that there is only one source of
random variation in the data. The
inferential approaches in other
SAS/STAT
based.
procedures
are
model
Once a model is introduced and
accepted as a description of datagenerating mechanism, the parameter
estimators can be estimated by using
the data. After the parameter
estimates are ready, the model
selected can be applied to answer the
questions of interest about the study
population. Also, it can be employed to
derive new predictions, to validate
assumption, to evaluate the model
itself and so on. Therefore, the three
steps of model selection, diagnosis and
discrimination
becomes
of
significance during the modelconstructing process. This process
usually is achieved by iterating itself.
III.
Classes of Statistical Models
There are seven classes of
statistical models in SAS/STAT
software. The first one comes to the
linear and nonlinear models. In
statistical estimation problems, a
nonlinear problem typically does not
have closed-form solution and must
be solved by iterative, numerical
techniques. The nonlinearity in the
mean function is often used to
distinguish between linear and
nonlinear models.
Then, it is the regression models
and models with classification effects.
A regression model in the narrow
sense is a linear model in which all
regressor effects are continuous
variables, compared to a classification
model. More specifically, each effect in
the model contributes a single column
to the property matrix and single
parameter to the overall model.
The univariate models in the
statistical models are discussed at the
third place and it indicates that there
only one response variable. By
contrast, the multivariate models have
multiple response variables modeled
jointly. The multivariate data can be
generally
grouped
into
three
categories, homogeneous multivariate
data which consist of observations of
the same attributes, heterogeneous
multivariate data which arise when
the responses that are modeled jointly
refer to different attributes and
contains two important subtypes data,
and heterocatanomic multivariate
data in which the observations can
come from different distributional
families.
The fourth one is the Fixed,
Random and Mixed models. Each term
in a statistical model demonstrates
either a fixed or a random effect.
Models with fixed effects are called
fixed models. Likewise, models with
random effects are called random
effects models. The mixed models are
the models with both fixed-effects and
random effects terms. The reasons of
having random effects in the model
are a) some observations are no
longer uncorrelated but instead have a
covariance that depends on the
variance of the random effects, b) the
different inference spaces, c) the
structure of G (variance-covariance
matrix) and Var[], d) certain
concepts, such as least square means
and Type III estimable functions, are
meaningful only for fixed effects, e)
using random effects.
The generalized linear models lay
on the fifth location. This class of
models extends the theory and
methods of linear models to data with
non-normal
responses.
The
generalized linear models also apply a
transformation, known as the link
function, but it is applied to a
deterministic component, the mean of
the data. This kind of models takes the
distribution of the data in to account,
rather than assuming that a
transformation of the data leads to
normally distributed data to which
standard linear modeling techniques
can be applied.
The latent variable models involve
variables that are not observed
directly in the research and also
involved in all kinds of regression
models.
Latent
variables
are
systematic unmeasured variables that
are also referred to as factors.
Bayesian Models are the last one
discussed in the SAS/STAT document.
It is based on the classical paradigm,
which treat the parameters of the
model as fixed, unknown constants.
They are not random variables, and
the notion of probability is derived in
an objective sense as a limiting
relative frequency. Model parameters
are random variables and the
probability of an event is defined in a
subjective sense as the degree to
which you believe that the event is
true.
Download