Statistical Modeling with SAS/STAT Cheng Lei Department of Electrical and Computer Engineering University of Victoria, Canada rexlei86@uvic.ca I. Overview There are more than 70 procedures in SAS/STAT software and most of them are dedicated to solving problems in statistical modeling. The contents discussed in the statistical document are presented in a simple schema so that it is not easy to express the complexity of statistical models. Compared to completely understand the theories behind each model, it is more practical to apply these models in terms of simple criteria. II. Statistical Models Simply classified, the statistical models consists of mainly two categories, namely, deterministic and stochastic models, model-based and design-based randomness models. The deterministic models represent that the relationship between inputs and outputs are theoretically in deterministic fashion. The purely mathematical models can be of importance as theoretical tools, but impractical for describing observational, experimental, or survey data. However, the stochastic models illustrate that the outputs are uncertainly and affected parameters. These constants, parameters, are based on the assumptions model and the data. by some unknown estimated about the In practice, the statistical models are preferred over deterministic models. The reasons are that, a) randomness is often introduced into a system so that it can balance or representativeness, B) even if a deterministic model can be formulated for the phenomenon under study, a stochastic model can provide a more parsimonious and more easily comprehended description, c) it is often sufficient to describe the average behaviour of a process, rather than each particular realization. In the statistical theory, the process of estimating the parameters based on the data is called fitting the model. There are numerous procedures in SAS/STAT software that can perform the fitting. The principles of estimating the parameters vary from computing the distribution, variance, bias and etc. A statistical model is a description of how the date is generated, not a description of the specific data to which it is applied. The models’ missions are to capture those aspects of a phenomenon that is relevant to inquiry and to explain how the data could have come about as a realization of a random experiment. Different means lead to different model formulations, different analytic strategies and different results. Thus, the model-based randomness owns the innate randomness while designbased randomness has the induced randomness. Interpreted in another way, randomness characteristic of the model-based randomness derives from the model and cannot be modified unless the model is changed. Conversely, the design-based randomness is controlled or induced by how to design the model. The randomness changes as the design changes. Design-based approaches play an important role in the analysis of data from controlled experiments by randomization tests. Its view treats the potential response of a particular treatment for a particular experimental unit as a constant. The stochastic nature of the error-control design is induced by randomly selecting one of the potential responses. A design-based approach is implicit in SAS/STAT procedures whose name commences with SURVEY. In the purely model-based framework, the only source of random variation for inference comes from the unknown variation in the responses as the model has been already designed and ready to use. However, it does not imply that there is only one source of random variation in the data. The inferential approaches in other SAS/STAT based. procedures are model Once a model is introduced and accepted as a description of datagenerating mechanism, the parameter estimators can be estimated by using the data. After the parameter estimates are ready, the model selected can be applied to answer the questions of interest about the study population. Also, it can be employed to derive new predictions, to validate assumption, to evaluate the model itself and so on. Therefore, the three steps of model selection, diagnosis and discrimination becomes of significance during the modelconstructing process. This process usually is achieved by iterating itself. III. Classes of Statistical Models There are seven classes of statistical models in SAS/STAT software. The first one comes to the linear and nonlinear models. In statistical estimation problems, a nonlinear problem typically does not have closed-form solution and must be solved by iterative, numerical techniques. The nonlinearity in the mean function is often used to distinguish between linear and nonlinear models. Then, it is the regression models and models with classification effects. A regression model in the narrow sense is a linear model in which all regressor effects are continuous variables, compared to a classification model. More specifically, each effect in the model contributes a single column to the property matrix and single parameter to the overall model. The univariate models in the statistical models are discussed at the third place and it indicates that there only one response variable. By contrast, the multivariate models have multiple response variables modeled jointly. The multivariate data can be generally grouped into three categories, homogeneous multivariate data which consist of observations of the same attributes, heterogeneous multivariate data which arise when the responses that are modeled jointly refer to different attributes and contains two important subtypes data, and heterocatanomic multivariate data in which the observations can come from different distributional families. The fourth one is the Fixed, Random and Mixed models. Each term in a statistical model demonstrates either a fixed or a random effect. Models with fixed effects are called fixed models. Likewise, models with random effects are called random effects models. The mixed models are the models with both fixed-effects and random effects terms. The reasons of having random effects in the model are a) some observations are no longer uncorrelated but instead have a covariance that depends on the variance of the random effects, b) the different inference spaces, c) the structure of G (variance-covariance matrix) and Var[], d) certain concepts, such as least square means and Type III estimable functions, are meaningful only for fixed effects, e) using random effects. The generalized linear models lay on the fifth location. This class of models extends the theory and methods of linear models to data with non-normal responses. The generalized linear models also apply a transformation, known as the link function, but it is applied to a deterministic component, the mean of the data. This kind of models takes the distribution of the data in to account, rather than assuming that a transformation of the data leads to normally distributed data to which standard linear modeling techniques can be applied. The latent variable models involve variables that are not observed directly in the research and also involved in all kinds of regression models. Latent variables are systematic unmeasured variables that are also referred to as factors. Bayesian Models are the last one discussed in the SAS/STAT document. It is based on the classical paradigm, which treat the parameters of the model as fixed, unknown constants. They are not random variables, and the notion of probability is derived in an objective sense as a limiting relative frequency. Model parameters are random variables and the probability of an event is defined in a subjective sense as the degree to which you believe that the event is true.