SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005 What will the output from this program look like? How many variables will be in the dataset example, and what will be the length and type of each variable? What will the variable package look like? What will the output from this program look like? Modeling with SAS examine relationships between variables estimate parameters and their standard errors calculate predicted values evaluate the fit or lack of fit of a model test hypotheses design outcome The linear model y 0 1 x1 2 x2 k xk ~ N (0, ) 2 Example: Weight 0 1Height 2 Age Note: outcome variable must be continuous and normal given independent variables the linear model with proc reg estimates parameters by least squares produces diagnostics to test model fit (e.g. scatter plots) tests hypotheses Example: proc reg data=mydata; model weight = height age; run; proc reg Syntax: proc reg <options>; model response = effects </options>; plot yvariable*xvariable = ’symbol’; by varlist; output <OUT=SAS data set> <output statistic list>; run; proc reg proc reg statement syntax: data = SAS data set name input data set outest = SAS data set name creates data set with parameter estimates simple prints simple statistics proc reg the model statement model response=<effects></options>; required variables must be numeric many options can specify more than one model statement Example: model weight = height age; model weight = height age / p clm cli; proc reg the plot statement plot yvariable*xvariable <=symbol> </options>; produces scatter plots - yvariable on the vertical axis and xvariable on the horizontal axis can specify several plots optional symbol to mark points yvariable and xvariable can be variables specified in model statements or statistics available in output statement Example: plot weight * age / pred; plot r. * p. / vref = 0; proc reg some statistics available for plotting: P. R. L95. U95. L95M. U95M. predicted residuals lower 95% upper 95% lower 95% upper 95% values CI CI CI CI bound bound bound bound for for for for individual prediction individual prediction mean of dependent variable mean of dependent variable Example: plot weight * age / pred; plot r. * p. / vref = 0; plot (weight p. l95. U95.) * age / overlay; proc reg the output statement output <OUT=SAS data set> keywords=names; creates SAS data set all original variables included keyword=names specifies the statistics to include Example: output out=pvals p=pred r=resid; Example: NMES variables of interest: totalexp – total medical expenditure ($) chd5 – indicator of CHD lastage – age at last interview male – sex of participant proc reg example here: 1. model 2. plot 3. output estimate parameters etc make three plots make an output dataset regout The run statement Many people assume that the run statement ends a procedure such as proc reg. This is because when SAS encounters a run statement it executes any outstanding instructions in the program buffer. But it may or may not end the procedure. proc reg data=lecture4.nmes; model totalexp = chd5 lastage male; run; model totalexp = chd5 lastage; plot r.*chd5; run; quit; /* ends the procedure */ proc glm (the general linear model) uses least-squares with generalized inverses performs linear regression, analysis of variance, analysis of covariance accepts classification variables (discrete) and continuous variables estimates and performs tests for general linear effects proc anova is suitable for “balanced” designs; proc glm can be used for either balanced or unbalanced designs suitable for random effects models proc glm Syntax: proc glm data=name <options>; class classification variables; model response=effects /options; means effects / options; random effects / options; estimate ‘label’ effect value / options; contrast ‘label’ effect value / options; run; proc glm response (dependent) variable is continuous – same normality assumption as in proc reg independent variables are discrete or continuous; discrete must listed on class statement interaction terms can be with an asterisk a*b, e.g. model bmi= a b a*b; proc glm means effects / options; computes arithmetic means and standard deviations of all continuous variables in the model (both dependent and independent) within each group for effects specified on the right-hand side of the model statement only class variables may be specified as effects options specify multiple comparison methods for main effect terms in the model proc glm example here: 1. solution 2. means 3. class show estimated parameters show means for smoke variable treat smoke as discrete proc glm example here: 1. format changes reference group reg and glm Both the proc reg and proc glm procedures are suitable only when the outcome variable is normally distributed. proc reg has many regression diagnostic features, while proc glm allows you to fit more sophisticated linear models such as random effects models, models for unbalanced designs etc. non-normal outcomes In many situations we cannot assume our response variable is normally distributed. proc reg and proc glm are not suitable for modeling such outcomes. Example: Suppose you are interested in estimating the prevalence of disease in a population. You have an indicator of disease (1 = Yes, 0 = No) non-normal outcomes Example: You are interested in estimating how the incidence of infant mortality has changed as a function of time Example: You are interested in estimating the median survival time for two groups of patients receiving either a placebo or treatment. proc logistic Example: Survey data: parent agrees to close school when certain toxic elements are found in the environment Variables: close 0 = no, 1 = yes lived years lived in community proc logistic Syntax: proc logistic <options>; model response = effects </options>; class variables; by variables; output <out=name> <keyword=name>; run; proc logistic • descending option means that we are modeling the probability that close=1 and not the probability that close=0. proc genmod • • • implements the generalized linear model fits models with normal, binomial or poisson response variable (among others) fits generalized estimating equations for repeated measures data proc genmod Syntax: proc genmod <options>; by variables; class variables; model response = effects </options>; output <out=name> <keyword=name>; make ‘table’ out=name; run; proc genmod: class statement says which variables are classification (categorical) variables by statement produces a separate analysis for each level of the by variables (data must be sorted in the order of the by variables) response variable is the response (dependent) variable in the regression model. <effects> are a list of variables. These are the independent variables in the regression model. Any independent variables that are categorical must be listed in the Class statement. Example: • • smoke will be treated as a categorical variable because of the class statement Same model as we produced with proc glm. The default is a linear model. options for the model statement dist = option specifies the distribution of the response variable. (default = normal) link = option specifies the link that will transform the response variable (default = identity) Examples: logistic regression: poisson regression: dist=binomial link=logit dist=poisson link=log options for the model statement alpha = specifies confidence level for confidence intervals waldci or lrci specifies that confidence intervals are to be computed. The waldci gives approximate intervals and doesn’t take as long as lrci. The lrci give intervals based on likelihood ratio. the output statement • the output statement is just one of the ways to create a new SAS dataset containing results form the genmod procedure. • statement is similar to that found in proc means and proc glm. Example: output out=new predicted=fit upper=upper lower=lower; the make statement • the make statement is another way to create a new SAS dataset containing results form the genmod procedure. • ods is another more general way (see later). Example: make ‘ParameterEstimates’ out=parms; make ‘ParmInfo’ out=parminfo; example: logistic regression Perform a logistic regression analysis to determine how the odds of CHD are associated with age and gender in the 1987 NMES Save the parameter estimates as a new dataset. Save the predicted values along with the original data. Example: • descending options means that we are modeling the probability that chd5=1 and not the probability that chd5=0.