Linear Regression Analysis with a focus on Influence Diagnostics using proc reg prepared by Voytek Grus for SAS user group, Halifax February 23, 2007 Introduction: What is Regression Analysis? • A broad collection of statistical techniques used to explore relationship between measurable variables. – It’s primary purpose is to describe the relationship between variables (model) and predict response or study its components (coefficients). • A central idea to RA is that it is a statistical (stochastic) process (not a deterministic equation) • A subgroup of Generalized Linear Models or/and Multivariate Analysis. Introduction: Types of Regression Analysis • Data types and statistical techniques – Analysis of observational versus experimental data (proc rsreg) – Discrete response variable: logistic regression (proc logistic, transreg) – Time series versus cross-sectional data (procs autoreg, pdlreg, arimax) – Survival Analysis: lifetime or failure time (proc lifereg) – Regression on random predictors • Simultaneous Econometric equations (procs model, syslin) • Structural Equation Modeling (proc calis) • Estimation techniques – – – – Linear vs non-linear (proc nlin nlinmix) Least square vs non-least squares such as MLE. (proc robustreg) Least squares vs partial-least squares (proc pls) Multivariate regression (multiple response regression) SAS offers many diverse tools to do regression analysis - A good way to start is to read about RA in SAS help. - - Chapter 2 of “Introduction to Regression Procedures” gives a good overview of RA and SAS procedures available to do varies analyses. SAS procedures, SAS Enterprise Guide, Matrix Programming language Regression Analysis: Process • State the purpose of the analysis: prediction, variable screening, model specification, parameter estimation (signs and significance), influence diagnostics. • Identify type of regression analysis to be conducted and find appropriate tools • Assess quality of your data • Fit in regression model • Examine compliance with statistical assumptions, remedy violation of where necessary, assess quality of fit. • Draw conclusions Diagnostics: testing for violation of assumptions • Analysis of residuals – Normality assumption (QQ- and PP-plots, added variable plots, partial residual plots, histograms, F tests for lack of fit, Durbin Watson) – Heteroscedasticity (ACOV and SPEC options). – Outlier detection (How large is too large?) – Influence diagnostics (cook’s distance, press) • Model specification (Levarage plots, Cp Mallow) – Non-linearity (scatter plots, partial res. Plots) – Over- and under-specfication • Multicollinearity tests (tol, vif, colin) • Autocorrelation (Durbin Watson) • Random predictors (X’s measured with errors) Remedies to violation of assumptions • Variable selection process (stepwise, mxrl etc proc reg) – Variable transformation • Dummy variables • Box-Tidwell Procedure • Not all functions are linearizable and non-linear regression must be used. • Polynomial regression (proc rsreg) • Weighted Least squares (weight statement in proc reg) • Non-least Squares Regression – Failure of normality: Huber M-estimator (proc robustreg) – Principal Components regression (proc pls princomp) – Ridge regression (proc reg) • Partial Least Squares: random predictors – Proc pls • Non-linear regression – Proc NMLX, proc nlin, proc model Functionality of Proc Reg in Linear Regression Analysis • Data modeling: by group processing, where statement, multiple model statements • Interactive analysis: reweigh, paint, plot statements etc. • Diagnostic tools: plots, tests (outliers, normality etc) Hypothesis Testing: F, t tests, partitioning of variability • Automated variable selection procedures: stepwise regression. Forward selection, backward elimination, maxr. • Model validation: Mallow Cp graphs. • Prediction: prediction intervals, press residuals etc. Literature • Classical and Modern Regression with Applications Raymond H. Myers (1986) • Applied Linear Regression by Sanford Weisberg ( 1985) • SAS Help Examples Questions?