ADVANCED RESEARCH METHODS: REGRESSION ANALYSIS THEORY AND MODELING BY ERLAN BAKIEV, PH. D. Faculty and Text • Textbook: – Keith, T. Z., (2006). Multiple Regression and Beyond. Pearson – Montgomery, Peck, and Vining, Introduction to Linear Regression Analysis, 4th ed., 2006 Lecture Outline • Overview of Regression Analysis • Example • Guidelines for Group Project (due week 5 class) Goals of Today’s Lecture • What is Regression analysis • Introduction to the most widely used statistical models – Linear regression – Logistic regression • How these models are used to analyze data and inform decisions – When different models are appropriate – How to fit, interpret, and assess different models • Practice with different data sets A Note on SPSS •SPSS stands for Statistical Package for the Social Sciences and it is one of the most widely used programs for statistical analysis in social sciences. •Widely used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others –User Friendly •You will use a basic set of commands for model fitting •Current version of SPSS is 18 Regression Analysis • Regression analysis is a statistical tool for the investigation of relationships between variables • İnvestigator seeks to understand the causal effect of one variable upon another, for example – the effect of a price increase upon demand, – the effect of changes in the money supply upon the inflation rate. Types of Regression Models • Two types of regression models – Linear regression • Simple linear regression (continuous Y, one X) • Multiple linear regression (continuous Y, several Xs) – Logistic regression • Binary Y, several Xs • Linear regression forms the basis for understanding most regression techniques Steps in a Regression Analysis Explore the Data Graphically Select a Tentative Structure Estimate the structure and its uncertainty Assess the plausibility of the tentative structure Use the estimated structure for your inferences (suitably qualified by the estimated uncertainty) Simple Regression • Regression analysis with a single explanatory variable is termed “simple regression.” • Ex: Education income relationship • I = α + βE + ε α = a constant amount (what one earns with zero education); where – β = the effect in dollars of an additional year of schooling on in- come, hypothesized to be positive; and – ε = the “noise” term reflecting other factors that influence earn-ings. Multiple Regression • İt is a technique that allows additional factors to enter the analysis separately so that the effect of each can be estimated.” • Simultaneously several independent variables can influence a dependent variable. • Ex: İnfluence of Education and experience on income • The model is as follows: I = α + βE + γX + ε where “γ” is expected to be positive Model-Based Data Analysis • Data sets consist of a set of observations – Each observation contains • One dependent variable (“Y”) • One or more independent variables (“Xs”) • We want to determine if there is a structure to the data set – Relationship between the response and one or more predictors Response = structure(predictors) + “error” •Structure – Varies from observation to observation • Function of predictors – Systematic, deterministic •“Error” – The error of a sample is the deviation of the sample from the (unobservable) true function value. Modeling Process • Fitting a model then requires: – Selecting a tentative structure – Estimating the structure and its uncertainty • Typically as statistics and their sampling distributions – Assessing plausibility of the choice of structure • Model diagnostics Characterizing the Relationship Between Two Variables • Outline – – – – Types of variables Correlation Graphical Methods Linear Regression İndependent and Dependent Variables • Each observation has two parts: dependent Y and one or more independent Xs • We are interested in the relationship of X and Y – At a minimum, X and Y vary together – X and Y are associated • Statistical relationship does not imply causation Gujarati, D.N. (2003). Basic Econometrics, International Edition - 4th ed.. McGraw-Hill Higher Education. pp. 22-24. ISBN 0-07-112342-3. Examples •X: •X: •X: •X: income parents’ SES education level range to target Y: Y: Y: Y: consumption education level income probability of hit/kill •The statistical methods we will study establish association •Association does not entail causality Types of Variables •Continuous: variable can take any value in a (possibly infinite) range – Money, height, blood pressure, weight •Discrete: variable takes on a countable set of numerical values (often finite and small) – People in a queue, hits on a target •Ordinal: Variable has a finite set of non-numeric, but ordered values (categorical) – Level of schooling, rating •Nominal: Variable is finite, non-numeric, nonordered – Religion, gender Methods of Examining Relationships • Method X • Simple Regression continuous • Multiple Regression continuous,discrete • Logistic Regression continuous,discrete Y continuous continuous nominal • Other combinations are possible • These methods form the foundation of most other methods Continuous X, Continuous Y • Simplest case • Correlation – Closely related to simple regression – Widely used in some substantive areas to characterize association • Graphical methods – Essential exploratory and diagnostic tool A Graphic Illustration 0 -1 -2 x1 1 2 Correlation 0.8 -2 -1 0 y1 1 2 Correlation • Definition E(Y EY )(X EX )/Var (Y )Var ( X )1/ 2 Covariance(X,Y) /Var (Y )Var ( X ) 1/ 2 • Attributes – Measures how X and Y vary together linearly – Scale-free – Ranges from –1 to +1 – Zero correlation is not necessarily independence • Note that this is pure association, no specification of response/dependent and predictor/independent variables Examples of Correlation 0 y1 2 4 3 2 -1 -2 -4 0 y1 2 4 -4 -2 0 y1 2 4 0 y1 2 4 3 2 x1 1 2 -1 -2 -2 -1 0 x1 1 2 1 0 -1 -2 -4 -2 Correlation 0.8 3 Correlation 0.5 3 Correlation 0.1 -2 0 -2 0 x1 1 2 -2 -1 0 x1 1 2 1 0 x1 -1 -2 -4 x1 Correlation 0.0 3 Correlation -0.5 3 Correlation -0.9 -4 -2 0 y1 2 4 -4 -2 0 y1 2 4 Problems with Correlation • Correlation is a single number summary of association • But – Outliers can dominate the value (it is not robust) – Zero correlation does not mean no relation • By itself, it does not allow the prediction of Y from X – This is usually of interest Graphical Methods: Two-Way Scatterplot (1993 Car Data) 4 3 2 1 Engine Size (l) 5 6 infile _skip(11) EngSiz _skip(12) Wt _skip(1) using 93cars.dat scatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)") 1500 2000 2500 3000 Weight (lbs) 3500 4000 Linear Regression •In simple linear regression, we use the following model for the expected value of y given x: – E(y|x) = β0 + β1x •Relationship characterized by two numbers – Straight line •Wide applicability in practical situations – Many relationships approximately linear (or can be made so) – Forms the basis for more sophisticated analysis The “Best” Straight Line • How do we construct the best line? • Some ideas – Minimize sum of distances from each Y to the line – Minimize sum of absolute values of distances from each Y to the line – Minimize sum of squared distances from Y to the line, N 2 (y i 1 i 0 1 xi ) Regression Line for Car Data 6 regress EngSiz Wt predict pengsiz scatter EngSiz Wt, ytitle("Engine Size (l)") xtitle("Weight (lbs)") || line pengsiz Wt, clc(black) 4 2 0 Engine Size (l) EngSiz = -1.90 + 0.0015 Weight 1500 2000 2500 EngSiz 3000 Weight (lbs) Fitted values 3500 4000 Interpretation of Coefficients • In many applications we are interested in the coefficients directly – β1 is the change in expected value of y for a unit change in x (slope) • β1 = 0 means that x is not related to change in mean of y – β0 is the value of E(y) for x=0 (intercept) • Very often of little interest because reflects choice of origin • Often outside range of x data values 600 620 640 660 680 700 Fit a Regression Line We Can Graph the Result 15 20 str avg_scr 25 Fitted values What We Need to Have to Do Inference • Construct 95% confidence interval for the slope • Get p-values for the t-statistic • This will allow us to state whether there is a statistically significant association between the independent and dependent variables – Test the hypothesis that β1 = 0 • SPSS, of course, does the arithmetic • In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. The lower the p-value, the less likely the result is if the null hypothesis is true, and consequently the more "significant" the result is, in the sense of statistical significance. What does it mean to “explain” variation? • General idea – Want to assess whether the regression line “explains” the data better than simpler alternative models – Variation left after model fit is “unexplained” • R2 indicates what percent of variability in Y is accounted for by the regression model • Some properties of R2 – 0 R2 1 – R2=0: Regression line is horizontal – R2=1: Data fits perfectly on regression line Multiple Regression Models • Extension of simple regression: y 0 1 x1 2 x2 y 0 1 x1 2 x2 ... k xk • Together the βi are called the regression coefficients In Policy Analysis Applications İndependent Variable Are Very Important • İndependent/ Categorical / Dummy Variables • Often at least part of our data is qualitative – Subject is male or female – Location is urban or rural – Others include ethnic group, insured vs. uninsured, high-school vs post-high school – Etc., etc., etc. • In much of the modeling in health most of the variables are qualitative • We need to generate the 0,1 indicators Will LM Approach Work for Other Types of Data? • Suppose we have 0,1 data – Success/failure, die/live, leave military/stay, ... – Suppose we want to link p with covariates (age, income, disease, etc.)? • Least squares won’t work nicely … let’s take a look! Example • We want to know whether the presence of CHD is related to age. – If we take a group of people of a given age • What is the fraction that have CHD? • Equivalently, what is the probability that a person of a given age has CHD? – We expect the probability to depend on age. 0 .2 .4 chd .6 .8 1 Example: Coronary Heart Disease 20 30 40 50 age 60 70 Fitted values are out-ofrange 0 .5 1 Linear Regression on 0,1 CHD Data 20 30 40 50 age chd Fitted values 60 70 Why A Different Type of Regression? • Logistic regression is the most commonly used generalization of multiple linear regression • Output data is categorical with 2 categories – Categorical: no metric, no order – Usually coded as 0/1 – Terminology: failure/success – Typical examples: dies/lives, does not/does have condition, does not/does marry, etc. • As we’ve seen, linear regression can be inappropriate Interpretation • A year increase in age means that one is 11% more likely to have CHD than someone a year younger • But the probability that you have CHD is much different depending on age • Logistic regression is often used to see how much additional risk is contributed by some risk factor (e.g. smoking) – The coefficient shows how much the risk factor increases your chances of having some condition – But the probability may still be small Group Project •Each of you will form a group to perform and report on a regression analysis of a set of data you select – Multiple linear regression At least 100 observations (i.e., 100 y’s and 100 x1’s, x2’s, …, xk’s) No more than 1000 observations At least 5 x’s – – – – Data set and analysis goal must have my approval Analysis report due March 15, 2011 Teams of up to three students Must do model fitting, interpretation, and write up results • The paper is 10p max • This includes all text, tables, graphics • Write as memo to explain – What you are trying to do – What you found out • Content will depend on data set and what you find SPSS Commands for the Group Project •Menu clear •Analyze •Regression •Linear •Move dependent and independent variables into boxes •Statistics •Descriptives •Part and Partial Correlations •Plots •Histogram •Normal probability plot •Continue •Stepwise •OK Resources to help you learn and use Stata http://www.ats.ucla.edu/stat/spss