Two-Variable Regression Analysis: Some Basic Ideas Jamie Monogan University of Georgia Intermediate Political Methodology Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 1 / 10 Objectives By the end of this meeting, participants should be able to: Define the population and sample regression models and identify the components of each. Classify whether a model is linear in variables and whether it is linear in parameters. Explain the role of the stochastic disturbance in the population regression model and describe how this term might be interpreted. Estimate a two-variable linear model in R & Stata and graph the results over the raw data. Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 2 / 10 The Population Regression Model Regression models give us the conditional expectation of Y given X , E (Y |X ). This should be more informed than the unconditional expected value E (Y ). At the population level, if we connect all of the conditional expected values, we obtain a population regression curve. Generally speaking, we try to model the population regression function, E (Y |Xi ) = f (Xi ). In other words, what function can represent the conditional expectation in the population? Jamie Monogan (UGA) Source: Gujarati & Porter 2009, 37 Two-Variable Regression: Basic Ideas POLS 7014 3 / 10 A Linear Population Regression Function It falls on the researcher to specify the functional form of the population regression function. (Recall: Attributes of a population are unknown.) A common specification is the linear population regression function: E (Y |Xi ) = β1 + β2 Xi . Equivalent representation: Yi = β1 + β2 Xi + ui . β1 is the population intercept, or the conditional expectation when Xi = 0. β2 is the population slope coefficient. β1 and β2 are parameters. Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 4 / 10 The Meaning of the Term Linear A model is linear in the variables if Y is a linear function of every X variable. A model is linear in the parameters if each parameter is only raised to the power 1 and is not multiplied or divided by any other parameter. (I.e., β12 , β1 × β2 , and β1 /β2 are all prohibited.) The linear regression model is linear in the parameters. A linear regression model might be linear in the variables, or it might not be. Hence, the linear regression model can produce a variety of geometric shapes. Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 5 / 10 Examples of Linear and Non-Linear Models Which models are linear in the parameters? Income as a Function of Age Yi = β1 + β2 Xi + ui Income as a Function of Age Squared Yi = β1 + β2 Xi2 + ui Probability of Voting as a Function of Income (MLE) exp(β1 +β2 Xi ) Pr (Yi = 1) = 1+exp(β 1 +β2 Xi ) Alternate form (Generalized linear model): Λ−1 (Pr (Yi = 1)) = β1 + β2 Xi Moving Average Model (Time Series) zt = θ0 − θ1 at−1 + at Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 6 / 10 The Role of the Disturbance Term We model the conditional expectation, but that does not mean we can perfectly predict the outcome. Hence, we have to say that a disturbance term is part of the model of the outcome. Namely, ui in the equation Yi = E (Y |Xi ) + ui . The disturbance takes-on different and unpredictable values for each observation. For example, consider several outcomes with the same input value (from Gujarati & Porter 2009, 40): Y1 = 55 = β1 + β2 (80) + u1 Y2 = 60 = β1 + β2 (80) + u2 Y3 = 65 = β1 + β2 (80) + u3 Y4 = 70 = β1 + β2 (80) + u4 Y5 = 75 = β1 + β2 (80) + u5 Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 7 / 10 The Meaning of the Disturbance Term What information might be captured by the disturbance term? 1 Vagueness of theory. 2 Unavailability of data. 3 Core variables versus peripheral variables. (Careful here.) 4 Intrinsic randomness in human behavior. 5 Poor proxy variables. (Poses a substantial problem.) 6 Principle of parsimony. (Occam’s razor.) 7 Wrong functional form. (Poses a substantial problem.) Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 8 / 10 The Sample Regression Model We usually have to estimate our models with a sample from the population. Hence, the sample regression function is: Ŷi = β̂1 + β̂2 Xi . β̂1 & β̂2 are estimators or statistics. We use these to estimate population parameters. The numerical values we obtain from our estimator are called estimates. The estimate for β̂1 is the sample intercept and the estimate for β̂2 is the sample slope coefficient. ûi is the residual or error term. In contrast to p. 44, we don’t really estimate ui as much as we predict it. Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 9 / 10 For Next Time Read Gujarati & Porter chapter 3 (Two-Variable Regression Analysis: The Problem of Estimation). From Gujarati & Porter, pp. 48-49, answer questions 2.3, 2.5, & 2.13. Open stateImmig0511.tab in R or Stata (source: http://hdl.handle.net/1902.1/16471). Again, you may want the file codebookStateImmig.txt to look-up variable descriptions. From these data please report the following: Write down a linear population regression model in which immigrant policy is a function of public ideology. (Same measures as last week.) Estimate this regression model using the data. Report the results of your model in equation form as a sample regression model. Create a scatterplot with immigrant policy on the vertical axis, public ideology on the horizontal axis, and the estimated regression line through the data. Write a sentence or two describing how this line compares to the line you approximated by hand last time. Include your code at the back of the document. Jamie Monogan (UGA) Two-Variable Regression: Basic Ideas POLS 7014 10 / 10