Lectures 8 and 9: Bivariate Regression and Correlation Oct. 19, 23, 2006 A. Linear regression 1. Describing the form of the association between X and Y a. Y = a + bX is the formula for a straight line (1) X and Y represent the values of variables (2) a and b are statistics that describe how X affects Y (a) for any given set of values for X and Y, a and b are constants (3) a is the Y-intercept (a) the point where the regression line crosses (intercepts) the Y axis (b) the value of Y when X = 0 (i) in some data, no case can have a value of 0 (c) a = µy – bµx (4) b is the regression coefficient which tells us how much Y changes with one unit change in X (a) the sign of the regression coefficient indicates the direction of the association—whether X’s effect on Y is positive or negative 2. Predicting the value of Y when we know the value of X because we know the value of the constants, a and b a. use of test scores to predict subsequent performance b. measuring people’s socioeconomic status c. in research on the effect of X on Y, we can plug various values of X into the regression equation and get an expected value of Y which helps to illustrate our results 3. Estimating the values of a and b in a set of data a. estimating a and b is easy if only 2 points, but if there are more than two points, we need some criterion for estimating the slope of the regression line and its intercept b. we want the line that will predict Y from X with a minimum of error, where we think of error as the sum of the distance between every value of Y and its predicted value, given X (1) we refer to the value of Y we predict based on X as Ŷ c. obvious possibility is to place the line so it minimizes the total deviations (Yi – Ŷ) d. but more than one line can satisfy that criterion 1 2 e. however, the sum of the squared deviations of each value of Y from its predicted value Ŷ produces a unique line (1) demonstrating that this is true and deriving the formula involves calculus f. We use the least-squares criterion to calculate b, the slope of the regression line, which is why we refer to this type of regression as ordinary least squares regression g. consequences of using squared deviations to estimate b 4. What does it mean when you get a regression coefficient byx = 0? It means that with one unit change in X, there would be no units change in Y. This could happen for two reasons: a. X and Y are completely unrelated b. Y were a constant B. Correlation versus regression 1. b tells us how much Y changes with each one unit change of X, whereas r tells us the strength of the association between X and Y; both r and b tell the direction of that association (positive or negative) 2. r does not tell us how much X affects Y; to answer that question we use regression 3. r is symmetric (ryx = rxy), whereas b is asymmetric (byx seldom = bxy) (a) we can regress X on Y which would produce a different regression line unless r = 1 4. the range of r is from -1.0 to +1.0, whereas b can take any value from – infinity to + infinity 5. formulas: same numerator in formulas for b and r: the covariance of X and Y a. b = ratio of the covariance of X and Y to the variance in X (1) byx = covyx / 2x) (2) why should we divide the covariance by 2x? (a) so our estimate of the effect of X will be in the units in which Y was measured. b. r = ratio of the covariance of X and Y to the standard deviations of both X and Y (1) ryx = covx,y/xy (2) why divide the covariance of X and Y by xy? (a) to make r a unit-free or standardized measure by canceling out the units in which both X and Y were measured (b) standardizing r to make it unit-free measure limits its range is limited from -1.0 to 1.0 3 6. Thus, b is measured in units of Y, whereas r is measured in standard units (i.e., it is unit free) 7. Given the similarity of b’s and r’s formulae, you can convert one to the other a. byx = ryx (y/x) b. ryx = byx (x/y) 8. interpretation of b is change of Y with one unit change in X; r2 has a PRE interpretation a. this is true because we can decompose the total variation in Y (to be accurate, the total sum of squares in Y— Σ(Yi- Y 2) into two components: (1) the portion of the variation in Y that X explains: Σ(Yi- Y )2—“explained sum of squares” (2) the portion of the variation in Y that is not explained by X: Σ Y - Ŷ )2— “unexplained sum of squares” b. the ratio of the explained sum of squares to the total sum of squares tells us the proportion of the variation in Y that X explains (1) this ratio = r2 which is called the coefficient of determination 9. Any restrictions on the range of X or Y affects the values of both b and r 10. Outliers strongly affect the value of both b and r C. Limits with formula Ŷ = a + bX 1. It assumes that only X influences the value of Y; it ignores unmeasured variables that influence the value of Y 2. It ignores the difficulty of perfectly measuring our variables; random measurement/coding errors occur 3. To address these problems, we need to modify formula for Y Yi = a + bXi + Ui a. We call Ui the “disturbance” term since it stands for all the unobserved factors that affect the value of Y. These disturbances may include other possible independent variables that we omitted from the regression as well as random factors like measurement or sampling error. b. we cannot to estimate the value of U because—by definition—we cannot observe it