Class 3

advertisement
Lectures 8 and 9: Bivariate Regression and Correlation
Oct. 19, 23, 2006
A. Linear regression
1. Describing the form of the association between X and Y
a. Y = a + bX is the formula for a straight line
(1) X and Y represent the values of variables
(2) a and b are statistics that describe how X affects Y
(a) for any given set of values for X and Y, a and b are constants
(3) a is the Y-intercept
(a) the point where the regression line crosses (intercepts) the Y axis
(b) the value of Y when X = 0
(i) in some data, no case can have a value of 0
(c)
a = µy – bµx
(4) b is the regression coefficient which tells us how much Y changes with one unit
change in X
(a) the sign of the regression coefficient indicates the direction of the
association—whether X’s effect on Y is positive or negative
2. Predicting the value of Y when we know the value of X because we know the value of the
constants, a and b
a. use of test scores to predict subsequent performance
b. measuring people’s socioeconomic status
c. in research on the effect of X on Y, we can plug various values of X into the regression
equation and get an expected value of Y which helps to illustrate our results
3. Estimating the values of a and b in a set of data
a. estimating a and b is easy if only 2 points, but if there are more than two points, we need
some criterion for estimating the slope of the regression line and its intercept
b. we want the line that will predict Y from X with a minimum of error, where we think of
error as the sum of the distance between every value of Y and its predicted value, given X
(1) we refer to the value of Y we predict based on X as Ŷ
c. obvious possibility is to place the line so it minimizes the total deviations (Yi – Ŷ)
d. but more than one line can satisfy that criterion
1
2
e. however, the sum of the squared deviations of each value of Y from its predicted value
Ŷ produces a unique line
(1) demonstrating that this is true and deriving the formula involves calculus
f. We use the least-squares criterion to calculate b, the slope of the regression line, which
is why we refer to this type of regression as ordinary least squares regression
g. consequences of using squared deviations to estimate b
4. What does it mean when you get a regression coefficient byx = 0? It means that with one
unit change in X, there would be no units change in Y. This could happen for two reasons:
a. X and Y are completely unrelated
b. Y were a constant
B. Correlation versus regression
1. b tells us how much Y changes with each one unit change of X, whereas r tells us the
strength of the association between X and Y; both r and b tell the direction of that association
(positive or negative)
2. r does not tell us how much X affects Y; to answer that question we use regression
3. r is symmetric (ryx = rxy), whereas b is asymmetric (byx seldom = bxy)
(a) we can regress X on Y which would produce a different regression line unless r = 1
4. the range of r is from -1.0 to +1.0, whereas b can take any value from – infinity to + infinity
5. formulas: same numerator in formulas for b and r: the covariance of X and Y
a. b = ratio of the covariance of X and Y to the variance in X
(1) byx = covyx / 2x)
(2) why should we divide the covariance by 2x?
(a) so our estimate of the effect of X will be in the units in which Y was
measured.
b. r = ratio of the covariance of X and Y to the standard deviations of both X and Y
(1) ryx = covx,y/xy
(2) why divide the covariance of X and Y by xy?
(a) to make r a unit-free or standardized measure by canceling out the
units in which both X and Y were measured
(b) standardizing r to make it unit-free measure limits its range is limited
from -1.0 to 1.0
3
6. Thus, b is measured in units of Y, whereas r is measured in standard units (i.e., it is unit
free)
7. Given the similarity of b’s and r’s formulae, you can convert one to the other
a. byx = ryx (y/x)
b. ryx = byx (x/y)
8. interpretation of b is change of Y with one unit change in X; r2 has a PRE interpretation
a. this is true because we can decompose the total variation in Y (to be accurate, the total
sum of squares in Y— Σ(Yi- Y 2) into two components:
(1) the portion of the variation in Y that X explains: Σ(Yi- Y )2—“explained sum
of squares”
(2) the portion of the variation in Y that is not explained by X: Σ Y - Ŷ )2—
“unexplained sum of squares”
b. the ratio of the explained sum of squares to the total sum of squares tells us the
proportion of the variation in Y that X explains
(1) this ratio = r2 which is called the coefficient of determination
9. Any restrictions on the range of X or Y affects the values of both b and r
10. Outliers strongly affect the value of both b and r
C. Limits with formula Ŷ = a + bX
1. It assumes that only X influences the value of Y; it ignores unmeasured variables that
influence the value of Y
2. It ignores the difficulty of perfectly measuring our variables; random measurement/coding
errors occur
3. To address these problems, we need to modify formula for Y
Yi = a + bXi + Ui
a. We call Ui the “disturbance” term since it stands for all the unobserved factors that
affect the value of Y. These disturbances may include other possible independent
variables that we omitted from the regression as well as random factors like measurement
or sampling error.
b. we cannot to estimate the value of U because—by definition—we cannot observe it
Download