EDF 802 Dr. Jeffrey Oescher Regression Analysis – Linear Regression I. Introduction A. Description 1. 2. B. Applications 1. 2. 3. C. 2. Addresses only linear data Uses only one predictor Two major issues to be discussed A. B. III. The variable one is predicting from is known as the independent or predictor variable The variable one is predicting to is known as the dependent or criterion variable Limitations 1. 2. II. Interval or ratio scales for both variables Variables 1. E. Make predictions from X to Y Identify the proportion of variance accounted for in the dependent variable (i.e., Y) based on our knowledge of the independent variable (i.e., X) Estimate residual values on Y having removed the effect of X Data 1. D. The process of predicting or estimating scores on a Y variable based on knowledge of scores on a X variable (i.e., the regression of Y on X). The term linear is based on an assumption that the relationship between Y and X can be represented by a straight line. This straight line is the regression line and represents how, on average, a change in the X variable is associated with a change in the Y variable. Describing the relationship Inferentially testing the magnitude of the relationship Descriptive linear regression A. B. Purpose – estimating Y from X The general linear regression equation 1. Yi’ = bXi + c -1- a. b. c. d. 2. An example of a regression equation a. b. 3. Y = 0.5 X + 2.0 For each unit of change in X we see one-half a unit change in Y An example of a regression line a. b. c. C. Yi’ is the predicted Y score b is the regression coefficient Xi is the score of X c is the Y intercept The intercept is the value of Y when X is zero (0) The slope is the amount of change in Y that corresponds to a change of one (1) unit in X Graphing a regression line Calculating the regression equation 1. Scatterplot of sample data – see the attached page a. 2. Data set D5 containing logical reasoning scores (X) and creativity scores (Y) for 20 subjects Least squares criterion – see the attached page a. Residuals (1) (2) 3. ei = (Yi –Yi’) The values of b and c are derived so that ∑e i2 is minimized Formula for c c Y bX 4. Formula for b b (r ) sy sx 5. SPSS – Windows programming and output 6. Regression equation for D5 sample data a. b. Y’ = (0.65)X + 5.23 Predicting Y from X using the formula -2- (1) (2) (3) IV. If X is 7, Y’ is 9.78 If X is 12, Y’ is 13.03 If X is 17, Y’ is 16.28 Inferential testing and linear regression A. Purpose – to determine whether the observed relationship between X and Y is of sufficient magnitude to suggest a relationship truly exists 1. 2. B. If the relationship is zero, our knowledge of X will not help predict Y If the relationship is not zero, our knowledge of X will help predict Y Errors in prediction 1. An error in prediction can be notationally represented as ei and is equal to (Yi – Yi’) a. b. 2. There is a need to estimate the characteristics of the errors terms a. b. c. 3. Average error Variation of errors Shape of the distributions of errors An estimate of the average error a. b. c. 4. Errors can be found above (positive) and below (negative) the regression line The regression line is developed to reduce the sum of all squared errors to a minimum By definition, the sum of all errors ∑ei = 0 Thus the mean error is also zero (0) See the accompanying sample data Estimates of the variance and standard deviation of the distribution of errors a. b. c. Conceptually the variance of the distribution of error terms is the sum of the squared deviation scores (i.e., Σ(e – e)2 divided by the appropriate degrees of freedom Since e (i.e., the mean error) equals 0, this equation can be simplified The following formula represents the variance of the distribution of the errors s y2. x e2 n2 -3- d. Taking the square root of the variance produces a statistic called the standard error of estimate s y. x (1) (2) e2 n2 This represents the “standard deviation” of the distribution of error terms It is critically important to the inferential analyses related to linear regression (a) (b) Calculating confidence intervals for estimating the range of predicted values of Y Calculating the error term for the test statistic used to examine the significance of the regression coefficient (3) If the relationship between the two variables is high, the standard error of estimate is small; if the relationship is low the standard error of estimate is large (4) Conditional distributions (a) (b) Distributions of actual scores around the predicted scores Homoscedasticity i) The variation associated will all conditional distributions is the same This is a major assumption of linear regression ii) C. Testing the significance of the regression coefficient 1. The concern related to the need for inferential analysis a. b. 2. If the relationship is zero (0), the regression coefficient b is zero (0) and our knowledge of X will NOT help predict Y Thus the issue becomes how different from zero (0) must the regression coefficient be in order to statistically enhance the prediction of Y? The relationship between linear regression and correlation a. b. If r = ±1.00 there are no residual errors If r = 0 then there is no regression line (1) If r = 0 then the formula for b is equal to zero (0) (see Section II C 4 above) -4- (2) This suggests the slope of the regression line is 0 (i.e., it is parallel to the X axis) which in turn implies that regardless of the value of X, Y will be equal to the intercept c ( c Y bX Y 0 X Y (3) D. If r=0 the best prediction of Y ' is Y Inferential logic 1. Hypotheses a. b. 2. H0: β = 0 H1: β ≠ 0 Test statistic and sampling distribution a. t b. b sb Sb is the standard error of the regression coefficient and is computed as follows sb c. d. V. Random selection Normal distributions of Y’s for each X Homoscedasticity An alternative method for evaluating this inferential question is to test the null hypothesis ρ = 0 Determining residual values A. B. VI. SS x The statistic is distributed as a t-distribution with n-2 df Assumptions (1) (2) (3) E. s y. x Residuals represent the value of Y having removed the effect of X Useful in multiple regression contexts Numerical example A. B. C. SPSS-Windows programming Output Interpreting the output -5-