QMeth201- Winter 2011- BB and BC Sections- Mohsen Sharifani Review handout for Regression analysis A: Phases of Regression analysis: a) Draw Scatterplot to see if: a. It looks “friendly”. Then, we know that most probably there is a connection between X and Y and thus, some sort of regression analysis [not necessarily linear] should be able to describe the correlation. b. It behaves linearly. Then, we are allowed to use linear regression, the only regression technique that we learn in this course. b) Finding regression equation: a. Compute r, the coefficient of correlation: i. By excel: use “corral” function ii. By hand: check your blue sheet. To find r, you need to have Sx,y ,Sx and Sy. n 1. Hint: If the problem gives you ( xi x )( y i y ) , do not forget i 1 to divide it by n-2 in order to get Sx,y.. b. Compute b, slope of the regression line. Check your blue sheet for the formula. Note: (Check next page where we explain what b intuitively mean) c. Compute a: the intercept of the regression line. d. Write your regression equation: Yexpected= a + b *X c) Error analysis (How much we can trust this regression equation?) The main statistical measure which depicts the error size is Standard error. There are two ways to find standard error: a. By hand: i. Find expected Y for each data point you have. ii. Find the residuals (residual= Actual- expected) 1. Hint: be careful about residual formula. It is NOT “expected – actual” iii. Calculate residual^2. iv. Add them up, divide by n-1, and take the square root. v. Finding Se, is in many ways similar to finding standard deviation. It is in fact standard deviation of residual values, except that for S e, we should divide the summation of squares by n-2. One might point out the description that is explained above is not exactly equal to the algorithm for finding standard deviation. In that algorithm, we have to subtract the average from the data values and then, square them; whereas for finding Se, we just square the residuals without subtracting the average of residuals form them. However we should note that average of residuals is zero1. Thus, [Residuali – E(residuals) ]^2 I basically equal to Residuali^2. b. By formula: check your blue sheet. d) Analyzing the magnitude of error (answering the question: are we surprised by…?): a. In general, whenever we want to figure out if a data point is within our expected range or not, we calculate standardized score for that X i X exp ected data, . If its absolute value is less than 1, it is pretty much within Sx the range that we expect. If its absolute value is between 1 and 2, then it is mostly subjective to decide whether the data is within the expected range or not2.A standardized score with absolute value more than 2 is, however, out of the range of expectation so it is “strange”. b. In this specific case, when we want to see if a y value is within our expected range or not, we calculate the Standardized score for that y. We calculate the residual for that y and find the standard score of it. Note that expected value of residuals, which is the average of residuals, is equal to zero and standard deviation of residuals is standard error of estimate, Se. So the standardized Re sidual score for residual is . Se B: The concepts that you are supposed to know (summary) 1) r: coefficient of correlation. It is a number between -1 and 1 and shows how linearly correlated our data is. If our data point have a perfect linear shape, r would be either 1 or 1 but in practice, it does not happen. 2) b: slope of the regression line: shows how much Y would change is x increases by 1 unit. 3) r^2: percentage of variability in y than can be explained by variability in x. The larger r^2 we have, the better prediction we can make. 4) Adjusted r^2: the concept is just the same as the regular r^2, expect that it is more precise for the cases that our sample size is small. (Check formula). 5) Se: standard error of estimate: shows that on average how much error we in our estimation process. 1 That is the way we find the formula for b and a. We do not go that far in this course though. In other words, if you get such a question in the exam, you can either answer “ yes, this sounds out of range to me” or “ No, this data does not look strange to me” and both answers would be correct. 2