Review handout for Regression analysis

advertisement
QMeth201- Winter 2011- BB and BC Sections- Mohsen Sharifani
Review handout for Regression analysis
A: Phases of Regression analysis:
a) Draw Scatterplot to see if:
a. It looks “friendly”. Then, we know that most probably there is a connection
between X and Y and thus, some sort of regression analysis [not necessarily
linear] should be able to describe the correlation.
b. It behaves linearly. Then, we are allowed to use linear regression, the only
regression technique that we learn in this course.
b) Finding regression equation:
a. Compute r, the coefficient of correlation:
i. By excel: use “corral” function
ii. By hand: check your blue sheet. To find r, you need to have Sx,y ,Sx and
Sy.
n
1.
Hint: If the problem gives you  ( xi  x )( y i  y ) , do not forget
i 1
to divide it by n-2 in order to get Sx,y..
b. Compute b, slope of the regression line. Check your blue sheet for the
formula.
Note: (Check next page where we explain what b intuitively mean)
c. Compute a: the intercept of the regression line.
d. Write your regression equation: Yexpected= a + b *X
c) Error analysis (How much we can trust this regression equation?)
The main statistical measure which depicts the error size is Standard error. There
are two ways to find standard error:
a. By hand:
i. Find expected Y for each data point you have.
ii. Find the residuals (residual= Actual- expected)
1. Hint: be careful about residual formula. It is NOT “expected – actual”
iii. Calculate residual^2.
iv. Add them up, divide by n-1, and take the square root.
v. Finding Se, is in many ways similar to finding standard deviation. It is
in fact standard deviation of residual values, except that for S e, we
should divide the summation of squares by n-2. One might point out
the description that is explained above is not exactly equal to the
algorithm for finding standard deviation. In that algorithm, we have to
subtract the average from the data values and then, square them;
whereas for finding Se, we just square the residuals without
subtracting the average of residuals form them. However we should
note that average of residuals is zero1. Thus, [Residuali – E(residuals)
]^2 I basically equal to Residuali^2.
b. By formula: check your blue sheet.
d)
Analyzing the magnitude of error (answering the question: are we surprised by…?):
a. In general, whenever we want to figure out if a data point is within our
expected range or not, we calculate standardized score for that
X i  X exp ected
data,
. If its absolute value is less than 1, it is pretty much within
Sx
the range that we expect. If its absolute value is between 1 and 2, then it is
mostly subjective to decide whether the data is within the expected range or
not2.A standardized score with absolute value more than 2 is, however, out of
the range of expectation so it is “strange”.
b. In this specific case, when we want to see if a y value is within our expected
range or not, we calculate the Standardized score for that y. We calculate the
residual for that y and find the standard score of it. Note that expected value of
residuals, which is the average of residuals, is equal to zero and standard
deviation of residuals is standard error of estimate, Se. So the standardized
Re sidual
score for residual is
.
Se
B: The concepts that you are supposed to know (summary)
1) r: coefficient of correlation. It is a number between -1 and 1 and shows how linearly
correlated our data is. If our data point have a perfect linear shape, r would be either 1 or 1 but in practice, it does not happen.
2) b: slope of the regression line: shows how much Y would change is x increases by 1
unit.
3) r^2: percentage of variability in y than can be explained by variability in x. The larger
r^2 we have, the better prediction we can make.
4) Adjusted r^2: the concept is just the same as the regular r^2, expect that it is more
precise for the cases that our sample size is small. (Check formula).
5) Se: standard error of estimate: shows that on average how much error we in our
estimation process.
1
That is the way we find the formula for b and a. We do not go that far in this course though.
In other words, if you get such a question in the exam, you can either answer “ yes, this sounds out of
range to me” or “ No, this data does not look strange to me” and both answers would be correct.
2
Download