Relations Between Two Variables Regression and Correlation

advertisement
Relations Between Two Variables
Regression and Correlation
In both cases, y is a random variable beyond the control of the experimenter.
In the case of correlation, x is also a random variable.
In the case of regression, x is treated as a fixed variable. (As if there is no
sampling error in x.)
Regression: you are wishing to predict the value of y on the basis of the value of x.
Correlation: you are wishing to express the degree the relation between a and y.
Scatter Diagram or Scatter Plot
X axis (abscissa) = predictor variable
Y axis (ordinate) = criterion variable
Positive
Negative
Perfect None
Covariance
COVxy
is a number reflecting the degree to which two variable vary or change
in value together.
( x  x )( y  y )


n 1
n = the number of xy pairs.
Using an example of collecting RT and error scores.
If a subject is slow (high x) and accurate (low y), then the d score for the x will be
positive and the d score for the y will be negative; their product will be negative.
If a subject is slow (high x) and inaccurate (high y), then the d score for the x will be
positive and the d score for the y will be positive; their product will be positive.
If a subject is fast (low x) and accurate (low y), then the d score for the x will be
negative and the d score for the y will be negative; their product will be positive.
If a subject is fast (low x) and inaccurate (high y), then the d score for the x will be
negative and the d score for the y will be positive; their product will be negative.
Illustrative Trends
(x  x)
Sub.
x
1
2
3
4
5
100
200
300
400
500
1
2
3
4
5
1
2
3
4
5
100
200
300
400
500
100
200
300
400
500
-200
-100
0
100
200
-200
-100
0
100
200
-200
-100
0
100
200
( y  y)
y
20
15
10
5
0
0
5
10
15
20
10
5
20
5
10
10
5
0
-5
-10
( x  x )( y  y )
-2000
-500
0
-500
-2000
-10
-5
0
5
10
2000
500
0
500
2000
0
-5
10
-5
0
0
500
0
-500
0
Those subjects who are fast
make more errors.
Total = -5000
Those subjects who are fast
make fewer errors.
Total = 5000
There is no trend.
Total = 0
Scatter plots of data from previous page.
We can see a trend
after all.
100 200 300 400 500
Scale Issues
(Sec.) (Min.)
x
(x  x)
y
( y  y)
1
3
-4
-2
5
13
-8
0
32
0
5
7
9
0
2
4
9
17
21
-4
4
8
0
8
32
1
3
5
7
-4
-2
0
2
300
780
540
1020
-430
0
-240
240
1920
0
0
480
9
4
1260
480
1920
( x  x )( y  y )
Total = 72
Total = 4320
Sub
1
2
3
4
5
X
2
3
2
4
4
Y
10
12
12
15
12
COVxy
( x  x )( y  y )


n 1
What is the covariance?
The absolute value of the covariance is a function of the variance of x and the
variance of y. Thus, a covariance could reflect a strong relation when the two
variances are small, but maybe express a weak relation when the variances are large.
Linear Relation is one in which the relation can be most accurately represented by
a straight line.
xnew  c1 ( xold )  c2
Remember: a linear transformation
The general equation for a straight line:
y  bx  a
(a is the y intercept and b is the slope of the line.)
b
y y2  y1

x x2  x1
3 2 1

 .5
31 2
If x = 8 then, y = .5(8) + 1.5 = 5.5
A = 1.5
When the relation is imperfect:
(not all points fall on a straight line.)
Why are the points not on the line?
We draw the “best fit” using what is called the “least-squares” criterion.
Why squares?
See optional link on simultaneous equations for a
closer look at the idea of least-squares.
Regression Line: Example
Subject
Stat. Score (x)
GPA (y)
1
110
1.0
2
112
1.6
3
118
1.2
4
119
2.1
5
122
2.6
6
125
1.8
7
127
2.6
8
130
2.0
9
132
3.2
10
134
2.6
11
136
3.0
12
138
3.6
GPA
4
3
2
1
110 120 130 140
Statistics Score
We wish to minimize
y 
 ( y  y )
2
The predicted value of y for a given value of x
y  by x  a y
by
= the slope minimizing the errors predicting y
ay
= y-axis minimizing the errors predicting y
 ( x  x )( y  y )
by 
COVxy
sx2

(n  1)
 (x  x)2
(n  1)
For our example:
by  0.074
What does this mean?
ay
y  b x

a
n
 y  bx
Our working example:
A = 2.275 – 0.074(125.25)
= -7.006
The regression line for our data:
y  0.074 x  7.006
Using the regression formula to predict: e.g., x = 124
y  0.074(124)  7.006
y  2.17
Note: If the x value you are inserting is beyond
the range of the values used to construct the
Formula, caution must be used.
Remember: To minimize the sum of the squared deviations about a point, the mean is best.
GPA
( y  y)2
1.0
1.69
1.6
.49
1.2
1.21
2.1
.04
2.6
.09
1.8
.25
2.6
.09
2.0
.09
3.2
.81
2.6
.09
3.0
.49
3.6
.169

y  27.3
y  2.3
 ( y  y)
Note: Using our GPA and Statistic Scores data
7.03
sy 
11
= .79
We could call this a type of
Standard Error” of y.
2
 7.03
Using only the mean of y to predict y, all y values would be the mean.
Using X,
sy .x  ?
Which MODEL is superior? Why?
Is there a reliable difference?
Standard Error of the Estimate: similar to a standard deviation
Where the relation is imperfect, there will be prediction error, whether one
use the mean or the regression line.
s y .x
2

(
y

y
)


( n  2)
Transformed….
 n  1
sy .x  sy (1  r r )

 n  2
What is r?
Residual Variance =
What might create residual variance?
Download