Association for Interval Level Variables

advertisement
Chapter 15 (1e) or 13 (2/3e)
Association Between Variables
Measured at the Interval-Ratio Level:
Bivariate Correlation and Regression
Introduction:

Scattergrams / Scatterplots


The Regression Line, Slope, and Intercept.




Graphs that display relationships between two interval-ratio
variables.
The regression line, y=a+bX, summarizes the linear
relationship between X and Y. Predicts the score of Y from a
score of X.
b represents the slope of the line.
a, called the intercept, is the point on the Y-axis where the
regression line crosses it.
Pearson’s r and the Coefficient of Determination (r2)


r is a measure of association for two I-R variables.
r2 tells you how much variation in the dependent variable is
explained by the independent variable.
Scattergram / Scatterplot

Has two dimensions:




The X (independent) variable is arrayed along the
horizontal axis.
The Y (dependent) variable is arrayed along the
vertical axis.
Each dot on a scattergram is a case in the
data set.
The dot is placed at the intersection of the
case’s scores on X and Y.
Example of a Hypothetical Scattergram
Showing the Relationship Between X and Y

Shows the relationship between % College
Educated (X) and Voter Turnout (Y) on election
day for the 50 cities.
Turnout By % College
73
68
63
Turnout
58
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
Scattergram Example (cont.)

Horizontal X axis - % of population of a city with a
college education.

Scores range from 15.3% to 34.6% and increase from
left to right.
Turnout By % College
73
68
63
Turnout
58
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
Scattergram Example (cont.)

Vertical (Y) axis is voter turnout.

Scores range from 44.1% to 70.4% and increase
from bottom to top
Turnout By % College
73
68
Turnout
63
58
53
48
43
15
20
25
% College
30
35
The Regression Line on a Scattergram



A single straight line that comes as close as possible to
all data points.
“least squares regression line”
Indicates strength and direction of the relationship.
Turnout By % College
73
68
63
58
Turnout
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
Strength of Regression Line


The greater the extent to which dots are clustered around the
regression line, the stronger the relationship.
This relationship is weak to moderate in strength.
Turnout By % College
73
68
63
Turnout 58
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
Direction of Regression Line



Positive: regression line rises left to right.
Negative: regression line falls left to right.
This a positive relationship: As % college educated
increases, % turnout increases.
Turnout By % College
73
68
63
58
Turnout
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
Scattergrams and Linearity

Inspection of the scattergram should always be the
first step in assessing the correlation between two
interval-ratio variables. In addition to assessing the
strength and direction, the relationship must also
be linear.
Turnout By % College
73
68
63
58
53
48
43
15
17
19
21
23
25
% College
27
29
31
33
35
The Regression Line: Formula

This formula defines the regression line:

y = a + bx

Where:
 Y = score on the dependent variable
 a = the Y intercept or the point where the
regression line crosses the Y axis.
 b = the slope of the regression line or the
amount of change produced in Y by a unit
change in X
 X = score on the independent variable
Regression and Prediction


We can use the regression line to find the predicted
value of y (symbolized as y’) for values of x.
Once we know the values of the coefficients b and a,
we can use the following prediction formula by
substituting any value for x to predict y. The predicted
level of y can be calculated by:

y' ( y)  a  bx

We can also use the regression formula to accurately
plot the regression line on our scattergram.
Regression Analysis: Healey’s definitional
formula for calculating the slope of the
line (Formula 15.2 (1e) or 13.2 (2/3e)

x  x y  y 

b
 x  x 
2

Note: The numerator is the covariation of x and y
(how x and y vary together). The denominator is the
sum of the squared deviations around the mean of x
Regression Analysis: *computational
formula* for b (Formula 13.3 in 2/3e)

Below is the *computational (working) formula to calculate b. It
is a re-arrangement of the theoretical formula and is much
easier to calculate!
nXY  (X )(Y )
b
2
2
nX  (X )


The slope tells you what the change in Y is, for every unit of X.
The sign of the slope coefficient (+/- b) tells you whether the
relationship is positive or negative.
Regression Analysis

The Y intercept (a) is computed from Healey,
Formula 15.3 (1e) or 13.4 (2/3e):
a  y  bx

The intercept (a) is the point where the regression
line crosses the Y-axis, when X=0.
Results of a Hypothetical Regression
Analysis of the Relationship Shown in the
Scattergrams Above:

For the relationship between % college educated
and % turnout:




Assume b (slope) = .42
Assume a (Y intercept)= 50.03
A slope of .42 means that % turnout increases by
.42 (less than half a percent) for every unit increase
of 1 in % college educated.
The Y intercept means that the regression line
crosses the Y axis at Y = 50.03.
An example of prediction:



We can use the regression equation y’=a+bx for
prediction. For instance, we could ask, what %
turnout would be expected in a city where only
10% of the population was college educated?
What % turnout would be expected in a city
where 70% of the population was college
educated?
This is a positive relationship so the value for Y
increases as X increases. Our prediction:


For X =10, Y = 54.5
For X =70, Y = 79.7
Calculating the Correlation Coefficient:
Formula for Pearson’s r

Definitional formula for Pearson’s r:
r

 x  x y  y 
 x  x 2   y  y 2 

 

*Use the computational formula to calculate*:
r
nXY  (X )( Y )
[nX  (X ) ][ nY  (Y ) ]
2
2
2
2
Pearson’s r



Like Gamma, r varies from -1.00 to +1.00
Pearson’s r is a measure of association for
Interval-Ratio variables.
For the hypothetical relationship between %
college educated and turnout, assume r =.32


This relationship would be positive and weak to
moderate.
As level of education increases, % turnout
increases.
The Coefficient of Determination:

Total variation in y (  y  y  ) is the sum of
the explained variation (  y' y  )
and the unexplained variation (   y  y ' )
The explained variation (the portion explained
by x) is represented by the formula:
2
2
2

r

2
r
2

 y ' y 
 y  y 
Or, alternatively: r2 = (r)2
Practical Example using Healey #15.1 (1e)
(Problem 13.1 in 2/3e)


The computation and interpretation of a, b, Pearson’s
r and r2 will be illustrated using a similar example
from Healey Problem 15.1 ( % Turnout by Education
(Years of Schooling) but with only 5 cases)
The variables are:



Voter turnout (Y) is the dependent variable.
Average years of school (X) is the independent variable.
The sample is 5 cities.

This is only to simplify the calculation. A sample of 5 is
actually very small.
Data from Problem:
City
X
Y
A
11.9
55
B
12.1
60

The scores on each
variable are displayed
in table format:


C
12.7
65
D
12.8
68
E
13.0
70
Y = % Turnout
X = Years of Education
1. Draw and Interpret the Scattergram:


The relationship between X and Y is linear.
Estimate regression line. Relationship is positive and strong.
2. Make a Computational Table:
X
Y
X2
Y2
XY
11.9
55
141.61
3025
654.5
12.1
60
146.41
3600
726
12.7
65
161.29
4225
825.5
12.8
68
163.84
4624
870.4
13.0
70
169
4900
910
∑X = 62.5
∑Y = 318
∑X2 =782.15
∑Y2 = 20374
∑XY = 3986.4
Sums (Σ) are needed to compute b, a, and Pearson’s r.

As well, the mean of X and Y are needed:
X  X / n  62.5 / 5  12.5
Y  Y / n  318 / 5  63 .6

3. Next, calculate b and a….

Calculate slope:
b

nXY  (X )( Y )
nX  (X )
2
Calculate y-intercept:
a  Y  bX
2
Interpret Slope (b), the Intercept (a)
b
nXY  (X )( Y )
nX  (X )
2
2

5(3986 .4)  (62 .5)(318 )
5(782 .15 )  (62 .5)
2
 12 .67
For every unit increase in X, Y increases by 12.67.
This means that for 1 additional year of schooling,
voter turnout goes up by 12.67%.
a  Y  b X  63.6 12.67(12.5)  94.78
This is the point at which the regression line crosses
the Y-axis (when X is equal to 0, Y is equal to -94.78)
Find the Regression Line*:
Y  a  bX  94.78  12.67 ( X )
*Note: you can now substitute two values for X and
solve for Y to find points to plot the actual regression
line on your scattergram.
For prediction:
Suppose years of schooling = 10 years…
Then, Y = -94.78 + 12.67 (10) = 31.92.
We would predict that when average years of education is
10 years, the voter turnout would be 31.92%
4. Pearson’s r

Calculate the correlation coefficient r
r
nXY  (X )( Y )
[nX  (X ) ][ nY  (Y ) ]
2
2
2
2
Interpret Pearson’s r
r

nXY  (X )( Y )
[nX 2  (X ) 2 ][ nY 2  (Y ) 2
5(3986 .4)  (62 .5)(318 )
[5(782 .15 )  (62 .5) 2 ][5(20374 )  (318 ) 2 ]
 .984
An r of 0.98 indicates an extremely strong
relationship between years of education and voter
turnout for these five cities (use the table given for
gamma to estimate strength)
5. Find the Coefficient of Determination (r2)
and Interpret:
r  (r )  (.984)  .968
2
2
2
The coefficient of determination is r2 = .968.
Education, by itself, explains 96.8% of the
variation in voter turnout.
6. Testing r for significance:

We can test the relationship between % turnout and
years of education (represented by Pearson’s r) for
significance using the 5 step model and the following
formula:
t obtained  r

Degrees of Freedom = N-2
n2
1 r
2
Step 1: Assumptions

There are 3 main assumptions…




1. The dependent and independent are normally
distributed. We can test this by looking at the histograms
for the two variables.
2. The relationship between X and Y is linear. We can
check this by looking at the scattergram.
3. The relationship is homoscedastic. We can test
homoscedasticity by looking at the scattergram and
observing that the data points form a “roughly symmetrical,
cigar-shaped pattern” about the regression line.
If the above 3 assumptions have been met, then
we can use linear regression and correlation and
test r for significance.
Step 2: Null and Alternate Hypotheses:



Ho: ρ = 0.0
H1: ρ ≠ 0.0
(Note that ρ (rho) is the population parameter, while r is the sample
statistic.)
Step 3: Sampling Distribution and Critical Region:




S.D. = t-distribution
Alpha = .05
DF = n - 2 = 5 - 2 = 3
tcritical = 3.182
Step 4. Computing the Test Statistic:

Use Formula 15.6 (1e) or 13.6 in 2/3e)
t obtained  r
n2
1 r
2
 .984
52
1  (.984 )
2
 9.53
Step 5. Decision and Interpretation:


Tobtained = 9.53 > tcritical = 3.182
Reject Ho. The relationship between % turnout and years
of schooling is significant.
Always include a brief summary of your
results:

There is a very strong, positive relationship
between % voter turnout and years of
schooling for the five cities. As years of
schooling increase, the % of voter turnout
goes up. The relationship is significant (t=9.53,
df=3, α = .05) . Years of schooling explain
96.8% of the variation in % voter turnout.
Practice Problems

Calculate, interpret and summarize the
results for Healey 1e #15.1 (2/3e #13.1) for
“% Turnout” and “Unemployment” and for “%
Turnout” and “Negative Campaigning”.

Answers* can be found below on the lectures
page.

*I used SPSS rather than a calculator to compute
the solutions, so your answers may be very
slightly different from the ones in the powerpoint.
Download