Correlation

advertisement

CORRELATION

Introduction to Correlation

 The Pearson product-moment correlation coefficient measures the degree of association between two interval (or better)-level variables, for example, the relationship between daily consumption of fat calories and body weight, or attitudes towards smoking and attitudes towards consumption of alcohol; what is the relationship between student achievement and dollars per student spent by the school district?

 Sometimes both of the variables are treated as “dependent,” meaning that we haven’t ordered them causally. Sometimes one of the variables, X, is treated as independent and the other, Y, as dependent. Which of these is dependent and which is independent depends on your theory of the relationship

 The correlation coefficient, Pearson’s r, ranges between +1 and -1 where +1 is a perfect positive association (people who get high scores on X also get high scores on Y) and -1 is a perfect negative association (people who get high scores on X get low scores on Y).

A correlation near zero indicates that there is no relationship between scores on the two variables

Related Measures of Association

 The correlation coefficient is related to other types of measures of association:

 The partial correlation, which measures the degree of association between two variables when the effects on them of a third variable is removed: what is the relationship between student achievement and dollars per student spent by the school district when the effect of parents’ SES is removed

 The multiple correlation, which measures the degree to which one variable is correlated with two or more other variables: how well can I predict student achievement knowing mean school district expenditure per pupil and parent SES

Other Related Measures

 The squared Pearson’s correlation coefficient, usually called R squared or the coefficient of

determination, tells us how much of the variation in Y, the dependent variable, can be explained by variation in X, the independent variable; for example, how much of the variation in student achievement can be explained by dollars per student expenditure by the school district?

 The quantity 1-R 2 is sometimes called the

coefficient of non-determination, and it is an estimate of the proportion of variance in the dependent variable that is not explained by the

IV

Scatterplot: Visual Representation of the

Relationship Measured by the Correlation

Coefficient

 The scatterplot is a figure which plots off cases for which two measures have been taken (for example, people who have filled out a survey of their attitudes toward smoking and another survey about their attitudes toward drinking) against each other

 In a scatterplot, one of the variables (usually the independent variable) is plotted along the horizontal or X axis and the other is plotted along the vertical or Y axis

 Each point in a scatterplot corresponds to the scores

(X,Y) for an individual case (a person, for example) where X is the score that person was assigned or obtained on one variable and Y is the score they attained on the other

 The strength of the linear relationship between X and Y is stronger as the swarm of points in the scatterplot more closely approximates a diagonal “line” across the graph

An Example of a Scatterplot

In this scatterplot, computer anxiety scores

(openness to computing) are plotted against the Y

(vertical) axis and computer self-efficacy scores are plotted along the X (horizontal) axis. For example, the person to whom the arrow is pointing had a score of about 17 on the openness scale and about 162 on the selfefficacy scale. What were the scores on the two scales of the person with the star next to his point?

30

20

10

0

20 40 60

Computer Self-efficacy

80 100 120 140 160 180

Scatterplot Allows You to Visualize the Relationship between Variables

The purpose of the scatterplot is to visualize the relationship between the two variables represented by the horizontal and vertical axes. Note that although the relationship is not perfect, there is a tendency for higher values of openness to computing to be associated with larger values of computer selfefficacy, suggesting that as openness increases, selfefficacy increases. This indicates that there is a

positive correlation

30

20

10

0

20 40 60

Computer Self-efficacy

80 100 120 140 160 180

Drawing A Possible Regression Line

Let’s draw a line through the swarm of points that best “fits” the data set (minimizes the distance between the line and each of the points). This is imposing a linear description of the relationship between the two variables, when sometimes you might want to find out if a line that represented a

curvilinear relationship (in this case an inverted U) was a better fit, but we’ll leave that question for another time. The line that represents this relationship best mathematically is called a

“regression line” and the point at which the mathematically best fitting line crosses the y axis is called the “intercept”

30

20

10

0

20 40 60

Computer Self-efficacy

80 100 120 140 160 180

Various Types of Associations

300

200

100

0

-100 0 100 200 300 400 500

Engine Displacement (cu. inches)

Positive Relationship between X and Y

50

40

30

20

10

0

0 1000 2000

Vehicle Weight (lbs.)

3000 4000 5000 6000

Strong negative

Relationship between X and

Y; points tightly clustered around line; nonlinear trend at lower weights

30

20

10

0

-2 0

Number of Children

2 4 6 8

Essentially no relationship between X and

Y; points loosely clustered around line

How is the Correlation Coefficient

Computed?

The conceptual formula for the correlation coefficient is a little daunting, but it looks like this:

r =

∑(X – X) (Y – Y)

[∑ (X – X) 2 ] [∑ (Y – Y) 2 ]

Where X is a person’s or case’s score on the independent variable, Y is a person’s or case’s score on the dependent variable, and X-bar and Y-bar are the means of the scores on the independent and dependent variables, respectively. The quantity in the numerator is called the sum of the crossproducts (SP). The quantity in the denominator is the square root of the product of the sum of squares for both variables (SS x and SS y

)

Meaning of Crossproducts

 The notion of the crossproducts is not too difficult to understand.

When we have a positive relationship between two variables, a person who is high on one of the variables will also score high on the other. And it follows that if his or her score on X is larger than the mean of variable X, then if there is a positive relationship his or her score on Y will be larger than the mean of Y. And this should hold for all or most of the cases

 When the crossproducts are negative (when for example the typical person who scores higher than the mean on X scores lower than the mean on Y) then there still may be a relationship but it is a negative relationship

 Thus the sign of the crossproducts (positive or negative) in the numerator of the formula for r tells us whether the relationship is positive or negative

 You can think of the formula for r as the ratio of (a) how much score deviation the two distributions (X and Y) have in common to

(b) the maximum amount of score deviation they could have in common

Computing Formula for Pearson’s r

 The conceptual formula for Pearson’s r is rarely used to compute it. You will find a nice illustration here of a computing formula and a brief example

Here is another computing formula

N ∑XY - ∑X ∑Y r =

[ N ∑X 2 – (∑X) 2 ] [N ∑Y 2 – (∑Y) 2 ]

We will do an example using this computing formula next, so let’s download the correlation.sav

data set

Scatterplot for the Correlation.sav data set

Open the correlation.sav file in SPSS

Go to Graphs/Chart Builder/OK

Under Choose From select ScatterDot (top leftmost icon) and double click to move it into the preview window

Drag Shyness onto the X axis box

Drag Speeches onto the Y axis box and click OK

 In the Output viewer, double click on the chart to bring up the Chart Editor; go to Elements and select “Fit Line at Total,” then select

“linear” and click Close

ScatterPlot of Shyness and

Speeches

A negative relationship: The more shy you are

(the farther you are along the X axis), the fewer speeches you give (the lower you are on the Y axis)

Computational Example of r for the relationship between Shyness and

Speeches

r =

N ∑XY - ∑X ∑Y

[ N ∑X 2 – (∑X) 2 ] [N ∑Y 2 – (∑Y) 2 ]

(6 X 107) – 30 (32)

[6 (230) – 30 2 ] [6 (226) – 32 2 ]

Shyness

X

0

2

3

6

9

10

30

r = -.797

(note crossproducts term in the numerator is negative) and R-square = .635

Speeches

Y

8

10

4

6

1

3

32

XY X 2 Y 2

0 0 64

20 4 100

12 9 16

36 36 36

9 81 1

30 100 9

107 230 226

SPSS vs. the Hand Calculation: It’s a Lot Quicker

 Now let’s try computing the coefficient with that same data in SPSS

 Go to Analyze/Correlate/Bivariate, and move Shyness and Speeches into the Variables box. Click Pearson, one-tailed, and OK. Did you get the same result as the hand calculation?

Correlations

Shyness

Speeches

Pears on Correlation

Sig. (1-tailed)

N

Pears on Correlation

Sig. (1-tailed)

N

Shyness Speeches

1 -.797*

6

.029

6

1 -.797*

.029

6 6

*.

Correlation is s ignificant at the 0.05 level (1-tailed).

Using SPSS to Test a Hypothesis about the Strength of

Association between Two Interval or Ratio Level

Variables: Correlation Coefficient

 Download the file called World95.sav

 We are going to test the strength of the association between population density (the variable is “number of people per square kilometer) and “average female life expectancy,” based on data from 109 cases (109 countries, with each country a case). Our hypothesis is that the association will be negative; that is, as population density increases, female life expectancy will decrease

 In SPSS Data Editor, go to Analyze/ Correlate/ Bivariate

 Move the two variables, “number of people per square kilometer” and “average female life expectancy” into the variables box

 Under correlation coefficients, select Pearson

 Under Tests of Significance, click one-tailed (we are making a directional prediction, so we will only accept as significant results in the “negative” 5% of the distribution

 Click “flag significant results”

 Click Options, and under Statistics, select Means and standard deviations, then Continue, then OK

 Compare your output to the next slide

SPSS Output for Bivariate

Correlation

Number of people

/ s q. kilometer

Average female life expectancy

Descriptive Statistics

Mean

203.415

Std. Deviation

675.7052

70.16

10.572

N

109

109

Number of people

/ sq. kilometer

Average female life expectancy

Correlations

Pears on Correlation

Sig. (1-tailed)

N

Pears on Correlation

Sig. (1-tailed)

N

Number of people / s q.

kilometer

1

.

109

.128

.093

109

Average female life expectancy

.128

.093

109

1

109

.

Significance Test of Pearson’s r

Significance of r is tested with a t-statistic with N-2 degrees of freedom where t =

r N – 2

Correlations

1 – r 2

SPSS provides the results of the

t test of the significance of r for you. Can also consult table F in

Levin and Fox

Number of people

/ sq. kilometer

Average female life expectancy

Pears on Correlation

Sig. (1-tailed)

N

Pears on Correlation

Sig. (1-tailed)

N

Number of people / s q.

kilometer

1

109

.

.128

.093

109

Average female life expectancy

.128

.093

109

1

.

109

Write a sentence which states your findings. Report the correlation coefficient, r, R 2

(the percent of variance in y accounted for by x), the significance level, and N, as well as the means on each of the two variables. Indicate whether or not your hypothesis was supported.

A Hypothesis to Test

Now, test the following hypothesis:

Countries in which there is a larger proportion of people living in cities (urban) will have a higher proportion of males who read (lit_male) (not “people who read”)

Write up your result

 Compare to next slide

Writing up the Results of Your Test

Descriptive Statistics

People living in cities (%)

Males who read (%)

Mean

56.53

78.73

Std. Deviation

24.203

20.445

N

108

85

Correlation of the variable with itself = 1 which appears in all the main diagonal cells

Correlations

People living in cities (%)

Males who read (%)

Pears on Correlation

Sig. (1-tailed)

N

Pears on Correlation

Sig. (1-tailed)

N

People living in cities (%)

1

.

108

.587**

.000

85

Males who read (%)

.587**

.000

85

1

85

.

**. Correlation is s ignificant at the 0.01 level (1-tailed).

The hypothesis that the proportion of its people living in cities would be positively associated with a country’s rate of male literacy was confirmed (r = .587, DF=83, p < .01, one-tailed).

The Regression Model

 Regression takes us a step beyond correlation in that not only are we concerned with the strength of the association, but we want to be able to describe its nature with sufficient precision to be able to make predictions

 To be able to make predictions, we need to be able to characterize one of the variables in the relationship as independent and the other as dependent

 For example, in the relationship (male literacy--% of people living in the cities), the causal order seems pretty obvious. Literacy rates are not like to produce urbanization, but urbanization is probably causally prior to increases in literacy rates

Regression model, cont’d

 A regression equation is used to predict the value of a dependent variable, Y, in this case a country’s male literacy rate, on the basis of some constant a that applies to all cases, plus some amount b which is applied to each individual value of X (the country’s % of people living in cities), plus some error term e that is unique to the individual case and unpredictable: Y

= a + bX + e (male literacy = a + b(percent urban)

+ e)

 You can think of regression as an equation which best fits a scatter of points representing the plot of X and Y against each other

Calculating the Regression Line

 What line best describes the relationship depicted in the scattergram?

 The formula for the regression line is Y = a + bX + e where Y is

(in this case) a country’s score on male literacy and X is the country’s % of people living in cities, a is the y-intercept or constant (the point where the line crosses the Y axis, or, the value of Y when X is zero if X is a variable for which there is zero amount of the property) and b is the slope of the regression line

(the amount by which male literacy changes for each unit change in percent living in cities)

 We can use the formula to predict Y given that we know X

Calculating the Regression Line, cont’d

 We will not do the hand computations for b, the slope of the regression line, or a, the intercept. Let’s use another

SPSS method for finding not only the correlation coefficient, Pearson’s r, but also the regression equation

(e.g., find the intercept, a, and the slope of the regression line, b)

 In Data Editor, go to Analyze/Regression/Linear

 Put the dependent variable, male literacy, into the Dependent box

 Move the independent variable, percentage of people living in cities, into the Independent(s) box, and click OK

 Compare your output to the next slide

Finding the Intercept (Constant) and Slope (β or unstandardized regression coefficient)

The intercept, or a (sometimes called β

0

Coefficients a

Model

1 (Cons tant)

People living in cities (%)

Uns tandardized

Coefficients

B

52.372

.495

Std. Error

4.378

.075

a. Dependent Variable: Males who read (%)

The slope, or β

Standardized

Coefficients

Beta

.587

t

11.961

6.608

Sig.

.000

.000

Beta weight when X and Y are expressed in standard score units

The regression equation for predicting Y (male literacy) is Y = a + (b)X, or Y =

52.372 +.495X, so if we wanted to predict the male literacy rate in country j we would multiply its percentage of people living in cities by .495, and add the constant, 52.372. Compare this to the scatterplot. Does it look right?

When scores on X and Y are available as Z scores, and are expressed in the same standardized units, then there is no intercept (constant) because you don’t have to make an adjustment for the differences in scale between X and Y, and so the equation for the regression line just becomes Y = (b) X, or in this case Y = .587 X, where .587 is the standardized version of b (note that it’s also the value of r, but only when there is just the one X variable and not multiple independent variables)

More Output from Linear

Regression

Model Summary

Model

1

R

.587

a

R Square

.345

Adjus ted

R Square

.337

Std. Error of the Es timate a. Predictors : (Constant), People living in cities (%)

16.649

The correlation coefficient and the coefficient of determination. The coefficient of determination, or Rsquare, is the proportion of variance in the dependent variable which can be accounted for by the independent variable. Adjusted R-square is an adjustment made to Rsquare when you get a lot of independent variables or predictors in the equation or have complexities like cubic terms. Minor adjustment with only one predictor

F test of the regression equation: More

Output from Linear Regression, Cont’d

ANOVA b

Model

1 Regress ion

Res idual

Total

Sum of

Squares

12104.898

23007.879

35112.776

df

1

83

84

Mean Square

12104.898

a. Predictors : (Constant), People living in cities (%) b. Dependent Variable: Males who read (%)

277.203

F

43.668

Sig.

.000

a

If the independent variable, X, were of no value in predicting Y, the best estimate of Y would be the mean of Y. To see how much better our calculated regression line is as a predictor of Y than the simple mean of Y, we calculate the sum of squares for the regression line and then a residual sum of squares

(variance left over after the regression line has done its work as a predictor) which shows how well or how badly the regression line fits the actual obtained scores on Y. If the residual mean square is large compared to the regression mean square, the value of F would be low and the resulting F ratio may not be significant. If the F ratio is statistically significant it suggests that we can reject the hypothesis that our predictor, β, is zero in the population, and say that the regression line is a good fit to the data

Partial Correlation

 What is the relationship between a country’s percentage of people living in cities (X2, the IV) and male literacy rate (Y, the DV) when the effects of gross domestic product (X1, another potential IV or control variable) are removed?

 That is, what happens when you statistically remove that portion of the variance that both percentage of people living in cities (X2) and gross domestic product (X1) have in common with each other and with Y, male literacy rate , e.g. compute a partial correlation?

 What you want to see is if the correlation between male literacy rate and percentage of people living in cities, which you determined to be .587 in your previous analyses, is lower when the effects of gross domestic product are

“partialled out”

Using SPSS to Compute a Partial

Correlation

 A partial correlation is the relationship between two variables after removing the overlap with a third variable completely from both variables. In the diagram below, this would be the relationship between male literacy (Y) and percentage living in cities (X2), after removing the influence of gross domestic product (X1) on both literacy and percentage living in cities

In the calculation of the partial correlation coefficient r

YX2.X1

, the area of interest is section a, and the effects removed are those in b, c, and d; partial correlation is the relationship of X2 and Y after the influence of

X1 is completely removed from both variables.

When only the effect of X1 on X2 is removed, this is called a part correlation ; part correlation first removes from X2 all variance which may be accounted for by X1 (sections c and b), then correlates the remaining unique component of the X2 with the dependent variable, Y

Computing the Partial Correlation in

SPSS

Go to Analyze/Correlate/Partial

 Move % People living in cities and Males who read into the Variables box

 Put Gross Domestic Product into the Controlling for box

 Select one-tailed test and check display actual significance level

 Under Options, select zero-order correlations

 Click Continue and then OK

 Compare your output to the next slide

Comparing Partial to Zero-Order Correlation: Effect of

Controlling for GDP on Relationship Between Percent

Living in Cities and Male Literacy

Zero order r of

Control variable with X

And Y

Correlations

Control Variables

-nonea People living in cities (%)

Males who read (%)

Gross domestic product / capita

Correlation

Significance (1-tailed) df

Correlation

Significance (1-tailed) df

Correlation

Significance (1-tailed) df

People living in cities (%)

1.000

0

.

.587

.000

83

.591

.000

83

Gross domestic product / capita

People living in cities (%)

Males who read (%)

Correlation

Significance (1-tailed) df

Correlation

Significance (1-tailed) df a.

Cells contain zero-order (Pearson) correlations.

1.000

.

0

.464

.000

82

Males who read (%)

.587

.000

83

1.000

.

0

.417

.000

83

.464

.000

82

1.000

0

.

Gross domestic product / capita

.591

.000

83

.417

.000

83

1.000

.

0

Zeroorder r

r when effect of

GDP is removed

Note that the partial correlation of % people living in cities and male literacy is only .4644 when GDP is held constant, where the zero order correlation you obtained previously was .5871. So clearly GDP is a control variable which influences the relationship between % of people living in cities and male literacy, although the % living in cities-literacy relationship is still significant even with

GDP removed

Download