SPSS Practical 4 – Correlation and Linear Regression

SPSS Practical 4 – Correlation and Linear Regression

In this practical we are using a dataset on child development, development.sav.

Locate development.sav

and save it to a network drive or your computer (Right click the online file, select save as and select the location to save the data to). Open SPSS and load the trial data set from where the file was saved [File → Open →Data, and navigate to the location where you saved development.sav to open the data file for use]. A full list of variables in the dataset is given below.

Variable

Physical Height

Weight

Head circumference

Sex cm kg cm

0 if male, 1 if female

Developmental Kaufman assessment battery for children

A global measure of IQ. Must be >0.

Normal range is 105 ± 15

Number of pictures recognised Out of 30 pictures shown

Socioeconomic Mother’s qualifications

Number of rooms in home

Parents’ marital status

Parents’ joint income

GCSE = 0 / Higher than GCSE = 1

0 = Married, 1 = cohabiting, 2 = separated, 3 = single

0 = <£20,000; 1 = 20-30,000; 2 = 30-

40,000; 3 = ≥40,000

CORRELATION

Correlation is useful when exploring the linear relationship between two numerical variables where they are likely to be associated but one probably does not depend on the other. The correlation coefficient goes from -1 to +1: 1 is a perfect positive linear association, -1 is a perfect negative linear association and 0 suggests no association. Large positive values closer to 1 suggest that as one variable increases, so does the other, whilst large negative values suggest that as one variable increases, the other decreases. We will use the example of height and head circumference.

First we will examine a scatter plot of height against head circumference. Got to ‘Graphs →

Legacy Dialog→ Scatterplot’. Choose ‘height’ and ‘headcirc’ and move them to the right hand side so that they sit along the proposed X and Y axes. Click ‘Ok’ . The scatter plot suggests there may be an association between the variables; the scatter of the data points suggests that individuals with larger head circumferences tend to have greater height.

Click ‘ Analyze → Correlate → Bivariate.’ Choose ‘height’ and ‘headcirc’ and move them to the ‘variables’ box. Click ‘Ok’ . The resulting table gives you the correlation between these two variables. You are also given a p-value testing the null hypothesis that the correlation coefficient in question is zero against the alternative that it is not equal to zero.

As a default, Pearson’s correlation was used. Pearson’s correlation coefficient assumes a linear relationship between the two variables. To relax this assumption, if it appears initially that your two variables are not linearly related Spearman’s correlation can alternatively be calculated. Go back to the correlation dialog box, untick ‘Pearson’ and tick ‘Spearman’ . Click

1

ok again. If the two correlations are in close agreement then it is likely the assumption of linearity was ok. In this example we see that the correlations are in close agreement, therefore the assumption of linearity was okay; we interpret Pearson’s correlation coefficient.

Kendall’s tau-b is another non-parametric version of the correlation coefficient which is an alternative to Spearman’s correlation. Explore the data set further. Go back to the correlate dialog box and add more variables. This gives you a matrix of correlations for each pair of variables selected. Try this and make sure you understand the output.

CORRELATION QUESTIONS

1. What is the Pearson correlation between height and head circumference?

2. What does this correlation value suggest?

3. Is the association between height and head circumference significant?

4. What is the Pearson correlation between height and weight?

REGRESSION

Regression analysis is a general modelling technique, useful for quantifying associations and for prediction when you have one numerical variable which you expect to depend upon one or more other variables. Linear Regression quantifies the linear relationship between one numerical outcome variable and one or more explanatory variables (factors believed may affect the outcome) with an equation – the regression line. We are interested in knowing about the effects of weight on intelligence, represented by Kaufman score. Are heavier children more or less intelligent than their lighter counterparts?

Click ‘Analyze → Regression →Linear’ . Select ‘Kaufman’ as the dependent variable and

‘Weight’ as the independent variable. Click the ‘Statistics’ box and tick ‘Confidence intervals’ , then click ‘Continue’ . Click ‘Plots’ , tick the box next to ‘Normal probability plot’ and indicate that you also want SPSS to produce the fitted values against residual plot by moving ‘ZPRED’ into the ‘X:’ box and ‘ZRESID’ into the ‘Y:’ box. We will come back to the use of this plot shortly. Click ‘Continue’ then ‘Ok’ .

The resulting coefficients table gives estimates of two parameters: the constant and the slope

(B relating to weight). The constant tells us the value of the dependent variable (Kaufman score) when the independent variable (weight) is 0. In many cases this is not of interest – there are no children with a weight of 0! The weight parameter B is of interest to us. This gives the increase in Kaufman score associated with a 1kg increase in weight. There is also a confidence interval around B and a significance test (testing the null hypothesis that the true slope is 0).

You also get the normal probability plot and residuals versus fitted values plot as outputs which are important for assessing the assumptions of the linear regression model.

There are 4 assumptions made when a linear regression model is run:

1.

Independent data observations

2.

Linear relationship between the numerical outcome variable and explanatory variable(s)

2

3.

Residuals are normally distributed

4.

Residuals have constant variance

If any of these assumptions are not met, analysis may be invalid. Two observations are said to be independent (1) if the value of one observation gives no information about the other observation. To asses assumption (2) we consult the “Residuals versus fitted values” plot which Is in the output viewer, we hope to see a random scatter in this plot. The assumption is violated if a non-linear pattern is observed. We asses assumption (4) also using the

Residuals versus fitted values plot; we do not want to see any ‘funnelling’ in the scatter of points, we want the variation in the residuals to be constant over the range of fitted values.

Finally we asses assumption (3) using the normal probability plot that is also in the output viewer. The assumption is satisfied if the data points lie close to the diagonal line.

REGRESSION QUESTIONS

1. What is the effect (and 95% CI) of a 1kg increase in weight on the Kaufman score?

2. Therefore what would be the effect of a 5kg increase in weight on Kaufman score?

3. What is the R

2

for this linear regression model?

4. Are the assumptions of the linear regression model satisfied?

It is entirely possible that the effect of weight on intelligence is confounded. What if this is simply because weight affects intelligence and is also association with sex? We need to adjust the estimate, B , for sex. To do this, open the regression dialog box again. Enter ‘sex’ as an additional independent variable. Look at the coefficients output table again. Has the estimate ( B ) corresponding to weight changed drastically?

Sex is a binary variable where 0 = Males and 1 = Females (see table of variables). B for sex therefore quantifies the difference in the estimated mean value of the Kaufman score between males and females after adjusting for weight.

QUESTIONS

5. Did the adjustment for sex you made change B for weight?

6. What is the effect (and 95% CI) of being female on the Kaufman score?

7. Did adding sex reduce the R

2

of your model?

You can continue like this and adjust for other variables you think are potential confounders until you have reached a model you are happy with. As well as adjusting for confounders, you may wish to produce a model which explains as much variation in the dependent variable

(Kaufman) as possible. This is measured by R

2

, which is the proportion of variation explained by the variables in your model. This is given in the Model Summary table. Try adding different independent variables to your model to see which improve R

2

the most.

3