Bivariate

advertisement
Quantitative Methods
Topic 9
Bivariate Relationships
Outline
Crosstab (Exploring the relationship
between two categorical variable).
 Correlation (Exploring the relationship
between two continuous variables,
typically)

2
Data file

YR12SURV2.SAV
YR12SURVEYCODING2.DOC
(Questionnaire)
Holland2fory12data.doc
3
RANKMS and RANKINV

RANKMS is a constructed variable ranking informants
on the amount of time spent on Maths and Science



RANKINV is a similar rank type variable focused on
investigative interests: e.g. informant interested in
laboratory work.



A high rank (e.g. 3) means the informant spent a lot of time on
Maths and Science
A low rank (e.g. 1) means the informant spent very little or no
time on Maths and Science.
A high rank (e.g. 4) means the informant was very interested
A low rank (.e.g. 1) means the informant was least interested.
These are ordinal variables.
4
Relationships between two
categorical variables
Example of research questions



Is there a relationship between gender and
student investigative interest?
Are males more likely to be interested in
investigative activities than females?
Is the proportion of males in each of the
investigative level the same as the proportion
of females?
5
Variables

To answer the research question in the
example above we will have to do
crosstabulations between two variables
Gender
 RANKINV

6
Hypothesis of independence
There is no association between the two
variables gender and RANKINV
 There is no difference in the proportion of
females and males in each of the
categories (levels) of investigative interest

7
How to do crosstabulations in SPSS





From the DATA menu select ANALYSE then
Descriptive Statistics then Crosstabs
Move GENDER into the Column(s) window
and RANKINV into the Row(s) window
Open the Statistics window and tick Chisquare Continue to close
Open the Cells window, under Counts tick
observed, under Percentages, tick Column,
then click Continue
Click OK in the Crosstabs window to run
8
Screen
9
Output
There are two tables in the output that are
important for us:
 Table 1: Crosstab
 Table 2: Chi-square

10
Table 1: Crosstab of RankInv by
Gender
rankinv * gender Crosstabulation
gender
1
rankinv
1
2
3
4
Total
Count
% within gender
Count
% within gender
Count
% within gender
Count
% within gender
Count
% within gender
113
30.7%
119
32.3%
82
22.3%
54
14.7%
368
100.0%
2
61
19.6%
95
30.5%
76
24.4%
79
25.4%
311
100.0%
Total
174
25.6%
214
31.5%
158
23.3%
133
19.6%
679
100.0%
11
Interpreting Association in the
Table



We can compare the column percentages along the
rows and calculate the percentage point difference to
see (in this case) whether females differ from males at
each ‘level’ of interest
In the rankinvestgtv by Gender crosstabulation, for
example, 30.7% of females were in category 1 (Very low
investigative interest) compared with 19.6% of males,
giving a percentage point difference of 11.1.
Similarly, there is a difference of 10.7 percentage points
in the number of males having very high investigative
interest compared with females
12
Table 2: Chi-square Statistics
generated by Crosstabs
Pearson
Chi-square
value,
degree of
freedom and
significant
level
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
McNemar Test
N of Valid Cases
Value
18.504a
18.642
17.828
3
3
Asymp. Sig.
(2-sided)
.000
.000
1
.000
df
Exact Sig.
(2-sided)
.b
679
a. 0 cells (.0%) have expected count less than 5. The minimum
expected count is 60.92.
b. Computed only for a PxP table, where P must be greater than 1.
This helps us
to check if
expected
counts less
than 5
13
Tests of Statistical Significance for
Tables




Chi-square used to test the null hypothesis that there is no discrepancy
between the observed and expected frequencies or there is no association
between row and column variables
Chi-square based statistics can be used independently of level of
measurement.
If chi-square is significant (say Asymp. Sig. <0.05) then we reject the null
hypothesis and conclude that the data show some association compared
with a (hypothetical) table in which the observed frequencies were
determined solely by the separate distributions of the two crosstabulated
variables (the ‘marginal distributions’)
If chi-square is not significant (say Asymp. Sig. >0.05) then we accept the
null hypothesis and conclude that the data show no association compared
with a (hypothetical) table in which the observed frequencies were
determined solely by the separate distributions of the two crosstabulated
variables (the ‘marginal distributions’)
14
Assumptions
Random samples
 Independent observations: the different
groups are independent of each other
 The lowest expected frequency in any cell
should be 5 or more

15
Chi-square Statisticslimitations

Chi-Square measures are sensitive to
sample size
16
Interpretation of output from
chi-square




The note under the table shows that you have not violated
one of the assumptions of chi-square concerning
‘minimum expected cell frequency’
Pearson chi-square value:
 at 18.5 for 3 degrees of freedom Chi-square is highly
significant
 probability of this level of association occurring by
chance is less than 0.001.
Degree of freedom=(r-1)(c-1) where r and c are number
of categories in each of the two variables.
Conclusion: males are more likely than females to be
interested in investigative activities.
17
Class activity 1: Produce a
similar table using GENDER by
RANKMS
18
Summary of analyses of association


RANKINV and RANKMS are 4 and 3
categories (respectively) ordinal
variables constructed, respectively, from
the total score on Investigative interests
and the proportion of curriculum time
spent in Maths/Science
Gender heads the columns, interest and
curriculum participation in Maths/Science
form the rows
 (thus,
by convention, gender is the explanatory or
independent variable, interest or curriculum
participation the response or dependent variables)
19
Correlations

Strengths of relationships between two
variables.
20
Correlation
Examples of Research questions




Is there a relationship between student
achievement in mathematics and
English language?
Is there a relationship between parents’
incomes and children VCE results ?
Is there a correlation between SES and
achievement ?
How strong are these relationships?
21
Assumptions (1)






Scores are obtained using a random sample from
populations
Independence of observations
The distribution of the variable(s) involved is
normal
Homoscedasticity: the variance of the dependent
variable is the same for values of X (residual
variance, or conditional variance)
Linearity: The relationship between the two
variables should be linear.
Related pairs: both pieces of information must be
22
from the same subjects
Data Set

Vietnam Data Set
vnsample2.sav
23
Scatter plot
24
Producing a Scatterplot

GRAPHS-SCATTER-SIMPLE-DEFINE

Select MEASUREMENT score (pmes500)
to make this the Y variable
Select NUMBER score (pnum500) to make



this the X variable.
Click OK
The scatterplot should appear in the OUTPUT
window.
25
Scatter plot
26
Interpretation of Scatter plot



Step 1: Checking for outliers
Step 2: INSPECTING THE DISTRIBUTION OF DATA
POINT:
 Are the data points scattered all over.
 Are the data points neatly arranged in a narrow cigar
shape
 Could we draw a straight line through the main cluster of
points or would a curved line better represents the points
 Is the shape of the cluster even from one end to other. (if it
starts off narrow and then gets wider, The data may violate
the assumption of homoscedasticity: at different value of X,
variability of Y is different)
Step 3: Determining the direction of the relationship between
the two variables: positive or negative correlations
27
Direct Relationship




When values on two variables tend to go in the
same direction, we call this a direct relationship.
The correlation between children’s ages and
heights is a direct relationship.
That is, older children tend to be taller than
younger children.
This is a direct relationship because children
with higher ages tend to have higher heights.
28
Inverse Relationship




When values on two variables tend to go in
opposite directions, we call this an inverse
relationship.
The correlation between students’ number of
absences and level of achievement is an
inverse relationship.
That is, students who are absent more often
tend to have lower achievement.
This is an inverse relationship because
children with higher numbers of absences
tend to have lower achievement scores.
29
Scatter plot
30
How to run correlation




Highlight ANALYSE, CORRELATE, BIVARIATE
Copy THE TWO VARIABLES INTO
VARIABLES box
Check that PEARSON box (two continuous
variables- see the notes for other variable types)
and the 2 tail box
Click OK
31
OUTPUT AND INTERPRETATION
Correlations
pnum500 NUMBER 500
SCORE IN PUPIL
MATHEMATICS
pmes500
MEASUREMENT 500
SCORE IN PUPIL MATH.
Pearson Correlation
Sig . (2-tailed)
N
Pearson Correlation
Sig . (2-tailed)
This is
correlation
pnum500
NUMBER
500 SCORE
IN PUPIL
MATHEMATI
CS
1
N
pmes500
MEASUREME
NT 500
SCORE IN
PUPIL MATH.
.821**
.000
733
733
.821**
1
coefficient
(r)
This is the p
value
.000
733
733
**. Correlation is sig nificant at the 0.01 level (2-tailed).
Number of
cases
Step 1: Checking information about sample size
Step 2: Determining the directions and strengths of the
relationships
Step 3:Calculating the coefficient of determination (r2)
Step 4: Assessing the significance
32
Correlation Coefficient



The relationship between two variables may
be expressed with a number between -.1.00
and 1.00. This number is called a correlation
coefficient.
The closer the correlation coefficient is to
0.00, the lower the relationship between the
two variables. The closer the coefficient is to
1.00
or -1.00 the higher the relationship.
According to Cohen (1988)
 R=.10
 R=.30
 R=.50
to .29 or R=-.10 to -.29: Small
to 0.49 or R=-.30 to -0.49 Medium
to 1.00 or
R=-.50 to -1 Large
33
Some caveats about correlation
and scatter plots - 1
Make a scatter plot of Measurement score
against Number score again.
 This time, double click on the plot to get
into Chart Editor.
 Change both X and Y axis scales to have a
minimum of 200 and a maximum of 750.
 Does the strength of the relationship look
weaker in this graph as compared to the
one where the min is 0 and max is 1000?

34
Some caveats about correlation
and scatter plots - 2

Be aware that judging the strength of
relationship based on visual perception of
scatter plot could be flawed, as the scale
of the plots can make a difference.
35
Some caveats about correlation
and scatter plots - 3

Create a new variable pmes10 using
Transform  compute new variable
such that pmes10=pmes500/100 +5



That is, we have transformed the measurement
score to have a mean of 10 and a standard
deviation of 1.
Compute the correlation between pmes10 and
pnum500.
How does this correlation compare with the
correlation between pmes500 and pnum500? 36
Some caveats about correlation
and scatter plots - 4



Compute the correlation between pmes500 and
pnum500, but only for scores between 300 and
600.
You can do this by selecting a sample
Data  Select cases  If condition is satisfied:
pnum500 > 300 and pnum500<600 and pmes500>300
and pmes500<600

How does the correlation compare with that from
the full sample?
37
Some caveats about correlation
and scatter plots - 5
This link shows TIMSS and PISA 2003
maths country mean scores for 22
countries. TIMSSandPISA 2003.doc
 Plot a scatter graph between TIMSS and
PISA scores
 Compute the correlation
 Repeat without Tunisia and Indonesia.

38
Coefficient of determination
is the squared correlation coefficient, and
 represents the proportion of common
variation in the two variables

39
Proportion of variance explained
 The
proportion of variance explained is equal to the
square of the correlation coefficient
 If the correlation between alternate forms of a
standardized test is 0.80, then (0.80)2 or 0.64 of the
variance in scores on one form of the test is explained
or associated with variance of scores on the other
form
 That is, 64% of the variance one sees in scores on
one form is associated with the variance of scores on
the other form. Consequently, only 36% (100% –
64%) of the variance of scores on one form is
unassociated with variance of scores on the other
form.
40
Presenting the results for
Correlation
Purpose of the test
 Variables involved
 r values, number of cases, p value
 R-square
 Interpretation

41
Class activity 1




What are the correlations between three
dimensions of mathematics: number,
measurement and space?
Is it true that students who perform well in
number strand also doing well in measurement
and space strands?
Datafile: VNsample2.sav
Variables: student scores in measurement,
number and space
42
Regression line
43
Producing a Scatterplot with
regression line

GRAPHS-SCATTER-SIMPLE-DEFINE

Select MEASUREMENT 500 SCORE IN PUPIL
MATH to make this the Y variable

Select NUMBER 500 SCORE IN PUPIL MATH
and to make this the X variable.
Click OK
The scatterplot should appear in the OUTPUT window.
Double click anywhere in the scatter plot to open SPSS
CHART EDITOR, Click on ELEMENTS< FIT LINE AT
TOTAL, make sure that LINEAR is selected.



44
Regression line
45
Regression line
y = 0.7444x + 124.76
R2 = 0.5654
1000
900
800
700
600
500
400
300
200
100
0
Series1
Linear (Series1)
0
200
400
600
800
1000
46
Linear Regression
Example of research questions



To what extent can student achievement in
Vietnamese language predict student
achievement in mathematics?
How well family income (or wealth) can predict
student performance?
How well university entrance scores can predict
student success in University?
47
Regression equation
y= a+bx+e
y
and x are the dependent and independent variables
respectively
 a is the intercept (the point at which the line cuts the
vertical axis.
 b is the slope of the line or the regression coefficient.
 e is error term.
48
How to RUN REGRESSION







Highlight ANALYSE, REGRESSION, LINEAR
Copy THE CONTINOUS DEPENDENT VARIABLE
INTO DEPENDENT box
Copy THE INDEPENDENT VARIABLE INTO
INDEPENDENT box
For METHOD, make sure that ENTER is selected.
Click STATISTICS and tick on ESTIMATES, MODEL
FIT AND DESCRIPTIVES, then CONTINUE
Click on OPTIONS, then INCLUDE CONSTANT IN
EQUATION, EXCLUDE CASES PAIRWISE, then
CONTINUE
Click OK
49
An example for regression
Research question: To what extent that
student scores in reading can predict
student scores in Mathematics?
 Datafile: VNsample2.sav
 Variables: Reading 500 scores,
Mathematics 500 scores achievement

50
OUTPUT AND INTERPRETATION
Descriptive Statistics
Mean
prd500 PUPIL
READING 500 SCOR.
[MEAN=500/SD=100]
pma500 PUPIL
MATHEMATICS 500
SCORE
Std. Deviation
N
494.8215
104.11670
733
493.1033
103.07274
733
Step 1: Checking Descriptive
Correlations
Pearson Correlation
Sig . (1-tailed)
N
prd500 PUPIL
READING 500 SCOR.
[MEAN=500/SD=100]
pma500 PUPIL
MATHEMATICS 500
SCORE
prd500 PUPIL
READING 500 SCOR.
[MEAN=500/SD=100]
pma500 PUPIL
MATHEMATICS 500
SCORE
prd500 PUPIL
READING 500 SCOR.
[MEAN=500/SD=100]
pma500 PUPIL
MATHEMATICS 500
SCORE
prd500
PUPIL
READING 500
SCOR.
[MEAN=500/
SD=100]
pma500
PUPIL
MATHEMATIC
S 500 SCORE
1.000
.752
.752
1.000
.
.000
.000
.
733
733
733
733
51
OUTPUT AND
INTERPRETATION (2)
Step 2: Evaluating the model
Model Summary
Model
1
R
.752a
R Sq uare
.565
Adjusted
R Sq uare
.565
Std. Error of
the Estimate
67.99595
a. Predictors: (Constant), prd500 PUPIL READING 500
SCOR.[MEAN=500/SD=100]
52
Output and interpretation
Step 3: Evaluating the effect of the independent
variable
The t-tests with
significance
levels for the
constant (a)
and the
regression coefficient (b)
Coefficientsa
Model
1
(Constant)
prd500 PUPIL
READING 500 SCOR.
[MEAN=500/SD=100]
Unstandardized
Coefficients
B
Std. Error
124.761
12.205
.744
Standardized
Coefficients
Beta
.024
.752
t
10.222
Sig .
.000
30.839
.000
a. Dependent Variable: pma500 PUPIL MATHEMATICS 500 SCORE
53
Presenting the results for
regression







Purpose of the test
Variables involved
Number of cases
Intercept
Un-standardised b and standardised (beta)
coefficients, SE, p value
R-square
Interpretation of the relationship
54
Class activity 2
To what extent school resources can
predict for student achievement in maths
and Vietnamese language
 Datafile: VNsample2.sav
 Variables: school resources index, math
achievement

55
Download