Crosstabs & Measures of Association

advertisement
POL242
October 9 and 11, 2012
Jennifer Hove
Questions of Causality
 Recall:
 Most causal thinking in social sciences is
probabilistic, not deterministic: as X increases,
the probability of Y increases, not that X
invariably produces Y
 We can observe only association per Hume
 We must therefore infer causation
 Not one, but many possible causes
Inferring Causal Relations
1. There must be association
 X   Y; ~X   ~Y
2. Time order must be considered
 Presumed cause should precede presumed effect
3. Must rule out possible rival explanations
 Sometimes what appears to be a strong relationship
between two variables is due to influence of others
4. Must be able to identify the process by which one
factor brings about change in another
 Causal linkage
Establishing Association
 With nominal or ordinal data, relationships usually
presented in tabular or table form
 Why? Hypotheses rest on core idea of comparison
 Ex: if we compare respondents on basis of their value
on the IV, say party identification, they should also
differ along DV, say support for gay rights
 Crosstabs are a wonderful means of making
comparisons
 “God speaks to you through crosstabs!”
Using/Interpreting Crosstabs
Table 1: Support for the Afghan Mission by Perceived Impact
of Taliban Resurgence, 2007
Fear of Taliban
Resurgence
All
Support for
Low
High
Respondents
Afghan Mission
86.1%
52.7%
60.4%
Low
(173)
(355)
(528)
13.9
47.3
39.6
High
(28)
(318)
(346)
Total
100
100
100
(N)
(201)
(673)
(874)
Tau-b=.29
Source: Strategic Counsel, CTV/Globe and Mail Survey, July 2007
 Data arranged in side-
by-side frequency
distributions
 IV (X) presented across
the top of the table – in
columns
 If ordinal, arrange from
low scores (on left) to
high scores (on right)
 DV (Y) presented down
the left hand side of the
table – in rows
 Again, if ordinal,
arrange from low (at
top) to high (at bottom)
Using/Interpreting Crosstabs
Table 1: Support for the Afghan Mission by Perceived Impact
of Taliban Resurgence, 2007
Fear of Taliban
Resurgence
All
Support for
Low
High
Respondents
Afghan Mission
86.1%
52.7%
60.4%
Low
(173)
(355)
(528)
13.9
47.3
39.6
High
(28)
(318)
(346)
Total
100
100
100
(N)
(201)
(673)
(874)
 Data presented so
that categories of the
IV add to 100%
 Percentaging within
categories of the IV
(down in a table)
 Comparisons are
made across
categories of the IV
Tau-b=.29
 From left to right
Source: Strategic Counsel, CTV/Globe and Mail Survey, July 2007
 To see the effect of
the IV on the DV
Rules (!) of Crosstabs
1. Make the IV define the columns and the DV define
the rows of the table
2. Always percentage down within categories of the
IV
3. Interpret the relationship by comparing across
columns, within rows of the table
Example: 2 x 2 Crosstab
on Xfor
Variable
Support for Y VariableScore
by Support
X Variable
Score on
Y Variable
Low
High
Low
A
C
A+C
High
B
D
B+ D
A+B
C+D
Table 1: Support for the Afghan Mission by Perceived Impact
of Taliban Resurgence, 2007
Fear of Taliban
Resurgence
All
Support for
Low
High
Respondents
Afghan Mission
86.1%
52.7%
60.4%
Low
(173)
(355)
(528)
13.9
47.3
39.6
High
(28)
(318)
(346)
Total
100
100
100
(N)
(201)
(673)
(874)
Diagonals
 Main diagonal: running to the right and down
 When larger proportion of cases fall on main diagonal,
relationship is said to be direct or positive
 Low values on X associated with low values on Y; high
values on X associated with high values on Y
Score on
Y Variable
Low
High
Score on X Variable
Low
High
A
B
C
D
A+C
B+ D
A+B
C+D
Diagonals
 Off diagonal: running to the right and up
 When larger proportion of cases fall on off diagonal,
relationship is said to be inverse or negative
 Low values on X associated with high values on Y;
high values on X associated with low values on Y
Score on
Y Variable
Low
High
Score on X Variable
Low
High
A
B
C
D
A+C
B+ D
A+B
C+D
Explaining Variation in Y
 Relationships between variables in social sciences
are rarely, if ever, perfectly predictable
 You are unlikely to see something like this:
Support for Y Variable by Support for X Variable
Score on
Y Variable
Low
High
Total
Score on X Variable
Low
High
100%
0
0
100%
100
100
Explaining Variation in Y
 There is likely to be more than one explanation or
“cause” behind the variation in Y
 So we will generally be looking at:
 X1   Y
 X2   Y
 To compare, we want to know relative strength of
each relationship
 A variety of summary terms called measures of
association are used
Measures of Association
 Compress information that appears in a crosstab into
a single number by summarizing:
 Magnitude (strength) of the relationship
 Direction of the relationship
 Magnitude: ranges from 0 (completely unpredictable)
to 1 (perfectly predictable)
 Direction: positive (+) = cases primarily on main
diagonal; negative (-) = cases primarily on off
diagonal
Two Cautionary Notes
 Direction is not useful with nominal-level variables,
since they are not ordered/ranked from low to high
 Even with ordinal measurement, interpretation of
direction depends entirely on how your variables are
coded
 Should always code your variables so that high scores
indicate “more” of what you want to explain
Direction & Strength
 Combining direction & strength, we get a range of
possibilities
-1.0
-.8
-.6
-.4
-.2
0
+.2
+.4
+.6
+.8
 All intermediary values can also occur, e.g.
-.2367
 Note that equivalent positive and negative scores
are equal in strength
 Ex: +.4 and -.4 are equal in strength; they differ only
in direction
+1.0
Choosing among Measures
 We use different measures of association for 2 main
reasons:
1. There are different levels of measurement
 Ordinal measurement offers ranking information used
to calculate association, which isn’t available with
nominal data
2. Some measures are specific to tables of certain
sizes and shapes
 Specific measures for 2 x 2 tables; others for larger
square tables; still others for rectangular tables
Phi Φ
 Use with dichotomous variables, 2 x 2 tables
 Applies to nominal and ordinal data
 Measures the strength of a relationship by taking the
# of cases on the main diagonal minus the # of
cases on the off diagonal (adjusting for marginal
distribution of cases, i.e. the sum of the columns and
rows)
AD  BC

( A  B)(C  D)( A  C )(B  D)
2 Examples: Phi Φ
Score on
Y Variable
Score on
Y Variable
Low
High
Total
Score on X Variable
Low
High
75%
10%
25%
90%
100
100
  .6
Low
High
Total
Score on X Variable
Low
High
50%
20%
50%
80%
100
100
  .2
Cramer’s V
 An extension of Phi
 Logic of Cramer’s V is based on percentage
differences across the columns, not on logic of
diagonals
 Use with nominal data, when tables are larger than 2
x2
Lambda
 Lambda (λ) is another measure of association for
nominal data
 Its rationale of “percentage of improvement” or
“proportion reduction in error” is relatively easy to
explain
 Not recommended in this course
 When modal category of each column is in same row,
λ=0
Measures of Association:
Ordinal Data
 Measures include Tau-b, Tau-c and Gamma
 Rely on analysis of diagonals
Support
for Y
Low
Med
High
Support for X
Low
Med
High
a
b
c
d
e
f
g
h
i
Measures of Association:
Ordinal Data
 Measures include Tau-b, Tau-c and Gamma
 Rely on analysis of diagonals
Support
for Y
Low
Med
High
Support for X
Low
Med
High
a
b
c
d
e
f
g
h
i
Measures of Association:
Ordinal Data
 Measures include Tau-b, Tau-c and Gamma
 Rely on analysis of diagonals
Support
for Y
Low
Med
High
Support for X
Low
Med
High
a
b
c
d
e
f
g
h
i
Mind your Ps and Qs
 The letter P indicates the # of pairs of cases on the




main diagonals (from left to right)
The letter Q indicates the # of pairs of cases on the
off diagonal (from right to left)
If P > Q, we have a positive association
If P < Q, we have a negative association
The core calculation = P - Q
Gamma
 The information of P and Q can be used to
calculate Gamma (γ)
PQ

PQ
PQ
P
Q



PQ PQ PQ
 Problems:
 Any vacant cell produces a score of 1.0
 Tends to overstate strength of a relationship
Tau-b and Tau-c
 Preferable to Gamma, though built on the same logic
of diagonals
 Tends to produce results similar to phi (using
nominal data) or the most important interval measure
(r) – to be discussed later in the year
P Q
Tau  b 
( P  Q  X )(P  Q  Y )
Tau-b and Tau-c
 Tau-b never quite reaches 1.0 in non-square tables
 So Tau-c was developed to use with rectangular
tables
 In practice, the difference between Tau-b and Tau-c
when applied to the same table is not great, but keep
the distinction above in mind
Example
Table 2: Approval of President Chavez by Opinion of the
United States, 2007
Opinion of the United States
Approval of
Chavez
Disapprove
Approve
Total
(N)
Very
Bad
12.7%
(26)
87.3
(178)
100
(204)
Bad
Good
22.8%
(64)
77.2
(217)
100
(281)
43.4%
(171)
56.6
(223)
100
(394)
Very
Good
67.9%
(110)
32.1
(52)
100
(162)
Tau-c: -.39 Tau-b: -.35
Source: Latinobarometer, 2007 – Venezuelan respondents only
All
Respondents
35.6%
(371)
64.4
(670)
100
(1041)
Summing Up
 With nominal data, use Phi or Cramer’s V
 Phi used for 2 x 2 tables
 Cramer’s V used for any other crosstab involving
nominal data
 Avoid Lambda
 With ordinal data, use Tau-c or Tau-b
 Tau-b used for square tables: 3 x 3, 4 x 4, etc
 Tau-c used for rectangular tables
 Avoid Gamma
Download