Measures of Association for Crosstabulations_372

advertisement
Crosstabulation and Measures of
Association
 Investigating
the relationship between two
variables
 Generally a statistical relationship exists if
the values of the observations for one
variable are associated with the values of
the observations for another variable
 Knowing that two variables are related
allows us to make predictions.
 If we know the value of one, we can
predict the value of the other.
 Determining
how the values of one
variable are related to the values of
another is one of the foundations of
empirical science.
 In
making such determinations we must
consider the following features of the
relationship.
 1.)
The level of measurement of the
variables. Difference varibles necessitate
different procedures.
 2.) The form of the relationship. We can
ask if changes in X move in lockstep with
changes in Y or if a more sophisticated
relationship exists.
 3.)The strength of the relationship. Is it
possible that some levels of X will always
be associated with certain levels of Y?

4.) Numerical Summaries of the relationship.
Social scientists strive to boil down the different
aspects of a relationship to a single number that
reveals the type and strength of the association.
 5.) Conditional relationships. The variables X
and Y may seem to be related in some fashion
but appearances can be deceiving.
Spuriousness for example. So we need to know
if the introduction of any other variables into the
analysis changes the relationship.
Types of Association
General Association – simply
associated in some way.
 2.) Positive Monotonic Correlation – when
the variables have order (ordinal or
continuous) high values of one var are
associated with high values of the other.
Converse is also true.
 3.) Negative Monotonic Correlation – Low
values are associated with high values.
 1.)
Types of Association Cont.
Positive Linear Association – A
particular type of positive monotonic
relationship where the plotted values of XY fall on a straight line that slopes upward.
 4.)
5.) Negative Linear Relaionship – Straight
line that slopes downward.
Strength of Relationships
 Virtually
no relationships between
variables in Social Science (and largely in
natural science as well) have a perfect
form.
 As
a result it makes sense to talk about
the strength of relationships.
Strength Cont.
 The
strength of a relationship between
variables can be found by simply looking
at a graph of the data.
 If
the values of X and Y are tied together
tightly then the relationship is strong.
 If
the X-Y points are spread out then the
relationship is weak.
Direction of Relationship
 We
can also infer direction from a graph
by simply observing how the values for our
variables move across the graph.
 This
is only true, however, when our
variables are ordinal or continuous.
Types of Bivariate Relationships
and Associated Statistics

Nominal/Ordinal (including dichotomous)


Interval and Dichotomous


Difference of means test
Interval and Nominal/Ordinal


Crosstabulation (Lamda, Chi-Square Gamma, etc.)
Analysis of Variance
Interval and Ratio

Regression and correlation
Assessing Relationships between
Variables
 1.
Calculate appropriate statistic to
measure the magnitude of the relationship
in the sample
 2. Calculate additional statistics to
determine if the relationship holds for the
population of interest (statistical
significance)

Substantive significance vs. Statistical
significance
What is a Crosstabulation?

Crosstabulations are appropriate for examining
relationships between variables that are
nominal, ordinal, or dichotomous.

Crosstabs show values for variables categorized
by another variable.

They display the joint distribution of values of the
variables by listing the categories for one along
the x-axis and the other along the y-axis
 Each
case is then placed in a cell of the
table that represents the combination of
values that corresponds to its scores on
the variables.
What is a Crosstabulation?
 Example:
We would like to know if
presidential vote choice in 2000 was
related to race.


Vote choice = Gore or Bush
Race = White, Hispanic, Black
Are Race and Vote Choice
Related? Why?
Black
Hispanic
White
TOTAL
Gore
106
23
427
556
Bush
8
15
484
507
TOTAL
114
38
911
1063
Are Race and Vote Choice
Related? Why?
Black
Hispanic
White
TOTAL
Gore
106 (93%)
23 (60.5%)
427 (46.9%)
556 (52.3)
Bush
8 (7%)
15 (39.5%)
484 (53.1%)
507 (47.7)
TOTAL
114 (100%)
38 (100%)
911 (100%)
1063 (100%)
Measures of Association for
Crosstabulations

Purpose – to determine if nominal/ordinal
variables are related in a crosstabulation

At least one nominal variable




Lamda
Chi-Square
Cramer’s V
Two ordinal variables


Tau
Gamma
Measures of Association for
Crosstabulations

These measures of association provide us with
correlation coefficients that summarize data from
a table into one number .

This is extremely useful when dealing with
several tables or very complex tables.

These coefficients measure both the strength
and direction of an association.
Coefficients for Nominal Data
 When
one or both of the variables are
nominal, ordinal coefficients cannot be
used because there is no underlying
ordering.
 Instead
we use PRE tests
Lambda (PRE coefficient)
 PRE
 Two
– Proportional Reduction in Error
Rules
 1.) Make a prediction on the value of an
observation in the absence of no prior
information
 2.) Given information on a second variable
and take it into account in making the
prediction.
Lambda PRE

If the two variables are associated then the use
of rule two should lead to fewer errors in your
predictions than rule one.

How many fewer errors depends upon how
closely the variables are associated.

PRE = (E1 – E2) / E1

Scale goes from 0 -1
Lambda

Lambda is a PRE coefficient and it relies on
rules 1 & 2 above.

When applying rule one all we have to go on is
what proportion of the population fit into one
category as opposed to another.

So, without any other information, guessing that
every observation is in the modal category
would give you the best chance of getting the
most correct.
Why?
 Think
of it like this. If you knew that I
tended to make exams where the most
often used answer was B, then, without
any other information, you would be best
served to pick B every time.
 But,
if you know information about each
case’s value on another variable, rule two
directs you to only look at the members of
that new category (variable) and find the
modal category (only on that var).
Example

Suppose a sample of 100 voters and you need
to predict how they will vote in the general
election.

Assume we know that overall 30% voted
democrat and 30% voted republican and 40%
were independent.

Now suppose we take one person out of the
group (John Smith), our best guess would be
that he would vote independent.
 Now
suppose we take another person
(Larry Mendez) and again we would
assume he voted independent.
 As
a result our best guess is to predict that
all of the voters (all 100) were
independent.
 We
are sure to get some wrong but it’s the
best we can do over the long run.
 How
many do we get wrong? 60.
 Suppose
now that we know something
about the voters regions (where they are
from) and we know what proportions
various regions voted in the election.
 NE-30
, MW – 20, SO – 30 , WE - 20
Lamda
NE
MW
1
REPUB
4
TOTAL
2
12
2
1
2
WE
1
2
6
2
8
1
DEM
1
10
1
IND
2
SO
1
2
1
2
10
2
16
1
TOTAL
30
1
2
4
1
2
40
1
14
2
8
6
30
20
30
20
2
30
100
Lamda – Rule 1
(prediction based solely on knowledge of marginal
distribution of dependent variable – partisanship)
NE
MW
1
REPUB
4 0
1
IND
2
1
2
10 0
2
12 30
1
SO
1
1
1
2
2
2
16 30
1
2
10 0
1
2
TOTAL
1
6 0
8 20
2
WE
1
30
2
4 20
2
1
40
2
DEM
14 0
2 0
8 0
6 0
30
TOTAL
30
20
30
20
100
Lamda – Rule 2
(prediction based on knowledge provided by independent
variable )
NE
MW
1
REPUB
4 0
1
IND
TOTAL
1
2
WE
1
2
0 10 0 20 6 0
2
1
2
TOTAL
1
2
0 10 0 20
1
2
1
2
1
2
1
14 0 30 2 0 0 8 0
30
20
30
2
1
2
0 6 0
0
20
30
2
12 30 0 8 20 0 16 30 30 4 20 0
1
DEM
2
SO
40
30
100
Lamda –Calculation of Errors

Errors w/Rule 1: 18 + 12 + 14 + 16 = 60
 Errors w/Rule 2: 16 + 10 + 14 + 10 = 50
 Lamda =(Errors R1 – Errors R2)/Errors R1
 Lamda = (60-50)/60=10/60=.17
NE
MW
1
REPUB
4 0
1
IND
2
SO
1
2
0 10 0 20 6 0
0
1
2
12 30 0 8 20 0
1
WE
1
2
2
2
1
1
TOTAL
1
2
10 0 20 30
2
1
2
16 30 30 4 20 0
2
1
2
1
2
DEM
14 0 30 2 0 0
8 0
0
6 0
0
TOTAL
30
30
20
20
40
30
100
Lamda
 PRE
measure
 Ranges from 0-1
 Potential problems with Lamda


Underestimates relationship when variables
(one or both) are highly skewed
Always 0 when modal category of Y is the
same across all categories of X
Chi –Square (c2)
 Also
appropriate for any crosstabulation
with at least one nominal variable (and
another nominal/ordinal variable)
 Based
on the difference between the
empirically observed crosstab and what
we would expect to observe if the two
variables are statistically independent
Background for c2

Statistical Independence – A property of two
variables in which the probability that an
observation is in a particular category of on
variable and also in a particular category of the
other variable equals the simple or marginal
probability of being in those categories.

Plays a large role in data analysis

Is another way to view the strength of a
relaitionship
Example
 Suppose
we have two nominal or
categorical variables, X and Y. We label
the categories for the first category (a,b,c)
and those of the second (r,s,t).
 Let
P(X = a) stand for the probability that a
randomly selected case has property a on
variable X and P(Y = r) stand for the
probability that a randomly selected case
has property r on variable Y.
 These
two probabilities are called marginal
distributions and simply refers to the
chance that an observation has a
particular value on a particular variable
irrespective of its value on another
variable.

Finally, let us assume that P(X = a, Y = r) stands
for the joint probability that a randomly selected
observation has both property a and property r
simultaneously.

Statistical Independence – The two variables are
therefore statisitically independent only if the
chances of observing a combination of
categories is equal to the marginal probability of
choosing one category times the marginal
probability of the other.
Background for c2
 P(X
 For
= a, Y = r) = [P(X = a)] [P(Y = r)]
example, if men are as likely to vote
as women, then the two variables (gender
and voter turnout) are statistically
independent because the probability of
observing a male nonvoter in the sample
is equal to the probability of observing a
male times the probability of obseving a
nonvoter.
Example
 If
100/300 are men & 210/300 voted then;
The marginal probabilities are:
P(X=m)=100/300 = .33 and P(Y=v) =
210/300 = .7
.33 x .7 = .23 and is our marginal probability
 If
we know that 70 of the voters are male
and take that proportion and divide by the
total number of voters (70/300) we also
get .23.
 We
can therefore say that the two
variables are independent.
 The
chi-squared statistic essentially
compares an observed result (the table
produced by the sample) with a
hypothetical table that would occur if (in
the population) the variables were
statistically independent.
A
value of 0 implies statistical
independence which means no
association.
 Chi-squared
increases as the departures
of observed and expected values grows.
There is no upper limit to how big the
difference can become but if it is past a
critical value then there is reason to reject
the null hypothesis that the two variables
are independent.
How do we Calc. Chi^2
 The
observed frequencies are already in
the crosstab.
 The
expected frequencies in each table
cell are found by multiplying the row and
the column marginal totals and dividing by
the sample size.
Chi –Square (c2)
NE
O
REPUB
E
4
O
IND
MW
E
10
E
12
O
O
SO
O
O
E
6
E
8
E
O
WE
O
O
E
10
E
16
E
O
TOTAL
O
30
E
4
E
O
40
E
DEM
14
2
8
6
30
TOTAL
30
20
30
20
100
Calculating Expected Frequencies

To calculate the expected cell frequency for NE
Republicans:
• E/30 = 30/100, therefore E=(30*30)/100 = 9
NE
O
REPUB
E
4
O
IND
MW
E
10
E
12
O
O
SO
O
O
E
6
E
8
E
O
WE
O
O
E
10
E
16
E
O
TOTAL
O
30
E
4
E
O
40
E
DEM
14
2
8
6
30
TOTAL
30
20
30
20
100
Calculating the Chi-Square Statistic

The chi-square statistic is calculated as:
 (Obs. Frequencyik - Exp. Frequencyik)2 / Exp. Frequencyik
(25/9)+(16/6)+(9/9)+(16/6)+(0)+(0)+(16/12)+(16/8)+(25/9)+16/6)+(1/9)+(0) = 18
NE
MW
SO
WE
TOTAL
O
E
O
E
O
E
O
E
4
9
10
6
6
9
10
6
O
E
O
E
O
E
O
E
12
12
8
8
16
12
4
8
O
E
O
E
O
E
O
E
DEM
14
9
2
6
8
9
6
6
TOTAL
30
REPUB
IND
20
30
20
30
40
30
100
 The
value 9, is the expected frequency in
the first cell of the table and is what we
would expect in a sample of 100 (with 30
Republicans and 30 north easterners) if
there is statistical independence in the
population.
 This
is more than we have in our sample
so there is a difference.
Just Like the Hyp. Test
 Null
: Statistical Independence between x
and Y
 Alt
: X and Y are not independent.
Interpreting the Chi-Square
Statistic

The Chi-Square statistic ranges from 0 to infinity
 0 = perfect statistical independence
 Even though two variables may be statistically
independent in the population, in a sample the
Chi-Square statistic may be > 0
 Therefore it is necessary to determine statistical
significance for a Chi-Square statistic (given a
certain level of confidence)
Cramer’s V
 Problem
with Chi-Square: not comparable
across different sample sizes (and their
associated crosstab)
 Cramer’s
V is a standardization of the ChiSquare statistic
Calculating Cramer’s V
Chi  Squared
N  Min( R  1, C  1)

V=

Where R = #rows and C =
#columns
• V ranges from 0-1


Example (region and
partisanship)
=
= √.09 = .30
18
100 (3  1)
Relationships between Ordinal
Variables
 There
are several measures of association
appropriate for relationships between
ordinal variables
 Gamma,
 All
Tau-b, Tau-c, Somer’s d
are based on identifying concordant,
discordant, and tied pairs of observations
Concordant Pairs:
Ideology and Voting

Ideology - conserv (1), moderate (2), liberal (3)
 Voting - never (1), sometimes (2), often (3)

Consider two hypothetical individuals in the
sample with scores
• Individual A: Ideology=1, Voting=1
• Individual B: Ideology=2, Voting=2
• Pair A&B are considered a concordant pair because B’s
ideology score is greater than A’s score, and B’s voting score
is greater than A’s score
Concordant Pairs (cont’d)

All of the following are concordant pairs

A(1,1) B(2,2)
A(1,1) B(2,3)
A(1,1) B(3,2)
A(1,2) B(2,3)
A(2,2) B(3,3)





Concordant pairs are consistent with a positive
relationship between the IV and the DV (ideology and
voting)
Discordant Pairs

All of the following are discordant pairs

A(1,2) B(2,1)
A(1,3) B(2,2)
A(2,2) B(3,1)
A(1,2) B(3,1)
A(3,1) B(1,2)





Discordant pairs are consistent with a negative
relationship between the IV and the DV (ideology and
voting)
Identifying Concordant Pairs

Concordant Pairs for Never - Conserv (1,1)
 #Concordant = 80*70 + 80*10 + 80*20 + 80*80
= 14,400
Conservative (1)
Moderate (2)
Liberal (3)
Never (1)
80
10
10
Sometimes (2)
20
70
10
Often (3)
0
20
80
Identifying Concordant Pairs

Concordant Pairs for Never - Moderate (1,2)
 #Concordant = 10*10 + 10*80 = 900
Conservative (1)
Moderate (2)
Liberal (3)
Never (1)
80
10
10
Sometimes (2)
20
70
10
Often (3)
0
20
80
Identifying Discordant Pairs

Discordant Pairs for Often - Conserv (1,3)
 #Discordant = 0*10 + 0*10 + 0*70 + 0*10 = 0
Conservative (1)
Moderate (2)
Liberal (3)
Never (1)
80
10
10
Sometimes (2)
20
70
10
Often (3)
0
20
80
Identifying Discordant Pairs

Discordant Pairs for Often - Moderate (2,3)
 #Discordant = 20*10 + 20*10
Conservative (1)
Moderate (2)
Liberal (3)
Never (1)
80
10
10
Sometimes (2)
20
70
10
Often (3)
0
20
80
Gamma
 Gamma
is calculated by identifying all
possible pairs of individuals in the sample
and determining if they are concordant or
discordant
 Gamma
= (#C - #D) / (#C + #D)
Interpreting Gamma
 Gamma
= 21400/24400 =.88
 Gamma ranges from -1 to +1
 Gamma does not account for tied pairs
 Tau
(b and c) and Somer’s d account for
tied pairs in different ways
Square tables:
Non-Square tables:
Example
2004 – What explains variation in
one’s political Ideology?
 NES
 Income?
 Education?
 Religion?
 Race?
Bivariate Relationships and
Hypothesis Testing
(Significance Testing)
 1. Determine the null and alternative
hypotheses
• Null: There is no relationship between X
and Y (X and Y are statistically
independent and test statistic = 0).
• Alternative: There IS a relationship
between X and Y (test statistic does not
equal 0).
Bivariate Relationships and
Hypothesis Testing
 2.
Determine Appropriate Test Statistic
(based on measurement levels of X and Y)
 3.
Identify the type of sampling distribution
for test statistic, and what it would look like
if the null hypothesis were true.
Bivariate Relationships and
Hypothesis Testing

4. Calculate the test statistic from the sample
data and determine the probability of observing
a test statistic this large (in absolute terms) if the
null hypothesis is true.

P-value (significance level) – probability of
observing a test statistic at least as large as our
observed test statistic, if in fact the null
hypothesis is true
Bivariate Relationships and
Hypothesis Testing

5. Choose an “alpha level” – a decision rule to
guide us in determining which values of the pvalue lead us to reject/not reject the null
hypothesis



When the p-value is extremely small, we reject the
null hypothesis (why?). The relationship is deemed
“statistically significant,”
When the p-value is not small, we do not reject the
null hypothesis (why?). The relationship is deemed
“statistically insignificant.”
Most common alpha level: .05
Bottom Line
 Assuming
we will always use an alpha
level of .05:


Reject the null hypothesis if P-value<.05
Do not reject the null hypothesis if Pvalue>.05
An Example
 Dependent
variable: Vote Choice in 2000
 (Gore, Bush, Nader)
 Independent variable: Ideology

(liberal, moderate, conservative)
An Example
 1.
Determine the null and alternative
hypotheses.
An Example
 Null
Hypothesis: There is no relationship
between ideology and vote choice in 2000.
 Alternative (Research) Hypothesis: There
is a relationship between ideology and
vote choice (liberals were more likely to
vote for Gore, while conservatives were
more likely to vote for Bush).
An Example
 2.
Determine Appropriate Test Statistic
(based on measurement levels of X and Y)
 3.
Identify the type of sampling distribution
for test statistic, and what it would look like
if the null hypothesis were true.
Sampling Distributions for the Chi-Squared Statistic
(under assumption of perfect independence)
df = (rows-1)(columns-1)
Bivariate Relationships and
Hypothesis Testing

4. Calculate the test statistic from the sample
data and determine the probability of observing
a test statistic this large (in absolute terms) if the
null hypothesis is true.

P-value (significance level) – probability of
observing a test statistic at least as large as our
observed test statistic, if in fact the null
hypothesis is true
Bivariate Relationships and
Hypothesis Testing

5. Choose an “alpha level” – a decision rule to
guide us in determining which values of the pvalue lead us to reject/not reject the null
hypothesis



When the p-value is extremely small, we reject the
null hypothesis (why?). The relationship is deemed
“statistically significant,”
When the p-value is not small, we do not reject the
null hypothesis (why?). The relationship is deemed
“statistically insignificant.”
Most common alpha level: .05
In-Class Exercise

For some years now, political commentators have cited
the importance of a “gender gap” in explaining election
outcomes. What is the source of the gender gap?

Develop a simple theory and corresponding hypothesis
(where gender is the independent variable) which seeks
to explain the source of the gender gap.

Specifically, determine:



Theory
Null and research hypothesis
Test statistic for a cross-tabulation to test your hypothesis
Download