I’ll finish up this lecture with one more extended example... (χ square ) statistics and then talk about introducing a third variable...

advertisement
Cross Tabulation: Part 2
Slide 1
I’ll finish up this lecture with one more extended example about contingency tables and chisquare (χ2) statistics and then talk about introducing a third variable and how it may reveal that
a relationship we believe exists between two variables doesn’t exist.
Slide 2
Example #4 concerns Eli Lilly and its decision about whether or not to enter the generic drug
market. I’ll leave you to read the description, background and drug study, but we’ll review the
questions. Using the χ2 contingency table, decide whether income level, education level, and
geographical area are independent of preference for generic drugs. If they’re not independent,
what relationship exists? Are the results of the analysis consistent with the hypothesis
suggested by the marketing department? If not, what hypothesis might explain the results?
Slide 3
In this first table, there’s an effort to relate total annual household income to willingness to
purchase generic drugs. As incomes progress from less than $5,999 to more than $30,000, a
quick eyeball analysis suggests it’s possible there might be a weak relationship between
household income and willingness to purchase generic drugs. For the lowest income group,
there’s an almost 50/50 yes/no split, but for the more than $30,000 income group, the yes/no
ratio is 5-to-1. There may be some relationship, but it’s difficult to tell by just eyeballing the table.
That’s why we need to run the χ2 test.
If the test indicates a relationship—meaning these two variables are not independent of one
another—then we’ll examine the table more fully to determine which income levels are driving
the relationship. Perhaps the relationship is curvilinear rather than linear. Perhaps lower-income
people don’t purchase generic drugs because they don’t know such drugs are chemically
equivalent to name-brand drugs. Perhaps higher-income people don’t care to buy generic drugs
because they have enough money to pay for brand-name drugs. Until we run a χ2 test, we don’t
know if it’s necessary to examine this table more fully for such possibilities.
Slide 4
There were 1502 respondents to this survey and the two questions being related in this crosstab
table are ‘If available, would you purchase generic substitutes? Yes/no/don’t know’ and ‘level of
education’. Ignoring the ‘don’t know’ row, an eyeball analysis suggests there’s a relationship
between ‘willingness to use or purchase a generic drug’ and ‘level of education’. Comparing the
yes/no ratios across columns, the 1½-to-1 ratio for ‘less than four years in high school’ grows to
an 8-to-1 ratio for ‘graduated from college or beyond’. This massive increase suggests the
variables are related.
Slide 5
The previous crosstab table related ‘willingness to use generic drugs’ to ‘level of education’.
‘Level of education’ is an ordinal variable. This slide shows a crosstab table and χ2 test based
on strictly nominal data: ‘willingness to use generic drugs’ versus ‘geographical area of
residence’. It’s difficult to discern a pattern, as it seems many of yes/no ratios are 3-to-1 within
each column. Thus, an eyeball analysis suggests a relationship between ‘geographical area of
residence’ and ‘willingness to use generic drugs’ is unlikely.
Page | 1
Slide 6
Here are the results of the χ2 test for income versus ‘willingness to use generic drugs’,
education versus ‘willingness to use generic drugs’, and ‘geographic area of residence’ versus
‘willingness to use generic drugs’.
The first table shows income versus ‘willingness to use generic drugs’. Remember degrees of
freedom? The ‘willingness to use’ question had three possible answers and incomes were
classified into one of six categories. If degrees of freedom is ‘rows – 1’ x ‘columns – 1’, then the
degrees of freedom for this table is (6-1) x (3-1), or ‘10’. Assume we want to be 95% confident
that these two things are independent. With 10 degrees of freedom at the 95% level of
confidence, the χ2 level that we need to exceed is 18.31. In fact, the calculated χ2 for this table is
77.88. As the χ2 value exceeds the critical value, we should reject the null hypothesis that
income and ‘willingness to use generic drugs’ are independent. After examining the crosstab
table further, we’d conclude that the higher a person’s income, the more likely they would be
willing to use generic drugs.
The tests for education and geographic area show—after determining the degrees of freedom
and confidence levels—that the critical χ2 value is 12.59 and the calculated value χ2 is 77.63;
thus, we can reject the null hypothesis that education and generic drug usage are independent
and conclude that the higher a person’s education level, the more likely they are use generic
drugs.
Finally, the relationships between geographic area and willingness to use generic drugs is
disconfirmed by the χ2 test, with a calculated value is only 7.13 but a critical χ2 value is 21.03. As
a result, we’d accept the null hypothesis that geographic area and willingness to use generic
drugs are independent. With all those tests, the answer to question #2 is ‘the results don’t
support the marketing department’s hypothesis’. In fact, the results are just the opposite, which
might be explained by people with higher education and income levels knowing that generic
drugs are a money-saving alternative to the equivalence to brand-name drugs; hence, their
increased willingness to use generic drugs.
Slide 7
Although the first example introduced you to the idea that looking at three variables
simultaneously would be useful, I’d like to expand this discussion to moderator variables, which
are third variables that alter or have a contingent effect on the relationship between an
independent and dependent variable. There may be or may not be a relationship between two
variables that may be affected by a third variable. In particular, the third variable may indicate
that there’s a spurious relationship between the dependent and independent variables.
Slide 8
This slide provides a basic flow diagram for the decision rules for deciding when and how to
introduce a third variable into a crosstab analysis.
Slide 9
Here’s an example of a spurious association revealed by introducing a third variable. Assume a
marketing manager for a transit company is interested in ascertaining the ability of advertising to
stimulate intentions to use mass transit. The manager assumes that ads not seen can’t
stimulate intentions to use; thus, there should be a positive relationship between recalling the
Page | 2
transit and intentions to use mass transit. Does potential riders’ ability to recall seeing a mass
transit ad increase their intentions to use mass transit?
To answer this question, 360 people were surveyed. This crosstab table relates their intentions
to use mass transit to their ad recall. The table and χ2 test strongly suggest that there’s a
positive relationship. Of the 150 people who recall seeing the ad, 63% indicated that they intend
to use mass transit in the next month, and of the 210 people who didn’t recall the ad, only 36%
indicated that they intend to use mass transit in the next month. The percentages for the first
column are the reverse of the second column. The χ2 value is 25; with 1 degree of freedom, that
value is significant at the 0.001 level. At first blush, the marketing manager would be overjoyed
that the ad is stimulating demand.
Slide 10
However, this second slide suggests a different interpretation. Now the original crosstab table
has been divided into two sub-tables: current users of mass transit and current non-users of
mass transit. Of the 350 people sampled, 160 are current users and 200 are non-users. The
numbers in parentheses are the counts. The cell ‘recall the ad/intend to use’ for transit contains
85 people, and the cell ‘recall the ad/intend to use’ for non-users is 10 people, which sums to 95
people. The sub-tables are equivalent in the sense that the ‘same cell numbers’ sum to the
numbers in the original table.
By dividing the sample further and considering a third variable, we get a different picture. The
table in the center, for users, shows roughly 80% intend to use mass transit in the next 30 days
whether or not they recall seeing the ad. For non-users, roughly 20% intend to use mass transit
whether or not they recall seeing the ad. A χ2 value for both these sub-tables is not significant,
which suggests that the advertising is not generating intentions to use among either current
users or non-user. Instead, current use is related to ad recall; people who use mass transit are
more likely to recall seeing an ad for mass transit. Current users are also more likely to use
mass transit in the future, so user status is related to advertising recall and user status also is
related to future use intentions. Thus, in this case advertising is unrelated to intended usage.
Instead, current user status relates positively to both variables; it’s the third variable that’s
causing the apparent relationship between the first two variables.
The bottom line: By assessing only two variables at a time—‘intentions to use’ and ‘ad recall’—
the marketing manager could erroneously conclude that the advertising is stimulating intentions
to use and might be encouraged to spend lots more on advertising. Considering current user
status, we discover that advertising is not generating additional intentions to use; hence, the
money spent on advertising is wasted. The marketing manager and the public would be better
off if fares were lowered, which would encourage more people to ride mass transit.
Slide 11
Here’s an example that relate ‘intentions to use’ to ‘the degree to which someone likes or
dislikes a product’. If we like something, then we’re more likely to use it, and if we dislike
something, then we’re less likely to use it. However, when we look at current users and current
non-users, we see there’s no relationship. Only current users can determine if they like or dislike
something. Users tend to like what they are using and current users also are more likely to use
in the future, which increase intention scores. Everything we see in sub-table A is being driven
by whether one is currently a user or non-user.
Page | 3
Download