Cross Tabulation: Part 2 Slide 1 I’ll finish up this lecture with one more extended example about contingency tables and chisquare (χ2) statistics and then talk about introducing a third variable and how it may reveal that a relationship we believe exists between two variables doesn’t exist. Slide 2 Example #4 concerns Eli Lilly and its decision about whether or not to enter the generic drug market. I’ll leave you to read the description, background and drug study, but we’ll review the questions. Using the χ2 contingency table, decide whether income level, education level, and geographical area are independent of preference for generic drugs. If they’re not independent, what relationship exists? Are the results of the analysis consistent with the hypothesis suggested by the marketing department? If not, what hypothesis might explain the results? Slide 3 In this first table, there’s an effort to relate total annual household income to willingness to purchase generic drugs. As incomes progress from less than $5,999 to more than $30,000, a quick eyeball analysis suggests it’s possible there might be a weak relationship between household income and willingness to purchase generic drugs. For the lowest income group, there’s an almost 50/50 yes/no split, but for the more than $30,000 income group, the yes/no ratio is 5-to-1. There may be some relationship, but it’s difficult to tell by just eyeballing the table. That’s why we need to run the χ2 test. If the test indicates a relationship—meaning these two variables are not independent of one another—then we’ll examine the table more fully to determine which income levels are driving the relationship. Perhaps the relationship is curvilinear rather than linear. Perhaps lower-income people don’t purchase generic drugs because they don’t know such drugs are chemically equivalent to name-brand drugs. Perhaps higher-income people don’t care to buy generic drugs because they have enough money to pay for brand-name drugs. Until we run a χ2 test, we don’t know if it’s necessary to examine this table more fully for such possibilities. Slide 4 There were 1502 respondents to this survey and the two questions being related in this crosstab table are ‘If available, would you purchase generic substitutes? Yes/no/don’t know’ and ‘level of education’. Ignoring the ‘don’t know’ row, an eyeball analysis suggests there’s a relationship between ‘willingness to use or purchase a generic drug’ and ‘level of education’. Comparing the yes/no ratios across columns, the 1½-to-1 ratio for ‘less than four years in high school’ grows to an 8-to-1 ratio for ‘graduated from college or beyond’. This massive increase suggests the variables are related. Slide 5 The previous crosstab table related ‘willingness to use generic drugs’ to ‘level of education’. ‘Level of education’ is an ordinal variable. This slide shows a crosstab table and χ2 test based on strictly nominal data: ‘willingness to use generic drugs’ versus ‘geographical area of residence’. It’s difficult to discern a pattern, as it seems many of yes/no ratios are 3-to-1 within each column. Thus, an eyeball analysis suggests a relationship between ‘geographical area of residence’ and ‘willingness to use generic drugs’ is unlikely. Page | 1 Slide 6 Here are the results of the χ2 test for income versus ‘willingness to use generic drugs’, education versus ‘willingness to use generic drugs’, and ‘geographic area of residence’ versus ‘willingness to use generic drugs’. The first table shows income versus ‘willingness to use generic drugs’. Remember degrees of freedom? The ‘willingness to use’ question had three possible answers and incomes were classified into one of six categories. If degrees of freedom is ‘rows – 1’ x ‘columns – 1’, then the degrees of freedom for this table is (6-1) x (3-1), or ‘10’. Assume we want to be 95% confident that these two things are independent. With 10 degrees of freedom at the 95% level of confidence, the χ2 level that we need to exceed is 18.31. In fact, the calculated χ2 for this table is 77.88. As the χ2 value exceeds the critical value, we should reject the null hypothesis that income and ‘willingness to use generic drugs’ are independent. After examining the crosstab table further, we’d conclude that the higher a person’s income, the more likely they would be willing to use generic drugs. The tests for education and geographic area show—after determining the degrees of freedom and confidence levels—that the critical χ2 value is 12.59 and the calculated value χ2 is 77.63; thus, we can reject the null hypothesis that education and generic drug usage are independent and conclude that the higher a person’s education level, the more likely they are use generic drugs. Finally, the relationships between geographic area and willingness to use generic drugs is disconfirmed by the χ2 test, with a calculated value is only 7.13 but a critical χ2 value is 21.03. As a result, we’d accept the null hypothesis that geographic area and willingness to use generic drugs are independent. With all those tests, the answer to question #2 is ‘the results don’t support the marketing department’s hypothesis’. In fact, the results are just the opposite, which might be explained by people with higher education and income levels knowing that generic drugs are a money-saving alternative to the equivalence to brand-name drugs; hence, their increased willingness to use generic drugs. Slide 7 Although the first example introduced you to the idea that looking at three variables simultaneously would be useful, I’d like to expand this discussion to moderator variables, which are third variables that alter or have a contingent effect on the relationship between an independent and dependent variable. There may be or may not be a relationship between two variables that may be affected by a third variable. In particular, the third variable may indicate that there’s a spurious relationship between the dependent and independent variables. Slide 8 This slide provides a basic flow diagram for the decision rules for deciding when and how to introduce a third variable into a crosstab analysis. Slide 9 Here’s an example of a spurious association revealed by introducing a third variable. Assume a marketing manager for a transit company is interested in ascertaining the ability of advertising to stimulate intentions to use mass transit. The manager assumes that ads not seen can’t stimulate intentions to use; thus, there should be a positive relationship between recalling the Page | 2 transit and intentions to use mass transit. Does potential riders’ ability to recall seeing a mass transit ad increase their intentions to use mass transit? To answer this question, 360 people were surveyed. This crosstab table relates their intentions to use mass transit to their ad recall. The table and χ2 test strongly suggest that there’s a positive relationship. Of the 150 people who recall seeing the ad, 63% indicated that they intend to use mass transit in the next month, and of the 210 people who didn’t recall the ad, only 36% indicated that they intend to use mass transit in the next month. The percentages for the first column are the reverse of the second column. The χ2 value is 25; with 1 degree of freedom, that value is significant at the 0.001 level. At first blush, the marketing manager would be overjoyed that the ad is stimulating demand. Slide 10 However, this second slide suggests a different interpretation. Now the original crosstab table has been divided into two sub-tables: current users of mass transit and current non-users of mass transit. Of the 350 people sampled, 160 are current users and 200 are non-users. The numbers in parentheses are the counts. The cell ‘recall the ad/intend to use’ for transit contains 85 people, and the cell ‘recall the ad/intend to use’ for non-users is 10 people, which sums to 95 people. The sub-tables are equivalent in the sense that the ‘same cell numbers’ sum to the numbers in the original table. By dividing the sample further and considering a third variable, we get a different picture. The table in the center, for users, shows roughly 80% intend to use mass transit in the next 30 days whether or not they recall seeing the ad. For non-users, roughly 20% intend to use mass transit whether or not they recall seeing the ad. A χ2 value for both these sub-tables is not significant, which suggests that the advertising is not generating intentions to use among either current users or non-user. Instead, current use is related to ad recall; people who use mass transit are more likely to recall seeing an ad for mass transit. Current users are also more likely to use mass transit in the future, so user status is related to advertising recall and user status also is related to future use intentions. Thus, in this case advertising is unrelated to intended usage. Instead, current user status relates positively to both variables; it’s the third variable that’s causing the apparent relationship between the first two variables. The bottom line: By assessing only two variables at a time—‘intentions to use’ and ‘ad recall’— the marketing manager could erroneously conclude that the advertising is stimulating intentions to use and might be encouraged to spend lots more on advertising. Considering current user status, we discover that advertising is not generating additional intentions to use; hence, the money spent on advertising is wasted. The marketing manager and the public would be better off if fares were lowered, which would encourage more people to ride mass transit. Slide 11 Here’s an example that relate ‘intentions to use’ to ‘the degree to which someone likes or dislikes a product’. If we like something, then we’re more likely to use it, and if we dislike something, then we’re less likely to use it. However, when we look at current users and current non-users, we see there’s no relationship. Only current users can determine if they like or dislike something. Users tend to like what they are using and current users also are more likely to use in the future, which increase intention scores. Everything we see in sub-table A is being driven by whether one is currently a user or non-user. Page | 3