SPSS: Expected frequencies, chi-squared test. In-depth example: Age groups and radio choices. Dealing with small frequencies. Quick Example: Handedness and Careers Last time we tested whether one nominal variable was independent of another. We did this by looking at the cross tabs and seeing how far the observed frequencies were from the frequencies we would expect if the two variables were independent. For nominal variables that only had 2 possible responses each (yes/no, male/female, insane/sane), we could use the odds ratio. When one or both of the variables has more than 2 responses odds ratio is no longer useful, so we use the chi-squared test instead. The tradeoff: Odds ratio can be used for one-tailed tests, chisquared can’t. Chi-squared can handle any number of rows and columns. Get chi-squared is also heavy in math, so in the real world, SPSS and other software can handle most of it for us. Most important things to know: - How to get the expected frequency from a particular cell. - Chi-squared is a measure of how far the observed frequencies are from the expected frequencies. - Large chi-squared values mean large deviations from the expected frequencies. - The df for chi-squared is (rows – 1) x (columns – 1) SPSS: Expected frequencies Start with a crosstab. Analyze Descriptive Stats Crosstabs In the pop-up, choose your row and column variables and click the cells button in the upper right of the pop-up. The cells button brings up the menu of what you want the cells to show. Uncheck observed and check expected. Then click Continue, then OK. This will produce a crosstab of the expected values. If father figure type and parenting style were independent, there would be 9.4 moderate style stepfathers in our sample on average. Leaving Observed checked and leaving Expected unchecked produces the observed values. In our sample, we found 10 moderate style stepfathers. Very near the 9.4 in the independent ideal. Checking both observed and expected produces a table that has both the observed and expected values in the same table. It allows cell-to-cell comparison but it’s more cluttered. The null hypothesis of independence fits the moderates and stepfathers. But live-in partners appear to be more permissive and less authoritarian than other types of father figure. Note the vague language about the trends in the data. That’s because we can’t say whether these trends are significant or not. We don’t have the tools to say anything definitive about specific categories. Bats: Observing frequencies you wouldn’t expect. SPSS: Full crosstab analysis. Consider the following data on a sample of people’s ages and radio preference. We want to know if a person’s radio preference depends on what generation they belong to. We have the data from 72 people in total in three nominal categories of radio choice and three ordinal categories of age. Should we do an odds ratio or a chi-squared? Should we do an odds ratio or a chi-squared? Chi-squared. Because we have 3x3 table. Odds Ratio only works for 2x2 tables. SPSS: Chi-Squared is also in the crosstabs section. Analyze Descriptive Statistics Crosstabs. Click on the Statistics button. Put a check next to Chi-Square in the upper left. It doesn’t matter if Risk is checked or unchecked. Then click Continue, then OK. Checking Chi-Squared produces the following table. We want the Pearson Chi-Square. (yeah, Pearson is a big deal) 2 χ = 10.268 df = 4. We could have got this from (rows – 1) x (cols. – 1) = 2 x 2 = 4. We also know that the p-value = .036. So if we were testing for independence at alpha =0.05, we would reject the null hypothesis of independence. For interest: Asymp. Sig. stands for Asymptotic Significance. Asymptotic in statistics means “As n infinity.” The Chi-Square test also tells us of potential problems. The test assumes there is a large number of respondents in each cell. The standard rule is that every cell should have a frequency of at least 5. Having small cells (cells with less than 5 respondents) makes the p-value of the chi-squared test inaccurate. The more small cells there are, the worse the problem. There are ways to deal with cells with small n. The easiest one is to find a logical way to group categories together. Here, there are substantially fewer older adults than any other group. We could merge the middle age and older adult categories into a “not young” category. Then we would have 2x3 cross tab with larger n values. For a table of this size, it’s simple enough to do by hand. Music News Sports Young 14 4 7 Middle Age 10 15 9 Older Adult 2 8 3 The frequencies in the new categories are the frequencies in both the old categories added together. Music News Sports Young 14 4 7 Not Young 10 + 2 = 12 15 + 8 = 23 9 + 3 = 12 Music News Sports Young 14 4 7 Not Young 12 23 12 We still have one cell below 5, but that’s better than having three cells below 5. This won’t distort our answer by much. But if we do this by hand, then we can’t analyze the new dataset with SPSS. We need some way to make new variables from old ones. This slide for interest: For 2x2 crosstabs, there is no way to merge to improve the frequencies in cells, but we can use a modification to the chi-squared test called the Yates’ Adjustment. The textbook talks about dealing with cells with few respondents in pages 326-331. Also, it’s technically the small expected frequencies that cause trouble, but the best indicator of these is small observed counts. We need some way to transform old variables into new ones. SPSS: Recoding variables. Goal: To take the three category variable Young/Middle/Old And make a two category variable Young/Not Young Transform Recode into Different Variables. Select the variable you want to change. In our case it’s age. Give the new variable a name in Output Variable: Name, Then click on Change. Then, click on Old and New Values. This brings up the menu to define the old categories you have the new categories you want. In the new popup, check Output variables are strings first Then enter the old category name in Old Value: Value And enter the new category name in New Value: Value Click Add and repeat the last slide for each category. “1Young Young”, “2MiddleAge NotYoung”, and “3OlderAdult NotYoung” are the recoding we’re doing. Now we can a crosstab in SPSS with the variable with the merged category variable. ( Analyze Descriptive Statistics Crosstabs ) We can look at the expected frequencies. (Crosstabs menu, Statistics button, Check “Expected”) Even though one cell has observed frequency less than 5, its expected frequency is more than 5, so the potential problem is lessened. We can also do the chi-squared test again and see if there’s a problem or a change in the p-value. 0/6 cells are too small instead of 3/9. We went from 4 df to 2 because we now have a 2x3 crosstab. (2 – 1) x (3 -1 ) = 2. Also, the most important part, the p-value, hasn’t changed dramatically. (In the 3x3 table it was .036) This implies that merging middle age and older didn’t change anything major. We reject the null ; radio choice depends on age. It’s easier to detect differences in larger groups, so we would expect the p-value to go down a little, but not something dramatic like .001 or .000. If the p-value had increased much we would have lost the ability to reject the null. (A bad merge can do this). Pacing parrot asks: Do we time for another? We took a survey of people in four career fields and found if they were left or right handed. These are the observed counts. Most of the respondents are right handed except for in the athletics field, where a few more than half are left handed. We want to know if this difference is a fluke or if career and handedness are somehow dependent. We have a 2x4 crosstab, so we should use a chi-squared test. These are the results: Degrees of freedom = 2 χ = There is evidence against independence. We have a 2x4 crosstab, so we should use a chi-squared test. These are the results: Degrees of freedom = 3 2 χ = 50.434 There is very significant evidence against independence. The chi-squared test has a very small p-value (less than .001). Do the results of this test tell us that there are more left handed people in athletics in general? The chi-squared test has a very small p-value (less than .001). Do the results of this test tell us that there are more left handed people in athletics in general? No. Chi-squared only checks whether two variables are independent, not specific trends within them. By comparing the expected and observed counts, we can see that the athletic field is much different from the others. We can use this information to guide a next step even if we’re not getting definite answers from just the expected counts. We could try merging the other three fields into “non-athletic” and “athletic”, as long as those three fields together fairly represented everything non-athletic. In that case, the odds ratio shows that someone in the athletic field has 7.371 times the odds of being left handed as someone in a non-athletic profession. The confidence interval shows that this odds ratio is significantly more than 1 at the alpha = 0.025 level. Next time: More on cross tabs. If time permits: Intro to Analysis of Variance. (Ch. 8)