Sociology 541 Review of Bivariate Regression Chi-Square Test Cautionary Notes Regarding Linear Regression 1. Differentiate between association and causation Temporal ordering and the possibility of endogeneity or simultaneity - causal paths run in both directions. Both x and y could be caused by some other variable - a spurious correlation. Multivariate techniques better control for possible spurious relationships by including the confounding factor but still incomplete. 2. Ecological Correlations When units of analysis are not individuals but are aggregate units, the correlation tends to be much higher. 3. Non-Linearity Correlation coefficient r measures scatter around a straight line. Not appropriate for a curvilinear relationship. 4. Extrapolation Be careful with extrapolation. Data could be curvilinear. 1 Analyzing Association between Categorical Variables Contingency Tables (Review Kurtz, pp. 32-34) When variables analyzed have only a few categories, as in most nominal and ordinal-scale measurement, bivariate data are presented in tables. These tables go by a few names: cross-tabulations, cross-classifications; contingency tables Consider the following question: How do attitudes about current divorce laws depend on gender? We have two categorical variables: Independent or explanatory variable is SEX: 1=male; 2=female Outcome of interest is DIVLAW (VAR 216) (Does the respondent think current divorce laws should be changed and if so, how?): 1=easier; 2=more difficult; 3=stay the same In this case, there are 3 possible outcomes for the two categories of the independent variable (male and female), leading to a total of six possible total outcomes. SPSS Commands: Analyze – Descriptives – Cross-Tabs – Select row (dependent) and column (independent) variables - Okay DI VLAW2 * RESPONDENTS SEX Crossta bula tion Count DIVLAW2 1.00 2.00 3.00 Total RESPONDENTS SEX MALE FEMALE 188 240 408 559 177 174 773 973 Total 428 967 351 1746 It is customary to make the dependent variable the row variable and to treat the independent variable as the column variable, but this convention is somewhat arbitrary and often broken. Marginal Frequency (marginals): numbers along the right margin and bottom margin of the table. They essentially present univariate frequency distributions. Number in lower right-hand corner is N, the total sample size excluding missing cases. N = sum of either row or column marginals. Where the categories of the two variables intersect contains the bivariate frequency distribution. Each intersection is called a cell and the number in the cell is called a cell frequency. Cell frequencies = number of cases with each possible combination of two characteristics. 2 We can conduct some percentage comparisons (to take into account the different sample sizes of men and women): To compare males and females, we want to examine the percentages of men and women who fall into each of the different categories of opinions about divorce laws. SPSS Commands: Analyze – Descriptives – Crosstabs – Select row and column variables – Cells (choose column percentages) - okay DIVLAW2 * RESPONDENTS SEX Crosstabulation DIVLAW2 1.00 2.00 3.00 Total Count % within RESPONDENTS SEX Count % within RESPONDENTS SEX Count % within RESPONDENTS SEX Count % within RESPONDENTS SEX RESPONDENTS SEX MALE FEMALE 188 240 Total 428 24.3% 24.7% 24.5% 408 559 967 52.8% 57.5% 55.4% 177 174 351 22.9% 17.9% 20.1% 773 973 1746 100.0% 100.0% 100.0% These percentages are called conditional distributions on the response (outcome) variable (divorce laws). They are the sample distribution of opinions on divorce laws, conditional on gender. With regard to table construction, keep in mind: 1. Table heading or title should succinctly describe what is contained in the table. 2. Clearly indicate all attributes of the table. 3. When percentages are reported, base on which computed should be indicated. Remember that percentages are affected by small sample sizes. 4. Exclude missing data in calculating percentages (often useful to include an additional row indicating the number of cases missing data). 5. Most cross-tabulations in social research are limited to variables with relatively few categories. 6. Describing table: be selective and discuss only those comparisons that best describe or highlight the relationship. 3 Chi Square Test for Independence (Kurtz, pp. 215-222) Two categorical variables are statistically independent if the population conditional probabilities on the response variable are identical. The variables are statistically dependent if the conditional distributions are not identical. Are the observed sample differences that we observe between groups in their conditional distributions due to sampling variation? In other words, if the variables were truly independent, would sampled differences of this size be likely? Or are the observed differences in percentages so great that statistical independence in the population is implausible? What are the appropriate hypotheses? A chi-square test compares the observed frequencies in the cells of a contingency table with values expected from the null hypothesis of independence. F0: observed frequency in a cell of a table Fe: expected frequency – the count expected in a cell if the variables are independent. Calculating the Expected Frequency The expected frequency Fe for a cell equals the product of the row and column totals (or marginals) for that cell, divided by the total sample size. Calculate the expected frequencies for the cross-tabulation of opinions of divorce laws and gender: 4 Chi-Square Test Statistic This test statistic for independence summarizes how close the expected frequencies fall to the observed frequencies. Symbolized by 2. 2 = (Fo – Fe)2 / Fe When the null hypothesis is true, the observed and expected frequencies tend to be close for each cell and the chi-square statistic is relatively small. If the null hypothesis is false, at least some of the observed and expected frequencies tend not to be very close, leading to a large test statistic. Calculate the chi-square test statistic for this example: Degrees of freedom for chi-square test statistic DF = (r-1)(c-1), where r = number of rows and c = number of columns Degrees of freedom term in the test has the following interpretation: given the row and column marginal frequencies, the observed frequencies within the contingency table determine the other cell frequencies. Determine the degrees of freedom for this chi-square test statistic: The chi square table in Appendix B (Table B.4) of Kurtz describes a sampling distribution. It describes the distribution of a statistic (chi square) by reporting the probabilities associated with every possible sample outcome (every possible chi square). For the two variables we are analyzing, imagine every possible bivariate table with a given N, each with an associated chi square. Obtain p-value: 5 Properties of Chi-Square Distribution 1. Positive (since sums squared deviations). Minimum possible value is 0, when variables are completely independent in the sample. 2. Skewed to right 3. Precise shape depends on degrees of freedom 4. Larger value of test statistic provides stronger evidence against the null hypothesis. 5. Non-parametric (“distribution-free” test). It requires no assumptions about the shape of the sampling distribution (unlike a t-test which assumes that the sampling distribution is normal in shape). Sample Size Requirements The chi-square distribution is the sampling distribution of the chi-square test statistic only if the sample size is large. A rough guideline for this requirement is that the expected frequency Fe should exceed 5 in each cell. Remember that all tests of hypothesis are sensitive to sample size, but the chi square test is particularly so. The probability of rejecting the null hypothesis increases as the number of cases increases. SPSS Commands: Same commands as above for cross-tabulations (but also select statistics, chi-square) 6 GROUP EXERCISE: Sample data from 1998 GSS: Independent or explanatory variable is SEX: 1=male; 2=female Outcome of interest is EVSTRAY (Whether had sex with someone other than spouse while married): 1=yes; 0=no (Recoded never married people and those who responded 'DK', ‘NAP’, ‘NA’ as missing data) No. males who responded yes: 169 No. females who responded yes: 155 Total no. of males: 750 Total no. of females: 1055 a. Present this information in a contingency table and summarize the findings. b. Conduct a chi square test to determine whether sex and ever having strayed are statistically independent. State the hypotheses, calculate the chi square statistic, find the associated p-value and interpret. 7 c. Now, conduct an independent samples t-test to determine whether there is a significant difference in the response to a question about ever having strayed by sex. d. What is the correspondence between a chi-square test and a two-sample t-test? 8 Lab: SPSS Exercises 1. Use the states96nodc.sav dataset. Obtain correlation coefficients for IMR, %black, poverty rate, and median hh income. Interpret the results. SPSS: Analyze – Correlate – Bivariate – Select variables - Okay 2. Using the 1998 GSS, conduct a chi-square test to determine whether there is a significant difference in the response to a question about whether or not you are afraid to walk at night in your neighborhood and class identification? (Remember to check variable codes and do any necessary recoding before you conduct the test) SPSS: Analyze – Descriptives – Crosstabs – Select row and column variables – Cells (choose column percentages) – Statistics (select chi square) - okay CLASS: Subjective Class Identification (VARIABLE 189) If you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class, or the upper class? 0 NAP 1 Lower Class 2 Working Class 3 Middle Class 4 Upper Class 5 No Class 8 DK 9 NA FEAR: Afraid to walk at night in neighborhood (VARIABLE 234) Is there any area right around here – that is, within a mile – where you would be afraid to walk alone at night? 0 NAP 1 Yes 2 No 8 DK 9 NA 3. Using the 1998 GSS, conduct both a chi-square test and an independent samples t-test to determine whether there is a significant difference by sex (SEX – VARIABLE 42) in the likelihood of having had an extramarital affair (EVSTRAY – VARIABLE 673). You will need to recode EVSTRAY (keep only the people who responded yes or no and code 1 as yes). 9