Handout

advertisement
Sociology 541
Review of Bivariate Regression
Chi-Square Test
Cautionary Notes Regarding Linear Regression
1. Differentiate between association and causation
Temporal ordering and the possibility of endogeneity or simultaneity - causal paths run in both
directions.
Both x and y could be caused by some other variable - a spurious correlation.
Multivariate techniques better control for possible spurious relationships by including the
confounding factor but still incomplete.
2. Ecological Correlations
When units of analysis are not individuals but are aggregate units, the correlation tends to be
much higher.
3. Non-Linearity
Correlation coefficient r measures scatter around a straight line. Not appropriate for a curvilinear
relationship.
4. Extrapolation
Be careful with extrapolation. Data could be curvilinear.
1
Analyzing Association between Categorical Variables
Contingency Tables (Review Kurtz, pp. 32-34)
When variables analyzed have only a few categories, as in most nominal and ordinal-scale
measurement, bivariate data are presented in tables.
These tables go by a few names: cross-tabulations, cross-classifications; contingency tables
Consider the following question: How do attitudes about current divorce laws depend on gender?
We have two categorical variables:
Independent or explanatory variable is SEX: 1=male; 2=female
Outcome of interest is DIVLAW (VAR 216) (Does the respondent think current divorce laws
should be changed and if so, how?): 1=easier; 2=more difficult; 3=stay the same
In this case, there are 3 possible outcomes for the two categories of the independent variable
(male and female), leading to a total of six possible total outcomes.
SPSS Commands:
Analyze – Descriptives – Cross-Tabs – Select row (dependent) and column (independent)
variables - Okay
DI VLAW2 * RESPONDENTS SEX Crossta bula tion
Count
DIVLAW2 1.00
2.00
3.00
Total
RESPONDENTS SEX
MALE
FEMALE
188
240
408
559
177
174
773
973
Total
428
967
351
1746
It is customary to make the dependent variable the row variable and to treat the independent
variable as the column variable, but this convention is somewhat arbitrary and often broken.
Marginal Frequency (marginals): numbers along the right margin and bottom margin of the table.
They essentially present univariate frequency distributions.
Number in lower right-hand corner is N, the total sample size excluding missing cases.
N = sum of either row or column marginals.
Where the categories of the two variables intersect contains the bivariate frequency distribution.
Each intersection is called a cell and the number in the cell is called a cell frequency.
Cell frequencies = number of cases with each possible combination of two characteristics.
2
We can conduct some percentage comparisons (to take into account the different sample sizes of
men and women):
To compare males and females, we want to examine the percentages of men and women who fall
into each of the different categories of opinions about divorce laws.
SPSS Commands:
Analyze – Descriptives – Crosstabs – Select row and column variables – Cells (choose column
percentages) - okay
DIVLAW2 * RESPONDENTS SEX Crosstabulation
DIVLAW2
1.00
2.00
3.00
Total
Count
% within
RESPONDENTS SEX
Count
% within
RESPONDENTS SEX
Count
% within
RESPONDENTS SEX
Count
% within
RESPONDENTS SEX
RESPONDENTS SEX
MALE
FEMALE
188
240
Total
428
24.3%
24.7%
24.5%
408
559
967
52.8%
57.5%
55.4%
177
174
351
22.9%
17.9%
20.1%
773
973
1746
100.0%
100.0%
100.0%
These percentages are called conditional distributions on the response (outcome) variable
(divorce laws). They are the sample distribution of opinions on divorce laws, conditional on
gender.
With regard to table construction, keep in mind:
1.
Table heading or title should succinctly describe what is contained in the table.
2.
Clearly indicate all attributes of the table.
3.
When percentages are reported, base on which computed should be indicated. Remember
that percentages are affected by small sample sizes.
4.
Exclude missing data in calculating percentages (often useful to include an additional row
indicating the number of cases missing data).
5.
Most cross-tabulations in social research are limited to variables with relatively few
categories.
6.
Describing table: be selective and discuss only those comparisons that best describe or
highlight the relationship.
3
Chi Square Test for Independence (Kurtz, pp. 215-222)
Two categorical variables are statistically independent if the population conditional probabilities
on the response variable are identical. The variables are statistically dependent if the conditional
distributions are not identical.
Are the observed sample differences that we observe between groups in their conditional
distributions due to sampling variation? In other words, if the variables were truly independent,
would sampled differences of this size be likely? Or are the observed differences in percentages
so great that statistical independence in the population is implausible?
What are the appropriate hypotheses?
A chi-square test compares the observed frequencies in the cells of a contingency table with
values expected from the null hypothesis of independence.
F0: observed frequency in a cell of a table
Fe: expected frequency – the count expected in a cell if the variables are independent.
Calculating the Expected Frequency
The expected frequency Fe for a cell equals the product of the row and column totals (or
marginals) for that cell, divided by the total sample size.
Calculate the expected frequencies for the cross-tabulation of opinions of divorce laws and
gender:
4
Chi-Square Test Statistic
This test statistic for independence summarizes how close the expected frequencies fall to the
observed frequencies. Symbolized by 2.
2 =  (Fo – Fe)2 / Fe
When the null hypothesis is true, the observed and expected frequencies tend to be close for each
cell and the chi-square statistic is relatively small.
If the null hypothesis is false, at least some of the observed and expected frequencies tend not to
be very close, leading to a large test statistic.
Calculate the chi-square test statistic for this example:
Degrees of freedom for chi-square test statistic
DF = (r-1)(c-1), where r = number of rows and c = number of columns
Degrees of freedom term in the test has the following interpretation: given the row and column
marginal frequencies, the observed frequencies within the contingency table determine the other
cell frequencies.
Determine the degrees of freedom for this chi-square test statistic:
The chi square table in Appendix B (Table B.4) of Kurtz describes a sampling distribution. It
describes the distribution of a statistic (chi square) by reporting the probabilities associated with
every possible sample outcome (every possible chi square). For the two variables we are
analyzing, imagine every possible bivariate table with a given N, each with an associated chi
square.
Obtain p-value:
5
Properties of Chi-Square Distribution
1. Positive (since sums squared deviations). Minimum possible value is 0, when variables are
completely independent in the sample.
2. Skewed to right
3. Precise shape depends on degrees of freedom
4. Larger value of test statistic provides stronger evidence against the null hypothesis.
5. Non-parametric (“distribution-free” test). It requires no assumptions about the shape of the
sampling distribution (unlike a t-test which assumes that the sampling distribution is normal
in shape).
Sample Size Requirements
The chi-square distribution is the sampling distribution of the chi-square test statistic only if the
sample size is large. A rough guideline for this requirement is that the expected frequency Fe
should exceed 5 in each cell.
Remember that all tests of hypothesis are sensitive to sample size, but the chi square test is
particularly so. The probability of rejecting the null hypothesis increases as the number of cases
increases.
SPSS Commands:
Same commands as above for cross-tabulations (but also select statistics, chi-square)
6
GROUP EXERCISE:
Sample data from 1998 GSS:
Independent or explanatory variable is SEX: 1=male; 2=female
Outcome of interest is EVSTRAY (Whether had sex with someone other than spouse while
married): 1=yes; 0=no
(Recoded never married people and those who responded 'DK', ‘NAP’, ‘NA’ as missing data)
No. males who responded yes: 169
No. females who responded yes: 155
Total no. of males: 750
Total no. of females: 1055
a. Present this information in a contingency table and summarize the findings.
b.
Conduct a chi square test to determine whether sex and ever having strayed are
statistically independent. State the hypotheses, calculate the chi square statistic, find
the associated p-value and interpret.
7
c. Now, conduct an independent samples t-test to determine whether there is a significant
difference in the response to a question about ever having strayed by sex.
d. What is the correspondence between a chi-square test and a two-sample t-test?
8
Lab: SPSS Exercises
1. Use the states96nodc.sav dataset. Obtain correlation coefficients for IMR, %black, poverty
rate, and median hh income. Interpret the results.
SPSS: Analyze – Correlate – Bivariate – Select variables - Okay
2. Using the 1998 GSS, conduct a chi-square test to determine whether there is a significant
difference in the response to a question about whether or not you are afraid to walk at night in
your neighborhood and class identification? (Remember to check variable codes and do any
necessary recoding before you conduct the test)
SPSS: Analyze – Descriptives – Crosstabs – Select row and column variables – Cells (choose
column percentages) – Statistics (select chi square) - okay
CLASS: Subjective Class Identification (VARIABLE 189)
If you were asked to use one of four names for your social class, which would you say you belong
in: the lower class, the working class, the middle class, or the upper class?
0 NAP
1 Lower Class
2 Working Class
3 Middle Class
4 Upper Class
5 No Class
8 DK
9 NA
FEAR: Afraid to walk at night in neighborhood (VARIABLE 234)
Is there any area right around here – that is, within a mile – where you would be afraid to walk
alone at night?
0 NAP
1 Yes
2 No
8 DK
9 NA
3. Using the 1998 GSS, conduct both a chi-square test and an independent samples t-test to
determine whether there is a significant difference by sex (SEX – VARIABLE 42) in the
likelihood of having had an extramarital affair (EVSTRAY – VARIABLE 673). You will
need to recode EVSTRAY (keep only the people who responded yes or no and code 1 as
yes).
9
Download