Word [] file

advertisement
ASSOCIATIONS IN CATEGORICAL DATA
(Statistics Course Notes by Alan Pickering)
Conventions Used in My Statistics Handouts
 Text which appears in unshaded boxes can be treated as an aside. It may define a
concept being used or provide a mathematical justification for something. Some of
these relate to very important statistical ideas with which you SHOULD be
familiar.
 The names of SPSS menus will be written in shadow bold font (e.g., Analyze).
The options on the menu will be called procedures and the name will be written in
bold small capitals font (e.g., DESCRIPTIVE STATISTICS). Procedures sometimes
offer a family of related options, the name of the one selected will appear in italic
small capitals font (e.g., CROSSTABS). The resulting window will contain boxes to
be filled or checked, and will offer buttons to access subwindows. Subwindow
names will appear in italic font (e.g., Statistics). Subwindows also have boxes to
be filled or checked. Thus the full path to the CROSSTABS procedure will be written:
Analyze > DESCRIPTIVE STATISTICS >> CROSSTABS

Questions to be completed will appear in shaded boxes at various points.
PART I – BASICS AND BACKGROUND
What Are Categorical Data?
Categorical or nominal data are those in which the classes within each variable do not
have any meaningful numerical value. Common examples are: gender; presence or
absence (of some sign, symptom, disease, or behaviour); ethnic group etc. It is
sometimes useful to recode a numerical variable into a small number of categories,
and sometimes this classification will retain ordinal information (e.g., carrying out a
median-split on a personality scale to create high or low trait subgroups; or grouping
subjects into broad age bands). The data can then be analysed using the techniques
described below, but one has to be aware that these analyses usually have
considerably reduced power, relative to using the full range of values on the variable.
Contingency Tables
One can carry out analyses on a single categorical variable to check whether the
frequencies occurring for each level of that category are as predicted (check out SPSS
Procedure: Analyze > NON-PARAMETRIC TESTS >> CHI-SQUARE). However, much
more commonly, the basic data structure for categorical variables is the contingency
(or classification) table, or crosstabulation, formed from the variables concerned.
These are described as n-way contingency tables, where n refers to the number of
variables involved. The contingency table documents the frequencies (i.e., counts) of
data for each combination of the variables concerned. Hence contingency tables are
also referred to as frequency tables. The small-scale example below is a 2-way table
formed from the variables PDstatus (i.e., Parkinson’s Disease status: does the
participant have Parkinson’s Disease?) and Smokehis (i.e., smoking history: has the
participant smoked regularly at some point during their life?). Each variable in the
table has two levels (yes vs. no).
PDstatus
yes
no
(=1)
(=2)
Smokehis
yes (=1)
no (=2)
Column
Totals
3
(5.73)
6
(3.27)
9
11
(8.27)
2
(4.73)
13
Row
Totals
14
8
Grand
Total=
22
Table 1. Observed frequency counts of current Parkinson’s
disease status by smoking history. Expected frequencies
under an independence model are in parentheses.
In SPSS, some of the analysis procedures reviewed below require that the categorical
variables have numerical values, so it is recommended to always code categorical
variables in this way. Remember that it is completely arbitrary how these values are
assigned to the categories1. To aid memory, and to produce clearer analysis printouts,
one should always use the variable labelling options for such variables and include
verbal value labels for each numerically-coded categorical variable.
The Names of the Techniques
Introductory stats courses familiarise students with a specific technique for analysing
2-way contingency tables: Pearson’s 2 test. More general methods are required for
analysing higher-order contingency tables (involving 3 or more variables), and some
of these analytic methods are therefore described (e.g., in Tabachnik and Fidell) as
multiway frequency table analyses (or MFA). However, there are a large number of
alternative names for a family of closely-related procedures. Here are just some of the
other names which one may come across (e.g., in the procedure names available in
SPSS): multinomial (or binary) logistic regression; logit regression; logistic analysis;
analysis of multinomial variance; and (hierarchical) loglinear modelling. As these
names imply the techniques are often comparable to the procedures used for handling
numerical data, in particular multiple linear regression and analysis of variance
(ANOVA). A reasonable simplification is that the categorical data analyses below can
be thought of as extending the Pearson 2 test to multiway tables. The techniques also
share, with the 2 test, the fact that the calculated statistics are tested for statistical
significance against the 2 distribution (just as multiple linear regression and ANOVA
involving testing their statistics against the F distribution).
Association in Contingency Tables
There are many techniques in statistics for detecting an association between 2
variables. A significant association simply means that the values of one variable vary
systematically (i.e., at a level greater than chance) with values of the other variable.
1
As we shall see later, changing the way the numerical values for categories are assigned can
sometimes alter the value of a particular statistic (e.g. from a positive number to a negative number),
but this does not alter the statistical significance of the statistic. It is also sometimes helpful (for
interpreting analyses) to choose one particular assignment of numbers rather than others.
The most well-known measures of association are probably the (various types of)
correlation coefficient between two variables. Correlation coefficients can reveal the
extent to which the score (or rank) of one variable is linearly related to the score (or
rank) of another. In contingency tables, the values within each category have no
intrinsic numerical value, but associations can still be detected. An association means
that the distribution of frequencies across the levels of one category differs depending
upon the particular level of another category. When there is no association between
variables, they are described as being independent. Thus, independence in a 2-way
table means that there is no association between the row and column variables.
It is a much-noted statistical fact that finding significant associations between
variables, in itself, tells you nothing about the causal relationships at work. In
association analyses one may therefore have no logical reason to treat variables as
either dependent or independent. Sometimes research is entirely exploratory and,
when significant associations are found, the search for a causal connection between
the variables begins. In such research with categorical variables one might typically
take a single sample of subjects and record the values on the variables of interest. For
example, one might explore whether the political orientation reported by a subject
(left; centre; right) was associated with the newspaper he or she reads. Neither
variable here is obviously the dependent variable (DV), as the causal relationship
could go in either direction. Very often, however, one has a causal model in mind. For
example, one might be interested in whether a subject’s gender is associated with
political orientation. Here, political orientation is the DV and a subject’s gender is the
independent variable (IV), as it not possible for political orientation to affect gender.
The research here will usually adopt a different sampling scheme, by controlling the
sampling of the subjects in terms of the IVs. In this example, two samples of subjects
(males and females) would be tested, recording the DV (political orientation) in each
sample. In would be typical to arrange for equal-sized samples of males and females
in such research. In the example data of Table 1, we are interested in finding variables
that predict whether a subject will develop Parkinson’s Disease (PD). Thus PD status
is the DV and the IV (or predictor) is smoking history.
For contingency table data, the distinction between (exploratory) analyses, where all
variables have a similar status, and analyses involving both IVs and DVs is important.
We will see below that it affects the name of the analysis and the statistical procedure
that one uses. In this handout, where the categorical analyses have both IVs and DVs,
we will adopt the convention that the DV will be shown as the column variable.
It follows from the above that a pair of alternative hypotheses (H1 = independence;
H2 = association) may be applied to a contingency table. In order to decide between
these hypotheses, one can calculate a statistic that reflects the discrepancy between
the actual frequencies obtained and the frequencies that would be expected under the
independence model described above. If the discrepancies are within the limits of
chance (i.e., the statistic is nonsignificant), then one cannot reject the hypothesis of
independence. If the discrepancies are not within chance limits (i.e., the statistic is
significant) then one can safely reject the independence hypothesis, which implies
association between the variables in the table.
Estimating Expected Frequencies Under the Independence Model
We will use the data from Table 1 as an example. If the variables PDstatus and
Smokehis are independent then the proportion of “Smokehis=yes” subjects with
PD should be equal to the proportion of “Smokehis=no”subjects with PD, and
both should be equal to the proportion of subjects who have PD overall. The
overall proportion with PD is equal to 0.41 (i.e., 9/22). Therefore, the expected
frequency with PD in the Smokehis=yes group should be 0.41 times the total
number of subjects in the Smokehis=yes group; i.e., 0.41*14 (=5.73). This gives
the expected frequency without PD in the Smokehis=yes group by subtraction
(=14-5.73=8.27). Similar calculations give the expected frequencies for PD
(0.41*8=3.27) and no PD (=8-3.27=4.73) in the Smokehis=no group. Another
way to get the expected frequency for a cell in row R and column C is to multiply
the row total for row R by the column total for column C and divide the result by
the grand total for the whole table (e.g., [9*14]/22=5.27 for row 1 and column 1).
This approach is easy to use with tables that have more than 2 variables (where
the rows represent one variable, columns another, and separate subtables are used
for other variables).
Testing Associations in 2-way Contingency Tables
There are several statistics that one can compute to test for association vs.
independence in a 2-way contingency table (such statistics are thus sometimes
referred to as an “indices of association”). Three such statistics (described below) are
worthy of attention, and two (G2; OR) are of particular significance for logistic
regression analyses.
In this section we shall consider a 2-way table with R rows and C columns; this is
therefore referred to as an RxC table. There are m cells in the table where m = R * C.
The actual frequency in cell number i of the table is denoted by the symbol fi and the
expected frequency (under the independence model) is denoted by ei.
(i)
Pearson’s 2 statistic.
How to Compute using SPSS: Select the following procedure:
Analyze > DESCRIPTIVE STATISTICS >> CROSSTABS
Click on the Statistics button to access the Statistics subwindow and then check
the Chi-square box. (The Cells subwindow is also useful as it lets you display
things other than just the actual frequencies in the contingency table.)
Key SPSS Output: Conducting a Pearson 2 analysis on the data in Table 1 (which
is available on the J drive as the dataset small parks data) produces the following
SPSS output:
Is/ Was subject a smoker? * Has got Parkinson's dise ase?
Crosstabulation
Count
Is/ Was subject
a s mok er?
yes
no
Total
Has got Parkinson's
dis eas e?
yes
no
3
11
6
2
9
13
Total
14
8
22
Chi-Square Tests
Pearson Chi-Square
Continuity Correction a
Likelihood Ratio
Fis her's Exact Test
Linear-by-Linear
As sociation
N of Valid Cases
Value
6.044b
4.031
6.222
df
1
1
1
As ymp.
Sig.
(2-sided)
.014
.045
.013
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.026
5.769
1
.022
.016
22
a. Computed only for a 2x2 table
b. 2 cells (50.0%) have expected count less than 5. The minimum expected count is
3.27.
What Do These Results Mean?: The results of this analysis show that the 2 test
statistic was significantly greater than the tabulated value (which is approximately
the expected value for the statistic, assuming independence between the
variables). Therefore, the independence model can be rejected and one concludes
that variables of PDstatus and smokehis are associated. If the p-value associated
with the 2 test statistic was nonsignificant, then one would not be able to reject
the hypothesis that PDstatus and smokehis are independent.
Formula2:
2 = i=1 to m ([fi – ei]2/ei)
Degrees of Freedom (df):
df=(R-1)*(C-1)
Testing Significance: Under the independence model, the 2 test statistic has a
distribution which follows the 2 distribution with the degrees of freedom as given
above. (Pearson could have been more helpful and given his statistic a name that
differed from that of the distribution against which it is tested.) It has been shown
2
The expression on the next line is used in many statistical formulae:
i=1 to m (xi)
It means calculate the sum of a set of numbers x1, x2, ... up to xm.
that the 2 distribution can be used to test the Pearson 2 statistic as long as none
of the expected frequencies is lower than 3.
(ii)
Likelihood ratio statistic (usually abbreviated G2 or Y2).
“Computing using SPSS”, “Key SPSS Output”, “What Do These Results Mean?”,
df, and “Testing Significance”, are the same as for Pearson 2. Because of the
mathematical relationship between the two formulae, the values under many
circumstances are approximately equal.
Formula3:
G2 = 2 * i=1 to m (fi *log[fi/ei])
(iii)
Odds ratio (OR).
(Note: this applies only to a 2x2 table, or to 2x2 comparison within a larger table.)
How to Compute Using SPSS: This is also available via SPSS CROSSTABS. Follow
the procedure for 2 and G2 but now check the “Cochran’s and Mantel-Haenszel
Statistics” box in the Statistics subwindow.
Key SPSS Output: Conducting an OR analysis for the data in Table 1 using SPSS
CROSSTABS gives the following additional output:
Mantel-Haenszel Common Odds Ratio Estimate
Es timate
ln(Estimate)
Std. Error of ln(Estimate)
As ymp. Sig. (2-s ided)
As ymp. 95% Confidence
Interval
.091
-2.398
1.044
.022
Common Odds
Lower Bound
.012
Ratio
Upper Bound
.704
ln(Common Odds
Lower Bound
-4.445
Ratio)
Upper Bound
-.351
The Mantel-Haenszel common odds ratio estimate is asymptotically normally dis tributed
under the common odds ratio of 1.000 as sumption. So is the natural log of the es timate.
What Do These Results Mean: The OR, as for 2 and G2, tests whether the
independence hypothesis can be rejected. The significance test above reveals a
p-value of 0.022, and so the row and column variables (smoking history and
Parkinson’s Disease status) are not independent. An OR value of 1 corresponds to
perfect independence, and can range from 0 to plus infinity. The value calculated
for the Table 1 data lies well below one (=0.091; the p-value shows this to be
significantly different from 1). This means that the odds of having Parkinson’s
Disease (PD) if you are a smoker (row 1) are 0.091 times the odds of having PD if
you are a nonsmoker. Smoking in these data is significantly protective with
respect to PD. If one had predicted the direction of this relationship (based on the
3
The log in the formula refers to natural logarithms, also written as loge or ln.
existing literature demonstrating a PD-protective effect for smoking), then one
could justifiably make a one-tailed test: the p-value for the OR of 0.091 in this
case would be 0.022/2 (=0.011).
In the SPSS output, one sees that the natural logarithm of the odds ratio is also
reported. For various reasons, this statistic is more useful within contingency
tables than the raw OR. This use of log-transformed statistics explains why
contingency table analyses are often described as logistic analyses, logistic
regressions or loglinear modelling. The output also reports confidence intervals
for the OR and log(OR) statistics. The next 2 boxes give a few basic facts about
logarithms and confidence intervals.
Formula: Assuming the cells of a 2x2 table contain the frequencies a, b, c, d as
follows:a
c
b
d
the formula is:OR = (a * d) / (c * b)
If any of the frequencies (a to d) are zero then one usually replaces the zero with
0.5.
Testing Significance: As the calculated ln(OR) is normally distributed under the
independence hypothesis, the estimate can be converted to a Z-value and tested
against the value in a normal probability table. Z(log[OR])=log(OR)/SE(log[OR]).
For the data in table 1, the value is -2.30. In a standard normal (i.e., Z)
distribution, the two-tailed probability of finding a value that is as far (or further)
above or below 0 as –2.3, is 0.022.
Key Facts About Logarithms
Taking the logarithm of a number is the inverse mathematical operation to
raising a number to a power (or exponentiating). If we write the following
equation:
xy = z
then we can define the “log to the base x of z” thus:
logx(z) = y
Try it with x=10; y=2 and z=100. The log (to base 10) of 100 is 2; i.e. the log
of a number is the power to which you have to raise the base to obtain the
original number. Natural logarithms employ the number e as their base (e is
approximately 2.718). In statistical theory, natural logs are always used. The
convenient thing about logs is that they turn multiplicative (or reciprocal)
relationships into additive (or subtractive) ones. Because ea * eb = e(a + b) and
ea / eb = e(a - b) this means the following are true (and indeed it is true for logs
to any base):
loge(a*b) = loge(a) + loge(b)
loge(a/b) = loge(a) - loge(b)
In order for xy = 0 to be generally true y must be minus infinity (-∞). Thus,
loge(0) could be -∞, but convention dictates that loge(0) is undefined Any
number when raised to the power zero is 1 and so loge(1)=0. Thus we have
the following ranges:
loge(x) >0
loge(x)=0
loge(x) <0
if 1< x <+∞
if x=1
if 0< x <1
Probabilities (or likelihoods) have values between 0 and 1. As the statistics in
this handout involve taking the logs of probabilities, it follows from the above
that the resulting values are negative.
What Are Confidence Intervals?
When we calculate the value of a statistic for a sample of data, we are
attempting to measure the true value of that statistic for the populations under
study. Even if we have removed (most of) the sources of systematic bias from
our experiment, the sample used to calculate the statistic will be subject to
several sources of random error or noise. Therefore, although the value we
calculate for the statistic is our best estimate of its true value in the
population, we might prefer to give a range of values within which we feel
the true value is likely to fall. A confidence interval (CI) for the calculated
value of a statistic is just such a range (error bars on graphs serve a similar
purpose). If we estimate a statistic to have a value of 10 with a 95%
confidence interval of plus or minus 6, then we are saying that we are 95%
certain that the true value of the statistic lies somewhere between 4 and 16
([10 – 6] to [10 + 6]). If the statistic would be expected to have a value of 0
according to some hypothesis, then the value and associated CI in the above
example imply that such a hypothesis should be rejected. Journals
increasingly demand that CIs are calculated.
In the SPSS output for the OR analysis of the data in Table 1, the log(OR) had
a value of –2.398, with a standard error (SE) for the log(OR) of 1.044. The SE
is given by the square root of the sum of the reciprocals of the frequencies
used in the OR calculation (i.e. √(1/3 + 1/11 + 1/6 + 1/2)=1.044). With
sufficient subjects in the whole table, the distribution of possible values for
log(OR) is distributed normally around the value actually obtained. 95% of
the distribution will lie within 1.96*SE either side of the estimated value.
Thus, the 95% CI for the calculated value of ln(OR) lies between (-2.398 –
1.96*1.044) and (–2.398 + 1.96*1.044); i.e. between –4.44 and –0.35. To
calculate the corresponding CI for the OR, we just simply compute e-4.44
(=0.01) and e-0.35(=0.70) For the data in table 1, our best guess of the true OR
is 0.09, and are 95% certain that the true value lies between 0.01 and 0.70. As
these 95% CIs do not include 1.0 (the value expected if there were
independence between PDstatus and Smokehis), then the independence
hypothesis can be rejected.
Question 1
What happens if you calculate 2, G2, and OR statistics for the contingency
table formed from the variables smokehis and PDstatr in the small parks
dataset? (PDstatr is the same variable as PDstatus, only it has been recoded
so that 1=no and 2=yes.)
Download