The Analysis of Categorical Data Categorical variables • When both predictor and response variables are categorical: • Presence/absence • Colors • The data in such a study represents counts –or frequencies - of observations in each category… Analysis Data Analysis A single categorical predictor variable Organized as two way contingency tables, and tested with chisquare or G-test Organized as a multiway contingency tables, and analyzed using either log-linear models or classification trees Multiple predictor variables (or complex models) Two way Contingency Tables • Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data Wildebeest carcasses from the Serengeti (Sinclair and Arcese 1995) Variables • Sex (males / females) • Cause of death (predation / other) • Bone marrow type: 1. 2. 3. Solid white fatty (healthy) Opaque gelatinous Translucent gelatinous Data Sex Marrow Predation Male SWF Yes Male OG Yes Male TG Yes … … … Brief format SEX MARROW DEATH COUNT FEMALE SWF PRED 26 MALE SWF PRED 14 FEMALE OG PRED 32 MALE OG PRED 43 FEMALE TG PRED 8 MALE TG PRED 10 FEMALE SWF NPRED 6 MALE SWF NPRED 7 FEMALE OG NPRED 26 MALE OG NPRED 12 FEMALE TG NPRED 16 MALE TG NPRED 26 Contingency table Sex * Death Crosstabulation Sex Dead NPRED PRED Total FEMALE 48 66 114 MALE Total 45 93 67 133 112 226 Contingency table Sex * Marrow Crosstabulation Sex FEMALE MALE Total Marrow SWF OG TG Total 58 32 24 114 55 113 21 53 36 60 112 226 Contingency table Death * Marrow Crosstabulation Death NPRED PRED Total OG Marrow SWF TG Total 38 13 42 93 75 113 40 53 18 60 133 226 Are the variables independent? We want to know, for example, whether males are more likely to die by predation than females… • Our null hypothesis is that the predictor and response variables are not associated with each other i.e. the two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling Calculating the expected values • The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation… Yˆm a le, d ea d b y p red a tio n N xP ( ma le d ea d _ b y _ p red a tio n) The probability of two independent events P ( m a le, d ea d _ b y _ p red a tio n) P ( m a le ) xP ( d ea d _ b y _ p red a tio n) Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals… Contingency table Sex * Death expected values Sex FEMALE MALE P Dead NPRED PRED 46.91 P 67.09 114 0.5044 46.09 65.91 93 133 0.4115 0.5885 112 0.4956 N=226 row _ total column _ total ˆ Yij sample _ size Yˆfemale_ no _ predated N P( female No _ predated ) Testing the hypothesis: Pearson’s Chi-square test 2 X Pearson Observed Expected 2 all _ cells Expected = 0.0866, P=0.7685 Observed Expected 0.5 2 X 2 Yates all _ cells = 0.0253, P=0.8736 Expected The degrees of freedom d f ( n u m b er _ o f _ ro w s 1) x ( n u m b er _ o f _ co lu m n s 1 ) =1 Calculating the P-value • We find the probability of obtaining a value of Χ2 as large or larger than 0.0866 relative to a Χ2 distribution with 1 degree of freedom • P = 0.769 <-4 -4:-2 -2:0 0:2 2:4 non predator >4 female Standardized Residuals: predator tcount male An alternative • The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution Observed G 2 all _ cells Observed ln Expected = 0.0866 Two way contingency tables • Sex * Death Crosstabulation: X 2 Pearson 0.087, d . f . 1, P 0.769 G 0 . 0 8 7, d . f . 1, P 0 . 7 6 9 • Sex * Marrow Crosstabulation: X P2 ea rso n 4 . 745, d . f . 2 , P 0 . 093 G 4 . 7 7 8, d . f . 2 , P 0 . 0 9 2 • Marrow * Death 2 X Crosstabulation: P ea rso n 29 . 30 8, d . f . 2 , P 0 . 00 1 G 2 9 . 5 2 0, d . f . 2 , P 0 . 0 0 1 Which test to chose? Model Rows/ Columns Sample size Test I II Not fixed Fixed/not fixed small G-test, with corrections I II III Not fixed Fixed/not fixed Fixed large G-test, Chi square test Fisher exact test Log-linear models Multi-way Contingency Tables Multiple two-way tables Females Death PRED NPRED Total Marrow OG SWF TG 32 26 26 6 58 32 Males Death PRED NPRED Total Marrow OG SWF TG 43 14 12 7 55 21 Total 8 16 24 66 48 114 Total 10 26 36 67 45 112 Log-linear models • They treat the cell frequencies as counts distributed as a Poisson random variable • The expected cell frequencies are modeled against the variables using the log-link and Poisson error term • They are fit and parameters estimated using maximum likelihood techniques Log-linear models • Do not distinguish response and predictor variables: all the variables are considered equally as response variables However • A logit model with categorical variables can be analyzed as a log-linear model Two way tables • For a two way table (I by J) we can fit two loglinear models • The first is a saturated (full) model • Log fij= constant + λix+ λky+ λjkxy • fij= is the expected frequency in cell ij • λix = is the effect of category i of variable X • λky = is the effect of category k of variable Y • λjkxy = is the effect any interaction between X and Y • This model fits the observed frequencies perfectly! Note • The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell… Two way tables • The second log-linear model represents independence of the two variables (X and Y) and is a reduced model: • Log fij= constant + λix+ λky • The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables Interpretation • The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies: • i.e. a larger λ means that the expected frequencies will be larger for that variable. • These variables are also deviations from the mean of all expected frequencies. Null hypothesis of independence • The Ho is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies • It is also a test that λjkxy =0: • There is NO interaction between two variables Test • We can test this Ho by comparing the fit of the model without this term to the saturated model that includes this term • We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the loglikelihood of each model Test • We then compare the fit of the two models with the likelihood ratio test statistic ∆ • However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G2 statistic • G2 =-2log∆ • G2 Follows a Χ2 distribution for reasonable sample sizes and can be generalized to • =- 2(log-likelihood reduced model -- log-likelihood full model) Degrees of freedom • The calculated G2 is compared to a Χ2 distribution with (i-1)(j-1) df. • This df (i-1)(j-1) is the difference between the df for the full model (ij-1) and the df for the reduced model [(i-1)+(j-1)] Akaike information criteria ˆ AIC 2 log L( | data) 2 K Hirotugu Akaike The full model log fijk C death sex marrow deatsex deathmarrow sexmarrow deathsexmarrow 2 AIC Gparticular _ mod el 2df particular_ mod el Complete table 1 2 3 4 5 6 7 8 9 Model D+S+M D*S D*M S*M D*S+D*M D*S+S*M D*M+S*M D*S+D*M+S*M Saturated full model G2 42.76 42.68 13.24 37.98 13.16 37.89 8.46 7.19 0 df 7 6 5 5 4 4 3 2 0 P 0.001 0.001 0.021 0.001 0.01 0.001 0.037 0.027 AIC 28.76 30.68 3.24 27.98 5.16 29.89 2.46 3.19 Two way interactions (marginal independence) D+S+M 42.76 reference d.f P D*S 1vs 2 42.6759 42.76-42.68=0.084 7-6 =1 0.769 D*M 1vs 3 13.24 42.76-13.24=29.520 7-5 =2 <0.001 S*M 1 vs 4 37.98 42.76-37.98=4.778 7-5 =2 0.092 Three way interaction • • • • • Death*Sex*Marrow Models compared 8 vs 9 G2= 7.19 df 2 P=0.027 Conditional independence term Models compared G2 df P D*S 7 vs 8 1.28 1 0.259 D*M 6 vs 8 30.71 2 0.001 S*M 5 vs 8 5.97 0.051 2 Death and marrow have a partial association Conditional independence Females Death Marrow OG SWF TG Total PRED 32 26 8 66 NPRED 26 6 16 48 Total 58 32 24 114 Males Death Marrow OG SWF TG Total PRED 43 14 10 67 NPRED 12 7 26 45 Total 55 21 36 112 ˆ XY ( k ) n11k n22k n12k n21k 1 1 1 1 XY ( k ) ) n11 n12 n21 n22 ASE (log ˆ CI e log(ˆXY ( k ) z0.95* ASE (logˆXY ( k ) ) Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne 26 * 26 14 *12 SW FvsOG SW FvsOG ˆ ˆ 3.521 male 0.558 female 32 * 6 43 * 7 Males 95 % CI Females TG vs OG 0.107 0.041-0.283 0.406 0.150-1.097 TG vs SWF 0.192 0.060-0.616 0.115 0.034-0.395 SWF vs OG 0.558 0.184-1.693 3.521 1.261-9.836 12 SWF vs OG Frequentist 8 M 4 6 F TG vs SWF 2 TG vs OG 0 Tab_odds_fem[, 1] 10 Bayesian 1 2 3 4 a 5 6 7 Complete independence • • • • Models compared 1 vs 8 G2=35.57 df= 5 P=<0.001 Warning • Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.