Statistics Analyses of Categorical Variables March 16, 2011 Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility Nemours Biomedical Research Categorical Variable • Observations belong to a finite set of discrete categories or groups. • Gender, race, severity of a disease are some examples of categorical variables. • Descriptive Statistics: Frequencies, percentages and proportions are usually used to describe data of a categorical variable. Nemours Biomedical Research Categorical Variables in Retinoid data • There are three categorical variables in Retinoid data set that we have on the course website – trt (Treatment groups) – bmigrp2 ( obese or non-obese at baseline) – bmigrp3 (non-obese, overweight, and obese at baseline) Nemours Biomedical Research Characterizing categorical variables Calculating frequencies and percentages for categories of Nemours Biomedical Research variables trt, bmigrp2, and bmigrp3 in the Retinoid data set SPSS output: characterizing categorical variables Nemours Biomedical Research Proportion Tests • Use – Test for the value of a single proportion • E.g., to test if the proportion of obese in 39 participating patients equals to specific value (say) 0.5 or not? – Test for equality of two or more Proportions • E.g. proportions of obese in two treatment groups are equal or not. Nemours Biomedical Research Test for the value of a single proportion Nemours Biomedical Research Type a test value SPSS output: Test for the value of a single proportion We are testing a hypothesis that the proportion of lean subjects in this sample is .05, i.e H0: the proportion of lean = 0.5 against the alternative, Ha: the proportion of lean ≠0.5 The observed proportion of lean subjects is 0.33 which is (.5-.33 = .17) smaller than the hypothesized proportion (.5). The p-value is .053, which is marginally higher than the level of significance (.05). Nemours Biomedical Research Chi-square Test • Formula: If xi (i=1,2,…n) are independent and normally distributed with mean µ and standard deviation σ2, then, 2 xi − µ 2 is a χ distribution with n d.f. ∑ σ i =1 n • If we don’t know µ, 2then we estimate it using a sample mean and xi − x 2 is a χ distribution with (n - 1) d.f. ∑ σ i =1 n then, Nemours Biomedical Research The Pearson Chi-squared Test Consider a contingency table. The number of units that fall in a cell is the cell’s observed frequency, and the number predicted by theory to do so is the cell’s expected count. The Pearson chi- squared test statistic to summarize the difference between observed and expected counts is, 2 ( − ) O E i χ2 = ∑ i , distribute d as χ 2 with (n - 1) d.f. Ei i =1 n Oi = Observed Frequency Ei = Expected Frequency Nemours Biomedical Research The Pearson Chi-squared Test • USE – Testing the equality of proportions for all categories of a variable – Testing the user specified proportions for all categories of a variable – Testing the independence/ association of attributes – Testing the population variance σ2= σ02. • Assumptions – Sample observations should be independent. – Cell frequencies should be >= 5. – Total observed and expected frequencies are equal Nemours Biomedical Research The Pearson Chi-squared Test– calculation of expected frequency • Expected frequency for any cell– Single variable: • The probability associated with a cell is multiplied by total number of subjects – Two variables (contingency table) • (row total X column total) / grand total Nemours Biomedical Research The Pearson Chi-squared Test– calculation of degrees of freedom (df) • Single variable: (Number of categories – 1) E.g. the variable bmigrp3 has 3 categories. So the df is (3-1) =2 • Two variables (contingency table) (Number of rows -1) X (Number of columns-1) To compare the distribution of obesity status (bmigrp3) between two treatment groups, the associated df is (3-1) x (2-1) = 2 Nemours Biomedical Research The Pearson Chi-squared Test– Skewness • The distribution of Chi-square statistic is positively skewed. That is, it has a long tail to the right. • As the df increases, the distribution of Chi-square statistic becomes more symmetric. Nemours Biomedical Research Testing the equality of proportions of a variable Nemours Biomedical Research Testing the equality of proportions of a variable H0: proportion of subjects are equal in all three groups. Ha: Three proportions are not equal The asymptotic p-value is 0.006 which is much smaller than the level of significance (0.05). It indicates a significant difference in proportions of subjects between three groups. Nemours Biomedical Research Testing the user specified proportions of a variable Nemours Biomedical Research Testing the user specified proportions a variable Let us test a hypothesis that the proportions of subjects are 0.3, 0.4, and 0.3 in lean, overweight, and obese respectively. The asymptotic p-value is 0.001 which is much smaller than the level of significance (0.05). It indicates that the proportion of subjects in three groups are significantly different from the specified proportions of three groups. Nemours Biomedical Research Testing the independence/ association of attributes Nemours Biomedical Research SPSS output: Testing the independence/ association of attributes Testing the distribution of obesity status in two treatment groups. That is, testing the association of obesity and treatment groups. The value of Pearson chisquared test is 4.005 and the df is 2. The asymptotic p-value is 0.135 which is greater than the level of significance (0.05). Question: Can we reject the H0? Nemours Biomedical Research Limitations of Chi-square Test • The only product is p-value and there is no other parameter to describe the degree or strength of association. • May not be appropriate to use for a small sample size specially with any cell less than expected frequencies 5. Nemours Biomedical Research Fisher Exact Test • An exact test in the analysis of 2x2 contingency table • Most useful for small sample size, specially when Pearson Chi-squared test is not applicable. • The exact probability of observing the cell counts is calculated using the hypergeometric distribution • No distributional assumption is needed Nemours Biomedical Research Fisher Exact Test Nemours Biomedical Research Fisher Exact Test Nemours Biomedical Research Odds and Odds Ratio • Like proportion, odds of an event also is very useful to describe the categorical data. • Odds, Odds ratio, and related regressions will be covered in the next class. Nemours Biomedical Research Thank you Nemours Biomedical Research