Categorical Data Analysis: Contingency Tables & Chi-Square

P. V. Prathyusha Research Assistant Dept of Biostatistics NIMHANS Bivariate Description Usually we want to study associations between two or more variables  Quantitative var’s : show data using scatterplots, correlation  Categorical var’s : show data using contingency tables  Mixture of categorical var. and quantitative var : can give numerical summaries (mean, standard deviation) or side-by-side box plots for the groups  General Social Survey (GSS) data Men: mean = 7.0, s = 8.4 Women: mean = 5.9, s = 6.0 Categorical data Age (years) (< 15, 15 -30, 30-45, 45 +) Ordinal Gender (M, F) Nominal Diagnosis (Normal, Abnormal) Nominal Improvement (Mild, Moderate, Fair) Ordinal SES (Low, Medium, High) Ordinal Locality (Rural / Urban) Anxiety score (< 13, 13 -23, 24-40, 41+) Nominal Ordinal Contingency Tables  Cross classifications of categorical variables in which  rows (typically) represent categories of explanatory variable and  columns represent categories of response variable.  Counts in “cells” of the table give the numbers of individuals at the corresponding combination of levels of the two variables  Contingency tables enable us to compare one characteristic of the sample, e.g. Smoking defined by another categorical variable, e.g. gender Happiness and Family Income Happiness Income Very Pretty Not too Row Total ------------------------------Above Aver. 164 233 26 423 Average 293 473 117 883 Below Aver. 132 383 172 687 -----------------------------Col Total 589 1089 315 1993 Row and column totals are called Marginal counts What can a contingency table do ? Can summarize by percentages on response variable (happiness) Happiness Income Very Pretty Not too Total -------------------------------------------Above 164 (39%) 233 (55%) 26 (6%) 423 Average 293 (33%) 473 (54%) 117 (13%) 883 Below 132 (19%) 383 (56%) 172 (25%) 687 ---------------------------------------------- Example: Percentage “very happy” is 39% for above aver. income (164/423 = 0.39) 33% for average income (293/883 = 0.33) 19% for below average income (??) What can a contingency table do ?  Association between two categorical variables.  For example, you want to know if there is any association between gender and headache.  Is there any association between taking aspirin and risk of heart attack in the population.  To test whether lung cancer is associated with smoking or not.  Diabetes associated with type of occupation or not. Observed frequencies • Depending on the subjects’ response, the data could be summarized in a table like this. • The observed numbers or counts in the table are: Gender Headache Marginal total (row) Yes No Men 10 30 40 Women 23 17 40 Marginal total (column) 33 47 80 This is what we have observed in the random sample of 80 subjects. Sample to population …  Knowing the incidence of headache of this 80 subjects, with great certainty is of limited use to us.  On the basis of observed frequencies (or %), we can make claims about the sample itself, but we cannot generalize to make claims about the population from which we drew our sample, unless we submit our results to a test of statistical significance.  A test of statistical significance tells us how confidently we can generalize to a larger (unmeasured) population from a (measured) sample of that population. Steps in test of hypothesis . . 1. Find out the type of problem and the question to be answered. 2. State the null & alternative hypotheses. 3. Calculate the standard error. 4. Calculate the critical ratio. Generally this is given by Difference between the means (proportions) --------------------------------------Standard error of the difference 5. Compare the value observed in the experiment with that given by the table, at a predetermined significance level. 6. Make inferences. Testing of hypothesis We need to measure how different our observed results are from the null hypothesis. How does chi-square do this? It compares the observed frequencies in our sample with the expected frequencies. What are expected frequencies ? Expected frequencies  If the Null Hypothesis was true, what would be the frequencies for each cell?  Numbers of men and women with and without headache we would expect to be same, if there is no relation between gender & headache.  i.e., if men and women were equally affected by headache, we would have expected these numbers in our sample of 80 people. Gender Headache Yes TOTAL No Men 10 17 30 23 40 Women 23 17 17 23 40 TOTAL 33 47 80  Under the assumption of no association between gender and headache, the expected numbers or counts in the table are Expected number = (row total)(column total) (table total) Gender Headache Yes TOTAL No Men 10 16.5 30 23.5 40 Women 23 16.5 17 23.5 40 TOTAL 33 47 80 Chi-square value The Chi-square value is a single number that adds up all the differences between the observed data and the expected data. 2 2 ( O  E ) (Observed Expected) ij 2     ij Expected Eij all cells i, j 2cal = (10 – 16.5)2/16.5 + (30 – 23.5)2/23.5 + (23 – 16.5)2/16.5 +(17– 23.5)2/23.5 N ( |ad – bc|)2 Chi-square = -------------------R1xR2xC1xC2 Theoretical Chi-square value Look up the theoretical Chi-square value in 2 distribution table with d.f = (r-1)*(c-1), to see if it is big enough to indicate a significant association of headache & gender. For a 2 x 2 table like this, d.f is (2-1)*(2-1) =1. Critical value 2 1, 0.05 = 3.841. Degrees of freedom Gender Headache Yes Men 10 10 Women 23 No 17 80 Degrees of freedom are the number of independent pieces of information in the data set. In a contingency table, the degrees of freedom are calculated as the product of the # of rows -1 and the # of columns -1, or (r-1)(c-1). Chi-square values…. If the observed data and expected data are identical (i.e., if there is no difference), the Chi-square value is 0. Greater differences between expected and observed data produce a larger Chi-square value. Larger the Chi-square value, greater the probability that there really is a significant association. Assumptions of 2 test  The sample must be randomly drawn from the population.  Data must be reported in raw frequencies (not percentages).  Categories of the variables must be mutually exclusive & exhaustive.  Expected frequencies cannot be too small, expected frequency should be more than 5 in at least 80% of the cells. Tables of higher dimensions Happiness Income Very Pretty Not too Total ------------------------------Above Aver. 164 233 26 423 Average 293 473 117 883 Below Aver. 132 383 172 687 -----------------------------Total 589 1089 315 1993 Chi-square test can be employed for tables of higher dimensions too. Higher dimension tables.. OCCUPATION KNOWLEDGE Poor Good 3 (20) -- -- 6 (40) Business 5 (33) 6 (40) Unemployed 7 (47) 3 (20) Govt sector Pvt sector OCCUPATION Chi-Square Value 10.691(a) (a) df Asymp. Sig. (2sided) 3 .014 4 cells (50.0%) have expected count < 5. KNOWLEDGE Poor Good Govt / Pvt 3 (20) 6 (40) Business 5 (33) 6 (40) Unemployed 7 (47) 3 (20) Chi-Square Value 2.691(a) df 2 Asymp. Sig. (2sided) .260 (a) 2 cells (33.3%) have expected count < 5. Higher dimension tables.. Depression Family type Normal Borderline Abnormal Chi-Square Value Df Asymp. Sig. (2-sided) Nuclear (n=39) 1 (50) 2 (67) 36 (95) 6.715(a) 2 0.035 Joint (n=4) 1 (50) 1 (33) 2 (5) (a) 5 cells (83.3%) have expected count < 5 M-W U test would be appropriate here Collapsing tables   Can often combine columns/rows to increase expected counts that are too low ○ may increase or reduce interpretability ○ may create or destroy structure in the table There is no clear guidelines ○ avoid simply trying to identify the combination of cells that produces a “significant” result.  chi-square is basically a measure of significance.  it is not a good measure of strength of association.  It can help you decide if an association exists, but not to tell how strong it is. Yate’s Correction Chi-square distribution is a continuous distribution and it fails to maintain its character of continuity even if any one of the expected frequencies is less than 5. In such cases, Yates Correction for continuity is applied to maintain the character of continuity of the distribution. The formula for Chi-square test with Yates correction is: N ( |ad – bc| - N/2 )2 Chi-square = ------------------------------------R1xR2xC1xC2 Fisher’s Exact Test The test can be used in case of 2x2 contingency tables when the sample sizes are small. Column 1 Column 2 Total Row 1 a b R1= a+b Row 2 c d R2= c+d Total C1= a+c C2= b+d N = R1+R2 Fisher’s exact probability is given by R1! R2! C1! C2! p = -------------------, where N! = 1x2x3x…….x(N-1)xN N! a! b! c! d! Gender Age group <20 yrs > 20 yrs Male 4 1 Female 1 5 11 p = (a+b)! (a+c)! (b+d)! (c+d)! / (N! a! b! c! d!) p = 5! 5! 6! 6! / 11! 4! 1! 1! 5! p = .065 If the p value is less than or equal to 0.05, the null hypothesis is rejected and the difference between the rows (or columns) is considered significant. Cochran (1954) suggests : The decision regarding use of Chi-square should be guided by the following considerations: 1. When N > 40, use Chi-square corrected for continuity. 2. When N is between 20 and 40, the Chi-square test may be used if all the expected frequencies are 5 or more. If any expected frequency is less than 5, use the Fisher’s Exact probability test. 3. When N < 20, use Fisher’s test in all cases. McNemar’s Test Used in case of two related samples or there are repeated measurements Can be used to test for significance of changes in “before-after” designs in which each person is used as his own control. Thus the test can be used  to test the effectiveness of a treatment /training program / therapy /intervention…. or  to compare the ratings of two judges on the same set of individuals . McNemar’s Test … PRE POST Total Normal Abnormal Normal a b R1 Abnormal c d R2 Total C1 C2 N Pre Rx POST - Rx PRE Rx Severe Severe 7 Post Rx Tot Mild 19 I 26 24 Mild W4 20 Tot. 11 39 50 Total Mild Severe Mild 40 8 Severe 45 I 85 7 52 15 100 Total W 48 The McNemar’s test statistic is given by: What is probability? The probability of a favorable event is the fraction of times you expect to see that event in many trials. In epidemiology behavioural research, a “risk” is considered a probability. For example… You record 25 heads on 50 flips of a coin, what is the probability of a heads? 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐻𝑒𝑎𝑑𝑠 = # 𝐻𝑒𝑎𝑑𝑠 25 = = .50 𝑜𝑟 50% # 𝑇𝑟𝑖𝑎𝑙𝑠 50 Remember: a probability should never exceed 1.0 or 100%. Relative Risk (RR)  Relative risk is the risk of developing a disease relative to exposure  It is most commonly used in cohort studies to study Incidence  It is ratio of probability of the event occurring in the exposed group versus control(unexposed group) Risk of developing disease among exposed Probability of outcome among not exposed RR  Incidence of disease among exp osed a/a b  Incidence of disease among not  exp osed c / c  d What is the risk of myocardial infarction (MI) if a patient is taking aspirin versus a placebo?  Risk of MI for aspirin group = 50/1080= .046 or 4.6%  Risk of MI for a placebo group = 200 / 1770 = .11 or 11%  What is the risk of myocardial infarction (MI) if a patient is taking aspirin versus a placebo? a / a  b 0.046 RR    0.418 c/cd 0.11 RR - Example Lung Cancer No lung Cancer Smokers 190 450 Non-smokers 70 700 a / a  b 190 /(190  450) 0.297 RR     3.27 c/cd 70 /( 70  700) 0.09 The risk of lung cancer is 3.27 times higher in smokers than non smokers  RR = 1 association between exposure and disease unlikely to exist  RR > 1 increased risk of disease among those that have been exposed  RR < 1 decreased risk of disease among those that have been exposed DM type II No DM type II Total BMI <30 25 350 375 BMI >30 65 200 265 Total 90 550 640 a / a  b 25 / 375 RR    0.27 c / c  d 65 / 265  Those who have BMI <30 have less risk of developing Type II Diabetes What are Odds? An “odds” is a probability of a favorable event occurring vs. not occurring. For example… What are the odds you will get a heads when flipping a fair coin? 𝑂𝑑𝑑𝑠 𝑜𝑓ℎ𝑒𝑎𝑑𝑠 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠 (1−𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑒𝑎𝑑𝑠 .50 = (1−.50) =1 “The odds of flipping heads to flipping tails is 1” In clinical and epidemiologic research, we use a ratio of two odds, or Odds Ratio (OR) and Relative Risk Ratio (RR), to express the strength of relationship between two variables. Odds Ratio (OR)  Odds ratio is the ratio of two odds  Generally used in case control studies to study Prevalence  It is the ratio of the odds of exposure in cases to the odds of exposure in controls  Provides an estimate of relative risk when the outcome is rare Case control Exposed a b unexposed c d  Odd of exposure among the cases : a/c  Odd of exposure among the control: b/d Odds of exp osure in cases a / c ad ORExposure    Odds of exp osure in controls b / d bc OR for Cohort and Cross sectional studies Outcome YES Outcome No Exposed a b unexposed c d  Odd of outcome among exposed=a/b  Odd of outcome among unexposed=c/d Odds of outcome among exp osed a / b ad OROutcome    Odds of outcome among un exp osed c / d bc  The exposure odds ratio is equal to the disease odds ratio. OR- Example Had MI No MI Aspirin 50 1030 Placebo 200 1570 odds of myocardial infarction (MI) if a patient is taking aspirin = 50/1030  odds of myocardial infarction (MI) if a patient is taking aspirin = 200/1570   OR = 50*1570 / 1030 * 200 = .38 or 38% OR- Example Smoking Hx No Smoking Hx  Lung Cancer No Lung Cancer 190 450 BMI < 30 25 350 70 700 BMI > 30 65 200 OR = (190*700/ (450*70) = 4.22 DM Type No DM II Type II  OR = (25*200) / (350*65) = .21 OR values Odds of exp osure in cases ORExposure  Odds of exp osure in controls  OR=1  OR>1 No association between exposure and outcome indicates that the exposure is associated with an increased risk of developing the disease  OR<1 indicates that the exposure is associated with the reduced risk of (protect against) developing the outcome When to use RR & OR Exposed unexposed  Outcome YES Outcome No a b c a/ab c/cd OR  a/c c/d In a rare condition a and c will be very small compared to that of b and d  So Relative Risk would be  d RR  a / b ad RR   c / d bc So given a rare condition Odds ratio approximates to Relative Risk. Relative Risk can only be calculated for prospective studies  Odds Ratio can be calculated for any of the designs   Case- control design  Cross sectional  Cohort  Diagnostic procedure or a test gives us an answer to the following question: "How well this test discriminates between certain two conditions of interest (health and disease; two stages of a disease etc.)?".  This discriminative ability can be quantified by the following measures  sensitivity and specificity  positive and negative predicative values (PPV, NPV) diagnostic accuracy: Sensitivity and Specificity  Measure how ‘good’ a test is at detecting binary features of interest (disease/no disease)  Sensitivity is the ability of a test to correctly classify an individual as ′diseased′  The ability of a test to correctly classify an individual as disease- free is called the test′s specificity. Usually the ‘true’ disease status is determined by some ‘gold standard’ method  For a specific test, sensitivity increases as specificity decreases and vice versa  Disease Present Disease Absent Total Test + a b a+b Test - c Total a+c b+d Total Diseased Total Normals Total test positive d c+d Total test negative Sensitivity= a/a+c 100 Specificity=d/b+d Disease Present Disease Absent Total Test + 25 2 27 Test - 5 68 73 Total 30 70 100 25/30 sensitivity 68/70 specificity Positive Predictive Value (PPV)  It is the percentage of patients with a positive test who actually have the disease.  PPV: = a / a+b =Probability (patient having disease when test is positive) Negative Predictive Value (NPV)  It is the percentage of patients with a negative test who do not have the disease.  NPV= d/c+d = Probability (patient not having disease when test is negative)

Categorical Data Analysis: Contingency Tables & Chi-Square

Related documents

Products

Support

Categorical Data Analysis: Contingency Tables & Chi-Square

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib