Chi Squared Tests Introduction • Two statistical techniques are presented. Both are used to analyze nominal data. – A goodness-of-fit test for a multinomial experiment. – A contingency table test of independence. • The test statistics in both cases follow the c2 distribution. Chi-Squared Goodness-of-Fit Test • The hypothesis tested involves the “success” probabilities p1, p2, …, pk.of a multinomial distribution. • The multinomial experiment is an extension of the binomial experiment. – There are n independent trials. – The outcome of each trial can be classified into one of k categories, called cells. – The probability pi for an outcome to fall into cell i remains constant for each trial. By assumption, p1 + p2 + … +pk = 1. – Trials in the experiment are independent. • Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for pi . • The hypotheses: H 0 : p1 a1, p2 a2 , ..., pk ak H1 : At least one pi ai • The test builds on comparing actual frequency and the expected frequency of occurrences in all cells. An Example • Example 16.1 – Two competing companies A and B have been dominant players in the market. Both companies conducted recent advertising campaigns on their products. – Market shares before the campaigns were: • Company A = 45% • Company B = 40% • Other competitors = 15%. • Example 16.1 – continued – To study the effect of the campaigns on the market shares, a survey was conducted. – 200 customers were asked to indicate their preference regarding the products advertised. – Survey results: • 102 customers preferred the company A’s product, • 82 customers preferred the company B’s product, • 16 customers preferred the competitors product. • Example 16.1 – continued Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns? • Solution – – – – The population investigated is the brand preferences. The data are nominal (A, B, or other) This is a multinomial experiment (three categories). The question of interest: Are p1, p2, and p3 different after the campaign from their values prior to the campaigns? • The hypotheses are: H0: p1 = .45, p2 = .40, p3 = .15 H1: At least one pi changed. The expected frequency for each category (cell) if the null hypothesis is true is shown below: 90 = 200(.45) 80 = 200(.40) What actual frequencies did the sample return? 102 82 1 2 1 3 2 30 = 200(.15) 3 16 • The statistic is: 2 ( f e ) i c2 i ei i1 k where ei npi Intuitively, this measures the extent of differences between the observed and the expected frequencies. • The rejection region is: c c ,k 1 2 2 • Example 16.1 – continued k c2 i1 (102 90)2 (82 80)2 (16 30)2 8.18 90 80 30 2 c2 ,k 1 c.05,31 5.99147 The p value P( c 2 8.18) .01679 [this come from Excel : CHIDIST(8.18,2)] • Example 16.1 – continued 2 with 2 degrees of freedom c 0.025 0.02 0.015 0.01 Alpha 0.005 P value 0 0 2 4 5.996 88.18 10 Rejection region Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed. 12 Required Conditions – The Rule of Five • The test statistic used to perform the test is only approximately Chi-squared distributed. • For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (npi 5). • If the expected frequency in a cell is less than 5, combine it with other cells. Chi-squared Test of a Contingency Table • This test is used to test whether… – two nominal variables are related? – there are differences between two or more populations of a nominal variable? • To accomplish the test objectives, we need to classify the data according to two different criteria. • The idea is also based on goodness of fit. • Example 16.2 – In an effort to better predict the demand for courses offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection. – A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data. Degree BA BENG BBA Other Accounting 31 8 12 10 61 Finance 13 16 10 5 44 Marketing 16 7 17 7 47 60 31 60 39 152 The observed values There are two ways to view this problem If each classification is considered a nominal variable, are these two variables dependent? If each undergraduate degree is considered a population, do these populations differ? • Solution – Since ei = npi but pi is unknown, we need to The hypotheses are: estimate the unknown H0: The two variables are independent probability from the data, H1: The two variables are dependent assuming H0 is true. – The test statistic k c 2 i1 ( fi e i )2 ei k is the number of cells in the contingency table. – The rejection region c2 c2,(r 1)( c 1) Estimating the expected frequencies Undergraduate Degree Accounting BA BENG BBA Other 6161 Probability 61/152 MBA Major Finance Marketing 44 44 44/152 6060 31 3939 22 47 47/152 Probability 60/152 31/152 39/152 22/152 152 152 Under the null hypothesis the two variables are independent: P(Accounting and BA) = P(Accounting)*P(BA) = [61/152][60/152]. The number of students expected to fall in the cell “Accounting - BA” is eAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08 The number of students expected to fall in the cell “Finance - BBA” is eFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29 • The expected frequency of cell of row i and column j in the contingency table is calculated by: (Column j total)(Row i total) eij = Sample size Calculation of the c2 statistic 2 ( f e ) i c2 i ei i1 k • Solution – continued Undergraduate Degree Accounting 31 (24.08) 24.08 BA k BENG 2 8 (12.44) BBA 31 24.08 12 (15.65) Other 10 (8.83) i61 1 31 24.08 c 31 24.08 31 c2= 24.08 MBA Major Finance Marketing 13 (17.37) 2 16 (18.55) 16 (8.97) 7 (9.58) i i 10 (11.29) 17 (12.06) (6.39) 77 6.80 (6.80) 55 6.39 i 44 47 (f e ) e 5 6.39 The expected frequency 5 6.39 60 31 39 22 152 7 6.80 7 6.80 7 6.80 5 6.39 (31 - 24.08)2 (5 - 6.39)2 (7 - 6.80)2 = +….+ +….+ 24.08 6.39 6.80 14.70 • Solution – continued – The critical value in our example is: 2 c2 ,(r1)(c1) c.05,(4 1)(31) 12.5916 • Conclusion: Since c2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent. Using the computer Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02 Define a code to specify each nominal value. Input the data in columns one column for each category. Code: Undergraduate degree 1 = BA 2 = BENG 3 = BBA 4 = OTHERS MBA Major 1 = ACCOUNTING 2 = FINANCE 3 = MARKETING Degree MBA Major 3 1 1 1 1 1 1 1 2 2 1 3 . . . . Contingency Table 1 2 3 Total 1 31 13 16 60 2 8 16 7 31 3 12 10 17 39 4 10 5 7 22 Total 61 44 47 152 Test Statistic CHI-Squared = 14.7019 P-Value = 0.0227