Chapter 16 Chi Squared Tests 16.1 Introduction • Two statistical techniques are presented, to analyze nominal data. – A goodness-of-fit test for the multinomial experiment. – A contingency table test of independence. • Both tests use the c2 as the sampling distribution of the test statistic. 16.2 Chi-Squared Goodness-of-Fit Test • The hypothesis tested involves the probabilities p1, p2, …, pk.of a multinomial distribution. • The multinomial experiment is an extension of the binomial experiment. – There are n independent trials. – The outcome of each trial can be classified into one of k categories, called cells. – The probability pi that the outcome fall into cell i remains constant for each trial. Moreover, p1 + p2 + … +pk = 1. – Trials of the experiment are independent 16.2 Chi-squared Goodness-of-Fit Test • We test whether there is sufficient evidence to reject a pre-specified set of values for pi. • The hypothesis: H 0 : p1 a1 , p 2 a 2 ,..., p k a k H 1 : At least one p i a i • The test builds on comparing actual frequency and the expected frequency of occurrences in all the cells. The multinomial goodness of fit test Example • Example 16.1 – Two competing companies A and B have enjoy dominant position in the market. The companies conducted aggressive advertising campaigns. – Market shares before the campaigns were: • Company A = 45% • Company B = 40% • Other competitors = 15%. The multinomial goodness of fit test Example • Example 16.1 – continued – To study the effect of the campaign on the market shares, a survey was conducted. – 200 customers were asked to indicate their preference regarding the product advertised. – Survey results: • 102 customers preferred the company A’s product, • 82 customers preferred the company B’s product, • 16 customers preferred the competitors product. The multinomial goodness of fit test Example • Example 16.1 – continued Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns? The multinomial goodness of fit test Example • Solution – – – – The population investigated is the brand preferences. The data are nominal (A, B, or other) This is a multinomial experiment (three categories). The question of interest: Are p1, p2, and p3 different after the campaign from their values before the campaign? The multinomial goodness of fit test Example • The hypotheses are: H0: p1 = .45, p2 = .40, p3 = .15 H1: At least one pi changed. The expected frequency for each category (cell) if the null hypothesis is true is shown below: 90 = 200(.45) 80 = 200(.40) What actual frequencies did the sample return? 102 82 1 2 1 3 2 30 = 200(.15) 3 16 The multinomial goodness of fit test Example • The statistic is 2 ( f e ) i c2 i ei i 1 k w here e i npi • The rejection region is c2 c2,k 1 The multinomial goodness of fit test Example • Example 16.1 – continued k c2 i1 (102 90) 2 ( 82 80) 2 (16 30)2 8.18 90 80 30 c2 ,k 1 c .205,31 5.99147 The p value P ( c 2 8.18) .01679 [ from Excel ( CHIDIST(8.18,2)] The multinomial goodness of fit test Example • Example 16.1 – continued c2 with 2 degrees of freedom 0.025 Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed. 0.02 0.015 0.01 Alpha 0.005 0 0 2 4 5.99 6 P value 8.18 8 10 Rejection region 12 Required conditions – the rule of five • The test statistic used to perform the test is only approximately Chi-squared distributed. • For the approximation to apply, the expected cell frequency has to be at least 5 for all the cells (npi 5). • If the expected frequency in a cell is less than 5, combine it with other cells. 16.3 Chi-squared Test of a Contingency Table • This test is used to test whether… – two nominal variables are related? – there are differences between two or more populations of a nominal variable • To accomplish the test objectives, we need to classify the data according to two different criteria. Contingency table c2 test – Example • Example 16.2 – In an effort to better predict the demand for courses offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection. – A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data. Contingency table c2 test – Example Degree BA BENG BBA Other Accounting 31 8 12 10 61 Finance 13 16 10 5 44 Marketing 16 7 17 7 47 60 31 60 39 152 The observed values There are two ways to address the problem If each classification is considered a nominal variable, are these two variables dependent? If each undergraduate degree is considered a population, do these populations differ? Contingency table c2 test – Example • Solution – Since ei = npi but pi is unknown, we need to The hypotheses are: estimate the unknown H0: The two variables are independent probability from the data, H1: The two variables are dependent assuming H0 is true. – The test statistic k c 2 i1 ( fi e i ) 2 ei k is the number of cells in the contingency table. – The rejection region c2 c2,(r 1)( c 1) Estimating the expected frequencies Undergraduate Degree Accounting BA BENG BBA Other 6161 Probability 61/152 MBA Major Finance Marketing 44 44 44/152 6060 31 3939 22 47 47/152 Probability 60/152 31/152 39/152 22/152 152 152 Under the null hypothesis the two variables are independent: P(Accounting and BA) = P(Accounting)*P(BA) = [61/152][60/152]. The number of students expected to fall in the cell “Accounting - BA” is eAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08 The number of students expected to fall in the cell “Finance - BBA” is eFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29 The expected frequencies for a contingency table • The expected frequency of cell of raw i and column j in the contingency table is calculated by (Column j total)(Row i total) eij = Sample size k c 2 i1 ( fi e i ) 2 ei Calculation of the c2 statistic • Solution – continued Undergraduate Degree Accounting 31 (24.08) 24.08 BA k BENG 2 8 (12.44) BBA 31 24.08 12 (15.65) Other 10 (8.83) i1 31 24.08 61 c 31 24.08 31 c2= 24.08 MBA Major Finance Marketing 13 (17.37) 2 16 (18.55) 16 (8.97) 7 (9.58) i i 10 (11.29) 17 (12.06) (6.39) 77 6.80 (6.80) 55 6.39 i 44 47 (f e ) e 5 6.39 The expected frequency 5 6.39 60 31 39 22 152 7 6.80 7 6.80 7 6.80 5 6.39 (31 - 24.08)2 (5 - 6.39)2 (7 - 6.80)2 = +….+ +….+ 24.08 6.39 6.80 14.70 Contingency table c2 test – Example • Solution – continued – The critical value in our example is: c 2 ,( r 1)( c 1) c 2 .05 ,( 4 1)( 31) 12.5916 • Conclusion: Since c2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent. Using the computer Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02 Define a code to specify each nominal value. Input the data in columns one column for each category. Code: Undergraduate degree 1 = BA 2 = BENG 3 = BBA 4 = OTHERS MBA Major 1 = ACCOUNTING 2 = FINANCE 3 = MARKETING Degree MBA Major 3 1 1 1 1 1 1 1 2 2 1 3 . . . . Contingency Table 1 2 3 Total 1 31 13 16 60 2 8 16 7 31 3 12 10 17 39 4 10 5 7 22 Total 61 44 47 152 Test Statistic CHI-Squared = 14.7019 P-Value = 0.0227 Required condition Rule of five – The c2 distribution provides an adequate approximation to the sampling distribution under the condition that eij >= 5 for all the cells. – When eij < 5 rows or columns must be added such that the condition is met. Example 10 (10.1) 14 18 (12.8) (17.9) 23 (16.0) (22.3) 12 (12.7) 16 (12.8) 8 ( 7.2) 12 8 (9.2) We combine column 2 and 3 14 + 4 16 + 7 8+4 4 (5.1) 7 (6.3) 4 (3.6) 12.8 + 5.1 16 + 6.3 9.2 + 3.6 16.5 Chi-Squared test for Normality • The goodness of fit Chi-squared test can be used to determined if data were drawn from any distribution. • The general procedure: – Hypothesize on the parameter values of the distribution we test (i.e. m m0, s s0 for the normal distribution). – For the variable tested X specify disjoint ranges that cover all its possible values. – Build a Chi squared statistic that (aggregately) compares the expected frequency under H0 and the actual frequency of observations that fall in each range. – Run a goodness of fit test based on the multinomial experiment. 15.5 Chi-Squared test for Normality • Testing for normality in Example 12.1 For a sample size of n=50 (see Xm12-01) ,the sample mean was 460.38 with standard error of 38.83. Can we infer from the data provided that this sample was drawn from a normal distribution with m = 460.38 and s = 38.83? Use 5% significance level. c2 test for normality Solution First let us select z values that define each cell (expected frequency > 5 for each cell.) z1 = -1; P(z < -1) = p1 = .1587; e1 = np1 = 50(.1587) = 7.94 z2 = 0; P(-1 < z< 0) = p2 = .3413; e2 = np2 = 50(.3413) = 17.07 z3 = 1; P(0 < z < 1) = p3 = .3413; e3 = 17.07 P(z > 1) = p4 = .1587; e4 = 7.94 The cell boundaries are calculated from the corresponding z values under H0. z1 =(x1 - 460.38)/38.83 = -1; x1 = 421.55 The expected frequencies can now be determined for each cell. e1 = 7.94 e2 = 17.07 e3 = 17.07 .3413 .1587 .3413 .1587 421.55 460.38 499.21 e4 = 7.94 c2 test for normality – The test statistic 2 2 (10 - 7.94)2 (13 17.07) (19 17.07) 2 c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2 7.94 f3 = 19 e2 = 17.07 f1 = 10 e1 = 7.94 f2 = 13 = 1.72 e3 = 17.07 f4 = 8 e4 = 7.94 c2 test for normality – The test statistic 2 2 (10 - 7.94)2 (13 17.07) (19 17.07) 2 c = 7.94 + 17.07 + 17.07 + (8 - 7.94)2 7.94 = 1.72 – The rejection region c 2 c 2 ,k 1L where L is the number of parameters estimated from the data. c2 ,k3 c.205,43 3.84146 Conclusion: There is insufficient evidence to conclude at 5% significance level that the data are not normally distributed.