Additional Tests for Qualitative data

Chi Squared Tests Introduction • Two statistical techniques are presented. Both are used to analyze nominal data. – A goodness-of-fit test for a multinomial experiment. – A contingency table test of independence. • The test statistics in both cases follow the c2 distribution. Chi-Squared Goodness-of-Fit Test • The hypothesis tested involves the “success” probabilities p1, p2, …, pk.of a multinomial distribution. • The multinomial experiment is an extension of the binomial experiment. – There are n independent trials. – The outcome of each trial can be classified into one of k categories, called cells. – The probability pi for an outcome to fall into cell i remains constant for each trial. By assumption, p1 + p2 + … +pk = 1. – Trials in the experiment are independent. • Our objective is to find out whether there is sufficient evidence to reject a pre-specified set of values for pi . • The hypotheses: H 0 : p1  a1, p2  a2 , ..., pk  ak H1 : At least one pi  ai • The test builds on comparing actual frequency and the expected frequency of occurrences in all  cells. An Example • Example 16.1 – Two competing companies A and B have been dominant players in the market. Both companies conducted recent advertising campaigns on their products. – Market shares before the campaigns were: • Company A = 45% • Company B = 40% • Other competitors = 15%. • Example 16.1 – continued – To study the effect of the campaigns on the market shares, a survey was conducted. – 200 customers were asked to indicate their preference regarding the products advertised. – Survey results: • 102 customers preferred the company A’s product, • 82 customers preferred the company B’s product, • 16 customers preferred the competitors product. • Example 16.1 – continued Can we conclude at 5% significance level that the market shares were affected by the advertising campaigns? • Solution – – – – The population investigated is the brand preferences. The data are nominal (A, B, or other) This is a multinomial experiment (three categories). The question of interest: Are p1, p2, and p3 different after the campaign from their values prior to the campaigns? • The hypotheses are: H0: p1 = .45, p2 = .40, p3 = .15 H1: At least one pi changed. The expected frequency for each category (cell) if the null hypothesis is true is shown below: 90 = 200(.45) 80 = 200(.40) What actual frequencies did the sample return? 102 82 1 2 1 3 2 30 = 200(.15) 3 16 • The statistic is: 2 ( f  e ) i c2   i ei i1 k where ei  npi Intuitively, this measures the extent of differences between the observed and the expected frequencies. • The rejection region is: c  c ,k 1 2 2 • Example 16.1 – continued k c2   i1 (102  90)2 (82  80)2 (16  30)2    8.18 90 80 30 2 c2 ,k 1  c.05,31  5.99147 The p  value  P( c 2  8.18)  .01679 [this come from Excel :  CHIDIST(8.18,2)]  • Example 16.1 – continued 2 with 2 degrees of freedom c 0.025 0.02 0.015 0.01 Alpha 0.005 P value 0 0 2 4 5.996 88.18 10 Rejection region Conclusion: Since 8.18 > 5.99, there is sufficient evidence at 5% significance level to reject the null hypothesis. At least one of the probabilities pi is different. Thus, at least two market shares have changed. 12 Required Conditions – The Rule of Five • The test statistic used to perform the test is only approximately Chi-squared distributed. • For the approximation to apply, the expected cell frequency has to be at least 5 for all cells (npi  5). • If the expected frequency in a cell is less than 5, combine it with other cells. Chi-squared Test of a Contingency Table • This test is used to test whether… – two nominal variables are related? – there are differences between two or more populations of a nominal variable? • To accomplish the test objectives, we need to classify the data according to two different criteria. • The idea is also based on goodness of fit. • Example 16.2 – In an effort to better predict the demand for courses offered by a certain MBA program, it was hypothesized that students’ academic background affect their choice of MBA major, thus, their courses selection. – A random sample of last year’s MBA students was selected. The following contingency table summarizes relevant data. Degree BA BENG BBA Other Accounting 31 8 12 10 61 Finance 13 16 10 5 44 Marketing 16 7 17 7 47 60 31 60 39 152 The observed values There are two ways to view this problem If each classification is considered a nominal variable, are these two variables dependent? If each undergraduate degree is considered a population, do these populations differ? • Solution – Since ei = npi but pi is unknown, we need to The hypotheses are: estimate the unknown H0: The two variables are independent probability from the data, H1: The two variables are dependent assuming H0 is true. – The test statistic k c  2  i1 ( fi  e i )2 ei k is the number of cells in the contingency table. – The rejection region c2  c2,(r 1)( c 1) Estimating the expected frequencies Undergraduate Degree Accounting BA BENG BBA Other 6161 Probability 61/152 MBA Major Finance Marketing 44 44 44/152 6060 31 3939 22 47 47/152 Probability 60/152 31/152 39/152 22/152 152 152 Under the null hypothesis the two variables are independent: P(Accounting and BA) = P(Accounting)*P(BA) = [61/152][60/152]. The number of students expected to fall in the cell “Accounting - BA” is eAcct-BA = n(pAcct-BA) = 152(61/152)(60/152) = [61*60]/152 = 24.08 The number of students expected to fall in the cell “Finance - BBA” is eFinance-BBA = npFinance-BBA = 152(44/152)(39/152) = [44*39]/152 = 11.29 • The expected frequency of cell of row i and column j in the contingency table is calculated by: (Column j total)(Row i total) eij = Sample size Calculation of the c2 statistic 2 ( f  e ) i c2   i ei i1 k • Solution – continued Undergraduate Degree Accounting 31 (24.08) 24.08 BA k BENG 2 8 (12.44) BBA 31 24.08 12 (15.65) Other 10 (8.83) i61 1 31 24.08 c  31 24.08 31 c2= 24.08 MBA Major Finance Marketing 13 (17.37) 2 16 (18.55) 16 (8.97) 7 (9.58) i i 10 (11.29) 17 (12.06) (6.39) 77 6.80 (6.80) 55 6.39 i 44 47 (f  e )  e  5 6.39 The expected frequency 5 6.39 60 31 39 22 152 7 6.80 7 6.80 7 6.80 5 6.39 (31 - 24.08)2 (5 - 6.39)2 (7 - 6.80)2 = +….+ +….+ 24.08 6.39 6.80 14.70 • Solution – continued – The critical value in our example is: 2 c2 ,(r1)(c1)  c.05,(4 1)(31)  12.5916 • Conclusion:  Since c2 = 14.70 > 12.5916, there is sufficient evidence to infer at 5% significance level that students’ undergraduate degree and MBA students courses selection are dependent. Using the computer Select the Chi squared / raw data Option from Data Analysis Plus under tools. See Xm16-02 Define a code to specify each nominal value. Input the data in columns one column for each category. Code: Undergraduate degree 1 = BA 2 = BENG 3 = BBA 4 = OTHERS MBA Major 1 = ACCOUNTING 2 = FINANCE 3 = MARKETING Degree MBA Major 3 1 1 1 1 1 1 1 2 2 1 3 . . . . Contingency Table 1 2 3 Total 1 31 13 16 60 2 8 16 7 31 3 12 10 17 39 4 10 5 7 22 Total 61 44 47 152 Test Statistic CHI-Squared = 14.7019 P-Value = 0.0227

Additional Tests for Qualitative data

Related documents

Products

Support

Additional Tests for Qualitative data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib