Trinity College, Dublin Generic Skills Programme Statistics for Research Students Laboratory 8: Frequency Data Analysis To complete the laboratory exercise, work your way through this handout, which is self contained and self explanatory. Work in pairs (two per machine), and learn from each other. Keep separate logs of your work. The tutor is available to help if necessary. Invitations to consider the results and their statistical interpretation are printed in italics. Take some time for this; consult the tutor if necessary. Make notes in your log for later reference. Topics: 1. One-sample tests and confidence interval for proportions graphical display of proportions 2. Chi-Square test of homogeneity of proportions 3. Two-sample tests of proportions equivalence to Chi-Square test 4 Assessing homogeneity of patterns of proportions graphical display Chi-Square test Learning Objectives: Be able to implement one- and two-sample tests of proportions and interpret the results implement Chi-Square tests of homogeneity of proportions and homogeneity of frequency distributions produce graphical displays of categorised proportions and frequency distributions A market penetration study One of the factors which may help to explain variation in sales of a product in different regions is the level of market penetration achieved for the product through promotion, advertising, etc. One way of assessing this is to carry out an appropriate marketing research survey. In one such survey, potential purchasers randomly sampled in each of three sales regions were interviewed, 200 from region A, 150 from Region B and 300 from Region C. Among the questions were the following: Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 Yes Have you ever heard of this product? No If 'No', skip to next Question. If 'Yes', ask: Yes Did you ever buy this product? No The answers to these questions reflect different levels of market penetration of the advertising campaign. The first objective must be to reach as many potential buyers as possible. The ultimate objective is to persuade as many people as possible to buy the product. In this case, a notional target of 90% was set for the first objective, that is, it was desired that 90% of the population should have heard of the product as a result of the advertising campaign. More critically, a target of 60% was set for the percentage of the population that actually bought the product. To assess the success of the marketing campaign in achieving these targets, we need to study the data. The results of the survey for these questions are available in the MarketPen.xls data file in the GenericSkillsData folder; copy and paste into Minitab. 1 One-sample tests and confidence interval for proportions 1.1 Assess target achievement To assess the success in achieving the target of 90% for the percentage in the population who heard of the product, calculate the corresponding sample percentage and test the "target" hypothesis as follows: from the Stat menu, select Basic Statistics, then 1 Proportion, select Hear? as the Sample column, check the Perform hypothesis test box and enter .9 as the hypothesized proportion, click the Options button and check the "Use test and interval based on normal distribution" box, click OK, OK. Was the target achieved? Summarise the results in terms of estimated percentage achieved, confidence interval and significance test. Implement a similar analysis for the Buy? data. First, note that there are blanks in the Buy? column. These correspond to respondents who answered "No" to the first question on the questionnaire extract above. For the purpose of counting those who did or did not buy the product, these respondents should be classified as "N", did Not buy. A simple adjustment to the Buy? data will fix this: from the Data menu, select Code, then Text to Text, page 2 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 select Buy? as the column to code data from, store the coded data in C5, code original values N and Y to N and Y, code original value "" (meaning blank) to N, click OK, name C5 as "Bought?" (or something else appropriate), implement the "1 Proportion" test (target = 60%) for the new column of data. Was the target achieved? Summarise the results in terms of estimated percentage achieved, confidence interval and significance test. 1.2 Assess percentages that heard of product by Region The reported percentages are likely to vary between regions, A, B and C. To facilitate analysis, use the Minitab Crosstabulation command as follows: from the Stat menu, select Tables, then Crosstabulation and Chi-Square, select Region as the categorical variable for rows, Hear? for columns, check boxes to display Counts and Row percents, click OK. Make a simple summary of the regional breakdown. Although the differences from target appear substantial, check their statistical significance as follows: from the Stat menu, select Basic Statistics, then 1 Proportion, check the Summarized data option, enter the number of Yes's (Number of events) and the sample size (Number of trials) for Region A, ensure Hypothesised proportion is .9, click the Options button and check the "Use test and interval based on normal distribution" box, click OK, OK, repeat for Regions B and C. Summarise the results. Compare the confidence interval widths, including that for the complete sample. Explain the differences in width. Compare the sample proportions for Regions A and C, compare their z-values, explain. 1.3 Graphical display To make a bar chart showing the percentages of those who heard of and bought the product, first make a summary table of percentages who Bought the product (as for Heard about the product), then enter the summary data for both as a table in a new worksheet, as follows: from the File menu, select New, then Minitab Worksheet, page 3 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 enter column names Region, Heard%, Bought% for C1, C2, C3, Enter A, B, C in C1 and the corresponding percentages from the Session window in C2 and C3. To make the chart, from the Graph menu, select Bar Chart, set Bars to represent Values from a table, under Two-way table, select Cluster, click Ok, select Heard %, Bought% as the Graph variables, select Region as Row labels, check the "Rows are outermost categories" option, (click the Help button to see what this means), click on the Scale button, then the Reference Lines tab, show lines at Y values 60 90, click OK, OK. 2 Chi-Square test of homogeneity of proportions 2.1 Testing the homogeneity of regional differences A question of interest is whether the percentages in the regions are the same, apart from chance variation due to sampling. A formal test of the homogeneity of the regional Buy percentages may be added to the crosstabulation that lead to the percentages, as follows: first, click in the original worksheet to make it active (or from the Window menu, select Worksheet 1), from the Stat menu, select Tables, then Crosstabulation and Chi-Square, select Region as the categorical variable for rows, Bought? for columns, check boxes to display Counts, click on the Chi-Square button, check the boxes for Chi-Square analysis, Expected cell counts, click OK, OK. Report on the statistical significance of the results; focus on Pearson Ch-Square. The calculation of the Pearson Chi-Square test statistic is based on the formula ( O E )2 , E where O represents Observed frequency and E represents Expected frequency and the sum is over the cells of the table produced by the Cross Tabulation command in the Session window. Thus, the Observed frequency of Yes answers in Region A was 109 and the corresponding Expected frequency was 100.3. The so-called Expected frequencies are those calculated on the assumption (the null hypothesis) that the population Buy percentages in the regions were all the same (homogeneous) and that the differences between the sample Buy percentages were due to chance variation. If the null hypothesis is correct, then the best estimate of the common value of the Buy percentage is that calculated from the complete sample, 326 / 650 = 50.15%. The Expected frequencies are calculated by applying this percentage to each of the regional sample sizes. page 4 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 Check that the Expected Buy frequencies are those shown in the Y column. Check that the Expected frequencies in each row add to the corresponding row sample size. Check that the Expected frequencies in each column add to the corresponding column total. Hence, explain the number of degrees of freedom associated with Chi-Square. 3 Two-sample tests of proportions 3.1 A two-sample test of regional differences The summary data suggests that Regions A and C are very similar in their penetration levels but that Region B has rather lower levels. To check this formally, it make sense to combine Regions A and C and compare results with those in B, using two-sample tests. To combine the regions, proceed as follows: from the Data menu, select Code, then Text to Text, select Region as the column to code data from, store the coded data in C6, code original values A and C to AC, code original value B to B, click OK, name C6 as Region2 (or something else appropriate). Implement the 2-sample test as follows: from the Stat menu, select Basic Statistics, then 2 Proportions, enter Hear? as Samples column, Region2 as Subscripts column, click the Options button, then check the "Use pooled estimate of p" option, click OK, OK, repeat for Bought? Make a report of the test results. 3.2 A Chi-Square two-sample test The Chi-Square test applied above to test the homogeneity of three regions can just as well be applied to testing the homogeneity of two regions. This is a direct analogue of the application of ANOVA to test the statistical significance of the difference between two sample means. To use it here, from the Stat menu, select Tables, then Crosstabulation and Chi-Square, select Region2 as the categorical variable for rows, Hear? for columns, uncheck Counts, click the Chi-Square button, uncheck all but Chi-Square analysis, click OK, OK. Demonstrate the equivalence of the 2-sample Z-test and the Pearson Chi-Square test (calculate the square root of the latter). page 5 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 Identify the sample proportions of the 2-sample test with relevant entries in the 2x2 table. Explain the Chi-Square DF. 4 Assessing homogeneity of patterns of proportions There are three levels of market penetration in these data. Some survey respondents who Never heard of the product, some Heard of the product but did not buy, some Bought the product. Denoting these three levels as N, H, B, respectively, it is of interest to study the different penetration patterns among the respondents in the different regions, as reflected in the regional frequency distributions of the three levels of penetration. To tabulate these frequency distributions, construct a new variable whose values are the different penetration levels, as follows: from the Data menu, select Code, then Text to Text, select Buy? as the column to code data from, store the coded data in C7, code original values "" (blank) to N, N to H, Y to B, click OK, name C7 as Level. To tabulate, from the Stat menu, select Tables, then Cross Tabulation and Chi-Square, select Region as the categorical variable for rows, Level for columns, check box to display Row percents, click on the Chi-Square button and uncheck Chi-Square analysis click OK. Summarise the variation between regional penetration patterns. 4.1 Graphical display A profile plot provides an effective graphical display. To set up the necessary data, switch to Worksheet 2, note that the region identifiers are already in C1 and that the B percentages are already in C3 and proceed as follows: rename C3 as B, name C4 as H, C5 as N, enter the H and N percentages in C4 and C5. To make the plot, from the Graph menu, select Line Plot, then Multiple Y's With Symbols, click OK, select B, H, N as the Graph variables, select Region as the Categorical variable for grouping check the "Graph variables are X-scale groups" option, click OK. page 6 Trinity College, Dublin Generic Skills Programme Introduction to Statistics Computer Laboratory 8 Improve the graph annotation: double click the graph title, edit Text to "Penetration Profiles by Region", double click the Y axis label, edit Text to "Per cent", from the Editor menu, select Add, then X Axis Label, double click the X axis label, edit Text to "Penetration Level". Discuss the variation patterns. 4.2 Chi-Square test At this stage, a Chi-Square test of the homogeneity of the regional penetration patterns is superfluous. For completeness, implement the test as follows (if not already implemented): switch to Worksheet 1, from the Stat menu, select Tables, then Cross Tabulation and Chi-Square, confirm Region as the categorical variable for rows, Level for columns, uncheck box to display Row percents, click on the Chi-Square button, then check the Chi-Square analysis box, Click OK, OK. Confirm the degrees of freedom for Chi-Square; explain. Calculate the 5% critical value. Report on the result of the Pearson Chi-Square test. Conclusion This concludes Laboratory 8. The learning objectives listed at the outset are reproduced here. Check them individually and ensure that you have achieved each one; seek help from the Tutor if necessary. Learning Objectives: Be able to implement one- and two-sample tests of proportions and interpret the results implement Chi-Square tests of homogeneity of proportions and homogeneity of frequency distributions produce graphical displays of categorised proportions and frequency distributions page 7