Laboratory 8 - Trinity College Dublin

advertisement
Trinity College, Dublin
Generic Skills Programme
Statistics for Research Students
Laboratory 8:
Frequency Data Analysis
To complete the laboratory exercise, work your way through this handout, which is self
contained and self explanatory. Work in pairs (two per machine), and learn from each other.
Keep separate logs of your work. The tutor is available to help if necessary.
Invitations to consider the results and their statistical interpretation are printed in italics.
Take some time for this; consult the tutor if necessary. Make notes in your log for later
reference.
Topics:
1. One-sample tests and confidence interval for proportions
 graphical display of proportions
2. Chi-Square test of homogeneity of proportions
3. Two-sample tests of proportions
 equivalence to Chi-Square test
4 Assessing homogeneity of patterns of proportions
 graphical display
 Chi-Square test
Learning Objectives:
Be able to

implement one- and two-sample tests of proportions and interpret the results

implement Chi-Square tests of homogeneity of proportions and homogeneity of
frequency distributions

produce graphical displays of categorised proportions and frequency distributions
A market penetration study
One of the factors which may help to explain variation in sales of a product in different regions
is the level of market penetration achieved for the product through promotion, advertising, etc.
One way of assessing this is to carry out an appropriate marketing research survey. In one
such survey, potential purchasers randomly sampled in each of three sales regions were
interviewed, 200 from region A, 150 from Region B and 300 from Region C. Among the
questions were the following:
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 8
Yes
Have you ever heard of this product?
No
If 'No', skip to next Question. If 'Yes', ask:
Yes
Did you ever buy this product?
No
The answers to these questions reflect different levels of market penetration of the advertising
campaign. The first objective must be to reach as many potential buyers as possible. The
ultimate objective is to persuade as many people as possible to buy the product. In this case, a
notional target of 90% was set for the first objective, that is, it was desired that 90% of the
population should have heard of the product as a result of the advertising campaign. More
critically, a target of 60% was set for the percentage of the population that actually bought the
product. To assess the success of the marketing campaign in achieving these targets, we need
to study the data.
The results of the survey for these questions are available in the MarketPen.xls data file in the
GenericSkillsData folder; copy and paste into Minitab.
1
One-sample tests and confidence interval for proportions
1.1
Assess target achievement
To assess the success in achieving the target of 90% for the percentage in the population who
heard of the product, calculate the corresponding sample percentage and test the "target"
hypothesis as follows:





from the Stat menu, select Basic Statistics, then 1 Proportion,
select Hear? as the Sample column,
check the Perform hypothesis test box and enter .9 as the hypothesized proportion,
click the Options button and check the "Use test and interval based on normal
distribution" box,
click OK, OK.
Was the target achieved? Summarise the results in terms of estimated percentage
achieved, confidence interval and significance test.
Implement a similar analysis for the Buy? data. First, note that there are blanks in the Buy?
column. These correspond to respondents who answered "No" to the first question on the
questionnaire extract above. For the purpose of counting those who did or did not buy the
product, these respondents should be classified as "N", did Not buy. A simple adjustment to the
Buy? data will fix this:

from the Data menu, select Code, then Text to Text,
page 2
Trinity College, Dublin
Generic Skills Programme






Introduction to Statistics
Computer Laboratory 8
select Buy? as the column to code data from,
store the coded data in C5,
code original values N and Y to N and Y, code original value "" (meaning blank) to N,
click OK,
name C5 as "Bought?" (or something else appropriate),
implement the "1 Proportion" test (target = 60%) for the new column of data.
Was the target achieved? Summarise the results in terms of estimated percentage
achieved, confidence interval and significance test.
1.2
Assess percentages that heard of product by Region
The reported percentages are likely to vary between regions, A, B and C. To facilitate analysis,
use the Minitab Crosstabulation command as follows:




from the Stat menu, select Tables, then Crosstabulation and Chi-Square,
select Region as the categorical variable for rows, Hear? for columns,
check boxes to display Counts and Row percents,
click OK.
Make a simple summary of the regional breakdown.
Although the differences from target appear substantial, check their statistical significance as
follows:







from the Stat menu, select Basic Statistics, then 1 Proportion,
check the Summarized data option,
enter the number of Yes's (Number of events) and the sample size (Number of trials)
for Region A,
ensure Hypothesised proportion is .9,
click the Options button and check the "Use test and interval based on normal
distribution" box,
click OK, OK,
repeat for Regions B and C.
Summarise the results.
Compare the confidence interval widths, including that for the complete sample.
Explain the differences in width.
Compare the sample proportions for Regions A and C, compare their z-values,
explain.
1.3
Graphical display
To make a bar chart showing the percentages of those who heard of and bought the product,
first make a summary table of percentages who Bought the product (as for Heard about the
product), then enter the summary data for both as a table in a new worksheet, as follows:

from the File menu, select New, then Minitab Worksheet,
page 3
Trinity College, Dublin
Generic Skills Programme


Introduction to Statistics
Computer Laboratory 8
enter column names Region, Heard%, Bought% for C1, C2, C3,
Enter A, B, C in C1 and the corresponding percentages from the Session window in C2
and C3.
To make the chart,







from the Graph menu, select Bar Chart,
set Bars to represent Values from a table,
under Two-way table, select Cluster, click Ok,
select Heard %, Bought% as the Graph variables, select Region as Row labels,
check the "Rows are outermost categories" option, (click the Help button to see what
this means),
click on the Scale button, then the Reference Lines tab, show lines at Y values 60 90,
click OK, OK.
2
Chi-Square test of homogeneity of proportions
2.1
Testing the homogeneity of regional differences
A question of interest is whether the percentages in the regions are the same, apart from
chance variation due to sampling. A formal test of the homogeneity of the regional Buy
percentages may be added to the crosstabulation that lead to the percentages, as follows:






first, click in the original worksheet to make it active (or from the Window menu, select
Worksheet 1),
from the Stat menu, select Tables, then Crosstabulation and Chi-Square,
select Region as the categorical variable for rows, Bought? for columns,
check boxes to display Counts,
click on the Chi-Square button, check the boxes for Chi-Square analysis, Expected cell
counts,
click OK, OK.
Report on the statistical significance of the results; focus on Pearson Ch-Square.
The calculation of the Pearson Chi-Square test statistic is based on the formula

( O  E )2
,
E
where O represents Observed frequency and E represents Expected frequency and the sum is
over the cells of the table produced by the Cross Tabulation command in the Session window.
Thus, the Observed frequency of Yes answers in Region A was 109 and the corresponding
Expected frequency was 100.3.
The so-called Expected frequencies are those calculated on the assumption (the null
hypothesis) that the population Buy percentages in the regions were all the same
(homogeneous) and that the differences between the sample Buy percentages were due to
chance variation. If the null hypothesis is correct, then the best estimate of the common value
of the Buy percentage is that calculated from the complete sample, 326 / 650 = 50.15%. The
Expected frequencies are calculated by applying this percentage to each of the regional sample
sizes.
page 4
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 8
Check that the Expected Buy frequencies are those shown in the Y column.
Check that the Expected frequencies in each row add to the corresponding row
sample size.
Check that the Expected frequencies in each column add to the corresponding
column total.
Hence, explain the number of degrees of freedom associated with Chi-Square.
3
Two-sample tests of proportions
3.1
A two-sample test of regional differences
The summary data suggests that Regions A and C are very similar in their penetration levels
but that Region B has rather lower levels. To check this formally, it make sense to combine
Regions A and C and compare results with those in B, using two-sample tests. To combine the
regions, proceed as follows:






from the Data menu, select Code, then Text to Text,
select Region as the column to code data from,
store the coded data in C6,
code original values A and C to AC, code original value B to B,
click OK,
name C6 as Region2 (or something else appropriate).
Implement the 2-sample test as follows:





from the Stat menu, select Basic Statistics, then 2 Proportions,
enter Hear? as Samples column, Region2 as Subscripts column,
click the Options button, then check the "Use pooled estimate of p" option,
click OK, OK,
repeat for Bought?
Make a report of the test results.
3.2
A Chi-Square two-sample test
The Chi-Square test applied above to test the homogeneity of three regions can just as well be
applied to testing the homogeneity of two regions. This is a direct analogue of the application of
ANOVA to test the statistical significance of the difference between two sample means. To use
it here,





from the Stat menu, select Tables, then Crosstabulation and Chi-Square,
select Region2 as the categorical variable for rows, Hear? for columns,
uncheck Counts,
click the Chi-Square button, uncheck all but Chi-Square analysis,
click OK, OK.
Demonstrate the equivalence of the 2-sample Z-test and the Pearson Chi-Square
test (calculate the square root of the latter).
page 5
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 8
Identify the sample proportions of the 2-sample test with relevant entries in the 2x2
table.
Explain the Chi-Square DF.
4
Assessing homogeneity of patterns of proportions
There are three levels of market penetration in these data. Some survey respondents who
Never heard of the product, some Heard of the product but did not buy, some Bought the
product. Denoting these three levels as N, H, B, respectively, it is of interest to study the
different penetration patterns among the respondents in the different regions, as reflected in the
regional frequency distributions of the three levels of penetration. To tabulate these frequency
distributions, construct a new variable whose values are the different penetration levels, as
follows:






from the Data menu, select Code, then Text to Text,
select Buy? as the column to code data from,
store the coded data in C7,
code original values "" (blank) to N, N to H, Y to B,
click OK,
name C7 as Level.
To tabulate,





from the Stat menu, select Tables, then Cross Tabulation and Chi-Square,
select Region as the categorical variable for rows, Level for columns,
check box to display Row percents,
click on the Chi-Square button and uncheck Chi-Square analysis
click OK.
Summarise the variation between regional penetration patterns.
4.1
Graphical display
A profile plot provides an effective graphical display. To set up the necessary data, switch to
Worksheet 2, note that the region identifiers are already in C1 and that the B percentages are
already in C3 and proceed as follows:


rename C3 as B, name C4 as H, C5 as N,
enter the H and N percentages in C4 and C5.
To make the plot,





from the Graph menu, select Line Plot, then Multiple Y's With Symbols, click OK,
select B, H, N as the Graph variables,
select Region as the Categorical variable for grouping
check the "Graph variables are X-scale groups" option,
click OK.
page 6
Trinity College, Dublin
Generic Skills Programme
Introduction to Statistics
Computer Laboratory 8
Improve the graph annotation:




double click the graph title, edit Text to "Penetration Profiles by Region",
double click the Y axis label, edit Text to "Per cent",
from the Editor menu, select Add, then X Axis Label,
double click the X axis label, edit Text to "Penetration Level".
Discuss the variation patterns.
4.2
Chi-Square test
At this stage, a Chi-Square test of the homogeneity of the regional penetration patterns is
superfluous. For completeness, implement the test as follows (if not already implemented):






switch to Worksheet 1,
from the Stat menu, select Tables, then Cross Tabulation and Chi-Square,
confirm Region as the categorical variable for rows, Level for columns,
uncheck box to display Row percents,
click on the Chi-Square button, then check the Chi-Square analysis box,
Click OK, OK.
Confirm the degrees of freedom for Chi-Square; explain. Calculate the 5% critical
value. Report on the result of the Pearson Chi-Square test.
Conclusion
This concludes Laboratory 8. The learning objectives listed at the outset are reproduced here.
Check them individually and ensure that you have achieved each one; seek help from the Tutor
if necessary.
Learning Objectives:
Be able to

implement one- and two-sample tests of proportions and interpret the results

implement Chi-Square tests of homogeneity of proportions and homogeneity of
frequency distributions

produce graphical displays of categorised proportions and frequency distributions
page 7
Download